A curious case of suboptimal code generation

While writing the scheduler for my bare metal kernel I noticed that the timing of the task interrupt was slightly off from what it should have been, by a few clock cycles. Not a huge problem, but it got me curious to investigate where the slowdown was. The problem, as it turns out, was more intricate than I originally thought.

Consider the piece of code below:

static uint32_t var;

static void func(void) {
    printf("%d\n", var);
}

/* ... some other code that actually changes `var' */

Suppose we build this for the ARMv6 (or ARMv6-M) architecture. We should expect to see some assembly code for the func function that roughly does the following:

loads the value of var into a register
loads the address of the "%d\n" string constant into another register
makes a subroutine call to the printf function with e.g. bl

Indeed, if we disassemble some code generated by GCC we get:

000000f0 <func>:
  f0:   b508            push    {r3, lr}
  f2:   4b03            ldr     r3, [pc, #12]   ; (100 <func+0x10>)
  f4:   4803            ldr     r0, [pc, #12]   ; (104 <func+0x14>)
  f6:   6819            ldr     r1, [r3, #0]
  f8:   f000 fca2       bl      a40 <printf>
  fc:   bd08            pop     {r3, pc}
  fe:   46c0            nop                     ; (mov r8, r8)
 100:   10000200        .word   0x10000200
 104:   00000b88        .word   0x00000b88

And we know from the linker map that the var variable lives at address 0x10000200 in the .bss section, and that the string literal was placed in the .rodata section at 0x00000b88; note that memory is mapped at 0x1000000 in this example:

.bss.var       0x0000000010000200        0x4 test.o
.rodata.str1.4 0x0000000000000b88        0x4 test.o

Back to the assembly, skipping the push and pop instructions we are left with three ldr instructions, one bl, a nop and two constant word literals at the end of the function. Wait, why three ldr instructions? We only need to load two pieces of data into registers. The reason is that the ldr instruction can’t just accept any 32-bit immediate source memory address; it can do so if this address is “near” the program counter, by doing a PC-relative move as in ldr r3, [pc, #12] (meaning “load the memory at address PC + 12 bytes into R3”). Otherwise the address has to be in a register, as in ldr r1, [r3, #0] (which means “load the memory at address R3 + 0 bytes into R1”).

In this case, the address of our var variable is pretty far away from the PC, since it’s in memory (above 0x1000000) while the PC (around 0xf0) is firmly in the flash address space. So what the compiler does is it places the address of var in a constant near the function – this is what func+100 is in the disassembly above – and loads that into a register using PC-relative addressing, before finally loading the variable from memory via that register. It also does that for the string constant, although we don’t need to dereference it so only one ldr is needed. And, to complete the analysis, the nop instruction was inserted to word-align those two trailing literals.

That makes sense, and is just a result of the compiler working around a limitation of the ldr instruction. But exactly how near to the PC does a memory address need to be to be eligible for PC-relative addressing? The ARM documentation provides the answer to that question: it must be within ±1020 bytes of it for the Cortex-M0 (apparently it depends on the processor model).

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0497a/BABJFJBD.html

So suppose you were writing some really efficient code, with executable code running straight from memory instead of flash. Then you could arrange for the variables used by the function to be very close to the PC, and then you would need only one PC-relative ldr to load the variable into a register! However, neither GCC nor LLVM/Clang do this, and always emit an indirect ldr with a helper constant literal even when the variable to load is literally bytes away from the PC:

10000238 <func>:
10000238:       b508            push    {r3, lr}
1000023a:       4b03            ldr     r3, [pc, #12]   ; (10000248 <func+0x10>)
1000023c:       4803            ldr     r0, [pc, #12]   ; (1000024c <func+0x14>)
1000023e:       6819            ldr     r1, [r3, #0]
10000240:       f000 f8c6       bl      100003d0 <__printf_veneer>
10000244:       bd08            pop     {r3, pc}
10000246:       46c0            nop                     ; (mov r8, r8)
10000248:       10000200        .word   0x10000200
1000024c:       00000b70        .word   0x00000b70

The reason the compiler doesn’t pick this up is because even though the variable is declared static, the compiler does not know what the address of var or of the function are going to be during the code generation stage, so it can’t make any assumptions about the variable being near the PC at the ldr instruction. This is confirmed when looking at the intermediate disassembly of test.o:

00000000 <func>:
   0:   b508            push    {r3, lr}
   2:   4b03            ldr     r3, [pc, #12]   ; (10 <func+0x10>)
   4:   4803            ldr     r0, [pc, #12]   ; (14 <func+0x14>)
   6:   6819            ldr     r1, [r3, #0]
   8:   f7ff fffe       bl      0 <printf>
   c:   bd08            pop     {r3, pc}
   e:   46c0            nop                     ; (mov r8, r8)
        ...

Now observe that there are no constant literals yet, those are added by the linker to the final ELF file, GCC just added some space for the linker to put them in later – that’s what the ellipsis at the end represents. Notice it doesn’t even know where printf is, and just put a placeholder zero there instead. If we tell objdump to show the raw contents of the function (.ram is just a section I created to put the function into, for it to be loaded into memory instead of flash) we get:

Contents of section .ram:
 0000 08b5034b 03481968 fff7feff 08bdc046  ...K.H.h.......F
 0010 00000000 00000000                    ........

Notice the zeroes at the end, conveniently placed for the linker to populate: mystery solved.

However, what is disappointing is that even with link-time optimization turned on (-flto) neither compiler was able to notice this at link time and fix up the code to remove the redundant ldr instruction. Now this isn’t a huge deal, as an ldr instruction takes only two cycles anyway, but it goes to show that the compiler doesn’t always know everything; in this case a suboptimal decision was made by the code generation stage due to limited information, that could not be reversed later in the build process.

A curious case of suboptimal code generation

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0497a/BABJFJBD.html