A curious case of suboptimal code generation
While writing the scheduler for my bare metal kernel I noticed that the timing of the task interrupt was slightly off from what it should have been, by a few clock cycles. Not a huge problem, but it got me curious to investigate where the slowdown was. The problem, as it turns out, was more intricate than I originally thought.
Consider the piece of code below:
static uint32_t var;
static void func(void) {
printf("%d\n", var);
}
/* ... some other code that actually changes `var' */
Suppose we build this for the ARMv6 (or ARMv6-M) architecture. We should expect to see some assembly code for the func
function that roughly does the following:
- loads the value of
var
into a register - loads the address of the
"%d\n"
string constant into another register - makes a subroutine call to the
printf
function with e.g.bl
Indeed, if we disassemble some code generated by GCC we get:
000000f0 <func>:
f0: b508 push {r3, lr}
f2: 4b03 ldr r3, [pc, #12] ; (100 <func+0x10>)
f4: 4803 ldr r0, [pc, #12] ; (104 <func+0x14>)
f6: 6819 ldr r1, [r3, #0]
f8: f000 fca2 bl a40 <printf>
fc: bd08 pop {r3, pc}
fe: 46c0 nop ; (mov r8, r8)
100: 10000200 .word 0x10000200
104: 00000b88 .word 0x00000b88
And we know from the linker map that the var
variable lives at address 0x10000200
in the .bss
section, and that the string literal was placed in the .rodata
section at 0x00000b88; note that memory is mapped at 0x1000000
in this example:
.bss.var 0x0000000010000200 0x4 test.o
.rodata.str1.4 0x0000000000000b88 0x4 test.o
Back to the assembly, skipping the push
and pop
instructions we are left with three ldr
instructions, one bl
, a nop
and two constant word literals at the end of the function. Wait, why three ldr
instructions? We only need to load two pieces of data into registers. The reason is that the ldr
instruction can’t just accept any 32-bit immediate source memory address; it can do so if this address is “near” the program counter, by doing a PC-relative move as in ldr r3, [pc, #12]
(meaning “load the memory at address PC + 12 bytes into R3”). Otherwise the address has to be in a register, as in ldr r1, [r3, #0]
(which means “load the memory at address R3 + 0 bytes into R1”).
In this case, the address of our var
variable is pretty far away from the PC, since it’s in memory (above 0x1000000
) while the PC (around 0xf0
) is firmly in the flash address space. So what the compiler does is it places the address of var
in a constant near the function – this is what func+100
is in the disassembly above – and loads that into a register using PC-relative addressing, before finally loading the variable from memory via that register. It also does that for the string constant, although we don’t need to dereference it so only one ldr
is needed. And, to complete the analysis, the nop
instruction was inserted to word-align those two trailing literals.
That makes sense, and is just a result of the compiler working around a limitation of the ldr
instruction. But exactly how near to the PC does a memory address need to be to be eligible for PC-relative addressing? The ARM documentation provides the answer to that question: it must be within ±1020 bytes of it for the Cortex-M0 (apparently it depends on the processor model).
So suppose you were writing some really efficient code, with executable code running straight from memory instead of flash. Then you could arrange for the variables used by the function to be very close to the PC, and then you would need only one PC-relative ldr
to load the variable into a register! However, neither GCC nor LLVM/Clang do this, and always emit an indirect ldr
with a helper constant literal even when the variable to load is literally bytes away from the PC:
10000238 <func>:
10000238: b508 push {r3, lr}
1000023a: 4b03 ldr r3, [pc, #12] ; (10000248 <func+0x10>)
1000023c: 4803 ldr r0, [pc, #12] ; (1000024c <func+0x14>)
1000023e: 6819 ldr r1, [r3, #0]
10000240: f000 f8c6 bl 100003d0 <__printf_veneer>
10000244: bd08 pop {r3, pc}
10000246: 46c0 nop ; (mov r8, r8)
10000248: 10000200 .word 0x10000200
1000024c: 00000b70 .word 0x00000b70
The reason the compiler doesn’t pick this up is because even though the variable is declared static
, the compiler does not know what the address of var
or of the function are going to be during the code generation stage, so it can’t make any assumptions about the variable being near the PC at the ldr
instruction. This is confirmed when looking at the intermediate disassembly of test.o
:
00000000 <func>:
0: b508 push {r3, lr}
2: 4b03 ldr r3, [pc, #12] ; (10 <func+0x10>)
4: 4803 ldr r0, [pc, #12] ; (14 <func+0x14>)
6: 6819 ldr r1, [r3, #0]
8: f7ff fffe bl 0 <printf>
c: bd08 pop {r3, pc}
e: 46c0 nop ; (mov r8, r8)
...
Now observe that there are no constant literals yet, those are added by the linker to the final ELF file, GCC just added some space for the linker to put them in later – that’s what the ellipsis at the end represents. Notice it doesn’t even know where printf
is, and just put a placeholder zero there instead. If we tell objdump
to show the raw contents of the function (.ram
is just a section I created to put the function into, for it to be loaded into memory instead of flash) we get:
Contents of section .ram:
0000 08b5034b 03481968 fff7feff 08bdc046 ...K.H.h.......F
0010 00000000 00000000 ........
Notice the zeroes at the end, conveniently placed for the linker to populate: mystery solved.
However, what is disappointing is that even with link-time optimization turned on (-flto
) neither compiler was able to notice this at link time and fix up the code to remove the redundant ldr
instruction. Now this isn’t a huge deal, as an ldr
instruction takes only two cycles anyway, but it goes to show that the compiler doesn’t always know everything; in this case a suboptimal decision was made by the code generation stage due to limited information, that could not be reversed later in the build process.