nullprogram.com/blog/2016/12/23/
Relocatable code — program code that executes correctly from any
properly-aligned address — is an essential feature for shared
libraries. Otherwise all of a system’s shared libraries would need to
coordinate their virtual load addresses. Loading programs and
libraries to random addresses is also a valuable security feature:
Address Space Layout Randomization (ASLR). But how does a compiler
generate code for a function that accesses a global variable if that
variable’s address isn’t known at compile time?
Consider this simple C code sample.
static const float values[] = {1.1f, 1.2f, 1.3f, 1.4f};
float get_value(unsigned x)
{
return x < 4 ? values[x] : 0.0f;
}
This function needs the base address of values
in order to
dereference it for values[x]
. The easiest way to find out how this
works, especially without knowing where to start, is to compile the
code and have a look! I’ll compile for x86-64 with GCC 4.9.2 (Debian
Jessie).
$ gcc -c -Os -fPIC get_value.c
I optimized for size (-Os
) to make the disassembly easier to follow.
Next, disassemble this pre-linked code with objdump
. Alternatively I
could have asked for the compiler’s assembly output with -S
, but
this will be good reverse engineering practice.
$ objdump -d -Mintel get_value.o
0000000000000000 <get_value>:
0: 83 ff 03 cmp edi,0x3
3: 0f 57 c0 xorps xmm0,xmm0
6: 77 0e ja 16 <get_value+0x16>
8: 48 8d 05 00 00 00 00 lea rax,[rip+0x0]
f: 89 ff mov edi,edi
11: f3 0f 10 04 b8 movss xmm0,DWORD PTR [rax+rdi*4]
16: c3 ret
There are a couple of interesting things going on, but let’s start
from the beginning.
-
The ABI specifies that the first integer/pointer argument
(the 32-bit integer x
) is passed through the edi
register. The
function compares x
to 3, to satisfy x < 4
.
-
The ABI specifies that floating point values are returned through
the SSE2 SIMD register xmm0
. It’s cleared by XORing the
register with itself — the conventional way to clear registers on
x86 — setting up for a return value of 0.0f
.
-
It then uses the result of the previous comparison to perform a
jump, ja
(“jump if after”). That is, jump to the relative address
specified by the jump’s operand if the first operand to cmp
(edi
) comes after the first operand (0x3
) as unsigned values.
Its cousin, jg
(“jump if greater”), is for signed values. If x
is outside the array bounds, it jumps straight to ret
, returning
0.0f
.
-
If x
was in bounds, it uses a lea
(“load effective address”) to
load something into the 64-bit rax
register. This is the
complicated bit, and I’ll start by giving the answer: The value
loaded into rax
is the address of the values
array. More on
this in a moment.
-
Finally it uses x
as an index into address in rax
. The movss
(“move scalar single-precision”) instruction loads a 32-bit float
into the first lane of xmm0
, where the caller expects to find the
return value. This is all preceded by a mov edi, edi
which
looks like a hotpatch nop, but it isn’t. x86-64 always uses
64-bit registers for addressing, meaning it uses rdi
not edi
.
All 32-bit register assignments clear the upper 32 bits, and so
this mov
zero-extends edi
into rdi
. This is in case of the
unlikely event that the caller left garbage in those upper bits.
Clearing xmm0
The first interesting part: xmm0
is cleared even when its first lane
is loaded with a value. There are two reasons to do this.
The obvious reason is that the alternative requires additional
instructions, and I told GCC to optimize for size. It would need
either an extra ret
or an conditional jmp
over the “else” branch.
The less obvious reason is that it breaks a data dependency. For
over 20 years now, x86 micro-architectures have employed an
optimization technique called register renaming. Architectural
registers (rax
, edi
, etc.) are just temporary names for
underlying physical registers. This disconnect allows for more
aggressive out-of-order execution. Two instructions sharing an
architectural register can be executed independently so long as there
are no data dependencies between these instructions.
For example, take this assembly sample. It assembles to 9 bytes of
machine code.
mov edi, [rcx]
mov ecx, 7
shl eax, cl
This reads a 32-bit value from the address stored in rcx
, then
assigns ecx
and uses cl
(the lowest byte of rcx
) in a shift
operation. Without register renaming, the shift couldn’t be performed
until the load in the first instruction completed. However, the second
instruction is a 32-bit assignment, which, as I mentioned before, also
clears the upper 32 bits of rcx
, wiping the unused parts of
register.
So after the second instruction, it’s guaranteed that the value in
rcx
has no dependencies on code that comes before it. Because of
this, it’s likely a different physical register will be used for the
second and third instructions, allowing these instructions to be
executed out of order, before the load. Ingenious!
Compare it to this example, where the second instruction assigns to
cl
instead of ecx
. This assembles to just 6 bytes.
mov edi, [rcx]
mov cl, 7
shl eax, cl
The result is 3 bytes smaller, but since it’s not a 32-bit assignment,
the upper bits of rcx
still hold the original register contents.
This creates a false dependency and may prevent out-of-order
execution, reducing performance.
By clearing xmm0
, instructions in get_value
involving xmm0
have
the opportunity to be executed prior to instructions in the callee
that use xmm0
.
RIP-relative addressing
Going back to the instruction that computes the address of values
.
8: 48 8d 05 00 00 00 00 lea rax,[rip+0x0]
Normally load/store addresses are absolute, based off an address
either in a general purpose register, or at some hard-coded base
address. The latter is not an option in relocatable code. With
RIP-relative addressing that’s still the case, but the register with
the absolute address is rip
, the instruction pointer. This
addressing mode was introduced in x86-64 to make relocatable code more
efficient.
That means this instruction copies the instruction pointer (pointing
to the next instruction) into rax
, plus a 32-bit displacement,
currently zero. This isn’t the right way to encode a displacement of
zero (unless you want a larger instruction). That’s because the
displacement will be filled in later by the linker. The compiler adds
a relocation entry to the object file so that the linker knows how
to do this.
On platforms that use ELF we can inspect relocations this with
readelf
.
$ readelf -r get_value.o
Relocation section '.rela.text' at offset 0x270 contains 1 entries:
Offset Info Type Sym. Value
00000000000b 000700000002 R_X86_64_PC32 0000000000000000 .rodata - 4
The relocation type is R_X86_64_PC32
. In the AMD64 Architecture
Processor Supplement, this is defined as “S + A - P”.
-
S: Represents the value of the symbol whose index resides in the
relocation entry.
-
A: Represents the addend used to compute the value of the
relocatable field.
-
P: Represents the place of the storage unit being relocated.
The symbol, S, is .rodata
— the final address for this object file’s
portion of .rodata
(where values
resides). The addend, A, is -4
since the instruction pointer points at the next instruction. That
is, this will be relative to four bytes after the relocation offset.
Finally, the address of the relocation, P, is the address of last four
bytes of the lea
instruction. These values are all known at
link-time, so no run-time support is necessary.
Being “S - P” (overall), this will be the displacement between these
two addresses: the 32-bit value is relative. It’s relocatable so long
as these two parts of the binary (code and data) maintain a fixed
distance from each other. The binary is relocated as a whole, so this
assumption holds.
32-bit relocation
Since RIP-relative addressing wasn’t introduced until x86-64, how did
this all work on x86? Again, let’s just see what the compiler does.
Add the -m32
flag for a 32-bit target, and -fomit-frame-pointer
to
make it simpler for explanatory purposes.
$ gcc -c -m32 -fomit-frame-pointer -Os -fPIC get_value.c
$ objdump -d -Mintel get_value.o
00000000 <get_value>:
0: 8b 44 24 04 mov eax,DWORD PTR [esp+0x4]
4: d9 ee fldz
6: e8 fc ff ff ff call 7 <get_value+0x7>
b: 81 c1 02 00 00 00 add ecx,0x2
11: 83 f8 03 cmp eax,0x3
14: 77 09 ja 1f <get_value+0x1f>
16: dd d8 fstp st(0)
18: d9 84 81 00 00 00 00 fld DWORD PTR [ecx+eax*4+0x0]
1f: c3 ret
Disassembly of section .text.__x86.get_pc_thunk.cx:
00000000 <__x86.get_pc_thunk.cx>:
0: 8b 0c 24 mov ecx,DWORD PTR [esp]
3: c3 ret
Hmm, this one includes an extra function.
-
In this calling convention, arguments are passed on the stack. The
first instruction loads the argument, x
, into eax
.
-
The fldz
instruction clears the x87 floating pointer return
register, just like clearing xmm0
in the x86-64 version.
-
Next it calls __x86.get_pc_thunk.cx
. The call pushes the
instruction pointer, eip
, onto the stack. This function reads
that value off the stack into ecx
and returns. In other words,
calling this function copies eip
into ecx
. It’s setting up to
load data at an address relative to the code. Notice the function
name starts with two underscores — a name which is reserved for
exactly for these sorts of implementation purposes.
-
Next a 32-bit displacement is added to ecx
. In this case it’s
2
, but, like before, this is actually going be filled in later by
the linker.
-
Then it’s just like before: a branch to optionally load a value.
The floating pointer load (fld
) is another relocation.
Let’s look at the relocations. There are three this time:
$ readelf -r get_value.o
Relocation section '.rel.text' at offset 0x2b0 contains 3 entries:
Offset Info Type Sym.Value Sym. Name
00000007 00000e02 R_386_PC32 00000000 __x86.get_pc_thunk.cx
0000000d 00000f0a R_386_GOTPC 00000000 _GLOBAL_OFFSET_TABLE_
0000001b 00000709 R_386_GOTOFF 00000000 .rodata
The first relocation is the call-site for the thunk. The thunk has
external linkage and may be merged with a matching thunk in another
object file, and so may be relocated. (Clang inlines its thunk.) Calls
are relative, so its type is R_386_PC32
: a code-relative
displacement just like on x86-64.
The next is of type R_386_GOTPC
and sets the second operand in that
add ecx
. It’s defined as “GOT + A - P” where “GOT” is the address of
the Global Offset Table — a table of addresses of the binary’s
relocated objects. Since values
is static, the GOT won’t actually
hold an address for it, but the relative address of the GOT itself
will be useful.
The final relocation is of type R_386_GOTOFF
. This is defined as
“S + A - GOT”. Another displacement between two addresses. This is the
displacement in the load, fld
. Ultimately the load adds these last
two relocations together, canceling the GOT:
(GOT + A0 - P) + (S + A1 - GOT)
= S + A0 + A1 - P
So the GOT isn’t relevant in this case. It’s just a mechanism for
constructing a custom relocation type.
Branch optimization
Notice in the x86 version the thunk is called before checking the
argument. What if it’s most likely that will x
be out of bounds of
the array, and the function usually returns zero? That means it’s
usually wasting its time calling the thunk. Without profile-guided
optimization the compiler probably won’t know this.
The typical way to provide such a compiler hint is with a pair of
macros, likely()
and unlikely()
. With GCC and Clang, these would
be defined to use __builtin_expect
. Compilers without this sort of
feature would have macros that do nothing instead. So I gave it a
shot:
#define likely(x) __builtin_expect((x),1)
#define unlikely(x) __builtin_expect((x),0)
static const float values[] = {1.1f, 1.2f, 1.3f, 1.4f};
float get_value(unsigned x)
{
return unlikely(x < 4) ? values[x] : 0.0f;
}
Unfortunately this makes no difference even in the latest version of
GCC. In Clang it changes branch fall-through (for static branch
prediction), but still always calls the thunk. It seems
compilers have difficulty with optimizing relocatable
code on x86.
x86-64 isn’t just about more memory
It’s commonly understood that the advantage of 64-bit versus 32-bit
systems is processes having access to more than 4GB of memory. But as
this shows, there’s more to it than that. Even programs that don’t
need that much memory can really benefit from newer features like
RIP-relative addressing.