There are a few things about linking that I think are worthy of writing down after I studied linking from CSAPP-3e and some resources on the web.
Linkage of const
Global Variables
const
global variables in C++ have internal linkage (as declared static
) unless they’re explicitly declared extern
or inline
(see C++17 inline variable), whereas in C external linkage (as declared extern
) is the default for all file-scoped entities (functions and global variables). An excellent answer on StackOverflow about inline variables can be found here. One of the major uses of the inline
keyword is to enable multiple identical function definitions across translation units (TUs) (in practice function definitions in header files), so this meaning was extended to variables to allow an inline
const
variable to have a uniform memory address across TUs and allow a non-const
global variable to be defined in headers without redefinition problems.
$ cat t.c int int_ = 1; static int static_int = 1; const int const_int = 1; static const int static_const_int = 1; $ gcc -c t.c $ nm t.o 0000000000000000 R const_int # global read-only data section in C 0000000000000000 D int_ 0000000000000004 r static_const_int 0000000000000004 d static_int $ g++ -c t.c $ nm -C t.o 0000000000000000 D int_ 0000000000000004 d static_int 0000000000000004 r static_const_int 0000000000000000 r const_int # local read-only data section in C++
Link Order
The link order matters when we link against static or shared libraries. The rules used by the linker to resolve symbols are clearly described in the Symbol Resolution section of the Linking chapter in CSAPP-3e. Please refer to the text or this post from Ian Lance Taylor’s blog for more information.
Shared Libraries
Shared libraries are ubiquitous on almost every platform that has an operating system. The key purpose of shared libraries is to share code among processes and thus save memory. One approach to sharing library code could be to put the given library code at a specific memory location. But it has at least two problems: (a) There are so many (literally unlimited number of) libraries, how do we specify which library to put at which location? If there are ten libraries in total, but a program is only linked against one or two of them, the reserved memory for other shared libraries is wasted. (b) Having a library loaded at a fixed memory location is vulnerable to attacks. So for these reasons, shared libraries are implemented in a way that they can be loaded at any (random) memory locations that you don’t know in advance.
Luckily, the compiler can generate position-independent code (PIC) for us. Let’s first look at a normal non-PIC case:
// foo.c int x = 17; int foo(int i) { return i + x; } // main.c #include <stdio.h> int foo(int i); int main() { int r = foo(3); printf("%d\n", r); return r; }
$ gcc main.c foo.c -o main $ objdump -d main ... 0000000000001149 <main>: 1149: f3 0f 1e fa endbr64 114d: 55 push %rbp 114e: 48 89 e5 mov %rsp,%rbp 1151: 48 83 ec 10 sub $0x10,%rsp 1155: bf 03 00 00 00 mov $0x3,%edi 115a: e8 21 00 00 00 call 1180 <foo> 115f: 89 45 fc mov %eax,-0x4(%rbp) 1162: 8b 45 fc mov -0x4(%rbp),%eax 1165: 89 c6 mov %eax,%esi 1167: 48 8d 05 96 0e 00 00 lea 0xe96(%rip),%rax # 2004 <_IO_stdin_used+0x4> 116e: 48 89 c7 mov %rax,%rdi 1171: b8 00 00 00 00 mov $0x0,%eax 1176: e8 d5 fe ff ff call 1050 <printf@plt> 117b: 8b 45 fc mov -0x4(%rbp),%eax 117e: c9 leave 117f: c3 ret 0000000000001180 <foo>: 1180: f3 0f 1e fa endbr64 1184: 55 push %rbp 1185: 48 89 e5 mov %rsp,%rbp 1188: 89 7d fc mov %edi,-0x4(%rbp) 118b: 8b 15 7f 2e 00 00 mov 0x2e7f(%rip),%edx # 4010 <x> 1191: 8b 45 fc mov -0x4(%rbp),%eax 1194: 01 d0 add %edx,%eax 1196: 5d pop %rbp 1197: c3 ret ... $ objdump --full-contents main ... Contents of section .data: 4000 00000000 00000000 08400000 00000000 .........@...... 4010 11000000 ...
Fun story: I was confused about the disassembly once by the line call 1180 <foo>
. I thought it was an absolute address since it could use something like 0x21(%rip)
for relative addressing otherwise. I knew if I printed the function’s address, it would be different every time it was executed since the OS would load the code (.text section) at random addresses for security’s sake. But I also knew the code section was read-only. So, how was address space layout randomization implemented? Did the kernel (dynamic linker) perform some relocation and modify the machine code anyway? If it did, that would be soooo inefficient, for all calls to other normal functions would need to be modified as well. I just couldn’t figure it out. Then I realized that the secret must be behind the machine code e8 21 00 00 00
itself. So I asked ChatGPT about it. It turned out that, yes indeed, this call
instruction uses relative addressing.
Therefore, knowing how some instructions are encoded is helpful in many situations. In x86 assembly, the opcode e8
is a near call (see more at https://shell-storm.org/x86doc/CALL.html). The 4 bytes followed are the operand which is encoded in two’s complement. x86 machines are little-endian. So the rel32 operand is 0x21
, added to the RIP register (115f
) gives our target address 1180
. Similarly, the E9
JMP
instruction also uses relative addressing. Another example: the operand in 1176: e8 d5 fe ff ff
is the value whose two’s complement is 0xfffffed5
. The value is -12b
. Therefore, this instruction can be decoded as call RIP-12b
(RIP = 117b
), i.e. call 1050
. Most of the time it’s fine just to look at the decoded assembler code rather than the machine code itself. However, knowing the instruction encoding makes us better understand how the CPU really executes some instructions (e.g. call RIP-12b
shows us that at runtime the CPU pushes the address of the next instruction (RIP) onto the stack and then the execution jumps to the memory location with an offset of -12b
to RIP).
Go back to the example above. We can see that main
makes a near call to foo
, and foo
references the the global variable in the .data section which has an initial value 0x11 (17)
. Nothing special. Now we make foo.c
a shared library.
$ gcc -shared -fpic foo.c -o foo.so $ objdump -d foo.so ... 00000000000010f9 <foo>: 10f9: f3 0f 1e fa endbr64 10fd: 55 push %rbp 10fe: 48 89 e5 mov %rsp,%rbp 1101: 89 7d fc mov %edi,-0x4(%rbp) 1104: 48 8b 05 cd 2e 00 00 mov 0x2ecd(%rip),%rax # 3fd8 <x-0x48> 110b: 8b 10 mov (%rax),%edx 110d: 8b 45 fc mov -0x4(%rbp),%eax 1110: 01 d0 add %edx,%eax 1112: 5d pop %rbp 1113: c3 ret ... $ objdump --full-contents foo.so ... Contents of section .got: 3fd8 00000000 00000000 00000000 00000000 ................ 3fe8 00000000 00000000 00000000 00000000 ................ 3ff8 00000000 00000000 ........ Contents of section .got.plt: 4000 883e0000 00000000 00000000 00000000 .>.............. 4010 00000000 00000000 ........ Contents of section .data: 4018 18400000 00000000 11000000 .@.......... ...
There’s an observable difference. Without being PIC, the instruction for referring to x
is mov 0x2e7f(%rip),%edx
which copies 4 bytes from the .data section to register %edx
. Now being PIC, two load instructions are involved: mov 0x2ecd(%rip),%rax
and mov (%rax),%edx
. The first one copies 8 bytes from the address 3fd8
in the .got section to register %rax
, and the second accesses the lower half part of the value stored at address in %rax
. But why is the global variable x
now read from the .got section and why does the address 3fd8
contain an empty value?
To answer these questions, we are introduced to the global offset table (GOT), which contains an 8-byte entry (a pointer) for each global data object (procedure or global variable) that is referenced by the object module. The compiler generates a relocation record for each entry in the GOT.
$ readelf --relocs foo.so Relocation section '.rela.dyn' at offset 0x420 contains 8 entries: Offset Info Type Sym. Value Sym. Name + Addend 000000003e78 000000000008 R_X86_64_RELATIVE 10f0 000000003e80 000000000008 R_X86_64_RELATIVE 10b0 000000004018 000000000008 R_X86_64_RELATIVE 4018 000000003fd8 000500000006 R_X86_64_GLOB_DAT 0000000000004020 x + 0 000000003fe0 000100000006 R_X86_64_GLOB_DAT 0000000000000000 __cxa_finalize + 0 000000003fe8 000200000006 R_X86_64_GLOB_DAT 0000000000000000 _ITM_registerTMCl[...] + 0 000000003ff0 000300000006 R_X86_64_GLOB_DAT 0000000000000000 _ITM_deregisterTM[...] + 0 000000003ff8 000400000006 R_X86_64_GLOB_DAT 0000000000000000 __gmon_start__ + 0
The GOT serves as a kind of indirection that enables us to override symbols (global variables and functions) in shared libraries. This is done with the help of the dynamic linker, which can relocate GOT entries at load time or runtime (lazy binding) so that they contain absolute memory addresses to the proper (overridden) symbols. For example, we can also define a global variable x
in main.c
and this one is going to override the one provided in the shared library foo.so
:
$ cat main.c #include <stdio.h> int foo(int i); #ifdef OVERRIDE_X int x = 13; #endif int main() { int r = foo(3); printf("%d\n", r); return r; } $ gcc main.c ./foo.so -o main $ ./main 20 $ gcc main.c ./foo.so -o main -DOVERRIDE_X $ ./main 16
Another typical way of overriding symbols in a shared library is to use LD_PRELOAD
which we will cover later.
Back to the questions. Reading the global variable x
via an extra indirection from the GOT is to allow things like symbol overriding. The GOT entry for x
will be filled by the dynamic linker at load time or runtime, so it’s fine to leave the value empty.
Next, let’s focus on function calls in shared libraries. This involves two aspects: calling library functions from our application code and function calls within the shared library. For the former, see the disassembly below:
$ gcc main.c ./foo.so -o main $ objdump -d main ... Disassembly of section .plt: 0000000000001020 <.plt>: 1020: ff 35 92 2f 00 00 push 0x2f92(%rip) # 3fb8 <_GLOBAL_OFFSET_TABLE_+0x8> 1026: f2 ff 25 93 2f 00 00 bnd jmp *0x2f93(%rip) # 3fc0 <_GLOBAL_OFFSET_TABLE_+0x10> 102d: 0f 1f 00 nopl (%rax) 1030: f3 0f 1e fa endbr64 1034: 68 00 00 00 00 push $0x0 1039: f2 e9 e1 ff ff ff bnd jmp 1020 <_init+0x20> 103f: 90 nop 1040: f3 0f 1e fa endbr64 1044: 68 01 00 00 00 push $0x1 1049: f2 e9 d1 ff ff ff bnd jmp 1020 <_init+0x20> 104f: 90 nop ... Disassembly of section .plt.sec: 0000000000001060 <printf@plt>: 1060: f3 0f 1e fa endbr64 1064: f2 ff 25 5d 2f 00 00 bnd jmp *0x2f5d(%rip) # 3fc8 <printf@GLIBC_2.2.5> 106b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) 0000000000001070 <foo@plt>: 1070: f3 0f 1e fa endbr64 1074: f2 ff 25 55 2f 00 00 bnd jmp *0x2f55(%rip) # 3fd0 <foo@Base> 107b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) Disassembly of section .text: ... 0000000000001169 <main>: 1169: f3 0f 1e fa endbr64 116d: 55 push %rbp 116e: 48 89 e5 mov %rsp,%rbp 1171: 48 83 ec 10 sub $0x10,%rsp 1175: bf 03 00 00 00 mov $0x3,%edi 117a: e8 f1 fe ff ff call 1070 <foo@plt> 117f: 89 45 fc mov %eax,-0x4(%rbp) 1182: 8b 45 fc mov -0x4(%rbp),%eax 1185: 89 c6 mov %eax,%esi 1187: 48 8d 05 76 0e 00 00 lea 0xe76(%rip),%rax # 2004 <_IO_stdin_used+0x4> 118e: 48 89 c7 mov %rax,%rdi 1191: b8 00 00 00 00 mov $0x0,%eax 1196: e8 c5 fe ff ff call 1060 <printf@plt> 119b: 8b 45 fc mov -0x4(%rbp),%eax 119e: c9 leave 119f: c3 ret ... $ objdump --full-contents main ... Contents of section .dynamic: 3db0 01000000 00000000 72000000 00000000 ........r....... 3dc0 01000000 00000000 7b000000 00000000 ........{....... ... Contents of section .got: 3fb0 b03d0000 00000000 00000000 00000000 .=.............. 3fc0 00000000 00000000 30100000 00000000 ........0....... 3fd0 40100000 00000000 00000000 00000000 @............... 3fe0 00000000 00000000 00000000 00000000 ................ 3ff0 00000000 00000000 00000000 00000000 ................ Contents of section .data: 4000 00000000 00000000 08400000 00000000 .........@...... ...
There’s a lot of jumps happening here. We can see that calls to shared library functions are made through the procedure linkage table (PLT). Each shared library function called by the executable has its own PLT entry in the .plt.sec section. In our example, the PLT entries are foo@plt
and printf@plt
. PLT entries will be relocated lazily by the dynamic linker at runtime, meaning that the address of each function will be resolved by the dynamic linker the first time the function is called. This makes sense since usually a shared library, e.g. libc
, encompasses a large number of functions, and only a few get called by our program. This lazy binding saves a lot of work for the dynamic linker.
Now we walk through our example of calling the library function foo
. Function main
calls into foo@plt
, which immediately jumps to another memory address via GOT[4]
. The initial value of GOT[4]
is 0x1040
. Following that, the execution jumps to somewhere in the .plt section and pushes the index of this PLT entry (0x1
for foo
; 0x0
for printf
) onto the stack and then branches to the common code at the beginning of the .plt section. The common code pushes GOT[1]
onto the stack and the execution jumps to GOT[2]
. It can be guessed that GOT[1]
is a pointer to relocation information, and it will be used together with the pushed PLT entry index as two arguments by the dynamic linker to resolve the address of foo
. GOT[2]
should be the entry point for the dynamic linker to perform relocations. GOT[1]
and GOT[2]
will be filled by the dynamic linker at load time. After the dynamic linker figures out the address of foo
at runtime, it rewrites GOT[4]
with this address so that subsequent calls to foo
will jump directly to the resolved destination. Finally, the control is transferred back to foo
when the dynamic linker has done its job. What a beautiful engineering!
Within a shared library, let’s see what happens when a function calls another function.
// bar.c int x = 17; int bar(int i) { return i + 42; } int foo(int i) { return bar(i) + x; }
$ gcc -shared -fpic bar.c -o bar.so $ objdump -d bar.so ... Disassembly of section .plt.sec: 0000000000001050 <bar@plt>: 1050: f3 0f 1e fa endbr64 1054: f2 ff 25 bd 2f 00 00 bnd jmp *0x2fbd(%rip) # 4018 <bar+0x2eff> 105b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) Disassembly of section .text: ... 0000000000001119 <bar>: 1119: f3 0f 1e fa endbr64 111d: 55 push %rbp 111e: 48 89 e5 mov %rsp,%rbp 1121: 89 7d fc mov %edi,-0x4(%rbp) 1124: 8b 45 fc mov -0x4(%rbp),%eax 1127: 83 c0 2a add $0x2a,%eax 112a: 5d pop %rbp 112b: c3 ret 000000000000112c <foo>: 112c: f3 0f 1e fa endbr64 1130: 55 push %rbp 1131: 48 89 e5 mov %rsp,%rbp 1134: 48 83 ec 10 sub $0x10,%rsp 1138: 89 7d fc mov %edi,-0x4(%rbp) 113b: 8b 45 fc mov -0x4(%rbp),%eax 113e: 89 c7 mov %eax,%edi 1140: e8 0b ff ff ff call 1050 <bar@plt> 1145: 48 8b 15 8c 2e 00 00 mov 0x2e8c(%rip),%rdx # 3fd8 <x-0x50> 114c: 8b 12 mov (%rdx),%edx 114e: 01 d0 add %edx,%eax 1150: c9 leave 1151: c3 ret ...
Well, we can see that the call to bar
is done through a PLT entry, just like it’s been called from non-library code. This actually makes sense since all exported symbols can be overridden. If we don’t want to export our internal implementation functions or variables, we can declare them static
.
$ cat bar.c int x = 17; static int bar(int i) { return i + 42; } int foo(int i) { return bar(i) + x; } $ gcc -shared -fpic bar.c -o bar.so $ objdump -d bar.so ... 00000000000010f9 <bar>: 10f9: f3 0f 1e fa endbr64 10fd: 55 push %rbp 10fe: 48 89 e5 mov %rsp,%rbp 1101: 89 7d fc mov %edi,-0x4(%rbp) 1104: 8b 45 fc mov -0x4(%rbp),%eax 1107: 83 c0 2a add $0x2a,%eax 110a: 5d pop %rbp 110b: c3 ret 000000000000110c <foo>: 110c: f3 0f 1e fa endbr64 1110: 55 push %rbp 1111: 48 89 e5 mov %rsp,%rbp 1114: 48 83 ec 08 sub $0x8,%rsp 1118: 89 7d fc mov %edi,-0x4(%rbp) 111b: 8b 45 fc mov -0x4(%rbp),%eax 111e: 89 c7 mov %eax,%edi 1120: e8 d4 ff ff ff call 10f9 <bar> 1125: 48 8b 15 ac 2e 00 00 mov 0x2eac(%rip),%rdx # 3fd8 <x-0x48> 112c: 8b 12 mov (%rdx),%edx 112e: 01 d0 add %edx,%eax 1130: c9 leave 1131: c3 ret ...
In designing a library, we often have headers and their implementation details will be put in separate translation units. Those functions have external linkage. But we don’t want to export those symbols unless they are API functions intended to be provided to the user. Under such cases, we can pass -fvisibility=hidden
to the compiler and explicitly mark the symbols we want to export using __attribute__((__visibility__("default")))
(in practice, a descriptive macro can be used for this, e.g. LIBNAME_EXPORT
), see more in the GCC manual. Doing so has many practical benefits, for example, less work for the dynamic linker, more efficient code can be generated since no PLT entry is needed, and for data references, there will be only one data load vs two loads if it gets exported.
$ cat bar.c int x = 17; int bar(int i) { return i + 42; } __attribute__((__visibility__("default"))) int foo(int i) { return bar(i) + x; } $ gcc -shared -fpic bar.c -o bar.so -fvisibility=hidden $ objdump -d bar.so ... 00000000000010f9 <bar>: 10f9: f3 0f 1e fa endbr64 10fd: 55 push %rbp 10fe: 48 89 e5 mov %rsp,%rbp 1101: 89 7d fc mov %edi,-0x4(%rbp) 1104: 8b 45 fc mov -0x4(%rbp),%eax 1107: 83 c0 2a add $0x2a,%eax 110a: 5d pop %rbp 110b: c3 ret 000000000000110c <foo>: 110c: f3 0f 1e fa endbr64 1110: 55 push %rbp 1111: 48 89 e5 mov %rsp,%rbp 1114: 48 83 ec 08 sub $0x8,%rsp 1118: 89 7d fc mov %edi,-0x4(%rbp) 111b: 8b 45 fc mov -0x4(%rbp),%eax 111e: 89 c7 mov %eax,%edi 1120: e8 d4 ff ff ff call 10f9 <bar> 1125: 8b 15 f5 2e 00 00 mov 0x2ef5(%rip),%edx # 4020 <x> 112b: 01 d0 add %edx,%eax 112d: c9 leave 112e: c3 ret ...
At the end of this section, let’s see an example of using LD_PRELOAD
to override symbols by loading specific libraries before any other shared libraries. In practice, LD_PRELOAD
can be used to override the malloc
implementation in libc.so
or to load a debug library that contains logging versions of some functions (e.g. to debug some multithreaded code by logging each thread’s activity).
$ cat foo.c int x = 17; int foo(int i) { return i + x; } $ cat main.c #include <stdio.h> int foo(int i); int main() { int r = foo(3); printf("%d\n", r); return r; } $ cat bar.c int x = 17; int bar(int i) { return i + 42; } __attribute__((__visibility__("default"))) int foo(int i) { return bar(i) + x; } $ gcc -shared -fpic foo.c -o foo.so $ gcc -shared -fpic bar.c -o bar.so -fvisibility=hidden $ gcc main.c ./foo.so -o main $ ./main 20 $ LD_PRELOAD=./bar.so ./main 62
Oh, one more thing to highlight. Symbol collisions in shared libraries can catch you unexpectedly. The exported symbols in a shared library can be overridden by anything as long as they have the same name. This means an exported function/variable in a shared library can be overridden by a global variable of any type, or by a function of any signature. If the same symbol is used for different things, it’s an undefined behavior. Symbol collisions can happen between a shared library and our application code, or between two shared libraries linked by our innocent application. Therefore, it’s a good practice to limit the visibility of the symbols in a shared library as small as possible. Better performance and lower chance of symbol collision. Why not!
$ cat foo.c #include <stdio.h> int x = 5; int foobar(int i) { printf("foobar called from foo.c, argument i=%d\n", i); return i + x; } int foo(int i) { return foobar(i) * 10; } $ cat bar.c #include <stdio.h> int foobar(int i, int j) { printf("foobar called from bar.c, arguments i=%d, j=%d\n", i, j); return i + j; } int bar(int i, int j) { return foobar(i, j) + 1; } $ cat barz.c #include <stdio.h> int foobar = 2; int bar(int i, int j) { printf("foobar=%d\n", foobar); return foobar + i + j + 5; } $ cat main.c #include <stdio.h> int foo(int i); // from foo.so int bar(int i, int j); // from bar(z).so int main() { int a = foo(3); int b = bar(3, 7); printf("%d\n", a); printf("%d\n", b); return a + b; } $ gcc -shared -fpic foo.c -o foo.so $ gcc -shared -fpic bar.c -o bar.so $ gcc -shared -fpic barz.c -o barz.so $ gcc main.c ./foo.so ./bar.so -o main $ ./main foobar called from foo.c, argument i=3 foobar called from foo.c, argument i=3 80 9 $ gcc main.c ./bar.so ./foo.so -o main $ ./main foobar called from bar.c, arguments i=3, j=2081388488 foobar called from bar.c, arguments i=3, j=7 -660951570 11 $ gcc main.c ./foo.so ./barz.so -o main $ ./main foobar called from foo.c, argument i=3 foobar=-98693133 80 -98693118 $ gcc main.c ./barz.so ./foo.so -o main $ ./main fish: Job 1, './main' terminated by signal SIGSEGV (Address boundary error)
In the example above, our program calls two external functions from two shared libraries: foo
from foo.so
, and bar
from bar.so
or barz.so
. They all call or reference foobar
. Whichever the shared library is loaded first overrides the other library’s foobar
symbol. From the results, we can see that the library load order seems to match the library link order. But we can manually specify the load order using LD_PRELOAD
.
$ gcc main.c ./foo.so ./bar.so -o main $ LD_PRELOAD=./foo.so:./bar.so ./main foobar called from foo.c, argument i=3 foobar called from foo.c, argument i=3 80 9 $ LD_PRELOAD=./bar.so:./foo.so ./main foobar called from bar.c, arguments i=3, j=1241836152 foobar called from bar.c, arguments i=3, j=7 -466540338 11 $ gcc main.c ./foo.so ./barz.so -o main $ LD_PRELOAD=./foo.so:./barz.so ./main foobar called from foo.c, argument i=3 foobar=-98693133 80 -98693118 $ LD_PRELOAD=./barz.so:./foo.so ./main fish: Job 1, 'LD_PRELOAD=./barz.so:./foo.so .…' terminated by signal SIGSEGV (Address boundary error)
The point here is that if symbol collisions happened, there would be many possible ways our program might pan out due to undefined behaviors. If we only exported foo
and bar
, none of them would happen. If we really need to export foobar
as it may be an API function, can we avoid the symbol collision on it? The answer is to pass -Bsymbolic
or -Bsymbolic-functions
to the program linker when creating the shared library. These options make the shared library bind references to global (function) symbols to the definition within the shared library, if any. This means calls to functions within the shared library don’t have to go through the PLT, and referenced data doesn’t have to be read from the GOT. Thus, more efficient code can be generated by the compiler. But this renders symbol overriding impossible, even for exported public symbols, as no dynamic relocation entry for the shared library would be generated. If one wants finer control, --dynamic-list
can be used.
$ gcc -shared -fpic foo.c -o foo_symbolic.so -Wl,-Bsymbolic $ readelf -r foo.so Relocation section '.rela.dyn' at offset 0x4b0 contains 8 entries: Offset Info Type Sym. Value Sym. Name + Addend 000000003e08 000000000008 R_X86_64_RELATIVE 1130 000000003e10 000000000008 R_X86_64_RELATIVE 10f0 000000004028 000000000008 R_X86_64_RELATIVE 4028 000000003fd8 000100000006 R_X86_64_GLOB_DAT 0000000000000000 _ITM_deregisterTM[...] + 0 000000003fe0 000600000006 R_X86_64_GLOB_DAT 0000000000004030 x + 0 000000003fe8 000300000006 R_X86_64_GLOB_DAT 0000000000000000 __gmon_start__ + 0 000000003ff0 000400000006 R_X86_64_GLOB_DAT 0000000000000000 _ITM_registerTMCl[...] + 0 000000003ff8 000500000006 R_X86_64_GLOB_DAT 0000000000000000 __cxa_finalize@GLIBC_2.2.5 + 0 Relocation section '.rela.plt' at offset 0x570 contains 2 entries: Offset Info Type Sym. Value Sym. Name + Addend 000000004018 000800000007 R_X86_64_JUMP_SLO 0000000000001139 foobar + 0 000000004020 000200000007 R_X86_64_JUMP_SLO 0000000000000000 printf@GLIBC_2.2.5 + 0 $ readelf -r foo_symbolic.so # no relocations for x and foobar Relocation section '.rela.dyn' at offset 0x4b0 contains 7 entries: Offset Info Type Sym. Value Sym. Name + Addend 000000003df0 000000000008 R_X86_64_RELATIVE 1110 000000003df8 000000000008 R_X86_64_RELATIVE 10d0 000000004020 000000000008 R_X86_64_RELATIVE 4020 000000003fe0 000100000006 R_X86_64_GLOB_DAT 0000000000000000 _ITM_deregisterTM[...] + 0 000000003fe8 000300000006 R_X86_64_GLOB_DAT 0000000000000000 __gmon_start__ + 0 000000003ff0 000400000006 R_X86_64_GLOB_DAT 0000000000000000 _ITM_registerTMCl[...] + 0 000000003ff8 000500000006 R_X86_64_GLOB_DAT 0000000000000000 __cxa_finalize@GLIBC_2.2.5 + 0 Relocation section '.rela.plt' at offset 0x558 contains 1 entry: Offset Info Type Sym. Value Sym. Name + Addend 000000004018 000200000007 R_X86_64_JUMP_SLO 0000000000000000 printf@GLIBC_2.2.5 + 0 $ objdump -d foo_symbolic.so ... Disassembly of section .plt.sec: 0000000000001050 <printf@plt>: 1050: f3 0f 1e fa endbr64 1054: f2 ff 25 bd 2f 00 00 bnd jmp *0x2fbd(%rip) # 4018 <printf@GLIBC_2.2.5> 105b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) Disassembly of section .text: 0000000000001060 <deregister_tm_clones>: ... 0000000000001119 <foobar>: 1119: f3 0f 1e fa endbr64 111d: 55 push %rbp 111e: 48 89 e5 mov %rsp,%rbp 1121: 48 83 ec 10 sub $0x10,%rsp 1125: 89 7d fc mov %edi,-0x4(%rbp) 1128: 8b 45 fc mov -0x4(%rbp),%eax 112b: 89 c6 mov %eax,%esi 112d: 48 8d 05 cc 0e 00 00 lea 0xecc(%rip),%rax # 2000 <_fini+0xe88> 1134: 48 89 c7 mov %rax,%rdi 1137: b8 00 00 00 00 mov $0x0,%eax 113c: e8 0f ff ff ff call 1050 <printf@plt> 1141: 48 8d 05 e0 2e 00 00 lea 0x2ee0(%rip),%rax # 4028 <x> 1148: 8b 10 mov (%rax),%edx 114a: 8b 45 fc mov -0x4(%rbp),%eax 114d: 01 d0 add %edx,%eax 114f: c9 leave 1150: c3 ret 0000000000001151 <foo>: 1151: f3 0f 1e fa endbr64 1155: 55 push %rbp 1156: 48 89 e5 mov %rsp,%rbp 1159: 48 83 ec 10 sub $0x10,%rsp 115d: 89 7d fc mov %edi,-0x4(%rbp) 1160: 8b 45 fc mov -0x4(%rbp),%eax 1163: 89 c7 mov %eax,%edi 1165: e8 af ff ff ff call 1119 <foobar> 116a: 89 c2 mov %eax,%edx 116c: 89 d0 mov %edx,%eax 116e: c1 e0 02 shl $0x2,%eax 1171: 01 d0 add %edx,%eax 1173: 01 c0 add %eax,%eax 1175: c9 leave 1176: c3 ret ... $ objdump --full-contents foo_symbolic.so ... Contents of section .data: 4020 20400000 00000000 05000000 @.......... ...
Some other good stuff to read on shared libraries: “How To Write Shared Libraries” by Ulrich Drepper.
Link-Time Optimization
Link-time optimization (LTO) is also called interprocedural optimization. The kind of optimization, as its name indicates, happens at the link stage when the compiler sees the whole program. LTO is done by dumping the compiler’s intermediate representation (GIMPLE in GCC, LLVM IR/bitcode in Clang) when compiling each source file (translation unit). So the resulting object files are fat as they contain extra information used for LTO. Typical LTO includes function inlining, dead code elimination, etc. Read more here or watch this awesome video. We’ll instead focus on a concrete and interesting example: trying to inline a comparison function pointer for a library routine insertion_sort
in C.
// isort.h #pragma once #include <stddef.h> void insertion_sort(void *ptr, size_t num, size_t size, int (*comp)(const void *, const void *)); // isort.c #include "isort.h" #include <stdio.h> static void swap_item(char *a, char *b, size_t size) { for (size_t i = 0; i < size; ++i) { char tmp = a[i]; a[i] = b[i]; b[i] = tmp; } } void insertion_sort(void *ptr, size_t num, size_t size, int (*comp)(const void *, const void *)) { char *base = (char *)ptr; for (size_t i = 1; i < num; ++i) { for (size_t j = i; j > 0 && comp(base + (j-1)*size, base + j*size) > 0; --j) { swap_item(base + (j-1)*size, base + j*size, size); } } } // main.c #include "isort.h" #include <stdio.h> int compare_int(const void *a, const void *b) { return *(int *)a - *(int *)b; } int main() { int arr[] = {7, 3, 1, 5, 2, 9, 1, 7, 8, 2, 4, 6, 5}; size_t arr_size = sizeof(arr) / sizeof(arr[0]); insertion_sort(arr, arr_size, sizeof(arr[0]), compare_int); for (size_t i = 0; i < arr_size; ++i) { printf("%d ", arr[i]); } printf("\n"); return 0; }
This is a typical setup – the definition of the library routine insertion_sort
is separate from our application code, and our application code provides a custom comparison function upon calling the library routine.
$ gcc -o main main.c isort.c -O3 $ objdump -d main | grep call 1014: ff d0 call *%rax 110d: e8 0e 03 00 00 call 1420 <insertion_sort> 1128: e8 63 ff ff ff call 1090 <__printf_chk@plt> 1137: e8 34 ff ff ff call 1070 <putchar@plt> 1157: e8 24 ff ff ff call 1080 <__stack_chk_fail@plt> 117f: ff 15 53 2e 00 00 call *0x2e53(%rip) # 3fd8 <__libc_start_main@GLIBC_2.34> 1222: e8 39 fe ff ff call 1060 <__cxa_finalize@plt> 1227: e8 64 ff ff ff call 1190 <deregister_tm_clones> 14e1: ff d0 call *%rax $ gcc -o main main.c isort.c -O3 -flto $ objdump -d main | grep call 1014: ff d0 call *%rax 1180: e8 0b ff ff ff call 1090 <__printf_chk@plt> 118f: e8 dc fe ff ff call 1070 <putchar@plt> 11af: e8 cc fe ff ff call 1080 <__stack_chk_fail@plt> 11df: ff 15 f3 2d 00 00 call *0x2df3(%rip) # 3fd8 <__libc_start_main@GLIBC_2.34> 1282: e8 d9 fd ff ff call 1060 <__cxa_finalize@plt> 1287: e8 64 ff ff ff call 11f0 <deregister_tm_clones>
From the grepped results we can tell, even with -O3
optimization level, that the comparison function is still invoked via an indirect call *%rax
rather than being inlined, which is bad for performance especially when called within a tight loop. With LTO, both the insertion_sort
routine and the comparison function are inlined into main
. Sadly, even if we put all the stuff in a single file, the comparator is not inlined (Godbolt link). How about LTO for shared libraries and static libraries?
$ gcc -shared -fpic isort.c -O3 -flto -o libisort.so $ gcc main.c -L. -lisort -O3 -flto -o main $ objdump -d main | grep call 1014: ff d0 call *%rax 112d: e8 7e ff ff ff call 10b0 <insertion_sort@plt> 1148: e8 53 ff ff ff call 10a0 <__printf_chk@plt> 1157: e8 24 ff ff ff call 1080 <putchar@plt> 1177: e8 14 ff ff ff call 1090 <__stack_chk_fail@plt> 119f: ff 15 33 2e 00 00 call *0x2e33(%rip) # 3fd8 <__libc_start_main@GLIBC_2.34> 1242: e8 29 fe ff ff call 1070 <__cxa_finalize@plt> 1247: e8 64 ff ff ff call 11b0 <deregister_tm_clones> $ objdump -d isort.so | grep call 1014: ff d0 call *%rax 10d2: e8 59 ff ff ff call 1030 <__cxa_finalize@plt> 10d7: e8 64 ff ff ff call 1040 <deregister_tm_clones> 11c1: ff d0 call *%rax $ gcc isort.c -c -O3 -flto $ ar rcs libisort.a isort.o ar: isort.o: plugin needed to handle lto object $ gcc-ar rcs libisort.a isort.o $ gcc main.c -L. -l:libisort.a -O3 -flto -o main $ objdump -d main | grep call 1014: ff d0 call *%rax 1180: e8 0b ff ff ff call 1090 <__printf_chk@plt> 118f: e8 dc fe ff ff call 1070 <putchar@plt> 11af: e8 cc fe ff ff call 1080 <__stack_chk_fail@plt> 11df: ff 15 f3 2d 00 00 call *0x2df3(%rip) # 3fd8 <__libc_start_main@GLIBC_2.34> 1282: e8 d9 fd ff ff call 1060 <__cxa_finalize@plt> 1287: e8 64 ff ff ff call 11f0 <deregister_tm_clones>
There should be no surprise. For shared library routines, since they can be overridden at runtime the compiler cannot assume the call to our library routine insertion_sort
will be the one used at runtime. Although LTO doesn’t work well for application code calling library code, compiling the shared library with -flto
flag may help the library itself generate more aggressive/performant code. Static libraries, however, are just archives of object files. As long as the archived object files needed for static linking contain LTO information, LTO can be performed in the same way as for object files.