There are a few things about linking that I think are worthy of writing down after I studied linking from CSAPP-3e and some resources on the web.
Linkage of const Global Variables
const global variables in C++ have internal linkage (as declared static) unless they’re explicitly declared extern or inline (see C++17 inline variable), whereas in C external linkage (as declared extern) is the default for all file-scoped entities (functions and global variables). An excellent answer on StackOverflow about inline variables can be found here. One of the major uses of the inline keyword is to enable multiple identical function definitions across translation units (TUs) (in practice function definitions in header files), so this meaning was extended to variables to allow an inline const variable to have a uniform memory address across TUs and allow a non-const global variable to be defined in headers without redefinition problems.
$ cat t.c int int_ = 1; static int static_int = 1; const int const_int = 1; static const int static_const_int = 1; $ gcc -c t.c $ nm t.o 0000000000000000 R const_int # global read-only data section in C 0000000000000000 D int_ 0000000000000004 r static_const_int 0000000000000004 d static_int $ g++ -c t.c $ nm -C t.o 0000000000000000 D int_ 0000000000000004 d static_int 0000000000000004 r static_const_int 0000000000000000 r const_int # local read-only data section in C++
Link Order
The link order matters when we link against static or shared libraries. The rules used by the linker to resolve symbols are clearly described in the Symbol Resolution section of the Linking chapter in CSAPP-3e. Please refer to the text or this post from Ian Lance Taylor’s blog for more information.
Shared Libraries
Shared libraries are ubiquitous on almost every platform that has an operating system. The key purpose of shared libraries is to share code among processes and thus save memory. One approach to sharing library code could be to put the given library code at a specific memory location. But it has at least two problems: (a) There are so many (literally unlimited number of) libraries, how do we specify which library to put at which location? If there are ten libraries in total, but a program is only linked against one or two of them, the reserved memory for other shared libraries is wasted. (b) Having a library loaded at a fixed memory location is vulnerable to attacks. So for these reasons, shared libraries are implemented in a way that they can be loaded at any (random) memory locations that you don’t know in advance.
Luckily, the compiler can generate position-independent code (PIC) for us. Let’s first look at a normal non-PIC case:
// foo.c
int x = 17;
int foo(int i) {
return i + x;
}
// main.c
#include <stdio.h>
int foo(int i);
int main() {
int r = foo(3);
printf("%d\n", r);
return r;
}
$ gcc main.c foo.c -o main
$ objdump -d main
...
0000000000001149 <main>:
1149: f3 0f 1e fa endbr64
114d: 55 push %rbp
114e: 48 89 e5 mov %rsp,%rbp
1151: 48 83 ec 10 sub $0x10,%rsp
1155: bf 03 00 00 00 mov $0x3,%edi
115a: e8 21 00 00 00 call 1180 <foo>
115f: 89 45 fc mov %eax,-0x4(%rbp)
1162: 8b 45 fc mov -0x4(%rbp),%eax
1165: 89 c6 mov %eax,%esi
1167: 48 8d 05 96 0e 00 00 lea 0xe96(%rip),%rax # 2004 <_IO_stdin_used+0x4>
116e: 48 89 c7 mov %rax,%rdi
1171: b8 00 00 00 00 mov $0x0,%eax
1176: e8 d5 fe ff ff call 1050 <printf@plt>
117b: 8b 45 fc mov -0x4(%rbp),%eax
117e: c9 leave
117f: c3 ret
0000000000001180 <foo>:
1180: f3 0f 1e fa endbr64
1184: 55 push %rbp
1185: 48 89 e5 mov %rsp,%rbp
1188: 89 7d fc mov %edi,-0x4(%rbp)
118b: 8b 15 7f 2e 00 00 mov 0x2e7f(%rip),%edx # 4010 <x>
1191: 8b 45 fc mov -0x4(%rbp),%eax
1194: 01 d0 add %edx,%eax
1196: 5d pop %rbp
1197: c3 ret
...
$ objdump --full-contents main
...
Contents of section .data:
4000 00000000 00000000 08400000 00000000 .........@......
4010 11000000
...
Fun story: I was confused about the disassembly once by the line call 1180 <foo>. I thought it was an absolute address since it could use something like 0x21(%rip) for relative addressing otherwise. I knew if I printed the function’s address, it would be different every time it was executed since the OS would load the code (.text section) at random addresses for security’s sake. But I also knew the code section was read-only. So, how was address space layout randomization implemented? Did the kernel (dynamic linker) perform some relocation and modify the machine code anyway? If it did, that would be soooo inefficient, for all calls to other normal functions would need to be modified as well. I just couldn’t figure it out. Then I realized that the secret must be behind the machine code e8 21 00 00 00 itself. So I asked ChatGPT about it. It turned out that, yes indeed, this call instruction uses relative addressing.
Therefore, knowing how some instructions are encoded is helpful in many situations. In x86 assembly, the opcode e8 is a near call (see more at https://shell-storm.org/x86doc/CALL.html). The 4 bytes followed are the operand which is encoded in two’s complement. x86 machines are little-endian. So the rel32 operand is 0x21, added to the RIP register (115f) gives our target address 1180. Similarly, the E9 JMP instruction also uses relative addressing. Another example: the operand in 1176: e8 d5 fe ff ff is the value whose two’s complement is 0xfffffed5. The value is -12b. Therefore, this instruction can be decoded as call RIP-12b (RIP = 117b), i.e. call 1050. Most of the time it’s fine just to look at the decoded assembler code rather than the machine code itself. However, knowing the instruction encoding makes us better understand how the CPU really executes some instructions (e.g. call RIP-12b shows us that at runtime the CPU pushes the address of the next instruction (RIP) onto the stack and then the execution jumps to the memory location with an offset of -12b to RIP).
Go back to the example above. We can see that main makes a near call to foo, and foo references the the global variable in the .data section which has an initial value 0x11 (17). Nothing special. Now we make foo.c a shared library.
$ gcc -shared -fpic foo.c -o foo.so
$ objdump -d foo.so
...
00000000000010f9 <foo>:
10f9: f3 0f 1e fa endbr64
10fd: 55 push %rbp
10fe: 48 89 e5 mov %rsp,%rbp
1101: 89 7d fc mov %edi,-0x4(%rbp)
1104: 48 8b 05 cd 2e 00 00 mov 0x2ecd(%rip),%rax # 3fd8 <x-0x48>
110b: 8b 10 mov (%rax),%edx
110d: 8b 45 fc mov -0x4(%rbp),%eax
1110: 01 d0 add %edx,%eax
1112: 5d pop %rbp
1113: c3 ret
...
$ objdump --full-contents foo.so
...
Contents of section .got:
3fd8 00000000 00000000 00000000 00000000 ................
3fe8 00000000 00000000 00000000 00000000 ................
3ff8 00000000 00000000 ........
Contents of section .got.plt:
4000 883e0000 00000000 00000000 00000000 .>..............
4010 00000000 00000000 ........
Contents of section .data:
4018 18400000 00000000 11000000 .@..........
...
There’s an observable difference. Without being PIC, the instruction for referring to x is mov 0x2e7f(%rip),%edx which copies 4 bytes from the .data section to register %edx. Now being PIC, two load instructions are involved: mov 0x2ecd(%rip),%rax and mov (%rax),%edx. The first one copies 8 bytes from the address 3fd8 in the .got section to register %rax, and the second accesses the lower half part of the value stored at address in %rax. But why is the global variable x now read from the .got section and why does the address 3fd8 contain an empty value?
To answer these questions, we are introduced to the global offset table (GOT), which contains an 8-byte entry (a pointer) for each global data object (procedure or global variable) that is referenced by the object module. The compiler generates a relocation record for each entry in the GOT.
$ readelf --relocs foo.so Relocation section '.rela.dyn' at offset 0x420 contains 8 entries: Offset Info Type Sym. Value Sym. Name + Addend 000000003e78 000000000008 R_X86_64_RELATIVE 10f0 000000003e80 000000000008 R_X86_64_RELATIVE 10b0 000000004018 000000000008 R_X86_64_RELATIVE 4018 000000003fd8 000500000006 R_X86_64_GLOB_DAT 0000000000004020 x + 0 000000003fe0 000100000006 R_X86_64_GLOB_DAT 0000000000000000 __cxa_finalize + 0 000000003fe8 000200000006 R_X86_64_GLOB_DAT 0000000000000000 _ITM_registerTMCl[...] + 0 000000003ff0 000300000006 R_X86_64_GLOB_DAT 0000000000000000 _ITM_deregisterTM[...] + 0 000000003ff8 000400000006 R_X86_64_GLOB_DAT 0000000000000000 __gmon_start__ + 0
The GOT serves as a kind of indirection that enables us to override symbols (global variables and functions) in shared libraries. This is done with the help of the dynamic linker, which can relocate GOT entries at load time or runtime (lazy binding) so that they contain absolute memory addresses to the proper (overridden) symbols. For example, we can also define a global variable x in main.c and this one is going to override the one provided in the shared library foo.so:
$ cat main.c
#include <stdio.h>
int foo(int i);
#ifdef OVERRIDE_X
int x = 13;
#endif
int main() {
int r = foo(3);
printf("%d\n", r);
return r;
}
$ gcc main.c ./foo.so -o main
$ ./main
20
$ gcc main.c ./foo.so -o main -DOVERRIDE_X
$ ./main
16
Another typical way of overriding symbols in a shared library is to use LD_PRELOAD which we will cover later.
Back to the questions. Reading the global variable x via an extra indirection from the GOT is to allow things like symbol overriding. The GOT entry for x will be filled by the dynamic linker at load time or runtime, so it’s fine to leave the value empty.
Next, let’s focus on function calls in shared libraries. This involves two aspects: calling library functions from our application code and function calls within the shared library. For the former, see the disassembly below:
$ gcc main.c ./foo.so -o main
$ objdump -d main
...
Disassembly of section .plt:
0000000000001020 <.plt>:
1020: ff 35 92 2f 00 00 push 0x2f92(%rip) # 3fb8 <_GLOBAL_OFFSET_TABLE_+0x8>
1026: f2 ff 25 93 2f 00 00 bnd jmp *0x2f93(%rip) # 3fc0 <_GLOBAL_OFFSET_TABLE_+0x10>
102d: 0f 1f 00 nopl (%rax)
1030: f3 0f 1e fa endbr64
1034: 68 00 00 00 00 push $0x0
1039: f2 e9 e1 ff ff ff bnd jmp 1020 <_init+0x20>
103f: 90 nop
1040: f3 0f 1e fa endbr64
1044: 68 01 00 00 00 push $0x1
1049: f2 e9 d1 ff ff ff bnd jmp 1020 <_init+0x20>
104f: 90 nop
...
Disassembly of section .plt.sec:
0000000000001060 <printf@plt>:
1060: f3 0f 1e fa endbr64
1064: f2 ff 25 5d 2f 00 00 bnd jmp *0x2f5d(%rip) # 3fc8 <printf@GLIBC_2.2.5>
106b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
0000000000001070 <foo@plt>:
1070: f3 0f 1e fa endbr64
1074: f2 ff 25 55 2f 00 00 bnd jmp *0x2f55(%rip) # 3fd0 <foo@Base>
107b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
Disassembly of section .text:
...
0000000000001169 <main>:
1169: f3 0f 1e fa endbr64
116d: 55 push %rbp
116e: 48 89 e5 mov %rsp,%rbp
1171: 48 83 ec 10 sub $0x10,%rsp
1175: bf 03 00 00 00 mov $0x3,%edi
117a: e8 f1 fe ff ff call 1070 <foo@plt>
117f: 89 45 fc mov %eax,-0x4(%rbp)
1182: 8b 45 fc mov -0x4(%rbp),%eax
1185: 89 c6 mov %eax,%esi
1187: 48 8d 05 76 0e 00 00 lea 0xe76(%rip),%rax # 2004 <_IO_stdin_used+0x4>
118e: 48 89 c7 mov %rax,%rdi
1191: b8 00 00 00 00 mov $0x0,%eax
1196: e8 c5 fe ff ff call 1060 <printf@plt>
119b: 8b 45 fc mov -0x4(%rbp),%eax
119e: c9 leave
119f: c3 ret
...
$ objdump --full-contents main
...
Contents of section .dynamic:
3db0 01000000 00000000 72000000 00000000 ........r.......
3dc0 01000000 00000000 7b000000 00000000 ........{.......
...
Contents of section .got:
3fb0 b03d0000 00000000 00000000 00000000 .=..............
3fc0 00000000 00000000 30100000 00000000 ........0.......
3fd0 40100000 00000000 00000000 00000000 @...............
3fe0 00000000 00000000 00000000 00000000 ................
3ff0 00000000 00000000 00000000 00000000 ................
Contents of section .data:
4000 00000000 00000000 08400000 00000000 .........@......
...
There’s a lot of jumps happening here. We can see that calls to shared library functions are made through the procedure linkage table (PLT). Each shared library function called by the executable has its own PLT entry in the .plt.sec section. In our example, the PLT entries are foo@plt and printf@plt. PLT entries will be relocated lazily by the dynamic linker at runtime, meaning that the address of each function will be resolved by the dynamic linker the first time the function is called. This makes sense since usually a shared library, e.g. libc, encompasses a large number of functions, and only a few get called by our program. This lazy binding saves a lot of work for the dynamic linker.
Now we walk through our example of calling the library function foo. Function main calls into foo@plt, which immediately jumps to another memory address via GOT[4]. The initial value of GOT[4] is 0x1040. Following that, the execution jumps to somewhere in the .plt section and pushes the index of this PLT entry (0x1 for foo; 0x0 for printf) onto the stack and then branches to the common code at the beginning of the .plt section. The common code pushes GOT[1] onto the stack and the execution jumps to GOT[2]. It can be guessed that GOT[1] is a pointer to relocation information, and it will be used together with the pushed PLT entry index as two arguments by the dynamic linker to resolve the address of foo. GOT[2] should be the entry point for the dynamic linker to perform relocations. GOT[1] and GOT[2] will be filled by the dynamic linker at load time. After the dynamic linker figures out the address of foo at runtime, it rewrites GOT[4] with this address so that subsequent calls to foo will jump directly to the resolved destination. Finally, the control is transferred back to foo when the dynamic linker has done its job. What a beautiful engineering!
Within a shared library, let’s see what happens when a function calls another function.
// bar.c
int x = 17;
int bar(int i) {
return i + 42;
}
int foo(int i) {
return bar(i) + x;
}
$ gcc -shared -fpic bar.c -o bar.so
$ objdump -d bar.so
...
Disassembly of section .plt.sec:
0000000000001050 <bar@plt>:
1050: f3 0f 1e fa endbr64
1054: f2 ff 25 bd 2f 00 00 bnd jmp *0x2fbd(%rip) # 4018 <bar+0x2eff>
105b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
Disassembly of section .text:
...
0000000000001119 <bar>:
1119: f3 0f 1e fa endbr64
111d: 55 push %rbp
111e: 48 89 e5 mov %rsp,%rbp
1121: 89 7d fc mov %edi,-0x4(%rbp)
1124: 8b 45 fc mov -0x4(%rbp),%eax
1127: 83 c0 2a add $0x2a,%eax
112a: 5d pop %rbp
112b: c3 ret
000000000000112c <foo>:
112c: f3 0f 1e fa endbr64
1130: 55 push %rbp
1131: 48 89 e5 mov %rsp,%rbp
1134: 48 83 ec 10 sub $0x10,%rsp
1138: 89 7d fc mov %edi,-0x4(%rbp)
113b: 8b 45 fc mov -0x4(%rbp),%eax
113e: 89 c7 mov %eax,%edi
1140: e8 0b ff ff ff call 1050 <bar@plt>
1145: 48 8b 15 8c 2e 00 00 mov 0x2e8c(%rip),%rdx # 3fd8 <x-0x50>
114c: 8b 12 mov (%rdx),%edx
114e: 01 d0 add %edx,%eax
1150: c9 leave
1151: c3 ret
...
Well, we can see that the call to bar is done through a PLT entry, just like it’s been called from non-library code. This actually makes sense since all exported symbols can be overridden. If we don’t want to export our internal implementation functions or variables, we can declare them static.
$ cat bar.c
int x = 17;
static int bar(int i) {
return i + 42;
}
int foo(int i) {
return bar(i) + x;
}
$ gcc -shared -fpic bar.c -o bar.so
$ objdump -d bar.so
...
00000000000010f9 <bar>:
10f9: f3 0f 1e fa endbr64
10fd: 55 push %rbp
10fe: 48 89 e5 mov %rsp,%rbp
1101: 89 7d fc mov %edi,-0x4(%rbp)
1104: 8b 45 fc mov -0x4(%rbp),%eax
1107: 83 c0 2a add $0x2a,%eax
110a: 5d pop %rbp
110b: c3 ret
000000000000110c <foo>:
110c: f3 0f 1e fa endbr64
1110: 55 push %rbp
1111: 48 89 e5 mov %rsp,%rbp
1114: 48 83 ec 08 sub $0x8,%rsp
1118: 89 7d fc mov %edi,-0x4(%rbp)
111b: 8b 45 fc mov -0x4(%rbp),%eax
111e: 89 c7 mov %eax,%edi
1120: e8 d4 ff ff ff call 10f9 <bar>
1125: 48 8b 15 ac 2e 00 00 mov 0x2eac(%rip),%rdx # 3fd8 <x-0x48>
112c: 8b 12 mov (%rdx),%edx
112e: 01 d0 add %edx,%eax
1130: c9 leave
1131: c3 ret
...
In designing a library, we often have headers and their implementation details will be put in separate translation units. Those functions have external linkage. But we don’t want to export those symbols unless they are API functions intended to be provided to the user. Under such cases, we can pass -fvisibility=hidden to the compiler and explicitly mark the symbols we want to export using __attribute__((__visibility__("default"))) (in practice, a descriptive macro can be used for this, e.g. LIBNAME_EXPORT), see more in the GCC manual. Doing so has many practical benefits, for example, less work for the dynamic linker, more efficient code can be generated since no PLT entry is needed, and for data references, there will be only one data load vs two loads if it gets exported.
$ cat bar.c
int x = 17;
int bar(int i) {
return i + 42;
}
__attribute__((__visibility__("default")))
int foo(int i) {
return bar(i) + x;
}
$ gcc -shared -fpic bar.c -o bar.so -fvisibility=hidden
$ objdump -d bar.so
...
00000000000010f9 <bar>:
10f9: f3 0f 1e fa endbr64
10fd: 55 push %rbp
10fe: 48 89 e5 mov %rsp,%rbp
1101: 89 7d fc mov %edi,-0x4(%rbp)
1104: 8b 45 fc mov -0x4(%rbp),%eax
1107: 83 c0 2a add $0x2a,%eax
110a: 5d pop %rbp
110b: c3 ret
000000000000110c <foo>:
110c: f3 0f 1e fa endbr64
1110: 55 push %rbp
1111: 48 89 e5 mov %rsp,%rbp
1114: 48 83 ec 08 sub $0x8,%rsp
1118: 89 7d fc mov %edi,-0x4(%rbp)
111b: 8b 45 fc mov -0x4(%rbp),%eax
111e: 89 c7 mov %eax,%edi
1120: e8 d4 ff ff ff call 10f9 <bar>
1125: 8b 15 f5 2e 00 00 mov 0x2ef5(%rip),%edx # 4020 <x>
112b: 01 d0 add %edx,%eax
112d: c9 leave
112e: c3 ret
...
At the end of this section, let’s see an example of using LD_PRELOAD to override symbols by loading specific libraries before any other shared libraries. In practice, LD_PRELOAD can be used to override the malloc implementation in libc.so or to load a debug library that contains logging versions of some functions (e.g. to debug some multithreaded code by logging each thread’s activity).
$ cat foo.c
int x = 17;
int foo(int i) {
return i + x;
}
$ cat main.c
#include <stdio.h>
int foo(int i);
int main() {
int r = foo(3);
printf("%d\n", r);
return r;
}
$ cat bar.c
int x = 17;
int bar(int i) {
return i + 42;
}
__attribute__((__visibility__("default")))
int foo(int i) {
return bar(i) + x;
}
$ gcc -shared -fpic foo.c -o foo.so
$ gcc -shared -fpic bar.c -o bar.so -fvisibility=hidden
$ gcc main.c ./foo.so -o main
$ ./main
20
$ LD_PRELOAD=./bar.so ./main
62
Oh, one more thing to highlight. Symbol collisions in shared libraries can catch you unexpectedly. The exported symbols in a shared library can be overridden by anything as long as they have the same name. This means an exported function/variable in a shared library can be overridden by a global variable of any type, or by a function of any signature. If the same symbol is used for different things, it’s an undefined behavior. Symbol collisions can happen between a shared library and our application code, or between two shared libraries linked by our innocent application. Therefore, it’s a good practice to limit the visibility of the symbols in a shared library as small as possible. Better performance and lower chance of symbol collision. Why not!
$ cat foo.c
#include <stdio.h>
int x = 5;
int foobar(int i) {
printf("foobar called from foo.c, argument i=%d\n", i);
return i + x;
}
int foo(int i) {
return foobar(i) * 10;
}
$ cat bar.c
#include <stdio.h>
int foobar(int i, int j) {
printf("foobar called from bar.c, arguments i=%d, j=%d\n", i, j);
return i + j;
}
int bar(int i, int j) {
return foobar(i, j) + 1;
}
$ cat barz.c
#include <stdio.h>
int foobar = 2;
int bar(int i, int j) {
printf("foobar=%d\n", foobar);
return foobar + i + j + 5;
}
$ cat main.c
#include <stdio.h>
int foo(int i); // from foo.so
int bar(int i, int j); // from bar(z).so
int main() {
int a = foo(3);
int b = bar(3, 7);
printf("%d\n", a);
printf("%d\n", b);
return a + b;
}
$ gcc -shared -fpic foo.c -o foo.so
$ gcc -shared -fpic bar.c -o bar.so
$ gcc -shared -fpic barz.c -o barz.so
$ gcc main.c ./foo.so ./bar.so -o main
$ ./main
foobar called from foo.c, argument i=3
foobar called from foo.c, argument i=3
80
9
$ gcc main.c ./bar.so ./foo.so -o main
$ ./main
foobar called from bar.c, arguments i=3, j=2081388488
foobar called from bar.c, arguments i=3, j=7
-660951570
11
$ gcc main.c ./foo.so ./barz.so -o main
$ ./main
foobar called from foo.c, argument i=3
foobar=-98693133
80
-98693118
$ gcc main.c ./barz.so ./foo.so -o main
$ ./main
fish: Job 1, './main' terminated by signal SIGSEGV (Address boundary error)
In the example above, our program calls two external functions from two shared libraries: foo from foo.so, and bar from bar.so or barz.so. They all call or reference foobar. Whichever the shared library is loaded first overrides the other library’s foobar symbol. From the results, we can see that the library load order seems to match the library link order. But we can manually specify the load order using LD_PRELOAD.
$ gcc main.c ./foo.so ./bar.so -o main $ LD_PRELOAD=./foo.so:./bar.so ./main foobar called from foo.c, argument i=3 foobar called from foo.c, argument i=3 80 9 $ LD_PRELOAD=./bar.so:./foo.so ./main foobar called from bar.c, arguments i=3, j=1241836152 foobar called from bar.c, arguments i=3, j=7 -466540338 11 $ gcc main.c ./foo.so ./barz.so -o main $ LD_PRELOAD=./foo.so:./barz.so ./main foobar called from foo.c, argument i=3 foobar=-98693133 80 -98693118 $ LD_PRELOAD=./barz.so:./foo.so ./main fish: Job 1, 'LD_PRELOAD=./barz.so:./foo.so .…' terminated by signal SIGSEGV (Address boundary error)
The point here is that if symbol collisions happened, there would be many possible ways our program might pan out due to undefined behaviors. If we only exported foo and bar, none of them would happen. If we really need to export foobar as it may be an API function, can we avoid the symbol collision on it? The answer is to pass -Bsymbolic or -Bsymbolic-functions to the program linker when creating the shared library. These options make the shared library bind references to global (function) symbols to the definition within the shared library, if any. This means calls to functions within the shared library don’t have to go through the PLT, and referenced data doesn’t have to be read from the GOT. Thus, more efficient code can be generated by the compiler. But this renders symbol overriding impossible, even for exported public symbols, as no dynamic relocation entry for the shared library would be generated. If one wants finer control, --dynamic-list can be used.
$ gcc -shared -fpic foo.c -o foo_symbolic.so -Wl,-Bsymbolic
$ readelf -r foo.so
Relocation section '.rela.dyn' at offset 0x4b0 contains 8 entries:
Offset Info Type Sym. Value Sym. Name + Addend
000000003e08 000000000008 R_X86_64_RELATIVE 1130
000000003e10 000000000008 R_X86_64_RELATIVE 10f0
000000004028 000000000008 R_X86_64_RELATIVE 4028
000000003fd8 000100000006 R_X86_64_GLOB_DAT 0000000000000000 _ITM_deregisterTM[...] + 0
000000003fe0 000600000006 R_X86_64_GLOB_DAT 0000000000004030 x + 0
000000003fe8 000300000006 R_X86_64_GLOB_DAT 0000000000000000 __gmon_start__ + 0
000000003ff0 000400000006 R_X86_64_GLOB_DAT 0000000000000000 _ITM_registerTMCl[...] + 0
000000003ff8 000500000006 R_X86_64_GLOB_DAT 0000000000000000 __cxa_finalize@GLIBC_2.2.5 + 0
Relocation section '.rela.plt' at offset 0x570 contains 2 entries:
Offset Info Type Sym. Value Sym. Name + Addend
000000004018 000800000007 R_X86_64_JUMP_SLO 0000000000001139 foobar + 0
000000004020 000200000007 R_X86_64_JUMP_SLO 0000000000000000 printf@GLIBC_2.2.5 + 0
$ readelf -r foo_symbolic.so # no relocations for x and foobar
Relocation section '.rela.dyn' at offset 0x4b0 contains 7 entries:
Offset Info Type Sym. Value Sym. Name + Addend
000000003df0 000000000008 R_X86_64_RELATIVE 1110
000000003df8 000000000008 R_X86_64_RELATIVE 10d0
000000004020 000000000008 R_X86_64_RELATIVE 4020
000000003fe0 000100000006 R_X86_64_GLOB_DAT 0000000000000000 _ITM_deregisterTM[...] + 0
000000003fe8 000300000006 R_X86_64_GLOB_DAT 0000000000000000 __gmon_start__ + 0
000000003ff0 000400000006 R_X86_64_GLOB_DAT 0000000000000000 _ITM_registerTMCl[...] + 0
000000003ff8 000500000006 R_X86_64_GLOB_DAT 0000000000000000 __cxa_finalize@GLIBC_2.2.5 + 0
Relocation section '.rela.plt' at offset 0x558 contains 1 entry:
Offset Info Type Sym. Value Sym. Name + Addend
000000004018 000200000007 R_X86_64_JUMP_SLO 0000000000000000 printf@GLIBC_2.2.5 + 0
$ objdump -d foo_symbolic.so
...
Disassembly of section .plt.sec:
0000000000001050 <printf@plt>:
1050: f3 0f 1e fa endbr64
1054: f2 ff 25 bd 2f 00 00 bnd jmp *0x2fbd(%rip) # 4018 <printf@GLIBC_2.2.5>
105b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
Disassembly of section .text:
0000000000001060 <deregister_tm_clones>:
...
0000000000001119 <foobar>:
1119: f3 0f 1e fa endbr64
111d: 55 push %rbp
111e: 48 89 e5 mov %rsp,%rbp
1121: 48 83 ec 10 sub $0x10,%rsp
1125: 89 7d fc mov %edi,-0x4(%rbp)
1128: 8b 45 fc mov -0x4(%rbp),%eax
112b: 89 c6 mov %eax,%esi
112d: 48 8d 05 cc 0e 00 00 lea 0xecc(%rip),%rax # 2000 <_fini+0xe88>
1134: 48 89 c7 mov %rax,%rdi
1137: b8 00 00 00 00 mov $0x0,%eax
113c: e8 0f ff ff ff call 1050 <printf@plt>
1141: 48 8d 05 e0 2e 00 00 lea 0x2ee0(%rip),%rax # 4028 <x>
1148: 8b 10 mov (%rax),%edx
114a: 8b 45 fc mov -0x4(%rbp),%eax
114d: 01 d0 add %edx,%eax
114f: c9 leave
1150: c3 ret
0000000000001151 <foo>:
1151: f3 0f 1e fa endbr64
1155: 55 push %rbp
1156: 48 89 e5 mov %rsp,%rbp
1159: 48 83 ec 10 sub $0x10,%rsp
115d: 89 7d fc mov %edi,-0x4(%rbp)
1160: 8b 45 fc mov -0x4(%rbp),%eax
1163: 89 c7 mov %eax,%edi
1165: e8 af ff ff ff call 1119 <foobar>
116a: 89 c2 mov %eax,%edx
116c: 89 d0 mov %edx,%eax
116e: c1 e0 02 shl $0x2,%eax
1171: 01 d0 add %edx,%eax
1173: 01 c0 add %eax,%eax
1175: c9 leave
1176: c3 ret
...
$ objdump --full-contents foo_symbolic.so
...
Contents of section .data:
4020 20400000 00000000 05000000 @..........
...
Some other good stuff to read on shared libraries: “How To Write Shared Libraries” by Ulrich Drepper.
Link-Time Optimization
Link-time optimization (LTO) is also called interprocedural optimization. The kind of optimization, as its name indicates, happens at the link stage when the compiler sees the whole program. LTO is done by dumping the compiler’s intermediate representation (GIMPLE in GCC, LLVM IR/bitcode in Clang) when compiling each source file (translation unit). So the resulting object files are fat as they contain extra information used for LTO. Typical LTO includes function inlining, dead code elimination, etc. Read more here or watch this awesome video. We’ll instead focus on a concrete and interesting example: trying to inline a comparison function pointer for a library routine insertion_sort in C.
// isort.h
#pragma once
#include <stddef.h>
void insertion_sort(void *ptr, size_t num, size_t size, int (*comp)(const void *, const void *));
// isort.c
#include "isort.h"
#include <stdio.h>
static void swap_item(char *a, char *b, size_t size) {
for (size_t i = 0; i < size; ++i) {
char tmp = a[i];
a[i] = b[i];
b[i] = tmp;
}
}
void insertion_sort(void *ptr, size_t num, size_t size, int (*comp)(const void *, const void *)) {
char *base = (char *)ptr;
for (size_t i = 1; i < num; ++i) {
for (size_t j = i; j > 0 && comp(base + (j-1)*size, base + j*size) > 0; --j) {
swap_item(base + (j-1)*size, base + j*size, size);
}
}
}
// main.c
#include "isort.h"
#include <stdio.h>
int compare_int(const void *a, const void *b) {
return *(int *)a - *(int *)b;
}
int main() {
int arr[] = {7, 3, 1, 5, 2, 9, 1, 7, 8, 2, 4, 6, 5};
size_t arr_size = sizeof(arr) / sizeof(arr[0]);
insertion_sort(arr, arr_size, sizeof(arr[0]), compare_int);
for (size_t i = 0; i < arr_size; ++i) {
printf("%d ", arr[i]);
}
printf("\n");
return 0;
}
This is a typical setup – the definition of the library routine insertion_sort is separate from our application code, and our application code provides a custom comparison function upon calling the library routine.
$ gcc -o main main.c isort.c -O3
$ objdump -d main | grep call
1014: ff d0 call *%rax
110d: e8 0e 03 00 00 call 1420 <insertion_sort>
1128: e8 63 ff ff ff call 1090 <__printf_chk@plt>
1137: e8 34 ff ff ff call 1070 <putchar@plt>
1157: e8 24 ff ff ff call 1080 <__stack_chk_fail@plt>
117f: ff 15 53 2e 00 00 call *0x2e53(%rip) # 3fd8 <__libc_start_main@GLIBC_2.34>
1222: e8 39 fe ff ff call 1060 <__cxa_finalize@plt>
1227: e8 64 ff ff ff call 1190 <deregister_tm_clones>
14e1: ff d0 call *%rax
$ gcc -o main main.c isort.c -O3 -flto
$ objdump -d main | grep call
1014: ff d0 call *%rax
1180: e8 0b ff ff ff call 1090 <__printf_chk@plt>
118f: e8 dc fe ff ff call 1070 <putchar@plt>
11af: e8 cc fe ff ff call 1080 <__stack_chk_fail@plt>
11df: ff 15 f3 2d 00 00 call *0x2df3(%rip) # 3fd8 <__libc_start_main@GLIBC_2.34>
1282: e8 d9 fd ff ff call 1060 <__cxa_finalize@plt>
1287: e8 64 ff ff ff call 11f0 <deregister_tm_clones>
From the grepped results we can tell, even with -O3 optimization level, that the comparison function is still invoked via an indirect call *%rax rather than being inlined, which is bad for performance especially when called within a tight loop. With LTO, both the insertion_sort routine and the comparison function are inlined into main. Sadly, even if we put all the stuff in a single file, the comparator is not inlined (Godbolt link). How about LTO for shared libraries and static libraries?
$ gcc -shared -fpic isort.c -O3 -flto -o libisort.so
$ gcc main.c -L. -lisort -O3 -flto -o main
$ objdump -d main | grep call
1014: ff d0 call *%rax
112d: e8 7e ff ff ff call 10b0 <insertion_sort@plt>
1148: e8 53 ff ff ff call 10a0 <__printf_chk@plt>
1157: e8 24 ff ff ff call 1080 <putchar@plt>
1177: e8 14 ff ff ff call 1090 <__stack_chk_fail@plt>
119f: ff 15 33 2e 00 00 call *0x2e33(%rip) # 3fd8 <__libc_start_main@GLIBC_2.34>
1242: e8 29 fe ff ff call 1070 <__cxa_finalize@plt>
1247: e8 64 ff ff ff call 11b0 <deregister_tm_clones>
$ objdump -d isort.so | grep call
1014: ff d0 call *%rax
10d2: e8 59 ff ff ff call 1030 <__cxa_finalize@plt>
10d7: e8 64 ff ff ff call 1040 <deregister_tm_clones>
11c1: ff d0 call *%rax
$ gcc isort.c -c -O3 -flto
$ ar rcs libisort.a isort.o
ar: isort.o: plugin needed to handle lto object
$ gcc-ar rcs libisort.a isort.o
$ gcc main.c -L. -l:libisort.a -O3 -flto -o main
$ objdump -d main | grep call
1014: ff d0 call *%rax
1180: e8 0b ff ff ff call 1090 <__printf_chk@plt>
118f: e8 dc fe ff ff call 1070 <putchar@plt>
11af: e8 cc fe ff ff call 1080 <__stack_chk_fail@plt>
11df: ff 15 f3 2d 00 00 call *0x2df3(%rip) # 3fd8 <__libc_start_main@GLIBC_2.34>
1282: e8 d9 fd ff ff call 1060 <__cxa_finalize@plt>
1287: e8 64 ff ff ff call 11f0 <deregister_tm_clones>
There should be no surprise. For shared library routines, since they can be overridden at runtime the compiler cannot assume the call to our library routine insertion_sort will be the one used at runtime. Although LTO doesn’t work well for application code calling library code, compiling the shared library with -flto flag may help the library itself generate more aggressive/performant code. Static libraries, however, are just archives of object files. As long as the archived object files needed for static linking contain LTO information, LTO can be performed in the same way as for object files.




