Linking

There are a few things about linking that I think are worthy of writing down after I studied linking from CSAPP-3e and some resources on the web.

Linkage of const Global Variables

const global variables in C++ have internal linkage (as declared static) unless they’re explicitly declared extern or inline (see C++17 inline variable), whereas in C external linkage (as declared extern) is the default for all file-scoped entities (functions and global variables). An excellent answer on StackOverflow about inline variables can be found here. One of the major uses of the inline keyword is to enable multiple identical function definitions across translation units (TUs) (in practice function definitions in header files), so this meaning was extended to variables to allow an inline const variable to have a uniform memory address across TUs and allow a non-const global variable to be defined in headers without redefinition problems.

$ cat t.c
int int_ = 1;
static int static_int = 1;
const int const_int = 1;
static const int static_const_int = 1;
$ gcc -c t.c
$ nm t.o
0000000000000000 R const_int # global read-only data section in C
0000000000000000 D int_
0000000000000004 r static_const_int
0000000000000004 d static_int
$ g++ -c t.c
$ nm -C t.o
0000000000000000 D int_
0000000000000004 d static_int
0000000000000004 r static_const_int
0000000000000000 r const_int # local read-only data section in C++

Link Order

The link order matters when we link against static or shared libraries. The rules used by the linker to resolve symbols are clearly described in the Symbol Resolution section of the Linking chapter in CSAPP-3e. Please refer to the text or this post from Ian Lance Taylor’s blog for more information.

Shared Libraries

Shared libraries are ubiquitous on almost every platform that has an operating system. The key purpose of shared libraries is to share code among processes and thus save memory. One approach to sharing library code could be to put the given library code at a specific memory location. But it has at least two problems: (a) There are so many (literally unlimited number of) libraries, how do we specify which library to put at which location? If there are ten libraries in total, but a program is only linked against one or two of them, the reserved memory for other shared libraries is wasted. (b) Having a library loaded at a fixed memory location is vulnerable to attacks. So for these reasons, shared libraries are implemented in a way that they can be loaded at any (random) memory locations that you don’t know in advance.

Luckily, the compiler can generate position-independent code (PIC) for us. Let’s first look at a normal non-PIC case:

// foo.c
int x = 17;
int foo(int i) {
    return i + x;
}

// main.c
#include <stdio.h>
int foo(int i);
int main() {
    int r = foo(3);
    printf("%d\n", r);
    return r;
}
$ gcc main.c foo.c -o main
$ objdump -d main
...
0000000000001149 <main>:
    1149:	f3 0f 1e fa          	endbr64 
    114d:	55                   	push   %rbp
    114e:	48 89 e5             	mov    %rsp,%rbp
    1151:	48 83 ec 10          	sub    $0x10,%rsp
    1155:	bf 03 00 00 00       	mov    $0x3,%edi
    115a:	e8 21 00 00 00       	call   1180 <foo>
    115f:	89 45 fc             	mov    %eax,-0x4(%rbp)
    1162:	8b 45 fc             	mov    -0x4(%rbp),%eax
    1165:	89 c6                	mov    %eax,%esi
    1167:	48 8d 05 96 0e 00 00 	lea    0xe96(%rip),%rax        # 2004 <_IO_stdin_used+0x4>
    116e:	48 89 c7             	mov    %rax,%rdi
    1171:	b8 00 00 00 00       	mov    $0x0,%eax
    1176:	e8 d5 fe ff ff       	call   1050 <printf@plt>
    117b:	8b 45 fc             	mov    -0x4(%rbp),%eax
    117e:	c9                   	leave  
    117f:	c3                   	ret    

0000000000001180 <foo>:
    1180:	f3 0f 1e fa          	endbr64 
    1184:	55                   	push   %rbp
    1185:	48 89 e5             	mov    %rsp,%rbp
    1188:	89 7d fc             	mov    %edi,-0x4(%rbp)
    118b:	8b 15 7f 2e 00 00    	mov    0x2e7f(%rip),%edx        # 4010 <x>
    1191:	8b 45 fc             	mov    -0x4(%rbp),%eax
    1194:	01 d0                	add    %edx,%eax
    1196:	5d                   	pop    %rbp
    1197:	c3                   	ret   
... 
$ objdump --full-contents main
...
Contents of section .data:
 4000 00000000 00000000 08400000 00000000  .........@......
 4010 11000000 
...

Fun story: I was confused about the disassembly once by the line call 1180 <foo>. I thought it was an absolute address since it could use something like 0x21(%rip) for relative addressing otherwise. I knew if I printed the function’s address, it would be different every time it was executed since the OS would load the code (.text section) at random addresses for security’s sake. But I also knew the code section was read-only. So, how was address space layout randomization implemented? Did the kernel (dynamic linker) perform some relocation and modify the machine code anyway? If it did, that would be soooo inefficient, for all calls to other normal functions would need to be modified as well. I just couldn’t figure it out. Then I realized that the secret must be behind the machine code e8 21 00 00 00 itself. So I asked ChatGPT about it. It turned out that, yes indeed, this call instruction uses relative addressing.

Therefore, knowing how some instructions are encoded is helpful in many situations. In x86 assembly, the opcode e8 is a near call (see more at https://shell-storm.org/x86doc/CALL.html). The 4 bytes followed are the operand which is encoded in two’s complement. x86 machines are little-endian. So the rel32 operand is 0x21, added to the RIP register (115f) gives our target address 1180. Similarly, the E9 JMP instruction also uses relative addressing. Another example: the operand in 1176: e8 d5 fe ff ff is the value whose two’s complement is 0xfffffed5. The value is -12b. Therefore, this instruction can be decoded as call RIP-12b (RIP = 117b), i.e. call 1050. Most of the time it’s fine just to look at the decoded assembler code rather than the machine code itself. However, knowing the instruction encoding makes us better understand how the CPU really executes some instructions (e.g. call RIP-12b shows us that at runtime the CPU pushes the address of the next instruction (RIP) onto the stack and then the execution jumps to the memory location with an offset of -12b to RIP).

Go back to the example above. We can see that main makes a near call to foo, and foo references the the global variable in the .data section which has an initial value 0x11 (17). Nothing special. Now we make foo.c a shared library.

$ gcc -shared -fpic foo.c -o foo.so
$ objdump -d foo.so
...
00000000000010f9 <foo>:
    10f9:	f3 0f 1e fa          	endbr64 
    10fd:	55                   	push   %rbp
    10fe:	48 89 e5             	mov    %rsp,%rbp
    1101:	89 7d fc             	mov    %edi,-0x4(%rbp)
    1104:	48 8b 05 cd 2e 00 00 	mov    0x2ecd(%rip),%rax        # 3fd8 <x-0x48>
    110b:	8b 10                	mov    (%rax),%edx
    110d:	8b 45 fc             	mov    -0x4(%rbp),%eax
    1110:	01 d0                	add    %edx,%eax
    1112:	5d                   	pop    %rbp
    1113:	c3                   	ret
...
$ objdump --full-contents foo.so
...
Contents of section .got:
 3fd8 00000000 00000000 00000000 00000000  ................
 3fe8 00000000 00000000 00000000 00000000  ................
 3ff8 00000000 00000000                    ........        
Contents of section .got.plt:
 4000 883e0000 00000000 00000000 00000000  .>..............
 4010 00000000 00000000                    ........        
Contents of section .data:
 4018 18400000 00000000 11000000           .@..........
...

There’s an observable difference. Without being PIC, the instruction for referring to x is mov 0x2e7f(%rip),%edx which copies 4 bytes from the .data section to register %edx. Now being PIC, two load instructions are involved: mov 0x2ecd(%rip),%rax and mov (%rax),%edx. The first one copies 8 bytes from the address 3fd8 in the .got section to register %rax, and the second accesses the lower half part of the value stored at address in %rax. But why is the global variable x now read from the .got section and why does the address 3fd8 contain an empty value?

To answer these questions, we are introduced to the global offset table (GOT), which contains an 8-byte entry (a pointer) for each global data object (procedure or global variable) that is referenced by the object module. The compiler generates a relocation record for each entry in the GOT.

$ readelf --relocs foo.so
Relocation section '.rela.dyn' at offset 0x420 contains 8 entries:
  Offset          Info           Type           Sym. Value    Sym. Name + Addend
000000003e78  000000000008 R_X86_64_RELATIVE                    10f0
000000003e80  000000000008 R_X86_64_RELATIVE                    10b0
000000004018  000000000008 R_X86_64_RELATIVE                    4018
000000003fd8  000500000006 R_X86_64_GLOB_DAT 0000000000004020 x + 0
000000003fe0  000100000006 R_X86_64_GLOB_DAT 0000000000000000 __cxa_finalize + 0
000000003fe8  000200000006 R_X86_64_GLOB_DAT 0000000000000000 _ITM_registerTMCl[...] + 0
000000003ff0  000300000006 R_X86_64_GLOB_DAT 0000000000000000 _ITM_deregisterTM[...] + 0
000000003ff8  000400000006 R_X86_64_GLOB_DAT 0000000000000000 __gmon_start__ + 0

The GOT serves as a kind of indirection that enables us to override symbols (global variables and functions) in shared libraries. This is done with the help of the dynamic linker, which can relocate GOT entries at load time or runtime (lazy binding) so that they contain absolute memory addresses to the proper (overridden) symbols. For example, we can also define a global variable x in main.c and this one is going to override the one provided in the shared library foo.so:

$ cat main.c
#include <stdio.h>
int foo(int i);
#ifdef OVERRIDE_X
int x = 13;
#endif
int main() {
    int r = foo(3);
    printf("%d\n", r);
    return r;
}
$ gcc main.c ./foo.so -o main
$ ./main
20
$ gcc main.c ./foo.so -o main -DOVERRIDE_X
$ ./main
16

Another typical way of overriding symbols in a shared library is to use LD_PRELOAD which we will cover later.

Back to the questions. Reading the global variable x via an extra indirection from the GOT is to allow things like symbol overriding. The GOT entry for x will be filled by the dynamic linker at load time or runtime, so it’s fine to leave the value empty.

Next, let’s focus on function calls in shared libraries. This involves two aspects: calling library functions from our application code and function calls within the shared library. For the former, see the disassembly below:

$ gcc main.c ./foo.so -o main
$ objdump -d main
...
Disassembly of section .plt:

0000000000001020 <.plt>:
    1020:	ff 35 92 2f 00 00    	push   0x2f92(%rip)        # 3fb8 <_GLOBAL_OFFSET_TABLE_+0x8>
    1026:	f2 ff 25 93 2f 00 00 	bnd jmp *0x2f93(%rip)        # 3fc0 <_GLOBAL_OFFSET_TABLE_+0x10>
    102d:	0f 1f 00             	nopl   (%rax)
    1030:	f3 0f 1e fa          	endbr64 
    1034:	68 00 00 00 00       	push   $0x0
    1039:	f2 e9 e1 ff ff ff    	bnd jmp 1020 <_init+0x20>
    103f:	90                   	nop
    1040:	f3 0f 1e fa          	endbr64 
    1044:	68 01 00 00 00       	push   $0x1
    1049:	f2 e9 d1 ff ff ff    	bnd jmp 1020 <_init+0x20>
    104f:	90                   	nop
...
Disassembly of section .plt.sec:

0000000000001060 <printf@plt>:
    1060:	f3 0f 1e fa          	endbr64 
    1064:	f2 ff 25 5d 2f 00 00 	bnd jmp *0x2f5d(%rip)        # 3fc8 <printf@GLIBC_2.2.5>
    106b:	0f 1f 44 00 00       	nopl   0x0(%rax,%rax,1)

0000000000001070 <foo@plt>:
    1070:	f3 0f 1e fa          	endbr64 
    1074:	f2 ff 25 55 2f 00 00 	bnd jmp *0x2f55(%rip)        # 3fd0 <foo@Base>
    107b:	0f 1f 44 00 00       	nopl   0x0(%rax,%rax,1)

Disassembly of section .text:
...
0000000000001169 <main>:
    1169:	f3 0f 1e fa          	endbr64 
    116d:	55                   	push   %rbp
    116e:	48 89 e5             	mov    %rsp,%rbp
    1171:	48 83 ec 10          	sub    $0x10,%rsp
    1175:	bf 03 00 00 00       	mov    $0x3,%edi
    117a:	e8 f1 fe ff ff       	call   1070 <foo@plt>
    117f:	89 45 fc             	mov    %eax,-0x4(%rbp)
    1182:	8b 45 fc             	mov    -0x4(%rbp),%eax
    1185:	89 c6                	mov    %eax,%esi
    1187:	48 8d 05 76 0e 00 00 	lea    0xe76(%rip),%rax        # 2004 <_IO_stdin_used+0x4>
    118e:	48 89 c7             	mov    %rax,%rdi
    1191:	b8 00 00 00 00       	mov    $0x0,%eax
    1196:	e8 c5 fe ff ff       	call   1060 <printf@plt>
    119b:	8b 45 fc             	mov    -0x4(%rbp),%eax
    119e:	c9                   	leave  
    119f:	c3                   	ret
...
$ objdump --full-contents main
...
Contents of section .dynamic:
 3db0 01000000 00000000 72000000 00000000  ........r.......
 3dc0 01000000 00000000 7b000000 00000000  ........{.......
 ...
Contents of section .got:
 3fb0 b03d0000 00000000 00000000 00000000  .=..............
 3fc0 00000000 00000000 30100000 00000000  ........0.......
 3fd0 40100000 00000000 00000000 00000000  @...............
 3fe0 00000000 00000000 00000000 00000000  ................
 3ff0 00000000 00000000 00000000 00000000  ................
Contents of section .data:
 4000 00000000 00000000 08400000 00000000  .........@......
...

There’s a lot of jumps happening here. We can see that calls to shared library functions are made through the procedure linkage table (PLT). Each shared library function called by the executable has its own PLT entry in the .plt.sec section. In our example, the PLT entries are foo@plt and printf@plt. PLT entries will be relocated lazily by the dynamic linker at runtime, meaning that the address of each function will be resolved by the dynamic linker the first time the function is called. This makes sense since usually a shared library, e.g. libc, encompasses a large number of functions, and only a few get called by our program. This lazy binding saves a lot of work for the dynamic linker.

Now we walk through our example of calling the library function foo. Function main calls into foo@plt, which immediately jumps to another memory address via GOT[4]. The initial value of GOT[4] is 0x1040. Following that, the execution jumps to somewhere in the .plt section and pushes the index of this PLT entry (0x1 for foo; 0x0 for printf) onto the stack and then branches to the common code at the beginning of the .plt section. The common code pushes GOT[1] onto the stack and the execution jumps to GOT[2]. It can be guessed that GOT[1] is a pointer to relocation information, and it will be used together with the pushed PLT entry index as two arguments by the dynamic linker to resolve the address of foo. GOT[2] should be the entry point for the dynamic linker to perform relocations. GOT[1] and GOT[2] will be filled by the dynamic linker at load time. After the dynamic linker figures out the address of foo at runtime, it rewrites GOT[4] with this address so that subsequent calls to foo will jump directly to the resolved destination. Finally, the control is transferred back to foo when the dynamic linker has done its job. What a beautiful engineering!

Within a shared library, let’s see what happens when a function calls another function.

// bar.c
int x = 17;
int bar(int i) {
    return i + 42;
}
int foo(int i) {
    return bar(i) + x;
}
$ gcc -shared -fpic bar.c -o bar.so
$ objdump -d bar.so
...
Disassembly of section .plt.sec:

0000000000001050 <bar@plt>:
    1050:	f3 0f 1e fa          	endbr64 
    1054:	f2 ff 25 bd 2f 00 00 	bnd jmp *0x2fbd(%rip)        # 4018 <bar+0x2eff>
    105b:	0f 1f 44 00 00       	nopl   0x0(%rax,%rax,1)

Disassembly of section .text:
...
0000000000001119 <bar>:
    1119:	f3 0f 1e fa          	endbr64 
    111d:	55                   	push   %rbp
    111e:	48 89 e5             	mov    %rsp,%rbp
    1121:	89 7d fc             	mov    %edi,-0x4(%rbp)
    1124:	8b 45 fc             	mov    -0x4(%rbp),%eax
    1127:	83 c0 2a             	add    $0x2a,%eax
    112a:	5d                   	pop    %rbp
    112b:	c3                   	ret    

000000000000112c <foo>:
    112c:	f3 0f 1e fa          	endbr64 
    1130:	55                   	push   %rbp
    1131:	48 89 e5             	mov    %rsp,%rbp
    1134:	48 83 ec 10          	sub    $0x10,%rsp
    1138:	89 7d fc             	mov    %edi,-0x4(%rbp)
    113b:	8b 45 fc             	mov    -0x4(%rbp),%eax
    113e:	89 c7                	mov    %eax,%edi
    1140:	e8 0b ff ff ff       	call   1050 <bar@plt>
    1145:	48 8b 15 8c 2e 00 00 	mov    0x2e8c(%rip),%rdx        # 3fd8 <x-0x50>
    114c:	8b 12                	mov    (%rdx),%edx
    114e:	01 d0                	add    %edx,%eax
    1150:	c9                   	leave  
    1151:	c3                   	ret    
...

Well, we can see that the call to bar is done through a PLT entry, just like it’s been called from non-library code. This actually makes sense since all exported symbols can be overridden. If we don’t want to export our internal implementation functions or variables, we can declare them static.

$ cat bar.c 
int x = 17;
static int bar(int i) {
    return i + 42;
}
int foo(int i) {
    return bar(i) + x;
}
$ gcc -shared -fpic bar.c -o bar.so
$ objdump -d bar.so
...
00000000000010f9 <bar>:
    10f9:	f3 0f 1e fa          	endbr64 
    10fd:	55                   	push   %rbp
    10fe:	48 89 e5             	mov    %rsp,%rbp
    1101:	89 7d fc             	mov    %edi,-0x4(%rbp)
    1104:	8b 45 fc             	mov    -0x4(%rbp),%eax
    1107:	83 c0 2a             	add    $0x2a,%eax
    110a:	5d                   	pop    %rbp
    110b:	c3                   	ret    

000000000000110c <foo>:
    110c:	f3 0f 1e fa          	endbr64 
    1110:	55                   	push   %rbp
    1111:	48 89 e5             	mov    %rsp,%rbp
    1114:	48 83 ec 08          	sub    $0x8,%rsp
    1118:	89 7d fc             	mov    %edi,-0x4(%rbp)
    111b:	8b 45 fc             	mov    -0x4(%rbp),%eax
    111e:	89 c7                	mov    %eax,%edi
    1120:	e8 d4 ff ff ff       	call   10f9 <bar>
    1125:	48 8b 15 ac 2e 00 00 	mov    0x2eac(%rip),%rdx        # 3fd8 <x-0x48>
    112c:	8b 12                	mov    (%rdx),%edx
    112e:	01 d0                	add    %edx,%eax
    1130:	c9                   	leave  
    1131:	c3                   	ret   
...

In designing a library, we often have headers and their implementation details will be put in separate translation units. Those functions have external linkage. But we don’t want to export those symbols unless they are API functions intended to be provided to the user. Under such cases, we can pass -fvisibility=hidden to the compiler and explicitly mark the symbols we want to export using __attribute__((__visibility__("default"))) (in practice, a descriptive macro can be used for this, e.g. LIBNAME_EXPORT), see more in the GCC manual. Doing so has many practical benefits, for example, less work for the dynamic linker, more efficient code can be generated since no PLT entry is needed, and for data references, there will be only one data load vs two loads if it gets exported.

$ cat bar.c
int x = 17;

int bar(int i) {
    return i + 42;
}

__attribute__((__visibility__("default")))
int foo(int i) {
    return bar(i) + x;
}
$ gcc -shared -fpic bar.c -o bar.so -fvisibility=hidden
$ objdump -d bar.so
...
00000000000010f9 <bar>:
    10f9:	f3 0f 1e fa          	endbr64 
    10fd:	55                   	push   %rbp
    10fe:	48 89 e5             	mov    %rsp,%rbp
    1101:	89 7d fc             	mov    %edi,-0x4(%rbp)
    1104:	8b 45 fc             	mov    -0x4(%rbp),%eax
    1107:	83 c0 2a             	add    $0x2a,%eax
    110a:	5d                   	pop    %rbp
    110b:	c3                   	ret    

000000000000110c <foo>:
    110c:	f3 0f 1e fa          	endbr64 
    1110:	55                   	push   %rbp
    1111:	48 89 e5             	mov    %rsp,%rbp
    1114:	48 83 ec 08          	sub    $0x8,%rsp
    1118:	89 7d fc             	mov    %edi,-0x4(%rbp)
    111b:	8b 45 fc             	mov    -0x4(%rbp),%eax
    111e:	89 c7                	mov    %eax,%edi
    1120:	e8 d4 ff ff ff       	call   10f9 <bar>
    1125:	8b 15 f5 2e 00 00    	mov    0x2ef5(%rip),%edx        # 4020 <x>
    112b:	01 d0                	add    %edx,%eax
    112d:	c9                   	leave  
    112e:	c3                   	ret
...

At the end of this section, let’s see an example of using LD_PRELOAD to override symbols by loading specific libraries before any other shared libraries. In practice, LD_PRELOAD can be used to override the malloc implementation in libc.so or to load a debug library that contains logging versions of some functions (e.g. to debug some multithreaded code by logging each thread’s activity).

$ cat foo.c
int x = 17;

int foo(int i) {
    return i + x;
}
$ cat main.c 
#include <stdio.h>

int foo(int i);

int main() {
    int r = foo(3);
    printf("%d\n", r);
    return r;
}
$ cat bar.c
int x = 17;

int bar(int i) {
    return i + 42;
}

__attribute__((__visibility__("default")))
int foo(int i) {
    return bar(i) + x;
}
$ gcc -shared -fpic foo.c -o foo.so
$ gcc -shared -fpic bar.c -o bar.so -fvisibility=hidden
$ gcc main.c ./foo.so -o main
$ ./main
20
$ LD_PRELOAD=./bar.so ./main
62

Oh, one more thing to highlight. Symbol collisions in shared libraries can catch you unexpectedly. The exported symbols in a shared library can be overridden by anything as long as they have the same name. This means an exported function/variable in a shared library can be overridden by a global variable of any type, or by a function of any signature. If the same symbol is used for different things, it’s an undefined behavior. Symbol collisions can happen between a shared library and our application code, or between two shared libraries linked by our innocent application. Therefore, it’s a good practice to limit the visibility of the symbols in a shared library as small as possible. Better performance and lower chance of symbol collision. Why not!

$ cat foo.c
#include <stdio.h>
int x = 5;

int foobar(int i) {
    printf("foobar called from foo.c, argument i=%d\n", i);
    return i + x;
}

int foo(int i) {
    return foobar(i) * 10;
}
$ cat bar.c
#include <stdio.h>

int foobar(int i, int j) {
    printf("foobar called from bar.c, arguments i=%d, j=%d\n", i, j);
    return i + j;
}

int bar(int i, int j) {
    return foobar(i, j) + 1;
}
$ cat barz.c
#include <stdio.h>

int foobar = 2;

int bar(int i, int j) {
    printf("foobar=%d\n", foobar);
    return foobar + i + j + 5;
}
$ cat main.c
#include <stdio.h>

int foo(int i);        // from foo.so
int bar(int i, int j); // from bar(z).so

int main() {
    int a = foo(3);
    int b = bar(3, 7);
    printf("%d\n", a);
    printf("%d\n", b);
    return a + b;
}
$ gcc -shared -fpic foo.c -o foo.so
$ gcc -shared -fpic bar.c -o bar.so
$ gcc -shared -fpic barz.c -o barz.so
$ gcc main.c ./foo.so ./bar.so -o main
$ ./main
foobar called from foo.c, argument i=3
foobar called from foo.c, argument i=3
80
9
$ gcc main.c ./bar.so ./foo.so -o main
$ ./main
foobar called from bar.c, arguments i=3, j=2081388488
foobar called from bar.c, arguments i=3, j=7
-660951570
11
$ gcc main.c ./foo.so ./barz.so -o main
$ ./main
foobar called from foo.c, argument i=3
foobar=-98693133
80
-98693118
$ gcc main.c ./barz.so ./foo.so -o main
$ ./main
fish: Job 1, './main' terminated by signal SIGSEGV (Address boundary error)

In the example above, our program calls two external functions from two shared libraries: foo from foo.so, and bar from bar.so or barz.so. They all call or reference foobar. Whichever the shared library is loaded first overrides the other library’s foobar symbol. From the results, we can see that the library load order seems to match the library link order. But we can manually specify the load order using LD_PRELOAD.

$ gcc main.c ./foo.so ./bar.so -o main
$ LD_PRELOAD=./foo.so:./bar.so ./main
foobar called from foo.c, argument i=3
foobar called from foo.c, argument i=3
80
9
$ LD_PRELOAD=./bar.so:./foo.so ./main
foobar called from bar.c, arguments i=3, j=1241836152
foobar called from bar.c, arguments i=3, j=7
-466540338
11
$ gcc main.c ./foo.so ./barz.so -o main
$ LD_PRELOAD=./foo.so:./barz.so ./main
foobar called from foo.c, argument i=3
foobar=-98693133
80
-98693118
$ LD_PRELOAD=./barz.so:./foo.so ./main
fish: Job 1, 'LD_PRELOAD=./barz.so:./foo.so .…' terminated by signal SIGSEGV (Address boundary error)

The point here is that if symbol collisions happened, there would be many possible ways our program might pan out due to undefined behaviors. If we only exported foo and bar, none of them would happen. If we really need to export foobar as it may be an API function, can we avoid the symbol collision on it? The answer is to pass -Bsymbolic or -Bsymbolic-functions to the program linker when creating the shared library. These options make the shared library bind references to global (function) symbols to the definition within the shared library, if any. This means calls to functions within the shared library don’t have to go through the PLT, and referenced data doesn’t have to be read from the GOT. Thus, more efficient code can be generated by the compiler. But this renders symbol overriding impossible, even for exported public symbols, as no dynamic relocation entry for the shared library would be generated. If one wants finer control, --dynamic-list can be used.

$ gcc -shared -fpic foo.c -o foo_symbolic.so -Wl,-Bsymbolic
$ readelf -r foo.so 
Relocation section '.rela.dyn' at offset 0x4b0 contains 8 entries:
  Offset          Info           Type           Sym. Value    Sym. Name + Addend
000000003e08  000000000008 R_X86_64_RELATIVE                    1130
000000003e10  000000000008 R_X86_64_RELATIVE                    10f0
000000004028  000000000008 R_X86_64_RELATIVE                    4028
000000003fd8  000100000006 R_X86_64_GLOB_DAT 0000000000000000 _ITM_deregisterTM[...] + 0
000000003fe0  000600000006 R_X86_64_GLOB_DAT 0000000000004030 x + 0
000000003fe8  000300000006 R_X86_64_GLOB_DAT 0000000000000000 __gmon_start__ + 0
000000003ff0  000400000006 R_X86_64_GLOB_DAT 0000000000000000 _ITM_registerTMCl[...] + 0
000000003ff8  000500000006 R_X86_64_GLOB_DAT 0000000000000000 __cxa_finalize@GLIBC_2.2.5 + 0

Relocation section '.rela.plt' at offset 0x570 contains 2 entries:
  Offset          Info           Type           Sym. Value    Sym. Name + Addend
000000004018  000800000007 R_X86_64_JUMP_SLO 0000000000001139 foobar + 0
000000004020  000200000007 R_X86_64_JUMP_SLO 0000000000000000 printf@GLIBC_2.2.5 + 0
$ readelf -r foo_symbolic.so # no relocations for x and foobar
Relocation section '.rela.dyn' at offset 0x4b0 contains 7 entries:
  Offset          Info           Type           Sym. Value    Sym. Name + Addend
000000003df0  000000000008 R_X86_64_RELATIVE                    1110
000000003df8  000000000008 R_X86_64_RELATIVE                    10d0
000000004020  000000000008 R_X86_64_RELATIVE                    4020
000000003fe0  000100000006 R_X86_64_GLOB_DAT 0000000000000000 _ITM_deregisterTM[...] + 0
000000003fe8  000300000006 R_X86_64_GLOB_DAT 0000000000000000 __gmon_start__ + 0
000000003ff0  000400000006 R_X86_64_GLOB_DAT 0000000000000000 _ITM_registerTMCl[...] + 0
000000003ff8  000500000006 R_X86_64_GLOB_DAT 0000000000000000 __cxa_finalize@GLIBC_2.2.5 + 0

Relocation section '.rela.plt' at offset 0x558 contains 1 entry:
  Offset          Info           Type           Sym. Value    Sym. Name + Addend
000000004018  000200000007 R_X86_64_JUMP_SLO 0000000000000000 printf@GLIBC_2.2.5 + 0
$ objdump -d foo_symbolic.so
...
Disassembly of section .plt.sec:

0000000000001050 <printf@plt>:
    1050:	f3 0f 1e fa          	endbr64 
    1054:	f2 ff 25 bd 2f 00 00 	bnd jmp *0x2fbd(%rip)        # 4018 <printf@GLIBC_2.2.5>
    105b:	0f 1f 44 00 00       	nopl   0x0(%rax,%rax,1)

Disassembly of section .text:

0000000000001060 <deregister_tm_clones>:
...
0000000000001119 <foobar>:
    1119:	f3 0f 1e fa          	endbr64 
    111d:	55                   	push   %rbp
    111e:	48 89 e5             	mov    %rsp,%rbp
    1121:	48 83 ec 10          	sub    $0x10,%rsp
    1125:	89 7d fc             	mov    %edi,-0x4(%rbp)
    1128:	8b 45 fc             	mov    -0x4(%rbp),%eax
    112b:	89 c6                	mov    %eax,%esi
    112d:	48 8d 05 cc 0e 00 00 	lea    0xecc(%rip),%rax        # 2000 <_fini+0xe88>
    1134:	48 89 c7             	mov    %rax,%rdi
    1137:	b8 00 00 00 00       	mov    $0x0,%eax
    113c:	e8 0f ff ff ff       	call   1050 <printf@plt>
    1141:	48 8d 05 e0 2e 00 00 	lea    0x2ee0(%rip),%rax        # 4028 <x>
    1148:	8b 10                	mov    (%rax),%edx
    114a:	8b 45 fc             	mov    -0x4(%rbp),%eax
    114d:	01 d0                	add    %edx,%eax
    114f:	c9                   	leave  
    1150:	c3                   	ret    

0000000000001151 <foo>:
    1151:	f3 0f 1e fa          	endbr64 
    1155:	55                   	push   %rbp
    1156:	48 89 e5             	mov    %rsp,%rbp
    1159:	48 83 ec 10          	sub    $0x10,%rsp
    115d:	89 7d fc             	mov    %edi,-0x4(%rbp)
    1160:	8b 45 fc             	mov    -0x4(%rbp),%eax
    1163:	89 c7                	mov    %eax,%edi
    1165:	e8 af ff ff ff       	call   1119 <foobar>
    116a:	89 c2                	mov    %eax,%edx
    116c:	89 d0                	mov    %edx,%eax
    116e:	c1 e0 02             	shl    $0x2,%eax
    1171:	01 d0                	add    %edx,%eax
    1173:	01 c0                	add    %eax,%eax
    1175:	c9                   	leave  
    1176:	c3                   	ret    
...
$ objdump --full-contents foo_symbolic.so
...
Contents of section .data:
 4020 20400000 00000000 05000000            @..........   
...

Some other good stuff to read on shared libraries: “How To Write Shared Libraries” by Ulrich Drepper.

Link-Time Optimization

Link-time optimization (LTO) is also called interprocedural optimization. The kind of optimization, as its name indicates, happens at the link stage when the compiler sees the whole program. LTO is done by dumping the compiler’s intermediate representation (GIMPLE in GCC, LLVM IR/bitcode in Clang) when compiling each source file (translation unit). So the resulting object files are fat as they contain extra information used for LTO. Typical LTO includes function inlining, dead code elimination, etc. Read more here or watch this awesome video. We’ll instead focus on a concrete and interesting example: trying to inline a comparison function pointer for a library routine insertion_sort in C.

// isort.h
#pragma once
#include <stddef.h>

void insertion_sort(void *ptr, size_t num, size_t size, int (*comp)(const void *, const void *));
// isort.c
#include "isort.h"
#include <stdio.h>

static void swap_item(char *a, char *b, size_t size) {
    for (size_t i = 0; i < size; ++i) {
        char tmp = a[i];
        a[i] = b[i];
        b[i] = tmp;
    }
}

void insertion_sort(void *ptr, size_t num, size_t size, int (*comp)(const void *, const void *)) {
    char *base = (char *)ptr;
    for (size_t i = 1; i < num; ++i) {
        for (size_t j = i; j > 0 && comp(base + (j-1)*size, base + j*size) > 0; --j) {
            swap_item(base + (j-1)*size, base + j*size, size);
        }
    }
}
// main.c
#include "isort.h"
#include <stdio.h>

int compare_int(const void *a, const void *b) {
    return *(int *)a - *(int *)b;
}

int main() {
    int arr[] = {7, 3, 1, 5, 2, 9, 1, 7, 8, 2, 4, 6, 5};
    size_t arr_size = sizeof(arr) / sizeof(arr[0]);

    insertion_sort(arr, arr_size, sizeof(arr[0]), compare_int);

    for (size_t i = 0; i < arr_size; ++i) {
        printf("%d ", arr[i]);
    }
    printf("\n");

    return 0;
}

This is a typical setup – the definition of the library routine insertion_sort is separate from our application code, and our application code provides a custom comparison function upon calling the library routine.

$ gcc -o main main.c isort.c -O3 
$ objdump -d main | grep call
    1014:	ff d0                	call   *%rax
    110d:	e8 0e 03 00 00       	call   1420 <insertion_sort>
    1128:	e8 63 ff ff ff       	call   1090 <__printf_chk@plt>
    1137:	e8 34 ff ff ff       	call   1070 <putchar@plt>
    1157:	e8 24 ff ff ff       	call   1080 <__stack_chk_fail@plt>
    117f:	ff 15 53 2e 00 00    	call   *0x2e53(%rip)        # 3fd8 <__libc_start_main@GLIBC_2.34>
    1222:	e8 39 fe ff ff       	call   1060 <__cxa_finalize@plt>
    1227:	e8 64 ff ff ff       	call   1190 <deregister_tm_clones>
    14e1:	ff d0                	call   *%rax
$ gcc -o main main.c isort.c -O3 -flto
$ objdump -d main | grep call
    1014:	ff d0                	call   *%rax
    1180:	e8 0b ff ff ff       	call   1090 <__printf_chk@plt>
    118f:	e8 dc fe ff ff       	call   1070 <putchar@plt>
    11af:	e8 cc fe ff ff       	call   1080 <__stack_chk_fail@plt>
    11df:	ff 15 f3 2d 00 00    	call   *0x2df3(%rip)        # 3fd8 <__libc_start_main@GLIBC_2.34>
    1282:	e8 d9 fd ff ff       	call   1060 <__cxa_finalize@plt>
    1287:	e8 64 ff ff ff       	call   11f0 <deregister_tm_clones>

From the grepped results we can tell, even with -O3 optimization level, that the comparison function is still invoked via an indirect call *%rax rather than being inlined, which is bad for performance especially when called within a tight loop. With LTO, both the insertion_sort routine and the comparison function are inlined into main. Sadly, even if we put all the stuff in a single file, the comparator is not inlined (Godbolt link). How about LTO for shared libraries and static libraries?

$ gcc -shared -fpic isort.c -O3 -flto -o libisort.so
$ gcc main.c -L. -lisort -O3 -flto -o main
$ objdump -d main | grep call
    1014:	ff d0                	call   *%rax
    112d:	e8 7e ff ff ff       	call   10b0 <insertion_sort@plt>
    1148:	e8 53 ff ff ff       	call   10a0 <__printf_chk@plt>
    1157:	e8 24 ff ff ff       	call   1080 <putchar@plt>
    1177:	e8 14 ff ff ff       	call   1090 <__stack_chk_fail@plt>
    119f:	ff 15 33 2e 00 00    	call   *0x2e33(%rip)        # 3fd8 <__libc_start_main@GLIBC_2.34>
    1242:	e8 29 fe ff ff       	call   1070 <__cxa_finalize@plt>
    1247:	e8 64 ff ff ff       	call   11b0 <deregister_tm_clones>
$ objdump -d isort.so | grep call
    1014:	ff d0                	call   *%rax
    10d2:	e8 59 ff ff ff       	call   1030 <__cxa_finalize@plt>
    10d7:	e8 64 ff ff ff       	call   1040 <deregister_tm_clones>
    11c1:	ff d0                	call   *%rax
$ gcc isort.c -c -O3 -flto
$ ar rcs libisort.a isort.o
ar: isort.o: plugin needed to handle lto object
$ gcc-ar rcs libisort.a isort.o
$ gcc main.c -L. -l:libisort.a -O3 -flto -o main
$ objdump -d main | grep call
    1014:	ff d0                	call   *%rax
    1180:	e8 0b ff ff ff       	call   1090 <__printf_chk@plt>
    118f:	e8 dc fe ff ff       	call   1070 <putchar@plt>
    11af:	e8 cc fe ff ff       	call   1080 <__stack_chk_fail@plt>
    11df:	ff 15 f3 2d 00 00    	call   *0x2df3(%rip)        # 3fd8 <__libc_start_main@GLIBC_2.34>
    1282:	e8 d9 fd ff ff       	call   1060 <__cxa_finalize@plt>
    1287:	e8 64 ff ff ff       	call   11f0 <deregister_tm_clones>

There should be no surprise. For shared library routines, since they can be overridden at runtime the compiler cannot assume the call to our library routine insertion_sort will be the one used at runtime. Although LTO doesn’t work well for application code calling library code, compiling the shared library with -flto flag may help the library itself generate more aggressive/performant code. Static libraries, however, are just archives of object files. As long as the archived object files needed for static linking contain LTO information, LTO can be performed in the same way as for object files.

Leave a comment

Your email address will not be published.

The maximum upload file size: 10 MB. You can upload: image, audio, video, document, spreadsheet, interactive, text, archive, code, other. Links to YouTube, Facebook, Twitter and other services inserted in the comment text will be automatically embedded. Drop file here