{"id":1766,"date":"2023-11-19T22:00:00","date_gmt":"2023-11-19T14:00:00","guid":{"rendered":"https:\/\/markjohntaylor.com\/blog\/wordpress\/?p=1766"},"modified":"2024-02-23T21:26:24","modified_gmt":"2024-02-23T13:26:24","slug":"linking","status":"publish","type":"post","link":"https:\/\/markjohntaylor.com\/blog\/wordpress\/index.php\/2023\/11\/19\/linking\/","title":{"rendered":"Linking"},"content":{"rendered":"\n<p>There are a few things about linking that I think are worthy of writing down after I studied linking from CSAPP-3e and some resources on the web.<\/p>\n\n\n\n<p>Linkage of <code>const<\/code> Global Variables<\/p>\n\n\n\n<p><code>const<\/code> global variables in C++ have internal linkage (as declared <code>static<\/code>) unless they&#8217;re explicitly declared <code>extern<\/code> or <code>inline<\/code> (see <a href=\"https:\/\/en.cppreference.com\/w\/cpp\/language\/inline\">C++17 inline variable<\/a>), whereas in C external linkage (as declared <code>extern<\/code>) is the default for all file-scoped entities (functions and global variables). An excellent answer on StackOverflow about inline variables can be found <a href=\"https:\/\/stackoverflow.com\/a\/53896763\">here<\/a>. One of the major uses of the <code>inline<\/code> keyword is to enable multiple identical function definitions across translation units (TUs) (in practice function definitions in header files), so this meaning was extended to variables to allow an <code>inline<\/code> <code>const<\/code> variable to have a uniform memory address across TUs and allow a non-<code>const<\/code> global variable to be defined in headers without redefinition problems.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"godzilla\" data-enlighter-highlight=\"8,17\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">$ cat t.c\nint int_ = 1;\nstatic int static_int = 1;\nconst int const_int = 1;\nstatic const int static_const_int = 1;\n$ gcc -c t.c\n$ nm t.o\n0000000000000000 R const_int # global read-only data section in C\n0000000000000000 D int_\n0000000000000004 r static_const_int\n0000000000000004 d static_int\n$ g++ -c t.c\n$ nm -C t.o\n0000000000000000 D int_\n0000000000000004 d static_int\n0000000000000004 r static_const_int\n0000000000000000 r const_int # local read-only data section in C++<\/pre>\n\n\n\n<p>Link Order<\/p>\n\n\n\n<p>The link order matters when we link against static or shared libraries. The rules used by the linker to resolve symbols are clearly described in the Symbol Resolution section of the Linking chapter in CSAPP-3e. Please refer to the text or <a href=\"https:\/\/www.airs.com\/blog\/archives\/49\">this post<\/a> from Ian Lance Taylor&#8217;s blog for more information.<\/p>\n\n\n\n<p>Shared Libraries<\/p>\n\n\n\n<p>Shared libraries are ubiquitous on almost every platform that has an operating system. The key purpose of shared libraries is to share code among processes and thus save memory. One approach to sharing library code could be to put the given library code at a specific memory location. But it has at least two problems: (a) There are so many (literally unlimited number of) libraries, how do we specify which library to put at which location? If there are ten libraries in total, but a program is only linked against one or two of them, the reserved memory for other shared libraries is wasted. (b) Having a library loaded at a fixed memory location is vulnerable to attacks. So for these reasons, shared libraries are implemented in a way that they can be loaded at any (random) memory locations that you don&#8217;t know in advance.<\/p>\n\n\n\n<p>Luckily, the compiler can generate <a href=\"https:\/\/en.wikipedia.org\/wiki\/Position-independent_code\">position-independent code<\/a> (PIC) for us. Let&#8217;s first look at a normal non-PIC case:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"c\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">\/\/ foo.c\nint x = 17;\nint foo(int i) {\n    return i + x;\n}\n\n\/\/ main.c\n#include &lt;stdio.h>\nint foo(int i);\nint main() {\n    int r = foo(3);\n    printf(\"%d\\n\", r);\n    return r;\n}<\/pre>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"10,27,37\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">$ gcc main.c foo.c -o main\n$ objdump -d main\n...\n0000000000001149 &lt;main>:\n    1149:\tf3 0f 1e fa          \tendbr64 \n    114d:\t55                   \tpush   %rbp\n    114e:\t48 89 e5             \tmov    %rsp,%rbp\n    1151:\t48 83 ec 10          \tsub    $0x10,%rsp\n    1155:\tbf 03 00 00 00       \tmov    $0x3,%edi\n    115a:\te8 21 00 00 00       \tcall   1180 &lt;foo>\n    115f:\t89 45 fc             \tmov    %eax,-0x4(%rbp)\n    1162:\t8b 45 fc             \tmov    -0x4(%rbp),%eax\n    1165:\t89 c6                \tmov    %eax,%esi\n    1167:\t48 8d 05 96 0e 00 00 \tlea    0xe96(%rip),%rax        # 2004 &lt;_IO_stdin_used+0x4>\n    116e:\t48 89 c7             \tmov    %rax,%rdi\n    1171:\tb8 00 00 00 00       \tmov    $0x0,%eax\n    1176:\te8 d5 fe ff ff       \tcall   1050 &lt;printf@plt>\n    117b:\t8b 45 fc             \tmov    -0x4(%rbp),%eax\n    117e:\tc9                   \tleave  \n    117f:\tc3                   \tret    \n\n0000000000001180 &lt;foo>:\n    1180:\tf3 0f 1e fa          \tendbr64 \n    1184:\t55                   \tpush   %rbp\n    1185:\t48 89 e5             \tmov    %rsp,%rbp\n    1188:\t89 7d fc             \tmov    %edi,-0x4(%rbp)\n    118b:\t8b 15 7f 2e 00 00    \tmov    0x2e7f(%rip),%edx        # 4010 &lt;x>\n    1191:\t8b 45 fc             \tmov    -0x4(%rbp),%eax\n    1194:\t01 d0                \tadd    %edx,%eax\n    1196:\t5d                   \tpop    %rbp\n    1197:\tc3                   \tret   \n... \n$ objdump --full-contents main\n...\nContents of section .data:\n 4000 00000000 00000000 08400000 00000000  .........@......\n 4010 11000000 \n...<\/pre>\n\n\n\n<p>Fun story: I was confused about the disassembly once by the line <code>call 1180 &lt;foo&gt;<\/code>. I thought it was an absolute address since it could use something like <code>0x21(%rip)<\/code> for relative addressing otherwise. I knew if I printed the function&#8217;s address, it would be different every time it was executed since the OS would load the code (.text section) at random addresses for security&#8217;s sake. But I also knew the code section was read-only. So, how was address space layout randomization implemented? Did the kernel (dynamic linker) perform some relocation and modify the machine code anyway? If it did, that would be soooo inefficient, for all calls to other normal functions would need to be modified as well. I just couldn&#8217;t figure it out. Then I realized that the secret must be behind the machine code <code>e8 21 00 00 00<\/code> itself. So I asked ChatGPT about it. It turned out that, yes indeed, this <code>call<\/code> instruction uses relative addressing.<\/p>\n\n\n\n<p>Therefore, knowing how some instructions are encoded is helpful in many situations.  In x86 assembly, the opcode <code>e8<\/code> is a near call (see more at <a href=\"https:\/\/shell-storm.org\/x86doc\/CALL.html\">https:\/\/shell-storm.org\/x86doc\/CALL.html<\/a>). The 4 bytes followed are the operand which is encoded in two&#8217;s complement. x86 machines are little-endian. So the rel32 operand is <code>0x21<\/code>, added to the RIP register (<code>115f<\/code>) gives our target address <code>1180<\/code>. Similarly, the <a href=\"https:\/\/shell-storm.org\/x86doc\/JMP.html\"><code>E9<\/code> <code>JMP<\/code><\/a> instruction also uses relative addressing. Another example:  the operand in <code>1176: e8 d5 fe ff ff<\/code> is the value whose two&#8217;s complement is <code>0xfffffed5<\/code>. The value is <code>-12b<\/code>. Therefore, this instruction can be decoded as <code>call RIP-12b<\/code> (RIP = <code>117b<\/code>),  i.e. <code>call 1050<\/code>. Most of the time it&#8217;s fine just to look at the decoded assembler code rather than the machine code itself. However, knowing the instruction encoding makes us better understand how the CPU really executes some instructions (e.g. <code>call RIP-12b<\/code> shows us that at runtime the CPU pushes the address of the next instruction (RIP) onto the stack and then the execution jumps to the memory location with an offset of <code>-12b<\/code> to RIP).<\/p>\n\n\n\n<p>Go back to the example above. We can see that <code>main<\/code> makes a near call to <code>foo<\/code>, and <code>foo<\/code> references the the global variable in the .data section which has an initial value <code>0x11 (17)<\/code>. Nothing special. Now we make <code>foo.c<\/code> a shared library.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"9,10,19\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">$ gcc -shared -fpic foo.c -o foo.so\n$ objdump -d foo.so\n...\n00000000000010f9 &lt;foo>:\n    10f9:\tf3 0f 1e fa          \tendbr64 \n    10fd:\t55                   \tpush   %rbp\n    10fe:\t48 89 e5             \tmov    %rsp,%rbp\n    1101:\t89 7d fc             \tmov    %edi,-0x4(%rbp)\n    1104:\t48 8b 05 cd 2e 00 00 \tmov    0x2ecd(%rip),%rax        # 3fd8 &lt;x-0x48>\n    110b:\t8b 10                \tmov    (%rax),%edx\n    110d:\t8b 45 fc             \tmov    -0x4(%rbp),%eax\n    1110:\t01 d0                \tadd    %edx,%eax\n    1112:\t5d                   \tpop    %rbp\n    1113:\tc3                   \tret\n...\n$ objdump --full-contents foo.so\n...\nContents of section .got:\n 3fd8 00000000 00000000 00000000 00000000  ................\n 3fe8 00000000 00000000 00000000 00000000  ................\n 3ff8 00000000 00000000                    ........        \nContents of section .got.plt:\n 4000 883e0000 00000000 00000000 00000000  .>..............\n 4010 00000000 00000000                    ........        \nContents of section .data:\n 4018 18400000 00000000 11000000           .@..........\n...<\/pre>\n\n\n\n<p>There&#8217;s an observable difference. Without being PIC, the instruction for referring to <code>x<\/code> is <code>mov 0x2e7f(%rip),%edx<\/code> which copies 4 bytes from the .data section to register <code>%edx<\/code>. Now being PIC, two load instructions are involved: <code>mov 0x2ecd(%rip),%rax<\/code> and <code>mov (%rax),%edx<\/code>. The first one copies 8 bytes from the address <code>3fd8<\/code> in the .got section to register <code>%rax<\/code>, and the second accesses the lower half part of the value stored at address in <code>%rax<\/code>. But why is the global variable <code>x<\/code> now read from the .got section and why does the address <code>3fd8<\/code> contain an empty value?<\/p>\n\n\n\n<p>To answer these questions, we are introduced to the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Global_Offset_Table\">global offset table (GOT)<\/a>, which contains an 8-byte entry (a pointer) for each global data object (procedure or global variable) that is referenced by the object module. The compiler generates a relocation record for each entry in the GOT.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"7\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">$ readelf --relocs foo.so\nRelocation section '.rela.dyn' at offset 0x420 contains 8 entries:\n  Offset          Info           Type           Sym. Value    Sym. Name + Addend\n000000003e78  000000000008 R_X86_64_RELATIVE                    10f0\n000000003e80  000000000008 R_X86_64_RELATIVE                    10b0\n000000004018  000000000008 R_X86_64_RELATIVE                    4018\n000000003fd8  000500000006 R_X86_64_GLOB_DAT 0000000000004020 x + 0\n000000003fe0  000100000006 R_X86_64_GLOB_DAT 0000000000000000 __cxa_finalize + 0\n000000003fe8  000200000006 R_X86_64_GLOB_DAT 0000000000000000 _ITM_registerTMCl[...] + 0\n000000003ff0  000300000006 R_X86_64_GLOB_DAT 0000000000000000 _ITM_deregisterTM[...] + 0\n000000003ff8  000400000006 R_X86_64_GLOB_DAT 0000000000000000 __gmon_start__ + 0<\/pre>\n\n\n\n<p>The GOT serves as a kind of indirection that enables us to override symbols (global variables and functions) in shared libraries. This is done with the help of the dynamic linker, which can relocate GOT entries at load time or runtime (lazy binding) so that they contain absolute memory addresses to the proper (overridden) symbols. For example, we can also define a global variable <code>x<\/code> in <code>main.c<\/code> and this one is going to override the one provided in the shared library <code>foo.so<\/code>:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">$ cat main.c\n#include &lt;stdio.h>\nint foo(int i);\n#ifdef OVERRIDE_X\nint x = 13;\n#endif\nint main() {\n    int r = foo(3);\n    printf(\"%d\\n\", r);\n    return r;\n}\n$ gcc main.c .\/foo.so -o main\n$ .\/main\n20\n$ gcc main.c .\/foo.so -o main -DOVERRIDE_X\n$ .\/main\n16<\/pre>\n\n\n\n<p>Another typical way of overriding symbols in a shared library is to use <code>LD_PRELOAD<\/code> which we will cover later.<\/p>\n\n\n\n<p>Back to the questions. Reading the global variable <code>x<\/code> via an extra indirection from the GOT is to allow things like symbol overriding. The GOT entry for <code>x<\/code> will be filled by the dynamic linker at load time or runtime, so it&#8217;s fine to leave the value empty.<\/p>\n\n\n\n<p>Next, let&#8217;s focus on function calls in shared libraries. This involves two aspects: calling library functions from our application code and function calls within the shared library. For the former, see the disassembly below:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"39,26,60,14,7,59\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">$ gcc main.c .\/foo.so -o main\n$ objdump -d main\n...\nDisassembly of section .plt:\n\n0000000000001020 &lt;.plt>:\n    1020:\tff 35 92 2f 00 00    \tpush   0x2f92(%rip)        # 3fb8 &lt;_GLOBAL_OFFSET_TABLE_+0x8>\n    1026:\tf2 ff 25 93 2f 00 00 \tbnd jmp *0x2f93(%rip)        # 3fc0 &lt;_GLOBAL_OFFSET_TABLE_+0x10>\n    102d:\t0f 1f 00             \tnopl   (%rax)\n    1030:\tf3 0f 1e fa          \tendbr64 \n    1034:\t68 00 00 00 00       \tpush   $0x0\n    1039:\tf2 e9 e1 ff ff ff    \tbnd jmp 1020 &lt;_init+0x20>\n    103f:\t90                   \tnop\n    1040:\tf3 0f 1e fa          \tendbr64 \n    1044:\t68 01 00 00 00       \tpush   $0x1\n    1049:\tf2 e9 d1 ff ff ff    \tbnd jmp 1020 &lt;_init+0x20>\n    104f:\t90                   \tnop\n...\nDisassembly of section .plt.sec:\n\n0000000000001060 &lt;printf@plt>:\n    1060:\tf3 0f 1e fa          \tendbr64 \n    1064:\tf2 ff 25 5d 2f 00 00 \tbnd jmp *0x2f5d(%rip)        # 3fc8 &lt;printf@GLIBC_2.2.5>\n    106b:\t0f 1f 44 00 00       \tnopl   0x0(%rax,%rax,1)\n\n0000000000001070 &lt;foo@plt>:\n    1070:\tf3 0f 1e fa          \tendbr64 \n    1074:\tf2 ff 25 55 2f 00 00 \tbnd jmp *0x2f55(%rip)        # 3fd0 &lt;foo@Base>\n    107b:\t0f 1f 44 00 00       \tnopl   0x0(%rax,%rax,1)\n\nDisassembly of section .text:\n...\n0000000000001169 &lt;main>:\n    1169:\tf3 0f 1e fa          \tendbr64 \n    116d:\t55                   \tpush   %rbp\n    116e:\t48 89 e5             \tmov    %rsp,%rbp\n    1171:\t48 83 ec 10          \tsub    $0x10,%rsp\n    1175:\tbf 03 00 00 00       \tmov    $0x3,%edi\n    117a:\te8 f1 fe ff ff       \tcall   1070 &lt;foo@plt>\n    117f:\t89 45 fc             \tmov    %eax,-0x4(%rbp)\n    1182:\t8b 45 fc             \tmov    -0x4(%rbp),%eax\n    1185:\t89 c6                \tmov    %eax,%esi\n    1187:\t48 8d 05 76 0e 00 00 \tlea    0xe76(%rip),%rax        # 2004 &lt;_IO_stdin_used+0x4>\n    118e:\t48 89 c7             \tmov    %rax,%rdi\n    1191:\tb8 00 00 00 00       \tmov    $0x0,%eax\n    1196:\te8 c5 fe ff ff       \tcall   1060 &lt;printf@plt>\n    119b:\t8b 45 fc             \tmov    -0x4(%rbp),%eax\n    119e:\tc9                   \tleave  \n    119f:\tc3                   \tret\n...\n$ objdump --full-contents main\n...\nContents of section .dynamic:\n 3db0 01000000 00000000 72000000 00000000  ........r.......\n 3dc0 01000000 00000000 7b000000 00000000  ........{.......\n ...\nContents of section .got:\n 3fb0 b03d0000 00000000 00000000 00000000  .=..............\n 3fc0 00000000 00000000 30100000 00000000  ........0.......\n 3fd0 40100000 00000000 00000000 00000000  @...............\n 3fe0 00000000 00000000 00000000 00000000  ................\n 3ff0 00000000 00000000 00000000 00000000  ................\nContents of section .data:\n 4000 00000000 00000000 08400000 00000000  .........@......\n...<\/pre>\n\n\n\n<p>There&#8217;s a lot of jumps happening here. We can see that calls to shared library functions are made through the procedure linkage table (PLT). Each shared library function called by the executable has its own PLT entry in the .plt.sec section. In our example, the PLT entries are <code>foo@plt<\/code> and <code>printf@plt<\/code>. PLT entries will be relocated <em>lazily<\/em> by the dynamic linker at runtime, meaning that the address of each function will be resolved by the dynamic linker the <em>first time<\/em> the function is called. This makes sense since usually a shared library, e.g. <code>libc<\/code>, encompasses a large number of functions, and only a few get called by our program. This lazy binding saves a lot of work for the dynamic linker.<\/p>\n\n\n\n<p>Now we walk through our example of calling the library function <code>foo<\/code>. Function <code>main<\/code> calls into <code>foo@plt<\/code>, which immediately jumps to another memory address via <code>GOT[4]<\/code>. The initial value of <code>GOT[4]<\/code> is <code>0x1040<\/code>. Following that, the execution jumps to somewhere in the .plt section and pushes the index of this PLT entry (<code>0x1<\/code> for <code>foo<\/code>; <code>0x0<\/code> for <code>printf<\/code>) onto the stack and then branches to the common code at the beginning of the .plt section. The common code pushes <code>GOT[1]<\/code> onto the stack and the execution jumps to <code>GOT[2]<\/code>. It can be guessed that <code>GOT[1]<\/code> is a pointer to relocation information, and it will be used together with the pushed PLT entry index as two arguments by the dynamic linker to resolve the address of <code>foo<\/code>. <code>GOT[2]<\/code> should be the entry point for the dynamic linker to perform relocations. <code>GOT[1]<\/code> and <code>GOT[2]<\/code> will be filled by the dynamic linker at <em>load time<\/em>. After the dynamic linker figures out the address of <code>foo<\/code> at runtime, it rewrites <code>GOT[4]<\/code> with this address so that subsequent calls to <code>foo<\/code> will jump directly to the resolved destination. Finally, the control is transferred back to <code>foo<\/code> when the dynamic linker has done its job. What a beautiful engineering!<\/p>\n\n\n\n<p>Within a shared library, let&#8217;s see what happens when a function calls another function.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"c\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">\/\/ bar.c\nint x = 17;\nint bar(int i) {\n    return i + 42;\n}\nint foo(int i) {\n    return bar(i) + x;\n}<\/pre>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"31\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">$ gcc -shared -fpic bar.c -o bar.so\n$ objdump -d bar.so\n...\nDisassembly of section .plt.sec:\n\n0000000000001050 &lt;bar@plt>:\n    1050:\tf3 0f 1e fa          \tendbr64 \n    1054:\tf2 ff 25 bd 2f 00 00 \tbnd jmp *0x2fbd(%rip)        # 4018 &lt;bar+0x2eff>\n    105b:\t0f 1f 44 00 00       \tnopl   0x0(%rax,%rax,1)\n\nDisassembly of section .text:\n...\n0000000000001119 &lt;bar>:\n    1119:\tf3 0f 1e fa          \tendbr64 \n    111d:\t55                   \tpush   %rbp\n    111e:\t48 89 e5             \tmov    %rsp,%rbp\n    1121:\t89 7d fc             \tmov    %edi,-0x4(%rbp)\n    1124:\t8b 45 fc             \tmov    -0x4(%rbp),%eax\n    1127:\t83 c0 2a             \tadd    $0x2a,%eax\n    112a:\t5d                   \tpop    %rbp\n    112b:\tc3                   \tret    \n\n000000000000112c &lt;foo>:\n    112c:\tf3 0f 1e fa          \tendbr64 \n    1130:\t55                   \tpush   %rbp\n    1131:\t48 89 e5             \tmov    %rsp,%rbp\n    1134:\t48 83 ec 10          \tsub    $0x10,%rsp\n    1138:\t89 7d fc             \tmov    %edi,-0x4(%rbp)\n    113b:\t8b 45 fc             \tmov    -0x4(%rbp),%eax\n    113e:\t89 c7                \tmov    %eax,%edi\n    1140:\te8 0b ff ff ff       \tcall   1050 &lt;bar@plt>\n    1145:\t48 8b 15 8c 2e 00 00 \tmov    0x2e8c(%rip),%rdx        # 3fd8 &lt;x-0x50>\n    114c:\t8b 12                \tmov    (%rdx),%edx\n    114e:\t01 d0                \tadd    %edx,%eax\n    1150:\tc9                   \tleave  \n    1151:\tc3                   \tret    \n...<\/pre>\n\n\n\n<p>Well, we can see that the call to <code>bar<\/code> is done through a PLT entry, just like it&#8217;s been called from non-library code. This actually makes sense since all exported symbols can be overridden. If we don&#8217;t want to export our internal implementation functions or variables, we can declare them <code>static<\/code>.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"30\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">$ cat bar.c \nint x = 17;\nstatic int bar(int i) {\n    return i + 42;\n}\nint foo(int i) {\n    return bar(i) + x;\n}\n$ gcc -shared -fpic bar.c -o bar.so\n$ objdump -d bar.so\n...\n00000000000010f9 &lt;bar>:\n    10f9:\tf3 0f 1e fa          \tendbr64 \n    10fd:\t55                   \tpush   %rbp\n    10fe:\t48 89 e5             \tmov    %rsp,%rbp\n    1101:\t89 7d fc             \tmov    %edi,-0x4(%rbp)\n    1104:\t8b 45 fc             \tmov    -0x4(%rbp),%eax\n    1107:\t83 c0 2a             \tadd    $0x2a,%eax\n    110a:\t5d                   \tpop    %rbp\n    110b:\tc3                   \tret    \n\n000000000000110c &lt;foo>:\n    110c:\tf3 0f 1e fa          \tendbr64 \n    1110:\t55                   \tpush   %rbp\n    1111:\t48 89 e5             \tmov    %rsp,%rbp\n    1114:\t48 83 ec 08          \tsub    $0x8,%rsp\n    1118:\t89 7d fc             \tmov    %edi,-0x4(%rbp)\n    111b:\t8b 45 fc             \tmov    -0x4(%rbp),%eax\n    111e:\t89 c7                \tmov    %eax,%edi\n    1120:\te8 d4 ff ff ff       \tcall   10f9 &lt;bar>\n    1125:\t48 8b 15 ac 2e 00 00 \tmov    0x2eac(%rip),%rdx        # 3fd8 &lt;x-0x48>\n    112c:\t8b 12                \tmov    (%rdx),%edx\n    112e:\t01 d0                \tadd    %edx,%eax\n    1130:\tc9                   \tleave  \n    1131:\tc3                   \tret   \n...<\/pre>\n\n\n\n<p>In designing a library, we often have headers and their implementation details will be put in separate translation units. Those functions have external linkage. But we don&#8217;t want to export those symbols unless they are API functions intended to be provided to the user. Under such cases, we can pass <code>-fvisibility=hidden<\/code>&nbsp;to the compiler and explicitly mark the symbols we want to export using <code>__attribute__((__visibility__(\"default\")))<\/code> (in practice, a descriptive macro can be used for this, e.g. <code>LIBNAME_EXPORT<\/code>), see more in the <a href=\"https:\/\/www.gnu.org\/software\/gnulib\/manual\/html_node\/Exported-Symbols-of-Shared-Libraries.html\">GCC manual<\/a>. Doing so has many practical benefits, for example, less work for the dynamic linker,  more efficient code can be generated since no PLT entry is needed, and for data references, there will be only one data load vs two loads if it gets exported.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"33,34\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">$ cat bar.c\nint x = 17;\n\nint bar(int i) {\n    return i + 42;\n}\n\n__attribute__((__visibility__(\"default\")))\nint foo(int i) {\n    return bar(i) + x;\n}\n$ gcc -shared -fpic bar.c -o bar.so -fvisibility=hidden\n$ objdump -d bar.so\n...\n00000000000010f9 &lt;bar>:\n    10f9:\tf3 0f 1e fa          \tendbr64 \n    10fd:\t55                   \tpush   %rbp\n    10fe:\t48 89 e5             \tmov    %rsp,%rbp\n    1101:\t89 7d fc             \tmov    %edi,-0x4(%rbp)\n    1104:\t8b 45 fc             \tmov    -0x4(%rbp),%eax\n    1107:\t83 c0 2a             \tadd    $0x2a,%eax\n    110a:\t5d                   \tpop    %rbp\n    110b:\tc3                   \tret    \n\n000000000000110c &lt;foo>:\n    110c:\tf3 0f 1e fa          \tendbr64 \n    1110:\t55                   \tpush   %rbp\n    1111:\t48 89 e5             \tmov    %rsp,%rbp\n    1114:\t48 83 ec 08          \tsub    $0x8,%rsp\n    1118:\t89 7d fc             \tmov    %edi,-0x4(%rbp)\n    111b:\t8b 45 fc             \tmov    -0x4(%rbp),%eax\n    111e:\t89 c7                \tmov    %eax,%edi\n    1120:\te8 d4 ff ff ff       \tcall   10f9 &lt;bar>\n    1125:\t8b 15 f5 2e 00 00    \tmov    0x2ef5(%rip),%edx        # 4020 &lt;x>\n    112b:\t01 d0                \tadd    %edx,%eax\n    112d:\tc9                   \tleave  \n    112e:\tc3                   \tret\n...<\/pre>\n\n\n\n<p>At the end of this section, let&#8217;s see an example of using <code><a href=\"https:\/\/man7.org\/linux\/man-pages\/man8\/ld.so.8.html\">LD_PRELOAD<\/a><\/code> to override symbols by loading specific libraries before any other shared libraries. In practice, <code>LD_PRELOAD<\/code> can be used to override the <code>malloc<\/code> implementation in <code>libc.so<\/code> or to load a debug library that contains logging versions of some functions (e.g. to debug some multithreaded code by logging each thread&#8217;s activity).<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">$ cat foo.c\nint x = 17;\n\nint foo(int i) {\n    return i + x;\n}\n$ cat main.c \n#include &lt;stdio.h>\n\nint foo(int i);\n\nint main() {\n    int r = foo(3);\n    printf(\"%d\\n\", r);\n    return r;\n}\n$ cat bar.c\nint x = 17;\n\nint bar(int i) {\n    return i + 42;\n}\n\n__attribute__((__visibility__(\"default\")))\nint foo(int i) {\n    return bar(i) + x;\n}\n$ gcc -shared -fpic foo.c -o foo.so\n$ gcc -shared -fpic bar.c -o bar.so -fvisibility=hidden\n$ gcc main.c .\/foo.so -o main\n$ .\/main\n20\n$ LD_PRELOAD=.\/bar.so .\/main\n62<\/pre>\n\n\n\n<p>Oh, one more thing to highlight. Symbol collisions in shared libraries can catch you unexpectedly. The exported symbols in a shared library can be overridden by anything as long as they have the same name. This means an exported function\/variable in a shared library can be overridden by a global variable of any type, or by a function of any signature. If the same symbol is used for different things, it&#8217;s an undefined behavior. Symbol collisions can happen between a shared library and our application code, or between two shared libraries linked by our innocent application. Therefore, it&#8217;s a good practice to limit the <a href=\"https:\/\/gcc.gnu.org\/wiki\/Visibility\">visibility<\/a> of the symbols in a shared library as small as possible. Better performance and lower chance of symbol collision. Why not!<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">$ cat foo.c\n#include &lt;stdio.h>\nint x = 5;\n\nint foobar(int i) {\n    printf(\"foobar called from foo.c, argument i=%d\\n\", i);\n    return i + x;\n}\n\nint foo(int i) {\n    return foobar(i) * 10;\n}\n$ cat bar.c\n#include &lt;stdio.h>\n\nint foobar(int i, int j) {\n    printf(\"foobar called from bar.c, arguments i=%d, j=%d\\n\", i, j);\n    return i + j;\n}\n\nint bar(int i, int j) {\n    return foobar(i, j) + 1;\n}\n$ cat barz.c\n#include &lt;stdio.h>\n\nint foobar = 2;\n\nint bar(int i, int j) {\n    printf(\"foobar=%d\\n\", foobar);\n    return foobar + i + j + 5;\n}\n$ cat main.c\n#include &lt;stdio.h>\n\nint foo(int i);        \/\/ from foo.so\nint bar(int i, int j); \/\/ from bar(z).so\n\nint main() {\n    int a = foo(3);\n    int b = bar(3, 7);\n    printf(\"%d\\n\", a);\n    printf(\"%d\\n\", b);\n    return a + b;\n}\n$ gcc -shared -fpic foo.c -o foo.so\n$ gcc -shared -fpic bar.c -o bar.so\n$ gcc -shared -fpic barz.c -o barz.so\n$ gcc main.c .\/foo.so .\/bar.so -o main\n$ .\/main\nfoobar called from foo.c, argument i=3\nfoobar called from foo.c, argument i=3\n80\n9\n$ gcc main.c .\/bar.so .\/foo.so -o main\n$ .\/main\nfoobar called from bar.c, arguments i=3, j=2081388488\nfoobar called from bar.c, arguments i=3, j=7\n-660951570\n11\n$ gcc main.c .\/foo.so .\/barz.so -o main\n$ .\/main\nfoobar called from foo.c, argument i=3\nfoobar=-98693133\n80\n-98693118\n$ gcc main.c .\/barz.so .\/foo.so -o main\n$ .\/main\nfish: Job 1, '.\/main' terminated by signal SIGSEGV (Address boundary error)<\/pre>\n\n\n\n<p>In the example above, our program calls two external functions from two shared libraries: <code>foo<\/code> from <code>foo.so<\/code>, and <code>bar<\/code> from <code>bar.so<\/code> or <code>barz.so<\/code>. They all call or reference <code>foobar<\/code>. Whichever the shared library is loaded first overrides the other library&#8217;s <code>foobar<\/code> symbol. From the results, we can see that the library load order seems to match the library link order. But we can manually specify the load order using <code>LD_PRELOAD<\/code>.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">$ gcc main.c .\/foo.so .\/bar.so -o main\n$ LD_PRELOAD=.\/foo.so:.\/bar.so .\/main\nfoobar called from foo.c, argument i=3\nfoobar called from foo.c, argument i=3\n80\n9\n$ LD_PRELOAD=.\/bar.so:.\/foo.so .\/main\nfoobar called from bar.c, arguments i=3, j=1241836152\nfoobar called from bar.c, arguments i=3, j=7\n-466540338\n11\n$ gcc main.c .\/foo.so .\/barz.so -o main\n$ LD_PRELOAD=.\/foo.so:.\/barz.so .\/main\nfoobar called from foo.c, argument i=3\nfoobar=-98693133\n80\n-98693118\n$ LD_PRELOAD=.\/barz.so:.\/foo.so .\/main\nfish: Job 1, 'LD_PRELOAD=.\/barz.so:.\/foo.so .\u2026' terminated by signal SIGSEGV (Address boundary error)<\/pre>\n\n\n\n<p>The point here is that if symbol collisions happened, there would be many possible ways our program might pan out due to undefined behaviors. If we only exported <code>foo<\/code> and <code>bar<\/code>, none of them would happen. If we really need to export <code>foobar<\/code> as it may be an API function, can we avoid the symbol collision on it? The answer is to pass <code><a href=\"https:\/\/man7.org\/linux\/man-pages\/man1\/ld.1.html\">-Bsymbolic<\/a><\/code> or <code>-Bsymbolic-functions<\/code> to the program linker when creating the shared library. These options make the shared library bind references to global (function) symbols to the definition within the shared library, if any. This means calls to functions within the shared library don&#8217;t have to go through the PLT, and referenced data doesn&#8217;t have to be read from the GOT. Thus, more efficient code can be generated by the compiler. But this renders symbol overriding impossible, even for exported public symbols, as no dynamic relocation entry for the shared library would be generated. If one wants finer control,  <code>--dynamic-list<\/code> can be used.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"9,16,72,57,58,84\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">$ gcc -shared -fpic foo.c -o foo_symbolic.so -Wl,-Bsymbolic\n$ readelf -r foo.so \nRelocation section '.rela.dyn' at offset 0x4b0 contains 8 entries:\n  Offset          Info           Type           Sym. Value    Sym. Name + Addend\n000000003e08  000000000008 R_X86_64_RELATIVE                    1130\n000000003e10  000000000008 R_X86_64_RELATIVE                    10f0\n000000004028  000000000008 R_X86_64_RELATIVE                    4028\n000000003fd8  000100000006 R_X86_64_GLOB_DAT 0000000000000000 _ITM_deregisterTM[...] + 0\n000000003fe0  000600000006 R_X86_64_GLOB_DAT 0000000000004030 x + 0\n000000003fe8  000300000006 R_X86_64_GLOB_DAT 0000000000000000 __gmon_start__ + 0\n000000003ff0  000400000006 R_X86_64_GLOB_DAT 0000000000000000 _ITM_registerTMCl[...] + 0\n000000003ff8  000500000006 R_X86_64_GLOB_DAT 0000000000000000 __cxa_finalize@GLIBC_2.2.5 + 0\n\nRelocation section '.rela.plt' at offset 0x570 contains 2 entries:\n  Offset          Info           Type           Sym. Value    Sym. Name + Addend\n000000004018  000800000007 R_X86_64_JUMP_SLO 0000000000001139 foobar + 0\n000000004020  000200000007 R_X86_64_JUMP_SLO 0000000000000000 printf@GLIBC_2.2.5 + 0\n$ readelf -r foo_symbolic.so # no relocations for x and foobar\nRelocation section '.rela.dyn' at offset 0x4b0 contains 7 entries:\n  Offset          Info           Type           Sym. Value    Sym. Name + Addend\n000000003df0  000000000008 R_X86_64_RELATIVE                    1110\n000000003df8  000000000008 R_X86_64_RELATIVE                    10d0\n000000004020  000000000008 R_X86_64_RELATIVE                    4020\n000000003fe0  000100000006 R_X86_64_GLOB_DAT 0000000000000000 _ITM_deregisterTM[...] + 0\n000000003fe8  000300000006 R_X86_64_GLOB_DAT 0000000000000000 __gmon_start__ + 0\n000000003ff0  000400000006 R_X86_64_GLOB_DAT 0000000000000000 _ITM_registerTMCl[...] + 0\n000000003ff8  000500000006 R_X86_64_GLOB_DAT 0000000000000000 __cxa_finalize@GLIBC_2.2.5 + 0\n\nRelocation section '.rela.plt' at offset 0x558 contains 1 entry:\n  Offset          Info           Type           Sym. Value    Sym. Name + Addend\n000000004018  000200000007 R_X86_64_JUMP_SLO 0000000000000000 printf@GLIBC_2.2.5 + 0\n$ objdump -d foo_symbolic.so\n...\nDisassembly of section .plt.sec:\n\n0000000000001050 &lt;printf@plt>:\n    1050:\tf3 0f 1e fa          \tendbr64 \n    1054:\tf2 ff 25 bd 2f 00 00 \tbnd jmp *0x2fbd(%rip)        # 4018 &lt;printf@GLIBC_2.2.5>\n    105b:\t0f 1f 44 00 00       \tnopl   0x0(%rax,%rax,1)\n\nDisassembly of section .text:\n\n0000000000001060 &lt;deregister_tm_clones>:\n...\n0000000000001119 &lt;foobar>:\n    1119:\tf3 0f 1e fa          \tendbr64 \n    111d:\t55                   \tpush   %rbp\n    111e:\t48 89 e5             \tmov    %rsp,%rbp\n    1121:\t48 83 ec 10          \tsub    $0x10,%rsp\n    1125:\t89 7d fc             \tmov    %edi,-0x4(%rbp)\n    1128:\t8b 45 fc             \tmov    -0x4(%rbp),%eax\n    112b:\t89 c6                \tmov    %eax,%esi\n    112d:\t48 8d 05 cc 0e 00 00 \tlea    0xecc(%rip),%rax        # 2000 &lt;_fini+0xe88>\n    1134:\t48 89 c7             \tmov    %rax,%rdi\n    1137:\tb8 00 00 00 00       \tmov    $0x0,%eax\n    113c:\te8 0f ff ff ff       \tcall   1050 &lt;printf@plt>\n    1141:\t48 8d 05 e0 2e 00 00 \tlea    0x2ee0(%rip),%rax        # 4028 &lt;x>\n    1148:\t8b 10                \tmov    (%rax),%edx\n    114a:\t8b 45 fc             \tmov    -0x4(%rbp),%eax\n    114d:\t01 d0                \tadd    %edx,%eax\n    114f:\tc9                   \tleave  \n    1150:\tc3                   \tret    \n\n0000000000001151 &lt;foo>:\n    1151:\tf3 0f 1e fa          \tendbr64 \n    1155:\t55                   \tpush   %rbp\n    1156:\t48 89 e5             \tmov    %rsp,%rbp\n    1159:\t48 83 ec 10          \tsub    $0x10,%rsp\n    115d:\t89 7d fc             \tmov    %edi,-0x4(%rbp)\n    1160:\t8b 45 fc             \tmov    -0x4(%rbp),%eax\n    1163:\t89 c7                \tmov    %eax,%edi\n    1165:\te8 af ff ff ff       \tcall   1119 &lt;foobar>\n    116a:\t89 c2                \tmov    %eax,%edx\n    116c:\t89 d0                \tmov    %edx,%eax\n    116e:\tc1 e0 02             \tshl    $0x2,%eax\n    1171:\t01 d0                \tadd    %edx,%eax\n    1173:\t01 c0                \tadd    %eax,%eax\n    1175:\tc9                   \tleave  \n    1176:\tc3                   \tret    \n...\n$ objdump --full-contents foo_symbolic.so\n...\nContents of section .data:\n 4020 20400000 00000000 05000000            @..........   \n...<\/pre>\n\n\n\n<p>Some other good stuff to read on shared libraries: &#8220;<a href=\"https:\/\/akkadia.org\/drepper\/dsohowto.pdf\">How To Write Shared Libraries<\/a>&#8221; by Ulrich Drepper.<\/p>\n\n\n\n<p>Link-Time Optimization<\/p>\n\n\n\n<p>Link-time optimization (LTO) is also called <a href=\"https:\/\/en.wikipedia.org\/wiki\/Interprocedural_optimization\">interprocedural optimization<\/a>. The kind of optimization, as its name indicates, happens at the link stage when the compiler sees the whole program. LTO is done by dumping the compiler&#8217;s intermediate representation (GIMPLE in GCC,  LLVM IR\/bitcode in Clang) when compiling each source file (translation unit). So the resulting object files are fat as they contain extra information used for LTO. Typical LTO includes function inlining, dead code elimination, etc. Read more <a href=\"https:\/\/www.airs.com\/blog\/archives\/51\">here<\/a> or watch <a href=\"https:\/\/www.youtube.com\/watch?v=p9nH2vZ2mNo\">this<\/a> awesome video. We&#8217;ll instead focus on a concrete and interesting example: trying to inline a comparison function pointer for a library routine <code>insertion_sort<\/code> in C.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"c\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">\/\/ isort.h\n#pragma once\n#include &lt;stddef.h>\n\nvoid insertion_sort(void *ptr, size_t num, size_t size, int (*comp)(const void *, const void *));\n\/\/ isort.c\n#include \"isort.h\"\n#include &lt;stdio.h>\n\nstatic void swap_item(char *a, char *b, size_t size) {\n    for (size_t i = 0; i &lt; size; ++i) {\n        char tmp = a[i];\n        a[i] = b[i];\n        b[i] = tmp;\n    }\n}\n\nvoid insertion_sort(void *ptr, size_t num, size_t size, int (*comp)(const void *, const void *)) {\n    char *base = (char *)ptr;\n    for (size_t i = 1; i &lt; num; ++i) {\n        for (size_t j = i; j > 0 &amp;&amp; comp(base + (j-1)*size, base + j*size) > 0; --j) {\n            swap_item(base + (j-1)*size, base + j*size, size);\n        }\n    }\n}\n\/\/ main.c\n#include \"isort.h\"\n#include &lt;stdio.h>\n\nint compare_int(const void *a, const void *b) {\n    return *(int *)a - *(int *)b;\n}\n\nint main() {\n    int arr[] = {7, 3, 1, 5, 2, 9, 1, 7, 8, 2, 4, 6, 5};\n    size_t arr_size = sizeof(arr) \/ sizeof(arr[0]);\n\n    insertion_sort(arr, arr_size, sizeof(arr[0]), compare_int);\n\n    for (size_t i = 0; i &lt; arr_size; ++i) {\n        printf(\"%d \", arr[i]);\n    }\n    printf(\"\\n\");\n\n    return 0;\n}<\/pre>\n\n\n\n<p>This is a typical setup &#8211; the definition of the library routine <code>insertion_sort<\/code> is separate from our application code, and our application code provides a custom comparison function upon calling the library routine.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"bash\" data-enlighter-theme=\"\" data-enlighter-highlight=\"4,11\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">$ gcc -o main main.c isort.c -O3 \n$ objdump -d main | grep call\n    1014:\tff d0                \tcall   *%rax\n    110d:\te8 0e 03 00 00       \tcall   1420 &lt;insertion_sort>\n    1128:\te8 63 ff ff ff       \tcall   1090 &lt;__printf_chk@plt>\n    1137:\te8 34 ff ff ff       \tcall   1070 &lt;putchar@plt>\n    1157:\te8 24 ff ff ff       \tcall   1080 &lt;__stack_chk_fail@plt>\n    117f:\tff 15 53 2e 00 00    \tcall   *0x2e53(%rip)        # 3fd8 &lt;__libc_start_main@GLIBC_2.34>\n    1222:\te8 39 fe ff ff       \tcall   1060 &lt;__cxa_finalize@plt>\n    1227:\te8 64 ff ff ff       \tcall   1190 &lt;deregister_tm_clones>\n    14e1:\tff d0                \tcall   *%rax\n$ gcc -o main main.c isort.c -O3 -flto\n$ objdump -d main | grep call\n    1014:\tff d0                \tcall   *%rax\n    1180:\te8 0b ff ff ff       \tcall   1090 &lt;__printf_chk@plt>\n    118f:\te8 dc fe ff ff       \tcall   1070 &lt;putchar@plt>\n    11af:\te8 cc fe ff ff       \tcall   1080 &lt;__stack_chk_fail@plt>\n    11df:\tff 15 f3 2d 00 00    \tcall   *0x2df3(%rip)        # 3fd8 &lt;__libc_start_main@GLIBC_2.34>\n    1282:\te8 d9 fd ff ff       \tcall   1060 &lt;__cxa_finalize@plt>\n    1287:\te8 64 ff ff ff       \tcall   11f0 &lt;deregister_tm_clones><\/pre>\n\n\n\n<p>From the grepped results we can tell, even with <code>-O3<\/code> optimization level, that the comparison function is still invoked via an indirect <code>call *%rax<\/code> rather than being inlined, which is bad for performance especially when called within a tight loop. With LTO, both the <code>insertion_sort<\/code> routine and the comparison function are inlined into <code>main<\/code>. Sadly, even if we put all the stuff in a single file, the comparator is not inlined (<a href=\"https:\/\/godbolt.org\/z\/16oxhhToE\">Godbolt link<\/a>). How about LTO for shared libraries and static libraries?<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"5,16,20\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">$ gcc -shared -fpic isort.c -O3 -flto -o libisort.so\n$ gcc main.c -L. -lisort -O3 -flto -o main\n$ objdump -d main | grep call\n    1014:\tff d0                \tcall   *%rax\n    112d:\te8 7e ff ff ff       \tcall   10b0 &lt;insertion_sort@plt>\n    1148:\te8 53 ff ff ff       \tcall   10a0 &lt;__printf_chk@plt>\n    1157:\te8 24 ff ff ff       \tcall   1080 &lt;putchar@plt>\n    1177:\te8 14 ff ff ff       \tcall   1090 &lt;__stack_chk_fail@plt>\n    119f:\tff 15 33 2e 00 00    \tcall   *0x2e33(%rip)        # 3fd8 &lt;__libc_start_main@GLIBC_2.34>\n    1242:\te8 29 fe ff ff       \tcall   1070 &lt;__cxa_finalize@plt>\n    1247:\te8 64 ff ff ff       \tcall   11b0 &lt;deregister_tm_clones>\n$ objdump -d isort.so | grep call\n    1014:\tff d0                \tcall   *%rax\n    10d2:\te8 59 ff ff ff       \tcall   1030 &lt;__cxa_finalize@plt>\n    10d7:\te8 64 ff ff ff       \tcall   1040 &lt;deregister_tm_clones>\n    11c1:\tff d0                \tcall   *%rax\n$ gcc isort.c -c -O3 -flto\n$ ar rcs libisort.a isort.o\nar: isort.o: plugin needed to handle lto object\n$ gcc-ar rcs libisort.a isort.o\n$ gcc main.c -L. -l:libisort.a -O3 -flto -o main\n$ objdump -d main | grep call\n    1014:\tff d0                \tcall   *%rax\n    1180:\te8 0b ff ff ff       \tcall   1090 &lt;__printf_chk@plt>\n    118f:\te8 dc fe ff ff       \tcall   1070 &lt;putchar@plt>\n    11af:\te8 cc fe ff ff       \tcall   1080 &lt;__stack_chk_fail@plt>\n    11df:\tff 15 f3 2d 00 00    \tcall   *0x2df3(%rip)        # 3fd8 &lt;__libc_start_main@GLIBC_2.34>\n    1282:\te8 d9 fd ff ff       \tcall   1060 &lt;__cxa_finalize@plt>\n    1287:\te8 64 ff ff ff       \tcall   11f0 &lt;deregister_tm_clones><\/pre>\n\n\n\n<p>There should be no surprise. For shared library routines, since they can be overridden at runtime the compiler cannot assume the call to our library routine <code>insertion_sort<\/code> will be the one used at runtime. Although LTO doesn&#8217;t work well for application code calling library code, compiling the shared library with <code>-flto<\/code> flag may help the library itself generate more aggressive\/performant code. Static libraries, however, are just archives of object files. As long as the archived object files needed for static linking contain LTO information, LTO can be performed in the same way as for object files. <\/p>\n","protected":false},"excerpt":{"rendered":"<p>There are a few things about linking that I think are worthy of writing down after I studied linking from CSAPP-3e and some resources on the web. Linkage of const Global Variables const global variables in C++ have internal linkage (as declared static) unless they&#8217;re explicitly declared extern or inline (see C++17 inline variable), whereas &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/markjohntaylor.com\/blog\/wordpress\/index.php\/2023\/11\/19\/linking\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Linking&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[15,29],"tags":[],"_links":{"self":[{"href":"https:\/\/markjohntaylor.com\/blog\/wordpress\/index.php\/wp-json\/wp\/v2\/posts\/1766"}],"collection":[{"href":"https:\/\/markjohntaylor.com\/blog\/wordpress\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/markjohntaylor.com\/blog\/wordpress\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/markjohntaylor.com\/blog\/wordpress\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/markjohntaylor.com\/blog\/wordpress\/index.php\/wp-json\/wp\/v2\/comments?post=1766"}],"version-history":[{"count":142,"href":"https:\/\/markjohntaylor.com\/blog\/wordpress\/index.php\/wp-json\/wp\/v2\/posts\/1766\/revisions"}],"predecessor-version":[{"id":1933,"href":"https:\/\/markjohntaylor.com\/blog\/wordpress\/index.php\/wp-json\/wp\/v2\/posts\/1766\/revisions\/1933"}],"wp:attachment":[{"href":"https:\/\/markjohntaylor.com\/blog\/wordpress\/index.php\/wp-json\/wp\/v2\/media?parent=1766"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/markjohntaylor.com\/blog\/wordpress\/index.php\/wp-json\/wp\/v2\/categories?post=1766"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/markjohntaylor.com\/blog\/wordpress\/index.php\/wp-json\/wp\/v2\/tags?post=1766"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}