WHO'S AWESOME – YOU'RE AWESOME

Linking

There are a few things about linking that I think are worthy of writing down after I studied linking from CSAPP-3e and some resources on the web.

Linkage of const Global Variables

const global variables in C++ have internal linkage (as declared static) unless they’re explicitly declared extern or inline (see C++17 inline variable), whereas in C external linkage (as declared extern) is the default for all file-scoped entities (functions and global variables). An excellent answer on StackOverflow about inline variables can be found here. One of the major uses of the inline keyword is to enable multiple identical function definitions across translation units (TUs) (in practice function definitions in header files), so this meaning was extended to variables to allow an inline const variable to have a uniform memory address across TUs and allow a non-const global variable to be defined in headers without redefinition problems.

$ cat t.c
int int_ = 1;
static int static_int = 1;
const int const_int = 1;
static const int static_const_int = 1;
$ gcc -c t.c
$ nm t.o
0000000000000000 R const_int # global read-only data section in C
0000000000000000 D int_
0000000000000004 r static_const_int
0000000000000004 d static_int
$ g++ -c t.c
$ nm -C t.o
0000000000000000 D int_
0000000000000004 d static_int
0000000000000004 r static_const_int
0000000000000000 r const_int # local read-only data section in C++

Link Order

The link order matters when we link against static or shared libraries. The rules used by the linker to resolve symbols are clearly described in the Symbol Resolution section of the Linking chapter in CSAPP-3e. Please refer to the text or this post from Ian Lance Taylor’s blog for more information.

Shared Libraries

Shared libraries are ubiquitous on almost every platform that has an operating system. The key purpose of shared libraries is to share code among processes and thus save memory. One approach to sharing library code could be to put the given library code at a specific memory location. But it has at least two problems: (a) There are so many (literally unlimited number of) libraries, how do we specify which library to put at which location? If there are ten libraries in total, but a program is only linked against one or two of them, the reserved memory for other shared libraries is wasted. (b) Having a library loaded at a fixed memory location is vulnerable to attacks. So for these reasons, shared libraries are implemented in a way that they can be loaded at any (random) memory locations that you don’t know in advance.

Luckily, the compiler can generate position-independent code (PIC) for us. Let’s first look at a normal non-PIC case:

// foo.c
int x = 17;
int foo(int i) {
    return i + x;
}

// main.c
#include <stdio.h>
int foo(int i);
int main() {
    int r = foo(3);
    printf("%d\n", r);
    return r;
}

$ gcc main.c foo.c -o main
$ objdump -d main
...
0000000000001149 <main>:
    1149:	f3 0f 1e fa          	endbr64 
    114d:	55                   	push   %rbp
    114e:	48 89 e5             	mov    %rsp,%rbp
    1151:	48 83 ec 10          	sub    $0x10,%rsp
    1155:	bf 03 00 00 00       	mov    $0x3,%edi
    115a:	e8 21 00 00 00       	call   1180 <foo>
    115f:	89 45 fc             	mov    %eax,-0x4(%rbp)
    1162:	8b 45 fc             	mov    -0x4(%rbp),%eax
    1165:	89 c6                	mov    %eax,%esi
    1167:	48 8d 05 96 0e 00 00 	lea    0xe96(%rip),%rax        # 2004 <_IO_stdin_used+0x4>
    116e:	48 89 c7             	mov    %rax,%rdi
    1171:	b8 00 00 00 00       	mov    $0x0,%eax
    1176:	e8 d5 fe ff ff       	call   1050 <printf@plt>
    117b:	8b 45 fc             	mov    -0x4(%rbp),%eax
    117e:	c9                   	leave  
    117f:	c3                   	ret    

0000000000001180 <foo>:
    1180:	f3 0f 1e fa          	endbr64 
    1184:	55                   	push   %rbp
    1185:	48 89 e5             	mov    %rsp,%rbp
    1188:	89 7d fc             	mov    %edi,-0x4(%rbp)
    118b:	8b 15 7f 2e 00 00    	mov    0x2e7f(%rip),%edx        # 4010 <x>
    1191:	8b 45 fc             	mov    -0x4(%rbp),%eax
    1194:	01 d0                	add    %edx,%eax
    1196:	5d                   	pop    %rbp
    1197:	c3                   	ret   
... 
$ objdump --full-contents main
...
Contents of section .data:
 4000 00000000 00000000 08400000 00000000  .........@......
 4010 11000000 
...

Fun story: I was confused about the disassembly once by the line call 1180 <foo>. I thought it was an absolute address since it could use something like 0x21(%rip) for relative addressing otherwise. I knew if I printed the function’s address, it would be different every time it was executed since the OS would load the code (.text section) at random addresses for security’s sake. But I also knew the code section was read-only. So, how was address space layout randomization implemented? Did the kernel (dynamic linker) perform some relocation and modify the machine code anyway? If it did, that would be soooo inefficient, for all calls to other normal functions would need to be modified as well. I just couldn’t figure it out. Then I realized that the secret must be behind the machine code e8 21 00 00 00 itself. So I asked ChatGPT about it. It turned out that, yes indeed, this call instruction uses relative addressing.

Therefore, knowing how some instructions are encoded is helpful in many situations. In x86 assembly, the opcode e8 is a near call (see more at https://shell-storm.org/x86doc/CALL.html). The 4 bytes followed are the operand which is encoded in two’s complement. x86 machines are little-endian. So the rel32 operand is 0x21, added to the RIP register (115f) gives our target address 1180. Similarly, the E9 JMP instruction also uses relative addressing. Another example: the operand in 1176: e8 d5 fe ff ff is the value whose two’s complement is 0xfffffed5. The value is -12b. Therefore, this instruction can be decoded as call RIP-12b (RIP = 117b), i.e. call 1050. Most of the time it’s fine just to look at the decoded assembler code rather than the machine code itself. However, knowing the instruction encoding makes us better understand how the CPU really executes some instructions (e.g. call RIP-12b shows us that at runtime the CPU pushes the address of the next instruction (RIP) onto the stack and then the execution jumps to the memory location with an offset of -12b to RIP).

Go back to the example above. We can see that main makes a near call to foo, and foo references the the global variable in the .data section which has an initial value 0x11 (17). Nothing special. Now we make foo.c a shared library.

$ gcc -shared -fpic foo.c -o foo.so
$ objdump -d foo.so
...
00000000000010f9 <foo>:
    10f9:	f3 0f 1e fa          	endbr64 
    10fd:	55                   	push   %rbp
    10fe:	48 89 e5             	mov    %rsp,%rbp
    1101:	89 7d fc             	mov    %edi,-0x4(%rbp)
    1104:	48 8b 05 cd 2e 00 00 	mov    0x2ecd(%rip),%rax        # 3fd8 <x-0x48>
    110b:	8b 10                	mov    (%rax),%edx
    110d:	8b 45 fc             	mov    -0x4(%rbp),%eax
    1110:	01 d0                	add    %edx,%eax
    1112:	5d                   	pop    %rbp
    1113:	c3                   	ret
...
$ objdump --full-contents foo.so
...
Contents of section .got:
 3fd8 00000000 00000000 00000000 00000000  ................
 3fe8 00000000 00000000 00000000 00000000  ................
 3ff8 00000000 00000000                    ........        
Contents of section .got.plt:
 4000 883e0000 00000000 00000000 00000000  .>..............
 4010 00000000 00000000                    ........        
Contents of section .data:
 4018 18400000 00000000 11000000           .@..........
...

There’s an observable difference. Without being PIC, the instruction for referring to x is mov 0x2e7f(%rip),%edx which copies 4 bytes from the .data section to register %edx. Now being PIC, two load instructions are involved: mov 0x2ecd(%rip),%rax and mov (%rax),%edx. The first one copies 8 bytes from the address 3fd8 in the .got section to register %rax, and the second accesses the lower half part of the value stored at address in %rax. But why is the global variable x now read from the .got section and why does the address 3fd8 contain an empty value?

To answer these questions, we are introduced to the global offset table (GOT), which contains an 8-byte entry (a pointer) for each global data object (procedure or global variable) that is referenced by the object module. The compiler generates a relocation record for each entry in the GOT.

$ readelf --relocs foo.so
Relocation section '.rela.dyn' at offset 0x420 contains 8 entries:
  Offset          Info           Type           Sym. Value    Sym. Name + Addend
000000003e78  000000000008 R_X86_64_RELATIVE                    10f0
000000003e80  000000000008 R_X86_64_RELATIVE                    10b0
000000004018  000000000008 R_X86_64_RELATIVE                    4018
000000003fd8  000500000006 R_X86_64_GLOB_DAT 0000000000004020 x + 0
000000003fe0  000100000006 R_X86_64_GLOB_DAT 0000000000000000 __cxa_finalize + 0
000000003fe8  000200000006 R_X86_64_GLOB_DAT 0000000000000000 _ITM_registerTMCl[...] + 0
000000003ff0  000300000006 R_X86_64_GLOB_DAT 0000000000000000 _ITM_deregisterTM[...] + 0
000000003ff8  000400000006 R_X86_64_GLOB_DAT 0000000000000000 __gmon_start__ + 0

The GOT serves as a kind of indirection that enables us to override symbols (global variables and functions) in shared libraries. This is done with the help of the dynamic linker, which can relocate GOT entries at load time or runtime (lazy binding) so that they contain absolute memory addresses to the proper (overridden) symbols. For example, we can also define a global variable x in main.c and this one is going to override the one provided in the shared library foo.so:

$ cat main.c
#include <stdio.h>
int foo(int i);
#ifdef OVERRIDE_X
int x = 13;
#endif
int main() {
    int r = foo(3);
    printf("%d\n", r);
    return r;
}
$ gcc main.c ./foo.so -o main
$ ./main
20
$ gcc main.c ./foo.so -o main -DOVERRIDE_X
$ ./main
16

Another typical way of overriding symbols in a shared library is to use LD_PRELOAD which we will cover later.

Back to the questions. Reading the global variable x via an extra indirection from the GOT is to allow things like symbol overriding. The GOT entry for x will be filled by the dynamic linker at load time or runtime, so it’s fine to leave the value empty.

Next, let’s focus on function calls in shared libraries. This involves two aspects: calling library functions from our application code and function calls within the shared library. For the former, see the disassembly below:

$ gcc main.c ./foo.so -o main
$ objdump -d main
...
Disassembly of section .plt:

0000000000001020 <.plt>:
    1020:	ff 35 92 2f 00 00    	push   0x2f92(%rip)        # 3fb8 <_GLOBAL_OFFSET_TABLE_+0x8>
    1026:	f2 ff 25 93 2f 00 00 	bnd jmp *0x2f93(%rip)        # 3fc0 <_GLOBAL_OFFSET_TABLE_+0x10>
    102d:	0f 1f 00             	nopl   (%rax)
    1030:	f3 0f 1e fa          	endbr64 
    1034:	68 00 00 00 00       	push   $0x0
    1039:	f2 e9 e1 ff ff ff    	bnd jmp 1020 <_init+0x20>
    103f:	90                   	nop
    1040:	f3 0f 1e fa          	endbr64 
    1044:	68 01 00 00 00       	push   $0x1
    1049:	f2 e9 d1 ff ff ff    	bnd jmp 1020 <_init+0x20>
    104f:	90                   	nop
...
Disassembly of section .plt.sec:

0000000000001060 <printf@plt>:
    1060:	f3 0f 1e fa          	endbr64 
    1064:	f2 ff 25 5d 2f 00 00 	bnd jmp *0x2f5d(%rip)        # 3fc8 <printf@GLIBC_2.2.5>
    106b:	0f 1f 44 00 00       	nopl   0x0(%rax,%rax,1)

0000000000001070 <foo@plt>:
    1070:	f3 0f 1e fa          	endbr64 
    1074:	f2 ff 25 55 2f 00 00 	bnd jmp *0x2f55(%rip)        # 3fd0 <foo@Base>
    107b:	0f 1f 44 00 00       	nopl   0x0(%rax,%rax,1)

Disassembly of section .text:
...
0000000000001169 <main>:
    1169:	f3 0f 1e fa          	endbr64 
    116d:	55                   	push   %rbp
    116e:	48 89 e5             	mov    %rsp,%rbp
    1171:	48 83 ec 10          	sub    $0x10,%rsp
    1175:	bf 03 00 00 00       	mov    $0x3,%edi
    117a:	e8 f1 fe ff ff       	call   1070 <foo@plt>
    117f:	89 45 fc             	mov    %eax,-0x4(%rbp)
    1182:	8b 45 fc             	mov    -0x4(%rbp),%eax
    1185:	89 c6                	mov    %eax,%esi
    1187:	48 8d 05 76 0e 00 00 	lea    0xe76(%rip),%rax        # 2004 <_IO_stdin_used+0x4>
    118e:	48 89 c7             	mov    %rax,%rdi
    1191:	b8 00 00 00 00       	mov    $0x0,%eax
    1196:	e8 c5 fe ff ff       	call   1060 <printf@plt>
    119b:	8b 45 fc             	mov    -0x4(%rbp),%eax
    119e:	c9                   	leave  
    119f:	c3                   	ret
...
$ objdump --full-contents main
...
Contents of section .dynamic:
 3db0 01000000 00000000 72000000 00000000  ........r.......
 3dc0 01000000 00000000 7b000000 00000000  ........{.......
 ...
Contents of section .got:
 3fb0 b03d0000 00000000 00000000 00000000  .=..............
 3fc0 00000000 00000000 30100000 00000000  ........0.......
 3fd0 40100000 00000000 00000000 00000000  @...............
 3fe0 00000000 00000000 00000000 00000000  ................
 3ff0 00000000 00000000 00000000 00000000  ................
Contents of section .data:
 4000 00000000 00000000 08400000 00000000  .........@......
...

There’s a lot of jumps happening here. We can see that calls to shared library functions are made through the procedure linkage table (PLT). Each shared library function called by the executable has its own PLT entry in the .plt.sec section. In our example, the PLT entries are foo@plt and printf@plt. PLT entries will be relocated lazily by the dynamic linker at runtime, meaning that the address of each function will be resolved by the dynamic linker the first time the function is called. This makes sense since usually a shared library, e.g. libc, encompasses a large number of functions, and only a few get called by our program. This lazy binding saves a lot of work for the dynamic linker.

Now we walk through our example of calling the library function foo. Function main calls into foo@plt, which immediately jumps to another memory address via GOT[4]. The initial value of GOT[4] is 0x1040. Following that, the execution jumps to somewhere in the .plt section and pushes the index of this PLT entry (0x1 for foo; 0x0 for printf) onto the stack and then branches to the common code at the beginning of the .plt section. The common code pushes GOT[1] onto the stack and the execution jumps to GOT[2]. It can be guessed that GOT[1] is a pointer to relocation information, and it will be used together with the pushed PLT entry index as two arguments by the dynamic linker to resolve the address of foo. GOT[2] should be the entry point for the dynamic linker to perform relocations. GOT[1] and GOT[2] will be filled by the dynamic linker at load time. After the dynamic linker figures out the address of foo at runtime, it rewrites GOT[4] with this address so that subsequent calls to foo will jump directly to the resolved destination. Finally, the control is transferred back to foo when the dynamic linker has done its job. What a beautiful engineering!

Within a shared library, let’s see what happens when a function calls another function.

// bar.c
int x = 17;
int bar(int i) {
    return i + 42;
}
int foo(int i) {
    return bar(i) + x;
}

$ gcc -shared -fpic bar.c -o bar.so
$ objdump -d bar.so
...
Disassembly of section .plt.sec:

0000000000001050 <bar@plt>:
    1050:	f3 0f 1e fa          	endbr64 
    1054:	f2 ff 25 bd 2f 00 00 	bnd jmp *0x2fbd(%rip)        # 4018 <bar+0x2eff>
    105b:	0f 1f 44 00 00       	nopl   0x0(%rax,%rax,1)

Disassembly of section .text:
...
0000000000001119 <bar>:
    1119:	f3 0f 1e fa          	endbr64 
    111d:	55                   	push   %rbp
    111e:	48 89 e5             	mov    %rsp,%rbp
    1121:	89 7d fc             	mov    %edi,-0x4(%rbp)
    1124:	8b 45 fc             	mov    -0x4(%rbp),%eax
    1127:	83 c0 2a             	add    $0x2a,%eax
    112a:	5d                   	pop    %rbp
    112b:	c3                   	ret    

000000000000112c <foo>:
    112c:	f3 0f 1e fa          	endbr64 
    1130:	55                   	push   %rbp
    1131:	48 89 e5             	mov    %rsp,%rbp
    1134:	48 83 ec 10          	sub    $0x10,%rsp
    1138:	89 7d fc             	mov    %edi,-0x4(%rbp)
    113b:	8b 45 fc             	mov    -0x4(%rbp),%eax
    113e:	89 c7                	mov    %eax,%edi
    1140:	e8 0b ff ff ff       	call   1050 <bar@plt>
    1145:	48 8b 15 8c 2e 00 00 	mov    0x2e8c(%rip),%rdx        # 3fd8 <x-0x50>
    114c:	8b 12                	mov    (%rdx),%edx
    114e:	01 d0                	add    %edx,%eax
    1150:	c9                   	leave  
    1151:	c3                   	ret    
...

Well, we can see that the call to bar is done through a PLT entry, just like it’s been called from non-library code. This actually makes sense since all exported symbols can be overridden. If we don’t want to export our internal implementation functions or variables, we can declare them static.

$ cat bar.c 
int x = 17;
static int bar(int i) {
    return i + 42;
}
int foo(int i) {
    return bar(i) + x;
}
$ gcc -shared -fpic bar.c -o bar.so
$ objdump -d bar.so
...
00000000000010f9 <bar>:
    10f9:	f3 0f 1e fa          	endbr64 
    10fd:	55                   	push   %rbp
    10fe:	48 89 e5             	mov    %rsp,%rbp
    1101:	89 7d fc             	mov    %edi,-0x4(%rbp)
    1104:	8b 45 fc             	mov    -0x4(%rbp),%eax
    1107:	83 c0 2a             	add    $0x2a,%eax
    110a:	5d                   	pop    %rbp
    110b:	c3                   	ret    

000000000000110c <foo>:
    110c:	f3 0f 1e fa          	endbr64 
    1110:	55                   	push   %rbp
    1111:	48 89 e5             	mov    %rsp,%rbp
    1114:	48 83 ec 08          	sub    $0x8,%rsp
    1118:	89 7d fc             	mov    %edi,-0x4(%rbp)
    111b:	8b 45 fc             	mov    -0x4(%rbp),%eax
    111e:	89 c7                	mov    %eax,%edi
    1120:	e8 d4 ff ff ff       	call   10f9 <bar>
    1125:	48 8b 15 ac 2e 00 00 	mov    0x2eac(%rip),%rdx        # 3fd8 <x-0x48>
    112c:	8b 12                	mov    (%rdx),%edx
    112e:	01 d0                	add    %edx,%eax
    1130:	c9                   	leave  
    1131:	c3                   	ret   
...

In designing a library, we often have headers and their implementation details will be put in separate translation units. Those functions have external linkage. But we don’t want to export those symbols unless they are API functions intended to be provided to the user. Under such cases, we can pass -fvisibility=hidden to the compiler and explicitly mark the symbols we want to export using __attribute__((__visibility__("default"))) (in practice, a descriptive macro can be used for this, e.g. LIBNAME_EXPORT), see more in the GCC manual. Doing so has many practical benefits, for example, less work for the dynamic linker, more efficient code can be generated since no PLT entry is needed, and for data references, there will be only one data load vs two loads if it gets exported.

$ cat bar.c
int x = 17;

int bar(int i) {
    return i + 42;
}

__attribute__((__visibility__("default")))
int foo(int i) {
    return bar(i) + x;
}
$ gcc -shared -fpic bar.c -o bar.so -fvisibility=hidden
$ objdump -d bar.so
...
00000000000010f9 <bar>:
    10f9:	f3 0f 1e fa          	endbr64 
    10fd:	55                   	push   %rbp
    10fe:	48 89 e5             	mov    %rsp,%rbp
    1101:	89 7d fc             	mov    %edi,-0x4(%rbp)
    1104:	8b 45 fc             	mov    -0x4(%rbp),%eax
    1107:	83 c0 2a             	add    $0x2a,%eax
    110a:	5d                   	pop    %rbp
    110b:	c3                   	ret    

000000000000110c <foo>:
    110c:	f3 0f 1e fa          	endbr64 
    1110:	55                   	push   %rbp
    1111:	48 89 e5             	mov    %rsp,%rbp
    1114:	48 83 ec 08          	sub    $0x8,%rsp
    1118:	89 7d fc             	mov    %edi,-0x4(%rbp)
    111b:	8b 45 fc             	mov    -0x4(%rbp),%eax
    111e:	89 c7                	mov    %eax,%edi
    1120:	e8 d4 ff ff ff       	call   10f9 <bar>
    1125:	8b 15 f5 2e 00 00    	mov    0x2ef5(%rip),%edx        # 4020 <x>
    112b:	01 d0                	add    %edx,%eax
    112d:	c9                   	leave  
    112e:	c3                   	ret
...

At the end of this section, let’s see an example of using LD_PRELOAD to override symbols by loading specific libraries before any other shared libraries. In practice, LD_PRELOAD can be used to override the malloc implementation in libc.so or to load a debug library that contains logging versions of some functions (e.g. to debug some multithreaded code by logging each thread’s activity).

$ cat foo.c
int x = 17;

int foo(int i) {
    return i + x;
}
$ cat main.c 
#include <stdio.h>

int foo(int i);

int main() {
    int r = foo(3);
    printf("%d\n", r);
    return r;
}
$ cat bar.c
int x = 17;

int bar(int i) {
    return i + 42;
}

__attribute__((__visibility__("default")))
int foo(int i) {
    return bar(i) + x;
}
$ gcc -shared -fpic foo.c -o foo.so
$ gcc -shared -fpic bar.c -o bar.so -fvisibility=hidden
$ gcc main.c ./foo.so -o main
$ ./main
20
$ LD_PRELOAD=./bar.so ./main
62

Oh, one more thing to highlight. Symbol collisions in shared libraries can catch you unexpectedly. The exported symbols in a shared library can be overridden by anything as long as they have the same name. This means an exported function/variable in a shared library can be overridden by a global variable of any type, or by a function of any signature. If the same symbol is used for different things, it’s an undefined behavior. Symbol collisions can happen between a shared library and our application code, or between two shared libraries linked by our innocent application. Therefore, it’s a good practice to limit the visibility of the symbols in a shared library as small as possible. Better performance and lower chance of symbol collision. Why not!

$ cat foo.c
#include <stdio.h>
int x = 5;

int foobar(int i) {
    printf("foobar called from foo.c, argument i=%d\n", i);
    return i + x;
}

int foo(int i) {
    return foobar(i) * 10;
}
$ cat bar.c
#include <stdio.h>

int foobar(int i, int j) {
    printf("foobar called from bar.c, arguments i=%d, j=%d\n", i, j);
    return i + j;
}

int bar(int i, int j) {
    return foobar(i, j) + 1;
}
$ cat barz.c
#include <stdio.h>

int foobar = 2;

int bar(int i, int j) {
    printf("foobar=%d\n", foobar);
    return foobar + i + j + 5;
}
$ cat main.c
#include <stdio.h>

int foo(int i);        // from foo.so
int bar(int i, int j); // from bar(z).so

int main() {
    int a = foo(3);
    int b = bar(3, 7);
    printf("%d\n", a);
    printf("%d\n", b);
    return a + b;
}
$ gcc -shared -fpic foo.c -o foo.so
$ gcc -shared -fpic bar.c -o bar.so
$ gcc -shared -fpic barz.c -o barz.so
$ gcc main.c ./foo.so ./bar.so -o main
$ ./main
foobar called from foo.c, argument i=3
foobar called from foo.c, argument i=3
80
9
$ gcc main.c ./bar.so ./foo.so -o main
$ ./main
foobar called from bar.c, arguments i=3, j=2081388488
foobar called from bar.c, arguments i=3, j=7
-660951570
11
$ gcc main.c ./foo.so ./barz.so -o main
$ ./main
foobar called from foo.c, argument i=3
foobar=-98693133
80
-98693118
$ gcc main.c ./barz.so ./foo.so -o main
$ ./main
fish: Job 1, './main' terminated by signal SIGSEGV (Address boundary error)

In the example above, our program calls two external functions from two shared libraries: foo from foo.so, and bar from bar.so or barz.so. They all call or reference foobar. Whichever the shared library is loaded first overrides the other library’s foobar symbol. From the results, we can see that the library load order seems to match the library link order. But we can manually specify the load order using LD_PRELOAD.

$ gcc main.c ./foo.so ./bar.so -o main
$ LD_PRELOAD=./foo.so:./bar.so ./main
foobar called from foo.c, argument i=3
foobar called from foo.c, argument i=3
80
9
$ LD_PRELOAD=./bar.so:./foo.so ./main
foobar called from bar.c, arguments i=3, j=1241836152
foobar called from bar.c, arguments i=3, j=7
-466540338
11
$ gcc main.c ./foo.so ./barz.so -o main
$ LD_PRELOAD=./foo.so:./barz.so ./main
foobar called from foo.c, argument i=3
foobar=-98693133
80
-98693118
$ LD_PRELOAD=./barz.so:./foo.so ./main
fish: Job 1, 'LD_PRELOAD=./barz.so:./foo.so .…' terminated by signal SIGSEGV (Address boundary error)

The point here is that if symbol collisions happened, there would be many possible ways our program might pan out due to undefined behaviors. If we only exported foo and bar, none of them would happen. If we really need to export foobar as it may be an API function, can we avoid the symbol collision on it? The answer is to pass -Bsymbolic or -Bsymbolic-functions to the program linker when creating the shared library. These options make the shared library bind references to global (function) symbols to the definition within the shared library, if any. This means calls to functions within the shared library don’t have to go through the PLT, and referenced data doesn’t have to be read from the GOT. Thus, more efficient code can be generated by the compiler. But this renders symbol overriding impossible, even for exported public symbols, as no dynamic relocation entry for the shared library would be generated. If one wants finer control, --dynamic-list can be used.

$ gcc -shared -fpic foo.c -o foo_symbolic.so -Wl,-Bsymbolic
$ readelf -r foo.so 
Relocation section '.rela.dyn' at offset 0x4b0 contains 8 entries:
  Offset          Info           Type           Sym. Value    Sym. Name + Addend
000000003e08  000000000008 R_X86_64_RELATIVE                    1130
000000003e10  000000000008 R_X86_64_RELATIVE                    10f0
000000004028  000000000008 R_X86_64_RELATIVE                    4028
000000003fd8  000100000006 R_X86_64_GLOB_DAT 0000000000000000 _ITM_deregisterTM[...] + 0
000000003fe0  000600000006 R_X86_64_GLOB_DAT 0000000000004030 x + 0
000000003fe8  000300000006 R_X86_64_GLOB_DAT 0000000000000000 __gmon_start__ + 0
000000003ff0  000400000006 R_X86_64_GLOB_DAT 0000000000000000 _ITM_registerTMCl[...] + 0
000000003ff8  000500000006 R_X86_64_GLOB_DAT 0000000000000000 __cxa_finalize@GLIBC_2.2.5 + 0

Relocation section '.rela.plt' at offset 0x570 contains 2 entries:
  Offset          Info           Type           Sym. Value    Sym. Name + Addend
000000004018  000800000007 R_X86_64_JUMP_SLO 0000000000001139 foobar + 0
000000004020  000200000007 R_X86_64_JUMP_SLO 0000000000000000 printf@GLIBC_2.2.5 + 0
$ readelf -r foo_symbolic.so # no relocations for x and foobar
Relocation section '.rela.dyn' at offset 0x4b0 contains 7 entries:
  Offset          Info           Type           Sym. Value    Sym. Name + Addend
000000003df0  000000000008 R_X86_64_RELATIVE                    1110
000000003df8  000000000008 R_X86_64_RELATIVE                    10d0
000000004020  000000000008 R_X86_64_RELATIVE                    4020
000000003fe0  000100000006 R_X86_64_GLOB_DAT 0000000000000000 _ITM_deregisterTM[...] + 0
000000003fe8  000300000006 R_X86_64_GLOB_DAT 0000000000000000 __gmon_start__ + 0
000000003ff0  000400000006 R_X86_64_GLOB_DAT 0000000000000000 _ITM_registerTMCl[...] + 0
000000003ff8  000500000006 R_X86_64_GLOB_DAT 0000000000000000 __cxa_finalize@GLIBC_2.2.5 + 0

Relocation section '.rela.plt' at offset 0x558 contains 1 entry:
  Offset          Info           Type           Sym. Value    Sym. Name + Addend
000000004018  000200000007 R_X86_64_JUMP_SLO 0000000000000000 printf@GLIBC_2.2.5 + 0
$ objdump -d foo_symbolic.so
...
Disassembly of section .plt.sec:

0000000000001050 <printf@plt>:
    1050:	f3 0f 1e fa          	endbr64 
    1054:	f2 ff 25 bd 2f 00 00 	bnd jmp *0x2fbd(%rip)        # 4018 <printf@GLIBC_2.2.5>
    105b:	0f 1f 44 00 00       	nopl   0x0(%rax,%rax,1)

Disassembly of section .text:

0000000000001060 <deregister_tm_clones>:
...
0000000000001119 <foobar>:
    1119:	f3 0f 1e fa          	endbr64 
    111d:	55                   	push   %rbp
    111e:	48 89 e5             	mov    %rsp,%rbp
    1121:	48 83 ec 10          	sub    $0x10,%rsp
    1125:	89 7d fc             	mov    %edi,-0x4(%rbp)
    1128:	8b 45 fc             	mov    -0x4(%rbp),%eax
    112b:	89 c6                	mov    %eax,%esi
    112d:	48 8d 05 cc 0e 00 00 	lea    0xecc(%rip),%rax        # 2000 <_fini+0xe88>
    1134:	48 89 c7             	mov    %rax,%rdi
    1137:	b8 00 00 00 00       	mov    $0x0,%eax
    113c:	e8 0f ff ff ff       	call   1050 <printf@plt>
    1141:	48 8d 05 e0 2e 00 00 	lea    0x2ee0(%rip),%rax        # 4028 <x>
    1148:	8b 10                	mov    (%rax),%edx
    114a:	8b 45 fc             	mov    -0x4(%rbp),%eax
    114d:	01 d0                	add    %edx,%eax
    114f:	c9                   	leave  
    1150:	c3                   	ret    

0000000000001151 <foo>:
    1151:	f3 0f 1e fa          	endbr64 
    1155:	55                   	push   %rbp
    1156:	48 89 e5             	mov    %rsp,%rbp
    1159:	48 83 ec 10          	sub    $0x10,%rsp
    115d:	89 7d fc             	mov    %edi,-0x4(%rbp)
    1160:	8b 45 fc             	mov    -0x4(%rbp),%eax
    1163:	89 c7                	mov    %eax,%edi
    1165:	e8 af ff ff ff       	call   1119 <foobar>
    116a:	89 c2                	mov    %eax,%edx
    116c:	89 d0                	mov    %edx,%eax
    116e:	c1 e0 02             	shl    $0x2,%eax
    1171:	01 d0                	add    %edx,%eax
    1173:	01 c0                	add    %eax,%eax
    1175:	c9                   	leave  
    1176:	c3                   	ret    
...
$ objdump --full-contents foo_symbolic.so
...
Contents of section .data:
 4020 20400000 00000000 05000000            @..........   
...

Some other good stuff to read on shared libraries: “How To Write Shared Libraries” by Ulrich Drepper.

Link-Time Optimization

Link-time optimization (LTO) is also called interprocedural optimization. The kind of optimization, as its name indicates, happens at the link stage when the compiler sees the whole program. LTO is done by dumping the compiler’s intermediate representation (GIMPLE in GCC, LLVM IR/bitcode in Clang) when compiling each source file (translation unit). So the resulting object files are fat as they contain extra information used for LTO. Typical LTO includes function inlining, dead code elimination, etc. Read more here or watch this awesome video. We’ll instead focus on a concrete and interesting example: trying to inline a comparison function pointer for a library routine insertion_sort in C.

// isort.h
#pragma once
#include <stddef.h>

void insertion_sort(void *ptr, size_t num, size_t size, int (*comp)(const void *, const void *));
// isort.c
#include "isort.h"
#include <stdio.h>

static void swap_item(char *a, char *b, size_t size) {
    for (size_t i = 0; i < size; ++i) {
        char tmp = a[i];
        a[i] = b[i];
        b[i] = tmp;
    }
}

void insertion_sort(void *ptr, size_t num, size_t size, int (*comp)(const void *, const void *)) {
    char *base = (char *)ptr;
    for (size_t i = 1; i < num; ++i) {
        for (size_t j = i; j > 0 && comp(base + (j-1)*size, base + j*size) > 0; --j) {
            swap_item(base + (j-1)*size, base + j*size, size);
        }
    }
}
// main.c
#include "isort.h"
#include <stdio.h>

int compare_int(const void *a, const void *b) {
    return *(int *)a - *(int *)b;
}

int main() {
    int arr[] = {7, 3, 1, 5, 2, 9, 1, 7, 8, 2, 4, 6, 5};
    size_t arr_size = sizeof(arr) / sizeof(arr[0]);

    insertion_sort(arr, arr_size, sizeof(arr[0]), compare_int);

    for (size_t i = 0; i < arr_size; ++i) {
        printf("%d ", arr[i]);
    }
    printf("\n");

    return 0;
}

This is a typical setup – the definition of the library routine insertion_sort is separate from our application code, and our application code provides a custom comparison function upon calling the library routine.

$ gcc -o main main.c isort.c -O3 
$ objdump -d main | grep call
    1014:	ff d0                	call   *%rax
    110d:	e8 0e 03 00 00       	call   1420 <insertion_sort>
    1128:	e8 63 ff ff ff       	call   1090 <__printf_chk@plt>
    1137:	e8 34 ff ff ff       	call   1070 <putchar@plt>
    1157:	e8 24 ff ff ff       	call   1080 <__stack_chk_fail@plt>
    117f:	ff 15 53 2e 00 00    	call   *0x2e53(%rip)        # 3fd8 <__libc_start_main@GLIBC_2.34>
    1222:	e8 39 fe ff ff       	call   1060 <__cxa_finalize@plt>
    1227:	e8 64 ff ff ff       	call   1190 <deregister_tm_clones>
    14e1:	ff d0                	call   *%rax
$ gcc -o main main.c isort.c -O3 -flto
$ objdump -d main | grep call
    1014:	ff d0                	call   *%rax
    1180:	e8 0b ff ff ff       	call   1090 <__printf_chk@plt>
    118f:	e8 dc fe ff ff       	call   1070 <putchar@plt>
    11af:	e8 cc fe ff ff       	call   1080 <__stack_chk_fail@plt>
    11df:	ff 15 f3 2d 00 00    	call   *0x2df3(%rip)        # 3fd8 <__libc_start_main@GLIBC_2.34>
    1282:	e8 d9 fd ff ff       	call   1060 <__cxa_finalize@plt>
    1287:	e8 64 ff ff ff       	call   11f0 <deregister_tm_clones>

From the grepped results we can tell, even with -O3 optimization level, that the comparison function is still invoked via an indirect call *%rax rather than being inlined, which is bad for performance especially when called within a tight loop. With LTO, both the insertion_sort routine and the comparison function are inlined into main. Sadly, even if we put all the stuff in a single file, the comparator is not inlined (Godbolt link). How about LTO for shared libraries and static libraries?

$ gcc -shared -fpic isort.c -O3 -flto -o libisort.so
$ gcc main.c -L. -lisort -O3 -flto -o main
$ objdump -d main | grep call
    1014:	ff d0                	call   *%rax
    112d:	e8 7e ff ff ff       	call   10b0 <insertion_sort@plt>
    1148:	e8 53 ff ff ff       	call   10a0 <__printf_chk@plt>
    1157:	e8 24 ff ff ff       	call   1080 <putchar@plt>
    1177:	e8 14 ff ff ff       	call   1090 <__stack_chk_fail@plt>
    119f:	ff 15 33 2e 00 00    	call   *0x2e33(%rip)        # 3fd8 <__libc_start_main@GLIBC_2.34>
    1242:	e8 29 fe ff ff       	call   1070 <__cxa_finalize@plt>
    1247:	e8 64 ff ff ff       	call   11b0 <deregister_tm_clones>
$ objdump -d isort.so | grep call
    1014:	ff d0                	call   *%rax
    10d2:	e8 59 ff ff ff       	call   1030 <__cxa_finalize@plt>
    10d7:	e8 64 ff ff ff       	call   1040 <deregister_tm_clones>
    11c1:	ff d0                	call   *%rax
$ gcc isort.c -c -O3 -flto
$ ar rcs libisort.a isort.o
ar: isort.o: plugin needed to handle lto object
$ gcc-ar rcs libisort.a isort.o
$ gcc main.c -L. -l:libisort.a -O3 -flto -o main
$ objdump -d main | grep call
    1014:	ff d0                	call   *%rax
    1180:	e8 0b ff ff ff       	call   1090 <__printf_chk@plt>
    118f:	e8 dc fe ff ff       	call   1070 <putchar@plt>
    11af:	e8 cc fe ff ff       	call   1080 <__stack_chk_fail@plt>
    11df:	ff 15 f3 2d 00 00    	call   *0x2df3(%rip)        # 3fd8 <__libc_start_main@GLIBC_2.34>
    1282:	e8 d9 fd ff ff       	call   1060 <__cxa_finalize@plt>
    1287:	e8 64 ff ff ff       	call   11f0 <deregister_tm_clones>

There should be no surprise. For shared library routines, since they can be overridden at runtime the compiler cannot assume the call to our library routine insertion_sort will be the one used at runtime. Although LTO doesn’t work well for application code calling library code, compiling the shared library with -flto flag may help the library itself generate more aggressive/performant code. Static libraries, however, are just archives of object files. As long as the archived object files needed for static linking contain LTO information, LTO can be performed in the same way as for object files.

You Are So Beautiful by Joe Cocker

Loop-invariant Code Hoisting

We often face such a scenario in writing code:

our_loop:
    // some code before
    if (flag):
       do_something()
    else:
       do_something_else()
    // some code after

The flag is usually a constant, for example, an argument (e.g. true/false; 0/1) we passed to this function at runtime. The functions do_something() and do_something_else() may also be some code blocks and they’re almost identical. Well, if we know the flag will definitely not change within the loop we can move the conditional statement out of the loop so that the flag will only be tested once.

if (flag):
    our_loop:
        // some code before
        do_something()
        // some code after
else:
    our_loop:
        // some code before
        do_something_else()
        // some code after

This is great, except, …, well, it really looks ugly and we’re repeating ourselves. If the code within the loop is large, it makes our binary executable a bit larger (.text section).

To be concrete, let’s consider the following C++ code that computes the rank of the current value in arr[beg, end):

int get_rank(int arr[], int beg, int end, int cur, bool desc)
{
    int rank = 1;
    for (int i = beg; i < end; ++i) {
        if (i == cur)
            continue;
        if (!desc) {
            if (arr[i] < arr[cur])
                rank++;
        }
        else {
            if (arr[i] > arr[cur])
                rank++;
        }
    }
    return rank;
}

One may want to compute the ascending rank or the descending rank of a value in an array/range. Here the order (asc/desc) is a runtime constant. However, if it’s a compile-time constant one can avoid repeating themselves by using C++17’s if constexpr:

template<bool desc>
int get_rank(int arr[], int beg, int end, int cur)
{
    int rank = 1;
    for (int i = beg; i < end; ++i) {
        if (i == cur)
            continue;
        if constexpr(!desc) {  // requires C++17 or later
            if (arr[i] < arr[cur])
                rank++;
        }
        else {
            if (arr[i] > arr[cur])
                rank++;
        }
    }
    return rank;
}

// usage:
// int rank = get_rank<false>(arr, beg, end, cur);
// or
// constexpr bool desc = false; // or some compile-time constant
// int rank = get_rank<desc>(arr, beg, end, cur);

This solution combines both efficiency and elegance (thanks to the compiler). But how about the efficiency of the non-templated function (the one with an additional parameter desc)? I wasn’t sure if the compiler would do any sort of optimization. But I do know that modern computer architectures are designed to do a great job in branch prediction. The compiler may also aid in generating branch-friendly (or even branchless) code. So, if the prediction was correct, the fitched instructions for this branch will be kept hot (remaining in the cache and busy running in the CPU pipeline). Otherwise, the fitched instructions by the wrong prediction may be discarded (evicted from the cache) and instructions for the other branch will be fetched. Also, the CPU stalls caused by the pipeline invalidation will make it worse. Luckily, here in this case, the branch condition is constant and therefore the CPU is on our side.

Next, let’s see what the compile can do for us. Compiling this function on Godbolt using gcc 12.2 with -O3 flag yields the following assembly code:

get_rank(int*, int, int, int, bool):
        movq    %rdi, %r9
        movl    %edx, %edi
        cmpl    %edx, %esi
        jge     .L9
        movslq  %ecx, %rax
        movl    $1, %edx
        leaq    (%r9,%rax,4), %r10
        movslq  %esi, %rax
        testb   %r8b, %r8b
        jne     .L8
.L5:
        cmpl    %eax, %ecx
        je      .L4
        movl    (%r10), %esi
        cmpl    %esi, (%r9,%rax,4)
        setl    %sil
        movzbl  %sil, %esi
        addl    %esi, %edx
.L4:
        addq    $1, %rax
        cmpl    %eax, %edi
        jg      .L5
        movl    %edx, %eax
        ret
.L8:
        cmpl    %eax, %ecx  ; if (i == cur)
        je      .L7
        movl    (%r10), %esi
        cmpl    %esi, (%r9,%rax,4)
        setg    %sil
        movzbl  %sil, %esi
        addl    %esi, %edx
.L7: ; for (int i = beg; i < end; ++i)
        addq    $1, %rax
        cmpl    %eax, %edi
        jg      .L8
        movl    %edx, %eax
        ret
.L9:
        movl    $1, %edx
        movl    %edx, %eax
        ret

We can see that desc is only tested once. Cheers! With -O0 flag it gives

get_rank(int*, int, int, int, bool):
        pushq   %rbp
        movq    %rsp, %rbp
        movq    %rdi, -24(%rbp)
        movl    %esi, -28(%rbp)
        movl    %edx, -32(%rbp)
        movl    %ecx, -36(%rbp)
        movl    %r8d, %eax
        movb    %al, -40(%rbp)
        movl    $1, -4(%rbp)
        movl    -28(%rbp), %eax
        movl    %eax, -8(%rbp)
        jmp     .L2
.L6:
        movl    -8(%rbp), %eax
        cmpl    -36(%rbp), %eax
        je      .L8
        movzbl  -40(%rbp), %eax
        xorl    $1, %eax
        testb   %al, %al
        je      .L5
        movl    -8(%rbp), %eax
        cltq
        leaq    0(,%rax,4), %rdx
        movq    -24(%rbp), %rax
        addq    %rdx, %rax
        movl    (%rax), %edx
        movl    -36(%rbp), %eax
        cltq
        leaq    0(,%rax,4), %rcx
        movq    -24(%rbp), %rax
        addq    %rcx, %rax
        movl    (%rax), %eax
        cmpl    %eax, %edx
        jge     .L4
        addl    $1, -4(%rbp)
        jmp     .L4
.L5:
        movl    -8(%rbp), %eax
        cltq
        leaq    0(,%rax,4), %rdx
        movq    -24(%rbp), %rax
        addq    %rdx, %rax
        movl    (%rax), %edx
        movl    -36(%rbp), %eax
        cltq
        leaq    0(,%rax,4), %rcx
        movq    -24(%rbp), %rax
        addq    %rcx, %rax
        movl    (%rax), %eax
        cmpl    %eax, %edx
        jle     .L4
        addl    $1, -4(%rbp)
        jmp     .L4
.L8:
        nop
.L4:
        addl    $1, -8(%rbp)
.L2: ; for (int i = beg; i < end; ++i)
        movl    -8(%rbp), %eax
        cmpl    -32(%rbp), %eax
        jl      .L6
        movl    -4(%rbp), %eax
        popq    %rbp
        ret

which suggests desc is tested within each loop.

Now we know that compilers can actually perform this optimization (although it’s not enabled by default), we should be relaxed and fearless to choose our seemingly inefficient but shorter & sweeter code. Cheers again ;)

Awesome Netcat

We can turn any process into a server in an incredibly simple manner using the powerful networking utility netcat. For example, we can make a shell a server:

(a computer with ip address 192.168.1.2)
$ mkfifo fifo  # create a named pipe
$ cat fifo | sh -i 2>&1 | nc -l 1234 > fifo  # server

(on another window)
$ nc localhost 1234  # client

(or if you're on the same netwok from another computer)
$ nc 192.168.1.2 1234  # client

Now let’s try to understand the above-highlighted line. Noting that a pipeline runs in parallel, cat fifo therefore outputs the content of fifo only when nc -l 1234 writes its output to the pipe fifo.

We know that when the client connects to the server via netcat, if the client types in anything, it will be output to the server; and vice versa. Hence, nc localhost 1234‘s input becomes nc -l 1234‘s output, and nc -l 1234‘s input becomes nc localhost 1234‘s output.

When the client types in something, say, the command ls, it also appears in the output of the server, which then is redirected to the pipe fifo, which in turn goes to cat fifo (read end comes the data). Then the output of cat fifo (“ls”) is redirected to as the input of sh -i 2>&1, which executes the command ls and sends the results to the server nc -l 1234 as its input, which finally as output appears in the client nc localhost 1234.

Isn’t it beautiful? Just setting a few pipes and redirections we can turn a process into a server. Pipes are really one of the greatest UNIX inventions. They make many incredible things possible.

Zero Pointer Dereference, Huh?

Consider the following code:

int main()
{
    struct s {
        int m0;
        int m1;
        int m2;
    };

    struct s *p = NULL;
    int a = (int) &((struct s *)p)->m2;  // crash ??
    int b = (int) &((struct s *)0)->m2;  // crash ??
    int c = (int) p->m2;  // definitely dead

    return 0;
}

So, do you think the above two highlighted lines will crash the program? Well, at least to me, it will, since they’re dereferencing null pointers. Take, int b = (int) &((struct s *)0)->m2;, for example, we first dereference the zero pointer to get the member m2, and then obtain its address. Right? This is how we literally read &p_struct->member.

However, this is not how the compiler interprets it. The compiler is a very cool and knowledgeable guy who is assumed to know everything. So the compiler knows the offset of each member in a structure, and he says, “Well, why do I even care for the values of those members (dereferencing)? To obtain the address of a member in a structure, I just need to add the offset of the member to the address of the structure: &p_struct->member := p_struct + offset(member).”

With such ability, the compiler now can generate assembly code

6       {
   0x0000000000001129 <+0>:     endbr64
   0x000000000000112d <+4>:     push   %rbp
   0x000000000000112e <+5>:     mov    %rsp,%rbp

7           struct s {
8               int m0;
9               int m1;
10              int m2;
11          };
12
13          struct s *p = NULL;
   0x0000000000001131 <+8>:     movq   $0x0,-0x8(%rbp)

14          int a = (int) &((struct s *)p)->m2;  // okay
   0x0000000000001139 <+16>:    mov    -0x8(%rbp),%rax
   0x000000000000113d <+20>:    add    $0x8,%rax
   0x0000000000001141 <+24>:    mov    %eax,-0x14(%rbp)

15          int b = (int) &((struct s *)0)->m2;  // okay
   0x0000000000001144 <+27>:    movl   $0x8,-0x10(%rbp)

16          int c = (int) p->m2;  // dead
   0x000000000000114b <+34>:    mov    -0x8(%rbp),%rax
   0x000000000000114f <+38>:    mov    0x8(%rax),%eax # segfault, since it tries to access invalid memory address 0x8 (page 0)
   0x0000000000001152 <+41>:    mov    %eax,-0xc(%rbp)

17
18          return 0;
   0x0000000000001155 <+44>:    mov    $0x0,%eax

19      }
   0x000000000000115a <+49>:    pop    %rbp
   0x000000000000115b <+50>:    ret

We can see that there are no dereferences whatsoever when we try to get the address of a member in a structure, just the addition of the member offset to the address of the structure.

Therefore, as a special case with the structure address being zero, &((TYPE *)0)->MEMBER yields the exact offset of the member in a structure, which accounts for why the macro

#define offsetof(TYPE, MEMBER) ((size_t) &((TYPE *)0)->MEMBER)

works in the Linux kernel.

An Amazing Mathemusician

Recently I found a beautiful, funny, intelligent woman, Vi Hart, a YouTuber who creates entertaining, thought-provoking mathematics and music videos that explain mathematical concepts through doodles. It is an absolutely delightful and enlightening experience to watch her doodling, drawing those beautiful, adorable mathematical geometric shapes. Besides, she explains those mathematical concepts in a singing tone, which is cheerful and amusing to listen to. I think she did a fantastic job in spreading the idea that mathematics can be, and yes, is fun & beautiful in such exciting and entertaining ways. That is also what I always want to make the general audience feel about math.

Awesome Swiss Tables

The basic idea is open addressing, but without any redirections (thus making Swiss tables very memory efficient). Also, Swiss tables store a densely packed array of metadata with each entry consisting of 1 byte (which makes the metadata table easier to fit into the CPU cache, as opposed to, say, metadata of pointers of 8 bytes), sort of like UNIX file system inodes and bit maps (see OSTEP: Chapter 40 File System Implementation). Thus, we can use this compact metadata to determine if an entry in the table is present, empty, or deleted by using the single control bit and the 7-bit H2 hash (see the details at https://abseil.io/about/design/swisstables).

And there’s an excellent video about it on YouTube.

Finally, let’s see the performance differences on the counting words benchmark.

Founding Our Awesome Kick-ass Company

Today is a special occasion (& might be a historic moment) that we might appreciate a few years later: I, Yichuan Wang, & Robin Qiu just founded our own company. We were roomies in college, and they’re brilliant & awesome guys, and I firmly believe in a few years we will create a very bright future, with our hands, our passion, our genius minds, and our entrepreneurial spirits. I am just feeling passionate all over my body, my blood, my cells, just like Richard Hendricks. So, work hard, keep moving, and be a badass :) Notes on April 4, 2022.

Monty Hall Problem

This problem is actually a little tricky and counterintuitive, but really fun to think about (as a brain teaser).

Here is the problem (I’ll just copy it from Wikipedia, lazy evil being :)

Suppose you’re on a game show, and you’re given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say No. 1, and the host, who knows what’s behind the doors, opens another door, say No. 3, which has a goat. He then says to you, “Do you want to pick door No. 2?” Is it to your advantage to switch your choice?

At first sight, it appears to be a fifty-fifty chance to win without switching, as there’re only two doors left, and behind one there’s the car. Most people do think so (me too). But there is a piece of important information that people may not notice, the host knows what’s behind each door. On your choosing (a 1/3 probability of the car and a 2/3 probability of a goat), he purposely opens a door that has a goat. Actually, when you choose a door, you only have a 1/3 chance to win the car. The car behind one of the other two doors has a 2/3 chance. This won’t change with or without the host telling ya one of the other two has a goat. If the host opens a door with a goat, that means the other door left has a 2/3 chance to have the car. Thank you for that extra 33.3%!!

Mathematically speaking, we denote Ga, Gb the two goats. When choosing one door, say door 1, it has a 1/3 chance that the car is right behind it. But more likely (a 2/3 probability) it’s a goat behind it. It can be Ga or Gb. When it’s Ga, the host opens one of the other two doors behind which Gb is. When it’s Gb, the host opens the door where the one remaining goat Ga is. If now you decide to switch, then you get the car. There’s a 2-in-3 chance of happening (on choosing Ga or Gb, the car may be behind one of the other 2 doors, thus we have 2 x 2 = 4 cases, divided by the total number of permutations which is P(3, 3) = 6 cases). But if you were lucky and picked the door with the car (with a 1/3 probability) at the very beginning, and then decide to switch to the other remaining door, you lose. But there’s just a 1-in-3 chance (1 x 2 = 2 divided by P(3, 3) = 6). So, statistically, if you continue to play this game for, like 300 times, you’ll probably win about 200 times and lose only about 100 times. Overall, the odds are in your favor if you do the switch. (I also found a nice picture on Wikipedia, so again, I’ll shamelessly steal it)

But if there’re, say 100 doors, and only behind one of them has a car. And you’ve just picked your lucky number, door No. 13, but still, you are pretty much damn sure about it that the chance you win will be very very slim (1/100). Now if the host opens the other 98 doors, all with goats (why the hell would he do so??), and let you decide if you will switch your choice. Ha, this is a no-brainer, of course, I’ll f*$#ing do it, just switch!

Anyway, the most important thing is the host has the information (the power you want :) If he doesn’t know where the car is and randomly opens a door you didn’t pick, then you’ll get a 50% chance of winning. But that’s not how the game plays, he may open a door with the car (Ooops)! Also, it’s different from that first let the host opens a door with a goat and then let you choose one (you just have two choices then), it’s really a 50% win.

Interestingly and historically, this problem was made popular in a letter from reader Craig F. Whitaker, quoted in Marilyn vos Savant‘s “Ask Marilyn” column. There’s a great youtube video about it, see https://www.youtube.com/watch?v=dOQowCeAnRs&t=19s

Recommended watching & reading:

https://www.youtube.com/watch?v=4Lb-6rxZxx0

https://www.youtube.com/watch?v=7u6kFlWZOWg

https://en.wikipedia.org/wiki/Monty_Hall_problem