Awesome Swiss Tables

The basic idea is open addressing, but without any redirections (thus making swiss tables very memory efficient). Also, swiss tables store a densely packed array of metadata, sort of like UNIX file system inodes and bit maps (see OSTEP: Chapter 40 File System Implementation). Thus, we can use the metadata to tell if an entry in the table is present or not (see the details at

And there’s an excellent video about it on youtube.

Finally, let’s see the performance differences on the counting words benchmark.

Founding Our Awesome Kick-ass Company

Today is a special occasion (& might be a historic moment) that we might appreciate a few years later: I, Yichuan Wang, & Robin Qiu just founded our own company. We were roomies in college, and they’re brilliant & awesome guys, and I firmly believe in a few years we will create a very bright future, with our hands, our passion, our genius minds, and our entrepreneurial spirits. I am just feeling passionate all over my body, my blood, my cells, just like Richard Hendricks. So, work hard, keep moving, and be a badass :) Notes on April 4, 2022.

Monty Hall Problem

This is actually a tricky problem, but really fun (a brain teaser).

Here is the problem (I’ll just copy it from Wikipedia, lazy evil being :)

Suppose you’re on a game show, and you’re given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say No. 1, and the host, who knows what’s behind the doors, opens another door, say No. 3, which has a goat. He then says to you, “Do you want to pick door No. 2?” Is it to your advantage to switch your choice?

At first sight, it appears to be a fifty-fifty chance to win without switching, as there’re only two doors left, and behind one there’s the car. Most people do think so (me too). But there is a piece of important information that people may not notice, the host knows what’s behind each door. On your choosing (a 1/3 probability of the car and a 2/3 probability of a goat), he purposely opens a door that has a goat. Actually, when you choose a door, you only have a 1/3 chance to win the car. The car is behind one of the other two doors has a 2/3 chance. This won’t change with or without the host telling ya one of the other two has a goat. If the host opens a door with a goat, that means the other door left has a 2/3 chance to have the car. Thank you for that extra 33.3%!!

Mathematically speaking, we denote Ga, Gb the two goats. When choosing one door, say door 1, it has a 1/3 chance that the car is right behind it. But more likely (a 2/3 probability) it’s a goat behind it. It can be Ga or Gb. When it’s Ga, the host opens one of the other two doors which Gb is behind. When it’s Gb, the host opens the door where the one remaining goat Ga is. If now you decide to switch, then you get the car. There’s a 2-in-3 chance of happening (2 x 2 = 4 total cases). But if you were lucky and picked the door with the car (with a 1/3 probability) at the very beginning, and then decide to switch to the other remaining door, you lose. But there’s just a 1-in-3 chance (1 x 2 = 2 total cases). So, statistically, if you continue to play this game for, like 300 times, you’ll probably win about 200 times and lose only about 100 times. Overall, the odds are in your favor if you do the switch. (I also found a nice picture on Wikipedia, so again, I’ll shamelessly steal it)

But if there’re, say 100 doors, and only behind one of them has a car. And you’ve just picked your lucky number, door No. 13, but still, you are pretty much damn sure about it the chance you win will be very very slim (1/100). Now if the host opens the other 98 doors, all with goats (why the hell would he do so??), and let you decide if you will switch your choice. Ha, this is a no-brainer, of course, I’ll f*$#ing do it, just switch!

Anyway, the most important thing is the host has the information (the power you want :) If he doesn’t know where the car is and randomly opens a door you didn’t pick, then you’ll get a 50% chance of winning. But that’s not how the game plays, he may open a door with the car (Ooops)! Also, it’s different from that first let the host opens a door with a goat and then let you choose one (you just have two choices then), it’s really a 50% win.

Interestingly and historically, this problem was made popular in a letter from reader Craig F. Whitaker, quoted in Marilyn vos Savant‘s “Ask Marilyn” column. There’s a great youtube video about it, see

Recommended watching & reading:

A Wet Dream You’ll Never Wanna Wake Up From

I just couldn’t believe it, … seemed after I told Ania how desperately, madly I’ve fallen love with her, we gazed into each other’s eyes, and…, and the most amazing, exciting thing that I’ve always dreamed of happened… WOW, she kissed me passionately, it was a big sloppy kiss, a minutes-long tongue kiss, a hell of a kiss & my first kiss (I know it’s lame ;) I could feel her sweet tongue screwing mine, the best experience I could ever dream about from a romantic movie seeing the male protagonist kissing the female protagonist, but even much better. Then I kissed her back, with all the strength I could have, the wonderful exciting feelings can never be spoken…

What’s next was a lot of raw, animal, sexual touches – Yup, we made love. Making love with the one you’ve always dreamed of, the one you’ve known, loved & been crazy about for many years is a sheer thing I could die for.

This is such a beautiful dream and I had a thought & strong (VERY VERY intense) feeling to make love to her for the rest of my life, then I woke up. FUCK! I found a fact that the more you don’t wanna wake up from an amazing dream, the more likely you’ll, and be very soon, ’cause you’re just sooooooooooooo fucking excited ;) Damn!

The Bizarreness of Optimization

Background: consider the hashtable with closed addressing/separate chaining:

Typically, we can have two different schemes to link through all the nodes. One is to link the last node of the current bucket to the first node of the next non-empty bucket. See the above picture and imagine taht there are links among these buckets. Another is to make all the nodes look exactly the same as the above picture (by which I mean to link the tail nodes with null pointers).

The second approach will have some more work to obtain the successor of the current node. It may need to scan a lot of (empty) buckets to get its successor. Sounds awful, huh? Actually not. By amortized analysis (assume the max_load_factor==1 ), ideally, we will have each node in a (different) bucket exactly. So it can be efficient.

How about the first one? The bad news is that it can suffer from very terrible insertion overhead. Now imagine you have a very big hashtable, tab[i] and tab[j] are two adjacent (non-empty) buckets. The horrible thing is that you have a very big difference |i-j|, and then an item whose hash code k is between i and j is needed to be inserted there. We will need to scan all the way down from k to j in order to update the links of the three nodes: the last node of bucket i, the newly inserted node, the first node of bucket j (you can think of this process as the insertion of a doubly-linked list). And if we constantly perform this pathological operation, j-1, i+1, j-2, i+2,..., things can get really bad, bad, bad, bad.

The Problem: Now if I ask you a simple question, after building a hashtable, we iterate through all the elements, which do you think is faster, the all-linked-up one (1st case) or the only-in-bucket-linked one (2nd case)?

Here is a piece of test code and the whole code can be found here.

int main()
#if   defined(USE_HASHMAP)
    mySymbolTable::HashMap<int, int> mp{};
#elif defined(USE_HASHMAP2)
    mySymbolTable::alternative::HashMap<int, int> mp{};
    std::unordered_map<int, int> mp{};

    constexpr int N = 1'000'000;
    int k = 0;

    for(int i=0; i < N; ++i)
        mp[N-i] = i;

    auto t1 = std::clock();
    for(auto it=mp.begin(); it!=mp.end(); ++it)
    auto t2= std::clock();

    auto iteration_time_elapsed = (t2 - t1) / (double)1'000;
    std::cout << "Iterating through " << k << " items used "
              << iteration_time_elapsed << "ms\n";
    return 0;

Essentially, they run the following code respectively

_self& operator++() {
    _ptr = _ptr->_next;
    return *this;


_self& operator++() {
    auto next = _table->next_entry(_ptr, _bucket_index);
    _ptr = next.first;
    _bucket_index = next.second;
    return *this;

std::pair<node_ptr, size_t> next_entry(node_ptr x, size_t bucket_index) const {
    assert(x != nullptr && "cannot increment end() iterator");
    if (x->_next) return { x->_next, bucket_index };
    size_t i = bucket_index + 1;
    for (; i < _hashtable.size(); ++i) {
        if (_hashtable[i]) return { _hashtable[i], i };
    return { nullptr, i };

So, which one? (I think it should the first one, any rational human beings would make this choice, I suppose. Don’t believe it??)

Now have a look at their assembly code

The bizarre thing is that in debug mode (-g flag), the 1st case beats the 2nd by about one fifth (t1 = 82% * t2 ). While if optimization is enabled (-O3 flag), it’s a whole different story. The summary table is given below.

Here, _d suffix means debug version, *1 suffix means myst::HashMap (1st case), *2 suffix means myst::alternative::HashMap (2nd case), and no suffix suggests we are using std::unordered_map.

We can see from the results that optimization -O3 barely boosts the performance of the 1st case (std::unordered_map and myst::HashMap) except the test_int cases, but rather make a significant boost in the 2nd case (myst::alternative::HashMap). Given the 2nd is a little bit more complicated, it is therefore expected there will be more space for the 2nd case to optimize, for example, inlining. Still, I don’t think it can beat the 1st small code. How do we explain this weird phenomenon?

(In my understanding, small code means fast, think about cache, and technically speaking, 2nd case should never be faster than the 1st one.)

How To Implement A Real Emplace Routine

Above is a piece of code that attempts to implement the emplace family routines. One can insert something like const value_type& (value_type&&) val or variadic arguments Args&&... args, both of which then can be handled by the insert_leaf_at subroutine

Of course, this simple implementation is flawed as it requires an extra move operation. In order to obtain the key from Args&&... args, we need to somehow get the value itself which can be a value (set) or a pair of <key, val> values (map). This is done by using the compile-time template bool constant/parameter IsMap. If we must construct the actual object within the method insert_at_leaf, then, well, an extra move operation is inevitable.

The solution to the question is we construct the node in-place, thus the name empalce ;) We first construct the node in the heap using the variadic arguments node_ptr z = new_node(/*parent=*/nullptr, std::forward<Args>(args)...); , then get_key from the node by reference. And what is left is we walk through the tree and find the right position to insert the node and reset the parent link. Tada🎉, there you are. This makes even more sense if we look at the C++ specification where node_type and node_handle were introduced in C++17. Their presence makes implementing emplace more easily and naturally.

But the C++ folks are soooo ambitious and if the original one has so many combined benefits (simplicity, flexibility, versatility, etc.) and actually works pretty well (given the overhead of an extra move operation is negligible), we can just keep the way it is :)

Ref: A C++17 compliant AVL map container.

A Running Example

#include <stdio.h>
int add(int a, int b)
    int c = a+b;
    printf("@add(): &a=%p, &b=%p\n", &a, &b);
    return c;

int main()
    int i = 3;
    int j = 4;
    int k = add(i,j);
    printf("@main(): &i=%p, &j=%p, &k=%p\n", &i, &j, &k);
    return k;

I made a test in Dev-C++ (version 5.7.1, with MinGW GCC 4.8.1 32-bit), and in debug mode, via the CPU Window it produced the following instructions:

procedure main:

   0x004016e0 <+0>:	push   %ebp
   0x004016e1 <+1>:	mov    %esp,%ebp
   0x004016e3 <+3>:	and    $0xfffffff0,%esp
   0x004016e6 <+6>:	sub    $0x20,%esp
   0x004016e9 <+9>:	call   0x401cc0 <__main>
=> 0x004016ee <+14>:	movl   $0x3,0x1c(%esp)
   0x004016f6 <+22>:	movl   $0x4,0x18(%esp)
   0x004016fe <+30>:	mov    0x18(%esp),%edx
   0x00401702 <+34>:	mov    0x1c(%esp),%eax
   0x00401706 <+38>:	mov    %edx,0x4(%esp)
   0x0040170a <+42>:	mov    %eax,(%esp)
   0x0040170d <+45>:	call   0x4016b0 <add>
   0x00401712 <+50>:	mov    %eax,0x14(%esp)
   0x00401716 <+54>:	lea    0x14(%esp),%eax
   0x0040171a <+58>:	mov    %eax,0xc(%esp)
   0x0040171e <+62>:	lea    0x18(%esp),%eax
   0x00401722 <+66>:	mov    %eax,0x8(%esp)
   0x00401726 <+70>:	lea    0x1c(%esp),%eax
   0x0040172a <+74>:	mov    %eax,0x4(%esp)
   0x0040172e <+78>:	movl   $0x40507a,(%esp)
   0x00401735 <+85>:	call   0x4036c8 <printf>
   0x0040173a <+90>:	mov    0x14(%esp),%eax
   0x0040173e <+94>:	leave  
   0x0040173f <+95>:	ret    

subroutine add:

   0x004016b0 <+0>:	push   %ebp
   0x004016b1 <+1>:	mov    %esp,%ebp
   0x004016b3 <+3>:	sub    $0x28,%esp
   0x004016b6 <+6>:	mov    0x8(%ebp),%edx
   0x004016b9 <+9>:	mov    0xc(%ebp),%eax
   0x004016bc <+12>:	add    %edx,%eax
   0x004016be <+14>:	mov    %eax,-0xc(%ebp)
   0x004016c1 <+17>:	lea    0xc(%ebp),%eax
   0x004016c4 <+20>:	mov    %eax,0x8(%esp)
   0x004016c8 <+24>:	lea    0x8(%ebp),%eax
   0x004016cb <+27>:	mov    %eax,0x4(%esp)
   0x004016cf <+31>:	movl   $0x405064,(%esp)
   0x004016d6 <+38>:	call   0x4036c8 <printf>
   0x004016db <+43>:	mov    -0xc(%ebp),%eax
   0x004016de <+46>:	leave  
   0x004016df <+47>:	ret    

Following these instructions, I drew a simple diagram (partial, incomplete):

The arguments for printf were left out in this diagram for simplicity, but it is not difficult to imagine the whole running process with the aid of the above instructions. The calling convention, however, can be different from another compiler. For example, on my Ubuntu 20.04.1 LTS server with GCC 9.3.0, it yields something like

@add(): &a=0x7ffdf0f69a3c, &b=0x7ffdf0f69a38
@main(): &i=0x7ffdf0f69a6c, &j=0x7ffdf0f69a70, &k=0x7ffdf0f69a74

which is completely the opposite of the above diagram: the passed arguments were arranged in decreasing address, and the local vars were placed in increasing address. For local variables, the compiler knows in advance how much space should be allocated in the stack by looking at its symbol table, and then allocates enough space by subtracting some value from the RSP register. Thus, it looks like the compiler can place the local vars wherever it likes, it just needs to perform some arithmetics based on the RSP or RBP register. A good explanation for the locations of the passed arguments is, as opposed to storing them on the caller stack, it first copies those arguments to registers (if possible) in reverse order and then saves them to the callee stack near the beginning in argument order (EDI, ESI, EDX, ECX, R8, R9). In this way, the value in the EDI register (1st argument) is first placed into the callee stack and hence it has a higher address. Aha! Now it makes perfect sense! Next, let’s check it out by looking at the disassembly by running gdb a.out -batch -ex 'disassemble/s main' (or add).

Dump of assembler code for function main:
11	{
   0x00000000000011a7 <+0>:	endbr64 
   0x00000000000011ab <+4>:	push   %rbp
   0x00000000000011ac <+5>:	mov    %rsp,%rbp
   0x00000000000011af <+8>:	sub    $0x20,%rsp
   0x00000000000011b3 <+12>:	mov    %fs:0x28,%rax
   0x00000000000011bc <+21>:	mov    %rax,-0x8(%rbp)
   0x00000000000011c0 <+25>:	xor    %eax,%eax

12	    int i = 3;
   0x00000000000011c2 <+27>:	movl   $0x3,-0x14(%rbp)

13	    int j = 4;
   0x00000000000011c9 <+34>:	movl   $0x4,-0x10(%rbp)

14	    int k = add(i,j);
   0x00000000000011d0 <+41>:	mov    -0x10(%rbp),%edx
   0x00000000000011d3 <+44>:	mov    -0x14(%rbp),%eax
   0x00000000000011d6 <+47>:	mov    %edx,%esi
   0x00000000000011d8 <+49>:	mov    %eax,%edi
   0x00000000000011da <+51>:	callq  0x1169 <add>
   0x00000000000011df <+56>:	mov    %eax,-0xc(%rbp)

15	    printf("@main(): &i=%p, &j=%p, &k=%p\n", &i, &j, &k);
   0x00000000000011e2 <+59>:	lea    -0xc(%rbp),%rcx
   0x00000000000011e6 <+63>:	lea    -0x10(%rbp),%rdx
   0x00000000000011ea <+67>:	lea    -0x14(%rbp),%rax
   0x00000000000011ee <+71>:	mov    %rax,%rsi
   0x00000000000011f1 <+74>:	lea    0xe22(%rip),%rdi        # 0x201a
   0x00000000000011f8 <+81>:	mov    $0x0,%eax
   0x00000000000011fd <+86>:	callq  0x1070 <printf@plt>

16	    return k;
   0x0000000000001202 <+91>:	mov    -0xc(%rbp),%eax

17	}
   0x0000000000001205 <+94>:	mov    -0x8(%rbp),%rsi
   0x0000000000001209 <+98>:	xor    %fs:0x28,%rsi
   0x0000000000001212 <+107>:	je     0x1219 <main+114>
   0x0000000000001214 <+109>:	callq  0x1060 <__stack_chk_fail@plt>
   0x0000000000001219 <+114>:	leaveq 
   0x000000000000121a <+115>:	retq   
End of assembler dump.
Dump of assembler code for function add:
4	{
   0x0000000000001169 <+0>:	endbr64 
   0x000000000000116d <+4>:	push   %rbp
   0x000000000000116e <+5>:	mov    %rsp,%rbp
   0x0000000000001171 <+8>:	sub    $0x20,%rsp
   0x0000000000001175 <+12>:	mov    %edi,-0x14(%rbp)
   0x0000000000001178 <+15>:	mov    %esi,-0x18(%rbp)

5	    int c = a+b;
   0x000000000000117b <+18>:	mov    -0x14(%rbp),%edx
   0x000000000000117e <+21>:	mov    -0x18(%rbp),%eax
   0x0000000000001181 <+24>:	add    %edx,%eax
   0x0000000000001183 <+26>:	mov    %eax,-0x4(%rbp)

6	    printf("@add(): &a=%p, &b=%p\n", &a, &b);
   0x0000000000001186 <+29>:	lea    -0x18(%rbp),%rdx
   0x000000000000118a <+33>:	lea    -0x14(%rbp),%rax
   0x000000000000118e <+37>:	mov    %rax,%rsi
   0x0000000000001191 <+40>:	lea    0xe6c(%rip),%rdi        # 0x2004
   0x0000000000001198 <+47>:	mov    $0x0,%eax
   0x000000000000119d <+52>:	callq  0x1070 <printf@plt>

7	    return c;
   0x00000000000011a2 <+57>:	mov    -0x4(%rbp),%eax

8	}
   0x00000000000011a5 <+60>:	leaveq 
   0x00000000000011a6 <+61>:	retq   
End of assembler dump.

The above assembler code agrees with our speculation. Note, however, the location of the local variable c here. It is above the two arguments in the stack (just one position below the RBP register). If we rewrite the main function as int main(int argc, char** argv), then we can get something like

   0x00000000000011a7 <+0>:	endbr64 
   0x00000000000011ab <+4>:	push   %rbp
   0x00000000000011ac <+5>:	mov    %rsp,%rbp
   0x00000000000011af <+8>:	sub    $0x30,%rsp
   0x00000000000011b3 <+12>:	mov    %edi,-0x24(%rbp)
   0x00000000000011b6 <+15>:	mov    %rsi,-0x30(%rbp)
   0x00000000000011ba <+19>:	mov    %fs:0x28,%rax
   0x00000000000011c3 <+28>:	mov    %rax,-0x8(%rbp)
   0x00000000000011c7 <+32>:	xor    %eax,%eax

in the setup of the stack of main. We can see that the two passed arguments argc and argv are put at the stack top.