Incredible time on Oct 4, 2017
Incredible time on Oct 2, 2020
Incredible time on Oct 4, 2017
Incredible time on Oct 2, 2020
We often face such a scenario in writing code:
our_loop: // some code before if (flag): do_something() else: do_something_else() // some code after
The flag
is usually a constant, for example, an argument (e.g. true/false; 0/1) we passed to this function at runtime. The functions do_something()
and do_something_else()
may also be some code blocks and they’re almost identical. Well, if we know the flag will definitely not change within the loop we can move the conditional statement out of the loop so that the flag will only be tested once.
if (flag): our_loop: // some code before do_something() // some code after else: our_loop: // some code before do_something_else() // some code after
This is great, except, …, well, it really looks ugly and we’re repeating ourselves. If the code within the loop is large, it makes our binary executable a bit larger (.text section).
To be concrete, let’s consider the following C++ code that computes the rank of the current value in arr[beg, end):
int get_rank(int arr[], int beg, int end, int cur, bool desc) { int rank = 1; for (int i = beg; i < end; ++i) { if (i == cur) continue; if (!desc) { if (arr[i] < arr[cur]) rank++; } else { if (arr[i] > arr[cur]) rank++; } } return rank; }
One may want to compute the ascending rank or the descending rank of a value in an array/range. Here the order (asc/desc) is a runtime constant. However, if it’s a compile-time constant one can avoid repeating themselves by using C++17’s if constexpr
:
template<bool desc> int get_rank(int arr[], int beg, int end, int cur) { int rank = 1; for (int i = beg; i < end; ++i) { if (i == cur) continue; if constexpr(!desc) { // requires C++17 or later if (arr[i] < arr[cur]) rank++; } else { if (arr[i] > arr[cur]) rank++; } } return rank; } // usage: // int rank = get_rank<false>(arr, beg, end, cur); // or // constexpr bool desc = false; // or some compile-time constant // int rank = get_rank<desc>(arr, beg, end, cur);
This solution combines both efficiency and elegance (thanks to the compiler). But how about the efficiency of the non-templated function (the one with an additional parameter desc
)? I wasn’t sure if the compiler would do any sort of optimization. But I do know that modern computer architectures are designed to do a great job in branch prediction. The compiler may also aid in generating branch-friendly (or even branchless) code. So, if the prediction was correct, the fitched instructions for this branch will be kept hot (remaining in the cache and busy running in the CPU pipeline). Otherwise, the fitched instructions by the wrong prediction may be discarded (evicted from the cache) and instructions for the other branch will be fetched. Also, the CPU stalls caused by the pipeline invalidation will make it worse. Luckily, here in this case, the branch condition is constant and therefore the CPU is on our side.
Next, let’s see what the compile can do for us. Compiling this function in godbolt using gcc 12.2 with -O3
flag yields the following assembly code:
get_rank(int*, int, int, int, bool): movq %rdi, %r9 movl %edx, %edi cmpl %edx, %esi jge .L9 movslq %ecx, %rax movl $1, %edx leaq (%r9,%rax,4), %r10 movslq %esi, %rax testb %r8b, %r8b jne .L8 .L5: cmpl %eax, %ecx je .L4 movl (%r10), %esi cmpl %esi, (%r9,%rax,4) setl %sil movzbl %sil, %esi addl %esi, %edx .L4: addq $1, %rax cmpl %eax, %edi jg .L5 movl %edx, %eax ret .L8: cmpl %eax, %ecx ; if (i == cur) je .L7 movl (%r10), %esi cmpl %esi, (%r9,%rax,4) setg %sil movzbl %sil, %esi addl %esi, %edx .L7: ; for (int i = beg; i < end; ++i) addq $1, %rax cmpl %eax, %edi jg .L8 movl %edx, %eax ret .L9: movl $1, %edx movl %edx, %eax ret
We can see that desc
is only tested once. Cheers! With -O0
flag it gives
get_rank(int*, int, int, int, bool): pushq %rbp movq %rsp, %rbp movq %rdi, -24(%rbp) movl %esi, -28(%rbp) movl %edx, -32(%rbp) movl %ecx, -36(%rbp) movl %r8d, %eax movb %al, -40(%rbp) movl $1, -4(%rbp) movl -28(%rbp), %eax movl %eax, -8(%rbp) jmp .L2 .L6: movl -8(%rbp), %eax cmpl -36(%rbp), %eax je .L8 movzbl -40(%rbp), %eax xorl $1, %eax testb %al, %al je .L5 movl -8(%rbp), %eax cltq leaq 0(,%rax,4), %rdx movq -24(%rbp), %rax addq %rdx, %rax movl (%rax), %edx movl -36(%rbp), %eax cltq leaq 0(,%rax,4), %rcx movq -24(%rbp), %rax addq %rcx, %rax movl (%rax), %eax cmpl %eax, %edx jge .L4 addl $1, -4(%rbp) jmp .L4 .L5: movl -8(%rbp), %eax cltq leaq 0(,%rax,4), %rdx movq -24(%rbp), %rax addq %rdx, %rax movl (%rax), %edx movl -36(%rbp), %eax cltq leaq 0(,%rax,4), %rcx movq -24(%rbp), %rax addq %rcx, %rax movl (%rax), %eax cmpl %eax, %edx jle .L4 addl $1, -4(%rbp) jmp .L4 .L8: nop .L4: addl $1, -8(%rbp) .L2: ; for (int i = beg; i < end; ++i) movl -8(%rbp), %eax cmpl -32(%rbp), %eax jl .L6 movl -4(%rbp), %eax popq %rbp ret
which suggests desc
is tested within each loop.
Now we know that compilers can actually perform this optimization (although it’s not enabled by default), we should be relaxed and fearless to choose our seemingly inefficient but shorter & sweeter code. Cheers again ;)
See also https://en.wikipedia.org/wiki/Loop-invariant_code_motion, https://en.wikipedia.org/wiki/Loop_unswitching, http://www.cs.cmu.edu/afs/cs/academic/class/15745-s06/web/handouts/06.pdf.
We can turn any process into a server in an incredibly simple manner using the powerful networking utility netcat
. For example, we can make a shell a server:
(a computer with ip address 192.168.1.2) $ mkfifo fifo # create a named pipe $ cat fifo | sh -i 2>&1 | nc -l 1234 > fifo # server (on another window) $ nc localhost 1234 # client (or if you're on the same netwok from another computer) $ nc 192.168.1.2 1234 # client
Now let’s try to understand the above-highlighted line. Noting that a pipeline runs in parallel, cat fifo
therefore outputs the content of fifo
only when nc -l 1234
writes its output to the pipe fifo
.
We know that when the client connects to the server via netcat
, if the client types in anything, it will be output to the server; and vice versa. Hence, nc localhost 1234
‘s input becomes nc -l 1234
‘s output, and nc -l 1234
‘s input becomes nc localhost 1234
‘s output.
When the client types in something, say, the command ls
, it also appears in the output of the server, which then is redirected to the pipe fifo
, which in turn goes to cat fifo
(read end comes the data). Then the output of cat fifo
(“ls”) is redirected to as the input of sh -i 2>&1
, which executes the command ls
and sends the results to the server nc -l 1234
as its input, which finally as output appears in the client nc localhost 1234
.
Isn’t it beautiful? Just setting a few pipes and redirections we can turn a process into a server. Pipes are really one of the greatest UNIX inventions. They make many incredible things possible.
See also using netcat
as a proxy.
Consider the following code:
int main() { struct s { int m0; int m1; int m2; }; struct s *p = NULL; int a = (int) &((struct s *)p)->m2; // crash ?? int b = (int) &((struct s *)0)->m2; // crash ?? int c = (int) p->m2; // definitely dead return 0; }
So, do you think the above two highlighted lines will crash the program? Well, at least to me, it will, since they’re dereferencing null pointers. Take, int b = (int) &((struct s *)0)->m2;
, for example, we first dereference the zero pointer to get the member m2
, and then obtain its address. Right? This is how we literally read &p_struct->member
.
However, this is not how the compiler interprets it. The compiler is a very cool and knowledgeable guy who is assumed to know everything. So the compiler knows the offset of each member in a structure, and he says, “Well, why do I even care for the values of those members (dereferencing)? To obtain the address of a member in a structure, I just need to add the offset of the member to the address of the structure: &p_struct->member := p_struct + offset(member)
.”
With such ability, the compiler now can generate assembly code
6 { 0x0000000000001129 <+0>: endbr64 0x000000000000112d <+4>: push %rbp 0x000000000000112e <+5>: mov %rsp,%rbp 7 struct s { 8 int m0; 9 int m1; 10 int m2; 11 }; 12 13 struct s *p = NULL; 0x0000000000001131 <+8>: movq $0x0,-0x8(%rbp) 14 int a = (int) &((struct s *)p)->m2; // okay 0x0000000000001139 <+16>: mov -0x8(%rbp),%rax 0x000000000000113d <+20>: add $0x8,%rax 0x0000000000001141 <+24>: mov %eax,-0x14(%rbp) 15 int b = (int) &((struct s *)0)->m2; // okay 0x0000000000001144 <+27>: movl $0x8,-0x10(%rbp) 16 int c = (int) p->m2; // dead 0x000000000000114b <+34>: mov -0x8(%rbp),%rax 0x000000000000114f <+38>: mov 0x8(%rax),%eax # segfault, since it tries to access invalid memory address 0x8 (page 0) 0x0000000000001152 <+41>: mov %eax,-0xc(%rbp) 17 18 return 0; 0x0000000000001155 <+44>: mov $0x0,%eax 19 } 0x000000000000115a <+49>: pop %rbp 0x000000000000115b <+50>: ret
We can see that there are no dereferences whatsoever when we try to get the address of a member in a structure, just the addition of the member offset to the address of the structure.
Therefore, as a special case with the structure address being zero, &((TYPE *)0)->MEMBER
yields the exact offset of the member in a structure, which accounts for why the macro
#define offsetof(TYPE, MEMBER) ((size_t) &((TYPE *)0)->MEMBER)
works in the Linux kernel.
Recently I found a beautiful, funny, intelligent woman, Vi Hart, a YouTuber who creates entertaining, thought-provoking mathematics and music videos that explain mathematical concepts through doodles. It is an absolutely delightful and enlightening experience to watch her doodling, drawing those beautiful, adorable mathematical geometric shapes. Besides, she explains those mathematical concepts in a singing tone, which is cheerful and amusing to listen to. I think she did a fantastic job in spreading the idea that mathematics can be, and yes, is fun & beautiful in such exciting and entertaining ways. That is also what I always want to make the general audience feel about math.
The basic idea is open addressing, but without any redirections (thus making Swiss tables very memory efficient). Also, Swiss tables store a densely packed array of metadata with each entry consisting of 1 byte (which makes the metadata table easier to fit into CPU cache, as opposed to, say, a metadata of pointers of 8 bytes), sort of like UNIX file system inodes and bit maps (see OSTEP: Chapter 40 File System Implementation). Thus, we can use the metadata to tell if an entry in the table is present or not (by using the single control bit and the 7-bit H2 hash, see the details at https://abseil.io/about/design/swisstables).
And there’s an excellent video about it on youtube.
Finally, let’s see the performance differences on the counting words benchmark.
Today is a special occasion (& might be a historic moment) that we might appreciate a few years later: I, Yichuan Wang, & Robin Qiu just founded our own company. We were roomies in college, and they’re brilliant & awesome guys, and I firmly believe in a few years we will create a very bright future, with our hands, our passion, our genius minds, and our entrepreneurial spirits. I am just feeling passionate all over my body, my blood, my cells, just like Richard Hendricks. So, work hard, keep moving, and be a badass :) Notes on April 4, 2022.
This problem is actually a little tricky and counterintuitive, but really fun to think about (as a brain teaser).
Here is the problem (I’ll just copy it from Wikipedia, lazy evil being :)
Suppose you’re on a game show, and you’re given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say No. 1, and the host, who knows what’s behind the doors, opens another door, say No. 3, which has a goat. He then says to you, “Do you want to pick door No. 2?” Is it to your advantage to switch your choice?
At first sight, it appears to be a fifty-fifty chance to win without switching, as there’re only two doors left, and behind one there’s the car. Most people do think so (me too). But there is a piece of important information that people may not notice, the host knows what’s behind each door. On your choosing (a 1/3 probability of the car and a 2/3 probability of a goat), he purposely opens a door that has a goat. Actually, when you choose a door, you only have a 1/3 chance to win the car. The car behind one of the other two doors has a 2/3 chance. This won’t change with or without the host telling ya one of the other two has a goat. If the host opens a door with a goat, that means the other door left has a 2/3 chance to have the car. Thank you for that extra 33.3%!!
Mathematically speaking, we denote Ga, Gb the two goats. When choosing one door, say door 1, it has a 1/3 chance that the car is right behind it. But more likely (a 2/3 probability) it’s a goat behind it. It can be Ga or Gb. When it’s Ga, the host opens one of the other two doors behind which Gb is. When it’s Gb, the host opens the door where the one remaining goat Ga is. If now you decide to switch, then you get the car. There’s a 2-in-3 chance of happening (on choosing Ga or Gb, the car may be behind one of the other 2 doors, thus we have 2 x 2 = 4 cases, divided by the total number of permutations which is P(3, 3) = 6 cases). But if you were lucky and picked the door with the car (with a 1/3 probability) at the very beginning, and then decide to switch to the other remaining door, you lose. But there’s just a 1-in-3 chance (1 x 2 = 2 divided by P(3, 3) = 6). So, statistically, if you continue to play this game for, like 300 times, you’ll probably win about 200 times and lose only about 100 times. Overall, the odds are in your favor if you do the switch. (I also found a nice picture on Wikipedia, so again, I’ll shamelessly steal it)
But if there’re, say 100 doors, and only behind one of them has a car. And you’ve just picked your lucky number, door No. 13, but still, you are pretty much damn sure about it that the chance you win will be very very slim (1/100). Now if the host opens the other 98 doors, all with goats (why the hell would he do so??), and let you decide if you will switch your choice. Ha, this is a no-brainer, of course, I’ll f*$#ing do it, just switch!
Anyway, the most important thing is the host has the information (the power you want :) If he doesn’t know where the car is and randomly opens a door you didn’t pick, then you’ll get a 50% chance of winning. But that’s not how the game plays, he may open a door with the car (Ooops)! Also, it’s different from that first let the host opens a door with a goat and then let you choose one (you just have two choices then), it’s really a 50% win.
Interestingly and historically, this problem was made popular in a letter from reader Craig F. Whitaker, quoted in Marilyn vos Savant‘s “Ask Marilyn” column. There’s a great youtube video about it, see https://www.youtube.com/watch?v=dOQowCeAnRs&t=19s
Recommended watching & reading:
https://www.youtube.com/watch?v=4Lb-6rxZxx0
I just couldn’t believe it, … seemed after I told Ania how desperately, madly I’ve fallen love with her, we gazed into each other’s eyes, and…, and the most amazing, exciting thing that I’ve always dreamed of happened… WOW, she kissed me passionately, it was a big sloppy kiss, a minutes-long tongue kiss, a hell of a kiss & my first kiss (I know it’s lame ;) I could feel her sweet tongue screwing mine, the best experience I could ever dream about from a romantic movie seeing the male protagonist kissing the female protagonist, but even much better. Then I kissed her back, with all the strength I could have, the wonderful exciting feelings can never be spoken…
What’s next was a lot of raw, animal, sexual touches – Yup, we made love. Making love with the one you’ve always dreamed of, the one you’ve known, loved & been crazy about for many years is a sheer thing I could die for.
This is such a beautiful dream and I had a thought & strong (VERY VERY intense) feeling to make love to her for the rest of my life, then I woke up. FUCK! I found a fact that the more you don’t wanna wake up from an amazing dream, the more likely you’ll, and be very soon, ’cause you’re just sooooooooooooo fucking excited ;) Damn!
Background: consider the hashtable with closed addressing/separate chaining:
Typically, we can have two different schemes to link through all the nodes. One is to link the last node of the current bucket to the first node of the next non-empty bucket. See the above picture and imagine taht there are links among these buckets. Another is to make all the nodes look exactly the same as the above picture (by which I mean to link the tail nodes with null pointers).
The second approach will have some more work to obtain the successor of the current node. It may need to scan a lot of (empty) buckets to get its successor. Sounds awful, huh? Actually not. By amortized analysis (assume the max_load_factor==1
), ideally, we will have each node in a (different) bucket exactly. So it can be efficient.
How about the first one? The bad news is that it can suffer from very terrible insertion overhead. Now imagine you have a very big hashtable, tab[i]
and tab[j]
are two adjacent (non-empty) buckets. The horrible thing is that you have a very big difference |i-j|
, and then an item whose hash code k
is between i
and j
is needed to be inserted there. We will need to scan all the way down from k
to j
in order to update the links of the three nodes: the last node of bucket i
, the newly inserted node, the first node of bucket j
(you can think of this process as the insertion of a doubly-linked list). And if we constantly perform this pathological operation, j-1, i+1, j-2, i+2,...
, things can get really bad, bad, bad, bad.
The Problem: Now if I ask you a simple question, after building a hashtable, we iterate through all the elements, which do you think is faster, the all-linked-up one (1st case) or the only-in-bucket-linked one (2nd case)?
Here is a piece of test code and the whole code can be found here.
int main() { #if defined(USE_HASHMAP) mySymbolTable::HashMap<int, int> mp{}; #elif defined(USE_HASHMAP2) mySymbolTable::alternative::HashMap<int, int> mp{}; #else std::unordered_map<int, int> mp{}; #endif constexpr int N = 1'000'000; int k = 0; for(int i=0; i < N; ++i) mp[N-i] = i; auto t1 = std::clock(); for(auto it=mp.begin(); it!=mp.end(); ++it) ++k; auto t2= std::clock(); auto iteration_time_elapsed = (t2 - t1) / (double)1'000; std::cout << "Iterating through " << k << " items used " << iteration_time_elapsed << "ms\n"; return 0; }
Essentially, they run the following code respectively
_self& operator++() { _ptr = _ptr->_next; return *this; }
vs
_self& operator++() { auto next = _table->next_entry(_ptr, _bucket_index); _ptr = next.first; _bucket_index = next.second; return *this; } std::pair<node_ptr, size_t> next_entry(node_ptr x, size_t bucket_index) const { assert(x != nullptr && "cannot increment end() iterator"); if (x->_next) return { x->_next, bucket_index }; size_t i = bucket_index + 1; for (; i < _hashtable.size(); ++i) { if (_hashtable[i]) return { _hashtable[i], i }; } return { nullptr, i }; }
So, which one? (I think it should the first one, any rational human beings would make this choice, I suppose. Don’t believe it??)
Now have a look at their assembly code
The bizarre thing is that in debug mode (-g
flag), the 1st case beats the 2nd by about one fifth (t1 = 82% * t2
). While if optimization is enabled (-O3
flag), it’s a whole different story. The summary table is given below.
Here, _d
suffix means debug version, *1
suffix means myst::HashMap
(1st case), *2
suffix means myst::alternative::HashMap
(2nd case), and no suffix suggests we are using std::unordered_map
.
We can see from the results that optimization -O3
barely boosts the performance of the 1st case (std::unordered_map
and myst::HashMap
) except the test_int
cases, but rather make a significant boost in the 2nd case (myst::alternative::HashMap
). Given the 2nd is a little bit more complicated, it is therefore expected there will be more space for the 2nd case to optimize, for example, inlining. Still, I don’t think it can beat the 1st small code. How do we explain this weird phenomenon?
(In my understanding, small code means fast, think about cache, and technically speaking, 2nd case should never be faster than the 1st one.)