We often face such a scenario in writing code:
our_loop: // some code before if (flag): do_something() else: do_something_else() // some code after
The flag
is usually a constant, for example, an argument (e.g. true/false; 0/1) we passed to this function at runtime. The functions do_something()
and do_something_else()
may also be some code blocks and they’re almost identical. Well, if we know the flag will definitely not change within the loop we can move the conditional statement out of the loop so that the flag will only be tested once.
if (flag): our_loop: // some code before do_something() // some code after else: our_loop: // some code before do_something_else() // some code after
This is great, except, …, well, it really looks ugly and we’re repeating ourselves. If the code within the loop is large, it makes our binary executable a bit larger (.text section).
To be concrete, let’s consider the following C++ code that computes the rank of the current value in arr[beg, end):
int get_rank(int arr[], int beg, int end, int cur, bool desc) { int rank = 1; for (int i = beg; i < end; ++i) { if (i == cur) continue; if (!desc) { if (arr[i] < arr[cur]) rank++; } else { if (arr[i] > arr[cur]) rank++; } } return rank; }
One may want to compute the ascending rank or the descending rank of a value in an array/range. Here the order (asc/desc) is a runtime constant. However, if it’s a compile-time constant one can avoid repeating themselves by using C++17’s if constexpr
:
template<bool desc> int get_rank(int arr[], int beg, int end, int cur) { int rank = 1; for (int i = beg; i < end; ++i) { if (i == cur) continue; if constexpr(!desc) { // requires C++17 or later if (arr[i] < arr[cur]) rank++; } else { if (arr[i] > arr[cur]) rank++; } } return rank; } // usage: // int rank = get_rank<false>(arr, beg, end, cur); // or // constexpr bool desc = false; // or some compile-time constant // int rank = get_rank<desc>(arr, beg, end, cur);
This solution combines both efficiency and elegance (thanks to the compiler). But how about the efficiency of the non-templated function (the one with an additional parameter desc
)? I wasn’t sure if the compiler would do any sort of optimization. But I do know that modern computer architectures are designed to do a great job in branch prediction. The compiler may also aid in generating branch-friendly (or even branchless) code. So, if the prediction was correct, the fitched instructions for this branch will be kept hot (remaining in the cache and busy running in the CPU pipeline). Otherwise, the fitched instructions by the wrong prediction may be discarded (evicted from the cache) and instructions for the other branch will be fetched. Also, the CPU stalls caused by the pipeline invalidation will make it worse. Luckily, here in this case, the branch condition is constant and therefore the CPU is on our side.
Next, let’s see what the compile can do for us. Compiling this function on Godbolt using gcc 12.2 with -O3
flag yields the following assembly code:
get_rank(int*, int, int, int, bool): movq %rdi, %r9 movl %edx, %edi cmpl %edx, %esi jge .L9 movslq %ecx, %rax movl $1, %edx leaq (%r9,%rax,4), %r10 movslq %esi, %rax testb %r8b, %r8b jne .L8 .L5: cmpl %eax, %ecx je .L4 movl (%r10), %esi cmpl %esi, (%r9,%rax,4) setl %sil movzbl %sil, %esi addl %esi, %edx .L4: addq $1, %rax cmpl %eax, %edi jg .L5 movl %edx, %eax ret .L8: cmpl %eax, %ecx ; if (i == cur) je .L7 movl (%r10), %esi cmpl %esi, (%r9,%rax,4) setg %sil movzbl %sil, %esi addl %esi, %edx .L7: ; for (int i = beg; i < end; ++i) addq $1, %rax cmpl %eax, %edi jg .L8 movl %edx, %eax ret .L9: movl $1, %edx movl %edx, %eax ret
We can see that desc
is only tested once. Cheers! With -O0
flag it gives
get_rank(int*, int, int, int, bool): pushq %rbp movq %rsp, %rbp movq %rdi, -24(%rbp) movl %esi, -28(%rbp) movl %edx, -32(%rbp) movl %ecx, -36(%rbp) movl %r8d, %eax movb %al, -40(%rbp) movl $1, -4(%rbp) movl -28(%rbp), %eax movl %eax, -8(%rbp) jmp .L2 .L6: movl -8(%rbp), %eax cmpl -36(%rbp), %eax je .L8 movzbl -40(%rbp), %eax xorl $1, %eax testb %al, %al je .L5 movl -8(%rbp), %eax cltq leaq 0(,%rax,4), %rdx movq -24(%rbp), %rax addq %rdx, %rax movl (%rax), %edx movl -36(%rbp), %eax cltq leaq 0(,%rax,4), %rcx movq -24(%rbp), %rax addq %rcx, %rax movl (%rax), %eax cmpl %eax, %edx jge .L4 addl $1, -4(%rbp) jmp .L4 .L5: movl -8(%rbp), %eax cltq leaq 0(,%rax,4), %rdx movq -24(%rbp), %rax addq %rdx, %rax movl (%rax), %edx movl -36(%rbp), %eax cltq leaq 0(,%rax,4), %rcx movq -24(%rbp), %rax addq %rcx, %rax movl (%rax), %eax cmpl %eax, %edx jle .L4 addl $1, -4(%rbp) jmp .L4 .L8: nop .L4: addl $1, -8(%rbp) .L2: ; for (int i = beg; i < end; ++i) movl -8(%rbp), %eax cmpl -32(%rbp), %eax jl .L6 movl -4(%rbp), %eax popq %rbp ret
which suggests desc
is tested within each loop.
Now we know that compilers can actually perform this optimization (although it’s not enabled by default), we should be relaxed and fearless to choose our seemingly inefficient but shorter & sweeter code. Cheers again ;)
See also https://en.wikipedia.org/wiki/Loop-invariant_code_motion, https://en.wikipedia.org/wiki/Loop_unswitching, http://www.cs.cmu.edu/afs/cs/academic/class/15745-s06/web/handouts/06.pdf.