We often face such a scenario in writing code:
our_loop:
// some code before
if (flag):
do_something()
else:
do_something_else()
// some code after
The flag is usually a constant, for example, an argument (e.g. true/false; 0/1) we passed to this function at runtime. The functions do_something() and do_something_else() may also be some code blocks and they’re almost identical. Well, if we know the flag will definitely not change within the loop we can move the conditional statement out of the loop so that the flag will only be tested once.
if (flag):
our_loop:
// some code before
do_something()
// some code after
else:
our_loop:
// some code before
do_something_else()
// some code after
This is great, except, …, well, it really looks ugly and we’re repeating ourselves. If the code within the loop is large, it makes our binary executable a bit larger (.text section).
To be concrete, let’s consider the following C++ code that computes the rank of the current value in arr[beg, end):
int get_rank(int arr[], int beg, int end, int cur, bool desc)
{
int rank = 1;
for (int i = beg; i < end; ++i) {
if (i == cur)
continue;
if (!desc) {
if (arr[i] < arr[cur])
rank++;
}
else {
if (arr[i] > arr[cur])
rank++;
}
}
return rank;
}
One may want to compute the ascending rank or the descending rank of a value in an array/range. Here the order (asc/desc) is a runtime constant. However, if it’s a compile-time constant one can avoid repeating themselves by using C++17’s if constexpr:
template<bool desc>
int get_rank(int arr[], int beg, int end, int cur)
{
int rank = 1;
for (int i = beg; i < end; ++i) {
if (i == cur)
continue;
if constexpr(!desc) { // requires C++17 or later
if (arr[i] < arr[cur])
rank++;
}
else {
if (arr[i] > arr[cur])
rank++;
}
}
return rank;
}
// usage:
// int rank = get_rank<false>(arr, beg, end, cur);
// or
// constexpr bool desc = false; // or some compile-time constant
// int rank = get_rank<desc>(arr, beg, end, cur);
This solution combines both efficiency and elegance (thanks to the compiler). But how about the efficiency of the non-templated function (the one with an additional parameter desc)? I wasn’t sure if the compiler would do any sort of optimization. But I do know that modern computer architectures are designed to do a great job in branch prediction. The compiler may also aid in generating branch-friendly (or even branchless) code. So, if the prediction was correct, the fitched instructions for this branch will be kept hot (remaining in the cache and busy running in the CPU pipeline). Otherwise, the fitched instructions by the wrong prediction may be discarded (evicted from the cache) and instructions for the other branch will be fetched. Also, the CPU stalls caused by the pipeline invalidation will make it worse. Luckily, here in this case, the branch condition is constant and therefore the CPU is on our side.
Next, let’s see what the compile can do for us. Compiling this function on Godbolt using gcc 12.2 with -O3 flag yields the following assembly code:
get_rank(int*, int, int, int, bool):
movq %rdi, %r9
movl %edx, %edi
cmpl %edx, %esi
jge .L9
movslq %ecx, %rax
movl $1, %edx
leaq (%r9,%rax,4), %r10
movslq %esi, %rax
testb %r8b, %r8b
jne .L8
.L5:
cmpl %eax, %ecx
je .L4
movl (%r10), %esi
cmpl %esi, (%r9,%rax,4)
setl %sil
movzbl %sil, %esi
addl %esi, %edx
.L4:
addq $1, %rax
cmpl %eax, %edi
jg .L5
movl %edx, %eax
ret
.L8:
cmpl %eax, %ecx ; if (i == cur)
je .L7
movl (%r10), %esi
cmpl %esi, (%r9,%rax,4)
setg %sil
movzbl %sil, %esi
addl %esi, %edx
.L7: ; for (int i = beg; i < end; ++i)
addq $1, %rax
cmpl %eax, %edi
jg .L8
movl %edx, %eax
ret
.L9:
movl $1, %edx
movl %edx, %eax
ret
We can see that desc is only tested once. Cheers! With -O0 flag it gives
get_rank(int*, int, int, int, bool):
pushq %rbp
movq %rsp, %rbp
movq %rdi, -24(%rbp)
movl %esi, -28(%rbp)
movl %edx, -32(%rbp)
movl %ecx, -36(%rbp)
movl %r8d, %eax
movb %al, -40(%rbp)
movl $1, -4(%rbp)
movl -28(%rbp), %eax
movl %eax, -8(%rbp)
jmp .L2
.L6:
movl -8(%rbp), %eax
cmpl -36(%rbp), %eax
je .L8
movzbl -40(%rbp), %eax
xorl $1, %eax
testb %al, %al
je .L5
movl -8(%rbp), %eax
cltq
leaq 0(,%rax,4), %rdx
movq -24(%rbp), %rax
addq %rdx, %rax
movl (%rax), %edx
movl -36(%rbp), %eax
cltq
leaq 0(,%rax,4), %rcx
movq -24(%rbp), %rax
addq %rcx, %rax
movl (%rax), %eax
cmpl %eax, %edx
jge .L4
addl $1, -4(%rbp)
jmp .L4
.L5:
movl -8(%rbp), %eax
cltq
leaq 0(,%rax,4), %rdx
movq -24(%rbp), %rax
addq %rdx, %rax
movl (%rax), %edx
movl -36(%rbp), %eax
cltq
leaq 0(,%rax,4), %rcx
movq -24(%rbp), %rax
addq %rcx, %rax
movl (%rax), %eax
cmpl %eax, %edx
jle .L4
addl $1, -4(%rbp)
jmp .L4
.L8:
nop
.L4:
addl $1, -8(%rbp)
.L2: ; for (int i = beg; i < end; ++i)
movl -8(%rbp), %eax
cmpl -32(%rbp), %eax
jl .L6
movl -4(%rbp), %eax
popq %rbp
ret
which suggests desc is tested within each loop.
Now we know that compilers can actually perform this optimization (although it’s not enabled by default), we should be relaxed and fearless to choose our seemingly inefficient but shorter & sweeter code. Cheers again ;)
See also https://en.wikipedia.org/wiki/Loop-invariant_code_motion, https://en.wikipedia.org/wiki/Loop_unswitching, http://www.cs.cmu.edu/afs/cs/academic/class/15745-s06/web/handouts/06.pdf.
