# Loop-invariant Code Hoisting

We often face such a scenario in writing code:

```our_loop:
// some code before
if (flag):
do_something()
else:
do_something_else()
// some code after```

The `flag` is usually a constant, for example, an argument (e.g. true/false; 0/1) we passed to this function at runtime. The functions `do_something()` and `do_something_else()` may also be some code blocks and they’re almost identical. Well, if we know the flag will definitely not change within the loop we can move the conditional statement out of the loop so that the flag will only be tested once.

```if (flag):
our_loop:
// some code before
do_something()
// some code after
else:
our_loop:
// some code before
do_something_else()
// some code after```

This is great, except, …, well, it really looks ugly and we’re repeating ourselves. If the code within the loop is large, it makes our binary executable a bit larger (.text section).

To be concrete, let’s consider the following C++ code that computes the rank of the current value in arr[beg, end):

```int get_rank(int arr[], int beg, int end, int cur, bool desc)
{
int rank = 1;
for (int i = beg; i < end; ++i) {
if (i == cur)
continue;
if (!desc) {
if (arr[i] < arr[cur])
rank++;
}
else {
if (arr[i] > arr[cur])
rank++;
}
}
return rank;
}```

One may want to compute the ascending rank or the descending rank of a value in an array/range. Here the order (asc/desc) is a runtime constant. However, if it’s a compile-time constant one can avoid repeating themselves by using C++17’s `if constexpr`:

```template<bool desc>
int get_rank(int arr[], int beg, int end, int cur)
{
int rank = 1;
for (int i = beg; i < end; ++i) {
if (i == cur)
continue;
if constexpr(!desc) {  // requires C++17 or later
if (arr[i] < arr[cur])
rank++;
}
else {
if (arr[i] > arr[cur])
rank++;
}
}
return rank;
}

// usage:
// int rank = get_rank<false>(arr, beg, end, cur);
// or
// constexpr bool desc = false; // or some compile-time constant
// int rank = get_rank<desc>(arr, beg, end, cur);```

This solution combines both efficiency and elegance (thanks to the compiler). But how about the efficiency of the non-templated function (the one with an additional parameter `desc`)? I wasn’t sure if the compiler would do any sort of optimization. But I do know that modern computer architectures are designed to do a great job in branch prediction. The compiler may also aid in generating branch-friendly (or even branchless) code. So, if the prediction was correct, the fitched instructions for this branch will be kept hot (remaining in the cache and busy running in the CPU pipeline). Otherwise, the fitched instructions by the wrong prediction may be discarded (evicted from the cache) and instructions for the other branch will be fetched. Also, the CPU stalls caused by the pipeline invalidation will make it worse. Luckily, here in this case, the branch condition is constant and therefore the CPU is on our side.

Next, let’s see what the compile can do for us. Compiling this function in godbolt using gcc 12.2 with `-O3` flag yields the following assembly code:

```get_rank(int*, int, int, int, bool):
movq    %rdi, %r9
movl    %edx, %edi
cmpl    %edx, %esi
jge     .L9
movslq  %ecx, %rax
movl    \$1, %edx
leaq    (%r9,%rax,4), %r10
movslq  %esi, %rax
testb   %r8b, %r8b
jne     .L8
.L5:
cmpl    %eax, %ecx
je      .L4
movl    (%r10), %esi
cmpl    %esi, (%r9,%rax,4)
setl    %sil
movzbl  %sil, %esi
.L4:
cmpl    %eax, %edi
jg      .L5
movl    %edx, %eax
ret
.L8:
cmpl    %eax, %ecx  ; if (i == cur)
je      .L7
movl    (%r10), %esi
cmpl    %esi, (%r9,%rax,4)
setg    %sil
movzbl  %sil, %esi
.L7: ; for (int i = beg; i < end; ++i)
cmpl    %eax, %edi
jg      .L8
movl    %edx, %eax
ret
.L9:
movl    \$1, %edx
movl    %edx, %eax
ret```

We can see that `desc` is only tested once. Cheers! With `-O0` flag it gives

```get_rank(int*, int, int, int, bool):
pushq   %rbp
movq    %rsp, %rbp
movq    %rdi, -24(%rbp)
movl    %esi, -28(%rbp)
movl    %edx, -32(%rbp)
movl    %ecx, -36(%rbp)
movl    %r8d, %eax
movb    %al, -40(%rbp)
movl    \$1, -4(%rbp)
movl    -28(%rbp), %eax
movl    %eax, -8(%rbp)
jmp     .L2
.L6:
movl    -8(%rbp), %eax
cmpl    -36(%rbp), %eax
je      .L8
movzbl  -40(%rbp), %eax
xorl    \$1, %eax
testb   %al, %al
je      .L5
movl    -8(%rbp), %eax
cltq
leaq    0(,%rax,4), %rdx
movq    -24(%rbp), %rax
movl    (%rax), %edx
movl    -36(%rbp), %eax
cltq
leaq    0(,%rax,4), %rcx
movq    -24(%rbp), %rax
movl    (%rax), %eax
cmpl    %eax, %edx
jge     .L4
jmp     .L4
.L5:
movl    -8(%rbp), %eax
cltq
leaq    0(,%rax,4), %rdx
movq    -24(%rbp), %rax
movl    (%rax), %edx
movl    -36(%rbp), %eax
cltq
leaq    0(,%rax,4), %rcx
movq    -24(%rbp), %rax
movl    (%rax), %eax
cmpl    %eax, %edx
jle     .L4
jmp     .L4
.L8:
nop
.L4:
which suggests `desc` is tested within each loop.