Loop-invariant Code Hoisting

We often face such a scenario in writing code:

our_loop:
    // some code before
    if (flag):
       do_something()
    else:
       do_something_else()
    // some code after

The flag is usually a constant, for example, an argument (e.g. true/false; 0/1) we passed to this function at runtime. The functions do_something() and do_something_else() may also be some code blocks and they’re almost identical. Well, if we know the flag will definitely not change within the loop we can move the conditional statement out of the loop so that the flag will only be tested once.

if (flag):
    our_loop:
        // some code before
        do_something()
        // some code after
else:
    our_loop:
        // some code before
        do_something_else()
        // some code after

This is great, except, …, well, it really looks ugly and we’re repeating ourselves. If the code within the loop is large, it makes our binary executable a bit larger (.text section).

To be concrete, let’s consider the following C++ code that computes the rank of the current value in arr[beg, end):

int get_rank(int arr[], int beg, int end, int cur, bool desc)
{
    int rank = 1;
    for (int i = beg; i < end; ++i) {
        if (i == cur)
            continue;
        if (!desc) {
            if (arr[i] < arr[cur])
                rank++;
        }
        else {
            if (arr[i] > arr[cur])
                rank++;
        }
    }
    return rank;
}

One may want to compute the ascending rank or the descending rank of a value in an array/range. Here the order (asc/desc) is a runtime constant. However, if it’s a compile-time constant one can avoid repeating themselves by using C++17’s if constexpr:

template<bool desc>
int get_rank(int arr[], int beg, int end, int cur)
{
    int rank = 1;
    for (int i = beg; i < end; ++i) {
        if (i == cur)
            continue;
        if constexpr(!desc) {  // requires C++17 or later
            if (arr[i] < arr[cur])
                rank++;
        }
        else {
            if (arr[i] > arr[cur])
                rank++;
        }
    }
    return rank;
}

// usage:
// int rank = get_rank<false>(arr, beg, end, cur);
// or
// constexpr bool desc = false; // or some compile-time constant
// int rank = get_rank<desc>(arr, beg, end, cur);

This solution combines both efficiency and elegance (thanks to the compiler). But how about the efficiency of the non-templated function (the one with an additional parameter desc)? I wasn’t sure if the compiler would do any sort of optimization. But I do know that modern computer architectures are designed to do a great job in branch prediction. The compiler may also aid in generating branch-friendly (or even branchless) code. So, if the prediction was correct, the fitched instructions for this branch will be kept hot (remaining in the cache and busy running in the CPU pipeline). Otherwise, the fitched instructions by the wrong prediction may be discarded (evicted from the cache) and instructions for the other branch will be fetched. Also, the CPU stalls caused by the pipeline invalidation will make it worse. Luckily, here in this case, the branch condition is constant and therefore the CPU is on our side.

Next, let’s see what the compile can do for us. Compiling this function on Godbolt using gcc 12.2 with -O3 flag yields the following assembly code:

get_rank(int*, int, int, int, bool):
        movq    %rdi, %r9
        movl    %edx, %edi
        cmpl    %edx, %esi
        jge     .L9
        movslq  %ecx, %rax
        movl    $1, %edx
        leaq    (%r9,%rax,4), %r10
        movslq  %esi, %rax
        testb   %r8b, %r8b
        jne     .L8
.L5:
        cmpl    %eax, %ecx
        je      .L4
        movl    (%r10), %esi
        cmpl    %esi, (%r9,%rax,4)
        setl    %sil
        movzbl  %sil, %esi
        addl    %esi, %edx
.L4:
        addq    $1, %rax
        cmpl    %eax, %edi
        jg      .L5
        movl    %edx, %eax
        ret
.L8:
        cmpl    %eax, %ecx  ; if (i == cur)
        je      .L7
        movl    (%r10), %esi
        cmpl    %esi, (%r9,%rax,4)
        setg    %sil
        movzbl  %sil, %esi
        addl    %esi, %edx
.L7: ; for (int i = beg; i < end; ++i)
        addq    $1, %rax
        cmpl    %eax, %edi
        jg      .L8
        movl    %edx, %eax
        ret
.L9:
        movl    $1, %edx
        movl    %edx, %eax
        ret

We can see that desc is only tested once. Cheers! With -O0 flag it gives

get_rank(int*, int, int, int, bool):
        pushq   %rbp
        movq    %rsp, %rbp
        movq    %rdi, -24(%rbp)
        movl    %esi, -28(%rbp)
        movl    %edx, -32(%rbp)
        movl    %ecx, -36(%rbp)
        movl    %r8d, %eax
        movb    %al, -40(%rbp)
        movl    $1, -4(%rbp)
        movl    -28(%rbp), %eax
        movl    %eax, -8(%rbp)
        jmp     .L2
.L6:
        movl    -8(%rbp), %eax
        cmpl    -36(%rbp), %eax
        je      .L8
        movzbl  -40(%rbp), %eax
        xorl    $1, %eax
        testb   %al, %al
        je      .L5
        movl    -8(%rbp), %eax
        cltq
        leaq    0(,%rax,4), %rdx
        movq    -24(%rbp), %rax
        addq    %rdx, %rax
        movl    (%rax), %edx
        movl    -36(%rbp), %eax
        cltq
        leaq    0(,%rax,4), %rcx
        movq    -24(%rbp), %rax
        addq    %rcx, %rax
        movl    (%rax), %eax
        cmpl    %eax, %edx
        jge     .L4
        addl    $1, -4(%rbp)
        jmp     .L4
.L5:
        movl    -8(%rbp), %eax
        cltq
        leaq    0(,%rax,4), %rdx
        movq    -24(%rbp), %rax
        addq    %rdx, %rax
        movl    (%rax), %edx
        movl    -36(%rbp), %eax
        cltq
        leaq    0(,%rax,4), %rcx
        movq    -24(%rbp), %rax
        addq    %rcx, %rax
        movl    (%rax), %eax
        cmpl    %eax, %edx
        jle     .L4
        addl    $1, -4(%rbp)
        jmp     .L4
.L8:
        nop
.L4:
        addl    $1, -8(%rbp)
.L2: ; for (int i = beg; i < end; ++i)
        movl    -8(%rbp), %eax
        cmpl    -32(%rbp), %eax
        jl      .L6
        movl    -4(%rbp), %eax
        popq    %rbp
        ret

which suggests desc is tested within each loop.

Now we know that compilers can actually perform this optimization (although it’s not enabled by default), we should be relaxed and fearless to choose our seemingly inefficient but shorter & sweeter code. Cheers again ;)

See also https://en.wikipedia.org/wiki/Loop-invariant_code_motion, https://en.wikipedia.org/wiki/Loop_unswitching, http://www.cs.cmu.edu/afs/cs/academic/class/15745-s06/web/handouts/06.pdf.

Leave a comment

Your email address will not be published.

The maximum upload file size: 10 MB. You can upload: image, audio, video, document, spreadsheet, interactive, text, archive, code, other. Links to YouTube, Facebook, Twitter and other services inserted in the comment text will be automatically embedded. Drop file here