{"id":1633,"date":"2023-01-15T21:53:28","date_gmt":"2023-01-15T13:53:28","guid":{"rendered":"https:\/\/markjohntaylor.com\/blog\/wordpress\/?p=1633"},"modified":"2023-12-21T22:48:42","modified_gmt":"2023-12-21T14:48:42","slug":"loop-invariant-code-hoisting","status":"publish","type":"post","link":"https:\/\/markjohntaylor.com\/blog\/wordpress\/index.php\/2023\/01\/15\/loop-invariant-code-hoisting\/","title":{"rendered":"Loop-invariant Code Hoisting"},"content":{"rendered":"\n<p>We often face such a scenario in writing code:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"cpp\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">our_loop:\n    \/\/ some code before\n    if (flag):\n       do_something()\n    else:\n       do_something_else()\n    \/\/ some code after<\/pre>\n\n\n\n<p>The <code>flag<\/code> is usually a constant, for example, an argument (e.g. true\/false; 0\/1) we passed to this function at <strong>runtime<\/strong>. The functions <code>do_something()<\/code> and <code>do_something_else()<\/code> may also be some code blocks and they&#8217;re almost identical. Well, if we know the flag will <em>definitely not<\/em> change within the loop we can move the conditional statement out of the loop so that the flag will only be tested once.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"cpp\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">if (flag):\n    our_loop:\n        \/\/ some code before\n        do_something()\n        \/\/ some code after\nelse:\n    our_loop:\n        \/\/ some code before\n        do_something_else()\n        \/\/ some code after<\/pre>\n\n\n\n<p>This is great, except, &#8230;, well, it really looks ugly and we&#8217;re repeating ourselves. If the code within the loop is large, it makes our binary executable a bit larger (.text section).<\/p>\n\n\n\n<p>To be concrete, let&#8217;s consider the following C++ code that computes the rank of the current value in arr[beg, end):<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"cpp\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">int get_rank(int arr[], int beg, int end, int cur, bool desc)\n{\n    int rank = 1;\n    for (int i = beg; i &lt; end; ++i) {\n        if (i == cur)\n            continue;\n        if (!desc) {\n            if (arr[i] &lt; arr[cur])\n                rank++;\n        }\n        else {\n            if (arr[i] > arr[cur])\n                rank++;\n        }\n    }\n    return rank;\n}<\/pre>\n\n\n\n<p>One may want to compute the ascending rank or the descending rank of a value in an array\/range. Here the order (asc\/desc) is a runtime constant. However, if it&#8217;s a compile-time constant one can avoid repeating themselves by using C++17&#8217;s <code>if constexpr<\/code>:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"cpp\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">template&lt;bool desc>\nint get_rank(int arr[], int beg, int end, int cur)\n{\n    int rank = 1;\n    for (int i = beg; i &lt; end; ++i) {\n        if (i == cur)\n            continue;\n        if constexpr(!desc) {  \/\/ requires C++17 or later\n            if (arr[i] &lt; arr[cur])\n                rank++;\n        }\n        else {\n            if (arr[i] > arr[cur])\n                rank++;\n        }\n    }\n    return rank;\n}\n\n\/\/ usage:\n\/\/ int rank = get_rank&lt;false>(arr, beg, end, cur);\n\/\/ or\n\/\/ constexpr bool desc = false; \/\/ or some compile-time constant\n\/\/ int rank = get_rank&lt;desc>(arr, beg, end, cur);<\/pre>\n\n\n\n<p>This solution combines both efficiency and elegance (thanks to the compiler). But how about the efficiency of the non-templated function (the one with an additional parameter <code>desc<\/code>)? I wasn&#8217;t sure if the compiler would do any sort of optimization. But I do know that modern computer architectures are designed to do a great job in branch prediction. The compiler may also aid in generating branch-friendly (or even branchless) code. So, if the prediction was correct, the fitched instructions for this branch will be kept hot (remaining in the cache and busy running in the CPU pipeline). Otherwise, the fitched instructions by the wrong prediction may be discarded (evicted from the cache) and instructions for the other branch will be fetched. Also, the CPU stalls caused by the pipeline invalidation will make it worse. Luckily, here in this case, the branch condition is constant and therefore the CPU is on our side.<\/p>\n\n\n\n<p>Next, let&#8217;s see what the compile can do for us. Compiling this function on <a href=\"https:\/\/gcc.godbolt.org\/z\/e38cGd713\">Godbolt<\/a> using gcc 12.2 with <code>-O3<\/code> flag yields the following assembly code:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"asm\" data-enlighter-theme=\"\" data-enlighter-highlight=\"10\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">get_rank(int*, int, int, int, bool):\n        movq    %rdi, %r9\n        movl    %edx, %edi\n        cmpl    %edx, %esi\n        jge     .L9\n        movslq  %ecx, %rax\n        movl    $1, %edx\n        leaq    (%r9,%rax,4), %r10\n        movslq  %esi, %rax\n        testb   %r8b, %r8b\n        jne     .L8\n.L5:\n        cmpl    %eax, %ecx\n        je      .L4\n        movl    (%r10), %esi\n        cmpl    %esi, (%r9,%rax,4)\n        setl    %sil\n        movzbl  %sil, %esi\n        addl    %esi, %edx\n.L4:\n        addq    $1, %rax\n        cmpl    %eax, %edi\n        jg      .L5\n        movl    %edx, %eax\n        ret\n.L8:\n        cmpl    %eax, %ecx  ; if (i == cur)\n        je      .L7\n        movl    (%r10), %esi\n        cmpl    %esi, (%r9,%rax,4)\n        setg    %sil\n        movzbl  %sil, %esi\n        addl    %esi, %edx\n.L7: ; for (int i = beg; i &lt; end; ++i)\n        addq    $1, %rax\n        cmpl    %eax, %edi\n        jg      .L8\n        movl    %edx, %eax\n        ret\n.L9:\n        movl    $1, %edx\n        movl    %edx, %eax\n        ret<\/pre>\n\n\n\n<p>We can see that <code>desc<\/code> is only tested once. Cheers! With <code>-O0<\/code> flag it gives<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"asm\" data-enlighter-theme=\"\" data-enlighter-highlight=\"20, 62\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">get_rank(int*, int, int, int, bool):\n        pushq   %rbp\n        movq    %rsp, %rbp\n        movq    %rdi, -24(%rbp)\n        movl    %esi, -28(%rbp)\n        movl    %edx, -32(%rbp)\n        movl    %ecx, -36(%rbp)\n        movl    %r8d, %eax\n        movb    %al, -40(%rbp)\n        movl    $1, -4(%rbp)\n        movl    -28(%rbp), %eax\n        movl    %eax, -8(%rbp)\n        jmp     .L2\n.L6:\n        movl    -8(%rbp), %eax\n        cmpl    -36(%rbp), %eax\n        je      .L8\n        movzbl  -40(%rbp), %eax\n        xorl    $1, %eax\n        testb   %al, %al\n        je      .L5\n        movl    -8(%rbp), %eax\n        cltq\n        leaq    0(,%rax,4), %rdx\n        movq    -24(%rbp), %rax\n        addq    %rdx, %rax\n        movl    (%rax), %edx\n        movl    -36(%rbp), %eax\n        cltq\n        leaq    0(,%rax,4), %rcx\n        movq    -24(%rbp), %rax\n        addq    %rcx, %rax\n        movl    (%rax), %eax\n        cmpl    %eax, %edx\n        jge     .L4\n        addl    $1, -4(%rbp)\n        jmp     .L4\n.L5:\n        movl    -8(%rbp), %eax\n        cltq\n        leaq    0(,%rax,4), %rdx\n        movq    -24(%rbp), %rax\n        addq    %rdx, %rax\n        movl    (%rax), %edx\n        movl    -36(%rbp), %eax\n        cltq\n        leaq    0(,%rax,4), %rcx\n        movq    -24(%rbp), %rax\n        addq    %rcx, %rax\n        movl    (%rax), %eax\n        cmpl    %eax, %edx\n        jle     .L4\n        addl    $1, -4(%rbp)\n        jmp     .L4\n.L8:\n        nop\n.L4:\n        addl    $1, -8(%rbp)\n.L2: ; for (int i = beg; i &lt; end; ++i)\n        movl    -8(%rbp), %eax\n        cmpl    -32(%rbp), %eax\n        jl      .L6\n        movl    -4(%rbp), %eax\n        popq    %rbp\n        ret<\/pre>\n\n\n\n<p>which suggests <code>desc<\/code> is tested within each loop.<\/p>\n\n\n\n<p>Now we know that compilers can actually perform this optimization (although it&#8217;s not enabled by default), we should be relaxed and fearless to choose our seemingly inefficient but shorter &amp; sweeter code. Cheers again ;)<\/p>\n\n\n\n<p>See also <a href=\"https:\/\/en.wikipedia.org\/wiki\/Loop-invariant_code_motion\">https:\/\/en.wikipedia.org\/wiki\/Loop-invariant_code_motion<\/a>, <a href=\"https:\/\/en.wikipedia.org\/wiki\/Loop_unswitching\">https:\/\/en.wikipedia.org\/wiki\/Loop_unswitching<\/a>, <a href=\"http:\/\/www.cs.cmu.edu\/afs\/cs\/academic\/class\/15745-s06\/web\/handouts\/06.pdf\">http:\/\/www.cs.cmu.edu\/afs\/cs\/academic\/class\/15745-s06\/web\/handouts\/06.pdf<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>We often face such a scenario in writing code: The flag is usually a constant, for example, an argument (e.g. true\/false; 0\/1) we passed to this function at runtime. The functions do_something() and do_something_else() may also be some code blocks and they&#8217;re almost identical. Well, if we know the flag will definitely not change within &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/markjohntaylor.com\/blog\/wordpress\/index.php\/2023\/01\/15\/loop-invariant-code-hoisting\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Loop-invariant Code Hoisting&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[15],"tags":[],"_links":{"self":[{"href":"https:\/\/markjohntaylor.com\/blog\/wordpress\/index.php\/wp-json\/wp\/v2\/posts\/1633"}],"collection":[{"href":"https:\/\/markjohntaylor.com\/blog\/wordpress\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/markjohntaylor.com\/blog\/wordpress\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/markjohntaylor.com\/blog\/wordpress\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/markjohntaylor.com\/blog\/wordpress\/index.php\/wp-json\/wp\/v2\/comments?post=1633"}],"version-history":[{"count":42,"href":"https:\/\/markjohntaylor.com\/blog\/wordpress\/index.php\/wp-json\/wp\/v2\/posts\/1633\/revisions"}],"predecessor-version":[{"id":1929,"href":"https:\/\/markjohntaylor.com\/blog\/wordpress\/index.php\/wp-json\/wp\/v2\/posts\/1633\/revisions\/1929"}],"wp:attachment":[{"href":"https:\/\/markjohntaylor.com\/blog\/wordpress\/index.php\/wp-json\/wp\/v2\/media?parent=1633"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/markjohntaylor.com\/blog\/wordpress\/index.php\/wp-json\/wp\/v2\/categories?post=1633"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/markjohntaylor.com\/blog\/wordpress\/index.php\/wp-json\/wp\/v2\/tags?post=1633"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}