paniq/loop_integration.md

## loop_integration.md

      
    Raw
  

              loop_integration.md
            
          
    Loop Integration

by Leonard Ritter
This article proposes loop integration, a generalization of loop unrolling to an arithmetic loop summation operator that permits static loop analysis and expansion with logarithmic complexity.
Rationale

In program optimization, we often encounter loops whose complexity we aim to reduce. Some loops can even be removed completely through folding of constant expressions. The classical technique for this is called loop unrolling: apply initial arguments to the loop body, evaluate, and if the result is constant, then reapply it again, and so forth, until the only path left leads out of the loop.
So unrolling is a form of decompression that speculates on further reducibility of the expressions within. When unrolling leaves residual expressions that can not be folded, it ceases to be useful since now we're repetitively producing the same expressions, blowing up the size of the program, which hurts compiletime and runtime performance. Thus unrolling is often only attempted for a handful of iterations as is it a cumbersome approach for loops whose upper iteration limit exceeds, for example, 1000 iterations or more. In Scopes the upper limit for compile time unrolling is 64.
Iteration Arithmetic

Let's reason about this from a divide and conquer perspective. For a simpler argument, we consider an infinite loop that we wish to apply to itself a finite amount of times. A single iteration of this loop can be abstracted to a function L(x), which takes as x the initial value init OR the value produced by a previous iteration, and produces a value for the next iteration OR undefined. We further understand that L(undefined) = undefined, so once the loop reaches an undefined result, we can terminate its application.
If we now wish to iterate this loop 15 times for example, the naive approach would be to unroll it, that is, evaluate the expression L(L(L(L(L(L(L(L(L(L(L(L(L(L(L(init))))))))))))))). But instead, let's define an arithmetic of loop iterations:


Above notation of nested operations obfuscates that in a real evaluation scenario, loop iterations are time-ordered and form a linear sequence of expressions: %x0 = L init; %x1 = L %x0; ... and so on in static single assignment notation. This describes a concatenation of expressions, whose full expansion can often be reduced to fewer unique expressions.


Therefore, the only real operator we need defines the concatenation of two iterations of the same loop function: L(L(x)) = (L + L)(x) = (2*L)(x), or shorter, 2*L. Like the numerical summation operator, the loop summation operator is commutative and associative since both arguments are the same.  We colloquially refer to 2*L as doubling loop L.


We can now abstractly compute any N*L in logarithmic time by breaking down N into a sum of powers of two and repeatedly applying our operator. We understand N*L as definite integral of loop L over N (iterations).


For our example of 15 iterations, we want to compute the new loop function 15*L, the loop integral of L over 15. We break this down so that
15*L = (2^3 + 2^2 + 2^1 + 2^0)*L = 2*2*2*L + 2*2*L + 2*L + L

Once we break down the operation into a series of single static assignments, it is easy to see how we can reuse previous computations to reach our desired result in 6 operations rather than 15:
%2L = L + L
%3L = %2L + L
%4L = %2L + %2L
%8L = %4L + %4L
%12L = %8L + %4L
%15L = %12L + %3L

The complexity of loop integration is logarithmic. The exact number of operations performed depends on the bits set within the binary representation of the desired constant. To compute a loop integral over 2^N, only N operations are required, whereas a loop integral over 2^N-1 requires 2*(N-1) operations, since (2^N-1)*L = L + 2^1 + ... + 2^(N-1) requires us to perform a final summation of individual products.
Bounding Space and Time

Doubling a loop is a complex composite operation. Without optimizing the concatenated expressions, the instruction size ratio of an integrated loop N*L to L is N:1. We aim to optimize the concatenated expressions in order to reduce the size ratio R < N:1, of which 1:1 is its ideal lower bound, predicting a full elision to a single constant or source variable. While integrating loops we can already extrapolate an estimate of the final ratio in relation to the upper bound of iterations, deciding on whether to continue integration or give up. The author proposes a threshold of 2*log2(N):1 for a large count of N iterations, favoring loop integrals that are at least able to fold from linear to logarithmic complexity.
When no clear upper bound on loop iterations is defined, we need to deduce the limit from loop doubling; bisection can be employed to find the precise point when a concatenation of L(x) evaluates to undefined. If no such upper bound appears to exist, we must define a heuristic by which to decide on when to give up.
When to terminate a loop has historically been a tricky question, but we can define sane upper bounds based on real world scenarios here. The maximum number of iterations N_max(L) permitted for any loop should be N_max(L) = min(B_max, M_max / M_loop(L), T_max / T_loop(L)), where B_max is the maximum number of iterations under which the loop values do not wrap-around (for a long signed integer, B_max would be 2^63-1), M_max is the maximum memory we want to progressively allocate within the loop, M_loop(L) is the memory progressively allocated per iteration of L, T_max is the time we are willing to wait for a result and T_loop(L) is the time required to execute one iteration of the loop. T_loop can be heuristically decided using the formula T_loop(L) = I(L) * T_cycle, where I(L) is the number of instructions within the loop L and T_cycle is the mean time spent processing an instruction on the targeted machine.
Having determined N_max(L) for a loop from analysis, we can bound the maximum number of steps that should be spent on loop integrating L to 2*log2(N_max(L))-1. This makes it feasible to even attempt to fold seemingly infinite loops, as data width, memory capacity and time limits provide orientation.
Loop Folding

Let's look at loop folding on the example of a simple loop with a constant iteration count:
loop (i = 0; x = 1)
    if (i < 20)
        repeat i + 1, x * 2
    else
        break x

For the sake of simplicity, we ignore the break section and transform the loop to infinite form:
L(i, x) = (i < 20)?{i + 1, x * 2}:undefined
        (or simply)
        = (i < 20)?{i + 1, x * 2}

After loop doubling, our loop looks as follows:
(2*L)(i, x) = (i < 20)?{(i + 1 < 20)?(i + 1 + 1), (i + 1 < 20)?(x * 2 * 2)}

We can now trivially fold expressions within the doubled loop:
            = (i < 20)?{(i + 1 < 20)?(i + 2), (i + 1 < 20)?(x * 4)} # 1 + 1 = 2, 2 * 2 = 4
            = (i + 1 < 20)?{i + 2, x * 4}                           # (i < N)?(i + 1 < N)?X == (i + 1 < N)?X

We perform doubling until we reach the undefined condition, by testing each doubled loop with the initial value:
L = (i < 20)?{i + 1, x * 2}
2*L = (i + 1 < 20)?{i + 2, x * 4}
4*L = (i + 3 < 20)?{i + 4, x * 16}
8*L = (i + 7 < 20)?{i + 8, x * 256}
16*L = (i + 15 < 20)?{i + 16, x * 65536}
32*L = undefined                            # (i_init + 31 < 20) == false

Now we can bisect towards the final iteration, succeeding after four steps:
16*L + 8*L = 24*L = undefined
16*L + 4*L = 20*L = (i + 19 < 20)?{i + 20, x * 1048576}
20*L + 2*L = undefined
20*L + L = undefined

The number of iterations is perfectly bounded, and our constant result can be evaluated:
(20*L)(0, 1) = {20, 1048576}

Loop Reduction

Loops that depend on unknowns can not be folded, but at least reduced. As a loop is an iterated function, the task breaks down to finding the nth-iterate f^n(x) of an iterative function f(x), or n*f in our notation.
Unfortunately, the task is as difficult as integrating regular functions, and for many formulas, there is no closed form solution. Hence we can't reduce every possible loop. Wikipedia lists a few solutions suitable for rational and negative n, but we fortunately only need solutions for positive integer n.
Here are a few more simple ones:
f(x) -> C                   N*f -> C
f(x) -> x + C               N*f -> x + N*C
f(x) -> x * C               N*f -> x * (C^N)
f(x) -> x ^ C               N*f -> x ^ (C^N)

All loop output variables can be optimized independently from each other. Hence a loop can be partially optimized to replace iterative code with closed form formula.
Another difficulty lies in analytically determining N from the loop function itself. The problem is finding the smallest N where (N*f != undefined) && ((N+1)*f == undefined). See this common form:
f(x) -> (x < K)?g(x)        N*f -> ((N-1)*g < K)?(N*g)
(or, more traditionally)
f(x,n) -> (g(x, n-1) < K)?(g(x, n))

For the inequality g(x, n-1) < K, n must be chosen so that g(x, n) = K. Hence N = K*g^-1(x, K), with g^-1 being the inverse of N*g solved towards N. Most of the time, g(x) -> x + C, which simplifies the task to (K - x) / C = N, where here, we always round the result to the next integer.