[Compiler] 15. Loop Optimization and Code Motion

Overview

Most of a program's execution time is spent in loops. Therefore, optimizing loops is the most effective way to improve overall program performance. In this article, we systematically examine techniques for identifying and optimizing loops.

1. Dominators

1.1 Definition

In a flow graph, node d dominates node n = every path from the entry point to n passes through d.

This is denoted as d dom n.

Properties:

Every node dominates itself (reflexive)
If d dom n and n dom d then d = n (antisymmetric)
If d dom n and n dom p then d dom p (transitive)

1.2 Dominator Tree

A tree representation of the dominator relation. Each node's parent is its immediate dominator.

// Flow graph:
//   1 -> 2 -> 3 -> 4
//        |    ^    |
//        v    |    v
//        5 ---+    6
//                  |
//                  v
//                  7

// Dominator tree:
//        1
//        |
//        2
//      / | \
//     3  5  6
//     |     |
//     4     7
//
// Interpretation: 1 dom all nodes
//                 2 dom 3,4,5,6,7
//                 3 dom 4
//                 6 dom 7

1.3 Dominator Computation Algorithm

Dominators are computed using an iterative algorithm:

Algorithm: Dominator Computation

Initialization:
  DOM[entry] = {entry}
  DOM[n] = set of all nodes (n != entry)

Iterate (until convergence):
  for (all nodes n != entry):
    DOM[n] = {n} U ( n DOM[p] )  // p is all predecessors of n
                                  // Add n to the intersection of predecessors' DOMs

1.4 Dominance Frontier

The dominance frontier DF(n) of node n is the boundary just outside the region dominated by n.

DF(n) = set of nodes y | n dominates a predecessor of y, but does not (strictly) dominate y itself

The dominance frontier is critically used to determine where to insert phi functions in SSA construction.

2. Natural Loops

2.1 Back Edge

In a flow graph, if in edge n -> d, d dominates n, then this edge is called a back edge.

// In a flow graph:
//   1 -> 2 -> 3 -> 4
//        ^         |
//        +---------+   // 4 -> 2 is a back edge (2 dom 4)

2.2 Definition of Natural Loop

The natural loop for back edge n -> d:

Header: Node d (the sole entry point of the loop)
Body: The set of all nodes that can reach n without going through d, plus d

Algorithm: Natural Loop Construction (back edge n -> d)

1. loop = {d}
2. stack = {n}  (if n is different from d)
3. while stack is not empty:
     m = pop from stack
     if m is not in loop:
         add m to loop
         push all predecessors of m onto stack
4. Result: loop

2.3 Example

// Flow graph:
//   B1 -> B2 -> B3 -> B4
//          ^          |
//          +----------+  (back edge: B4 -> B2)
//          |
//          +-- B5 (B2 -> B5, B5 -> B2 is also a back edge)

// Natural loop for back edge B4 -> B2:
//   {B2, B3, B4}  (reachable to B4 without going through B2: B3, B4)

// Natural loop for back edge B5 -> B2:
//   {B2, B5}

2.4 Loop Nesting

If two loops share the same header, they are merged and treated as a single loop. Otherwise, one is completely contained within the other (nested) or they are disjoint.

// Outer loop: {B1, B2, B3, B4, B5}
// Inner loop: {B3, B4}
//
// Optimize the inner loop first, then the outer loop
// (from inside out)

3. Loop-Invariant Code Detection and Motion

3.1 Loop-Invariant Code Detection

The conditions for instruction s: x = y op z to be loop invariant:

All definitions of y and z are outside the loop, or
y (or z) has exactly one definition, and that definition has already been identified as loop invariant
y (or z) is a constant

Algorithm: Loop-Invariant Code Detection

1. Mark instructions whose operands are constants or defined only outside
   the loop as "loop invariant"
2. Repeat:
   - Newly mark instructions as "loop invariant" if all operands are constants,
     outside definitions, or have exactly one definition already marked as loop invariant
3. Terminate when no new markings are made

3.2 Example

// Loop:
//   B2: i = i + 1
//       t1 = n * 4       // n is defined outside the loop -> loop invariant
//       t2 = a[t1]       // t1 is loop invariant and a is invariant in loop -> loop invariant
//       t3 = t2 + i
//       if i < 100 goto B2

// Pass 1: Mark t1 = n * 4 (n is an outside definition)
// Pass 2: Mark t2 = a[t1] (t1 is loop invariant)
// Pass 3: No more -> terminate

3.3 Code Motion Conditions

To move a loop-invariant instruction s: x = y op z outside the loop (to the preheader), all of the following conditions must be satisfied:

Condition 1: The block containing s dominates all exits of the loop

// Safe case:
//   preheader -> [B1: s: x=... ] -> B2(exit)
//   B1 dominates the exit, so motion is possible

// Dangerous case:
//   preheader -> B1 -> B3(exit)
//                |
//                v
//               B2: s: x=...
//   B2 does not dominate exit B3 -> motion not possible
//   (a path exists where s is not executed)

Condition 2: x is not defined elsewhere within the loop

Condition 3: All uses of x in the loop reach only the definition at s

3.4 Preheader Insertion

A preheader block is inserted just before the loop header for code motion.

// Before motion:
//   ... -> header -> body -> ...
//           ^               |
//           +---------------+

// After preheader insertion:
//   ... -> preheader -> header -> body -> ...
//                        ^               |
//                        +---------------+

// Place loop-invariant code in the preheader:
//   preheader:
//     t1 = n * 4       // Moved code
//     t2 = a[t1]       // Moved code

4. Induction Variable Detection and Elimination

4.1 Basic Induction Variable

A variable i that is defined only in the form i = i + c (where c is a loop constant) within the loop is called a basic induction variable.

4.2 Derived Induction Variable

A variable j that can be expressed as j = c1 * i + c2 (where c1, c2 are loop constants) for a basic induction variable i is called a derived induction variable.

Triple notation for induction variables: (i, c1, c2) means j = c1 * i + c2.

// Example:
// i = i + 1        // i is a basic induction variable
// t1 = 4 * i       // t1 is a derived induction variable: (i, 4, 0)
// t2 = t1 + base   // t2 is a derived induction variable: (i, 4, base)

4.3 Induction Variable Detection Algorithm

Algorithm: Induction Variable Detection

Step 1: Detect basic induction variables
  - Find variables defined only in the form i = i + c within the loop

Step 2: Detect derived induction variables
  - j = c1 * i (i is basic induction variable, c1 is loop constant)
    -> j's triple: (i, c1, 0)
  - j = k + c2 (k is induction variable (i, a, b), c2 is loop constant)
    -> j's triple: (i, a, b+c2)
  - j = c1 * k (k is induction variable (i, a, b))
    -> j's triple: (i, c1*a, c1*b)

4.4 Applying Strength Reduction

Convert the multiplication of derived induction variables to addition.

// Original code:
// i = 0
// loop:
//   t1 = 4 * i       // Multiplication
//   a[t1] = 0
//   i = i + 1
//   if i < n goto loop

// After strength reduction:
// i = 0
// t1 = 0             // Initial value: 4 * 0 = 0
// loop:
//   a[t1] = 0
//   i = i + 1
//   t1 = t1 + 4      // Multiplication -> addition
//   if i < n goto loop

4.5 Induction Variable Elimination

After strength reduction, if the basic induction variable is used only in the loop termination condition, it can be replaced with the derived induction variable.

// After strength reduction:
// i = 0
// t1 = 0
// loop:
//   a[t1] = 0
//   i = i + 1         // i is used only in comparison
//   t1 = t1 + 4
//   if i < n goto loop

// After induction variable elimination:
// t1 = 0
// limit = 4 * n       // Termination condition conversion (computed once outside loop)
// loop:
//   a[t1] = 0
//   t1 = t1 + 4
//   if t1 < limit goto loop
//
// i is completely eliminated!

5. Loop Unrolling

5.1 Concept

A technique that copies the loop body multiple times to reduce the number of iterations.

5.2 Basic Loop Unrolling

// Original (100 iterations)
for (i = 0; i < 100; i++) {
    a[i] = 0;
}

// 4x unrolling (25 iterations)
for (i = 0; i < 100; i += 4) {
    a[i]   = 0;
    a[i+1] = 0;
    a[i+2] = 0;
    a[i+3] = 0;
}

5.3 Advantages

Reduced loop overhead: Fewer branch, comparison, and increment instructions
Increased instruction scheduling opportunities: More independent instructions improve pipeline utilization
Improved register utilization: Value reuse between consecutive operations

5.4 Disadvantages and Considerations

// When the iteration count is not a multiple of the unrolling factor
// Remainder handling is needed

// When n might not be a multiple of 4
i = 0
// Main loop (4x unrolling)
while (i + 3 < n) {
    a[i] = 0; a[i+1] = 0; a[i+2] = 0; a[i+3] = 0;
    i += 4;
}
// Remainder handling
while (i < n) {
    a[i] = 0;
    i++;
}

Code size increase: May negatively impact cache
Register pressure: Excessive unrolling may cause register spills
Typically 2x to 8x unrolling is effective

6. Comprehensive Loop Optimization Example

Let us examine the entire loop optimization process through a single example.

// Original code (initializing a row of a matrix)
i = 0
loop:
    t1 = i * 8           // Induction variable (multiplication)
    t2 = base + t1       // Induction variable
    *t2 = 0.0
    i = i + 1
    if i < n goto loop

// Step 1: Loop-invariant code motion
//   (none applicable in this example)

// Step 2: Strength reduction
i = 0
t1 = 0                   // Initial value
t2 = base                // Initial value
loop:
    *t2 = 0.0
    i = i + 1
    t1 = t1 + 8          // Multiplication -> addition
    t2 = t2 + 8          // Multiplication -> addition
    if i < n goto loop

// Step 3: Induction variable elimination (remove i and t1)
t2 = base
limit = base + n * 8     // Computed once outside the loop
loop:
    *t2 = 0.0
    t2 = t2 + 8
    if t2 < limit goto loop

// Step 4: Loop unrolling (2x)
t2 = base
limit = base + n * 8
loop:
    *t2 = 0.0
    *(t2 + 8) = 0.0
    t2 = t2 + 16
    if t2 < limit goto loop
// (remainder handling code omitted)

The original required 1 multiplication, 2 additions, 1 store, and 1 comparison per loop iteration, but after optimization, only 1 addition, 2 stores, and 1 comparison are needed.

Summary

Concept	Description
Dominator	A node through which all paths from the entry point to a specific node pass
Dominator tree	A tree structure representing the dominator relation
Back edge	An edge going back to a dominator (forming a loop)
Natural loop	A loop defined by a back edge
Loop-invariant code motion	Moving code that computes the same value every iteration outside the loop
Induction variable	A variable that changes by a constant amount in a loop
Strength reduction	Replacing multiplication with addition
Induction variable elimination	Removing induction variables that become unnecessary
Loop unrolling	Copying the loop body to reduce the number of iterations

Loop optimization is the area of compiler optimization that yields the greatest performance improvements. The basic strategy is to identify loops through dominator analysis, lighten the loop body through invariant code motion and strength reduction, and reduce overhead through loop unrolling. In the next article, we will cover instruction-level parallelism and scheduling.