Skip to content
Published on

[Compiler] 14. Data-Flow Analysis Framework

Authors

Overview

Data-Flow Analysis is a core foundational technology for compiler optimization. It collects information about specific properties at each point in a program to determine whether safe optimizations are possible.

In this article, we cover the basic concepts of data-flow analysis, three representative analyses (reaching definitions, live variables, available expressions), iterative algorithms, and the basics of lattice theory.


1. Basic Concepts of Data-Flow Analysis

1.1 Goal of Analysis

At each point in a program, we answer questions such as:

  • Where was the value of variable x defined?
  • Is the value of variable x used later?
  • Has the expression a + b already been computed and is it available?

1.2 Program Points

The positions before and after each instruction in a basic block are called program points.

// Block B:
//  [point 1] d1: x = a + b [point 2]
//  [point 2] d2: y = x * c [point 3]
//
// IN[B] = information at point 1
// OUT[B] = information at point 3

1.3 Transfer Function

For each basic block B, the transfer function f_B transforms input information into output information:

OUT[B] = f_B(IN[B])

In most data-flow analyses, the transfer function takes the gen/kill form:

OUT[B] = gen[B] U (IN[B] - kill[B])
  • gen[B]: Information newly generated in block B
  • kill[B]: Information invalidated in block B

1.4 Meet/Join Operation

When a block has multiple predecessors, the meet operation combines the information from each predecessor.

// Union: Include if it holds on any path
IN[B] = U OUT[P]  (P is all predecessors of B)

// Intersection: Include only if it holds on all paths
IN[B] = n OUT[P]  (P is all predecessors of B)

2. Reaching Definitions

2.1 Concept

Definition d reaches program point p = a path exists from d to p, and the same variable is not redefined along that path.

2.2 gen/kill Sets

For instruction d: x = y op z:
  gen = definition d itself
  kill = all other definitions of x

gen/kill for the entire block B:

// When block B has instructions d1, d2, ..., dk in order
// Later instructions can overwrite earlier ones

gen[B] = gen(dk) U (gen(dk-1) - kill(dk)) U ...
kill[B] = kill(d1) U kill(d2) U ... U kill(dk)

2.3 Data-Flow Equations

OUT[B] = gen[B] U (IN[B] - kill[B])
IN[B]  = U OUT[P]  (P is all predecessors of B)

The meet operation is union because: if a definition reaches through any path, the value might be used, so all must be conservatively included.

2.4 Example

// Flow graph:
//   B1 -> B2 -> B3
//          ^     |
//          +-----+

// B1: d1: x = 5        gen={d1}, kill={d2,d4}
// B2: d2: x = x + 1    gen={d2}, kill={d1,d4}
// B3: d3: y = x         gen={d3}, kill={}
//     d4: x = y + 1     gen={d4}, kill={d1,d2}

// gen[B3] = {d4} U ({d3} - {d1,d2}) = {d3, d4}
// kill[B3] = {d1, d2}

Iterative computation:

Initial values: OUT[B] = {} (for all B)

Iteration 1:
  IN[B1] = {}
  OUT[B1] = {d1} U ({} - {d2,d4}) = {d1}

  IN[B2] = OUT[B1] U OUT[B3] = {d1} U {} = {d1}
  OUT[B2] = {d2} U ({d1} - {d1,d4}) = {d2}

  IN[B3] = OUT[B2] = {d2}
  OUT[B3] = {d3,d4} U ({d2} - {d1,d2}) = {d3,d4}

Iteration 2:
  IN[B2] = OUT[B1] U OUT[B3] = {d1} U {d3,d4} = {d1,d3,d4}
  OUT[B2] = {d2} U ({d1,d3,d4} - {d1,d4}) = {d2,d3}

  IN[B3] = OUT[B2] = {d2,d3}
  OUT[B3] = {d3,d4} U ({d2,d3} - {d1,d2}) = {d3,d4}

Iteration 3:
  IN[B2] = {d1} U {d3,d4} = {d1,d3,d4}  (no change)
  -> Converged!

3. Live-Variable Analysis

3.1 Concept

Variable x is live at program point p = on some path starting from p, x is used before being redefined.

3.2 Backward Analysis

Live variable analysis is a backward analysis. It proceeds from the end of the program toward the beginning.

// Transfer function (backward)
IN[B]  = use[B] U (OUT[B] - def[B])
OUT[B] = U IN[S]  (S is all successors of B)
  • use[B]: Variables used before being defined in block B
  • def[B]: Variables defined in block B

3.3 Computing use/def Sets

// Block B:
//   x = y + z    // use: {y, z}, def: {x}
//   a = x + 1    // use: {x}, def: {a}  (x was defined above, so not in use[B])

// use[B] = {y, z}  (variables used before being defined)
// def[B] = {x, a}  (variables defined)

3.4 Example

// B1: a = 1; b = 2
// B2: c = a + b; d = c - a
// B3: d = b + d; (exit)

// use[B1] = {}, def[B1] = {a, b}
// use[B2] = {a, b}, def[B2] = {c, d}
// use[B3] = {b, d}, def[B3] = {d}

// Initial values: IN[B] = {} (for all B)

Iteration 1: (backward: B3 -> B2 -> B1)
  OUT[B3] = {}  (exit block)
  IN[B3] = {b, d} U ({} - {d}) = {b, d}

  OUT[B2] = IN[B3] = {b, d}
  IN[B2] = {a, b} U ({b, d} - {c, d}) = {a, b}

  OUT[B1] = IN[B2] = {a, b}
  IN[B1] = {} U ({a, b} - {a, b}) = {}

  -> Converged! (no change)

3.5 Applications

  • Dead code elimination: If x is not live at the OUT point of assignment x = expr, it is dead code
  • Register allocation: Variables that are simultaneously live form edges in the interference graph

4. Available Expressions

4.1 Concept

Expression x op y is available at program point p = on all paths reaching p, x op y has been computed and neither x nor y has been redefined since.

4.2 Data-Flow Equations

OUT[B]   = e_gen[B] U (IN[B] - e_kill[B])
IN[B]    = n OUT[P]  (P is all predecessors of B)
IN[ENTRY] = {}

The meet operation is intersection because: the expression must have been computed on all paths to be safely reused.

  • e_gen[B]: Expressions computed in block B whose operands are not subsequently redefined
  • e_kill[B]: When variable x is defined in block B, all expressions containing x as an operand

4.3 Example

// B1: t1 = a + b; c = t1
//     e_gen[B1] = {a+b}
//     e_kill[B1] = {expressions containing c}

// B2: t2 = a + b; d = t2 - c
//     e_gen[B2] = {a+b, t2-c}
//     e_kill[B2] = {expressions containing d}

// B3: a = 1
//     e_gen[B3] = {}
//     e_kill[B3] = {all expressions containing a} = {a+b}

4.4 Importance of Initial Values

Initial value settings are important in available expressions analysis:

OUT[ENTRY] = {}     // No expressions are available at the entry point
OUT[B] = U          // Other blocks are initialized to the universal set (all expressions)
                    // (Since intersection is used, we must start from a large value to converge correctly)

5. Iterative Algorithm

5.1 General Iterative Algorithm

Algorithm: Forward Data-Flow Analysis

Input: Flow graph, gen/kill sets, meet operation
Output: IN/OUT sets for each block

1. Initialization
   OUT[ENTRY] = initial value
   for (all blocks B != ENTRY)
       OUT[B] = initial value

2. Iteration
   changed = true
   while (changed):
       changed = false
       for (all blocks B != ENTRY):
           IN[B] = meet(OUT[P] for all predecessors P)
           oldOUT = OUT[B]
           OUT[B] = gen[B] U (IN[B] - kill[B])
           if OUT[B] != oldOUT:
               changed = true

5.2 Convergence Guarantee

Why the iterative algorithm always terminates:

  1. gen/kill transfer functions are monotone - as input grows, output grows (or stays the same)
  2. The set of data-flow values is finite
  3. Meet operations (union or intersection) are also monotone

Therefore, OUT values change in only one direction at each iteration and eventually reach a stable state within the finite range.

5.3 Convergence Speed

In general, the number of iterations is proportional to the depth of the flow graph. Depth is related to the nesting depth of loops. Most programs converge in 2-3 iterations.


6. Comparison of Three Analyses

PropertyReaching DefinitionsLive VariablesAvailable Expressions
DirectionForwardBackwardForward
Meet operationUnionUnionIntersection
Transfer functiongen U (IN - kill)use U (OUT - def)e_gen U (IN - e_kill)
Initial valueEmpty setEmpty setUniversal set (except Entry)
ApplicationConstant propagationDead code eliminationCommon subexpression elimination

7. Basics of Lattice Theory

7.1 Partially Ordered Set

The mathematical foundation of data-flow analysis is Lattice Theory.

A partially ordered set consists of a set S and a relation <=, satisfying:

  • Reflexivity: a <= a
  • Antisymmetry: if a <= b and b <= a then a = b
  • Transitivity: if a <= b and b <= c then a <= c

7.2 Lattice

In a partially ordered set, for any two elements a and b:

  • Least upper bound (join): The smallest element that is greater than or equal to both a and b
  • Greatest lower bound (meet): The largest element that is less than or equal to both a and b

If both always exist, it is called a lattice.

7.3 Data-Flow Analysis and Lattices

// Lattice for reaching definitions analysis
// Elements: subsets of definitions (2^D)
// Order: set inclusion (subset)
// Meet operation: union (U)
// Bottom: empty set ({})
// Top: universal set (D)

// Lattice for available expressions analysis
// Elements: subsets of expressions (2^E)
// Order: reverse set inclusion (superset)
// Meet operation: intersection
// Bottom: universal set (E)
// Top: empty set ({})

7.4 Fixed-Point Theorem

Tarski's Fixed-Point Theorem: A monotone function f on a complete lattice has a least fixed point. The iterative algorithm converges to this least fixed point.

This is the mathematical foundation that guarantees the correctness and termination of the iterative algorithm.

7.5 Safety and Precision

// Conservative analysis:
// - Approximates toward the safe side
// - May miss optimization opportunities, but never performs incorrect optimizations

// Reaching definitions: analyzes that more definitions reach than actually do
//   -> May not be able to apply constant propagation in some cases (safe)

// Available expressions: analyzes that fewer expressions are available than actually are
//   -> May not be able to apply common subexpression elimination in some cases (safe)

8. Practical Considerations

8.1 Bit Vector Representation

Representing data-flow values as bit vectors is efficient:

// Given definitions d1, d2, d3, d4
// gen[B] = {d1, d3} -> bit vector: 1010
// kill[B] = {d2}    -> bit vector: 0100

// Transfer function computation:
// OUT = gen | (IN & ~kill)  (processed at once with bitwise operations)

Bitwise operations are very fast at the hardware level, making them efficient even for large programs.

8.2 Worklist Algorithm

Using a worklist algorithm instead of simple iteration is more efficient:

Algorithm: Worklist-Based Iteration

1. Add all blocks to the worklist
2. While the worklist is not empty:
   a. Remove block B from the worklist
   b. Recompute OUT[B]
   c. If OUT[B] has changed, add B's successors to the worklist

By recomputing only blocks affected by changes, unnecessary computations are reduced.


Summary

ConceptDescription
Transfer functionTransforms input information to output information for a basic block
gen/kill setsData-flow information generated/invalidated in a block
Meet operationOperation combining information from multiple paths (union or intersection)
Reaching definitionsAnalyzes whether a variable's definition reaches a specific point
Live variablesAnalyzes whether a variable's value is used later
Available expressionsAnalyzes whether an expression has been computed and is reusable
Iterative algorithmIteratively computes data-flow equations until a fixed point
Lattice theoryMathematical foundation of data-flow analysis

Data-flow analysis serves as the theoretical foundation for compiler optimization, guaranteeing the safety of optimizations. Since it has mathematical backing from lattice theory, the iterative algorithm is guaranteed to converge and provides conservatively safe results. In the next article, we will examine loop optimization in detail using these analysis techniques.