- Authors

- Name
- Youngju Kim
- @fjvbn20031
Overview
Data-Flow Analysis is a core foundational technology for compiler optimization. It collects information about specific properties at each point in a program to determine whether safe optimizations are possible.
In this article, we cover the basic concepts of data-flow analysis, three representative analyses (reaching definitions, live variables, available expressions), iterative algorithms, and the basics of lattice theory.
1. Basic Concepts of Data-Flow Analysis
1.1 Goal of Analysis
At each point in a program, we answer questions such as:
- Where was the value of variable x defined?
- Is the value of variable x used later?
- Has the expression
a + balready been computed and is it available?
1.2 Program Points
The positions before and after each instruction in a basic block are called program points.
// Block B:
// [point 1] d1: x = a + b [point 2]
// [point 2] d2: y = x * c [point 3]
//
// IN[B] = information at point 1
// OUT[B] = information at point 3
1.3 Transfer Function
For each basic block B, the transfer function f_B transforms input information into output information:
OUT[B] = f_B(IN[B])
In most data-flow analyses, the transfer function takes the gen/kill form:
OUT[B] = gen[B] U (IN[B] - kill[B])
gen[B]: Information newly generated in block Bkill[B]: Information invalidated in block B
1.4 Meet/Join Operation
When a block has multiple predecessors, the meet operation combines the information from each predecessor.
// Union: Include if it holds on any path
IN[B] = U OUT[P] (P is all predecessors of B)
// Intersection: Include only if it holds on all paths
IN[B] = n OUT[P] (P is all predecessors of B)
2. Reaching Definitions
2.1 Concept
Definition d reaches program point p = a path exists from d to p, and the same variable is not redefined along that path.
2.2 gen/kill Sets
For instruction d: x = y op z:
gen = definition d itself
kill = all other definitions of x
gen/kill for the entire block B:
// When block B has instructions d1, d2, ..., dk in order
// Later instructions can overwrite earlier ones
gen[B] = gen(dk) U (gen(dk-1) - kill(dk)) U ...
kill[B] = kill(d1) U kill(d2) U ... U kill(dk)
2.3 Data-Flow Equations
OUT[B] = gen[B] U (IN[B] - kill[B])
IN[B] = U OUT[P] (P is all predecessors of B)
The meet operation is union because: if a definition reaches through any path, the value might be used, so all must be conservatively included.
2.4 Example
// Flow graph:
// B1 -> B2 -> B3
// ^ |
// +-----+
// B1: d1: x = 5 gen={d1}, kill={d2,d4}
// B2: d2: x = x + 1 gen={d2}, kill={d1,d4}
// B3: d3: y = x gen={d3}, kill={}
// d4: x = y + 1 gen={d4}, kill={d1,d2}
// gen[B3] = {d4} U ({d3} - {d1,d2}) = {d3, d4}
// kill[B3] = {d1, d2}
Iterative computation:
Initial values: OUT[B] = {} (for all B)
Iteration 1:
IN[B1] = {}
OUT[B1] = {d1} U ({} - {d2,d4}) = {d1}
IN[B2] = OUT[B1] U OUT[B3] = {d1} U {} = {d1}
OUT[B2] = {d2} U ({d1} - {d1,d4}) = {d2}
IN[B3] = OUT[B2] = {d2}
OUT[B3] = {d3,d4} U ({d2} - {d1,d2}) = {d3,d4}
Iteration 2:
IN[B2] = OUT[B1] U OUT[B3] = {d1} U {d3,d4} = {d1,d3,d4}
OUT[B2] = {d2} U ({d1,d3,d4} - {d1,d4}) = {d2,d3}
IN[B3] = OUT[B2] = {d2,d3}
OUT[B3] = {d3,d4} U ({d2,d3} - {d1,d2}) = {d3,d4}
Iteration 3:
IN[B2] = {d1} U {d3,d4} = {d1,d3,d4} (no change)
-> Converged!
3. Live-Variable Analysis
3.1 Concept
Variable x is live at program point p = on some path starting from p, x is used before being redefined.
3.2 Backward Analysis
Live variable analysis is a backward analysis. It proceeds from the end of the program toward the beginning.
// Transfer function (backward)
IN[B] = use[B] U (OUT[B] - def[B])
OUT[B] = U IN[S] (S is all successors of B)
use[B]: Variables used before being defined in block Bdef[B]: Variables defined in block B
3.3 Computing use/def Sets
// Block B:
// x = y + z // use: {y, z}, def: {x}
// a = x + 1 // use: {x}, def: {a} (x was defined above, so not in use[B])
// use[B] = {y, z} (variables used before being defined)
// def[B] = {x, a} (variables defined)
3.4 Example
// B1: a = 1; b = 2
// B2: c = a + b; d = c - a
// B3: d = b + d; (exit)
// use[B1] = {}, def[B1] = {a, b}
// use[B2] = {a, b}, def[B2] = {c, d}
// use[B3] = {b, d}, def[B3] = {d}
// Initial values: IN[B] = {} (for all B)
Iteration 1: (backward: B3 -> B2 -> B1)
OUT[B3] = {} (exit block)
IN[B3] = {b, d} U ({} - {d}) = {b, d}
OUT[B2] = IN[B3] = {b, d}
IN[B2] = {a, b} U ({b, d} - {c, d}) = {a, b}
OUT[B1] = IN[B2] = {a, b}
IN[B1] = {} U ({a, b} - {a, b}) = {}
-> Converged! (no change)
3.5 Applications
- Dead code elimination: If x is not live at the OUT point of assignment
x = expr, it is dead code - Register allocation: Variables that are simultaneously live form edges in the interference graph
4. Available Expressions
4.1 Concept
Expression x op y is available at program point p = on all paths reaching p, x op y has been computed and neither x nor y has been redefined since.
4.2 Data-Flow Equations
OUT[B] = e_gen[B] U (IN[B] - e_kill[B])
IN[B] = n OUT[P] (P is all predecessors of B)
IN[ENTRY] = {}
The meet operation is intersection because: the expression must have been computed on all paths to be safely reused.
e_gen[B]: Expressions computed in block B whose operands are not subsequently redefinede_kill[B]: When variable x is defined in block B, all expressions containing x as an operand
4.3 Example
// B1: t1 = a + b; c = t1
// e_gen[B1] = {a+b}
// e_kill[B1] = {expressions containing c}
// B2: t2 = a + b; d = t2 - c
// e_gen[B2] = {a+b, t2-c}
// e_kill[B2] = {expressions containing d}
// B3: a = 1
// e_gen[B3] = {}
// e_kill[B3] = {all expressions containing a} = {a+b}
4.4 Importance of Initial Values
Initial value settings are important in available expressions analysis:
OUT[ENTRY] = {} // No expressions are available at the entry point
OUT[B] = U // Other blocks are initialized to the universal set (all expressions)
// (Since intersection is used, we must start from a large value to converge correctly)
5. Iterative Algorithm
5.1 General Iterative Algorithm
Algorithm: Forward Data-Flow Analysis
Input: Flow graph, gen/kill sets, meet operation
Output: IN/OUT sets for each block
1. Initialization
OUT[ENTRY] = initial value
for (all blocks B != ENTRY)
OUT[B] = initial value
2. Iteration
changed = true
while (changed):
changed = false
for (all blocks B != ENTRY):
IN[B] = meet(OUT[P] for all predecessors P)
oldOUT = OUT[B]
OUT[B] = gen[B] U (IN[B] - kill[B])
if OUT[B] != oldOUT:
changed = true
5.2 Convergence Guarantee
Why the iterative algorithm always terminates:
- gen/kill transfer functions are monotone - as input grows, output grows (or stays the same)
- The set of data-flow values is finite
- Meet operations (union or intersection) are also monotone
Therefore, OUT values change in only one direction at each iteration and eventually reach a stable state within the finite range.
5.3 Convergence Speed
In general, the number of iterations is proportional to the depth of the flow graph. Depth is related to the nesting depth of loops. Most programs converge in 2-3 iterations.
6. Comparison of Three Analyses
| Property | Reaching Definitions | Live Variables | Available Expressions |
|---|---|---|---|
| Direction | Forward | Backward | Forward |
| Meet operation | Union | Union | Intersection |
| Transfer function | gen U (IN - kill) | use U (OUT - def) | e_gen U (IN - e_kill) |
| Initial value | Empty set | Empty set | Universal set (except Entry) |
| Application | Constant propagation | Dead code elimination | Common subexpression elimination |
7. Basics of Lattice Theory
7.1 Partially Ordered Set
The mathematical foundation of data-flow analysis is Lattice Theory.
A partially ordered set consists of a set S and a relation <=, satisfying:
- Reflexivity:
a <= a - Antisymmetry: if
a <= bandb <= athena = b - Transitivity: if
a <= bandb <= cthena <= c
7.2 Lattice
In a partially ordered set, for any two elements a and b:
- Least upper bound (join): The smallest element that is greater than or equal to both a and b
- Greatest lower bound (meet): The largest element that is less than or equal to both a and b
If both always exist, it is called a lattice.
7.3 Data-Flow Analysis and Lattices
// Lattice for reaching definitions analysis
// Elements: subsets of definitions (2^D)
// Order: set inclusion (subset)
// Meet operation: union (U)
// Bottom: empty set ({})
// Top: universal set (D)
// Lattice for available expressions analysis
// Elements: subsets of expressions (2^E)
// Order: reverse set inclusion (superset)
// Meet operation: intersection
// Bottom: universal set (E)
// Top: empty set ({})
7.4 Fixed-Point Theorem
Tarski's Fixed-Point Theorem: A monotone function f on a complete lattice has a least fixed point. The iterative algorithm converges to this least fixed point.
This is the mathematical foundation that guarantees the correctness and termination of the iterative algorithm.
7.5 Safety and Precision
// Conservative analysis:
// - Approximates toward the safe side
// - May miss optimization opportunities, but never performs incorrect optimizations
// Reaching definitions: analyzes that more definitions reach than actually do
// -> May not be able to apply constant propagation in some cases (safe)
// Available expressions: analyzes that fewer expressions are available than actually are
// -> May not be able to apply common subexpression elimination in some cases (safe)
8. Practical Considerations
8.1 Bit Vector Representation
Representing data-flow values as bit vectors is efficient:
// Given definitions d1, d2, d3, d4
// gen[B] = {d1, d3} -> bit vector: 1010
// kill[B] = {d2} -> bit vector: 0100
// Transfer function computation:
// OUT = gen | (IN & ~kill) (processed at once with bitwise operations)
Bitwise operations are very fast at the hardware level, making them efficient even for large programs.
8.2 Worklist Algorithm
Using a worklist algorithm instead of simple iteration is more efficient:
Algorithm: Worklist-Based Iteration
1. Add all blocks to the worklist
2. While the worklist is not empty:
a. Remove block B from the worklist
b. Recompute OUT[B]
c. If OUT[B] has changed, add B's successors to the worklist
By recomputing only blocks affected by changes, unnecessary computations are reduced.
Summary
| Concept | Description |
|---|---|
| Transfer function | Transforms input information to output information for a basic block |
| gen/kill sets | Data-flow information generated/invalidated in a block |
| Meet operation | Operation combining information from multiple paths (union or intersection) |
| Reaching definitions | Analyzes whether a variable's definition reaches a specific point |
| Live variables | Analyzes whether a variable's value is used later |
| Available expressions | Analyzes whether an expression has been computed and is reusable |
| Iterative algorithm | Iteratively computes data-flow equations until a fixed point |
| Lattice theory | Mathematical foundation of data-flow analysis |
Data-flow analysis serves as the theoretical foundation for compiler optimization, guaranteeing the safety of optimizations. Since it has mathematical backing from lattice theory, the iterative algorithm is guaranteed to converge and provides conservatively safe results. In the next article, we will examine loop optimization in detail using these analysis techniques.