[Compiler] 17. Parallelism Detection and Loop Transformations

Overview

With the prevalence of multicore processors and GPUs, finding and exploiting parallelism in programs has become extremely important. Compilers can analyze programs to automatically detect parts that can be safely executed in parallel, and through loop transformations, improve both parallelism and locality.

1. Types of Parallelism

1.1 Data Parallelism

Performs the same operation on different data simultaneously.

// Data parallel: each iteration is independent
for (i = 0; i < n; i++) {
    c[i] = a[i] + b[i];
}

// Split execution across 4 processors:
// P0: i = 0..n/4-1
// P1: i = n/4..n/2-1
// P2: i = n/2..3n/4-1
// P3: i = 3n/4..n-1

1.2 Task Parallelism

Performs different tasks simultaneously.

// Task parallel: three functions are independent
result1 = processA(data)    // Task 1
result2 = processB(data)    // Task 2
result3 = processC(data)    // Task 3
// All three tasks can execute simultaneously

1.3 Pipeline Parallelism

Processes data in stages, with each stage handled by a different processor.

// 3-stage pipeline:
// Stage 1: read    -> P0
// Stage 2: process -> P1
// Stage 3: write   -> P2

// Time:   1      2      3      4      5
// P0:   read1  read2  read3  read4  read5
// P1:          proc1  proc2  proc3  proc4
// P2:                 wrt1   wrt2   wrt3

2. Dependence Analysis

To parallelize loops, dependencies between iterations must be analyzed.

2.1 Types of Loop Dependencies

// Flow dependence (true dependence)
for (i = 1; i < n; i++) {
    a[i] = a[i-1] + 1;  // Iteration i depends on result of iteration i-1
}
// Dependence distance = 1, cannot parallelize

// Anti-dependence
for (i = 0; i < n-1; i++) {
    a[i] = a[i+1] + 1;  // Read then write
}
// Can be resolved by reversing order

// Output dependence
for (i = 0; i < n; i++) {
    a[i % 2] = b[i];    // Repeatedly writing to the same location
}

2.2 Dependence Vectors and Dependence Distance

Dependencies in multiply-nested loops are represented as vectors.

// Dependence vector in a doubly-nested loop
for (i = 1; i < n; i++) {
    for (j = 1; j < m; j++) {
        a[i][j] = a[i-1][j-1] + 1;
    }
}

// a[i][j] depends on a[i-1][j-1]
// Dependence vector: (1, 1)  (distance 1 in i-direction, distance 1 in j-direction)

2.3 Array Dependence Tests

Tests to determine whether two array accesses refer to the same location.

GCD Test (necessary condition):

// For a[c1*i + c2] and a[c3*j + c4] to reference the same location
// there must exist integers i, j satisfying c1*i + c2 = c3*j + c4
// -> gcd(c1, c3) must divide (c4 - c2)

// Example:
// a[2*i + 1] and a[2*j]
// gcd(2, 2) = 2, c4 - c2 = 0 - 1 = -1
// 2 does not divide -1 -> no dependence! (can parallelize)

Banerjee Test (more precise):

// Tests dependence considering loop bounds
// a[i + 1] and a[j] (0 <= i,j < n)
// Does a pair (i,j) satisfying i + 1 = j exist within bounds?
// If 0 <= i < n and 0 <= i+1 < n, then yes
// -> dependence exists

2.4 Limitations of Dependence Analysis

// Cases where precise dependence analysis is difficult:

// 1. Indirect array access
a[b[i]] = ...  // Value of b[i] is unknown

// 2. Pointer-based access
*p = ...       // Location pointed to by p is unknown

// 3. Function calls
foo(a, i)      // Unknown whether foo modifies a

// In such cases, conservatively assume dependence exists (safe but cannot parallelize)

3. Loop Transformations

3.1 Loop Interchange

A transformation that changes the order of nested loops. Used to expose parallelism or improve memory locality.

// Original: row-major access (already optimal in C)
for (i = 0; i < n; i++)
    for (j = 0; j < m; j++)
        a[i][j] = 0;

// After interchange: column-major access
for (j = 0; j < m; j++)
    for (i = 0; i < n; i++)
        a[i][j] = 0;

Condition for legal interchange: the lexicographic order of dependence vectors must be maintained.

// Dependence vector (1, -1):
// After interchange (-1, 1) -> first component is negative -> illegal!
//
// Dependence vector (1, 0):
// After interchange (0, 1) -> lexicographically positive -> legal!

3.2 Loop Tiling / Blocking

Divides loops into small tiles (blocks) to improve cache utilization.

// Original: matrix multiplication
for (i = 0; i < n; i++)
    for (j = 0; j < n; j++)
        for (k = 0; k < n; k++)
            c[i][j] += a[i][k] * b[k][j];

// After tiling (tile size T):
for (ii = 0; ii < n; ii += T)
    for (jj = 0; jj < n; jj += T)
        for (kk = 0; kk < n; kk += T)
            for (i = ii; i < min(ii+T, n); i++)
                for (j = jj; j < min(jj+T, n); j++)
                    for (k = kk; k < min(kk+T, n); k++)
                        c[i][j] += a[i][k] * b[k][j];

When tile size T is matched to cache size:

T x T sub-matrices of a, b, c fit in cache
Cache misses decrease from O(n^3) to O(n^3/T)

3.3 Loop Fusion

Combines multiple loops that iterate over the same range into one.

// Original: two separate loops
for (i = 0; i < n; i++)
    a[i] = b[i] + 1;

for (i = 0; i < n; i++)
    c[i] = a[i] * 2;

// After fusion: one loop
for (i = 0; i < n; i++) {
    a[i] = b[i] + 1;
    c[i] = a[i] * 2;
}

Advantages of fusion:

Reduced loop overhead
a[i] can be kept in a register (reduced memory access)
Improved data locality

Fusion condition: the dependence direction between the two loops must not be reversed.

3.4 Loop Fission / Distribution

Splits one loop into multiple loops.

// Original
for (i = 0; i < n; i++) {
    a[i] = b[i] + 1;        // Independent part 1
    c[i] = c[i-1] + a[i];   // Dependent part 2
}

// After fission
for (i = 0; i < n; i++)
    a[i] = b[i] + 1;        // Can be parallelized!

for (i = 0; i < n; i++)
    c[i] = c[i-1] + a[i];   // Must execute sequentially

Advantages of fission:

Isolate parallelizable parts for parallel execution
Smaller working set for each loop improves cache efficiency

4. Basics of Affine Transformation Theory

4.1 What is Affine Transformation

A unified framework for optimizing both parallelism and locality by linearly transforming the iteration space of loops.

// Original iteration space: (i, j)
for (i = 0; i < n; i++)
    for (j = 0; j < m; j++)
        S: a[i][j] = ...

// Affine transformation: new iteration variables (p, q) = T * (i, j)
// Example: (p, q) = (i+j, j) -> skewed iteration space

4.2 Skewing Transformation

Changes dependence directions to expose parallelism.

// Original: dependence vectors (0,1) and (1,0) -> both dimensions are sequential
for (i = 0; i < n; i++)
    for (j = 0; j < n; j++)
        a[i][j] = (a[i-1][j] + a[i][j-1]) / 2;

// Skewing: p = i + j, q = j
// After transformation, iterations with the same p value can execute in parallel

// Transformed dependence vectors: (1,1), (1,0) -> p (outer loop) is sequential, q (inner loop) can be parallel

4.3 Polyhedral Model

A generalized framework of affine transformations, used in modern automatic parallelization compilers (Polly, Pluto).

Each statement's iteration domain is represented as a polyhedron, dependencies as affine relations, and schedules as affine functions.

// Polyhedral representation:
// Iteration domain of statement S: 0 <= i < n, 0 <= j < m
// This is a 2D polyhedron (rectangle)

// Schedule function: theta(i,j) = (i, j)    (original order)
//                    or theta(i,j) = (i+j, j) (skewing)
//                    or theta(i,j) = (j, i)   (interchange)

5. Locality Optimization

5.1 Temporal Locality

The property of accessing recently accessed data again soon.

// Good temporal locality
for (i = 0; i < n; i++) {
    sum += a[i];  // sum is reused every iteration
}

// Poor temporal locality (can be improved by tiling)
for (i = 0; i < n; i++)
    for (j = 0; j < n; j++)
        b[j] += a[i][j];  // b[j] is reused only in the outer loop

5.2 Spatial Locality

The property of accessing adjacent memory locations consecutively.

// Good spatial locality (row-major access, matches C array layout)
for (i = 0; i < n; i++)
    for (j = 0; j < n; j++)
        a[i][j] = 0;      // Access a[i][0], a[i][1], ... in order

// Poor spatial locality (column-major access)
for (j = 0; j < n; j++)
    for (i = 0; i < n; i++)
        a[i][j] = 0;      // Access a[0][j], a[1][j], ... in order
                           // Different cache line for each access

5.3 Relationship Between Locality and Loop Transformations

Transformation	Temporal Locality	Spatial Locality	Parallelism
Loop interchange	Indirect	Direct	Possible
Loop tiling	Direct	Indirect	Indirect
Loop fusion	Direct	Indirect	Possible
Loop fission	Indirect	Direct	Direct

5.4 Optimization Effect on Matrix Multiplication

// No optimization: O(n^3) cache misses
// Loop interchange: improved spatial locality
// Tiling (B x B): O(n^3 / B) cache misses
// Tiling + interchange: near-optimal performance

// Actual performance difference (n = 1000, L1 cache 32KB):
// No optimization:   ~10 seconds
// Loop interchange:  ~5 seconds
// 32x32 tiling:      ~1 second
// Performance difference can be up to 10x!

6. Automatic Parallelization in Practice

6.1 DOALL Loops

A loop where all iterations are independent is called a DOALL loop.

// DOALL: fully parallelizable
for (i = 0; i < n; i++)
    a[i] = b[i] + c[i];

// Automatic conversion to OpenMP:
// #pragma omp parallel for
// for (i = 0; i < n; i++)
//     a[i] = b[i] + c[i];

6.2 DOACROSS Loops

Loops with inter-iteration dependencies that can be parallelized in a pipeline fashion.

// DOACROSS: parallelism exists up to the dependence distance
for (i = 1; i < n; i++)
    a[i] = a[i-1] + b[i];  // Dependence distance 1

// Pipeline parallelization:
// P0: a[0], a[4], a[8], ...
// P1: (wait) a[1], a[5], a[9], ...
// P2: (wait) (wait) a[2], a[6], ...
// P3: (wait) (wait) (wait) a[3], a[7], ...

Summary

Concept	Description
Data parallelism	Performing the same operation on different data simultaneously
Dependence analysis	Identifying data dependence relations between loop iterations
GCD test	Necessary condition check for array dependence
Loop interchange	Changing the order of nested loops
Loop tiling	Splitting loops into small blocks to maximize cache utilization
Loop fusion	Combining multiple loops into one
Loop fission	Splitting one loop into multiple loops
Affine transformation	Optimizing parallelism and locality through linear transformation of iteration space
Locality optimization	Optimizing memory access patterns to maximize cache utilization

Parallelism detection and loop transformations are core compiler technologies in the multicore era. By applying various loop transformations based on precise dependence analysis, compilers can automatically generate parallel code even without explicit parallelization by the programmer. In the next article, we will cover interprocedural analysis and pointer analysis.