Split View: [컴파일러] 17. 병렬성 검출과 루프 변환

[컴파일러] 17. 병렬성 검출과 루프 변환

개요

멀티코어 프로세서와 GPU가 보편화됨에 따라, 프로그램에서 병렬성을 찾아내고 활용하는 것이 매우 중요해졌습니다. 컴파일러는 프로그램을 분석하여 안전하게 병렬 실행할 수 있는 부분을 자동으로 검출하고, 루프 변환을 통해 병렬성과 지역성을 모두 향상시킬 수 있습니다.

1. 병렬성의 유형

1.1 데이터 병렬성 (Data Parallelism)

같은 연산을 서로 다른 데이터에 대해 동시에 수행합니다.

// 데이터 병렬: 각 반복이 독립적
for (i = 0; i < n; i++) {
    c[i] = a[i] + b[i];
}

// 4개 프로세서로 분할 실행:
// P0: i = 0..n/4-1
// P1: i = n/4..n/2-1
// P2: i = n/2..3n/4-1
// P3: i = 3n/4..n-1

1.2 태스크 병렬성 (Task Parallelism)

서로 다른 작업을 동시에 수행합니다.

// 태스크 병렬: 세 함수가 독립적
result1 = processA(data)    // 태스크 1
result2 = processB(data)    // 태스크 2
result3 = processC(data)    // 태스크 3
// 세 태스크를 동시에 실행 가능

1.3 파이프라인 병렬성

데이터를 단계별로 처리하며, 각 단계를 서로 다른 프로세서가 담당합니다.

// 3단계 파이프라인:
// Stage 1: read    -> P0
// Stage 2: process -> P1
// Stage 3: write   -> P2

// 시간:   1      2      3      4      5
// P0:   read1  read2  read3  read4  read5
// P1:          proc1  proc2  proc3  proc4
// P2:                 wrt1   wrt2   wrt3

2. 의존성 분석 (Dependence Analysis)

루프를 병렬화하려면 반복 간의 의존성을 분석해야 합니다.

2.1 루프 의존성의 종류

// 흐름 의존성 (진정한 의존성)
for (i = 1; i < n; i++) {
    a[i] = a[i-1] + 1;  // 반복 i가 반복 i-1의 결과에 의존
}
// 의존 거리(distance) = 1, 병렬화 불가

// 반의존성
for (i = 0; i < n-1; i++) {
    a[i] = a[i+1] + 1;  // 읽기 후 쓰기
}
// 순서를 역전하면 해결 가능

// 출력 의존성
for (i = 0; i < n; i++) {
    a[i % 2] = b[i];    // 같은 위치에 반복적으로 쓰기
}

2.2 의존 벡터와 의존 거리

다중 중첩 루프에서 의존성을 벡터로 표현합니다.

// 2중 루프에서의 의존 벡터
for (i = 1; i < n; i++) {
    for (j = 1; j < m; j++) {
        a[i][j] = a[i-1][j-1] + 1;
    }
}

// a[i][j]는 a[i-1][j-1]에 의존
// 의존 벡터: (1, 1)  (i방향 거리 1, j방향 거리 1)

2.3 배열 의존성 테스트

두 배열 접근이 같은 위치를 참조하는지 판별하는 테스트입니다.

GCD 테스트 (필요 조건):

// a[c1*i + c2]와 a[c3*j + c4]가 같은 위치를 참조하려면
// c1*i + c2 = c3*j + c4를 만족하는 정수 i, j가 존재해야 함
// -> gcd(c1, c3)이 (c4 - c2)를 나누어야 함

// 예시:
// a[2*i + 1]과 a[2*j]
// gcd(2, 2) = 2, c4 - c2 = 0 - 1 = -1
// 2는 -1을 나누지 못함 -> 의존성 없음! (병렬화 가능)

Banerjee 테스트 (더 정밀한 검사):

// 루프 범위를 고려하여 의존성을 검사
// a[i + 1]과 a[j] (0 <= i,j < n)
// i + 1 = j를 만족하는 (i,j) 쌍이 범위 내에 존재하는가?
// 0 <= i < n이고 0 <= i+1 < n이면 존재
// -> 의존성 있음

2.4 의존성 분석의 한계

// 정확한 의존성 분석이 어려운 경우:

// 1. 간접 배열 접근
a[b[i]] = ...  // b[i]의 값을 알 수 없음

// 2. 포인터를 통한 접근
*p = ...       // p가 가리키는 위치를 알 수 없음

// 3. 함수 호출
foo(a, i)      // foo가 a를 수정하는지 알 수 없음

// 이런 경우 보수적으로 의존성이 있다고 가정 (안전하지만 병렬화 불가)

3. 루프 변환

3.1 루프 교환 (Loop Interchange)

중첩 루프의 순서를 바꾸는 변환입니다. 병렬성 노출이나 메모리 지역성 향상에 사용됩니다.

// 원본: 행 우선 접근 (C에서는 이미 최적)
for (i = 0; i < n; i++)
    for (j = 0; j < m; j++)
        a[i][j] = 0;

// 교환 후: 열 우선 접근
for (j = 0; j < m; j++)
    for (i = 0; i < n; i++)
        a[i][j] = 0;

교환이 합법적인 조건: 의존 벡터의 사전식 순서가 유지되어야 합니다.

// 의존 벡터 (1, -1)인 경우:
// 교환 후 (-1, 1) -> 첫 성분이 음수 -> 불법!
//
// 의존 벡터 (1, 0)인 경우:
// 교환 후 (0, 1) -> 사전식 양수 -> 합법!

3.2 루프 타일링 / 블로킹 (Loop Tiling/Blocking)

루프를 작은 타일(블록)로 나누어 캐시 활용도를 높입니다.

// 원본: 행렬 곱셈
for (i = 0; i < n; i++)
    for (j = 0; j < n; j++)
        for (k = 0; k < n; k++)
            c[i][j] += a[i][k] * b[k][j];

// 타일링 후 (타일 크기 T):
for (ii = 0; ii < n; ii += T)
    for (jj = 0; jj < n; jj += T)
        for (kk = 0; kk < n; kk += T)
            for (i = ii; i < min(ii+T, n); i++)
                for (j = jj; j < min(jj+T, n); j++)
                    for (k = kk; k < min(kk+T, n); k++)
                        c[i][j] += a[i][k] * b[k][j];

타일 크기 T를 캐시 크기에 맞추면:

a, b, c의 T x T 부분 행렬이 캐시에 들어감
캐시 미스가 O(n^3)에서 O(n^3/T)로 감소

3.3 루프 융합 (Loop Fusion)

같은 범위를 순회하는 여러 루프를 하나로 합칩니다.

// 원본: 두 개의 별도 루프
for (i = 0; i < n; i++)
    a[i] = b[i] + 1;

for (i = 0; i < n; i++)
    c[i] = a[i] * 2;

// 융합 후: 하나의 루프
for (i = 0; i < n; i++) {
    a[i] = b[i] + 1;
    c[i] = a[i] * 2;
}

융합의 장점:

루프 오버헤드 감소
a[i]를 레지스터에 유지 가능 (메모리 접근 감소)
데이터 지역성 향상

융합 조건: 두 루프 사이에 의존성 방향이 역전되지 않아야 합니다.

3.4 루프 분열 (Loop Fission/Distribution)

하나의 루프를 여러 루프로 나눕니다.

// 원본
for (i = 0; i < n; i++) {
    a[i] = b[i] + 1;        // 독립적 부분 1
    c[i] = c[i-1] + a[i];   // 의존적 부분 2
}

// 분열 후
for (i = 0; i < n; i++)
    a[i] = b[i] + 1;        // 병렬화 가능!

for (i = 0; i < n; i++)
    c[i] = c[i-1] + a[i];   // 순차 실행 필요

분열의 장점:

병렬화 가능한 부분만 분리하여 병렬 실행
각 루프의 작업 집합(working set)이 작아져 캐시 효율 향상

4. 아핀 변환 이론 기초

4.1 아핀 변환이란

루프의 반복 공간을 선형 변환하여 병렬성과 지역성을 동시에 최적화하는 통합 프레임워크입니다.

// 원본 반복 공간: (i, j)
for (i = 0; i < n; i++)
    for (j = 0; j < m; j++)
        S: a[i][j] = ...

// 아핀 변환: 새로운 반복 변수 (p, q) = T * (i, j)
// 예: (p, q) = (i+j, j) -> 기울어진 반복 공간

4.2 기울이기 변환 (Skewing)

의존성 방향을 변경하여 병렬성을 노출합니다.

// 원본: 의존 벡터 (0,1)과 (1,0) -> 두 차원 모두 순차적
for (i = 0; i < n; i++)
    for (j = 0; j < n; j++)
        a[i][j] = (a[i-1][j] + a[i][j-1]) / 2;

// 기울이기: p = i + j, q = j
// 변환 후 같은 p 값을 가진 반복들은 병렬 실행 가능

// 변환 후 의존 벡터: (1,1), (1,0) -> p(외부 루프)는 순차, q(내부 루프)는 병렬 가능

4.3 다면체 모델 (Polyhedral Model)

아핀 변환을 일반화한 프레임워크로, 현대 자동 병렬화 컴파일러(Polly, Pluto)에서 사용됩니다.

각 문장의 반복 도메인을 다면체로, 의존성을 아핀 관계로, 스케줄을 아핀 함수로 표현합니다.

// 다면체 표현:
// 문장 S의 반복 도메인: 0 <= i < n, 0 <= j < m
// 이는 2차원 다면체(직사각형)

// 스케줄 함수: theta(i,j) = (i, j)  (원본 순서)
//            또는 theta(i,j) = (i+j, j) (기울이기)
//            또는 theta(i,j) = (j, i)   (교환)

5. 지역성 최적화 (Locality Optimization)

5.1 시간적 지역성 (Temporal Locality)

최근에 접근한 데이터를 곧 다시 접근하는 성질입니다.

// 시간적 지역성이 좋은 코드
for (i = 0; i < n; i++) {
    sum += a[i];  // sum이 매 반복 재사용됨
}

// 시간적 지역성이 나쁜 코드 (타일링으로 개선 가능)
for (i = 0; i < n; i++)
    for (j = 0; j < n; j++)
        b[j] += a[i][j];  // b[j]가 외부 루프에서만 재사용

5.2 공간적 지역성 (Spatial Locality)

인접한 메모리 위치를 연속으로 접근하는 성질입니다.

// 공간적 지역성이 좋은 코드 (행 우선 접근, C 배열 레이아웃과 일치)
for (i = 0; i < n; i++)
    for (j = 0; j < n; j++)
        a[i][j] = 0;      // a[i][0], a[i][1], ... 순서로 접근

// 공간적 지역성이 나쁜 코드 (열 우선 접근)
for (j = 0; j < n; j++)
    for (i = 0; i < n; i++)
        a[i][j] = 0;      // a[0][j], a[1][j], ... 순서로 접근
                           // 매 접근마다 캐시 라인이 다름

5.3 지역성과 루프 변환의 관계

변환	시간적 지역성	공간적 지역성	병렬성
루프 교환	간접적	직접적	가능
루프 타일링	직접적	간접적	간접적
루프 융합	직접적	간접적	가능
루프 분열	간접적	직접적	직접적

5.4 행렬 곱셈의 최적화 효과

// 최적화 없음: O(n^3) 캐시 미스
// 루프 교환: 공간적 지역성 개선
// 타일링 (B x B): O(n^3 / B) 캐시 미스
// 타일링 + 교환: 최적에 가까운 성능

// 실제 성능 차이 (n = 1000, L1 캐시 32KB 기준):
// 최적화 없음:   ~10초
// 루프 교환:     ~5초
// 32x32 타일링:  ~1초
// 성능 차이가 10배에 달할 수 있음!

6. 자동 병렬화의 실제

6.1 DOALL 루프

모든 반복이 독립적인 루프를 DOALL 루프라 합니다.

// DOALL: 완전 병렬화 가능
for (i = 0; i < n; i++)
    a[i] = b[i] + c[i];

// OpenMP로 자동 변환:
// #pragma omp parallel for
// for (i = 0; i < n; i++)
//     a[i] = b[i] + c[i];

6.2 DOACROSS 루프

반복 간에 의존성이 있지만 파이프라인 방식으로 병렬화할 수 있는 루프입니다.

// DOACROSS: 의존 거리 만큼의 병렬성 존재
for (i = 1; i < n; i++)
    a[i] = a[i-1] + b[i];  // 의존 거리 1

// 파이프라인 병렬화:
// P0: a[0], a[4], a[8], ...
// P1: (대기) a[1], a[5], a[9], ...
// P2: (대기) (대기) a[2], a[6], ...
// P3: (대기) (대기) (대기) a[3], a[7], ...

정리

개념	설명
데이터 병렬성	같은 연산을 서로 다른 데이터에 동시 수행
의존성 분석	루프 반복 간의 데이터 의존 관계 파악
GCD 테스트	배열 의존성의 필요 조건 검사
루프 교환	중첩 루프의 순서 변경
루프 타일링	루프를 작은 블록으로 분할하여 캐시 활용 극대화
루프 융합	여러 루프를 하나로 합침
루프 분열	하나의 루프를 여러 개로 분리
아핀 변환	반복 공간의 선형 변환으로 병렬성과 지역성 최적화
지역성 최적화	캐시 활용을 극대화하는 메모리 접근 패턴 최적화

병렬성 검출과 루프 변환은 멀티코어 시대의 핵심 컴파일러 기술입니다. 정확한 의존성 분석을 바탕으로 다양한 루프 변환을 적용하면, 프로그래머가 명시적으로 병렬화하지 않아도 컴파일러가 자동으로 병렬 코드를 생성할 수 있습니다. 다음 글에서는 프로시저 간 분석과 포인터 분석을 다루겠습니다.

[Compiler] 17. Parallelism Detection and Loop Transformations

Overview

With the prevalence of multicore processors and GPUs, finding and exploiting parallelism in programs has become extremely important. Compilers can analyze programs to automatically detect parts that can be safely executed in parallel, and through loop transformations, improve both parallelism and locality.

1. Types of Parallelism

1.1 Data Parallelism

Performs the same operation on different data simultaneously.

// Data parallel: each iteration is independent
for (i = 0; i < n; i++) {
    c[i] = a[i] + b[i];
}

// Split execution across 4 processors:
// P0: i = 0..n/4-1
// P1: i = n/4..n/2-1
// P2: i = n/2..3n/4-1
// P3: i = 3n/4..n-1

1.2 Task Parallelism

Performs different tasks simultaneously.

// Task parallel: three functions are independent
result1 = processA(data)    // Task 1
result2 = processB(data)    // Task 2
result3 = processC(data)    // Task 3
// All three tasks can execute simultaneously

1.3 Pipeline Parallelism

Processes data in stages, with each stage handled by a different processor.

// 3-stage pipeline:
// Stage 1: read    -> P0
// Stage 2: process -> P1
// Stage 3: write   -> P2

// Time:   1      2      3      4      5
// P0:   read1  read2  read3  read4  read5
// P1:          proc1  proc2  proc3  proc4
// P2:                 wrt1   wrt2   wrt3

2. Dependence Analysis

To parallelize loops, dependencies between iterations must be analyzed.

2.1 Types of Loop Dependencies

// Flow dependence (true dependence)
for (i = 1; i < n; i++) {
    a[i] = a[i-1] + 1;  // Iteration i depends on result of iteration i-1
}
// Dependence distance = 1, cannot parallelize

// Anti-dependence
for (i = 0; i < n-1; i++) {
    a[i] = a[i+1] + 1;  // Read then write
}
// Can be resolved by reversing order

// Output dependence
for (i = 0; i < n; i++) {
    a[i % 2] = b[i];    // Repeatedly writing to the same location
}

2.2 Dependence Vectors and Dependence Distance

Dependencies in multiply-nested loops are represented as vectors.

// Dependence vector in a doubly-nested loop
for (i = 1; i < n; i++) {
    for (j = 1; j < m; j++) {
        a[i][j] = a[i-1][j-1] + 1;
    }
}

// a[i][j] depends on a[i-1][j-1]
// Dependence vector: (1, 1)  (distance 1 in i-direction, distance 1 in j-direction)

2.3 Array Dependence Tests

Tests to determine whether two array accesses refer to the same location.

GCD Test (necessary condition):

// For a[c1*i + c2] and a[c3*j + c4] to reference the same location
// there must exist integers i, j satisfying c1*i + c2 = c3*j + c4
// -> gcd(c1, c3) must divide (c4 - c2)

// Example:
// a[2*i + 1] and a[2*j]
// gcd(2, 2) = 2, c4 - c2 = 0 - 1 = -1
// 2 does not divide -1 -> no dependence! (can parallelize)

Banerjee Test (more precise):

// Tests dependence considering loop bounds
// a[i + 1] and a[j] (0 <= i,j < n)
// Does a pair (i,j) satisfying i + 1 = j exist within bounds?
// If 0 <= i < n and 0 <= i+1 < n, then yes
// -> dependence exists

2.4 Limitations of Dependence Analysis

// Cases where precise dependence analysis is difficult:

// 1. Indirect array access
a[b[i]] = ...  // Value of b[i] is unknown

// 2. Pointer-based access
*p = ...       // Location pointed to by p is unknown

// 3. Function calls
foo(a, i)      // Unknown whether foo modifies a

// In such cases, conservatively assume dependence exists (safe but cannot parallelize)

3. Loop Transformations

3.1 Loop Interchange

A transformation that changes the order of nested loops. Used to expose parallelism or improve memory locality.

// Original: row-major access (already optimal in C)
for (i = 0; i < n; i++)
    for (j = 0; j < m; j++)
        a[i][j] = 0;

// After interchange: column-major access
for (j = 0; j < m; j++)
    for (i = 0; i < n; i++)
        a[i][j] = 0;

Condition for legal interchange: the lexicographic order of dependence vectors must be maintained.

// Dependence vector (1, -1):
// After interchange (-1, 1) -> first component is negative -> illegal!
//
// Dependence vector (1, 0):
// After interchange (0, 1) -> lexicographically positive -> legal!

3.2 Loop Tiling / Blocking

Divides loops into small tiles (blocks) to improve cache utilization.

// Original: matrix multiplication
for (i = 0; i < n; i++)
    for (j = 0; j < n; j++)
        for (k = 0; k < n; k++)
            c[i][j] += a[i][k] * b[k][j];

// After tiling (tile size T):
for (ii = 0; ii < n; ii += T)
    for (jj = 0; jj < n; jj += T)
        for (kk = 0; kk < n; kk += T)
            for (i = ii; i < min(ii+T, n); i++)
                for (j = jj; j < min(jj+T, n); j++)
                    for (k = kk; k < min(kk+T, n); k++)
                        c[i][j] += a[i][k] * b[k][j];

When tile size T is matched to cache size:

T x T sub-matrices of a, b, c fit in cache
Cache misses decrease from O(n^3) to O(n^3/T)

3.3 Loop Fusion

Combines multiple loops that iterate over the same range into one.

// Original: two separate loops
for (i = 0; i < n; i++)
    a[i] = b[i] + 1;

for (i = 0; i < n; i++)
    c[i] = a[i] * 2;

// After fusion: one loop
for (i = 0; i < n; i++) {
    a[i] = b[i] + 1;
    c[i] = a[i] * 2;
}

Advantages of fusion:

Reduced loop overhead
a[i] can be kept in a register (reduced memory access)
Improved data locality

Fusion condition: the dependence direction between the two loops must not be reversed.

3.4 Loop Fission / Distribution

Splits one loop into multiple loops.

// Original
for (i = 0; i < n; i++) {
    a[i] = b[i] + 1;        // Independent part 1
    c[i] = c[i-1] + a[i];   // Dependent part 2
}

// After fission
for (i = 0; i < n; i++)
    a[i] = b[i] + 1;        // Can be parallelized!

for (i = 0; i < n; i++)
    c[i] = c[i-1] + a[i];   // Must execute sequentially

Advantages of fission:

Isolate parallelizable parts for parallel execution
Smaller working set for each loop improves cache efficiency

4. Basics of Affine Transformation Theory

4.1 What is Affine Transformation

A unified framework for optimizing both parallelism and locality by linearly transforming the iteration space of loops.

// Original iteration space: (i, j)
for (i = 0; i < n; i++)
    for (j = 0; j < m; j++)
        S: a[i][j] = ...

// Affine transformation: new iteration variables (p, q) = T * (i, j)
// Example: (p, q) = (i+j, j) -> skewed iteration space

4.2 Skewing Transformation

Changes dependence directions to expose parallelism.

// Original: dependence vectors (0,1) and (1,0) -> both dimensions are sequential
for (i = 0; i < n; i++)
    for (j = 0; j < n; j++)
        a[i][j] = (a[i-1][j] + a[i][j-1]) / 2;

// Skewing: p = i + j, q = j
// After transformation, iterations with the same p value can execute in parallel

// Transformed dependence vectors: (1,1), (1,0) -> p (outer loop) is sequential, q (inner loop) can be parallel

4.3 Polyhedral Model

A generalized framework of affine transformations, used in modern automatic parallelization compilers (Polly, Pluto).

Each statement's iteration domain is represented as a polyhedron, dependencies as affine relations, and schedules as affine functions.

// Polyhedral representation:
// Iteration domain of statement S: 0 <= i < n, 0 <= j < m
// This is a 2D polyhedron (rectangle)

// Schedule function: theta(i,j) = (i, j)    (original order)
//                    or theta(i,j) = (i+j, j) (skewing)
//                    or theta(i,j) = (j, i)   (interchange)

5. Locality Optimization

5.1 Temporal Locality

The property of accessing recently accessed data again soon.

// Good temporal locality
for (i = 0; i < n; i++) {
    sum += a[i];  // sum is reused every iteration
}

// Poor temporal locality (can be improved by tiling)
for (i = 0; i < n; i++)
    for (j = 0; j < n; j++)
        b[j] += a[i][j];  // b[j] is reused only in the outer loop

5.2 Spatial Locality

The property of accessing adjacent memory locations consecutively.

// Good spatial locality (row-major access, matches C array layout)
for (i = 0; i < n; i++)
    for (j = 0; j < n; j++)
        a[i][j] = 0;      // Access a[i][0], a[i][1], ... in order

// Poor spatial locality (column-major access)
for (j = 0; j < n; j++)
    for (i = 0; i < n; i++)
        a[i][j] = 0;      // Access a[0][j], a[1][j], ... in order
                           // Different cache line for each access

5.3 Relationship Between Locality and Loop Transformations

Transformation	Temporal Locality	Spatial Locality	Parallelism
Loop interchange	Indirect	Direct	Possible
Loop tiling	Direct	Indirect	Indirect
Loop fusion	Direct	Indirect	Possible
Loop fission	Indirect	Direct	Direct

5.4 Optimization Effect on Matrix Multiplication

// No optimization: O(n^3) cache misses
// Loop interchange: improved spatial locality
// Tiling (B x B): O(n^3 / B) cache misses
// Tiling + interchange: near-optimal performance

// Actual performance difference (n = 1000, L1 cache 32KB):
// No optimization:   ~10 seconds
// Loop interchange:  ~5 seconds
// 32x32 tiling:      ~1 second
// Performance difference can be up to 10x!

6. Automatic Parallelization in Practice

6.1 DOALL Loops

A loop where all iterations are independent is called a DOALL loop.

// DOALL: fully parallelizable
for (i = 0; i < n; i++)
    a[i] = b[i] + c[i];

// Automatic conversion to OpenMP:
// #pragma omp parallel for
// for (i = 0; i < n; i++)
//     a[i] = b[i] + c[i];

6.2 DOACROSS Loops

Loops with inter-iteration dependencies that can be parallelized in a pipeline fashion.

// DOACROSS: parallelism exists up to the dependence distance
for (i = 1; i < n; i++)
    a[i] = a[i-1] + b[i];  // Dependence distance 1

// Pipeline parallelization:
// P0: a[0], a[4], a[8], ...
// P1: (wait) a[1], a[5], a[9], ...
// P2: (wait) (wait) a[2], a[6], ...
// P3: (wait) (wait) (wait) a[3], a[7], ...

Summary

Concept	Description
Data parallelism	Performing the same operation on different data simultaneously
Dependence analysis	Identifying data dependence relations between loop iterations
GCD test	Necessary condition check for array dependence
Loop interchange	Changing the order of nested loops
Loop tiling	Splitting loops into small blocks to maximize cache utilization
Loop fusion	Combining multiple loops into one
Loop fission	Splitting one loop into multiple loops
Affine transformation	Optimizing parallelism and locality through linear transformation of iteration space
Locality optimization	Optimizing memory access patterns to maximize cache utilization

Parallelism detection and loop transformations are core compiler technologies in the multicore era. By applying various loop transformations based on precise dependence analysis, compilers can automatically generate parallel code even without explicit parallelization by the programmer. In the next article, we will cover interprocedural analysis and pointer analysis.