Split View: [컴파일러] 15. 루프 최적화와 코드 이동

[컴파일러] 15. 루프 최적화와 코드 이동

개요

프로그램의 실행 시간 중 대부분은 루프에서 소비됩니다. 따라서 루프를 최적화하는 것이 전체 프로그램 성능 향상에 가장 효과적입니다. 이번 글에서는 루프를 식별하고 최적화하는 기법들을 체계적으로 살펴봅니다.

1. 지배자 (Dominator)

1.1 정의

흐름 그래프에서 노드 d가 노드 n을 **지배(dominate)**한다 = 진입점에서 n으로 가는 모든 경로가 d를 거침.

기호로 d dom n이라 표기합니다.

성질:

모든 노드는 자기 자신을 지배합니다 (반사적)
d dom n이고 n dom d이면 d = n (반대칭적)
d dom n이고 n dom p이면 d dom p (추이적)

1.2 지배자 트리 (Dominator Tree)

지배자 관계를 트리로 표현한 것입니다. 각 노드의 부모는 해당 노드의 **직접 지배자(immediate dominator)**입니다.

// 흐름 그래프:
//   1 -> 2 -> 3 -> 4
//        |    ^    |
//        v    |    v
//        5 ---+    6
//                  |
//                  v
//                  7

// 지배자 트리:
//        1
//        |
//        2
//      / | \
//     3  5  6
//     |     |
//     4     7
//
// 해석: 1 dom 모든 노드
//       2 dom 3,4,5,6,7
//       3 dom 4
//       6 dom 7

1.3 지배자 계산 알고리즘

반복 알고리즘으로 지배자를 계산합니다:

알고리즘: 지배자 계산

초기화:
  DOM[entry] = {entry}
  DOM[n] = 모든 노드의 집합 (n != entry)

반복 (수렴할 때까지):
  for (모든 노드 n != entry):
    DOM[n] = {n} U ( n DOM[p] )  // p는 n의 모든 선행자
                                  // 선행자들의 DOM 교집합에 n을 추가

1.4 지배 경계 (Dominance Frontier)

노드 n의 지배 경계 DF(n)는 n이 지배하는 영역의 바로 바깥 경계입니다.

DF(n) = 노드 y의 집합 | n이 y의 선행자를 지배하지만, y 자체는 (엄격하게) 지배하지 않음

지배 경계는 SSA 구성에서 phi 함수의 삽입 위치를 결정하는 데 핵심적으로 사용됩니다.

2. 자연 루프 (Natural Loop)

2.1 후방 간선 (Back Edge)

흐름 그래프에서 간선 n -> d에서 d가 n을 지배하면, 이 간선을 후방 간선이라 합니다.

// 흐름 그래프에서:
//   1 -> 2 -> 3 -> 4
//        ^         |
//        +---------+   // 4 -> 2는 후방 간선 (2 dom 4)

2.2 자연 루프의 정의

후방 간선 n -> d에 대한 자연 루프:

헤더: 노드 d (루프의 유일한 진입점)
본체: d를 거치지 않고 n에 도달할 수 있는 모든 노드의 집합과 d

알고리즘: 자연 루프 구성 (후방 간선 n -> d)

1. loop = {d}
2. stack = {n}  (n이 d와 다르면)
3. while stack이 비어있지 않음:
     m = stack에서 pop
     if m이 loop에 없으면:
         loop에 m 추가
         m의 모든 선행자를 stack에 push
4. 결과: loop

2.3 예시

// 흐름 그래프:
//   B1 -> B2 -> B3 -> B4
//          ^          |
//          +----------+  (후방 간선: B4 -> B2)
//          |
//          +-- B5 (B2 -> B5, B5 -> B2도 후방 간선)

// 후방 간선 B4 -> B2의 자연 루프:
//   {B2, B3, B4}  (B2를 거치지 않고 B4에 도달 가능: B3, B4)

// 후방 간선 B5 -> B2의 자연 루프:
//   {B2, B5}

2.4 루프 중첩

두 루프가 같은 헤더를 공유하면 합쳐서 하나의 루프로 취급합니다. 그렇지 않으면 하나가 다른 하나에 완전히 포함되거나(중첩) 서로소입니다.

// 외부 루프: {B1, B2, B3, B4, B5}
// 내부 루프: {B3, B4}
//
// 내부 루프를 먼저 최적화한 후 외부 루프를 최적화합니다
// (안쪽에서 바깥쪽으로)

3. 루프 불변 코드 검출과 이동

3.1 루프 불변 코드 검출

명령어 s: x = y op z가 루프 불변인 조건:

y와 z의 모든 정의가 루프 밖에 있거나
y(또는 z)의 정의가 정확히 하나이고, 그 정의가 이미 루프 불변으로 판별된 경우
y(또는 z)가 상수

알고리즘: 루프 불변 코드 검출

1. 상수이거나 루프 밖에서만 정의된 피연산자를 가진 명령어를
   "루프 불변"으로 표시
2. 반복:
   - 모든 피연산자가 상수이거나, 루프 밖 정의이거나,
     이미 루프 불변으로 표시된 정의 하나만 가지는 명령어를
     새로 "루프 불변"으로 표시
3. 더 이상 새로운 표시가 없으면 종료

3.2 예시

// 루프:
//   B2: i = i + 1
//       t1 = n * 4       // n은 루프 밖에서 정의 -> 루프 불변
//       t2 = a[t1]       // t1이 루프 불변이고 a가 루프에서 불변 -> 루프 불변
//       t3 = t2 + i
//       if i < 100 goto B2

// 1차: t1 = n * 4 표시 (n이 루프 밖 정의)
// 2차: t2 = a[t1] 표시 (t1이 루프 불변)
// 3차: 더 이상 없음 -> 종료

3.3 코드 이동 조건

루프 불변 명령어 s: x = y op z를 루프 밖(preheader)으로 이동하려면 다음 조건을 모두 만족해야 합니다:

조건 1: s를 포함하는 블록이 루프의 모든 출구를 지배

// 안전한 경우:
//   preheader -> [B1: s: x=... ] -> B2(출구)
//   B1이 출구를 지배하므로 이동 가능

// 위험한 경우:
//   preheader -> B1 -> B3(출구)
//                |
//                v
//               B2: s: x=...
//   B2가 출구 B3를 지배하지 않음 -> 이동 불가
//   (s가 실행되지 않는 경로가 존재)

조건 2: x가 루프 내에서 다른 곳에서 정의되지 않음

조건 3: 루프 내에서 x를 사용하는 모든 곳이 s의 정의에만 도달

3.4 Preheader 삽입

코드 이동을 위해 루프 헤더 바로 앞에 preheader 블록을 삽입합니다.

// 이동 전:
//   ... -> header -> body -> ...
//           ^               |
//           +---------------+

// preheader 삽입 후:
//   ... -> preheader -> header -> body -> ...
//                        ^               |
//                        +---------------+

// 루프 불변 코드를 preheader에 배치:
//   preheader:
//     t1 = n * 4       // 이동된 코드
//     t2 = a[t1]       // 이동된 코드

4. 유도 변수 검출과 제거

4.1 기본 유도 변수 (Basic Induction Variable)

루프 내에서 i = i + c (c는 루프 상수) 형태로만 정의되는 변수 i를 기본 유도 변수라 합니다.

4.2 파생 유도 변수 (Derived Induction Variable)

기본 유도 변수 i에 대해 j = c1 * i + c2 (c1, c2는 루프 상수)로 표현되는 변수 j를 파생 유도 변수라 합니다.

유도 변수의 삼중쌍(triple) 표기: (i, c1, c2)는 j = c1 * i + c2를 의미합니다.

// 예시:
// i = i + 1        // i는 기본 유도 변수
// t1 = 4 * i       // t1은 파생 유도 변수: (i, 4, 0)
// t2 = t1 + base   // t2는 파생 유도 변수: (i, 4, base)

4.3 유도 변수 검출 알고리즘

알고리즘: 유도 변수 검출

1단계: 기본 유도 변수 검출
  - 루프 내에서 i = i + c 형태로만 정의되는 변수를 찾음

2단계: 파생 유도 변수 검출
  - j = c1 * i (i가 기본 유도 변수, c1이 루프 상수)
    -> j의 삼중쌍: (i, c1, 0)
  - j = k + c2 (k가 유도 변수 (i, a, b), c2가 루프 상수)
    -> j의 삼중쌍: (i, a, b+c2)
  - j = c1 * k (k가 유도 변수 (i, a, b))
    -> j의 삼중쌍: (i, c1*a, c1*b)

4.4 강도 감소 적용

파생 유도 변수의 곱셈을 덧셈으로 변환합니다.

// 원본 코드:
// i = 0
// loop:
//   t1 = 4 * i       // 곱셈
//   a[t1] = 0
//   i = i + 1
//   if i < n goto loop

// 강도 감소 후:
// i = 0
// t1 = 0             // 초기값: 4 * 0 = 0
// loop:
//   a[t1] = 0
//   i = i + 1
//   t1 = t1 + 4      // 곱셈 -> 덧셈
//   if i < n goto loop

4.5 유도 변수 제거

강도 감소 후, 기본 유도 변수가 루프 종료 조건에서만 사용되면 파생 유도 변수로 대체할 수 있습니다.

// 강도 감소 후:
// i = 0
// t1 = 0
// loop:
//   a[t1] = 0
//   i = i + 1         // i는 비교에서만 사용
//   t1 = t1 + 4
//   if i < n goto loop

// 유도 변수 제거 후:
// t1 = 0
// limit = 4 * n       // 종료 조건 변환 (루프 밖에서 한 번만 계산)
// loop:
//   a[t1] = 0
//   t1 = t1 + 4
//   if t1 < limit goto loop
//
// i가 완전히 제거됨!

5. 루프 펼침 (Loop Unrolling)

5.1 개념

루프 본체를 여러 번 복사하여 반복 횟수를 줄이는 기법입니다.

5.2 기본 루프 펼침

// 원본 (100번 반복)
for (i = 0; i < 100; i++) {
    a[i] = 0;
}

// 4배 펼침 (25번 반복)
for (i = 0; i < 100; i += 4) {
    a[i]   = 0;
    a[i+1] = 0;
    a[i+2] = 0;
    a[i+3] = 0;
}

5.3 장점

루프 오버헤드 감소: 분기, 비교, 증가 명령어 횟수 감소
명령어 스케줄링 기회 증가: 독립적인 명령어가 많아져 파이프라인 활용도 향상
레지스터 활용 개선: 연속된 연산 사이에서 값 재사용 가능

5.4 단점과 고려사항

// 반복 횟수가 펼침 인수의 배수가 아닌 경우
// 나머지 처리가 필요

// n이 4의 배수가 아닐 수 있는 경우
i = 0
// 메인 루프 (4배 펼침)
while (i + 3 < n) {
    a[i] = 0; a[i+1] = 0; a[i+2] = 0; a[i+3] = 0;
    i += 4;
}
// 나머지 처리
while (i < n) {
    a[i] = 0;
    i++;
}

코드 크기 증가: 캐시에 악영향을 줄 수 있음
레지스터 압박: 펼침이 과도하면 레지스터 스필이 발생할 수 있음
일반적으로 2배~8배 펼침이 효과적

6. 루프 최적화 종합 예시

전체 루프 최적화 과정을 하나의 예시로 살펴보겠습니다.

// 원본 코드 (행렬의 한 행 초기화)
i = 0
loop:
    t1 = i * 8           // 유도 변수 (곱셈)
    t2 = base + t1       // 유도 변수
    *t2 = 0.0
    i = i + 1
    if i < n goto loop

// 1단계: 루프 불변 코드 이동
//   (이 예시에서는 해당 없음)

// 2단계: 강도 감소
i = 0
t1 = 0                   // 초기값
t2 = base                // 초기값
loop:
    *t2 = 0.0
    i = i + 1
    t1 = t1 + 8          // 곱셈 -> 덧셈
    t2 = t2 + 8          // 곱셈 -> 덧셈
    if i < n goto loop

// 3단계: 유도 변수 제거 (i와 t1 제거)
t2 = base
limit = base + n * 8     // 루프 밖에서 한 번 계산
loop:
    *t2 = 0.0
    t2 = t2 + 8
    if t2 < limit goto loop

// 4단계: 루프 펼침 (2배)
t2 = base
limit = base + n * 8
loop:
    *t2 = 0.0
    *(t2 + 8) = 0.0
    t2 = t2 + 16
    if t2 < limit goto loop
// (나머지 처리 코드는 생략)

원본에서 루프 한 번 반복에 곱셈 1회, 덧셈 2회, 저장 1회, 비교 1회가 필요했지만, 최적화 후에는 덧셈 1회, 저장 2회, 비교 1회로 줄었습니다.

정리

개념	설명
지배자	진입점에서 특정 노드로 가는 모든 경로가 거치는 노드
지배자 트리	지배 관계를 트리로 표현한 구조
후방 간선	지배자로 되돌아가는 간선 (루프 형성)
자연 루프	후방 간선에 의해 정의되는 루프
루프 불변 코드 이동	매 반복 같은 값을 계산하는 코드를 루프 밖으로 이동
유도 변수	루프에서 일정하게 증감하는 변수
강도 감소	곱셈을 덧셈으로 대체
유도 변수 제거	불필요해진 유도 변수를 제거
루프 펼침	루프 본체를 복사하여 반복 횟수 감소

루프 최적화는 컴파일러 최적화에서 가장 큰 성능 향상을 가져오는 분야입니다. 지배자 분석으로 루프를 식별하고, 불변 코드 이동과 강도 감소로 루프 본체를 경량화하며, 루프 펼침으로 오버헤드를 줄이는 것이 기본 전략입니다. 다음 글에서는 명령어 수준 병렬성과 스케줄링을 다루겠습니다.

[Compiler] 15. Loop Optimization and Code Motion

Overview

Most of a program's execution time is spent in loops. Therefore, optimizing loops is the most effective way to improve overall program performance. In this article, we systematically examine techniques for identifying and optimizing loops.

1. Dominators

1.1 Definition

In a flow graph, node d dominates node n = every path from the entry point to n passes through d.

This is denoted as d dom n.

Properties:

Every node dominates itself (reflexive)
If d dom n and n dom d then d = n (antisymmetric)
If d dom n and n dom p then d dom p (transitive)

1.2 Dominator Tree

A tree representation of the dominator relation. Each node's parent is its immediate dominator.

// Flow graph:
//   1 -> 2 -> 3 -> 4
//        |    ^    |
//        v    |    v
//        5 ---+    6
//                  |
//                  v
//                  7

// Dominator tree:
//        1
//        |
//        2
//      / | \
//     3  5  6
//     |     |
//     4     7
//
// Interpretation: 1 dom all nodes
//                 2 dom 3,4,5,6,7
//                 3 dom 4
//                 6 dom 7

1.3 Dominator Computation Algorithm

Dominators are computed using an iterative algorithm:

Algorithm: Dominator Computation

Initialization:
  DOM[entry] = {entry}
  DOM[n] = set of all nodes (n != entry)

Iterate (until convergence):
  for (all nodes n != entry):
    DOM[n] = {n} U ( n DOM[p] )  // p is all predecessors of n
                                  // Add n to the intersection of predecessors' DOMs

1.4 Dominance Frontier

The dominance frontier DF(n) of node n is the boundary just outside the region dominated by n.

DF(n) = set of nodes y | n dominates a predecessor of y, but does not (strictly) dominate y itself

The dominance frontier is critically used to determine where to insert phi functions in SSA construction.

2. Natural Loops

2.1 Back Edge

In a flow graph, if in edge n -> d, d dominates n, then this edge is called a back edge.

// In a flow graph:
//   1 -> 2 -> 3 -> 4
//        ^         |
//        +---------+   // 4 -> 2 is a back edge (2 dom 4)

2.2 Definition of Natural Loop

The natural loop for back edge n -> d:

Header: Node d (the sole entry point of the loop)
Body: The set of all nodes that can reach n without going through d, plus d

Algorithm: Natural Loop Construction (back edge n -> d)

1. loop = {d}
2. stack = {n}  (if n is different from d)
3. while stack is not empty:
     m = pop from stack
     if m is not in loop:
         add m to loop
         push all predecessors of m onto stack
4. Result: loop

2.3 Example

// Flow graph:
//   B1 -> B2 -> B3 -> B4
//          ^          |
//          +----------+  (back edge: B4 -> B2)
//          |
//          +-- B5 (B2 -> B5, B5 -> B2 is also a back edge)

// Natural loop for back edge B4 -> B2:
//   {B2, B3, B4}  (reachable to B4 without going through B2: B3, B4)

// Natural loop for back edge B5 -> B2:
//   {B2, B5}

2.4 Loop Nesting

If two loops share the same header, they are merged and treated as a single loop. Otherwise, one is completely contained within the other (nested) or they are disjoint.

// Outer loop: {B1, B2, B3, B4, B5}
// Inner loop: {B3, B4}
//
// Optimize the inner loop first, then the outer loop
// (from inside out)

3. Loop-Invariant Code Detection and Motion

3.1 Loop-Invariant Code Detection

The conditions for instruction s: x = y op z to be loop invariant:

All definitions of y and z are outside the loop, or
y (or z) has exactly one definition, and that definition has already been identified as loop invariant
y (or z) is a constant

Algorithm: Loop-Invariant Code Detection

1. Mark instructions whose operands are constants or defined only outside
   the loop as "loop invariant"
2. Repeat:
   - Newly mark instructions as "loop invariant" if all operands are constants,
     outside definitions, or have exactly one definition already marked as loop invariant
3. Terminate when no new markings are made

3.2 Example

// Loop:
//   B2: i = i + 1
//       t1 = n * 4       // n is defined outside the loop -> loop invariant
//       t2 = a[t1]       // t1 is loop invariant and a is invariant in loop -> loop invariant
//       t3 = t2 + i
//       if i < 100 goto B2

// Pass 1: Mark t1 = n * 4 (n is an outside definition)
// Pass 2: Mark t2 = a[t1] (t1 is loop invariant)
// Pass 3: No more -> terminate

3.3 Code Motion Conditions

To move a loop-invariant instruction s: x = y op z outside the loop (to the preheader), all of the following conditions must be satisfied:

Condition 1: The block containing s dominates all exits of the loop

// Safe case:
//   preheader -> [B1: s: x=... ] -> B2(exit)
//   B1 dominates the exit, so motion is possible

// Dangerous case:
//   preheader -> B1 -> B3(exit)
//                |
//                v
//               B2: s: x=...
//   B2 does not dominate exit B3 -> motion not possible
//   (a path exists where s is not executed)

Condition 2: x is not defined elsewhere within the loop

Condition 3: All uses of x in the loop reach only the definition at s

3.4 Preheader Insertion

A preheader block is inserted just before the loop header for code motion.

// Before motion:
//   ... -> header -> body -> ...
//           ^               |
//           +---------------+

// After preheader insertion:
//   ... -> preheader -> header -> body -> ...
//                        ^               |
//                        +---------------+

// Place loop-invariant code in the preheader:
//   preheader:
//     t1 = n * 4       // Moved code
//     t2 = a[t1]       // Moved code

4. Induction Variable Detection and Elimination

4.1 Basic Induction Variable

A variable i that is defined only in the form i = i + c (where c is a loop constant) within the loop is called a basic induction variable.

4.2 Derived Induction Variable

A variable j that can be expressed as j = c1 * i + c2 (where c1, c2 are loop constants) for a basic induction variable i is called a derived induction variable.

Triple notation for induction variables: (i, c1, c2) means j = c1 * i + c2.

// Example:
// i = i + 1        // i is a basic induction variable
// t1 = 4 * i       // t1 is a derived induction variable: (i, 4, 0)
// t2 = t1 + base   // t2 is a derived induction variable: (i, 4, base)

4.3 Induction Variable Detection Algorithm

Algorithm: Induction Variable Detection

Step 1: Detect basic induction variables
  - Find variables defined only in the form i = i + c within the loop

Step 2: Detect derived induction variables
  - j = c1 * i (i is basic induction variable, c1 is loop constant)
    -> j's triple: (i, c1, 0)
  - j = k + c2 (k is induction variable (i, a, b), c2 is loop constant)
    -> j's triple: (i, a, b+c2)
  - j = c1 * k (k is induction variable (i, a, b))
    -> j's triple: (i, c1*a, c1*b)

4.4 Applying Strength Reduction

Convert the multiplication of derived induction variables to addition.

// Original code:
// i = 0
// loop:
//   t1 = 4 * i       // Multiplication
//   a[t1] = 0
//   i = i + 1
//   if i < n goto loop

// After strength reduction:
// i = 0
// t1 = 0             // Initial value: 4 * 0 = 0
// loop:
//   a[t1] = 0
//   i = i + 1
//   t1 = t1 + 4      // Multiplication -> addition
//   if i < n goto loop

4.5 Induction Variable Elimination

After strength reduction, if the basic induction variable is used only in the loop termination condition, it can be replaced with the derived induction variable.

// After strength reduction:
// i = 0
// t1 = 0
// loop:
//   a[t1] = 0
//   i = i + 1         // i is used only in comparison
//   t1 = t1 + 4
//   if i < n goto loop

// After induction variable elimination:
// t1 = 0
// limit = 4 * n       // Termination condition conversion (computed once outside loop)
// loop:
//   a[t1] = 0
//   t1 = t1 + 4
//   if t1 < limit goto loop
//
// i is completely eliminated!

5. Loop Unrolling

5.1 Concept

A technique that copies the loop body multiple times to reduce the number of iterations.

5.2 Basic Loop Unrolling

// Original (100 iterations)
for (i = 0; i < 100; i++) {
    a[i] = 0;
}

// 4x unrolling (25 iterations)
for (i = 0; i < 100; i += 4) {
    a[i]   = 0;
    a[i+1] = 0;
    a[i+2] = 0;
    a[i+3] = 0;
}

5.3 Advantages

Reduced loop overhead: Fewer branch, comparison, and increment instructions
Increased instruction scheduling opportunities: More independent instructions improve pipeline utilization
Improved register utilization: Value reuse between consecutive operations

5.4 Disadvantages and Considerations

// When the iteration count is not a multiple of the unrolling factor
// Remainder handling is needed

// When n might not be a multiple of 4
i = 0
// Main loop (4x unrolling)
while (i + 3 < n) {
    a[i] = 0; a[i+1] = 0; a[i+2] = 0; a[i+3] = 0;
    i += 4;
}
// Remainder handling
while (i < n) {
    a[i] = 0;
    i++;
}

Code size increase: May negatively impact cache
Register pressure: Excessive unrolling may cause register spills
Typically 2x to 8x unrolling is effective

6. Comprehensive Loop Optimization Example

Let us examine the entire loop optimization process through a single example.

// Original code (initializing a row of a matrix)
i = 0
loop:
    t1 = i * 8           // Induction variable (multiplication)
    t2 = base + t1       // Induction variable
    *t2 = 0.0
    i = i + 1
    if i < n goto loop

// Step 1: Loop-invariant code motion
//   (none applicable in this example)

// Step 2: Strength reduction
i = 0
t1 = 0                   // Initial value
t2 = base                // Initial value
loop:
    *t2 = 0.0
    i = i + 1
    t1 = t1 + 8          // Multiplication -> addition
    t2 = t2 + 8          // Multiplication -> addition
    if i < n goto loop

// Step 3: Induction variable elimination (remove i and t1)
t2 = base
limit = base + n * 8     // Computed once outside the loop
loop:
    *t2 = 0.0
    t2 = t2 + 8
    if t2 < limit goto loop

// Step 4: Loop unrolling (2x)
t2 = base
limit = base + n * 8
loop:
    *t2 = 0.0
    *(t2 + 8) = 0.0
    t2 = t2 + 16
    if t2 < limit goto loop
// (remainder handling code omitted)

The original required 1 multiplication, 2 additions, 1 store, and 1 comparison per loop iteration, but after optimization, only 1 addition, 2 stores, and 1 comparison are needed.

Summary

Concept	Description
Dominator	A node through which all paths from the entry point to a specific node pass
Dominator tree	A tree structure representing the dominator relation
Back edge	An edge going back to a dominator (forming a loop)
Natural loop	A loop defined by a back edge
Loop-invariant code motion	Moving code that computes the same value every iteration outside the loop
Induction variable	A variable that changes by a constant amount in a loop
Strength reduction	Replacing multiplication with addition
Induction variable elimination	Removing induction variables that become unnecessary
Loop unrolling	Copying the loop body to reduce the number of iterations

Loop optimization is the area of compiler optimization that yields the greatest performance improvements. The basic strategy is to identify loops through dominator analysis, lighten the loop body through invariant code motion and strength reduction, and reduce overhead through loop unrolling. In the next article, we will cover instruction-level parallelism and scheduling.