Split View: [컴파일러] 16. 명령어 수준 병렬성과 스케줄링

[컴파일러] 16. 명령어 수준 병렬성과 스케줄링

개요

현대 프로세서는 여러 명령어를 동시에 실행하여 성능을 높입니다. 이를 명령어 수준 병렬성(Instruction-Level Parallelism, ILP)이라 합니다. 컴파일러는 ILP를 최대한 활용하도록 명령어의 순서를 재배치(스케줄링)하는 역할을 합니다.

이번 글에서는 ILP의 개념, 이를 지원하는 프로세서 아키텍처, 그리고 컴파일러의 명령어 스케줄링 기법을 살펴봅니다.

1. 명령어 수준 병렬성 (ILP)

1.1 기본 개념

ILP는 프로그램 내에서 동시에 실행할 수 있는 명령어가 얼마나 있는지를 나타냅니다.

// ILP가 높은 코드
a = x + y     // 이 세 명령어는 서로 독립적
b = p + q     // 동시 실행 가능
c = r - s

// ILP가 낮은 코드
a = x + y     // 순차적 의존성
b = a + z     // a에 의존
c = b * 2     // b에 의존

1.2 데이터 의존성의 종류

명령어 간의 의존성이 ILP를 제한합니다:

진정한 의존성 (True Dependence, RAW: Read After Write)

a = b + c     // a를 쓰기
d = a + e     // a를 읽기 (대기 필요)

반의존성 (Anti-Dependence, WAR: Write After Read)

a = b + c     // b를 읽기
b = d + e     // b를 쓰기 (이름 변경으로 해결 가능)

출력 의존성 (Output Dependence, WAW: Write After Write)

a = b + c     // a를 쓰기
a = d + e     // a를 다시 쓰기 (순서 유지 필요)

반의존성과 출력 의존성은 레지스터 이름 변경(register renaming)으로 해결할 수 있으므로 이름 의존성이라고도 합니다.

2. ILP를 위한 프로세서 아키텍처

2.1 파이프라이닝 (Pipelining)

명령어 실행을 여러 단계로 나누어 각 단계를 겹쳐서 수행합니다.

// 5단계 파이프라인: IF(Fetch), ID(Decode), EX(Execute), MEM, WB(Writeback)

시간:    1    2    3    4    5    6    7
명령1:  IF   ID   EX   MEM  WB
명령2:       IF   ID   EX   MEM  WB
명령3:            IF   ID   EX   MEM  WB

// 이상적으로 매 클록마다 하나의 명령어 완료

파이프라인 해저드(hazard):

// 데이터 해저드 (RAW)
ADD R1, R2, R3    // R1에 결과 쓰기 (WB 단계)
SUB R4, R1, R5    // R1을 읽기 (ID 단계) - 아직 값이 준비 안됨!

// 해결: 포워딩(forwarding) 또는 지연(stall)
ADD R1, R2, R3
NOP               // 버블 삽입 (stall)
SUB R4, R1, R5

// 컴파일러가 독립적인 명령어를 사이에 삽입하면 stall 방지
ADD R1, R2, R3
MUL R6, R7, R8    // 독립적인 명령어로 빈 슬롯 채움
SUB R4, R1, R5

2.2 슈퍼스칼라 프로세서 (Superscalar)

한 클록에 여러 명령어를 동시에 발행(issue)합니다.

// 2-issue 슈퍼스칼라 프로세서
시간:    1         2         3
슬롯1:  ADD       MUL       LD
슬롯2:  SUB       AND       ST

// 이상적으로 매 클록마다 2개 명령어 완료
// (IPC = Instructions Per Cycle = 2)

2.3 VLIW 아키텍처

VLIW(Very Long Instruction Word)는 하나의 긴 명령어에 여러 연산을 묶습니다.

// VLIW 명령어 (4개 연산 슬롯)
// | ALU1     | ALU2     | MEM      | BRANCH   |
// | ADD R1.. | MUL R3.. | LD R5..  | NOP      |

// 각 슬롯은 독립적인 기능 유닛에서 실행
// 병렬 실행의 책임이 컴파일러에 있음

슈퍼스칼라 vs VLIW:

특성	슈퍼스칼라	VLIW
병렬성 결정	하드웨어 (동적)	컴파일러 (정적)
하드웨어 복잡도	높음	낮음
코드 호환성	높음	낮음
컴파일러 부담	보통	높음
대표 제품	Intel x86, ARM	Intel Itanium, TI DSP

3. 명령어 스케줄링

3.1 기본 블록 스케줄링

기본 블록 내에서 명령어 순서를 재배치하여 파이프라인 stall을 최소화합니다.

의존성 DAG 구성:

// 원본 코드:
// 1: LD  R1, a       // 로드 지연시간 = 2
// 2: ADD R2, R1, R3  // R1에 의존 (RAW)
// 3: LD  R4, b       // 독립적
// 4: MUL R5, R4, R6  // R4에 의존 (RAW)
// 5: ADD R7, R2, R5  // R2, R5에 의존

// 의존성 DAG:
//   1 --2--> 2 --1--> 5
//   3 --2--> 4 --1--> 5
// (간선의 가중치 = 지연 시간)

3.2 리스트 스케줄링 (List Scheduling)

리스트 스케줄링은 기본 블록 내에서 가장 널리 사용되는 스케줄링 알고리즘입니다.

알고리즘: 리스트 스케줄링

1. 의존성 DAG 구성
2. 각 노드의 우선순위 계산 (보통 임계 경로 길이)
3. 준비 리스트(ready list)에 의존성이 없는 명령어 추가
4. while 준비 리스트가 비어있지 않음:
     a. 우선순위가 가장 높은 명령어를 선택
     b. 현재 사이클에 발행
     c. 후속 명령어의 의존성이 해소되면 준비 리스트에 추가

3.3 리스트 스케줄링 예시

// 의존성 DAG (지연시간):
//   LD a(2) -> ADD(1) -> ST(1)
//   LD b(2) -> MUL(3) -> ST(1)

// 우선순위 (루트까지 임계 경로 길이):
//   LD a: 2+1+1 = 4
//   LD b: 2+3+1 = 6  (더 높은 우선순위)
//   ADD: 1+1 = 2
//   MUL: 3+1 = 4
//   ST(ADD 결과): 1
//   ST(MUL 결과): 1

// 스케줄링:
// 사이클 1: LD b (우선순위 6, 가장 높음)
// 사이클 2: LD a (LD b 결과 대기 중, 다른 것 발행)
// 사이클 3: MUL (LD b 결과 준비됨, 사이클 1+2=3)
// 사이클 4: ADD (LD a 결과 준비됨, 사이클 2+2=4)
// 사이클 5: ST (ADD 결과, 사이클 4+1=5)
// 사이클 6: ST (MUL 결과, 사이클 3+3=6)

// 원본 순서대로 실행하면 7사이클 필요
// 스케줄링 후 6사이클로 단축

4. 소프트웨어 파이프라이닝 (Software Pipelining)

4.1 개념

루프의 여러 반복을 겹쳐서 실행하도록 코드를 변환하는 기법입니다. 하드웨어 파이프라이닝과 유사한 아이디어를 소프트웨어 수준에서 적용합니다.

// 원본 루프 (각 반복에 3개 명령어, 각각 1사이클)
// 반복 i:  A(i), B(i), C(i)
// 반복 i+1: A(i+1), B(i+1), C(i+1)

// 시간:  1     2     3     4     5     6     7
// 원본:  A(0)  B(0)  C(0)  A(1)  B(1)  C(1)  A(2)...

// 소프트웨어 파이프라이닝 후:
// 시간:  1     2     3     4     5
// 원본:  A(0)  B(0)  C(0)
//              A(1)  B(1)  C(1)
//                    A(2)  B(2)  C(2)
//
// 정상 상태(steady state) 커널:
// 시간:  ...   k     k+1   k+2   ...
//              C(i)  C(i+1) ...
//              B(i+1) B(i+2) ...
//              A(i+2) A(i+3) ...

4.2 구성

소프트웨어 파이프라이닝된 루프는 세 부분으로 구성됩니다:

// 1. 프롤로그 (prologue): 파이프라인 채우기
A(0)
A(1), B(0)

// 2. 커널 (kernel): 정상 상태 반복
loop:
    A(i+2), B(i+1), C(i)   // 세 반복이 겹침
    i = i + 1
    if i < n-2 goto loop

// 3. 에필로그 (epilogue): 파이프라인 비우기
B(n-1), C(n-2)
C(n-1)

4.3 장점

루프 본체의 처리량(throughput) 최대화
리소스(기능 유닛, 레지스터)의 균형 잡힌 활용
루프 펼침과 결합하면 더욱 효과적

5. 레지스터 할당과 스케줄링의 상호작용

5.1 상충 관계

레지스터 할당과 명령어 스케줄링은 서로 상충하는 목표를 가집니다:

// 스케줄링에 유리한 코드 (레지스터 많이 사용)
LD R1, a
LD R2, b        // R1 대기 중 다른 로드 수행
LD R3, c
ADD R4, R1, R2  // R1 준비됨
MUL R5, R3, R4

// 레지스터 할당에 유리한 코드 (레지스터 적게 사용)
LD R1, a
ADD R1, R1, b   // R1 즉시 사용 (stall 가능!)
LD R1, c
MUL R1, R1, R1

5.2 해결 방법

방법 1: 스케줄링 먼저, 할당 나중

1. 가상 레지스터(무한 레지스터)로 스케줄링 수행
2. 스케줄링된 코드에 대해 레지스터 할당
3. 스필이 발생하면 스필 코드 삽입 후 재스케줄링

방법 2: 할당 먼저, 스케줄링 나중

1. 먼저 레지스터 할당 수행
2. 할당된 코드에 대해 스케줄링
3. 스케줄링 자유도가 제한됨 (같은 레지스터 사용으로 인한 거짓 의존성)

방법 3: 통합 접근

1. 레지스터 할당과 스케줄링을 동시에 고려
2. 더 좋은 결과를 얻을 수 있으나 복잡도가 높음

대부분의 실용적인 컴파일러는 방법 1을 사용합니다.

6. 전역 스케줄링

6.1 기본 블록을 넘어선 스케줄링

기본 블록 크기가 작으면 스케줄링의 자유도가 제한됩니다. 전역 스케줄링은 여러 기본 블록에 걸쳐 명령어를 이동합니다.

6.2 트레이스 스케줄링 (Trace Scheduling)

가장 자주 실행되는 경로(트레이스)를 찾아 그 경로 전체를 하나의 단위로 스케줄링합니다.

// 흐름 그래프:
//   B1 (실행 확률 100%)
//    |
//   B2 (if cond) -- 90% --> B3
//        |
//        10% --> B4

// 트레이스: B1 -> B2 -> B3 (가장 빈번한 경로)
// 이 경로를 하나의 긴 블록처럼 스케줄링

// B2에서 B3로 명령어를 이동하거나
// B1에서 B2로 명령어를 이동할 수 있음
// (덜 빈번한 경로 B4에 대한 보상 코드 필요)

6.3 슈퍼블록 (Superblock)

트레이스에서 중간 진입점을 제거하여 스케줄링을 단순화한 것입니다.

// 슈퍼블록: 하나의 진입점, 여러 출구점
//   entry -> A -> B -> C -> exit1
//                 |
//                 v
//               exit2
//
// 중간에서 들어오는 간선이 없으므로 코드 이동이 더 자유로움

7. VLIW 코드 생성

7.1 VLIW의 특징

VLIW 프로세서에서는 컴파일러가 병렬 실행의 전적인 책임을 집니다.

// VLIW 명령어 패킷 (3 슬롯: ALU, MEM, BR)
// 각 슬롯에 독립적인 연산을 배치

// 사이클 1: | ADD R1,R2,R3 | LD R4, mem1 | NOP        |
// 사이클 2: | MUL R5,R4,R6 | LD R7, mem2 | NOP        |
// 사이클 3: | SUB R8,R5,R7 | ST mem3, R1 | BEQ R8, L1 |

7.2 NOP 문제

ILP가 부족하면 빈 슬롯이 NOP으로 채워져 코드 크기가 커집니다.

// ILP가 낮은 코드:
// | ADD R1,R2,R3 | NOP         | NOP |
// | MUL R4,R1,R5 | NOP         | NOP |  // 모든 연산이 의존적
// | SUB R6,R4,R7 | NOP         | NOP |

// 해결: 루프 펼침 + 소프트웨어 파이프라이닝으로 ILP 증가

7.3 번들 압축

NOP으로 인한 코드 크기 증가를 줄이기 위해 번들 압축 기법을 사용합니다:

실제로 사용되는 슬롯만 인코딩
헤더 비트로 어떤 슬롯이 활성인지 표시

정리

개념	설명
ILP	동시 실행 가능한 명령어의 수준
파이프라이닝	명령어 실행 단계를 겹쳐서 수행
슈퍼스칼라	하드웨어가 동적으로 여러 명령어를 동시 발행
VLIW	컴파일러가 정적으로 병렬 연산을 하나의 긴 명령어에 배치
리스트 스케줄링	우선순위 기반 기본 블록 스케줄링 알고리즘
소프트웨어 파이프라이닝	루프의 여러 반복을 겹쳐서 실행
트레이스 스케줄링	빈번한 실행 경로를 하나의 단위로 스케줄링

명령어 스케줄링은 현대 프로세서의 성능을 최대한 끌어내는 핵심 컴파일러 기법입니다. 특히 VLIW 아키텍처에서는 컴파일러의 스케줄링 능력이 곧 프로세서의 성능을 결정합니다. 다음 글에서는 더 큰 단위의 병렬성인 데이터 병렬성과 루프 변환을 다루겠습니다.

[Compiler] 16. Instruction-Level Parallelism and Scheduling

Overview

Modern processors improve performance by executing multiple instructions simultaneously. This is called Instruction-Level Parallelism (ILP). The compiler plays the role of rearranging (scheduling) instruction order to maximize ILP utilization.

In this article, we examine the concept of ILP, processor architectures that support it, and compiler instruction scheduling techniques.

1. Instruction-Level Parallelism (ILP)

1.1 Basic Concept

ILP indicates how many instructions within a program can be executed simultaneously.

// High ILP code
a = x + y     // These three instructions are independent
b = p + q     // Can be executed simultaneously
c = r - s

// Low ILP code
a = x + y     // Sequential dependency
b = a + z     // Depends on a
c = b * 2     // Depends on b

1.2 Types of Data Dependencies

Dependencies between instructions limit ILP:

True Dependence (RAW: Read After Write)

a = b + c     // Write a
d = a + e     // Read a (must wait)

Anti-Dependence (WAR: Write After Read)

a = b + c     // Read b
b = d + e     // Write b (can be resolved by renaming)

Output Dependence (WAW: Write After Write)

a = b + c     // Write a
a = d + e     // Write a again (order must be maintained)

Anti-dependencies and output dependencies can be resolved by register renaming, so they are also called name dependencies.

2. Processor Architectures for ILP

2.1 Pipelining

Instruction execution is divided into multiple stages, and stages are overlapped.

// 5-stage pipeline: IF(Fetch), ID(Decode), EX(Execute), MEM, WB(Writeback)

Time:    1    2    3    4    5    6    7
Inst 1:  IF   ID   EX   MEM  WB
Inst 2:       IF   ID   EX   MEM  WB
Inst 3:            IF   ID   EX   MEM  WB

// Ideally, one instruction completes every clock cycle

Pipeline hazards:

// Data hazard (RAW)
ADD R1, R2, R3    // Write result to R1 (WB stage)
SUB R4, R1, R5    // Read R1 (ID stage) - value not ready yet!

// Solution: forwarding or stall
ADD R1, R2, R3
NOP               // Bubble insertion (stall)
SUB R4, R1, R5

// Compiler inserts independent instructions to prevent stalls
ADD R1, R2, R3
MUL R6, R7, R8    // Fill empty slot with independent instruction
SUB R4, R1, R5

2.2 Superscalar Processor

Issues multiple instructions simultaneously in a single clock cycle.

// 2-issue superscalar processor
Time:    1         2         3
Slot 1:  ADD       MUL       LD
Slot 2:  SUB       AND       ST

// Ideally, 2 instructions complete every clock cycle
// (IPC = Instructions Per Cycle = 2)

2.3 VLIW Architecture

VLIW (Very Long Instruction Word) bundles multiple operations into a single long instruction.

// VLIW instruction (4 operation slots)
// | ALU1     | ALU2     | MEM      | BRANCH   |
// | ADD R1.. | MUL R3.. | LD R5..  | NOP      |

// Each slot executes on an independent functional unit
// The compiler is responsible for parallel execution

Superscalar vs VLIW:

Feature	Superscalar	VLIW
Parallelism decision	Hardware (dynamic)	Compiler (static)
Hardware complexity	High	Low
Code compatibility	High	Low
Compiler burden	Moderate	High
Representative products	Intel x86, ARM	Intel Itanium, TI DSP

3. Instruction Scheduling

3.1 Basic Block Scheduling

Rearranges instruction order within a basic block to minimize pipeline stalls.

Dependency DAG construction:

// Original code:
// 1: LD  R1, a       // Load latency = 2
// 2: ADD R2, R1, R3  // Depends on R1 (RAW)
// 3: LD  R4, b       // Independent
// 4: MUL R5, R4, R6  // Depends on R4 (RAW)
// 5: ADD R7, R2, R5  // Depends on R2, R5

// Dependency DAG:
//   1 --2--> 2 --1--> 5
//   3 --2--> 4 --1--> 5
// (Edge weights = latencies)

3.2 List Scheduling

List scheduling is the most widely used scheduling algorithm for basic blocks.

Algorithm: List Scheduling

1. Build dependency DAG
2. Compute priority for each node (usually critical path length)
3. Add instructions with no dependencies to the ready list
4. while ready list is not empty:
     a. Select the instruction with highest priority
     b. Issue at current cycle
     c. Add successor instructions to ready list when dependencies are resolved

3.3 List Scheduling Example

// Dependency DAG (latencies):
//   LD a(2) -> ADD(1) -> ST(1)
//   LD b(2) -> MUL(3) -> ST(1)

// Priorities (critical path length to root):
//   LD a: 2+1+1 = 4
//   LD b: 2+3+1 = 6  (higher priority)
//   ADD: 1+1 = 2
//   MUL: 3+1 = 4
//   ST(ADD result): 1
//   ST(MUL result): 1

// Scheduling:
// Cycle 1: LD b (priority 6, highest)
// Cycle 2: LD a (waiting for LD b result, issue something else)
// Cycle 3: MUL (LD b result ready, cycle 1+2=3)
// Cycle 4: ADD (LD a result ready, cycle 2+2=4)
// Cycle 5: ST (ADD result, cycle 4+1=5)
// Cycle 6: ST (MUL result, cycle 3+3=6)

// Original order would need 7 cycles
// After scheduling, reduced to 6 cycles

4. Software Pipelining

4.1 Concept

A technique that transforms code so that multiple iterations of a loop are overlapped in execution. It applies an idea similar to hardware pipelining at the software level.

// Original loop (3 instructions per iteration, 1 cycle each)
// Iteration i:  A(i), B(i), C(i)
// Iteration i+1: A(i+1), B(i+1), C(i+1)

// Time:  1     2     3     4     5     6     7
// Orig:  A(0)  B(0)  C(0)  A(1)  B(1)  C(1)  A(2)...

// After software pipelining:
// Time:  1     2     3     4     5
// Orig:  A(0)  B(0)  C(0)
//              A(1)  B(1)  C(1)
//                    A(2)  B(2)  C(2)
//
// Steady state kernel:
// Time:  ...   k     k+1   k+2   ...
//              C(i)  C(i+1) ...
//              B(i+1) B(i+2) ...
//              A(i+2) A(i+3) ...

4.2 Structure

A software-pipelined loop consists of three parts:

// 1. Prologue: Filling the pipeline
A(0)
A(1), B(0)

// 2. Kernel: Steady state iteration
loop:
    A(i+2), B(i+1), C(i)   // Three iterations overlap
    i = i + 1
    if i < n-2 goto loop

// 3. Epilogue: Draining the pipeline
B(n-1), C(n-2)
C(n-1)

4.3 Advantages

Maximizes loop body throughput
Balanced utilization of resources (functional units, registers)
Even more effective when combined with loop unrolling

5. Interaction Between Register Allocation and Scheduling

5.1 Conflicting Goals

// Code favorable for scheduling (uses many registers)
LD R1, a
LD R2, b        // Perform another load while waiting for R1
LD R3, c
ADD R4, R1, R2  // R1 is ready
MUL R5, R3, R4

// Code favorable for register allocation (uses few registers)
LD R1, a
ADD R1, R1, b   // Use R1 immediately (may stall!)
LD R1, c
MUL R1, R1, R1

5.2 Solutions

Approach 1: Schedule first, allocate later

1. Perform scheduling with virtual registers (infinite registers)
2. Perform register allocation on scheduled code
3. If spills occur, insert spill code and reschedule

Approach 2: Allocate first, schedule later

1. Perform register allocation first
2. Schedule the allocated code
3. Scheduling freedom is limited (false dependencies from shared registers)

Approach 3: Integrated approach

1. Consider register allocation and scheduling simultaneously
2. Can achieve better results but complexity is high

Most practical compilers use Approach 1.

6. Global Scheduling

6.1 Scheduling Beyond Basic Blocks

When basic block sizes are small, scheduling freedom is limited. Global scheduling moves instructions across multiple basic blocks.

6.2 Trace Scheduling

Finds the most frequently executed path (trace) and schedules the entire path as a single unit.

// Flow graph:
//   B1 (execution probability 100%)
//    |
//   B2 (if cond) -- 90% --> B3
//        |
//        10% --> B4

// Trace: B1 -> B2 -> B3 (most frequent path)
// Schedule this path as one long block

// Instructions can be moved from B2 to B3
// or from B1 to B2
// (compensation code needed for the less frequent path B4)

6.3 Superblock

A trace with intermediate entry points removed to simplify scheduling.

// Superblock: one entry point, multiple exit points
//   entry -> A -> B -> C -> exit1
//                 |
//                 v
//               exit2
//
// No incoming edges in the middle, so code motion is freer

7. VLIW Code Generation

7.1 VLIW Characteristics

In VLIW processors, the compiler has full responsibility for parallel execution.

// VLIW instruction packet (3 slots: ALU, MEM, BR)
// Place independent operations in each slot

// Cycle 1: | ADD R1,R2,R3 | LD R4, mem1 | NOP        |
// Cycle 2: | MUL R5,R4,R6 | LD R7, mem2 | NOP        |
// Cycle 3: | SUB R8,R5,R7 | ST mem3, R1 | BEQ R8, L1 |

7.2 NOP Problem

When ILP is insufficient, empty slots are filled with NOPs, increasing code size.

// Code with low ILP:
// | ADD R1,R2,R3 | NOP         | NOP |
// | MUL R4,R1,R5 | NOP         | NOP |  // All operations are dependent
// | SUB R6,R4,R7 | NOP         | NOP |

// Solution: Increase ILP through loop unrolling + software pipelining

7.3 Bundle Compression

To reduce code size increase due to NOPs, bundle compression techniques are used:

Encode only actually used slots
Use header bits to indicate which slots are active

Summary

Concept	Description
ILP	The degree of instructions that can be executed simultaneously
Pipelining	Overlapping instruction execution stages
Superscalar	Hardware dynamically issues multiple instructions simultaneously
VLIW	Compiler statically places parallel operations in one long instruction
List scheduling	Priority-based basic block scheduling algorithm
Software pipelining	Overlapping execution of multiple loop iterations
Trace scheduling	Scheduling a frequently executed path as a single unit

Instruction scheduling is a key compiler technique for maximizing the performance of modern processors. In VLIW architectures in particular, the compiler's scheduling capability directly determines processor performance. In the next article, we will cover data parallelism and loop transformations at a larger scale.