- Authors

- Name
- Youngju Kim
- @fjvbn20031
Overview
Modern processors improve performance by executing multiple instructions simultaneously. This is called Instruction-Level Parallelism (ILP). The compiler plays the role of rearranging (scheduling) instruction order to maximize ILP utilization.
In this article, we examine the concept of ILP, processor architectures that support it, and compiler instruction scheduling techniques.
1. Instruction-Level Parallelism (ILP)
1.1 Basic Concept
ILP indicates how many instructions within a program can be executed simultaneously.
// High ILP code
a = x + y // These three instructions are independent
b = p + q // Can be executed simultaneously
c = r - s
// Low ILP code
a = x + y // Sequential dependency
b = a + z // Depends on a
c = b * 2 // Depends on b
1.2 Types of Data Dependencies
Dependencies between instructions limit ILP:
True Dependence (RAW: Read After Write)
a = b + c // Write a
d = a + e // Read a (must wait)
Anti-Dependence (WAR: Write After Read)
a = b + c // Read b
b = d + e // Write b (can be resolved by renaming)
Output Dependence (WAW: Write After Write)
a = b + c // Write a
a = d + e // Write a again (order must be maintained)
Anti-dependencies and output dependencies can be resolved by register renaming, so they are also called name dependencies.
2. Processor Architectures for ILP
2.1 Pipelining
Instruction execution is divided into multiple stages, and stages are overlapped.
// 5-stage pipeline: IF(Fetch), ID(Decode), EX(Execute), MEM, WB(Writeback)
Time: 1 2 3 4 5 6 7
Inst 1: IF ID EX MEM WB
Inst 2: IF ID EX MEM WB
Inst 3: IF ID EX MEM WB
// Ideally, one instruction completes every clock cycle
Pipeline hazards:
// Data hazard (RAW)
ADD R1, R2, R3 // Write result to R1 (WB stage)
SUB R4, R1, R5 // Read R1 (ID stage) - value not ready yet!
// Solution: forwarding or stall
ADD R1, R2, R3
NOP // Bubble insertion (stall)
SUB R4, R1, R5
// Compiler inserts independent instructions to prevent stalls
ADD R1, R2, R3
MUL R6, R7, R8 // Fill empty slot with independent instruction
SUB R4, R1, R5
2.2 Superscalar Processor
Issues multiple instructions simultaneously in a single clock cycle.
// 2-issue superscalar processor
Time: 1 2 3
Slot 1: ADD MUL LD
Slot 2: SUB AND ST
// Ideally, 2 instructions complete every clock cycle
// (IPC = Instructions Per Cycle = 2)
2.3 VLIW Architecture
VLIW (Very Long Instruction Word) bundles multiple operations into a single long instruction.
// VLIW instruction (4 operation slots)
// | ALU1 | ALU2 | MEM | BRANCH |
// | ADD R1.. | MUL R3.. | LD R5.. | NOP |
// Each slot executes on an independent functional unit
// The compiler is responsible for parallel execution
Superscalar vs VLIW:
| Feature | Superscalar | VLIW |
|---|---|---|
| Parallelism decision | Hardware (dynamic) | Compiler (static) |
| Hardware complexity | High | Low |
| Code compatibility | High | Low |
| Compiler burden | Moderate | High |
| Representative products | Intel x86, ARM | Intel Itanium, TI DSP |
3. Instruction Scheduling
3.1 Basic Block Scheduling
Rearranges instruction order within a basic block to minimize pipeline stalls.
Dependency DAG construction:
// Original code:
// 1: LD R1, a // Load latency = 2
// 2: ADD R2, R1, R3 // Depends on R1 (RAW)
// 3: LD R4, b // Independent
// 4: MUL R5, R4, R6 // Depends on R4 (RAW)
// 5: ADD R7, R2, R5 // Depends on R2, R5
// Dependency DAG:
// 1 --2--> 2 --1--> 5
// 3 --2--> 4 --1--> 5
// (Edge weights = latencies)
3.2 List Scheduling
List scheduling is the most widely used scheduling algorithm for basic blocks.
Algorithm: List Scheduling
1. Build dependency DAG
2. Compute priority for each node (usually critical path length)
3. Add instructions with no dependencies to the ready list
4. while ready list is not empty:
a. Select the instruction with highest priority
b. Issue at current cycle
c. Add successor instructions to ready list when dependencies are resolved
3.3 List Scheduling Example
// Dependency DAG (latencies):
// LD a(2) -> ADD(1) -> ST(1)
// LD b(2) -> MUL(3) -> ST(1)
// Priorities (critical path length to root):
// LD a: 2+1+1 = 4
// LD b: 2+3+1 = 6 (higher priority)
// ADD: 1+1 = 2
// MUL: 3+1 = 4
// ST(ADD result): 1
// ST(MUL result): 1
// Scheduling:
// Cycle 1: LD b (priority 6, highest)
// Cycle 2: LD a (waiting for LD b result, issue something else)
// Cycle 3: MUL (LD b result ready, cycle 1+2=3)
// Cycle 4: ADD (LD a result ready, cycle 2+2=4)
// Cycle 5: ST (ADD result, cycle 4+1=5)
// Cycle 6: ST (MUL result, cycle 3+3=6)
// Original order would need 7 cycles
// After scheduling, reduced to 6 cycles
4. Software Pipelining
4.1 Concept
A technique that transforms code so that multiple iterations of a loop are overlapped in execution. It applies an idea similar to hardware pipelining at the software level.
// Original loop (3 instructions per iteration, 1 cycle each)
// Iteration i: A(i), B(i), C(i)
// Iteration i+1: A(i+1), B(i+1), C(i+1)
// Time: 1 2 3 4 5 6 7
// Orig: A(0) B(0) C(0) A(1) B(1) C(1) A(2)...
// After software pipelining:
// Time: 1 2 3 4 5
// Orig: A(0) B(0) C(0)
// A(1) B(1) C(1)
// A(2) B(2) C(2)
//
// Steady state kernel:
// Time: ... k k+1 k+2 ...
// C(i) C(i+1) ...
// B(i+1) B(i+2) ...
// A(i+2) A(i+3) ...
4.2 Structure
A software-pipelined loop consists of three parts:
// 1. Prologue: Filling the pipeline
A(0)
A(1), B(0)
// 2. Kernel: Steady state iteration
loop:
A(i+2), B(i+1), C(i) // Three iterations overlap
i = i + 1
if i < n-2 goto loop
// 3. Epilogue: Draining the pipeline
B(n-1), C(n-2)
C(n-1)
4.3 Advantages
- Maximizes loop body throughput
- Balanced utilization of resources (functional units, registers)
- Even more effective when combined with loop unrolling
5. Interaction Between Register Allocation and Scheduling
5.1 Conflicting Goals
Register allocation and instruction scheduling have conflicting objectives:
// Code favorable for scheduling (uses many registers)
LD R1, a
LD R2, b // Perform another load while waiting for R1
LD R3, c
ADD R4, R1, R2 // R1 is ready
MUL R5, R3, R4
// Code favorable for register allocation (uses few registers)
LD R1, a
ADD R1, R1, b // Use R1 immediately (may stall!)
LD R1, c
MUL R1, R1, R1
5.2 Solutions
Approach 1: Schedule first, allocate later
1. Perform scheduling with virtual registers (infinite registers)
2. Perform register allocation on scheduled code
3. If spills occur, insert spill code and reschedule
Approach 2: Allocate first, schedule later
1. Perform register allocation first
2. Schedule the allocated code
3. Scheduling freedom is limited (false dependencies from shared registers)
Approach 3: Integrated approach
1. Consider register allocation and scheduling simultaneously
2. Can achieve better results but complexity is high
Most practical compilers use Approach 1.
6. Global Scheduling
6.1 Scheduling Beyond Basic Blocks
When basic block sizes are small, scheduling freedom is limited. Global scheduling moves instructions across multiple basic blocks.
6.2 Trace Scheduling
Finds the most frequently executed path (trace) and schedules the entire path as a single unit.
// Flow graph:
// B1 (execution probability 100%)
// |
// B2 (if cond) -- 90% --> B3
// |
// 10% --> B4
// Trace: B1 -> B2 -> B3 (most frequent path)
// Schedule this path as one long block
// Instructions can be moved from B2 to B3
// or from B1 to B2
// (compensation code needed for the less frequent path B4)
6.3 Superblock
A trace with intermediate entry points removed to simplify scheduling.
// Superblock: one entry point, multiple exit points
// entry -> A -> B -> C -> exit1
// |
// v
// exit2
//
// No incoming edges in the middle, so code motion is freer
7. VLIW Code Generation
7.1 VLIW Characteristics
In VLIW processors, the compiler has full responsibility for parallel execution.
// VLIW instruction packet (3 slots: ALU, MEM, BR)
// Place independent operations in each slot
// Cycle 1: | ADD R1,R2,R3 | LD R4, mem1 | NOP |
// Cycle 2: | MUL R5,R4,R6 | LD R7, mem2 | NOP |
// Cycle 3: | SUB R8,R5,R7 | ST mem3, R1 | BEQ R8, L1 |
7.2 NOP Problem
When ILP is insufficient, empty slots are filled with NOPs, increasing code size.
// Code with low ILP:
// | ADD R1,R2,R3 | NOP | NOP |
// | MUL R4,R1,R5 | NOP | NOP | // All operations are dependent
// | SUB R6,R4,R7 | NOP | NOP |
// Solution: Increase ILP through loop unrolling + software pipelining
7.3 Bundle Compression
To reduce code size increase due to NOPs, bundle compression techniques are used:
- Encode only actually used slots
- Use header bits to indicate which slots are active
Summary
| Concept | Description |
|---|---|
| ILP | The degree of instructions that can be executed simultaneously |
| Pipelining | Overlapping instruction execution stages |
| Superscalar | Hardware dynamically issues multiple instructions simultaneously |
| VLIW | Compiler statically places parallel operations in one long instruction |
| List scheduling | Priority-based basic block scheduling algorithm |
| Software pipelining | Overlapping execution of multiple loop iterations |
| Trace scheduling | Scheduling a frequently executed path as a single unit |
Instruction scheduling is a key compiler technique for maximizing the performance of modern processors. In VLIW architectures in particular, the compiler's scheduling capability directly determines processor performance. In the next article, we will cover data parallelism and loop transformations at a larger scale.