[Compiler] 16. Instruction-Level Parallelism and Scheduling

Overview

Modern processors improve performance by executing multiple instructions simultaneously. This is called Instruction-Level Parallelism (ILP). The compiler plays the role of rearranging (scheduling) instruction order to maximize ILP utilization.

In this article, we examine the concept of ILP, processor architectures that support it, and compiler instruction scheduling techniques.

1. Instruction-Level Parallelism (ILP)

1.1 Basic Concept

ILP indicates how many instructions within a program can be executed simultaneously.

// High ILP code
a = x + y     // These three instructions are independent
b = p + q     // Can be executed simultaneously
c = r - s

// Low ILP code
a = x + y     // Sequential dependency
b = a + z     // Depends on a
c = b * 2     // Depends on b

1.2 Types of Data Dependencies

Dependencies between instructions limit ILP:

True Dependence (RAW: Read After Write)

a = b + c     // Write a
d = a + e     // Read a (must wait)

Anti-Dependence (WAR: Write After Read)

a = b + c     // Read b
b = d + e     // Write b (can be resolved by renaming)

Output Dependence (WAW: Write After Write)

a = b + c     // Write a
a = d + e     // Write a again (order must be maintained)

Anti-dependencies and output dependencies can be resolved by register renaming, so they are also called name dependencies.

2. Processor Architectures for ILP

2.1 Pipelining

Instruction execution is divided into multiple stages, and stages are overlapped.

// 5-stage pipeline: IF(Fetch), ID(Decode), EX(Execute), MEM, WB(Writeback)

Time:    1    2    3    4    5    6    7
Inst 1:  IF   ID   EX   MEM  WB
Inst 2:       IF   ID   EX   MEM  WB
Inst 3:            IF   ID   EX   MEM  WB

// Ideally, one instruction completes every clock cycle

Pipeline hazards:

// Data hazard (RAW)
ADD R1, R2, R3    // Write result to R1 (WB stage)
SUB R4, R1, R5    // Read R1 (ID stage) - value not ready yet!

// Solution: forwarding or stall
ADD R1, R2, R3
NOP               // Bubble insertion (stall)
SUB R4, R1, R5

// Compiler inserts independent instructions to prevent stalls
ADD R1, R2, R3
MUL R6, R7, R8    // Fill empty slot with independent instruction
SUB R4, R1, R5

2.2 Superscalar Processor

Issues multiple instructions simultaneously in a single clock cycle.

// 2-issue superscalar processor
Time:    1         2         3
Slot 1:  ADD       MUL       LD
Slot 2:  SUB       AND       ST

// Ideally, 2 instructions complete every clock cycle
// (IPC = Instructions Per Cycle = 2)

2.3 VLIW Architecture

VLIW (Very Long Instruction Word) bundles multiple operations into a single long instruction.

// VLIW instruction (4 operation slots)
// | ALU1     | ALU2     | MEM      | BRANCH   |
// | ADD R1.. | MUL R3.. | LD R5..  | NOP      |

// Each slot executes on an independent functional unit
// The compiler is responsible for parallel execution

Superscalar vs VLIW:

Feature	Superscalar	VLIW
Parallelism decision	Hardware (dynamic)	Compiler (static)
Hardware complexity	High	Low
Code compatibility	High	Low
Compiler burden	Moderate	High
Representative products	Intel x86, ARM	Intel Itanium, TI DSP

3. Instruction Scheduling

3.1 Basic Block Scheduling

Rearranges instruction order within a basic block to minimize pipeline stalls.

Dependency DAG construction:

// Original code:
// 1: LD  R1, a       // Load latency = 2
// 2: ADD R2, R1, R3  // Depends on R1 (RAW)
// 3: LD  R4, b       // Independent
// 4: MUL R5, R4, R6  // Depends on R4 (RAW)
// 5: ADD R7, R2, R5  // Depends on R2, R5

// Dependency DAG:
//   1 --2--> 2 --1--> 5
//   3 --2--> 4 --1--> 5
// (Edge weights = latencies)

3.2 List Scheduling

List scheduling is the most widely used scheduling algorithm for basic blocks.

Algorithm: List Scheduling

1. Build dependency DAG
2. Compute priority for each node (usually critical path length)
3. Add instructions with no dependencies to the ready list
4. while ready list is not empty:
     a. Select the instruction with highest priority
     b. Issue at current cycle
     c. Add successor instructions to ready list when dependencies are resolved

3.3 List Scheduling Example

// Dependency DAG (latencies):
//   LD a(2) -> ADD(1) -> ST(1)
//   LD b(2) -> MUL(3) -> ST(1)

// Priorities (critical path length to root):
//   LD a: 2+1+1 = 4
//   LD b: 2+3+1 = 6  (higher priority)
//   ADD: 1+1 = 2
//   MUL: 3+1 = 4
//   ST(ADD result): 1
//   ST(MUL result): 1

// Scheduling:
// Cycle 1: LD b (priority 6, highest)
// Cycle 2: LD a (waiting for LD b result, issue something else)
// Cycle 3: MUL (LD b result ready, cycle 1+2=3)
// Cycle 4: ADD (LD a result ready, cycle 2+2=4)
// Cycle 5: ST (ADD result, cycle 4+1=5)
// Cycle 6: ST (MUL result, cycle 3+3=6)

// Original order would need 7 cycles
// After scheduling, reduced to 6 cycles

4. Software Pipelining

4.1 Concept

A technique that transforms code so that multiple iterations of a loop are overlapped in execution. It applies an idea similar to hardware pipelining at the software level.

// Original loop (3 instructions per iteration, 1 cycle each)
// Iteration i:  A(i), B(i), C(i)
// Iteration i+1: A(i+1), B(i+1), C(i+1)

// Time:  1     2     3     4     5     6     7
// Orig:  A(0)  B(0)  C(0)  A(1)  B(1)  C(1)  A(2)...

// After software pipelining:
// Time:  1     2     3     4     5
// Orig:  A(0)  B(0)  C(0)
//              A(1)  B(1)  C(1)
//                    A(2)  B(2)  C(2)
//
// Steady state kernel:
// Time:  ...   k     k+1   k+2   ...
//              C(i)  C(i+1) ...
//              B(i+1) B(i+2) ...
//              A(i+2) A(i+3) ...

4.2 Structure

A software-pipelined loop consists of three parts:

// 1. Prologue: Filling the pipeline
A(0)
A(1), B(0)

// 2. Kernel: Steady state iteration
loop:
    A(i+2), B(i+1), C(i)   // Three iterations overlap
    i = i + 1
    if i < n-2 goto loop

// 3. Epilogue: Draining the pipeline
B(n-1), C(n-2)
C(n-1)

4.3 Advantages

Maximizes loop body throughput
Balanced utilization of resources (functional units, registers)
Even more effective when combined with loop unrolling

5. Interaction Between Register Allocation and Scheduling

5.1 Conflicting Goals

// Code favorable for scheduling (uses many registers)
LD R1, a
LD R2, b        // Perform another load while waiting for R1
LD R3, c
ADD R4, R1, R2  // R1 is ready
MUL R5, R3, R4

// Code favorable for register allocation (uses few registers)
LD R1, a
ADD R1, R1, b   // Use R1 immediately (may stall!)
LD R1, c
MUL R1, R1, R1

5.2 Solutions

Approach 1: Schedule first, allocate later

1. Perform scheduling with virtual registers (infinite registers)
2. Perform register allocation on scheduled code
3. If spills occur, insert spill code and reschedule

Approach 2: Allocate first, schedule later

1. Perform register allocation first
2. Schedule the allocated code
3. Scheduling freedom is limited (false dependencies from shared registers)

Approach 3: Integrated approach

1. Consider register allocation and scheduling simultaneously
2. Can achieve better results but complexity is high

Most practical compilers use Approach 1.

6. Global Scheduling

6.1 Scheduling Beyond Basic Blocks

When basic block sizes are small, scheduling freedom is limited. Global scheduling moves instructions across multiple basic blocks.

6.2 Trace Scheduling

Finds the most frequently executed path (trace) and schedules the entire path as a single unit.

// Flow graph:
//   B1 (execution probability 100%)
//    |
//   B2 (if cond) -- 90% --> B3
//        |
//        10% --> B4

// Trace: B1 -> B2 -> B3 (most frequent path)
// Schedule this path as one long block

// Instructions can be moved from B2 to B3
// or from B1 to B2
// (compensation code needed for the less frequent path B4)

6.3 Superblock

A trace with intermediate entry points removed to simplify scheduling.

// Superblock: one entry point, multiple exit points
//   entry -> A -> B -> C -> exit1
//                 |
//                 v
//               exit2
//
// No incoming edges in the middle, so code motion is freer

7. VLIW Code Generation

7.1 VLIW Characteristics

In VLIW processors, the compiler has full responsibility for parallel execution.

// VLIW instruction packet (3 slots: ALU, MEM, BR)
// Place independent operations in each slot

// Cycle 1: | ADD R1,R2,R3 | LD R4, mem1 | NOP        |
// Cycle 2: | MUL R5,R4,R6 | LD R7, mem2 | NOP        |
// Cycle 3: | SUB R8,R5,R7 | ST mem3, R1 | BEQ R8, L1 |

7.2 NOP Problem

When ILP is insufficient, empty slots are filled with NOPs, increasing code size.

// Code with low ILP:
// | ADD R1,R2,R3 | NOP         | NOP |
// | MUL R4,R1,R5 | NOP         | NOP |  // All operations are dependent
// | SUB R6,R4,R7 | NOP         | NOP |

// Solution: Increase ILP through loop unrolling + software pipelining

7.3 Bundle Compression

To reduce code size increase due to NOPs, bundle compression techniques are used:

Encode only actually used slots
Use header bits to indicate which slots are active

Summary

Concept	Description
ILP	The degree of instructions that can be executed simultaneously
Pipelining	Overlapping instruction execution stages
Superscalar	Hardware dynamically issues multiple instructions simultaneously
VLIW	Compiler statically places parallel operations in one long instruction
List scheduling	Priority-based basic block scheduling algorithm
Software pipelining	Overlapping execution of multiple loop iterations
Trace scheduling	Scheduling a frequently executed path as a single unit

Instruction scheduling is a key compiler technique for maximizing the performance of modern processors. In VLIW architectures in particular, the compiler's scheduling capability directly determines processor performance. In the next article, we will cover data parallelism and loop transformations at a larger scale.