- Authors

- Name
- Youngju Kim
- @fjvbn20031
- Introduction
- 1. Overview of Computer Architecture
- 2. Instruction Set Architecture (ISA)
- 3. ALU and Datapath
- 4. Pipelining
- 5. Memory Hierarchy
- 6. Virtual Memory
- 7. I/O Systems
- 8. Parallel Architecture
- 9. GPU Architecture
- 10. Modern Architecture Trends
- 11. Performance Optimization in Practice
- 12. Quiz
- References
Introduction
Computer Architecture is the foundational discipline that bridges hardware and software. Whether you are an electronics engineer, computer science student, or systems programmer, a deep understanding of how processors work is essential for writing high-performance software, designing chips, and reasoning about system behavior.
This guide is structured around Patterson & Hennessy's Computer Organization and Design and Computer Architecture: A Quantitative Approach, covering everything from ISA design to modern GPU parallel architectures.
1. Overview of Computer Architecture
Von Neumann Architecture
The Von Neumann architecture, proposed by John von Neumann in 1945, forms the foundation of nearly all modern computers. Its defining characteristic is that program instructions and data share the same memory space.
Components:
- CPU (Central Processing Unit): ALU + Control Unit + Registers
- Memory: Stores both program code and data
- I/O Devices: Keyboard, display, disk, network
- Bus: Carries data, address, and control signals
Von Neumann Bottleneck: The bandwidth between the CPU and memory limits overall performance. The modern cache hierarchy was designed to mitigate this bottleneck.
Harvard Architecture
The Harvard architecture uses separate instruction and data memories, enabling simultaneous access to both. It is used in DSPs, microcontrollers (AVR, PIC), and in the L1 cache split (I-Cache / D-Cache) of modern processors.
| Aspect | Von Neumann | Harvard |
|---|---|---|
| Memory | Unified | Separate |
| Bandwidth | Limited | Higher |
| Complexity | Lower | Higher |
| Use case | General-purpose CPU | DSP, microcontroller |
Levels of Abstraction
Computer systems are organized in layers of abstraction:
Application Software
↓
Operating System
↓
ISA (Instruction Set Architecture) ← Hardware/Software boundary
↓
Microarchitecture
↓
Logic Gates
↓
Transistors
The ISA is the contract between hardware and software. Any processor that implements the same ISA can run the same software, regardless of its internal microarchitecture.
Performance Metrics
CPU Execution Time:
CPU Time = Instruction Count × CPI × Clock Cycle Time
= Instruction Count × CPI / Clock Rate
- CPI (Cycles Per Instruction): Average clock cycles per instruction
- Clock Rate: Measured in Hz (e.g., 3.5 GHz)
- MIPS (Millions of Instructions Per Second): Throughput in instructions
- FLOPS (Floating Point Operations Per Second): Floating-point throughput
Amdahl's Law (Limits of Parallel Speedup):
Speedup = 1 / ((1 - f) + f/s)
Where f is the fraction of work that can be parallelized, and s is the speedup of the parallelized part. If only 40% of a program can be parallelized, the theoretical maximum speedup is 1.67x regardless of core count.
2. Instruction Set Architecture (ISA)
RISC vs CISC
RISC (Reduced Instruction Set Computer):
- Simple, fixed-length instructions
- Register-centric operations (Load/Store architecture)
- Favors hardware pipelining
- Examples: ARM, RISC-V, MIPS, PowerPC
CISC (Complex Instruction Set Computer):
- Complex, variable-length instructions
- Can operate directly on memory
- Higher code density (more work per instruction)
- Examples: x86-64, VAX
Modern x86-64 processors internally decode CISC instructions into RISC-like micro-ops, blurring the original distinction.
Instruction Formats (RISC-V)
RISC-V defines six instruction formats:
R-type (Register operations):
[funct7 | rs2 | rs1 | funct3 | rd | opcode]
7-bit 5-bit 5-bit 3-bit 5-bit 7-bit
I-type (Immediate / Load):
[imm[11:0] | rs1 | funct3 | rd | opcode]
12-bit 5-bit 3-bit 5-bit 7-bit
S-type (Store):
[imm[11:5] | rs2 | rs1 | funct3 | imm[4:0] | opcode]
B-type (Branch):
[imm[12|10:5] | rs2 | rs1 | funct3 | imm[4:1|11] | opcode]
U-type (Upper immediate):
[imm[31:12] | rd | opcode]
J-type (JAL):
[imm[20|10:1|11|19:12] | rd | opcode]
RISC-V Assembly Examples
RISC-V is an open-source ISA developed at UC Berkeley, growing rapidly in both education and industry.
# RISC-V Assembly Examples
# Basic arithmetic
add a0, a0, a1 # a0 = a0 + a1
sub t0, t1, t2 # t0 = t1 - t2
addi t0, t0, 10 # t0 = t0 + 10 (immediate)
mul a0, a1, a2 # a0 = a1 * a2
# Memory access (Load/Store)
lw t0, 0(a0) # t0 = Memory[a0 + 0] (word load)
lh t1, 2(a0) # t1 = Memory[a0 + 2] (halfword load)
lb t2, 1(a0) # t2 = Memory[a0 + 1] (byte load)
sw t0, 4(a0) # Memory[a0 + 4] = t0 (word store)
# Logical operations
and t0, t1, t2 # t0 = t1 & t2
or t0, t1, t2 # t0 = t1 | t2
xor t0, t1, t2 # t0 = t1 ^ t2
sll t0, t1, t2 # t0 = t1 << t2 (logical left shift)
srl t0, t1, t2 # t0 = t1 >> t2 (logical right shift)
# Branches and jumps
beq t0, t1, label # if t0 == t1, jump to label
bne t0, t1, label # if t0 != t1, jump to label
blt t0, t1, label # if t0 < t1, jump to label
bge t0, t1, label # if t0 >= t1, jump to label
jal ra, func # ra = PC+4; jump to func
jalr ra, 0(t0) # ra = PC+4; jump to address in t0
# RISC-V Register Conventions
# x0 (zero): hardwired zero
# x1 (ra): return address
# x2 (sp): stack pointer
# x10-x17 (a0-a7): function arguments / return values
# x5-x7, x28-x31 (t0-t6): temporaries
# x8-x9, x18-x27 (s0-s11): saved registers
Translating C to RISC-V Assembly
// C code
int factorial(int n) {
if (n <= 1) return 1;
return n * factorial(n - 1);
}
# RISC-V Assembly (recursive factorial)
factorial:
addi sp, sp, -16 # Allocate stack frame
sw ra, 12(sp) # Save return address
sw a0, 8(sp) # Save n
addi t0, zero, 1
bgt a0, t0, recurse # if n > 1, recurse
addi a0, zero, 1 # return 1
j done
recurse:
addi a0, a0, -1 # n - 1
jal ra, factorial # recursive call
lw t0, 8(sp) # restore n
mul a0, t0, a0 # n * factorial(n-1)
done:
lw ra, 12(sp) # restore return address
addi sp, sp, 16 # free stack frame
jalr zero, 0(ra) # return
3. ALU and Datapath
ALU Design
The ALU (Arithmetic Logic Unit) is the computational core of the CPU:
ADD: result = A + B
SUB: result = A + (~B + 1) = A - B (two's complement)
AND: result = A AND B
OR: result = A OR B
XOR: result = A XOR B
SLT: result = (A < B) ? 1 : 0 (Set Less Than)
SLL: result = A << B
Ripple Carry Adder:
FA0: S0 = A0 XOR B0 XOR Cin; C1 = carry out
FA1: S1 = A1 XOR B1 XOR C1; C2 = carry out
...
For an N-bit ripple carry adder, delay is O(N). A 64-bit addition requires 64 gate delays.
Carry Lookahead Adder (CLA):
Define Generate and Propagate:
Gi = Ai AND Bi (carry generate)
Pi = Ai XOR Bi (carry propagate)
Ci+1 = Gi OR (Pi AND Ci)
By computing carries for 4-bit groups in parallel, total delay drops to O(log N).
Single-Cycle Datapath
In a single-cycle implementation, every instruction completes in exactly one clock cycle.
Instruction flow (R-type ADD):
1. IF: PC → instruction memory → fetch instruction
2. ID: Read rs1, rs2 from register file; generate control signals
3. EX: ALU computes rs1 + rs2
4. MEM: (R-type has no memory access)
5. WB: Write ALU result to rd
Problem: The slowest instruction (e.g., memory load) determines the clock period. Fast instructions are penalized by the slow clock.
4. Pipelining
Pipelining overlaps the execution of multiple instructions, just like running a washer, dryer, and iron simultaneously for a load of laundry.
Five-Stage Pipeline
Stage Abbr Function
------+------+--------------------------------------------
Fetch IF Fetch instruction from memory using PC
Decode ID Decode instruction, read registers, gen control
Execute EX Perform ALU operation
Memory MEM Read/write data memory
Writeback WB Write result back to register file
Timing diagram:
Cycle: 1 2 3 4 5 6 7 8
Inst 1: IF ID EX MEM WB
Inst 2: IF ID EX MEM WB
Inst 3: IF ID EX MEM WB
Inst 4: IF ID EX MEM WB
Ideally, a 5-stage pipeline provides up to 5x throughput improvement over single-cycle.
Pipeline Hazards
1. Structural Hazard
Two instructions require the same hardware resource at the same time.
- Solution: Duplicate resources (separate I-Cache and D-Cache), add register file ports.
2. Data Hazard
An instruction needs a result that a previous instruction has not yet produced.
add t0, t1, t2 # write t0 (WB: cycle 5)
sub t3, t0, t4 # read t0 (ID: cycle 3) → RAW hazard!
Types:
- RAW (Read After Write): Most common. Next instruction reads before previous writes.
- WAR (Write After Read): Occurs in out-of-order execution.
- WAW (Write After Write): Two instructions write the same register.
Solutions:
- Forwarding (Bypassing): Route EX/MEM stage result directly to next EX stage input — no stall.
- Stall (Bubble): Insert NOP cycles. Reduces performance.
- Code Reordering: Compiler inserts independent instructions between dependent ones.
# Load-use hazard: 1 stall cycle unavoidable
lw t0, 0(a0) # load from memory
add t1, t0, t2 # use t0 immediately → MEM→EX forwarding impossible
# Solution: compiler inserts an independent instruction
lw t0, 0(a0)
add t3, t4, t5 # independent instruction fills the slot
add t1, t0, t2 # t0 is now ready
3. Control Hazard
A branch instruction creates uncertainty about which instruction to fetch next.
beq t0, t1, label # branch decision made in EX stage
# Instructions already fetched in IF/ID may be wrong
Solutions:
- Flush: Discard incorrectly fetched instructions (2-3 cycle penalty).
- Branch Prediction:
- Static: always Not-Taken or always Taken
- Dynamic: 1-bit/2-bit predictors, Branch Target Buffer (BTB)
- Modern CPUs achieve 95%+ prediction accuracy
- Delayed Branch: Always execute the instruction after the branch (MIPS approach).
Superscalar Pipelines
Modern CPUs run multiple pipelines in parallel:
Intel Core: 6-way out-of-order execution (OoO)
AMD Zen 4: 4-way decode + OoO execution
ARM Cortex-X4: 5-way decode
Out-of-Order Execution:
- Instructions fetched and decoded in order
- Instructions dispatched as soon as their operands are ready (Tomasulo's algorithm)
- Results committed in original program order (Reorder Buffer)
5. Memory Hierarchy
The Memory Hierarchy
Registers
Size: ~1KB | Latency: 1 cycle | Cost: very high
L1 Cache (on-chip)
Size: 32-64KB | Latency: 4-5 cycles | Cost: high
L2 Cache (on-chip)
Size: 256KB-1MB | Latency: 12-15 cycles | Cost: medium
L3 Cache (on-chip, shared)
Size: 8-64MB | Latency: 30-40 cycles | Cost: low
DRAM (Main Memory)
Size: 8-256GB | Latency: 200-300 cycles | Cost: very low
SSD / NVMe
Size: 1-4TB | Latency: 10,000+ cycles | Cost: extremely low
Principle of Locality
- Temporal Locality: Recently accessed data will likely be accessed again soon. (Loop variables, counters)
- Spatial Locality: Data near recently accessed locations will likely be accessed soon. (Array traversal)
// Spatial locality optimization
// Bad: column-major access (frequent cache misses)
for (int j = 0; j < N; j++)
for (int i = 0; i < N; i++)
sum += A[i][j]; // A[0][0], A[1][0], A[2][0]... (non-contiguous)
// Good: row-major access (cache-friendly)
for (int i = 0; i < N; i++)
for (int j = 0; j < N; j++)
sum += A[i][j]; // A[0][0], A[0][1], A[0][2]... (contiguous memory)
Cache Organization
Direct-Mapped Cache: Each memory block maps to exactly one cache slot.
- Pro: Simple hardware, fast lookup
- Con: Conflict misses (two blocks competing for the same slot)
N-way Set-Associative Cache: Each memory block can go into any of N slots in a set.
- Most common in real CPUs (L1: 4-way, L2: 8-way, L3: 16-way)
Fully Associative Cache: A block can go anywhere in the cache.
- Pro: No conflict misses
- Con: Expensive parallel lookup (used for TLBs)
Cache address breakdown (32-bit address, 4KB cache, 64B block, 4-way):
[Tag (20 bits) | Index (6 bits) | Offset (6 bits)]
Replacement Policies
- LRU (Least Recently Used): Evict the block unused for the longest time. Best performance, complex hardware.
- FIFO (First In First Out): Evict the oldest-inserted block. Simple.
- Random: Evict a random block. Simple hardware, performance close to LRU in practice.
Write Policies
- Write-through: Update both cache and memory on every write. Simple, high bandwidth usage.
- Write-back: Update cache only; write to memory when the block is evicted (Dirty bit). Better performance, more complex coherence.
6. Virtual Memory
Overview
Virtual memory gives each process its own address space, providing isolation, protection, and the ability to overcommit physical memory.
Virtual Address (VA): Address used by the process (0x0000 to 0xFFFFFFFF)
Physical Address (PA): Actual DRAM address
Page: Fixed-size block of virtual/physical memory (typically 4KB)
Page Table Translation
Virtual Address [VPN | Page Offset]
↓
Page table lookup
↓
Physical Address [PFN | Page Offset]
64-bit systems use a 4-level page table (x86-64):
[PML4 (9 bits) | PDPT (9 bits) | PD (9 bits) | PT (9 bits) | Offset (12 bits)]
TLB (Translation Lookaside Buffer)
Every page table lookup requires memory accesses. The TLB is a fully associative cache for recent virtual-to-physical translations.
TLB Hit: VA → TLB lookup → PA (1-2 cycles)
TLB Miss: VA → page table walk → PA (100+ cycles)
Typical TLB size: 64-1024 entries. Must maintain 99%+ hit rate for good performance.
Page Replacement Algorithms
When physical memory is full, the OS must evict a page to disk.
- Optimal: Evict the page that won't be used for the longest time (theoretical best, not implementable)
- LRU: Evict the least recently used page (good performance, expensive hardware)
- Clock (FIFO + second chance): LRU approximation used by most OS kernels
7. I/O Systems
I/O Control Methods
Polling: The CPU repeatedly checks device status registers.
- Pro: Simple, low latency
- Con: Wastes CPU cycles (busy-wait)
Interrupts: The device signals the CPU when ready.
- Pro: CPU free to do other work
- Con: Interrupt handling overhead (context save/restore)
DMA (Direct Memory Access): A DMA controller transfers data between memory and device without CPU involvement.
- Essential for high-throughput transfers (disk, network, GPU)
- CPU only initiates and acknowledges completion
Bus Architecture
PCIe 5.0: x16 = 128 GB/s bidirectional
NVMe (PCIe): up to 7 GB/s sequential read (Gen4)
USB 4.0: up to 40 Gbps
DDR5-6400: 51.2 GB/s per channel
8. Parallel Architecture
Flynn's Taxonomy
| Class | Instruction streams | Data streams | Examples |
|---|---|---|---|
| SISD | Single | Single | Single-core CPU |
| SIMD | Single | Multiple | GPU, AVX vector ops |
| MISD | Multiple | Single | Fault-tolerant systems |
| MIMD | Multiple | Multiple | Multi-core CPU, clusters |
Multicore and Cache Coherence
When multiple cores hold copies of the same cache line, coherence must be maintained.
MESI Protocol (Modified, Exclusive, Shared, Invalid):
Modified (M): This core holds the only (modified) copy
Exclusive (E): This core holds the only copy, matches memory
Shared (S): Multiple cores hold read-only copies
Invalid (I): This copy is stale (another core modified it)
State transitions:
- Core A writes to a Shared line → A transitions to M, all others go to I (invalidation)
- Core B reads an Invalid line → bus snooping causes A to supply the line
OpenMP Parallel Programming
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
// Parallel array summation
int main() {
int n = 1000000;
long long sum = 0;
int *arr = malloc(n * sizeof(int));
for (int i = 0; i < n; i++)
arr[i] = i + 1;
// Reduction clause ensures thread-safe accumulation
#pragma omp parallel for reduction(+:sum) schedule(static)
for (int i = 0; i < n; i++) {
sum += arr[i];
}
printf("Sum = %lld\n", sum); // Expected: 500000500000
// Print thread IDs
#pragma omp parallel
{
int tid = omp_get_thread_num();
int nthreads = omp_get_num_threads();
printf("Thread %d of %d\n", tid, nthreads);
}
free(arr);
return 0;
}
// Cache-friendly parallel matrix multiplication
void matmul(float *A, float *B, float *C, int N) {
#pragma omp parallel for collapse(2) schedule(dynamic, 64)
for (int i = 0; i < N; i++) {
for (int j = 0; j < N; j++) {
float sum = 0.0f;
for (int k = 0; k < N; k++) {
sum += A[i*N + k] * B[k*N + j];
}
C[i*N + j] = sum;
}
}
}
NUMA Architecture
NUMA (Non-Uniform Memory Access) assigns local memory to each CPU socket.
Socket 0 (Core 0-15) ←→ Local DRAM (64GB) [~80ns]
↕ QPI / Infinity Fabric
Socket 1 (Core 16-31) ←→ Local DRAM (64GB) [~80ns]
Socket 0 accessing Socket 1 memory: ~160ns (2x latency)
Use numactl to pin processes to specific NUMA nodes for best performance.
9. GPU Architecture
GPU vs CPU Design Philosophy
| Aspect | CPU | GPU |
|---|---|---|
| Core count | Dozens | Thousands to tens of thousands |
| Core complexity | Very high (OoO, branch prediction) | Simple |
| Cache size | Large, multi-level | Small, minimal |
| Design goal | Minimize single-thread latency | Maximize throughput |
| Use case | General-purpose serial work | Massively parallel computation |
SIMT Execution Model
SIMT (Single Instruction, Multiple Thread) is the GPU's core execution model.
Warp: 32 threads executing the same instruction (NVIDIA)
→ All 32 threads execute the same instruction simultaneously
→ Each thread operates on different data
Warp scheduler: On memory stall, immediately switch to another warp
→ Thousands of warps rotate rapidly, hiding memory latency
NVIDIA GPU Hierarchy
GPU
├── GPC (Graphics Processing Cluster) x 8
│ └── SM (Streaming Multiprocessor) x 7-12
│ ├── CUDA Core x 128 (FP32 operations)
│ ├── Tensor Core x 4 (matrix multiply, AI)
│ ├── RT Core x 1 (ray tracing)
│ ├── Warp Scheduler x 4
│ ├── Register File (256KB)
│ └── Shared Memory / L1 Cache (128-256KB)
└── L2 Cache (shared, tens of MB)
NVIDIA H100 (Hopper, 2022):
- 132 SMs, 16,896 CUDA Cores
- 528 Tensor Cores (4th generation, FP8 support)
- 80GB HBM3, 3.35 TB/s memory bandwidth
- 4 PetaFLOPS FP8 Tensor Core performance
CUDA Programming Basics
#include <cuda_runtime.h>
#include <stdio.h>
// GPU kernel: vector addition
__global__ void vectorAdd(float *a, float *b, float *c, int n) {
// Global thread index
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
c[idx] = a[idx] + b[idx];
}
}
int main() {
int n = 1 << 20; // 1M elements
size_t size = n * sizeof(float);
// Host (CPU) memory
float *h_a = (float*)malloc(size);
float *h_b = (float*)malloc(size);
float *h_c = (float*)malloc(size);
// Device (GPU) memory
float *d_a, *d_b, *d_c;
cudaMalloc(&d_a, size);
cudaMalloc(&d_b, size);
cudaMalloc(&d_c, size);
// Host → Device transfer
cudaMemcpy(d_a, h_a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, h_b, size, cudaMemcpyHostToDevice);
// Launch kernel: 256 threads per block
int threadsPerBlock = 256;
int blocksPerGrid = (n + threadsPerBlock - 1) / threadsPerBlock;
vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_a, d_b, d_c, n);
// Device → Host transfer
cudaMemcpy(h_c, d_c, size, cudaMemcpyDeviceToHost);
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
free(h_a); free(h_b); free(h_c);
return 0;
}
GPU Memory Optimization with Shared Memory
// Tiled matrix multiplication using shared memory
#define TILE_SIZE 16
__global__ void matmulShared(float *A, float *B, float *C, int N) {
__shared__ float tileA[TILE_SIZE][TILE_SIZE];
__shared__ float tileB[TILE_SIZE][TILE_SIZE];
int row = blockIdx.y * TILE_SIZE + threadIdx.y;
int col = blockIdx.x * TILE_SIZE + threadIdx.x;
float sum = 0.0f;
for (int t = 0; t < N / TILE_SIZE; t++) {
// Load tiles into shared memory
tileA[threadIdx.y][threadIdx.x] = A[row * N + t * TILE_SIZE + threadIdx.x];
tileB[threadIdx.y][threadIdx.x] = B[(t * TILE_SIZE + threadIdx.y) * N + col];
__syncthreads(); // Wait for all threads to finish loading
for (int k = 0; k < TILE_SIZE; k++)
sum += tileA[threadIdx.y][k] * tileB[k][threadIdx.x];
__syncthreads();
}
if (row < N && col < N)
C[row * N + col] = sum;
}
Tensor Cores and AI Acceleration
Tensor Cores accelerate the matrix multiply-accumulate (MMA) operation in hardware.
Regular CUDA Core: 1 FP32 MAC per cycle
4th-gen Tensor Core: 16x16x16 matrix multiply = 4096 FP16 MACs per cycle
NVIDIA cuBLAS, cuDNN, and TensorRT automatically leverage Tensor Cores when the data sizes align.
10. Modern Architecture Trends
Chiplet Design
Instead of one large monolithic die, chiplets connect multiple smaller dies on an interposer.
- AMD EPYC (Genoa): 12 CCDs (Core Complex Die, 5nm) + 1 IOD (I/O Die, 6nm)
- Intel Meteor Lake: separate CPU, GPU, SoC tiles
- Benefits: Higher yield, process optimization per function, cost reduction
- Technologies: TSMC CoWoS, Intel EMIB, UCIe standard
HBM (High Bandwidth Memory)
DRAM dies are stacked in 3D to achieve extreme memory bandwidth.
HBM3E (2024): 9.6 Gbps/pin, 8-stack = 1.2 TB/s per package
GDDR6X (RTX 4090): 21 Gbps/pin, 384-bit = 1 TB/s
DDR5-6400: 51.2 GB/s per channel (CPU)
NPU/TPU: AI-Dedicated Accelerators
- Google TPU v5p: 459 TFLOPS BF16, mesh interconnect
- Apple Neural Engine: iPhone 15 Pro, 35 TOPS INT8
- Qualcomm Hexagon: Smartphone NPU, 75 TOPS
- Intel Gaudi 3: 1835 TFLOPS BF16
The Rise of RISC-V
Open-source ISA expanding from low-power embedded to data-center servers:
- Vendors: SiFive, StarFive, Alibaba T-Head
- RISC-V International: 3,000+ member organizations
- Official Linux 5.19 support, Android port complete
11. Performance Optimization in Practice
Cache-Friendly Programming
#include <stdlib.h>
#define N 4096
float A[N][N], B[N][N], C[N][N];
// Inefficient: column-major traversal (many cache misses)
void matmul_naive() {
for (int i = 0; i < N; i++)
for (int j = 0; j < N; j++)
for (int k = 0; k < N; k++)
C[i][j] += A[i][k] * B[k][j]; // B[k][j] strides across rows
}
// Efficient: loop tiling (reuses cache blocks)
#define BLOCK 64
void matmul_tiled() {
for (int ii = 0; ii < N; ii += BLOCK)
for (int jj = 0; jj < N; jj += BLOCK)
for (int kk = 0; kk < N; kk += BLOCK)
for (int i = ii; i < ii+BLOCK && i < N; i++)
for (int j = jj; j < jj+BLOCK && j < N; j++)
for (int k = kk; k < kk+BLOCK && k < N; k++)
C[i][j] += A[i][k] * B[k][j];
}
SIMD Vectorization with AVX2
#include <immintrin.h> // AVX2 intrinsics
// Process 8 floats simultaneously with AVX2
void vector_add_avx2(float *a, float *b, float *c, int n) {
int i;
for (i = 0; i <= n - 8; i += 8) {
__m256 va = _mm256_loadu_ps(&a[i]); // Load 8 floats
__m256 vb = _mm256_loadu_ps(&b[i]);
__m256 vc = _mm256_add_ps(va, vb); // Add 8 floats at once
_mm256_storeu_ps(&c[i], vc); // Store 8 floats
}
// Handle remaining elements
for (; i < n; i++)
c[i] = a[i] + b[i];
}
12. Quiz
Q1. What causes a Data Hazard in a pipeline, and how can it be resolved?
Answer: A data hazard occurs when an instruction needs the result of a previous instruction that has not yet completed — specifically the RAW (Read After Write) hazard where the pipeline tries to read a register before a prior write has committed.
Solutions:
- Forwarding (Bypassing): Route the EX/MEM stage output directly to the next EX stage input. Eliminates the stall in most cases.
- Stall (Pipeline Bubble): Insert NOP cycles to wait for the result. Reduces throughput.
- Code Reordering: Compiler inserts independent instructions between dependent pairs, hiding the latency.
- Load-use hazard is a special case where one stall is unavoidable (MEM result cannot be forwarded to EX in time).
Q2. What is the difference between a Direct-Mapped cache and a 4-way Set-Associative cache?
Answer: In a direct-mapped cache, each memory block has exactly one possible location in the cache. In a 4-way set-associative cache, each block can be placed in any of 4 slots within its set.
Key differences:
- Direct-mapped: susceptible to conflict misses (two frequently used blocks competing for one slot). Hardware is simple and fast.
- 4-way SA: conflict misses greatly reduced. Requires a replacement policy (LRU). More hardware cost.
- Real CPUs typically use 4-way or 8-way for L1, 16-way for L3. Fully associative is reserved for TLBs.
Q3. Explain the role of the TLB and the process that happens on a TLB miss.
Answer: The TLB (Translation Lookaside Buffer) is a fully associative cache that stores recent virtual-to-physical address translations, avoiding expensive page table walks on every memory access.
TLB miss handling:
- The VPN (Virtual Page Number) is not found in the TLB
- Hardware (x86) or software (RISC-V, MIPS) performs a page table walk
- Multi-level page tables are traversed in order: PML4 → PDPT → PD → PT
- The PFN (Physical Frame Number) is retrieved
- A new TLB entry is inserted, evicting an old one via LRU if necessary
- The original memory access is retried
- Full miss handling takes hundreds of cycles — maintaining a 99%+ hit rate is critical.
Q4. What is warp divergence in a GPU and how does it affect performance?
Answer: Warp divergence occurs when threads within a warp (group of 32) take different branches of an if-else statement.
How it works:
- In the SIMT model, all threads in a warp must execute the same instruction.
- When an if-else diverges: the if-branch executes first with non-if threads masked off, then the else-branch executes with non-else threads masked off.
- Both paths are executed serially, up to 2x slower in the worst case.
Mitigations:
- Arrange data so threads in the same warp always take the same branch
- Replace conditional logic with arithmetic (e.g., branchless select)
- Minimize conditional code in kernel hot paths
Q5. Using Amdahl's Law, calculate the maximum theoretical speedup when 80% of a program is parallelized.
Answer: With an unlimited number of processors, the maximum speedup is 5x.
Calculation:
- Parallelizable fraction: f = 0.8
- Sequential fraction: 1 - 0.8 = 0.2
- As s (parallel speedup) approaches infinity: f/s approaches 0
- Speedup = 1 / ((1 - f) + f/s) = 1 / (0.2 + 0) = 1 / 0.2 = 5x
Implication: No matter how many cores are added, the 20% sequential portion caps the overall speedup at 5x. This is why eliminating sequential bottlenecks is more valuable than simply adding more parallelism.
References
- Patterson, D. & Hennessy, J. (2020). Computer Organization and Design: RISC-V Edition (2nd ed.). Morgan Kaufmann.
- Hennessy, J. & Patterson, D. (2019). Computer Architecture: A Quantitative Approach (6th ed.). Morgan Kaufmann.
- RISC-V International. (2024). The RISC-V Instruction Set Manual. https://riscv.org/technical/specifications/
- NVIDIA. (2023). CUDA C++ Programming Guide. https://docs.nvidia.com/cuda/cuda-c-programming-guide/
- ARM Ltd. (2024). ARM Architecture Reference Manual. https://developer.arm.com/documentation/
- Intel. (2024). Intel 64 and IA-32 Architectures Software Developer's Manual.