Computer Architecture Complete Guide: From ISA to GPU Parallel Architecture

Introduction
1. Overview of Computer Architecture
2. Instruction Set Architecture (ISA)
3. ALU and Datapath
- ALU Design
- Single-Cycle Datapath
4. Pipelining
5. Memory Hierarchy
6. Virtual Memory
7. I/O Systems
- I/O Control Methods
- Bus Architecture
8. Parallel Architecture
9. GPU Architecture
10. Modern Architecture Trends
11. Performance Optimization in Practice
- Cache-Friendly Programming
- SIMD Vectorization with AVX2
12. Quiz
References

Introduction

Computer Architecture is the foundational discipline that bridges hardware and software. Whether you are an electronics engineer, computer science student, or systems programmer, a deep understanding of how processors work is essential for writing high-performance software, designing chips, and reasoning about system behavior.

This guide is structured around Patterson & Hennessy's Computer Organization and Design and Computer Architecture: A Quantitative Approach, covering everything from ISA design to modern GPU parallel architectures.

1. Overview of Computer Architecture

Von Neumann Architecture

The Von Neumann architecture, proposed by John von Neumann in 1945, forms the foundation of nearly all modern computers. Its defining characteristic is that program instructions and data share the same memory space.

Components:

CPU (Central Processing Unit): ALU + Control Unit + Registers
Memory: Stores both program code and data
I/O Devices: Keyboard, display, disk, network
Bus: Carries data, address, and control signals

Von Neumann Bottleneck: The bandwidth between the CPU and memory limits overall performance. The modern cache hierarchy was designed to mitigate this bottleneck.

Harvard Architecture

The Harvard architecture uses separate instruction and data memories, enabling simultaneous access to both. It is used in DSPs, microcontrollers (AVR, PIC), and in the L1 cache split (I-Cache / D-Cache) of modern processors.

Aspect	Von Neumann	Harvard
Memory	Unified	Separate
Bandwidth	Limited	Higher
Complexity	Lower	Higher
Use case	General-purpose CPU	DSP, microcontroller

Levels of Abstraction

Computer systems are organized in layers of abstraction:

Application Software
       ↓
Operating System
       ↓
ISA (Instruction Set Architecture)  ← Hardware/Software boundary
       ↓
Microarchitecture
       ↓
Logic Gates
       ↓
Transistors

The ISA is the contract between hardware and software. Any processor that implements the same ISA can run the same software, regardless of its internal microarchitecture.

Performance Metrics

CPU Execution Time:

CPU Time = Instruction Count × CPI × Clock Cycle Time
         = Instruction Count × CPI / Clock Rate

CPI (Cycles Per Instruction): Average clock cycles per instruction
Clock Rate: Measured in Hz (e.g., 3.5 GHz)
MIPS (Millions of Instructions Per Second): Throughput in instructions
FLOPS (Floating Point Operations Per Second): Floating-point throughput

Amdahl's Law (Limits of Parallel Speedup):

Speedup = 1 / ((1 - f) + f/s)

Where f is the fraction of work that can be parallelized, and s is the speedup of the parallelized part. If only 40% of a program can be parallelized, the theoretical maximum speedup is 1.67x regardless of core count.

2. Instruction Set Architecture (ISA)

RISC vs CISC

RISC (Reduced Instruction Set Computer):

Simple, fixed-length instructions
Register-centric operations (Load/Store architecture)
Favors hardware pipelining
Examples: ARM, RISC-V, MIPS, PowerPC

CISC (Complex Instruction Set Computer):

Complex, variable-length instructions
Can operate directly on memory
Higher code density (more work per instruction)
Examples: x86-64, VAX

Modern x86-64 processors internally decode CISC instructions into RISC-like micro-ops, blurring the original distinction.

Instruction Formats (RISC-V)

RISC-V defines six instruction formats:

R-type (Register operations):
[funct7 | rs2 | rs1 | funct3 | rd | opcode]
  7-bit   5-bit 5-bit  3-bit  5-bit  7-bit

I-type (Immediate / Load):
[imm[11:0] | rs1 | funct3 | rd | opcode]
  12-bit    5-bit   3-bit  5-bit  7-bit

S-type (Store):
[imm[11:5] | rs2 | rs1 | funct3 | imm[4:0] | opcode]

B-type (Branch):
[imm[12|10:5] | rs2 | rs1 | funct3 | imm[4:1|11] | opcode]

U-type (Upper immediate):
[imm[31:12] | rd | opcode]

J-type (JAL):
[imm[20|10:1|11|19:12] | rd | opcode]

RISC-V Assembly Examples

RISC-V is an open-source ISA developed at UC Berkeley, growing rapidly in both education and industry.

# RISC-V Assembly Examples
# Basic arithmetic
add  a0, a0, a1      # a0 = a0 + a1
sub  t0, t1, t2      # t0 = t1 - t2
addi t0, t0, 10      # t0 = t0 + 10 (immediate)
mul  a0, a1, a2      # a0 = a1 * a2

# Memory access (Load/Store)
lw   t0, 0(a0)       # t0 = Memory[a0 + 0]  (word load)
lh   t1, 2(a0)       # t1 = Memory[a0 + 2]  (halfword load)
lb   t2, 1(a0)       # t2 = Memory[a0 + 1]  (byte load)
sw   t0, 4(a0)       # Memory[a0 + 4] = t0  (word store)

# Logical operations
and  t0, t1, t2      # t0 = t1 & t2
or   t0, t1, t2      # t0 = t1 | t2
xor  t0, t1, t2      # t0 = t1 ^ t2
sll  t0, t1, t2      # t0 = t1 << t2 (logical left shift)
srl  t0, t1, t2      # t0 = t1 >> t2 (logical right shift)

# Branches and jumps
beq  t0, t1, label   # if t0 == t1, jump to label
bne  t0, t1, label   # if t0 != t1, jump to label
blt  t0, t1, label   # if t0 < t1,  jump to label
bge  t0, t1, label   # if t0 >= t1, jump to label
jal  ra, func        # ra = PC+4; jump to func
jalr ra, 0(t0)       # ra = PC+4; jump to address in t0

# RISC-V Register Conventions
# x0  (zero): hardwired zero
# x1  (ra):   return address
# x2  (sp):   stack pointer
# x10-x17 (a0-a7): function arguments / return values
# x5-x7, x28-x31 (t0-t6): temporaries
# x8-x9, x18-x27 (s0-s11): saved registers

Translating C to RISC-V Assembly

// C code
int factorial(int n) {
    if (n <= 1) return 1;
    return n * factorial(n - 1);
}

# RISC-V Assembly (recursive factorial)
factorial:
    addi sp, sp, -16     # Allocate stack frame
    sw   ra, 12(sp)      # Save return address
    sw   a0, 8(sp)       # Save n

    addi t0, zero, 1
    bgt  a0, t0, recurse # if n > 1, recurse
    addi a0, zero, 1     # return 1
    j    done

recurse:
    addi a0, a0, -1      # n - 1
    jal  ra, factorial   # recursive call
    lw   t0, 8(sp)       # restore n
    mul  a0, t0, a0      # n * factorial(n-1)

done:
    lw   ra, 12(sp)      # restore return address
    addi sp, sp, 16      # free stack frame
    jalr zero, 0(ra)     # return

3. ALU and Datapath

ALU Design

The ALU (Arithmetic Logic Unit) is the computational core of the CPU:

ADD:  result = A + B
SUB:  result = A + (~B + 1) = A - B  (two's complement)
AND:  result = A AND B
OR:   result = A OR B
XOR:  result = A XOR B
SLT:  result = (A < B) ? 1 : 0  (Set Less Than)
SLL:  result = A << B

Ripple Carry Adder:

FA0: S0 = A0 XOR B0 XOR Cin;  C1 = carry out
FA1: S1 = A1 XOR B1 XOR C1;  C2 = carry out
...

For an N-bit ripple carry adder, delay is O(N). A 64-bit addition requires 64 gate delays.

Carry Lookahead Adder (CLA):

Define Generate and Propagate:

Gi = Ai AND Bi       (carry generate)
Pi = Ai XOR Bi       (carry propagate)
Ci+1 = Gi OR (Pi AND Ci)

By computing carries for 4-bit groups in parallel, total delay drops to O(log N).

Single-Cycle Datapath

In a single-cycle implementation, every instruction completes in exactly one clock cycle.

Instruction flow (R-type ADD):
1. IF:  PC → instruction memory → fetch instruction
2. ID:  Read rs1, rs2 from register file; generate control signals
3. EX:  ALU computes rs1 + rs2
4. MEM: (R-type has no memory access)
5. WB:  Write ALU result to rd

Problem: The slowest instruction (e.g., memory load) determines the clock period. Fast instructions are penalized by the slow clock.

4. Pipelining

Pipelining overlaps the execution of multiple instructions, just like running a washer, dryer, and iron simultaneously for a load of laundry.

Five-Stage Pipeline

Stage  Abbr  Function
------+------+--------------------------------------------
Fetch  IF    Fetch instruction from memory using PC
Decode ID    Decode instruction, read registers, gen control
Execute EX   Perform ALU operation
Memory MEM   Read/write data memory
Writeback WB Write result back to register file

Timing diagram:

Cycle:  1    2    3    4    5    6    7    8
Inst 1: IF   ID   EX  MEM   WB
Inst 2:      IF   ID   EX  MEM   WB
Inst 3:           IF   ID   EX  MEM   WB
Inst 4:                IF   ID   EX  MEM   WB

Ideally, a 5-stage pipeline provides up to 5x throughput improvement over single-cycle.

Pipeline Hazards

1. Structural Hazard

Two instructions require the same hardware resource at the same time.

Solution: Duplicate resources (separate I-Cache and D-Cache), add register file ports.

2. Data Hazard

An instruction needs a result that a previous instruction has not yet produced.

add t0, t1, t2    # write t0 (WB: cycle 5)
sub t3, t0, t4    # read t0  (ID: cycle 3) → RAW hazard!

Types:

RAW (Read After Write): Most common. Next instruction reads before previous writes.
WAR (Write After Read): Occurs in out-of-order execution.
WAW (Write After Write): Two instructions write the same register.

Solutions:

Forwarding (Bypassing): Route EX/MEM stage result directly to next EX stage input — no stall.
Stall (Bubble): Insert NOP cycles. Reduces performance.
Code Reordering: Compiler inserts independent instructions between dependent ones.

# Load-use hazard: 1 stall cycle unavoidable
lw  t0, 0(a0)      # load from memory
add t1, t0, t2     # use t0 immediately → MEM→EX forwarding impossible
# Solution: compiler inserts an independent instruction
lw  t0, 0(a0)
add t3, t4, t5     # independent instruction fills the slot
add t1, t0, t2     # t0 is now ready

3. Control Hazard

A branch instruction creates uncertainty about which instruction to fetch next.

beq t0, t1, label  # branch decision made in EX stage
# Instructions already fetched in IF/ID may be wrong

Solutions:

Flush: Discard incorrectly fetched instructions (2-3 cycle penalty).
Branch Prediction:
- Static: always Not-Taken or always Taken
- Dynamic: 1-bit/2-bit predictors, Branch Target Buffer (BTB)
- Modern CPUs achieve 95%+ prediction accuracy
Delayed Branch: Always execute the instruction after the branch (MIPS approach).

Superscalar Pipelines

Modern CPUs run multiple pipelines in parallel:

Intel Core: 6-way out-of-order execution (OoO)
AMD Zen 4:  4-way decode + OoO execution
ARM Cortex-X4: 5-way decode

Out-of-Order Execution:

Instructions fetched and decoded in order
Instructions dispatched as soon as their operands are ready (Tomasulo's algorithm)
Results committed in original program order (Reorder Buffer)

5. Memory Hierarchy

The Memory Hierarchy

Registers
  Size: ~1KB  |  Latency: 1 cycle       |  Cost: very high

L1 Cache (on-chip)
  Size: 32-64KB  |  Latency: 4-5 cycles  |  Cost: high

L2 Cache (on-chip)
  Size: 256KB-1MB  |  Latency: 12-15 cycles  |  Cost: medium

L3 Cache (on-chip, shared)
  Size: 8-64MB  |  Latency: 30-40 cycles  |  Cost: low

DRAM (Main Memory)
  Size: 8-256GB  |  Latency: 200-300 cycles  |  Cost: very low

SSD / NVMe
  Size: 1-4TB  |  Latency: 10,000+ cycles  |  Cost: extremely low

Principle of Locality

Temporal Locality: Recently accessed data will likely be accessed again soon. (Loop variables, counters)
Spatial Locality: Data near recently accessed locations will likely be accessed soon. (Array traversal)

// Spatial locality optimization
// Bad: column-major access (frequent cache misses)
for (int j = 0; j < N; j++)
    for (int i = 0; i < N; i++)
        sum += A[i][j];  // A[0][0], A[1][0], A[2][0]... (non-contiguous)

// Good: row-major access (cache-friendly)
for (int i = 0; i < N; i++)
    for (int j = 0; j < N; j++)
        sum += A[i][j];  // A[0][0], A[0][1], A[0][2]... (contiguous memory)

Cache Organization

Direct-Mapped Cache: Each memory block maps to exactly one cache slot.

Pro: Simple hardware, fast lookup
Con: Conflict misses (two blocks competing for the same slot)

N-way Set-Associative Cache: Each memory block can go into any of N slots in a set.

Most common in real CPUs (L1: 4-way, L2: 8-way, L3: 16-way)

Fully Associative Cache: A block can go anywhere in the cache.

Pro: No conflict misses
Con: Expensive parallel lookup (used for TLBs)

Cache address breakdown (32-bit address, 4KB cache, 64B block, 4-way):

[Tag (20 bits) | Index (6 bits) | Offset (6 bits)]

Replacement Policies

LRU (Least Recently Used): Evict the block unused for the longest time. Best performance, complex hardware.
FIFO (First In First Out): Evict the oldest-inserted block. Simple.
Random: Evict a random block. Simple hardware, performance close to LRU in practice.

Write Policies

Write-through: Update both cache and memory on every write. Simple, high bandwidth usage.
Write-back: Update cache only; write to memory when the block is evicted (Dirty bit). Better performance, more complex coherence.

6. Virtual Memory

Overview

Virtual memory gives each process its own address space, providing isolation, protection, and the ability to overcommit physical memory.

Virtual Address (VA): Address used by the process (0x0000 to 0xFFFFFFFF)
Physical Address (PA): Actual DRAM address
Page: Fixed-size block of virtual/physical memory (typically 4KB)

Page Table Translation

Virtual Address [VPN | Page Offset]
                    ↓
             Page table lookup
                    ↓
Physical Address [PFN | Page Offset]

64-bit systems use a 4-level page table (x86-64):

[PML4 (9 bits) | PDPT (9 bits) | PD (9 bits) | PT (9 bits) | Offset (12 bits)]

TLB (Translation Lookaside Buffer)

Every page table lookup requires memory accesses. The TLB is a fully associative cache for recent virtual-to-physical translations.

TLB Hit:  VA → TLB lookup → PA  (1-2 cycles)
TLB Miss: VA → page table walk → PA  (100+ cycles)

Typical TLB size: 64-1024 entries. Must maintain 99%+ hit rate for good performance.

Page Replacement Algorithms

When physical memory is full, the OS must evict a page to disk.

Optimal: Evict the page that won't be used for the longest time (theoretical best, not implementable)
LRU: Evict the least recently used page (good performance, expensive hardware)
Clock (FIFO + second chance): LRU approximation used by most OS kernels

7. I/O Systems

I/O Control Methods

Polling: The CPU repeatedly checks device status registers.

Pro: Simple, low latency
Con: Wastes CPU cycles (busy-wait)

Interrupts: The device signals the CPU when ready.

Pro: CPU free to do other work
Con: Interrupt handling overhead (context save/restore)

DMA (Direct Memory Access): A DMA controller transfers data between memory and device without CPU involvement.

Essential for high-throughput transfers (disk, network, GPU)
CPU only initiates and acknowledges completion

Bus Architecture

PCIe 5.0:   x16 = 128 GB/s bidirectional
NVMe (PCIe): up to 7 GB/s sequential read (Gen4)
USB 4.0:    up to 40 Gbps
DDR5-6400:  51.2 GB/s per channel

8. Parallel Architecture

Flynn's Taxonomy

Class	Instruction streams	Data streams	Examples
SISD	Single	Single	Single-core CPU
SIMD	Single	Multiple	GPU, AVX vector ops
MISD	Multiple	Single	Fault-tolerant systems
MIMD	Multiple	Multiple	Multi-core CPU, clusters

Multicore and Cache Coherence

When multiple cores hold copies of the same cache line, coherence must be maintained.

MESI Protocol (Modified, Exclusive, Shared, Invalid):

Modified (M):  This core holds the only (modified) copy
Exclusive (E): This core holds the only copy, matches memory
Shared (S):    Multiple cores hold read-only copies
Invalid (I):   This copy is stale (another core modified it)

State transitions:

Core A writes to a Shared line → A transitions to M, all others go to I (invalidation)
Core B reads an Invalid line → bus snooping causes A to supply the line

OpenMP Parallel Programming

#include <omp.h>
#include <stdio.h>
#include <stdlib.h>

// Parallel array summation
int main() {
    int n = 1000000;
    long long sum = 0;
    int *arr = malloc(n * sizeof(int));

    for (int i = 0; i < n; i++)
        arr[i] = i + 1;

    // Reduction clause ensures thread-safe accumulation
    #pragma omp parallel for reduction(+:sum) schedule(static)
    for (int i = 0; i < n; i++) {
        sum += arr[i];
    }

    printf("Sum = %lld\n", sum);  // Expected: 500000500000

    // Print thread IDs
    #pragma omp parallel
    {
        int tid = omp_get_thread_num();
        int nthreads = omp_get_num_threads();
        printf("Thread %d of %d\n", tid, nthreads);
    }

    free(arr);
    return 0;
}

// Cache-friendly parallel matrix multiplication
void matmul(float *A, float *B, float *C, int N) {
    #pragma omp parallel for collapse(2) schedule(dynamic, 64)
    for (int i = 0; i < N; i++) {
        for (int j = 0; j < N; j++) {
            float sum = 0.0f;
            for (int k = 0; k < N; k++) {
                sum += A[i*N + k] * B[k*N + j];
            }
            C[i*N + j] = sum;
        }
    }
}

NUMA Architecture

NUMA (Non-Uniform Memory Access) assigns local memory to each CPU socket.

Socket 0 (Core 0-15)  ←→ Local DRAM (64GB)  [~80ns]
      ↕ QPI / Infinity Fabric
Socket 1 (Core 16-31) ←→ Local DRAM (64GB)  [~80ns]

Socket 0 accessing Socket 1 memory: ~160ns (2x latency)

Use numactl to pin processes to specific NUMA nodes for best performance.

9. GPU Architecture

GPU vs CPU Design Philosophy

Aspect	CPU	GPU
Core count	Dozens	Thousands to tens of thousands
Core complexity	Very high (OoO, branch prediction)	Simple
Cache size	Large, multi-level	Small, minimal
Design goal	Minimize single-thread latency	Maximize throughput
Use case	General-purpose serial work	Massively parallel computation

SIMT Execution Model

SIMT (Single Instruction, Multiple Thread) is the GPU's core execution model.

Warp: 32 threads executing the same instruction (NVIDIA)
→ All 32 threads execute the same instruction simultaneously
→ Each thread operates on different data

Warp scheduler: On memory stall, immediately switch to another warp
→ Thousands of warps rotate rapidly, hiding memory latency

NVIDIA GPU Hierarchy

GPU
├── GPC (Graphics Processing Cluster) x 8
│   └── SM (Streaming Multiprocessor) x 7-12
│       ├── CUDA Core x 128 (FP32 operations)
│       ├── Tensor Core x 4 (matrix multiply, AI)
│       ├── RT Core x 1 (ray tracing)
│       ├── Warp Scheduler x 4
│       ├── Register File (256KB)
│       └── Shared Memory / L1 Cache (128-256KB)
└── L2 Cache (shared, tens of MB)

NVIDIA H100 (Hopper, 2022):

132 SMs, 16,896 CUDA Cores
528 Tensor Cores (4th generation, FP8 support)
80GB HBM3, 3.35 TB/s memory bandwidth
4 PetaFLOPS FP8 Tensor Core performance

CUDA Programming Basics

#include <cuda_runtime.h>
#include <stdio.h>

// GPU kernel: vector addition
__global__ void vectorAdd(float *a, float *b, float *c, int n) {
    // Global thread index
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        c[idx] = a[idx] + b[idx];
    }
}

int main() {
    int n = 1 << 20;  // 1M elements
    size_t size = n * sizeof(float);

    // Host (CPU) memory
    float *h_a = (float*)malloc(size);
    float *h_b = (float*)malloc(size);
    float *h_c = (float*)malloc(size);

    // Device (GPU) memory
    float *d_a, *d_b, *d_c;
    cudaMalloc(&d_a, size);
    cudaMalloc(&d_b, size);
    cudaMalloc(&d_c, size);

    // Host → Device transfer
    cudaMemcpy(d_a, h_a, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_b, h_b, size, cudaMemcpyHostToDevice);

    // Launch kernel: 256 threads per block
    int threadsPerBlock = 256;
    int blocksPerGrid = (n + threadsPerBlock - 1) / threadsPerBlock;
    vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_a, d_b, d_c, n);

    // Device → Host transfer
    cudaMemcpy(h_c, d_c, size, cudaMemcpyDeviceToHost);

    cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
    free(h_a); free(h_b); free(h_c);
    return 0;
}

GPU Memory Optimization with Shared Memory

// Tiled matrix multiplication using shared memory
#define TILE_SIZE 16

__global__ void matmulShared(float *A, float *B, float *C, int N) {
    __shared__ float tileA[TILE_SIZE][TILE_SIZE];
    __shared__ float tileB[TILE_SIZE][TILE_SIZE];

    int row = blockIdx.y * TILE_SIZE + threadIdx.y;
    int col = blockIdx.x * TILE_SIZE + threadIdx.x;
    float sum = 0.0f;

    for (int t = 0; t < N / TILE_SIZE; t++) {
        // Load tiles into shared memory
        tileA[threadIdx.y][threadIdx.x] = A[row * N + t * TILE_SIZE + threadIdx.x];
        tileB[threadIdx.y][threadIdx.x] = B[(t * TILE_SIZE + threadIdx.y) * N + col];
        __syncthreads();  // Wait for all threads to finish loading

        for (int k = 0; k < TILE_SIZE; k++)
            sum += tileA[threadIdx.y][k] * tileB[k][threadIdx.x];
        __syncthreads();
    }

    if (row < N && col < N)
        C[row * N + col] = sum;
}

Tensor Cores and AI Acceleration

Tensor Cores accelerate the matrix multiply-accumulate (MMA) operation in hardware.

Regular CUDA Core: 1 FP32 MAC per cycle
4th-gen Tensor Core: 16x16x16 matrix multiply = 4096 FP16 MACs per cycle

NVIDIA cuBLAS, cuDNN, and TensorRT automatically leverage Tensor Cores when the data sizes align.

10. Modern Architecture Trends

Chiplet Design

Instead of one large monolithic die, chiplets connect multiple smaller dies on an interposer.

AMD EPYC (Genoa): 12 CCDs (Core Complex Die, 5nm) + 1 IOD (I/O Die, 6nm)
Intel Meteor Lake: separate CPU, GPU, SoC tiles
Benefits: Higher yield, process optimization per function, cost reduction
Technologies: TSMC CoWoS, Intel EMIB, UCIe standard

HBM (High Bandwidth Memory)

DRAM dies are stacked in 3D to achieve extreme memory bandwidth.

HBM3E (2024): 9.6 Gbps/pin, 8-stack = 1.2 TB/s per package
GDDR6X (RTX 4090): 21 Gbps/pin, 384-bit = 1 TB/s
DDR5-6400: 51.2 GB/s per channel (CPU)

NPU/TPU: AI-Dedicated Accelerators

Google TPU v5p: 459 TFLOPS BF16, mesh interconnect
Apple Neural Engine: iPhone 15 Pro, 35 TOPS INT8
Qualcomm Hexagon: Smartphone NPU, 75 TOPS
Intel Gaudi 3: 1835 TFLOPS BF16

The Rise of RISC-V

Open-source ISA expanding from low-power embedded to data-center servers:

Vendors: SiFive, StarFive, Alibaba T-Head
RISC-V International: 3,000+ member organizations
Official Linux 5.19 support, Android port complete

11. Performance Optimization in Practice

Cache-Friendly Programming

#include <stdlib.h>

#define N 4096

float A[N][N], B[N][N], C[N][N];

// Inefficient: column-major traversal (many cache misses)
void matmul_naive() {
    for (int i = 0; i < N; i++)
        for (int j = 0; j < N; j++)
            for (int k = 0; k < N; k++)
                C[i][j] += A[i][k] * B[k][j];  // B[k][j] strides across rows
}

// Efficient: loop tiling (reuses cache blocks)
#define BLOCK 64
void matmul_tiled() {
    for (int ii = 0; ii < N; ii += BLOCK)
        for (int jj = 0; jj < N; jj += BLOCK)
            for (int kk = 0; kk < N; kk += BLOCK)
                for (int i = ii; i < ii+BLOCK && i < N; i++)
                    for (int j = jj; j < jj+BLOCK && j < N; j++)
                        for (int k = kk; k < kk+BLOCK && k < N; k++)
                            C[i][j] += A[i][k] * B[k][j];
}

SIMD Vectorization with AVX2

#include <immintrin.h>  // AVX2 intrinsics

// Process 8 floats simultaneously with AVX2
void vector_add_avx2(float *a, float *b, float *c, int n) {
    int i;
    for (i = 0; i <= n - 8; i += 8) {
        __m256 va = _mm256_loadu_ps(&a[i]);   // Load 8 floats
        __m256 vb = _mm256_loadu_ps(&b[i]);
        __m256 vc = _mm256_add_ps(va, vb);    // Add 8 floats at once
        _mm256_storeu_ps(&c[i], vc);          // Store 8 floats
    }
    // Handle remaining elements
    for (; i < n; i++)
        c[i] = a[i] + b[i];
}

12. Quiz

Q1. What causes a Data Hazard in a pipeline, and how can it be resolved?

Answer: A data hazard occurs when an instruction needs the result of a previous instruction that has not yet completed — specifically the RAW (Read After Write) hazard where the pipeline tries to read a register before a prior write has committed.

Solutions:

Forwarding (Bypassing): Route the EX/MEM stage output directly to the next EX stage input. Eliminates the stall in most cases.
Stall (Pipeline Bubble): Insert NOP cycles to wait for the result. Reduces throughput.
Code Reordering: Compiler inserts independent instructions between dependent pairs, hiding the latency.
Load-use hazard is a special case where one stall is unavoidable (MEM result cannot be forwarded to EX in time).

Q2. What is the difference between a Direct-Mapped cache and a 4-way Set-Associative cache?

Answer: In a direct-mapped cache, each memory block has exactly one possible location in the cache. In a 4-way set-associative cache, each block can be placed in any of 4 slots within its set.

Key differences:

Direct-mapped: susceptible to conflict misses (two frequently used blocks competing for one slot). Hardware is simple and fast.
4-way SA: conflict misses greatly reduced. Requires a replacement policy (LRU). More hardware cost.
Real CPUs typically use 4-way or 8-way for L1, 16-way for L3. Fully associative is reserved for TLBs.

Q3. Explain the role of the TLB and the process that happens on a TLB miss.

Answer: The TLB (Translation Lookaside Buffer) is a fully associative cache that stores recent virtual-to-physical address translations, avoiding expensive page table walks on every memory access.

TLB miss handling:

The VPN (Virtual Page Number) is not found in the TLB
Hardware (x86) or software (RISC-V, MIPS) performs a page table walk
Multi-level page tables are traversed in order: PML4 → PDPT → PD → PT
The PFN (Physical Frame Number) is retrieved
A new TLB entry is inserted, evicting an old one via LRU if necessary
The original memory access is retried

Full miss handling takes hundreds of cycles — maintaining a 99%+ hit rate is critical.

Q4. What is warp divergence in a GPU and how does it affect performance?

Answer: Warp divergence occurs when threads within a warp (group of 32) take different branches of an if-else statement.

How it works:

In the SIMT model, all threads in a warp must execute the same instruction.
When an if-else diverges: the if-branch executes first with non-if threads masked off, then the else-branch executes with non-else threads masked off.
Both paths are executed serially, up to 2x slower in the worst case.

Mitigations:

Arrange data so threads in the same warp always take the same branch
Replace conditional logic with arithmetic (e.g., branchless select)
Minimize conditional code in kernel hot paths

Q5. Using Amdahl's Law, calculate the maximum theoretical speedup when 80% of a program is parallelized.

Answer: With an unlimited number of processors, the maximum speedup is 5x.

Calculation:

Parallelizable fraction: f = 0.8
Sequential fraction: 1 - 0.8 = 0.2
As s (parallel speedup) approaches infinity: f/s approaches 0
Speedup = 1 / ((1 - f) + f/s) = 1 / (0.2 + 0) = 1 / 0.2 = 5x

Implication: No matter how many cores are added, the 20% sequential portion caps the overall speedup at 5x. This is why eliminating sequential bottlenecks is more valuable than simply adding more parallelism.

References

Patterson, D. & Hennessy, J. (2020). Computer Organization and Design: RISC-V Edition (2nd ed.). Morgan Kaufmann.
Hennessy, J. & Patterson, D. (2019). Computer Architecture: A Quantitative Approach (6th ed.). Morgan Kaufmann.
RISC-V International. (2024). The RISC-V Instruction Set Manual. https://riscv.org/technical/specifications/
NVIDIA. (2023). CUDA C++ Programming Guide. https://docs.nvidia.com/cuda/cuda-c-programming-guide/
ARM Ltd. (2024). ARM Architecture Reference Manual. https://developer.arm.com/documentation/
Intel. (2024). Intel 64 and IA-32 Architectures Software Developer's Manual.