Split View: 컴퓨터 구조 완전 가이드: ISA부터 GPU 병렬 아키텍처까지

컴퓨터 구조 완전 가이드: ISA부터 GPU 병렬 아키텍처까지

들어가며
1. 컴퓨터 구조 개요
2. 명령어 집합 구조 (ISA)
3. ALU와 데이터패스
- ALU 설계
- 단일 사이클 데이터패스
4. 파이프라이닝 (Pipelining)
5. 메모리 계층 구조
6. 가상 메모리
7. 입출력 시스템
- I/O 제어 방법
- 버스 아키텍처
8. 병렬 아키텍처
9. GPU 아키텍처
10. 최신 아키텍처 트렌드
11. 성능 최적화 실전
- 캐시 친화적 프로그래밍
- SIMD 벡터화 (AVX2)
12. 퀴즈
참고 자료

들어가며

컴퓨터 구조(Computer Architecture)는 하드웨어와 소프트웨어의 경계를 이해하는 핵심 학문입니다. 전자/컴퓨터공학 전공자라면 반드시 깊이 이해해야 하는 분야이며, 고성능 소프트웨어 개발자, 시스템 프로그래머, 반도체 설계자 모두에게 필수 지식입니다.

이 가이드는 Patterson & Hennessy의 Computer Organization and Design과 Computer Architecture: A Quantitative Approach를 기반으로, ISA 설계부터 현대 GPU 병렬 아키텍처까지 체계적으로 다룹니다.

1. 컴퓨터 구조 개요

폰 노이만 아키텍처 (Von Neumann Architecture)

현대 컴퓨터의 기반인 폰 노이만 아키텍처는 1945년 John von Neumann이 제안했습니다. 핵심 특징은 프로그램과 데이터를 같은 메모리 공간에 저장한다는 것입니다.

구성 요소:

CPU (Central Processing Unit): ALU + 제어 유닛 + 레지스터
메모리: 프로그램 코드 + 데이터 저장
입출력 장치: 키보드, 디스플레이, 디스크 등
버스: 데이터/주소/제어 신호 전달

폰 노이만 병목(Von Neumann Bottleneck): CPU와 메모리 사이의 버스 대역폭이 성능을 제한합니다. 현대 캐시 계층 구조는 이 문제를 완화하기 위해 설계되었습니다.

하버드 아키텍처 (Harvard Architecture)

하버드 아키텍처는 명령어 메모리와 데이터 메모리를 분리하여 동시 접근을 허용합니다. DSP, 마이크로컨트롤러(AVR, PIC), 그리고 현대 CPU의 L1 캐시(I-Cache/D-Cache 분리)에 사용됩니다.

구분	폰 노이만	하버드
메모리	통합	분리
대역폭	제한적	높음
복잡도	낮음	높음
사용처	범용 CPU	DSP, 마이크로컨트롤러

컴퓨터 추상화 계층

컴퓨터 시스템은 여러 추상화 계층으로 구성됩니다:

응용 프로그램 (Application)
       ↓
운영체제 (Operating System)
       ↓
ISA (Instruction Set Architecture)  ← 하드웨어/소프트웨어 경계
       ↓
마이크로아키텍처 (Microarchitecture)
       ↓
논리 게이트 (Logic Gates)
       ↓
트랜지스터 (Transistors)

ISA는 하드웨어와 소프트웨어의 계약(contract)입니다. 같은 ISA를 구현하는 프로세서라면 어떤 마이크로아키텍처를 사용하든 동일한 소프트웨어가 동작합니다.

성능 측정 지표

CPU 실행 시간 공식:

CPU Time = Instruction Count × CPI × Clock Cycle Time
         = Instruction Count × CPI / Clock Rate

CPI (Cycles Per Instruction): 명령어 1개 실행에 걸리는 평균 클럭 사이클 수
Clock Rate: Hz 단위, 초당 클럭 사이클 수 (예: 3.5 GHz)
MIPS (Millions of Instructions Per Second): 초당 실행 명령어 수 (백만 단위)
FLOPS (Floating Point Operations Per Second): 부동소수점 연산 성능

Amdahl의 법칙 (성능 개선 한계):

Speedup = 1 / ((1 - f) + f/s)

여기서 f는 병렬화 가능한 비율, s는 병렬화된 부분의 속도 향상 배율입니다. 전체의 40%만 병렬화 가능하다면, 아무리 많은 코어를 사용해도 최대 1.67배 속도 향상이 한계입니다.

2. 명령어 집합 구조 (ISA)

RISC vs CISC

RISC (Reduced Instruction Set Computer):

단순하고 고정 길이 명령어
레지스터 중심 연산 (Load/Store 아키텍처)
하드웨어 파이프라이닝에 유리
대표: ARM, RISC-V, MIPS, PowerPC

CISC (Complex Instruction Set Computer):

복잡하고 가변 길이 명령어
메모리 직접 연산 가능
코드 밀도 높음 (적은 명령어로 복잡한 작업)
대표: x86-64, VAX

현대 x86-64 프로세서는 내부적으로 RISC 마이크로연산(micro-ops)으로 변환하여 실행하므로, 실제 경계는 모호해졌습니다.

명령어 형식 (RISC-V 기준)

RISC-V는 6가지 명령어 형식을 사용합니다:

R-형식 (레지스터 연산):
[funct7|rs2|rs1|funct3|rd|opcode]
 7비트  5비트 5비트 3비트 5비트 7비트

I-형식 (즉시값/로드):
[imm[11:0]|rs1|funct3|rd|opcode]
  12비트   5비트 3비트 5비트 7비트

S-형식 (스토어):
[imm[11:5]|rs2|rs1|funct3|imm[4:0]|opcode]

B-형식 (분기):
[imm[12|10:5]|rs2|rs1|funct3|imm[4:1|11]|opcode]

U-형식 (상위 즉시값):
[imm[31:12]|rd|opcode]

J-형식 (JAL):
[imm[20|10:1|11|19:12]|rd|opcode]

RISC-V ISA 예제 코드

RISC-V는 UC Berkeley에서 개발한 오픈소스 ISA로, 교육 및 산업 모두에서 급성장 중입니다.

# RISC-V 어셈블리 예제
# 기본 산술 연산
add  a0, a0, a1      # a0 = a0 + a1
sub  t0, t1, t2      # t0 = t1 - t2
addi t0, t0, 10      # t0 = t0 + 10 (즉시값)
mul  a0, a1, a2      # a0 = a1 * a2

# 메모리 접근 (Load/Store)
lw   t0, 0(a0)       # t0 = Memory[a0 + 0]   (워드 로드)
lh   t1, 2(a0)       # t1 = Memory[a0 + 2]   (하프워드 로드)
lb   t2, 1(a0)       # t2 = Memory[a0 + 1]   (바이트 로드)
sw   t0, 4(a0)       # Memory[a0 + 4] = t0   (워드 스토어)

# 논리 연산
and  t0, t1, t2      # t0 = t1 & t2
or   t0, t1, t2      # t0 = t1 | t2
xor  t0, t1, t2      # t0 = t1 ^ t2
sll  t0, t1, t2      # t0 = t1 << t2 (논리 왼쪽 시프트)
srl  t0, t1, t2      # t0 = t1 >> t2 (논리 오른쪽 시프트)

# 분기 및 점프
beq  t0, t1, label   # if t0 == t1, jump to label
bne  t0, t1, label   # if t0 != t1, jump to label
blt  t0, t1, label   # if t0 < t1,  jump to label
bge  t0, t1, label   # if t0 >= t1, jump to label
jal  ra, func        # ra = PC+4; jump to func
jalr ra, 0(t0)       # ra = PC+4; jump to t0+0

# RISC-V 레지스터 규약
# x0 (zero): 항상 0
# x1 (ra):   반환 주소 (Return Address)
# x2 (sp):   스택 포인터 (Stack Pointer)
# x10-x17 (a0-a7): 함수 인자/반환값
# x5-x7, x28-x31 (t0-t6): 임시 레지스터
# x8-x9, x18-x27 (s0-s11): 저장 레지스터

C 함수를 RISC-V 어셈블리로 변환

// C 코드
int factorial(int n) {
    if (n <= 1) return 1;
    return n * factorial(n - 1);
}

# RISC-V 어셈블리 (재귀 팩토리얼)
factorial:
    addi sp, sp, -16     # 스택 프레임 할당
    sw   ra, 12(sp)      # 반환 주소 저장
    sw   a0, 8(sp)       # n 저장

    addi t0, zero, 1
    bgt  a0, t0, recurse # n > 1이면 재귀
    addi a0, zero, 1     # return 1
    j    done

recurse:
    addi a0, a0, -1      # n - 1
    jal  ra, factorial   # 재귀 호출
    lw   t0, 8(sp)       # n 복원
    mul  a0, t0, a0      # n * factorial(n-1)

done:
    lw   ra, 12(sp)      # 반환 주소 복원
    addi sp, sp, 16      # 스택 프레임 해제
    jalr zero, 0(ra)     # 반환

3. ALU와 데이터패스

ALU 설계

ALU(Arithmetic Logic Unit)는 CPU의 핵심 연산 장치입니다. 기본 연산:

ADD:  결과 = A + B
SUB:  결과 = A + (~B + 1) = A - B  (2의 보수)
AND:  결과 = A AND B
OR:   결과 = A OR B
XOR:  결과 = A XOR B
SLT:  결과 = (A < B) ? 1 : 0  (Set Less Than)
SLL:  결과 = A << B (시프트)

리플 캐리 가산기 (Ripple Carry Adder):

FA0: S0 = A0 XOR B0 XOR Cin;  C1 = (A0 AND B0) OR ...
FA1: S1 = A1 XOR B1 XOR C1;  C2 = ...
...
FAn: Sn = An XOR Bn XOR Cn;  Cout = ...

N비트 리플 캐리 가산기의 지연: O(N). 64비트 가산에 64단계 게이트 지연 발생.

선행 올림 가산기 (Carry Lookahead Adder, CLA):

생성(Generate)과 전파(Propagate)를 정의합니다:

Gi = Ai AND Bi       (올림 생성)
Pi = Ai XOR Bi       (올림 전파)
Ci+1 = Gi OR (Pi AND Ci)

4비트 그룹 단위로 올림을 병렬 계산하면 O(log N) 지연으로 줄어듭니다.

단일 사이클 데이터패스

단일 사이클(Single-cycle) 구현에서는 모든 명령어가 한 클럭 사이클에 실행됩니다.

명령어 흐름 (R-형식 ADD):
1. IF: PC → 명령어 메모리 → 명령어 인출
2. ID: 레지스터 파일에서 rs1, rs2 읽기 + 제어 신호 생성
3. EX: ALU에서 rs1 + rs2 계산
4. MEM: (R-형식은 메모리 접근 없음)
5. WB: ALU 결과를 rd에 기록

단일 사이클의 문제: 가장 느린 명령어(예: 메모리 접근)가 클럭 주기를 결정 → 빠른 명령어도 느린 클럭에 맞춰야 함.

4. 파이프라이닝 (Pipelining)

파이프라이닝은 세탁기-건조기-다리미를 동시에 사용하는 것처럼, 여러 명령어의 서로 다른 단계를 겹쳐 실행하는 기법입니다.

5단계 파이프라인

단계   약어  역할
------+-----+----------------------------------------
인출   IF   PC로부터 명령어를 메모리에서 가져옴
해독   ID   명령어 해독, 레지스터 읽기, 제어 신호 생성
실행   EX   ALU 연산 수행
메모리 MEM  데이터 메모리 읽기/쓰기
기록   WB   레지스터 파일에 결과 저장

타이밍 다이어그램:

사이클:  1    2    3    4    5    6    7    8
Inst 1: IF   ID   EX  MEM   WB
Inst 2:      IF   ID   EX  MEM   WB
Inst 3:           IF   ID   EX  MEM   WB
Inst 4:                IF   ID   EX  MEM   WB

이상적으로 5단계 파이프라인은 단일 사이클 대비 최대 5배 처리량 향상을 제공합니다.

파이프라인 해저드 (Pipeline Hazards)

1. 구조적 해저드 (Structural Hazard)

동일한 하드웨어 자원을 두 명령어가 동시에 사용하려 할 때 발생합니다.

해결: 자원 분리(I-Cache와 D-Cache 분리), 레지스터 포트 추가

2. 데이터 해저드 (Data Hazard)

이전 명령어의 결과가 준비되기 전에 다음 명령어가 그 값을 필요로 할 때 발생합니다.

add t0, t1, t2    # t0 쓰기 (WB: 5사이클)
sub t3, t0, t4    # t0 읽기 (ID: 3사이클) → RAW 해저드!

종류:

RAW (Read After Write): 가장 흔함. 이전 명령어가 쓰기 전에 읽기 시도.
WAR (Write After Read): 순서 바뀐 실행(OoO)에서 발생.
WAW (Write After Write): 두 명령어가 같은 레지스터에 쓸 때.

해결 방법:

전방전달 (Forwarding/Bypassing): EX/MEM 단계 결과를 바로 다음 EX 단계 입력으로 전달.
스톨 (Stall/Bubble): 파이프라인을 멈추고 NOP(No Operation) 삽입. 성능 저하.
코드 재배치: 컴파일러가 독립적인 명령어를 사이에 삽입.

# 로드-유스 해저드 (1 사이클 스톨 불가피)
lw  t0, 0(a0)      # 메모리에서 로드
add t1, t0, t2     # t0 바로 사용 → MEM→EX 전방전달 불가
# 해결: 컴파일러가 독립 명령어 삽입
lw  t0, 0(a0)
add t3, t4, t5     # 독립 명령어 삽입
add t1, t0, t2     # 이제 t0 준비됨

3. 제어 해저드 (Control Hazard)

분기(branch) 명령어로 인해 다음에 실행할 명령어가 불확실할 때 발생합니다.

beq t0, t1, label  # 분기 여부 EX 단계에서 결정
# 그 동안 IF, ID에서 인출된 명령어들은?

해결 방법:

플러시 (Flush): 잘못 인출된 명령어 버리기 (2-3 사이클 페널티)
분기 예측 (Branch Prediction):
- 정적 예측: 항상 Not-Taken 또는 항상 Taken
- 동적 예측: 1비트/2비트 예측기, BTB(Branch Target Buffer)
- 현대 CPU는 95%+ 예측 정확도
지연 분기 (Delayed Branch): 분기 명령어 직후 슬롯에 항상 유용한 명령어 배치 (MIPS)

슈퍼스칼라 파이프라인

현대 CPU는 여러 파이프라인을 병렬로 운영합니다:

Intel Core: 6-way 비순서 실행 (Out-of-Order Execution)
AMD Zen 4:  4-way 디코드 + OoO 실행
ARM Cortex-X4: 5-way 디코드

비순서 실행 (Out-of-Order Execution):

명령어를 순서대로 인출/해독
준비된 명령어부터 실행 (토마술로 알고리즘)
결과는 순서대로 기록 (Reorder Buffer 사용)

5. 메모리 계층 구조

메모리 계층

레지스터 (Register)
  용량: ~1KB  |  속도: 1 사이클  |  비용: 매우 높음

L1 캐시 (On-chip)
  용량: 32-64KB  |  속도: 4-5 사이클  |  비용: 높음

L2 캐시 (On-chip)
  용량: 256KB-1MB  |  속도: 12-15 사이클  |  비용: 중간

L3 캐시 (On-chip, 공유)
  용량: 8-64MB  |  속도: 30-40 사이클  |  비용: 낮음

DRAM (Main Memory)
  용량: 8-256GB  |  속도: 200-300 사이클  |  비용: 매우 낮음

SSD/NVMe
  용량: 1-4TB  |  속도: 10,000+ 사이클  |  비용: 아주 낮음

지역성 원리 (Principle of Locality)

시간적 지역성 (Temporal Locality): 최근 접근한 데이터는 곧 다시 접근될 가능성이 높다. (루프 변수, 카운터)
공간적 지역성 (Spatial Locality): 접근한 데이터 근처의 데이터도 곧 접근될 가능성이 높다. (배열 순차 접근)

// 공간적 지역성 최적화 예시
// 나쁜 예: 열 우선 접근 (캐시 미스 빈번)
for (int j = 0; j < N; j++)
    for (int i = 0; i < N; i++)
        sum += A[i][j];  // A[0][0], A[1][0], A[2][0]... (행 건너뜀)

// 좋은 예: 행 우선 접근 (캐시 친화적)
for (int i = 0; i < N; i++)
    for (int j = 0; j < N; j++)
        sum += A[i][j];  // A[0][0], A[0][1], A[0][2]... (연속 메모리)

캐시 구조

Direct-mapped 캐시: 메모리의 각 블록이 캐시의 딱 한 곳에만 저장될 수 있습니다.

장점: 구현 간단, 빠른 접근
단점: 충돌 미스(Conflict Miss) 발생 가능

N-way Set-associative 캐시: 메모리 블록이 N개 슬롯 중 하나에 저장됩니다.

실제 CPU에서 가장 많이 사용 (L1: 4-way, L2: 8-way, L3: 16-way)

Fully-associative 캐시: 메모리 블록이 어느 슬롯에나 저장 가능.

장점: 충돌 미스 없음
단점: 탐색 비용 큼 (TLB에 사용)

캐시 주소 분해 (32비트 주소, 4KB 캐시, 64B 블록, 4-way):

[태그(20비트) | 인덱스(6비트) | 오프셋(6비트)]

캐시 교체 정책

LRU (Least Recently Used): 가장 오래 사용되지 않은 블록 교체. 성능 최고, 구현 복잡.
FIFO (First In First Out): 가장 먼저 들어온 블록 교체. 구현 간단.
Random: 무작위 교체. 하드웨어 구현 간단, LRU와 유사한 성능.

Write 정책

Write-through: 캐시와 메모리를 동시에 갱신. 일관성 보장, 대역폭 소비.
Write-back: 캐시만 갱신, 교체 시 메모리 기록(Dirty bit 사용). 성능 좋음, 일관성 복잡.

6. 가상 메모리

개요

가상 메모리는 각 프로세스에 독립적인 주소 공간을 제공하여 보호, 격리, 메모리 오버커밋을 가능하게 합니다.

가상 주소 (VA): 프로세스가 사용하는 주소 (0x0000 ~ 0xFFFFFFFF)
물리 주소 (PA): 실제 DRAM의 주소
페이지:         가상/물리 메모리의 고정 크기 블록 (보통 4KB)

페이지 테이블 변환

가상 주소 [VPN | Page Offset]
              ↓
         페이지 테이블 조회
              ↓
물리 주소 [PFN | Page Offset]

64비트 시스템에서는 4단계 페이지 테이블을 사용합니다 (x86-64):

[PML4(9비트) | PDPT(9비트) | PD(9비트) | PT(9비트) | Offset(12비트)]

TLB (Translation Lookaside Buffer)

페이지 테이블 조회는 메모리 접근을 추가로 필요로 합니다. TLB는 최근 변환 결과를 캐싱하는 완전 연관 캐시입니다.

TLB Hit:  가상 주소 → TLB 조회 → 물리 주소 (1-2 사이클)
TLB Miss: 가상 주소 → 페이지 테이블 워크 → 물리 주소 (100+ 사이클)

TLB 크기: 보통 64-1024 엔트리. 히트율 99%+ 유지.

페이지 교체 알고리즘

메모리가 가득 찼을 때 어떤 페이지를 디스크로 내보낼지 결정합니다.

Optimal: 미래에 가장 오랫동안 사용되지 않을 페이지 교체 (이론적 최적, 실제 불가)
LRU: 가장 오래 사용되지 않은 페이지 교체 (성능 좋음, 구현 비쌈)
Clock (FIFO + 기회): 실제 OS에서 많이 사용하는 LRU 근사

7. 입출력 시스템

I/O 제어 방법

폴링 (Polling): CPU가 장치 상태를 주기적으로 확인합니다.

장점: 구현 간단, 낮은 지연
단점: CPU 시간 낭비 (Busy-wait)

인터럽트 (Interrupt): 장치가 준비되면 CPU에 신호를 보냅니다.

장점: CPU가 다른 작업 수행 가능
단점: 인터럽트 처리 오버헤드

DMA (Direct Memory Access): DMA 컨트롤러가 CPU 없이 직접 메모리-장치 간 데이터 전송을 수행합니다.

대량 데이터 전송에 필수 (디스크, 네트워크, GPU)
CPU는 전송 시작/완료만 처리

버스 아키텍처

PCIe 5.0:   x16 슬롯 = 128 GB/s 양방향
NVMe (PCIe):최대 7 GB/s 순차 읽기 (Gen4)
USB 4.0:    최대 40 Gbps
DDR5-6400:  최대 51.2 GB/s (채널당)

8. 병렬 아키텍처

Flynn의 분류

분류	명령어 스트림	데이터 스트림	예시
SISD	단일	단일	단일 코어 CPU
SIMD	단일	다중	GPU, AVX 벡터 연산
MISD	다중	단일	결함 허용 시스템
MIMD	다중	다중	멀티코어 CPU, 클러스터

멀티코어와 캐시 일관성

여러 코어가 같은 데이터를 캐시에 갖고 있을 때 일관성을 유지해야 합니다.

MESI 프로토콜 (Modified, Exclusive, Shared, Invalid):

Modified (M): 이 코어만 수정된 최신 복사본 보유
Exclusive (E): 이 코어만 보유, 메모리와 동일
Shared (S):   여러 코어가 읽기 전용 복사본 보유
Invalid (I):  유효하지 않음 (다른 코어가 수정함)

상태 전환:

코어 A가 S 상태 데이터 쓰기 → A는 M으로 전환, 다른 코어는 I로 무효화
코어 B가 I 상태 데이터 읽기 → 버스 스누핑으로 A에서 전달

OpenMP 병렬 프로그래밍

#include <omp.h>
#include <stdio.h>
#include <stdlib.h>

// 병렬 배열 합산
int main() {
    int n = 1000000;
    long long sum = 0;
    int *arr = malloc(n * sizeof(int));

    for (int i = 0; i < n; i++)
        arr[i] = i + 1;

    // reduction으로 안전한 병렬 합산
    #pragma omp parallel for reduction(+:sum) schedule(static)
    for (int i = 0; i < n; i++) {
        sum += arr[i];
    }

    printf("Sum = %lld\n", sum);  // 500000500000

    // 병렬 for + 작업 분배
    #pragma omp parallel
    {
        int tid = omp_get_thread_num();
        int nthreads = omp_get_num_threads();
        printf("Thread %d of %d\n", tid, nthreads);
    }

    free(arr);
    return 0;
}

// 행렬 곱셈 병렬화 (캐시 친화적 + OpenMP)
void matmul(float *A, float *B, float *C, int N) {
    #pragma omp parallel for collapse(2) schedule(dynamic, 64)
    for (int i = 0; i < N; i++) {
        for (int j = 0; j < N; j++) {
            float sum = 0.0f;
            for (int k = 0; k < N; k++) {
                sum += A[i*N + k] * B[k*N + j];
            }
            C[i*N + j] = sum;
        }
    }
}

NUMA 아키텍처

NUMA(Non-Uniform Memory Access)는 각 CPU 소켓이 로컬 메모리를 갖는 구조입니다.

소켓 0 (Core 0-15) ←→ Local DRAM (64GB)  [~80ns]
      ↕ QPI/Infinity Fabric
소켓 1 (Core 16-31) ←→ Local DRAM (64GB) [~80ns]

소켓 0에서 소켓 1의 메모리 접근: ~160ns (2배 지연)

numactl을 사용하면 프로세스를 특정 NUMA 노드에 고정할 수 있습니다.

9. GPU 아키텍처

GPU vs CPU 설계 철학

항목	CPU	GPU
코어 수	수십 개	수천-수만 개
코어 복잡도	매우 복잡 (OoO, 분기예측)	단순
캐시 크기	크고 복잡	작고 단순
설계 목표	단일 스레드 지연 최소화	처리량(Throughput) 최대화
용도	범용 직렬 연산	대규모 병렬 연산

SIMT 실행 모델

SIMT(Single Instruction, Multiple Thread)는 GPU의 핵심 실행 모델입니다.

워프(Warp): 32개 스레드의 묶음 (NVIDIA)
→ 32개 스레드가 동일한 명령어를 동시에 실행
→ 각 스레드는 서로 다른 데이터에 작용

워프 스케줄러: 지연 발생 시(메모리 접근) 즉시 다른 워프로 전환
→ 수천 개 워프를 빠르게 전환하여 지연을 숨김 (Latency Hiding)

NVIDIA GPU 계층 구조

GPU
├── GPC (Graphics Processing Cluster) x 8
│   └── SM (Streaming Multiprocessor) x 7-12
│       ├── CUDA Core x 128 (FP32 연산)
│       ├── Tensor Core x 4 (행렬 연산, AI)
│       ├── RT Core x 1 (레이 트레이싱)
│       ├── Warp Scheduler x 4
│       ├── Register File (256KB)
│       └── Shared Memory / L1 Cache (128-256KB)
└── L2 Cache (공유, 수십 MB)

NVIDIA H100 (Hopper, 2022):

132개 SM, 16,896개 CUDA Core
528개 Tensor Core (4세대, FP8 지원)
80GB HBM3 메모리, 3.35 TB/s 대역폭
4 PetaFLOPS FP8 Tensor Core 성능

CUDA 프로그래밍 기초

#include <cuda_runtime.h>
#include <stdio.h>

// GPU 커널: 벡터 덧셈
__global__ void vectorAdd(float *a, float *b, float *c, int n) {
    // 스레드 글로벌 인덱스 계산
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        c[idx] = a[idx] + b[idx];
    }
}

int main() {
    int n = 1 << 20;  // 1M 원소
    size_t size = n * sizeof(float);

    // 호스트(CPU) 메모리
    float *h_a = (float*)malloc(size);
    float *h_b = (float*)malloc(size);
    float *h_c = (float*)malloc(size);

    // 디바이스(GPU) 메모리
    float *d_a, *d_b, *d_c;
    cudaMalloc(&d_a, size);
    cudaMalloc(&d_b, size);
    cudaMalloc(&d_c, size);

    // 호스트 → 디바이스 복사
    cudaMemcpy(d_a, h_a, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_b, h_b, size, cudaMemcpyHostToDevice);

    // 커널 실행: 블록당 256 스레드
    int threadsPerBlock = 256;
    int blocksPerGrid = (n + threadsPerBlock - 1) / threadsPerBlock;
    vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_a, d_b, d_c, n);

    // 디바이스 → 호스트 복사
    cudaMemcpy(h_c, d_c, size, cudaMemcpyDeviceToHost);

    cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
    free(h_a); free(h_b); free(h_c);
    return 0;
}

GPU 메모리 최적화

// 공유 메모리를 활용한 행렬 곱 최적화
#define TILE_SIZE 16

__global__ void matmulShared(float *A, float *B, float *C, int N) {
    __shared__ float tileA[TILE_SIZE][TILE_SIZE];
    __shared__ float tileB[TILE_SIZE][TILE_SIZE];

    int row = blockIdx.y * TILE_SIZE + threadIdx.y;
    int col = blockIdx.x * TILE_SIZE + threadIdx.x;
    float sum = 0.0f;

    for (int t = 0; t < N / TILE_SIZE; t++) {
        // 타일을 공유 메모리로 로드
        tileA[threadIdx.y][threadIdx.x] = A[row * N + t * TILE_SIZE + threadIdx.x];
        tileB[threadIdx.y][threadIdx.x] = B[(t * TILE_SIZE + threadIdx.y) * N + col];
        __syncthreads();  // 모든 스레드 로드 완료 대기

        for (int k = 0; k < TILE_SIZE; k++)
            sum += tileA[threadIdx.y][k] * tileB[k][threadIdx.x];
        __syncthreads();
    }

    if (row < N && col < N)
        C[row * N + col] = sum;
}

Tensor Core와 AI 가속

Tensor Core는 행렬 곱셈-누산(MMA: Matrix Multiply-Accumulate)을 하드웨어에서 가속합니다.

일반 CUDA Core: FP32 MAC 1개/사이클
Tensor Core(4세대): 16x16x16 행렬 곱 = 4096 FP16 MAC/사이클

NVIDIA cuBLAS, cuDNN, TensorRT는 Tensor Core를 자동으로 활용합니다.

10. 최신 아키텍처 트렌드

칩렛(Chiplet) 설계

단일 대형 다이(monolithic die) 대신 여러 작은 칩렛을 인터포저로 연결합니다.

AMD EPYC (Genoa): 12개 CCD(Core Complex Die, 5nm) + 1개 IOD(I/O Die, 6nm)
Intel Meteor Lake: CPU + GPU + SoC 타일 분리
장점: 수율 향상, 공정 최적화, 비용 절감
기술: TSMC CoWoS, Intel EMIB, UCIe 표준

HBM (High Bandwidth Memory)

DRAM을 3D로 적층하여 초고대역폭을 달성합니다.

HBM3E (2024): 9.6 Gbps/핀, 8-스택 = 1.2 TB/s (패키지당)
GDDR6X (RTX 4090): 21 Gbps/핀, 384비트 = 1 TB/s
DDR5-6400: 51.2 GB/s (채널당, CPU 기준)

NPU/TPU: AI 전용 가속기

Google TPU v5p: 459 TFLOPS BF16, Mesh 연결
Apple Neural Engine: iPhone 15 Pro, 35 TOPS INT8
Qualcomm Hexagon: 스마트폰 NPU, 75 TOPS
Intel Gaudi 3: 1835 TFLOPS BF16

RISC-V의 부상

오픈소스 ISA로 저전력 임베디드부터 데이터센터 서버까지 확장 중:

SiFive, StarFive, Alibaba T-Head
RISC-V International: 3,000+ 회원사
Linux 5.19 공식 지원, Android 포팅 완료

11. 성능 최적화 실전

캐시 친화적 프로그래밍

#include <time.h>
#include <stdio.h>
#include <stdlib.h>

#define N 4096

float A[N][N], B[N][N], C[N][N];

// 비효율적: 열 우선 접근 (캐시 미스 많음)
void matmul_naive() {
    for (int i = 0; i < N; i++)
        for (int j = 0; j < N; j++)
            for (int k = 0; k < N; k++)
                C[i][j] += A[i][k] * B[k][j];  // B[k][j] 접근이 비연속
}

// 효율적: 블록 타일링 (캐시 재사용)
#define BLOCK 64
void matmul_tiled() {
    for (int ii = 0; ii < N; ii += BLOCK)
        for (int jj = 0; jj < N; jj += BLOCK)
            for (int kk = 0; kk < N; kk += BLOCK)
                for (int i = ii; i < ii+BLOCK && i < N; i++)
                    for (int j = jj; j < jj+BLOCK && j < N; j++)
                        for (int k = kk; k < kk+BLOCK && k < N; k++)
                            C[i][j] += A[i][k] * B[k][j];
}

SIMD 벡터화 (AVX2)

#include <immintrin.h>  // AVX2

// AVX2로 float 8개를 동시에 덧셈
void vector_add_avx2(float *a, float *b, float *c, int n) {
    int i;
    for (i = 0; i <= n - 8; i += 8) {
        __m256 va = _mm256_loadu_ps(&a[i]);   // 8x float 로드
        __m256 vb = _mm256_loadu_ps(&b[i]);
        __m256 vc = _mm256_add_ps(va, vb);    // 8x 동시 덧셈
        _mm256_storeu_ps(&c[i], vc);          // 8x float 저장
    }
    // 나머지 원소 처리
    for (; i < n; i++)
        c[i] = a[i] + b[i];
}

12. 퀴즈

Q1. 파이프라이닝에서 데이터 해저드(Data Hazard)가 발생하는 원인과 해결 방법은?

정답: 이전 명령어의 쓰기(Write)가 완료되기 전에 다음 명령어가 해당 레지스터를 읽으려 할 때(RAW: Read After Write) 발생합니다.

해결 방법:

전방전달(Forwarding): EX/MEM 단계 결과를 바로 다음 명령어 EX 입력으로 직접 전달하여 스톨 없이 해결.
스톨(Stall): NOP(버블)을 삽입하여 파이프라인을 일시 중단. 성능 저하.
코드 재배치: 컴파일러가 독립적인 명령어를 사이에 끼워 지연 없이 처리.
로드-유스 해저드의 경우 1사이클 스톨이 불가피합니다 (MEM 단계 결과를 EX 입력으로 전달 불가).

Q2. Direct-mapped 캐시와 4-way Set-associative 캐시의 차이점은?

정답: Direct-mapped 캐시는 메모리의 각 블록이 캐시의 딱 한 슬롯에만 매핑되는 반면, 4-way Set-associative는 동일한 인덱스를 가진 4개의 슬롯 중 하나에 매핑될 수 있습니다.

핵심 차이:

Direct-mapped: 충돌 미스(Conflict Miss) 발생 가능. 구현 단순, 비용 낮음.
4-way SA: 충돌 미스 감소. LRU 등 교체 정책 필요. 비용 증가.
실제 CPU L1 캐시는 보통 4-way 또는 8-way를 사용하며, 완전 연관(Fully-associative)은 TLB에만 사용됩니다.

Q3. TLB(Translation Lookaside Buffer)의 역할과 TLB 미스 시 처리 과정을 설명하시오.

정답: TLB는 가상 주소를 물리 주소로 변환하는 최근 결과를 캐싱하는 완전 연관 캐시입니다. 메모리에 있는 페이지 테이블 접근 오버헤드를 줄여줍니다.

TLB 미스 처리:

TLB에 해당 가상 페이지 번호(VPN)가 없음
하드웨어(x86) 또는 소프트웨어(RISC-V, MIPS) 페이지 테이블 워크 수행
다단계 페이지 테이블을 순서대로 조회 (PML4 → PDPT → PD → PT)
최종 물리 프레임 번호(PFN) 획득
TLB에 새 항목 추가 (기존 항목 교체 필요 시 LRU 적용)
원래 메모리 접근 재시도

전체 과정: 수백 사이클 소요 → TLB 히트율(99%+) 유지가 중요

Q4. GPU의 워프(Warp) 발산(Divergence)이란 무엇이며 성능에 어떤 영향을 주는가?

정답: 워프 내 32개 스레드가 서로 다른 분기(if-else)를 취할 때 워프 발산이 발생합니다.

동작 방식:

SIMT 모델에서 워프의 모든 스레드는 같은 명령어를 실행해야 합니다.
if 블록 실행 시: if를 취한 스레드만 활성화, else 스레드는 마스킹(비활성)
else 블록 실행 시: else를 취한 스레드만 활성화, if 스레드는 마스킹
두 경로를 직렬로 실행하므로 최악 2배 시간 소요

해결 방법:

워프 내 스레드들이 같은 분기를 취하도록 데이터 정렬
분기 대신 수학적 연산으로 대체 (조건부 선택)
쿠다 쉐이더의 분기 최소화 설계

Q5. Amdahl의 법칙을 이용하여, 전체 코드의 80%를 병렬화했을 때 최대 이론 속도 향상을 계산하시오.

정답: 무한히 많은 프로세서를 사용한다고 가정하면 최대 5배 속도 향상입니다.

계산:

병렬화 비율 f = 0.8 (80%)
순차 비율 = 1 - 0.8 = 0.2 (20%)
병렬화 배율 s를 무한대로 보내면: f/s → 0
Speedup = 1 / ((1 - f) + f/s) = 1 / (0.2 + 0) = 1 / 0.2 = 5배

의미: 아무리 많은 코어를 추가해도, 순차 실행 부분(20%)이 전체 속도를 5배로 제한합니다. 이것이 순차 병목을 제거하는 것이 병렬 최적화의 핵심인 이유입니다.

참고 자료

Patterson, D. & Hennessy, J. (2020). Computer Organization and Design: RISC-V Edition (2nd ed.). Morgan Kaufmann.
Hennessy, J. & Patterson, D. (2019). Computer Architecture: A Quantitative Approach (6th ed.). Morgan Kaufmann.
RISC-V International. (2024). The RISC-V Instruction Set Manual. https://riscv.org/technical/specifications/
NVIDIA. (2023). CUDA C++ Programming Guide. https://docs.nvidia.com/cuda/cuda-c-programming-guide/
ARM Ltd. (2024). ARM Architecture Reference Manual. https://developer.arm.com/documentation/
Intel. (2024). Intel 64 and IA-32 Architectures Software Developer's Manual.

Computer Architecture Complete Guide: From ISA to GPU Parallel Architecture

Introduction
1. Overview of Computer Architecture
2. Instruction Set Architecture (ISA)
3. ALU and Datapath
- ALU Design
- Single-Cycle Datapath
4. Pipelining
5. Memory Hierarchy
6. Virtual Memory
7. I/O Systems
- I/O Control Methods
- Bus Architecture
8. Parallel Architecture
9. GPU Architecture
10. Modern Architecture Trends
11. Performance Optimization in Practice
- Cache-Friendly Programming
- SIMD Vectorization with AVX2
12. Quiz
References

Introduction

Computer Architecture is the foundational discipline that bridges hardware and software. Whether you are an electronics engineer, computer science student, or systems programmer, a deep understanding of how processors work is essential for writing high-performance software, designing chips, and reasoning about system behavior.

This guide is structured around Patterson & Hennessy's Computer Organization and Design and Computer Architecture: A Quantitative Approach, covering everything from ISA design to modern GPU parallel architectures.

1. Overview of Computer Architecture

Von Neumann Architecture

The Von Neumann architecture, proposed by John von Neumann in 1945, forms the foundation of nearly all modern computers. Its defining characteristic is that program instructions and data share the same memory space.

Components:

CPU (Central Processing Unit): ALU + Control Unit + Registers
Memory: Stores both program code and data
I/O Devices: Keyboard, display, disk, network
Bus: Carries data, address, and control signals

Von Neumann Bottleneck: The bandwidth between the CPU and memory limits overall performance. The modern cache hierarchy was designed to mitigate this bottleneck.

Harvard Architecture

The Harvard architecture uses separate instruction and data memories, enabling simultaneous access to both. It is used in DSPs, microcontrollers (AVR, PIC), and in the L1 cache split (I-Cache / D-Cache) of modern processors.

Aspect	Von Neumann	Harvard
Memory	Unified	Separate
Bandwidth	Limited	Higher
Complexity	Lower	Higher
Use case	General-purpose CPU	DSP, microcontroller

Levels of Abstraction

Computer systems are organized in layers of abstraction:

Application Software
       ↓
Operating System
       ↓
ISA (Instruction Set Architecture)  ← Hardware/Software boundary
       ↓
Microarchitecture
       ↓
Logic Gates
       ↓
Transistors

The ISA is the contract between hardware and software. Any processor that implements the same ISA can run the same software, regardless of its internal microarchitecture.

Performance Metrics

CPU Execution Time:

CPU Time = Instruction Count × CPI × Clock Cycle Time
         = Instruction Count × CPI / Clock Rate

CPI (Cycles Per Instruction): Average clock cycles per instruction
Clock Rate: Measured in Hz (e.g., 3.5 GHz)
MIPS (Millions of Instructions Per Second): Throughput in instructions
FLOPS (Floating Point Operations Per Second): Floating-point throughput

Amdahl's Law (Limits of Parallel Speedup):

Speedup = 1 / ((1 - f) + f/s)

Where f is the fraction of work that can be parallelized, and s is the speedup of the parallelized part. If only 40% of a program can be parallelized, the theoretical maximum speedup is 1.67x regardless of core count.

2. Instruction Set Architecture (ISA)

RISC vs CISC

RISC (Reduced Instruction Set Computer):

Simple, fixed-length instructions
Register-centric operations (Load/Store architecture)
Favors hardware pipelining
Examples: ARM, RISC-V, MIPS, PowerPC

CISC (Complex Instruction Set Computer):

Complex, variable-length instructions
Can operate directly on memory
Higher code density (more work per instruction)
Examples: x86-64, VAX

Modern x86-64 processors internally decode CISC instructions into RISC-like micro-ops, blurring the original distinction.

Instruction Formats (RISC-V)

RISC-V defines six instruction formats:

R-type (Register operations):
[funct7 | rs2 | rs1 | funct3 | rd | opcode]
  7-bit   5-bit 5-bit  3-bit  5-bit  7-bit

I-type (Immediate / Load):
[imm[11:0] | rs1 | funct3 | rd | opcode]
  12-bit    5-bit   3-bit  5-bit  7-bit

S-type (Store):
[imm[11:5] | rs2 | rs1 | funct3 | imm[4:0] | opcode]

B-type (Branch):
[imm[12|10:5] | rs2 | rs1 | funct3 | imm[4:1|11] | opcode]

U-type (Upper immediate):
[imm[31:12] | rd | opcode]

J-type (JAL):
[imm[20|10:1|11|19:12] | rd | opcode]

RISC-V Assembly Examples

RISC-V is an open-source ISA developed at UC Berkeley, growing rapidly in both education and industry.

# RISC-V Assembly Examples
# Basic arithmetic
add  a0, a0, a1      # a0 = a0 + a1
sub  t0, t1, t2      # t0 = t1 - t2
addi t0, t0, 10      # t0 = t0 + 10 (immediate)
mul  a0, a1, a2      # a0 = a1 * a2

# Memory access (Load/Store)
lw   t0, 0(a0)       # t0 = Memory[a0 + 0]  (word load)
lh   t1, 2(a0)       # t1 = Memory[a0 + 2]  (halfword load)
lb   t2, 1(a0)       # t2 = Memory[a0 + 1]  (byte load)
sw   t0, 4(a0)       # Memory[a0 + 4] = t0  (word store)

# Logical operations
and  t0, t1, t2      # t0 = t1 & t2
or   t0, t1, t2      # t0 = t1 | t2
xor  t0, t1, t2      # t0 = t1 ^ t2
sll  t0, t1, t2      # t0 = t1 << t2 (logical left shift)
srl  t0, t1, t2      # t0 = t1 >> t2 (logical right shift)

# Branches and jumps
beq  t0, t1, label   # if t0 == t1, jump to label
bne  t0, t1, label   # if t0 != t1, jump to label
blt  t0, t1, label   # if t0 < t1,  jump to label
bge  t0, t1, label   # if t0 >= t1, jump to label
jal  ra, func        # ra = PC+4; jump to func
jalr ra, 0(t0)       # ra = PC+4; jump to address in t0

# RISC-V Register Conventions
# x0  (zero): hardwired zero
# x1  (ra):   return address
# x2  (sp):   stack pointer
# x10-x17 (a0-a7): function arguments / return values
# x5-x7, x28-x31 (t0-t6): temporaries
# x8-x9, x18-x27 (s0-s11): saved registers

Translating C to RISC-V Assembly

// C code
int factorial(int n) {
    if (n <= 1) return 1;
    return n * factorial(n - 1);
}

# RISC-V Assembly (recursive factorial)
factorial:
    addi sp, sp, -16     # Allocate stack frame
    sw   ra, 12(sp)      # Save return address
    sw   a0, 8(sp)       # Save n

    addi t0, zero, 1
    bgt  a0, t0, recurse # if n > 1, recurse
    addi a0, zero, 1     # return 1
    j    done

recurse:
    addi a0, a0, -1      # n - 1
    jal  ra, factorial   # recursive call
    lw   t0, 8(sp)       # restore n
    mul  a0, t0, a0      # n * factorial(n-1)

done:
    lw   ra, 12(sp)      # restore return address
    addi sp, sp, 16      # free stack frame
    jalr zero, 0(ra)     # return

3. ALU and Datapath

ALU Design

The ALU (Arithmetic Logic Unit) is the computational core of the CPU:

ADD:  result = A + B
SUB:  result = A + (~B + 1) = A - B  (two's complement)
AND:  result = A AND B
OR:   result = A OR B
XOR:  result = A XOR B
SLT:  result = (A < B) ? 1 : 0  (Set Less Than)
SLL:  result = A << B

Ripple Carry Adder:

FA0: S0 = A0 XOR B0 XOR Cin;  C1 = carry out
FA1: S1 = A1 XOR B1 XOR C1;  C2 = carry out
...

For an N-bit ripple carry adder, delay is O(N). A 64-bit addition requires 64 gate delays.

Carry Lookahead Adder (CLA):

Define Generate and Propagate:

Gi = Ai AND Bi       (carry generate)
Pi = Ai XOR Bi       (carry propagate)
Ci+1 = Gi OR (Pi AND Ci)

By computing carries for 4-bit groups in parallel, total delay drops to O(log N).

Single-Cycle Datapath

In a single-cycle implementation, every instruction completes in exactly one clock cycle.

Instruction flow (R-type ADD):
1. IF:  PC → instruction memory → fetch instruction
2. ID:  Read rs1, rs2 from register file; generate control signals
3. EX:  ALU computes rs1 + rs2
4. MEM: (R-type has no memory access)
5. WB:  Write ALU result to rd

Problem: The slowest instruction (e.g., memory load) determines the clock period. Fast instructions are penalized by the slow clock.

4. Pipelining

Pipelining overlaps the execution of multiple instructions, just like running a washer, dryer, and iron simultaneously for a load of laundry.

Five-Stage Pipeline

Stage  Abbr  Function
------+------+--------------------------------------------
Fetch  IF    Fetch instruction from memory using PC
Decode ID    Decode instruction, read registers, gen control
Execute EX   Perform ALU operation
Memory MEM   Read/write data memory
Writeback WB Write result back to register file

Timing diagram:

Cycle:  1    2    3    4    5    6    7    8
Inst 1: IF   ID   EX  MEM   WB
Inst 2:      IF   ID   EX  MEM   WB
Inst 3:           IF   ID   EX  MEM   WB
Inst 4:                IF   ID   EX  MEM   WB

Ideally, a 5-stage pipeline provides up to 5x throughput improvement over single-cycle.

Pipeline Hazards

1. Structural Hazard

Two instructions require the same hardware resource at the same time.

Solution: Duplicate resources (separate I-Cache and D-Cache), add register file ports.

2. Data Hazard

An instruction needs a result that a previous instruction has not yet produced.

add t0, t1, t2    # write t0 (WB: cycle 5)
sub t3, t0, t4    # read t0  (ID: cycle 3) → RAW hazard!

Types:

RAW (Read After Write): Most common. Next instruction reads before previous writes.
WAR (Write After Read): Occurs in out-of-order execution.
WAW (Write After Write): Two instructions write the same register.

Solutions:

Forwarding (Bypassing): Route EX/MEM stage result directly to next EX stage input — no stall.
Stall (Bubble): Insert NOP cycles. Reduces performance.
Code Reordering: Compiler inserts independent instructions between dependent ones.

# Load-use hazard: 1 stall cycle unavoidable
lw  t0, 0(a0)      # load from memory
add t1, t0, t2     # use t0 immediately → MEM→EX forwarding impossible
# Solution: compiler inserts an independent instruction
lw  t0, 0(a0)
add t3, t4, t5     # independent instruction fills the slot
add t1, t0, t2     # t0 is now ready

3. Control Hazard

A branch instruction creates uncertainty about which instruction to fetch next.

beq t0, t1, label  # branch decision made in EX stage
# Instructions already fetched in IF/ID may be wrong

Solutions:

Flush: Discard incorrectly fetched instructions (2-3 cycle penalty).
Branch Prediction:
- Static: always Not-Taken or always Taken
- Dynamic: 1-bit/2-bit predictors, Branch Target Buffer (BTB)
- Modern CPUs achieve 95%+ prediction accuracy
Delayed Branch: Always execute the instruction after the branch (MIPS approach).

Superscalar Pipelines

Modern CPUs run multiple pipelines in parallel:

Intel Core: 6-way out-of-order execution (OoO)
AMD Zen 4:  4-way decode + OoO execution
ARM Cortex-X4: 5-way decode

Out-of-Order Execution:

Instructions fetched and decoded in order
Instructions dispatched as soon as their operands are ready (Tomasulo's algorithm)
Results committed in original program order (Reorder Buffer)

5. Memory Hierarchy

The Memory Hierarchy

Registers
  Size: ~1KB  |  Latency: 1 cycle       |  Cost: very high

L1 Cache (on-chip)
  Size: 32-64KB  |  Latency: 4-5 cycles  |  Cost: high

L2 Cache (on-chip)
  Size: 256KB-1MB  |  Latency: 12-15 cycles  |  Cost: medium

L3 Cache (on-chip, shared)
  Size: 8-64MB  |  Latency: 30-40 cycles  |  Cost: low

DRAM (Main Memory)
  Size: 8-256GB  |  Latency: 200-300 cycles  |  Cost: very low

SSD / NVMe
  Size: 1-4TB  |  Latency: 10,000+ cycles  |  Cost: extremely low

Principle of Locality

Temporal Locality: Recently accessed data will likely be accessed again soon. (Loop variables, counters)
Spatial Locality: Data near recently accessed locations will likely be accessed soon. (Array traversal)

// Spatial locality optimization
// Bad: column-major access (frequent cache misses)
for (int j = 0; j < N; j++)
    for (int i = 0; i < N; i++)
        sum += A[i][j];  // A[0][0], A[1][0], A[2][0]... (non-contiguous)

// Good: row-major access (cache-friendly)
for (int i = 0; i < N; i++)
    for (int j = 0; j < N; j++)
        sum += A[i][j];  // A[0][0], A[0][1], A[0][2]... (contiguous memory)

Cache Organization

Direct-Mapped Cache: Each memory block maps to exactly one cache slot.

Pro: Simple hardware, fast lookup
Con: Conflict misses (two blocks competing for the same slot)

N-way Set-Associative Cache: Each memory block can go into any of N slots in a set.

Most common in real CPUs (L1: 4-way, L2: 8-way, L3: 16-way)

Fully Associative Cache: A block can go anywhere in the cache.

Pro: No conflict misses
Con: Expensive parallel lookup (used for TLBs)

Cache address breakdown (32-bit address, 4KB cache, 64B block, 4-way):

[Tag (20 bits) | Index (6 bits) | Offset (6 bits)]

Replacement Policies

LRU (Least Recently Used): Evict the block unused for the longest time. Best performance, complex hardware.
FIFO (First In First Out): Evict the oldest-inserted block. Simple.
Random: Evict a random block. Simple hardware, performance close to LRU in practice.

Write Policies

Write-through: Update both cache and memory on every write. Simple, high bandwidth usage.
Write-back: Update cache only; write to memory when the block is evicted (Dirty bit). Better performance, more complex coherence.

6. Virtual Memory

Overview

Virtual memory gives each process its own address space, providing isolation, protection, and the ability to overcommit physical memory.

Virtual Address (VA): Address used by the process (0x0000 to 0xFFFFFFFF)
Physical Address (PA): Actual DRAM address
Page: Fixed-size block of virtual/physical memory (typically 4KB)

Page Table Translation

Virtual Address [VPN | Page Offset]
                    ↓
             Page table lookup
                    ↓
Physical Address [PFN | Page Offset]

64-bit systems use a 4-level page table (x86-64):

[PML4 (9 bits) | PDPT (9 bits) | PD (9 bits) | PT (9 bits) | Offset (12 bits)]

TLB (Translation Lookaside Buffer)

Every page table lookup requires memory accesses. The TLB is a fully associative cache for recent virtual-to-physical translations.

TLB Hit:  VA → TLB lookup → PA  (1-2 cycles)
TLB Miss: VA → page table walk → PA  (100+ cycles)

Typical TLB size: 64-1024 entries. Must maintain 99%+ hit rate for good performance.

Page Replacement Algorithms

When physical memory is full, the OS must evict a page to disk.

Optimal: Evict the page that won't be used for the longest time (theoretical best, not implementable)
LRU: Evict the least recently used page (good performance, expensive hardware)
Clock (FIFO + second chance): LRU approximation used by most OS kernels

7. I/O Systems

I/O Control Methods

Polling: The CPU repeatedly checks device status registers.

Pro: Simple, low latency
Con: Wastes CPU cycles (busy-wait)

Interrupts: The device signals the CPU when ready.

Pro: CPU free to do other work
Con: Interrupt handling overhead (context save/restore)

DMA (Direct Memory Access): A DMA controller transfers data between memory and device without CPU involvement.

Essential for high-throughput transfers (disk, network, GPU)
CPU only initiates and acknowledges completion

Bus Architecture

PCIe 5.0:   x16 = 128 GB/s bidirectional
NVMe (PCIe): up to 7 GB/s sequential read (Gen4)
USB 4.0:    up to 40 Gbps
DDR5-6400:  51.2 GB/s per channel

8. Parallel Architecture

Flynn's Taxonomy

Class	Instruction streams	Data streams	Examples
SISD	Single	Single	Single-core CPU
SIMD	Single	Multiple	GPU, AVX vector ops
MISD	Multiple	Single	Fault-tolerant systems
MIMD	Multiple	Multiple	Multi-core CPU, clusters

Multicore and Cache Coherence

When multiple cores hold copies of the same cache line, coherence must be maintained.

MESI Protocol (Modified, Exclusive, Shared, Invalid):

Modified (M):  This core holds the only (modified) copy
Exclusive (E): This core holds the only copy, matches memory
Shared (S):    Multiple cores hold read-only copies
Invalid (I):   This copy is stale (another core modified it)

State transitions:

Core A writes to a Shared line → A transitions to M, all others go to I (invalidation)
Core B reads an Invalid line → bus snooping causes A to supply the line

OpenMP Parallel Programming

#include <omp.h>
#include <stdio.h>
#include <stdlib.h>

// Parallel array summation
int main() {
    int n = 1000000;
    long long sum = 0;
    int *arr = malloc(n * sizeof(int));

    for (int i = 0; i < n; i++)
        arr[i] = i + 1;

    // Reduction clause ensures thread-safe accumulation
    #pragma omp parallel for reduction(+:sum) schedule(static)
    for (int i = 0; i < n; i++) {
        sum += arr[i];
    }

    printf("Sum = %lld\n", sum);  // Expected: 500000500000

    // Print thread IDs
    #pragma omp parallel
    {
        int tid = omp_get_thread_num();
        int nthreads = omp_get_num_threads();
        printf("Thread %d of %d\n", tid, nthreads);
    }

    free(arr);
    return 0;
}

// Cache-friendly parallel matrix multiplication
void matmul(float *A, float *B, float *C, int N) {
    #pragma omp parallel for collapse(2) schedule(dynamic, 64)
    for (int i = 0; i < N; i++) {
        for (int j = 0; j < N; j++) {
            float sum = 0.0f;
            for (int k = 0; k < N; k++) {
                sum += A[i*N + k] * B[k*N + j];
            }
            C[i*N + j] = sum;
        }
    }
}

NUMA Architecture

NUMA (Non-Uniform Memory Access) assigns local memory to each CPU socket.

Socket 0 (Core 0-15)  ←→ Local DRAM (64GB)  [~80ns]
      ↕ QPI / Infinity Fabric
Socket 1 (Core 16-31) ←→ Local DRAM (64GB)  [~80ns]

Socket 0 accessing Socket 1 memory: ~160ns (2x latency)

Use numactl to pin processes to specific NUMA nodes for best performance.

9. GPU Architecture

GPU vs CPU Design Philosophy

Aspect	CPU	GPU
Core count	Dozens	Thousands to tens of thousands
Core complexity	Very high (OoO, branch prediction)	Simple
Cache size	Large, multi-level	Small, minimal
Design goal	Minimize single-thread latency	Maximize throughput
Use case	General-purpose serial work	Massively parallel computation

SIMT Execution Model

SIMT (Single Instruction, Multiple Thread) is the GPU's core execution model.

Warp: 32 threads executing the same instruction (NVIDIA)
→ All 32 threads execute the same instruction simultaneously
→ Each thread operates on different data

Warp scheduler: On memory stall, immediately switch to another warp
→ Thousands of warps rotate rapidly, hiding memory latency

NVIDIA GPU Hierarchy

GPU
├── GPC (Graphics Processing Cluster) x 8
│   └── SM (Streaming Multiprocessor) x 7-12
│       ├── CUDA Core x 128 (FP32 operations)
│       ├── Tensor Core x 4 (matrix multiply, AI)
│       ├── RT Core x 1 (ray tracing)
│       ├── Warp Scheduler x 4
│       ├── Register File (256KB)
│       └── Shared Memory / L1 Cache (128-256KB)
└── L2 Cache (shared, tens of MB)

NVIDIA H100 (Hopper, 2022):

132 SMs, 16,896 CUDA Cores
528 Tensor Cores (4th generation, FP8 support)
80GB HBM3, 3.35 TB/s memory bandwidth
4 PetaFLOPS FP8 Tensor Core performance

CUDA Programming Basics

#include <cuda_runtime.h>
#include <stdio.h>

// GPU kernel: vector addition
__global__ void vectorAdd(float *a, float *b, float *c, int n) {
    // Global thread index
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        c[idx] = a[idx] + b[idx];
    }
}

int main() {
    int n = 1 << 20;  // 1M elements
    size_t size = n * sizeof(float);

    // Host (CPU) memory
    float *h_a = (float*)malloc(size);
    float *h_b = (float*)malloc(size);
    float *h_c = (float*)malloc(size);

    // Device (GPU) memory
    float *d_a, *d_b, *d_c;
    cudaMalloc(&d_a, size);
    cudaMalloc(&d_b, size);
    cudaMalloc(&d_c, size);

    // Host → Device transfer
    cudaMemcpy(d_a, h_a, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_b, h_b, size, cudaMemcpyHostToDevice);

    // Launch kernel: 256 threads per block
    int threadsPerBlock = 256;
    int blocksPerGrid = (n + threadsPerBlock - 1) / threadsPerBlock;
    vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_a, d_b, d_c, n);

    // Device → Host transfer
    cudaMemcpy(h_c, d_c, size, cudaMemcpyDeviceToHost);

    cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
    free(h_a); free(h_b); free(h_c);
    return 0;
}

GPU Memory Optimization with Shared Memory

// Tiled matrix multiplication using shared memory
#define TILE_SIZE 16

__global__ void matmulShared(float *A, float *B, float *C, int N) {
    __shared__ float tileA[TILE_SIZE][TILE_SIZE];
    __shared__ float tileB[TILE_SIZE][TILE_SIZE];

    int row = blockIdx.y * TILE_SIZE + threadIdx.y;
    int col = blockIdx.x * TILE_SIZE + threadIdx.x;
    float sum = 0.0f;

    for (int t = 0; t < N / TILE_SIZE; t++) {
        // Load tiles into shared memory
        tileA[threadIdx.y][threadIdx.x] = A[row * N + t * TILE_SIZE + threadIdx.x];
        tileB[threadIdx.y][threadIdx.x] = B[(t * TILE_SIZE + threadIdx.y) * N + col];
        __syncthreads();  // Wait for all threads to finish loading

        for (int k = 0; k < TILE_SIZE; k++)
            sum += tileA[threadIdx.y][k] * tileB[k][threadIdx.x];
        __syncthreads();
    }

    if (row < N && col < N)
        C[row * N + col] = sum;
}

Tensor Cores and AI Acceleration

Tensor Cores accelerate the matrix multiply-accumulate (MMA) operation in hardware.

Regular CUDA Core: 1 FP32 MAC per cycle
4th-gen Tensor Core: 16x16x16 matrix multiply = 4096 FP16 MACs per cycle

NVIDIA cuBLAS, cuDNN, and TensorRT automatically leverage Tensor Cores when the data sizes align.

10. Modern Architecture Trends

Chiplet Design

Instead of one large monolithic die, chiplets connect multiple smaller dies on an interposer.

AMD EPYC (Genoa): 12 CCDs (Core Complex Die, 5nm) + 1 IOD (I/O Die, 6nm)
Intel Meteor Lake: separate CPU, GPU, SoC tiles
Benefits: Higher yield, process optimization per function, cost reduction
Technologies: TSMC CoWoS, Intel EMIB, UCIe standard

HBM (High Bandwidth Memory)

DRAM dies are stacked in 3D to achieve extreme memory bandwidth.

HBM3E (2024): 9.6 Gbps/pin, 8-stack = 1.2 TB/s per package
GDDR6X (RTX 4090): 21 Gbps/pin, 384-bit = 1 TB/s
DDR5-6400: 51.2 GB/s per channel (CPU)

NPU/TPU: AI-Dedicated Accelerators

Google TPU v5p: 459 TFLOPS BF16, mesh interconnect
Apple Neural Engine: iPhone 15 Pro, 35 TOPS INT8
Qualcomm Hexagon: Smartphone NPU, 75 TOPS
Intel Gaudi 3: 1835 TFLOPS BF16

The Rise of RISC-V

Open-source ISA expanding from low-power embedded to data-center servers:

Vendors: SiFive, StarFive, Alibaba T-Head
RISC-V International: 3,000+ member organizations
Official Linux 5.19 support, Android port complete

11. Performance Optimization in Practice

Cache-Friendly Programming

#include <stdlib.h>

#define N 4096

float A[N][N], B[N][N], C[N][N];

// Inefficient: column-major traversal (many cache misses)
void matmul_naive() {
    for (int i = 0; i < N; i++)
        for (int j = 0; j < N; j++)
            for (int k = 0; k < N; k++)
                C[i][j] += A[i][k] * B[k][j];  // B[k][j] strides across rows
}

// Efficient: loop tiling (reuses cache blocks)
#define BLOCK 64
void matmul_tiled() {
    for (int ii = 0; ii < N; ii += BLOCK)
        for (int jj = 0; jj < N; jj += BLOCK)
            for (int kk = 0; kk < N; kk += BLOCK)
                for (int i = ii; i < ii+BLOCK && i < N; i++)
                    for (int j = jj; j < jj+BLOCK && j < N; j++)
                        for (int k = kk; k < kk+BLOCK && k < N; k++)
                            C[i][j] += A[i][k] * B[k][j];
}

SIMD Vectorization with AVX2

#include <immintrin.h>  // AVX2 intrinsics

// Process 8 floats simultaneously with AVX2
void vector_add_avx2(float *a, float *b, float *c, int n) {
    int i;
    for (i = 0; i <= n - 8; i += 8) {
        __m256 va = _mm256_loadu_ps(&a[i]);   // Load 8 floats
        __m256 vb = _mm256_loadu_ps(&b[i]);
        __m256 vc = _mm256_add_ps(va, vb);    // Add 8 floats at once
        _mm256_storeu_ps(&c[i], vc);          // Store 8 floats
    }
    // Handle remaining elements
    for (; i < n; i++)
        c[i] = a[i] + b[i];
}

12. Quiz

Q1. What causes a Data Hazard in a pipeline, and how can it be resolved?

Answer: A data hazard occurs when an instruction needs the result of a previous instruction that has not yet completed — specifically the RAW (Read After Write) hazard where the pipeline tries to read a register before a prior write has committed.

Solutions:

Forwarding (Bypassing): Route the EX/MEM stage output directly to the next EX stage input. Eliminates the stall in most cases.
Stall (Pipeline Bubble): Insert NOP cycles to wait for the result. Reduces throughput.
Code Reordering: Compiler inserts independent instructions between dependent pairs, hiding the latency.
Load-use hazard is a special case where one stall is unavoidable (MEM result cannot be forwarded to EX in time).

Q2. What is the difference between a Direct-Mapped cache and a 4-way Set-Associative cache?

Answer: In a direct-mapped cache, each memory block has exactly one possible location in the cache. In a 4-way set-associative cache, each block can be placed in any of 4 slots within its set.

Key differences:

Direct-mapped: susceptible to conflict misses (two frequently used blocks competing for one slot). Hardware is simple and fast.
4-way SA: conflict misses greatly reduced. Requires a replacement policy (LRU). More hardware cost.
Real CPUs typically use 4-way or 8-way for L1, 16-way for L3. Fully associative is reserved for TLBs.

Q3. Explain the role of the TLB and the process that happens on a TLB miss.

Answer: The TLB (Translation Lookaside Buffer) is a fully associative cache that stores recent virtual-to-physical address translations, avoiding expensive page table walks on every memory access.

TLB miss handling:

The VPN (Virtual Page Number) is not found in the TLB
Hardware (x86) or software (RISC-V, MIPS) performs a page table walk
Multi-level page tables are traversed in order: PML4 → PDPT → PD → PT
The PFN (Physical Frame Number) is retrieved
A new TLB entry is inserted, evicting an old one via LRU if necessary
The original memory access is retried

Full miss handling takes hundreds of cycles — maintaining a 99%+ hit rate is critical.

Q4. What is warp divergence in a GPU and how does it affect performance?

Answer: Warp divergence occurs when threads within a warp (group of 32) take different branches of an if-else statement.

How it works:

In the SIMT model, all threads in a warp must execute the same instruction.
When an if-else diverges: the if-branch executes first with non-if threads masked off, then the else-branch executes with non-else threads masked off.
Both paths are executed serially, up to 2x slower in the worst case.

Mitigations:

Arrange data so threads in the same warp always take the same branch
Replace conditional logic with arithmetic (e.g., branchless select)
Minimize conditional code in kernel hot paths

Q5. Using Amdahl's Law, calculate the maximum theoretical speedup when 80% of a program is parallelized.

Answer: With an unlimited number of processors, the maximum speedup is 5x.

Calculation:

Parallelizable fraction: f = 0.8
Sequential fraction: 1 - 0.8 = 0.2
As s (parallel speedup) approaches infinity: f/s approaches 0
Speedup = 1 / ((1 - f) + f/s) = 1 / (0.2 + 0) = 1 / 0.2 = 5x

Implication: No matter how many cores are added, the 20% sequential portion caps the overall speedup at 5x. This is why eliminating sequential bottlenecks is more valuable than simply adding more parallelism.

References

Patterson, D. & Hennessy, J. (2020). Computer Organization and Design: RISC-V Edition (2nd ed.). Morgan Kaufmann.
Hennessy, J. & Patterson, D. (2019). Computer Architecture: A Quantitative Approach (6th ed.). Morgan Kaufmann.
RISC-V International. (2024). The RISC-V Instruction Set Manual. https://riscv.org/technical/specifications/
NVIDIA. (2023). CUDA C++ Programming Guide. https://docs.nvidia.com/cuda/cuda-c-programming-guide/
ARM Ltd. (2024). ARM Architecture Reference Manual. https://developer.arm.com/documentation/
Intel. (2024). Intel 64 and IA-32 Architectures Software Developer's Manual.