Skip to content

✍️ 필사 모드: Modern Computer Architecture — CPU Pipelines, Out-of-Order, Caches, Branch Prediction, Meltdown, Apple Silicon, ARM, RISC-V, SIMD, GPU Deep Dive (2025)

English
0%
정확도 0%
💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

Why a 2026 Backend Engineer Must Know CPU Internals

You wrote Rust but it was sometimes slower than C. You reached for HashMap but a linear scan over Vec<(K,V)> was faster. A single if made your code 10x slower. M1 outran same-generation Intel. All those answers become obvious once you understand what happens inside the CPU. Software runs on hardware. The CPU does not execute a + b in the order you wrote it — pipelines, caches, branch predictors, and memory controllers silently intervene. This post opens that box.

Since Meltdown/Spectre (2018), "the CPU is no longer a black box" is industry consensus. Apple Silicon M1 (2020) proved ARM could decisively beat x86. AWS Graviton4, NVIDIA Grace, Ampere Altra made 30% cloud savings real. RISC-V is silicon-shipping stage at Western Digital, SiFive, Tenstorrent, Meta. GPUs moved past H100/H200 to B100/B200 Blackwell (2024) and GB300 (2025), dominating the LLM economy. No software engineer can sidestep this anymore.

This post extends the earlier Rust Deep Dive. The reason Rust can claim zero-cost abstraction is that it compiles to CPU-friendly code — and that "CPU-friendly" is what we dissect here.

Part 1. From Transistors to Instructions — What Your if Actually Does

1.1 Von Neumann to Modern CPU in 30 Seconds

The 1945 Von Neumann architecture is logically unchanged: code and data share memory, CPU reads and executes sequentially. But the physical implementation has been revolutionized.

  • 1971 Intel 4004 — 4-bit, 2,300 transistors, 740kHz
  • 1985 Intel 386 — 32-bit, 275K transistors, 16MHz
  • 1993 Pentium — superscalar (2 instructions per cycle), 5-stage pipeline
  • 1995 Pentium Pro — Out-of-Order, speculative execution, register renaming
  • 2006 Core 2 — Intel hit the power wall, switched to parallelism
  • 2011 Sandy Bridge — integrated memory controller, AVX
  • 2020 Apple M1 — 5nm, 16B transistors, Unified Memory Architecture (UMA), class-leading IPC
  • 2024 Intel Lunar Lake / AMD Zen 5 / Apple M4 — hybrid Performance-core + Efficiency-core

A chip now holds 200 billion+ transistors, yet we still write a + b. Modern CPU design is the bridge.

1.2 5 Stages of Execution (Actually 20)

The textbook RISC pipeline:

  1. Fetch — read instruction from memory
  2. Decode — interpret the opcode
  3. Execute — actual computation
  4. Memory Access — read/write if needed
  5. Write Back — store result to register

Modern x86 (Golden Cove, Zen 4) splits this into 14 to 20 stages so each stage completes faster and clock can go higher. Longer pipelines amplify the cost of branch misprediction.

1.3 Superscalar — 4 to 8 Instructions per Cycle

Single-instruction pipelines ended in the 1990s. Today multiple instructions issue per clock.

  • Apple M3/M4 P-core: 10-wide decode, 9-wide issue
  • Intel Golden Cove: 6-wide decode, 5-wide issue
  • AMD Zen 5: 8-wide decode, 6-wide issue

Why it matters: if your IPC (Instructions Per Cycle) hits 3 to 4, the CPU is running many ops in parallel. If IPC is 0.5, something is killing the pipeline — branch miss, cache miss, or dependency chain.

Part 2. Out-of-Order Execution — The CPU Reorders Your Code

2.1 Why Reorder

int a = load_from_memory(&x);  // 100-cycle miss
int b = 2 + 2;                  // 1 cycle

In-order CPU blocks on line 1 before starting line 2. Out-of-Order CPU runs b first, then merges when a arrives. If instructions have no dependency, they execute in any order.

2.2 ROB, RS, Register Renaming

  • ROB (Re-Order Buffer) — out-of-order execution, in-order commit
  • RS (Reservation Station) — dispatch to execution unit as soon as operands are ready
  • Register Renaming — architectural registers (16) map to physical registers (hundreds)

The RAX in your code is actually one of 192 physical registers, so false dependencies vanish. Apple M3 ROB has ~700 entries, Zen 5 ~480. Larger ROB means further-out reordering.

2.3 Speculative Execution — Living the Future First

Branch predict then assume one side and execute. Right: win. Wrong: roll back. Poor design here produced Spectre/Meltdown in 2018 (Part 4).

Part 3. Cache Hierarchy — Why Arrays Beat Linked Lists

3.1 Memory Hierarchy Reality

Approximate 2025 x86 P-core numbers:

TierCapacityLatencyThroughput
Registers~4000 cyclesper-op
L1d48 to 80KB4 to 5 cycles3 to 4 loads/cycle
L22 to 3MB12 to 15 cycles1 load/cycle
L332 to 96MB40 to 50 cyclesshared
DRAM32 to 512GB200 to 300 cycles50 to 100 GB/s
NVMeTB-scale10 to 100 us7 to 14 GB/s
Networkinfinity100 us+10 to 100 Gb/s

Register and DRAM differ by ~100x. Cache bridges the gap.

3.2 Cache Line — 64 Bytes Is the Unit

Read 1 byte, get 64 (Apple Silicon: 128). This is the top reason arrays beat linked lists.

  • int[1000] traversal: first touch loads 16 ints; next 15 hit cache
  • linked list: nodes scattered in memory, cache miss per hop

5 cycles vs 100 cycles — 20x.

3.3 Prefetch — The CPU Prepares the Future

Hardware prefetchers detect access patterns and pull lines into L1/L2 early. Linear stride (a[0], a[1], a[2]...) is perfectly detected; pointer chasing fails. Software-side: __builtin_prefetch(ptr) (GCC/Clang).

3.4 False Sharing — 64-Byte Hell

Two threads writing different variables on the same cache line cause the MESI coherence protocol to ping-pong the whole line. 10x performance drop is common.

#[repr(align(64))]
struct PaddedCounter(AtomicU64);

Align to 64 bytes and the problem disappears.

3.5 Data-Oriented Design — What Game Engines Learned

OOP's Entity { name, pos, vel, hp, ... } is a cache nightmare. update_positions only needs pos and vel, but name and hp pollute lines. ECS (Entity-Component-System) and SoA (Struct-of-Arrays) fix this structurally.

// AoS: Entity entities[N]  -> cache waste
// SoA: float xs[N], ys[N], vxs[N], vys[N]  -> pure loads

Unity DOTS, Unreal Mass Entity, Bevy ECS all adopt this.

Part 4. Branch Prediction — When a Branch Is 10x Slower

4.1 Why Predict

With a 15-stage pipeline, an if means "don't know which way yet." Instead of stalling, the CPU guesses and continues. Right: no cost. Wrong: 15 cycles wasted.

4.2 From 2-bit Saturating Counter to TAGE

  • 1-bit: remembers last outcome only, ~60% accuracy
  • 2-bit saturating: confidence accumulates, ~90%
  • Perceptron (AMD 2008+): neural network, 95%+
  • TAGE (Intel/Apple 2010s+): tagged histories of varied lengths, 97 to 99%

This is why the famous Stack Overflow question "why is a sorted array faster" answers itself — sorted patterns are trivially predictable (near 100%), random ones are ~50%.

4.3 Branchless Programming

Strip branches from hot loops and misprediction cost disappears.

// branch
int max = a > b ? a : b;

// branchless (CMOV)
int max = b ^ ((a ^ b) & -(a > b));

Rust/LLVM auto-compiles simple ternaries to cmov. For complex code, SIMD vectorizes the whole branch.

Part 5. Post-2018 — Meltdown, Spectre, MDS, Zenbleed

5.1 Meltdown (Intel)

Speculatively reads kernel memory from user mode, then infers the value via cache side-channel. Intel loaded data before permission check. KPTI patches cost 5 to 30%.

5.2 Spectre v1/v2

Attacker trains the branch predictor, extracts victim memory via cache side channel. A design-level flaw: "no full fix," only layered mitigations (IBRS, STIBP, retpoline).

5.3 MDS, L1TF, ZombieLoad, Zenbleed

Variants surfaced 2019 to 2023. Some cloud providers disable SMT. AWS Nitro, Google Titan, Azure Pluton are dedicated security chips shaped by this threat model.

5.4 Post-Quantum Era Security Chips

Apple Secure Enclave, Intel SGX/TDX, AMD SEV-SNP, ARM CCA provide isolated execution inside the CPU even the kernel cannot see — the hardware basis of Confidential Computing.

Part 6. Apple Silicon — Why M1 Beat Intel

6.1 Five Design Wins

  1. Wide decode (10-wide vs Intel's 6) — x86 variable-length instructions resist parallel decode; ARM's fixed 32-bit makes it easy
  2. Unified Memory Architecture (UMA) — CPU/GPU/NPU share LPDDR, zero copy cost
  3. Huge ROB + many physical registers — 2x Intel's OoO window
  4. SoC-level integration — memory controller, Thunderbolt, NVMe, display on one die, latency collapses
  5. TSMC 5nm/3nm leading process — 20 to 30% head start at same power

6.2 P-core / E-core Hybrid

M-series mixes performance and efficiency cores. Background work (Spotlight indexing, mail) runs on E-cores to save battery; foreground compute on P-cores. Windows/Linux scheduling is tricky because the OS must infer workload character (Intel's Thread Director helps).

6.3 Graviton/Ampere — Cloud Is Moving to ARM

AWS Graviton3/4 is 30 to 40% cheaper at equal performance. Netflix, Snap, Airbnb migrated at scale. Cloudflare Workers is ARM-first. Reason: absolute lead in performance per watt.

Part 7. x86 vs ARM vs RISC-V — 2025 Landscape

7.1 Why x86 Remains

  • Legacy compatibility (Windows/Linux/Steam catalog)
  • Mature SIMD (AVX-512, VNNI)
  • Server datacenter inertia

But losing perf/watt war. Intel's Lunar Lake (2024) imitates Apple: on-package memory.

7.2 ARM's Winning Fronts

  • Mobile (already 100% ARM)
  • Cloud (Graviton/Ampere 30%+)
  • Mac (100% ARM)
  • Windows on ARM (Qualcomm Snapdragon X, 2024)

7.3 RISC-V — The Open-ISA Revolt

  • Strengths: royalty-free, modular (RV32I, RV64I, +M/A/F/D/V)
  • 2024 to 2025 silicon: SiFive P870, Tenstorrent Ascalon, Meta MTIA, China XuanTie
  • Weaknesses: ecosystem maturity, vector fragmentation

Key is RVA23 profile standardization. Linux kernel already fully supports it.

Part 8. SIMD — 8x Faster with Vector Instructions

8.1 Why SIMD

Scalar: one value per op. SIMD: many values per op (Single Instruction, Multiple Data).

  • SSE (1999+): 128-bit, 4 floats
  • AVX/AVX2 (2011+): 256-bit, 8 floats
  • AVX-512 (2017+): 512-bit, 16 floats (Intel server, AMD Zen 4+)
  • ARM NEON (2004+): 128-bit standard
  • SVE/SVE2 (2016+): variable-length (128 to 2048-bit), Apple M4, Graviton3
  • RISC-V V (2021+): SVE-like, variable length

8.2 Auto-Vectorization and Manual

LLVM/GCC auto-vectorize simple loops. Dependencies or branches defeat them. For tight performance, use intrinsics or std::simd (Rust portable SIMD, 2024 nightly), std::experimental::simd (C++).

use std::simd::f32x8;
let a = f32x8::from_array([1.0; 8]);
let b = f32x8::from_array([2.0; 8]);
let c = a + b;  // 8 adds in one instruction

8.3 Real Examples

  • simdjson: JSON parse 1 to 3 GB/s (4 to 10x over scalar)
  • ClickHouse: SIMD-tuned aggregations
  • WebP/AVIF decode
  • LLM inference Q4/Q8 quantization
  • Ripgrep string search (memchr crate)

Part 9. GPU Architecture — CUDA Cores, SM, Warp

9.1 CPU vs GPU Philosophy

  • CPU: deep pipeline, complex prediction, few threads, minimize latency
  • GPU: shallow pipeline, tens of thousands of threads, maximize throughput

GPU hides latency by context-switching threads on cache miss. CPU just waits.

9.2 H100/B200/GB300 Internals

  • SM (Streaming Multiprocessor): GPU's "core cluster", H100 has 132, B200 has 208
  • CUDA Core: scalar ALU in an SM, 128 per SM
  • Tensor Core: 4x4 matrix multiply per cycle, LLM's engine
  • Warp: 32 threads co-execute in SIMT
  • Shared Memory: ~228KB per SM, manual-managed cache
  • HBM3e (B200): 192GB, 8TB/s

9.3 Warp Divergence — GPU's Branch Cost

If 32 threads in a warp take different branches, both paths execute with masks. Branch cost is brutal — hence LLM kernels are branchless.

9.4 LLM Inference Bottlenecks

  • Prefill: compute-bound, tensor cores 100% busy
  • Decode: memory-bound, KV-cache reads dominate → FlashAttention, PagedAttention, continuous batching
  • Blackwell B200 FP4 (2024): 2x effective memory bandwidth

9.5 ROCm, Apple MLX, Metal — Outside NVIDIA

  • AMD ROCm ports CUDA via HIP, MI300X (HBM3 192GB) is H100 alternative
  • Apple Metal + MLX: local LLM on M-series, UMA gives instant model load
  • Intel Arc/Gaudi: OneAPI/SYCL, Gaudi3 targets inference value

10.1 DDR5, LPDDR5X, HBM3e

  • DDR5 (desktop/server): up to 8400 MT/s, ~90 GB/s 2-ch
  • LPDDR5X (mobile, Apple, Qualcomm): 8533 MT/s, low power
  • HBM3e (GPU): 3D-stacked, 8TB/s per GPU, mandatory on B200

10.2 NUMA Pitfalls

Remote-socket memory is 2 to 3x slower on 2-socket servers. Tune with Linux numactl, mbind. Kubernetes Topology Manager recognizes this.

Real standard since 2024.

  • CXL.cache: CPU caches remote memory
  • CXL.mem: memory pooling, disaggregation
  • CXL.io: PCIe-compatible

Meaning: "RAM in a server becomes a shared resource." Rack-level memory over-provisioning solved.

Direct GPU interconnect. H100 NVLink 900 GB/s, B200 1.8 TB/s. Hardware foundation for large-LLM parallelism (TP/PP/EP).

Part 11. Power, Heat, Quantum — Why Clocks Stopped Rising

11.1 End of Power Wall and Dennard Scaling

Dennard Scaling (transistors shrink, power drops proportionally) stopped ~2006. Clocks stuck at 3 to 5GHz since. Solutions: parallelism (more cores) and specialization (NPU, Tensor Core).

11.2 Dark Silicon

With 200B transistors, the power budget only allows a fraction on simultaneously. Hence purpose-built blocks (AMX, NPU, media engine) that turn on only when used.

11.3 3D Packaging

TSMC CoWoS, Intel Foveros — stack logic and memory vertically, latency/bandwidth explodes. B200, MI300X are 3D assemblies.

11.4 Quantum Computing — 2025 Reality

IBM Osprey (433 qubit), Google Willow (2024, 105 qubit + error-correction breakthrough), Atom Computing 1000-qubit neutral atom. "Far from breaking crypto but useful for optimization/chemistry." Immediate impact for engineers: post-quantum crypto prep (Kyber, Dilithium).

Part 12. Wiring Performance End-to-End — Measure, Profile, Tune

12.1 Linux perf

perf stat -d ./myapp
# instructions, cycles, IPC, cache-misses, branch-misses
perf record -g ./myapp && perf report
  • IPC less than 1 → many misses
  • branch-miss greater than 2% → prediction failure
  • LLC-load-miss greater than 10% → cache design failure

12.2 Intel VTune, AMD uProf, Apple Instruments

Top-Down analysis: Retiring / Bad Speculation / Frontend Bound / Backend Bound. Judge the bottleneck in a minute.

12.3 Microbench Traps

Run a million times with std::chrono and average? Compiler treats it as loop-invariant and deletes the code. Use black_box (Rust), benchmark::DoNotOptimize (Google Benchmark).

12.4 Performance Checklist

  • Measure first, never guess
  • Algorithm: O(n squared) → O(n log n) before any microtuning
  • Check data layout (SoA/AoS)
  • Allocation count — Rust alloc hooks, jemalloc stats
  • Lock contention — perf lock, perf trace
  • SIMD opportunity — auto-vectorization logs (-Rpass=loop-vectorize)

Part 13. Practice — Why 10x Differences Appear in the Same Language

13.1 Case 1: Linear Probing vs Chaining HashMap

std::unordered_map-style chaining is a cache-miss generator. Open addressing (linear probing) is 2 to 5x faster on modern CPUs. Rust hashbrown (std HashMap backbone, Google SwissTable port) exemplifies.

13.2 Case 2: Why a JSON Parser Can Do 1 GB/s

simdjson scans 64 bytes at once via AVX-512/ARM NEON to identify structure, then parses in a second pass. 10x-plus over traditional parsers.

13.3 Case 3: Redis vs KeyDB vs Dragonfly

Redis is single-threaded. KeyDB/Dragonfly use io_uring + shared-nothing for linear per-core scaling. 10 to 25x throughput gap on the same hardware.

13.4 Case 4: LLM Inference — TensorRT-LLM vs vLLM vs SGLang

  • Same model, same GPU, 3 to 8x throughput spread
  • Reasons: KV-cache management, continuous batching, paged attention, CUDA graph, FP8/FP4 quantization
  • "Runtime swap" rivals "model swap" in effect

Part 14. How Hardware Shifts Shape Software Design

14.1 NVMe + io_uring = File I/O Reborn

As disks accelerated, blocking I/O overhead became the bottleneck. Async I/O is mandatory.

14.2 100 Gb/s NIC + RDMA + DPDK

Network faster than PCIe. Without kernel bypass (DPDK) and RDMA (RoCEv2) you leave performance on the table. Cloudflare, Meta, Stripe edges run on these.

14.3 DPU/IPU

NVIDIA BlueField, Intel IPU, AWS Nitro offload network, storage, security off CPU. 2025 datacenter standard.

14.4 AI Accelerator Warring States

NVIDIA B200/GB300, AMD MI350X, Google TPU v5p/v6, AWS Trainium2/Inferentia3, Cerebras WSE-3, Groq LPU, Tenstorrent. Each aims to beat NVIDIA in one domain.

Part 15. Checklist 12, Anti-patterns 10

Checklist 12

  1. Hot-path data structures are cache-friendly? (array beats linked list)
  2. Struct layout considers 64-byte alignment/padding?
  3. Multi-thread shared data has no false sharing?
  4. Hot loop branch prediction rate checked with perf?
  5. If IPC below 1, investigated cause (cache, branch, dependency)?
  6. Loops with SIMD opportunity got auto-vectorized?
  7. On NUMA systems, pinned local memory with numactl or topology manager?
  8. Measured SMT noisy-neighbor impact on vCPU?
  9. GPU workload Warp divergence measured?
  10. Memory allocation batched via arena or slab to reduce fragmentation?
  11. Have a post-quantum crypto migration roadmap (2030 timeline)?
  12. ARM64 builds (CI/CD) and benches ready?

Anti-patterns 10

  1. Deciding "Rust is fast" without benchmarks
  2. Using OOP AoS exclusively, ignoring SoA
  3. Spraying HashMap in hot loops
  4. All threads contending on one atomic counter
  5. Thinking virtual threads/async fix CPU-bound problems
  6. Tuning on x86, not testing ARM
  7. GPU code with if-else causing warp divergence
  8. Micro-tuning memcpy while leaving algorithm alone
  9. Trusting KVM/VM-internal benches as bare-metal
  10. Disabling Meltdown/Spectre mitigation for performance

Next — "Observability in the Modern Era" — OpenTelemetry, eBPF continuous profiling, Grafana LGTM, Pyroscope, Sentry Performance, Datadog, Honeycomb, SLO, SRE, on-call

You learned the CPU — now you need to see what happens on it in real time. Next post drills into 2025 observability.

  • Metrics x Logs x Traces x Profiles — Four Pillars
  • OpenTelemetry becoming the de facto standard
  • eBPF continuous profiling — Parca, Pyroscope, Polar Signals
  • Distributed tracing — sampling strategies, tail/head-based
  • LGTM stack — Loki/Grafana/Tempo/Mimir
  • Sentry Performance, Datadog APM, Honeycomb compared
  • SLO/SLI/Error Budget — decade since the SRE book
  • AI anomaly detection — Datadog Watchdog, Grafana Oncall AI
  • On-call operations — rotation, postmortem, blameless culture

From hardware to software to operations — up the next stack layer.

현재 단락 (1/193)

You wrote Rust but it was sometimes slower than C. You reached for `HashMap` but a linear scan over ...

작성 글자: 0원문 글자: 15,964작성 단락: 0/193