✍️ 필사 모드: Compilers and Modern Language Runtimes — LLVM, JIT, GC, V8 TurboFan/Maglev, Inline Caching, Escape Analysis, Rust Monomorphization Complete Guide (2025)
EnglishWhy You Should Know Compilers and Runtimes
Reality in 2025:
- V8, JVM, Go runtime, CPython, LLVM — tools you use daily.
- Their internal optimizations decide 70% of your app's performance.
- "Why is this code slow?" answers often hide in compiler decisions.
- Picking a language (Go vs Rust vs Python vs TS) is picking a runtime profile.
- In the AI era, million-QPS LLM inference runtimes are a compiler-optimization battle.
This post traces "what happens until my code runs."
Part 1 — Compiler vs Interpreter vs JIT — A Spectrum
Classical split
- Compiler: translate to machine code ahead-of-time (AOT).
- Interpreter: execute source one line at a time.
- JIT: compile hot parts to machine code during execution.
Reality is hybrid
- JVM: interpreter + C1 + C2 JIT + AOT (GraalVM).
- V8: Ignition (interpreter) + Sparkplug + Maglev + TurboFan (JIT).
- CPython: pure bytecode interpreter → 3.13 Specializing Adaptive Interpreter.
- .NET: bytecode + RyuJIT + AOT (NativeAOT).
"Is this language compiled or interpreted?" is the wrong question. "What tiers does its runtime have?" is the real one.
Part 2 — LLVM's Dominance
Why LLVM became the standard
Chris Lattner's 2000 PhD project. In 2024:
- Language frontends emit LLVM IR; optimization and codegen are LLVM's job.
- Languages using LLVM: Rust, Swift, Julia, Zig, Crystal, Kotlin/Native, Mojo, Chapel.
- Core tool for Apple Silicon migration.
- GPU code (CUDA/HIP) goes through LLVM.
LLVM IR
define i32 @add(i32 %a, i32 %b) {
%sum = add i32 %a, %b
ret i32 %sum
}
Language-neutral IR. Hundreds of optimization passes run on it:
- Constant Folding
- Dead Code Elimination
- Loop Invariant Code Motion
- Inlining
- Vectorization
- Instruction Combining
MLIR (2019, Chris Lattner returns)
"Multi-level IR." Not one LLVM IR, but domain-specific IRs at multiple levels.
Why needed: ML frameworks (TensorFlow/PyTorch) manage high-level graphs down to low-level ops across multiple abstractions. LLVM IR alone lacks the expressiveness.
MLIR adoption in 2024:
- Mojo (Modular AI) — Python superset, MLIR-based.
- IREE — ML compiler.
- Triton (OpenAI) — GPU kernel language.
Part 3 — JIT Masters
V8's four tiers
As of 2024:
Bytecode (Ignition interpreter)
↓ (hot)
Sparkplug — 1:1 bytecode-to-machine (non-optimizing, fast codegen)
↓ (hotter)
Maglev — mid-tier optimizing JIT (2023)
↓ (very hot)
TurboFan — peak optimization (Sea of Nodes, slow)
Each tier uses type feedback collected below to optimize more aggressively above.
Deoptimization: if assumptions break, fall back. Core mechanism of dynamic-language optimization.
Hidden Class + Inline Caching
JS objects are dynamic; obj.x has no fixed address. V8's answer:
Hidden Class (aka Shape, Map):
- Objects with identical property structure share a hidden class.
obj.xcompiles to "offset N in this hidden class."
Inline Cache (IC):
- Cache the property-access result at the call site.
- Repeated access on the same hidden class → cache hit → native speed.
- Shape changes → IC miss → polymorphic → megamorphic.
Lesson: keep JS object shapes stable.
// Bad: conditional property addition
const p = {};
if (cond) p.x = 1;
if (cond2) p.y = 2;
// Good: declare all properties at construction
const p = { x: cond ? 1 : undefined, y: cond2 ? 2 : undefined };
Escape Analysis
"If this object doesn't escape the function → stack-allocate or scalar-replace."
Heap allocation is expensive; GC pressure. Code friendly to escape analysis is much faster.
- JVM C2: strong escape analysis.
- Go compiler: check with
go build -gcflags="-m". - V8 TurboFan: limited.
Tiered Compilation
JVM: C1 (fast JIT) → C2 (aggressive JIT) → (OpenJDK 17+) GraalVM.
- Observe with
-XX:+PrintCompilation. - C2 is still among the strongest JITs in production.
LuaJIT: trace-based JIT. Mike Pall's masterpiece. Still cited as the best dynamic-language JIT in the 2020s.
Part 4 — GC Lineage
Mark & Sweep (1960)
- Mark reachable objects from roots.
- Sweep the unmarked.
- Cons: stop-the-world, fragmentation.
Copying GC
- Split into two spaces; copy live objects to the other.
- No fragmentation; 2x memory.
Generational GC
- Most objects die young (Weak Generational Hypothesis).
- Young/Old split; collect young often.
Concurrent and Incremental GC
Run GC concurrently or in small chunks to minimize STW.
G1 GC (default in JDK 9+)
- Region-based (~2048 heap regions).
- Predictable pause-time targets.
- Default for most server workloads.
ZGC (JDK 11+, Production-Ready in 2023)
- Colored Pointers — track object moves via color bits in references.
- Sub-millisecond pauses even on tens of TB heaps.
- 2023 Generational ZGC improves throughput too.
Shenandoah (RedHat, JDK 12+)
- Concurrent compaction.
- Competitor to ZGC.
Go's GC
- Concurrent mark & sweep.
- Tri-color abstraction + write barrier.
- STW target below 1ms (achieved since Go 1.5).
- Not generational — uses escape analysis to put more on the stack.
CPython
- Primarily reference counting + a cycle collector for cycles.
- Pros: deterministic release.
- Cons: counter ops on every alloc/dealloc; needs GIL.
V8
- Orinoco — concurrent, incremental, parallel.
- Young generation (new space): copying.
- Old generation (old space): mark-compact.
Choosing a GC
| GC | pause | throughput | memory overhead |
|---|---|---|---|
| Parallel (JDK, not default) | long | high | low |
| G1 | mid | mid | mid |
| ZGC | ultra-low | mid | mid |
| Shenandoah | ultra-low | mid | mid |
| Go | low | mid | low-mid |
| CPython RC | near 0 | low (counters) | low |
Part 5 — Go Scheduler
G-M-P model
- G: goroutine
- M: OS thread
- P: processor (logical CPU, usually GOMAXPROCS)
Each P has a local G queue. If empty, work-steals from another P.
Traits
- Cooperative preemption + Go 1.14+ asynchronous preemption (signal-based timeout).
- On syscall: even if M blocks, P is reassigned to another M immediately → other goroutines keep running.
- Netpoller integrates I/O with epoll/kqueue.
Limitations
- Limited NUMA-aware scheduling.
- GOMAXPROCS auto-detect is cgroups-limited — set manually in containers.
- Uber's
automaxprocslibrary fixes this.
Part 6 — Rust and the Power of AOT
Monomorphization
fn max<T: PartialOrd>(a: T, b: T) -> T { if a > b { a } else { b } }
let x = max(1i32, 2i32); // generates max::<i32>
let y = max(1.0f64, 2.0); // generates max::<f64>
Dedicated machine code per type → zero call overhead, optimal inlining. Downside: binary bloat.
Zero-Cost Abstractions
"Code using abstractions is no slower than hand-written equivalent."
- Iterator chains compile to the same assembly as a for-loop.
- async/await fully unrolls into a state machine.
- Trait objects (
dyn Trait) cost, but static dispatch is zero.
Rust's Borrow Checker
Zero-runtime-cost memory safety. Compile-time ownership/lifetime verification.
Rust 2024-2025
- Native async trait (1.75).
- Parallel frontend (Nightly) — faster compiles.
- Cranelift backend — replaces LLVM in debug builds.
cargo-nextest— faster tests.
Part 7 — Python 3.13's Revolution
Specializing Adaptive Interpreter (PEP 659, 3.11+)
CPython injects shape-specialized bytecode instructions at runtime.
LOAD_ATTR → LOAD_ATTR_INSTANCE_VALUE (dict-based object)
→ LOAD_ATTR_SLOT (__slots__)
→ LOAD_ATTR_MODULE
→ ...
V8's Inline Cache, brought to CPython.
3.13 additions
- Experimental JIT (copy-and-patch). Experimental flag.
- Free-Threading (PEP 703) — GIL-less execution. Separate build option.
- Incremental GC.
PyPy
- Tracing JIT. 4-10x faster than CPython for years.
- Compatibility issues limit mainstream adoption.
- HPy (2020+) — portable C-extension API effort.
Part 8 — JavaScript Runtime Landscape
V8 (Chrome, Node.js, Deno)
- Released 2008. Lars Bak.
- The reference point for JS optimization.
JavaScriptCore (Safari/WebKit)
- 4-tier JIT (LLInt → Baseline → DFG → FTL).
- Custom B3 JIT compiler.
- Reputed to be more memory-efficient than V8.
SpiderMonkey (Firefox)
- IonMonkey JIT.
- Strong WebAssembly implementation.
Bun picks JSC
Bun uses JSC instead of V8. Part of its beating-Node benchmarks stems from this choice.
Runtime API differences
- Node.js: fs, net, http, CommonJS+ESM.
- Deno: Web API first + permissions + native TypeScript.
- Bun: Node API + Web API + built-in bundler/test runner.
- Workerd (Cloudflare): V8 isolates, limited Node compat.
Part 9 — WebAssembly Runtimes
Covered previously, but from a runtime angle:
| Runtime | Trait | Where |
|---|---|---|
| V8 + Wasm | Browser standard | Web |
| Wasmtime | Bytecode Alliance | Server WASI |
| Wasmer | Multiple backends | Embedded |
| WasmEdge | CNCF | Edge |
| Wasmer + Cranelift | Fast compile | Dev |
| Wasmer + LLVM | Best code | Prod |
Part 10 — AOT vs JIT Trade-offs
AOT (Rust, Go, Swift, GraalVM Native Image)
Pros:
- Fast startup.
- Predictable performance.
- Low runtime overhead.
Cons:
- No dynamic optimization (no type feedback).
- Binary size.
- Compile time.
JIT (V8, JVM C2)
Pros:
- Dynamic typing/polymorphism optimization.
- Uses runtime information.
Cons:
- Warmup (slow at first).
- Memory overhead.
- Unpredictable pauses.
GraalVM — the bridge
- Native Image — AOT-compile Java.
- 50ms startup, 90% memory reduction.
- Spring Native, Quarkus, Micronaut — serverless/container-friendly.
- Cons: reflection and dynamic class loading constrained.
Part 11 — Performance Analysis Workflow
CPU profiling
- Flame graph for the big picture.
- Identify hotspot functions.
- Inspect their assembly (
perf annotateor Compiler Explorer). - For JIT output: Node
--print-opt-code, V8 logging.
Memory profiling
- Chrome DevTools Heap Snapshot (V8).
- JFR + Mission Control (JVM).
- pprof (Go).
- memray (Python).
Tracing
- Linux perf + flame graph.
- Parca, Pyroscope — continuous profiling.
- async-profiler (JVM) — safepoint-free JVM profiler.
Part 12 — Checklist (12 items)
- Latest LTS runtime — V8, JVM, Go, Python all keep improving.
- GC tune only after benchmarking — premature optimization is bad.
- Keep JS object shape stable — stabilize Hidden Class.
- Mind Go escape analysis — check with
-gcflags="-m". - Rust:
#[inline]/PGO — manual hints often needed. - JVM: JFR always on — continuous prod profiling.
- Consider CPython 3.13+ — specialized interpreter gains.
- Container CPU awareness — GOMAXPROCS,
-XX:ActiveProcessorCount. - Startup-sensitive apps → AOT — Native Image, Go AOT.
- Measure JIT warmup — discard initial benchmark numbers.
- Minimize allocation hot paths — most perf issues are here.
- Compiler Explorer (godbolt.org) — habitual assembly inspection.
Part 13 — 10 Anti-patterns
- "This language is fast" — runtime config/structure matters more.
- Benchmark with
time ./apponce — ignoring variance/warmup. - Ignoring JIT warmup — judging from
-Xcomponly. - Microbenchmarking with
System.nanoTime()instead of JMH — tens of times off. - Generic abuse (monomorphization explosion) — tens of MB binaries.
- Tight loops in pure Python — consider NumPy/Cython/Numba.
- CPU-bound work on Node main event loop —
worker_threadsrequired. - JVM prod without Xms=Xmx — heap-resize cost.
- No GC logs — impossible to triage prod incidents.
- JVM on container default memory — gets OOM-killed.
-XX:+UseContainerSupportis default but verify.
Part 14 — Learning Resources
- Book: Crafting Interpreters (Robert Nystrom) — free online. Kindest intro.
- Book: Engineering a Compiler (Cooper & Torczon).
- Book: Modern Compiler Implementation in ML/Java/C (Andrew Appel).
- Book: The Garbage Collection Handbook (Jones, Hosking, Moss).
- Blog: V8 blog; Chrome V8 engineer talks on YouTube.
- Tool: godbolt.org — assembly playground.
- Course: Stanford CS143 Compilers (open).
Closing — Language Is Runtime
"Which language?" is often really "which runtime profile?"
- Ultra-low latency → Rust, Go AOT, GraalVM Native.
- Dev productivity first → Python, TypeScript, Kotlin.
- Polymorphism optimization → JVM C2, V8 TurboFan.
- Startup matters → AOT.
Every runtime is a trade-off. The same language shifts performance profile by config, version, GC choice. An engineer's weapon is concrete runtime knowledge: "I don't break V8's Hidden Class," "I verify Go's escape analysis."
In the LLM-writes-code era, engineers who can explain why code is fast or slow grow more valuable. The answer mostly lives in the compiler and the runtime.
Next — "AI Engineering in Practice" — LLM API architecture, RAG, agents, fine-tuning, vector DBs, evaluation, production ops
After 14 systems posts, the next sits on top of them all: AI applications.
- LLM API calls in practice — retries, timeouts, streaming, cost
- RAG architecture — from basic retrieval to Hybrid Search
- Agent design patterns — ReAct, Plan-and-Execute, Tool Use
- When (and when not) to fine-tune — LoRA, DPO, RLHF
- Vector DBs — pgvector vs Qdrant vs Pinecone
- LLM evaluation — the real difficulty of measuring accuracy
- Prompt engineering science — Structured Output, Few-shot, Chain-of-Thought
- LLM observability — OpenTelemetry GenAI, LangSmith, LangFuse
- Cost optimization — model choice, caching, Prompt Compression
- Security — Prompt Injection, Data Leakage
"How to actually ship an AI product." Next post.
현재 단락 (1/240)
Reality in 2025: