Compilers and Modern Language Runtimes — LLVM, JIT, GC, V8 TurboFan/Maglev, Inline Caching, Escape Analysis, Rust Monomorphization Complete Guide (2025)

Why You Should Know Compilers and Runtimes

Reality in 2025:

V8, JVM, Go runtime, CPython, LLVM — tools you use daily.
Their internal optimizations decide 70% of your app's performance.
"Why is this code slow?" answers often hide in compiler decisions.
Picking a language (Go vs Rust vs Python vs TS) is picking a runtime profile.
In the AI era, million-QPS LLM inference runtimes are a compiler-optimization battle.

This post traces "what happens until my code runs."

Part 1 — Compiler vs Interpreter vs JIT — A Spectrum

Classical split

Compiler: translate to machine code ahead-of-time (AOT).
Interpreter: execute source one line at a time.
JIT: compile hot parts to machine code during execution.

Reality is hybrid

JVM: interpreter + C1 + C2 JIT + AOT (GraalVM).
V8: Ignition (interpreter) + Sparkplug + Maglev + TurboFan (JIT).
CPython: pure bytecode interpreter → 3.13 Specializing Adaptive Interpreter.
.NET: bytecode + RyuJIT + AOT (NativeAOT).

"Is this language compiled or interpreted?" is the wrong question. "What tiers does its runtime have?" is the real one.

Part 2 — LLVM's Dominance

Why LLVM became the standard

Chris Lattner's 2000 PhD project. In 2024:

Language frontends emit LLVM IR; optimization and codegen are LLVM's job.
Languages using LLVM: Rust, Swift, Julia, Zig, Crystal, Kotlin/Native, Mojo, Chapel.
Core tool for Apple Silicon migration.
GPU code (CUDA/HIP) goes through LLVM.

LLVM IR

define i32 @add(i32 %a, i32 %b) {
  %sum = add i32 %a, %b
  ret i32 %sum
}

Language-neutral IR. Hundreds of optimization passes run on it:

Constant Folding
Dead Code Elimination
Loop Invariant Code Motion
Inlining
Vectorization
Instruction Combining

MLIR (2019, Chris Lattner returns)

"Multi-level IR." Not one LLVM IR, but domain-specific IRs at multiple levels.

Why needed: ML frameworks (TensorFlow/PyTorch) manage high-level graphs down to low-level ops across multiple abstractions. LLVM IR alone lacks the expressiveness.

MLIR adoption in 2024:

Mojo (Modular AI) — Python superset, MLIR-based.
IREE — ML compiler.
Triton (OpenAI) — GPU kernel language.

Part 3 — JIT Masters

V8's four tiers

As of 2024:

Bytecode (Ignition interpreter)
  ↓ (hot)
Sparkplug — 1:1 bytecode-to-machine (non-optimizing, fast codegen)
  ↓ (hotter)
Maglev — mid-tier optimizing JIT (2023)
  ↓ (very hot)
TurboFan — peak optimization (Sea of Nodes, slow)

Each tier uses type feedback collected below to optimize more aggressively above.

Deoptimization: if assumptions break, fall back. Core mechanism of dynamic-language optimization.

Hidden Class + Inline Caching

JS objects are dynamic; obj.x has no fixed address. V8's answer:

Hidden Class (aka Shape, Map):

Objects with identical property structure share a hidden class.
obj.x compiles to "offset N in this hidden class."

Inline Cache (IC):

Cache the property-access result at the call site.
Repeated access on the same hidden class → cache hit → native speed.
Shape changes → IC miss → polymorphic → megamorphic.

Lesson: keep JS object shapes stable.

// Bad: conditional property addition
const p = {};
if (cond) p.x = 1;
if (cond2) p.y = 2;

// Good: declare all properties at construction
const p = { x: cond ? 1 : undefined, y: cond2 ? 2 : undefined };

Escape Analysis

"If this object doesn't escape the function → stack-allocate or scalar-replace."

Heap allocation is expensive; GC pressure. Code friendly to escape analysis is much faster.

JVM C2: strong escape analysis.
Go compiler: check with go build -gcflags="-m".
V8 TurboFan: limited.

Tiered Compilation

JVM: C1 (fast JIT) → C2 (aggressive JIT) → (OpenJDK 17+) GraalVM.

Observe with -XX:+PrintCompilation.
C2 is still among the strongest JITs in production.

LuaJIT: trace-based JIT. Mike Pall's masterpiece. Still cited as the best dynamic-language JIT in the 2020s.

Part 4 — GC Lineage

Mark & Sweep (1960)

Mark reachable objects from roots.
Sweep the unmarked.
Cons: stop-the-world, fragmentation.

Copying GC

Split into two spaces; copy live objects to the other.
No fragmentation; 2x memory.

Generational GC

Most objects die young (Weak Generational Hypothesis).
Young/Old split; collect young often.

Concurrent and Incremental GC

Run GC concurrently or in small chunks to minimize STW.

G1 GC (default in JDK 9+)

Region-based (~2048 heap regions).
Predictable pause-time targets.
Default for most server workloads.

ZGC (JDK 11+, Production-Ready in 2023)

Colored Pointers — track object moves via color bits in references.
Sub-millisecond pauses even on tens of TB heaps.
2023 Generational ZGC improves throughput too.

Shenandoah (RedHat, JDK 12+)

Concurrent compaction.
Competitor to ZGC.

Go's GC

Concurrent mark & sweep.
Tri-color abstraction + write barrier.
STW target below 1ms (achieved since Go 1.5).
Not generational — uses escape analysis to put more on the stack.

CPython

Primarily reference counting + a cycle collector for cycles.
Pros: deterministic release.
Cons: counter ops on every alloc/dealloc; needs GIL.

V8

Orinoco — concurrent, incremental, parallel.
Young generation (new space): copying.
Old generation (old space): mark-compact.

Choosing a GC

GC	pause	throughput	memory overhead
Parallel (JDK, not default)	long	high	low
G1	mid	mid	mid
ZGC	ultra-low	mid	mid
Shenandoah	ultra-low	mid	mid
Go	low	mid	low-mid
CPython RC	near 0	low (counters)	low

Part 5 — Go Scheduler

G-M-P model

G: goroutine
M: OS thread
P: processor (logical CPU, usually GOMAXPROCS)

Each P has a local G queue. If empty, work-steals from another P.

Traits

Cooperative preemption + Go 1.14+ asynchronous preemption (signal-based timeout).
On syscall: even if M blocks, P is reassigned to another M immediately → other goroutines keep running.
Netpoller integrates I/O with epoll/kqueue.

Limitations

Limited NUMA-aware scheduling.
GOMAXPROCS auto-detect is cgroups-limited — set manually in containers.
Uber's automaxprocs library fixes this.

Part 6 — Rust and the Power of AOT

Monomorphization

fn max<T: PartialOrd>(a: T, b: T) -> T { if a > b { a } else { b } }

let x = max(1i32, 2i32);    // generates max::<i32>
let y = max(1.0f64, 2.0);   // generates max::<f64>

Dedicated machine code per type → zero call overhead, optimal inlining. Downside: binary bloat.

Zero-Cost Abstractions

"Code using abstractions is no slower than hand-written equivalent."

Iterator chains compile to the same assembly as a for-loop.
async/await fully unrolls into a state machine.
Trait objects (dyn Trait) cost, but static dispatch is zero.

Rust's Borrow Checker

Zero-runtime-cost memory safety. Compile-time ownership/lifetime verification.

Rust 2024-2025

Native async trait (1.75).
Parallel frontend (Nightly) — faster compiles.
Cranelift backend — replaces LLVM in debug builds.
cargo-nextest — faster tests.

Part 7 — Python 3.13's Revolution

Specializing Adaptive Interpreter (PEP 659, 3.11+)

CPython injects shape-specialized bytecode instructions at runtime.

LOAD_ATTR → LOAD_ATTR_INSTANCE_VALUE (dict-based object)
          → LOAD_ATTR_SLOT (__slots__)
          → LOAD_ATTR_MODULE
          → ...

V8's Inline Cache, brought to CPython.

3.13 additions

Experimental JIT (copy-and-patch). Experimental flag.
Free-Threading (PEP 703) — GIL-less execution. Separate build option.
Incremental GC.

PyPy

Tracing JIT. 4-10x faster than CPython for years.
Compatibility issues limit mainstream adoption.
HPy (2020+) — portable C-extension API effort.

Part 8 — JavaScript Runtime Landscape

V8 (Chrome, Node.js, Deno)

Released 2008. Lars Bak.
The reference point for JS optimization.

JavaScriptCore (Safari/WebKit)

4-tier JIT (LLInt → Baseline → DFG → FTL).
Custom B3 JIT compiler.
Reputed to be more memory-efficient than V8.

SpiderMonkey (Firefox)

IonMonkey JIT.
Strong WebAssembly implementation.

Bun picks JSC

Bun uses JSC instead of V8. Part of its beating-Node benchmarks stems from this choice.

Runtime API differences

Node.js: fs, net, http, CommonJS+ESM.
Deno: Web API first + permissions + native TypeScript.
Bun: Node API + Web API + built-in bundler/test runner.
Workerd (Cloudflare): V8 isolates, limited Node compat.

Part 9 — WebAssembly Runtimes

Covered previously, but from a runtime angle:

Runtime	Trait	Where
V8 + Wasm	Browser standard	Web
Wasmtime	Bytecode Alliance	Server WASI
Wasmer	Multiple backends	Embedded
WasmEdge	CNCF	Edge
Wasmer + Cranelift	Fast compile	Dev
Wasmer + LLVM	Best code	Prod

Part 10 — AOT vs JIT Trade-offs

AOT (Rust, Go, Swift, GraalVM Native Image)

Pros:

Fast startup.
Predictable performance.
Low runtime overhead.

Cons:

No dynamic optimization (no type feedback).
Binary size.
Compile time.

JIT (V8, JVM C2)

Pros:

Dynamic typing/polymorphism optimization.
Uses runtime information.

Cons:

Warmup (slow at first).
Memory overhead.
Unpredictable pauses.

GraalVM — the bridge

Native Image — AOT-compile Java.
50ms startup, 90% memory reduction.
Spring Native, Quarkus, Micronaut — serverless/container-friendly.
Cons: reflection and dynamic class loading constrained.

Part 11 — Performance Analysis Workflow

CPU profiling

Flame graph for the big picture.
Identify hotspot functions.
Inspect their assembly (perf annotate or Compiler Explorer).
For JIT output: Node --print-opt-code, V8 logging.

Memory profiling

Chrome DevTools Heap Snapshot (V8).
JFR + Mission Control (JVM).
pprof (Go).
memray (Python).

Tracing

Linux perf + flame graph.
Parca, Pyroscope — continuous profiling.
async-profiler (JVM) — safepoint-free JVM profiler.

Part 12 — Checklist (12 items)

Latest LTS runtime — V8, JVM, Go, Python all keep improving.
GC tune only after benchmarking — premature optimization is bad.
Keep JS object shape stable — stabilize Hidden Class.
Mind Go escape analysis — check with -gcflags="-m".
Rust: #[inline]/PGO — manual hints often needed.
JVM: JFR always on — continuous prod profiling.
Consider CPython 3.13+ — specialized interpreter gains.
Container CPU awareness — GOMAXPROCS, -XX:ActiveProcessorCount.
Startup-sensitive apps → AOT — Native Image, Go AOT.
Measure JIT warmup — discard initial benchmark numbers.
Minimize allocation hot paths — most perf issues are here.
Compiler Explorer (godbolt.org) — habitual assembly inspection.

Part 13 — 10 Anti-patterns

"This language is fast" — runtime config/structure matters more.
Benchmark with time ./app once — ignoring variance/warmup.
Ignoring JIT warmup — judging from -Xcomp only.
Microbenchmarking with System.nanoTime() instead of JMH — tens of times off.
Generic abuse (monomorphization explosion) — tens of MB binaries.
Tight loops in pure Python — consider NumPy/Cython/Numba.
CPU-bound work on Node main event loop — worker_threads required.
JVM prod without Xms=Xmx — heap-resize cost.
No GC logs — impossible to triage prod incidents.
JVM on container default memory — gets OOM-killed. -XX:+UseContainerSupport is default but verify.

Part 14 — Learning Resources

Book: Crafting Interpreters (Robert Nystrom) — free online. Kindest intro.
Book: Engineering a Compiler (Cooper & Torczon).
Book: Modern Compiler Implementation in ML/Java/C (Andrew Appel).
Book: The Garbage Collection Handbook (Jones, Hosking, Moss).
Blog: V8 blog; Chrome V8 engineer talks on YouTube.
Tool: godbolt.org — assembly playground.
Course: Stanford CS143 Compilers (open).

Closing — Language Is Runtime

"Which language?" is often really "which runtime profile?"

Ultra-low latency → Rust, Go AOT, GraalVM Native.
Dev productivity first → Python, TypeScript, Kotlin.
Polymorphism optimization → JVM C2, V8 TurboFan.
Startup matters → AOT.

Every runtime is a trade-off. The same language shifts performance profile by config, version, GC choice. An engineer's weapon is concrete runtime knowledge: "I don't break V8's Hidden Class," "I verify Go's escape analysis."

In the LLM-writes-code era, engineers who can explain why code is fast or slow grow more valuable. The answer mostly lives in the compiler and the runtime.

Next — "AI Engineering in Practice" — LLM API architecture, RAG, agents, fine-tuning, vector DBs, evaluation, production ops

After 14 systems posts, the next sits on top of them all: AI applications.

LLM API calls in practice — retries, timeouts, streaming, cost
RAG architecture — from basic retrieval to Hybrid Search
Agent design patterns — ReAct, Plan-and-Execute, Tool Use
When (and when not) to fine-tune — LoRA, DPO, RLHF
Vector DBs — pgvector vs Qdrant vs Pinecone
LLM evaluation — the real difficulty of measuring accuracy
Prompt engineering science — Structured Output, Few-shot, Chain-of-Thought
LLM observability — OpenTelemetry GenAI, LangSmith, LangFuse
Cost optimization — model choice, caching, Prompt Compression
Security — Prompt Injection, Data Leakage

"How to actually ship an AI product." Next post.