Skip to content
Published on

CPython Bytecode Interpreter Deep Dive — ceval.c, Specializing Adaptive, PEP 659, Copy-and-Patch JIT (2025)

Authors

TL;DR

  • CPython runs a pipeline: "Python source to AST to bytecode to ceval.c interpreter". It is an "interpreted language" but really compile-then-VM-execute.
  • Bytecode: stack-based VM instructions. LOAD_FAST, BINARY_OP, CALL, etc. Stored in .pyc.
  • ceval.c: the heart of Python. A giant switch or computed-goto loop. Every Python statement passes through.
  • Python 3.11+ PEP 659: Specializing Adaptive Interpreter. Hot paths are swapped for specialized opcodes at runtime. 25%+ perf uplift.
  • Inline cache: cache slots right next to each opcode. Profile on first hit, specialize, fast path on subsequent hits.
  • Python 3.12 PEP 684: Per-interpreter GIL. Each subinterpreter gets an independent GIL.
  • Python 3.13 PEP 703: Free-threaded build. GIL removed (experimental).
  • Python 3.13 PEP 744: Copy-and-patch JIT. Tier 2 optimizer. Template-based ultra-fast codegen.
  • Frame object: the execution context per call. Locals, stack, instruction pointer, exception state.
  • dis module: inspect bytecode with dis.dis(f).

1. Why Python is "slow"

1.1 The cost of dynamic typing

Simple addition in Python:

x + y

In C, int + int is one CPU instruction (add). In Python:

  1. Check types of x and y.
  2. Look up __add__ on x.
  3. If missing, try __radd__ on y.
  4. Call the method.
  5. Allocate a result object (heap).
  6. Adjust reference counts.

Dozens of C instructions. 100x slower easily.

1.2 Boxing

Every value in Python is an object. Even a single int is a PyLongObject on the heap:

typedef struct _longobject {
    PyObject_VAR_HEAD
    digit ob_digit[1];
} PyLongObject;

A 32+ byte struct. Even "1" is that large.

1.3 Reference counting

CPython's memory management is refcount-based:

Py_INCREF(obj);  // +1
Py_DECREF(obj);  // -1, free if 0

Every allocation/deallocation pays this cost. Under multi-threading it requires atomics, which is one reason the GIL exists.

1.4 "So Python must stay slow forever?"

No. Since 2021 CPython has been in a real performance revolution:

  • Python 3.11 (Oct 2022): ~25% faster.
  • Python 3.12 (2023): +5%.
  • Python 3.13 (2024): free-threaded + JIT (experimental).
  • Python 3.14 (2025): more aggressive optimization.

The "Faster CPython" project (started after Guido joined Microsoft) targets 5x faster than 3.10.


2. The compilation pipeline

2.1 Stages

Python source (hello.py)
     |
Lexer (tokenizer)
     |
Parser
     |
AST (abstract syntax tree)
     |
Compiler (ast -> bytecode)
     |
Bytecode (code object)
     |
ceval.c interpreter
     |
Execute

Surprise: Python does compile. "Interpreted" means the target is VM instructions, not machine code.

2.2 Lexer and parser

Python 3.9+ uses a PEG parser (Pegen). Previously LL(1). PEG is more flexible.

import tokenize, io

code = "x = 1 + 2\n"
tokens = list(tokenize.tokenize(io.BytesIO(code.encode()).readline))
for t in tokens:
    print(t)

2.3 AST

import ast
tree = ast.parse("x = 1 + 2")
print(ast.dump(tree, indent=2))
Module(
  body=[
    Assign(
      targets=[Name(id='x', ctx=Store())],
      value=BinOp(
        left=Constant(value=1),
        op=Add(),
        right=Constant(value=2)))],
  type_ignores=[])

The ast module lets you transform trees — macro-like power.

2.4 Code object

def add(a, b):
    return a + b

print(add.__code__)
print(add.__code__.co_code)
print(add.__code__.co_consts)
print(add.__code__.co_names)
print(add.__code__.co_varnames)

PyCodeObject fields:

  • co_code: bytecode.
  • co_consts: constant pool.
  • co_names: global/attr names.
  • co_varnames: locals.
  • co_flags: async, generator, etc.
  • co_lnotab: bytecode offset -> source line.
  • co_stacksize: max eval stack depth.

2.5 .pyc files

Code objects are marshaled to .pyc under __pycache__/:

magic number (version)
source mtime or hash
source size
marshaled code object

On rerun, Python compares .py vs .pyc — unchanged means skip parsing. Python 3.7+ supports reproducible builds via SOURCE_DATE_EPOCH.


3. Bytecode dissection

3.1 dis module

import dis

def add(a, b):
    return a + b

dis.dis(add)
  2           0 RESUME                   0
  3           2 LOAD_FAST                0 (a)
              4 LOAD_FAST                1 (b)
              6 BINARY_OP                0 (+)
             10 RETURN_VALUE

3.2 Opcodes

Python 3.12 has ~200 opcodes:

Stack: LOAD_CONST, LOAD_FAST, LOAD_GLOBAL, LOAD_ATTR, STORE_FAST, POP_TOP. Arithmetic: BINARY_OP, COMPARE_OP. Control: JUMP_FORWARD, POP_JUMP_IF_FALSE, FOR_ITER. Calls: CALL, RETURN_VALUE, MAKE_FUNCTION. Build: BUILD_LIST, BUILD_TUPLE, BUILD_DICT.

3.3 Stack-based

Python's VM is stack-based. No registers.

a + b:

LOAD_FAST  0  # stack: [a]
LOAD_FAST  1  # stack: [a, b]
BINARY_OP  0  # pop a, b -> push a+b
RETURN_VALUE  # return top

Pros: simple instructions, easy compile. Cons: more instructions than register-based, less JIT-friendly.

JVM, CLR, Lua are stack-based; Dalvik is register-based.

3.4 Python 3.11 changes

Before 3.11, each binary operator was its own opcode (BINARY_ADD, BINARY_SUB, ...). 3.11 unified them:

LOAD_FAST  0
LOAD_FAST  1
BINARY_OP  0  # 0=+, 5=*, ...

Unification makes PEP 659 specialization easier — one adaptive opcode to specialize.

3.5 Larger example

def fib(n):
    if n < 2:
        return n
    return fib(n-1) + fib(n-2)

Dispatches through recursive calls, branches, arithmetic — all visible in dis.dis(fib).


4. ceval.c — the heart of Python

4.1 File overview

Python/ceval.c is the most important file in CPython. About 8,000 lines (3.12).

Key functions:

  • _PyEval_EvalFrameDefault(): main interpreter loop.
  • PyEval_EvalCode(): top-level evaluation entry.

4.2 Frame execution

On call:

  1. Create a new frame object.
  2. Initialize locals.
  3. Call _PyEval_EvalFrameDefault(frame).
  4. Run the loop.
  5. Free the frame on return.
typedef struct _PyInterpreterFrame {
    PyFunctionObject *f_funcobj;
    PyObject *f_builtins;
    PyObject *f_globals;
    PyObject *f_locals;
    PyCodeObject *f_code;
    PyObject *frame_obj;
    struct _PyInterpreterFrame *previous;
    _Py_CODEUNIT *prev_instr;
    int stacktop;
    bool is_entry;
    char owner;
    PyObject *localsplus[1];
} _PyInterpreterFrame;

localsplus fuses locals and eval stack into one array for speed.

4.3 Main loop

PyObject *
_PyEval_EvalFrameDefault(_PyInterpreterFrame *frame, int throwflag)
{
    _Py_CODEUNIT *next_instr = frame->prev_instr + 1;
    PyObject **stack_pointer = _PyFrame_GetStackPointer(frame);

    while (1) {
        _Py_CODEUNIT word = *next_instr++;
        int opcode = _Py_OPCODE(word);
        int oparg = _Py_OPARG(word);

        switch (opcode) {
            case LOAD_FAST: {
                PyObject *value = frame->localsplus[oparg];
                Py_INCREF(value);
                *stack_pointer++ = value;
                DISPATCH();
            }
            case BINARY_OP: {
                PyObject *right = *(--stack_pointer);
                PyObject *left = *(--stack_pointer);
                PyObject *result = binary_op(left, right, oparg);
                Py_DECREF(left); Py_DECREF(right);
                if (result == NULL) goto error;
                *stack_pointer++ = result;
                DISPATCH();
            }
            case RETURN_VALUE: {
                return *(--stack_pointer);
            }
        }
    }
}

4.4 Computed GOTO

switch can be slow: one shared jump point = poor branch prediction. GCC/Clang support computed goto:

#define DISPATCH() goto *opcode_targets[opcode];

Each handler ends with its own indirect jump, giving the CPU predictor per-opcode branch history. Typically 10-15% faster than switch. Enabled via USE_COMPUTED_GOTOS, default on GCC/Clang.

4.5 Dispatch overhead

Per-opcode fetch-decode-execute-branch takes a few nanoseconds. Over millions of ops that is milliseconds of pure overhead. C does the same work in hundreds of ns. This is a major chunk of Python's slowness.


5. Frame and call stack

5.1 Role of frame

Each call = one frame = "state of one running function".

5.2 Python 3.11 frame improvements

  • Before: each frame was a heap object (malloc).
  • 3.11+: frames on the C stack; heap only when required (generators, tracebacks).

Result: 50% faster function calls.

5.3 Tracebacks

Exception unwinding walks the frame stack.

5.4 sys._getframe

import sys
frame = sys._getframe()
print(frame.f_code.co_name)
print(frame.f_locals)
print(frame.f_back)

Used by debuggers, profilers, logging.

5.5 f-strings and frame

f"{x}" compiles at compile time but values come from the runtime frame's locals.


6. Specializing Adaptive Interpreter (PEP 659)

Python 3.11's biggest innovation.

6.1 Problem

Consider LOAD_ATTR for obj.name:

  1. Walk type(obj) MRO.
  2. Find descriptor.
  3. Apply descriptor protocol.
  4. Fall back to obj.__dict__.
  5. Finally __getattr__.

Complex. But in practice, obj is usually the same type and name is a plain dict key. Cacheable.

6.2 Inline cache

Inline cache stores a cache slot next to each opcode. Used by V8, HotSpot. CPython 3.11+ adds adaptive opcodes:

LOAD_ATTR  0   <- generic opcode

First hit:
  - Profile: record type + offset
  - Rewrite opcode to LOAD_ATTR_INSTANCE_VALUE

Next hits:
  - Check type version tag
  - Match: direct offset access (fast)
  - Miss: fall back to LOAD_ATTR

6.3 Cache structure

[LOAD_ATTR opcode (2 bytes)] [cache slot (several bytes)]

Cache slot holds the expected type version tag, dict offset or descriptor cache, and hit/miss counters. Inline because it lives in the bytecode stream itself.

6.4 Specialization variants

LOAD_ATTR variants:

  • LOAD_ATTR_INSTANCE_VALUE
  • LOAD_ATTR_WITH_HINT
  • LOAD_ATTR_SLOT
  • LOAD_ATTR_MODULE
  • LOAD_ATTR_CLASS
  • LOAD_ATTR_METHOD_WITH_VALUES
  • LOAD_ATTR_PROPERTY

Each has a separate C path — fast when its assumption holds.

6.5 BINARY_OP specializations

BINARY_OP_ADD_INT
BINARY_OP_ADD_FLOAT
BINARY_OP_ADD_UNICODE
BINARY_OP_MULTIPLY_INT
BINARY_OP_SUBTRACT_INT

If x + y always sees ints, swap to BINARY_OP_ADD_INT. Check types, do raw int add. 10x+ faster.

6.6 De-optimization

add(1, 2)       # BINARY_OP_ADD_INT
add(1.0, 2.0)   # int assumption fails, fallback

After enough misses, de-optimize back to the generic opcode. Re-specialization possible later.

6.7 Results

  • pyperformance: 25%+ avg speedup.
  • Some benches: 60%+.
  • Django template rendering: ~15% faster.

Zero code changes required.

6.8 Significance

PEP 659 proved major perf wins without JIT — refuting "Python needs a JIT to be fast". JIT still helps further but pure interpreter optimization had runway left.


7. Python 3.12 — Per-interpreter GIL

7.1 Subinterpreters

CPython has long supported subinterpreters: multiple independent Python states per process. But they all shared one GIL — no parallelism.

7.2 PEP 684

Per-interpreter GIL (Python 3.12). Each subinterpreter has its own GIL.

Process
 ├── Main interpreter (GIL 1)
 ├── Sub 1 (GIL 2)   <- real parallel execution
 └── Sub 2 (GIL 3)   <- real parallel execution

7.3 Use from Python

Python 3.13+ has an experimental interpreters module:

import interpreters

interp = interpreters.create()
interp.run("""
import math
print(math.pi)
""")

Real parallel Python code.

7.4 Limits

  • Shared objects restricted: pass data via channels.
  • C extensions: many don't yet support per-interpreter state.
  • Memory cost: each interpreter has its own state.

A middle ground between multiprocessing and threads — useful for plugin systems and sandboxes.


8. Python 3.13 — Free-threaded build

8.1 Sam Gross's nogil

In 2021 Sam Gross published a "nogil" branch — CPython without a GIL, minimal perf loss. After years of refinement, PEP 703 was accepted.

8.2 PEP 703

Python 3.13 (Oct 2024): optional free-threaded build.

./configure --disable-gil
make
./python
  • No GIL.
  • Real parallel threads.
  • Built-in objects become thread-safe (new locks).

8.3 Technical challenges

Refcounting: guarded by the GIL before. Solution: biased reference counting — each object has an owner thread that modifies refs without locks; other threads use atomics.

Dicts: per-bucket locks.

GC: reduce stop-the-world to incremental.

8.4 Performance

  • Single-threaded: ~5-10% slower (atomic overhead).
  • Multi-threaded CPU: linear scaling with N threads.

Tradeoff: small single-threaded loss for massive multi-thread gain.

8.5 Status

Experimental in 3.13 — not the default build. Ship via --disable-gil or special wheels. Likely default in 3.14, stable in 3.15. Ecosystem compatibility is the main hurdle.


9. Python 3.13 — Copy-and-Patch JIT

9.1 JIT's long promise

PyPy has had a JIT for years, making Python several times faster. But it is complex and has compat issues. CPython avoided JIT — until 2024.

9.2 PEP 744

Copy-and-Patch JIT, led by Brandt Bucher and Mark Shannon.

  • Template-based: pre-compiled machine-code snippets per opcode.
  • Ultra-fast compilation: just copy snippets and patch addresses.
  • Tier 2 optimizer: Tier 1 = interpreter, only hot paths get JIT.

9.3 The idea

Normal JIT = run LLVM at runtime = milliseconds per compile. Copy-and-patch = LLVM at build time, runtime does only memcpy + address patching.

// Build-time template (pseudo):
const char LOAD_FAST_template[] = {
    // mov rax, [rdi + <FRAME_OFFSET_PLACEHOLDER>]
    0x48, 0x8B, 0x47, 0x00, 0x00, 0x00, 0x00,
    // push rax
    0x50,
};

// Runtime:
memcpy(code_buffer, LOAD_FAST_template, sizeof(LOAD_FAST_template));
patch_offset(code_buffer + 3, actual_offset);

Microseconds to generate machine code. 1000x faster than LLVM JIT.

9.4 Tier 2 optimizer

  • Tier 1: interpreter + inline caches.
  • Tier 2: detect hot traces and emit specialized bytecode (superinstructions).
  • Tier 2 JIT: turn Tier 2 into machine code via copy-and-patch.

Only the hottest code becomes native.

9.5 Performance

Early: +5-10% on top of PEP 659. Python 3.13 JIT is experimental. Mature in 3.14. Mainstream in 3.15-3.16. End goal: gradual approach to C-level speed.


10. Generators and coroutines

10.1 Generator function

def count_up():
    n = 0
    while True:
        yield n
        n += 1

A function with yield returns a generator object.

10.2 Frame suspension

The trick: the frame is kept alive. On yield, the frame is suspended — preserved until the next next(). _PyInterpreterFrame lives on the heap and prev_instr points just past the yield.

10.3 Coroutines and async

async def fetch():
    data = await download()
    return process(data)

async def generalizes generators. await generalizes yield. Same frame-suspension mechanism under the hood. asyncio schedules these coroutines.

10.4 Chaining

async def outer():
    return await inner()

Frame chain: outer's frame awaits inner's frame.

10.5 Async performance

Python async is cheap thanks to C-level frame suspension — lighter than threads. Still bound by the GIL for CPU parallelism.


11. Import system and bytecode caching

11.1 How import works

  1. Check sys.modules.
  2. Walk sys.meta_path finders.
  3. Finder returns a loader.
  4. Loader creates the module and runs the code.
  5. Register in sys.modules.

11.2 .pyc cache

a.py  ->  __pycache__/a.cpython-312.pyc

Compare mtime/hash; load .pyc if unchanged.

11.3 Invalidation modes

python -m py_compile a.py
python -m py_compile --invalidation-mode=checked_hash a.py

Hash-based is essential for reproducible builds.

11.4 Frozen modules

Some stdlib modules are frozen — bytecode embedded in the CPython binary.

>>> import sys
>>> sys.modules['_frozen_importlib']
<module '_frozen_importlib' (frozen)>

No disk read, fast startup. Python 3.11+ freezes many stdlib modules.


12. Exception handling

12.1 Try bytecode (pre-3.11)

SETUP_FINALLY   target
  risky()
POP_BLOCK
  JUMP_FORWARD
target:
  <handler path>

Every try entry had a SETUP_FINALLY opcode.

12.2 Python 3.11: zero-cost exceptions

3.11 removes the overhead. Uses an exception table instead:

exception_table:
  [bytecode range] -> [handler location, stack_depth]

Consulted only when an exception actually fires — like Rust's DWARF-based unwinding. No hot-path cost.

12.3 Raise and unwinding

On raise:

  1. Create the exception object.
  2. Walk frames, consult exception tables.
  3. No handler: terminate.
  4. Handler found: unwind and run.

13. Tuning and optimization

13.1 Finding what is slow

cProfile:

import cProfile
cProfile.run("myfunc()")

line_profiler, py-spy (sampling, attach to running process).

13.2 General tips

Built-ins:

sum(xs)  # fast

Local > global:

def f():
    sqrt = math.sqrt
    for x in data:
        y = sqrt(x)

LOAD_FAST beats LOAD_GLOBAL.

Comprehensions are optimized. Avoid unnecessary list() over generators.

13.3 Cython / PyPy / C extensions

  • Cython: Python + type annotations -> C. 10-100x.
  • PyPy: drop-in alternative with a JIT. Compat caveats.
  • C extensions: max perf, more dev effort.
  • FFI (cffi, ctypes): wrap existing C libs.

13.4 Newer alternatives

Mojo, Codon, Nuitka.


14. Exploration tools

14.1 dis

import dis
dis.dis(my_function)
dis.show_code(my_function.__code__)

14.2 gc

import gc
gc.get_objects()
gc.collect()
gc.get_threshold()

14.3 sys

sys.getsizeof(obj)
sys.getrefcount(obj)
sys.settrace(tracer)

14.4 tracemalloc

import tracemalloc
tracemalloc.start()
snapshot = tracemalloc.take_snapshot()

14.5 Reading CPython source

The best way to learn. GitHub: python/cpython. Key files:

  • Python/ceval.c: interpreter loop.
  • Objects/: built-in types.
  • Python/compile.c: AST to bytecode.
  • Include/internal/: internal headers.

15. Timeline

  • Python 3.6 (2016): f-strings.
  • Python 3.9 (2020): PEG parser.
  • Python 3.10 (2021): pattern matching, better errors.
  • Python 3.11 (2022): specializing adaptive interpreter (PEP 659). 25% faster.
  • Python 3.12 (2023): per-interpreter GIL.
  • Python 3.13 (2024): free-threaded build, experimental JIT.
  • Python 3.14 (2025): JIT expansion, free-threaded maturation.

Performance trajectory from 3.10 baseline: 3.11 ~1.25x, 3.12 ~1.3x, 3.13 ~1.4x with JIT, 3.15 targeted 2-3x. Faster CPython project aims at 5x 3.10.


16. Learning resources

Books: "CPython Internals" (Anthony Shaw). "Python Internals for Developers" (Obi Ike-Nwosu). Online: Faster CPython project on GitHub; Łukasz Langa's blog; Brett Cannon's import series. Talks: "CPython from the Inside Out" (Philip Guo); PyCon / EuroPython internals tracks. Code: python/cpython. PEP 659, 703, 744. Alternatives: PyPy (JIT), MicroPython, RustPython, Pyston.


17. Quiz

Q1. Is calling Python an "interpreted language" accurate?

A. Partially. Python source is compiled to bytecode (that is what .pyc holds). At runtime a bytecode VM (ceval.c's dispatch loop) interprets it — just not to machine code. Java and C# do the same (JVM, CLR). The difference is they JIT to native; CPython was a pure interpreter for a long time. Python 3.13 adds an experimental copy-and-patch JIT, so the "pure interpreter" era is ending.

Q2. Why is computed GOTO faster than switch?

A. Branch prediction accuracy. A switch dispatches from one jump site — the predictor struggles because the next target depends on the opcode stream. Computed GOTO puts an indirect jump at the end of each handler, so each jump site has independent branch history. If LOAD_FAST often follows LOAD_FAST, that site learns the pattern. Typically 10-15% faster. Uses GCC/Clang &&label. Standard in Python, Ruby, Lua, V8.

Q3. How does PEP 659 specializing adaptive interpreter work?

A. Inline-cache-based runtime opcode rewriting. Each opcode has adjacent cache slots. First hit profiles type/offset, then rewrites the opcode in place to a specialized variant, e.g. LOAD_ATTR -> LOAD_ATTR_INSTANCE_VALUE. Subsequent hits verify a cheap type version tag; match means direct offset access (5-10x faster). Miss triggers de-optimization. Basically the V8/HotSpot inline-cache trick applied to a pure interpreter. Main driver of 25%+ perf in Python 3.11.

Q4. Why does the Python 3.13 free-threaded build slow down single-threaded code?

A. Atomic refcount cost. Refcounts were previously protected by the GIL (cheap integer ops). Without the GIL, refcounts need atomic ops (tens of ns). That overhead applies to every object access — 5-10% slower single-threaded. Sam Gross mitigates with biased reference counting: the owner thread modifies locks-free; others use atomics. Tradeoff: small single-threaded loss for linear multi-threaded scaling.

Q5. Why is copy-and-patch JIT faster than LLVM-based JIT?

A. Separation of compile time and runtime. Normal JITs run LLVM at runtime — milliseconds per function. Copy-and-patch compiles machine-code templates at build time, runtime does memcpy + address patching (microseconds). 1000x faster than LLVM JIT; code quality lower, but "75% of JIT benefit for 25% of effort". Used by Python 3.13's Tier 2 optimizer experimentally. A small innovation (templates) disrupting established JIT infra.

Q6. How are generators implemented differently from normal functions?

A. Frame lifecycle. Normal functions free their frame on return. Generators suspend the frame on yield — keep it on the heap, set prev_instr past the yield. Next next() resumes from that instruction. Locals, eval stack, everything is preserved. async def generalizes this — coroutines are generators in new clothes. Why Python async is lightweight and why asyncio can schedule coroutines like goroutines.

Q7. What did Python 3.11 "zero-cost exceptions" eliminate?

A. Bytecode overhead on try entry. Pre-3.11, entering a try ran a SETUP_FINALLY opcode every time. 3.11+ stores an exception table in the code object: "range X..Y -> handler Z", set statically at compile time. try entry becomes a no-op at runtime. Only a thrown exception pays the lookup cost. Same approach as Rust's DWARF-based unwinding. try/except no longer penalizes hot code.


Related posts:

  • "Python GIL and CPython Internals" — concurrency in the same project.
  • "JIT Compilation: V8 and JVM" — the same ideas in other languages.
  • "Rust Tokio Async Runtime" — zero-cost async taken to the extreme.
  • "Diffusion Models" — the AI workloads Python runs on top of all this.