Building an LLM from Scratch — A Stanford CS336 Style Learning Roadmap

Introduction — Why From-Scratch, Why Now
The Curriculum at a Glance
Stage 1 — Tokenizer: Implementing BPE Yourself
Stage 2 — Transformer Architecture: Understanding Attention as Code
Stage 3 — Training Infrastructure: Data Pipelines and Distributed Training
- The Data Pipeline
- Distributed Training Basics
Stage 4 — Scaling Laws: Where to Spend the Budget
Stage 5 — Alignment: SFT, RLHF, DPO
Stage 6 — Inference Optimization: Starting with the KV Cache
What Building It Yourself Teaches — and Why It Matters in the API Era
What Compute Do You Actually Need
- Do Not Skip Evaluation
The 20-Week Study Plan
Mini Project Ideas
Pitfalls and a Critical View
Closing
References

Introduction — Why From-Scratch, Why Now

Throughout the first half of 2026, one piece of educational content has consistently climbed to the top of Hacker News and GeekNews: Stanford CS336, Language Modeling from Scratch. All lecture materials and assignments are public, and students implement everything themselves — tokenizer, transformer, training loop, inference engine — without leaning on high-level library abstractions.

The timing is what makes this interesting. 2026 is the year AI coding agents went mainstream. Agents like Claude Code and Codex handle multi-hour autonomous tasks, and the phrase prompt engineering has been replaced by context engineering and loop engineering. In an era when building your own model seems less necessary than ever, a course that builds one from the ground up is exploding in popularity.

This post dissects a CS336-style curriculum stage by stage: what you learn at each step, why it matters in the API era, and a 20-week plan you can follow on your own.

The Curriculum at a Glance

A CS336-style curriculum is organized around six pillars.

+----------------+    +----------------+    +------------------+
| 1. Tokenizer   | -> | 2. Architecture | -> | 3. Training Infra |
|  BPE from zero |    |  Transformer    |    |  Data / Distributed|
+----------------+    +----------------+    +------------------+
                                                     |
                                                     v
+----------------+    +----------------+    +------------------+
| 6. Inference   | <- | 5. Alignment   | <- | 4. Scaling Laws   |
|  KV cache      |    |  SFT/RLHF/DPO  |    |  Budget planning  |
+----------------+    +----------------+    +------------------+

The order matters. Without a tokenizer there is no data; without an architecture there is nothing to train; without infrastructure you cannot train at any meaningful scale. Scaling laws tell you where to spend your budget, alignment makes a base model useful, and inference optimization lets you actually serve what you built.

Stage 1 — Tokenizer: Implementing BPE Yourself

The entry point of every LLM is the tokenizer. The first CS336 assignment is implementing a BPE (Byte Pair Encoding) tokenizer from scratch. The concept is simple: repeatedly merge the most frequent adjacent byte pair to grow a vocabulary.

from collections import Counter

def get_pair_counts(corpus):
    """Count adjacent pair frequencies across token sequences."""
    counts = Counter()
    for tokens in corpus:
        for a, b in zip(tokens, tokens[1:]):
            counts[(a, b)] += 1
    return counts

def merge_pair(corpus, pair, new_token):
    """Merge every occurrence of pair into new_token."""
    merged = []
    for tokens in corpus:
        out, i = [], 0
        while i < len(tokens):
            if i < len(tokens) - 1 and (tokens[i], tokens[i + 1]) == pair:
                out.append(new_token)
                i += 2
            else:
                out.append(tokens[i])
                i += 1
        merged.append(out)
    return merged

def train_bpe(text, vocab_size):
    # Start at byte level: base vocabulary of 256
    corpus = [list(text.encode("utf-8"))]
    merges = {}
    next_id = 256
    while next_id < vocab_size:
        counts = get_pair_counts(corpus)
        if not counts:
            break
        best = counts.most_common(1)[0][0]
        merges[best] = next_id
        corpus = merge_pair(corpus, best, next_id)
        next_id += 1
    return merges

Run this naive implementation on a real corpus and you learn your first lesson immediately: it is far too slow. Rescanning the entire corpus on every merge takes days on a multi-gigabyte dataset. So you end up implementing incremental updates with a priority queue and an inverted index — and in the process you realize the tokenizer is not mere preprocessing but a systems engineering problem in its own right.

What you learn by building it yourself:

How byte-level BPE sidesteps Unicode headaches (Korean text, emoji)
The trade-off vocabulary size imposes on sequence length versus embedding table size
Why a single rule about digits or whitespace handling can swing downstream performance

Stage 2 — Transformer Architecture: Understanding Attention as Code

Self-attention looks intimidating as math, but as code it is just a few matrix multiplications. The essence of scaled dot-product attention in PyTorch:

import math
import torch
import torch.nn as nn
import torch.nn.functional as F

class CausalSelfAttention(nn.Module):
    def __init__(self, d_model, n_head):
        super().__init__()
        assert d_model % n_head == 0
        self.n_head = n_head
        self.head_dim = d_model // n_head
        self.qkv = nn.Linear(d_model, 3 * d_model, bias=False)
        self.proj = nn.Linear(d_model, d_model, bias=False)

    def forward(self, x):
        B, T, C = x.shape
        q, k, v = self.qkv(x).split(C, dim=2)
        # (B, T, C) -> (B, n_head, T, head_dim)
        q = q.view(B, T, self.n_head, self.head_dim).transpose(1, 2)
        k = k.view(B, T, self.n_head, self.head_dim).transpose(1, 2)
        v = v.view(B, T, self.n_head, self.head_dim).transpose(1, 2)

        # attention scores = Q @ K^T / sqrt(head_dim)
        scores = q @ k.transpose(-2, -1) / math.sqrt(self.head_dim)
        # causal mask: no peeking at future tokens
        mask = torch.triu(torch.ones(T, T, device=x.device), diagonal=1).bool()
        scores = scores.masked_fill(mask, float("-inf"))
        attn = F.softmax(scores, dim=-1)

        out = attn @ v  # (B, n_head, T, head_dim)
        out = out.transpose(1, 2).contiguous().view(B, T, C)
        return self.proj(out)

Add RMSNorm, a SwiGLU feed-forward layer, and RoPE (rotary position embeddings), and you have the standard 2026 decoder block. Certain things only sink in when you build them:

A single line of causal masking is the decisive difference between a language model and an encoder
How pre-norm versus post-norm affects training stability
The quadratic memory cost of attention with respect to sequence length — confirmed firsthand by an OOM error

CS336 goes further and has students implement the core FlashAttention idea (tiling so the attention matrix never materializes in memory) as a Triton kernel. This is where intuition for the GPU memory hierarchy starts to form.

Stage 3 — Training Infrastructure: Data Pipelines and Distributed Training

Model code is maybe 20 percent of the work. The rest is data and infrastructure.

The Data Pipeline

Turning web crawl data (Common Crawl and friends) into a trainable token stream is a pipeline in itself.

Raw HTML
   |  text extraction (boilerplate removal)
   v
Language filtering (fastText classifier)
   |
   v
Quality filtering (heuristics + classifiers)
   |
   v
Deduplication (MinHash / exact match)
   |
   v
Tokenize + shuffle + fixed-length chunks
   |
   v
Binary training shards (.bin)

Deduplication alone requires implementing MinHash LSH, and a single quality-filter threshold can swing final model performance by several percent. The consistent lesson of recent years: data work moves the needle more than architecture work.

Distributed Training Basics

The moment you outgrow a single GPU, you need distributed training. Three core concepts cover most of it.

Technique	What it splits	Communication cost	When to use
DDP (data parallel)	The batch	Gradient all-reduce	Model fits on one GPU
FSDP / ZeRO	Params + optimizer state	All-gather, reduce-scatter	Model does not fit on one GPU
Tensor parallel	The matmuls themselves	All-reduce per layer	Very large models, fast interconnect

A minimal DDP training loop:

import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

dist.init_process_group(backend="nccl")
rank = dist.get_rank()
torch.cuda.set_device(rank)

model = GPT(config).cuda(rank)
model = DDP(model, device_ids=[rank])
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.1)

for step, batch in enumerate(loader):
    x, y = batch
    x, y = x.cuda(rank), y.cuda(rank)
    logits = model(x)
    loss = F.cross_entropy(logits.view(-1, logits.size(-1)), y.view(-1))
    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    optimizer.step()
    optimizer.zero_grad(set_to_none=True)

Run this yourself and you will inevitably watch a training curve diverge. Learning-rate warmup, gradient clipping, bf16 mixed precision — this is the stage where you learn in your bones why those mechanisms exist.

Stage 4 — Scaling Laws: Where to Spend the Budget

Scaling laws are empirical rules for splitting a fixed compute budget between model size and data volume. The headline result of the Chinchilla paper: a roughly 1-to-20 ratio of parameters to training tokens is compute-optimal.

def chinchilla_optimal(compute_budget_flops):
    """Rough compute-optimal split using the 6ND approximation.
    Assumes C = 6 * N * D and D = 20 * N."""
    # C = 6 * N * (20 * N) = 120 * N^2
    n_params = (compute_budget_flops / 120) ** 0.5
    n_tokens = 20 * n_params
    return n_params, n_tokens

# Example: 8x A100 for two weeks (roughly 1.2e21 FLOPs)
params, tokens = chinchilla_optimal(1.2e21)
print(f"params: {params/1e9:.2f}B, tokens: {tokens/1e9:.0f}B")
# Roughly a 3B-parameter model on 63B tokens

Plotting your own scaling curve from small experiments is the highlight of CS336. Train 1M, 10M, and 100M parameter models on the same data, plot loss on a log-log chart, and you get nearly a straight line. Extrapolating that line to predict the performance of a bigger model is a miniature version of what frontier labs actually do.

One caveat: Chinchilla-optimal means optimal for training compute only. Once you account for inference cost, overtraining a smaller model is often cheaper in total, and recent open models are indeed trained on far more tokens than the Chinchilla ratio suggests.

Stage 5 — Alignment: SFT, RLHF, DPO

A pretrained base model is just a next-token predictor. Making it follow instructions requires alignment.

Base model
   |
   v
SFT (supervised fine-tuning)
   |  train on high-quality instruction-response pairs
   v
Preference optimization (RLHF or DPO)
   |  steer toward responses humans prefer
   v
Final assistant model

SFT is simple: build instruction-response pairs and fine-tune with the ordinary language-modeling loss. Data quality is everything.
RLHF trains a separate reward model, then optimizes the policy with reinforcement learning such as PPO. Powerful, but high in implementation difficulty and instability.
DPO removes the reward model and the RL loop entirely, optimizing the policy directly from preference pairs (a chosen response and a rejected one).

The core of the DPO loss in code:

def dpo_loss(policy_chosen_logps, policy_rejected_logps,
             ref_chosen_logps, ref_rejected_logps, beta=0.1):
    """Push apart the log-prob gap between chosen and rejected
    responses, measured relative to a reference model."""
    chosen_ratio = policy_chosen_logps - ref_chosen_logps
    rejected_ratio = policy_rejected_logps - ref_rejected_logps
    logits = beta * (chosen_ratio - rejected_ratio)
    return -F.logsigmoid(logits).mean()

Implement it yourself and you discover that alignment is not magic but loss-function design — and that base-model quality and preference-data quality set the ceiling on results.

Stage 6 — Inference Optimization: Starting with the KV Cache

The moment you serve a trained model, a completely different engineering problem begins. Autoregressive generation naively recomputes the whole sequence for every new token; the KV cache fixes that.

class KVCache:
    def __init__(self, batch, n_head, max_len, head_dim, device):
        shape = (batch, n_head, max_len, head_dim)
        self.k = torch.zeros(shape, device=device, dtype=torch.bfloat16)
        self.v = torch.zeros(shape, device=device, dtype=torch.bfloat16)
        self.pos = 0

    def update(self, k_new, v_new):
        t = k_new.size(2)
        self.k[:, :, self.pos:self.pos + t] = k_new
        self.v[:, :, self.pos:self.pos + t] = v_new
        self.pos += t
        return self.k[:, :, :self.pos], self.v[:, :, :self.pos]

With the cache attached, per-token compute during generation drops from quadratic to linear in sequence length. The price is memory. Once you feel that trade-off firsthand, you understand naturally why GQA (grouped-query attention) appeared and why PagedAttention in vLLM was such a breakthrough. Inference optimization deserves its own post, but the key insight is that inference has a completely different bottleneck from training: memory bandwidth.

What Building It Yourself Teaches — and Why It Matters in the API Era

A fair objection: in an era when agents write the code, why implement attention by hand?

First, the quality of your debugging and decision-making changes. Why does cost grow quadratically when you extend context length? Why is a model weak at arithmetic because of its tokenizer? Should you fine-tune or use RAG? People who know the internals answer these practical questions completely differently from people who do not.

Second, your ability to direct AI agents scales with your understanding of their internals. Context engineering — the headline skill of 2026 — can only become precise on top of an understanding of how models actually consume context: attention, position embeddings, the KV cache.

Third, abstractions always leak. Even a model hidden behind an API exposes its internals through tokenizer boundaries, context limits, and sampling parameters. Having built one from scratch is insurance against panic when you hit those leaks.

What Compute Do You Actually Need

GPU cost is the most cited barrier to from-scratch learning, but the compute required per stage is smaller than people think.

Stage	Compute needed	Realistic option	Rough cost
Tokenizer, attention implementation	CPU is enough	Local laptop	Free
Char-level LM, 10M model	One GPU (8GB VRAM)	Colab free/Pro, local gaming GPU	Free to ~15 USD/month
100M model pretraining	One GPU (24GB VRAM)	Cloud spot instances	Tens of dollars
1B-scale experiments, distributed	2 to 8 GPUs	Hourly GPU cloud	Hundreds of dollars per run
SFT, DPO (1B or below)	One GPU (24GB VRAM)	Even smaller with LoRA	Tens of dollars

The first eight weeks are essentially free. Environment setup is minimal:

# Virtual environment and core dependencies
python -m venv .venv
source .venv/bin/activate
pip install torch numpy tiktoken datasets wandb

# Get a feel for training with nanoGPT
git clone https://github.com/karpathy/nanoGPT.git
cd nanoGPT
python data/shakespeare_char/prepare.py
python train.py config/train_shakespeare_char.py

Get into the habit of logging every run from day one. Loss curves, learning rates, and gradient norms recorded per experiment become a huge asset when you later need to track down a divergence.

Do Not Skip Evaluation

You must evaluate your own model objectively. A minimal evaluation setup:

@torch.no_grad()
def estimate_perplexity(model, loader, max_batches=50):
    model.eval()
    total_loss, n = 0.0, 0
    for i, (x, y) in enumerate(loader):
        if i >= max_batches:
            break
        logits = model(x.cuda())
        loss = F.cross_entropy(
            logits.view(-1, logits.size(-1)), y.cuda().view(-1)
        )
        total_loss += loss.item()
        n += 1
    model.train()
    return math.exp(total_loss / n)

Note that perplexity is only comparable between models sharing the same tokenizer. Different vocabularies change the very meaning of per-token loss.

The 20-Week Study Plan

This self-study plan assumes 8 to 10 hours per week.

Week	Topic	Core activity	Deliverable
1	Setup and baseline	PyTorch, GPU env, char-level LM	Char-level bigram model
2	BPE tokenizer 1	Naive BPE implementation	Working training code
3	BPE tokenizer 2	Priority-queue optimization, encode/decode	Production-speed tokenizer
4	Attention basics	Single-head attention, causal mask	Attention module
5	Transformer block	Multi-head, RMSNorm, SwiGLU, RoPE	Decoder block
6	Full model assembly	GPT class, weight initialization	10M-parameter model
7	Training loop	AdamW, cosine schedule, clipping	Stable training curve
8	Mixed precision and profiling	bf16, torch.compile, throughput	Tokens-per-second report
9	Data pipeline 1	Text extraction, language filter	Cleaning scripts
10	Data pipeline 2	MinHash dedup, sharding	Training dataset
11	Midpoint project	Pretrain a 100M-scale model	Your own base model
12	Scaling experiments	Fit loss curves from 1M to 100M	Scaling chart
13	Distributed training 1	DDP, multi-GPU training	Distributed training script
14	Distributed training 2	FSDP or ZeRO concepts and experiments	Memory usage comparison
15	SFT	Build instruction data, fine-tune	Instruction-following model
16	DPO	Preference data, DPO loss implementation	Aligned model
17	Evaluation	Perplexity, benchmarks, LLM judge	Evaluation report
18	Inference optimization 1	KV cache, sampling implementation	Your own generation engine
19	Inference optimization 2	Batching, quantization concepts, speed	Latency/throughput report
20	Final project	Clean up the pipeline, write it up	Public repo + retrospective

Mini Project Ideas

Small, self-contained projects in the spirit of nanoGPT give the best learning efficiency.

Nano Shakespeare: generate Shakespeare-style text with a char-level model. Done in four hours, and the whole picture clicks.
A Korean BPE tokenizer: build a 32k-vocab tokenizer from a Korean Wikipedia dump and compare compression ratio against existing tokenizers.
Your own nanoGPT fork: add RoPE, RMSNorm, and SwiGLU yourself and compare loss curves against the original.
A scaling mini-lab: fit loss curves on three model sizes, predict the loss of a model ten times bigger, then verify it.
A DPO exercise: align a model of 1B or smaller on a public preference dataset and blind-compare before/after responses.
A mini inference server: implement a generation server with KV cache and dynamic batching, then measure concurrent-request throughput.

Pitfalls and a Critical View

From-scratch learning has its own traps.

The compute illusion: you cannot build a frontier-grade model at home. The goal is internalizing the mechanics, not reproducing SOTA. A 100M-scale model teaches you everything you need.
Drowning in wheel reinvention: spending four weeks optimizing the tokenizer misses the point. Discipline matters — get each stage working, then move on.
Distance from production code: educational implementations are simplified. For real serving, use a battle-tested engine like vLLM. Building your own is the foundation for understanding and tuning that engine.
Overrated resume effect: what gets evaluated is not the from-scratch experience itself but the writing, code, and measurements that came out of it. Publish your artifacts.

Closing

Building an LLM from scratch remains a worthwhile investment in 2026 — perhaps especially in 2026. The more agents write our code, the more quietly the gap widens beneath the abstraction layer between people who know the bottom of the system and people who do not. With excellent open materials like CS336 and nanoGPT available, start lightly from week one of the plan. The moment a char-level model first emits a plausible sentence is a feeling no API call can ever give you.

References

CS336 official page: https://cs336.stanford.edu/
nanoGPT (Karpathy): https://github.com/karpathy/nanoGPT
Karpathy, Zero to Hero lectures: https://karpathy.ai/zero-to-hero.html
Attention Is All You Need: https://arxiv.org/abs/1706.03762
Chinchilla scaling laws paper: https://arxiv.org/abs/2203.15556
DPO paper: https://arxiv.org/abs/2305.18290
FlashAttention paper: https://arxiv.org/abs/2205.14135
vLLM documentation: https://docs.vllm.ai/
CS336 discussion on Hacker News: https://news.ycombinator.com/item?id=44046428
GeekNews: https://news.hada.io/