- Published on
Building an LLM from Scratch — A Stanford CS336 Style Learning Roadmap
- Authors

- Name
- Youngju Kim
- @fjvbn20031
- Introduction — Why From-Scratch, Why Now
- The Curriculum at a Glance
- Stage 1 — Tokenizer: Implementing BPE Yourself
- Stage 2 — Transformer Architecture: Understanding Attention as Code
- Stage 3 — Training Infrastructure: Data Pipelines and Distributed Training
- Stage 4 — Scaling Laws: Where to Spend the Budget
- Stage 5 — Alignment: SFT, RLHF, DPO
- Stage 6 — Inference Optimization: Starting with the KV Cache
- What Building It Yourself Teaches — and Why It Matters in the API Era
- What Compute Do You Actually Need
- The 20-Week Study Plan
- Mini Project Ideas
- Pitfalls and a Critical View
- Closing
- References
Introduction — Why From-Scratch, Why Now
Throughout the first half of 2026, one piece of educational content has consistently climbed to the top of Hacker News and GeekNews: Stanford CS336, Language Modeling from Scratch. All lecture materials and assignments are public, and students implement everything themselves — tokenizer, transformer, training loop, inference engine — without leaning on high-level library abstractions.
The timing is what makes this interesting. 2026 is the year AI coding agents went mainstream. Agents like Claude Code and Codex handle multi-hour autonomous tasks, and the phrase prompt engineering has been replaced by context engineering and loop engineering. In an era when building your own model seems less necessary than ever, a course that builds one from the ground up is exploding in popularity.
This post dissects a CS336-style curriculum stage by stage: what you learn at each step, why it matters in the API era, and a 20-week plan you can follow on your own.
The Curriculum at a Glance
A CS336-style curriculum is organized around six pillars.
+----------------+ +----------------+ +------------------+
| 1. Tokenizer | -> | 2. Architecture | -> | 3. Training Infra |
| BPE from zero | | Transformer | | Data / Distributed|
+----------------+ +----------------+ +------------------+
|
v
+----------------+ +----------------+ +------------------+
| 6. Inference | <- | 5. Alignment | <- | 4. Scaling Laws |
| KV cache | | SFT/RLHF/DPO | | Budget planning |
+----------------+ +----------------+ +------------------+
The order matters. Without a tokenizer there is no data; without an architecture there is nothing to train; without infrastructure you cannot train at any meaningful scale. Scaling laws tell you where to spend your budget, alignment makes a base model useful, and inference optimization lets you actually serve what you built.
Stage 1 — Tokenizer: Implementing BPE Yourself
The entry point of every LLM is the tokenizer. The first CS336 assignment is implementing a BPE (Byte Pair Encoding) tokenizer from scratch. The concept is simple: repeatedly merge the most frequent adjacent byte pair to grow a vocabulary.
from collections import Counter
def get_pair_counts(corpus):
"""Count adjacent pair frequencies across token sequences."""
counts = Counter()
for tokens in corpus:
for a, b in zip(tokens, tokens[1:]):
counts[(a, b)] += 1
return counts
def merge_pair(corpus, pair, new_token):
"""Merge every occurrence of pair into new_token."""
merged = []
for tokens in corpus:
out, i = [], 0
while i < len(tokens):
if i < len(tokens) - 1 and (tokens[i], tokens[i + 1]) == pair:
out.append(new_token)
i += 2
else:
out.append(tokens[i])
i += 1
merged.append(out)
return merged
def train_bpe(text, vocab_size):
# Start at byte level: base vocabulary of 256
corpus = [list(text.encode("utf-8"))]
merges = {}
next_id = 256
while next_id < vocab_size:
counts = get_pair_counts(corpus)
if not counts:
break
best = counts.most_common(1)[0][0]
merges[best] = next_id
corpus = merge_pair(corpus, best, next_id)
next_id += 1
return merges
Run this naive implementation on a real corpus and you learn your first lesson immediately: it is far too slow. Rescanning the entire corpus on every merge takes days on a multi-gigabyte dataset. So you end up implementing incremental updates with a priority queue and an inverted index — and in the process you realize the tokenizer is not mere preprocessing but a systems engineering problem in its own right.
What you learn by building it yourself:
- How byte-level BPE sidesteps Unicode headaches (Korean text, emoji)
- The trade-off vocabulary size imposes on sequence length versus embedding table size
- Why a single rule about digits or whitespace handling can swing downstream performance
Stage 2 — Transformer Architecture: Understanding Attention as Code
Self-attention looks intimidating as math, but as code it is just a few matrix multiplications. The essence of scaled dot-product attention in PyTorch:
import math
import torch
import torch.nn as nn
import torch.nn.functional as F
class CausalSelfAttention(nn.Module):
def __init__(self, d_model, n_head):
super().__init__()
assert d_model % n_head == 0
self.n_head = n_head
self.head_dim = d_model // n_head
self.qkv = nn.Linear(d_model, 3 * d_model, bias=False)
self.proj = nn.Linear(d_model, d_model, bias=False)
def forward(self, x):
B, T, C = x.shape
q, k, v = self.qkv(x).split(C, dim=2)
# (B, T, C) -> (B, n_head, T, head_dim)
q = q.view(B, T, self.n_head, self.head_dim).transpose(1, 2)
k = k.view(B, T, self.n_head, self.head_dim).transpose(1, 2)
v = v.view(B, T, self.n_head, self.head_dim).transpose(1, 2)
# attention scores = Q @ K^T / sqrt(head_dim)
scores = q @ k.transpose(-2, -1) / math.sqrt(self.head_dim)
# causal mask: no peeking at future tokens
mask = torch.triu(torch.ones(T, T, device=x.device), diagonal=1).bool()
scores = scores.masked_fill(mask, float("-inf"))
attn = F.softmax(scores, dim=-1)
out = attn @ v # (B, n_head, T, head_dim)
out = out.transpose(1, 2).contiguous().view(B, T, C)
return self.proj(out)
Add RMSNorm, a SwiGLU feed-forward layer, and RoPE (rotary position embeddings), and you have the standard 2026 decoder block. Certain things only sink in when you build them:
- A single line of causal masking is the decisive difference between a language model and an encoder
- How pre-norm versus post-norm affects training stability
- The quadratic memory cost of attention with respect to sequence length — confirmed firsthand by an OOM error
CS336 goes further and has students implement the core FlashAttention idea (tiling so the attention matrix never materializes in memory) as a Triton kernel. This is where intuition for the GPU memory hierarchy starts to form.
Stage 3 — Training Infrastructure: Data Pipelines and Distributed Training
Model code is maybe 20 percent of the work. The rest is data and infrastructure.
The Data Pipeline
Turning web crawl data (Common Crawl and friends) into a trainable token stream is a pipeline in itself.
Raw HTML
| text extraction (boilerplate removal)
v
Language filtering (fastText classifier)
|
v
Quality filtering (heuristics + classifiers)
|
v
Deduplication (MinHash / exact match)
|
v
Tokenize + shuffle + fixed-length chunks
|
v
Binary training shards (.bin)
Deduplication alone requires implementing MinHash LSH, and a single quality-filter threshold can swing final model performance by several percent. The consistent lesson of recent years: data work moves the needle more than architecture work.
Distributed Training Basics
The moment you outgrow a single GPU, you need distributed training. Three core concepts cover most of it.
| Technique | What it splits | Communication cost | When to use |
|---|---|---|---|
| DDP (data parallel) | The batch | Gradient all-reduce | Model fits on one GPU |
| FSDP / ZeRO | Params + optimizer state | All-gather, reduce-scatter | Model does not fit on one GPU |
| Tensor parallel | The matmuls themselves | All-reduce per layer | Very large models, fast interconnect |
A minimal DDP training loop:
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
dist.init_process_group(backend="nccl")
rank = dist.get_rank()
torch.cuda.set_device(rank)
model = GPT(config).cuda(rank)
model = DDP(model, device_ids=[rank])
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.1)
for step, batch in enumerate(loader):
x, y = batch
x, y = x.cuda(rank), y.cuda(rank)
logits = model(x)
loss = F.cross_entropy(logits.view(-1, logits.size(-1)), y.view(-1))
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
optimizer.zero_grad(set_to_none=True)
Run this yourself and you will inevitably watch a training curve diverge. Learning-rate warmup, gradient clipping, bf16 mixed precision — this is the stage where you learn in your bones why those mechanisms exist.
Stage 4 — Scaling Laws: Where to Spend the Budget
Scaling laws are empirical rules for splitting a fixed compute budget between model size and data volume. The headline result of the Chinchilla paper: a roughly 1-to-20 ratio of parameters to training tokens is compute-optimal.
def chinchilla_optimal(compute_budget_flops):
"""Rough compute-optimal split using the 6ND approximation.
Assumes C = 6 * N * D and D = 20 * N."""
# C = 6 * N * (20 * N) = 120 * N^2
n_params = (compute_budget_flops / 120) ** 0.5
n_tokens = 20 * n_params
return n_params, n_tokens
# Example: 8x A100 for two weeks (roughly 1.2e21 FLOPs)
params, tokens = chinchilla_optimal(1.2e21)
print(f"params: {params/1e9:.2f}B, tokens: {tokens/1e9:.0f}B")
# Roughly a 3B-parameter model on 63B tokens
Plotting your own scaling curve from small experiments is the highlight of CS336. Train 1M, 10M, and 100M parameter models on the same data, plot loss on a log-log chart, and you get nearly a straight line. Extrapolating that line to predict the performance of a bigger model is a miniature version of what frontier labs actually do.
One caveat: Chinchilla-optimal means optimal for training compute only. Once you account for inference cost, overtraining a smaller model is often cheaper in total, and recent open models are indeed trained on far more tokens than the Chinchilla ratio suggests.
Stage 5 — Alignment: SFT, RLHF, DPO
A pretrained base model is just a next-token predictor. Making it follow instructions requires alignment.
Base model
|
v
SFT (supervised fine-tuning)
| train on high-quality instruction-response pairs
v
Preference optimization (RLHF or DPO)
| steer toward responses humans prefer
v
Final assistant model
- SFT is simple: build instruction-response pairs and fine-tune with the ordinary language-modeling loss. Data quality is everything.
- RLHF trains a separate reward model, then optimizes the policy with reinforcement learning such as PPO. Powerful, but high in implementation difficulty and instability.
- DPO removes the reward model and the RL loop entirely, optimizing the policy directly from preference pairs (a chosen response and a rejected one).
The core of the DPO loss in code:
def dpo_loss(policy_chosen_logps, policy_rejected_logps,
ref_chosen_logps, ref_rejected_logps, beta=0.1):
"""Push apart the log-prob gap between chosen and rejected
responses, measured relative to a reference model."""
chosen_ratio = policy_chosen_logps - ref_chosen_logps
rejected_ratio = policy_rejected_logps - ref_rejected_logps
logits = beta * (chosen_ratio - rejected_ratio)
return -F.logsigmoid(logits).mean()
Implement it yourself and you discover that alignment is not magic but loss-function design — and that base-model quality and preference-data quality set the ceiling on results.
Stage 6 — Inference Optimization: Starting with the KV Cache
The moment you serve a trained model, a completely different engineering problem begins. Autoregressive generation naively recomputes the whole sequence for every new token; the KV cache fixes that.
class KVCache:
def __init__(self, batch, n_head, max_len, head_dim, device):
shape = (batch, n_head, max_len, head_dim)
self.k = torch.zeros(shape, device=device, dtype=torch.bfloat16)
self.v = torch.zeros(shape, device=device, dtype=torch.bfloat16)
self.pos = 0
def update(self, k_new, v_new):
t = k_new.size(2)
self.k[:, :, self.pos:self.pos + t] = k_new
self.v[:, :, self.pos:self.pos + t] = v_new
self.pos += t
return self.k[:, :, :self.pos], self.v[:, :, :self.pos]
With the cache attached, per-token compute during generation drops from quadratic to linear in sequence length. The price is memory. Once you feel that trade-off firsthand, you understand naturally why GQA (grouped-query attention) appeared and why PagedAttention in vLLM was such a breakthrough. Inference optimization deserves its own post, but the key insight is that inference has a completely different bottleneck from training: memory bandwidth.
What Building It Yourself Teaches — and Why It Matters in the API Era
A fair objection: in an era when agents write the code, why implement attention by hand?
First, the quality of your debugging and decision-making changes. Why does cost grow quadratically when you extend context length? Why is a model weak at arithmetic because of its tokenizer? Should you fine-tune or use RAG? People who know the internals answer these practical questions completely differently from people who do not.
Second, your ability to direct AI agents scales with your understanding of their internals. Context engineering — the headline skill of 2026 — can only become precise on top of an understanding of how models actually consume context: attention, position embeddings, the KV cache.
Third, abstractions always leak. Even a model hidden behind an API exposes its internals through tokenizer boundaries, context limits, and sampling parameters. Having built one from scratch is insurance against panic when you hit those leaks.
What Compute Do You Actually Need
GPU cost is the most cited barrier to from-scratch learning, but the compute required per stage is smaller than people think.
| Stage | Compute needed | Realistic option | Rough cost |
|---|---|---|---|
| Tokenizer, attention implementation | CPU is enough | Local laptop | Free |
| Char-level LM, 10M model | One GPU (8GB VRAM) | Colab free/Pro, local gaming GPU | Free to ~15 USD/month |
| 100M model pretraining | One GPU (24GB VRAM) | Cloud spot instances | Tens of dollars |
| 1B-scale experiments, distributed | 2 to 8 GPUs | Hourly GPU cloud | Hundreds of dollars per run |
| SFT, DPO (1B or below) | One GPU (24GB VRAM) | Even smaller with LoRA | Tens of dollars |
The first eight weeks are essentially free. Environment setup is minimal:
# Virtual environment and core dependencies
python -m venv .venv
source .venv/bin/activate
pip install torch numpy tiktoken datasets wandb
# Get a feel for training with nanoGPT
git clone https://github.com/karpathy/nanoGPT.git
cd nanoGPT
python data/shakespeare_char/prepare.py
python train.py config/train_shakespeare_char.py
Get into the habit of logging every run from day one. Loss curves, learning rates, and gradient norms recorded per experiment become a huge asset when you later need to track down a divergence.
Do Not Skip Evaluation
You must evaluate your own model objectively. A minimal evaluation setup:
@torch.no_grad()
def estimate_perplexity(model, loader, max_batches=50):
model.eval()
total_loss, n = 0.0, 0
for i, (x, y) in enumerate(loader):
if i >= max_batches:
break
logits = model(x.cuda())
loss = F.cross_entropy(
logits.view(-1, logits.size(-1)), y.cuda().view(-1)
)
total_loss += loss.item()
n += 1
model.train()
return math.exp(total_loss / n)
Note that perplexity is only comparable between models sharing the same tokenizer. Different vocabularies change the very meaning of per-token loss.
The 20-Week Study Plan
This self-study plan assumes 8 to 10 hours per week.
| Week | Topic | Core activity | Deliverable |
|---|---|---|---|
| 1 | Setup and baseline | PyTorch, GPU env, char-level LM | Char-level bigram model |
| 2 | BPE tokenizer 1 | Naive BPE implementation | Working training code |
| 3 | BPE tokenizer 2 | Priority-queue optimization, encode/decode | Production-speed tokenizer |
| 4 | Attention basics | Single-head attention, causal mask | Attention module |
| 5 | Transformer block | Multi-head, RMSNorm, SwiGLU, RoPE | Decoder block |
| 6 | Full model assembly | GPT class, weight initialization | 10M-parameter model |
| 7 | Training loop | AdamW, cosine schedule, clipping | Stable training curve |
| 8 | Mixed precision and profiling | bf16, torch.compile, throughput | Tokens-per-second report |
| 9 | Data pipeline 1 | Text extraction, language filter | Cleaning scripts |
| 10 | Data pipeline 2 | MinHash dedup, sharding | Training dataset |
| 11 | Midpoint project | Pretrain a 100M-scale model | Your own base model |
| 12 | Scaling experiments | Fit loss curves from 1M to 100M | Scaling chart |
| 13 | Distributed training 1 | DDP, multi-GPU training | Distributed training script |
| 14 | Distributed training 2 | FSDP or ZeRO concepts and experiments | Memory usage comparison |
| 15 | SFT | Build instruction data, fine-tune | Instruction-following model |
| 16 | DPO | Preference data, DPO loss implementation | Aligned model |
| 17 | Evaluation | Perplexity, benchmarks, LLM judge | Evaluation report |
| 18 | Inference optimization 1 | KV cache, sampling implementation | Your own generation engine |
| 19 | Inference optimization 2 | Batching, quantization concepts, speed | Latency/throughput report |
| 20 | Final project | Clean up the pipeline, write it up | Public repo + retrospective |
Mini Project Ideas
Small, self-contained projects in the spirit of nanoGPT give the best learning efficiency.
- Nano Shakespeare: generate Shakespeare-style text with a char-level model. Done in four hours, and the whole picture clicks.
- A Korean BPE tokenizer: build a 32k-vocab tokenizer from a Korean Wikipedia dump and compare compression ratio against existing tokenizers.
- Your own nanoGPT fork: add RoPE, RMSNorm, and SwiGLU yourself and compare loss curves against the original.
- A scaling mini-lab: fit loss curves on three model sizes, predict the loss of a model ten times bigger, then verify it.
- A DPO exercise: align a model of 1B or smaller on a public preference dataset and blind-compare before/after responses.
- A mini inference server: implement a generation server with KV cache and dynamic batching, then measure concurrent-request throughput.
Pitfalls and a Critical View
From-scratch learning has its own traps.
- The compute illusion: you cannot build a frontier-grade model at home. The goal is internalizing the mechanics, not reproducing SOTA. A 100M-scale model teaches you everything you need.
- Drowning in wheel reinvention: spending four weeks optimizing the tokenizer misses the point. Discipline matters — get each stage working, then move on.
- Distance from production code: educational implementations are simplified. For real serving, use a battle-tested engine like vLLM. Building your own is the foundation for understanding and tuning that engine.
- Overrated resume effect: what gets evaluated is not the from-scratch experience itself but the writing, code, and measurements that came out of it. Publish your artifacts.
Closing
Building an LLM from scratch remains a worthwhile investment in 2026 — perhaps especially in 2026. The more agents write our code, the more quietly the gap widens beneath the abstraction layer between people who know the bottom of the system and people who do not. With excellent open materials like CS336 and nanoGPT available, start lightly from week one of the plan. The moment a char-level model first emits a plausible sentence is a feeling no API call can ever give you.
References
- CS336 official page: https://cs336.stanford.edu/
- nanoGPT (Karpathy): https://github.com/karpathy/nanoGPT
- Karpathy, Zero to Hero lectures: https://karpathy.ai/zero-to-hero.html
- Attention Is All You Need: https://arxiv.org/abs/1706.03762
- Chinchilla scaling laws paper: https://arxiv.org/abs/2203.15556
- DPO paper: https://arxiv.org/abs/2305.18290
- FlashAttention paper: https://arxiv.org/abs/2205.14135
- vLLM documentation: https://docs.vllm.ai/
- CS336 discussion on Hacker News: https://news.ycombinator.com/item?id=44046428
- GeekNews: https://news.hada.io/