Introduction
ChatGPT, Claude, Gemini — the core of the AI we use every day is the **GPT (Generative Pre-trained Transformer)** architecture. But have you ever built one yourself?
In this series, we train a language model **from scratch** using Andrej Karpathy's **nanoGPT**, running on a home GPU. It is a miniature version of the large models, but the principles are exactly the same as GPT-4.
Why Build It Yourself?
- Reading papers gives 30% understanding; **coding it yourself gives 90%**
- We have GPU servers, so we can actually train it (GB10 128GB!)
- "Trained an LLM from scratch" on your resume — a real differentiator
- When reading AI papers, you develop the intuition to say "Ah, that part!"
GPT Architecture Essentials
GPT is a **Decoder-only Transformer**. Three key components:
1. Tokenization
The first step: converting text into numbers.
Character-level tokenizer (the simplest)
text = "hello world"
chars = sorted(list(set(text)))
chars = [' ', 'd', 'e', 'h', 'l', 'o', 'r', 'w']
stoi = {ch: i for i, ch in enumerate(chars)} # char → int
itos = {i: ch for i, ch in enumerate(chars)} # int → char
encode = lambda s: [stoi[c] for c in s]
decode = lambda l: ''.join([itos[i] for i in l])
print(encode("hello")) # [3, 2, 4, 4, 5]
print(decode([3, 2, 4, 4, 5])) # "hello"
In practice, GPT uses a **BPE (Byte-Pair Encoding)** tokenizer:
enc = tiktoken.get_encoding("gpt2")
tokens = enc.encode("Let's build our own GPT!")
print(tokens) # [5756, 338, 1382, 674, 898, ...]
print(len(tokens)) # ~7 tokens
2. Self-Attention (The Core of Cores)
Each token learns "how much attention to pay to every other token."
class SelfAttention(nn.Module):
def __init__(self, embed_dim, head_dim):
super().__init__()
self.query = nn.Linear(embed_dim, head_dim, bias=False)
self.key = nn.Linear(embed_dim, head_dim, bias=False)
self.value = nn.Linear(embed_dim, head_dim, bias=False)
def forward(self, x):
B, T, C = x.shape
q = self.query(x) # (B, T, head_dim)
k = self.key(x) # (B, T, head_dim)
v = self.value(x) # (B, T, head_dim)
Attention scores
weights = q @ k.transpose(-2, -1) # (B, T, T)
weights = weights * (C ** -0.5) # Scale
Causal mask — future tokens cannot be seen!
mask = torch.tril(torch.ones(T, T))
weights = weights.masked_fill(mask == 0, float('-inf'))
weights = F.softmax(weights, dim=-1)
out = weights @ v # (B, T, head_dim)
return out
**Key Intuition**: When predicting the blank in "The cat sat on the \_\_\_", high attention weights are assigned to "cat" and "sat".
3. Transformer Block
class TransformerBlock(nn.Module):
def __init__(self, embed_dim, num_heads):
super().__init__()
head_dim = embed_dim // num_heads
self.heads = nn.ModuleList([
SelfAttention(embed_dim, head_dim)
for _ in range(num_heads)
])
self.proj = nn.Linear(embed_dim, embed_dim)
self.ffn = nn.Sequential(
nn.Linear(embed_dim, 4 * embed_dim),
nn.GELU(),
nn.Linear(4 * embed_dim, embed_dim),
)
self.ln1 = nn.LayerNorm(embed_dim)
self.ln2 = nn.LayerNorm(embed_dim)
def forward(self, x):
Multi-Head Attention + Residual
attn_out = torch.cat([h(self.ln1(x)) for h in self.heads], dim=-1)
x = x + self.proj(attn_out)
Feed-Forward + Residual
x = x + self.ffn(self.ln2(x))
return x
Full GPT Model
class MicroGPT(nn.Module):
def __init__(self, vocab_size, embed_dim=384, num_heads=6,
num_layers=6, block_size=256):
super().__init__()
self.token_emb = nn.Embedding(vocab_size, embed_dim)
self.pos_emb = nn.Embedding(block_size, embed_dim)
self.blocks = nn.Sequential(*[
TransformerBlock(embed_dim, num_heads)
for _ in range(num_layers)
])
self.ln_f = nn.LayerNorm(embed_dim)
self.head = nn.Linear(embed_dim, vocab_size)
def forward(self, idx, targets=None):
B, T = idx.shape
tok_emb = self.token_emb(idx) # (B, T, embed_dim)
pos_emb = self.pos_emb(torch.arange(T)) # (T, embed_dim)
x = tok_emb + pos_emb
x = self.blocks(x)
x = self.ln_f(x)
logits = self.head(x) # (B, T, vocab_size)
loss = None
if targets is not None:
loss = F.cross_entropy(
logits.view(-1, logits.size(-1)),
targets.view(-1)
)
return logits, loss
def generate(self, idx, max_new_tokens):
for _ in range(max_new_tokens):
logits, _ = self(idx[:, -256:]) # block_size limit
probs = F.softmax(logits[:, -1, :], dim=-1)
next_token = torch.multinomial(probs, num_samples=1)
idx = torch.cat([idx, next_token], dim=1)
return idx
**Model Size Comparison:**
| Model | Parameters | Layers | Training Time (1 GPU) |
| ------------ | ------------ | ------ | --------------------- |
| Our MicroGPT | **10M** | 6 | ~30 minutes |
| GPT-2 Small | 124M | 12 | ~several days |
| GPT-3 | 175B | 96 | thousands of GPU-days |
| GPT-4 | ~1.8T (est.) | ? | ? |
Hands-On: Training on Shakespeare
Data Preparation
Run on spark01 (GB10 128GB)
cd ~/nanoGPT
python3 data/shakespeare_char/prepare.py
→ train: 1,003,854 tokens / val: 111,540 tokens
Start Training
python3 train.py config/train_shakespeare_char.py \
--device=cuda \
--max_iters=5000 \
--eval_interval=500 \
--log_interval=100
Generated Output (After 5000 iterations)
ROMEO:
What say you to this? Let me not stay a whit;
And yet I feel the thing I have forgot
To take upon the honour of my word.
It generates text in Shakespeare's style **from scratch**!
Next Steps: Korean GPT
1. **Train a Korean tokenizer** (SentencePiece BPE)
2. **Collect Namuwiki / news data**
3. **Train MicroGPT 500M** (spark01 128GB)
4. **LoRA fine-tuning** for a conversational model
Series Roadmap
| Part | Topic | Status |
| ---- | ------------------------------------------ | ------ |
| 1 | Understanding GPT with nanoGPT (This post) | ✅ |
| 2 | Building a Korean Tokenizer | 🔜 |
| 3 | Training a 500M Korean GPT | 🔜 |
| 4 | Building a Conversational Model with RLHF | 🔜 |
| 5 | Image Generation Model (DDPM) from Scratch | 🔜 |
**Q1.** Is GPT an Encoder or Decoder architecture?
||Decoder-only Transformer||
**Q2.** What is the role of the Causal Mask in Self-Attention?
||Blocks future tokens from being seen — masking with a lower triangular matrix for autoregressive generation||
**Q3.** Why do we divide by sqrt(d_k) when computing attention scores?
||When dot products become large, softmax becomes extreme causing gradient vanishing — scaling stabilizes this||
**Q4.** What are the pros and cons of BPE vs Character-level tokenizers?
||BPE: Vocabulary-efficient (shorter sequences), but more complex to implement. Character-level: Simple, but sequences are longer and long-range dependencies are harder to learn||
**Q5.** Why are Residual Connections important in Transformers?
||They add the original input back to prevent gradient vanishing in deep networks — improving training stability and convergence speed||
**Q6.** Why does the Feed-Forward Network expand 4x and then contract back?
||To gain representational capacity for non-linear transformations — extracting features in a wider space then compressing back to the original dimension||
**Q7.** How many parameters does our MicroGPT have, and how does it compare to GPT-2 Small?
||About 10M, roughly 12x smaller than GPT-2 Small (124M)||
**Q8.** How can a model understand language just by learning next token prediction?
||To accurately predict the next token, the model must implicitly learn grammar, semantics, context, and world knowledge — compression is understanding||
References & GitHub Resources
Source Code
- **nanoGPT** — Andrej Karpathy's minimal GPT implementation: [github.com/karpathy/nanoGPT](https://github.com/karpathy/nanoGPT)
- **minGPT** — Predecessor to nanoGPT, educational implementation: [github.com/karpathy/minGPT](https://github.com/karpathy/minGPT)
- **microGPT analysis** (based on this blog): [github.com/fjvbn2003/ai-model-analysis](https://github.com/fjvbn2003/ai-model-analysis)
Original Papers
- [Attention Is All You Need (2017)](https://arxiv.org/abs/1706.03762) — The original Transformer architecture paper
- [Improving Language Understanding by Generative Pre-Training (GPT-1, 2018)](https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf)
- [Language Models are Unsupervised Multitask Learners (GPT-2, 2019)](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)
Video Lectures
- [Andrej Karpathy — Let's build GPT from scratch](https://www.youtube.com/watch?v=kCc8FmEb1nY) — 2-hour nanoGPT implementation lecture (must watch!)
- [Andrej Karpathy — Let's reproduce GPT-2 (124M)](https://www.youtube.com/watch?v=l8pRSuU81PU) — GPT-2 reproduction
Related Series & Recommended Posts
- [Complete Math Guide for AI](/blog/ai/2026-03-03-math-for-ai-complete-guide) — Linear algebra, calculus, information theory (essential for understanding GPT)
- [GPT Series Paper Analysis: From GPT-1 to GPT-4](/blog/ai-papers/gpt_series_evolution) — GPT evolution lineage
- [torchaudio Complete Guide](/blog/ai-platform/2026-03-03-torchaudio-complete-guide) — Audio AI (multimodal after GPT)
- [torchvision Complete Guide](/blog/ai-platform/2026-03-03-torchvision-complete-guide) — Vision AI (ViT = Vision Transformer)
- [vLLM Inference Optimization Guide](/blog/llm/2026-03-03-vllm-inference-optimization) — Serving your trained GPT
- [LLM Quantization GPTQ/AWQ/GGUF](/blog/llm/2026-03-03-llm-quantization-gptq-awq-gguf) — Model compression
Quiz
Q1: What is the main topic covered in "Build Your Own GPT — Training a Language Model from
Scratch with nanoGPT"?
Train a GPT language model from scratch using Andrej Karpathy's nanoGPT. A complete dissection of
the Transformer architecture — tokenizers, Self-Attention, training loops — all with code.
Reading papers gives 30% understanding; coding it yourself gives 90% We have GPU servers, so we
can actually train it (GB10 128GB!) "Trained an LLM from scratch" on your resume — a real
differentiator When reading AI papers, you develop the intuition to say "Ah, that part!"
GPT is a Decoder-only Transformer. Three key components: 1. Tokenization The first step:
converting text into numbers. In practice, GPT uses a BPE (Byte-Pair Encoding) tokenizer: 2.
Model Size Comparison:
Data Preparation Start Training Generated Output (After 5000 iterations) It generates text in
Shakespeare's style from scratch!
현재 단락 (1/168)
ChatGPT, Claude, Gemini — the core of the AI we use every day is the **GPT (Generative Pre-trained T...