Skip to content

필사 모드: Build Your Own GPT — Training a Language Model from Scratch with nanoGPT

English
0%
정확도 0%
💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.
원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Introduction

ChatGPT, Claude, Gemini — the core of the AI we use every day is the **GPT (Generative Pre-trained Transformer)** architecture. But have you ever built one yourself?

In this series, we train a language model **from scratch** using Andrej Karpathy's **nanoGPT**, running on a home GPU. It is a miniature version of the large models, but the principles are exactly the same as GPT-4.

Why Build It Yourself?

- Reading papers gives 30% understanding; **coding it yourself gives 90%**

- We have GPU servers, so we can actually train it (GB10 128GB!)

- "Trained an LLM from scratch" on your resume — a real differentiator

- When reading AI papers, you develop the intuition to say "Ah, that part!"

GPT Architecture Essentials

GPT is a **Decoder-only Transformer**. Three key components:

1. Tokenization

The first step: converting text into numbers.

Character-level tokenizer (the simplest)

text = "hello world"

chars = sorted(list(set(text)))

chars = [' ', 'd', 'e', 'h', 'l', 'o', 'r', 'w']

stoi = {ch: i for i, ch in enumerate(chars)} # char → int

itos = {i: ch for i, ch in enumerate(chars)} # int → char

encode = lambda s: [stoi[c] for c in s]

decode = lambda l: ''.join([itos[i] for i in l])

print(encode("hello")) # [3, 2, 4, 4, 5]

print(decode([3, 2, 4, 4, 5])) # "hello"

In practice, GPT uses a **BPE (Byte-Pair Encoding)** tokenizer:

enc = tiktoken.get_encoding("gpt2")

tokens = enc.encode("Let's build our own GPT!")

print(tokens) # [5756, 338, 1382, 674, 898, ...]

print(len(tokens)) # ~7 tokens

2. Self-Attention (The Core of Cores)

Each token learns "how much attention to pay to every other token."

class SelfAttention(nn.Module):

def __init__(self, embed_dim, head_dim):

super().__init__()

self.query = nn.Linear(embed_dim, head_dim, bias=False)

self.key = nn.Linear(embed_dim, head_dim, bias=False)

self.value = nn.Linear(embed_dim, head_dim, bias=False)

def forward(self, x):

B, T, C = x.shape

q = self.query(x) # (B, T, head_dim)

k = self.key(x) # (B, T, head_dim)

v = self.value(x) # (B, T, head_dim)

Attention scores

weights = q @ k.transpose(-2, -1) # (B, T, T)

weights = weights * (C ** -0.5) # Scale

Causal mask — future tokens cannot be seen!

mask = torch.tril(torch.ones(T, T))

weights = weights.masked_fill(mask == 0, float('-inf'))

weights = F.softmax(weights, dim=-1)

out = weights @ v # (B, T, head_dim)

return out

**Key Intuition**: When predicting the blank in "The cat sat on the \_\_\_", high attention weights are assigned to "cat" and "sat".

3. Transformer Block

class TransformerBlock(nn.Module):

def __init__(self, embed_dim, num_heads):

super().__init__()

head_dim = embed_dim // num_heads

self.heads = nn.ModuleList([

SelfAttention(embed_dim, head_dim)

for _ in range(num_heads)

])

self.proj = nn.Linear(embed_dim, embed_dim)

self.ffn = nn.Sequential(

nn.Linear(embed_dim, 4 * embed_dim),

nn.GELU(),

nn.Linear(4 * embed_dim, embed_dim),

)

self.ln1 = nn.LayerNorm(embed_dim)

self.ln2 = nn.LayerNorm(embed_dim)

def forward(self, x):

Multi-Head Attention + Residual

attn_out = torch.cat([h(self.ln1(x)) for h in self.heads], dim=-1)

x = x + self.proj(attn_out)

Feed-Forward + Residual

x = x + self.ffn(self.ln2(x))

return x

Full GPT Model

class MicroGPT(nn.Module):

def __init__(self, vocab_size, embed_dim=384, num_heads=6,

num_layers=6, block_size=256):

super().__init__()

self.token_emb = nn.Embedding(vocab_size, embed_dim)

self.pos_emb = nn.Embedding(block_size, embed_dim)

self.blocks = nn.Sequential(*[

TransformerBlock(embed_dim, num_heads)

for _ in range(num_layers)

])

self.ln_f = nn.LayerNorm(embed_dim)

self.head = nn.Linear(embed_dim, vocab_size)

def forward(self, idx, targets=None):

B, T = idx.shape

tok_emb = self.token_emb(idx) # (B, T, embed_dim)

pos_emb = self.pos_emb(torch.arange(T)) # (T, embed_dim)

x = tok_emb + pos_emb

x = self.blocks(x)

x = self.ln_f(x)

logits = self.head(x) # (B, T, vocab_size)

loss = None

if targets is not None:

loss = F.cross_entropy(

logits.view(-1, logits.size(-1)),

targets.view(-1)

)

return logits, loss

def generate(self, idx, max_new_tokens):

for _ in range(max_new_tokens):

logits, _ = self(idx[:, -256:]) # block_size limit

probs = F.softmax(logits[:, -1, :], dim=-1)

next_token = torch.multinomial(probs, num_samples=1)

idx = torch.cat([idx, next_token], dim=1)

return idx

**Model Size Comparison:**

| Model | Parameters | Layers | Training Time (1 GPU) |

| ------------ | ------------ | ------ | --------------------- |

| Our MicroGPT | **10M** | 6 | ~30 minutes |

| GPT-2 Small | 124M | 12 | ~several days |

| GPT-3 | 175B | 96 | thousands of GPU-days |

| GPT-4 | ~1.8T (est.) | ? | ? |

Hands-On: Training on Shakespeare

Data Preparation

Run on spark01 (GB10 128GB)

cd ~/nanoGPT

python3 data/shakespeare_char/prepare.py

→ train: 1,003,854 tokens / val: 111,540 tokens

Start Training

python3 train.py config/train_shakespeare_char.py \

--device=cuda \

--max_iters=5000 \

--eval_interval=500 \

--log_interval=100

Generated Output (After 5000 iterations)

ROMEO:

What say you to this? Let me not stay a whit;

And yet I feel the thing I have forgot

To take upon the honour of my word.

It generates text in Shakespeare's style **from scratch**!

Next Steps: Korean GPT

1. **Train a Korean tokenizer** (SentencePiece BPE)

2. **Collect Namuwiki / news data**

3. **Train MicroGPT 500M** (spark01 128GB)

4. **LoRA fine-tuning** for a conversational model

Series Roadmap

| Part | Topic | Status |

| ---- | ------------------------------------------ | ------ |

| 1 | Understanding GPT with nanoGPT (This post) | ✅ |

| 2 | Building a Korean Tokenizer | 🔜 |

| 3 | Training a 500M Korean GPT | 🔜 |

| 4 | Building a Conversational Model with RLHF | 🔜 |

| 5 | Image Generation Model (DDPM) from Scratch | 🔜 |

**Q1.** Is GPT an Encoder or Decoder architecture?

||Decoder-only Transformer||

**Q2.** What is the role of the Causal Mask in Self-Attention?

||Blocks future tokens from being seen — masking with a lower triangular matrix for autoregressive generation||

**Q3.** Why do we divide by sqrt(d_k) when computing attention scores?

||When dot products become large, softmax becomes extreme causing gradient vanishing — scaling stabilizes this||

**Q4.** What are the pros and cons of BPE vs Character-level tokenizers?

||BPE: Vocabulary-efficient (shorter sequences), but more complex to implement. Character-level: Simple, but sequences are longer and long-range dependencies are harder to learn||

**Q5.** Why are Residual Connections important in Transformers?

||They add the original input back to prevent gradient vanishing in deep networks — improving training stability and convergence speed||

**Q6.** Why does the Feed-Forward Network expand 4x and then contract back?

||To gain representational capacity for non-linear transformations — extracting features in a wider space then compressing back to the original dimension||

**Q7.** How many parameters does our MicroGPT have, and how does it compare to GPT-2 Small?

||About 10M, roughly 12x smaller than GPT-2 Small (124M)||

**Q8.** How can a model understand language just by learning next token prediction?

||To accurately predict the next token, the model must implicitly learn grammar, semantics, context, and world knowledge — compression is understanding||

References & GitHub Resources

Source Code

- **nanoGPT** — Andrej Karpathy's minimal GPT implementation: [github.com/karpathy/nanoGPT](https://github.com/karpathy/nanoGPT)

- **minGPT** — Predecessor to nanoGPT, educational implementation: [github.com/karpathy/minGPT](https://github.com/karpathy/minGPT)

- **microGPT analysis** (based on this blog): [github.com/fjvbn2003/ai-model-analysis](https://github.com/fjvbn2003/ai-model-analysis)

Original Papers

- [Attention Is All You Need (2017)](https://arxiv.org/abs/1706.03762) — The original Transformer architecture paper

- [Improving Language Understanding by Generative Pre-Training (GPT-1, 2018)](https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf)

- [Language Models are Unsupervised Multitask Learners (GPT-2, 2019)](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)

Video Lectures

- [Andrej Karpathy — Let's build GPT from scratch](https://www.youtube.com/watch?v=kCc8FmEb1nY) — 2-hour nanoGPT implementation lecture (must watch!)

- [Andrej Karpathy — Let's reproduce GPT-2 (124M)](https://www.youtube.com/watch?v=l8pRSuU81PU) — GPT-2 reproduction

Related Series & Recommended Posts

- [Complete Math Guide for AI](/blog/ai/2026-03-03-math-for-ai-complete-guide) — Linear algebra, calculus, information theory (essential for understanding GPT)

- [GPT Series Paper Analysis: From GPT-1 to GPT-4](/blog/ai-papers/gpt_series_evolution) — GPT evolution lineage

- [torchaudio Complete Guide](/blog/ai-platform/2026-03-03-torchaudio-complete-guide) — Audio AI (multimodal after GPT)

- [torchvision Complete Guide](/blog/ai-platform/2026-03-03-torchvision-complete-guide) — Vision AI (ViT = Vision Transformer)

- [vLLM Inference Optimization Guide](/blog/llm/2026-03-03-vllm-inference-optimization) — Serving your trained GPT

- [LLM Quantization GPTQ/AWQ/GGUF](/blog/llm/2026-03-03-llm-quantization-gptq-awq-gguf) — Model compression

Quiz

Q1: What is the main topic covered in "Build Your Own GPT — Training a Language Model from

Scratch with nanoGPT"?

Train a GPT language model from scratch using Andrej Karpathy's nanoGPT. A complete dissection of

the Transformer architecture — tokenizers, Self-Attention, training loops — all with code.

Reading papers gives 30% understanding; coding it yourself gives 90% We have GPU servers, so we

can actually train it (GB10 128GB!) "Trained an LLM from scratch" on your resume — a real

differentiator When reading AI papers, you develop the intuition to say "Ah, that part!"

GPT is a Decoder-only Transformer. Three key components: 1. Tokenization The first step:

converting text into numbers. In practice, GPT uses a BPE (Byte-Pair Encoding) tokenizer: 2.

Model Size Comparison:

Data Preparation Start Training Generated Output (After 5000 iterations) It generates text in

Shakespeare's style from scratch!

현재 단락 (1/168)

ChatGPT, Claude, Gemini — the core of the AI we use every day is the **GPT (Generative Pre-trained T...

작성 글자: 0원문 글자: 8,870작성 단락: 0/168