Skip to content
Published on

Build Your Own GPT — Training a Language Model from Scratch with nanoGPT

Authors
  • Name
    Twitter
Build Your Own GPT

Introduction

ChatGPT, Claude, Gemini — the core of the AI we use every day is the GPT (Generative Pre-trained Transformer) architecture. But have you ever built one yourself?

In this series, we train a language model from scratch using Andrej Karpathy's nanoGPT, running on a home GPU. It is a miniature version of the large models, but the principles are exactly the same as GPT-4.

Why Build It Yourself?

  • Reading papers gives 30% understanding; coding it yourself gives 90%
  • We have GPU servers, so we can actually train it (GB10 128GB!)
  • "Trained an LLM from scratch" on your resume — a real differentiator
  • When reading AI papers, you develop the intuition to say "Ah, that part!"

GPT Architecture Essentials

GPT is a Decoder-only Transformer. Three key components:

1. Tokenization

The first step: converting text into numbers.

# Character-level tokenizer (the simplest)
text = "hello world"
chars = sorted(list(set(text)))
# chars = [' ', 'd', 'e', 'h', 'l', 'o', 'r', 'w']

stoi = {ch: i for i, ch in enumerate(chars)}  # char → int
itos = {i: ch for i, ch in enumerate(chars)}  # int → char

encode = lambda s: [stoi[c] for c in s]
decode = lambda l: ''.join([itos[i] for i in l])

print(encode("hello"))  # [3, 2, 4, 4, 5]
print(decode([3, 2, 4, 4, 5]))  # "hello"

In practice, GPT uses a BPE (Byte-Pair Encoding) tokenizer:

import tiktoken
enc = tiktoken.get_encoding("gpt2")
tokens = enc.encode("Let's build our own GPT!")
print(tokens)  # [5756, 338, 1382, 674, 898, ...]
print(len(tokens))  # ~7 tokens

2. Self-Attention (The Core of Cores)

Each token learns "how much attention to pay to every other token."

import torch
import torch.nn as nn
import torch.nn.functional as F

class SelfAttention(nn.Module):
    def __init__(self, embed_dim, head_dim):
        super().__init__()
        self.query = nn.Linear(embed_dim, head_dim, bias=False)
        self.key = nn.Linear(embed_dim, head_dim, bias=False)
        self.value = nn.Linear(embed_dim, head_dim, bias=False)

    def forward(self, x):
        B, T, C = x.shape
        q = self.query(x)  # (B, T, head_dim)
        k = self.key(x)    # (B, T, head_dim)
        v = self.value(x)  # (B, T, head_dim)

        # Attention scores
        weights = q @ k.transpose(-2, -1)  # (B, T, T)
        weights = weights * (C ** -0.5)     # Scale

        # Causal mask — future tokens cannot be seen!
        mask = torch.tril(torch.ones(T, T))
        weights = weights.masked_fill(mask == 0, float('-inf'))

        weights = F.softmax(weights, dim=-1)
        out = weights @ v  # (B, T, head_dim)
        return out

Key Intuition: When predicting the blank in "The cat sat on the ___", high attention weights are assigned to "cat" and "sat".

3. Transformer Block

class TransformerBlock(nn.Module):
    def __init__(self, embed_dim, num_heads):
        super().__init__()
        head_dim = embed_dim // num_heads
        self.heads = nn.ModuleList([
            SelfAttention(embed_dim, head_dim)
            for _ in range(num_heads)
        ])
        self.proj = nn.Linear(embed_dim, embed_dim)
        self.ffn = nn.Sequential(
            nn.Linear(embed_dim, 4 * embed_dim),
            nn.GELU(),
            nn.Linear(4 * embed_dim, embed_dim),
        )
        self.ln1 = nn.LayerNorm(embed_dim)
        self.ln2 = nn.LayerNorm(embed_dim)

    def forward(self, x):
        # Multi-Head Attention + Residual
        attn_out = torch.cat([h(self.ln1(x)) for h in self.heads], dim=-1)
        x = x + self.proj(attn_out)
        # Feed-Forward + Residual
        x = x + self.ffn(self.ln2(x))
        return x

Full GPT Model

class MicroGPT(nn.Module):
    def __init__(self, vocab_size, embed_dim=384, num_heads=6,
                 num_layers=6, block_size=256):
        super().__init__()
        self.token_emb = nn.Embedding(vocab_size, embed_dim)
        self.pos_emb = nn.Embedding(block_size, embed_dim)
        self.blocks = nn.Sequential(*[
            TransformerBlock(embed_dim, num_heads)
            for _ in range(num_layers)
        ])
        self.ln_f = nn.LayerNorm(embed_dim)
        self.head = nn.Linear(embed_dim, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape
        tok_emb = self.token_emb(idx)          # (B, T, embed_dim)
        pos_emb = self.pos_emb(torch.arange(T)) # (T, embed_dim)
        x = tok_emb + pos_emb

        x = self.blocks(x)
        x = self.ln_f(x)
        logits = self.head(x)  # (B, T, vocab_size)

        loss = None
        if targets is not None:
            loss = F.cross_entropy(
                logits.view(-1, logits.size(-1)),
                targets.view(-1)
            )
        return logits, loss

    def generate(self, idx, max_new_tokens):
        for _ in range(max_new_tokens):
            logits, _ = self(idx[:, -256:])  # block_size limit
            probs = F.softmax(logits[:, -1, :], dim=-1)
            next_token = torch.multinomial(probs, num_samples=1)
            idx = torch.cat([idx, next_token], dim=1)
        return idx

Model Size Comparison:

ModelParametersLayersTraining Time (1 GPU)
Our MicroGPT10M6~30 minutes
GPT-2 Small124M12~several days
GPT-3175B96thousands of GPU-days
GPT-4~1.8T (est.)??

Hands-On: Training on Shakespeare

Data Preparation

# Run on spark01 (GB10 128GB)
cd ~/nanoGPT
python3 data/shakespeare_char/prepare.py
# → train: 1,003,854 tokens / val: 111,540 tokens

Start Training

python3 train.py config/train_shakespeare_char.py \
  --device=cuda \
  --max_iters=5000 \
  --eval_interval=500 \
  --log_interval=100

Generated Output (After 5000 iterations)

ROMEO:
What say you to this? Let me not stay a whit;
And yet I feel the thing I have forgot
To take upon the honour of my word.

It generates text in Shakespeare's style from scratch!

Next Steps: Korean GPT

  1. Train a Korean tokenizer (SentencePiece BPE)
  2. Collect Namuwiki / news data
  3. Train MicroGPT 500M (spark01 128GB)
  4. LoRA fine-tuning for a conversational model

Series Roadmap

PartTopicStatus
1Understanding GPT with nanoGPT (This post)
2Building a Korean Tokenizer🔜
3Training a 500M Korean GPT🔜
4Building a Conversational Model with RLHF🔜
5Image Generation Model (DDPM) from Scratch🔜

Quiz — Build Your Own GPT (Click to check!)

Q1. Is GPT an Encoder or Decoder architecture? ||Decoder-only Transformer||

Q2. What is the role of the Causal Mask in Self-Attention? ||Blocks future tokens from being seen — masking with a lower triangular matrix for autoregressive generation||

Q3. Why do we divide by sqrt(d_k) when computing attention scores? ||When dot products become large, softmax becomes extreme causing gradient vanishing — scaling stabilizes this||

Q4. What are the pros and cons of BPE vs Character-level tokenizers? ||BPE: Vocabulary-efficient (shorter sequences), but more complex to implement. Character-level: Simple, but sequences are longer and long-range dependencies are harder to learn||

Q5. Why are Residual Connections important in Transformers? ||They add the original input back to prevent gradient vanishing in deep networks — improving training stability and convergence speed||

Q6. Why does the Feed-Forward Network expand 4x and then contract back? ||To gain representational capacity for non-linear transformations — extracting features in a wider space then compressing back to the original dimension||

Q7. How many parameters does our MicroGPT have, and how does it compare to GPT-2 Small? ||About 10M, roughly 12x smaller than GPT-2 Small (124M)||

Q8. How can a model understand language just by learning next token prediction? ||To accurately predict the next token, the model must implicitly learn grammar, semantics, context, and world knowledge — compression is understanding||

References & GitHub Resources

Source Code

Original Papers

Video Lectures