- Published on
Natural Language Processing Complete Guide: Zero to Hero - From Text Processing to LLMs
- Authors

- Name
- Youngju Kim
- @fjvbn20031
Natural Language Processing (NLP) Complete Guide: Zero to Hero
Natural Language Processing (NLP) is a core branch of artificial intelligence that enables computers to understand and generate human language. Countless services we use every day — ChatGPT, translation engines, search engines, sentiment analysis systems — are all built on NLP technology. This guide provides a complete learning path covering everything from the most basic text preprocessing to the latest large language models (LLMs).
Table of Contents
- NLP Fundamentals and Text Preprocessing
- Text Representation
- Word Embeddings
- NLP with Recurrent Neural Networks
- Attention Mechanisms
- Deep Dive into the Transformer Architecture
- Deep Dive into BERT
- The GPT Family of Models
- Modern NLP Techniques
- Real-World NLP Projects
1. NLP Fundamentals and Text Preprocessing
The first step in natural language processing is transforming raw text data into a form that models can work with. Because text data is unstructured, cleaning and structuring it into a consistent format is essential.
1.1 Tokenization
Tokenization is the process of splitting text into smaller units called tokens. Depending on the unit of the token, we have word tokenization, character tokenization, and subword tokenization.
Word Tokenization
The most intuitive approach — splitting text into words based on whitespace or punctuation.
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
nltk.download('punkt')
text = "Natural Language Processing is fascinating! It powers ChatGPT and many AI applications."
# Word tokenization
word_tokens = word_tokenize(text)
print("Word tokens:", word_tokens)
# Output: ['Natural', 'Language', 'Processing', 'is', 'fascinating', '!', ...]
# Sentence tokenization
sent_tokens = sent_tokenize(text)
print("Sentence tokens:", sent_tokens)
# Output: ['Natural Language Processing is fascinating!', 'It powers ChatGPT...']
Character Tokenization
Splits text into individual characters. The vocabulary is small and there are no out-of-vocabulary (OOV) problems, but sequences become very long.
text = "Hello NLP"
char_tokens = list(text)
print("Character tokens:", char_tokens)
# Output: ['H', 'e', 'l', 'l', 'o', ' ', 'N', 'L', 'P']
Subword Tokenization
Splits text into units between words and characters. Algorithms like BPE (Byte Pair Encoding), WordPiece, and SentencePiece are used by modern models like BERT and GPT.
from tokenizers import ByteLevelBPETokenizer
# Train a BPE tokenizer
tokenizer = ByteLevelBPETokenizer()
tokenizer.train(
files=["corpus.txt"],
vocab_size=5000,
min_frequency=2,
special_tokens=["<s>", "<pad>", "</s>", "<unk>", "<mask>"]
)
# Encode
encoding = tokenizer.encode("Natural Language Processing")
print("Tokens:", encoding.tokens)
print("IDs:", encoding.ids)
1.2 Stop Word Removal
Stop words are words like "the", "is", and "at" that appear so frequently they carry almost no meaningful information. Removing them reduces data size and lets the model focus on informative words.
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('stopwords')
text = "This is a sample sentence showing off the stop words filtration."
stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(text)
filtered_sentence = [w for w in word_tokens if not w.lower() in stop_words]
print("Original:", word_tokens)
print("Filtered:", filtered_sentence)
# Output: ['sample', 'sentence', 'showing', 'stop', 'words', 'filtration', '.']
1.3 Stemming and Lemmatization
Stemming
Removes affixes to extract the stem of a word. Fast but can produce linguistically incorrect results.
from nltk.stem import PorterStemmer, LancasterStemmer
ps = PorterStemmer()
ls = LancasterStemmer()
words = ["running", "runs", "ran", "runner", "easily", "fairly"]
for word in words:
print(f"{word:15} -> Porter: {ps.stem(word):15} Lancaster: {ls.stem(word)}")
Lemmatization
Extracts the base form (lemma) of a word. Slower than stemming but linguistically correct.
import nltk
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()
# Specifying part of speech gives more accurate results
print(lemmatizer.lemmatize("running", pos='v')) # run
print(lemmatizer.lemmatize("better", pos='a')) # good
print(lemmatizer.lemmatize("dogs")) # dog
print(lemmatizer.lemmatize("went", pos='v')) # go
1.4 Text Cleaning with Regular Expressions
import re
def clean_text(text):
"""Text cleaning function"""
# Remove HTML tags
text = re.sub(r'<[^>]+>', '', text)
# Remove URLs
text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
# Remove email addresses
text = re.sub(r'\S+@\S+', '', text)
# Remove special characters (keep alphanumeric and spaces)
text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
# Collapse multiple spaces
text = re.sub(r'\s+', ' ', text)
# Strip and lowercase
text = text.strip().lower()
return text
sample_text = """
Check out my website at https://example.com!
Email me at user@example.com for <b>more info</b>.
It's really cool!!!
"""
cleaned = clean_text(sample_text)
print(cleaned)
# Output: "check out my website at email me at for more info its really cool"
1.5 Advanced Preprocessing with spaCy
spaCy is an industrial-strength NLP library that provides tokenization, POS tagging, named entity recognition, dependency parsing, and more.
import spacy
# Load English model
nlp = spacy.load("en_core_web_sm")
text = "Apple is looking at buying U.K. startup for $1 billion"
doc = nlp(text)
print("=== Token Information ===")
for token in doc:
print(f"{token.text:15} | POS: {token.pos_:10} | Lemma: {token.lemma_:15} | Stop: {token.is_stop}")
print("\n=== Named Entity Recognition ===")
for ent in doc.ents:
print(f"{ent.text:20} -> {ent.label_} ({spacy.explain(ent.label_)})")
print("\n=== Dependency Parsing ===")
for token in doc:
print(f"{token.text:15} -> {token.dep_:15} (head: {token.head.text})")
2. Text Representation
Converting text to numerical vectors is at the heart of NLP. For a model to process text, it must be converted into a numeric form on which mathematical operations can be performed.
2.1 Bag of Words (BoW)
BoW is the simplest method, representing text as a word-frequency vector. It ignores word order and only considers how often each word appears.
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
corpus = [
"I love natural language processing",
"Natural language processing is amazing",
"I love machine learning too",
"Deep learning is part of machine learning"
]
# BoW vectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print("Vocabulary:", vectorizer.get_feature_names_out())
print("\nBoW Matrix:")
print(X.toarray())
print("\nShape:", X.shape) # (4 documents, n words)
# Inspect one document
doc_0 = dict(zip(vectorizer.get_feature_names_out(), X.toarray()[0]))
print("\nDocument 0 word frequencies:")
for word, count in sorted(doc_0.items()):
if count > 0:
print(f" {word}: {count}")
2.2 TF-IDF
TF-IDF (Term Frequency-Inverse Document Frequency) improves on BoW by considering not just how often a word appears in a document, but also how rare it is across all documents.
TF (Term Frequency): How often a word appears in a given document. IDF (Inverse Document Frequency): The inverse of how many documents contain the word — rare words get higher scores.
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
corpus = [
"the cat sat on the mat",
"the cat sat on the hat",
"the dog sat on the log",
"the cat wore the hat",
]
# TF-IDF vectorizer
tfidf = TfidfVectorizer(smooth_idf=True, norm='l2')
X = tfidf.fit_transform(corpus)
# Display as DataFrame
df = pd.DataFrame(
X.toarray(),
columns=tfidf.get_feature_names_out(),
index=[f"Doc {i}" for i in range(len(corpus))]
)
print(df.round(3))
# Manual TF-IDF implementation
def compute_tfidf(corpus):
from math import log
vocab = set()
for doc in corpus:
vocab.update(doc.split())
vocab = sorted(vocab)
def tf(word, doc):
words = doc.split()
return words.count(word) / len(words)
def idf(word, corpus):
n_docs_with_word = sum(1 for doc in corpus if word in doc.split())
return log((1 + len(corpus)) / (1 + n_docs_with_word)) + 1
tfidf_matrix = []
for doc in corpus:
row = [tf(word, doc) * idf(word, corpus) for word in vocab]
tfidf_matrix.append(row)
return tfidf_matrix, vocab
matrix, vocab = compute_tfidf(corpus)
print("\nManually computed TF-IDF:")
df_manual = pd.DataFrame(matrix, columns=vocab)
print(df_manual.round(3))
2.3 N-gram
An N-gram is a contiguous sequence of N items (words or characters) from a given text. N-grams allow a model to capture some local word-order information.
from nltk.util import ngrams
from nltk.tokenize import word_tokenize
from collections import Counter
text = "I love natural language processing and machine learning"
tokens = word_tokenize(text.lower())
unigrams = list(ngrams(tokens, 1))
bigrams = list(ngrams(tokens, 2))
trigrams = list(ngrams(tokens, 3))
print("Unigrams:", unigrams[:5])
print("Bigrams:", bigrams[:5])
print("Trigrams:", trigrams[:5])
def get_ngram_freq(text, n):
tokens = word_tokenize(text.lower())
n_grams = list(ngrams(tokens, n))
return Counter(n_grams)
large_corpus = """
Machine learning is a subset of artificial intelligence.
Artificial intelligence is transforming many industries.
Natural language processing is a part of machine learning.
Deep learning has revolutionized natural language processing.
"""
bigram_freq = get_ngram_freq(large_corpus, 2)
print("\nMost frequent bigrams:")
for bigram, count in bigram_freq.most_common(10):
print(f" {' '.join(bigram)}: {count}")
3. Word Embeddings
Word embeddings represent words as dense vectors such that semantically similar words are close together in vector space.
3.1 Word2Vec
Word2Vec, published by Tomas Mikolov at Google in 2013, is a groundbreaking word embedding model. It comes in two variants: CBOW (Continuous Bag of Words) and Skip-gram.
CBOW: Predicts the center word from surrounding context words. Skip-gram: Predicts surrounding context words from the center word.
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
from collections import Counter
class Word2VecSkipGram(nn.Module):
"""Skip-gram Word2Vec implementation"""
def __init__(self, vocab_size, embedding_dim):
super().__init__()
self.center_embedding = nn.Embedding(vocab_size, embedding_dim)
self.context_embedding = nn.Embedding(vocab_size, embedding_dim)
self.center_embedding.weight.data.uniform_(-0.5/embedding_dim, 0.5/embedding_dim)
self.context_embedding.weight.data.uniform_(-0.5/embedding_dim, 0.5/embedding_dim)
def forward(self, center, context, negative):
"""
center: (batch_size,) center word indices
context: (batch_size,) actual context word indices
negative: (batch_size, neg_samples) negative sample indices
"""
center_emb = self.center_embedding(center) # (batch, dim)
# Positive sample score
context_emb = self.context_embedding(context) # (batch, dim)
pos_score = torch.sum(center_emb * context_emb, dim=1) # (batch,)
pos_loss = -torch.log(torch.sigmoid(pos_score) + 1e-10)
# Negative sample score
neg_emb = self.context_embedding(negative) # (batch, neg_samples, dim)
center_emb_expanded = center_emb.unsqueeze(1) # (batch, 1, dim)
neg_score = torch.bmm(neg_emb, center_emb_expanded.transpose(1, 2)).squeeze(2)
neg_loss = -torch.sum(torch.log(torch.sigmoid(-neg_score) + 1e-10), dim=1)
return (pos_loss + neg_loss).mean()
class Word2VecCBOW(nn.Module):
"""CBOW Word2Vec implementation"""
def __init__(self, vocab_size, embedding_dim):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embedding_dim)
self.linear = nn.Linear(embedding_dim, vocab_size)
def forward(self, context):
"""
context: (batch_size, context_window_size) context word indices
"""
embedded = self.embedding(context) # (batch, window, dim)
mean_embedded = embedded.mean(dim=1) # (batch, dim)
output = self.linear(mean_embedded) # (batch, vocab_size)
return output
# Practical usage with Gensim
from gensim.models import Word2Vec
sentences = [
"I love natural language processing".split(),
"natural language processing is a field of AI".split(),
"machine learning is part of artificial intelligence".split(),
"deep learning models process natural language".split(),
"word embeddings represent words as vectors".split(),
"Word2Vec learns word representations".split(),
"semantic similarity between words is captured by embeddings".split(),
]
model = Word2Vec(
sentences=sentences,
vector_size=100, # embedding dimension
window=5, # context window size
min_count=1, # minimum word frequency
workers=4,
epochs=100,
sg=1, # 1=Skip-gram, 0=CBOW
negative=5 # negative sampling
)
print("'language' vector (first 5 dims):", model.wv['language'][:5])
similar_words = model.wv.most_similar('language', topn=5)
print("\nWords similar to 'language':")
for word, similarity in similar_words:
print(f" {word}: {similarity:.4f}")
result = model.wv.most_similar(
positive=['artificial', 'language'],
negative=['natural'],
topn=3
)
print("\nAnalogy result:")
for word, sim in result:
print(f" {word}: {sim:.4f}")
print("\nSimilarity between 'language' and 'processing':", model.wv.similarity('language', 'processing'))
3.2 Embedding Visualization with t-SNE
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import numpy as np
def visualize_embeddings(model, words=None):
"""Visualize Word2Vec embeddings with t-SNE"""
if words is None:
words = list(model.wv.key_to_index.keys())[:50]
vectors = np.array([model.wv[word] for word in words])
tsne = TSNE(
n_components=2,
random_state=42,
perplexity=min(30, len(words)-1),
n_iter=1000
)
vectors_2d = tsne.fit_transform(vectors)
fig, ax = plt.subplots(figsize=(12, 8))
ax.scatter(vectors_2d[:, 0], vectors_2d[:, 1], alpha=0.7)
for i, word in enumerate(words):
ax.annotate(word, (vectors_2d[i, 0], vectors_2d[i, 1]),
fontsize=9, ha='center', va='bottom')
ax.set_title("Word2Vec Embeddings t-SNE Visualization")
plt.tight_layout()
plt.savefig('word2vec_tsne.png', dpi=150, bbox_inches='tight')
plt.show()
visualize_embeddings(model)
3.3 GloVe
GloVe (Global Vectors for Word Representation), developed at Stanford, exploits global co-occurrence statistics from the entire corpus.
# Using pre-trained GloVe vectors
# pip install torchtext
from torchtext.vocab import GloVe
import torch
glove = GloVe(name='6B', dim=100)
vector = glove['computer']
print("'computer' GloVe vector (first 5 dims):", vector[:5].numpy())
def cosine_similarity(v1, v2):
return torch.nn.functional.cosine_similarity(
v1.unsqueeze(0), v2.unsqueeze(0)
).item()
words_to_compare = [('king', 'queen'), ('man', 'woman'), ('Paris', 'France')]
for w1, w2 in words_to_compare:
if w1 in glove.stoi and w2 in glove.stoi:
sim = cosine_similarity(glove[w1], glove[w2])
print(f"sim('{w1}', '{w2}') = {sim:.4f}")
4. NLP with Recurrent Neural Networks
Recurrent Neural Networks (RNNs) are designed to process sequential data. They carry information from previous time steps to the current step, making them well-suited for text.
4.1 Basic RNN Architecture
import torch
import torch.nn as nn
class SimpleRNN(nn.Module):
"""Basic RNN implementation"""
def __init__(self, input_size, hidden_size, output_size):
super().__init__()
self.hidden_size = hidden_size
self.input_to_hidden = nn.Linear(input_size + hidden_size, hidden_size)
self.hidden_to_output = nn.Linear(hidden_size, output_size)
def forward(self, x, hidden=None):
"""
x: (seq_len, batch, input_size)
hidden: (batch, hidden_size) initial hidden state
"""
batch_size = x.size(1)
if hidden is None:
hidden = torch.zeros(batch_size, self.hidden_size)
outputs = []
for t in range(x.size(0)):
combined = torch.cat([x[t], hidden], dim=1)
hidden = torch.tanh(self.input_to_hidden(combined))
output = self.hidden_to_output(hidden)
outputs.append(output)
return torch.stack(outputs, dim=0), hidden
# Using PyTorch built-in RNN
rnn = nn.RNN(
input_size=50,
hidden_size=128,
num_layers=2,
batch_first=True,
dropout=0.3,
bidirectional=True
)
x = torch.randn(32, 20, 50) # (batch, seq_len, features)
output, hidden = rnn(x)
print("Output shape:", output.shape) # (32, 20, 256) - bidirectional: 128*2
print("Hidden shape:", hidden.shape) # (4, 32, 128) - 2 layers * 2 directions
4.2 LSTM — Gate Mechanisms Explained
LSTM (Long Short-Term Memory) was designed to solve the vanishing gradient problem of vanilla RNNs. Its gate mechanism allows it to capture long-range dependencies.
class LSTMCell(nn.Module):
"""LSTM cell — manual implementation for understanding gate mechanics"""
def __init__(self, input_size, hidden_size):
super().__init__()
self.hidden_size = hidden_size
# Compute all 4 gates at once for efficiency
# forget gate, input gate, cell gate, output gate
self.gates = nn.Linear(input_size + hidden_size, 4 * hidden_size)
def forward(self, x, state=None):
"""
x: (batch, input_size)
state: (h, c) previous hidden and cell states
"""
batch_size = x.size(0)
if state is None:
h = torch.zeros(batch_size, self.hidden_size)
c = torch.zeros(batch_size, self.hidden_size)
else:
h, c = state
combined = torch.cat([x, h], dim=1)
gates = self.gates(combined)
f_gate, i_gate, g_gate, o_gate = gates.chunk(4, dim=1)
f = torch.sigmoid(f_gate) # forget gate: how much to forget
i = torch.sigmoid(i_gate) # input gate: how much new info to store
g = torch.tanh(g_gate) # cell gate: new candidate information
o = torch.sigmoid(o_gate) # output gate: how much to output
new_c = f * c + i * g # forget some, add new
new_h = o * torch.tanh(new_c)
return new_h, (new_h, new_c)
# Bidirectional LSTM for sentiment analysis
class SentimentLSTM(nn.Module):
"""LSTM-based sentiment analysis model"""
def __init__(self, vocab_size, embed_dim, hidden_dim, num_classes, num_layers=2, dropout=0.3):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
self.lstm = nn.LSTM(
embed_dim,
hidden_dim,
num_layers=num_layers,
batch_first=True,
dropout=dropout,
bidirectional=True
)
self.dropout = nn.Dropout(dropout)
self.classifier = nn.Sequential(
nn.Linear(hidden_dim * 2, hidden_dim),
nn.ReLU(),
nn.Dropout(dropout),
nn.Linear(hidden_dim, num_classes)
)
def forward(self, x, lengths=None):
embedded = self.dropout(self.embedding(x))
if lengths is not None:
packed = nn.utils.rnn.pack_padded_sequence(
embedded, lengths, batch_first=True, enforce_sorted=False
)
lstm_out, (hidden, cell) = self.lstm(packed)
lstm_out, _ = nn.utils.rnn.pad_packed_sequence(lstm_out, batch_first=True)
else:
lstm_out, (hidden, cell) = self.lstm(embedded)
forward_hidden = hidden[-2]
backward_hidden = hidden[-1]
combined = torch.cat([forward_hidden, backward_hidden], dim=1)
output = self.classifier(self.dropout(combined))
return output
4.3 GRU
GRU (Gated Recurrent Unit) is a simpler alternative to LSTM that achieves comparable performance faster.
class GRUCell(nn.Module):
"""GRU cell — manual implementation"""
def __init__(self, input_size, hidden_size):
super().__init__()
self.hidden_size = hidden_size
self.reset_gate = nn.Linear(input_size + hidden_size, hidden_size)
self.update_gate = nn.Linear(input_size + hidden_size, hidden_size)
self.new_gate_input = nn.Linear(input_size, hidden_size)
self.new_gate_hidden = nn.Linear(hidden_size, hidden_size)
def forward(self, x, h=None):
batch_size = x.size(0)
if h is None:
h = torch.zeros(batch_size, self.hidden_size)
combined = torch.cat([x, h], dim=1)
r = torch.sigmoid(self.reset_gate(combined)) # reset gate
z = torch.sigmoid(self.update_gate(combined)) # update gate
n = torch.tanh(
self.new_gate_input(x) + r * self.new_gate_hidden(h)
)
new_h = (1 - z) * n + z * h
return new_h
4.4 Seq2Seq Machine Translation
class Encoder(nn.Module):
"""Seq2Seq Encoder"""
def __init__(self, vocab_size, embed_dim, hidden_dim, num_layers=2, dropout=0.5):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
self.lstm = nn.LSTM(embed_dim, hidden_dim, num_layers,
batch_first=True, dropout=dropout)
self.dropout = nn.Dropout(dropout)
def forward(self, src):
embedded = self.dropout(self.embedding(src))
outputs, (hidden, cell) = self.lstm(embedded)
return outputs, hidden, cell
class Decoder(nn.Module):
"""Seq2Seq Decoder"""
def __init__(self, vocab_size, embed_dim, hidden_dim, num_layers=2, dropout=0.5):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
self.lstm = nn.LSTM(embed_dim, hidden_dim, num_layers,
batch_first=True, dropout=dropout)
self.fc = nn.Linear(hidden_dim, vocab_size)
self.dropout = nn.Dropout(dropout)
def forward(self, trg, hidden, cell):
trg = trg.unsqueeze(1) # (batch, 1)
embedded = self.dropout(self.embedding(trg))
output, (hidden, cell) = self.lstm(embedded, (hidden, cell))
prediction = self.fc(output.squeeze(1)) # (batch, vocab_size)
return prediction, hidden, cell
class Seq2Seq(nn.Module):
"""Seq2Seq Translation Model"""
def __init__(self, encoder, decoder, device):
super().__init__()
self.encoder = encoder
self.decoder = decoder
self.device = device
def forward(self, src, trg, teacher_forcing_ratio=0.5):
batch_size = src.shape[0]
trg_len = trg.shape[1]
trg_vocab_size = self.decoder.fc.out_features
outputs = torch.zeros(batch_size, trg_len, trg_vocab_size).to(self.device)
encoder_outputs, hidden, cell = self.encoder(src)
# First decoder input: start-of-sequence token
input = trg[:, 0]
for t in range(1, trg_len):
output, hidden, cell = self.decoder(input, hidden, cell)
outputs[:, t] = output
teacher_force = torch.rand(1).item() < teacher_forcing_ratio
top1 = output.argmax(1)
input = trg[:, t] if teacher_force else top1
return outputs
5. Attention Mechanisms
Attention mechanisms allow a model to dynamically focus on different parts of the input sequence when generating each output token.
5.1 Bahdanau Attention
class BahdanauAttention(nn.Module):
"""Bahdanau (Additive) Attention"""
def __init__(self, hidden_dim):
super().__init__()
self.W_enc = nn.Linear(hidden_dim * 2, hidden_dim)
self.W_dec = nn.Linear(hidden_dim, hidden_dim)
self.v = nn.Linear(hidden_dim, 1)
def forward(self, decoder_hidden, encoder_outputs):
"""
decoder_hidden: (batch, hidden) current decoder hidden state
encoder_outputs: (batch, src_len, hidden*2) all encoder outputs
"""
src_len = encoder_outputs.shape[1]
decoder_hidden = decoder_hidden.unsqueeze(1).repeat(1, src_len, 1)
energy = torch.tanh(
self.W_enc(encoder_outputs) + self.W_dec(decoder_hidden)
)
attention = self.v(energy).squeeze(2) # (batch, src_len)
attention_weights = torch.softmax(attention, dim=1)
context = torch.bmm(
attention_weights.unsqueeze(1), # (batch, 1, src_len)
encoder_outputs # (batch, src_len, hidden*2)
).squeeze(1)
return context, attention_weights
class LuongAttention(nn.Module):
"""Luong (Multiplicative) Attention"""
def __init__(self, hidden_dim, method='dot'):
super().__init__()
self.method = method
if method == 'general':
self.W = nn.Linear(hidden_dim, hidden_dim)
elif method == 'concat':
self.W = nn.Linear(hidden_dim * 2, hidden_dim)
self.v = nn.Linear(hidden_dim, 1)
def forward(self, decoder_hidden, encoder_outputs):
if self.method == 'dot':
score = torch.bmm(
encoder_outputs,
decoder_hidden.unsqueeze(2)
).squeeze(2)
elif self.method == 'general':
energy = self.W(encoder_outputs)
score = torch.bmm(
energy,
decoder_hidden.unsqueeze(2)
).squeeze(2)
elif self.method == 'concat':
decoder_expanded = decoder_hidden.unsqueeze(1).expand_as(encoder_outputs)
energy = torch.tanh(self.W(torch.cat([decoder_expanded, encoder_outputs], dim=2)))
score = self.v(energy).squeeze(2)
attention_weights = torch.softmax(score, dim=1)
context = torch.bmm(attention_weights.unsqueeze(1), encoder_outputs).squeeze(1)
return context, attention_weights
5.2 Self-Attention
Self-attention allows each position in a sequence to attend to all other positions. It is the core building block of the Transformer.
class SelfAttention(nn.Module):
"""Multi-Head Self-Attention"""
def __init__(self, embed_dim, num_heads=8, dropout=0.1):
super().__init__()
assert embed_dim % num_heads == 0
self.embed_dim = embed_dim
self.num_heads = num_heads
self.head_dim = embed_dim // num_heads
self.scale = self.head_dim ** -0.5
self.W_q = nn.Linear(embed_dim, embed_dim)
self.W_k = nn.Linear(embed_dim, embed_dim)
self.W_v = nn.Linear(embed_dim, embed_dim)
self.W_o = nn.Linear(embed_dim, embed_dim)
self.dropout = nn.Dropout(dropout)
def forward(self, x, mask=None):
"""
x: (batch, seq_len, embed_dim)
mask: (batch, seq_len, seq_len) attention mask
"""
batch_size, seq_len, _ = x.shape
Q = self.W_q(x)
K = self.W_k(x)
V = self.W_v(x)
Q = Q.view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
K = K.view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
V = V.view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
scores = torch.matmul(Q, K.transpose(-2, -1)) * self.scale
if mask is not None:
scores = scores.masked_fill(mask == 0, float('-inf'))
attention_weights = self.dropout(torch.softmax(scores, dim=-1))
attended = torch.matmul(attention_weights, V)
attended = attended.transpose(1, 2).contiguous()
attended = attended.view(batch_size, seq_len, self.embed_dim)
output = self.W_o(attended)
return output, attention_weights
6. Deep Dive into the Transformer Architecture
The Transformer, introduced in the 2017 paper "Attention is All You Need," completely changed the NLP paradigm. Using only attention mechanisms — no RNNs or convolutions — it achieved state-of-the-art translation performance.
6.1 Positional Encoding
Because Transformers process all positions in parallel, they need positional encoding to inject information about sequence order.
import torch
import torch.nn as nn
import math
class PositionalEncoding(nn.Module):
"""Sinusoidal Positional Encoding"""
def __init__(self, embed_dim, max_len=5000, dropout=0.1):
super().__init__()
self.dropout = nn.Dropout(p=dropout)
pe = torch.zeros(max_len, embed_dim)
position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
div_term = torch.exp(
torch.arange(0, embed_dim, 2).float() * (-math.log(10000.0) / embed_dim)
)
pe[:, 0::2] = torch.sin(position * div_term) # even indices: sin
pe[:, 1::2] = torch.cos(position * div_term) # odd indices: cos
pe = pe.unsqueeze(0) # (1, max_len, embed_dim)
self.register_buffer('pe', pe)
def forward(self, x):
"""x: (batch, seq_len, embed_dim)"""
x = x + self.pe[:, :x.size(1), :]
return self.dropout(x)
class LearnablePositionalEncoding(nn.Module):
"""Learnable Positional Encoding (BERT style)"""
def __init__(self, embed_dim, max_len=512):
super().__init__()
self.pe = nn.Embedding(max_len, embed_dim)
def forward(self, x):
batch_size, seq_len, _ = x.shape
positions = torch.arange(seq_len, device=x.device).unsqueeze(0)
return x + self.pe(positions)
6.2 Full Transformer Implementation
class MultiHeadAttention(nn.Module):
"""Multi-Head Attention"""
def __init__(self, embed_dim, num_heads, dropout=0.1):
super().__init__()
assert embed_dim % num_heads == 0
self.embed_dim = embed_dim
self.num_heads = num_heads
self.head_dim = embed_dim // num_heads
self.scale = self.head_dim ** -0.5
self.W_q = nn.Linear(embed_dim, embed_dim, bias=False)
self.W_k = nn.Linear(embed_dim, embed_dim, bias=False)
self.W_v = nn.Linear(embed_dim, embed_dim, bias=False)
self.W_o = nn.Linear(embed_dim, embed_dim)
self.dropout = nn.Dropout(dropout)
def split_heads(self, x):
"""(batch, seq, embed) -> (batch, heads, seq, head_dim)"""
batch, seq, _ = x.shape
x = x.view(batch, seq, self.num_heads, self.head_dim)
return x.transpose(1, 2)
def forward(self, query, key, value, mask=None):
batch_size = query.shape[0]
Q = self.split_heads(self.W_q(query))
K = self.split_heads(self.W_k(key))
V = self.split_heads(self.W_v(value))
scores = torch.matmul(Q, K.transpose(-2, -1)) * self.scale
if mask is not None:
scores = scores.masked_fill(mask == 0, float('-inf'))
attn_weights = self.dropout(torch.softmax(scores, dim=-1))
output = torch.matmul(attn_weights, V)
output = output.transpose(1, 2).contiguous().view(batch_size, -1, self.embed_dim)
return self.W_o(output), attn_weights
class FeedForward(nn.Module):
"""Position-wise Feed-Forward Network"""
def __init__(self, embed_dim, ff_dim, dropout=0.1):
super().__init__()
self.net = nn.Sequential(
nn.Linear(embed_dim, ff_dim),
nn.GELU(),
nn.Dropout(dropout),
nn.Linear(ff_dim, embed_dim),
nn.Dropout(dropout)
)
def forward(self, x):
return self.net(x)
class EncoderLayer(nn.Module):
"""Transformer Encoder Layer"""
def __init__(self, embed_dim, num_heads, ff_dim, dropout=0.1):
super().__init__()
self.self_attention = MultiHeadAttention(embed_dim, num_heads, dropout)
self.feed_forward = FeedForward(embed_dim, ff_dim, dropout)
self.norm1 = nn.LayerNorm(embed_dim)
self.norm2 = nn.LayerNorm(embed_dim)
self.dropout = nn.Dropout(dropout)
def forward(self, x, src_mask=None):
attn_output, _ = self.self_attention(x, x, x, src_mask)
x = self.norm1(x + self.dropout(attn_output))
ff_output = self.feed_forward(x)
x = self.norm2(x + ff_output)
return x
class DecoderLayer(nn.Module):
"""Transformer Decoder Layer"""
def __init__(self, embed_dim, num_heads, ff_dim, dropout=0.1):
super().__init__()
self.self_attention = MultiHeadAttention(embed_dim, num_heads, dropout)
self.cross_attention = MultiHeadAttention(embed_dim, num_heads, dropout)
self.feed_forward = FeedForward(embed_dim, ff_dim, dropout)
self.norm1 = nn.LayerNorm(embed_dim)
self.norm2 = nn.LayerNorm(embed_dim)
self.norm3 = nn.LayerNorm(embed_dim)
self.dropout = nn.Dropout(dropout)
def forward(self, x, memory, src_mask=None, tgt_mask=None):
# Masked Self-Attention (future token masking)
self_attn_out, _ = self.self_attention(x, x, x, tgt_mask)
x = self.norm1(x + self.dropout(self_attn_out))
# Cross-Attention (attend to encoder outputs)
cross_attn_out, cross_attn_weights = self.cross_attention(x, memory, memory, src_mask)
x = self.norm2(x + self.dropout(cross_attn_out))
ff_out = self.feed_forward(x)
x = self.norm3(x + ff_out)
return x, cross_attn_weights
class Transformer(nn.Module):
"""Full Transformer Implementation"""
def __init__(
self,
src_vocab_size,
tgt_vocab_size,
embed_dim=512,
num_heads=8,
num_encoder_layers=6,
num_decoder_layers=6,
ff_dim=2048,
max_len=5000,
dropout=0.1
):
super().__init__()
self.src_embedding = nn.Embedding(src_vocab_size, embed_dim)
self.tgt_embedding = nn.Embedding(tgt_vocab_size, embed_dim)
self.pos_encoding = PositionalEncoding(embed_dim, max_len, dropout)
self.embed_scale = math.sqrt(embed_dim)
self.encoder_layers = nn.ModuleList([
EncoderLayer(embed_dim, num_heads, ff_dim, dropout)
for _ in range(num_encoder_layers)
])
self.decoder_layers = nn.ModuleList([
DecoderLayer(embed_dim, num_heads, ff_dim, dropout)
for _ in range(num_decoder_layers)
])
self.output_norm = nn.LayerNorm(embed_dim)
self.output_projection = nn.Linear(embed_dim, tgt_vocab_size)
self._init_weights()
def _init_weights(self):
for p in self.parameters():
if p.dim() > 1:
nn.init.xavier_uniform_(p)
def make_causal_mask(self, seq_len, device):
"""Autoregressive mask — prevents attending to future tokens"""
mask = torch.tril(torch.ones(seq_len, seq_len, device=device)).bool()
return mask.unsqueeze(0).unsqueeze(0)
def make_pad_mask(self, x, pad_idx=0):
"""Padding mask"""
return (x != pad_idx).unsqueeze(1).unsqueeze(2)
def encode(self, src, src_mask=None):
x = self.pos_encoding(self.src_embedding(src) * self.embed_scale)
for layer in self.encoder_layers:
x = layer(x, src_mask)
return x
def decode(self, tgt, memory, src_mask=None, tgt_mask=None):
x = self.pos_encoding(self.tgt_embedding(tgt) * self.embed_scale)
for layer in self.decoder_layers:
x, _ = layer(x, memory, src_mask, tgt_mask)
return self.output_norm(x)
def forward(self, src, tgt, src_pad_idx=0, tgt_pad_idx=0):
src_mask = self.make_pad_mask(src, src_pad_idx)
tgt_len = tgt.shape[1]
tgt_pad_mask = self.make_pad_mask(tgt, tgt_pad_idx)
tgt_causal_mask = self.make_causal_mask(tgt_len, tgt.device)
tgt_mask = tgt_pad_mask & tgt_causal_mask
memory = self.encode(src, src_mask)
output = self.decode(tgt, memory, src_mask, tgt_mask)
logits = self.output_projection(output)
return logits
# Instantiate and test
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = Transformer(
src_vocab_size=10000,
tgt_vocab_size=10000,
embed_dim=512,
num_heads=8,
num_encoder_layers=6,
num_decoder_layers=6,
ff_dim=2048,
dropout=0.1
).to(device)
total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Total parameters: {total_params:,}")
src = torch.randint(1, 10000, (4, 20)).to(device)
tgt = torch.randint(1, 10000, (4, 18)).to(device)
output = model(src, tgt)
print(f"Output shape: {output.shape}") # (4, 18, 10000)
7. Deep Dive into BERT
BERT (Bidirectional Encoder Representations from Transformers) was published by Google in 2018 and revolutionized the NLP field.
7.1 BERT's Core Ideas
BERT uses two pre-training tasks:
Masked Language Modeling (MLM): 15% of input tokens are replaced with [MASK]; the model predicts the original words. Next Sentence Prediction (NSP): Given two sentences, the model predicts whether the second sentence follows the first.
from transformers import (
BertTokenizer,
BertModel,
BertForSequenceClassification,
BertForTokenClassification,
BertForQuestionAnswering,
AdamW,
get_linear_schedule_with_warmup
)
import torch
from torch.utils.data import Dataset, DataLoader
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
text = "Natural language processing is transforming AI."
encoding = tokenizer(
text,
add_special_tokens=True,
max_length=128,
padding='max_length',
truncation=True,
return_tensors='pt'
)
print("Input IDs:", encoding['input_ids'][0][:10])
print("Attention mask:", encoding['attention_mask'][0][:10])
print("Decoded:", tokenizer.decode(encoding['input_ids'][0]))
# WordPiece tokenization
complex_words = ["unbelievable", "preprocessing", "transformers", "tokenization"]
for word in complex_words:
tokens = tokenizer.tokenize(word)
print(f"{word:20} -> {tokens}")
7.2 BERT Fine-tuning: Text Classification
class SentimentDataset(Dataset):
"""Sentiment analysis dataset"""
def __init__(self, texts, labels, tokenizer, max_length=128):
self.texts = texts
self.labels = labels
self.tokenizer = tokenizer
self.max_length = max_length
def __len__(self):
return len(self.texts)
def __getitem__(self, idx):
encoding = self.tokenizer(
self.texts[idx],
max_length=self.max_length,
padding='max_length',
truncation=True,
return_tensors='pt'
)
return {
'input_ids': encoding['input_ids'].squeeze(),
'attention_mask': encoding['attention_mask'].squeeze(),
'label': torch.tensor(self.labels[idx], dtype=torch.long)
}
def train_bert_classifier(
train_texts,
train_labels,
val_texts,
val_labels,
num_labels=2,
epochs=3,
batch_size=16,
lr=2e-5
):
"""BERT sentiment analysis fine-tuning"""
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
train_dataset = SentimentDataset(train_texts, train_labels, tokenizer)
val_dataset = SentimentDataset(val_texts, val_labels, tokenizer)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size)
model = BertForSequenceClassification.from_pretrained(
'bert-base-uncased',
num_labels=num_labels
).to(device)
optimizer = AdamW(model.parameters(), lr=lr, weight_decay=0.01)
total_steps = len(train_loader) * epochs
scheduler = get_linear_schedule_with_warmup(
optimizer,
num_warmup_steps=total_steps // 10,
num_training_steps=total_steps
)
best_val_acc = 0
for epoch in range(epochs):
model.train()
total_loss = 0
correct = 0
for batch in train_loader:
input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
labels = batch['label'].to(device)
optimizer.zero_grad()
outputs = model(
input_ids=input_ids,
attention_mask=attention_mask,
labels=labels
)
loss = outputs.loss
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
scheduler.step()
total_loss += loss.item()
preds = outputs.logits.argmax(dim=1)
correct += (preds == labels).sum().item()
train_acc = correct / len(train_dataset)
avg_loss = total_loss / len(train_loader)
model.eval()
val_correct = 0
with torch.no_grad():
for batch in val_loader:
input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
labels = batch['label'].to(device)
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
preds = outputs.logits.argmax(dim=1)
val_correct += (preds == labels).sum().item()
val_acc = val_correct / len(val_dataset)
print(f"Epoch {epoch+1}: Loss={avg_loss:.4f}, Train Acc={train_acc:.4f}, Val Acc={val_acc:.4f}")
if val_acc > best_val_acc:
best_val_acc = val_acc
torch.save(model.state_dict(), 'best_bert_classifier.pt')
return model
7.3 BERT NER (Named Entity Recognition)
from transformers import BertForTokenClassification
ner_labels = ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC']
label2id = {label: i for i, label in enumerate(ner_labels)}
id2label = {i: label for i, label in enumerate(ner_labels)}
model = BertForTokenClassification.from_pretrained(
'bert-base-uncased',
num_labels=len(ner_labels),
id2label=id2label,
label2id=label2id
)
def predict_ner(text, model, tokenizer):
"""Predict named entities"""
model.eval()
encoding = tokenizer(
text.split(),
is_split_into_words=True,
return_offsets_mapping=True,
padding=True,
truncation=True,
return_tensors='pt'
)
with torch.no_grad():
outputs = model(
input_ids=encoding['input_ids'],
attention_mask=encoding['attention_mask']
)
predictions = outputs.logits.argmax(dim=2).squeeze().tolist()
word_ids = encoding.word_ids()
result = []
prev_word_id = None
for pred, word_id in zip(predictions, word_ids):
if word_id is None or word_id == prev_word_id:
continue
word = text.split()[word_id]
label = id2label[pred]
result.append((word, label))
prev_word_id = word_id
return result
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
text = "Barack Obama was born in Hawaii and served as the 44th President of the United States."
# print(predict_ner(text, model, tokenizer)) # use with a fine-tuned model
8. The GPT Family of Models
GPT (Generative Pre-trained Transformer) is a series of autoregressive language models developed by OpenAI.
8.1 GPT Evolution
| Model | Year | Parameters | Key Feature |
|---|---|---|---|
| GPT-1 | 2018 | 117M | First GPT, unsupervised pre-training |
| GPT-2 | 2019 | 1.5B | Zero-shot transfer, "too dangerous to release" |
| GPT-3 | 2020 | 175B | In-context learning, few-shot |
| InstructGPT | 2022 | 1.3B | RLHF, following instructions |
| GPT-4 | 2023 | Undisclosed | Multimodal, stronger reasoning |
8.2 Text Generation with GPT-2
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')
model.eval()
tokenizer.pad_token = tokenizer.eos_token
def generate_text(
prompt,
model,
tokenizer,
max_length=200,
temperature=0.9,
top_p=0.95,
top_k=50,
num_return_sequences=1,
do_sample=True
):
"""Generate text with GPT-2"""
inputs = tokenizer(prompt, return_tensors='pt')
input_ids = inputs['input_ids']
with torch.no_grad():
output = model.generate(
input_ids,
max_length=max_length,
temperature=temperature,
top_p=top_p,
top_k=top_k,
num_return_sequences=num_return_sequences,
do_sample=do_sample,
pad_token_id=tokenizer.eos_token_id,
repetition_penalty=1.2
)
generated_texts = []
for out in output:
text = tokenizer.decode(out, skip_special_tokens=True)
generated_texts.append(text)
return generated_texts
prompt = "Artificial intelligence is transforming the world by"
generated = generate_text(prompt, model, tokenizer, max_length=150, temperature=0.8)
for i, text in enumerate(generated):
print(f"\nGenerated text {i+1}:")
print(text)
print("-" * 50)
def compare_generation_strategies(prompt, model, tokenizer):
"""Compare different decoding strategies"""
input_ids = tokenizer.encode(prompt, return_tensors='pt')
strategies = {
"Greedy Search": dict(do_sample=False),
"Beam Search": dict(do_sample=False, num_beams=5, early_stopping=True),
"Temperature Sampling": dict(do_sample=True, temperature=0.7),
"Top-k Sampling": dict(do_sample=True, top_k=50),
"Top-p (Nucleus) Sampling": dict(do_sample=True, top_p=0.92),
}
print(f"Prompt: {prompt}\n")
with torch.no_grad():
for name, params in strategies.items():
output = model.generate(
input_ids,
max_new_tokens=50,
pad_token_id=tokenizer.eos_token_id,
**params
)
text = tokenizer.decode(output[0], skip_special_tokens=True)
print(f"[{name}]")
print(text[len(prompt):].strip())
print()
compare_generation_strategies(
"The future of artificial intelligence",
model,
tokenizer
)
8.3 Prompt Engineering
# Zero-shot classification
def zero_shot_classification(text, categories, model_name="gpt2"):
"""Zero-shot text classification"""
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)
model.eval()
scores = {}
for category in categories:
prompt = f"Text: {text}\nCategory: {category}"
inputs = tokenizer(prompt, return_tensors='pt')
with torch.no_grad():
outputs = model(**inputs, labels=inputs['input_ids'])
scores[category] = -outputs.loss.item()
best_category = max(scores, key=scores.get)
return best_category, scores
# Few-shot learning example
few_shot_prompt = """
Classify the sentiment of the following texts:
Text: "This movie was amazing!"
Sentiment: Positive
Text: "I hated every minute of it."
Sentiment: Negative
Text: "The film was okay, nothing special."
Sentiment: Neutral
Text: "Absolutely brilliant performance!"
Sentiment:"""
print("Few-shot prompt:")
print(few_shot_prompt)
# Chain-of-Thought prompting
cot_prompt = """
Q: A train travels 120 miles in 2 hours. How long will it take to travel 300 miles?
A: Let me think step by step.
1. First, find the speed: 120 miles / 2 hours = 60 miles per hour
2. Then, calculate the time for 300 miles: 300 miles / 60 mph = 5 hours
Therefore, it will take 5 hours.
Q: If a store sells 3 apples for $2.40, how much do 7 apples cost?
A: Let me think step by step.
"""
9. Modern NLP Techniques
9.1 LoRA / QLoRA Fine-tuning
LoRA (Low-Rank Adaptation) is a technique for efficiently fine-tuning large language models. It freezes the original weights and trains only small low-rank matrices added to the existing layers.
from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
def setup_lora_model(model_name="gpt2", r=8, lora_alpha=32, lora_dropout=0.1):
"""Configure a LoRA model"""
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=r, # LoRA rank
lora_alpha=lora_alpha, # scaling parameter
target_modules=["c_attn"], # layers to apply LoRA to
lora_dropout=lora_dropout,
bias="none"
)
model = get_peft_model(model, lora_config)
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(f"Trainable parameters: {trainable:,} ({100 * trainable / total:.2f}%)")
return model, tokenizer
class InstructionDataset(torch.utils.data.Dataset):
"""Instruction-format dataset for SFT"""
def __init__(self, data, tokenizer, max_length=512):
self.tokenizer = tokenizer
self.max_length = max_length
self.texts = []
for item in data:
prompt = f"### Instruction:\n{item['instruction']}\n\n### Response:\n{item['response']}"
self.texts.append(prompt)
def __len__(self):
return len(self.texts)
def __getitem__(self, idx):
encoding = self.tokenizer(
self.texts[idx],
max_length=self.max_length,
padding='max_length',
truncation=True,
return_tensors='pt'
)
input_ids = encoding['input_ids'].squeeze()
return {
'input_ids': input_ids,
'attention_mask': encoding['attention_mask'].squeeze(),
'labels': input_ids.clone()
}
9.2 RAG (Retrieval-Augmented Generation)
from transformers import AutoTokenizer, AutoModel
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
class SimpleRAG:
"""Simple RAG system"""
def __init__(self, embedding_model="sentence-transformers/all-MiniLM-L6-v2"):
self.tokenizer = AutoTokenizer.from_pretrained(embedding_model)
self.model = AutoModel.from_pretrained(embedding_model)
self.knowledge_base = []
self.embeddings = []
def mean_pooling(self, model_output, attention_mask):
token_embeddings = model_output[0]
input_mask_expanded = attention_mask.unsqueeze(-1).expand(
token_embeddings.size()
).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / \
torch.clamp(input_mask_expanded.sum(1), min=1e-9)
def encode(self, texts):
encoded = self.tokenizer(
texts,
padding=True,
truncation=True,
max_length=512,
return_tensors='pt'
)
with torch.no_grad():
model_output = self.model(**encoded)
embeddings = self.mean_pooling(model_output, encoded['attention_mask'])
embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)
return embeddings.numpy()
def add_documents(self, documents):
self.knowledge_base.extend(documents)
new_embeddings = self.encode(documents)
if len(self.embeddings) == 0:
self.embeddings = new_embeddings
else:
self.embeddings = np.vstack([self.embeddings, new_embeddings])
print(f"Added {len(documents)} documents. Total: {len(self.knowledge_base)}.")
def retrieve(self, query, top_k=3):
query_embedding = self.encode([query])
similarities = cosine_similarity(query_embedding, self.embeddings)[0]
top_indices = np.argsort(similarities)[::-1][:top_k]
results = []
for idx in top_indices:
results.append({
'document': self.knowledge_base[idx],
'similarity': similarities[idx]
})
return results
def answer(self, query, generator_model, generator_tokenizer, top_k=3):
"""RAG: retrieve + generate"""
relevant_docs = self.retrieve(query, top_k=top_k)
context = "\n".join([doc['document'] for doc in relevant_docs])
prompt = f"""Based on the following context, answer the question.
Context:
{context}
Question: {query}
Answer:"""
inputs = generator_tokenizer(prompt, return_tensors='pt', max_length=512, truncation=True)
with torch.no_grad():
outputs = generator_model.generate(
inputs['input_ids'],
max_new_tokens=150,
temperature=0.7,
do_sample=True,
pad_token_id=generator_tokenizer.eos_token_id
)
answer = generator_tokenizer.decode(outputs[0], skip_special_tokens=True)
answer = answer[len(prompt):].strip()
return answer, relevant_docs
# Usage example
rag = SimpleRAG()
documents = [
"BERT (Bidirectional Encoder Representations from Transformers) was published by Google in 2018.",
"GPT-3 has 175 billion parameters and was developed by OpenAI in 2020.",
"The Transformer architecture was introduced in the paper 'Attention is All You Need' in 2017.",
"RLHF (Reinforcement Learning from Human Feedback) is used to align language models with human values.",
"LoRA allows efficient fine-tuning by adding low-rank matrices to pre-trained model weights.",
"RAG combines information retrieval with text generation for more accurate responses.",
]
rag.add_documents(documents)
query = "What is BERT and when was it created?"
results = rag.retrieve(query, top_k=2)
print(f"\nQuery: {query}")
print("\nRetrieved documents:")
for r in results:
print(f" [{r['similarity']:.3f}] {r['document']}")
9.3 RLHF (Reinforcement Learning from Human Feedback)
class RewardModel(nn.Module):
"""Reward model — scores response quality"""
def __init__(self, base_model_name="gpt2"):
super().__init__()
from transformers import GPT2Model
self.transformer = GPT2Model.from_pretrained(base_model_name)
hidden_size = self.transformer.config.hidden_size
self.reward_head = nn.Linear(hidden_size, 1)
def forward(self, input_ids, attention_mask=None):
outputs = self.transformer(input_ids=input_ids, attention_mask=attention_mask)
last_hidden = outputs.last_hidden_state[:, -1, :]
reward = self.reward_head(last_hidden)
return reward.squeeze(-1)
class PPOTrainer:
"""PPO-based RLHF training (conceptual implementation)"""
def __init__(self, policy_model, reward_model, ref_model, tokenizer):
self.policy = policy_model
self.reward_model = reward_model
self.ref_model = ref_model # Reference model for KL penalty
self.tokenizer = tokenizer
def compute_kl_penalty(self, policy_logprobs, ref_logprobs, kl_coeff=0.1):
"""KL divergence penalty — prevents policy from drifting too far"""
kl = policy_logprobs - ref_logprobs
return kl_coeff * kl.mean()
def compute_advantages(self, rewards, values, gamma=0.99, lam=0.95):
"""Generalized Advantage Estimation (GAE)"""
advantages = []
last_advantage = 0
for t in reversed(range(len(rewards))):
if t == len(rewards) - 1:
next_value = 0
else:
next_value = values[t + 1]
delta = rewards[t] + gamma * next_value - values[t]
advantage = delta + gamma * lam * last_advantage
advantages.insert(0, advantage)
last_advantage = advantage
return torch.tensor(advantages)
10. Real-World NLP Projects
10.1 Sentiment Analysis System (Complete Pipeline)
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
from torch.utils.data import Dataset, DataLoader
from torch.optim import AdamW
from transformers import get_linear_schedule_with_warmup
import matplotlib.pyplot as plt
import seaborn as sns
class SentimentAnalysisPipeline:
"""End-to-end sentiment analysis pipeline"""
def __init__(self, model_name="distilbert-base-uncased-finetuned-sst-2-english"):
self.model_name = model_name
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model = AutoModelForSequenceClassification.from_pretrained(model_name)
self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
self.model.to(self.device)
self.model.eval()
self.label_map = {0: 'Negative', 1: 'Positive'}
def predict(self, texts, batch_size=32):
"""Batch prediction"""
if isinstance(texts, str):
texts = [texts]
all_predictions = []
all_probabilities = []
for i in range(0, len(texts), batch_size):
batch = texts[i:i + batch_size]
encoding = self.tokenizer(
batch,
padding=True,
truncation=True,
max_length=512,
return_tensors='pt'
)
input_ids = encoding['input_ids'].to(self.device)
attention_mask = encoding['attention_mask'].to(self.device)
with torch.no_grad():
outputs = self.model(input_ids=input_ids, attention_mask=attention_mask)
probs = torch.softmax(outputs.logits, dim=1)
preds = probs.argmax(dim=1)
all_predictions.extend(preds.cpu().numpy())
all_probabilities.extend(probs.cpu().numpy())
results = []
for pred, prob in zip(all_predictions, all_probabilities):
results.append({
'label': self.label_map[pred],
'score': float(prob[pred]),
'probabilities': {self.label_map[i]: float(p) for i, p in enumerate(prob)}
})
return results if len(results) > 1 else results[0]
def analyze_batch(self, texts):
"""Batch sentiment analysis with statistics"""
results = self.predict(texts)
positive_count = sum(1 for r in results if r['label'] == 'Positive')
negative_count = len(results) - positive_count
avg_confidence = np.mean([r['score'] for r in results])
print(f"\n=== Sentiment Analysis Results ===")
print(f"Total texts: {len(texts)}")
print(f"Positive: {positive_count} ({100*positive_count/len(texts):.1f}%)")
print(f"Negative: {negative_count} ({100*negative_count/len(texts):.1f}%)")
print(f"Average confidence: {avg_confidence:.3f}")
return results
pipeline = SentimentAnalysisPipeline()
reviews = [
"This product exceeded all my expectations! Absolutely love it.",
"Terrible quality, broke after one day. Complete waste of money.",
"It's okay, does what it's supposed to do.",
"Best purchase I've made this year!",
"Would not recommend to anyone.",
]
print("=== Individual Predictions ===")
for review in reviews:
result = pipeline.predict(review)
print(f"Text: {review[:50]}...")
print(f" -> {result['label']} ({result['score']:.3f})\n")
print("\n=== Batch Analysis ===")
pipeline.analyze_batch(reviews)
10.2 Text Summarization System
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
class TextSummarizer:
"""Text summarization system"""
def __init__(self, model_name="facebook/bart-large-cnn"):
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
self.model.to(self.device)
def summarize(self, text, max_length=130, min_length=30, num_beams=4):
"""Abstractive summarization"""
inputs = self.tokenizer(
text,
max_length=1024,
truncation=True,
return_tensors='pt'
).to(self.device)
with torch.no_grad():
summary_ids = self.model.generate(
inputs['input_ids'],
max_length=max_length,
min_length=min_length,
num_beams=num_beams,
length_penalty=2.0,
early_stopping=True
)
summary = self.tokenizer.decode(summary_ids[0], skip_special_tokens=True)
original_words = len(text.split())
summary_words = len(summary.split())
compression = (1 - summary_words / original_words) * 100
return {
'summary': summary,
'original_length': original_words,
'summary_length': summary_words,
'compression_rate': f"{compression:.1f}%"
}
def extractive_summarize(self, text, num_sentences=3):
"""Extractive summarization (TF-IDF based)"""
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import nltk
nltk.download('punkt', quiet=True)
from nltk.tokenize import sent_tokenize
sentences = sent_tokenize(text)
if len(sentences) <= num_sentences:
return text
vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = vectorizer.fit_transform(sentences)
similarity_matrix = cosine_similarity(tfidf_matrix)
scores = similarity_matrix.mean(axis=1)
top_indices = sorted(np.argsort(scores)[-num_sentences:].tolist())
summary = ' '.join([sentences[i] for i in top_indices])
return summary
summarizer = TextSummarizer()
long_text = """
Artificial intelligence has made remarkable strides in recent years, particularly in
natural language processing. The development of transformer-based models like BERT and
GPT has revolutionized how machines understand and generate human language. These models,
trained on massive datasets, can perform a wide range of tasks including translation,
summarization, question answering, and sentiment analysis.
The key innovation behind these models is the attention mechanism, which allows the model
to focus on relevant parts of the input when generating each word of the output. This has
enabled much more nuanced understanding of context and semantics compared to earlier
approaches like recurrent neural networks.
However, these advances come with challenges. Training large language models requires
enormous computational resources and energy. There are also concerns about bias in the
training data leading to biased outputs, and the potential for misuse in generating
misinformation. Researchers are actively working on making these models more efficient,
fair, and reliable.
The future of NLP looks promising, with models becoming increasingly capable of
understanding and generating language that is indistinguishable from human writing.
Applications range from customer service chatbots to scientific research assistance,
and the technology continues to evolve rapidly.
"""
result = summarizer.summarize(long_text)
print("=== Abstractive Summary ===")
print(f"Original length: {result['original_length']} words")
print(f"Summary length: {result['summary_length']} words")
print(f"Compression rate: {result['compression_rate']}")
print(f"\nSummary:\n{result['summary']}")
print("\n=== Extractive Summary ===")
extractive = summarizer.extractive_summarize(long_text, num_sentences=2)
print(extractive)
10.3 Question Answering System
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
class QASystem:
"""Question answering system"""
def __init__(self, model_name="deepset/roberta-base-squad2"):
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model = AutoModelForQuestionAnswering.from_pretrained(model_name)
self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
self.model.to(self.device)
self.model.eval()
def answer(self, question, context, max_answer_len=100):
"""Extract answer from context"""
encoding = self.tokenizer(
question,
context,
max_length=512,
truncation=True,
stride=128,
return_overflowing_tokens=True,
return_offsets_mapping=True,
padding='max_length',
return_tensors='pt'
)
offset_mapping = encoding.pop('offset_mapping').cpu()
sample_map = encoding.pop('overflow_to_sample_mapping').cpu()
encoding = {k: v.to(self.device) for k, v in encoding.items()}
with torch.no_grad():
outputs = self.model(**encoding)
start_logits = outputs.start_logits
end_logits = outputs.end_logits
answers = []
for i in range(len(start_logits)):
start_log = start_logits[i].cpu().numpy()
end_log = end_logits[i].cpu().numpy()
offsets = offset_mapping[i].numpy()
start_idx = np.argmax(start_log)
end_idx = np.argmax(end_log)
if start_idx <= end_idx:
start_char = offsets[start_idx][0]
end_char = offsets[end_idx][1]
answer_text = context[start_char:end_char]
score = float(start_log[start_idx] + end_log[end_idx])
if answer_text:
answers.append({
'answer': answer_text,
'score': score,
'start': int(start_char),
'end': int(end_char)
})
if not answers:
return {'answer': "No answer found.", 'score': 0.0}
return max(answers, key=lambda x: x['score'])
def multi_question_answer(self, questions, context):
"""Answer multiple questions about a context"""
results = []
for question in questions:
answer = self.answer(question, context)
results.append({
'question': question,
'answer': answer['answer'],
'confidence': answer['score']
})
return results
qa_system = QASystem()
context = """
The Transformer model was introduced in a 2017 paper titled "Attention Is All You Need"
by researchers at Google Brain. The model architecture relies entirely on attention
mechanisms, dispensing with recurrence and convolutions. BERT, which stands for
Bidirectional Encoder Representations from Transformers, was introduced in 2018 by
Google AI Language team. BERT achieved state-of-the-art results on eleven NLP tasks.
GPT-3, developed by OpenAI, was released in 2020 and has 175 billion parameters, making
it one of the largest language models at the time of its release.
"""
questions = [
"When was the Transformer model introduced?",
"What does BERT stand for?",
"How many parameters does GPT-3 have?",
"Who developed BERT?",
]
print("=== Question Answering System ===\n")
results = qa_system.multi_question_answer(questions, context)
for r in results:
print(f"Q: {r['question']}")
print(f"A: {r['answer']}")
print(f"Confidence: {r['confidence']:.2f}")
print()
Closing: NLP Learning Roadmap
This guide has walked through the complete NLP journey. A summary of each stage:
- Foundations: text preprocessing, BoW, TF-IDF
- Embeddings: Word2Vec, GloVe, FastText
- Sequence Models: RNN, LSTM, GRU, Seq2Seq
- The Attention Revolution: attention mechanisms, self-attention
- Mastering Transformers: the full architecture
- Pre-trained Models: BERT and GPT in practice
- Modern Techniques: LoRA, RAG, RLHF
Recommended Resources
- HuggingFace Official Docs
- Attention is All You Need Paper
- BERT Paper
- PyTorch NLP Tutorial
- Stanford CS224N — NLP with Deep Learning
- NLTK Documentation
- spaCy Documentation
- Gensim Word2Vec
NLP is a rapidly evolving field, so continuous learning and hands-on practice are essential. Run the code examples yourself, experiment with the parameters, and build your own projects to develop a deep understanding. Happy learning!