Natural Language Processing (NLP) Complete Guide: Zero to Hero

Natural Language Processing (NLP) is a core branch of artificial intelligence that enables computers to understand and generate human language. Countless services we use every day — ChatGPT, translation engines, search engines, sentiment analysis systems — are all built on NLP technology. This guide provides a complete learning path covering everything from the most basic text preprocessing to the latest large language models (LLMs).

NLP Fundamentals and Text Preprocessing
Text Representation
Word Embeddings
NLP with Recurrent Neural Networks
Attention Mechanisms
Deep Dive into the Transformer Architecture
Deep Dive into BERT
The GPT Family of Models
Modern NLP Techniques
Real-World NLP Projects

1. NLP Fundamentals and Text Preprocessing

The first step in natural language processing is transforming raw text data into a form that models can work with. Because text data is unstructured, cleaning and structuring it into a consistent format is essential.

1.1 Tokenization

Tokenization is the process of splitting text into smaller units called tokens. Depending on the unit of the token, we have word tokenization, character tokenization, and subword tokenization.

Word Tokenization

The most intuitive approach — splitting text into words based on whitespace or punctuation.

import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
nltk.download('punkt')

text = "Natural Language Processing is fascinating! It powers ChatGPT and many AI applications."

# Word tokenization
word_tokens = word_tokenize(text)
print("Word tokens:", word_tokens)
# Output: ['Natural', 'Language', 'Processing', 'is', 'fascinating', '!', ...]

# Sentence tokenization
sent_tokens = sent_tokenize(text)
print("Sentence tokens:", sent_tokens)
# Output: ['Natural Language Processing is fascinating!', 'It powers ChatGPT...']

Character Tokenization

Splits text into individual characters. The vocabulary is small and there are no out-of-vocabulary (OOV) problems, but sequences become very long.

text = "Hello NLP"
char_tokens = list(text)
print("Character tokens:", char_tokens)
# Output: ['H', 'e', 'l', 'l', 'o', ' ', 'N', 'L', 'P']

Subword Tokenization

Splits text into units between words and characters. Algorithms like BPE (Byte Pair Encoding), WordPiece, and SentencePiece are used by modern models like BERT and GPT.

from tokenizers import ByteLevelBPETokenizer

# Train a BPE tokenizer
tokenizer = ByteLevelBPETokenizer()
tokenizer.train(
    files=["corpus.txt"],
    vocab_size=5000,
    min_frequency=2,
    special_tokens=["<s>", "<pad>", "</s>", "<unk>", "<mask>"]
)

# Encode
encoding = tokenizer.encode("Natural Language Processing")
print("Tokens:", encoding.tokens)
print("IDs:", encoding.ids)

1.2 Stop Word Removal

Stop words are words like "the", "is", and "at" that appear so frequently they carry almost no meaningful information. Removing them reduces data size and lets the model focus on informative words.

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('stopwords')

text = "This is a sample sentence showing off the stop words filtration."
stop_words = set(stopwords.words('english'))

word_tokens = word_tokenize(text)
filtered_sentence = [w for w in word_tokens if not w.lower() in stop_words]

print("Original:", word_tokens)
print("Filtered:", filtered_sentence)
# Output: ['sample', 'sentence', 'showing', 'stop', 'words', 'filtration', '.']

1.3 Stemming and Lemmatization

Stemming

Removes affixes to extract the stem of a word. Fast but can produce linguistically incorrect results.

from nltk.stem import PorterStemmer, LancasterStemmer

ps = PorterStemmer()
ls = LancasterStemmer()

words = ["running", "runs", "ran", "runner", "easily", "fairly"]
for word in words:
    print(f"{word:15} -> Porter: {ps.stem(word):15} Lancaster: {ls.stem(word)}")

Lemmatization

Extracts the base form (lemma) of a word. Slower than stemming but linguistically correct.

import nltk
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

lemmatizer = WordNetLemmatizer()

# Specifying part of speech gives more accurate results
print(lemmatizer.lemmatize("running", pos='v'))  # run
print(lemmatizer.lemmatize("better", pos='a'))    # good
print(lemmatizer.lemmatize("dogs"))               # dog
print(lemmatizer.lemmatize("went", pos='v'))      # go

1.4 Text Cleaning with Regular Expressions

import re

def clean_text(text):
    """Text cleaning function"""
    # Remove HTML tags
    text = re.sub(r'<[^>]+>', '', text)

    # Remove URLs
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)

    # Remove email addresses
    text = re.sub(r'\S+@\S+', '', text)

    # Remove special characters (keep alphanumeric and spaces)
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)

    # Collapse multiple spaces
    text = re.sub(r'\s+', ' ', text)

    # Strip and lowercase
    text = text.strip().lower()

    return text

sample_text = """
Check out my website at https://example.com!
Email me at user@example.com for <b>more info</b>.
It's   really   cool!!!
"""

cleaned = clean_text(sample_text)
print(cleaned)
# Output: "check out my website at email me at for more info its really cool"

1.5 Advanced Preprocessing with spaCy

spaCy is an industrial-strength NLP library that provides tokenization, POS tagging, named entity recognition, dependency parsing, and more.

import spacy

# Load English model
nlp = spacy.load("en_core_web_sm")

text = "Apple is looking at buying U.K. startup for $1 billion"
doc = nlp(text)

print("=== Token Information ===")
for token in doc:
    print(f"{token.text:15} | POS: {token.pos_:10} | Lemma: {token.lemma_:15} | Stop: {token.is_stop}")

print("\n=== Named Entity Recognition ===")
for ent in doc.ents:
    print(f"{ent.text:20} -> {ent.label_} ({spacy.explain(ent.label_)})")

print("\n=== Dependency Parsing ===")
for token in doc:
    print(f"{token.text:15} -> {token.dep_:15} (head: {token.head.text})")

2. Text Representation

Converting text to numerical vectors is at the heart of NLP. For a model to process text, it must be converted into a numeric form on which mathematical operations can be performed.

2.1 Bag of Words (BoW)

BoW is the simplest method, representing text as a word-frequency vector. It ignores word order and only considers how often each word appears.

from sklearn.feature_extraction.text import CountVectorizer
import numpy as np

corpus = [
    "I love natural language processing",
    "Natural language processing is amazing",
    "I love machine learning too",
    "Deep learning is part of machine learning"
]

# BoW vectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

print("Vocabulary:", vectorizer.get_feature_names_out())
print("\nBoW Matrix:")
print(X.toarray())
print("\nShape:", X.shape)  # (4 documents, n words)

# Inspect one document
doc_0 = dict(zip(vectorizer.get_feature_names_out(), X.toarray()[0]))
print("\nDocument 0 word frequencies:")
for word, count in sorted(doc_0.items()):
    if count > 0:
        print(f"  {word}: {count}")

2.2 TF-IDF

TF-IDF (Term Frequency-Inverse Document Frequency) improves on BoW by considering not just how often a word appears in a document, but also how rare it is across all documents.

TF (Term Frequency): How often a word appears in a given document. IDF (Inverse Document Frequency): The inverse of how many documents contain the word — rare words get higher scores.

from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

corpus = [
    "the cat sat on the mat",
    "the cat sat on the hat",
    "the dog sat on the log",
    "the cat wore the hat",
]

# TF-IDF vectorizer
tfidf = TfidfVectorizer(smooth_idf=True, norm='l2')
X = tfidf.fit_transform(corpus)

# Display as DataFrame
df = pd.DataFrame(
    X.toarray(),
    columns=tfidf.get_feature_names_out(),
    index=[f"Doc {i}" for i in range(len(corpus))]
)
print(df.round(3))

# Manual TF-IDF implementation
def compute_tfidf(corpus):
    from math import log

    vocab = set()
    for doc in corpus:
        vocab.update(doc.split())
    vocab = sorted(vocab)

    def tf(word, doc):
        words = doc.split()
        return words.count(word) / len(words)

    def idf(word, corpus):
        n_docs_with_word = sum(1 for doc in corpus if word in doc.split())
        return log((1 + len(corpus)) / (1 + n_docs_with_word)) + 1

    tfidf_matrix = []
    for doc in corpus:
        row = [tf(word, doc) * idf(word, corpus) for word in vocab]
        tfidf_matrix.append(row)

    return tfidf_matrix, vocab

matrix, vocab = compute_tfidf(corpus)
print("\nManually computed TF-IDF:")
df_manual = pd.DataFrame(matrix, columns=vocab)
print(df_manual.round(3))

2.3 N-gram

An N-gram is a contiguous sequence of N items (words or characters) from a given text. N-grams allow a model to capture some local word-order information.

from nltk.util import ngrams
from nltk.tokenize import word_tokenize
from collections import Counter

text = "I love natural language processing and machine learning"
tokens = word_tokenize(text.lower())

unigrams = list(ngrams(tokens, 1))
bigrams = list(ngrams(tokens, 2))
trigrams = list(ngrams(tokens, 3))

print("Unigrams:", unigrams[:5])
print("Bigrams:", bigrams[:5])
print("Trigrams:", trigrams[:5])

def get_ngram_freq(text, n):
    tokens = word_tokenize(text.lower())
    n_grams = list(ngrams(tokens, n))
    return Counter(n_grams)

large_corpus = """
Machine learning is a subset of artificial intelligence.
Artificial intelligence is transforming many industries.
Natural language processing is a part of machine learning.
Deep learning has revolutionized natural language processing.
"""

bigram_freq = get_ngram_freq(large_corpus, 2)
print("\nMost frequent bigrams:")
for bigram, count in bigram_freq.most_common(10):
    print(f"  {' '.join(bigram)}: {count}")

3. Word Embeddings

Word embeddings represent words as dense vectors such that semantically similar words are close together in vector space.

3.1 Word2Vec

Word2Vec, published by Tomas Mikolov at Google in 2013, is a groundbreaking word embedding model. It comes in two variants: CBOW (Continuous Bag of Words) and Skip-gram.

CBOW: Predicts the center word from surrounding context words. Skip-gram: Predicts surrounding context words from the center word.

import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
from collections import Counter

class Word2VecSkipGram(nn.Module):
    """Skip-gram Word2Vec implementation"""

    def __init__(self, vocab_size, embedding_dim):
        super().__init__()
        self.center_embedding = nn.Embedding(vocab_size, embedding_dim)
        self.context_embedding = nn.Embedding(vocab_size, embedding_dim)

        self.center_embedding.weight.data.uniform_(-0.5/embedding_dim, 0.5/embedding_dim)
        self.context_embedding.weight.data.uniform_(-0.5/embedding_dim, 0.5/embedding_dim)

    def forward(self, center, context, negative):
        """
        center: (batch_size,) center word indices
        context: (batch_size,) actual context word indices
        negative: (batch_size, neg_samples) negative sample indices
        """
        center_emb = self.center_embedding(center)  # (batch, dim)

        # Positive sample score
        context_emb = self.context_embedding(context)  # (batch, dim)
        pos_score = torch.sum(center_emb * context_emb, dim=1)  # (batch,)
        pos_loss = -torch.log(torch.sigmoid(pos_score) + 1e-10)

        # Negative sample score
        neg_emb = self.context_embedding(negative)  # (batch, neg_samples, dim)
        center_emb_expanded = center_emb.unsqueeze(1)  # (batch, 1, dim)
        neg_score = torch.bmm(neg_emb, center_emb_expanded.transpose(1, 2)).squeeze(2)
        neg_loss = -torch.sum(torch.log(torch.sigmoid(-neg_score) + 1e-10), dim=1)

        return (pos_loss + neg_loss).mean()

class Word2VecCBOW(nn.Module):
    """CBOW Word2Vec implementation"""

    def __init__(self, vocab_size, embedding_dim):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.linear = nn.Linear(embedding_dim, vocab_size)

    def forward(self, context):
        """
        context: (batch_size, context_window_size) context word indices
        """
        embedded = self.embedding(context)  # (batch, window, dim)
        mean_embedded = embedded.mean(dim=1)  # (batch, dim)
        output = self.linear(mean_embedded)  # (batch, vocab_size)
        return output

# Practical usage with Gensim
from gensim.models import Word2Vec

sentences = [
    "I love natural language processing".split(),
    "natural language processing is a field of AI".split(),
    "machine learning is part of artificial intelligence".split(),
    "deep learning models process natural language".split(),
    "word embeddings represent words as vectors".split(),
    "Word2Vec learns word representations".split(),
    "semantic similarity between words is captured by embeddings".split(),
]

model = Word2Vec(
    sentences=sentences,
    vector_size=100,   # embedding dimension
    window=5,          # context window size
    min_count=1,       # minimum word frequency
    workers=4,
    epochs=100,
    sg=1,              # 1=Skip-gram, 0=CBOW
    negative=5         # negative sampling
)

print("'language' vector (first 5 dims):", model.wv['language'][:5])

similar_words = model.wv.most_similar('language', topn=5)
print("\nWords similar to 'language':")
for word, similarity in similar_words:
    print(f"  {word}: {similarity:.4f}")

result = model.wv.most_similar(
    positive=['artificial', 'language'],
    negative=['natural'],
    topn=3
)
print("\nAnalogy result:")
for word, sim in result:
    print(f"  {word}: {sim:.4f}")

print("\nSimilarity between 'language' and 'processing':", model.wv.similarity('language', 'processing'))

3.2 Embedding Visualization with t-SNE

from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import numpy as np

def visualize_embeddings(model, words=None):
    """Visualize Word2Vec embeddings with t-SNE"""
    if words is None:
        words = list(model.wv.key_to_index.keys())[:50]

    vectors = np.array([model.wv[word] for word in words])

    tsne = TSNE(
        n_components=2,
        random_state=42,
        perplexity=min(30, len(words)-1),
        n_iter=1000
    )
    vectors_2d = tsne.fit_transform(vectors)

    fig, ax = plt.subplots(figsize=(12, 8))
    ax.scatter(vectors_2d[:, 0], vectors_2d[:, 1], alpha=0.7)

    for i, word in enumerate(words):
        ax.annotate(word, (vectors_2d[i, 0], vectors_2d[i, 1]),
                   fontsize=9, ha='center', va='bottom')

    ax.set_title("Word2Vec Embeddings t-SNE Visualization")
    plt.tight_layout()
    plt.savefig('word2vec_tsne.png', dpi=150, bbox_inches='tight')
    plt.show()

visualize_embeddings(model)

3.3 GloVe

GloVe (Global Vectors for Word Representation), developed at Stanford, exploits global co-occurrence statistics from the entire corpus.

# Using pre-trained GloVe vectors
# pip install torchtext

from torchtext.vocab import GloVe
import torch

glove = GloVe(name='6B', dim=100)

vector = glove['computer']
print("'computer' GloVe vector (first 5 dims):", vector[:5].numpy())

def cosine_similarity(v1, v2):
    return torch.nn.functional.cosine_similarity(
        v1.unsqueeze(0), v2.unsqueeze(0)
    ).item()

words_to_compare = [('king', 'queen'), ('man', 'woman'), ('Paris', 'France')]
for w1, w2 in words_to_compare:
    if w1 in glove.stoi and w2 in glove.stoi:
        sim = cosine_similarity(glove[w1], glove[w2])
        print(f"sim('{w1}', '{w2}') = {sim:.4f}")

4. NLP with Recurrent Neural Networks

Recurrent Neural Networks (RNNs) are designed to process sequential data. They carry information from previous time steps to the current step, making them well-suited for text.

4.1 Basic RNN Architecture

import torch
import torch.nn as nn

class SimpleRNN(nn.Module):
    """Basic RNN implementation"""

    def __init__(self, input_size, hidden_size, output_size):
        super().__init__()
        self.hidden_size = hidden_size
        self.input_to_hidden = nn.Linear(input_size + hidden_size, hidden_size)
        self.hidden_to_output = nn.Linear(hidden_size, output_size)

    def forward(self, x, hidden=None):
        """
        x: (seq_len, batch, input_size)
        hidden: (batch, hidden_size) initial hidden state
        """
        batch_size = x.size(1)

        if hidden is None:
            hidden = torch.zeros(batch_size, self.hidden_size)

        outputs = []
        for t in range(x.size(0)):
            combined = torch.cat([x[t], hidden], dim=1)
            hidden = torch.tanh(self.input_to_hidden(combined))
            output = self.hidden_to_output(hidden)
            outputs.append(output)

        return torch.stack(outputs, dim=0), hidden

# Using PyTorch built-in RNN
rnn = nn.RNN(
    input_size=50,
    hidden_size=128,
    num_layers=2,
    batch_first=True,
    dropout=0.3,
    bidirectional=True
)

x = torch.randn(32, 20, 50)  # (batch, seq_len, features)
output, hidden = rnn(x)
print("Output shape:", output.shape)   # (32, 20, 256) - bidirectional: 128*2
print("Hidden shape:", hidden.shape)   # (4, 32, 128)  - 2 layers * 2 directions

4.2 LSTM — Gate Mechanisms Explained

LSTM (Long Short-Term Memory) was designed to solve the vanishing gradient problem of vanilla RNNs. Its gate mechanism allows it to capture long-range dependencies.

class LSTMCell(nn.Module):
    """LSTM cell — manual implementation for understanding gate mechanics"""

    def __init__(self, input_size, hidden_size):
        super().__init__()
        self.hidden_size = hidden_size

        # Compute all 4 gates at once for efficiency
        # forget gate, input gate, cell gate, output gate
        self.gates = nn.Linear(input_size + hidden_size, 4 * hidden_size)

    def forward(self, x, state=None):
        """
        x: (batch, input_size)
        state: (h, c) previous hidden and cell states
        """
        batch_size = x.size(0)

        if state is None:
            h = torch.zeros(batch_size, self.hidden_size)
            c = torch.zeros(batch_size, self.hidden_size)
        else:
            h, c = state

        combined = torch.cat([x, h], dim=1)
        gates = self.gates(combined)

        f_gate, i_gate, g_gate, o_gate = gates.chunk(4, dim=1)

        f = torch.sigmoid(f_gate)   # forget gate: how much to forget
        i = torch.sigmoid(i_gate)   # input gate: how much new info to store
        g = torch.tanh(g_gate)      # cell gate: new candidate information
        o = torch.sigmoid(o_gate)   # output gate: how much to output

        new_c = f * c + i * g       # forget some, add new
        new_h = o * torch.tanh(new_c)

        return new_h, (new_h, new_c)

# Bidirectional LSTM for sentiment analysis
class SentimentLSTM(nn.Module):
    """LSTM-based sentiment analysis model"""

    def __init__(self, vocab_size, embed_dim, hidden_dim, num_classes, num_layers=2, dropout=0.3):
        super().__init__()

        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
        self.lstm = nn.LSTM(
            embed_dim,
            hidden_dim,
            num_layers=num_layers,
            batch_first=True,
            dropout=dropout,
            bidirectional=True
        )
        self.dropout = nn.Dropout(dropout)
        self.classifier = nn.Sequential(
            nn.Linear(hidden_dim * 2, hidden_dim),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_dim, num_classes)
        )

    def forward(self, x, lengths=None):
        embedded = self.dropout(self.embedding(x))

        if lengths is not None:
            packed = nn.utils.rnn.pack_padded_sequence(
                embedded, lengths, batch_first=True, enforce_sorted=False
            )
            lstm_out, (hidden, cell) = self.lstm(packed)
            lstm_out, _ = nn.utils.rnn.pad_packed_sequence(lstm_out, batch_first=True)
        else:
            lstm_out, (hidden, cell) = self.lstm(embedded)

        forward_hidden = hidden[-2]
        backward_hidden = hidden[-1]
        combined = torch.cat([forward_hidden, backward_hidden], dim=1)

        output = self.classifier(self.dropout(combined))
        return output

4.3 GRU

GRU (Gated Recurrent Unit) is a simpler alternative to LSTM that achieves comparable performance faster.

class GRUCell(nn.Module):
    """GRU cell — manual implementation"""

    def __init__(self, input_size, hidden_size):
        super().__init__()
        self.hidden_size = hidden_size
        self.reset_gate = nn.Linear(input_size + hidden_size, hidden_size)
        self.update_gate = nn.Linear(input_size + hidden_size, hidden_size)
        self.new_gate_input = nn.Linear(input_size, hidden_size)
        self.new_gate_hidden = nn.Linear(hidden_size, hidden_size)

    def forward(self, x, h=None):
        batch_size = x.size(0)

        if h is None:
            h = torch.zeros(batch_size, self.hidden_size)

        combined = torch.cat([x, h], dim=1)

        r = torch.sigmoid(self.reset_gate(combined))   # reset gate
        z = torch.sigmoid(self.update_gate(combined))  # update gate

        n = torch.tanh(
            self.new_gate_input(x) + r * self.new_gate_hidden(h)
        )

        new_h = (1 - z) * n + z * h
        return new_h

4.4 Seq2Seq Machine Translation

class Encoder(nn.Module):
    """Seq2Seq Encoder"""

    def __init__(self, vocab_size, embed_dim, hidden_dim, num_layers=2, dropout=0.5):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
        self.lstm = nn.LSTM(embed_dim, hidden_dim, num_layers,
                           batch_first=True, dropout=dropout)
        self.dropout = nn.Dropout(dropout)

    def forward(self, src):
        embedded = self.dropout(self.embedding(src))
        outputs, (hidden, cell) = self.lstm(embedded)
        return outputs, hidden, cell

class Decoder(nn.Module):
    """Seq2Seq Decoder"""

    def __init__(self, vocab_size, embed_dim, hidden_dim, num_layers=2, dropout=0.5):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
        self.lstm = nn.LSTM(embed_dim, hidden_dim, num_layers,
                           batch_first=True, dropout=dropout)
        self.fc = nn.Linear(hidden_dim, vocab_size)
        self.dropout = nn.Dropout(dropout)

    def forward(self, trg, hidden, cell):
        trg = trg.unsqueeze(1)  # (batch, 1)
        embedded = self.dropout(self.embedding(trg))
        output, (hidden, cell) = self.lstm(embedded, (hidden, cell))
        prediction = self.fc(output.squeeze(1))  # (batch, vocab_size)
        return prediction, hidden, cell

class Seq2Seq(nn.Module):
    """Seq2Seq Translation Model"""

    def __init__(self, encoder, decoder, device):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.device = device

    def forward(self, src, trg, teacher_forcing_ratio=0.5):
        batch_size = src.shape[0]
        trg_len = trg.shape[1]
        trg_vocab_size = self.decoder.fc.out_features

        outputs = torch.zeros(batch_size, trg_len, trg_vocab_size).to(self.device)

        encoder_outputs, hidden, cell = self.encoder(src)

        # First decoder input: start-of-sequence token
        input = trg[:, 0]

        for t in range(1, trg_len):
            output, hidden, cell = self.decoder(input, hidden, cell)
            outputs[:, t] = output

            teacher_force = torch.rand(1).item() < teacher_forcing_ratio
            top1 = output.argmax(1)
            input = trg[:, t] if teacher_force else top1

        return outputs

5. Attention Mechanisms

Attention mechanisms allow a model to dynamically focus on different parts of the input sequence when generating each output token.

5.1 Bahdanau Attention

class BahdanauAttention(nn.Module):
    """Bahdanau (Additive) Attention"""

    def __init__(self, hidden_dim):
        super().__init__()
        self.W_enc = nn.Linear(hidden_dim * 2, hidden_dim)
        self.W_dec = nn.Linear(hidden_dim, hidden_dim)
        self.v = nn.Linear(hidden_dim, 1)

    def forward(self, decoder_hidden, encoder_outputs):
        """
        decoder_hidden: (batch, hidden) current decoder hidden state
        encoder_outputs: (batch, src_len, hidden*2) all encoder outputs
        """
        src_len = encoder_outputs.shape[1]

        decoder_hidden = decoder_hidden.unsqueeze(1).repeat(1, src_len, 1)

        energy = torch.tanh(
            self.W_enc(encoder_outputs) + self.W_dec(decoder_hidden)
        )

        attention = self.v(energy).squeeze(2)  # (batch, src_len)
        attention_weights = torch.softmax(attention, dim=1)

        context = torch.bmm(
            attention_weights.unsqueeze(1),  # (batch, 1, src_len)
            encoder_outputs                  # (batch, src_len, hidden*2)
        ).squeeze(1)

        return context, attention_weights

class LuongAttention(nn.Module):
    """Luong (Multiplicative) Attention"""

    def __init__(self, hidden_dim, method='dot'):
        super().__init__()
        self.method = method

        if method == 'general':
            self.W = nn.Linear(hidden_dim, hidden_dim)
        elif method == 'concat':
            self.W = nn.Linear(hidden_dim * 2, hidden_dim)
            self.v = nn.Linear(hidden_dim, 1)

    def forward(self, decoder_hidden, encoder_outputs):
        if self.method == 'dot':
            score = torch.bmm(
                encoder_outputs,
                decoder_hidden.unsqueeze(2)
            ).squeeze(2)

        elif self.method == 'general':
            energy = self.W(encoder_outputs)
            score = torch.bmm(
                energy,
                decoder_hidden.unsqueeze(2)
            ).squeeze(2)

        elif self.method == 'concat':
            decoder_expanded = decoder_hidden.unsqueeze(1).expand_as(encoder_outputs)
            energy = torch.tanh(self.W(torch.cat([decoder_expanded, encoder_outputs], dim=2)))
            score = self.v(energy).squeeze(2)

        attention_weights = torch.softmax(score, dim=1)
        context = torch.bmm(attention_weights.unsqueeze(1), encoder_outputs).squeeze(1)

        return context, attention_weights

5.2 Self-Attention

Self-attention allows each position in a sequence to attend to all other positions. It is the core building block of the Transformer.

class SelfAttention(nn.Module):
    """Multi-Head Self-Attention"""

    def __init__(self, embed_dim, num_heads=8, dropout=0.1):
        super().__init__()
        assert embed_dim % num_heads == 0

        self.embed_dim = embed_dim
        self.num_heads = num_heads
        self.head_dim = embed_dim // num_heads
        self.scale = self.head_dim ** -0.5

        self.W_q = nn.Linear(embed_dim, embed_dim)
        self.W_k = nn.Linear(embed_dim, embed_dim)
        self.W_v = nn.Linear(embed_dim, embed_dim)
        self.W_o = nn.Linear(embed_dim, embed_dim)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        """
        x: (batch, seq_len, embed_dim)
        mask: (batch, seq_len, seq_len) attention mask
        """
        batch_size, seq_len, _ = x.shape

        Q = self.W_q(x)
        K = self.W_k(x)
        V = self.W_v(x)

        Q = Q.view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
        K = K.view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
        V = V.view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2)

        scores = torch.matmul(Q, K.transpose(-2, -1)) * self.scale

        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))

        attention_weights = self.dropout(torch.softmax(scores, dim=-1))

        attended = torch.matmul(attention_weights, V)
        attended = attended.transpose(1, 2).contiguous()
        attended = attended.view(batch_size, seq_len, self.embed_dim)

        output = self.W_o(attended)

        return output, attention_weights

6. Deep Dive into the Transformer Architecture

The Transformer, introduced in the 2017 paper "Attention is All You Need," completely changed the NLP paradigm. Using only attention mechanisms — no RNNs or convolutions — it achieved state-of-the-art translation performance.

6.1 Positional Encoding

Because Transformers process all positions in parallel, they need positional encoding to inject information about sequence order.

import torch
import torch.nn as nn
import math

class PositionalEncoding(nn.Module):
    """Sinusoidal Positional Encoding"""

    def __init__(self, embed_dim, max_len=5000, dropout=0.1):
        super().__init__()
        self.dropout = nn.Dropout(p=dropout)

        pe = torch.zeros(max_len, embed_dim)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(
            torch.arange(0, embed_dim, 2).float() * (-math.log(10000.0) / embed_dim)
        )

        pe[:, 0::2] = torch.sin(position * div_term)  # even indices: sin
        pe[:, 1::2] = torch.cos(position * div_term)  # odd indices: cos

        pe = pe.unsqueeze(0)  # (1, max_len, embed_dim)
        self.register_buffer('pe', pe)

    def forward(self, x):
        """x: (batch, seq_len, embed_dim)"""
        x = x + self.pe[:, :x.size(1), :]
        return self.dropout(x)

class LearnablePositionalEncoding(nn.Module):
    """Learnable Positional Encoding (BERT style)"""

    def __init__(self, embed_dim, max_len=512):
        super().__init__()
        self.pe = nn.Embedding(max_len, embed_dim)

    def forward(self, x):
        batch_size, seq_len, _ = x.shape
        positions = torch.arange(seq_len, device=x.device).unsqueeze(0)
        return x + self.pe(positions)

6.2 Full Transformer Implementation

class MultiHeadAttention(nn.Module):
    """Multi-Head Attention"""

    def __init__(self, embed_dim, num_heads, dropout=0.1):
        super().__init__()
        assert embed_dim % num_heads == 0

        self.embed_dim = embed_dim
        self.num_heads = num_heads
        self.head_dim = embed_dim // num_heads
        self.scale = self.head_dim ** -0.5

        self.W_q = nn.Linear(embed_dim, embed_dim, bias=False)
        self.W_k = nn.Linear(embed_dim, embed_dim, bias=False)
        self.W_v = nn.Linear(embed_dim, embed_dim, bias=False)
        self.W_o = nn.Linear(embed_dim, embed_dim)
        self.dropout = nn.Dropout(dropout)

    def split_heads(self, x):
        """(batch, seq, embed) -> (batch, heads, seq, head_dim)"""
        batch, seq, _ = x.shape
        x = x.view(batch, seq, self.num_heads, self.head_dim)
        return x.transpose(1, 2)

    def forward(self, query, key, value, mask=None):
        batch_size = query.shape[0]

        Q = self.split_heads(self.W_q(query))
        K = self.split_heads(self.W_k(key))
        V = self.split_heads(self.W_v(value))

        scores = torch.matmul(Q, K.transpose(-2, -1)) * self.scale

        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))

        attn_weights = self.dropout(torch.softmax(scores, dim=-1))
        output = torch.matmul(attn_weights, V)
        output = output.transpose(1, 2).contiguous().view(batch_size, -1, self.embed_dim)

        return self.W_o(output), attn_weights

class FeedForward(nn.Module):
    """Position-wise Feed-Forward Network"""

    def __init__(self, embed_dim, ff_dim, dropout=0.1):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(embed_dim, ff_dim),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(ff_dim, embed_dim),
            nn.Dropout(dropout)
        )

    def forward(self, x):
        return self.net(x)

class EncoderLayer(nn.Module):
    """Transformer Encoder Layer"""

    def __init__(self, embed_dim, num_heads, ff_dim, dropout=0.1):
        super().__init__()
        self.self_attention = MultiHeadAttention(embed_dim, num_heads, dropout)
        self.feed_forward = FeedForward(embed_dim, ff_dim, dropout)
        self.norm1 = nn.LayerNorm(embed_dim)
        self.norm2 = nn.LayerNorm(embed_dim)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, src_mask=None):
        attn_output, _ = self.self_attention(x, x, x, src_mask)
        x = self.norm1(x + self.dropout(attn_output))

        ff_output = self.feed_forward(x)
        x = self.norm2(x + ff_output)

        return x

class DecoderLayer(nn.Module):
    """Transformer Decoder Layer"""

    def __init__(self, embed_dim, num_heads, ff_dim, dropout=0.1):
        super().__init__()
        self.self_attention = MultiHeadAttention(embed_dim, num_heads, dropout)
        self.cross_attention = MultiHeadAttention(embed_dim, num_heads, dropout)
        self.feed_forward = FeedForward(embed_dim, ff_dim, dropout)
        self.norm1 = nn.LayerNorm(embed_dim)
        self.norm2 = nn.LayerNorm(embed_dim)
        self.norm3 = nn.LayerNorm(embed_dim)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, memory, src_mask=None, tgt_mask=None):
        # Masked Self-Attention (future token masking)
        self_attn_out, _ = self.self_attention(x, x, x, tgt_mask)
        x = self.norm1(x + self.dropout(self_attn_out))

        # Cross-Attention (attend to encoder outputs)
        cross_attn_out, cross_attn_weights = self.cross_attention(x, memory, memory, src_mask)
        x = self.norm2(x + self.dropout(cross_attn_out))

        ff_out = self.feed_forward(x)
        x = self.norm3(x + ff_out)

        return x, cross_attn_weights

class Transformer(nn.Module):
    """Full Transformer Implementation"""

    def __init__(
        self,
        src_vocab_size,
        tgt_vocab_size,
        embed_dim=512,
        num_heads=8,
        num_encoder_layers=6,
        num_decoder_layers=6,
        ff_dim=2048,
        max_len=5000,
        dropout=0.1
    ):
        super().__init__()

        self.src_embedding = nn.Embedding(src_vocab_size, embed_dim)
        self.tgt_embedding = nn.Embedding(tgt_vocab_size, embed_dim)
        self.pos_encoding = PositionalEncoding(embed_dim, max_len, dropout)
        self.embed_scale = math.sqrt(embed_dim)

        self.encoder_layers = nn.ModuleList([
            EncoderLayer(embed_dim, num_heads, ff_dim, dropout)
            for _ in range(num_encoder_layers)
        ])

        self.decoder_layers = nn.ModuleList([
            DecoderLayer(embed_dim, num_heads, ff_dim, dropout)
            for _ in range(num_decoder_layers)
        ])

        self.output_norm = nn.LayerNorm(embed_dim)
        self.output_projection = nn.Linear(embed_dim, tgt_vocab_size)

        self._init_weights()

    def _init_weights(self):
        for p in self.parameters():
            if p.dim() > 1:
                nn.init.xavier_uniform_(p)

    def make_causal_mask(self, seq_len, device):
        """Autoregressive mask — prevents attending to future tokens"""
        mask = torch.tril(torch.ones(seq_len, seq_len, device=device)).bool()
        return mask.unsqueeze(0).unsqueeze(0)

    def make_pad_mask(self, x, pad_idx=0):
        """Padding mask"""
        return (x != pad_idx).unsqueeze(1).unsqueeze(2)

    def encode(self, src, src_mask=None):
        x = self.pos_encoding(self.src_embedding(src) * self.embed_scale)
        for layer in self.encoder_layers:
            x = layer(x, src_mask)
        return x

    def decode(self, tgt, memory, src_mask=None, tgt_mask=None):
        x = self.pos_encoding(self.tgt_embedding(tgt) * self.embed_scale)
        for layer in self.decoder_layers:
            x, _ = layer(x, memory, src_mask, tgt_mask)
        return self.output_norm(x)

    def forward(self, src, tgt, src_pad_idx=0, tgt_pad_idx=0):
        src_mask = self.make_pad_mask(src, src_pad_idx)
        tgt_len = tgt.shape[1]
        tgt_pad_mask = self.make_pad_mask(tgt, tgt_pad_idx)
        tgt_causal_mask = self.make_causal_mask(tgt_len, tgt.device)
        tgt_mask = tgt_pad_mask & tgt_causal_mask

        memory = self.encode(src, src_mask)
        output = self.decode(tgt, memory, src_mask, tgt_mask)
        logits = self.output_projection(output)

        return logits

# Instantiate and test
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model = Transformer(
    src_vocab_size=10000,
    tgt_vocab_size=10000,
    embed_dim=512,
    num_heads=8,
    num_encoder_layers=6,
    num_decoder_layers=6,
    ff_dim=2048,
    dropout=0.1
).to(device)

total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Total parameters: {total_params:,}")

src = torch.randint(1, 10000, (4, 20)).to(device)
tgt = torch.randint(1, 10000, (4, 18)).to(device)
output = model(src, tgt)
print(f"Output shape: {output.shape}")  # (4, 18, 10000)

7. Deep Dive into BERT

BERT (Bidirectional Encoder Representations from Transformers) was published by Google in 2018 and revolutionized the NLP field.

7.1 BERT's Core Ideas

BERT uses two pre-training tasks:

Masked Language Modeling (MLM): 15% of input tokens are replaced with [MASK]; the model predicts the original words. Next Sentence Prediction (NSP): Given two sentences, the model predicts whether the second sentence follows the first.

from transformers import (
    BertTokenizer,
    BertModel,
    BertForSequenceClassification,
    BertForTokenClassification,
    BertForQuestionAnswering,
    AdamW,
    get_linear_schedule_with_warmup
)
import torch
from torch.utils.data import Dataset, DataLoader

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

text = "Natural language processing is transforming AI."
encoding = tokenizer(
    text,
    add_special_tokens=True,
    max_length=128,
    padding='max_length',
    truncation=True,
    return_tensors='pt'
)

print("Input IDs:", encoding['input_ids'][0][:10])
print("Attention mask:", encoding['attention_mask'][0][:10])
print("Decoded:", tokenizer.decode(encoding['input_ids'][0]))

# WordPiece tokenization
complex_words = ["unbelievable", "preprocessing", "transformers", "tokenization"]
for word in complex_words:
    tokens = tokenizer.tokenize(word)
    print(f"{word:20} -> {tokens}")

7.2 BERT Fine-tuning: Text Classification

class SentimentDataset(Dataset):
    """Sentiment analysis dataset"""

    def __init__(self, texts, labels, tokenizer, max_length=128):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        encoding = self.tokenizer(
            self.texts[idx],
            max_length=self.max_length,
            padding='max_length',
            truncation=True,
            return_tensors='pt'
        )
        return {
            'input_ids': encoding['input_ids'].squeeze(),
            'attention_mask': encoding['attention_mask'].squeeze(),
            'label': torch.tensor(self.labels[idx], dtype=torch.long)
        }

def train_bert_classifier(
    train_texts,
    train_labels,
    val_texts,
    val_labels,
    num_labels=2,
    epochs=3,
    batch_size=16,
    lr=2e-5
):
    """BERT sentiment analysis fine-tuning"""

    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

    train_dataset = SentimentDataset(train_texts, train_labels, tokenizer)
    val_dataset = SentimentDataset(val_texts, val_labels, tokenizer)

    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
    val_loader = DataLoader(val_dataset, batch_size=batch_size)

    model = BertForSequenceClassification.from_pretrained(
        'bert-base-uncased',
        num_labels=num_labels
    ).to(device)

    optimizer = AdamW(model.parameters(), lr=lr, weight_decay=0.01)

    total_steps = len(train_loader) * epochs
    scheduler = get_linear_schedule_with_warmup(
        optimizer,
        num_warmup_steps=total_steps // 10,
        num_training_steps=total_steps
    )

    best_val_acc = 0

    for epoch in range(epochs):
        model.train()
        total_loss = 0
        correct = 0

        for batch in train_loader:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['label'].to(device)

            optimizer.zero_grad()
            outputs = model(
                input_ids=input_ids,
                attention_mask=attention_mask,
                labels=labels
            )

            loss = outputs.loss
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            optimizer.step()
            scheduler.step()

            total_loss += loss.item()
            preds = outputs.logits.argmax(dim=1)
            correct += (preds == labels).sum().item()

        train_acc = correct / len(train_dataset)
        avg_loss = total_loss / len(train_loader)

        model.eval()
        val_correct = 0
        with torch.no_grad():
            for batch in val_loader:
                input_ids = batch['input_ids'].to(device)
                attention_mask = batch['attention_mask'].to(device)
                labels = batch['label'].to(device)

                outputs = model(input_ids=input_ids, attention_mask=attention_mask)
                preds = outputs.logits.argmax(dim=1)
                val_correct += (preds == labels).sum().item()

        val_acc = val_correct / len(val_dataset)
        print(f"Epoch {epoch+1}: Loss={avg_loss:.4f}, Train Acc={train_acc:.4f}, Val Acc={val_acc:.4f}")

        if val_acc > best_val_acc:
            best_val_acc = val_acc
            torch.save(model.state_dict(), 'best_bert_classifier.pt')

    return model

7.3 BERT NER (Named Entity Recognition)

from transformers import BertForTokenClassification

ner_labels = ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC']
label2id = {label: i for i, label in enumerate(ner_labels)}
id2label = {i: label for i, label in enumerate(ner_labels)}

model = BertForTokenClassification.from_pretrained(
    'bert-base-uncased',
    num_labels=len(ner_labels),
    id2label=id2label,
    label2id=label2id
)

def predict_ner(text, model, tokenizer):
    """Predict named entities"""
    model.eval()

    encoding = tokenizer(
        text.split(),
        is_split_into_words=True,
        return_offsets_mapping=True,
        padding=True,
        truncation=True,
        return_tensors='pt'
    )

    with torch.no_grad():
        outputs = model(
            input_ids=encoding['input_ids'],
            attention_mask=encoding['attention_mask']
        )

    predictions = outputs.logits.argmax(dim=2).squeeze().tolist()
    word_ids = encoding.word_ids()

    result = []
    prev_word_id = None
    for pred, word_id in zip(predictions, word_ids):
        if word_id is None or word_id == prev_word_id:
            continue
        word = text.split()[word_id]
        label = id2label[pred]
        result.append((word, label))
        prev_word_id = word_id

    return result

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
text = "Barack Obama was born in Hawaii and served as the 44th President of the United States."
# print(predict_ner(text, model, tokenizer))  # use with a fine-tuned model

8. The GPT Family of Models

GPT (Generative Pre-trained Transformer) is a series of autoregressive language models developed by OpenAI.

8.1 GPT Evolution

Model	Year	Parameters	Key Feature
GPT-1	2018	117M	First GPT, unsupervised pre-training
GPT-2	2019	1.5B	Zero-shot transfer, "too dangerous to release"
GPT-3	2020	175B	In-context learning, few-shot
InstructGPT	2022	1.3B	RLHF, following instructions
GPT-4	2023	Undisclosed	Multimodal, stronger reasoning

8.2 Text Generation with GPT-2

from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')
model.eval()

tokenizer.pad_token = tokenizer.eos_token

def generate_text(
    prompt,
    model,
    tokenizer,
    max_length=200,
    temperature=0.9,
    top_p=0.95,
    top_k=50,
    num_return_sequences=1,
    do_sample=True
):
    """Generate text with GPT-2"""

    inputs = tokenizer(prompt, return_tensors='pt')
    input_ids = inputs['input_ids']

    with torch.no_grad():
        output = model.generate(
            input_ids,
            max_length=max_length,
            temperature=temperature,
            top_p=top_p,
            top_k=top_k,
            num_return_sequences=num_return_sequences,
            do_sample=do_sample,
            pad_token_id=tokenizer.eos_token_id,
            repetition_penalty=1.2
        )

    generated_texts = []
    for out in output:
        text = tokenizer.decode(out, skip_special_tokens=True)
        generated_texts.append(text)

    return generated_texts

prompt = "Artificial intelligence is transforming the world by"
generated = generate_text(prompt, model, tokenizer, max_length=150, temperature=0.8)

for i, text in enumerate(generated):
    print(f"\nGenerated text {i+1}:")
    print(text)
    print("-" * 50)

def compare_generation_strategies(prompt, model, tokenizer):
    """Compare different decoding strategies"""

    input_ids = tokenizer.encode(prompt, return_tensors='pt')

    strategies = {
        "Greedy Search": dict(do_sample=False),
        "Beam Search": dict(do_sample=False, num_beams=5, early_stopping=True),
        "Temperature Sampling": dict(do_sample=True, temperature=0.7),
        "Top-k Sampling": dict(do_sample=True, top_k=50),
        "Top-p (Nucleus) Sampling": dict(do_sample=True, top_p=0.92),
    }

    print(f"Prompt: {prompt}\n")

    with torch.no_grad():
        for name, params in strategies.items():
            output = model.generate(
                input_ids,
                max_new_tokens=50,
                pad_token_id=tokenizer.eos_token_id,
                **params
            )
            text = tokenizer.decode(output[0], skip_special_tokens=True)
            print(f"[{name}]")
            print(text[len(prompt):].strip())
            print()

compare_generation_strategies(
    "The future of artificial intelligence",
    model,
    tokenizer
)

8.3 Prompt Engineering

# Zero-shot classification
def zero_shot_classification(text, categories, model_name="gpt2"):
    """Zero-shot text classification"""

    tokenizer = GPT2Tokenizer.from_pretrained(model_name)
    model = GPT2LMHeadModel.from_pretrained(model_name)
    model.eval()

    scores = {}

    for category in categories:
        prompt = f"Text: {text}\nCategory: {category}"
        inputs = tokenizer(prompt, return_tensors='pt')

        with torch.no_grad():
            outputs = model(**inputs, labels=inputs['input_ids'])
            scores[category] = -outputs.loss.item()

    best_category = max(scores, key=scores.get)
    return best_category, scores

# Few-shot learning example
few_shot_prompt = """
Classify the sentiment of the following texts:

Text: "This movie was amazing!"
Sentiment: Positive

Text: "I hated every minute of it."
Sentiment: Negative

Text: "The film was okay, nothing special."
Sentiment: Neutral

Text: "Absolutely brilliant performance!"
Sentiment:"""

print("Few-shot prompt:")
print(few_shot_prompt)

# Chain-of-Thought prompting
cot_prompt = """
Q: A train travels 120 miles in 2 hours. How long will it take to travel 300 miles?

A: Let me think step by step.
1. First, find the speed: 120 miles / 2 hours = 60 miles per hour
2. Then, calculate the time for 300 miles: 300 miles / 60 mph = 5 hours
Therefore, it will take 5 hours.

Q: If a store sells 3 apples for $2.40, how much do 7 apples cost?

A: Let me think step by step.
"""

9. Modern NLP Techniques

9.1 LoRA / QLoRA Fine-tuning

LoRA (Low-Rank Adaptation) is a technique for efficiently fine-tuning large language models. It freezes the original weights and trains only small low-rank matrices added to the existing layers.

from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

def setup_lora_model(model_name="gpt2", r=8, lora_alpha=32, lora_dropout=0.1):
    """Configure a LoRA model"""

    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32
    )
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    tokenizer.pad_token = tokenizer.eos_token

    lora_config = LoraConfig(
        task_type=TaskType.CAUSAL_LM,
        r=r,                          # LoRA rank
        lora_alpha=lora_alpha,        # scaling parameter
        target_modules=["c_attn"],    # layers to apply LoRA to
        lora_dropout=lora_dropout,
        bias="none"
    )

    model = get_peft_model(model, lora_config)

    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    total = sum(p.numel() for p in model.parameters())
    print(f"Trainable parameters: {trainable:,} ({100 * trainable / total:.2f}%)")

    return model, tokenizer

class InstructionDataset(torch.utils.data.Dataset):
    """Instruction-format dataset for SFT"""

    def __init__(self, data, tokenizer, max_length=512):
        self.tokenizer = tokenizer
        self.max_length = max_length

        self.texts = []
        for item in data:
            prompt = f"### Instruction:\n{item['instruction']}\n\n### Response:\n{item['response']}"
            self.texts.append(prompt)

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        encoding = self.tokenizer(
            self.texts[idx],
            max_length=self.max_length,
            padding='max_length',
            truncation=True,
            return_tensors='pt'
        )
        input_ids = encoding['input_ids'].squeeze()
        return {
            'input_ids': input_ids,
            'attention_mask': encoding['attention_mask'].squeeze(),
            'labels': input_ids.clone()
        }

9.2 RAG (Retrieval-Augmented Generation)

from transformers import AutoTokenizer, AutoModel
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

class SimpleRAG:
    """Simple RAG system"""

    def __init__(self, embedding_model="sentence-transformers/all-MiniLM-L6-v2"):
        self.tokenizer = AutoTokenizer.from_pretrained(embedding_model)
        self.model = AutoModel.from_pretrained(embedding_model)
        self.knowledge_base = []
        self.embeddings = []

    def mean_pooling(self, model_output, attention_mask):
        token_embeddings = model_output[0]
        input_mask_expanded = attention_mask.unsqueeze(-1).expand(
            token_embeddings.size()
        ).float()
        return torch.sum(token_embeddings * input_mask_expanded, 1) / \
               torch.clamp(input_mask_expanded.sum(1), min=1e-9)

    def encode(self, texts):
        encoded = self.tokenizer(
            texts,
            padding=True,
            truncation=True,
            max_length=512,
            return_tensors='pt'
        )

        with torch.no_grad():
            model_output = self.model(**encoded)

        embeddings = self.mean_pooling(model_output, encoded['attention_mask'])
        embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)
        return embeddings.numpy()

    def add_documents(self, documents):
        self.knowledge_base.extend(documents)
        new_embeddings = self.encode(documents)
        if len(self.embeddings) == 0:
            self.embeddings = new_embeddings
        else:
            self.embeddings = np.vstack([self.embeddings, new_embeddings])
        print(f"Added {len(documents)} documents. Total: {len(self.knowledge_base)}.")

    def retrieve(self, query, top_k=3):
        query_embedding = self.encode([query])
        similarities = cosine_similarity(query_embedding, self.embeddings)[0]
        top_indices = np.argsort(similarities)[::-1][:top_k]

        results = []
        for idx in top_indices:
            results.append({
                'document': self.knowledge_base[idx],
                'similarity': similarities[idx]
            })

        return results

    def answer(self, query, generator_model, generator_tokenizer, top_k=3):
        """RAG: retrieve + generate"""
        relevant_docs = self.retrieve(query, top_k=top_k)

        context = "\n".join([doc['document'] for doc in relevant_docs])

        prompt = f"""Based on the following context, answer the question.

Context:
{context}

Question: {query}

Answer:"""

        inputs = generator_tokenizer(prompt, return_tensors='pt', max_length=512, truncation=True)
        with torch.no_grad():
            outputs = generator_model.generate(
                inputs['input_ids'],
                max_new_tokens=150,
                temperature=0.7,
                do_sample=True,
                pad_token_id=generator_tokenizer.eos_token_id
            )

        answer = generator_tokenizer.decode(outputs[0], skip_special_tokens=True)
        answer = answer[len(prompt):].strip()

        return answer, relevant_docs

# Usage example
rag = SimpleRAG()

documents = [
    "BERT (Bidirectional Encoder Representations from Transformers) was published by Google in 2018.",
    "GPT-3 has 175 billion parameters and was developed by OpenAI in 2020.",
    "The Transformer architecture was introduced in the paper 'Attention is All You Need' in 2017.",
    "RLHF (Reinforcement Learning from Human Feedback) is used to align language models with human values.",
    "LoRA allows efficient fine-tuning by adding low-rank matrices to pre-trained model weights.",
    "RAG combines information retrieval with text generation for more accurate responses.",
]

rag.add_documents(documents)

query = "What is BERT and when was it created?"
results = rag.retrieve(query, top_k=2)

print(f"\nQuery: {query}")
print("\nRetrieved documents:")
for r in results:
    print(f"  [{r['similarity']:.3f}] {r['document']}")

9.3 RLHF (Reinforcement Learning from Human Feedback)

class RewardModel(nn.Module):
    """Reward model — scores response quality"""

    def __init__(self, base_model_name="gpt2"):
        super().__init__()
        from transformers import GPT2Model
        self.transformer = GPT2Model.from_pretrained(base_model_name)
        hidden_size = self.transformer.config.hidden_size
        self.reward_head = nn.Linear(hidden_size, 1)

    def forward(self, input_ids, attention_mask=None):
        outputs = self.transformer(input_ids=input_ids, attention_mask=attention_mask)
        last_hidden = outputs.last_hidden_state[:, -1, :]
        reward = self.reward_head(last_hidden)
        return reward.squeeze(-1)

class PPOTrainer:
    """PPO-based RLHF training (conceptual implementation)"""

    def __init__(self, policy_model, reward_model, ref_model, tokenizer):
        self.policy = policy_model
        self.reward_model = reward_model
        self.ref_model = ref_model  # Reference model for KL penalty
        self.tokenizer = tokenizer

    def compute_kl_penalty(self, policy_logprobs, ref_logprobs, kl_coeff=0.1):
        """KL divergence penalty — prevents policy from drifting too far"""
        kl = policy_logprobs - ref_logprobs
        return kl_coeff * kl.mean()

    def compute_advantages(self, rewards, values, gamma=0.99, lam=0.95):
        """Generalized Advantage Estimation (GAE)"""
        advantages = []
        last_advantage = 0

        for t in reversed(range(len(rewards))):
            if t == len(rewards) - 1:
                next_value = 0
            else:
                next_value = values[t + 1]

            delta = rewards[t] + gamma * next_value - values[t]
            advantage = delta + gamma * lam * last_advantage
            advantages.insert(0, advantage)
            last_advantage = advantage

        return torch.tensor(advantages)

10. Real-World NLP Projects

10.1 Sentiment Analysis System (Complete Pipeline)

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
from torch.utils.data import Dataset, DataLoader
from torch.optim import AdamW
from transformers import get_linear_schedule_with_warmup
import matplotlib.pyplot as plt
import seaborn as sns

class SentimentAnalysisPipeline:
    """End-to-end sentiment analysis pipeline"""

    def __init__(self, model_name="distilbert-base-uncased-finetuned-sst-2-english"):
        self.model_name = model_name
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForSequenceClassification.from_pretrained(model_name)
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        self.model.to(self.device)
        self.model.eval()
        self.label_map = {0: 'Negative', 1: 'Positive'}

    def predict(self, texts, batch_size=32):
        """Batch prediction"""
        if isinstance(texts, str):
            texts = [texts]

        all_predictions = []
        all_probabilities = []

        for i in range(0, len(texts), batch_size):
            batch = texts[i:i + batch_size]

            encoding = self.tokenizer(
                batch,
                padding=True,
                truncation=True,
                max_length=512,
                return_tensors='pt'
            )

            input_ids = encoding['input_ids'].to(self.device)
            attention_mask = encoding['attention_mask'].to(self.device)

            with torch.no_grad():
                outputs = self.model(input_ids=input_ids, attention_mask=attention_mask)
                probs = torch.softmax(outputs.logits, dim=1)
                preds = probs.argmax(dim=1)

            all_predictions.extend(preds.cpu().numpy())
            all_probabilities.extend(probs.cpu().numpy())

        results = []
        for pred, prob in zip(all_predictions, all_probabilities):
            results.append({
                'label': self.label_map[pred],
                'score': float(prob[pred]),
                'probabilities': {self.label_map[i]: float(p) for i, p in enumerate(prob)}
            })

        return results if len(results) > 1 else results[0]

    def analyze_batch(self, texts):
        """Batch sentiment analysis with statistics"""
        results = self.predict(texts)

        positive_count = sum(1 for r in results if r['label'] == 'Positive')
        negative_count = len(results) - positive_count
        avg_confidence = np.mean([r['score'] for r in results])

        print(f"\n=== Sentiment Analysis Results ===")
        print(f"Total texts: {len(texts)}")
        print(f"Positive: {positive_count} ({100*positive_count/len(texts):.1f}%)")
        print(f"Negative: {negative_count} ({100*negative_count/len(texts):.1f}%)")
        print(f"Average confidence: {avg_confidence:.3f}")

        return results

pipeline = SentimentAnalysisPipeline()

reviews = [
    "This product exceeded all my expectations! Absolutely love it.",
    "Terrible quality, broke after one day. Complete waste of money.",
    "It's okay, does what it's supposed to do.",
    "Best purchase I've made this year!",
    "Would not recommend to anyone.",
]

print("=== Individual Predictions ===")
for review in reviews:
    result = pipeline.predict(review)
    print(f"Text: {review[:50]}...")
    print(f"  -> {result['label']} ({result['score']:.3f})\n")

print("\n=== Batch Analysis ===")
pipeline.analyze_batch(reviews)

10.2 Text Summarization System

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

class TextSummarizer:
    """Text summarization system"""

    def __init__(self, model_name="facebook/bart-large-cnn"):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        self.model.to(self.device)

    def summarize(self, text, max_length=130, min_length=30, num_beams=4):
        """Abstractive summarization"""
        inputs = self.tokenizer(
            text,
            max_length=1024,
            truncation=True,
            return_tensors='pt'
        ).to(self.device)

        with torch.no_grad():
            summary_ids = self.model.generate(
                inputs['input_ids'],
                max_length=max_length,
                min_length=min_length,
                num_beams=num_beams,
                length_penalty=2.0,
                early_stopping=True
            )

        summary = self.tokenizer.decode(summary_ids[0], skip_special_tokens=True)

        original_words = len(text.split())
        summary_words = len(summary.split())
        compression = (1 - summary_words / original_words) * 100

        return {
            'summary': summary,
            'original_length': original_words,
            'summary_length': summary_words,
            'compression_rate': f"{compression:.1f}%"
        }

    def extractive_summarize(self, text, num_sentences=3):
        """Extractive summarization (TF-IDF based)"""
        from sklearn.feature_extraction.text import TfidfVectorizer
        from sklearn.metrics.pairwise import cosine_similarity
        import nltk
        nltk.download('punkt', quiet=True)
        from nltk.tokenize import sent_tokenize

        sentences = sent_tokenize(text)

        if len(sentences) <= num_sentences:
            return text

        vectorizer = TfidfVectorizer(stop_words='english')
        tfidf_matrix = vectorizer.fit_transform(sentences)

        similarity_matrix = cosine_similarity(tfidf_matrix)
        scores = similarity_matrix.mean(axis=1)

        top_indices = sorted(np.argsort(scores)[-num_sentences:].tolist())
        summary = ' '.join([sentences[i] for i in top_indices])
        return summary

summarizer = TextSummarizer()

long_text = """
Artificial intelligence has made remarkable strides in recent years, particularly in
natural language processing. The development of transformer-based models like BERT and
GPT has revolutionized how machines understand and generate human language. These models,
trained on massive datasets, can perform a wide range of tasks including translation,
summarization, question answering, and sentiment analysis.

The key innovation behind these models is the attention mechanism, which allows the model
to focus on relevant parts of the input when generating each word of the output. This has
enabled much more nuanced understanding of context and semantics compared to earlier
approaches like recurrent neural networks.

However, these advances come with challenges. Training large language models requires
enormous computational resources and energy. There are also concerns about bias in the
training data leading to biased outputs, and the potential for misuse in generating
misinformation. Researchers are actively working on making these models more efficient,
fair, and reliable.

The future of NLP looks promising, with models becoming increasingly capable of
understanding and generating language that is indistinguishable from human writing.
Applications range from customer service chatbots to scientific research assistance,
and the technology continues to evolve rapidly.
"""

result = summarizer.summarize(long_text)
print("=== Abstractive Summary ===")
print(f"Original length: {result['original_length']} words")
print(f"Summary length: {result['summary_length']} words")
print(f"Compression rate: {result['compression_rate']}")
print(f"\nSummary:\n{result['summary']}")

print("\n=== Extractive Summary ===")
extractive = summarizer.extractive_summarize(long_text, num_sentences=2)
print(extractive)

10.3 Question Answering System

from transformers import AutoTokenizer, AutoModelForQuestionAnswering

class QASystem:
    """Question answering system"""

    def __init__(self, model_name="deepset/roberta-base-squad2"):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForQuestionAnswering.from_pretrained(model_name)
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        self.model.to(self.device)
        self.model.eval()

    def answer(self, question, context, max_answer_len=100):
        """Extract answer from context"""

        encoding = self.tokenizer(
            question,
            context,
            max_length=512,
            truncation=True,
            stride=128,
            return_overflowing_tokens=True,
            return_offsets_mapping=True,
            padding='max_length',
            return_tensors='pt'
        )

        offset_mapping = encoding.pop('offset_mapping').cpu()
        sample_map = encoding.pop('overflow_to_sample_mapping').cpu()

        encoding = {k: v.to(self.device) for k, v in encoding.items()}

        with torch.no_grad():
            outputs = self.model(**encoding)
            start_logits = outputs.start_logits
            end_logits = outputs.end_logits

        answers = []

        for i in range(len(start_logits)):
            start_log = start_logits[i].cpu().numpy()
            end_log = end_logits[i].cpu().numpy()
            offsets = offset_mapping[i].numpy()

            start_idx = np.argmax(start_log)
            end_idx = np.argmax(end_log)

            if start_idx <= end_idx:
                start_char = offsets[start_idx][0]
                end_char = offsets[end_idx][1]

                answer_text = context[start_char:end_char]
                score = float(start_log[start_idx] + end_log[end_idx])

                if answer_text:
                    answers.append({
                        'answer': answer_text,
                        'score': score,
                        'start': int(start_char),
                        'end': int(end_char)
                    })

        if not answers:
            return {'answer': "No answer found.", 'score': 0.0}

        return max(answers, key=lambda x: x['score'])

    def multi_question_answer(self, questions, context):
        """Answer multiple questions about a context"""
        results = []
        for question in questions:
            answer = self.answer(question, context)
            results.append({
                'question': question,
                'answer': answer['answer'],
                'confidence': answer['score']
            })
        return results

qa_system = QASystem()

context = """
The Transformer model was introduced in a 2017 paper titled "Attention Is All You Need"
by researchers at Google Brain. The model architecture relies entirely on attention
mechanisms, dispensing with recurrence and convolutions. BERT, which stands for
Bidirectional Encoder Representations from Transformers, was introduced in 2018 by
Google AI Language team. BERT achieved state-of-the-art results on eleven NLP tasks.
GPT-3, developed by OpenAI, was released in 2020 and has 175 billion parameters, making
it one of the largest language models at the time of its release.
"""

questions = [
    "When was the Transformer model introduced?",
    "What does BERT stand for?",
    "How many parameters does GPT-3 have?",
    "Who developed BERT?",
]

print("=== Question Answering System ===\n")
results = qa_system.multi_question_answer(questions, context)

for r in results:
    print(f"Q: {r['question']}")
    print(f"A: {r['answer']}")
    print(f"Confidence: {r['confidence']:.2f}")
    print()

Closing: NLP Learning Roadmap

This guide has walked through the complete NLP journey. A summary of each stage:

Foundations: text preprocessing, BoW, TF-IDF
Embeddings: Word2Vec, GloVe, FastText
Sequence Models: RNN, LSTM, GRU, Seq2Seq
The Attention Revolution: attention mechanisms, self-attention
Mastering Transformers: the full architecture
Pre-trained Models: BERT and GPT in practice
Modern Techniques: LoRA, RAG, RLHF

Recommended Resources

NLP is a rapidly evolving field, so continuous learning and hands-on practice are essential. Run the code examples yourself, experiment with the parameters, and build your own projects to develop a deep understanding. Happy learning!