Split View: 자연어 처리(NLP) 완전 정복 가이드: Zero to Hero - 텍스트 처리부터 LLM까지
자연어 처리(NLP) 완전 정복 가이드: Zero to Hero - 텍스트 처리부터 LLM까지
자연어 처리(NLP) 완전 정복 가이드: Zero to Hero
자연어 처리(Natural Language Processing, NLP)는 인간의 언어를 컴퓨터가 이해하고 생성할 수 있도록 하는 인공지능의 핵심 분야입니다. ChatGPT, 번역기, 검색 엔진, 감성 분석 시스템 등 우리가 매일 사용하는 수많은 서비스가 NLP 기술을 기반으로 합니다. 이 가이드는 NLP의 가장 기초적인 텍스트 전처리부터 최신 대형 언어 모델(LLM)까지 체계적으로 다루는 완전한 학습 경로를 제공합니다.
목차
- NLP 기초와 텍스트 전처리
- 텍스트 표현(Text Representation)
- 단어 임베딩(Word Embeddings)
- 순환 신경망으로 NLP
- Attention 메커니즘
- Transformer 아키텍처 완전 분석
- BERT 완전 분석
- GPT 계열 모델
- 최신 NLP 기법
- 실전 NLP 프로젝트
1. NLP 기초와 텍스트 전처리
자연어 처리의 첫 번째 단계는 원시 텍스트 데이터를 모델이 처리할 수 있는 형태로 변환하는 전처리 과정입니다. 텍스트 데이터는 비정형 데이터이기 때문에 일관된 형식으로 정제하고 구조화하는 작업이 필수적입니다.
1.1 토큰화 (Tokenization)
토큰화는 텍스트를 더 작은 단위인 토큰으로 분리하는 과정입니다. 토큰의 단위에 따라 단어 토큰화, 문자 토큰화, 서브워드 토큰화로 나뉩니다.
단어 토큰화 (Word Tokenization)
가장 직관적인 방법으로, 공백이나 구두점을 기준으로 텍스트를 단어 단위로 분리합니다.
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
nltk.download('punkt')
text = "Natural Language Processing is fascinating! It powers ChatGPT and many AI applications."
# 단어 토큰화
word_tokens = word_tokenize(text)
print("단어 토큰:", word_tokens)
# 출력: ['Natural', 'Language', 'Processing', 'is', 'fascinating', '!', ...]
# 문장 토큰화
sent_tokens = sent_tokenize(text)
print("문장 토큰:", sent_tokens)
# 출력: ['Natural Language Processing is fascinating!', 'It powers ChatGPT...']
문자 토큰화 (Character Tokenization)
텍스트를 개별 문자 단위로 분리합니다. 어휘 크기가 작고, 미지 단어(OOV) 문제가 없지만 시퀀스 길이가 길어지는 단점이 있습니다.
text = "Hello NLP"
char_tokens = list(text)
print("문자 토큰:", char_tokens)
# 출력: ['H', 'e', 'l', 'l', 'o', ' ', 'N', 'L', 'P']
서브워드 토큰화 (Subword Tokenization)
단어와 문자의 중간 단위인 서브워드로 분리합니다. BPE(Byte Pair Encoding), WordPiece, SentencePiece 등의 알고리즘이 있으며, BERT와 GPT 같은 최신 모델들이 사용합니다.
from tokenizers import ByteLevelBPETokenizer
# BPE 토크나이저 훈련
tokenizer = ByteLevelBPETokenizer()
tokenizer.train(
files=["corpus.txt"],
vocab_size=5000,
min_frequency=2,
special_tokens=["<s>", "<pad>", "</s>", "<unk>", "<mask>"]
)
# 인코딩
encoding = tokenizer.encode("Natural Language Processing")
print("토큰:", encoding.tokens)
print("ID:", encoding.ids)
1.2 불용어 제거 (Stop Words Removal)
불용어는 "the", "is", "at" 같이 너무 자주 등장하여 의미 있는 정보를 거의 담지 못하는 단어들입니다. 이를 제거하면 데이터 크기를 줄이고 중요한 단어에 집중할 수 있습니다.
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('stopwords')
text = "This is a sample sentence showing off the stop words filtration."
stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(text)
filtered_sentence = [w for w in word_tokens if not w.lower() in stop_words]
print("원본:", word_tokens)
print("필터링 후:", filtered_sentence)
# 출력: ['sample', 'sentence', 'showing', 'stop', 'words', 'filtration', '.']
1.3 어간 추출과 표제어 추출
어간 추출 (Stemming)
단어의 접사를 제거하여 어간(stem)을 추출합니다. 빠르지만 언어학적으로 올바르지 않은 결과를 낼 수 있습니다.
from nltk.stem import PorterStemmer, LancasterStemmer
ps = PorterStemmer()
ls = LancasterStemmer()
words = ["running", "runs", "ran", "runner", "easily", "fairly"]
for word in words:
print(f"{word:15} -> Porter: {ps.stem(word):15} Lancaster: {ls.stem(word)}")
표제어 추출 (Lemmatization)
단어의 원형(lemma)을 추출합니다. 어간 추출보다 느리지만 언어학적으로 올바른 결과를 제공합니다.
import nltk
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()
# 품사를 지정하면 더 정확한 결과
print(lemmatizer.lemmatize("running", pos='v')) # run
print(lemmatizer.lemmatize("better", pos='a')) # good
print(lemmatizer.lemmatize("dogs")) # dog
print(lemmatizer.lemmatize("went", pos='v')) # go
1.4 정규 표현식으로 텍스트 정제
import re
def clean_text(text):
"""텍스트 정제 함수"""
# HTML 태그 제거
text = re.sub(r'<[^>]+>', '', text)
# URL 제거
text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
# 이메일 제거
text = re.sub(r'\S+@\S+', '', text)
# 특수문자 제거 (알파벳, 숫자, 공백만 허용)
text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
# 여러 공백을 하나로
text = re.sub(r'\s+', ' ', text)
# 앞뒤 공백 제거
text = text.strip().lower()
return text
sample_text = """
Check out my website at https://example.com!
Email me at user@example.com for <b>more info</b>.
It's really cool!!!
"""
cleaned = clean_text(sample_text)
print(cleaned)
# 출력: "check out my website at email me at for more info its really cool"
1.5 spaCy를 활용한 고급 전처리
spaCy는 산업 수준의 NLP 라이브러리로, 토큰화, 품사 태깅, 개체명 인식, 의존성 파싱 등을 제공합니다.
import spacy
# 영어 모델 로드
nlp = spacy.load("en_core_web_sm")
text = "Apple is looking at buying U.K. startup for $1 billion"
doc = nlp(text)
print("=== 토큰 정보 ===")
for token in doc:
print(f"{token.text:15} | POS: {token.pos_:10} | Lemma: {token.lemma_:15} | Stop: {token.is_stop}")
print("\n=== 개체명 인식 ===")
for ent in doc.ents:
print(f"{ent.text:20} -> {ent.label_} ({spacy.explain(ent.label_)})")
print("\n=== 의존성 파싱 ===")
for token in doc:
print(f"{token.text:15} -> {token.dep_:15} (head: {token.head.text})")
1.6 한국어 처리: KoNLPy
한국어는 교착어(agglutinative language)로, 영어와 다른 형태소 분석이 필요합니다. KoNLPy는 한국어 NLP를 위한 파이썬 라이브러리입니다.
from konlpy.tag import Okt, Mecab, Komoran, Kkma
# Okt (Open Korean Text) 사용 예제
okt = Okt()
text = "자연어 처리는 인공지능의 핵심 분야입니다."
print("형태소 분석:", okt.morphs(text))
print("품사 태깅:", okt.pos(text))
print("명사 추출:", okt.nouns(text))
print("어절 분리:", okt.phrases(text))
# 출력 예시:
# 형태소 분석: ['자연어', '처리', '는', '인공지능', '의', '핵심', '분야', '입니다', '.']
# 품사 태깅: [('자연어', 'Noun'), ('처리', 'Noun'), ('는', 'Josa'), ...]
# 명사 추출: ['자연어', '처리', '인공지능', '핵심', '분야']
# Mecab - 가장 빠르고 정확한 형태소 분석기
try:
mecab = Mecab()
print("\nMecab 분석:")
print(mecab.pos(text))
except Exception:
print("Mecab이 설치되지 않았습니다.")
# 한국어 불용어 처리
korean_stopwords = ['은', '는', '이', '가', '을', '를', '의', '에', '에서', '으로', '와', '과', '도']
def korean_tokenize(text):
okt = Okt()
tokens = okt.morphs(text, stem=True) # 어간 추출 포함
tokens = [t for t in tokens if t not in korean_stopwords]
tokens = [t for t in tokens if len(t) > 1] # 1글자 제거
return tokens
print("\n한국어 토큰화:", korean_tokenize(text))
2. 텍스트 표현 (Text Representation)
텍스트를 수치 벡터로 변환하는 방법은 NLP의 핵심입니다. 모델이 텍스트를 처리하려면 수학적 연산이 가능한 숫자 형태로 변환해야 합니다.
2.1 Bag of Words (BoW)
BoW는 텍스트를 단어 빈도 벡터로 표현하는 가장 단순한 방법입니다. 단어의 순서를 무시하고 각 단어의 출현 빈도만 고려합니다.
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
corpus = [
"I love natural language processing",
"Natural language processing is amazing",
"I love machine learning too",
"Deep learning is part of machine learning"
]
# BoW 벡터라이저
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print("어휘:", vectorizer.get_feature_names_out())
print("\nBoW 행렬:")
print(X.toarray())
print("\n형태:", X.shape) # (4 문서, n 단어)
# 특정 문서 확인
doc_0 = dict(zip(vectorizer.get_feature_names_out(), X.toarray()[0]))
print("\n문서 0의 단어 빈도:")
for word, count in sorted(doc_0.items()):
if count > 0:
print(f" {word}: {count}")
2.2 TF-IDF
TF-IDF(Term Frequency-Inverse Document Frequency)는 BoW의 단점을 보완한 방법으로, 단어의 빈도뿐만 아니라 해당 단어가 얼마나 특별한지(다른 문서에 얼마나 적게 나타나는지)도 고려합니다.
TF (Term Frequency): 특정 단어가 한 문서에서 얼마나 자주 등장하는가 IDF (Inverse Document Frequency): 특정 단어가 얼마나 많은 문서에서 등장하는가의 역수
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
corpus = [
"the cat sat on the mat",
"the cat sat on the hat",
"the dog sat on the log",
"the cat wore the hat",
]
# TF-IDF 벡터라이저
tfidf = TfidfVectorizer(smooth_idf=True, norm='l2')
X = tfidf.fit_transform(corpus)
# 결과를 데이터프레임으로 표시
df = pd.DataFrame(
X.toarray(),
columns=tfidf.get_feature_names_out(),
index=[f"Doc {i}" for i in range(len(corpus))]
)
print(df.round(3))
# TF-IDF 수동 계산 예시
def compute_tfidf(corpus):
from math import log
# 어휘 구축
vocab = set()
for doc in corpus:
vocab.update(doc.split())
vocab = sorted(vocab)
# TF 계산
def tf(word, doc):
words = doc.split()
return words.count(word) / len(words)
# IDF 계산
def idf(word, corpus):
n_docs_with_word = sum(1 for doc in corpus if word in doc.split())
return log((1 + len(corpus)) / (1 + n_docs_with_word)) + 1
# TF-IDF 행렬 계산
tfidf_matrix = []
for doc in corpus:
row = [tf(word, doc) * idf(word, corpus) for word in vocab]
tfidf_matrix.append(row)
return tfidf_matrix, vocab
matrix, vocab = compute_tfidf(corpus)
print("\n수동 계산 TF-IDF:")
df_manual = pd.DataFrame(matrix, columns=vocab)
print(df_manual.round(3))
2.3 N-gram
N-gram은 연속된 N개의 항목(단어 또는 문자)으로 이루어진 시퀀스입니다. 단어 간의 순서 정보를 일부 포착할 수 있습니다.
from nltk.util import ngrams
from nltk.tokenize import word_tokenize
from collections import Counter
text = "I love natural language processing and machine learning"
tokens = word_tokenize(text.lower())
# Unigram (1-gram)
unigrams = list(ngrams(tokens, 1))
# Bigram (2-gram)
bigrams = list(ngrams(tokens, 2))
# Trigram (3-gram)
trigrams = list(ngrams(tokens, 3))
print("Unigrams:", unigrams[:5])
print("Bigrams:", bigrams[:5])
print("Trigrams:", trigrams[:5])
# N-gram 빈도 계산
def get_ngram_freq(text, n):
tokens = word_tokenize(text.lower())
n_grams = list(ngrams(tokens, n))
return Counter(n_grams)
# 더 큰 코퍼스에서의 응용
large_corpus = """
Machine learning is a subset of artificial intelligence.
Artificial intelligence is transforming many industries.
Natural language processing is a part of machine learning.
Deep learning has revolutionized natural language processing.
"""
bigram_freq = get_ngram_freq(large_corpus, 2)
print("\n가장 빈번한 Bigrams:")
for bigram, count in bigram_freq.most_common(10):
print(f" {' '.join(bigram)}: {count}")
3. 단어 임베딩 (Word Embeddings)
단어 임베딩은 단어를 밀집 벡터(dense vector)로 표현하는 방법으로, 의미적으로 유사한 단어들이 벡터 공간에서 가깝게 위치하도록 학습합니다.
3.1 Word2Vec
Word2Vec은 2013년 Google의 Tomas Mikolov가 발표한 획기적인 단어 임베딩 모델입니다. CBOW(Continuous Bag of Words)와 Skip-gram 두 가지 방법이 있습니다.
CBOW: 주변 단어들로 중심 단어를 예측 Skip-gram: 중심 단어로 주변 단어들을 예측
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
from collections import Counter
class Word2VecSkipGram(nn.Module):
"""Skip-gram Word2Vec 구현"""
def __init__(self, vocab_size, embedding_dim):
super().__init__()
self.center_embedding = nn.Embedding(vocab_size, embedding_dim)
self.context_embedding = nn.Embedding(vocab_size, embedding_dim)
# 초기화
self.center_embedding.weight.data.uniform_(-0.5/embedding_dim, 0.5/embedding_dim)
self.context_embedding.weight.data.uniform_(-0.5/embedding_dim, 0.5/embedding_dim)
def forward(self, center, context, negative):
"""
center: (batch_size,) 중심 단어 인덱스
context: (batch_size,) 실제 문맥 단어 인덱스
negative: (batch_size, neg_samples) 네거티브 샘플 인덱스
"""
# 중심 단어 임베딩
center_emb = self.center_embedding(center) # (batch, dim)
# 양성 샘플 점수
context_emb = self.context_embedding(context) # (batch, dim)
pos_score = torch.sum(center_emb * context_emb, dim=1) # (batch,)
pos_loss = -torch.log(torch.sigmoid(pos_score) + 1e-10)
# 음성 샘플 점수
neg_emb = self.context_embedding(negative) # (batch, neg_samples, dim)
center_emb_expanded = center_emb.unsqueeze(1) # (batch, 1, dim)
neg_score = torch.bmm(neg_emb, center_emb_expanded.transpose(1, 2)).squeeze(2) # (batch, neg_samples)
neg_loss = -torch.sum(torch.log(torch.sigmoid(-neg_score) + 1e-10), dim=1)
return (pos_loss + neg_loss).mean()
class Word2VecCBOW(nn.Module):
"""CBOW Word2Vec 구현"""
def __init__(self, vocab_size, embedding_dim):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embedding_dim)
self.linear = nn.Linear(embedding_dim, vocab_size)
def forward(self, context):
"""
context: (batch_size, context_window_size) 문맥 단어 인덱스
"""
# 문맥 단어 임베딩의 평균
embedded = self.embedding(context) # (batch, window, dim)
mean_embedded = embedded.mean(dim=1) # (batch, dim)
output = self.linear(mean_embedded) # (batch, vocab_size)
return output
# 데이터 준비 및 훈련 예시
def prepare_data(sentences, window_size=2):
"""훈련 데이터 준비"""
word_counts = Counter()
for sentence in sentences:
word_counts.update(sentence.split())
vocab = ['<UNK>'] + [word for word, _ in word_counts.most_common()]
word2idx = {word: idx for idx, word in enumerate(vocab)}
idx2word = {idx: word for word, idx in word2idx.items()}
training_data = []
for sentence in sentences:
words = sentence.split()
for i, center_word in enumerate(words):
center_idx = word2idx.get(center_word, 0)
for j in range(max(0, i-window_size), min(len(words), i+window_size+1)):
if i != j:
context_idx = word2idx.get(words[j], 0)
training_data.append((center_idx, context_idx))
return training_data, word2idx, idx2word, vocab
# Gensim을 사용한 Word2Vec (실용적인 방법)
from gensim.models import Word2Vec
sentences = [
"I love natural language processing".split(),
"natural language processing is a field of AI".split(),
"machine learning is part of artificial intelligence".split(),
"deep learning models process natural language".split(),
"word embeddings represent words as vectors".split(),
"Word2Vec learns word representations".split(),
"semantic similarity between words is captured by embeddings".split(),
]
# 모델 훈련
model = Word2Vec(
sentences=sentences,
vector_size=100, # 임베딩 차원
window=5, # 윈도우 크기
min_count=1, # 최소 빈도
workers=4, # 병렬 처리 수
epochs=100, # 훈련 에폭
sg=1, # 1=Skip-gram, 0=CBOW
negative=5 # 네거티브 샘플링
)
# 단어 벡터 확인
print("'language' 벡터 (처음 5차원):", model.wv['language'][:5])
# 유사 단어 찾기
similar_words = model.wv.most_similar('language', topn=5)
print("\n'language'와 유사한 단어:")
for word, similarity in similar_words:
print(f" {word}: {similarity:.4f}")
# 단어 유추 (King - Man + Woman = Queen 유형)
# 유추: language - natural + artificial = ?
result = model.wv.most_similar(
positive=['artificial', 'language'],
negative=['natural'],
topn=3
)
print("\n단어 유추 결과:")
for word, sim in result:
print(f" {word}: {sim:.4f}")
# 두 단어의 유사도
print("\n'language'와 'processing' 유사도:", model.wv.similarity('language', 'processing'))
3.2 임베딩 시각화 (t-SNE)
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import numpy as np
def visualize_embeddings(model, words=None):
"""Word2Vec 임베딩 t-SNE 시각화"""
if words is None:
words = list(model.wv.key_to_index.keys())[:50]
# 벡터 추출
vectors = np.array([model.wv[word] for word in words])
# t-SNE 차원 축소
tsne = TSNE(
n_components=2,
random_state=42,
perplexity=min(30, len(words)-1),
n_iter=1000
)
vectors_2d = tsne.fit_transform(vectors)
# 시각화
fig, ax = plt.subplots(figsize=(12, 8))
ax.scatter(vectors_2d[:, 0], vectors_2d[:, 1], alpha=0.7)
for i, word in enumerate(words):
ax.annotate(word, (vectors_2d[i, 0], vectors_2d[i, 1]),
fontsize=9, ha='center', va='bottom')
ax.set_title("Word2Vec Embeddings t-SNE Visualization")
plt.tight_layout()
plt.savefig('word2vec_tsne.png', dpi=150, bbox_inches='tight')
plt.show()
visualize_embeddings(model)
3.3 GloVe
GloVe(Global Vectors for Word Representation)는 Stanford에서 개발한 단어 임베딩 알고리즘으로, 전체 코퍼스의 동시 출현 통계(co-occurrence statistics)를 활용합니다.
# 사전 훈련된 GloVe 벡터 사용
# pip install torchtext
from torchtext.vocab import GloVe
# GloVe 다운로드 및 로드
glove = GloVe(name='6B', dim=100)
# 단어 벡터 접근
vector = glove['computer']
print("'computer' GloVe 벡터 (처음 5차원):", vector[:5].numpy())
# 코사인 유사도 계산
def cosine_similarity(v1, v2):
return torch.nn.functional.cosine_similarity(
v1.unsqueeze(0), v2.unsqueeze(0)
).item()
import torch
words_to_compare = [('king', 'queen'), ('man', 'woman'), ('Paris', 'France')]
for w1, w2 in words_to_compare:
if w1 in glove.stoi and w2 in glove.stoi:
sim = cosine_similarity(glove[w1], glove[w2])
print(f"sim('{w1}', '{w2}') = {sim:.4f}")
4. 순환 신경망으로 NLP
순환 신경망(RNN)은 시퀀스 데이터를 처리하기 위해 설계된 신경망으로, 이전 시간 단계의 정보를 현재 시간 단계로 전달하는 메커니즘을 가집니다.
4.1 RNN 기본 구조
import torch
import torch.nn as nn
class SimpleRNN(nn.Module):
"""기본 RNN 구현"""
def __init__(self, input_size, hidden_size, output_size):
super().__init__()
self.hidden_size = hidden_size
# 입력 -> 은닉 상태 변환
self.input_to_hidden = nn.Linear(input_size + hidden_size, hidden_size)
# 은닉 상태 -> 출력 변환
self.hidden_to_output = nn.Linear(hidden_size, output_size)
def forward(self, x, hidden=None):
"""
x: (seq_len, batch, input_size)
hidden: (batch, hidden_size) 초기 은닉 상태
"""
batch_size = x.size(1)
if hidden is None:
hidden = torch.zeros(batch_size, self.hidden_size)
outputs = []
for t in range(x.size(0)):
# 입력과 이전 은닉 상태를 결합
combined = torch.cat([x[t], hidden], dim=1)
# 새로운 은닉 상태 계산
hidden = torch.tanh(self.input_to_hidden(combined))
# 출력 계산
output = self.hidden_to_output(hidden)
outputs.append(output)
return torch.stack(outputs, dim=0), hidden
# PyTorch 내장 RNN 사용
rnn = nn.RNN(
input_size=50,
hidden_size=128,
num_layers=2,
batch_first=True,
dropout=0.3,
bidirectional=True # 양방향 RNN
)
# 임의 입력으로 테스트
x = torch.randn(32, 20, 50) # (batch, seq_len, features)
output, hidden = rnn(x)
print("출력 형태:", output.shape) # (32, 20, 256) - 양방향이므로 128*2
print("은닉 상태 형태:", hidden.shape) # (4, 32, 128) - 2레이어 * 2방향
4.2 LSTM - 게이트 메커니즘 상세 설명
LSTM(Long Short-Term Memory)은 RNN의 기울기 소실 문제를 해결하기 위해 설계되었습니다. 게이트 메커니즘을 통해 장기 의존성을 포착할 수 있습니다.
class LSTMCell(nn.Module):
"""LSTM 셀 수동 구현 - 게이트 메커니즘 이해를 위해"""
def __init__(self, input_size, hidden_size):
super().__init__()
self.hidden_size = hidden_size
# 4개의 게이트를 한 번에 계산 (효율성)
# forget gate, input gate, cell gate, output gate
self.gates = nn.Linear(input_size + hidden_size, 4 * hidden_size)
def forward(self, x, state=None):
"""
x: (batch, input_size)
state: (h, c) 이전 은닉 상태와 셀 상태
"""
batch_size = x.size(0)
if state is None:
h = torch.zeros(batch_size, self.hidden_size)
c = torch.zeros(batch_size, self.hidden_size)
else:
h, c = state
# 입력과 이전 은닉 상태 결합
combined = torch.cat([x, h], dim=1)
# 4개 게이트 계산
gates = self.gates(combined)
# 게이트 분리
f_gate, i_gate, g_gate, o_gate = gates.chunk(4, dim=1)
# 각 게이트에 활성화 함수 적용
f = torch.sigmoid(f_gate) # forget gate: 이전 정보 얼마나 잊을지
i = torch.sigmoid(i_gate) # input gate: 새 정보 얼마나 저장할지
g = torch.tanh(g_gate) # cell gate: 새로운 후보 정보
o = torch.sigmoid(o_gate) # output gate: 얼마나 출력할지
# 새로운 셀 상태: 일부 잊고 + 새 정보 추가
new_c = f * c + i * g
# 새로운 은닉 상태
new_h = o * torch.tanh(new_c)
return new_h, (new_h, new_c)
# PyTorch 내장 LSTM을 이용한 감성 분석 모델
class SentimentLSTM(nn.Module):
"""LSTM 기반 감성 분석 모델"""
def __init__(self, vocab_size, embed_dim, hidden_dim, num_classes, num_layers=2, dropout=0.3):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
self.lstm = nn.LSTM(
embed_dim,
hidden_dim,
num_layers=num_layers,
batch_first=True,
dropout=dropout,
bidirectional=True
)
self.dropout = nn.Dropout(dropout)
# 양방향이므로 hidden_dim * 2
self.classifier = nn.Sequential(
nn.Linear(hidden_dim * 2, hidden_dim),
nn.ReLU(),
nn.Dropout(dropout),
nn.Linear(hidden_dim, num_classes)
)
def forward(self, x, lengths=None):
"""
x: (batch, seq_len) 토큰 인덱스
lengths: 각 시퀀스의 실제 길이 (패딩 고려)
"""
# 임베딩
embedded = self.dropout(self.embedding(x)) # (batch, seq, embed)
# 패딩된 시퀀스 처리
if lengths is not None:
packed = nn.utils.rnn.pack_padded_sequence(
embedded, lengths, batch_first=True, enforce_sorted=False
)
lstm_out, (hidden, cell) = self.lstm(packed)
lstm_out, _ = nn.utils.rnn.pad_packed_sequence(lstm_out, batch_first=True)
else:
lstm_out, (hidden, cell) = self.lstm(embedded)
# 마지막 은닉 상태 (양방향)
# hidden: (num_layers * 2, batch, hidden_dim)
forward_hidden = hidden[-2] # 마지막 레이어의 순방향
backward_hidden = hidden[-1] # 마지막 레이어의 역방향
combined = torch.cat([forward_hidden, backward_hidden], dim=1)
# 분류
output = self.classifier(self.dropout(combined))
return output
4.3 GRU
GRU(Gated Recurrent Unit)는 LSTM보다 단순한 구조로, 비슷한 성능을 더 빠르게 달성합니다.
class GRUCell(nn.Module):
"""GRU 셀 수동 구현"""
def __init__(self, input_size, hidden_size):
super().__init__()
self.hidden_size = hidden_size
# Reset gate
self.reset_gate = nn.Linear(input_size + hidden_size, hidden_size)
# Update gate
self.update_gate = nn.Linear(input_size + hidden_size, hidden_size)
# New gate
self.new_gate_input = nn.Linear(input_size, hidden_size)
self.new_gate_hidden = nn.Linear(hidden_size, hidden_size)
def forward(self, x, h=None):
batch_size = x.size(0)
if h is None:
h = torch.zeros(batch_size, self.hidden_size)
combined = torch.cat([x, h], dim=1)
# 리셋 게이트: 이전 은닉 상태 얼마나 리셋할지
r = torch.sigmoid(self.reset_gate(combined))
# 업데이트 게이트: 얼마나 업데이트할지
z = torch.sigmoid(self.update_gate(combined))
# 새로운 후보 상태
n = torch.tanh(
self.new_gate_input(x) + r * self.new_gate_hidden(h)
)
# 새로운 은닉 상태
new_h = (1 - z) * n + z * h
return new_h
4.4 Seq2Seq 기계번역
class Encoder(nn.Module):
"""Seq2Seq 인코더"""
def __init__(self, vocab_size, embed_dim, hidden_dim, num_layers=2, dropout=0.5):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
self.lstm = nn.LSTM(embed_dim, hidden_dim, num_layers,
batch_first=True, dropout=dropout)
self.dropout = nn.Dropout(dropout)
def forward(self, src):
embedded = self.dropout(self.embedding(src))
outputs, (hidden, cell) = self.lstm(embedded)
return outputs, hidden, cell
class Decoder(nn.Module):
"""Seq2Seq 디코더"""
def __init__(self, vocab_size, embed_dim, hidden_dim, num_layers=2, dropout=0.5):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
self.lstm = nn.LSTM(embed_dim, hidden_dim, num_layers,
batch_first=True, dropout=dropout)
self.fc = nn.Linear(hidden_dim, vocab_size)
self.dropout = nn.Dropout(dropout)
def forward(self, trg, hidden, cell):
# trg: (batch,) 현재 토큰
trg = trg.unsqueeze(1) # (batch, 1)
embedded = self.dropout(self.embedding(trg)) # (batch, 1, embed)
output, (hidden, cell) = self.lstm(embedded, (hidden, cell))
prediction = self.fc(output.squeeze(1)) # (batch, vocab_size)
return prediction, hidden, cell
class Seq2Seq(nn.Module):
"""Seq2Seq 번역 모델"""
def __init__(self, encoder, decoder, device):
super().__init__()
self.encoder = encoder
self.decoder = decoder
self.device = device
def forward(self, src, trg, teacher_forcing_ratio=0.5):
"""
src: (batch, src_len) 소스 시퀀스
trg: (batch, trg_len) 타겟 시퀀스
teacher_forcing_ratio: 티처 포싱 확률
"""
batch_size = src.shape[0]
trg_len = trg.shape[1]
trg_vocab_size = self.decoder.fc.out_features
outputs = torch.zeros(batch_size, trg_len, trg_vocab_size).to(self.device)
# 인코더 실행
encoder_outputs, hidden, cell = self.encoder(src)
# 디코더의 첫 입력: <SOS> 토큰
input = trg[:, 0]
for t in range(1, trg_len):
output, hidden, cell = self.decoder(input, hidden, cell)
outputs[:, t] = output
# 티처 포싱: 훈련 시 실제 레이블 또는 예측값 사용
teacher_force = torch.rand(1).item() < teacher_forcing_ratio
top1 = output.argmax(1)
input = trg[:, t] if teacher_force else top1
return outputs
5. Attention 메커니즘
Attention 메커니즘은 모델이 출력을 생성할 때 입력 시퀀스의 어떤 부분에 집중할지를 동적으로 결정하게 합니다.
5.1 Bahdanau Attention
class BahdanauAttention(nn.Module):
"""Bahdanau (Additive) Attention"""
def __init__(self, hidden_dim):
super().__init__()
self.W_enc = nn.Linear(hidden_dim * 2, hidden_dim) # 양방향 인코더
self.W_dec = nn.Linear(hidden_dim, hidden_dim)
self.v = nn.Linear(hidden_dim, 1)
def forward(self, decoder_hidden, encoder_outputs):
"""
decoder_hidden: (batch, hidden) 디코더의 현재 은닉 상태
encoder_outputs: (batch, src_len, hidden*2) 모든 인코더 출력
"""
src_len = encoder_outputs.shape[1]
# 디코더 은닉 상태를 src_len만큼 반복
decoder_hidden = decoder_hidden.unsqueeze(1).repeat(1, src_len, 1)
# 에너지 계산
energy = torch.tanh(
self.W_enc(encoder_outputs) + self.W_dec(decoder_hidden)
)
# 어텐션 점수
attention = self.v(energy).squeeze(2) # (batch, src_len)
attention_weights = torch.softmax(attention, dim=1)
# 컨텍스트 벡터 = 가중합
context = torch.bmm(
attention_weights.unsqueeze(1), # (batch, 1, src_len)
encoder_outputs # (batch, src_len, hidden*2)
).squeeze(1) # (batch, hidden*2)
return context, attention_weights
class LuongAttention(nn.Module):
"""Luong (Multiplicative) Attention"""
def __init__(self, hidden_dim, method='dot'):
super().__init__()
self.method = method
if method == 'general':
self.W = nn.Linear(hidden_dim, hidden_dim)
elif method == 'concat':
self.W = nn.Linear(hidden_dim * 2, hidden_dim)
self.v = nn.Linear(hidden_dim, 1)
def forward(self, decoder_hidden, encoder_outputs):
"""
decoder_hidden: (batch, hidden)
encoder_outputs: (batch, src_len, hidden)
"""
if self.method == 'dot':
# 내적
score = torch.bmm(
encoder_outputs,
decoder_hidden.unsqueeze(2)
).squeeze(2)
elif self.method == 'general':
# W * h_encoder
energy = self.W(encoder_outputs)
score = torch.bmm(
energy,
decoder_hidden.unsqueeze(2)
).squeeze(2)
elif self.method == 'concat':
decoder_expanded = decoder_hidden.unsqueeze(1).expand_as(encoder_outputs)
energy = torch.tanh(self.W(torch.cat([decoder_expanded, encoder_outputs], dim=2)))
score = self.v(energy).squeeze(2)
attention_weights = torch.softmax(score, dim=1)
context = torch.bmm(attention_weights.unsqueeze(1), encoder_outputs).squeeze(1)
return context, attention_weights
5.2 Self-Attention
Self-Attention은 동일한 시퀀스 내에서 각 위치가 다른 위치들에 어텐션을 수행합니다. Transformer의 핵심 구성 요소입니다.
class SelfAttention(nn.Module):
"""Self-Attention 구현"""
def __init__(self, embed_dim, num_heads=8, dropout=0.1):
super().__init__()
assert embed_dim % num_heads == 0
self.embed_dim = embed_dim
self.num_heads = num_heads
self.head_dim = embed_dim // num_heads
self.scale = self.head_dim ** -0.5
# Query, Key, Value 프로젝션
self.W_q = nn.Linear(embed_dim, embed_dim)
self.W_k = nn.Linear(embed_dim, embed_dim)
self.W_v = nn.Linear(embed_dim, embed_dim)
self.W_o = nn.Linear(embed_dim, embed_dim)
self.dropout = nn.Dropout(dropout)
def forward(self, x, mask=None):
"""
x: (batch, seq_len, embed_dim)
mask: (batch, seq_len, seq_len) 어텐션 마스크
"""
batch_size, seq_len, _ = x.shape
# Query, Key, Value 계산
Q = self.W_q(x) # (batch, seq, embed)
K = self.W_k(x)
V = self.W_v(x)
# 멀티헤드를 위해 형태 변환
Q = Q.view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
K = K.view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
V = V.view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
# Q, K, V: (batch, num_heads, seq_len, head_dim)
# Scaled Dot-Product Attention
scores = torch.matmul(Q, K.transpose(-2, -1)) * self.scale
# scores: (batch, num_heads, seq_len, seq_len)
if mask is not None:
scores = scores.masked_fill(mask == 0, float('-inf'))
attention_weights = torch.softmax(scores, dim=-1)
attention_weights = self.dropout(attention_weights)
# 어텐션 적용
attended = torch.matmul(attention_weights, V)
# attended: (batch, num_heads, seq_len, head_dim)
# 헤드 결합
attended = attended.transpose(1, 2).contiguous()
attended = attended.view(batch_size, seq_len, self.embed_dim)
# 출력 프로젝션
output = self.W_o(attended)
return output, attention_weights
6. Transformer 아키텍처 완전 분석
2017년 발표된 "Attention is All You Need" 논문에서 소개된 Transformer는 NLP의 패러다임을 완전히 바꾼 혁신적인 아키텍처입니다. RNN 없이 순수 어텐션 메커니즘만으로 번역 등의 과제에서 최고 성능을 달성했습니다.
6.1 Positional Encoding
Transformer는 순서 정보가 없으므로, 위치 인코딩으로 각 위치의 정보를 임베딩에 추가합니다.
import torch
import torch.nn as nn
import math
class PositionalEncoding(nn.Module):
"""사인/코사인 Positional Encoding"""
def __init__(self, embed_dim, max_len=5000, dropout=0.1):
super().__init__()
self.dropout = nn.Dropout(p=dropout)
# PE 행렬 계산
pe = torch.zeros(max_len, embed_dim)
position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
div_term = torch.exp(
torch.arange(0, embed_dim, 2).float() * (-math.log(10000.0) / embed_dim)
)
pe[:, 0::2] = torch.sin(position * div_term) # 짝수 인덱스: sin
pe[:, 1::2] = torch.cos(position * div_term) # 홀수 인덱스: cos
pe = pe.unsqueeze(0) # (1, max_len, embed_dim)
self.register_buffer('pe', pe)
def forward(self, x):
"""x: (batch, seq_len, embed_dim)"""
x = x + self.pe[:, :x.size(1), :]
return self.dropout(x)
class LearnablePositionalEncoding(nn.Module):
"""학습 가능한 Positional Encoding (BERT 스타일)"""
def __init__(self, embed_dim, max_len=512):
super().__init__()
self.pe = nn.Embedding(max_len, embed_dim)
def forward(self, x):
batch_size, seq_len, _ = x.shape
positions = torch.arange(seq_len, device=x.device).unsqueeze(0)
return x + self.pe(positions)
6.2 완전한 Transformer 구현
class MultiHeadAttention(nn.Module):
"""Multi-Head Attention"""
def __init__(self, embed_dim, num_heads, dropout=0.1):
super().__init__()
assert embed_dim % num_heads == 0
self.embed_dim = embed_dim
self.num_heads = num_heads
self.head_dim = embed_dim // num_heads
self.scale = self.head_dim ** -0.5
self.W_q = nn.Linear(embed_dim, embed_dim, bias=False)
self.W_k = nn.Linear(embed_dim, embed_dim, bias=False)
self.W_v = nn.Linear(embed_dim, embed_dim, bias=False)
self.W_o = nn.Linear(embed_dim, embed_dim)
self.dropout = nn.Dropout(dropout)
def split_heads(self, x):
"""(batch, seq, embed) -> (batch, heads, seq, head_dim)"""
batch, seq, _ = x.shape
x = x.view(batch, seq, self.num_heads, self.head_dim)
return x.transpose(1, 2)
def forward(self, query, key, value, mask=None):
batch_size = query.shape[0]
Q = self.split_heads(self.W_q(query))
K = self.split_heads(self.W_k(key))
V = self.split_heads(self.W_v(value))
# Scaled Dot-Product Attention
scores = torch.matmul(Q, K.transpose(-2, -1)) * self.scale
if mask is not None:
scores = scores.masked_fill(mask == 0, float('-inf'))
attn_weights = self.dropout(torch.softmax(scores, dim=-1))
output = torch.matmul(attn_weights, V)
output = output.transpose(1, 2).contiguous().view(batch_size, -1, self.embed_dim)
return self.W_o(output), attn_weights
class FeedForward(nn.Module):
"""Position-wise Feed-Forward Network"""
def __init__(self, embed_dim, ff_dim, dropout=0.1):
super().__init__()
self.net = nn.Sequential(
nn.Linear(embed_dim, ff_dim),
nn.GELU(), # 논문에서는 ReLU, BERT/GPT에서는 GELU 주로 사용
nn.Dropout(dropout),
nn.Linear(ff_dim, embed_dim),
nn.Dropout(dropout)
)
def forward(self, x):
return self.net(x)
class EncoderLayer(nn.Module):
"""Transformer Encoder Layer"""
def __init__(self, embed_dim, num_heads, ff_dim, dropout=0.1):
super().__init__()
self.self_attention = MultiHeadAttention(embed_dim, num_heads, dropout)
self.feed_forward = FeedForward(embed_dim, ff_dim, dropout)
self.norm1 = nn.LayerNorm(embed_dim)
self.norm2 = nn.LayerNorm(embed_dim)
self.dropout = nn.Dropout(dropout)
def forward(self, x, src_mask=None):
# Self-Attention + Add & Norm (Pre-LN 스타일)
attn_output, _ = self.self_attention(x, x, x, src_mask)
x = self.norm1(x + self.dropout(attn_output))
# Feed-Forward + Add & Norm
ff_output = self.feed_forward(x)
x = self.norm2(x + ff_output)
return x
class DecoderLayer(nn.Module):
"""Transformer Decoder Layer"""
def __init__(self, embed_dim, num_heads, ff_dim, dropout=0.1):
super().__init__()
self.self_attention = MultiHeadAttention(embed_dim, num_heads, dropout)
self.cross_attention = MultiHeadAttention(embed_dim, num_heads, dropout)
self.feed_forward = FeedForward(embed_dim, ff_dim, dropout)
self.norm1 = nn.LayerNorm(embed_dim)
self.norm2 = nn.LayerNorm(embed_dim)
self.norm3 = nn.LayerNorm(embed_dim)
self.dropout = nn.Dropout(dropout)
def forward(self, x, memory, src_mask=None, tgt_mask=None):
# Masked Self-Attention (자기 자신에 대한 어텐션, 미래 토큰 마스킹)
self_attn_out, _ = self.self_attention(x, x, x, tgt_mask)
x = self.norm1(x + self.dropout(self_attn_out))
# Cross-Attention (인코더 출력에 대한 어텐션)
cross_attn_out, cross_attn_weights = self.cross_attention(x, memory, memory, src_mask)
x = self.norm2(x + self.dropout(cross_attn_out))
# Feed-Forward
ff_out = self.feed_forward(x)
x = self.norm3(x + ff_out)
return x, cross_attn_weights
class Transformer(nn.Module):
"""완전한 Transformer 구현"""
def __init__(
self,
src_vocab_size,
tgt_vocab_size,
embed_dim=512,
num_heads=8,
num_encoder_layers=6,
num_decoder_layers=6,
ff_dim=2048,
max_len=5000,
dropout=0.1
):
super().__init__()
# 임베딩 레이어
self.src_embedding = nn.Embedding(src_vocab_size, embed_dim)
self.tgt_embedding = nn.Embedding(tgt_vocab_size, embed_dim)
self.pos_encoding = PositionalEncoding(embed_dim, max_len, dropout)
self.embed_scale = math.sqrt(embed_dim)
# 인코더 스택
self.encoder_layers = nn.ModuleList([
EncoderLayer(embed_dim, num_heads, ff_dim, dropout)
for _ in range(num_encoder_layers)
])
# 디코더 스택
self.decoder_layers = nn.ModuleList([
DecoderLayer(embed_dim, num_heads, ff_dim, dropout)
for _ in range(num_decoder_layers)
])
# 출력 레이어
self.output_norm = nn.LayerNorm(embed_dim)
self.output_projection = nn.Linear(embed_dim, tgt_vocab_size)
# 가중치 초기화
self._init_weights()
def _init_weights(self):
for p in self.parameters():
if p.dim() > 1:
nn.init.xavier_uniform_(p)
def make_causal_mask(self, seq_len, device):
"""자기회귀 마스크 (미래 토큰 마스킹)"""
mask = torch.tril(torch.ones(seq_len, seq_len, device=device)).bool()
return mask.unsqueeze(0).unsqueeze(0) # (1, 1, seq, seq)
def make_pad_mask(self, x, pad_idx=0):
"""패딩 마스크"""
return (x != pad_idx).unsqueeze(1).unsqueeze(2) # (batch, 1, 1, seq)
def encode(self, src, src_mask=None):
x = self.pos_encoding(self.src_embedding(src) * self.embed_scale)
for layer in self.encoder_layers:
x = layer(x, src_mask)
return x
def decode(self, tgt, memory, src_mask=None, tgt_mask=None):
x = self.pos_encoding(self.tgt_embedding(tgt) * self.embed_scale)
for layer in self.decoder_layers:
x, _ = layer(x, memory, src_mask, tgt_mask)
return self.output_norm(x)
def forward(self, src, tgt, src_pad_idx=0, tgt_pad_idx=0):
# 마스크 생성
src_mask = self.make_pad_mask(src, src_pad_idx)
tgt_len = tgt.shape[1]
tgt_pad_mask = self.make_pad_mask(tgt, tgt_pad_idx)
tgt_causal_mask = self.make_causal_mask(tgt_len, tgt.device)
tgt_mask = tgt_pad_mask & tgt_causal_mask
# 인코더
memory = self.encode(src, src_mask)
# 디코더
output = self.decode(tgt, memory, src_mask, tgt_mask)
# 출력
logits = self.output_projection(output)
return logits
# 모델 인스턴스화 및 테스트
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = Transformer(
src_vocab_size=10000,
tgt_vocab_size=10000,
embed_dim=512,
num_heads=8,
num_encoder_layers=6,
num_decoder_layers=6,
ff_dim=2048,
dropout=0.1
).to(device)
# 파라미터 수 계산
total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"총 파라미터 수: {total_params:,}")
# 테스트
src = torch.randint(1, 10000, (4, 20)).to(device)
tgt = torch.randint(1, 10000, (4, 18)).to(device)
output = model(src, tgt)
print(f"출력 형태: {output.shape}") # (4, 18, 10000)
7. BERT 완전 분석
BERT(Bidirectional Encoder Representations from Transformers)는 2018년 Google이 발표한 사전학습 언어 모델로, NLP 분야에 혁명을 가져왔습니다.
7.1 BERT의 핵심 아이디어
BERT는 두 가지 사전학습 태스크를 사용합니다:
Masked Language Modeling (MLM): 입력 토큰의 15%를 [MASK]로 대체하고 원래 단어를 예측 Next Sentence Prediction (NSP): 두 문장이 연속된 문장인지 예측
from transformers import (
BertTokenizer,
BertModel,
BertForSequenceClassification,
BertForTokenClassification,
BertForQuestionAnswering,
AdamW,
get_linear_schedule_with_warmup
)
import torch
from torch.utils.data import Dataset, DataLoader
# 토크나이저 로드
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
# 기본 인코딩
text = "Natural language processing is transforming AI."
encoding = tokenizer(
text,
add_special_tokens=True, # [CLS], [SEP] 추가
max_length=128,
padding='max_length',
truncation=True,
return_tensors='pt'
)
print("입력 ID:", encoding['input_ids'][0][:10])
print("어텐션 마스크:", encoding['attention_mask'][0][:10])
print("디코딩:", tokenizer.decode(encoding['input_ids'][0]))
# WordPiece 토큰화 확인
complex_words = ["unbelievable", "preprocessing", "transformers", "tokenization"]
for word in complex_words:
tokens = tokenizer.tokenize(word)
print(f"{word:20} -> {tokens}")
7.2 BERT Fine-tuning: 텍스트 분류
class SentimentDataset(Dataset):
"""감성 분석 데이터셋"""
def __init__(self, texts, labels, tokenizer, max_length=128):
self.texts = texts
self.labels = labels
self.tokenizer = tokenizer
self.max_length = max_length
def __len__(self):
return len(self.texts)
def __getitem__(self, idx):
encoding = self.tokenizer(
self.texts[idx],
max_length=self.max_length,
padding='max_length',
truncation=True,
return_tensors='pt'
)
return {
'input_ids': encoding['input_ids'].squeeze(),
'attention_mask': encoding['attention_mask'].squeeze(),
'label': torch.tensor(self.labels[idx], dtype=torch.long)
}
def train_bert_classifier(
train_texts,
train_labels,
val_texts,
val_labels,
num_labels=2,
epochs=3,
batch_size=16,
lr=2e-5
):
"""BERT 감성 분석 파인튜닝"""
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
# 데이터셋 생성
train_dataset = SentimentDataset(train_texts, train_labels, tokenizer)
val_dataset = SentimentDataset(val_texts, val_labels, tokenizer)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size)
# 모델 로드
model = BertForSequenceClassification.from_pretrained(
'bert-base-uncased',
num_labels=num_labels
).to(device)
# 옵티마이저
optimizer = AdamW(model.parameters(), lr=lr, weight_decay=0.01)
# 스케줄러
total_steps = len(train_loader) * epochs
scheduler = get_linear_schedule_with_warmup(
optimizer,
num_warmup_steps=total_steps // 10,
num_training_steps=total_steps
)
# 훈련 루프
best_val_acc = 0
for epoch in range(epochs):
model.train()
total_loss = 0
correct = 0
for batch in train_loader:
input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
labels = batch['label'].to(device)
optimizer.zero_grad()
outputs = model(
input_ids=input_ids,
attention_mask=attention_mask,
labels=labels
)
loss = outputs.loss
logits = outputs.logits
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
scheduler.step()
total_loss += loss.item()
preds = logits.argmax(dim=1)
correct += (preds == labels).sum().item()
train_acc = correct / len(train_dataset)
avg_loss = total_loss / len(train_loader)
# 검증
model.eval()
val_correct = 0
with torch.no_grad():
for batch in val_loader:
input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
labels = batch['label'].to(device)
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
preds = outputs.logits.argmax(dim=1)
val_correct += (preds == labels).sum().item()
val_acc = val_correct / len(val_dataset)
print(f"Epoch {epoch+1}: Loss={avg_loss:.4f}, Train Acc={train_acc:.4f}, Val Acc={val_acc:.4f}")
if val_acc > best_val_acc:
best_val_acc = val_acc
torch.save(model.state_dict(), 'best_bert_classifier.pt')
return model
# 감성 분석 예제
texts = [
"This movie was absolutely fantastic! I loved every moment.",
"Terrible film. Complete waste of time and money.",
"The acting was decent but the plot was confusing.",
"One of the best movies I've seen this year!",
"I fell asleep halfway through. So boring."
]
labels = [1, 0, 0, 1, 0] # 1: positive, 0: negative
7.3 BERT NER (개체명 인식)
from transformers import BertForTokenClassification
# NER 레이블
ner_labels = ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC']
label2id = {label: i for i, label in enumerate(ner_labels)}
id2label = {i: label for i, label in enumerate(ner_labels)}
model = BertForTokenClassification.from_pretrained(
'bert-base-uncased',
num_labels=len(ner_labels),
id2label=id2label,
label2id=label2id
)
def predict_ner(text, model, tokenizer):
"""NER 예측"""
model.eval()
encoding = tokenizer(
text.split(),
is_split_into_words=True,
return_offsets_mapping=True,
padding=True,
truncation=True,
return_tensors='pt'
)
with torch.no_grad():
outputs = model(
input_ids=encoding['input_ids'],
attention_mask=encoding['attention_mask']
)
predictions = outputs.logits.argmax(dim=2).squeeze().tolist()
word_ids = encoding.word_ids()
# 서브워드 토큰 처리
result = []
prev_word_id = None
for pred, word_id in zip(predictions, word_ids):
if word_id is None or word_id == prev_word_id:
continue
word = text.split()[word_id]
label = id2label[pred]
result.append((word, label))
prev_word_id = word_id
return result
# 사용 예시
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
text = "Barack Obama was born in Hawaii and served as the 44th President of the United States."
# NER 결과 출력 (훈련된 모델 사용 시)
# print(predict_ner(text, model, tokenizer))
8. GPT 계열 모델
GPT(Generative Pre-trained Transformer)는 OpenAI가 개발한 자기회귀 언어 모델 시리즈입니다.
8.1 GPT 발전 과정
| 모델 | 연도 | 파라미터 | 특징 |
|---|---|---|---|
| GPT-1 | 2018 | 117M | 최초의 GPT, Unsupervised pre-training |
| GPT-2 | 2019 | 1.5B | 제로샷 태스크 이전, "너무 위험해 공개 안 해" |
| GPT-3 | 2020 | 175B | In-context learning, Few-shot |
| InstructGPT | 2022 | 1.3B | RLHF, 지시 따르기 |
| GPT-4 | 2023 | 비공개 | 멀티모달, 더 강력한 추론 |
8.2 GPT-2로 텍스트 생성
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch
# GPT-2 로드
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')
model.eval()
tokenizer.pad_token = tokenizer.eos_token
def generate_text(
prompt,
model,
tokenizer,
max_length=200,
temperature=0.9,
top_p=0.95,
top_k=50,
num_return_sequences=1,
do_sample=True
):
"""GPT-2 텍스트 생성"""
inputs = tokenizer(prompt, return_tensors='pt')
input_ids = inputs['input_ids']
with torch.no_grad():
output = model.generate(
input_ids,
max_length=max_length,
temperature=temperature, # 높을수록 다양한 출력
top_p=top_p, # nucleus sampling
top_k=top_k, # top-k sampling
num_return_sequences=num_return_sequences,
do_sample=do_sample,
pad_token_id=tokenizer.eos_token_id,
repetition_penalty=1.2 # 반복 방지
)
generated_texts = []
for out in output:
text = tokenizer.decode(out, skip_special_tokens=True)
generated_texts.append(text)
return generated_texts
# 텍스트 생성 예시
prompt = "Artificial intelligence is transforming the world by"
generated = generate_text(prompt, model, tokenizer, max_length=150, temperature=0.8)
for i, text in enumerate(generated):
print(f"\n생성된 텍스트 {i+1}:")
print(text)
print("-" * 50)
# Greedy vs Sampling 비교
def compare_generation_strategies(prompt, model, tokenizer):
"""다양한 생성 전략 비교"""
input_ids = tokenizer.encode(prompt, return_tensors='pt')
strategies = {
"Greedy Search": dict(do_sample=False),
"Beam Search": dict(do_sample=False, num_beams=5, early_stopping=True),
"Temperature Sampling": dict(do_sample=True, temperature=0.7),
"Top-k Sampling": dict(do_sample=True, top_k=50),
"Top-p (Nucleus) Sampling": dict(do_sample=True, top_p=0.92),
}
print(f"프롬프트: {prompt}\n")
with torch.no_grad():
for name, params in strategies.items():
output = model.generate(
input_ids,
max_new_tokens=50,
pad_token_id=tokenizer.eos_token_id,
**params
)
text = tokenizer.decode(output[0], skip_special_tokens=True)
print(f"[{name}]")
print(text[len(prompt):].strip())
print()
compare_generation_strategies(
"The future of artificial intelligence",
model,
tokenizer
)
8.3 Prompt Engineering
def zero_shot_classification(text, categories, model_name="gpt2"):
"""Zero-shot 분류"""
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)
model.eval()
# 각 카테고리에 대한 로그 확률 계산
scores = {}
for category in categories:
prompt = f"Text: {text}\nCategory: {category}"
inputs = tokenizer(prompt, return_tensors='pt')
with torch.no_grad():
outputs = model(**inputs, labels=inputs['input_ids'])
# 낮은 perplexity = 더 가능성 높은 카테고리
scores[category] = -outputs.loss.item()
best_category = max(scores, key=scores.get)
return best_category, scores
# Few-shot 학습 예시
few_shot_prompt = """
Classify the sentiment of the following texts:
Text: "This movie was amazing!"
Sentiment: Positive
Text: "I hated every minute of it."
Sentiment: Negative
Text: "The film was okay, nothing special."
Sentiment: Neutral
Text: "Absolutely brilliant performance!"
Sentiment:"""
print("Few-shot 프롬프트:")
print(few_shot_prompt)
# Chain of Thought 프롬프팅
cot_prompt = """
Q: A train travels 120 miles in 2 hours. How long will it take to travel 300 miles?
A: Let me think step by step.
1. First, find the speed: 120 miles / 2 hours = 60 miles per hour
2. Then, calculate the time for 300 miles: 300 miles / 60 mph = 5 hours
Therefore, it will take 5 hours.
Q: If a store sells 3 apples for $2.40, how much does 7 apples cost?
A: Let me think step by step.
"""
9. 최신 NLP 기법
9.1 LoRA/QLoRA 파인튜닝
LoRA(Low-Rank Adaptation)는 대형 언어 모델을 효율적으로 파인튜닝하는 기법입니다. 원래 가중치는 고정하고, 낮은 랭크의 행렬만 학습합니다.
from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
def setup_lora_model(model_name="gpt2", r=8, lora_alpha=32, lora_dropout=0.1):
"""LoRA 모델 설정"""
# 기본 모델 로드
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
# LoRA 설정
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=r, # LoRA 랭크
lora_alpha=lora_alpha, # 스케일링 파라미터
target_modules=["c_attn"], # 적용할 레이어
lora_dropout=lora_dropout,
bias="none"
)
# LoRA 모델 생성
model = get_peft_model(model, lora_config)
# 훈련 가능한 파라미터 비율 출력
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(f"훈련 가능한 파라미터: {trainable:,} ({100 * trainable / total:.2f}%)")
return model, tokenizer
# LoRA 학습 예시
class InstructionDataset(torch.utils.data.Dataset):
"""지시문 형식 데이터셋"""
def __init__(self, data, tokenizer, max_length=512):
self.tokenizer = tokenizer
self.max_length = max_length
# 지시문 형식으로 변환
self.texts = []
for item in data:
prompt = f"### Instruction:\n{item['instruction']}\n\n### Response:\n{item['response']}"
self.texts.append(prompt)
def __len__(self):
return len(self.texts)
def __getitem__(self, idx):
encoding = self.tokenizer(
self.texts[idx],
max_length=self.max_length,
padding='max_length',
truncation=True,
return_tensors='pt'
)
input_ids = encoding['input_ids'].squeeze()
return {
'input_ids': input_ids,
'attention_mask': encoding['attention_mask'].squeeze(),
'labels': input_ids.clone() # 언어 모델링: 입력 = 레이블
}
9.2 RAG (Retrieval-Augmented Generation)
from transformers import AutoTokenizer, AutoModel
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
class SimpleRAG:
"""간단한 RAG 시스템 구현"""
def __init__(self, embedding_model="sentence-transformers/all-MiniLM-L6-v2"):
self.tokenizer = AutoTokenizer.from_pretrained(embedding_model)
self.model = AutoModel.from_pretrained(embedding_model)
self.knowledge_base = []
self.embeddings = []
def mean_pooling(self, model_output, attention_mask):
"""평균 풀링으로 문장 임베딩 계산"""
token_embeddings = model_output[0]
input_mask_expanded = attention_mask.unsqueeze(-1).expand(
token_embeddings.size()
).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / \
torch.clamp(input_mask_expanded.sum(1), min=1e-9)
def encode(self, texts):
"""텍스트를 임베딩 벡터로 변환"""
encoded = self.tokenizer(
texts,
padding=True,
truncation=True,
max_length=512,
return_tensors='pt'
)
with torch.no_grad():
model_output = self.model(**encoded)
embeddings = self.mean_pooling(model_output, encoded['attention_mask'])
# L2 정규화
embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)
return embeddings.numpy()
def add_documents(self, documents):
"""지식베이스에 문서 추가"""
self.knowledge_base.extend(documents)
new_embeddings = self.encode(documents)
if len(self.embeddings) == 0:
self.embeddings = new_embeddings
else:
self.embeddings = np.vstack([self.embeddings, new_embeddings])
print(f"{len(documents)}개의 문서가 추가되었습니다. 총 {len(self.knowledge_base)}개.")
def retrieve(self, query, top_k=3):
"""쿼리와 가장 유사한 문서 검색"""
query_embedding = self.encode([query])
similarities = cosine_similarity(query_embedding, self.embeddings)[0]
top_indices = np.argsort(similarities)[::-1][:top_k]
results = []
for idx in top_indices:
results.append({
'document': self.knowledge_base[idx],
'similarity': similarities[idx]
})
return results
def answer(self, query, generator_model, generator_tokenizer, top_k=3):
"""RAG: 검색 + 생성"""
# 관련 문서 검색
relevant_docs = self.retrieve(query, top_k=top_k)
# 컨텍스트 구성
context = "\n".join([doc['document'] for doc in relevant_docs])
# 프롬프트 구성
prompt = f"""Based on the following context, answer the question.
Context:
{context}
Question: {query}
Answer:"""
# 텍스트 생성
inputs = generator_tokenizer(prompt, return_tensors='pt', max_length=512, truncation=True)
with torch.no_grad():
outputs = generator_model.generate(
inputs['input_ids'],
max_new_tokens=150,
temperature=0.7,
do_sample=True,
pad_token_id=generator_tokenizer.eos_token_id
)
answer = generator_tokenizer.decode(outputs[0], skip_special_tokens=True)
answer = answer[len(prompt):].strip()
return answer, relevant_docs
# RAG 시스템 사용 예시
rag = SimpleRAG()
documents = [
"BERT (Bidirectional Encoder Representations from Transformers) was published by Google in 2018.",
"GPT-3 has 175 billion parameters and was developed by OpenAI in 2020.",
"The Transformer architecture was introduced in the paper 'Attention is All You Need' in 2017.",
"RLHF (Reinforcement Learning from Human Feedback) is used to align language models with human values.",
"LoRA allows efficient fine-tuning by adding low-rank matrices to pre-trained model weights.",
"RAG combines information retrieval with text generation for more accurate responses.",
]
rag.add_documents(documents)
query = "What is BERT and when was it created?"
results = rag.retrieve(query, top_k=2)
print(f"\n쿼리: {query}")
print("\n검색된 문서:")
for r in results:
print(f" [{r['similarity']:.3f}] {r['document']}")
9.3 RLHF (인간 피드백 강화학습)
# RLHF의 핵심 구성 요소 개념적 구현
class RewardModel(nn.Module):
"""보상 모델 - 응답의 품질을 점수화"""
def __init__(self, base_model_name="gpt2"):
super().__init__()
from transformers import GPT2Model
self.transformer = GPT2Model.from_pretrained(base_model_name)
hidden_size = self.transformer.config.hidden_size
self.reward_head = nn.Linear(hidden_size, 1)
def forward(self, input_ids, attention_mask=None):
outputs = self.transformer(input_ids=input_ids, attention_mask=attention_mask)
# 마지막 토큰의 표현 사용
last_hidden = outputs.last_hidden_state[:, -1, :]
reward = self.reward_head(last_hidden)
return reward.squeeze(-1)
class PPOTrainer:
"""PPO 기반 RLHF 훈련 (개념적 구현)"""
def __init__(self, policy_model, reward_model, ref_model, tokenizer):
self.policy = policy_model
self.reward_model = reward_model
self.ref_model = ref_model # KL 패널티를 위한 참조 모델
self.tokenizer = tokenizer
def compute_kl_penalty(self, policy_logprobs, ref_logprobs, kl_coeff=0.1):
"""KL divergence 패널티 계산 - 정책이 참조 모델에서 너무 멀어지는 것을 방지"""
kl = policy_logprobs - ref_logprobs
return kl_coeff * kl.mean()
def compute_advantages(self, rewards, values, gamma=0.99, lam=0.95):
"""GAE (Generalized Advantage Estimation)"""
advantages = []
last_advantage = 0
for t in reversed(range(len(rewards))):
if t == len(rewards) - 1:
next_value = 0
else:
next_value = values[t + 1]
delta = rewards[t] + gamma * next_value - values[t]
advantage = delta + gamma * lam * last_advantage
advantages.insert(0, advantage)
last_advantage = advantage
return torch.tensor(advantages)
10. 실전 NLP 프로젝트
10.1 감성 분석 시스템 (완전한 파이프라인)
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
from torch.utils.data import Dataset, DataLoader
from torch.optim import AdamW
from transformers import get_linear_schedule_with_warmup
import matplotlib.pyplot as plt
import seaborn as sns
class SentimentAnalysisPipeline:
"""완전한 감성 분석 파이프라인"""
def __init__(self, model_name="distilbert-base-uncased-finetuned-sst-2-english"):
self.model_name = model_name
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model = AutoModelForSequenceClassification.from_pretrained(model_name)
self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
self.model.to(self.device)
self.model.eval()
self.label_map = {0: 'Negative', 1: 'Positive'}
def predict(self, texts, batch_size=32):
"""배치 예측"""
if isinstance(texts, str):
texts = [texts]
all_predictions = []
all_probabilities = []
for i in range(0, len(texts), batch_size):
batch = texts[i:i + batch_size]
encoding = self.tokenizer(
batch,
padding=True,
truncation=True,
max_length=512,
return_tensors='pt'
)
input_ids = encoding['input_ids'].to(self.device)
attention_mask = encoding['attention_mask'].to(self.device)
with torch.no_grad():
outputs = self.model(input_ids=input_ids, attention_mask=attention_mask)
probs = torch.softmax(outputs.logits, dim=1)
preds = probs.argmax(dim=1)
all_predictions.extend(preds.cpu().numpy())
all_probabilities.extend(probs.cpu().numpy())
results = []
for pred, prob in zip(all_predictions, all_probabilities):
results.append({
'label': self.label_map[pred],
'score': float(prob[pred]),
'probabilities': {self.label_map[i]: float(p) for i, p in enumerate(prob)}
})
return results if len(results) > 1 else results[0]
def analyze_batch(self, texts):
"""배치 감성 분석 및 통계"""
results = self.predict(texts)
positive_count = sum(1 for r in results if r['label'] == 'Positive')
negative_count = len(results) - positive_count
avg_confidence = np.mean([r['score'] for r in results])
print(f"\n=== 감성 분석 결과 ===")
print(f"총 텍스트: {len(texts)}")
print(f"긍정: {positive_count} ({100*positive_count/len(texts):.1f}%)")
print(f"부정: {negative_count} ({100*negative_count/len(texts):.1f}%)")
print(f"평균 신뢰도: {avg_confidence:.3f}")
return results
# 사용 예시
pipeline = SentimentAnalysisPipeline()
reviews = [
"This product exceeded all my expectations! Absolutely love it.",
"Terrible quality, broke after one day. Complete waste of money.",
"It's okay, does what it's supposed to do.",
"Best purchase I've made this year!",
"Would not recommend to anyone.",
]
print("=== 개별 예측 ===")
for review in reviews:
result = pipeline.predict(review)
print(f"텍스트: {review[:50]}...")
print(f" -> {result['label']} ({result['score']:.3f})\n")
print("\n=== 배치 분석 ===")
pipeline.analyze_batch(reviews)
10.2 텍스트 요약 시스템
from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM
class TextSummarizer:
"""텍스트 요약 시스템"""
def __init__(self, model_name="facebook/bart-large-cnn"):
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
self.model.to(self.device)
def summarize(
self,
text,
max_length=130,
min_length=30,
num_beams=4,
length_penalty=2.0,
early_stopping=True
):
"""단일 텍스트 요약"""
inputs = self.tokenizer(
text,
max_length=1024,
truncation=True,
return_tensors='pt'
).to(self.device)
with torch.no_grad():
summary_ids = self.model.generate(
inputs['input_ids'],
max_length=max_length,
min_length=min_length,
num_beams=num_beams,
length_penalty=length_penalty,
early_stopping=early_stopping
)
summary = self.tokenizer.decode(summary_ids[0], skip_special_tokens=True)
# 압축률 계산
original_words = len(text.split())
summary_words = len(summary.split())
compression = (1 - summary_words / original_words) * 100
return {
'summary': summary,
'original_length': original_words,
'summary_length': summary_words,
'compression_rate': f"{compression:.1f}%"
}
def extractive_summarize(self, text, num_sentences=3):
"""추출적 요약 (TF-IDF 기반)"""
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import nltk
nltk.download('punkt', quiet=True)
from nltk.tokenize import sent_tokenize
sentences = sent_tokenize(text)
if len(sentences) <= num_sentences:
return text
# TF-IDF 행렬
vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = vectorizer.fit_transform(sentences)
# 문장 중요도 = 다른 문장들과의 평균 유사도
similarity_matrix = cosine_similarity(tfidf_matrix)
scores = similarity_matrix.mean(axis=1)
# 중요도 높은 문장 선택 (원래 순서 유지)
top_indices = sorted(
np.argsort(scores)[-num_sentences:].tolist()
)
summary = ' '.join([sentences[i] for i in top_indices])
return summary
# 사용 예시
summarizer = TextSummarizer()
long_text = """
Artificial intelligence has made remarkable strides in recent years, particularly in
natural language processing. The development of transformer-based models like BERT and
GPT has revolutionized how machines understand and generate human language. These models,
trained on massive datasets, can perform a wide range of tasks including translation,
summarization, question answering, and sentiment analysis.
The key innovation behind these models is the attention mechanism, which allows the model
to focus on relevant parts of the input when generating each word of the output. This has
enabled much more nuanced understanding of context and semantics compared to earlier
approaches like recurrent neural networks.
However, these advances come with challenges. Training large language models requires
enormous computational resources and energy. There are also concerns about bias in the
training data leading to biased outputs, and the potential for misuse in generating
misinformation. Researchers are actively working on making these models more efficient,
fair, and reliable.
The future of NLP looks promising, with models becoming increasingly capable of
understanding and generating language that is indistinguishable from human writing.
Applications range from customer service chatbots to scientific research assistance,
and the technology continues to evolve rapidly.
"""
result = summarizer.summarize(long_text)
print("=== 추상적 요약 ===")
print(f"원본 길이: {result['original_length']} 단어")
print(f"요약 길이: {result['summary_length']} 단어")
print(f"압축률: {result['compression_rate']}")
print(f"\n요약:\n{result['summary']}")
print("\n=== 추출적 요약 ===")
extractive = summarizer.extractive_summarize(long_text, num_sentences=2)
print(extractive)
10.3 질의응답 시스템
from transformers import pipeline, AutoTokenizer, AutoModelForQuestionAnswering
class QASystem:
"""질의응답 시스템"""
def __init__(self, model_name="deepset/roberta-base-squad2"):
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model = AutoModelForQuestionAnswering.from_pretrained(model_name)
self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
self.model.to(self.device)
self.model.eval()
def answer(self, question, context, max_answer_len=100):
"""질문에 대한 답변 추출"""
# 토큰화
encoding = self.tokenizer(
question,
context,
max_length=512,
truncation=True,
stride=128,
return_overflowing_tokens=True,
return_offsets_mapping=True,
padding='max_length',
return_tensors='pt'
)
offset_mapping = encoding.pop('offset_mapping').cpu()
sample_map = encoding.pop('overflow_to_sample_mapping').cpu()
encoding = {k: v.to(self.device) for k, v in encoding.items()}
with torch.no_grad():
outputs = self.model(**encoding)
start_logits = outputs.start_logits
end_logits = outputs.end_logits
# 최적 답변 위치 찾기
answers = []
for i in range(len(start_logits)):
start_log = start_logits[i].cpu().numpy()
end_log = end_logits[i].cpu().numpy()
offsets = offset_mapping[i].numpy()
# 최적 시작/끝 위치
start_idx = np.argmax(start_log)
end_idx = np.argmax(end_log)
if start_idx <= end_idx:
start_char = offsets[start_idx][0]
end_char = offsets[end_idx][1]
answer_text = context[start_char:end_char]
score = float(start_log[start_idx] + end_log[end_idx])
if answer_text:
answers.append({
'answer': answer_text,
'score': score,
'start': int(start_char),
'end': int(end_char)
})
if not answers:
return {'answer': "답을 찾을 수 없습니다.", 'score': 0.0}
return max(answers, key=lambda x: x['score'])
def multi_question_answer(self, questions, context):
"""여러 질문에 대한 답변"""
results = []
for question in questions:
answer = self.answer(question, context)
results.append({
'question': question,
'answer': answer['answer'],
'confidence': answer['score']
})
return results
# 사용 예시
qa_system = QASystem()
context = """
The Transformer model was introduced in a 2017 paper titled "Attention Is All You Need"
by researchers at Google Brain. The model architecture relies entirely on attention
mechanisms, dispensing with recurrence and convolutions. BERT, which stands for
Bidirectional Encoder Representations from Transformers, was introduced in 2018 by
Google AI Language team. BERT achieved state-of-the-art results on eleven NLP tasks.
GPT-3, developed by OpenAI, was released in 2020 and has 175 billion parameters, making
it one of the largest language models at the time of its release.
"""
questions = [
"When was the Transformer model introduced?",
"What does BERT stand for?",
"How many parameters does GPT-3 have?",
"Who developed BERT?",
]
print("=== 질의응답 시스템 ===\n")
results = qa_system.multi_question_answer(questions, context)
for r in results:
print(f"Q: {r['question']}")
print(f"A: {r['answer']}")
print(f"신뢰도: {r['confidence']:.2f}")
print()
마무리: NLP 학습 로드맵
이 가이드에서 NLP의 전체 여정을 살펴보았습니다. 각 단계를 정리하면:
- 기초 다지기: 텍스트 전처리, BoW, TF-IDF
- 임베딩 이해: Word2Vec, GloVe, FastText
- 시퀀스 모델: RNN, LSTM, GRU, Seq2Seq
- 어텐션 혁명: Attention 메커니즘, Self-Attention
- Transformer 마스터: 완전한 아키텍처 이해
- 사전학습 모델: BERT, GPT 활용
- 최신 기법: LoRA, RAG, RLHF
추천 학습 자료
- HuggingFace 공식 문서
- Attention is All You Need 논문
- BERT 논문
- PyTorch NLP 튜토리얼
- Stanford CS224N - NLP with Deep Learning
NLP는 빠르게 발전하는 분야이므로 지속적인 학습과 실습이 중요합니다. 직접 코드를 실행하고 실험하면서 깊은 이해를 쌓아가시기 바랍니다!
Natural Language Processing Complete Guide: Zero to Hero - From Text Processing to LLMs
Natural Language Processing (NLP) Complete Guide: Zero to Hero
Natural Language Processing (NLP) is a core branch of artificial intelligence that enables computers to understand and generate human language. Countless services we use every day — ChatGPT, translation engines, search engines, sentiment analysis systems — are all built on NLP technology. This guide provides a complete learning path covering everything from the most basic text preprocessing to the latest large language models (LLMs).
Table of Contents
- NLP Fundamentals and Text Preprocessing
- Text Representation
- Word Embeddings
- NLP with Recurrent Neural Networks
- Attention Mechanisms
- Deep Dive into the Transformer Architecture
- Deep Dive into BERT
- The GPT Family of Models
- Modern NLP Techniques
- Real-World NLP Projects
1. NLP Fundamentals and Text Preprocessing
The first step in natural language processing is transforming raw text data into a form that models can work with. Because text data is unstructured, cleaning and structuring it into a consistent format is essential.
1.1 Tokenization
Tokenization is the process of splitting text into smaller units called tokens. Depending on the unit of the token, we have word tokenization, character tokenization, and subword tokenization.
Word Tokenization
The most intuitive approach — splitting text into words based on whitespace or punctuation.
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
nltk.download('punkt')
text = "Natural Language Processing is fascinating! It powers ChatGPT and many AI applications."
# Word tokenization
word_tokens = word_tokenize(text)
print("Word tokens:", word_tokens)
# Output: ['Natural', 'Language', 'Processing', 'is', 'fascinating', '!', ...]
# Sentence tokenization
sent_tokens = sent_tokenize(text)
print("Sentence tokens:", sent_tokens)
# Output: ['Natural Language Processing is fascinating!', 'It powers ChatGPT...']
Character Tokenization
Splits text into individual characters. The vocabulary is small and there are no out-of-vocabulary (OOV) problems, but sequences become very long.
text = "Hello NLP"
char_tokens = list(text)
print("Character tokens:", char_tokens)
# Output: ['H', 'e', 'l', 'l', 'o', ' ', 'N', 'L', 'P']
Subword Tokenization
Splits text into units between words and characters. Algorithms like BPE (Byte Pair Encoding), WordPiece, and SentencePiece are used by modern models like BERT and GPT.
from tokenizers import ByteLevelBPETokenizer
# Train a BPE tokenizer
tokenizer = ByteLevelBPETokenizer()
tokenizer.train(
files=["corpus.txt"],
vocab_size=5000,
min_frequency=2,
special_tokens=["<s>", "<pad>", "</s>", "<unk>", "<mask>"]
)
# Encode
encoding = tokenizer.encode("Natural Language Processing")
print("Tokens:", encoding.tokens)
print("IDs:", encoding.ids)
1.2 Stop Word Removal
Stop words are words like "the", "is", and "at" that appear so frequently they carry almost no meaningful information. Removing them reduces data size and lets the model focus on informative words.
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('stopwords')
text = "This is a sample sentence showing off the stop words filtration."
stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(text)
filtered_sentence = [w for w in word_tokens if not w.lower() in stop_words]
print("Original:", word_tokens)
print("Filtered:", filtered_sentence)
# Output: ['sample', 'sentence', 'showing', 'stop', 'words', 'filtration', '.']
1.3 Stemming and Lemmatization
Stemming
Removes affixes to extract the stem of a word. Fast but can produce linguistically incorrect results.
from nltk.stem import PorterStemmer, LancasterStemmer
ps = PorterStemmer()
ls = LancasterStemmer()
words = ["running", "runs", "ran", "runner", "easily", "fairly"]
for word in words:
print(f"{word:15} -> Porter: {ps.stem(word):15} Lancaster: {ls.stem(word)}")
Lemmatization
Extracts the base form (lemma) of a word. Slower than stemming but linguistically correct.
import nltk
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()
# Specifying part of speech gives more accurate results
print(lemmatizer.lemmatize("running", pos='v')) # run
print(lemmatizer.lemmatize("better", pos='a')) # good
print(lemmatizer.lemmatize("dogs")) # dog
print(lemmatizer.lemmatize("went", pos='v')) # go
1.4 Text Cleaning with Regular Expressions
import re
def clean_text(text):
"""Text cleaning function"""
# Remove HTML tags
text = re.sub(r'<[^>]+>', '', text)
# Remove URLs
text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
# Remove email addresses
text = re.sub(r'\S+@\S+', '', text)
# Remove special characters (keep alphanumeric and spaces)
text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
# Collapse multiple spaces
text = re.sub(r'\s+', ' ', text)
# Strip and lowercase
text = text.strip().lower()
return text
sample_text = """
Check out my website at https://example.com!
Email me at user@example.com for <b>more info</b>.
It's really cool!!!
"""
cleaned = clean_text(sample_text)
print(cleaned)
# Output: "check out my website at email me at for more info its really cool"
1.5 Advanced Preprocessing with spaCy
spaCy is an industrial-strength NLP library that provides tokenization, POS tagging, named entity recognition, dependency parsing, and more.
import spacy
# Load English model
nlp = spacy.load("en_core_web_sm")
text = "Apple is looking at buying U.K. startup for $1 billion"
doc = nlp(text)
print("=== Token Information ===")
for token in doc:
print(f"{token.text:15} | POS: {token.pos_:10} | Lemma: {token.lemma_:15} | Stop: {token.is_stop}")
print("\n=== Named Entity Recognition ===")
for ent in doc.ents:
print(f"{ent.text:20} -> {ent.label_} ({spacy.explain(ent.label_)})")
print("\n=== Dependency Parsing ===")
for token in doc:
print(f"{token.text:15} -> {token.dep_:15} (head: {token.head.text})")
2. Text Representation
Converting text to numerical vectors is at the heart of NLP. For a model to process text, it must be converted into a numeric form on which mathematical operations can be performed.
2.1 Bag of Words (BoW)
BoW is the simplest method, representing text as a word-frequency vector. It ignores word order and only considers how often each word appears.
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
corpus = [
"I love natural language processing",
"Natural language processing is amazing",
"I love machine learning too",
"Deep learning is part of machine learning"
]
# BoW vectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print("Vocabulary:", vectorizer.get_feature_names_out())
print("\nBoW Matrix:")
print(X.toarray())
print("\nShape:", X.shape) # (4 documents, n words)
# Inspect one document
doc_0 = dict(zip(vectorizer.get_feature_names_out(), X.toarray()[0]))
print("\nDocument 0 word frequencies:")
for word, count in sorted(doc_0.items()):
if count > 0:
print(f" {word}: {count}")
2.2 TF-IDF
TF-IDF (Term Frequency-Inverse Document Frequency) improves on BoW by considering not just how often a word appears in a document, but also how rare it is across all documents.
TF (Term Frequency): How often a word appears in a given document. IDF (Inverse Document Frequency): The inverse of how many documents contain the word — rare words get higher scores.
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
corpus = [
"the cat sat on the mat",
"the cat sat on the hat",
"the dog sat on the log",
"the cat wore the hat",
]
# TF-IDF vectorizer
tfidf = TfidfVectorizer(smooth_idf=True, norm='l2')
X = tfidf.fit_transform(corpus)
# Display as DataFrame
df = pd.DataFrame(
X.toarray(),
columns=tfidf.get_feature_names_out(),
index=[f"Doc {i}" for i in range(len(corpus))]
)
print(df.round(3))
# Manual TF-IDF implementation
def compute_tfidf(corpus):
from math import log
vocab = set()
for doc in corpus:
vocab.update(doc.split())
vocab = sorted(vocab)
def tf(word, doc):
words = doc.split()
return words.count(word) / len(words)
def idf(word, corpus):
n_docs_with_word = sum(1 for doc in corpus if word in doc.split())
return log((1 + len(corpus)) / (1 + n_docs_with_word)) + 1
tfidf_matrix = []
for doc in corpus:
row = [tf(word, doc) * idf(word, corpus) for word in vocab]
tfidf_matrix.append(row)
return tfidf_matrix, vocab
matrix, vocab = compute_tfidf(corpus)
print("\nManually computed TF-IDF:")
df_manual = pd.DataFrame(matrix, columns=vocab)
print(df_manual.round(3))
2.3 N-gram
An N-gram is a contiguous sequence of N items (words or characters) from a given text. N-grams allow a model to capture some local word-order information.
from nltk.util import ngrams
from nltk.tokenize import word_tokenize
from collections import Counter
text = "I love natural language processing and machine learning"
tokens = word_tokenize(text.lower())
unigrams = list(ngrams(tokens, 1))
bigrams = list(ngrams(tokens, 2))
trigrams = list(ngrams(tokens, 3))
print("Unigrams:", unigrams[:5])
print("Bigrams:", bigrams[:5])
print("Trigrams:", trigrams[:5])
def get_ngram_freq(text, n):
tokens = word_tokenize(text.lower())
n_grams = list(ngrams(tokens, n))
return Counter(n_grams)
large_corpus = """
Machine learning is a subset of artificial intelligence.
Artificial intelligence is transforming many industries.
Natural language processing is a part of machine learning.
Deep learning has revolutionized natural language processing.
"""
bigram_freq = get_ngram_freq(large_corpus, 2)
print("\nMost frequent bigrams:")
for bigram, count in bigram_freq.most_common(10):
print(f" {' '.join(bigram)}: {count}")
3. Word Embeddings
Word embeddings represent words as dense vectors such that semantically similar words are close together in vector space.
3.1 Word2Vec
Word2Vec, published by Tomas Mikolov at Google in 2013, is a groundbreaking word embedding model. It comes in two variants: CBOW (Continuous Bag of Words) and Skip-gram.
CBOW: Predicts the center word from surrounding context words. Skip-gram: Predicts surrounding context words from the center word.
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
from collections import Counter
class Word2VecSkipGram(nn.Module):
"""Skip-gram Word2Vec implementation"""
def __init__(self, vocab_size, embedding_dim):
super().__init__()
self.center_embedding = nn.Embedding(vocab_size, embedding_dim)
self.context_embedding = nn.Embedding(vocab_size, embedding_dim)
self.center_embedding.weight.data.uniform_(-0.5/embedding_dim, 0.5/embedding_dim)
self.context_embedding.weight.data.uniform_(-0.5/embedding_dim, 0.5/embedding_dim)
def forward(self, center, context, negative):
"""
center: (batch_size,) center word indices
context: (batch_size,) actual context word indices
negative: (batch_size, neg_samples) negative sample indices
"""
center_emb = self.center_embedding(center) # (batch, dim)
# Positive sample score
context_emb = self.context_embedding(context) # (batch, dim)
pos_score = torch.sum(center_emb * context_emb, dim=1) # (batch,)
pos_loss = -torch.log(torch.sigmoid(pos_score) + 1e-10)
# Negative sample score
neg_emb = self.context_embedding(negative) # (batch, neg_samples, dim)
center_emb_expanded = center_emb.unsqueeze(1) # (batch, 1, dim)
neg_score = torch.bmm(neg_emb, center_emb_expanded.transpose(1, 2)).squeeze(2)
neg_loss = -torch.sum(torch.log(torch.sigmoid(-neg_score) + 1e-10), dim=1)
return (pos_loss + neg_loss).mean()
class Word2VecCBOW(nn.Module):
"""CBOW Word2Vec implementation"""
def __init__(self, vocab_size, embedding_dim):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embedding_dim)
self.linear = nn.Linear(embedding_dim, vocab_size)
def forward(self, context):
"""
context: (batch_size, context_window_size) context word indices
"""
embedded = self.embedding(context) # (batch, window, dim)
mean_embedded = embedded.mean(dim=1) # (batch, dim)
output = self.linear(mean_embedded) # (batch, vocab_size)
return output
# Practical usage with Gensim
from gensim.models import Word2Vec
sentences = [
"I love natural language processing".split(),
"natural language processing is a field of AI".split(),
"machine learning is part of artificial intelligence".split(),
"deep learning models process natural language".split(),
"word embeddings represent words as vectors".split(),
"Word2Vec learns word representations".split(),
"semantic similarity between words is captured by embeddings".split(),
]
model = Word2Vec(
sentences=sentences,
vector_size=100, # embedding dimension
window=5, # context window size
min_count=1, # minimum word frequency
workers=4,
epochs=100,
sg=1, # 1=Skip-gram, 0=CBOW
negative=5 # negative sampling
)
print("'language' vector (first 5 dims):", model.wv['language'][:5])
similar_words = model.wv.most_similar('language', topn=5)
print("\nWords similar to 'language':")
for word, similarity in similar_words:
print(f" {word}: {similarity:.4f}")
result = model.wv.most_similar(
positive=['artificial', 'language'],
negative=['natural'],
topn=3
)
print("\nAnalogy result:")
for word, sim in result:
print(f" {word}: {sim:.4f}")
print("\nSimilarity between 'language' and 'processing':", model.wv.similarity('language', 'processing'))
3.2 Embedding Visualization with t-SNE
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import numpy as np
def visualize_embeddings(model, words=None):
"""Visualize Word2Vec embeddings with t-SNE"""
if words is None:
words = list(model.wv.key_to_index.keys())[:50]
vectors = np.array([model.wv[word] for word in words])
tsne = TSNE(
n_components=2,
random_state=42,
perplexity=min(30, len(words)-1),
n_iter=1000
)
vectors_2d = tsne.fit_transform(vectors)
fig, ax = plt.subplots(figsize=(12, 8))
ax.scatter(vectors_2d[:, 0], vectors_2d[:, 1], alpha=0.7)
for i, word in enumerate(words):
ax.annotate(word, (vectors_2d[i, 0], vectors_2d[i, 1]),
fontsize=9, ha='center', va='bottom')
ax.set_title("Word2Vec Embeddings t-SNE Visualization")
plt.tight_layout()
plt.savefig('word2vec_tsne.png', dpi=150, bbox_inches='tight')
plt.show()
visualize_embeddings(model)
3.3 GloVe
GloVe (Global Vectors for Word Representation), developed at Stanford, exploits global co-occurrence statistics from the entire corpus.
# Using pre-trained GloVe vectors
# pip install torchtext
from torchtext.vocab import GloVe
import torch
glove = GloVe(name='6B', dim=100)
vector = glove['computer']
print("'computer' GloVe vector (first 5 dims):", vector[:5].numpy())
def cosine_similarity(v1, v2):
return torch.nn.functional.cosine_similarity(
v1.unsqueeze(0), v2.unsqueeze(0)
).item()
words_to_compare = [('king', 'queen'), ('man', 'woman'), ('Paris', 'France')]
for w1, w2 in words_to_compare:
if w1 in glove.stoi and w2 in glove.stoi:
sim = cosine_similarity(glove[w1], glove[w2])
print(f"sim('{w1}', '{w2}') = {sim:.4f}")
4. NLP with Recurrent Neural Networks
Recurrent Neural Networks (RNNs) are designed to process sequential data. They carry information from previous time steps to the current step, making them well-suited for text.
4.1 Basic RNN Architecture
import torch
import torch.nn as nn
class SimpleRNN(nn.Module):
"""Basic RNN implementation"""
def __init__(self, input_size, hidden_size, output_size):
super().__init__()
self.hidden_size = hidden_size
self.input_to_hidden = nn.Linear(input_size + hidden_size, hidden_size)
self.hidden_to_output = nn.Linear(hidden_size, output_size)
def forward(self, x, hidden=None):
"""
x: (seq_len, batch, input_size)
hidden: (batch, hidden_size) initial hidden state
"""
batch_size = x.size(1)
if hidden is None:
hidden = torch.zeros(batch_size, self.hidden_size)
outputs = []
for t in range(x.size(0)):
combined = torch.cat([x[t], hidden], dim=1)
hidden = torch.tanh(self.input_to_hidden(combined))
output = self.hidden_to_output(hidden)
outputs.append(output)
return torch.stack(outputs, dim=0), hidden
# Using PyTorch built-in RNN
rnn = nn.RNN(
input_size=50,
hidden_size=128,
num_layers=2,
batch_first=True,
dropout=0.3,
bidirectional=True
)
x = torch.randn(32, 20, 50) # (batch, seq_len, features)
output, hidden = rnn(x)
print("Output shape:", output.shape) # (32, 20, 256) - bidirectional: 128*2
print("Hidden shape:", hidden.shape) # (4, 32, 128) - 2 layers * 2 directions
4.2 LSTM — Gate Mechanisms Explained
LSTM (Long Short-Term Memory) was designed to solve the vanishing gradient problem of vanilla RNNs. Its gate mechanism allows it to capture long-range dependencies.
class LSTMCell(nn.Module):
"""LSTM cell — manual implementation for understanding gate mechanics"""
def __init__(self, input_size, hidden_size):
super().__init__()
self.hidden_size = hidden_size
# Compute all 4 gates at once for efficiency
# forget gate, input gate, cell gate, output gate
self.gates = nn.Linear(input_size + hidden_size, 4 * hidden_size)
def forward(self, x, state=None):
"""
x: (batch, input_size)
state: (h, c) previous hidden and cell states
"""
batch_size = x.size(0)
if state is None:
h = torch.zeros(batch_size, self.hidden_size)
c = torch.zeros(batch_size, self.hidden_size)
else:
h, c = state
combined = torch.cat([x, h], dim=1)
gates = self.gates(combined)
f_gate, i_gate, g_gate, o_gate = gates.chunk(4, dim=1)
f = torch.sigmoid(f_gate) # forget gate: how much to forget
i = torch.sigmoid(i_gate) # input gate: how much new info to store
g = torch.tanh(g_gate) # cell gate: new candidate information
o = torch.sigmoid(o_gate) # output gate: how much to output
new_c = f * c + i * g # forget some, add new
new_h = o * torch.tanh(new_c)
return new_h, (new_h, new_c)
# Bidirectional LSTM for sentiment analysis
class SentimentLSTM(nn.Module):
"""LSTM-based sentiment analysis model"""
def __init__(self, vocab_size, embed_dim, hidden_dim, num_classes, num_layers=2, dropout=0.3):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
self.lstm = nn.LSTM(
embed_dim,
hidden_dim,
num_layers=num_layers,
batch_first=True,
dropout=dropout,
bidirectional=True
)
self.dropout = nn.Dropout(dropout)
self.classifier = nn.Sequential(
nn.Linear(hidden_dim * 2, hidden_dim),
nn.ReLU(),
nn.Dropout(dropout),
nn.Linear(hidden_dim, num_classes)
)
def forward(self, x, lengths=None):
embedded = self.dropout(self.embedding(x))
if lengths is not None:
packed = nn.utils.rnn.pack_padded_sequence(
embedded, lengths, batch_first=True, enforce_sorted=False
)
lstm_out, (hidden, cell) = self.lstm(packed)
lstm_out, _ = nn.utils.rnn.pad_packed_sequence(lstm_out, batch_first=True)
else:
lstm_out, (hidden, cell) = self.lstm(embedded)
forward_hidden = hidden[-2]
backward_hidden = hidden[-1]
combined = torch.cat([forward_hidden, backward_hidden], dim=1)
output = self.classifier(self.dropout(combined))
return output
4.3 GRU
GRU (Gated Recurrent Unit) is a simpler alternative to LSTM that achieves comparable performance faster.
class GRUCell(nn.Module):
"""GRU cell — manual implementation"""
def __init__(self, input_size, hidden_size):
super().__init__()
self.hidden_size = hidden_size
self.reset_gate = nn.Linear(input_size + hidden_size, hidden_size)
self.update_gate = nn.Linear(input_size + hidden_size, hidden_size)
self.new_gate_input = nn.Linear(input_size, hidden_size)
self.new_gate_hidden = nn.Linear(hidden_size, hidden_size)
def forward(self, x, h=None):
batch_size = x.size(0)
if h is None:
h = torch.zeros(batch_size, self.hidden_size)
combined = torch.cat([x, h], dim=1)
r = torch.sigmoid(self.reset_gate(combined)) # reset gate
z = torch.sigmoid(self.update_gate(combined)) # update gate
n = torch.tanh(
self.new_gate_input(x) + r * self.new_gate_hidden(h)
)
new_h = (1 - z) * n + z * h
return new_h
4.4 Seq2Seq Machine Translation
class Encoder(nn.Module):
"""Seq2Seq Encoder"""
def __init__(self, vocab_size, embed_dim, hidden_dim, num_layers=2, dropout=0.5):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
self.lstm = nn.LSTM(embed_dim, hidden_dim, num_layers,
batch_first=True, dropout=dropout)
self.dropout = nn.Dropout(dropout)
def forward(self, src):
embedded = self.dropout(self.embedding(src))
outputs, (hidden, cell) = self.lstm(embedded)
return outputs, hidden, cell
class Decoder(nn.Module):
"""Seq2Seq Decoder"""
def __init__(self, vocab_size, embed_dim, hidden_dim, num_layers=2, dropout=0.5):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
self.lstm = nn.LSTM(embed_dim, hidden_dim, num_layers,
batch_first=True, dropout=dropout)
self.fc = nn.Linear(hidden_dim, vocab_size)
self.dropout = nn.Dropout(dropout)
def forward(self, trg, hidden, cell):
trg = trg.unsqueeze(1) # (batch, 1)
embedded = self.dropout(self.embedding(trg))
output, (hidden, cell) = self.lstm(embedded, (hidden, cell))
prediction = self.fc(output.squeeze(1)) # (batch, vocab_size)
return prediction, hidden, cell
class Seq2Seq(nn.Module):
"""Seq2Seq Translation Model"""
def __init__(self, encoder, decoder, device):
super().__init__()
self.encoder = encoder
self.decoder = decoder
self.device = device
def forward(self, src, trg, teacher_forcing_ratio=0.5):
batch_size = src.shape[0]
trg_len = trg.shape[1]
trg_vocab_size = self.decoder.fc.out_features
outputs = torch.zeros(batch_size, trg_len, trg_vocab_size).to(self.device)
encoder_outputs, hidden, cell = self.encoder(src)
# First decoder input: start-of-sequence token
input = trg[:, 0]
for t in range(1, trg_len):
output, hidden, cell = self.decoder(input, hidden, cell)
outputs[:, t] = output
teacher_force = torch.rand(1).item() < teacher_forcing_ratio
top1 = output.argmax(1)
input = trg[:, t] if teacher_force else top1
return outputs
5. Attention Mechanisms
Attention mechanisms allow a model to dynamically focus on different parts of the input sequence when generating each output token.
5.1 Bahdanau Attention
class BahdanauAttention(nn.Module):
"""Bahdanau (Additive) Attention"""
def __init__(self, hidden_dim):
super().__init__()
self.W_enc = nn.Linear(hidden_dim * 2, hidden_dim)
self.W_dec = nn.Linear(hidden_dim, hidden_dim)
self.v = nn.Linear(hidden_dim, 1)
def forward(self, decoder_hidden, encoder_outputs):
"""
decoder_hidden: (batch, hidden) current decoder hidden state
encoder_outputs: (batch, src_len, hidden*2) all encoder outputs
"""
src_len = encoder_outputs.shape[1]
decoder_hidden = decoder_hidden.unsqueeze(1).repeat(1, src_len, 1)
energy = torch.tanh(
self.W_enc(encoder_outputs) + self.W_dec(decoder_hidden)
)
attention = self.v(energy).squeeze(2) # (batch, src_len)
attention_weights = torch.softmax(attention, dim=1)
context = torch.bmm(
attention_weights.unsqueeze(1), # (batch, 1, src_len)
encoder_outputs # (batch, src_len, hidden*2)
).squeeze(1)
return context, attention_weights
class LuongAttention(nn.Module):
"""Luong (Multiplicative) Attention"""
def __init__(self, hidden_dim, method='dot'):
super().__init__()
self.method = method
if method == 'general':
self.W = nn.Linear(hidden_dim, hidden_dim)
elif method == 'concat':
self.W = nn.Linear(hidden_dim * 2, hidden_dim)
self.v = nn.Linear(hidden_dim, 1)
def forward(self, decoder_hidden, encoder_outputs):
if self.method == 'dot':
score = torch.bmm(
encoder_outputs,
decoder_hidden.unsqueeze(2)
).squeeze(2)
elif self.method == 'general':
energy = self.W(encoder_outputs)
score = torch.bmm(
energy,
decoder_hidden.unsqueeze(2)
).squeeze(2)
elif self.method == 'concat':
decoder_expanded = decoder_hidden.unsqueeze(1).expand_as(encoder_outputs)
energy = torch.tanh(self.W(torch.cat([decoder_expanded, encoder_outputs], dim=2)))
score = self.v(energy).squeeze(2)
attention_weights = torch.softmax(score, dim=1)
context = torch.bmm(attention_weights.unsqueeze(1), encoder_outputs).squeeze(1)
return context, attention_weights
5.2 Self-Attention
Self-attention allows each position in a sequence to attend to all other positions. It is the core building block of the Transformer.
class SelfAttention(nn.Module):
"""Multi-Head Self-Attention"""
def __init__(self, embed_dim, num_heads=8, dropout=0.1):
super().__init__()
assert embed_dim % num_heads == 0
self.embed_dim = embed_dim
self.num_heads = num_heads
self.head_dim = embed_dim // num_heads
self.scale = self.head_dim ** -0.5
self.W_q = nn.Linear(embed_dim, embed_dim)
self.W_k = nn.Linear(embed_dim, embed_dim)
self.W_v = nn.Linear(embed_dim, embed_dim)
self.W_o = nn.Linear(embed_dim, embed_dim)
self.dropout = nn.Dropout(dropout)
def forward(self, x, mask=None):
"""
x: (batch, seq_len, embed_dim)
mask: (batch, seq_len, seq_len) attention mask
"""
batch_size, seq_len, _ = x.shape
Q = self.W_q(x)
K = self.W_k(x)
V = self.W_v(x)
Q = Q.view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
K = K.view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
V = V.view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
scores = torch.matmul(Q, K.transpose(-2, -1)) * self.scale
if mask is not None:
scores = scores.masked_fill(mask == 0, float('-inf'))
attention_weights = self.dropout(torch.softmax(scores, dim=-1))
attended = torch.matmul(attention_weights, V)
attended = attended.transpose(1, 2).contiguous()
attended = attended.view(batch_size, seq_len, self.embed_dim)
output = self.W_o(attended)
return output, attention_weights
6. Deep Dive into the Transformer Architecture
The Transformer, introduced in the 2017 paper "Attention is All You Need," completely changed the NLP paradigm. Using only attention mechanisms — no RNNs or convolutions — it achieved state-of-the-art translation performance.
6.1 Positional Encoding
Because Transformers process all positions in parallel, they need positional encoding to inject information about sequence order.
import torch
import torch.nn as nn
import math
class PositionalEncoding(nn.Module):
"""Sinusoidal Positional Encoding"""
def __init__(self, embed_dim, max_len=5000, dropout=0.1):
super().__init__()
self.dropout = nn.Dropout(p=dropout)
pe = torch.zeros(max_len, embed_dim)
position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
div_term = torch.exp(
torch.arange(0, embed_dim, 2).float() * (-math.log(10000.0) / embed_dim)
)
pe[:, 0::2] = torch.sin(position * div_term) # even indices: sin
pe[:, 1::2] = torch.cos(position * div_term) # odd indices: cos
pe = pe.unsqueeze(0) # (1, max_len, embed_dim)
self.register_buffer('pe', pe)
def forward(self, x):
"""x: (batch, seq_len, embed_dim)"""
x = x + self.pe[:, :x.size(1), :]
return self.dropout(x)
class LearnablePositionalEncoding(nn.Module):
"""Learnable Positional Encoding (BERT style)"""
def __init__(self, embed_dim, max_len=512):
super().__init__()
self.pe = nn.Embedding(max_len, embed_dim)
def forward(self, x):
batch_size, seq_len, _ = x.shape
positions = torch.arange(seq_len, device=x.device).unsqueeze(0)
return x + self.pe(positions)
6.2 Full Transformer Implementation
class MultiHeadAttention(nn.Module):
"""Multi-Head Attention"""
def __init__(self, embed_dim, num_heads, dropout=0.1):
super().__init__()
assert embed_dim % num_heads == 0
self.embed_dim = embed_dim
self.num_heads = num_heads
self.head_dim = embed_dim // num_heads
self.scale = self.head_dim ** -0.5
self.W_q = nn.Linear(embed_dim, embed_dim, bias=False)
self.W_k = nn.Linear(embed_dim, embed_dim, bias=False)
self.W_v = nn.Linear(embed_dim, embed_dim, bias=False)
self.W_o = nn.Linear(embed_dim, embed_dim)
self.dropout = nn.Dropout(dropout)
def split_heads(self, x):
"""(batch, seq, embed) -> (batch, heads, seq, head_dim)"""
batch, seq, _ = x.shape
x = x.view(batch, seq, self.num_heads, self.head_dim)
return x.transpose(1, 2)
def forward(self, query, key, value, mask=None):
batch_size = query.shape[0]
Q = self.split_heads(self.W_q(query))
K = self.split_heads(self.W_k(key))
V = self.split_heads(self.W_v(value))
scores = torch.matmul(Q, K.transpose(-2, -1)) * self.scale
if mask is not None:
scores = scores.masked_fill(mask == 0, float('-inf'))
attn_weights = self.dropout(torch.softmax(scores, dim=-1))
output = torch.matmul(attn_weights, V)
output = output.transpose(1, 2).contiguous().view(batch_size, -1, self.embed_dim)
return self.W_o(output), attn_weights
class FeedForward(nn.Module):
"""Position-wise Feed-Forward Network"""
def __init__(self, embed_dim, ff_dim, dropout=0.1):
super().__init__()
self.net = nn.Sequential(
nn.Linear(embed_dim, ff_dim),
nn.GELU(),
nn.Dropout(dropout),
nn.Linear(ff_dim, embed_dim),
nn.Dropout(dropout)
)
def forward(self, x):
return self.net(x)
class EncoderLayer(nn.Module):
"""Transformer Encoder Layer"""
def __init__(self, embed_dim, num_heads, ff_dim, dropout=0.1):
super().__init__()
self.self_attention = MultiHeadAttention(embed_dim, num_heads, dropout)
self.feed_forward = FeedForward(embed_dim, ff_dim, dropout)
self.norm1 = nn.LayerNorm(embed_dim)
self.norm2 = nn.LayerNorm(embed_dim)
self.dropout = nn.Dropout(dropout)
def forward(self, x, src_mask=None):
attn_output, _ = self.self_attention(x, x, x, src_mask)
x = self.norm1(x + self.dropout(attn_output))
ff_output = self.feed_forward(x)
x = self.norm2(x + ff_output)
return x
class DecoderLayer(nn.Module):
"""Transformer Decoder Layer"""
def __init__(self, embed_dim, num_heads, ff_dim, dropout=0.1):
super().__init__()
self.self_attention = MultiHeadAttention(embed_dim, num_heads, dropout)
self.cross_attention = MultiHeadAttention(embed_dim, num_heads, dropout)
self.feed_forward = FeedForward(embed_dim, ff_dim, dropout)
self.norm1 = nn.LayerNorm(embed_dim)
self.norm2 = nn.LayerNorm(embed_dim)
self.norm3 = nn.LayerNorm(embed_dim)
self.dropout = nn.Dropout(dropout)
def forward(self, x, memory, src_mask=None, tgt_mask=None):
# Masked Self-Attention (future token masking)
self_attn_out, _ = self.self_attention(x, x, x, tgt_mask)
x = self.norm1(x + self.dropout(self_attn_out))
# Cross-Attention (attend to encoder outputs)
cross_attn_out, cross_attn_weights = self.cross_attention(x, memory, memory, src_mask)
x = self.norm2(x + self.dropout(cross_attn_out))
ff_out = self.feed_forward(x)
x = self.norm3(x + ff_out)
return x, cross_attn_weights
class Transformer(nn.Module):
"""Full Transformer Implementation"""
def __init__(
self,
src_vocab_size,
tgt_vocab_size,
embed_dim=512,
num_heads=8,
num_encoder_layers=6,
num_decoder_layers=6,
ff_dim=2048,
max_len=5000,
dropout=0.1
):
super().__init__()
self.src_embedding = nn.Embedding(src_vocab_size, embed_dim)
self.tgt_embedding = nn.Embedding(tgt_vocab_size, embed_dim)
self.pos_encoding = PositionalEncoding(embed_dim, max_len, dropout)
self.embed_scale = math.sqrt(embed_dim)
self.encoder_layers = nn.ModuleList([
EncoderLayer(embed_dim, num_heads, ff_dim, dropout)
for _ in range(num_encoder_layers)
])
self.decoder_layers = nn.ModuleList([
DecoderLayer(embed_dim, num_heads, ff_dim, dropout)
for _ in range(num_decoder_layers)
])
self.output_norm = nn.LayerNorm(embed_dim)
self.output_projection = nn.Linear(embed_dim, tgt_vocab_size)
self._init_weights()
def _init_weights(self):
for p in self.parameters():
if p.dim() > 1:
nn.init.xavier_uniform_(p)
def make_causal_mask(self, seq_len, device):
"""Autoregressive mask — prevents attending to future tokens"""
mask = torch.tril(torch.ones(seq_len, seq_len, device=device)).bool()
return mask.unsqueeze(0).unsqueeze(0)
def make_pad_mask(self, x, pad_idx=0):
"""Padding mask"""
return (x != pad_idx).unsqueeze(1).unsqueeze(2)
def encode(self, src, src_mask=None):
x = self.pos_encoding(self.src_embedding(src) * self.embed_scale)
for layer in self.encoder_layers:
x = layer(x, src_mask)
return x
def decode(self, tgt, memory, src_mask=None, tgt_mask=None):
x = self.pos_encoding(self.tgt_embedding(tgt) * self.embed_scale)
for layer in self.decoder_layers:
x, _ = layer(x, memory, src_mask, tgt_mask)
return self.output_norm(x)
def forward(self, src, tgt, src_pad_idx=0, tgt_pad_idx=0):
src_mask = self.make_pad_mask(src, src_pad_idx)
tgt_len = tgt.shape[1]
tgt_pad_mask = self.make_pad_mask(tgt, tgt_pad_idx)
tgt_causal_mask = self.make_causal_mask(tgt_len, tgt.device)
tgt_mask = tgt_pad_mask & tgt_causal_mask
memory = self.encode(src, src_mask)
output = self.decode(tgt, memory, src_mask, tgt_mask)
logits = self.output_projection(output)
return logits
# Instantiate and test
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = Transformer(
src_vocab_size=10000,
tgt_vocab_size=10000,
embed_dim=512,
num_heads=8,
num_encoder_layers=6,
num_decoder_layers=6,
ff_dim=2048,
dropout=0.1
).to(device)
total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Total parameters: {total_params:,}")
src = torch.randint(1, 10000, (4, 20)).to(device)
tgt = torch.randint(1, 10000, (4, 18)).to(device)
output = model(src, tgt)
print(f"Output shape: {output.shape}") # (4, 18, 10000)
7. Deep Dive into BERT
BERT (Bidirectional Encoder Representations from Transformers) was published by Google in 2018 and revolutionized the NLP field.
7.1 BERT's Core Ideas
BERT uses two pre-training tasks:
Masked Language Modeling (MLM): 15% of input tokens are replaced with [MASK]; the model predicts the original words. Next Sentence Prediction (NSP): Given two sentences, the model predicts whether the second sentence follows the first.
from transformers import (
BertTokenizer,
BertModel,
BertForSequenceClassification,
BertForTokenClassification,
BertForQuestionAnswering,
AdamW,
get_linear_schedule_with_warmup
)
import torch
from torch.utils.data import Dataset, DataLoader
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
text = "Natural language processing is transforming AI."
encoding = tokenizer(
text,
add_special_tokens=True,
max_length=128,
padding='max_length',
truncation=True,
return_tensors='pt'
)
print("Input IDs:", encoding['input_ids'][0][:10])
print("Attention mask:", encoding['attention_mask'][0][:10])
print("Decoded:", tokenizer.decode(encoding['input_ids'][0]))
# WordPiece tokenization
complex_words = ["unbelievable", "preprocessing", "transformers", "tokenization"]
for word in complex_words:
tokens = tokenizer.tokenize(word)
print(f"{word:20} -> {tokens}")
7.2 BERT Fine-tuning: Text Classification
class SentimentDataset(Dataset):
"""Sentiment analysis dataset"""
def __init__(self, texts, labels, tokenizer, max_length=128):
self.texts = texts
self.labels = labels
self.tokenizer = tokenizer
self.max_length = max_length
def __len__(self):
return len(self.texts)
def __getitem__(self, idx):
encoding = self.tokenizer(
self.texts[idx],
max_length=self.max_length,
padding='max_length',
truncation=True,
return_tensors='pt'
)
return {
'input_ids': encoding['input_ids'].squeeze(),
'attention_mask': encoding['attention_mask'].squeeze(),
'label': torch.tensor(self.labels[idx], dtype=torch.long)
}
def train_bert_classifier(
train_texts,
train_labels,
val_texts,
val_labels,
num_labels=2,
epochs=3,
batch_size=16,
lr=2e-5
):
"""BERT sentiment analysis fine-tuning"""
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
train_dataset = SentimentDataset(train_texts, train_labels, tokenizer)
val_dataset = SentimentDataset(val_texts, val_labels, tokenizer)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size)
model = BertForSequenceClassification.from_pretrained(
'bert-base-uncased',
num_labels=num_labels
).to(device)
optimizer = AdamW(model.parameters(), lr=lr, weight_decay=0.01)
total_steps = len(train_loader) * epochs
scheduler = get_linear_schedule_with_warmup(
optimizer,
num_warmup_steps=total_steps // 10,
num_training_steps=total_steps
)
best_val_acc = 0
for epoch in range(epochs):
model.train()
total_loss = 0
correct = 0
for batch in train_loader:
input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
labels = batch['label'].to(device)
optimizer.zero_grad()
outputs = model(
input_ids=input_ids,
attention_mask=attention_mask,
labels=labels
)
loss = outputs.loss
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
scheduler.step()
total_loss += loss.item()
preds = outputs.logits.argmax(dim=1)
correct += (preds == labels).sum().item()
train_acc = correct / len(train_dataset)
avg_loss = total_loss / len(train_loader)
model.eval()
val_correct = 0
with torch.no_grad():
for batch in val_loader:
input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
labels = batch['label'].to(device)
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
preds = outputs.logits.argmax(dim=1)
val_correct += (preds == labels).sum().item()
val_acc = val_correct / len(val_dataset)
print(f"Epoch {epoch+1}: Loss={avg_loss:.4f}, Train Acc={train_acc:.4f}, Val Acc={val_acc:.4f}")
if val_acc > best_val_acc:
best_val_acc = val_acc
torch.save(model.state_dict(), 'best_bert_classifier.pt')
return model
7.3 BERT NER (Named Entity Recognition)
from transformers import BertForTokenClassification
ner_labels = ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC']
label2id = {label: i for i, label in enumerate(ner_labels)}
id2label = {i: label for i, label in enumerate(ner_labels)}
model = BertForTokenClassification.from_pretrained(
'bert-base-uncased',
num_labels=len(ner_labels),
id2label=id2label,
label2id=label2id
)
def predict_ner(text, model, tokenizer):
"""Predict named entities"""
model.eval()
encoding = tokenizer(
text.split(),
is_split_into_words=True,
return_offsets_mapping=True,
padding=True,
truncation=True,
return_tensors='pt'
)
with torch.no_grad():
outputs = model(
input_ids=encoding['input_ids'],
attention_mask=encoding['attention_mask']
)
predictions = outputs.logits.argmax(dim=2).squeeze().tolist()
word_ids = encoding.word_ids()
result = []
prev_word_id = None
for pred, word_id in zip(predictions, word_ids):
if word_id is None or word_id == prev_word_id:
continue
word = text.split()[word_id]
label = id2label[pred]
result.append((word, label))
prev_word_id = word_id
return result
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
text = "Barack Obama was born in Hawaii and served as the 44th President of the United States."
# print(predict_ner(text, model, tokenizer)) # use with a fine-tuned model
8. The GPT Family of Models
GPT (Generative Pre-trained Transformer) is a series of autoregressive language models developed by OpenAI.
8.1 GPT Evolution
| Model | Year | Parameters | Key Feature |
|---|---|---|---|
| GPT-1 | 2018 | 117M | First GPT, unsupervised pre-training |
| GPT-2 | 2019 | 1.5B | Zero-shot transfer, "too dangerous to release" |
| GPT-3 | 2020 | 175B | In-context learning, few-shot |
| InstructGPT | 2022 | 1.3B | RLHF, following instructions |
| GPT-4 | 2023 | Undisclosed | Multimodal, stronger reasoning |
8.2 Text Generation with GPT-2
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')
model.eval()
tokenizer.pad_token = tokenizer.eos_token
def generate_text(
prompt,
model,
tokenizer,
max_length=200,
temperature=0.9,
top_p=0.95,
top_k=50,
num_return_sequences=1,
do_sample=True
):
"""Generate text with GPT-2"""
inputs = tokenizer(prompt, return_tensors='pt')
input_ids = inputs['input_ids']
with torch.no_grad():
output = model.generate(
input_ids,
max_length=max_length,
temperature=temperature,
top_p=top_p,
top_k=top_k,
num_return_sequences=num_return_sequences,
do_sample=do_sample,
pad_token_id=tokenizer.eos_token_id,
repetition_penalty=1.2
)
generated_texts = []
for out in output:
text = tokenizer.decode(out, skip_special_tokens=True)
generated_texts.append(text)
return generated_texts
prompt = "Artificial intelligence is transforming the world by"
generated = generate_text(prompt, model, tokenizer, max_length=150, temperature=0.8)
for i, text in enumerate(generated):
print(f"\nGenerated text {i+1}:")
print(text)
print("-" * 50)
def compare_generation_strategies(prompt, model, tokenizer):
"""Compare different decoding strategies"""
input_ids = tokenizer.encode(prompt, return_tensors='pt')
strategies = {
"Greedy Search": dict(do_sample=False),
"Beam Search": dict(do_sample=False, num_beams=5, early_stopping=True),
"Temperature Sampling": dict(do_sample=True, temperature=0.7),
"Top-k Sampling": dict(do_sample=True, top_k=50),
"Top-p (Nucleus) Sampling": dict(do_sample=True, top_p=0.92),
}
print(f"Prompt: {prompt}\n")
with torch.no_grad():
for name, params in strategies.items():
output = model.generate(
input_ids,
max_new_tokens=50,
pad_token_id=tokenizer.eos_token_id,
**params
)
text = tokenizer.decode(output[0], skip_special_tokens=True)
print(f"[{name}]")
print(text[len(prompt):].strip())
print()
compare_generation_strategies(
"The future of artificial intelligence",
model,
tokenizer
)
8.3 Prompt Engineering
# Zero-shot classification
def zero_shot_classification(text, categories, model_name="gpt2"):
"""Zero-shot text classification"""
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)
model.eval()
scores = {}
for category in categories:
prompt = f"Text: {text}\nCategory: {category}"
inputs = tokenizer(prompt, return_tensors='pt')
with torch.no_grad():
outputs = model(**inputs, labels=inputs['input_ids'])
scores[category] = -outputs.loss.item()
best_category = max(scores, key=scores.get)
return best_category, scores
# Few-shot learning example
few_shot_prompt = """
Classify the sentiment of the following texts:
Text: "This movie was amazing!"
Sentiment: Positive
Text: "I hated every minute of it."
Sentiment: Negative
Text: "The film was okay, nothing special."
Sentiment: Neutral
Text: "Absolutely brilliant performance!"
Sentiment:"""
print("Few-shot prompt:")
print(few_shot_prompt)
# Chain-of-Thought prompting
cot_prompt = """
Q: A train travels 120 miles in 2 hours. How long will it take to travel 300 miles?
A: Let me think step by step.
1. First, find the speed: 120 miles / 2 hours = 60 miles per hour
2. Then, calculate the time for 300 miles: 300 miles / 60 mph = 5 hours
Therefore, it will take 5 hours.
Q: If a store sells 3 apples for $2.40, how much do 7 apples cost?
A: Let me think step by step.
"""
9. Modern NLP Techniques
9.1 LoRA / QLoRA Fine-tuning
LoRA (Low-Rank Adaptation) is a technique for efficiently fine-tuning large language models. It freezes the original weights and trains only small low-rank matrices added to the existing layers.
from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
def setup_lora_model(model_name="gpt2", r=8, lora_alpha=32, lora_dropout=0.1):
"""Configure a LoRA model"""
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=r, # LoRA rank
lora_alpha=lora_alpha, # scaling parameter
target_modules=["c_attn"], # layers to apply LoRA to
lora_dropout=lora_dropout,
bias="none"
)
model = get_peft_model(model, lora_config)
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(f"Trainable parameters: {trainable:,} ({100 * trainable / total:.2f}%)")
return model, tokenizer
class InstructionDataset(torch.utils.data.Dataset):
"""Instruction-format dataset for SFT"""
def __init__(self, data, tokenizer, max_length=512):
self.tokenizer = tokenizer
self.max_length = max_length
self.texts = []
for item in data:
prompt = f"### Instruction:\n{item['instruction']}\n\n### Response:\n{item['response']}"
self.texts.append(prompt)
def __len__(self):
return len(self.texts)
def __getitem__(self, idx):
encoding = self.tokenizer(
self.texts[idx],
max_length=self.max_length,
padding='max_length',
truncation=True,
return_tensors='pt'
)
input_ids = encoding['input_ids'].squeeze()
return {
'input_ids': input_ids,
'attention_mask': encoding['attention_mask'].squeeze(),
'labels': input_ids.clone()
}
9.2 RAG (Retrieval-Augmented Generation)
from transformers import AutoTokenizer, AutoModel
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
class SimpleRAG:
"""Simple RAG system"""
def __init__(self, embedding_model="sentence-transformers/all-MiniLM-L6-v2"):
self.tokenizer = AutoTokenizer.from_pretrained(embedding_model)
self.model = AutoModel.from_pretrained(embedding_model)
self.knowledge_base = []
self.embeddings = []
def mean_pooling(self, model_output, attention_mask):
token_embeddings = model_output[0]
input_mask_expanded = attention_mask.unsqueeze(-1).expand(
token_embeddings.size()
).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / \
torch.clamp(input_mask_expanded.sum(1), min=1e-9)
def encode(self, texts):
encoded = self.tokenizer(
texts,
padding=True,
truncation=True,
max_length=512,
return_tensors='pt'
)
with torch.no_grad():
model_output = self.model(**encoded)
embeddings = self.mean_pooling(model_output, encoded['attention_mask'])
embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)
return embeddings.numpy()
def add_documents(self, documents):
self.knowledge_base.extend(documents)
new_embeddings = self.encode(documents)
if len(self.embeddings) == 0:
self.embeddings = new_embeddings
else:
self.embeddings = np.vstack([self.embeddings, new_embeddings])
print(f"Added {len(documents)} documents. Total: {len(self.knowledge_base)}.")
def retrieve(self, query, top_k=3):
query_embedding = self.encode([query])
similarities = cosine_similarity(query_embedding, self.embeddings)[0]
top_indices = np.argsort(similarities)[::-1][:top_k]
results = []
for idx in top_indices:
results.append({
'document': self.knowledge_base[idx],
'similarity': similarities[idx]
})
return results
def answer(self, query, generator_model, generator_tokenizer, top_k=3):
"""RAG: retrieve + generate"""
relevant_docs = self.retrieve(query, top_k=top_k)
context = "\n".join([doc['document'] for doc in relevant_docs])
prompt = f"""Based on the following context, answer the question.
Context:
{context}
Question: {query}
Answer:"""
inputs = generator_tokenizer(prompt, return_tensors='pt', max_length=512, truncation=True)
with torch.no_grad():
outputs = generator_model.generate(
inputs['input_ids'],
max_new_tokens=150,
temperature=0.7,
do_sample=True,
pad_token_id=generator_tokenizer.eos_token_id
)
answer = generator_tokenizer.decode(outputs[0], skip_special_tokens=True)
answer = answer[len(prompt):].strip()
return answer, relevant_docs
# Usage example
rag = SimpleRAG()
documents = [
"BERT (Bidirectional Encoder Representations from Transformers) was published by Google in 2018.",
"GPT-3 has 175 billion parameters and was developed by OpenAI in 2020.",
"The Transformer architecture was introduced in the paper 'Attention is All You Need' in 2017.",
"RLHF (Reinforcement Learning from Human Feedback) is used to align language models with human values.",
"LoRA allows efficient fine-tuning by adding low-rank matrices to pre-trained model weights.",
"RAG combines information retrieval with text generation for more accurate responses.",
]
rag.add_documents(documents)
query = "What is BERT and when was it created?"
results = rag.retrieve(query, top_k=2)
print(f"\nQuery: {query}")
print("\nRetrieved documents:")
for r in results:
print(f" [{r['similarity']:.3f}] {r['document']}")
9.3 RLHF (Reinforcement Learning from Human Feedback)
class RewardModel(nn.Module):
"""Reward model — scores response quality"""
def __init__(self, base_model_name="gpt2"):
super().__init__()
from transformers import GPT2Model
self.transformer = GPT2Model.from_pretrained(base_model_name)
hidden_size = self.transformer.config.hidden_size
self.reward_head = nn.Linear(hidden_size, 1)
def forward(self, input_ids, attention_mask=None):
outputs = self.transformer(input_ids=input_ids, attention_mask=attention_mask)
last_hidden = outputs.last_hidden_state[:, -1, :]
reward = self.reward_head(last_hidden)
return reward.squeeze(-1)
class PPOTrainer:
"""PPO-based RLHF training (conceptual implementation)"""
def __init__(self, policy_model, reward_model, ref_model, tokenizer):
self.policy = policy_model
self.reward_model = reward_model
self.ref_model = ref_model # Reference model for KL penalty
self.tokenizer = tokenizer
def compute_kl_penalty(self, policy_logprobs, ref_logprobs, kl_coeff=0.1):
"""KL divergence penalty — prevents policy from drifting too far"""
kl = policy_logprobs - ref_logprobs
return kl_coeff * kl.mean()
def compute_advantages(self, rewards, values, gamma=0.99, lam=0.95):
"""Generalized Advantage Estimation (GAE)"""
advantages = []
last_advantage = 0
for t in reversed(range(len(rewards))):
if t == len(rewards) - 1:
next_value = 0
else:
next_value = values[t + 1]
delta = rewards[t] + gamma * next_value - values[t]
advantage = delta + gamma * lam * last_advantage
advantages.insert(0, advantage)
last_advantage = advantage
return torch.tensor(advantages)
10. Real-World NLP Projects
10.1 Sentiment Analysis System (Complete Pipeline)
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
from torch.utils.data import Dataset, DataLoader
from torch.optim import AdamW
from transformers import get_linear_schedule_with_warmup
import matplotlib.pyplot as plt
import seaborn as sns
class SentimentAnalysisPipeline:
"""End-to-end sentiment analysis pipeline"""
def __init__(self, model_name="distilbert-base-uncased-finetuned-sst-2-english"):
self.model_name = model_name
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model = AutoModelForSequenceClassification.from_pretrained(model_name)
self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
self.model.to(self.device)
self.model.eval()
self.label_map = {0: 'Negative', 1: 'Positive'}
def predict(self, texts, batch_size=32):
"""Batch prediction"""
if isinstance(texts, str):
texts = [texts]
all_predictions = []
all_probabilities = []
for i in range(0, len(texts), batch_size):
batch = texts[i:i + batch_size]
encoding = self.tokenizer(
batch,
padding=True,
truncation=True,
max_length=512,
return_tensors='pt'
)
input_ids = encoding['input_ids'].to(self.device)
attention_mask = encoding['attention_mask'].to(self.device)
with torch.no_grad():
outputs = self.model(input_ids=input_ids, attention_mask=attention_mask)
probs = torch.softmax(outputs.logits, dim=1)
preds = probs.argmax(dim=1)
all_predictions.extend(preds.cpu().numpy())
all_probabilities.extend(probs.cpu().numpy())
results = []
for pred, prob in zip(all_predictions, all_probabilities):
results.append({
'label': self.label_map[pred],
'score': float(prob[pred]),
'probabilities': {self.label_map[i]: float(p) for i, p in enumerate(prob)}
})
return results if len(results) > 1 else results[0]
def analyze_batch(self, texts):
"""Batch sentiment analysis with statistics"""
results = self.predict(texts)
positive_count = sum(1 for r in results if r['label'] == 'Positive')
negative_count = len(results) - positive_count
avg_confidence = np.mean([r['score'] for r in results])
print(f"\n=== Sentiment Analysis Results ===")
print(f"Total texts: {len(texts)}")
print(f"Positive: {positive_count} ({100*positive_count/len(texts):.1f}%)")
print(f"Negative: {negative_count} ({100*negative_count/len(texts):.1f}%)")
print(f"Average confidence: {avg_confidence:.3f}")
return results
pipeline = SentimentAnalysisPipeline()
reviews = [
"This product exceeded all my expectations! Absolutely love it.",
"Terrible quality, broke after one day. Complete waste of money.",
"It's okay, does what it's supposed to do.",
"Best purchase I've made this year!",
"Would not recommend to anyone.",
]
print("=== Individual Predictions ===")
for review in reviews:
result = pipeline.predict(review)
print(f"Text: {review[:50]}...")
print(f" -> {result['label']} ({result['score']:.3f})\n")
print("\n=== Batch Analysis ===")
pipeline.analyze_batch(reviews)
10.2 Text Summarization System
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
class TextSummarizer:
"""Text summarization system"""
def __init__(self, model_name="facebook/bart-large-cnn"):
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
self.model.to(self.device)
def summarize(self, text, max_length=130, min_length=30, num_beams=4):
"""Abstractive summarization"""
inputs = self.tokenizer(
text,
max_length=1024,
truncation=True,
return_tensors='pt'
).to(self.device)
with torch.no_grad():
summary_ids = self.model.generate(
inputs['input_ids'],
max_length=max_length,
min_length=min_length,
num_beams=num_beams,
length_penalty=2.0,
early_stopping=True
)
summary = self.tokenizer.decode(summary_ids[0], skip_special_tokens=True)
original_words = len(text.split())
summary_words = len(summary.split())
compression = (1 - summary_words / original_words) * 100
return {
'summary': summary,
'original_length': original_words,
'summary_length': summary_words,
'compression_rate': f"{compression:.1f}%"
}
def extractive_summarize(self, text, num_sentences=3):
"""Extractive summarization (TF-IDF based)"""
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import nltk
nltk.download('punkt', quiet=True)
from nltk.tokenize import sent_tokenize
sentences = sent_tokenize(text)
if len(sentences) <= num_sentences:
return text
vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = vectorizer.fit_transform(sentences)
similarity_matrix = cosine_similarity(tfidf_matrix)
scores = similarity_matrix.mean(axis=1)
top_indices = sorted(np.argsort(scores)[-num_sentences:].tolist())
summary = ' '.join([sentences[i] for i in top_indices])
return summary
summarizer = TextSummarizer()
long_text = """
Artificial intelligence has made remarkable strides in recent years, particularly in
natural language processing. The development of transformer-based models like BERT and
GPT has revolutionized how machines understand and generate human language. These models,
trained on massive datasets, can perform a wide range of tasks including translation,
summarization, question answering, and sentiment analysis.
The key innovation behind these models is the attention mechanism, which allows the model
to focus on relevant parts of the input when generating each word of the output. This has
enabled much more nuanced understanding of context and semantics compared to earlier
approaches like recurrent neural networks.
However, these advances come with challenges. Training large language models requires
enormous computational resources and energy. There are also concerns about bias in the
training data leading to biased outputs, and the potential for misuse in generating
misinformation. Researchers are actively working on making these models more efficient,
fair, and reliable.
The future of NLP looks promising, with models becoming increasingly capable of
understanding and generating language that is indistinguishable from human writing.
Applications range from customer service chatbots to scientific research assistance,
and the technology continues to evolve rapidly.
"""
result = summarizer.summarize(long_text)
print("=== Abstractive Summary ===")
print(f"Original length: {result['original_length']} words")
print(f"Summary length: {result['summary_length']} words")
print(f"Compression rate: {result['compression_rate']}")
print(f"\nSummary:\n{result['summary']}")
print("\n=== Extractive Summary ===")
extractive = summarizer.extractive_summarize(long_text, num_sentences=2)
print(extractive)
10.3 Question Answering System
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
class QASystem:
"""Question answering system"""
def __init__(self, model_name="deepset/roberta-base-squad2"):
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model = AutoModelForQuestionAnswering.from_pretrained(model_name)
self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
self.model.to(self.device)
self.model.eval()
def answer(self, question, context, max_answer_len=100):
"""Extract answer from context"""
encoding = self.tokenizer(
question,
context,
max_length=512,
truncation=True,
stride=128,
return_overflowing_tokens=True,
return_offsets_mapping=True,
padding='max_length',
return_tensors='pt'
)
offset_mapping = encoding.pop('offset_mapping').cpu()
sample_map = encoding.pop('overflow_to_sample_mapping').cpu()
encoding = {k: v.to(self.device) for k, v in encoding.items()}
with torch.no_grad():
outputs = self.model(**encoding)
start_logits = outputs.start_logits
end_logits = outputs.end_logits
answers = []
for i in range(len(start_logits)):
start_log = start_logits[i].cpu().numpy()
end_log = end_logits[i].cpu().numpy()
offsets = offset_mapping[i].numpy()
start_idx = np.argmax(start_log)
end_idx = np.argmax(end_log)
if start_idx <= end_idx:
start_char = offsets[start_idx][0]
end_char = offsets[end_idx][1]
answer_text = context[start_char:end_char]
score = float(start_log[start_idx] + end_log[end_idx])
if answer_text:
answers.append({
'answer': answer_text,
'score': score,
'start': int(start_char),
'end': int(end_char)
})
if not answers:
return {'answer': "No answer found.", 'score': 0.0}
return max(answers, key=lambda x: x['score'])
def multi_question_answer(self, questions, context):
"""Answer multiple questions about a context"""
results = []
for question in questions:
answer = self.answer(question, context)
results.append({
'question': question,
'answer': answer['answer'],
'confidence': answer['score']
})
return results
qa_system = QASystem()
context = """
The Transformer model was introduced in a 2017 paper titled "Attention Is All You Need"
by researchers at Google Brain. The model architecture relies entirely on attention
mechanisms, dispensing with recurrence and convolutions. BERT, which stands for
Bidirectional Encoder Representations from Transformers, was introduced in 2018 by
Google AI Language team. BERT achieved state-of-the-art results on eleven NLP tasks.
GPT-3, developed by OpenAI, was released in 2020 and has 175 billion parameters, making
it one of the largest language models at the time of its release.
"""
questions = [
"When was the Transformer model introduced?",
"What does BERT stand for?",
"How many parameters does GPT-3 have?",
"Who developed BERT?",
]
print("=== Question Answering System ===\n")
results = qa_system.multi_question_answer(questions, context)
for r in results:
print(f"Q: {r['question']}")
print(f"A: {r['answer']}")
print(f"Confidence: {r['confidence']:.2f}")
print()
Closing: NLP Learning Roadmap
This guide has walked through the complete NLP journey. A summary of each stage:
- Foundations: text preprocessing, BoW, TF-IDF
- Embeddings: Word2Vec, GloVe, FastText
- Sequence Models: RNN, LSTM, GRU, Seq2Seq
- The Attention Revolution: attention mechanisms, self-attention
- Mastering Transformers: the full architecture
- Pre-trained Models: BERT and GPT in practice
- Modern Techniques: LoRA, RAG, RLHF
Recommended Resources
- HuggingFace Official Docs
- Attention is All You Need Paper
- BERT Paper
- PyTorch NLP Tutorial
- Stanford CS224N — NLP with Deep Learning
- NLTK Documentation
- spaCy Documentation
- Gensim Word2Vec
NLP is a rapidly evolving field, so continuous learning and hands-on practice are essential. Run the code examples yourself, experiment with the parameters, and build your own projects to develop a deep understanding. Happy learning!