- Published on
NLP & Text Processing Complete Guide: BERT Fine-tuning, RAG Systems, and Multilingual Processing
- Authors

- Name
- Youngju Kim
- @fjvbn20031
NLP & Text Processing Complete Guide: BERT Fine-tuning, RAG Systems, and Multilingual Processing
Natural Language Processing (NLP) is a core AI technology that enables computers to understand and generate human language. This guide covers everything from text preprocessing basics through BERT fine-tuning, RAG pipeline construction, and multilingual processing — with hands-on code throughout.
Table of Contents
- Text Preprocessing and Tokenization
- Subword Tokenization: BPE, WordPiece, SentencePiece
- Word Embeddings: Word2Vec, FastText, GloVe
- Sequence Model Evolution: RNN to Transformer
- BERT Family Models and Fine-tuning
- Text Task Practical Guide
- Building RAG Systems
- Multilingual NLP and Korean Language Specifics
- Quiz
1. Text Preprocessing and Tokenization
1.1 Korean Morpheme Analysis (KoNLPy, MeCab)
Korean is an agglutinative language where various affixes attach to stems to express meaning. Simple whitespace-based tokenization (effective in English) is therefore insufficient for Korean. KoNLPy is the leading Python library for Korean morpheme analysis.
from konlpy.tag import Okt, Mecab, Komoran
okt = Okt()
text = "자연어 처리는 인공지능의 핵심 분야입니다."
# Morpheme analysis
morphs = okt.morphs(text)
print("Morphemes:", morphs)
# ['자연어', '처리', '는', '인공지능', '의', '핵심', '분야', '입니다', '.']
# Part-of-speech tagging
pos_tags = okt.pos(text)
print("POS tags:", pos_tags)
# [('자연어', 'Noun'), ('처리', 'Noun'), ('는', 'Josa'), ...]
# Noun extraction
nouns = okt.nouns(text)
print("Nouns:", nouns)
# ['자연어', '처리', '인공지능', '핵심', '분야']
# MeCab provides faster and more accurate analysis for production use
# mecab = Mecab()
# mecab_morphs = mecab.morphs(text)
MeCab is a Japanese morphological analyzer ported to Korean; it outperforms Okt in speed and accuracy for large-scale data processing.
# Korean text preprocessing pipeline using KoNLPy
import re
from konlpy.tag import Okt
def preprocess_korean(text: str) -> list[str]:
"""Korean text preprocessing pipeline"""
# Remove special characters (keep Hangul, Latin, digits)
text = re.sub(r'[^가-힣a-zA-Z0-9\s]', '', text)
text = re.sub(r'\s+', ' ', text).strip()
okt = Okt()
# Extract only nouns, verbs, and adjectives (length > 1)
tokens = [
word for word, pos in okt.pos(text)
if pos in ['Noun', 'Verb', 'Adjective'] and len(word) > 1
]
return tokens
sample = "자연어 처리 기술이 빠르게 발전하고 있습니다!"
print(preprocess_korean(sample))
1.2 Stopword Removal and Normalization
KOREAN_STOPWORDS = [
'이', '가', '을', '를', '은', '는', '의', '에', '에서',
'로', '으로', '와', '과', '도', '만', '까지', '부터'
]
def remove_stopwords(tokens: list[str], stopwords: list[str]) -> list[str]:
return [token for token in tokens if token not in stopwords]
def normalize_text(text: str) -> str:
# Collapse repeated characters: 'ㅋㅋㅋㅋ' -> 'ㅋㅋ'
text = re.sub(r'(.)\1{2,}', r'\1\1', text)
text = text.lower()
return text
2. Subword Tokenization: BPE, WordPiece, SentencePiece
2.1 BPE (Byte Pair Encoding)
BPE iteratively merges the most frequent character pair to build a vocabulary. It effectively solves the OOV (Out-of-Vocabulary) problem and is used in GPT-family models.
from tokenizers import Tokenizer, trainers, pre_tokenizers, models, decoders
# Train a BPE tokenizer
tokenizer = Tokenizer(models.BPE())
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel()
trainer = trainers.BpeTrainer(
vocab_size=30000,
min_frequency=2,
special_tokens=["[UNK]", "[PAD]", "[BOS]", "[EOS]"]
)
files = ["train_corpus.txt"]
tokenizer.train(files, trainer)
# Tokenize example
encoding = tokenizer.encode("Natural Language Processing is fascinating!")
print("Tokens:", encoding.tokens)
print("IDs:", encoding.ids)
# Save and reload
tokenizer.save("bpe_tokenizer.json")
loaded_tokenizer = Tokenizer.from_file("bpe_tokenizer.json")
2.2 WordPiece vs SentencePiece
WordPiece (used in BERT) marks continuation subwords with the ## prefix. SentencePiece is language-agnostic and processes raw text without language-specific rules.
import sentencepiece as spm
# Train SentencePiece model
spm.SentencePieceTrainer.train(
input='corpus.txt',
model_prefix='sp_model',
vocab_size=32000,
character_coverage=0.9995, # Set high for CJK languages
model_type='bpe', # or 'unigram'
pad_id=0,
unk_id=1,
bos_id=2,
eos_id=3
)
sp = spm.SentencePieceProcessor()
sp.load('sp_model.model')
text = "Natural Language Processing is the core of NLP."
tokens = sp.encode_as_pieces(text)
ids = sp.encode_as_ids(text)
print("SentencePiece tokens:", tokens)
| Method | Models | Key Feature |
|---|---|---|
| BPE | GPT-2, RoBERTa | Frequency-based merging, deterministic |
| WordPiece | BERT, DistilBERT | Likelihood maximization, ## prefix |
| SentencePiece | T5, mBART, XLM-R | Language-agnostic, handles whitespace |
| Unigram LM | ALBERT, mBERT | Probabilistic model, flexible segmentation |
3. Word Embeddings: Word2Vec, FastText, GloVe
3.1 Word2Vec
Word2Vec produces dense word representations using a shallow neural network. It offers two architectures: CBOW (Continuous Bag of Words) and Skip-gram.
from gensim.models import Word2Vec
from konlpy.tag import Okt
okt = Okt()
corpus = [
"Natural language processing is a core field of AI",
"BERT is a bidirectional transformer model",
"Text classification uses BERT for sentiment analysis",
]
# Tokenize corpus
tokenized_corpus = [sentence.lower().split() for sentence in corpus]
# Train Word2Vec (Skip-gram)
model = Word2Vec(
sentences=tokenized_corpus,
vector_size=100,
window=5,
min_count=1,
sg=1, # 1: Skip-gram, 0: CBOW
workers=4,
epochs=100
)
# Find similar words
similar = model.wv.most_similar('bert', topn=5)
print("Similar words:", similar)
# Word vector
vector = model.wv['bert']
print("Vector shape:", vector.shape) # (100,)
3.2 FastText and GloVe
FastText represents words as the sum of their character n-grams, enabling embedding of OOV words. GloVe uses global co-occurrence statistics.
from gensim.models import FastText
ft_model = FastText(
sentences=tokenized_corpus,
vector_size=100,
window=5,
min_count=1,
sg=1,
min_n=2,
max_n=6,
epochs=100
)
# OOV words are handled via n-gram composition
oov_vector = ft_model.wv['multilingualnlp'] # Works even if unseen
print("OOV vector shape:", oov_vector.shape)
3.3 Sentence Embeddings with Sentence Transformers
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
# Multilingual model
model = SentenceTransformer('paraphrase-multilingual-mpnet-base-v2')
sentences = [
"Natural language processing is a branch of artificial intelligence.",
"NLP is an important area of AI research.",
"The weather is sunny and pleasant today.",
]
embeddings = model.encode(sentences)
print("Embedding shape:", embeddings.shape) # (3, 768)
sim_matrix = cosine_similarity(embeddings)
print("Similarity matrix:")
print(sim_matrix.round(4))
# Semantic search
query = "What is a language model?"
query_embedding = model.encode([query])
similarities = cosine_similarity(query_embedding, embeddings)[0]
for sent, score in zip(sentences, similarities):
print(f"Score {score:.4f}: {sent}")
4. Sequence Model Evolution: RNN to Transformer
4.1 From RNN to LSTM and GRU
import torch
import torch.nn as nn
class LSTMClassifier(nn.Module):
def __init__(self, vocab_size, embed_dim, hidden_dim, num_classes):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
self.lstm = nn.LSTM(
embed_dim, hidden_dim,
num_layers=2,
batch_first=True,
dropout=0.3,
bidirectional=True
)
self.classifier = nn.Sequential(
nn.Linear(hidden_dim * 2, hidden_dim),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(hidden_dim, num_classes)
)
def forward(self, x):
embedded = self.embedding(x)
output, (hidden, cell) = self.lstm(embedded)
# Concatenate last hidden states from both directions
final_hidden = torch.cat([hidden[-2], hidden[-1]], dim=1)
return self.classifier(final_hidden)
class GRUClassifier(nn.Module):
"""GRU has fewer parameters than LSTM and trains faster"""
def __init__(self, vocab_size, embed_dim, hidden_dim, num_classes):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
self.gru = nn.GRU(
embed_dim, hidden_dim,
num_layers=2,
batch_first=True,
bidirectional=True
)
self.fc = nn.Linear(hidden_dim * 2, num_classes)
def forward(self, x):
embedded = self.embedding(x)
_, hidden = self.gru(embedded)
out = torch.cat([hidden[-2], hidden[-1]], dim=1)
return self.fc(out)
4.2 Multi-Head Attention
class MultiHeadAttention(nn.Module):
def __init__(self, d_model, num_heads):
super().__init__()
self.d_model = d_model
self.num_heads = num_heads
self.d_k = d_model // num_heads
self.W_q = nn.Linear(d_model, d_model)
self.W_k = nn.Linear(d_model, d_model)
self.W_v = nn.Linear(d_model, d_model)
self.W_o = nn.Linear(d_model, d_model)
def scaled_dot_product_attention(self, Q, K, V, mask=None):
scores = torch.matmul(Q, K.transpose(-2, -1)) / (self.d_k ** 0.5)
if mask is not None:
scores = scores.masked_fill(mask == 0, -1e9)
attn_weights = torch.softmax(scores, dim=-1)
return torch.matmul(attn_weights, V), attn_weights
def forward(self, query, key, value, mask=None):
batch_size = query.size(0)
Q = self.W_q(query).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
K = self.W_k(key).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
V = self.W_v(value).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
attn_output, _ = self.scaled_dot_product_attention(Q, K, V, mask)
attn_output = attn_output.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)
return self.W_o(attn_output)
5. BERT Family Models and Fine-tuning
5.1 BERT Pre-training Tasks
BERT (Bidirectional Encoder Representations from Transformers) uses two pre-training objectives:
- MLM (Masked Language Model): Replace 15% of input tokens with
[MASK]and predict the original tokens. - NSP (Next Sentence Prediction): Predict whether two sentences are consecutive in the original document.
5.2 KLUE-BERT for Korean Text Classification
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import TrainingArguments, Trainer
from datasets import Dataset
import numpy as np
from sklearn.metrics import accuracy_score, f1_score
model_name = "klue/bert-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
model_name,
num_labels=5 # very negative / negative / neutral / positive / very positive
)
train_data = {
"text": [
"This movie is really entertaining! Highly recommended.",
"The service was rude and the quality was poor.",
"It was average. Nothing special.",
],
"label": [4, 0, 2]
}
def tokenize_function(examples):
return tokenizer(
examples["text"],
max_length=128,
padding="max_length",
truncation=True
)
dataset = Dataset.from_dict(train_data)
tokenized_dataset = dataset.map(tokenize_function, batched=True)
training_args = TrainingArguments(
output_dir="./klue-bert-sentiment",
num_train_epochs=3,
per_device_train_batch_size=16,
per_device_eval_batch_size=32,
warmup_ratio=0.1,
weight_decay=0.01,
learning_rate=2e-5,
evaluation_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
metric_for_best_model="f1",
fp16=True,
)
def compute_metrics(eval_pred):
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
return {
"accuracy": accuracy_score(labels, predictions),
"f1": f1_score(labels, predictions, average="weighted")
}
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset,
eval_dataset=tokenized_dataset,
compute_metrics=compute_metrics,
)
trainer.train()
5.3 KoELECTRA NER Fine-tuning
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
model_name = "monologg/koelectra-base-finetuned-ner"
tokenizer = AutoTokenizer.from_pretrained(model_name)
ner_model = AutoModelForTokenClassification.from_pretrained(model_name)
ner_pipeline = pipeline(
"ner",
model=ner_model,
tokenizer=tokenizer,
aggregation_strategy="simple"
)
text = "Samsung Electronics CEO Lee Jae-yong held a press conference in Gangnam, Seoul."
entities = ner_pipeline(text)
for entity in entities:
print(f"Entity: {entity['word']}, Type: {entity['entity_group']}, "
f"Score: {entity['score']:.4f}")
5.4 Comparing BERT Family Models
| Model | Key Improvement | Notes |
|---|---|---|
| BERT | Baseline | MLM + NSP |
| RoBERTa | Removes NSP, more data | Dynamic masking, larger batches |
| DeBERTa | Disentangled attention | Separate position and content attention |
| KLUE-BERT | Korean-specific | Trained on Korean Wikipedia and news |
| KoELECTRA | ELECTRA architecture | Generator-discriminator structure |
6. Text Task Practical Guide
6.1 Text Classification and Sentiment Analysis
from transformers import pipeline
# Sentiment analysis
sentiment = pipeline(
"sentiment-analysis",
model="distilbert-base-uncased-finetuned-sst-2-english"
)
result = sentiment("This product offers excellent value for the price.")
print(result) # [{'label': 'POSITIVE', 'score': 0.9987}]
# Summarization
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
long_text = """
Natural language processing (NLP) is a subfield of linguistics, computer
science, and artificial intelligence. It focuses on the interactions between
computers and human language, particularly how to program computers to
process and analyze large amounts of natural language data.
"""
summary = summarizer(long_text, max_length=50, min_length=20)
print(summary[0]['summary_text'])
# Question answering
qa = pipeline(
"question-answering",
model="deepset/roberta-base-squad2"
)
context = "Seoul is the capital of South Korea with a population of approximately 9.5 million."
question = "What is the population of Seoul?"
answer = qa(question=question, context=context)
print(f"Answer: {answer['answer']}, Score: {answer['score']:.4f}")
6.2 Neural Machine Translation
from transformers import MarianMTModel, MarianTokenizer
model_name = "Helsinki-NLP/opus-mt-ko-en"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
def translate(text: str) -> str:
inputs = tokenizer(text, return_tensors="pt", padding=True)
translated = model.generate(**inputs, num_beams=5)
return tokenizer.decode(translated[0], skip_special_tokens=True)
korean_text = "인공지능 기술이 우리의 삶을 크게 변화시키고 있습니다."
print(f"Translation: {translate(korean_text)}")
7. Building RAG Systems
7.1 RAG Pipeline Overview
Retrieval-Augmented Generation (RAG) enhances LLM answer quality by retrieving relevant external knowledge. The pipeline consists of: document chunking → embedding → vector storage → retrieval → re-ranking → generation.
7.2 LangChain RAG Pipeline with pgvector
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import PGVector
from langchain_community.chat_models import ChatOllama
from langchain.chains import RetrievalQA
from langchain.schema import Document
# Step 1: Document chunking
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=50,
separators=["\n\n", "\n", ".", " ", ""],
length_function=len,
)
documents = [
Document(
page_content="BERT is a bidirectional transformer model published by Google in 2018...",
metadata={"source": "nlp_guide.txt", "page": 1}
)
]
chunks = text_splitter.split_documents(documents)
print(f"Number of chunks: {len(chunks)}")
# Step 2: Embedding model (multilingual support)
embeddings = HuggingFaceEmbeddings(
model_name="sentence-transformers/paraphrase-multilingual-mpnet-base-v2",
model_kwargs={"device": "cuda"},
encode_kwargs={"normalize_embeddings": True}
)
# Step 3: Store in pgvector
CONNECTION_STRING = "postgresql+psycopg2://user:password@localhost:5432/ragdb"
COLLECTION_NAME = "nlp_documents"
vectorstore = PGVector.from_documents(
documents=chunks,
embedding=embeddings,
collection_name=COLLECTION_NAME,
connection_string=CONNECTION_STRING,
)
# Step 4: Configure retriever
retriever = vectorstore.as_retriever(
search_type="similarity",
search_kwargs={"k": 5}
)
# Step 5: Build RAG chain
llm = ChatOllama(model="llama3.1:8b-instruct-q4_K_M")
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=retriever,
return_source_documents=True,
)
query = "What are BERT's pre-training methods?"
result = qa_chain.invoke({"query": query})
print(result["result"])
7.3 Two-Stage Retrieval: Bi-encoder + Cross-encoder
from sentence_transformers import SentenceTransformer, CrossEncoder
from sentence_transformers.util import cos_sim
# Stage 1: Bi-encoder for fast candidate retrieval
bi_encoder = SentenceTransformer('paraphrase-multilingual-mpnet-base-v2')
# Stage 2: Cross-encoder for precise re-ranking
cross_encoder = CrossEncoder('cross-encoder/mmarco-mMiniLMv2-L12-H384-v1')
def two_stage_retrieval(query: str, corpus: list[str], top_k: int = 5) -> list[str]:
"""Two-stage retrieval: bi-encoder + cross-encoder"""
# Stage 1: Fast initial retrieval (top 20 candidates)
query_emb = bi_encoder.encode(query, convert_to_tensor=True)
corpus_emb = bi_encoder.encode(corpus, convert_to_tensor=True)
scores = cos_sim(query_emb, corpus_emb)[0]
top_indices = scores.argsort(descending=True)[:20].tolist()
candidates = [corpus[i] for i in top_indices]
# Stage 2: Cross-encoder re-ranking for precision
pairs = [[query, doc] for doc in candidates]
rerank_scores = cross_encoder.predict(pairs)
ranked = sorted(
zip(candidates, rerank_scores),
key=lambda x: x[1],
reverse=True
)
return [doc for doc, _ in ranked[:top_k]]
corpus = [
"BERT is pre-trained using masked language modeling.",
"GPT generates text in an autoregressive manner.",
"Today the weather is clear with a temperature of 20 degrees.",
]
results = two_stage_retrieval("How does BERT learn?", corpus)
for doc in results:
print(f"- {doc}")
8. Multilingual NLP and Korean Language Specifics
8.1 mBERT and XLM-R
from transformers import AutoTokenizer, AutoModel
import torch
# XLM-RoBERTa: supports 100 languages
xlmr_tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-base")
xlmr_model = AutoModel.from_pretrained("xlm-roberta-base")
sentences = {
"Korean": "자연어 처리는 인공지능의 핵심입니다.",
"English": "NLP is the core of artificial intelligence.",
"Japanese": "自然言語処理はAIの核心です。",
}
for lang, sentence in sentences.items():
inputs = xlmr_tokenizer(
sentence,
return_tensors="pt",
max_length=128,
padding=True,
truncation=True
)
with torch.no_grad():
outputs = xlmr_model(**inputs)
# Use [CLS] token embedding as sentence representation
sentence_embedding = outputs.last_hidden_state[:, 0, :]
print(f"{lang}: embedding shape = {sentence_embedding.shape}")
8.2 CJK Tokenization Specifics
CJK (Chinese, Japanese, Korean) languages require specialized handling since whitespace-based tokenization is ineffective.
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-multilingual-cased")
texts = {
"Korean": "자연어처리", # Agglutinative: postpositions attach to stems
"Chinese": "自然语言处理", # Isolating: each character is a meaning unit
"Japanese": "自然言語処理", # Agglutinative + kanji/hiragana/katakana mix
"English": "NaturalLanguageProcessing",
}
for lang, text in texts.items():
tokens = tokenizer.tokenize(text)
print(f"{lang}: {tokens}")
8.3 Korean Language Challenges
# Korean agglutinative characteristics:
# One verb stem can produce hundreds of surface forms:
# 먹다, 먹어, 먹었다, 먹을, 먹히다, 먹이다, 먹어서, 먹더라도...
from konlpy.tag import Mecab
def extract_korean_features(text: str) -> dict:
"""Extract linguistic features from Korean text"""
mecab = Mecab()
pos_tags = mecab.pos(text)
features = {
"nouns": [],
"verbs": [],
"adjectives": [],
"proper_nouns": []
}
for word, tag in pos_tags:
if tag.startswith('NN'):
features["nouns"].append(word)
elif tag.startswith('VV'):
features["verbs"].append(word)
elif tag.startswith('VA'):
features["adjectives"].append(word)
elif tag == 'NNP':
features["proper_nouns"].append(word)
return features
text = "Samsung announced development of a new AI semiconductor."
features = extract_korean_features(text)
print(features)
Quiz
Q1. How does BPE tokenization solve the OOV problem?
Answer: BPE iteratively merges the most frequently occurring character pairs to build a subword vocabulary. Any unseen word can be decomposed into learned subword units, minimizing OOV occurrences.
Explanation: Traditional word-based vocabularies handle unseen words as [UNK], losing all information. BPE starts from characters and merges frequent pairs, so neologisms and compound words can always be represented as a combination of existing subwords. For example, "multilingualnlp" might be split into "multi", "lingual", "nlp" or even character pairs, but never falls back to [UNK].
Q2. What are BERT's MLM and NSP pre-training tasks?
Answer: MLM randomly masks 15% of input tokens and trains the model to predict the original tokens. NSP trains the model to classify whether two sentences are consecutive in the original document.
Explanation: MLM enables bidirectional context understanding — BERT can use both left and right context to predict a masked word, which GPT (unidirectional) cannot do. Of the masked 15%, 80% are replaced with [MASK], 10% with a random token, and 10% are left unchanged. NSP was later shown by RoBERTa to not meaningfully improve performance and was dropped in subsequent models.
Q3. What is the advantage of two-stage retrieval combining bi-encoder and cross-encoder in RAG?
Answer: Bi-encoders independently encode queries and documents enabling fast ANN search, while cross-encoders process query-document pairs jointly for precise relevance scoring. Combining both achieves good speed and accuracy simultaneously.
Explanation: A bi-encoder alone is fast but cannot capture fine-grained query-document interaction, limiting accuracy. A cross-encoder is highly accurate but requires evaluating all documents at query time, making it too slow for large corpora. The two-stage approach uses the bi-encoder to produce a small candidate set (20-100 documents), then applies the cross-encoder only to those candidates — getting the best of both worlds.
Q4. Why is SBERT more efficient than BERT for sentence similarity computation?
Answer: SBERT uses a Siamese/Triplet network to encode sentences into fixed-size vectors. Similarity computation only requires comparing pre-computed vectors with cosine similarity. BERT requires a new forward pass for every sentence pair.
Explanation: Computing pairwise similarities for n sentences with BERT requires n(n-1)/2 forward passes (roughly 65 hours for 10,000 sentences), while SBERT requires n encoding steps followed by fast vector operations (approximately 5 seconds). This makes SBERT orders of magnitude faster for semantic search, clustering, and other large-scale comparison tasks.
Q5. Why does Korean agglutinative morphology require different processing from English?
Answer: Korean attaches postpositions, verb endings, and affixes to stems, producing enormous surface form variety. Morpheme analysis is required to normalize words to their stems; without it, '인공지능의', '인공지능이', '인공지능을' are treated as three different words rather than the same concept.
Explanation: English morphology is relatively simple — 'eat', 'eats', 'eating' — and basic stemming suffices. Korean stems can combine with hundreds of postposition and ending combinations, so a specialized morphological analyzer (MeCab, Okt, Komoran) is essential. This also affects vocabulary size: Korean BERT models trained with raw whitespace tokenization show dramatically worse performance than those with morpheme-aware preprocessing.
Summary
This guide covered the full NLP text processing stack: Korean morpheme analysis, BPE/WordPiece/SentencePiece tokenization, Word2Vec/FastText/SBERT embeddings, the RNN-to-Transformer evolution, BERT family fine-tuning (KLUE-BERT, KoELECTRA), practical NLP tasks (NER, sentiment, QA, summarization, translation), RAG pipeline construction with pgvector and LangChain, two-stage retrieval, and multilingual processing with XLM-R. The key insight for Korean NLP is that agglutinative morphology demands morpheme-aware tokenization, and for RAG systems, combining bi-encoder speed with cross-encoder precision delivers the best retrieval quality.