💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

NLP & Text Processing Complete Guide: BERT Fine-tuning, RAG Systems, and Multilingual Processing

Natural Language Processing (NLP) is a core AI technology that enables computers to understand and generate human language. This guide covers everything from text preprocessing basics through BERT fine-tuning, RAG pipeline construction, and multilingual processing — with hands-on code throughout.

1. Text Preprocessing and Tokenization

2. Subword Tokenization: BPE, WordPiece, SentencePiece

3. Word Embeddings: Word2Vec, FastText, GloVe

4. Sequence Model Evolution: RNN to Transformer

5. BERT Family Models and Fine-tuning

6. Text Task Practical Guide

7. Building RAG Systems

8. Multilingual NLP and Korean Language Specifics

9. Quiz

1. Text Preprocessing and Tokenization

1.1 Korean Morpheme Analysis (KoNLPy, MeCab)

Korean is an agglutinative language where various affixes attach to stems to express meaning. Simple whitespace-based tokenization (effective in English) is therefore insufficient for Korean. KoNLPy is the leading Python library for Korean morpheme analysis.

from konlpy.tag import Okt, Mecab, Komoran

okt = Okt()

text = "자연어 처리는 인공지능의 핵심 분야입니다."

Morpheme analysis

morphs = okt.morphs(text)

print("Morphemes:", morphs)

['자연어', '처리', '는', '인공지능', '의', '핵심', '분야', '입니다', '.']

Part-of-speech tagging

pos_tags = okt.pos(text)

print("POS tags:", pos_tags)

[('자연어', 'Noun'), ('처리', 'Noun'), ('는', 'Josa'), ...]

Noun extraction

nouns = okt.nouns(text)

print("Nouns:", nouns)

['자연어', '처리', '인공지능', '핵심', '분야']

MeCab provides faster and more accurate analysis for production use

mecab = Mecab()

mecab_morphs = mecab.morphs(text)

MeCab is a Japanese morphological analyzer ported to Korean; it outperforms Okt in speed and accuracy for large-scale data processing.

Korean text preprocessing pipeline using KoNLPy

from konlpy.tag import Okt

def preprocess_korean(text: str) -> list[str]:

"""Korean text preprocessing pipeline"""

Remove special characters (keep Hangul, Latin, digits)

text = re.sub(r'[^가-힣a-zA-Z0-9\s]', '', text)

text = re.sub(r'\s+', ' ', text).strip()

okt = Okt()

Extract only nouns, verbs, and adjectives (length > 1)

tokens = [

word for word, pos in okt.pos(text)

if pos in ['Noun', 'Verb', 'Adjective'] and len(word) > 1

]

return tokens

sample = "자연어 처리 기술이 빠르게 발전하고 있습니다!"

print(preprocess_korean(sample))

1.2 Stopword Removal and Normalization

KOREAN_STOPWORDS = [

'이', '가', '을', '를', '은', '는', '의', '에', '에서',

'로', '으로', '와', '과', '도', '만', '까지', '부터'

]

def remove_stopwords(tokens: list[str], stopwords: list[str]) -> list[str]:

return [token for token in tokens if token not in stopwords]

def normalize_text(text: str) -> str:

Collapse repeated characters: 'ㅋㅋㅋㅋ' -> 'ㅋㅋ'

text = re.sub(r'(.)\1{2,}', r'\1\1', text)

text = text.lower()

return text

2. Subword Tokenization: BPE, WordPiece, SentencePiece

2.1 BPE (Byte Pair Encoding)

BPE iteratively merges the most frequent character pair to build a vocabulary. It effectively solves the OOV (Out-of-Vocabulary) problem and is used in GPT-family models.

from tokenizers import Tokenizer, trainers, pre_tokenizers, models, decoders

Train a BPE tokenizer

tokenizer = Tokenizer(models.BPE())

tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)

tokenizer.decoder = decoders.ByteLevel()

trainer = trainers.BpeTrainer(

vocab_size=30000,

min_frequency=2,

special_tokens=["[UNK]", "[PAD]", "[BOS]", "[EOS]"]

)

files = ["train_corpus.txt"]

tokenizer.train(files, trainer)

Tokenize example

encoding = tokenizer.encode("Natural Language Processing is fascinating!")

print("Tokens:", encoding.tokens)

print("IDs:", encoding.ids)

Save and reload

tokenizer.save("bpe_tokenizer.json")

loaded_tokenizer = Tokenizer.from_file("bpe_tokenizer.json")

2.2 WordPiece vs SentencePiece

WordPiece (used in BERT) marks continuation subwords with the `##` prefix. SentencePiece is language-agnostic and processes raw text without language-specific rules.

Train SentencePiece model

spm.SentencePieceTrainer.train(

input='corpus.txt',

model_prefix='sp_model',

vocab_size=32000,

character_coverage=0.9995, # Set high for CJK languages

model_type='bpe', # or 'unigram'

pad_id=0,

unk_id=1,

bos_id=2,

eos_id=3

)

sp = spm.SentencePieceProcessor()

sp.load('sp_model.model')

text = "Natural Language Processing is the core of NLP."

tokens = sp.encode_as_pieces(text)

ids = sp.encode_as_ids(text)

print("SentencePiece tokens:", tokens)

| Method | Models | Key Feature |

| ------------- | ---------------- | ------------------------------------------ |

| BPE | GPT-2, RoBERTa | Frequency-based merging, deterministic |

| WordPiece | BERT, DistilBERT | Likelihood maximization, `##` prefix |

| SentencePiece | T5, mBART, XLM-R | Language-agnostic, handles whitespace |

| Unigram LM | ALBERT, mBERT | Probabilistic model, flexible segmentation |

3. Word Embeddings: Word2Vec, FastText, GloVe

3.1 Word2Vec

Word2Vec produces dense word representations using a shallow neural network. It offers two architectures: CBOW (Continuous Bag of Words) and Skip-gram.

from gensim.models import Word2Vec

from konlpy.tag import Okt

okt = Okt()

corpus = [

"Natural language processing is a core field of AI",

"BERT is a bidirectional transformer model",

"Text classification uses BERT for sentiment analysis",

]

Tokenize corpus

tokenized_corpus = [sentence.lower().split() for sentence in corpus]

Train Word2Vec (Skip-gram)

model = Word2Vec(

sentences=tokenized_corpus,

vector_size=100,

window=5,

min_count=1,

sg=1, # 1: Skip-gram, 0: CBOW

workers=4,

epochs=100

)

Find similar words

similar = model.wv.most_similar('bert', topn=5)

print("Similar words:", similar)

Word vector

vector = model.wv['bert']

print("Vector shape:", vector.shape) # (100,)

3.2 FastText and GloVe

FastText represents words as the sum of their character n-grams, enabling embedding of OOV words. GloVe uses global co-occurrence statistics.

from gensim.models import FastText

ft_model = FastText(

sentences=tokenized_corpus,

vector_size=100,

window=5,

min_count=1,

sg=1,

min_n=2,

max_n=6,

epochs=100

)

OOV words are handled via n-gram composition

oov_vector = ft_model.wv['multilingualnlp'] # Works even if unseen

print("OOV vector shape:", oov_vector.shape)

3.3 Sentence Embeddings with Sentence Transformers

from sentence_transformers import SentenceTransformer

from sklearn.metrics.pairwise import cosine_similarity

Multilingual model

model = SentenceTransformer('paraphrase-multilingual-mpnet-base-v2')

sentences = [

"Natural language processing is a branch of artificial intelligence.",

"NLP is an important area of AI research.",

"The weather is sunny and pleasant today.",

]

embeddings = model.encode(sentences)

print("Embedding shape:", embeddings.shape) # (3, 768)

sim_matrix = cosine_similarity(embeddings)

print("Similarity matrix:")

print(sim_matrix.round(4))

Semantic search

query = "What is a language model?"

query_embedding = model.encode([query])

similarities = cosine_similarity(query_embedding, embeddings)[0]

for sent, score in zip(sentences, similarities):

print(f"Score {score:.4f}: {sent}")

4. Sequence Model Evolution: RNN to Transformer

4.1 From RNN to LSTM and GRU

class LSTMClassifier(nn.Module):

def __init__(self, vocab_size, embed_dim, hidden_dim, num_classes):

super().__init__()

self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)

self.lstm = nn.LSTM(

embed_dim, hidden_dim,

num_layers=2,

batch_first=True,

dropout=0.3,

bidirectional=True

)

self.classifier = nn.Sequential(

nn.Linear(hidden_dim * 2, hidden_dim),

nn.ReLU(),

nn.Dropout(0.3),

nn.Linear(hidden_dim, num_classes)

)

def forward(self, x):

embedded = self.embedding(x)

output, (hidden, cell) = self.lstm(embedded)

Concatenate last hidden states from both directions

final_hidden = torch.cat([hidden[-2], hidden[-1]], dim=1)

return self.classifier(final_hidden)

class GRUClassifier(nn.Module):

"""GRU has fewer parameters than LSTM and trains faster"""

def __init__(self, vocab_size, embed_dim, hidden_dim, num_classes):

super().__init__()

self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)

self.gru = nn.GRU(

embed_dim, hidden_dim,

num_layers=2,

batch_first=True,

bidirectional=True

)

self.fc = nn.Linear(hidden_dim * 2, num_classes)

def forward(self, x):

embedded = self.embedding(x)

_, hidden = self.gru(embedded)

out = torch.cat([hidden[-2], hidden[-1]], dim=1)

return self.fc(out)

4.2 Multi-Head Attention

class MultiHeadAttention(nn.Module):

def __init__(self, d_model, num_heads):

super().__init__()

self.d_model = d_model

self.num_heads = num_heads

self.d_k = d_model // num_heads

self.W_q = nn.Linear(d_model, d_model)

self.W_k = nn.Linear(d_model, d_model)

self.W_v = nn.Linear(d_model, d_model)

self.W_o = nn.Linear(d_model, d_model)

def scaled_dot_product_attention(self, Q, K, V, mask=None):

scores = torch.matmul(Q, K.transpose(-2, -1)) / (self.d_k ** 0.5)

if mask is not None:

scores = scores.masked_fill(mask == 0, -1e9)

attn_weights = torch.softmax(scores, dim=-1)

return torch.matmul(attn_weights, V), attn_weights

def forward(self, query, key, value, mask=None):

batch_size = query.size(0)

Q = self.W_q(query).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)

K = self.W_k(key).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)

V = self.W_v(value).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)

attn_output, _ = self.scaled_dot_product_attention(Q, K, V, mask)

attn_output = attn_output.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)

return self.W_o(attn_output)

5. BERT Family Models and Fine-tuning

5.1 BERT Pre-training Tasks

BERT (Bidirectional Encoder Representations from Transformers) uses two pre-training objectives:

- **MLM (Masked Language Model)**: Replace 15% of input tokens with `[MASK]` and predict the original tokens.

- **NSP (Next Sentence Prediction)**: Predict whether two sentences are consecutive in the original document.

5.2 KLUE-BERT for Korean Text Classification

from transformers import AutoTokenizer, AutoModelForSequenceClassification

from transformers import TrainingArguments, Trainer

from datasets import Dataset

from sklearn.metrics import accuracy_score, f1_score

model_name = "klue/bert-base"

tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModelForSequenceClassification.from_pretrained(

model_name,

num_labels=5 # very negative / negative / neutral / positive / very positive

)

train_data = {

"text": [

"This movie is really entertaining! Highly recommended.",

"The service was rude and the quality was poor.",

"It was average. Nothing special.",

"label": [4, 0, 2]

}

def tokenize_function(examples):

return tokenizer(

examples["text"],

max_length=128,

padding="max_length",

truncation=True

)

dataset = Dataset.from_dict(train_data)

tokenized_dataset = dataset.map(tokenize_function, batched=True)

training_args = TrainingArguments(

output_dir="./klue-bert-sentiment",

num_train_epochs=3,

per_device_train_batch_size=16,

per_device_eval_batch_size=32,

warmup_ratio=0.1,

weight_decay=0.01,

learning_rate=2e-5,

evaluation_strategy="epoch",

save_strategy="epoch",

load_best_model_at_end=True,

metric_for_best_model="f1",

fp16=True,

)

def compute_metrics(eval_pred):

logits, labels = eval_pred

predictions = np.argmax(logits, axis=-1)

return {

"accuracy": accuracy_score(labels, predictions),

"f1": f1_score(labels, predictions, average="weighted")

}

trainer = Trainer(

model=model,

args=training_args,

train_dataset=tokenized_dataset,

eval_dataset=tokenized_dataset,

compute_metrics=compute_metrics,

)

trainer.train()

5.3 KoELECTRA NER Fine-tuning

from transformers import AutoTokenizer, AutoModelForTokenClassification

from transformers import pipeline

model_name = "monologg/koelectra-base-finetuned-ner"

tokenizer = AutoTokenizer.from_pretrained(model_name)

ner_model = AutoModelForTokenClassification.from_pretrained(model_name)

ner_pipeline = pipeline(

"ner",

model=ner_model,

tokenizer=tokenizer,

aggregation_strategy="simple"

)

text = "Samsung Electronics CEO Lee Jae-yong held a press conference in Gangnam, Seoul."

entities = ner_pipeline(text)

for entity in entities:

print(f"Entity: {entity['word']}, Type: {entity['entity_group']}, "

f"Score: {entity['score']:.4f}")

5.4 Comparing BERT Family Models

| Model | Key Improvement | Notes |

| --------- | ---------------------- | --------------------------------------- |

| BERT | Baseline | MLM + NSP |

| RoBERTa | Removes NSP, more data | Dynamic masking, larger batches |

| DeBERTa | Disentangled attention | Separate position and content attention |

| KLUE-BERT | Korean-specific | Trained on Korean Wikipedia and news |

| KoELECTRA | ELECTRA architecture | Generator-discriminator structure |

6. Text Task Practical Guide

6.1 Text Classification and Sentiment Analysis

from transformers import pipeline

Sentiment analysis

sentiment = pipeline(

"sentiment-analysis",

model="distilbert-base-uncased-finetuned-sst-2-english"

)

result = sentiment("This product offers excellent value for the price.")

print(result) # [{'label': 'POSITIVE', 'score': 0.9987}]

Summarization

summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

long_text = """

Natural language processing (NLP) is a subfield of linguistics, computer

science, and artificial intelligence. It focuses on the interactions between

computers and human language, particularly how to program computers to

process and analyze large amounts of natural language data.

"""

summary = summarizer(long_text, max_length=50, min_length=20)

print(summary[0]['summary_text'])

Question answering

qa = pipeline(

"question-answering",

model="deepset/roberta-base-squad2"

)

context = "Seoul is the capital of South Korea with a population of approximately 9.5 million."

question = "What is the population of Seoul?"

answer = qa(question=question, context=context)

print(f"Answer: {answer['answer']}, Score: {answer['score']:.4f}")

6.2 Neural Machine Translation

from transformers import MarianMTModel, MarianTokenizer

model_name = "Helsinki-NLP/opus-mt-ko-en"

tokenizer = MarianTokenizer.from_pretrained(model_name)

model = MarianMTModel.from_pretrained(model_name)

def translate(text: str) -> str:

inputs = tokenizer(text, return_tensors="pt", padding=True)

translated = model.generate(**inputs, num_beams=5)

return tokenizer.decode(translated[0], skip_special_tokens=True)

korean_text = "인공지능 기술이 우리의 삶을 크게 변화시키고 있습니다."

print(f"Translation: {translate(korean_text)}")

7. Building RAG Systems

7.1 RAG Pipeline Overview

Retrieval-Augmented Generation (RAG) enhances LLM answer quality by retrieving relevant external knowledge. The pipeline consists of: document chunking → embedding → vector storage → retrieval → re-ranking → generation.

7.2 LangChain RAG Pipeline with pgvector

from langchain.text_splitter import RecursiveCharacterTextSplitter

from langchain_community.embeddings import HuggingFaceEmbeddings

from langchain_community.vectorstores import PGVector

from langchain_community.chat_models import ChatOllama

from langchain.chains import RetrievalQA

from langchain.schema import Document

Step 1: Document chunking

text_splitter = RecursiveCharacterTextSplitter(

chunk_size=512,

chunk_overlap=50,

separators=["\n\n", "\n", ".", " ", ""],

length_function=len,

)

documents = [

Document(

page_content="BERT is a bidirectional transformer model published by Google in 2018...",

metadata={"source": "nlp_guide.txt", "page": 1}

)

]

chunks = text_splitter.split_documents(documents)

print(f"Number of chunks: {len(chunks)}")

Step 2: Embedding model (multilingual support)

embeddings = HuggingFaceEmbeddings(

model_name="sentence-transformers/paraphrase-multilingual-mpnet-base-v2",

model_kwargs={"device": "cuda"},

encode_kwargs={"normalize_embeddings": True}

)

Step 3: Store in pgvector

CONNECTION_STRING = "postgresql+psycopg2://user:password@localhost:5432/ragdb"

COLLECTION_NAME = "nlp_documents"

vectorstore = PGVector.from_documents(

documents=chunks,

embedding=embeddings,

collection_name=COLLECTION_NAME,

connection_string=CONNECTION_STRING,

)

Step 4: Configure retriever

retriever = vectorstore.as_retriever(

search_type="similarity",

search_kwargs={"k": 5}

)

Step 5: Build RAG chain

llm = ChatOllama(model="llama3.1:8b-instruct-q4_K_M")

qa_chain = RetrievalQA.from_chain_type(

llm=llm,

chain_type="stuff",

retriever=retriever,

return_source_documents=True,

)

query = "What are BERT's pre-training methods?"

result = qa_chain.invoke({"query": query})

print(result["result"])

7.3 Two-Stage Retrieval: Bi-encoder + Cross-encoder

from sentence_transformers import SentenceTransformer, CrossEncoder

from sentence_transformers.util import cos_sim

Stage 1: Bi-encoder for fast candidate retrieval

bi_encoder = SentenceTransformer('paraphrase-multilingual-mpnet-base-v2')

Stage 2: Cross-encoder for precise re-ranking

cross_encoder = CrossEncoder('cross-encoder/mmarco-mMiniLMv2-L12-H384-v1')

def two_stage_retrieval(query: str, corpus: list[str], top_k: int = 5) -> list[str]:

"""Two-stage retrieval: bi-encoder + cross-encoder"""

Stage 1: Fast initial retrieval (top 20 candidates)

query_emb = bi_encoder.encode(query, convert_to_tensor=True)

corpus_emb = bi_encoder.encode(corpus, convert_to_tensor=True)

scores = cos_sim(query_emb, corpus_emb)[0]

top_indices = scores.argsort(descending=True)[:20].tolist()

candidates = [corpus[i] for i in top_indices]

Stage 2: Cross-encoder re-ranking for precision

pairs = [[query, doc] for doc in candidates]

rerank_scores = cross_encoder.predict(pairs)

ranked = sorted(

zip(candidates, rerank_scores),

key=lambda x: x[1],

reverse=True

)

return [doc for doc, _ in ranked[:top_k]]

corpus = [

"BERT is pre-trained using masked language modeling.",

"GPT generates text in an autoregressive manner.",

"Today the weather is clear with a temperature of 20 degrees.",

]

results = two_stage_retrieval("How does BERT learn?", corpus)

for doc in results:

print(f"- {doc}")

8. Multilingual NLP and Korean Language Specifics

8.1 mBERT and XLM-R

from transformers import AutoTokenizer, AutoModel

XLM-RoBERTa: supports 100 languages

xlmr_tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-base")

xlmr_model = AutoModel.from_pretrained("xlm-roberta-base")

sentences = {

"Korean": "자연어 처리는 인공지능의 핵심입니다.",

"English": "NLP is the core of artificial intelligence.",

"Japanese": "自然言語処理はAIの核心です。",

}

for lang, sentence in sentences.items():

inputs = xlmr_tokenizer(

sentence,

return_tensors="pt",

max_length=128,

padding=True,

truncation=True

)

with torch.no_grad():

outputs = xlmr_model(**inputs)

Use [CLS] token embedding as sentence representation

sentence_embedding = outputs.last_hidden_state[:, 0, :]

print(f"{lang}: embedding shape = {sentence_embedding.shape}")

8.2 CJK Tokenization Specifics

CJK (Chinese, Japanese, Korean) languages require specialized handling since whitespace-based tokenization is ineffective.

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-multilingual-cased")

texts = {

"Korean": "자연어처리", # Agglutinative: postpositions attach to stems

"Chinese": "自然语言处理", # Isolating: each character is a meaning unit

"Japanese": "自然言語処理", # Agglutinative + kanji/hiragana/katakana mix

"English": "NaturalLanguageProcessing",

}

for lang, text in texts.items():

tokens = tokenizer.tokenize(text)

print(f"{lang}: {tokens}")

8.3 Korean Language Challenges

Korean agglutinative characteristics:

One verb stem can produce hundreds of surface forms:

먹다, 먹어, 먹었다, 먹을, 먹히다, 먹이다, 먹어서, 먹더라도...

from konlpy.tag import Mecab

def extract_korean_features(text: str) -> dict:

"""Extract linguistic features from Korean text"""

mecab = Mecab()

pos_tags = mecab.pos(text)

features = {

"nouns": [],

"verbs": [],

"adjectives": [],

"proper_nouns": []

}

for word, tag in pos_tags:

if tag.startswith('NN'):

features["nouns"].append(word)

elif tag.startswith('VV'):

features["verbs"].append(word)

elif tag.startswith('VA'):

features["adjectives"].append(word)

elif tag == 'NNP':

features["proper_nouns"].append(word)

return features

text = "Samsung announced development of a new AI semiconductor."

features = extract_korean_features(text)

print(features)

Quiz

**Answer**: BPE iteratively merges the most frequently occurring character pairs to build a subword vocabulary. Any unseen word can be decomposed into learned subword units, minimizing OOV occurrences.

**Explanation**: Traditional word-based vocabularies handle unseen words as `[UNK]`, losing all information. BPE starts from characters and merges frequent pairs, so neologisms and compound words can always be represented as a combination of existing subwords. For example, "multilingualnlp" might be split into "multi", "lingual", "nlp" or even character pairs, but never falls back to `[UNK]`.

**Answer**: MLM randomly masks 15% of input tokens and trains the model to predict the original tokens. NSP trains the model to classify whether two sentences are consecutive in the original document.

**Explanation**: MLM enables bidirectional context understanding — BERT can use both left and right context to predict a masked word, which GPT (unidirectional) cannot do. Of the masked 15%, 80% are replaced with `[MASK]`, 10% with a random token, and 10% are left unchanged. NSP was later shown by RoBERTa to not meaningfully improve performance and was dropped in subsequent models.

**Answer**: Bi-encoders independently encode queries and documents enabling fast ANN search, while cross-encoders process query-document pairs jointly for precise relevance scoring. Combining both achieves good speed and accuracy simultaneously.

**Explanation**: A bi-encoder alone is fast but cannot capture fine-grained query-document interaction, limiting accuracy. A cross-encoder is highly accurate but requires evaluating all documents at query time, making it too slow for large corpora. The two-stage approach uses the bi-encoder to produce a small candidate set (20-100 documents), then applies the cross-encoder only to those candidates — getting the best of both worlds.

**Answer**: SBERT uses a Siamese/Triplet network to encode sentences into fixed-size vectors. Similarity computation only requires comparing pre-computed vectors with cosine similarity. BERT requires a new forward pass for every sentence pair.

**Explanation**: Computing pairwise similarities for n sentences with BERT requires n(n-1)/2 forward passes (roughly 65 hours for 10,000 sentences), while SBERT requires n encoding steps followed by fast vector operations (approximately 5 seconds). This makes SBERT orders of magnitude faster for semantic search, clustering, and other large-scale comparison tasks.

**Answer**: Korean attaches postpositions, verb endings, and affixes to stems, producing enormous surface form variety. Morpheme analysis is required to normalize words to their stems; without it, '인공지능의', '인공지능이', '인공지능을' are treated as three different words rather than the same concept.

**Explanation**: English morphology is relatively simple — 'eat', 'eats', 'eating' — and basic stemming suffices. Korean stems can combine with hundreds of postposition and ending combinations, so a specialized morphological analyzer (MeCab, Okt, Komoran) is essential. This also affects vocabulary size: Korean BERT models trained with raw whitespace tokenization show dramatically worse performance than those with morpheme-aware preprocessing.

Summary

This guide covered the full NLP text processing stack: Korean morpheme analysis, BPE/WordPiece/SentencePiece tokenization, Word2Vec/FastText/SBERT embeddings, the RNN-to-Transformer evolution, BERT family fine-tuning (KLUE-BERT, KoELECTRA), practical NLP tasks (NER, sentiment, QA, summarization, translation), RAG pipeline construction with pgvector and LangChain, two-stage retrieval, and multilingual processing with XLM-R. The key insight for Korean NLP is that agglutinative morphology demands morpheme-aware tokenization, and for RAG systems, combining bi-encoder speed with cross-encoder precision delivers the best retrieval quality.

NLP & Text Processing Complete Guide: BERT Fine-tuning, RAG Systems, and Multilingual Processing

Table of Contents

1. Text Preprocessing and Tokenization

1.1 Korean Morpheme Analysis (KoNLPy, MeCab)

Morpheme analysis

['자연어', '처리', '는', '인공지능', '의', '핵심', '분야', '입니다', '.']

Part-of-speech tagging

[('자연어', 'Noun'), ('처리', 'Noun'), ('는', 'Josa'), ...]

Noun extraction

['자연어', '처리', '인공지능', '핵심', '분야']

MeCab provides faster and more accurate analysis for production use

mecab = Mecab()

mecab_morphs = mecab.morphs(text)

Korean text preprocessing pipeline using KoNLPy

Remove special characters (keep Hangul, Latin, digits)

Extract only nouns, verbs, and adjectives (length > 1)

1.2 Stopword Removal and Normalization

Collapse repeated characters: 'ㅋㅋㅋㅋ' -> 'ㅋㅋ'

2. Subword Tokenization: BPE, WordPiece, SentencePiece

2.1 BPE (Byte Pair Encoding)

Train a BPE tokenizer

Tokenize example

Save and reload

2.2 WordPiece vs SentencePiece

Train SentencePiece model

3. Word Embeddings: Word2Vec, FastText, GloVe

3.1 Word2Vec

Tokenize corpus

Train Word2Vec (Skip-gram)

Find similar words

Word vector

3.2 FastText and GloVe

OOV words are handled via n-gram composition

3.3 Sentence Embeddings with Sentence Transformers

Multilingual model

Semantic search

4. Sequence Model Evolution: RNN to Transformer

4.1 From RNN to LSTM and GRU

Concatenate last hidden states from both directions

4.2 Multi-Head Attention

5. BERT Family Models and Fine-tuning

5.1 BERT Pre-training Tasks

5.2 KLUE-BERT for Korean Text Classification

5.3 KoELECTRA NER Fine-tuning

5.4 Comparing BERT Family Models

6. Text Task Practical Guide

6.1 Text Classification and Sentiment Analysis

Sentiment analysis

Summarization

Question answering

6.2 Neural Machine Translation

7. Building RAG Systems

7.1 RAG Pipeline Overview

7.2 LangChain RAG Pipeline with pgvector

Step 1: Document chunking

Step 2: Embedding model (multilingual support)

Step 3: Store in pgvector

Step 4: Configure retriever

Step 5: Build RAG chain

7.3 Two-Stage Retrieval: Bi-encoder + Cross-encoder

Stage 1: Bi-encoder for fast candidate retrieval

Stage 2: Cross-encoder for precise re-ranking

Stage 1: Fast initial retrieval (top 20 candidates)

Stage 2: Cross-encoder re-ranking for precision

8. Multilingual NLP and Korean Language Specifics

8.1 mBERT and XLM-R

XLM-RoBERTa: supports 100 languages

Use [CLS] token embedding as sentence representation

8.2 CJK Tokenization Specifics

8.3 Korean Language Challenges

Korean agglutinative characteristics:

One verb stem can produce hundreds of surface forms:

먹다, 먹어, 먹었다, 먹을, 먹히다, 먹이다, 먹어서, 먹더라도...

Quiz

Summary