- Published on
Complete Guide to Korean LLM Training Data: Hugging Face Datasets, Preprocessing, and Quality Control
- Authors

- Name
- Youngju Kim
- @fjvbn20031
- Introduction: Why Training Data Matters More Than Model Architecture
- 1. Hugging Face Datasets Deep Dive
- 2. Korean Data Collection Methods
- 3. Data Preprocessing Pipeline
- 4. Instruction Tuning Data Formats
- 5. RLHF/DPO Dataset Construction
- 6. Data Quality Metrics
- 7. Full Pipeline: Building a Korean SFT Dataset from Scratch
- 8. Quiz
- 9. References
Introduction: Why Training Data Matters More Than Model Architecture
In 2024, Microsoft Research's Phi-3 paper sent shockwaves through the industry. A 3.8B parameter model outperformed 7B-13B models, and the secret was meticulously curated high-quality training data. Meta's LIMA paper ("Less Is More for Alignment") demonstrated that just 1,000 high-quality samples could achieve performance close to GPT-4.
The phrase "Data is the new oil" may be cliche, but in the LLM era, it has never been more accurate. While architectural innovations (Transformers, MoE, State Space Models) matter, the same architecture can produce vastly different performance depending on data quality and diversity.
The Korean LLM ecosystem is growing rapidly:
| Model | Developer | Parameters | Features |
|---|---|---|---|
| SOLAR | Upstage | 10.7B | Depth Up-Scaling, Korean-optimized |
| EXAONE | LG AI Research | 7.8B | Enterprise Korean LLM |
| HyperCLOVA X | NAVER | Undisclosed | Largest Korean language model |
| Qwen-KO | Community | Various | Qwen-based Korean fine-tuning |
| KULLM | Korea Univ. | 13B | Korean open-source LLM |
| Polyglot-Ko | EleutherAI | 12.8B | Korean pre-trained model |
What determines the performance of all these models is ultimately training data. This guide covers everything from Hugging Face dataset usage to Korean data collection, preprocessing, Instruction Tuning formats, and RLHF/DPO dataset construction.
1. Hugging Face Datasets Deep Dive
1.1 Platform Overview
Hugging Face is an ML community platform hosting over 150,000 datasets as of 2023. It provides dataset viewers, download statistics, and automatic documentation features.
Key Features:
- Dataset Viewer: Preview data directly in the browser
- Download Stats: Monthly download counts
- Dataset Card: Dataset metadata, licensing, and usage documentation
- Streaming: Load without full download
- Git LFS: Large file version control
1.2 Dataset Types by Category
Pre-training Data
Large-scale text corpora that form the model's foundational language understanding.
| Dataset | Size | Languages | Description |
|---|---|---|---|
| CC-100 | 2.5TB | 100+ langs | Cleaned Common Crawl corpus |
| mC4 | 27TB | 101 langs | Google's multilingual C4 |
| Korean Wikipedia | ~1GB | Korean | Full Korean Wikipedia |
| Namuwiki | ~5GB | Korean | Namuwiki dump (non-commercial) |
| KCC | ~30GB | Korean | Korean web crawl data |
| OSCAR | Various | Multilingual | Classified Common Crawl corpus |
SFT/Instruction Tuning Data
Core data for teaching LLMs to follow instructions.
| Dataset | Size | Format | Description |
|---|---|---|---|
| Alpaca (Stanford) | 52K | instruction/input/output | Generated via Self-Instruct |
| ShareGPT | 90K+ | conversations | Real ChatGPT conversations |
| LIMA | 1K | instruction/output | Hand-curated high quality |
| OpenOrca | 4M | instruction/output | Includes GPT-4 responses |
| Dolly 2.0 | 15K | instruction/output | Hand-crafted, commercially usable |
| FLAN Collection | 1836 tasks | Various | Google's large Instruction collection |
RLHF/DPO Data
Alignment data reflecting human preferences.
| Dataset | Size | Structure | Description |
|---|---|---|---|
| HH-RLHF (Anthropic) | 170K | chosen/rejected | Helpfulness + harmlessness preference |
| UltraFeedback | 64K | 4-point scale | GPT-4 based auto-evaluation |
| Nectar | 183K | ranked list | 7-model response rankings |
| Chatbot Arena | Ongoing | ELO scores | Human blind comparison |
Evaluation Benchmarks
| Benchmark | Domain | Korean Support |
|---|---|---|
| MMLU | 57 academic subjects | Translated version available |
| ARC | Science reasoning | Translated version |
| HellaSwag | Common sense reasoning | Translated version |
| KoBBQ | Bias evaluation | Native Korean |
| KLUE | Korean NLU | Native Korean |
| KorNAT | Korean common sense | Native Korean |
1.3 Korean-Specific Datasets
Korean LLM Dataset Ecosystem
├── Pre-training
│ ├── Korean Wikipedia (~600K articles)
│ ├── Namuwiki Dump (~5GB)
│ ├── AI Hub Corpora (NIKL)
│ └── mC4-ko (Korean subset)
├── Instruction Tuning
│ ├── KoAlpaca (beomi) - 52K
│ ├── KoVicuna (melodysdreamj) - 40K+
│ ├── KOpen-platypus - 25K
│ ├── ko_wikidata_QA - Wiki-based QA
│ └── kullm-v2 (Korea Univ.) - 152K
├── Preference/Alignment
│ ├── ko-rlhf (community)
│ └── KoreanFeedback (custom-built)
└── Evaluation
├── KLUE (8 tasks)
├── KoBBQ (bias)
└── KorNAT (common sense)
1.4 Practical datasets Library Usage
Basic Loading and Exploration
from datasets import load_dataset, Dataset, DatasetDict
# Basic loading
ds = load_dataset("beomi/KoAlpaca-v1.1a")
print(ds)
# DatasetDict({
# train: Dataset({
# features: ['instruction', 'output'],
# num_rows: 21155
# })
# })
# Load specific split
train_ds = load_dataset("beomi/KoAlpaca-v1.1a", split="train")
# Inspect first 5 examples
for example in train_ds.select(range(5)):
print(f"Instruction: {example['instruction'][:50]}...")
print(f"Output: {example['output'][:50]}...")
print("---")
Filtering and Transformation
# Length-based filtering
filtered_ds = ds["train"].filter(
lambda x: len(x["instruction"]) > 10 and len(x["output"]) > 20
)
print(f"After filtering: {len(filtered_ds)} / {len(ds['train'])}")
# Convert to Alpaca format
def format_alpaca(example):
text = f"""### Instruction:
{example['instruction']}
### Response:
{example['output']}"""
return {"text": text}
formatted_ds = filtered_ds.map(format_alpaca)
# Apply tokenizer
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("beomi/llama-2-ko-7b")
def tokenize_function(examples):
return tokenizer(
examples["text"],
truncation=True,
max_length=2048,
padding="max_length",
)
tokenized_ds = formatted_ds.map(
tokenize_function,
batched=True,
remove_columns=formatted_ds.column_names,
)
Streaming Mode (Large Datasets)
# Stream large datasets (memory efficient)
streaming_ds = load_dataset(
"allenai/c4",
"ko",
split="train",
streaming=True,
)
# Iterate through first 100 examples
for i, example in enumerate(streaming_ds):
if i >= 100:
break
process(example["text"])
# Streaming + filtering + batch processing
filtered_stream = streaming_ds.filter(
lambda x: len(x["text"]) > 100
).take(10000)
# Process in batches
batch = []
for example in filtered_stream:
batch.append(example)
if len(batch) == 32:
process_batch(batch)
batch = []
Upload to Hugging Face Hub
from datasets import Dataset
import pandas as pd
# Create dataset from DataFrame
df = pd.DataFrame({
"instruction": ["What is Korea's capital?", "What is Python?"],
"output": ["Korea's capital is Seoul.", "Python is a programming language."],
})
my_dataset = Dataset.from_pandas(df)
# Upload to Hub
my_dataset.push_to_hub(
"my-org/my-korean-dataset",
private=True,
token="hf_xxxxx",
)
# Auto-generate Dataset Card
from huggingface_hub import DatasetCard
card = DatasetCard.load("my-org/my-korean-dataset")
card.text = """
# My Korean Dataset
Korean Instruction Tuning dataset.
## Data Structure
- instruction: Question/instruction text
- output: Response
## License
CC-BY-4.0
"""
card.push_to_hub("my-org/my-korean-dataset")
2. Korean Data Collection Methods
2.1 Web Crawling
# News crawling with newspaper3k
from newspaper import Article
import json
def crawl_article(url):
"""Crawl news article (must comply with robots.txt!)"""
article = Article(url, language="ko")
article.download()
article.parse()
return {
"title": article.title,
"text": article.text,
"publish_date": str(article.publish_date),
"source_url": url,
}
# Large-scale crawling with Scrapy
# scrapy_spider.py
"""
import scrapy
class KoreanTextSpider(scrapy.Spider):
name = 'korean_text'
custom_settings = {
'ROBOTSTXT_OBEY': True, # Must comply with robots.txt
'DOWNLOAD_DELAY': 2, # 2-second intervals
'CONCURRENT_REQUESTS': 4, # Limit concurrent requests
}
def parse(self, response):
text = response.css('article::text').getall()
yield {
'url': response.url,
'text': ' '.join(text),
}
"""
Crawling Best Practices:
- Always comply with
robots.txt - Minimum 1-2 second request intervals
- Verify copyright/licensing
- Mandatory PII filtering
2.2 Public Data Sources
| Source | URL | Data Type | License |
|---|---|---|---|
| AI Hub | aihub.or.kr | Various Korean corpora | Public |
| Modoo Corpus | corpus.korean.go.kr | Written/spoken corpora | CC BY |
| NIKL | korean.go.kr | Standard corpora | Academic |
| Data Portal | data.go.kr | Government public data | Public |
# AI Hub data loading example
import json
import glob
def load_aihub_data(data_dir):
"""Load AI Hub JSON format data"""
all_data = []
for filepath in glob.glob(f"{data_dir}/**/*.json", recursive=True):
with open(filepath, "r", encoding="utf-8") as f:
data = json.load(f)
if "document" in data:
for doc in data["document"]:
for sent in doc.get("sentence", []):
all_data.append({
"text": sent.get("form", ""),
"source": "aihub",
})
return all_data
2.3 Translation-Based Data Generation
# Translation using NLLB (No Language Left Behind)
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
model_name = "facebook/nllb-200-distilled-600M"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
def translate_en_to_ko(text):
"""English -> Korean translation"""
tokenizer.src_lang = "eng_Latn"
inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True)
translated = model.generate(
**inputs,
forced_bos_token_id=tokenizer.convert_tokens_to_ids("kor_Hang"),
max_length=512,
)
return tokenizer.decode(translated[0], skip_special_tokens=True)
# Translation quality validation
def validate_translation(original, translated):
"""Automatic translation quality validation"""
checks = {
"not_empty": len(translated.strip()) > 0,
"not_too_short": len(translated) > len(original) * 0.3,
"not_too_long": len(translated) < len(original) * 3,
"no_english_majority": sum(1 for c in translated if c.isascii()) / max(len(translated), 1) < 0.5,
}
return all(checks.values()), checks
2.4 Synthetic Data Generation
Self-Instruct Approach
import openai
import json
import random
# Self-Instruct: Generate new instructions from seed data
SEED_INSTRUCTIONS = [
"Explain the four seasons of Korea.",
"What are the advantages of list comprehension in Python?",
"Tell me the key points to consider when writing an email.",
]
def generate_new_instructions(seed_instructions, num_generate=10):
"""Generate new instructions using GPT-4"""
prompt = f"""Here are example Korean instructions:
{chr(10).join(f'{i+1}. {inst}' for i, inst in enumerate(seed_instructions))}
Generate {num_generate} completely new Korean instructions in a similar style.
Include diverse topics (science, history, technology, daily life, etc.).
Write each instruction on a single line with a number."""
client = openai.OpenAI()
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
temperature=0.8,
)
return parse_instructions(response.choices[0].message.content)
def generate_response(instruction):
"""Generate a response for the instruction"""
client = openai.OpenAI()
response = client.chat.completions.create(
model="gpt-4",
messages=[
{"role": "system", "content": "You are a helpful Korean AI assistant."},
{"role": "user", "content": instruction},
],
temperature=0.7,
)
return response.choices[0].message.content
Evol-Instruct Approach
def evolve_instruction(instruction, evolution_type="deepen"):
"""WizardLM's Evol-Instruct: Progressively complexify instructions"""
evolution_prompts = {
"deepen": f"""Make the following instruction deeper and more specific.
Original: {instruction}
Evolved version:""",
"broaden": f"""Broaden the scope of the following instruction to be more comprehensive.
Original: {instruction}
Evolved version:""",
"concretize": f"""Add specific conditions or constraints to the following instruction.
Original: {instruction}
Evolved version:""",
"reasoning": f"""Transform the following instruction into a form requiring step-by-step reasoning.
Original: {instruction}
Evolved version:""",
}
client = openai.OpenAI()
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": evolution_prompts[evolution_type]}],
temperature=0.7,
)
return response.choices[0].message.content
2.5 Community Data Sources
- Namuwiki: Rich Korean content (non-commercial CC-BY-NC-SA)
- Korean Reddit: r/korea, r/hanguk, etc.
- Stack Overflow Korean: Technical Q&A
- Naver Knowledge iN: Crawl with caution (check terms of service)
- Korean Wikipedia: CC-BY-SA license
3. Data Preprocessing Pipeline
3.1 Overall Pipeline Architecture
Raw Data
|
v
+-------------------+
| 1. Text Cleaning | HTML tag removal, encoding cleanup
+--------+----------+
v
+-------------------+
| 2. Lang Detection | Filter Korean-only text
+--------+----------+
v
+-------------------+
| 3. Deduplication | MinHash LSH, Exact Match
+--------+----------+
v
+-------------------+
| 4. Quality Filter | Perplexity, length, toxicity
+--------+----------+
v
+-------------------+
| 5. PII Removal | Personal info masking
+--------+----------+
v
+-------------------+
| 6. Tokenization | SentencePiece / BPE
+--------+----------+
v
Cleaned Data
3.2 Text Cleaning
import re
import html
import unicodedata
def clean_text(text):
"""Basic Korean text cleaning"""
# Decode HTML entities
text = html.unescape(text)
# Remove HTML tags
text = re.sub(r'<[^>]+>', '', text)
# Remove URLs
text = re.sub(r'https?://\S+|www\.\S+', '', text)
# Remove emails
text = re.sub(r'\S+@\S+\.\S+', '[EMAIL]', text)
# Mask phone numbers
text = re.sub(r'\d{2,3}-\d{3,4}-\d{4}', '[PHONE]', text)
# Normalize whitespace
text = re.sub(r'\s+', ' ', text)
# Unicode normalization (NFC)
text = unicodedata.normalize('NFC', text)
return text.strip()
def clean_korean_specific(text):
"""Korean-specific cleaning"""
# Remove advertising patterns
ad_patterns = [
r'click\s*now',
r'free\s*consultation',
r'contact\s*us',
r'call\s*now',
]
for pattern in ad_patterns:
if re.search(pattern, text, re.IGNORECASE):
return None
return text
# Batch processing
def clean_batch(texts):
"""Batch cleaning"""
cleaned = []
for text in texts:
result = clean_text(text)
result = clean_korean_specific(result)
if result and len(result) > 20:
cleaned.append(result)
return cleaned
3.3 Deduplication
from datasketch import MinHash, MinHashLSH
import hashlib
class TextDeduplicator:
"""MinHash LSH-based approximate deduplication"""
def __init__(self, threshold=0.8, num_perm=128):
self.threshold = threshold
self.num_perm = num_perm
self.lsh = MinHashLSH(threshold=threshold, num_perm=num_perm)
self.seen_exact = set()
def get_minhash(self, text):
"""Generate MinHash for text"""
m = MinHash(num_perm=self.num_perm)
# Split into 3-grams
for i in range(len(text) - 2):
m.update(text[i:i+3].encode('utf-8'))
return m
def is_duplicate(self, text, doc_id):
"""Check for duplicates"""
# 1. Exact matching (hash-based)
text_hash = hashlib.md5(text.encode('utf-8')).hexdigest()
if text_hash in self.seen_exact:
return True
self.seen_exact.add(text_hash)
# 2. Approximate matching (MinHash LSH)
minhash = self.get_minhash(text)
result = self.lsh.query(minhash)
if result:
return True
self.lsh.insert(doc_id, minhash)
return False
def deduplicate(self, documents):
"""Deduplicate document list"""
unique_docs = []
for i, doc in enumerate(documents):
if not self.is_duplicate(doc["text"], f"doc_{i}"):
unique_docs.append(doc)
print(f"Dedup: {len(documents)} -> {len(unique_docs)} "
f"({len(documents) - len(unique_docs)} removed)")
return unique_docs
3.4 Language Detection Filtering
import fasttext
# Load fasttext language detection model
model_path = "lid.176.bin" # Pre-download required
lang_model = fasttext.load_model(model_path)
def detect_language(text):
"""Detect text language"""
text_clean = text.replace('\n', ' ')[:200]
predictions = lang_model.predict(text_clean)
lang = predictions[0][0].replace('__label__', '')
confidence = predictions[1][0]
return lang, confidence
def filter_korean(documents, min_confidence=0.7):
"""Filter Korean-only text"""
korean_docs = []
for doc in documents:
lang, conf = detect_language(doc["text"])
if lang == "ko" and conf >= min_confidence:
korean_docs.append(doc)
return korean_docs
3.5 Quality Filtering
import numpy as np
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
class QualityFilter:
"""Text quality filtering"""
def __init__(self):
self.criteria = {
"min_length": 50,
"max_length": 10000,
"min_words": 10,
"max_repetition_ratio": 0.3,
"max_special_char_ratio": 0.1,
}
def check_length(self, text):
"""Length-based filter"""
return self.criteria["min_length"] <= len(text) <= self.criteria["max_length"]
def check_repetition(self, text):
"""Detect repetitive text"""
words = text.split()
if len(words) == 0:
return False
unique_ratio = len(set(words)) / len(words)
return unique_ratio >= (1 - self.criteria["max_repetition_ratio"])
def check_special_chars(self, text):
"""Check special character ratio"""
special = sum(1 for c in text if not c.isalnum() and not c.isspace()
and c not in '.,!?;:')
return special / max(len(text), 1) < self.criteria["max_special_char_ratio"]
def compute_perplexity(self, text, model, tokenizer, device="cuda"):
"""Perplexity-based quality assessment (lower = more natural text)"""
inputs = tokenizer(text, return_tensors="pt", truncation=True,
max_length=512).to(device)
with torch.no_grad():
outputs = model(**inputs, labels=inputs["input_ids"])
return torch.exp(outputs.loss).item()
def filter(self, text):
"""Comprehensive quality filtering"""
return (
self.check_length(text)
and self.check_repetition(text)
and self.check_special_chars(text)
)
3.6 PII (Personally Identifiable Information) Removal
import re
from typing import Dict, List
class PIIRemover:
"""PII removal for Korean text"""
PATTERNS = {
"SSN": r'\d{6}[-]?\d{7}',
"PHONE": r'0\d{1,2}[-.]?\d{3,4}[-.]?\d{4}',
"EMAIL": r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}',
"CARD": r'\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}',
"ACCOUNT": r'\d{3,6}[-]?\d{2,6}[-]?\d{2,6}[-]?\d{0,3}',
"IP": r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}',
}
def remove_pii(self, text: str) -> str:
"""Replace PII with mask tokens"""
for pii_type, pattern in self.PATTERNS.items():
mask_token = f"[{pii_type}]"
text = re.sub(pattern, mask_token, text)
return text
def detect_pii(self, text: str) -> Dict[str, List[str]]:
"""Detect PII (for review before removal)"""
found = {}
for pii_type, pattern in self.PATTERNS.items():
matches = re.findall(pattern, text)
if matches:
found[pii_type] = matches
return found
3.7 Tokenization Considerations
from tokenizers import Tokenizer, models, trainers, pre_tokenizers
# Train Korean BPE tokenizer
def train_korean_tokenizer(text_files, vocab_size=32000):
"""Train Korean-optimized BPE tokenizer"""
tokenizer = Tokenizer(models.BPE(unk_token="[UNK]"))
# Pre-tokenization is critical for Korean
tokenizer.pre_tokenizer = pre_tokenizers.Sequence([
pre_tokenizers.ByteLevel(add_prefix_space=False),
])
trainer = trainers.BpeTrainer(
vocab_size=vocab_size,
special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
min_frequency=2,
)
tokenizer.train(text_files, trainer)
return tokenizer
# Compare tokenizer efficiency
def compare_tokenizer_efficiency(text, tokenizers_dict):
"""Compare Korean token efficiency across tokenizers"""
print(f"Original ({len(text)} chars): {text[:50]}...")
print("-" * 60)
for name, tok in tokenizers_dict.items():
tokens = tok.encode(text)
token_strs = tok.convert_ids_to_tokens(tokens)
fertility = len(tokens) / len(text.split())
print(f"{name}: {len(tokens)} tokens, fertility={fertility:.2f}")
print(f" First 10 tokens: {token_strs[:10]}")
4. Instruction Tuning Data Formats
4.1 Alpaca Format
The most basic and widely used format.
{
"instruction": "Summarize the following text.",
"input": "Artificial Intelligence (AI) is a subfield of computer science that artificially implements human learning, reasoning, and perception capabilities...",
"output": "AI is a technology that implements human intelligence on computers, and it is being utilized in various fields through advances in machine learning and deep learning."
}
def format_alpaca(instruction, input_text="", output=""):
"""Generate Alpaca format"""
if input_text:
return {
"instruction": instruction,
"input": input_text,
"output": output,
}
return {
"instruction": instruction,
"input": "",
"output": output,
}
# Alpaca prompt templates
ALPACA_TEMPLATE = """Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
{instruction}
### Input:
{input}
### Response:
{output}"""
ALPACA_TEMPLATE_NO_INPUT = """Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
{instruction}
### Response:
{output}"""
4.2 ShareGPT Format
A format for representing multi-turn conversations.
{
"conversations": [
{
"from": "human",
"value": "What is the difference between lists and tuples in Python?"
},
{
"from": "gpt",
"value": "The key differences between lists and tuples in Python are...\n\n1. **Mutability**: Lists are mutable, tuples are immutable\n2. **Performance**: Tuples are more memory efficient\n3. **Syntax**: Lists use [], tuples use ()"
},
{
"from": "human",
"value": "When should I prefer tuples?"
},
{
"from": "gpt",
"value": "Tuples are best used in these cases:\n\n1. When data should not be modified (coordinates, RGB values)\n2. When used as dictionary keys\n3. When returning multiple values from functions"
}
]
}
4.3 OpenAI Messages Format
Standard format compatible with the OpenAI API.
{
"messages": [
{
"role": "system",
"content": "You are an AI assistant fluent in Korean. Provide accurate and helpful answers."
},
{
"role": "user",
"content": "Explain the difference between machine learning and deep learning."
},
{
"role": "assistant",
"content": "Let me explain the key differences between ML and deep learning..."
}
]
}
4.4 Chat Templates (Model-Specific Differences)
# Llama 3 Chat Template
LLAMA3_TEMPLATE = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>
{system_message}<|eot_id|><|start_header_id|>user<|end_header_id|>
{user_message}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
{assistant_message}<|eot_id|>"""
# Mistral Chat Template
MISTRAL_TEMPLATE = """<s>[INST] {system_message}
{user_message} [/INST]{assistant_message}</s>"""
# Qwen 2 Chat Template
QWEN2_TEMPLATE = """<|im_start|>system
{system_message}<|im_end|>
<|im_start|>user
{user_message}<|im_end|>
<|im_start|>assistant
{assistant_message}<|im_end|>"""
4.5 Converting Between Formats
def sharegpt_to_openai(sharegpt_data):
"""ShareGPT -> OpenAI Messages conversion"""
messages = []
role_map = {"human": "user", "gpt": "assistant", "system": "system"}
for conv in sharegpt_data["conversations"]:
messages.append({
"role": role_map.get(conv["from"], conv["from"]),
"content": conv["value"],
})
return {"messages": messages}
def alpaca_to_openai(alpaca_data, system_prompt=""):
"""Alpaca -> OpenAI Messages conversion"""
messages = []
if system_prompt:
messages.append({"role": "system", "content": system_prompt})
user_content = alpaca_data["instruction"]
if alpaca_data.get("input"):
user_content += f"\n\n{alpaca_data['input']}"
messages.append({"role": "user", "content": user_content})
messages.append({"role": "assistant", "content": alpaca_data["output"]})
return {"messages": messages}
def openai_to_sharegpt(openai_data):
"""OpenAI Messages -> ShareGPT conversion"""
role_map = {"user": "human", "assistant": "gpt", "system": "system"}
conversations = []
for msg in openai_data["messages"]:
conversations.append({
"from": role_map.get(msg["role"], msg["role"]),
"value": msg["content"],
})
return {"conversations": conversations}
# Batch conversion
def batch_convert(dataset, source_format, target_format):
"""Convert entire dataset format"""
converters = {
("sharegpt", "openai"): sharegpt_to_openai,
("alpaca", "openai"): alpaca_to_openai,
("openai", "sharegpt"): openai_to_sharegpt,
}
converter = converters.get((source_format, target_format))
if not converter:
raise ValueError(f"Unsupported conversion: {source_format} -> {target_format}")
return [converter(item) for item in dataset]
5. RLHF/DPO Dataset Construction
5.1 Preference Data Structure
# DPO (Direct Preference Optimization) data structure
dpo_example = {
"prompt": "Recommend healthy traditional Korean foods.",
"chosen": "Here are some of the healthiest traditional Korean foods:\n\n1. **Kimchi**: Rich in probiotics, vitamin C, and dietary fiber.\n2. **Doenjang-jjigae**: Fermented food with anti-cancer properties and rich in protein.\n3. **Mixed grain rice**: Provides balanced nutrition from various grains.\n4. **Namul (seasoned vegetables)**: Spinach, bean sprouts, etc. provide various vitamins and minerals.",
"rejected": "Umm... bibimbap I guess. It's tasty. Bulgogi too.",
}
5.2 Human Annotation Guidelines
ANNOTATION_GUIDELINES = """
## Preference Evaluation Guidelines
### Evaluation Criteria (1-5 scale)
1. **Helpfulness**: How well does it answer the question
2. **Accuracy**: Is the information factually correct
3. **Safety**: Is there harmful or biased content
4. **Fluency**: Is the language natural
### Comparison Evaluation Notes
- Read both responses fully before comparing
- Judge quality, not length
- Mark 'tie' when uncertain
- Apply objective quality standards, not personal opinion
"""
# Preference data collection tool
class PreferenceCollector:
def __init__(self):
self.annotations = []
def add_comparison(self, prompt, response_a, response_b, preference, annotator_id):
"""Save preference comparison result"""
self.annotations.append({
"prompt": prompt,
"response_a": response_a,
"response_b": response_b,
"preference": preference, # "a", "b", "tie"
"annotator_id": annotator_id,
"timestamp": datetime.now().isoformat(),
})
def compute_agreement(self):
"""Calculate inter-annotator agreement"""
from collections import Counter
prompt_votes = {}
for ann in self.annotations:
key = (ann["prompt"], ann["response_a"][:50])
if key not in prompt_votes:
prompt_votes[key] = []
prompt_votes[key].append(ann["preference"])
agreements = []
for key, votes in prompt_votes.items():
if len(votes) >= 2:
most_common = Counter(votes).most_common(1)[0][1]
agreements.append(most_common / len(votes))
return np.mean(agreements) if agreements else 0
5.3 AI-Based Automatic Ranking
def ai_rank_responses(prompt, responses, model="gpt-4"):
"""Constitutional AI-style automatic ranking"""
ranking_prompt = f"""Evaluate the following responses to a question.
Question: {prompt}
"""
for i, resp in enumerate(responses):
ranking_prompt += f"Response {i+1}: {resp}\n\n"
ranking_prompt += """Rate each response on a 1-5 scale for these criteria and provide final rankings:
1. Helpfulness
2. Accuracy
3. Safety
4. Fluency
Return results in JSON format."""
client = openai.OpenAI()
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": ranking_prompt}],
response_format={"type": "json_object"},
)
return json.loads(response.choices[0].message.content)
5.4 UltraFeedback Methodology
def create_ultrafeedback_data(prompts, models_to_evaluate):
"""UltraFeedback-style multi-model response collection and evaluation"""
dataset = []
for prompt in prompts:
responses = {}
# Collect responses from multiple models
for model_name in models_to_evaluate:
responses[model_name] = generate_response(prompt, model_name)
# Evaluate each response with GPT-4 (1-10 scale)
evaluations = {}
for model_name, response in responses.items():
score = evaluate_single_response(prompt, response)
evaluations[model_name] = score
# Select best/worst scoring responses (for DPO)
best_model = max(evaluations, key=evaluations.get)
worst_model = min(evaluations, key=evaluations.get)
dataset.append({
"prompt": prompt,
"chosen": responses[best_model],
"rejected": responses[worst_model],
"chosen_model": best_model,
"rejected_model": worst_model,
"scores": evaluations,
})
return dataset
6. Data Quality Metrics
6.1 Diversity Measurement
from collections import Counter
import numpy as np
def vocabulary_diversity(texts):
"""Measure vocabulary diversity (Type-Token Ratio)"""
all_tokens = []
for text in texts:
all_tokens.extend(text.split())
types = len(set(all_tokens))
tokens = len(all_tokens)
ttr = types / tokens if tokens > 0 else 0
return {"type_token_ratio": ttr, "unique_words": types, "total_words": tokens}
def topic_diversity(texts, n_topics=10):
"""Topic diversity (LDA-based)"""
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(max_features=5000)
dtm = vectorizer.fit_transform(texts)
lda = LatentDirichletAllocation(n_components=n_topics, random_state=42)
topic_dist = lda.fit_transform(dtm)
# Topic entropy (higher = more uniform distribution)
avg_dist = topic_dist.mean(axis=0)
entropy = -np.sum(avg_dist * np.log(avg_dist + 1e-10))
return {"topic_entropy": entropy, "max_entropy": np.log(n_topics)}
def instruction_diversity(instructions):
"""Instruction starter verb diversity"""
first_words = [inst.split()[0] if inst.split() else "" for inst in instructions]
counter = Counter(first_words)
return {
"unique_starters": len(counter),
"top_10": counter.most_common(10),
"starter_entropy": -sum(
(c/len(first_words)) * np.log(c/len(first_words))
for c in counter.values()
),
}
6.2 Length Distribution Analysis
import matplotlib.pyplot as plt
def analyze_length_distribution(dataset, text_field="text"):
"""Dataset length distribution analysis"""
lengths = [len(item[text_field]) for item in dataset]
stats = {
"count": len(lengths),
"mean": np.mean(lengths),
"median": np.median(lengths),
"std": np.std(lengths),
"min": np.min(lengths),
"max": np.max(lengths),
"p25": np.percentile(lengths, 25),
"p75": np.percentile(lengths, 75),
"p95": np.percentile(lengths, 95),
}
print("=== Length Distribution Statistics ===")
for k, v in stats.items():
print(f" {k}: {v:.1f}")
return stats
6.3 Benchmark Contamination Check
def check_contamination(train_data, benchmark_data, n_gram=13):
"""Check contamination between training and benchmark data"""
# Build benchmark n-gram set
benchmark_ngrams = set()
for item in benchmark_data:
text = item["question"] if "question" in item else item["text"]
words = text.split()
for i in range(len(words) - n_gram + 1):
ngram = " ".join(words[i:i+n_gram])
benchmark_ngrams.add(ngram)
# Search for overlapping n-grams in training data
contaminated = []
for i, item in enumerate(train_data):
text = item.get("instruction", "") + " " + item.get("output", "")
words = text.split()
for j in range(len(words) - n_gram + 1):
ngram = " ".join(words[j:j+n_gram])
if ngram in benchmark_ngrams:
contaminated.append({
"train_idx": i,
"matched_ngram": ngram,
})
break
contamination_rate = len(contaminated) / max(len(train_data), 1)
print(f"Contamination rate: {contamination_rate:.4%} ({len(contaminated)}/{len(train_data)})")
return contaminated
7. Full Pipeline: Building a Korean SFT Dataset from Scratch
7.1 Complete Pipeline Code
"""
Korean SFT Dataset Construction Pipeline
Collect -> Clean -> Format -> Validate -> Upload
"""
import json
import os
from datasets import Dataset
from tqdm import tqdm
# ===== Step 1: Data Collection =====
def collect_data():
"""Collect data from various sources"""
all_data = []
# 1-1. Load existing datasets
from datasets import load_dataset
koalpaca = load_dataset("beomi/KoAlpaca-v1.1a", split="train")
for item in koalpaca:
all_data.append({
"instruction": item["instruction"],
"input": "",
"output": item["output"],
"source": "koalpaca",
})
# 1-2. Add synthetic data
synthetic = generate_synthetic_data(num_samples=1000)
all_data.extend(synthetic)
print(f"Total collected: {len(all_data)}")
return all_data
# ===== Step 2: Data Cleaning =====
def clean_data(raw_data):
"""Data cleaning pipeline"""
cleaned = []
pii_remover = PIIRemover()
quality_filter = QualityFilter()
for item in tqdm(raw_data, desc="Cleaning"):
instruction = clean_text(item["instruction"])
output = clean_text(item["output"])
instruction = pii_remover.remove_pii(instruction)
output = pii_remover.remove_pii(output)
if not quality_filter.filter(instruction) or not quality_filter.filter(output):
continue
cleaned.append({
"instruction": instruction,
"input": item.get("input", ""),
"output": output,
"source": item["source"],
})
print(f"After cleaning: {len(cleaned)}/{len(raw_data)}")
return cleaned
# ===== Step 3: Deduplication =====
def remove_duplicates(data):
"""Remove duplicates"""
dedup = TextDeduplicator(threshold=0.85)
docs = [{"text": item["instruction"] + " " + item["output"], **item} for item in data]
unique = dedup.deduplicate(docs)
return [{"instruction": d["instruction"], "input": d.get("input", ""),
"output": d["output"], "source": d["source"]} for d in unique]
# ===== Step 4: Format Conversion =====
def format_data(data, target_format="openai"):
"""Convert to target format"""
formatted = []
system_prompt = "You are a helpful Korean AI assistant."
for item in data:
if target_format == "openai":
formatted.append(alpaca_to_openai(item, system_prompt))
elif target_format == "sharegpt":
formatted.append({
"conversations": [
{"from": "human", "value": item["instruction"]},
{"from": "gpt", "value": item["output"]},
],
})
return formatted
# ===== Step 5: Validation =====
def validate_data(data, format_type="openai"):
"""Data quality validation"""
errors = []
for i, item in enumerate(data):
if format_type == "openai":
if "messages" not in item:
errors.append(f"[{i}] Missing messages field")
elif len(item["messages"]) < 2:
errors.append(f"[{i}] Insufficient message count")
for msg in item.get("messages", []):
if msg["role"] not in ("system", "user", "assistant"):
errors.append(f"[{i}] Invalid role: {msg['role']}")
if not msg["content"].strip():
errors.append(f"[{i}] Empty content")
if errors:
print(f"Validation errors: {len(errors)}")
for e in errors[:10]:
print(f" {e}")
else:
print("Validation passed!")
return len(errors) == 0
# ===== Step 6: Upload =====
def upload_to_hub(data, repo_name):
"""Upload to Hugging Face Hub"""
ds = Dataset.from_list(data)
split_ds = ds.train_test_split(test_size=0.05, seed=42)
split_ds.push_to_hub(repo_name, private=True)
print(f"Upload complete: {repo_name}")
print(f" Train: {len(split_ds['train'])}, Validation: {len(split_ds['test'])}")
# ===== Execute =====
if __name__ == "__main__":
raw_data = collect_data()
cleaned_data = clean_data(raw_data)
unique_data = remove_duplicates(cleaned_data)
formatted_data = format_data(unique_data, "openai")
is_valid = validate_data(formatted_data, "openai")
if is_valid:
upload_to_hub(formatted_data, "my-org/korean-sft-v1")
7.2 Quality Dashboard
def generate_quality_report(dataset):
"""Generate dataset quality report"""
report = {
"total_samples": len(dataset),
"length_stats": analyze_length_distribution(dataset, "instruction"),
"diversity": vocabulary_diversity([d["instruction"] for d in dataset]),
"source_distribution": Counter(d["source"] for d in dataset),
}
print("=" * 60)
print("Dataset Quality Report")
print("=" * 60)
print(f"Total samples: {report['total_samples']}")
print(f"\nLength statistics:")
for k, v in report['length_stats'].items():
print(f" {k}: {v:.1f}")
print(f"\nVocabulary diversity: TTR = {report['diversity']['type_token_ratio']:.4f}")
print(f"\nSource distribution:")
for source, count in report['source_distribution'].most_common():
print(f" {source}: {count} ({count/len(dataset)*100:.1f}%)")
print("=" * 60)
return report
8. Quiz
Q1. What was the key factor that allowed Phi-3 to outperform larger models?
Answer: Meticulously curated high-quality training data
Phi-3 outperformed 7B-13B models with just 3.8B parameters. The key was not model size but training data quality. It was trained using textbook-quality synthetic data that was carefully curated.
Q2. What is MinHash LSH used for?
Answer: Approximate Deduplication
MinHash LSH is an algorithm that efficiently finds similar documents in large datasets for deduplication. It uses similarity-based approximate matching rather than exact matching, operating at O(n) time complexity.
Q3. What are the key differences between Alpaca, ShareGPT, and OpenAI Messages formats?
Answer:
- Alpaca: Single-turn structure with instruction/input/output
- ShareGPT: Multi-turn conversations array with human/gpt roles
- OpenAI Messages: Messages array with system/user/assistant roles
Both ShareGPT and OpenAI formats support multi-turn conversations, but differ in role names and structure.
Q4. What do "chosen" and "rejected" mean in DPO datasets?
Answer:
- chosen: The human-preferred (better) response
- rejected: The human-dispreferred (worse) response
DPO (Direct Preference Optimization) uses this paired data to train models to generate responses similar to chosen and different from rejected. Unlike RLHF, it optimizes directly without a separate reward model.
Q5. Why is benchmark contamination dangerous and how can it be detected?
Answer:
Benchmark contamination occurs when evaluation data is included in training data, causing model performance to be overestimated.
Risks:
- Model has "memorized" rather than actually "solved" the problems
- Fair model comparison becomes impossible
- Below-expected performance when deployed in production
Detection methods:
- N-gram overlap checking (typically 13-gram)
- Hash comparison of benchmark sentences
- Reference GPT-4's contamination reporting methodology
9. References
- LIMA: Less Is More for Alignment - Zhou et al., 2023
- Phi-3 Technical Report - Microsoft Research, 2024
- Self-Instruct: Aligning LLMs with Self-Generated Instructions - Wang et al., 2023
- WizardLM: Empowering Large Language Models to Follow Complex Instructions - Xu et al., 2023
- Hugging Face Datasets Documentation - huggingface.co/docs/datasets
- KoAlpaca: Korean Alpaca Model - beomi, GitHub
- UltraFeedback: Boosting Language Models with High-quality Feedback - Cui et al., 2023
- Training Language Models to Follow Instructions with Human Feedback - Ouyang et al., 2022
- Direct Preference Optimization - Rafailov et al., 2023
- Deduplicating Training Data Makes Language Models Better - Lee et al., 2022
- KLUE: Korean Language Understanding Evaluation - Park et al., 2021
- Textbooks Are All You Need - Gunasekar et al., 2023
- Constitutional AI: Harmlessness from AI Feedback - Bai et al., 2022
- The RefinedWeb Dataset for Falcon LLM - Penedo et al., 2023