Skip to content
Published on

Complete Guide to Korean LLM Training Data: Hugging Face Datasets, Preprocessing, and Quality Control

Authors

Introduction: Why Training Data Matters More Than Model Architecture

In 2024, Microsoft Research's Phi-3 paper sent shockwaves through the industry. A 3.8B parameter model outperformed 7B-13B models, and the secret was meticulously curated high-quality training data. Meta's LIMA paper ("Less Is More for Alignment") demonstrated that just 1,000 high-quality samples could achieve performance close to GPT-4.

The phrase "Data is the new oil" may be cliche, but in the LLM era, it has never been more accurate. While architectural innovations (Transformers, MoE, State Space Models) matter, the same architecture can produce vastly different performance depending on data quality and diversity.

The Korean LLM ecosystem is growing rapidly:

ModelDeveloperParametersFeatures
SOLARUpstage10.7BDepth Up-Scaling, Korean-optimized
EXAONELG AI Research7.8BEnterprise Korean LLM
HyperCLOVA XNAVERUndisclosedLargest Korean language model
Qwen-KOCommunityVariousQwen-based Korean fine-tuning
KULLMKorea Univ.13BKorean open-source LLM
Polyglot-KoEleutherAI12.8BKorean pre-trained model

What determines the performance of all these models is ultimately training data. This guide covers everything from Hugging Face dataset usage to Korean data collection, preprocessing, Instruction Tuning formats, and RLHF/DPO dataset construction.


1. Hugging Face Datasets Deep Dive

1.1 Platform Overview

Hugging Face is an ML community platform hosting over 150,000 datasets as of 2023. It provides dataset viewers, download statistics, and automatic documentation features.

Key Features:

  • Dataset Viewer: Preview data directly in the browser
  • Download Stats: Monthly download counts
  • Dataset Card: Dataset metadata, licensing, and usage documentation
  • Streaming: Load without full download
  • Git LFS: Large file version control

1.2 Dataset Types by Category

Pre-training Data

Large-scale text corpora that form the model's foundational language understanding.

DatasetSizeLanguagesDescription
CC-1002.5TB100+ langsCleaned Common Crawl corpus
mC427TB101 langsGoogle's multilingual C4
Korean Wikipedia~1GBKoreanFull Korean Wikipedia
Namuwiki~5GBKoreanNamuwiki dump (non-commercial)
KCC~30GBKoreanKorean web crawl data
OSCARVariousMultilingualClassified Common Crawl corpus

SFT/Instruction Tuning Data

Core data for teaching LLMs to follow instructions.

DatasetSizeFormatDescription
Alpaca (Stanford)52Kinstruction/input/outputGenerated via Self-Instruct
ShareGPT90K+conversationsReal ChatGPT conversations
LIMA1Kinstruction/outputHand-curated high quality
OpenOrca4Minstruction/outputIncludes GPT-4 responses
Dolly 2.015Kinstruction/outputHand-crafted, commercially usable
FLAN Collection1836 tasksVariousGoogle's large Instruction collection

RLHF/DPO Data

Alignment data reflecting human preferences.

DatasetSizeStructureDescription
HH-RLHF (Anthropic)170Kchosen/rejectedHelpfulness + harmlessness preference
UltraFeedback64K4-point scaleGPT-4 based auto-evaluation
Nectar183Kranked list7-model response rankings
Chatbot ArenaOngoingELO scoresHuman blind comparison

Evaluation Benchmarks

BenchmarkDomainKorean Support
MMLU57 academic subjectsTranslated version available
ARCScience reasoningTranslated version
HellaSwagCommon sense reasoningTranslated version
KoBBQBias evaluationNative Korean
KLUEKorean NLUNative Korean
KorNATKorean common senseNative Korean

1.3 Korean-Specific Datasets

Korean LLM Dataset Ecosystem
├── Pre-training
│   ├── Korean Wikipedia (~600K articles)
│   ├── Namuwiki Dump (~5GB)
│   ├── AI Hub Corpora (NIKL)
│   └── mC4-ko (Korean subset)
├── Instruction Tuning
│   ├── KoAlpaca (beomi) - 52K
│   ├── KoVicuna (melodysdreamj) - 40K+
│   ├── KOpen-platypus - 25K
│   ├── ko_wikidata_QA - Wiki-based QA
│   └── kullm-v2 (Korea Univ.) - 152K
├── Preference/Alignment
│   ├── ko-rlhf (community)
│   └── KoreanFeedback (custom-built)
└── Evaluation
    ├── KLUE (8 tasks)
    ├── KoBBQ (bias)
    └── KorNAT (common sense)

1.4 Practical datasets Library Usage

Basic Loading and Exploration

from datasets import load_dataset, Dataset, DatasetDict

# Basic loading
ds = load_dataset("beomi/KoAlpaca-v1.1a")
print(ds)
# DatasetDict({
#     train: Dataset({
#         features: ['instruction', 'output'],
#         num_rows: 21155
#     })
# })

# Load specific split
train_ds = load_dataset("beomi/KoAlpaca-v1.1a", split="train")

# Inspect first 5 examples
for example in train_ds.select(range(5)):
    print(f"Instruction: {example['instruction'][:50]}...")
    print(f"Output: {example['output'][:50]}...")
    print("---")

Filtering and Transformation

# Length-based filtering
filtered_ds = ds["train"].filter(
    lambda x: len(x["instruction"]) > 10 and len(x["output"]) > 20
)
print(f"After filtering: {len(filtered_ds)} / {len(ds['train'])}")

# Convert to Alpaca format
def format_alpaca(example):
    text = f"""### Instruction:
{example['instruction']}

### Response:
{example['output']}"""
    return {"text": text}

formatted_ds = filtered_ds.map(format_alpaca)

# Apply tokenizer
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("beomi/llama-2-ko-7b")

def tokenize_function(examples):
    return tokenizer(
        examples["text"],
        truncation=True,
        max_length=2048,
        padding="max_length",
    )

tokenized_ds = formatted_ds.map(
    tokenize_function,
    batched=True,
    remove_columns=formatted_ds.column_names,
)

Streaming Mode (Large Datasets)

# Stream large datasets (memory efficient)
streaming_ds = load_dataset(
    "allenai/c4",
    "ko",
    split="train",
    streaming=True,
)

# Iterate through first 100 examples
for i, example in enumerate(streaming_ds):
    if i >= 100:
        break
    process(example["text"])

# Streaming + filtering + batch processing
filtered_stream = streaming_ds.filter(
    lambda x: len(x["text"]) > 100
).take(10000)

# Process in batches
batch = []
for example in filtered_stream:
    batch.append(example)
    if len(batch) == 32:
        process_batch(batch)
        batch = []

Upload to Hugging Face Hub

from datasets import Dataset
import pandas as pd

# Create dataset from DataFrame
df = pd.DataFrame({
    "instruction": ["What is Korea's capital?", "What is Python?"],
    "output": ["Korea's capital is Seoul.", "Python is a programming language."],
})
my_dataset = Dataset.from_pandas(df)

# Upload to Hub
my_dataset.push_to_hub(
    "my-org/my-korean-dataset",
    private=True,
    token="hf_xxxxx",
)

# Auto-generate Dataset Card
from huggingface_hub import DatasetCard

card = DatasetCard.load("my-org/my-korean-dataset")
card.text = """
# My Korean Dataset
Korean Instruction Tuning dataset.
## Data Structure
- instruction: Question/instruction text
- output: Response
## License
CC-BY-4.0
"""
card.push_to_hub("my-org/my-korean-dataset")

2. Korean Data Collection Methods

2.1 Web Crawling

# News crawling with newspaper3k
from newspaper import Article
import json

def crawl_article(url):
    """Crawl news article (must comply with robots.txt!)"""
    article = Article(url, language="ko")
    article.download()
    article.parse()

    return {
        "title": article.title,
        "text": article.text,
        "publish_date": str(article.publish_date),
        "source_url": url,
    }

# Large-scale crawling with Scrapy
# scrapy_spider.py
"""
import scrapy

class KoreanTextSpider(scrapy.Spider):
    name = 'korean_text'
    custom_settings = {
        'ROBOTSTXT_OBEY': True,  # Must comply with robots.txt
        'DOWNLOAD_DELAY': 2,      # 2-second intervals
        'CONCURRENT_REQUESTS': 4, # Limit concurrent requests
    }

    def parse(self, response):
        text = response.css('article::text').getall()
        yield {
            'url': response.url,
            'text': ' '.join(text),
        }
"""

Crawling Best Practices:

  • Always comply with robots.txt
  • Minimum 1-2 second request intervals
  • Verify copyright/licensing
  • Mandatory PII filtering

2.2 Public Data Sources

SourceURLData TypeLicense
AI Hubaihub.or.krVarious Korean corporaPublic
Modoo Corpuscorpus.korean.go.krWritten/spoken corporaCC BY
NIKLkorean.go.krStandard corporaAcademic
Data Portaldata.go.krGovernment public dataPublic
# AI Hub data loading example
import json
import glob

def load_aihub_data(data_dir):
    """Load AI Hub JSON format data"""
    all_data = []
    for filepath in glob.glob(f"{data_dir}/**/*.json", recursive=True):
        with open(filepath, "r", encoding="utf-8") as f:
            data = json.load(f)
            if "document" in data:
                for doc in data["document"]:
                    for sent in doc.get("sentence", []):
                        all_data.append({
                            "text": sent.get("form", ""),
                            "source": "aihub",
                        })
    return all_data

2.3 Translation-Based Data Generation

# Translation using NLLB (No Language Left Behind)
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_name = "facebook/nllb-200-distilled-600M"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

def translate_en_to_ko(text):
    """English -> Korean translation"""
    tokenizer.src_lang = "eng_Latn"
    inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True)
    translated = model.generate(
        **inputs,
        forced_bos_token_id=tokenizer.convert_tokens_to_ids("kor_Hang"),
        max_length=512,
    )
    return tokenizer.decode(translated[0], skip_special_tokens=True)

# Translation quality validation
def validate_translation(original, translated):
    """Automatic translation quality validation"""
    checks = {
        "not_empty": len(translated.strip()) > 0,
        "not_too_short": len(translated) > len(original) * 0.3,
        "not_too_long": len(translated) < len(original) * 3,
        "no_english_majority": sum(1 for c in translated if c.isascii()) / max(len(translated), 1) < 0.5,
    }
    return all(checks.values()), checks

2.4 Synthetic Data Generation

Self-Instruct Approach

import openai
import json
import random

# Self-Instruct: Generate new instructions from seed data
SEED_INSTRUCTIONS = [
    "Explain the four seasons of Korea.",
    "What are the advantages of list comprehension in Python?",
    "Tell me the key points to consider when writing an email.",
]

def generate_new_instructions(seed_instructions, num_generate=10):
    """Generate new instructions using GPT-4"""
    prompt = f"""Here are example Korean instructions:

{chr(10).join(f'{i+1}. {inst}' for i, inst in enumerate(seed_instructions))}

Generate {num_generate} completely new Korean instructions in a similar style.
Include diverse topics (science, history, technology, daily life, etc.).
Write each instruction on a single line with a number."""

    client = openai.OpenAI()
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.8,
    )
    return parse_instructions(response.choices[0].message.content)

def generate_response(instruction):
    """Generate a response for the instruction"""
    client = openai.OpenAI()
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": "You are a helpful Korean AI assistant."},
            {"role": "user", "content": instruction},
        ],
        temperature=0.7,
    )
    return response.choices[0].message.content

Evol-Instruct Approach

def evolve_instruction(instruction, evolution_type="deepen"):
    """WizardLM's Evol-Instruct: Progressively complexify instructions"""

    evolution_prompts = {
        "deepen": f"""Make the following instruction deeper and more specific.
Original: {instruction}
Evolved version:""",
        "broaden": f"""Broaden the scope of the following instruction to be more comprehensive.
Original: {instruction}
Evolved version:""",
        "concretize": f"""Add specific conditions or constraints to the following instruction.
Original: {instruction}
Evolved version:""",
        "reasoning": f"""Transform the following instruction into a form requiring step-by-step reasoning.
Original: {instruction}
Evolved version:""",
    }

    client = openai.OpenAI()
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": evolution_prompts[evolution_type]}],
        temperature=0.7,
    )
    return response.choices[0].message.content

2.5 Community Data Sources

  • Namuwiki: Rich Korean content (non-commercial CC-BY-NC-SA)
  • Korean Reddit: r/korea, r/hanguk, etc.
  • Stack Overflow Korean: Technical Q&A
  • Naver Knowledge iN: Crawl with caution (check terms of service)
  • Korean Wikipedia: CC-BY-SA license

3. Data Preprocessing Pipeline

3.1 Overall Pipeline Architecture

Raw Data
    |
    v
+-------------------+
| 1. Text Cleaning  |  HTML tag removal, encoding cleanup
+--------+----------+
         v
+-------------------+
| 2. Lang Detection |  Filter Korean-only text
+--------+----------+
         v
+-------------------+
| 3. Deduplication  |  MinHash LSH, Exact Match
+--------+----------+
         v
+-------------------+
| 4. Quality Filter |  Perplexity, length, toxicity
+--------+----------+
         v
+-------------------+
| 5. PII Removal    |  Personal info masking
+--------+----------+
         v
+-------------------+
| 6. Tokenization   |  SentencePiece / BPE
+--------+----------+
         v
    Cleaned Data

3.2 Text Cleaning

import re
import html
import unicodedata

def clean_text(text):
    """Basic Korean text cleaning"""
    # Decode HTML entities
    text = html.unescape(text)

    # Remove HTML tags
    text = re.sub(r'<[^>]+>', '', text)

    # Remove URLs
    text = re.sub(r'https?://\S+|www\.\S+', '', text)

    # Remove emails
    text = re.sub(r'\S+@\S+\.\S+', '[EMAIL]', text)

    # Mask phone numbers
    text = re.sub(r'\d{2,3}-\d{3,4}-\d{4}', '[PHONE]', text)

    # Normalize whitespace
    text = re.sub(r'\s+', ' ', text)

    # Unicode normalization (NFC)
    text = unicodedata.normalize('NFC', text)

    return text.strip()

def clean_korean_specific(text):
    """Korean-specific cleaning"""
    # Remove advertising patterns
    ad_patterns = [
        r'click\s*now',
        r'free\s*consultation',
        r'contact\s*us',
        r'call\s*now',
    ]
    for pattern in ad_patterns:
        if re.search(pattern, text, re.IGNORECASE):
            return None
    return text

# Batch processing
def clean_batch(texts):
    """Batch cleaning"""
    cleaned = []
    for text in texts:
        result = clean_text(text)
        result = clean_korean_specific(result)
        if result and len(result) > 20:
            cleaned.append(result)
    return cleaned

3.3 Deduplication

from datasketch import MinHash, MinHashLSH
import hashlib

class TextDeduplicator:
    """MinHash LSH-based approximate deduplication"""

    def __init__(self, threshold=0.8, num_perm=128):
        self.threshold = threshold
        self.num_perm = num_perm
        self.lsh = MinHashLSH(threshold=threshold, num_perm=num_perm)
        self.seen_exact = set()

    def get_minhash(self, text):
        """Generate MinHash for text"""
        m = MinHash(num_perm=self.num_perm)
        # Split into 3-grams
        for i in range(len(text) - 2):
            m.update(text[i:i+3].encode('utf-8'))
        return m

    def is_duplicate(self, text, doc_id):
        """Check for duplicates"""
        # 1. Exact matching (hash-based)
        text_hash = hashlib.md5(text.encode('utf-8')).hexdigest()
        if text_hash in self.seen_exact:
            return True
        self.seen_exact.add(text_hash)

        # 2. Approximate matching (MinHash LSH)
        minhash = self.get_minhash(text)
        result = self.lsh.query(minhash)
        if result:
            return True

        self.lsh.insert(doc_id, minhash)
        return False

    def deduplicate(self, documents):
        """Deduplicate document list"""
        unique_docs = []
        for i, doc in enumerate(documents):
            if not self.is_duplicate(doc["text"], f"doc_{i}"):
                unique_docs.append(doc)

        print(f"Dedup: {len(documents)} -> {len(unique_docs)} "
              f"({len(documents) - len(unique_docs)} removed)")
        return unique_docs

3.4 Language Detection Filtering

import fasttext

# Load fasttext language detection model
model_path = "lid.176.bin"  # Pre-download required
lang_model = fasttext.load_model(model_path)

def detect_language(text):
    """Detect text language"""
    text_clean = text.replace('\n', ' ')[:200]
    predictions = lang_model.predict(text_clean)
    lang = predictions[0][0].replace('__label__', '')
    confidence = predictions[1][0]
    return lang, confidence

def filter_korean(documents, min_confidence=0.7):
    """Filter Korean-only text"""
    korean_docs = []
    for doc in documents:
        lang, conf = detect_language(doc["text"])
        if lang == "ko" and conf >= min_confidence:
            korean_docs.append(doc)
    return korean_docs

3.5 Quality Filtering

import numpy as np
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

class QualityFilter:
    """Text quality filtering"""

    def __init__(self):
        self.criteria = {
            "min_length": 50,
            "max_length": 10000,
            "min_words": 10,
            "max_repetition_ratio": 0.3,
            "max_special_char_ratio": 0.1,
        }

    def check_length(self, text):
        """Length-based filter"""
        return self.criteria["min_length"] <= len(text) <= self.criteria["max_length"]

    def check_repetition(self, text):
        """Detect repetitive text"""
        words = text.split()
        if len(words) == 0:
            return False
        unique_ratio = len(set(words)) / len(words)
        return unique_ratio >= (1 - self.criteria["max_repetition_ratio"])

    def check_special_chars(self, text):
        """Check special character ratio"""
        special = sum(1 for c in text if not c.isalnum() and not c.isspace()
                     and c not in '.,!?;:')
        return special / max(len(text), 1) < self.criteria["max_special_char_ratio"]

    def compute_perplexity(self, text, model, tokenizer, device="cuda"):
        """Perplexity-based quality assessment (lower = more natural text)"""
        inputs = tokenizer(text, return_tensors="pt", truncation=True,
                          max_length=512).to(device)
        with torch.no_grad():
            outputs = model(**inputs, labels=inputs["input_ids"])
        return torch.exp(outputs.loss).item()

    def filter(self, text):
        """Comprehensive quality filtering"""
        return (
            self.check_length(text)
            and self.check_repetition(text)
            and self.check_special_chars(text)
        )

3.6 PII (Personally Identifiable Information) Removal

import re
from typing import Dict, List

class PIIRemover:
    """PII removal for Korean text"""

    PATTERNS = {
        "SSN": r'\d{6}[-]?\d{7}',
        "PHONE": r'0\d{1,2}[-.]?\d{3,4}[-.]?\d{4}',
        "EMAIL": r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}',
        "CARD": r'\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}',
        "ACCOUNT": r'\d{3,6}[-]?\d{2,6}[-]?\d{2,6}[-]?\d{0,3}',
        "IP": r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}',
    }

    def remove_pii(self, text: str) -> str:
        """Replace PII with mask tokens"""
        for pii_type, pattern in self.PATTERNS.items():
            mask_token = f"[{pii_type}]"
            text = re.sub(pattern, mask_token, text)
        return text

    def detect_pii(self, text: str) -> Dict[str, List[str]]:
        """Detect PII (for review before removal)"""
        found = {}
        for pii_type, pattern in self.PATTERNS.items():
            matches = re.findall(pattern, text)
            if matches:
                found[pii_type] = matches
        return found

3.7 Tokenization Considerations

from tokenizers import Tokenizer, models, trainers, pre_tokenizers

# Train Korean BPE tokenizer
def train_korean_tokenizer(text_files, vocab_size=32000):
    """Train Korean-optimized BPE tokenizer"""
    tokenizer = Tokenizer(models.BPE(unk_token="[UNK]"))

    # Pre-tokenization is critical for Korean
    tokenizer.pre_tokenizer = pre_tokenizers.Sequence([
        pre_tokenizers.ByteLevel(add_prefix_space=False),
    ])

    trainer = trainers.BpeTrainer(
        vocab_size=vocab_size,
        special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
        min_frequency=2,
    )

    tokenizer.train(text_files, trainer)
    return tokenizer

# Compare tokenizer efficiency
def compare_tokenizer_efficiency(text, tokenizers_dict):
    """Compare Korean token efficiency across tokenizers"""
    print(f"Original ({len(text)} chars): {text[:50]}...")
    print("-" * 60)
    for name, tok in tokenizers_dict.items():
        tokens = tok.encode(text)
        token_strs = tok.convert_ids_to_tokens(tokens)
        fertility = len(tokens) / len(text.split())
        print(f"{name}: {len(tokens)} tokens, fertility={fertility:.2f}")
        print(f"  First 10 tokens: {token_strs[:10]}")

4. Instruction Tuning Data Formats

4.1 Alpaca Format

The most basic and widely used format.

{
  "instruction": "Summarize the following text.",
  "input": "Artificial Intelligence (AI) is a subfield of computer science that artificially implements human learning, reasoning, and perception capabilities...",
  "output": "AI is a technology that implements human intelligence on computers, and it is being utilized in various fields through advances in machine learning and deep learning."
}
def format_alpaca(instruction, input_text="", output=""):
    """Generate Alpaca format"""
    if input_text:
        return {
            "instruction": instruction,
            "input": input_text,
            "output": output,
        }
    return {
        "instruction": instruction,
        "input": "",
        "output": output,
    }

# Alpaca prompt templates
ALPACA_TEMPLATE = """Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{input}

### Response:
{output}"""

ALPACA_TEMPLATE_NO_INPUT = """Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Response:
{output}"""

4.2 ShareGPT Format

A format for representing multi-turn conversations.

{
  "conversations": [
    {
      "from": "human",
      "value": "What is the difference between lists and tuples in Python?"
    },
    {
      "from": "gpt",
      "value": "The key differences between lists and tuples in Python are...\n\n1. **Mutability**: Lists are mutable, tuples are immutable\n2. **Performance**: Tuples are more memory efficient\n3. **Syntax**: Lists use [], tuples use ()"
    },
    {
      "from": "human",
      "value": "When should I prefer tuples?"
    },
    {
      "from": "gpt",
      "value": "Tuples are best used in these cases:\n\n1. When data should not be modified (coordinates, RGB values)\n2. When used as dictionary keys\n3. When returning multiple values from functions"
    }
  ]
}

4.3 OpenAI Messages Format

Standard format compatible with the OpenAI API.

{
  "messages": [
    {
      "role": "system",
      "content": "You are an AI assistant fluent in Korean. Provide accurate and helpful answers."
    },
    {
      "role": "user",
      "content": "Explain the difference between machine learning and deep learning."
    },
    {
      "role": "assistant",
      "content": "Let me explain the key differences between ML and deep learning..."
    }
  ]
}

4.4 Chat Templates (Model-Specific Differences)

# Llama 3 Chat Template
LLAMA3_TEMPLATE = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>

{system_message}<|eot_id|><|start_header_id|>user<|end_header_id|>

{user_message}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{assistant_message}<|eot_id|>"""

# Mistral Chat Template
MISTRAL_TEMPLATE = """<s>[INST] {system_message}

{user_message} [/INST]{assistant_message}</s>"""

# Qwen 2 Chat Template
QWEN2_TEMPLATE = """<|im_start|>system
{system_message}<|im_end|>
<|im_start|>user
{user_message}<|im_end|>
<|im_start|>assistant
{assistant_message}<|im_end|>"""

4.5 Converting Between Formats

def sharegpt_to_openai(sharegpt_data):
    """ShareGPT -> OpenAI Messages conversion"""
    messages = []
    role_map = {"human": "user", "gpt": "assistant", "system": "system"}

    for conv in sharegpt_data["conversations"]:
        messages.append({
            "role": role_map.get(conv["from"], conv["from"]),
            "content": conv["value"],
        })
    return {"messages": messages}

def alpaca_to_openai(alpaca_data, system_prompt=""):
    """Alpaca -> OpenAI Messages conversion"""
    messages = []
    if system_prompt:
        messages.append({"role": "system", "content": system_prompt})

    user_content = alpaca_data["instruction"]
    if alpaca_data.get("input"):
        user_content += f"\n\n{alpaca_data['input']}"

    messages.append({"role": "user", "content": user_content})
    messages.append({"role": "assistant", "content": alpaca_data["output"]})

    return {"messages": messages}

def openai_to_sharegpt(openai_data):
    """OpenAI Messages -> ShareGPT conversion"""
    role_map = {"user": "human", "assistant": "gpt", "system": "system"}
    conversations = []

    for msg in openai_data["messages"]:
        conversations.append({
            "from": role_map.get(msg["role"], msg["role"]),
            "value": msg["content"],
        })
    return {"conversations": conversations}

# Batch conversion
def batch_convert(dataset, source_format, target_format):
    """Convert entire dataset format"""
    converters = {
        ("sharegpt", "openai"): sharegpt_to_openai,
        ("alpaca", "openai"): alpaca_to_openai,
        ("openai", "sharegpt"): openai_to_sharegpt,
    }
    converter = converters.get((source_format, target_format))
    if not converter:
        raise ValueError(f"Unsupported conversion: {source_format} -> {target_format}")

    return [converter(item) for item in dataset]

5. RLHF/DPO Dataset Construction

5.1 Preference Data Structure

# DPO (Direct Preference Optimization) data structure
dpo_example = {
    "prompt": "Recommend healthy traditional Korean foods.",
    "chosen": "Here are some of the healthiest traditional Korean foods:\n\n1. **Kimchi**: Rich in probiotics, vitamin C, and dietary fiber.\n2. **Doenjang-jjigae**: Fermented food with anti-cancer properties and rich in protein.\n3. **Mixed grain rice**: Provides balanced nutrition from various grains.\n4. **Namul (seasoned vegetables)**: Spinach, bean sprouts, etc. provide various vitamins and minerals.",
    "rejected": "Umm... bibimbap I guess. It's tasty. Bulgogi too.",
}

5.2 Human Annotation Guidelines

ANNOTATION_GUIDELINES = """
## Preference Evaluation Guidelines

### Evaluation Criteria (1-5 scale)
1. **Helpfulness**: How well does it answer the question
2. **Accuracy**: Is the information factually correct
3. **Safety**: Is there harmful or biased content
4. **Fluency**: Is the language natural

### Comparison Evaluation Notes
- Read both responses fully before comparing
- Judge quality, not length
- Mark 'tie' when uncertain
- Apply objective quality standards, not personal opinion
"""

# Preference data collection tool
class PreferenceCollector:
    def __init__(self):
        self.annotations = []

    def add_comparison(self, prompt, response_a, response_b, preference, annotator_id):
        """Save preference comparison result"""
        self.annotations.append({
            "prompt": prompt,
            "response_a": response_a,
            "response_b": response_b,
            "preference": preference,  # "a", "b", "tie"
            "annotator_id": annotator_id,
            "timestamp": datetime.now().isoformat(),
        })

    def compute_agreement(self):
        """Calculate inter-annotator agreement"""
        from collections import Counter
        prompt_votes = {}
        for ann in self.annotations:
            key = (ann["prompt"], ann["response_a"][:50])
            if key not in prompt_votes:
                prompt_votes[key] = []
            prompt_votes[key].append(ann["preference"])

        agreements = []
        for key, votes in prompt_votes.items():
            if len(votes) >= 2:
                most_common = Counter(votes).most_common(1)[0][1]
                agreements.append(most_common / len(votes))
        return np.mean(agreements) if agreements else 0

5.3 AI-Based Automatic Ranking

def ai_rank_responses(prompt, responses, model="gpt-4"):
    """Constitutional AI-style automatic ranking"""
    ranking_prompt = f"""Evaluate the following responses to a question.

Question: {prompt}

"""
    for i, resp in enumerate(responses):
        ranking_prompt += f"Response {i+1}: {resp}\n\n"

    ranking_prompt += """Rate each response on a 1-5 scale for these criteria and provide final rankings:
1. Helpfulness
2. Accuracy
3. Safety
4. Fluency

Return results in JSON format."""

    client = openai.OpenAI()
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": ranking_prompt}],
        response_format={"type": "json_object"},
    )
    return json.loads(response.choices[0].message.content)

5.4 UltraFeedback Methodology

def create_ultrafeedback_data(prompts, models_to_evaluate):
    """UltraFeedback-style multi-model response collection and evaluation"""
    dataset = []

    for prompt in prompts:
        responses = {}
        # Collect responses from multiple models
        for model_name in models_to_evaluate:
            responses[model_name] = generate_response(prompt, model_name)

        # Evaluate each response with GPT-4 (1-10 scale)
        evaluations = {}
        for model_name, response in responses.items():
            score = evaluate_single_response(prompt, response)
            evaluations[model_name] = score

        # Select best/worst scoring responses (for DPO)
        best_model = max(evaluations, key=evaluations.get)
        worst_model = min(evaluations, key=evaluations.get)

        dataset.append({
            "prompt": prompt,
            "chosen": responses[best_model],
            "rejected": responses[worst_model],
            "chosen_model": best_model,
            "rejected_model": worst_model,
            "scores": evaluations,
        })

    return dataset

6. Data Quality Metrics

6.1 Diversity Measurement

from collections import Counter
import numpy as np

def vocabulary_diversity(texts):
    """Measure vocabulary diversity (Type-Token Ratio)"""
    all_tokens = []
    for text in texts:
        all_tokens.extend(text.split())

    types = len(set(all_tokens))
    tokens = len(all_tokens)
    ttr = types / tokens if tokens > 0 else 0
    return {"type_token_ratio": ttr, "unique_words": types, "total_words": tokens}

def topic_diversity(texts, n_topics=10):
    """Topic diversity (LDA-based)"""
    from sklearn.decomposition import LatentDirichletAllocation
    from sklearn.feature_extraction.text import CountVectorizer

    vectorizer = CountVectorizer(max_features=5000)
    dtm = vectorizer.fit_transform(texts)

    lda = LatentDirichletAllocation(n_components=n_topics, random_state=42)
    topic_dist = lda.fit_transform(dtm)

    # Topic entropy (higher = more uniform distribution)
    avg_dist = topic_dist.mean(axis=0)
    entropy = -np.sum(avg_dist * np.log(avg_dist + 1e-10))
    return {"topic_entropy": entropy, "max_entropy": np.log(n_topics)}

def instruction_diversity(instructions):
    """Instruction starter verb diversity"""
    first_words = [inst.split()[0] if inst.split() else "" for inst in instructions]
    counter = Counter(first_words)
    return {
        "unique_starters": len(counter),
        "top_10": counter.most_common(10),
        "starter_entropy": -sum(
            (c/len(first_words)) * np.log(c/len(first_words))
            for c in counter.values()
        ),
    }

6.2 Length Distribution Analysis

import matplotlib.pyplot as plt

def analyze_length_distribution(dataset, text_field="text"):
    """Dataset length distribution analysis"""
    lengths = [len(item[text_field]) for item in dataset]

    stats = {
        "count": len(lengths),
        "mean": np.mean(lengths),
        "median": np.median(lengths),
        "std": np.std(lengths),
        "min": np.min(lengths),
        "max": np.max(lengths),
        "p25": np.percentile(lengths, 25),
        "p75": np.percentile(lengths, 75),
        "p95": np.percentile(lengths, 95),
    }

    print("=== Length Distribution Statistics ===")
    for k, v in stats.items():
        print(f"  {k}: {v:.1f}")

    return stats

6.3 Benchmark Contamination Check

def check_contamination(train_data, benchmark_data, n_gram=13):
    """Check contamination between training and benchmark data"""
    # Build benchmark n-gram set
    benchmark_ngrams = set()
    for item in benchmark_data:
        text = item["question"] if "question" in item else item["text"]
        words = text.split()
        for i in range(len(words) - n_gram + 1):
            ngram = " ".join(words[i:i+n_gram])
            benchmark_ngrams.add(ngram)

    # Search for overlapping n-grams in training data
    contaminated = []
    for i, item in enumerate(train_data):
        text = item.get("instruction", "") + " " + item.get("output", "")
        words = text.split()
        for j in range(len(words) - n_gram + 1):
            ngram = " ".join(words[j:j+n_gram])
            if ngram in benchmark_ngrams:
                contaminated.append({
                    "train_idx": i,
                    "matched_ngram": ngram,
                })
                break

    contamination_rate = len(contaminated) / max(len(train_data), 1)
    print(f"Contamination rate: {contamination_rate:.4%} ({len(contaminated)}/{len(train_data)})")
    return contaminated

7. Full Pipeline: Building a Korean SFT Dataset from Scratch

7.1 Complete Pipeline Code

"""
Korean SFT Dataset Construction Pipeline
Collect -> Clean -> Format -> Validate -> Upload
"""

import json
import os
from datasets import Dataset
from tqdm import tqdm

# ===== Step 1: Data Collection =====
def collect_data():
    """Collect data from various sources"""
    all_data = []

    # 1-1. Load existing datasets
    from datasets import load_dataset
    koalpaca = load_dataset("beomi/KoAlpaca-v1.1a", split="train")
    for item in koalpaca:
        all_data.append({
            "instruction": item["instruction"],
            "input": "",
            "output": item["output"],
            "source": "koalpaca",
        })

    # 1-2. Add synthetic data
    synthetic = generate_synthetic_data(num_samples=1000)
    all_data.extend(synthetic)

    print(f"Total collected: {len(all_data)}")
    return all_data

# ===== Step 2: Data Cleaning =====
def clean_data(raw_data):
    """Data cleaning pipeline"""
    cleaned = []
    pii_remover = PIIRemover()
    quality_filter = QualityFilter()

    for item in tqdm(raw_data, desc="Cleaning"):
        instruction = clean_text(item["instruction"])
        output = clean_text(item["output"])

        instruction = pii_remover.remove_pii(instruction)
        output = pii_remover.remove_pii(output)

        if not quality_filter.filter(instruction) or not quality_filter.filter(output):
            continue

        cleaned.append({
            "instruction": instruction,
            "input": item.get("input", ""),
            "output": output,
            "source": item["source"],
        })

    print(f"After cleaning: {len(cleaned)}/{len(raw_data)}")
    return cleaned

# ===== Step 3: Deduplication =====
def remove_duplicates(data):
    """Remove duplicates"""
    dedup = TextDeduplicator(threshold=0.85)
    docs = [{"text": item["instruction"] + " " + item["output"], **item} for item in data]
    unique = dedup.deduplicate(docs)
    return [{"instruction": d["instruction"], "input": d.get("input", ""),
             "output": d["output"], "source": d["source"]} for d in unique]

# ===== Step 4: Format Conversion =====
def format_data(data, target_format="openai"):
    """Convert to target format"""
    formatted = []
    system_prompt = "You are a helpful Korean AI assistant."

    for item in data:
        if target_format == "openai":
            formatted.append(alpaca_to_openai(item, system_prompt))
        elif target_format == "sharegpt":
            formatted.append({
                "conversations": [
                    {"from": "human", "value": item["instruction"]},
                    {"from": "gpt", "value": item["output"]},
                ],
            })
    return formatted

# ===== Step 5: Validation =====
def validate_data(data, format_type="openai"):
    """Data quality validation"""
    errors = []
    for i, item in enumerate(data):
        if format_type == "openai":
            if "messages" not in item:
                errors.append(f"[{i}] Missing messages field")
            elif len(item["messages"]) < 2:
                errors.append(f"[{i}] Insufficient message count")
            for msg in item.get("messages", []):
                if msg["role"] not in ("system", "user", "assistant"):
                    errors.append(f"[{i}] Invalid role: {msg['role']}")
                if not msg["content"].strip():
                    errors.append(f"[{i}] Empty content")

    if errors:
        print(f"Validation errors: {len(errors)}")
        for e in errors[:10]:
            print(f"  {e}")
    else:
        print("Validation passed!")
    return len(errors) == 0

# ===== Step 6: Upload =====
def upload_to_hub(data, repo_name):
    """Upload to Hugging Face Hub"""
    ds = Dataset.from_list(data)
    split_ds = ds.train_test_split(test_size=0.05, seed=42)

    split_ds.push_to_hub(repo_name, private=True)
    print(f"Upload complete: {repo_name}")
    print(f"  Train: {len(split_ds['train'])}, Validation: {len(split_ds['test'])}")

# ===== Execute =====
if __name__ == "__main__":
    raw_data = collect_data()
    cleaned_data = clean_data(raw_data)
    unique_data = remove_duplicates(cleaned_data)
    formatted_data = format_data(unique_data, "openai")
    is_valid = validate_data(formatted_data, "openai")
    if is_valid:
        upload_to_hub(formatted_data, "my-org/korean-sft-v1")

7.2 Quality Dashboard

def generate_quality_report(dataset):
    """Generate dataset quality report"""
    report = {
        "total_samples": len(dataset),
        "length_stats": analyze_length_distribution(dataset, "instruction"),
        "diversity": vocabulary_diversity([d["instruction"] for d in dataset]),
        "source_distribution": Counter(d["source"] for d in dataset),
    }

    print("=" * 60)
    print("Dataset Quality Report")
    print("=" * 60)
    print(f"Total samples: {report['total_samples']}")
    print(f"\nLength statistics:")
    for k, v in report['length_stats'].items():
        print(f"  {k}: {v:.1f}")
    print(f"\nVocabulary diversity: TTR = {report['diversity']['type_token_ratio']:.4f}")
    print(f"\nSource distribution:")
    for source, count in report['source_distribution'].most_common():
        print(f"  {source}: {count} ({count/len(dataset)*100:.1f}%)")
    print("=" * 60)

    return report

8. Quiz

Q1. What was the key factor that allowed Phi-3 to outperform larger models?

Answer: Meticulously curated high-quality training data

Phi-3 outperformed 7B-13B models with just 3.8B parameters. The key was not model size but training data quality. It was trained using textbook-quality synthetic data that was carefully curated.

Q2. What is MinHash LSH used for?

Answer: Approximate Deduplication

MinHash LSH is an algorithm that efficiently finds similar documents in large datasets for deduplication. It uses similarity-based approximate matching rather than exact matching, operating at O(n) time complexity.

Q3. What are the key differences between Alpaca, ShareGPT, and OpenAI Messages formats?

Answer:

  • Alpaca: Single-turn structure with instruction/input/output
  • ShareGPT: Multi-turn conversations array with human/gpt roles
  • OpenAI Messages: Messages array with system/user/assistant roles

Both ShareGPT and OpenAI formats support multi-turn conversations, but differ in role names and structure.

Q4. What do "chosen" and "rejected" mean in DPO datasets?

Answer:

  • chosen: The human-preferred (better) response
  • rejected: The human-dispreferred (worse) response

DPO (Direct Preference Optimization) uses this paired data to train models to generate responses similar to chosen and different from rejected. Unlike RLHF, it optimizes directly without a separate reward model.

Q5. Why is benchmark contamination dangerous and how can it be detected?

Answer:

Benchmark contamination occurs when evaluation data is included in training data, causing model performance to be overestimated.

Risks:

  • Model has "memorized" rather than actually "solved" the problems
  • Fair model comparison becomes impossible
  • Below-expected performance when deployed in production

Detection methods:

  • N-gram overlap checking (typically 13-gram)
  • Hash comparison of benchmark sentences
  • Reference GPT-4's contamination reporting methodology

9. References

  1. LIMA: Less Is More for Alignment - Zhou et al., 2023
  2. Phi-3 Technical Report - Microsoft Research, 2024
  3. Self-Instruct: Aligning LLMs with Self-Generated Instructions - Wang et al., 2023
  4. WizardLM: Empowering Large Language Models to Follow Complex Instructions - Xu et al., 2023
  5. Hugging Face Datasets Documentation - huggingface.co/docs/datasets
  6. KoAlpaca: Korean Alpaca Model - beomi, GitHub
  7. UltraFeedback: Boosting Language Models with High-quality Feedback - Cui et al., 2023
  8. Training Language Models to Follow Instructions with Human Feedback - Ouyang et al., 2022
  9. Direct Preference Optimization - Rafailov et al., 2023
  10. Deduplicating Training Data Makes Language Models Better - Lee et al., 2022
  11. KLUE: Korean Language Understanding Evaluation - Park et al., 2021
  12. Textbooks Are All You Need - Gunasekar et al., 2023
  13. Constitutional AI: Harmlessness from AI Feedback - Bai et al., 2022
  14. The RefinedWeb Dataset for Falcon LLM - Penedo et al., 2023