Split View: 한국어 LLM 학습 데이터 제작 완전 가이드: Hugging Face 데이터셋, 전처리, 품질 관리까지

한국어 LLM 학습 데이터 제작 완전 가이드: Hugging Face 데이터셋, 전처리, 품질 관리까지

서론: 왜 학습 데이터가 모델 아키텍처보다 중요한가
1. Hugging Face Datasets 딥다이브
2. 한국어 데이터 수집 방법
3. 데이터 전처리 파이프라인
4. Instruction Tuning 데이터 포맷
5. RLHF/DPO 데이터셋 구축
6. 데이터 품질 메트릭
7. 실전 예제: 한국어 SFT 데이터셋 처음부터 구축하기
- 7.1 전체 파이프라인
- 7.2 품질 대시보드
8. 퀴즈
9. 참고 자료

서론: 왜 학습 데이터가 모델 아키텍처보다 중요한가

2024년, Microsoft Research의 Phi-3 논문은 업계에 충격을 줬습니다. 3.8B 파라미터 모델이 7B~13B 모델을 능가하는 성능을 보였고, 그 비결은 철저하게 큐레이션된 고품질 학습 데이터였습니다. Meta의 LIMA 논문("Less Is More for Alignment")은 단 1,000개의 고품질 샘플로도 GPT-4에 근접한 성능을 달성할 수 있음을 보여줬습니다.

"Data is the new oil"이라는 말은 이제 진부하지만, LLM 시대에는 그 어느 때보다 정확합니다. 모델 아키텍처의 혁신(Transformer, MoE, State Space Models)도 중요하지만, 같은 아키텍처라도 학습 데이터의 품질과 다양성에 따라 성능이 천지 차이입니다.

한국어 LLM 생태계도 빠르게 성장하고 있습니다:

모델	개발사	파라미터	특징
SOLAR	Upstage	10.7B	Depth Up-Scaling, 한국어 특화
EXAONE	LG AI Research	7.8B	기업용 한국어 LLM
HyperCLOVA X	NAVER	비공개	한국어 최대 규모
Qwen-KO	커뮤니티	다양	Qwen 기반 한국어 파인튜닝
KULLM	고려대	13B	한국어 오픈소스 LLM
Polyglot-Ko	EleutherAI	12.8B	한국어 사전학습 모델

이 모든 모델의 성능을 좌우하는 것은 결국 학습 데이터입니다. 이 가이드에서는 Hugging Face 데이터셋 활용법부터 한국어 데이터 수집, 전처리, Instruction Tuning 포맷, RLHF/DPO 데이터셋 구축까지 전 과정을 다룹니다.

1. Hugging Face Datasets 딥다이브

1.1 플랫폼 개요

Hugging Face는 2023년 기준 15만 개 이상의 데이터셋을 호스팅하는 ML 커뮤니티 플랫폼입니다. 데이터셋 뷰어, 다운로드 통계, 자동 문서화 등의 기능을 제공합니다.

핵심 기능:

Dataset Viewer: 브라우저에서 바로 데이터 미리보기
Download Stats: 월간 다운로드 수 확인
Dataset Card: 데이터셋 메타데이터, 라이선스, 사용법 문서
Streaming: 전체 다운로드 없이 스트리밍 로드
Git LFS: 대용량 파일 버전 관리

1.2 데이터셋 유형별 분류

사전학습(Pre-training) 데이터

대규모 텍스트 코퍼스로, 모델의 기본 언어 이해력을 형성합니다.

데이터셋	크기	언어	설명
CC-100	2.5TB	100+언어	Common Crawl 기반 정제 코퍼스
mC4	27TB	101언어	Google의 다국어 C4
Korean Wikipedia	~1GB	한국어	위키피디아 한국어판 전문
Namuwiki	~5GB	한국어	나무위키 덤프 (비상업적 용도)
KCC (Korean Crawl Corpus)	~30GB	한국어	한국어 웹 크롤 데이터
OSCAR	다양	다국어	Common Crawl 기반 분류된 코퍼스

SFT/Instruction Tuning 데이터

LLM이 지시를 따르도록 학습하는 핵심 데이터입니다.

데이터셋	크기	포맷	설명
Alpaca (Stanford)	52K	instruction/input/output	Self-Instruct로 생성
ShareGPT	90K+	conversations	실제 ChatGPT 대화 수집
LIMA	1K	instruction/output	수작업 큐레이션 고품질
OpenOrca	4M	instruction/output	GPT-4 응답 포함
Dolly 2.0	15K	instruction/output	수작업, 상업적 사용 가능
FLAN Collection	1836 tasks	다양	Google의 대규모 Instruction 모음

RLHF/DPO 데이터

인간 선호도를 반영하는 정렬(Alignment) 데이터입니다.

데이터셋	크기	구조	설명
HH-RLHF (Anthropic)	170K	chosen/rejected	도움됨 + 무해함 선호도
UltraFeedback	64K	4점 척도	GPT-4 기반 자동 평가
Nectar	183K	ranked list	7개 모델 응답 순위
Chatbot Arena	지속 갱신	ELO 점수	인간 블라인드 비교

평가(Evaluation) 벤치마크

벤치마크	영역	한국어 지원
MMLU	57개 학문 분야	번역 버전 존재
ARC	과학 추론	번역 버전
HellaSwag	상식 추론	번역 버전
KoBBQ	편향 평가	한국어 네이티브
KLUE	한국어 NLU	한국어 네이티브
KorNAT	한국어 상식	한국어 네이티브

1.3 한국어 특화 데이터셋

한국어 LLM 데이터셋 생태계
├── 사전학습
│   ├── Korean Wikipedia (~600K articles)
│   ├── Namuwiki Dump (~5GB)
│   ├── AI Hub 말뭉치 (국립국어원)
│   └── mC4-ko (Korean subset)
├── Instruction Tuning
│   ├── KoAlpaca (beomi) - 52K
│   ├── KoVicuna (melodysdreamj) - 40K+
│   ├── KOpen-platypus - 25K
│   ├── ko_wikidata_QA - 위키 기반 QA
│   └── kullm-v2 (고려대) - 152K
├── 선호도/정렬
│   ├── ko-rlhf (커뮤니티)
│   └── KoreanFeedback (자체 구축)
└── 평가
    ├── KLUE (8 tasks)
    ├── KoBBQ (편향)
    └── KorNAT (상식)

1.4 datasets 라이브러리 실전 활용

기본 로딩 및 탐색

from datasets import load_dataset, Dataset, DatasetDict

# 기본 로딩
ds = load_dataset("beomi/KoAlpaca-v1.1a")
print(ds)
# DatasetDict({
#     train: Dataset({
#         features: ['instruction', 'output'],
#         num_rows: 21155
#     })
# })

# 특정 split 로딩
train_ds = load_dataset("beomi/KoAlpaca-v1.1a", split="train")

# 처음 5개 확인
for example in train_ds.select(range(5)):
    print(f"Instruction: {example['instruction'][:50]}...")
    print(f"Output: {example['output'][:50]}...")
    print("---")

필터링 및 변환

# 길이 기반 필터링
filtered_ds = ds["train"].filter(
    lambda x: len(x["instruction"]) > 10 and len(x["output"]) > 20
)
print(f"필터링 후: {len(filtered_ds)} / {len(ds['train'])}")

# Alpaca 포맷으로 변환
def format_alpaca(example):
    text = f"""### Instruction:
{example['instruction']}

### Response:
{example['output']}"""
    return {"text": text}

formatted_ds = filtered_ds.map(format_alpaca)

# 토크나이저 적용
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("beomi/llama-2-ko-7b")

def tokenize_function(examples):
    return tokenizer(
        examples["text"],
        truncation=True,
        max_length=2048,
        padding="max_length",
    )

tokenized_ds = formatted_ds.map(
    tokenize_function,
    batched=True,
    remove_columns=formatted_ds.column_names,
)

스트리밍 모드 (대용량 데이터)

# 스트리밍으로 대용량 데이터 처리 (메모리 효율적)
streaming_ds = load_dataset(
    "allenai/c4",
    "ko",
    split="train",
    streaming=True,
)

# 처음 100개만 순회
for i, example in enumerate(streaming_ds):
    if i >= 100:
        break
    process(example["text"])

# 스트리밍 + 필터링 + 배치 처리
filtered_stream = streaming_ds.filter(
    lambda x: len(x["text"]) > 100
).take(10000)

# 배치 단위로 처리
batch = []
for example in filtered_stream:
    batch.append(example)
    if len(batch) == 32:
        process_batch(batch)
        batch = []

Hugging Face Hub에 업로드

from datasets import Dataset
import pandas as pd

# 데이터프레임에서 데이터셋 생성
df = pd.DataFrame({
    "instruction": ["한국의 수도는?", "파이썬이란?"],
    "output": ["한국의 수도는 서울입니다.", "파이썬은 프로그래밍 언어입니다."],
})
my_dataset = Dataset.from_pandas(df)

# Hub에 업로드
my_dataset.push_to_hub(
    "my-org/my-korean-dataset",
    private=True,  # 비공개 설정
    token="hf_xxxxx",
)

# Dataset Card 자동 생성
from huggingface_hub import DatasetCard

card = DatasetCard.load("my-org/my-korean-dataset")
card.text = """
# My Korean Dataset
한국어 Instruction Tuning 데이터셋입니다.
## 데이터 구조
- instruction: 질문/지시문
- output: 응답
## 라이선스
CC-BY-4.0
"""
card.push_to_hub("my-org/my-korean-dataset")

2. 한국어 데이터 수집 방법

2.1 웹 크롤링

# newspaper3k를 이용한 뉴스 크롤링
from newspaper import Article
import json

def crawl_article(url):
    """뉴스 기사 크롤링 (robots.txt 준수 필수!)"""
    article = Article(url, language="ko")
    article.download()
    article.parse()

    return {
        "title": article.title,
        "text": article.text,
        "publish_date": str(article.publish_date),
        "source_url": url,
    }

# Scrapy를 이용한 대규모 크롤링
# scrapy_spider.py
"""
import scrapy

class KoreanTextSpider(scrapy.Spider):
    name = 'korean_text'
    custom_settings = {
        'ROBOTSTXT_OBEY': True,  # robots.txt 준수 필수
        'DOWNLOAD_DELAY': 2,      # 2초 간격
        'CONCURRENT_REQUESTS': 4, # 동시 요청 제한
    }

    def parse(self, response):
        text = response.css('article::text').getall()
        yield {
            'url': response.url,
            'text': ' '.join(text),
        }
"""

크롤링 시 주의사항:

robots.txt 반드시 준수
요청 간격 최소 1~2초
저작권/라이선스 확인
개인정보 필터링 필수

2.2 공공 데이터 소스

소스	URL	데이터 유형	라이선스
AI Hub	aihub.or.kr	다양한 한국어 말뭉치	공공
모두의말뭉치	corpus.korean.go.kr	문어/구어 코퍼스	CC BY
NIKL (국립국어원)	korean.go.kr	표준 말뭉치	학술용
공공데이터포털	data.go.kr	정부 공공데이터	공공

# AI Hub 데이터 로딩 예시
import json
import glob

def load_aihub_data(data_dir):
    """AI Hub JSON 포맷 데이터 로딩"""
    all_data = []
    for filepath in glob.glob(f"{data_dir}/**/*.json", recursive=True):
        with open(filepath, "r", encoding="utf-8") as f:
            data = json.load(f)
            # AI Hub 형식에 따라 파싱
            if "document" in data:
                for doc in data["document"]:
                    for sent in doc.get("sentence", []):
                        all_data.append({
                            "text": sent.get("form", ""),
                            "source": "aihub",
                        })
    return all_data

2.3 번역 기반 데이터 생성

# NLLB (No Language Left Behind)를 이용한 번역
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_name = "facebook/nllb-200-distilled-600M"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

def translate_en_to_ko(text):
    """영어 -> 한국어 번역"""
    tokenizer.src_lang = "eng_Latn"
    inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True)
    translated = model.generate(
        **inputs,
        forced_bos_token_id=tokenizer.convert_tokens_to_ids("kor_Hang"),
        max_length=512,
    )
    return tokenizer.decode(translated[0], skip_special_tokens=True)

# 번역 품질 검증
def validate_translation(original, translated):
    """번역 품질 자동 검증"""
    checks = {
        "not_empty": len(translated.strip()) > 0,
        "not_too_short": len(translated) > len(original) * 0.3,
        "not_too_long": len(translated) < len(original) * 3,
        "no_english_majority": sum(1 for c in translated if c.isascii()) / max(len(translated), 1) < 0.5,
    }
    return all(checks.values()), checks

2.4 합성 데이터 생성

Self-Instruct 방식

import openai
import json
import random

# Self-Instruct: 시드 데이터에서 새로운 지시문 생성
SEED_INSTRUCTIONS = [
    "한국의 4계절에 대해 설명해주세요.",
    "파이썬에서 리스트 컴프리헨션의 장점은?",
    "이메일 작성 시 주의사항을 알려주세요.",
]

def generate_new_instructions(seed_instructions, num_generate=10):
    """GPT-4로 새로운 instruction 생성"""
    prompt = f"""다음은 한국어 지시문 예시입니다:

{chr(10).join(f'{i+1}. {inst}' for i, inst in enumerate(seed_instructions))}

위 예시와 비슷한 스타일이지만 완전히 새로운 한국어 지시문을 {num_generate}개 생성하세요.
다양한 주제(과학, 역사, 기술, 일상생활 등)를 포함하세요.
각 지시문은 번호와 함께 한 줄로 작성하세요."""

    client = openai.OpenAI()
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.8,
    )
    return parse_instructions(response.choices[0].message.content)

def generate_response(instruction):
    """지시문에 대한 응답 생성"""
    client = openai.OpenAI()
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": "당신은 도움이 되는 한국어 AI 어시스턴트입니다."},
            {"role": "user", "content": instruction},
        ],
        temperature=0.7,
    )
    return response.choices[0].message.content

Evol-Instruct 방식

def evolve_instruction(instruction, evolution_type="deepen"):
    """WizardLM의 Evol-Instruct: 지시문을 점진적으로 복잡하게 만들기"""

    evolution_prompts = {
        "deepen": f"""다음 지시문을 더 깊이 있고 구체적으로 만들어주세요.
원본: {instruction}
진화된 버전:""",
        "broaden": f"""다음 지시문의 범위를 넓혀서 더 포괄적으로 만들어주세요.
원본: {instruction}
진화된 버전:""",
        "concretize": f"""다음 지시문에 구체적인 조건이나 제약을 추가해주세요.
원본: {instruction}
진화된 버전:""",
        "reasoning": f"""다음 지시문을 단계적 추론이 필요한 형태로 변환해주세요.
원본: {instruction}
진화된 버전:""",
    }

    client = openai.OpenAI()
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": evolution_prompts[evolution_type]}],
        temperature=0.7,
    )
    return response.choices[0].message.content

2.5 커뮤니티 데이터 소스

나무위키: 풍부한 한국어 콘텐츠 (비상업적 CC-BY-NC-SA)
한국어 Reddit: r/korea, r/hanguk 등
Stack Overflow 한국어: 기술 Q&A
네이버 지식iN: 크롤링 주의 (이용약관 확인)
한국어 위키백과: CC-BY-SA 라이선스

3. 데이터 전처리 파이프라인

3.1 전체 파이프라인 아키텍처

원본 데이터
    │
    ▼
┌─────────────────┐
│ 1. 텍스트 정제   │  HTML 태그 제거, 인코딩 정리
└────────┬────────┘
         ▼
┌─────────────────┐
│ 2. 언어 감지     │  한국어 텍스트만 필터링
└────────┬────────┘
         ▼
┌─────────────────┐
│ 3. 중복 제거     │  MinHash LSH, Exact Match
└────────┬────────┘
         ▼
┌─────────────────┐
│ 4. 품질 필터링   │  Perplexity, 길이, 독성
└────────┬────────┘
         ▼
┌─────────────────┐
│ 5. PII 제거     │  개인정보 마스킹
└────────┬────────┘
         ▼
┌─────────────────┐
│ 6. 토크나이징    │  SentencePiece / BPE
└────────┬────────┘
         ▼
    정제된 데이터

3.2 텍스트 정제

import re
import html
import unicodedata

def clean_text(text):
    """한국어 텍스트 기본 정제"""
    # HTML 엔티티 디코딩
    text = html.unescape(text)

    # HTML 태그 제거
    text = re.sub(r'<[^>]+>', '', text)

    # URL 제거
    text = re.sub(r'https?://\S+|www\.\S+', '', text)

    # 이메일 제거
    text = re.sub(r'\S+@\S+\.\S+', '[EMAIL]', text)

    # 전화번호 마스킹
    text = re.sub(r'\d{2,3}-\d{3,4}-\d{4}', '[PHONE]', text)

    # 연속 공백 정리
    text = re.sub(r'\s+', ' ', text)

    # Unicode 정규화 (NFC)
    text = unicodedata.normalize('NFC', text)

    # 한국어에 불필요한 특수문자 제거 (기본 문장부호 유지)
    text = re.sub(r'[^\w\s가-힣ㄱ-ㅎㅏ-ㅣa-zA-Z0-9.,!?;:\'\"()\-]', '', text)

    return text.strip()

def clean_korean_specific(text):
    """한국어 특화 정제"""
    # 자음/모음만 있는 경우 제거 (ㅋㅋㅋ, ㅎㅎㅎ 등은 상황에 따라)
    # 광고성 텍스트 패턴 제거
    ad_patterns = [
        r'지금\s*바로\s*클릭',
        r'무료\s*상담',
        r'카카오톡?\s*문의',
        r'전화\s*주세요',
    ]
    for pattern in ad_patterns:
        if re.search(pattern, text):
            return None  # 광고성으로 판단되면 제거

    return text

# 배치 처리
def clean_batch(texts):
    """배치 단위 정제"""
    cleaned = []
    for text in texts:
        result = clean_text(text)
        result = clean_korean_specific(result)
        if result and len(result) > 20:
            cleaned.append(result)
    return cleaned

3.3 중복 제거 (Deduplication)

from datasketch import MinHash, MinHashLSH
import hashlib

class TextDeduplicator:
    """MinHash LSH 기반 근사 중복 제거"""

    def __init__(self, threshold=0.8, num_perm=128):
        self.threshold = threshold
        self.num_perm = num_perm
        self.lsh = MinHashLSH(threshold=threshold, num_perm=num_perm)
        self.seen_exact = set()

    def get_minhash(self, text):
        """텍스트의 MinHash 생성"""
        m = MinHash(num_perm=self.num_perm)
        # 3-gram 단위로 분할
        for i in range(len(text) - 2):
            m.update(text[i:i+3].encode('utf-8'))
        return m

    def is_duplicate(self, text, doc_id):
        """중복 여부 확인"""
        # 1. 정확 매칭 (해시 기반)
        text_hash = hashlib.md5(text.encode('utf-8')).hexdigest()
        if text_hash in self.seen_exact:
            return True
        self.seen_exact.add(text_hash)

        # 2. 근사 매칭 (MinHash LSH)
        minhash = self.get_minhash(text)
        result = self.lsh.query(minhash)
        if result:
            return True

        self.lsh.insert(doc_id, minhash)
        return False

    def deduplicate(self, documents):
        """문서 리스트 중복 제거"""
        unique_docs = []
        for i, doc in enumerate(documents):
            if not self.is_duplicate(doc["text"], f"doc_{i}"):
                unique_docs.append(doc)

        print(f"중복 제거: {len(documents)} -> {len(unique_docs)} "
              f"({len(documents) - len(unique_docs)}개 제거)")
        return unique_docs

3.4 언어 감지 필터링

import fasttext

# fasttext 언어 감지 모델 로드
model_path = "lid.176.bin"  # 사전 다운로드 필요
lang_model = fasttext.load_model(model_path)

def detect_language(text):
    """텍스트 언어 감지"""
    # 줄바꿈 제거 (fasttext는 한 줄 입력)
    text_clean = text.replace('\n', ' ')[:200]
    predictions = lang_model.predict(text_clean)
    lang = predictions[0][0].replace('__label__', '')
    confidence = predictions[1][0]
    return lang, confidence

def filter_korean(documents, min_confidence=0.7):
    """한국어 텍스트만 필터링"""
    korean_docs = []
    for doc in documents:
        lang, conf = detect_language(doc["text"])
        if lang == "ko" and conf >= min_confidence:
            korean_docs.append(doc)
    return korean_docs

3.5 품질 필터링

import numpy as np
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

class QualityFilter:
    """텍스트 품질 필터링"""

    def __init__(self):
        self.criteria = {
            "min_length": 50,
            "max_length": 10000,
            "min_words": 10,
            "max_repetition_ratio": 0.3,
            "max_special_char_ratio": 0.1,
        }

    def check_length(self, text):
        """길이 기반 필터"""
        return self.criteria["min_length"] <= len(text) <= self.criteria["max_length"]

    def check_repetition(self, text):
        """반복 텍스트 감지"""
        words = text.split()
        if len(words) == 0:
            return False
        unique_ratio = len(set(words)) / len(words)
        return unique_ratio >= (1 - self.criteria["max_repetition_ratio"])

    def check_special_chars(self, text):
        """특수 문자 비율 확인"""
        special = sum(1 for c in text if not c.isalnum() and not c.isspace()
                     and c not in '.,!?;:')
        return special / max(len(text), 1) < self.criteria["max_special_char_ratio"]

    def compute_perplexity(self, text, model, tokenizer, device="cuda"):
        """Perplexity 기반 품질 평가 (낮을수록 자연스러운 텍스트)"""
        inputs = tokenizer(text, return_tensors="pt", truncation=True,
                          max_length=512).to(device)
        with torch.no_grad():
            outputs = model(**inputs, labels=inputs["input_ids"])
        return torch.exp(outputs.loss).item()

    def filter(self, text):
        """종합 품질 필터링"""
        return (
            self.check_length(text)
            and self.check_repetition(text)
            and self.check_special_chars(text)
        )

3.6 PII (개인정보) 제거

import re
from typing import Dict, List

class PIIRemover:
    """개인식별정보(PII) 제거"""

    PATTERNS = {
        "주민등록번호": r'\d{6}[-]?\d{7}',
        "전화번호": r'0\d{1,2}[-.]?\d{3,4}[-.]?\d{4}',
        "이메일": r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}',
        "카드번호": r'\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}',
        "계좌번호": r'\d{3,6}[-]?\d{2,6}[-]?\d{2,6}[-]?\d{0,3}',
        "IP주소": r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}',
    }

    def remove_pii(self, text: str) -> str:
        """PII를 마스킹 토큰으로 대체"""
        for pii_type, pattern in self.PATTERNS.items():
            mask_token = f"[{pii_type}]"
            text = re.sub(pattern, mask_token, text)
        return text

    def detect_pii(self, text: str) -> Dict[str, List[str]]:
        """PII 감지 (제거 전 확인용)"""
        found = {}
        for pii_type, pattern in self.PATTERNS.items():
            matches = re.findall(pattern, text)
            if matches:
                found[pii_type] = matches
        return found

3.7 토크나이징 고려사항

from tokenizers import Tokenizer, models, trainers, pre_tokenizers

# SentencePiece 한국어 토크나이저 학습
def train_korean_tokenizer(text_files, vocab_size=32000):
    """한국어 특화 BPE 토크나이저 학습"""
    tokenizer = Tokenizer(models.BPE(unk_token="[UNK]"))

    # 한국어는 pre-tokenization이 중요
    tokenizer.pre_tokenizer = pre_tokenizers.Sequence([
        pre_tokenizers.ByteLevel(add_prefix_space=False),
    ])

    trainer = trainers.BpeTrainer(
        vocab_size=vocab_size,
        special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
        min_frequency=2,
    )

    tokenizer.train(text_files, trainer)
    return tokenizer

# 토크나이저 효율성 비교
def compare_tokenizer_efficiency(text, tokenizers_dict):
    """여러 토크나이저의 한국어 토큰 효율성 비교"""
    print(f"원문 ({len(text)}자): {text[:50]}...")
    print("-" * 60)
    for name, tok in tokenizers_dict.items():
        tokens = tok.encode(text)
        token_strs = tok.convert_ids_to_tokens(tokens)
        fertility = len(tokens) / len(text.split())
        print(f"{name}: {len(tokens)} tokens, fertility={fertility:.2f}")
        print(f"  처음 10토큰: {token_strs[:10]}")

4. Instruction Tuning 데이터 포맷

4.1 Alpaca 포맷

가장 기본적이고 널리 사용되는 포맷입니다.

{
  "instruction": "다음 텍스트를 요약해주세요.",
  "input": "인공지능(AI)은 인간의 학습능력, 추론능력, 지각능력을 인공적으로 구현한 컴퓨터 과학의 세부분야입니다...",
  "output": "인공지능은 인간의 지능을 컴퓨터로 구현하는 기술로, 머신러닝과 딥러닝의 발전으로 다양한 분야에서 활용되고 있습니다."
}

def format_alpaca(instruction, input_text="", output=""):
    """Alpaca 포맷 생성"""
    if input_text:
        return {
            "instruction": instruction,
            "input": input_text,
            "output": output,
        }
    return {
        "instruction": instruction,
        "input": "",
        "output": output,
    }

# Alpaca 프롬프트 템플릿
ALPACA_TEMPLATE = """Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{input}

### Response:
{output}"""

ALPACA_TEMPLATE_NO_INPUT = """Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Response:
{output}"""

4.2 ShareGPT 포맷

다중 턴 대화를 표현하는 포맷입니다.

{
  "conversations": [
    {
      "from": "human",
      "value": "파이썬에서 리스트와 튜플의 차이점이 뭐야?"
    },
    {
      "from": "gpt",
      "value": "파이썬에서 리스트와 튜플의 주요 차이점은...\n\n1. **가변성**: 리스트는 변경 가능(mutable), 튜플은 변경 불가(immutable)\n2. **성능**: 튜플이 메모리 효율적\n3. **문법**: 리스트는 [], 튜플은 ()"
    },
    {
      "from": "human",
      "value": "그러면 언제 튜플을 쓰는 게 좋아?"
    },
    {
      "from": "gpt",
      "value": "튜플은 다음과 같은 경우에 사용하면 좋습니다:\n\n1. 데이터가 변경되면 안 되는 경우 (좌표, RGB 값)\n2. 딕셔너리의 키로 사용할 때\n3. 함수에서 여러 값을 반환할 때"
    }
  ]
}

4.3 OpenAI Messages 포맷

OpenAI API와 호환되는 표준 포맷입니다.

{
  "messages": [
    {
      "role": "system",
      "content": "당신은 한국어에 능통한 AI 어시스턴트입니다. 정확하고 도움이 되는 답변을 제공하세요."
    },
    {
      "role": "user",
      "content": "머신러닝과 딥러닝의 차이를 설명해줘."
    },
    {
      "role": "assistant",
      "content": "머신러닝과 딥러닝의 핵심 차이를 설명하겠습니다..."
    }
  ]
}

4.4 Chat Template (모델별 차이)

# Llama 3 Chat Template
LLAMA3_TEMPLATE = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>

{system_message}<|eot_id|><|start_header_id|>user<|end_header_id|>

{user_message}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{assistant_message}<|eot_id|>"""

# Mistral Chat Template
MISTRAL_TEMPLATE = """<s>[INST] {system_message}

{user_message} [/INST]{assistant_message}</s>"""

# Qwen 2 Chat Template
QWEN2_TEMPLATE = """<|im_start|>system
{system_message}<|im_end|>
<|im_start|>user
{user_message}<|im_end|>
<|im_start|>assistant
{assistant_message}<|im_end|>"""

4.5 포맷 간 변환

def sharegpt_to_openai(sharegpt_data):
    """ShareGPT -> OpenAI Messages 변환"""
    messages = []
    role_map = {"human": "user", "gpt": "assistant", "system": "system"}

    for conv in sharegpt_data["conversations"]:
        messages.append({
            "role": role_map.get(conv["from"], conv["from"]),
            "content": conv["value"],
        })
    return {"messages": messages}

def alpaca_to_openai(alpaca_data, system_prompt=""):
    """Alpaca -> OpenAI Messages 변환"""
    messages = []
    if system_prompt:
        messages.append({"role": "system", "content": system_prompt})

    user_content = alpaca_data["instruction"]
    if alpaca_data.get("input"):
        user_content += f"\n\n{alpaca_data['input']}"

    messages.append({"role": "user", "content": user_content})
    messages.append({"role": "assistant", "content": alpaca_data["output"]})

    return {"messages": messages}

def openai_to_sharegpt(openai_data):
    """OpenAI Messages -> ShareGPT 변환"""
    role_map = {"user": "human", "assistant": "gpt", "system": "system"}
    conversations = []

    for msg in openai_data["messages"]:
        conversations.append({
            "from": role_map.get(msg["role"], msg["role"]),
            "value": msg["content"],
        })
    return {"conversations": conversations}

# 대량 변환
def batch_convert(dataset, source_format, target_format):
    """데이터셋 전체 포맷 변환"""
    converters = {
        ("sharegpt", "openai"): sharegpt_to_openai,
        ("alpaca", "openai"): alpaca_to_openai,
        ("openai", "sharegpt"): openai_to_sharegpt,
    }
    converter = converters.get((source_format, target_format))
    if not converter:
        raise ValueError(f"지원하지 않는 변환: {source_format} -> {target_format}")

    return [converter(item) for item in dataset]

5. RLHF/DPO 데이터셋 구축

5.1 선호도 데이터 구조

# DPO (Direct Preference Optimization) 데이터 구조
dpo_example = {
    "prompt": "한국의 전통 음식 중 건강에 좋은 것을 추천해주세요.",
    "chosen": "한국 전통 음식 중 건강에 좋은 대표적인 음식들을 소개하겠습니다.\n\n1. **김치**: 유산균이 풍부하고 비타민C, 식이섬유가 많습니다.\n2. **된장찌개**: 발효 식품으로 항암 효과가 있으며 단백질이 풍부합니다.\n3. **잡곡밥**: 다양한 영양소를 고르게 섭취할 수 있습니다.\n4. **나물**: 시금치, 콩나물 등 다양한 비타민과 미네랄을 제공합니다.",
    "rejected": "음... 비빔밥이요. 맛있으니까요. 불고기도요.",
}

5.2 인간 평가 가이드라인

ANNOTATION_GUIDELINES = """
## 선호도 평가 가이드라인

### 평가 기준 (1-5점)
1. **도움됨 (Helpfulness)**: 질문에 얼마나 잘 답했는가
2. **정확성 (Accuracy)**: 사실적으로 맞는 정보인가
3. **안전성 (Safety)**: 유해하거나 편향된 내용은 없는가
4. **유창성 (Fluency)**: 한국어가 자연스러운가

### 비교 평가 시 주의사항
- 두 응답 모두 읽은 후 비교할 것
- 길이가 아닌 품질 기준으로 판단
- 확신이 없으면 'tie'로 표시
- 개인 의견이 아닌 객관적 품질 기준으로 판단
"""

# 평가 데이터 수집 도구
class PreferenceCollector:
    def __init__(self):
        self.annotations = []

    def add_comparison(self, prompt, response_a, response_b, preference, annotator_id):
        """선호도 비교 결과 저장"""
        self.annotations.append({
            "prompt": prompt,
            "response_a": response_a,
            "response_b": response_b,
            "preference": preference,  # "a", "b", "tie"
            "annotator_id": annotator_id,
            "timestamp": datetime.now().isoformat(),
        })

    def compute_agreement(self):
        """평가자 간 일치도 계산"""
        from collections import Counter
        # 같은 prompt에 대한 평가 비교
        prompt_votes = {}
        for ann in self.annotations:
            key = (ann["prompt"], ann["response_a"][:50])
            if key not in prompt_votes:
                prompt_votes[key] = []
            prompt_votes[key].append(ann["preference"])

        agreements = []
        for key, votes in prompt_votes.items():
            if len(votes) >= 2:
                most_common = Counter(votes).most_common(1)[0][1]
                agreements.append(most_common / len(votes))
        return np.mean(agreements) if agreements else 0

5.3 AI 기반 자동 순위 매기기

def ai_rank_responses(prompt, responses, model="gpt-4"):
    """Constitutional AI 방식의 자동 순위 매기기"""
    ranking_prompt = f"""다음 질문에 대한 여러 응답을 평가해주세요.

질문: {prompt}

"""
    for i, resp in enumerate(responses):
        ranking_prompt += f"응답 {i+1}: {resp}\n\n"

    ranking_prompt += """각 응답을 다음 기준으로 1-5점 평가하고, 최종 순위를 매겨주세요:
1. 도움됨 (Helpfulness)
2. 정확성 (Accuracy)
3. 안전성 (Safety)
4. 유창성 (Fluency)

JSON 형식으로 결과를 반환하세요."""

    client = openai.OpenAI()
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": ranking_prompt}],
        response_format={"type": "json_object"},
    )
    return json.loads(response.choices[0].message.content)

5.4 UltraFeedback 방법론

def create_ultrafeedback_data(prompts, models_to_evaluate):
    """UltraFeedback 스타일의 다중 모델 응답 수집 및 평가"""
    dataset = []

    for prompt in prompts:
        responses = {}
        # 여러 모델에서 응답 수집
        for model_name in models_to_evaluate:
            responses[model_name] = generate_response(prompt, model_name)

        # GPT-4로 각 응답 평가 (1-10점)
        evaluations = {}
        for model_name, response in responses.items():
            score = evaluate_single_response(prompt, response)
            evaluations[model_name] = score

        # 최고/최저 점수 응답 선택 (DPO용)
        best_model = max(evaluations, key=evaluations.get)
        worst_model = min(evaluations, key=evaluations.get)

        dataset.append({
            "prompt": prompt,
            "chosen": responses[best_model],
            "rejected": responses[worst_model],
            "chosen_model": best_model,
            "rejected_model": worst_model,
            "scores": evaluations,
        })

    return dataset

6. 데이터 품질 메트릭

6.1 다양성 측정

from collections import Counter
import numpy as np

def vocabulary_diversity(texts):
    """어휘 다양성 측정 (Type-Token Ratio)"""
    all_tokens = []
    for text in texts:
        all_tokens.extend(text.split())

    types = len(set(all_tokens))
    tokens = len(all_tokens)
    ttr = types / tokens if tokens > 0 else 0
    return {"type_token_ratio": ttr, "unique_words": types, "total_words": tokens}

def topic_diversity(texts, n_topics=10):
    """토픽 다양성 (LDA 기반)"""
    from sklearn.decomposition import LatentDirichletAllocation
    from sklearn.feature_extraction.text import CountVectorizer

    vectorizer = CountVectorizer(max_features=5000)
    dtm = vectorizer.fit_transform(texts)

    lda = LatentDirichletAllocation(n_components=n_topics, random_state=42)
    topic_dist = lda.fit_transform(dtm)

    # 토픽 엔트로피 (높을수록 균등 분포)
    avg_dist = topic_dist.mean(axis=0)
    entropy = -np.sum(avg_dist * np.log(avg_dist + 1e-10))
    return {"topic_entropy": entropy, "max_entropy": np.log(n_topics)}

def instruction_diversity(instructions):
    """지시문 시작 동사 다양성"""
    first_words = [inst.split()[0] if inst.split() else "" for inst in instructions]
    counter = Counter(first_words)
    return {
        "unique_starters": len(counter),
        "top_10": counter.most_common(10),
        "starter_entropy": -sum(
            (c/len(first_words)) * np.log(c/len(first_words))
            for c in counter.values()
        ),
    }

6.2 길이 분포 분석

import matplotlib.pyplot as plt

def analyze_length_distribution(dataset, text_field="text"):
    """데이터셋 길이 분포 분석"""
    lengths = [len(item[text_field]) for item in dataset]

    stats = {
        "count": len(lengths),
        "mean": np.mean(lengths),
        "median": np.median(lengths),
        "std": np.std(lengths),
        "min": np.min(lengths),
        "max": np.max(lengths),
        "p25": np.percentile(lengths, 25),
        "p75": np.percentile(lengths, 75),
        "p95": np.percentile(lengths, 95),
    }

    print("=== 길이 분포 통계 ===")
    for k, v in stats.items():
        print(f"  {k}: {v:.1f}")

    return stats

6.3 벤치마크 오염 검사

def check_contamination(train_data, benchmark_data, n_gram=13):
    """학습 데이터와 벤치마크 간 오염(contamination) 검사"""
    # 벤치마크 n-gram 집합 생성
    benchmark_ngrams = set()
    for item in benchmark_data:
        text = item["question"] if "question" in item else item["text"]
        words = text.split()
        for i in range(len(words) - n_gram + 1):
            ngram = " ".join(words[i:i+n_gram])
            benchmark_ngrams.add(ngram)

    # 학습 데이터에서 겹치는 n-gram 검색
    contaminated = []
    for i, item in enumerate(train_data):
        text = item.get("instruction", "") + " " + item.get("output", "")
        words = text.split()
        for j in range(len(words) - n_gram + 1):
            ngram = " ".join(words[j:j+n_gram])
            if ngram in benchmark_ngrams:
                contaminated.append({
                    "train_idx": i,
                    "matched_ngram": ngram,
                })
                break

    contamination_rate = len(contaminated) / max(len(train_data), 1)
    print(f"오염률: {contamination_rate:.4%} ({len(contaminated)}/{len(train_data)})")
    return contaminated

7. 실전 예제: 한국어 SFT 데이터셋 처음부터 구축하기

7.1 전체 파이프라인

"""
한국어 SFT 데이터셋 구축 전체 파이프라인
수집 -> 정제 -> 포맷 변환 -> 검증 -> 업로드
"""

import json
import os
from datasets import Dataset
from tqdm import tqdm

# ===== Step 1: 데이터 수집 =====
def collect_data():
    """다양한 소스에서 데이터 수집"""
    all_data = []

    # 1-1. 기존 데이터셋 로드
    from datasets import load_dataset
    koalpaca = load_dataset("beomi/KoAlpaca-v1.1a", split="train")
    for item in koalpaca:
        all_data.append({
            "instruction": item["instruction"],
            "input": "",
            "output": item["output"],
            "source": "koalpaca",
        })

    # 1-2. 합성 데이터 추가
    synthetic = generate_synthetic_data(num_samples=1000)
    all_data.extend(synthetic)

    print(f"총 수집: {len(all_data)}개")
    return all_data

# ===== Step 2: 데이터 정제 =====
def clean_data(raw_data):
    """데이터 정제 파이프라인"""
    cleaned = []
    pii_remover = PIIRemover()
    quality_filter = QualityFilter()

    for item in tqdm(raw_data, desc="정제 중"):
        # 텍스트 정제
        instruction = clean_text(item["instruction"])
        output = clean_text(item["output"])

        # PII 제거
        instruction = pii_remover.remove_pii(instruction)
        output = pii_remover.remove_pii(output)

        # 품질 필터링
        if not quality_filter.filter(instruction) or not quality_filter.filter(output):
            continue

        cleaned.append({
            "instruction": instruction,
            "input": item.get("input", ""),
            "output": output,
            "source": item["source"],
        })

    print(f"정제 후: {len(cleaned)}/{len(raw_data)}")
    return cleaned

# ===== Step 3: 중복 제거 =====
def remove_duplicates(data):
    """중복 제거"""
    dedup = TextDeduplicator(threshold=0.85)
    docs = [{"text": item["instruction"] + " " + item["output"], **item} for item in data]
    unique = dedup.deduplicate(docs)
    return [{"instruction": d["instruction"], "input": d.get("input", ""),
             "output": d["output"], "source": d["source"]} for d in unique]

# ===== Step 4: 포맷 변환 =====
def format_data(data, target_format="openai"):
    """목표 포맷으로 변환"""
    formatted = []
    system_prompt = "당신은 도움이 되는 한국어 AI 어시스턴트입니다."

    for item in data:
        if target_format == "openai":
            formatted.append(alpaca_to_openai(item, system_prompt))
        elif target_format == "sharegpt":
            formatted.append({
                "conversations": [
                    {"from": "human", "value": item["instruction"]},
                    {"from": "gpt", "value": item["output"]},
                ],
            })
    return formatted

# ===== Step 5: 검증 =====
def validate_data(data, format_type="openai"):
    """데이터 품질 검증"""
    errors = []
    for i, item in enumerate(data):
        if format_type == "openai":
            if "messages" not in item:
                errors.append(f"[{i}] messages 필드 없음")
            elif len(item["messages"]) < 2:
                errors.append(f"[{i}] 메시지 수 부족")
            for msg in item.get("messages", []):
                if msg["role"] not in ("system", "user", "assistant"):
                    errors.append(f"[{i}] 잘못된 role: {msg['role']}")
                if not msg["content"].strip():
                    errors.append(f"[{i}] 빈 content")

    if errors:
        print(f"검증 오류 {len(errors)}개:")
        for e in errors[:10]:
            print(f"  {e}")
    else:
        print("검증 통과!")
    return len(errors) == 0

# ===== Step 6: 업로드 =====
def upload_to_hub(data, repo_name):
    """Hugging Face Hub에 업로드"""
    ds = Dataset.from_list(data)

    # Train/Validation 분할
    split_ds = ds.train_test_split(test_size=0.05, seed=42)

    split_ds.push_to_hub(
        repo_name,
        private=True,
    )
    print(f"업로드 완료: {repo_name}")
    print(f"  Train: {len(split_ds['train'])}, Validation: {len(split_ds['test'])}")

# ===== 실행 =====
if __name__ == "__main__":
    # 1. 수집
    raw_data = collect_data()

    # 2. 정제
    cleaned_data = clean_data(raw_data)

    # 3. 중복 제거
    unique_data = remove_duplicates(cleaned_data)

    # 4. 포맷 변환
    formatted_data = format_data(unique_data, "openai")

    # 5. 검증
    is_valid = validate_data(formatted_data, "openai")

    # 6. 업로드
    if is_valid:
        upload_to_hub(formatted_data, "my-org/korean-sft-v1")

7.2 품질 대시보드

def generate_quality_report(dataset):
    """데이터셋 품질 보고서 생성"""
    report = {
        "total_samples": len(dataset),
        "length_stats": analyze_length_distribution(dataset, "instruction"),
        "diversity": vocabulary_diversity([d["instruction"] for d in dataset]),
        "source_distribution": Counter(d["source"] for d in dataset),
    }

    print("=" * 60)
    print("데이터셋 품질 보고서")
    print("=" * 60)
    print(f"총 샘플 수: {report['total_samples']}")
    print(f"\n길이 통계:")
    for k, v in report['length_stats'].items():
        print(f"  {k}: {v:.1f}")
    print(f"\n어휘 다양성: TTR = {report['diversity']['type_token_ratio']:.4f}")
    print(f"\n소스 분포:")
    for source, count in report['source_distribution'].most_common():
        print(f"  {source}: {count} ({count/len(dataset)*100:.1f}%)")
    print("=" * 60)

    return report

8. 퀴즈

Q1. Phi-3 모델이 더 큰 모델을 능가할 수 있었던 핵심 요인은?

정답: 철저하게 큐레이션된 고품질 학습 데이터

Phi-3는 3.8B 파라미터로 7B~13B 모델을 능가했는데, 이는 모델 크기가 아닌 학습 데이터의 품질이 핵심이었습니다. 교과서 수준의 고품질 합성 데이터를 사용하여 학습했습니다.

Q2. MinHash LSH는 어떤 목적으로 사용되나요?

정답: 근사 중복 제거 (Approximate Deduplication)

MinHash LSH는 대규모 데이터에서 유사한 문서를 효율적으로 찾아 중복을 제거하는 알고리즘입니다. 정확한 매칭이 아닌 유사도 기반 근사 매칭으로, O(n) 수준의 시간 복잡도로 작동합니다.

Q3. Alpaca, ShareGPT, OpenAI Messages 포맷의 핵심 차이점은?

정답:

Alpaca: instruction/input/output 단일 턴 구조
ShareGPT: conversations 배열로 다중 턴 대화 (human/gpt 역할)
OpenAI Messages: messages 배열로 system/user/assistant 역할 구분

ShareGPT와 OpenAI 포맷은 모두 다중 턴을 지원하지만, 역할 이름과 구조가 다릅니다.

Q4. DPO 데이터셋에서 chosen과 rejected의 의미는?

정답:

chosen: 인간이 선호하는 (더 나은) 응답
rejected: 인간이 비선호하는 (덜 나은) 응답

DPO(Direct Preference Optimization)는 이 쌍 데이터를 이용해 모델이 chosen과 유사한 응답을 생성하고 rejected와 다른 응답을 생성하도록 학습합니다. RLHF와 달리 별도의 보상 모델 없이 직접 최적화합니다.

Q5. 벤치마크 오염(contamination)이 위험한 이유와 검사 방법은?

정답:

벤치마크 오염은 학습 데이터에 평가 데이터가 포함되어 모델 성능이 과대평가되는 문제입니다.

위험성:

모델이 실제로는 해당 문제를 "풀지" 못하고 "암기"한 것
공정한 모델 비교 불가능
실제 배포 시 기대 이하의 성능

검사 방법:

n-gram 겹침 검사 (보통 13-gram 사용)
벤치마크 문장의 해시 비교
GPT-4의 GPT-4 벤치마크 오염 보고서 참고

9. 참고 자료

LIMA: Less Is More for Alignment - Zhou et al., 2023
Phi-3 Technical Report - Microsoft Research, 2024
Self-Instruct: Aligning LLMs with Self-Generated Instructions - Wang et al., 2023
WizardLM: Empowering Large Language Models to Follow Complex Instructions - Xu et al., 2023
Hugging Face Datasets Documentation - huggingface.co/docs/datasets
KoAlpaca: Korean Alpaca Model - beomi, GitHub
UltraFeedback: Boosting Language Models with High-quality Feedback - Cui et al., 2023
Training Language Models to Follow Instructions with Human Feedback - Ouyang et al., 2022
Direct Preference Optimization - Rafailov et al., 2023
Deduplicating Training Data Makes Language Models Better - Lee et al., 2022
KLUE: Korean Language Understanding Evaluation - Park et al., 2021
Textbooks Are All You Need - Gunasekar et al., 2023
Constitutional AI: Harmlessness from AI Feedback - Bai et al., 2022
The RefinedWeb Dataset for Falcon LLM - Penedo et al., 2023

Complete Guide to Korean LLM Training Data: Hugging Face Datasets, Preprocessing, and Quality Control

Introduction: Why Training Data Matters More Than Model Architecture
1. Hugging Face Datasets Deep Dive
2. Korean Data Collection Methods
3. Data Preprocessing Pipeline
4. Instruction Tuning Data Formats
5. RLHF/DPO Dataset Construction
6. Data Quality Metrics
7. Full Pipeline: Building a Korean SFT Dataset from Scratch
- 7.1 Complete Pipeline Code
- 7.2 Quality Dashboard
8. Quiz
9. References

Introduction: Why Training Data Matters More Than Model Architecture

In 2024, Microsoft Research's Phi-3 paper sent shockwaves through the industry. A 3.8B parameter model outperformed 7B-13B models, and the secret was meticulously curated high-quality training data. Meta's LIMA paper ("Less Is More for Alignment") demonstrated that just 1,000 high-quality samples could achieve performance close to GPT-4.

The phrase "Data is the new oil" may be cliche, but in the LLM era, it has never been more accurate. While architectural innovations (Transformers, MoE, State Space Models) matter, the same architecture can produce vastly different performance depending on data quality and diversity.

The Korean LLM ecosystem is growing rapidly:

Model	Developer	Parameters	Features
SOLAR	Upstage	10.7B	Depth Up-Scaling, Korean-optimized
EXAONE	LG AI Research	7.8B	Enterprise Korean LLM
HyperCLOVA X	NAVER	Undisclosed	Largest Korean language model
Qwen-KO	Community	Various	Qwen-based Korean fine-tuning
KULLM	Korea Univ.	13B	Korean open-source LLM
Polyglot-Ko	EleutherAI	12.8B	Korean pre-trained model

What determines the performance of all these models is ultimately training data. This guide covers everything from Hugging Face dataset usage to Korean data collection, preprocessing, Instruction Tuning formats, and RLHF/DPO dataset construction.

1. Hugging Face Datasets Deep Dive

1.1 Platform Overview

Hugging Face is an ML community platform hosting over 150,000 datasets as of 2023. It provides dataset viewers, download statistics, and automatic documentation features.

Key Features:

Dataset Viewer: Preview data directly in the browser
Download Stats: Monthly download counts
Dataset Card: Dataset metadata, licensing, and usage documentation
Streaming: Load without full download
Git LFS: Large file version control

1.2 Dataset Types by Category

Pre-training Data

Large-scale text corpora that form the model's foundational language understanding.

Dataset	Size	Languages	Description
CC-100	2.5TB	100+ langs	Cleaned Common Crawl corpus
mC4	27TB	101 langs	Google's multilingual C4
Korean Wikipedia	~1GB	Korean	Full Korean Wikipedia
Namuwiki	~5GB	Korean	Namuwiki dump (non-commercial)
KCC	~30GB	Korean	Korean web crawl data
OSCAR	Various	Multilingual	Classified Common Crawl corpus

SFT/Instruction Tuning Data

Core data for teaching LLMs to follow instructions.

Dataset	Size	Format	Description
Alpaca (Stanford)	52K	instruction/input/output	Generated via Self-Instruct
ShareGPT	90K+	conversations	Real ChatGPT conversations
LIMA	1K	instruction/output	Hand-curated high quality
OpenOrca	4M	instruction/output	Includes GPT-4 responses
Dolly 2.0	15K	instruction/output	Hand-crafted, commercially usable
FLAN Collection	1836 tasks	Various	Google's large Instruction collection

RLHF/DPO Data

Alignment data reflecting human preferences.

Dataset	Size	Structure	Description
HH-RLHF (Anthropic)	170K	chosen/rejected	Helpfulness + harmlessness preference
UltraFeedback	64K	4-point scale	GPT-4 based auto-evaluation
Nectar	183K	ranked list	7-model response rankings
Chatbot Arena	Ongoing	ELO scores	Human blind comparison

Evaluation Benchmarks

Benchmark	Domain	Korean Support
MMLU	57 academic subjects	Translated version available
ARC	Science reasoning	Translated version
HellaSwag	Common sense reasoning	Translated version
KoBBQ	Bias evaluation	Native Korean
KLUE	Korean NLU	Native Korean
KorNAT	Korean common sense	Native Korean

1.3 Korean-Specific Datasets

Korean LLM Dataset Ecosystem
├── Pre-training
│   ├── Korean Wikipedia (~600K articles)
│   ├── Namuwiki Dump (~5GB)
│   ├── AI Hub Corpora (NIKL)
│   └── mC4-ko (Korean subset)
├── Instruction Tuning
│   ├── KoAlpaca (beomi) - 52K
│   ├── KoVicuna (melodysdreamj) - 40K+
│   ├── KOpen-platypus - 25K
│   ├── ko_wikidata_QA - Wiki-based QA
│   └── kullm-v2 (Korea Univ.) - 152K
├── Preference/Alignment
│   ├── ko-rlhf (community)
│   └── KoreanFeedback (custom-built)
└── Evaluation
    ├── KLUE (8 tasks)
    ├── KoBBQ (bias)
    └── KorNAT (common sense)

1.4 Practical datasets Library Usage

Basic Loading and Exploration

from datasets import load_dataset, Dataset, DatasetDict

# Basic loading
ds = load_dataset("beomi/KoAlpaca-v1.1a")
print(ds)
# DatasetDict({
#     train: Dataset({
#         features: ['instruction', 'output'],
#         num_rows: 21155
#     })
# })

# Load specific split
train_ds = load_dataset("beomi/KoAlpaca-v1.1a", split="train")

# Inspect first 5 examples
for example in train_ds.select(range(5)):
    print(f"Instruction: {example['instruction'][:50]}...")
    print(f"Output: {example['output'][:50]}...")
    print("---")

Filtering and Transformation

# Length-based filtering
filtered_ds = ds["train"].filter(
    lambda x: len(x["instruction"]) > 10 and len(x["output"]) > 20
)
print(f"After filtering: {len(filtered_ds)} / {len(ds['train'])}")

# Convert to Alpaca format
def format_alpaca(example):
    text = f"""### Instruction:
{example['instruction']}

### Response:
{example['output']}"""
    return {"text": text}

formatted_ds = filtered_ds.map(format_alpaca)

# Apply tokenizer
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("beomi/llama-2-ko-7b")

def tokenize_function(examples):
    return tokenizer(
        examples["text"],
        truncation=True,
        max_length=2048,
        padding="max_length",
    )

tokenized_ds = formatted_ds.map(
    tokenize_function,
    batched=True,
    remove_columns=formatted_ds.column_names,
)

Streaming Mode (Large Datasets)

# Stream large datasets (memory efficient)
streaming_ds = load_dataset(
    "allenai/c4",
    "ko",
    split="train",
    streaming=True,
)

# Iterate through first 100 examples
for i, example in enumerate(streaming_ds):
    if i >= 100:
        break
    process(example["text"])

# Streaming + filtering + batch processing
filtered_stream = streaming_ds.filter(
    lambda x: len(x["text"]) > 100
).take(10000)

# Process in batches
batch = []
for example in filtered_stream:
    batch.append(example)
    if len(batch) == 32:
        process_batch(batch)
        batch = []

Upload to Hugging Face Hub

from datasets import Dataset
import pandas as pd

# Create dataset from DataFrame
df = pd.DataFrame({
    "instruction": ["What is Korea's capital?", "What is Python?"],
    "output": ["Korea's capital is Seoul.", "Python is a programming language."],
})
my_dataset = Dataset.from_pandas(df)

# Upload to Hub
my_dataset.push_to_hub(
    "my-org/my-korean-dataset",
    private=True,
    token="hf_xxxxx",
)

# Auto-generate Dataset Card
from huggingface_hub import DatasetCard

card = DatasetCard.load("my-org/my-korean-dataset")
card.text = """
# My Korean Dataset
Korean Instruction Tuning dataset.
## Data Structure
- instruction: Question/instruction text
- output: Response
## License
CC-BY-4.0
"""
card.push_to_hub("my-org/my-korean-dataset")

2. Korean Data Collection Methods

2.1 Web Crawling

# News crawling with newspaper3k
from newspaper import Article
import json

def crawl_article(url):
    """Crawl news article (must comply with robots.txt!)"""
    article = Article(url, language="ko")
    article.download()
    article.parse()

    return {
        "title": article.title,
        "text": article.text,
        "publish_date": str(article.publish_date),
        "source_url": url,
    }

# Large-scale crawling with Scrapy
# scrapy_spider.py
"""
import scrapy

class KoreanTextSpider(scrapy.Spider):
    name = 'korean_text'
    custom_settings = {
        'ROBOTSTXT_OBEY': True,  # Must comply with robots.txt
        'DOWNLOAD_DELAY': 2,      # 2-second intervals
        'CONCURRENT_REQUESTS': 4, # Limit concurrent requests
    }

    def parse(self, response):
        text = response.css('article::text').getall()
        yield {
            'url': response.url,
            'text': ' '.join(text),
        }
"""

Crawling Best Practices:

Always comply with robots.txt
Minimum 1-2 second request intervals
Verify copyright/licensing
Mandatory PII filtering

2.2 Public Data Sources

Source	URL	Data Type	License
AI Hub	aihub.or.kr	Various Korean corpora	Public
Modoo Corpus	corpus.korean.go.kr	Written/spoken corpora	CC BY
NIKL	korean.go.kr	Standard corpora	Academic
Data Portal	data.go.kr	Government public data	Public

# AI Hub data loading example
import json
import glob

def load_aihub_data(data_dir):
    """Load AI Hub JSON format data"""
    all_data = []
    for filepath in glob.glob(f"{data_dir}/**/*.json", recursive=True):
        with open(filepath, "r", encoding="utf-8") as f:
            data = json.load(f)
            if "document" in data:
                for doc in data["document"]:
                    for sent in doc.get("sentence", []):
                        all_data.append({
                            "text": sent.get("form", ""),
                            "source": "aihub",
                        })
    return all_data

2.3 Translation-Based Data Generation

# Translation using NLLB (No Language Left Behind)
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_name = "facebook/nllb-200-distilled-600M"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

def translate_en_to_ko(text):
    """English -> Korean translation"""
    tokenizer.src_lang = "eng_Latn"
    inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True)
    translated = model.generate(
        **inputs,
        forced_bos_token_id=tokenizer.convert_tokens_to_ids("kor_Hang"),
        max_length=512,
    )
    return tokenizer.decode(translated[0], skip_special_tokens=True)

# Translation quality validation
def validate_translation(original, translated):
    """Automatic translation quality validation"""
    checks = {
        "not_empty": len(translated.strip()) > 0,
        "not_too_short": len(translated) > len(original) * 0.3,
        "not_too_long": len(translated) < len(original) * 3,
        "no_english_majority": sum(1 for c in translated if c.isascii()) / max(len(translated), 1) < 0.5,
    }
    return all(checks.values()), checks

2.4 Synthetic Data Generation

Self-Instruct Approach

import openai
import json
import random

# Self-Instruct: Generate new instructions from seed data
SEED_INSTRUCTIONS = [
    "Explain the four seasons of Korea.",
    "What are the advantages of list comprehension in Python?",
    "Tell me the key points to consider when writing an email.",
]

def generate_new_instructions(seed_instructions, num_generate=10):
    """Generate new instructions using GPT-4"""
    prompt = f"""Here are example Korean instructions:

{chr(10).join(f'{i+1}. {inst}' for i, inst in enumerate(seed_instructions))}

Generate {num_generate} completely new Korean instructions in a similar style.
Include diverse topics (science, history, technology, daily life, etc.).
Write each instruction on a single line with a number."""

    client = openai.OpenAI()
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.8,
    )
    return parse_instructions(response.choices[0].message.content)

def generate_response(instruction):
    """Generate a response for the instruction"""
    client = openai.OpenAI()
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": "You are a helpful Korean AI assistant."},
            {"role": "user", "content": instruction},
        ],
        temperature=0.7,
    )
    return response.choices[0].message.content

Evol-Instruct Approach

def evolve_instruction(instruction, evolution_type="deepen"):
    """WizardLM's Evol-Instruct: Progressively complexify instructions"""

    evolution_prompts = {
        "deepen": f"""Make the following instruction deeper and more specific.
Original: {instruction}
Evolved version:""",
        "broaden": f"""Broaden the scope of the following instruction to be more comprehensive.
Original: {instruction}
Evolved version:""",
        "concretize": f"""Add specific conditions or constraints to the following instruction.
Original: {instruction}
Evolved version:""",
        "reasoning": f"""Transform the following instruction into a form requiring step-by-step reasoning.
Original: {instruction}
Evolved version:""",
    }

    client = openai.OpenAI()
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": evolution_prompts[evolution_type]}],
        temperature=0.7,
    )
    return response.choices[0].message.content

2.5 Community Data Sources

Namuwiki: Rich Korean content (non-commercial CC-BY-NC-SA)
Korean Reddit: r/korea, r/hanguk, etc.
Stack Overflow Korean: Technical Q&A
Naver Knowledge iN: Crawl with caution (check terms of service)
Korean Wikipedia: CC-BY-SA license

3. Data Preprocessing Pipeline

3.1 Overall Pipeline Architecture

Raw Data
    |
    v
+-------------------+
| 1. Text Cleaning  |  HTML tag removal, encoding cleanup
+--------+----------+
         v
+-------------------+
| 2. Lang Detection |  Filter Korean-only text
+--------+----------+
         v
+-------------------+
| 3. Deduplication  |  MinHash LSH, Exact Match
+--------+----------+
         v
+-------------------+
| 4. Quality Filter |  Perplexity, length, toxicity
+--------+----------+
         v
+-------------------+
| 5. PII Removal    |  Personal info masking
+--------+----------+
         v
+-------------------+
| 6. Tokenization   |  SentencePiece / BPE
+--------+----------+
         v
    Cleaned Data

3.2 Text Cleaning

import re
import html
import unicodedata

def clean_text(text):
    """Basic Korean text cleaning"""
    # Decode HTML entities
    text = html.unescape(text)

    # Remove HTML tags
    text = re.sub(r'<[^>]+>', '', text)

    # Remove URLs
    text = re.sub(r'https?://\S+|www\.\S+', '', text)

    # Remove emails
    text = re.sub(r'\S+@\S+\.\S+', '[EMAIL]', text)

    # Mask phone numbers
    text = re.sub(r'\d{2,3}-\d{3,4}-\d{4}', '[PHONE]', text)

    # Normalize whitespace
    text = re.sub(r'\s+', ' ', text)

    # Unicode normalization (NFC)
    text = unicodedata.normalize('NFC', text)

    return text.strip()

def clean_korean_specific(text):
    """Korean-specific cleaning"""
    # Remove advertising patterns
    ad_patterns = [
        r'click\s*now',
        r'free\s*consultation',
        r'contact\s*us',
        r'call\s*now',
    ]
    for pattern in ad_patterns:
        if re.search(pattern, text, re.IGNORECASE):
            return None
    return text

# Batch processing
def clean_batch(texts):
    """Batch cleaning"""
    cleaned = []
    for text in texts:
        result = clean_text(text)
        result = clean_korean_specific(result)
        if result and len(result) > 20:
            cleaned.append(result)
    return cleaned

3.3 Deduplication

from datasketch import MinHash, MinHashLSH
import hashlib

class TextDeduplicator:
    """MinHash LSH-based approximate deduplication"""

    def __init__(self, threshold=0.8, num_perm=128):
        self.threshold = threshold
        self.num_perm = num_perm
        self.lsh = MinHashLSH(threshold=threshold, num_perm=num_perm)
        self.seen_exact = set()

    def get_minhash(self, text):
        """Generate MinHash for text"""
        m = MinHash(num_perm=self.num_perm)
        # Split into 3-grams
        for i in range(len(text) - 2):
            m.update(text[i:i+3].encode('utf-8'))
        return m

    def is_duplicate(self, text, doc_id):
        """Check for duplicates"""
        # 1. Exact matching (hash-based)
        text_hash = hashlib.md5(text.encode('utf-8')).hexdigest()
        if text_hash in self.seen_exact:
            return True
        self.seen_exact.add(text_hash)

        # 2. Approximate matching (MinHash LSH)
        minhash = self.get_minhash(text)
        result = self.lsh.query(minhash)
        if result:
            return True

        self.lsh.insert(doc_id, minhash)
        return False

    def deduplicate(self, documents):
        """Deduplicate document list"""
        unique_docs = []
        for i, doc in enumerate(documents):
            if not self.is_duplicate(doc["text"], f"doc_{i}"):
                unique_docs.append(doc)

        print(f"Dedup: {len(documents)} -> {len(unique_docs)} "
              f"({len(documents) - len(unique_docs)} removed)")
        return unique_docs

3.4 Language Detection Filtering

import fasttext

# Load fasttext language detection model
model_path = "lid.176.bin"  # Pre-download required
lang_model = fasttext.load_model(model_path)

def detect_language(text):
    """Detect text language"""
    text_clean = text.replace('\n', ' ')[:200]
    predictions = lang_model.predict(text_clean)
    lang = predictions[0][0].replace('__label__', '')
    confidence = predictions[1][0]
    return lang, confidence

def filter_korean(documents, min_confidence=0.7):
    """Filter Korean-only text"""
    korean_docs = []
    for doc in documents:
        lang, conf = detect_language(doc["text"])
        if lang == "ko" and conf >= min_confidence:
            korean_docs.append(doc)
    return korean_docs

3.5 Quality Filtering

import numpy as np
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

class QualityFilter:
    """Text quality filtering"""

    def __init__(self):
        self.criteria = {
            "min_length": 50,
            "max_length": 10000,
            "min_words": 10,
            "max_repetition_ratio": 0.3,
            "max_special_char_ratio": 0.1,
        }

    def check_length(self, text):
        """Length-based filter"""
        return self.criteria["min_length"] <= len(text) <= self.criteria["max_length"]

    def check_repetition(self, text):
        """Detect repetitive text"""
        words = text.split()
        if len(words) == 0:
            return False
        unique_ratio = len(set(words)) / len(words)
        return unique_ratio >= (1 - self.criteria["max_repetition_ratio"])

    def check_special_chars(self, text):
        """Check special character ratio"""
        special = sum(1 for c in text if not c.isalnum() and not c.isspace()
                     and c not in '.,!?;:')
        return special / max(len(text), 1) < self.criteria["max_special_char_ratio"]

    def compute_perplexity(self, text, model, tokenizer, device="cuda"):
        """Perplexity-based quality assessment (lower = more natural text)"""
        inputs = tokenizer(text, return_tensors="pt", truncation=True,
                          max_length=512).to(device)
        with torch.no_grad():
            outputs = model(**inputs, labels=inputs["input_ids"])
        return torch.exp(outputs.loss).item()

    def filter(self, text):
        """Comprehensive quality filtering"""
        return (
            self.check_length(text)
            and self.check_repetition(text)
            and self.check_special_chars(text)
        )

3.6 PII (Personally Identifiable Information) Removal

import re
from typing import Dict, List

class PIIRemover:
    """PII removal for Korean text"""

    PATTERNS = {
        "SSN": r'\d{6}[-]?\d{7}',
        "PHONE": r'0\d{1,2}[-.]?\d{3,4}[-.]?\d{4}',
        "EMAIL": r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}',
        "CARD": r'\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}',
        "ACCOUNT": r'\d{3,6}[-]?\d{2,6}[-]?\d{2,6}[-]?\d{0,3}',
        "IP": r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}',
    }

    def remove_pii(self, text: str) -> str:
        """Replace PII with mask tokens"""
        for pii_type, pattern in self.PATTERNS.items():
            mask_token = f"[{pii_type}]"
            text = re.sub(pattern, mask_token, text)
        return text

    def detect_pii(self, text: str) -> Dict[str, List[str]]:
        """Detect PII (for review before removal)"""
        found = {}
        for pii_type, pattern in self.PATTERNS.items():
            matches = re.findall(pattern, text)
            if matches:
                found[pii_type] = matches
        return found

3.7 Tokenization Considerations

from tokenizers import Tokenizer, models, trainers, pre_tokenizers

# Train Korean BPE tokenizer
def train_korean_tokenizer(text_files, vocab_size=32000):
    """Train Korean-optimized BPE tokenizer"""
    tokenizer = Tokenizer(models.BPE(unk_token="[UNK]"))

    # Pre-tokenization is critical for Korean
    tokenizer.pre_tokenizer = pre_tokenizers.Sequence([
        pre_tokenizers.ByteLevel(add_prefix_space=False),
    ])

    trainer = trainers.BpeTrainer(
        vocab_size=vocab_size,
        special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
        min_frequency=2,
    )

    tokenizer.train(text_files, trainer)
    return tokenizer

# Compare tokenizer efficiency
def compare_tokenizer_efficiency(text, tokenizers_dict):
    """Compare Korean token efficiency across tokenizers"""
    print(f"Original ({len(text)} chars): {text[:50]}...")
    print("-" * 60)
    for name, tok in tokenizers_dict.items():
        tokens = tok.encode(text)
        token_strs = tok.convert_ids_to_tokens(tokens)
        fertility = len(tokens) / len(text.split())
        print(f"{name}: {len(tokens)} tokens, fertility={fertility:.2f}")
        print(f"  First 10 tokens: {token_strs[:10]}")

4. Instruction Tuning Data Formats

4.1 Alpaca Format

The most basic and widely used format.

{
  "instruction": "Summarize the following text.",
  "input": "Artificial Intelligence (AI) is a subfield of computer science that artificially implements human learning, reasoning, and perception capabilities...",
  "output": "AI is a technology that implements human intelligence on computers, and it is being utilized in various fields through advances in machine learning and deep learning."
}

def format_alpaca(instruction, input_text="", output=""):
    """Generate Alpaca format"""
    if input_text:
        return {
            "instruction": instruction,
            "input": input_text,
            "output": output,
        }
    return {
        "instruction": instruction,
        "input": "",
        "output": output,
    }

# Alpaca prompt templates
ALPACA_TEMPLATE = """Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{input}

### Response:
{output}"""

ALPACA_TEMPLATE_NO_INPUT = """Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Response:
{output}"""

4.2 ShareGPT Format

A format for representing multi-turn conversations.

{
  "conversations": [
    {
      "from": "human",
      "value": "What is the difference between lists and tuples in Python?"
    },
    {
      "from": "gpt",
      "value": "The key differences between lists and tuples in Python are...\n\n1. **Mutability**: Lists are mutable, tuples are immutable\n2. **Performance**: Tuples are more memory efficient\n3. **Syntax**: Lists use [], tuples use ()"
    },
    {
      "from": "human",
      "value": "When should I prefer tuples?"
    },
    {
      "from": "gpt",
      "value": "Tuples are best used in these cases:\n\n1. When data should not be modified (coordinates, RGB values)\n2. When used as dictionary keys\n3. When returning multiple values from functions"
    }
  ]
}

4.3 OpenAI Messages Format

Standard format compatible with the OpenAI API.

{
  "messages": [
    {
      "role": "system",
      "content": "You are an AI assistant fluent in Korean. Provide accurate and helpful answers."
    },
    {
      "role": "user",
      "content": "Explain the difference between machine learning and deep learning."
    },
    {
      "role": "assistant",
      "content": "Let me explain the key differences between ML and deep learning..."
    }
  ]
}

4.4 Chat Templates (Model-Specific Differences)

# Llama 3 Chat Template
LLAMA3_TEMPLATE = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>

{system_message}<|eot_id|><|start_header_id|>user<|end_header_id|>

{user_message}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{assistant_message}<|eot_id|>"""

# Mistral Chat Template
MISTRAL_TEMPLATE = """<s>[INST] {system_message}

{user_message} [/INST]{assistant_message}</s>"""

# Qwen 2 Chat Template
QWEN2_TEMPLATE = """<|im_start|>system
{system_message}<|im_end|>
<|im_start|>user
{user_message}<|im_end|>
<|im_start|>assistant
{assistant_message}<|im_end|>"""

4.5 Converting Between Formats

def sharegpt_to_openai(sharegpt_data):
    """ShareGPT -> OpenAI Messages conversion"""
    messages = []
    role_map = {"human": "user", "gpt": "assistant", "system": "system"}

    for conv in sharegpt_data["conversations"]:
        messages.append({
            "role": role_map.get(conv["from"], conv["from"]),
            "content": conv["value"],
        })
    return {"messages": messages}

def alpaca_to_openai(alpaca_data, system_prompt=""):
    """Alpaca -> OpenAI Messages conversion"""
    messages = []
    if system_prompt:
        messages.append({"role": "system", "content": system_prompt})

    user_content = alpaca_data["instruction"]
    if alpaca_data.get("input"):
        user_content += f"\n\n{alpaca_data['input']}"

    messages.append({"role": "user", "content": user_content})
    messages.append({"role": "assistant", "content": alpaca_data["output"]})

    return {"messages": messages}

def openai_to_sharegpt(openai_data):
    """OpenAI Messages -> ShareGPT conversion"""
    role_map = {"user": "human", "assistant": "gpt", "system": "system"}
    conversations = []

    for msg in openai_data["messages"]:
        conversations.append({
            "from": role_map.get(msg["role"], msg["role"]),
            "value": msg["content"],
        })
    return {"conversations": conversations}

# Batch conversion
def batch_convert(dataset, source_format, target_format):
    """Convert entire dataset format"""
    converters = {
        ("sharegpt", "openai"): sharegpt_to_openai,
        ("alpaca", "openai"): alpaca_to_openai,
        ("openai", "sharegpt"): openai_to_sharegpt,
    }
    converter = converters.get((source_format, target_format))
    if not converter:
        raise ValueError(f"Unsupported conversion: {source_format} -> {target_format}")

    return [converter(item) for item in dataset]

5. RLHF/DPO Dataset Construction

5.1 Preference Data Structure

# DPO (Direct Preference Optimization) data structure
dpo_example = {
    "prompt": "Recommend healthy traditional Korean foods.",
    "chosen": "Here are some of the healthiest traditional Korean foods:\n\n1. **Kimchi**: Rich in probiotics, vitamin C, and dietary fiber.\n2. **Doenjang-jjigae**: Fermented food with anti-cancer properties and rich in protein.\n3. **Mixed grain rice**: Provides balanced nutrition from various grains.\n4. **Namul (seasoned vegetables)**: Spinach, bean sprouts, etc. provide various vitamins and minerals.",
    "rejected": "Umm... bibimbap I guess. It's tasty. Bulgogi too.",
}

5.2 Human Annotation Guidelines

ANNOTATION_GUIDELINES = """
## Preference Evaluation Guidelines

### Evaluation Criteria (1-5 scale)
1. **Helpfulness**: How well does it answer the question
2. **Accuracy**: Is the information factually correct
3. **Safety**: Is there harmful or biased content
4. **Fluency**: Is the language natural

### Comparison Evaluation Notes
- Read both responses fully before comparing
- Judge quality, not length
- Mark 'tie' when uncertain
- Apply objective quality standards, not personal opinion
"""

# Preference data collection tool
class PreferenceCollector:
    def __init__(self):
        self.annotations = []

    def add_comparison(self, prompt, response_a, response_b, preference, annotator_id):
        """Save preference comparison result"""
        self.annotations.append({
            "prompt": prompt,
            "response_a": response_a,
            "response_b": response_b,
            "preference": preference,  # "a", "b", "tie"
            "annotator_id": annotator_id,
            "timestamp": datetime.now().isoformat(),
        })

    def compute_agreement(self):
        """Calculate inter-annotator agreement"""
        from collections import Counter
        prompt_votes = {}
        for ann in self.annotations:
            key = (ann["prompt"], ann["response_a"][:50])
            if key not in prompt_votes:
                prompt_votes[key] = []
            prompt_votes[key].append(ann["preference"])

        agreements = []
        for key, votes in prompt_votes.items():
            if len(votes) >= 2:
                most_common = Counter(votes).most_common(1)[0][1]
                agreements.append(most_common / len(votes))
        return np.mean(agreements) if agreements else 0

5.3 AI-Based Automatic Ranking

def ai_rank_responses(prompt, responses, model="gpt-4"):
    """Constitutional AI-style automatic ranking"""
    ranking_prompt = f"""Evaluate the following responses to a question.

Question: {prompt}

"""
    for i, resp in enumerate(responses):
        ranking_prompt += f"Response {i+1}: {resp}\n\n"

    ranking_prompt += """Rate each response on a 1-5 scale for these criteria and provide final rankings:
1. Helpfulness
2. Accuracy
3. Safety
4. Fluency

Return results in JSON format."""

    client = openai.OpenAI()
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": ranking_prompt}],
        response_format={"type": "json_object"},
    )
    return json.loads(response.choices[0].message.content)

5.4 UltraFeedback Methodology

def create_ultrafeedback_data(prompts, models_to_evaluate):
    """UltraFeedback-style multi-model response collection and evaluation"""
    dataset = []

    for prompt in prompts:
        responses = {}
        # Collect responses from multiple models
        for model_name in models_to_evaluate:
            responses[model_name] = generate_response(prompt, model_name)

        # Evaluate each response with GPT-4 (1-10 scale)
        evaluations = {}
        for model_name, response in responses.items():
            score = evaluate_single_response(prompt, response)
            evaluations[model_name] = score

        # Select best/worst scoring responses (for DPO)
        best_model = max(evaluations, key=evaluations.get)
        worst_model = min(evaluations, key=evaluations.get)

        dataset.append({
            "prompt": prompt,
            "chosen": responses[best_model],
            "rejected": responses[worst_model],
            "chosen_model": best_model,
            "rejected_model": worst_model,
            "scores": evaluations,
        })

    return dataset

6. Data Quality Metrics

6.1 Diversity Measurement

from collections import Counter
import numpy as np

def vocabulary_diversity(texts):
    """Measure vocabulary diversity (Type-Token Ratio)"""
    all_tokens = []
    for text in texts:
        all_tokens.extend(text.split())

    types = len(set(all_tokens))
    tokens = len(all_tokens)
    ttr = types / tokens if tokens > 0 else 0
    return {"type_token_ratio": ttr, "unique_words": types, "total_words": tokens}

def topic_diversity(texts, n_topics=10):
    """Topic diversity (LDA-based)"""
    from sklearn.decomposition import LatentDirichletAllocation
    from sklearn.feature_extraction.text import CountVectorizer

    vectorizer = CountVectorizer(max_features=5000)
    dtm = vectorizer.fit_transform(texts)

    lda = LatentDirichletAllocation(n_components=n_topics, random_state=42)
    topic_dist = lda.fit_transform(dtm)

    # Topic entropy (higher = more uniform distribution)
    avg_dist = topic_dist.mean(axis=0)
    entropy = -np.sum(avg_dist * np.log(avg_dist + 1e-10))
    return {"topic_entropy": entropy, "max_entropy": np.log(n_topics)}

def instruction_diversity(instructions):
    """Instruction starter verb diversity"""
    first_words = [inst.split()[0] if inst.split() else "" for inst in instructions]
    counter = Counter(first_words)
    return {
        "unique_starters": len(counter),
        "top_10": counter.most_common(10),
        "starter_entropy": -sum(
            (c/len(first_words)) * np.log(c/len(first_words))
            for c in counter.values()
        ),
    }

6.2 Length Distribution Analysis

import matplotlib.pyplot as plt

def analyze_length_distribution(dataset, text_field="text"):
    """Dataset length distribution analysis"""
    lengths = [len(item[text_field]) for item in dataset]

    stats = {
        "count": len(lengths),
        "mean": np.mean(lengths),
        "median": np.median(lengths),
        "std": np.std(lengths),
        "min": np.min(lengths),
        "max": np.max(lengths),
        "p25": np.percentile(lengths, 25),
        "p75": np.percentile(lengths, 75),
        "p95": np.percentile(lengths, 95),
    }

    print("=== Length Distribution Statistics ===")
    for k, v in stats.items():
        print(f"  {k}: {v:.1f}")

    return stats

6.3 Benchmark Contamination Check

def check_contamination(train_data, benchmark_data, n_gram=13):
    """Check contamination between training and benchmark data"""
    # Build benchmark n-gram set
    benchmark_ngrams = set()
    for item in benchmark_data:
        text = item["question"] if "question" in item else item["text"]
        words = text.split()
        for i in range(len(words) - n_gram + 1):
            ngram = " ".join(words[i:i+n_gram])
            benchmark_ngrams.add(ngram)

    # Search for overlapping n-grams in training data
    contaminated = []
    for i, item in enumerate(train_data):
        text = item.get("instruction", "") + " " + item.get("output", "")
        words = text.split()
        for j in range(len(words) - n_gram + 1):
            ngram = " ".join(words[j:j+n_gram])
            if ngram in benchmark_ngrams:
                contaminated.append({
                    "train_idx": i,
                    "matched_ngram": ngram,
                })
                break

    contamination_rate = len(contaminated) / max(len(train_data), 1)
    print(f"Contamination rate: {contamination_rate:.4%} ({len(contaminated)}/{len(train_data)})")
    return contaminated

7. Full Pipeline: Building a Korean SFT Dataset from Scratch

7.1 Complete Pipeline Code

"""
Korean SFT Dataset Construction Pipeline
Collect -> Clean -> Format -> Validate -> Upload
"""

import json
import os
from datasets import Dataset
from tqdm import tqdm

# ===== Step 1: Data Collection =====
def collect_data():
    """Collect data from various sources"""
    all_data = []

    # 1-1. Load existing datasets
    from datasets import load_dataset
    koalpaca = load_dataset("beomi/KoAlpaca-v1.1a", split="train")
    for item in koalpaca:
        all_data.append({
            "instruction": item["instruction"],
            "input": "",
            "output": item["output"],
            "source": "koalpaca",
        })

    # 1-2. Add synthetic data
    synthetic = generate_synthetic_data(num_samples=1000)
    all_data.extend(synthetic)

    print(f"Total collected: {len(all_data)}")
    return all_data

# ===== Step 2: Data Cleaning =====
def clean_data(raw_data):
    """Data cleaning pipeline"""
    cleaned = []
    pii_remover = PIIRemover()
    quality_filter = QualityFilter()

    for item in tqdm(raw_data, desc="Cleaning"):
        instruction = clean_text(item["instruction"])
        output = clean_text(item["output"])

        instruction = pii_remover.remove_pii(instruction)
        output = pii_remover.remove_pii(output)

        if not quality_filter.filter(instruction) or not quality_filter.filter(output):
            continue

        cleaned.append({
            "instruction": instruction,
            "input": item.get("input", ""),
            "output": output,
            "source": item["source"],
        })

    print(f"After cleaning: {len(cleaned)}/{len(raw_data)}")
    return cleaned

# ===== Step 3: Deduplication =====
def remove_duplicates(data):
    """Remove duplicates"""
    dedup = TextDeduplicator(threshold=0.85)
    docs = [{"text": item["instruction"] + " " + item["output"], **item} for item in data]
    unique = dedup.deduplicate(docs)
    return [{"instruction": d["instruction"], "input": d.get("input", ""),
             "output": d["output"], "source": d["source"]} for d in unique]

# ===== Step 4: Format Conversion =====
def format_data(data, target_format="openai"):
    """Convert to target format"""
    formatted = []
    system_prompt = "You are a helpful Korean AI assistant."

    for item in data:
        if target_format == "openai":
            formatted.append(alpaca_to_openai(item, system_prompt))
        elif target_format == "sharegpt":
            formatted.append({
                "conversations": [
                    {"from": "human", "value": item["instruction"]},
                    {"from": "gpt", "value": item["output"]},
                ],
            })
    return formatted

# ===== Step 5: Validation =====
def validate_data(data, format_type="openai"):
    """Data quality validation"""
    errors = []
    for i, item in enumerate(data):
        if format_type == "openai":
            if "messages" not in item:
                errors.append(f"[{i}] Missing messages field")
            elif len(item["messages"]) < 2:
                errors.append(f"[{i}] Insufficient message count")
            for msg in item.get("messages", []):
                if msg["role"] not in ("system", "user", "assistant"):
                    errors.append(f"[{i}] Invalid role: {msg['role']}")
                if not msg["content"].strip():
                    errors.append(f"[{i}] Empty content")

    if errors:
        print(f"Validation errors: {len(errors)}")
        for e in errors[:10]:
            print(f"  {e}")
    else:
        print("Validation passed!")
    return len(errors) == 0

# ===== Step 6: Upload =====
def upload_to_hub(data, repo_name):
    """Upload to Hugging Face Hub"""
    ds = Dataset.from_list(data)
    split_ds = ds.train_test_split(test_size=0.05, seed=42)

    split_ds.push_to_hub(repo_name, private=True)
    print(f"Upload complete: {repo_name}")
    print(f"  Train: {len(split_ds['train'])}, Validation: {len(split_ds['test'])}")

# ===== Execute =====
if __name__ == "__main__":
    raw_data = collect_data()
    cleaned_data = clean_data(raw_data)
    unique_data = remove_duplicates(cleaned_data)
    formatted_data = format_data(unique_data, "openai")
    is_valid = validate_data(formatted_data, "openai")
    if is_valid:
        upload_to_hub(formatted_data, "my-org/korean-sft-v1")

7.2 Quality Dashboard

def generate_quality_report(dataset):
    """Generate dataset quality report"""
    report = {
        "total_samples": len(dataset),
        "length_stats": analyze_length_distribution(dataset, "instruction"),
        "diversity": vocabulary_diversity([d["instruction"] for d in dataset]),
        "source_distribution": Counter(d["source"] for d in dataset),
    }

    print("=" * 60)
    print("Dataset Quality Report")
    print("=" * 60)
    print(f"Total samples: {report['total_samples']}")
    print(f"\nLength statistics:")
    for k, v in report['length_stats'].items():
        print(f"  {k}: {v:.1f}")
    print(f"\nVocabulary diversity: TTR = {report['diversity']['type_token_ratio']:.4f}")
    print(f"\nSource distribution:")
    for source, count in report['source_distribution'].most_common():
        print(f"  {source}: {count} ({count/len(dataset)*100:.1f}%)")
    print("=" * 60)

    return report

8. Quiz

Q1. What was the key factor that allowed Phi-3 to outperform larger models?

Answer: Meticulously curated high-quality training data

Phi-3 outperformed 7B-13B models with just 3.8B parameters. The key was not model size but training data quality. It was trained using textbook-quality synthetic data that was carefully curated.

Q2. What is MinHash LSH used for?

Answer: Approximate Deduplication

MinHash LSH is an algorithm that efficiently finds similar documents in large datasets for deduplication. It uses similarity-based approximate matching rather than exact matching, operating at O(n) time complexity.

Q3. What are the key differences between Alpaca, ShareGPT, and OpenAI Messages formats?

Answer:

Alpaca: Single-turn structure with instruction/input/output
ShareGPT: Multi-turn conversations array with human/gpt roles
OpenAI Messages: Messages array with system/user/assistant roles

Both ShareGPT and OpenAI formats support multi-turn conversations, but differ in role names and structure.

Q4. What do "chosen" and "rejected" mean in DPO datasets?

Answer:

chosen: The human-preferred (better) response
rejected: The human-dispreferred (worse) response

DPO (Direct Preference Optimization) uses this paired data to train models to generate responses similar to chosen and different from rejected. Unlike RLHF, it optimizes directly without a separate reward model.

Q5. Why is benchmark contamination dangerous and how can it be detected?

Answer:

Benchmark contamination occurs when evaluation data is included in training data, causing model performance to be overestimated.

Risks:

Model has "memorized" rather than actually "solved" the problems
Fair model comparison becomes impossible
Below-expected performance when deployed in production

Detection methods:

N-gram overlap checking (typically 13-gram)
Hash comparison of benchmark sentences
Reference GPT-4's contamination reporting methodology

9. References

LIMA: Less Is More for Alignment - Zhou et al., 2023
Phi-3 Technical Report - Microsoft Research, 2024
Self-Instruct: Aligning LLMs with Self-Generated Instructions - Wang et al., 2023
WizardLM: Empowering Large Language Models to Follow Complex Instructions - Xu et al., 2023
Hugging Face Datasets Documentation - huggingface.co/docs/datasets
KoAlpaca: Korean Alpaca Model - beomi, GitHub
UltraFeedback: Boosting Language Models with High-quality Feedback - Cui et al., 2023
Training Language Models to Follow Instructions with Human Feedback - Ouyang et al., 2022
Direct Preference Optimization - Rafailov et al., 2023
Deduplicating Training Data Makes Language Models Better - Lee et al., 2022
KLUE: Korean Language Understanding Evaluation - Park et al., 2021
Textbooks Are All You Need - Gunasekar et al., 2023
Constitutional AI: Harmlessness from AI Feedback - Bai et al., 2022
The RefinedWeb Dataset for Falcon LLM - Penedo et al., 2023