Chaos and Order

Chaos and Order https://www.youngju.dev/blog 천천히 올바르게. AI Researcher & DevOps Engineer Youngju's tech blog. GPU/CUDA, LLM, MLOps, Kubernetes AI workloads, distributed training, and data engineering. ko fjvbn2003@gmail.com (Youngju Kim) fjvbn2003@gmail.com (Youngju Kim) Sat, 16 May 2026 00:00:00 GMT https://www.youngju.dev/blog/culture/2026-05-16-ai-safety-alignment-2026-constitutional-ai-rlhf-dpo-grpo-mech-interp-aisi-evals-redteam-deep-dive.en AI Safety & Alignment 2026 Deep Dive - Constitutional AI · RLHF · DPO · GRPO · Mechanistic Interpretability · AISI Evals · Red Team https://www.youngju.dev/blog/culture/2026-05-16-ai-safety-alignment-2026-constitutional-ai-rlhf-dpo-grpo-mech-interp-aisi-evals-redteam-deep-dive.en A single-shot map of AI safety and alignment as of 2026. Starts from conceptual roots like outer/inner alignment and mesa-optimization, walks through training-time alignment (RLHF, DPO, GRPO, Constitutional AI), frontier policies (Anthropic RSP, OpenAI Preparedness, DeepMind Frontier Safety Framework), mechanistic interpretability with sparse autoencoders, capability evals (MMLU, GPQA, SWE-bench, METR) and safety evals (Apollo scheming, Anthropic sabotage), the AISI network (UK, US, Korea, Japan, EU), red teaming and jailbreaks (GCG, PAIR, AutoDAN), defenses (Llama Guard, NeMo Guardrails, Constitutional Classifiers), and regulation (EU AI Act, Korean AI Basic Act, METI guidelines) — 24 chapters. Sat, 16 May 2026 00:00:00 GMT fjvbn2003@gmail.com (Youngju Kim) ai-safetyai-alignmentconstitutional-airlhfdpogrpomechanistic-interpretabilityaisired-teamevalsenglish https://www.youngju.dev/blog/culture/2026-05-16-ai-safety-alignment-2026-constitutional-ai-rlhf-dpo-grpo-mech-interp-aisi-evals-redteam-deep-dive.ja AI 安全 & アライメント 2026 完全ガイド - Constitutional AI · RLHF · DPO · GRPO · Mechanistic Interpretability · AISI Evals · Red Team 徹底解説 https://www.youngju.dev/blog/culture/2026-05-16-ai-safety-alignment-2026-constitutional-ai-rlhf-dpo-grpo-mech-interp-aisi-evals-redteam-deep-dive.ja 2026年のAI安全とアライメントの全体地形を一気に整理する。outer/inner アライメントや mesa-optimization といった概念的基盤から、RLHF・DPO・GRPO・Constitutional AI に至る学習時アライメント手法、Anthropic RSP や OpenAI Preparedness Framework、Google DeepMind Frontier Safety Framework といったフロンティア政策、Mechanistic Interpretability と Sparse Autoencoder、MMLU・GPQA・SWE-bench・METR などの能力評価と Apollo Research の scheming evals などの安全評価、英米韓日の AISI ネットワークと Bletchley・Seoul・Paris の首脳会議、レッドチーミングと GCG・PAIR・AutoDAN といった jailbreak、Llama Guard・NeMo Guardrails・Constitutional Classifiers といった防御、EU AI Act・韓国 AI 基本法・METI ガイドラインまで — 24章で展開する。 Sat, 16 May 2026 00:00:00 GMT fjvbn2003@gmail.com (Youngju Kim) ai-safetyai-alignmentconstitutional-airlhfdpogrpomechanistic-interpretabilityaisired-teamevals日本語 https://www.youngju.dev/blog/culture/2026-05-16-ai-safety-alignment-2026-constitutional-ai-rlhf-dpo-grpo-mech-interp-aisi-evals-redteam-deep-dive AI 안전 & 얼라인먼트 2026 완벽 가이드 - Constitutional AI · RLHF · DPO · GRPO · Mechanistic Interpretability · AISI Evals · Red Team 심층 분석 https://www.youngju.dev/blog/culture/2026-05-16-ai-safety-alignment-2026-constitutional-ai-rlhf-dpo-grpo-mech-interp-aisi-evals-redteam-deep-dive 2026년 AI 안전과 얼라인먼트의 전체 지형을 한 번에 정리한다. outer/inner alignment와 mesa-optimization 같은 개념적 토대부터 RLHF·DPO·GRPO·Constitutional AI로 이어지는 학습 정렬 기법, Anthropic RSP와 OpenAI Preparedness Framework, Google DeepMind Frontier Safety Framework 같은 프런티어 정책, Mechanistic Interpretability와 Sparse Autoencoder, MMLU·GPQA·SWE-bench·METR 같은 능력 평가와 Apollo Research scheming evals 같은 안전 평가, AISI(영·미·한·일)와 Bletchley·Seoul·Paris 정상회담, Red Teaming과 GCG·PAIR·AutoDAN 같은 jailbreak·Llama Guard·NeMo Guardrails·Constitutional Classifiers 같은 방어, EU AI Act·Korean AI Basic Act·METI 가이드라인까지 — 24개 챕터로 펼친다. Sat, 16 May 2026 00:00:00 GMT fjvbn2003@gmail.com (Youngju Kim) ai-safetyai-alignmentconstitutional-airlhfdpogrpomechanistic-interpretabilityaisired-teamevals https://www.youngju.dev/blog/culture/2026-05-16-llm-finetuning-frameworks-2026-axolotl-unsloth-llama-factory-trl-peft-torchtune-deep-dive.en LLM Fine-tuning Frameworks 2026 — A Deep Dive into Axolotl, Unsloth, LLaMA-Factory, TRL, PEFT, and TorchTune https://www.youngju.dev/blog/culture/2026-05-16-llm-finetuning-frameworks-2026-axolotl-unsloth-llama-factory-trl-peft-torchtune-deep-dive.en A complete map of the 2026 LLM fine-tuning ecosystem. Open-source frameworks like Axolotl, Unsloth, LLaMA-Factory, TRL, PEFT, and TorchTune. LLM Foundry (MosaicML, acquired by Databricks). Cloud fine-tuning APIs from Modal, Together, OpenAI, Anthropic, and Cohere. Distributed training techniques like QLoRA, FSDP, and DeepSpeed Zero. Preference-optimization algorithms like DPO, GRPO (DeepSeek R1), KTO (Kahneman-Tversky), and IPO. Plus case studies from Korea (Upstage, KT, LG AI) and Japan (Sakana, Stockmark, ELYZA, PFN). Includes a decision guide for solo developers, academic researchers, startups, and enterprises. Sat, 16 May 2026 00:00:00 GMT fjvbn2003@gmail.com (Youngju Kim) llmfinetuningaxolotlunslothllama-factorytrlpefttorchtunemosaicmlllm-foundrymodaldpogrpoktoqlorafsdpdeepspeed2026deep-diveenglish https://www.youngju.dev/blog/culture/2026-05-16-llm-finetuning-frameworks-2026-axolotl-unsloth-llama-factory-trl-peft-torchtune-deep-dive.ja LLMファインチューニングフレームワーク2026 — Axolotl / Unsloth / LLaMA-Factory / TRL / PEFT / TorchTune 徹底ガイド https://www.youngju.dev/blog/culture/2026-05-16-llm-finetuning-frameworks-2026-axolotl-unsloth-llama-factory-trl-peft-torchtune-deep-dive.ja 2026年のLLMファインチューニング生態系を一気に整理する。Axolotl・Unsloth・LLaMA-Factory・TRL・PEFT・TorchTuneといったオープンソースフレームワークから、LLM Foundry(MosaicML、Databricksが買収)、Modal・Together・OpenAI・Anthropic・Cohereのクラウドファインチューニング API まで。QLoRA・FSDP・DeepSpeed Zero などの分散学習手法、DPO・GRPO(DeepSeek R1)・KTO(Kahneman-Tversky)・IPO といった選好最適化アルゴリズム、そして韓国(Upstage・KT・LG AI)・日本(Sakana・Stockmark・ELYZA・PFN)の事例まで。個人開発者・学術研究者・スタートアップ・エンタープライズそれぞれが何を選べば良いかの意思決定ガイドも収録。 Sat, 16 May 2026 00:00:00 GMT fjvbn2003@gmail.com (Youngju Kim) llmfinetuningaxolotlunslothllama-factorytrlpefttorchtunemosaicmlllm-foundrymodaldpogrpoktoqlorafsdpdeepspeed2026deep-dive日本語 https://www.youngju.dev/blog/culture/2026-05-16-llm-finetuning-frameworks-2026-axolotl-unsloth-llama-factory-trl-peft-torchtune-deep-dive LLM 파인튜닝 프레임워크 2026 — Axolotl / Unsloth / LLaMA-Factory / TRL / PEFT / TorchTune 심층 가이드 https://www.youngju.dev/blog/culture/2026-05-16-llm-finetuning-frameworks-2026-axolotl-unsloth-llama-factory-trl-peft-torchtune-deep-dive 2026년 LLM 파인튜닝 생태계를 한 번에 정리한다. Axolotl·Unsloth·LLaMA-Factory·TRL·PEFT·TorchTune 같은 오픈소스 프레임워크부터 LLM Foundry(MosaicML, Databricks 인수), Modal·Together·OpenAI·Anthropic·Cohere의 클라우드 파인튜닝 API까지. QLoRA·FSDP·DeepSpeed Zero 같은 분산 학습 기법, DPO·GRPO(DeepSeek R1)·KTO(Kahneman-Tversky)·IPO 같은 선호 최적화 알고리즘, 그리고 한국(Upstage·KT·LG AI)·일본(Sakana·Stockmark·ELYZA·PFN)의 사례까지. 1인 개발자·학술 연구자·스타트업·엔터프라이즈 각각이 무엇을 골라야 하는지 결정 가이드도 포함한다. Sat, 16 May 2026 00:00:00 GMT fjvbn2003@gmail.com (Youngju Kim) llmfinetuningaxolotlunslothllama-factorytrlpefttorchtunemosaicmlllm-foundrymodaldpogrpoktoqlorafsdpdeepspeed2026deep-dive