Chaos and Order

Chaos and Order https://www.youngju.dev/blog 천천히 올바르게. AI Researcher & DevOps Engineer Youngju's tech blog. GPU/CUDA, LLM, MLOps, Kubernetes AI workloads, distributed training, and data engineering. ko fjvbn2003@gmail.com (Youngju Kim) fjvbn2003@gmail.com (Youngju Kim) Sat, 16 May 2026 00:00:00 GMT https://www.youngju.dev/blog/culture/2026-05-16-ai-safety-alignment-2026-constitutional-ai-rlhf-dpo-grpo-mech-interp-aisi-evals-redteam-deep-dive.en AI Safety & Alignment 2026 Deep Dive - Constitutional AI · RLHF · DPO · GRPO · Mechanistic Interpretability · AISI Evals · Red Team https://www.youngju.dev/blog/culture/2026-05-16-ai-safety-alignment-2026-constitutional-ai-rlhf-dpo-grpo-mech-interp-aisi-evals-redteam-deep-dive.en A single-shot map of AI safety and alignment as of 2026. Starts from conceptual roots like outer/inner alignment and mesa-optimization, walks through training-time alignment (RLHF, DPO, GRPO, Constitutional AI), frontier policies (Anthropic RSP, OpenAI Preparedness, DeepMind Frontier Safety Framework), mechanistic interpretability with sparse autoencoders, capability evals (MMLU, GPQA, SWE-bench, METR) and safety evals (Apollo scheming, Anthropic sabotage), the AISI network (UK, US, Korea, Japan, EU), red teaming and jailbreaks (GCG, PAIR, AutoDAN), defenses (Llama Guard, NeMo Guardrails, Constitutional Classifiers), and regulation (EU AI Act, Korean AI Basic Act, METI guidelines) — 24 chapters. Sat, 16 May 2026 00:00:00 GMT fjvbn2003@gmail.com (Youngju Kim) ai-safetyai-alignmentconstitutional-airlhfdpogrpomechanistic-interpretabilityaisired-teamevalsenglish https://www.youngju.dev/blog/culture/2026-05-16-ai-safety-alignment-2026-constitutional-ai-rlhf-dpo-grpo-mech-interp-aisi-evals-redteam-deep-dive.ja AI 安全 & アライメント 2026 完全ガイド - Constitutional AI · RLHF · DPO · GRPO · Mechanistic Interpretability · AISI Evals · Red Team 徹底解説 https://www.youngju.dev/blog/culture/2026-05-16-ai-safety-alignment-2026-constitutional-ai-rlhf-dpo-grpo-mech-interp-aisi-evals-redteam-deep-dive.ja 2026年のAI安全とアライメントの全体地形を一気に整理する。outer/inner アライメントや mesa-optimization といった概念的基盤から、RLHF・DPO・GRPO・Constitutional AI に至る学習時アライメント手法、Anthropic RSP や OpenAI Preparedness Framework、Google DeepMind Frontier Safety Framework といったフロンティア政策、Mechanistic Interpretability と Sparse Autoencoder、MMLU・GPQA・SWE-bench・METR などの能力評価と Apollo Research の scheming evals などの安全評価、英米韓日の AISI ネットワークと Bletchley・Seoul・Paris の首脳会議、レッドチーミングと GCG・PAIR・AutoDAN といった jailbreak、Llama Guard・NeMo Guardrails・Constitutional Classifiers といった防御、EU AI Act・韓国 AI 基本法・METI ガイドラインまで — 24章で展開する。 Sat, 16 May 2026 00:00:00 GMT fjvbn2003@gmail.com (Youngju Kim) ai-safetyai-alignmentconstitutional-airlhfdpogrpomechanistic-interpretabilityaisired-teamevals日本語 https://www.youngju.dev/blog/culture/2026-05-16-ai-safety-alignment-2026-constitutional-ai-rlhf-dpo-grpo-mech-interp-aisi-evals-redteam-deep-dive AI 안전 & 얼라인먼트 2026 완벽 가이드 - Constitutional AI · RLHF · DPO · GRPO · Mechanistic Interpretability · AISI Evals · Red Team 심층 분석 https://www.youngju.dev/blog/culture/2026-05-16-ai-safety-alignment-2026-constitutional-ai-rlhf-dpo-grpo-mech-interp-aisi-evals-redteam-deep-dive 2026년 AI 안전과 얼라인먼트의 전체 지형을 한 번에 정리한다. outer/inner alignment와 mesa-optimization 같은 개념적 토대부터 RLHF·DPO·GRPO·Constitutional AI로 이어지는 학습 정렬 기법, Anthropic RSP와 OpenAI Preparedness Framework, Google DeepMind Frontier Safety Framework 같은 프런티어 정책, Mechanistic Interpretability와 Sparse Autoencoder, MMLU·GPQA·SWE-bench·METR 같은 능력 평가와 Apollo Research scheming evals 같은 안전 평가, AISI(영·미·한·일)와 Bletchley·Seoul·Paris 정상회담, Red Teaming과 GCG·PAIR·AutoDAN 같은 jailbreak·Llama Guard·NeMo Guardrails·Constitutional Classifiers 같은 방어, EU AI Act·Korean AI Basic Act·METI 가이드라인까지 — 24개 챕터로 펼친다. Sat, 16 May 2026 00:00:00 GMT fjvbn2003@gmail.com (Youngju Kim) ai-safetyai-alignmentconstitutional-airlhfdpogrpomechanistic-interpretabilityaisired-teamevals