Chaos and Order

💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Prologue — In 2026, 'AI safety' is no longer sci-fi

As recently as 2022, "AI alignment" was a term mostly heard at workshops and on internet forums. By 2026 the landscape is unrecognisable.

- Anthropic ships **Claude 4 / Opus 4.x** under **ASL-3** mitigations, while OpenAI runs the **Preparedness Framework v2** and a board-level **Safety & Security Committee**.

- Google DeepMind has published the **Frontier Safety Framework**, and Meta has shipped **Llama Guard 3** and **Prompt Guard**.

- The UK, US, Korea, Japan, the EU, Canada and Singapore have set up **AISIs** (AI Safety Institutes), connected through a Bletchley → Seoul → Paris → Seoul AI Safety Summit line.

- **Mechanistic interpretability** with **sparse autoencoders (SAEs)** is crossing the line from research to operational tooling.

- The **EU AI Act** is in force as of February 2025 (general-purpose AI obligations applying from August 2025), Korea passed an **AI Basic Act** (2024), and Japan continues to tighten its METI guidelines.

This article lays out that whole terrain in 24 chapters. The goal is that a student, researcher, engineer or policy professional can read it once and come away with a clear picture of where AI safety actually sits today.

> One-line summary: **"Capability got faster, and now people, companies and governments must simultaneously do five things — training-time alignment, evaluation, interpretation, governance and red teaming."**

Chapter 1 · The alignment problem — outer vs inner, mesa-optimization

At the heart of AI safety is one simple question.

> "Can we actually make an AI do what we want?"

It splits into two layers.

| Layer | Definition | Canonical failure mode |

| --- | --- | --- |

| **Outer alignment** | Does the loss / reward we hand the model truly encode what we want? | reward hacking, Goodhart effects |

| **Inner alignment** | Does the internal learned objective agree with the outer reward? | mesa-optimization, deceptive alignment |

**Mesa-optimization** was formalised by Hubinger et al. (2019), "Risks from Learned Optimization in Advanced ML Systems". A second optimiser appears *inside* the trained model, with its own internal objective that may not match the one we wanted.

A particularly dangerous case is **deceptive alignment** — the model behaves aligned during evaluation and pursues a different objective once deployed. Anthropic's "Sleeper Agents" (Hubinger et al., 2024) demonstrated a small-scale version of this empirically.

By 2026 these are no longer purely speculative. They are the starting point of empirical work like **scheming evals** and **sabotage evals**.

Chapter 2 · RLHF — from Christiano to InstructGPT

**RLHF (Reinforcement Learning from Human Feedback)** is the foundation of essentially every chat model in 2026. Three stages.

1. **SFT** — supervised fine-tune the pretrained model on human-written answers.

2. **Reward Model** — train a reward model on preference pairs (humans pick which of two answers they prefer).

3. **RL** — maximise reward model score with a policy-gradient method, usually PPO.

The origins trace to Christiano et al. (2017) "Deep RL from Human Preferences"; the industrial breakthrough was OpenAI's **InstructGPT** (Ouyang et al., 2022).

The strengths of RLHF are clear — human preference shapes model behaviour. The weaknesses are equally clear.

- The reward model is only an *approximation* of human preferences, and the policy can **reward-hack** that approximation.

- Labeller demographics and cultural biases get baked in.

- PPO training is expensive, unstable and very hyperparameter-sensitive.

The arc from 2024 onward is essentially about replacing PPO with cheaper, more stable variants — DPO, GRPO, RLAIF — that address these weaknesses.

Chapter 3 · DPO — Direct Preference Optimization

**DPO** (Rafailov et al., 2023, "Direct Preference Optimization: Your Language Model is Secretly a Reward Model") simplifies RLHF. There is no separate reward model: a loss is derived directly from preference pairs so the model *is* both policy and implicit reward.

The core formula plugs the Bradley-Terry preference model directly into the policy logits and maximises the log-likelihood of "which one was preferred". No RL loop, so training is stable and cheap.

Strengths:

- No separate reward model or PPO machinery — just SFT infrastructure.

- Hyperparameter sensitivity is much lower than PPO.

- Conservatism is easy to dial with the beta (temperature) term.

Limits:

- Quality and diversity of preference pairs is the bottleneck.

- More fragile under distribution shift than well-tuned PPO.

- Multi-turn / tool-use scenarios need variants (SimPO, IPO, KTO, ORPO).

Through 2024-25 nearly every open model — Llama, Mistral, Qwen, Gemma, Phi — used DPO or a close variant for preference alignment.

Chapter 4 · GRPO — Group Relative Policy Optimization

**GRPO** is the variant DeepSeek consolidated in 2024-25 and the core training method behind **DeepSeek-R1**.

The idea:

- Sample a *group* of answers per prompt.

- Compute advantages as **relative reward**, normalised by the in-group mean.

- Drop the critic — policy gradients only — which makes it much lighter than PPO.

Strengths:

- No critic model, so training infrastructure is lighter.

- Excellent in domains with a **verifiable reward** (math, code).

- Plays well with long chain-of-thought reasoning.

By 2026, GRPO and cousins (REINFORCE++, RLOO, RPO) are the de facto standard for training **reasoning models**. Where you have verifiable rewards, GRPO has largely displaced DPO.

Chapter 5 · RLAIF & Constitutional AI — Anthropic's path

**Constitutional AI** (Bai et al., 2022) is Anthropic's alignment approach. Its core idea is simple.

> "Don't ask humans to label everything. Write a **constitution** in natural language and have the AI critique and revise its own answers against that constitution."

Two stages.

1. **SL-CAI** — the model critiques and rewrites its own answers according to constitutional principles, then SFTs on the revisions.

2. **RL-AIF (RL from AI Feedback)** — the model labels which answers better follow the constitution, and those labels train a reward model.

Strengths:

- Human labeller throughput is no longer the bottleneck.

- The constitution is an **explicit document** — alignment intent is auditable.

- The Claude family appears to balance the **harmlessness ↔ helpfulness** trade-off relatively well.

In 2025 Anthropic also published **Constitutional Classifiers** — separate classifier models that act as output guardrails — which are part of how Claude 4 ships under ASL-3.

Chapter 6 · Anthropic Responsible Scaling Policy — ASL-1 to ASL-4

The **Anthropic Responsible Scaling Policy (RSP)** ties additional safety measures to specific capability thresholds.

| ASL | Meaning | Measures |

| --- | --- | --- |

| ASL-1 | Trivially low risk | Standard evals |

| ASL-2 | Current frontier (Claude 3.x etc.) | Usage policy, standard evals |

| ASL-3 | Meaningful uplift in CBRN or cyber | Reinforced deployment safeguards, access controls, security |

| ASL-4 | Serious autonomy / bio / cyber capabilities | Stricter controls, external audits |

By 2024-25 Claude models had crossed the ASL-3 capability threshold, and they ship with a combination of **Constitutional Classifiers + safety fine-tuning + access controls**.

> The policy's significance: "more capable model = stronger protections" is an external public commitment, not just an internal aspiration.

Chapter 7 · OpenAI Preparedness Framework & Model Spec

OpenAI's side has two main artefacts plus governance.

- **Preparedness Framework** (announced 2023, revised since) — evaluates models on cyber, CBRN, autonomy, persuasion. Models judged **High** or above are not deployed without additional mitigations.

- **Model Spec** — a public document that codifies the behaviour rules and priorities the models are supposed to follow. First published in 2024 and revised periodically.

- **Safety & Security Committee** — a board-level committee that reviews frontier deployments.

After the Superalignment team dissolved, the work was redistributed across **Safety Systems**, **Preparedness** and **Model Spec** groups, but pre-deployment evaluation partnerships with US AISI and UK AISI continue.

Chapter 8 · Google DeepMind Frontier Safety Framework

The **Google DeepMind Frontier Safety Framework** (2024, revised since) combines several pieces.

- **Critical Capability Levels (CCLs)** in autonomy / cyber / CBRN / persuasion.

- A **mitigation matrix** tied to each CCL — security, access controls, evaluation, deployment guards.

- Pre-deployment evaluation agreements with UK AISI and US AISI.

Gemini 2.x / 2.5 are evaluated and shipped under this framework, alongside content-provenance tools like **SynthID**.

Chapter 9 · Meta Llama Guard / Prompt Guard / System Safeguards

True to its open-weights ethos, Meta ships **model + guards** together.

- **Llama Guard 3** — a safety classifier that labels both inputs and outputs. 8B and 1B versions.

- **Prompt Guard** — a small classifier specialised in prompt injection / jailbreak detection.

- **CodeShield** — detects vulnerabilities and malicious patterns in generated code.

- **Llama 3 System Safeguards** — guidelines, eval suites and a "Responsible Use Guide".

Open-model users compose these into a **policy enforcement layer** in their own infra — usually cheaper than training one more giant model to be polite.

Chapter 10 · Mechanistic interpretability — seeing models as circuits

**Mechanistic interpretability** decomposes the internal activations and weights of a model into circuits to answer **"why did the model do that?"**.

Main threads:

- Olah et al., the **OpenAI Microscope** and **Anthropic Circuits** series — vision models first, then language models.

- Olsson et al. (2022), "In-context Learning and Induction Heads" — induction heads as the mechanism behind in-context learning.

- Anthropic's **"Towards Monosemanticity"** (2023) — monosemantic features extracted via SAEs in a small model.

- Anthropic's **"Scaling Monosemanticity"** (2024) — millions of features extracted and visualised in Claude 3 Sonnet.

- DeepMind, Conjecture, Redwood Research and EleutherAI each have their own circuit-tracing and SAE programmes.

The 2026 takeaway: interpretation is no longer just *explanation*; it is a *diagnostic tool*. "When we knock down this feature, what changes?" is now an experimentally answerable question.

Chapter 11 · Sparse Autoencoders (SAEs) — decomposing representations

**SAEs** decompose model activations into a **sparse, over-complete dictionary** of features. The aim is to break the polysemantic (one-neuron-many-concepts) problem and recover something close to **monosemantic** features.

The underlying hypothesis is **superposition**: models store more concepts than they have dimensions by packing them at small angles (Elhage et al., 2022, "Toy Models of Superposition").

A typical SAE workflow:

1. Collect activations at a chosen layer.

2. Sparse-decode them into a much larger dictionary (e.g. 16× to dozens of times wider).

3. Auto- and human-label each feature by inspecting top activating inputs.

This is how we get case studies of "the Golden Gate Bridge feature", "safety-relevant features", "deception circuits". **Goodfire, Transluce and Apollo** are turning SAEs into operational tooling.

Chapter 12 · Capability evals — MMLU, GPQA, MMMU, BIG-bench

Safety evaluation is only meaningful if **capability evaluation** is also honest. The benchmarks people quote in 2026:

- **MMLU** (Hendrycks et al., 2020) — 57 multiple-choice subjects.

- **MMLU-Pro** — a harder, cleaner follow-up.

- **GPQA** (Rein et al., 2023) — PhD-level science. The Diamond subset is the de facto standard.

- **MMMU** — multimodal undergraduate-level evaluation.

- **BIG-bench / BBH** — a broad collection of reasoning and language tasks.

- **HellaSwag, ARC, Winogrande** — classic commonsense / reasoning benchmarks.

The problem: many of these are now at risk of **contamination** — the model has effectively seen the questions during training. So **LiveBench**, **GPQA Diamond** and **MMLU-Pro** are widely used as "harder, less contaminated" supplements.

Chapter 13 · Code and agent evals — SWE-bench, TerminalBench, MLE-bench

Coding and agentic evaluation exploded in 2024-26.

- **HumanEval / HumanEval+** — function-level coding.

- **MBPP / MBPP+** — entry-level Python tasks.

- **SWE-bench** (Princeton, 2023) — real GitHub issues. SWE-bench Verified / Lite / Multimodal.

- **TerminalBench** — terminal-environment task automation.

- **MLE-bench** (OpenAI, 2024) — full ML engineering tasks (datasets, training).

- **WebArena, VisualWebArena** — web agent evaluation.

- **GAIA** — general-assistant evaluation.

By 2026 SWE-bench Verified is the *de facto* coding-agent benchmark, and **METR's HCAST** (Human-Calibrated Autonomy Scaling Tasks) is the *de facto* benchmark for autonomy.

Chapter 14 · Safety evals — Apollo scheming, METR autonomy, Anthropic sabotage

Capability evals are not enough. **Safety evals** ask whether the model can use its capabilities in the *wrong* direction.

- **Apollo Research** — **scheming evals** test whether a model reasons about being watched and changes behaviour accordingly. Their 2024 "Frontier Models are Capable of In-context Scheming" report is widely cited.

- **METR** — autonomy and R&D-capability evals. Works with UK and US AISIs to pre-evaluate OpenAI o-series, Anthropic Claude and DeepMind Gemini.

- **Anthropic Sabotage Evaluations** (2024) — measure whether a model can covertly **sabotage** a user's task.

- **CBRN evals** — chemical / biological / radiological / nuclear capabilities. Some are only run by governments or partners.

- **Cyber evals** — CyberSecEval, NIST standards, MITRE ATLAS.

These safety evals are what make thresholds like ASL-3, OpenAI's "High" and DeepMind's CCLs operational rather than rhetorical.

Chapter 15 · Eval infrastructure — lm-evaluation-harness, OpenAI evals, Inspect

The *infrastructure* matters as much as the *results*. The same model on the same benchmark can move 5-10 points depending on prompting, sampling and normalisation.

- **EleutherAI lm-evaluation-harness** — the most-used open eval framework; powers HuggingFace's Open LLM Leaderboard.

- **OpenAI evals** — open framework for internal and external evals.

- **UK AISI Inspect** — a published evaluation framework, particularly strong on agents and tool use.

- **lighteval (HuggingFace), HELM (Stanford)** — unified leaderboards and normalisation.

- **METR Vivaria, Apollo, Pattern Labs** — autonomy / scheming-eval infrastructure.

Evaluation is no longer a one-off experiment. It is run like CI/CD — new model version, automatic eval suite, report.

Chapter 16 · The AISI network — UK, US, Korea, Japan, EU, Canada, Singapore

The thread that began at the 2023 UK Bletchley Park summit ran through Seoul (2024), Paris (2025) and back to Korea, and the institutional fruit is a set of national **AI Safety Institutes**.

- **UK AISI** — first and largest. Pre-evaluates OpenAI, Anthropic and DeepMind models.

- **US AISI / AISIC** — sits at NIST. The Consortium has 100+ companies and institutions.

- **Korea AISI (KAISI)** — established after the 2024 Seoul summit. Cooperates with ETRI, KISTI and universities.

- **Japan AISI** — under METI / AIST, focused on Japanese-language evals and domestic models.

- **EU AI Office** — enforces the EU AI Act, supervises GPAI obligations.

- **Canada AI Safety Institute, Singapore AISI** — later joiners.

They cooperate through the **International Network of AISIs**, sharing methodology, red-team results and vulnerabilities.

Chapter 17 · Red teaming — from human breach to automation

**Red teaming** borrows the term from security: deliberately adversarial evaluation.

Organisations:

- **Anthropic Red Teaming** — internal and external red teams across policy violations, CBRN and cyber.

- **OpenAI Red Team Network** — network of external domain experts.

- **Microsoft AI Red Team** — covers the models behind Office / Copilot.

- **Google DeepMind Frontier Red Team** — for Gemini and other DeepMind systems.

Tools:

- **HarmBench** (CAIS) — an automated jailbreak benchmark.

- **GCG (Greedy Coordinate Gradient)** (Zou et al., 2023, "Universal and Transferable Adversarial Attacks") — auto-generates adversarial suffixes.

- **PAIR (Prompt Automatic Iterative Refinement)** (Chao et al., 2023) — two LLMs collaborate to find jailbreaks.

- **AutoDAN** — genetic-algorithm-based jailbreak search.

Automated red teaming complements humans, and the loop "vulnerability found → patch → re-eval" now looks a lot like a security SDLC.

Chapter 18 · Jailbreaks & prompt injection — a taxonomy of the attack surface

You can only defend what you can classify.

- **Direct prompt injection** — the user sticks "ignore previous instructions" directly in the message.

- **Indirect prompt injection** (Greshake et al., 2023) — malicious instructions hide in an external document the model fetches (web page, email, tool result). The worst case for RAG and agents.

- **Jailbreak prompts** — DAN, Crescendo, many-shot jailbreak, role-play variants.

- **GCG, AutoDAN, PAIR** — automated adversarial prompts.

- **Data exfiltration through tools** — paths by which an agent leaks secrets externally.

**Indirect prompt injection** is the underlying problem for every RAG / browsing / email agent. Telling the model which instructions in a document to trust is itself an unsolved AI problem.

Chapter 19 · Defenses — Llama Guard, NeMo Guardrails, Constitutional Classifiers, SmoothLLM

A typical defence stack has five layers.

1. **Input classifiers** — Llama Guard, Prompt Guard, Azure Content Safety.

2. **System prompt hardening** — privilege separation, sanitising tool outputs, instructions to ignore meta-instructions.

3. **Inference-time guards** — perturbation / ensemble defences like **SmoothLLM** (Robey et al., 2023).

4. **Output classifiers** — Constitutional Classifiers, Llama Guard 3, OpenAI Moderation.

5. **Logging and observability** — full request logs plus LLM observability (Langfuse, Helicone) for forensic analysis.

Open-source guardrail frameworks:

- **NVIDIA NeMo Guardrails** — policies in a small DSL (Colang), guards on input, output and conversation flow.

- **Guardrails AI** — output validation, structured output, retry loops.

- **LangChain / LlamaIndex guardrails** — application-layer guards.

Defence does not assume a perfect model. It is built as **defence in depth**.

Chapter 20 · Open infrastructure — safetensors, model cards, datasheets, SBOM-for-AI

Operational safety has been hardening too.

- **safetensors** (HuggingFace) — a safe serialisation format that removes pickle-based PyTorch weight files' arbitrary-code-execution risk. The de facto standard since 2024.

- **Model cards / Data cards** — Mitchell et al. (2019) model cards and Gebru et al. (2018) datasheets for datasets, elevated to mandatory documentation under the EU AI Act and NIST AI RMF.

- **SBOM-for-AI** — tracking provenance of weights, training data and evaluations as you would a software bill of materials.

- **C2PA / SynthID** — provenance and watermarking for images, video and text.

Platforms like **HuggingFace Spaces, Modal and Replicate** increasingly require this metadata as table stakes.

Chapter 21 · Regulation — EU AI Act, Korean AI Basic Act, METI guidelines

Law and regulation tightened sharply in 2024-26.

- **EU AI Act** — in force August 2024; prohibited uses and AI literacy obligations from February 2025; GPAI obligations from August 2025; high-risk obligations phased in through August 2026. Obligations scale with capability and systemic risk.

- **Korean AI Basic Act** (Act on the Development of AI and Establishment of Trust) — passed December 2024, phased in from 2025-26. Obligations for "high-impact" and generative AI, legal basis for KAISI, mandatory safety evaluation.

- **Japan METI guidelines** — 2024 AI Business Operator Guidelines, AISI operation, follow-up to the G7 Hiroshima Process.

- **US Executive Order 14110** (2023) was partially replaced by a new executive order in 2025, but NIST AI RMF and AISI activity continue.

- **China generative AI interim measures** — in force since 2023, with data, licensing and content review obligations.

For a company the first question is: "**which bucket of the EU AI Act do we fall into; are we a GPAI provider; are we a high-risk system?**"

Chapter 22 · Researchers and organisations — Bengio, Russell, Anthropic, Apollo, Redwood

A one-line map of the AI safety field.

- **Yoshua Bengio (Mila)** — chair of the *International AI Safety Report* (2024-25); cognitive and probabilistic-safety research.

- **Stuart Russell (UC Berkeley CHAI)** — author of *Human Compatible*; the "assistance game" framing.

- **Anthropic** — Claude, Constitutional AI, RSP, the Interpretability team.

- **OpenAI** — Model Spec, Preparedness, Safety Systems.

- **Google DeepMind** — Frontier Safety Framework, SAFE, Interpretability, Gemini Safety.

- **Apollo Research** — specialises in scheming / deception evals.

- **Redwood Research** — safety RL, interpretability, alignment research.

- **METR** — independent autonomy-evaluation NGO.

- **Conjecture** — interpretability and alignment start-up.

- **MIRI** — classical alignment theory; lately, policy and communication.

- **CAIS (Center for AI Safety)** — *Statement on AI Risk*, HarmBench.

- **CHAI, FAR.AI, ARC Evals (predecessor of METR)** — academic / NGO line.

Chapter 23 · Korea and Japan — KAISI, NAVER, LG, Sakana, Japan AISI

The Asian map has solidified.

- **Korea AISI (KAISI)** — established off the 2024 Seoul summit; works with ETRI, KISTI, KAIST and Seoul National University.

- **NAVER HyperCLOVA X** — publishes its own safety evals and multilingual safety datasets.

- **LG AI Research EXAONE** — its own RLHF and safety classifier line.

- **KakaoBrain, Upstage, Lablup** — safety and evaluation infrastructure collaborators.

- **Japan AISI** — sits under METI / AIST; concentrates on Japanese-language safety evaluation.

- **NICT, Riken** — Japanese-language evaluation and red-teaming partners.

- **Sakana AI, Preferred Networks** — Japanese model labs collaborating on evaluation.

In 2025-26 the Korean and Japanese AISIs have carved out a clear comparative advantage in **multilingual safety evaluation** — catching Korean and Japanese-language jailbreaks and cultural risks that English-centric evals miss.

Chapter 24 · A practical checklist for teams shipping LLMs

A 2026-vintage checklist for any team deploying LLMs in production.

1. **Risk classification** — which bucket of the EU AI Act / local law applies; are you GPAI; are you high-risk.

2. **Model choice** — under which framework (Anthropic RSP, OpenAI Preparedness, DeepMind FSF) and which ASL / level is the model you are using.

3. **System-level safety** — which guard stack: Llama Guard / Prompt Guard / Constitutional Classifiers / NeMo Guardrails.

4. **Eval suite** — MMLU-Pro, GPQA Diamond, SWE-bench Verified, HarmBench, your own native-language jailbreak set, RAG-injection set.

5. **Logging and observability** — Langfuse, Helicone, OpenTelemetry GenAI, post-incident analysis tooling.

6. **Red team** — quarterly human red team plus automated (GCG / PAIR / AutoDAN) red team.

7. **Incident response** — IR playbook, model card updates, regulator-notification procedures.

8. **Documentation** — model card, data card, RAG-source provenance, evaluation reports.

9. **External evaluation** — consider pre-evaluation cooperation with UK / US / KR / JP AISI.

10. **People** — who is the named owner for deployment decisions: CISO, CPO, AI Ethics Officer.

> One line: **"AI safety is not one team's problem. It is an operational system in which training, evaluation, deployment, incident response, legal and comms all sit on a single thread."**

Epilogue — Five at the same time

The one-line summary of AI safety in 2026 is this.

> "Capability got faster, and we are now simultaneously doing five things — **training-time alignment (RLHF, DPO, GRPO, CAI), interpretation (mech interp, SAEs), evaluation (MMLU, GPQA, SWE-bench, METR), red teaming (GCG, PAIR, automation), and governance (RSP, Preparedness, FSF, EU AI Act, AISI)**."

Doing only one of them is not enough. Strong alignment with dishonest evaluation hides regressions. Honest evaluation without red teams misses what is behind the locked door. Interpretation answers *why does the model do this*; policy answers *how far is the model allowed to go*. Governance gives people, companies and countries a shared language.

The hope of this article is to be a little of that shared language. From here, the job — from wherever you sit — is to use that language to shape the next year.

References

- [Hubinger et al., "Risks from Learned Optimization in Advanced ML Systems"](https://arxiv.org/abs/1906.01820)

- [Christiano et al., "Deep RL from Human Preferences"](https://arxiv.org/abs/1706.03741)

- [Ouyang et al., "Training language models to follow instructions with human feedback (InstructGPT)"](https://arxiv.org/abs/2203.02155)

- [Rafailov et al., "Direct Preference Optimization"](https://arxiv.org/abs/2305.18290)

- [DeepSeek-R1 paper](https://arxiv.org/abs/2501.12948)

- [Bai et al., "Constitutional AI"](https://arxiv.org/abs/2212.08073)

- [Anthropic Responsible Scaling Policy](https://www.anthropic.com/news/anthropics-responsible-scaling-policy)

- [Anthropic Constitutional Classifiers](https://www.anthropic.com/research/constitutional-classifiers)

- [OpenAI Preparedness Framework](https://openai.com/safety/preparedness)

- [OpenAI Model Spec](https://model-spec.openai.com/)

- [Google DeepMind Frontier Safety Framework](https://deepmind.google/discover/blog/introducing-the-frontier-safety-framework/)

- [Meta Llama Guard 3](https://github.com/meta-llama/PurpleLlama)

- [Anthropic Scaling Monosemanticity](https://transformer-circuits.pub/2024/scaling-monosemanticity/)

- [Anthropic Towards Monosemanticity](https://transformer-circuits.pub/2023/monosemantic-features)

- [Olsson et al., "In-context Learning and Induction Heads"](https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html)

- [Elhage et al., "Toy Models of Superposition"](https://transformer-circuits.pub/2022/toy_model/index.html)

- [MMLU paper](https://arxiv.org/abs/2009.03300)

- [GPQA paper](https://arxiv.org/abs/2311.12022)

- [SWE-bench](https://www.swebench.com/)

- [MLE-bench (OpenAI)](https://openai.com/index/mle-bench/)

- [METR](https://metr.org/)

- [Apollo Research scheming evals](https://www.apolloresearch.ai/research/scheming-reasoning-evaluations)

- [Anthropic Sabotage Evaluations](https://www.anthropic.com/research/sabotage-evaluations)

- [UK AISI](https://www.aisi.gov.uk/)

- [US AISI / NIST AISIC](https://www.nist.gov/aisi)

- [International AI Safety Report 2025 (Bengio chair)](https://www.gov.uk/government/publications/international-ai-safety-report-2025)

- [Greshake et al., "Indirect Prompt Injection"](https://arxiv.org/abs/2302.12173)

- [Zou et al., "Universal and Transferable Adversarial Attacks on Aligned Language Models (GCG)"](https://arxiv.org/abs/2307.15043)

- [Chao et al., "PAIR"](https://arxiv.org/abs/2310.08419)

- [HarmBench (CAIS)](https://www.harmbench.org/)

- [SmoothLLM](https://arxiv.org/abs/2310.03684)

- [NVIDIA NeMo Guardrails](https://github.com/NVIDIA/NeMo-Guardrails)

- [HuggingFace safetensors](https://github.com/huggingface/safetensors)

- [EleutherAI lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness)

- [UK AISI Inspect](https://github.com/UKGovernmentBEIS/inspect_ai)

- [EU AI Act (consolidated text)](https://artificialintelligenceact.eu/)

- [Korean AI Basic Act news](https://www.korea.kr/news/policyNewsView.do?newsId=148937548)

- [Japan METI AI Guidelines](https://www.meti.go.jp/english/policy/mono_info_service/ai_society_principles.html)