Skip to content

필사 모드: AI Safety, Evals and Red-Teaming in 2026 — Deep Dive into Inspect AI, Garak, PyRIT, Promptfoo, OpenAI Evals, lm-eval-harness

English
0%
정확도 0%
💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.
원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Prologue — In 2026, "safety" does not collapse to one tool

May 2026. Model labs ship a new model every month. App companies wrap them into agents. Every week someone posts a fresh jailbreak on social media; every month some country's AI Safety Institute drops a new evaluation report. Safety has stopped being "we did RLHF, we are good." Safety is now a system where **evaluation, red-teaming, policy, and standards** run in parallel.

The problem: these four areas use different tools, and every company, government, and academic group calls them by different names. What is Inspect AI and how is it different from Garak? Why did Microsoft build PyRIT, and who actually uses it? Is Promptfoo academic or production? Are OpenAI Evals and lm-evaluation-harness the same thing? How is UK AISI different from Korea AISI?

This post draws the 2026 AI safety, evaluation, and red-teaming ecosystem on one page. It is not a deep dive into any single tool; it is a map of **who built each tool, what problem it solves, and where it fits**. So that when your team has a one-line "safety infra" conversation, you can decide what to deploy and what to skip.

1. The 2026 AI safety map — four domains: eval, red-team, policy, standards

The big picture first. The 2026 AI safety ecosystem splits into four domains. Different people build them and different people use them.

| Domain | What it does | Representative tools and bodies | Who uses it |

| --- | --- | --- | --- |

| **Eval** | Measures model / agent capability and safety | Inspect AI, lm-eval-harness, OpenAI Evals, Promptfoo, DeepEval | Model labs, app companies, AISIs |

| **Red-team** | Tries to adversarially break a model or app | Garak, PyRIT, Pliny DB, MITRE ATLAS | Security teams, AISIs, red teams |

| **Policy** | Company-internal decisions about what to ship | OpenAI Preparedness Framework, Anthropic RSP | Frontier lab governance |

| **Standards** | Industry / government common ground | OWASP LLM Top 10, NIST AI RMF, ISO 42001 | Security, compliance, governments |

These four depend on each other. **Eval automates the results of red-teaming** — when a new jailbreak is discovered, it gets folded into the eval suite. **Policy uses eval scores as thresholds** — "if score in the CBRN category is below X, hold the release." **Standards give eval and red-team the common vocabulary** of what to measure.

This post follows that order: eval (chapters 2 to 9), red-team (chapters 3 and 4), policy (chapter 11), standards (chapter 12), then regional variations (chapter 13) and who should pick what (chapter 14).

One-line insight: **"safety" does not collapse to one tool — it is the sum of four domains.** And as of 2026, very few teams are looking at all four. Most look at one or two.

2. Inspect AI (Anthropic, May 2024) — the eval framework UK AISI uses

**One-liner**: An LLM evaluation framework open-sourced by Anthropic in May 2024. The UK AI Safety Institute adopted it as its core evaluation tool, and by 2025–2026 it had effectively become "the standard for government-grade evaluation."

**Why it exists**: Before 2024, most LLM evaluation code was throwaway scripts. Different APIs per model, ad-hoc handling of non-determinism, tool-use evaluation, multi-turn evaluation, judge models — re-implemented every time. Inspect AI consolidates this into one framework. Evaluations are expressed as **Dataset, Solver, Scorer**.

**Three core concepts**:

- **Task**: One unit of evaluation. Combines a dataset, a solver (how to solve), and a scorer (how to grade).

- **Solver**: The strategy the model uses to answer. Plain generation, chain-of-thought, tool-use, multi-turn agent.

- **Scorer**: How to grade. Exact match, includes, model-graded (another model judges), custom.

from inspect_ai import Task, task

from inspect_ai.dataset import example_dataset

from inspect_ai.scorer import includes

from inspect_ai.solver import generate

@task

def theory_of_mind():

return Task(

dataset=example_dataset("theory_of_mind"),

solver=generate(),

scorer=includes(),

)

This one file is one evaluation. You run it with `inspect eval theory_of_mind.py --model anthropic/claude-sonnet-4-5`.

**UK AISI's adoption** was decisive. AISI receives pre-release models from labs and runs government-grade evaluations. That evaluation infrastructure is Inspect AI. This effectively signaled "a government-endorsed evaluation standard," and from 2025 US AISI, Japan AISI, and Korea AISI also added Inspect-based evaluation as a standard option.

**When to use it**:

- Pre-release capability and safety evals at model labs.

- External evaluation by AISI-class government bodies.

- Academic benchmark releases (increasingly shipped Inspect-compatible).

- App teams running their own model-to-model comparisons.

**When not to use it**: Quick prompt A/B tests (Promptfoo is lighter). Production observability (Phoenix or Langfuse fits better).

3. Garak (NVIDIA, then independent) — the LLM vulnerability scanner

**One-liner**: nmap or OWASP ZAP for LLMs. Fires a battery of known attack scenarios at a model and reports where it breaks. Started at NVIDIA in 2023, spun out into an independent project in 2024–2025.

**Why it exists**: Leon Derczynski (NVIDIA, later independent) noticed there was no security scanner for LLMs. Web security had ZAP and Burp, networks had nmap, but LLMs had nothing. Garak fills that gap with **probes (attacks), detectors (graders), and reports**.

**Core concepts**:

- **Probe**: One category of attack. Examples: `dan` (DAN-class jailbreaks), `promptinject` (Prompt Injection), `lmrc.SlurUsage` (eliciting slurs), `realtoxicityprompts`, `goodside` (previously known attacks).

- **Detector**: Decides whether the model broke. Keyword matching, model-based judge, toxicity classifier.

- **Generator**: Interface to the target model. OpenAI, Anthropic, HuggingFace, REST, local.

Full scan (takes a while)

garak --model_type openai --model_name gpt-4o-mini

Specific probes only

garak --model_type huggingface --model_name meta-llama/Llama-3-8b \

--probes dan.Dan_11_0,promptinject

Report: garak.<timestamp>.report.html

**What Garak shows**: A matrix of which models break on which attack categories. "This model responds to DAN 11.0 47 percent of the time, to RealToxicityPrompts 12 percent."

**Its 2026 value**:

- Pre-release for model labs — a "minimum baseline" sweep.

- Pre-release for app companies — see where the model you chose breaks in your use case.

- Academic — when a new jailbreak is published, a PR adds it as a Garak probe (so the project doubles as an attack DB).

**Limitation**: Garak is centered on *known* attacks. Zero-day jailbreaks are out of reach. That is why it pairs well with frameworks like PyRIT that *generate* attacks.

4. PyRIT (Microsoft) — the Python Risk Identification Toolkit

**One-liner**: An adversarial evaluation framework open-sourced by Microsoft AI Red Team in 2024. If Garak is "catalog of known attacks," PyRIT is "orchestration that generates attacks automatically."

**Why it is different (Garak vs PyRIT)**:

| Dimension | Garak | PyRIT |

| --- | --- | --- |

| Focus | Scanning known attacks | Generating adversarial attacks |

| Analogy | nmap, OWASP ZAP | Metasploit |

| Attack flow | Fire predefined probes | LLM attacks LLM (automatic variation) |

| Learning curve | Low (one CLI line) | Medium (orchestrator code) |

| Built by | NVIDIA then independent | Microsoft AI Red Team |

**Core concepts**:

- **Orchestrator**: Assembly of an attack flow. Single-turn (one shot), Multi-turn (jailbreak through dialog), Red Teaming (an attacker model trying to break a target model).

- **Converter**: Transforms the same intent into different surface forms — base64 encoding, ROT13, translation, paraphrase.

- **Scorer**: Judges whether a response violated policy. Azure Content Safety, self-ask, regex.

- **Target**: The attack target. OpenAI, Azure OpenAI, HuggingFace, custom endpoints.

Pseudo-code, Red Teaming Orchestrator

from pyrit.orchestrator import RedTeamingOrchestrator

from pyrit.score import SelfAskTrueFalseScorer

orchestrator = RedTeamingOrchestrator(

attack_strategy=...,

chat_target=victim_model,

red_teaming_chat=attacker_model,

scorer=SelfAskTrueFalseScorer(...),

)

await orchestrator.apply_attack_strategy_until_completion_async(max_turns=5)

The core idea is **one LLM tries to break another LLM, automatically**. Instead of a human writing each new jail case, an attacker LLM keeps the conversation going and tries variations.

**Who uses it**: Microsoft AI Red Team itself, plus Microsoft customers' security teams. Since 2025, some external evaluation teams at OpenAI and Anthropic have adopted parts of it. Reproducing new jailbreak papers via PyRIT orchestrators has become a common pattern in academia.

**Do you need both Garak and PyRIT?** Yes. Garak for daily regression scans (fast, fixed catalog); PyRIT for quarterly deep red-teams (slower, surfaces new patterns).

5. Promptfoo (Y Combinator) — the de facto standard for prompt testing

**One-liner**: An OSS CLI that compares prompts, models, and chains the way you would unit-test code. Went through Y Combinator and became the fastest-growing tool of 2024–2025. The entry point for the JS/TS ecosystem.

**Why it exists**: For app companies, the most frequent questions are simple: "which model should we use?" and "is this prompt better than the previous one?" Inspect AI is too heavy and OpenAI Evals is too academic. Promptfoo created the flow of **one YAML file for cases, one CLI for the comparison, a web UI for results**.

promptfooconfig.yaml

prompts:

- "Translate to French: {{text}}"

- file://prompts/translate_v2.txt

providers:

- openai:gpt-4o-mini

- anthropic:claude-haiku-4-5

tests:

- vars:

text: "Hello world"

assert:

- type: contains

value: "Bonjour"

- type: llm-rubric

value: "Translation is natural French, not literal."

One `promptfoo eval` runs the model-by-prompt matrix; `promptfoo view` opens the web UI for diffing.

**Why Promptfoo took off**:

- Low entry barrier (YAML plus CLI).

- Rich built-in assertions (contains, regex, llm-rubric, javascript, similar, factuality, latency, cost).

- Has a red-team mode (`promptfoo redteam`, not as deep as Garak or PyRIT but covers parts of OWASP LLM Top 10).

- CI integration is easy (GitHub Action, GitLab).

- Dataset import is easy (CSV, JSONL, HuggingFace).

**When to use it**:

- Prompt A/B tests (the most common use case).

- Model comparison (same input across providers).

- CI regression checks (fail if the new prompt scores lower).

- Light red-teaming (an entry point to OWASP LLM Top 10).

**When not to use it**: Government-grade evaluation (Inspect AI). Academic benchmarks (lm-eval-harness). Production observability (Phoenix).

6. OpenAI Evals — the original open-source eval framework

**One-liner**: The evaluation framework OpenAI open-sourced in March 2023. Carries the "what OpenAI uses to evaluate its own models" halo. Activity tapered off in 2024–2026, but it is still the reference for datasets and patterns.

**Why it matters historically**: Before 2023 there was no common pattern for "how should evaluation code look?" OpenAI Evals codified it as "Eval class plus JSONL data plus sampler." This pattern influenced every framework that followed.

**Core structure**:

- **Eval class**: A Python class implementing one evaluation.

- **JSONL dataset**: One sample per line (input, ideal, etc).

- **Sampler**: Interface to the model.

- **Registry**: YAML metadata for evals.

evals/registry/evals/my-eval.yaml

my-eval:

id: my-eval.dev.v0

description: My custom eval

metrics: [accuracy]

my-eval.dev.v0:

class: evals.elsuite.basic.match:Match

args:

samples_jsonl: my_eval/samples.jsonl

Run with `oaieval gpt-4o-mini my-eval`.

**Its 2026 position**:

- New projects rarely pick OpenAI Evals (Inspect AI and Promptfoo are more active).

- But it bundles hundreds of reference eval datasets and is frequently cited as ground truth for new evaluations.

- OpenAI itself no longer uses Evals as standalone infrastructure (presumably moved to internal tooling).

- For compatibility, Inspect AI advertises that it can import OpenAI Evals datasets.

**When to use it**: When you need reference datasets, or want to reuse existing OpenAI Evals assets.

7. lm-evaluation-harness (EleutherAI) — the academic standard

**One-liner**: EleutherAI's academic LLM benchmark runner. The backend behind the HuggingFace leaderboard. Almost every benchmark number in a new model paper came out of this tool.

**Why it became the academic standard**: Around 2022, every new LLM came with a different evaluation script from its authors. Same model, same benchmark, different scores in different papers — because of scoring method, few-shot count, prompt format. lm-evaluation-harness standardized that: "only numbers produced by the same code are comparable."

**Highlights**:

- 200+ built-in benchmarks (MMLU, HellaSwag, GPQA, ARC, TruthfulQA, GSM8K, HumanEval-derived, BBH, ...).

- Backends include HuggingFace `transformers`, `vllm`, OpenAI/Anthropic APIs, llama.cpp.

- Official backend for the HuggingFace Open LLM Leaderboard.

MMLU 5-shot

lm_eval --model hf --model_args pretrained=meta-llama/Llama-3.1-8B \

--tasks mmlu --num_fewshot 5 --device cuda --batch_size 8

Several benchmarks at once

lm_eval --model hf --model_args pretrained=... \

--tasks mmlu,gpqa,hellaswag,arc_challenge --num_fewshot 5

**Relationship to Inspect AI**: There is overlap. lm-eval-harness is strong on "standard academic benchmarks via a standard runner"; Inspect AI is strong on "custom evaluation, agents, multi-turn." As of 2026 both are alive, and model labs and AISIs typically use both, for different workflows.

**When to use it**:

- Benchmark tables in a new model paper.

- Submitting to the HuggingFace leaderboard.

- "Compare against another model under identical conditions."

8. MLflow Evals / Arize Phoenix / DeepEval / Giskard — the operational four

If the tools above are "evaluation infrastructure," these four sit on the boundary between **evaluation and observability**.

MLflow Evals (Databricks)

LLM evaluation module added to MLflow in 2023. Fits into the flow of "compare this model experiment with that model experiment" inside an ML lifecycle. Default option for Databricks customers. Built-in metrics expanded for LLM use cases (`mlflow.evaluate(model_type="question-answering")`).

**When to use it**: ML teams already on MLflow. Managing classical ML and LLMs in one lifecycle.

Arize Phoenix (open source)

OpenTelemetry-based LLM observability plus evaluation. Receives traces, visualizes them, and runs evaluation on top. **Tracing first, evaluation layered on**. Self-hosted OSS and SaaS (Arize) both exist.

**When to use it**: When you want to evaluate based on real production traffic. RAG debugging ("why was this retrieval wrong?").

DeepEval (Confident AI)

"pytest for LLM evals." Define LLM evaluations as Python unit-test decorators. Rich built-in metrics (GEval, AnswerRelevancy, Faithfulness, ContextualPrecision/Recall, Hallucination, ToxicityMetric, BiasMetric).

Pseudo-code

@pytest.mark.eval

def test_answer_quality():

metric = AnswerRelevancyMetric(threshold=0.7)

test_case = LLMTestCase(input="...", actual_output="...")

assert_test(test_case, [metric])

**When to use it**: Python developers comfortable with pytest who want LLM evals in the same flow.

Giskard (open source)

Strong on **bias, robustness, and drift** for ML models (both classical ML and LLMs). Automatically discovers issues (e.g., accuracy gaps across demographic groups).

**When to use it**: Regulated industries (finance, healthcare). When fairness and bias reports are required.

**Summary table**:

| Tool | Focus | Who uses it |

| --- | --- | --- |

| MLflow Evals | ML lifecycle integration | Databricks customers |

| Phoenix | Tracing plus eval | RAG and agent observability teams |

| DeepEval | Pytest-style | Python developers |

| Giskard | Bias and drift | Regulated industries |

9. The benchmark battery — HumanEval / MMLU / GPQA / SWE-Bench / BigCodeBench

Not tools, but **datasets**. The standard battery that shows up in every new model release table as of 2026.

HumanEval (OpenAI, 2021)

164 Python function-writing problems. The model sees a docstring and fills in the body. Graded by unit-test pass. The oldest "code generation" benchmark. As of 2026 it is essentially saturated (top models 95%+) but still appears as a one-line standard.

MMLU (Massive Multitask Language Understanding, 2021)

About 16,000 multiple-choice questions across 57 academic subjects. Four-way choice. The standard for "broad knowledge." Top models are now 90%+, so it discriminates less; teams supplement it with **MMLU-Pro** (2024, harder) and **HLE** (Humanity's Last Exam, 2025).

GPQA (Google-Proof Q&A, 2023)

About 448 graduate-level science questions. Even experts score about 65 percent. Designed so a Google search cannot solve it. The standard for "really hard reasoning." Scores climbed quickly after the rise of reasoning models (the o-series, Claude's extended thinking).

SWE-Bench / SWE-Bench Verified (Princeton, 2024)

Tasks that ask the model to make a PR from a real GitHub issue. Graded by whether the PR passes the unit tests it should. Verified is a 500-task subset that OpenAI had humans review for solvability and test correctness. **The flagship agent benchmark.**

BigCodeBench (BigCode, 2024)

More realistic code generation than HumanEval — 1,140 problems involving standard library and third-party library calls (NumPy, Pandas, etc). The successor to HumanEval after saturation.

**Others you will see often**:

- **GSM8K / MATH**: math.

- **BBH (Big-Bench Hard)**: hard subset of BIG-bench.

- **ARC / ARC-AGI**: abstract reasoning.

- **WebArena / VisualWebArena / OSWorld**: agents driving browsers and operating systems.

- **τ-bench (tau-bench)**: multi-turn customer-service agent.

Key point: never look at one benchmark in isolation. Look at a battery — **MMLU/GPQA for knowledge, HumanEval/BigCodeBench/SWE-Bench for code, MATH for math, τ-bench/SWE-bench for agents** — and weight the ones closest to your use case more heavily.

10. AI Safety Institutes — UK, US, Japan, Korea, Singapore, France

After the UK established the first AI Safety Institute (UK AISI) in late 2023, most major countries set up sibling institutions during 2025–2026.

UK AISI (announced November 2023, operating in 2024)

The world's first AISI. Under the UK government. Centered on **pre-release model evaluation** (receiving models before deployment for risk assessment) and **collaborative research**. The primary user and de facto sponsor of the Inspect AI framework. Its 2024–2025 reports became standard academic references.

US AISI (established under NIST in 2024)

Set up inside NIST (National Institute of Standards and Technology). Handles US-government AI safety evaluation and standards development. Doubles as the operating body of the NIST AI Risk Management Framework (AI RMF).

Japan AISI (AI Safety Institute of Japan, 2024 under IPA)

Set up inside IPA (Information-technology Promotion Agency). Provides Japanese-government AI safety evaluation and policy advice. Publishes guidelines and evaluation methodology documents.

Korea AISI (announced 2024)

Built around ETRI (Electronics and Telecommunications Research Institute). Conducts government-level AI safety evaluation and standardization activities. Cooperates with academia including KAIST and KISTI.

Singapore AI Verify Foundation (2023)

Not strictly named "AISI" but functions as a peer institution. Has open-sourced AI Verify (an evaluation toolkit) and Project Moonshot (an LLM red-team toolkit).

France AI Safety Office (under Inria, 2024)

Located inside Inria, France's national research lab. Cooperates with the EU AI Office at the European level.

**Common traits**:

- Government-funded.

- Pre-release evaluation (MoUs with major model labs).

- Joint evaluation methodology development (international cooperation — AISI Network).

- Public outputs (published reports).

**Why this matters**: As of 2026, **the common vocabulary for "what is dangerous" is coming out of AISIs**. Categories like CBRN (chemical, biological, radiological, nuclear), cyber, autonomy, and model deception are being standardized in AISI evaluation reports.

11. OpenAI Preparedness Framework plus Anthropic RSP

Two model labs' **internal safety policy documents**. They publicly state the company's internal decisions about "what level of risk is acceptable to release."

OpenAI Preparedness Framework (announced December 2023, revised since)

OpenAI's own definition of **risk thresholds by category**:

- Categories: Cybersecurity, CBRN, Persuasion, Model Autonomy.

- Each category rates model risk as Low / Medium / High / Critical.

- Above a threshold, release is held or mitigations are required.

- A separate "Preparedness Team" organization; a "Safety Advisory Group" makes decisions.

Anthropic Responsible Scaling Policy (RSP, announced September 2023, revised since)

Anthropic's counterpart document. The **AI Safety Level (ASL)** concept:

- ASL-1: clearly no risk (small models, game AI).

- ASL-2: today's frontier models — minor risk signals.

- ASL-3: substantial risk in autonomy, cyber, CBRN.

- ASL-4+: defined later.

For each level, **deployment standards and security standards** are defined. ASL-3 and above require stronger weight security and stricter deployment validation.

Common pattern

- "Define risk categories, evaluate, set thresholds, gate decisions."

- **Evaluation is the input** to these decisions — which is why Inspect AI and red-team tools become the infrastructure for the policy gate.

- After 2025, Google DeepMind published a similar Frontier Safety Framework, making this a de facto frontier-lab pattern.

**Voluntary commitments vs legal duty**: Both documents are voluntary corporate commitments. Legal force entered partly via EU AI Act's GPAI (General Purpose AI) obligations in 2025, and US state-level legislation (California SB 53 transparency act) added more.

12. MITRE ATLAS plus OWASP LLM Top 10

The two pillars of industry standards.

MITRE ATLAS (Adversarial Threat Landscape for AI Systems)

A matrix that extends MITRE's ATT&CK (the standard taxonomy for traditional cyber attacks) to attacks on AI systems. First published around 2020, with LLM and GenAI tactics added heavily in 2024–2026.

**Structure**:

- **Tactics**: Reconnaissance, Resource Development, Initial Access, ML Model Access, Execution, Persistence, Defense Evasion, Discovery, Collection, Exfiltration, Impact.

- **Techniques**: Specific attacks under each tactic — "Jailbreak via Multi-Turn Coaxing," "Prompt Injection," "Model Theft," "Data Poisoning."

- **Case Studies**: Real-world attack incidents.

**Who uses it**: Security teams as a checklist for "what attacks could apply to our AI system?" Vocabulary for classifying red-team report findings.

OWASP LLM Top 10 (2023 v1 then 2025 v2)

OWASP's list of the ten most common security risks in LLM applications. The 2025 v2 items:

1. Prompt Injection

2. Insecure Output Handling

3. Training Data Poisoning

4. Model Denial of Service

5. Supply Chain Vulnerabilities

6. Sensitive Information Disclosure

7. Insecure Plugin/Tool Design

8. Excessive Agency

9. Overreliance

10. Model Theft

**Why it matters**: The most familiar vocabulary for app developers. "Our app addresses the OWASP LLM Top 10" has become a one-line compliance checklist.

**Mappings to tooling**:

- Promptfoo redteam mode covers parts of OWASP LLM Top 10 automatically.

- Garak probes map to OWASP categories.

- PyRIT orchestrators simulate OWASP scenarios.

ATLAS vs OWASP LLM Top 10

- ATLAS: broad, tactic-centered, all AI systems (ML, CV, LLM).

- OWASP: narrow, app-centered, LLM only.

- Use both. ATLAS for threat analysis, OWASP for app security checklists.

13. Korea and Japan — KAIST, KISTI, AI Safety Institute of Japan, RIKEN AIP

Korea

- **KAIST AI Safety group**: KAIST's internal AI safety and alignment research. The academic center publishing safety and interpretability papers.

- **KISTI (Korea Institute of Science and Technology Information) Safety**: A government-funded research institute. Strong on data and infrastructure-level safety (dataset governance, infra security).

- **ETRI (Electronics and Telecommunications Research Institute)**: The operating body of Korea AISI. Government-level standards and evaluation development.

- **TTA (Telecommunications Technology Association)**: AI trustworthiness standards.

- Industry side: Naver CLOVA Safety, Kakao safety and ethics team, LG AI Research safety group.

Korea's strengths: **Korean evaluation datasets** (KoBest, KMMLU, Korean toxicity datasets) and safety evaluation for Korea-specific domains (K-pop, law, medicine).

Japan

- **AI Safety Institute of Japan**: Established 2024 under IPA. Government-level safety evaluation and guidelines.

- **RIKEN AIP (Center for Advanced Intelligence Project)**: Japan's largest AI research center. Safety, alignment, and trustworthiness research.

- **NICT (National Institute of Information and Communications Technology)**: Japanese LLMs (JapaneseLM series) plus safety evaluation.

- Industry side: SoftBank, NTT, Rakuten internal safety groups.

Japan's strengths: **Japanese evaluation datasets** (JGLUE, llm-jp-eval), and evaluation for hierarchy, politeness, and culture-specific phenomena.

Common trend

Between 2025 and 2026, both Korea and Japan saw a rapid increase in datasets evaluating **"are domestic LLMs safe in our language and culture relative to English-trained models?"** The pattern: keep English evaluation tooling (Inspect AI, lm-eval-harness), but fill in the dataset layer in your own language.

14. Who should pick what — model release / app integration / governance / academic

The final chapter. Recommended stacks by persona.

A. Model labs (frontier lab safety and evaluation teams)

- **Must**: Inspect AI (custom eval plus AISI compatibility), lm-evaluation-harness (academic benchmark compatibility), PyRIT (red-team automation), internal platform.

- **Recommended**: Garak (external regression scan), OpenAI Evals (reference dataset import), self-written policy document (Anthropic RSP / OpenAI Preparedness as the template).

- **Release gate**: Policy document thresholds (RSP/Preparedness) gated on evaluation (Inspect AI) and red-team (PyRIT) results.

B. App companies (teams using foundation models)

- **Must**: Promptfoo (prompt A/B plus CI regression), one observability tool (Phoenix, Langfuse, or LangSmith).

- **Recommended**: DeepEval (pytest integration), Garak (quarterly external model security check), OWASP LLM Top 10 checklist.

- **Optional**: Inspect AI (when internal benchmarks grow), Giskard (regulated industries).

C. Governance, compliance, security teams

- **Must**: OWASP LLM Top 10 (checklist), MITRE ATLAS (threat analysis), NIST AI RMF / ISO 42001 (governance frameworks).

- **Recommended**: PyRIT (quarterly red-team), Garak (own regression scan separate from outsourced pen test), Giskard (bias reports).

- **Reference**: Your national AISI's evaluation methodology and guidelines.

D. Academia and researchers

- **Must**: lm-evaluation-harness (paper standard), Inspect AI (for releasing new evals).

- **Recommended**: OpenAI Evals (reference), Promptfoo (light experiments).

- **By research area**: For jailbreak papers, reproduce via PyRIT or Garak; for safety classifiers, Atla-class safety classifiers as a baseline.

General principles

- **Separate eval from red-team** operationally (different people build them, different cadences).

- Distinguish **eval in CI** (per PR) from **deep red-team** (quarterly).

- **Add datasets in your own language and domain** — English benchmarks alone do not tell you how the model behaves in your use case.

- **Imitate the policy documents** — the RSP/Preparedness pattern from frontier labs applies to small teams too. Codify "at what category and what level, you hold a release."

**One-line conclusion**: Safety in 2026 does not collapse to one tool. The job of safety infrastructure is to combine the four domains (eval, red-team, policy, standards) at your team's scale. And the starting point of that combination is to know **who built what and what problem they were solving** — may this post serve as that one-page map.

References

- Inspect AI (Anthropic, official): https://inspect.aisi.org.uk/ (hosted by UK AISI)

- Inspect AI GitHub: https://github.com/UKGovernmentBEIS/inspect_ai

- Garak (LLM vulnerability scanner): https://github.com/NVIDIA/garak

- Garak paper, "garak: A Framework for Security Probing Large Language Models," Derczynski et al., arXiv:2406.11036

- PyRIT (Microsoft): https://github.com/Azure/PyRIT

- PyRIT introduction (Microsoft Security blog): https://www.microsoft.com/security/blog/2024/02/22/announcing-microsofts-open-automation-framework-to-red-team-generative-ai-systems/

- Promptfoo: https://www.promptfoo.dev/ , GitHub: https://github.com/promptfoo/promptfoo

- OpenAI Evals: https://github.com/openai/evals

- lm-evaluation-harness (EleutherAI): https://github.com/EleutherAI/lm-evaluation-harness

- HuggingFace Open LLM Leaderboard: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard

- HumanEval, "Evaluating Large Language Models Trained on Code," Chen et al., arXiv:2107.03374

- MMLU, "Measuring Massive Multitask Language Understanding," Hendrycks et al., arXiv:2009.03300

- GPQA, "GPQA: A Graduate-Level Google-Proof Q&A Benchmark," Rein et al., arXiv:2311.12022

- SWE-bench, "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?", Jimenez et al., arXiv:2310.06770

- SWE-bench Verified (OpenAI announcement): https://openai.com/index/introducing-swe-bench-verified/

- BigCodeBench: https://github.com/bigcode-project/bigcodebench , arXiv:2406.15877

- DeepEval (Confident AI): https://github.com/confident-ai/deepeval

- Arize Phoenix: https://github.com/Arize-ai/phoenix

- MLflow Evals: https://mlflow.org/docs/latest/llms/llm-evaluate/index.html

- Giskard: https://github.com/Giskard-AI/giskard

- UK AI Safety Institute: https://www.aisi.gov.uk/

- US AI Safety Institute (NIST): https://www.nist.gov/aisi

- AI Safety Institute of Japan: https://aisi.go.jp/

- Singapore AI Verify Foundation, Project Moonshot: https://aiverifyfoundation.sg/

- OpenAI Preparedness Framework: https://openai.com/safety/preparedness

- Anthropic Responsible Scaling Policy: https://www.anthropic.com/news/anthropics-responsible-scaling-policy

- Google DeepMind Frontier Safety Framework: https://deepmind.google/discover/blog/introducing-the-frontier-safety-framework/

- MITRE ATLAS: https://atlas.mitre.org/

- OWASP Top 10 for LLM Applications: https://owasp.org/www-project-top-10-for-large-language-model-applications/

- NIST AI Risk Management Framework: https://www.nist.gov/itl/ai-risk-management-framework

- ISO/IEC 42001 (AI management systems): https://www.iso.org/standard/81230.html

- EU AI Act: https://artificialintelligenceact.eu/

- KMMLU (Korean MMLU): arXiv:2402.11548

- JGLUE (Japanese GLUE): https://github.com/yahoojapan/JGLUE

- RIKEN AIP: https://aip.riken.jp/

현재 단락 (1/299)

May 2026. Model labs ship a new model every month. App companies wrap them into agents. Every week s...

작성 글자: 0원문 글자: 26,936작성 단락: 0/299