Split View: LLM 환각(Hallucination) 완전 해부: 왜 AI는 거짓말을 하고, 어떻게 막는가

LLM 환각(Hallucination) 완전 해부: 왜 AI는 거짓말을 하고, 어떻게 막는가

들어가며
환각이란 정확히 무엇인가?
- 환각의 4가지 유형
왜 환각이 발생하는가? 기술적 원인
- 구체적인 기술적 원인 3가지
해결 전략 5가지
환각 측정 지표
환각을 완전히 막을 수 없는 경우
프로덕션 환경 권장 설정
마치며

들어가며

프로덕션에 LLM을 배포한 개발자라면 한 번쯤 이런 경험을 했을 것이다. 사용자가 "우리 제품의 환불 정책이 뭐야?"라고 물었는데, 챗봇이 완전히 틀린 정책을 자신감 넘치게 답변했다. 이것이 **환각(Hallucination)**이다.

환각은 LLM의 버그가 아니다. 이것은 설계 방식의 필연적 결과다. 이 글에서는 환각의 기술적 원인을 파헤치고, 실제로 작동하는 5가지 해결 전략을 코드와 함께 제공한다.

환각이란 정확히 무엇인가?

환각은 하나의 현상이 아니다. 유형을 구분해야 올바른 해결책을 선택할 수 있다.

환각의 4가지 유형

1. 사실적 환각 (Factual Hallucination) 존재하지 않거나 틀린 사실을 생성한다.

예: "에펠탑은 런던에 있습니다"
예: "파이썬은 1995년에 귀도 반 로섬이 만들었습니다" (실제로는 1991년)

2. 작화 (Confabulation) 그럴듯하게 들리지만 완전히 지어낸 세부 정보를 만든다.

예: 실존하지 않는 논문 인용 ("Smith et al., 2023에 따르면...")
예: 실제로는 없는 API 메서드 이름을 믿을 수 있게 제안

3. 출처 환각 (Attribution Hallucination) 실제 정보지만 출처가 틀렸다.

예: A가 한 말을 B가 했다고 주장
예: 정확한 통계를 잘못된 기관에서 나왔다고 인용

4. 시간적 환각 (Temporal Hallucination) 구식 정보를 현재 사실인 것처럼 제시한다.

예: 훈련 데이터 마감 이후 발표된 모델을 "최신"이라고 부름
예: 이미 폐기된 API 문서를 기반으로 코드 작성

왜 환각이 발생하는가? 기술적 원인

LLM의 핵심 작동 방식:

입력 토큰들 → [Transformer 레이어들] → 다음 토큰에 대한 확률 분포 → 샘플링

예:
"파리는 프랑스의" → ["수도": 0.92, "도시": 0.05, "강": 0.02, ...]
                    → "수도" 선택

핵심 문제는 간단하다: LLM은 "이것이 사실인가?"를 판단하지 않는다. 그저 "다음으로 올 가능성이 가장 높은 토큰은 무엇인가?"를 예측할 뿐이다.

구체적인 기술적 원인 3가지

원인 1: 확신도와 정확도의 분리

LLM의 확률 분포에는 "모르겠음"이라는 개념이 없다. 모델은 항상 무언가를 예측해야 한다. 훈련 데이터에 없는 질문을 받아도, 모델은 "모른다"고 말하는 대신 가장 그럴듯한 패턴으로 빈칸을 채운다.

질문: "2024년 노벨 물리학상 수상자는?"
(훈련 데이터 마감이 2023년인 경우)

모델 내부:
- "노벨 물리학상" + "수상자" + "2024" → 패턴 매칭
- 이전에 노벨상 수상자 이름 뒤에 나오는 패턴을 학습함
- → 그럴듯한 이름을 자신있게 생성

원인 2: 훈련 데이터의 오류 학습

인터넷에는 잘못된 정보가 넘쳐난다. LLM은 올바른 정보와 틀린 정보를 구분하지 않고 모두 학습한다. 잘못된 정보가 많이 등장할수록, 그것이 "그럴듯한" 패턴으로 강화된다.

원인 3: 긴 컨텍스트에서의 집중력 저하

컨텍스트 창이 길수록, 초반에 제공된 정보를 후반에서 정확히 참조하는 능력이 떨어진다. 이를 "Lost in the Middle" 문제라고 한다. 핵심 정보는 컨텍스트의 시작 또는 끝에 배치하는 것이 유리하다.

해결 전략 5가지

전략 1: RAG - 가장 효과적인 방법

RAG(Retrieval-Augmented Generation)는 환각을 줄이는 가장 검증된 방법이다. 모델의 응답을 검색된 사실에 고정시킨다.

from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.prompts import ChatPromptTemplate

# 핵심: 컨텍스트에 없는 정보는 말하지 말 것을 명시
SYSTEM_PROMPT = """당신은 제공된 컨텍스트만을 기반으로 답변하는 어시스턴트입니다.

규칙:
1. 컨텍스트에 없는 정보는 절대 만들어내지 마세요
2. 답을 모르면 "제공된 문서에 해당 정보가 없습니다"라고 말하세요
3. 답변의 근거가 되는 출처를 명시하세요

컨텍스트:
{context}
"""

def rag_query(question: str, vectorstore) -> dict:
    # 관련 문서 검색
    docs = vectorstore.similarity_search(question, k=4)
    context = "\n\n---\n\n".join([doc.page_content for doc in docs])

    prompt = ChatPromptTemplate.from_messages([
        ("system", SYSTEM_PROMPT),
        ("human", "{question}")
    ])

    llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.1)
    chain = prompt | llm

    response = chain.invoke({
        "context": context,
        "question": question
    })

    return {
        "answer": response.content,
        "sources": [doc.metadata.get("source", "unknown") for doc in docs]
    }

RAG의 효과: 도메인 특화 질문에서 환각률을 60-80% 감소시킨다는 연구 결과가 있다.

전략 2: 자기비판 파이프라인

모델에게 자신의 답변을 검토하도록 요청한다. 동일한 모델이 "답변자"와 "검토자" 역할을 수행한다.

def self_critique_pipeline(question: str, llm) -> str:
    """2단계 자기비판으로 환각 감소"""

    # 1단계: 초기 답변 생성
    initial_response = llm.invoke(
        f"다음 질문에 답해주세요: {question}"
    )
    initial_answer = initial_response.content

    # 2단계: 자기 검토
    critique_prompt = f"""다음 질문과 답변을 검토해주세요.

질문: {question}
답변: {initial_answer}

다음을 확인해주세요:
1. 사실적으로 정확한가?
2. 불확실한 정보가 있는가?
3. 잘못됐을 가능성이 있는 주장이 있는가?

불확실한 부분은 명시적으로 표시하고, 필요한 경우 수정된 답변을 제공해주세요.
"""

    critique_response = llm.invoke(critique_prompt)
    return critique_response.content

# 실제 사용
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o", temperature=0.3)
result = self_critique_pipeline("파이썬 GIL이 멀티스레딩에 미치는 영향은?", llm)

전략 3: Chain of Verification

Dhuliawala et al. (2023) 논문에서 제안한 방법이다. 답변에서 검증 가능한 사실들을 추출하고, 각각을 독립적으로 검증한다.

def chain_of_verification(question: str, llm) -> dict:
    """
    1. 초기 답변 생성
    2. 검증 질문 생성
    3. 각 검증 질문에 독립적으로 답변
    4. 검증 결과로 최종 답변 수정
    """

    # Step 1: 초기 답변
    initial = llm.invoke(question).content

    # Step 2: 검증 질문 생성
    verification_prompt = f"""다음 답변에서 사실 확인이 필요한 주장들을 추출하고,
각 주장을 검증할 수 있는 독립적인 질문을 만들어주세요.

답변: {initial}

형식: 각 줄에 하나의 검증 질문"""

    verification_questions_raw = llm.invoke(verification_prompt).content
    questions = [q.strip() for q in verification_questions_raw.split('\n') if q.strip()]

    # Step 3: 각 검증 질문에 독립 답변
    verifications = {}
    for vq in questions[:5]:  # 최대 5개
        answer = llm.invoke(
            f"다음 질문에 간결하게 답해주세요: {vq}"
        ).content
        verifications[vq] = answer

    # Step 4: 최종 답변 수정
    correction_prompt = f"""원래 질문: {question}
초기 답변: {initial}

검증 결과:
{chr(10).join([f'Q: {q}\nA: {a}' for q, a in verifications.items()])}

검증 결과를 반영하여 정확도를 높인 최종 답변을 작성해주세요.
불확실한 내용은 "~로 알려져 있습니다" 형태로 표현하세요."""

    final_answer = llm.invoke(correction_prompt).content

    return {
        "initial_answer": initial,
        "verifications": verifications,
        "final_answer": final_answer
    }

전략 4: Temperature와 샘플링 전략 조정

from openai import OpenAI
client = OpenAI()

def factual_query(prompt: str) -> str:
    """사실 기반 쿼리용 보수적 설정"""
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.1,    # 낮은 온도 = 더 보수적, 예측 가능한 출력
        top_p=0.9,          # 상위 90% 확률 토큰에서만 샘플링
        presence_penalty=0.0,   # 새로운 주제 도입 페널티 없음
        frequency_penalty=0.0   # 반복 페널티 없음 (사실 반복은 OK)
    )
    return response.choices[0].message.content

def creative_query(prompt: str) -> str:
    """창의적 태스크용 설정"""
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.9,    # 높은 창의성 허용
        top_p=0.95
    )
    return response.choices[0].message.content

# 태스크에 따라 적절한 함수 선택
factual_result = factual_query("HTTP와 HTTPS의 차이점을 설명해주세요")
creative_result = creative_query("AI가 바꿔놓을 미래의 하루를 상상해서 써주세요")

온도가 낮을수록 모델은 더 "안전한" 고확률 토큰을 선택한다. 사실 기반 태스크에는 0.0~0.3이 적합하다.

전략 5: 출처 인용 강제

모델에게 모든 주장에 출처를 인용하도록 강제하면, 환각을 식별하기 쉬워진다.

CITATION_PROMPT = """다음 질문에 답할 때 반드시 지켜야 할 규칙:

1. 모든 사실적 주장에는 [출처: X] 형태로 출처를 표시하세요
2. 출처를 모르는 주장은 [출처: 불명확] 으로 표시하세요
3. 당신의 추론에 기반한 내용은 [추론] 으로 표시하세요

예시:
"파이썬은 1991년에 출시됐습니다 [출처: 파이썬 공식 문서].
현재 가장 널리 사용되는 프로그래밍 언어 중 하나입니다 [출처: Stack Overflow Developer Survey 2023].
앞으로도 AI/ML 분야에서 지배적인 위치를 유지할 것입니다 [추론]."

질문: {question}
"""

def cited_response(question: str, llm) -> str:
    prompt = CITATION_PROMPT.format(question=question)
    response = llm.invoke(prompt)
    return response.content

이 접근법의 장점: 사용자가 [출처: 불명확] 태그를 보고 스스로 추가 확인을 할 수 있다.

환각 측정 지표

코드로 측정 가능한 지표들:

RAGAS Faithfulness Score (RAG 시스템용)

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy

# 응답이 컨텍스트에 얼마나 충실한지 측정
# 0.0 (전혀 충실하지 않음) ~ 1.0 (완전히 충실함)
results = evaluate(
    dataset=test_dataset,
    metrics=[faithfulness, answer_relevancy]
)
print(f"Faithfulness: {results['faithfulness']:.3f}")
# 0.85 이상이면 양호한 RAG 시스템

TruthfulQA: 817개의 사람이 작성한 질문으로 구성된 벤치마크. GPT-4는 약 59%, 인간은 94% 정확도를 보인다.

환각을 완전히 막을 수 없는 경우

솔직히 말하자: 모든 환각을 막는 것은 불가능하다. 그리고 어떤 경우에는 막으면 안 된다.

환각이 오히려 유용한 경우:

창의적 글쓰기: 소설, 시, 마케팅 카피 작성 시 "새로운 것을 만들어내는" 능력이 필요
브레인스토밍: 존재하지 않는 아이디어를 연결하는 것이 가치 있음
가상 시나리오 작성: "만약 X라면?" 류의 질문

리스크 기반 접근법:

사용 사례	환각 리스크	권장 전략
의료 정보 제공	매우 높음	RAG + 검증 + "전문의 상담" 필수 고지
법률 상담	매우 높음	절대 단독 사용 금지
코드 생성	중간	테스트 코드 자동 실행으로 검증
요약/번역	낮음	온도 낮춤 + 출처 제공
창의적 글쓰기	해당 없음	제한 불필요

프로덕션 환경 권장 설정

class HallucinationSafetyConfig:
    """프로덕션 환경에서 환각을 최소화하는 설정"""

    # 사실 기반 태스크
    FACTUAL = {
        "temperature": 0.1,
        "system_prompt_suffix": "\n\n중요: 확실하지 않은 정보는 '확인이 필요합니다'라고 표현하세요.",
        "use_rag": True,
        "self_critique": True
    }

    # 일반 대화
    CONVERSATIONAL = {
        "temperature": 0.7,
        "system_prompt_suffix": "\n\n알 수 없는 사실은 솔직하게 모른다고 말하세요.",
        "use_rag": False,
        "self_critique": False
    }

    # 코드 생성
    CODE = {
        "temperature": 0.2,
        "system_prompt_suffix": "\n\n존재하지 않는 함수나 라이브러리를 만들어내지 마세요.",
        "use_rag": True,  # 문서 기반 RAG
        "self_critique": True
    }

마치며

환각은 LLM의 결함이 아니라 확률적 언어 모델의 본질적 특성이다. 모델은 "사실인가"를 모른다. 다음 토큰의 확률만 안다.

그러나 올바른 아키텍처와 프롬프트 설계로 환각을 크게 줄일 수 있다:

RAG: 응답을 검색된 사실에 고정 (가장 효과적)
자기비판: 모델이 스스로 검토
Chain of Verification: 주장별 독립 검증
낮은 Temperature: 사실 기반 태스크에서 보수적 출력
출처 강제: 검증 가능성 확보

중요한 것은 사용 사례의 리스크를 정확히 파악하고, 그에 맞는 전략을 조합하는 것이다. 의료나 법률처럼 오류가 치명적인 도메인에서는 LLM을 단독으로 사용해서는 안 된다.

LLM Hallucination: Why AI Makes Things Up and 5 Strategies to Prevent It

Introduction
What Exactly Is Hallucination?
- The 4 Types of Hallucination
Why Does Hallucination Happen? The Technical Cause
- Three Specific Technical Root Causes
5 Prevention Strategies
Measuring Hallucination
- RAGAS Faithfulness (for RAG systems)
- TruthfulQA Benchmark
When You Can't (and Shouldn't) Prevent Hallucination
Production-Ready Configuration
Conclusion

Introduction

If you've deployed an LLM in production, you've encountered this: a user asks a straightforward question and the model responds with complete confidence — and complete inaccuracy. A chatbot invents a return policy that doesn't exist. A coding assistant suggests an API method that was never part of any library. A research assistant cites a paper that was never published.

This is hallucination. And it's not a bug — it's a fundamental consequence of how LLMs work. This guide breaks down the technical causes and gives you five practical strategies to fight back, with real code you can deploy today.

What Exactly Is Hallucination?

Hallucination isn't a single phenomenon. Identifying the type determines the correct fix.

The 4 Types of Hallucination

1. Factual Hallucination The model generates outright false facts with apparent confidence.

"The Eiffel Tower is located in London"
"Python was created by Guido van Rossum in 1995" (it was 1991)

2. Confabulation Plausible-sounding but entirely fabricated details — the model fills gaps with invented specifics.

Citing a paper that doesn't exist: "According to Smith et al., 2023..."
Suggesting a library method or function that has never existed

3. Attribution Hallucination Real information, wrong source.

Attributing a quote to the wrong person
Citing accurate statistics but crediting the wrong organization

4. Temporal Hallucination Outdated information presented as current fact.

Calling a model "the latest" when it was superseded after the training cutoff
Writing code against a deprecated API because that was in the training data

Why Does Hallucination Happen? The Technical Cause

How an LLM works at its core:

Input tokens → [Transformer layers] → probability distribution over next token → sample

Example:
"Paris is the ___" → {"capital": 0.91, "city": 0.06, "heart": 0.02, ...}
                   → select "capital"

The fundamental issue: an LLM does not reason about truth. It predicts the statistically most plausible next token. There is no "I don't know" state in the probability distribution — the model must always predict something.

When asked about information not in its training data, the model doesn't refuse. It pattern-matches to the closest thing it learned and fills in the blank — confidently.

Three Specific Technical Root Causes

Cause 1: Confidence and accuracy are decoupled

A high-probability token selection doesn't mean the output is factually correct. The model is confident that a token is a likely continuation — not that the statement is true. It has no internal flag for "I'm uncertain about this fact."

Cause 2: Training data contains errors

The internet is full of misinformation. LLMs train on it indiscriminately. Frequently repeated errors get reinforced as "plausible" patterns. There's no ground truth filter during pre-training.

Cause 3: Lost in the Middle

Research (Liu et al., 2023) shows that LLMs struggle to accurately recall information from the middle of long contexts. They attend more reliably to information at the beginning and end of the context window. This causes hallucination even when the correct answer was provided — the model just didn't attend to it.

5 Prevention Strategies

Strategy 1: RAG (Most Effective)

Retrieval-Augmented Generation grounds the model's response in retrieved facts. The model only answers from what's in the retrieved context.

from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.prompts import ChatPromptTemplate

# The key: explicitly forbid fabrication
SYSTEM_PROMPT = """You are a helpful assistant that answers questions ONLY based on the provided context.

Rules:
1. Never make up information not in the context
2. If the answer isn't in the context, say "I don't have information about this in the provided documents"
3. Always cite which part of the context supports your answer

Context:
{context}
"""

def rag_query(question: str, vectorstore) -> dict:
    # Retrieve relevant documents
    docs = vectorstore.similarity_search(question, k=4)
    context = "\n\n---\n\n".join([doc.page_content for doc in docs])

    prompt = ChatPromptTemplate.from_messages([
        ("system", SYSTEM_PROMPT),
        ("human", "{question}")
    ])

    llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.1)
    chain = prompt | llm

    response = chain.invoke({
        "context": context,
        "question": question
    })

    return {
        "answer": response.content,
        "sources": [doc.metadata.get("source", "unknown") for doc in docs]
    }

Real-world impact: RAG reduces hallucination rates by 60-80% on domain-specific queries. The constraint "only answer from context" is extraordinarily powerful.

Strategy 2: Self-Critique Pipeline

Ask the model to review its own answer. The same model plays both "answerer" and "reviewer" roles in two separate API calls — crucially, the reviewer doesn't see its own previous reasoning, reducing confirmation bias.

def self_critique_pipeline(question: str, llm) -> str:
    """Two-pass self-critique to reduce hallucination"""

    # Pass 1: Generate initial answer
    initial_response = llm.invoke(
        f"Please answer the following question: {question}"
    )
    initial_answer = initial_response.content

    # Pass 2: Self-review (separate call, no memory of pass 1's reasoning)
    critique_prompt = f"""Review the following question and answer critically.

Question: {question}
Answer: {initial_answer}

Check for:
1. Factual accuracy — are any claims potentially wrong?
2. Unsupported specifics — dates, names, numbers that might be invented?
3. Outdated information that may have changed?

Mark uncertain claims explicitly, and provide a revised answer with corrections if needed.
Uncertain claims should use phrases like "as of my last training data" or "I believe, but please verify."
"""

    critique_response = llm.invoke(critique_prompt)
    return critique_response.content

# Usage
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o", temperature=0.3)
result = self_critique_pipeline(
    "What were the main architectural innovations in GPT-3?",
    llm
)

Strategy 3: Chain of Verification

Proposed by Dhuliawala et al. (2023) at Meta AI. The model generates an answer, then generates verification questions from its own claims, answers each independently, and uses the results to correct itself.

def chain_of_verification(question: str, llm) -> dict:
    """
    Step 1: Generate initial answer
    Step 2: Extract verifiable claims as questions
    Step 3: Answer each verification question independently
    Step 4: Correct the final answer using verification results
    """

    # Step 1: Initial answer
    initial = llm.invoke(question).content

    # Step 2: Extract verification questions
    verification_prompt = f"""From the following answer, extract the key factual claims
and turn each into a standalone verification question.

Answer: {initial}

Format: one verification question per line.
Focus on specific facts: dates, names, numbers, relationships."""

    vq_raw = llm.invoke(verification_prompt).content
    questions = [q.strip() for q in vq_raw.split('\n') if q.strip()]

    # Step 3: Answer each independently (without seeing the original answer)
    verifications = {}
    for vq in questions[:5]:  # Cap at 5 to control costs
        answer = llm.invoke(
            f"Answer this question concisely and accurately: {vq}"
        ).content
        verifications[vq] = answer

    # Step 4: Produce corrected final answer
    correction_prompt = f"""Original question: {question}
Original answer: {initial}

Verification results:
{chr(10).join([f'Q: {q}\nA: {a}' for q, a in verifications.items()])}

Using the verification results, produce an improved final answer.
Where verification revealed uncertainty, use hedged language ("reportedly", "as of 2023", etc.)"""

    final_answer = llm.invoke(correction_prompt).content

    return {
        "initial_answer": initial,
        "verifications": verifications,
        "final_answer": final_answer
    }

Strategy 4: Temperature and Sampling Tuning

Temperature directly controls how "creative" (read: risky) the model is with its token selection. Lower temperature = more conservative = fewer hallucinations on factual tasks.

from openai import OpenAI
client = OpenAI()

def factual_query(prompt: str) -> str:
    """Conservative settings for fact-based queries"""
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.1,     # Low: stick to highest-probability tokens
        top_p=0.9,           # Only sample from top 90% probability mass
        presence_penalty=0.0,   # Don't penalize repetition of established facts
        frequency_penalty=0.0   # Same
    )
    return response.choices[0].message.content

def creative_query(prompt: str) -> str:
    """Relaxed settings for creative tasks"""
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.9,     # High: allow exploration
        top_p=0.95
    )
    return response.choices[0].message.content

# Task-appropriate dispatch
answer = factual_query("Explain the difference between TCP and UDP")
story = creative_query("Write a short story about an AI that becomes self-aware")

Temperature guidelines:

0.0–0.2: Factual Q&A, data extraction, classification
0.3–0.5: Technical writing, summarization, code generation
0.6–0.8: General conversation, explanations
0.9–1.0: Creative writing, brainstorming

Strategy 5: Forced Source Citation

Require the model to tag every factual claim with its source. This makes hallucinations immediately visible — any claim tagged [Source: unknown] signals a fact worth verifying.

CITATION_SYSTEM = """When answering questions, you MUST tag every factual claim:

- [Source: X] — where X is the specific source you're drawing from
- [Source: unknown] — for facts you believe are true but can't cite specifically
- [Inference] — for logical conclusions you're drawing yourself

Example:
"Python was first released in 1991 [Source: Python docs / Guido van Rossum].
It is now one of the most popular languages worldwide [Source: Stack Overflow Survey 2024].
It will likely remain dominant in ML for the next decade [Inference]."

Never omit source tags. If you would need to say [Source: unknown] for too many claims,
say so upfront and reduce the scope of your answer.
"""

def cited_response(question: str, llm) -> str:
    from langchain.schema import SystemMessage, HumanMessage
    messages = [
        SystemMessage(content=CITATION_SYSTEM),
        HumanMessage(content=question)
    ]
    response = llm.invoke(messages)
    return response.content

Measuring Hallucination

RAGAS Faithfulness (for RAG systems)

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision

# Measures: does the answer stay faithful to the retrieved context?
# Score: 0.0 (fabricates everything) to 1.0 (perfectly grounded)
results = evaluate(
    dataset=test_dataset,  # questions, answers, contexts, ground_truths
    metrics=[faithfulness, answer_relevancy, context_precision]
)

print(f"Faithfulness:       {results['faithfulness']:.3f}")   # Target: >0.85
print(f"Answer Relevancy:   {results['answer_relevancy']:.3f}")  # Target: >0.80
print(f"Context Precision:  {results['context_precision']:.3f}") # Target: >0.75

TruthfulQA Benchmark

817 adversarially-crafted questions designed to elicit hallucinations. Reference scores (as of early 2025):

GPT-4: ~59% truthful
Claude 3 Opus: ~62% truthful
Humans: ~94% truthful

The gap between AI and humans is exactly why hallucination mitigation matters in production.

When You Can't (and Shouldn't) Prevent Hallucination

Being honest: some hallucination is unavoidable. Some is desirable.

Cases where "hallucination" is a feature:

Creative writing: you want the model to invent things
Brainstorming: novel connections are the point
Hypothetical scenarios: "what if" requires imagination

Risk-based framework:

Use Case	Hallucination Risk	Recommended Approach
Medical information	Critical	RAG + verification + mandatory "consult a doctor" disclaimer
Legal advice	Critical	Never use LLM alone
Code generation	Medium	Auto-run tests to verify output
Document summarization	Low	Low temperature + source documents provided
Creative writing	N/A	No restrictions needed

Production-Ready Configuration

class HallucinationConfig:
    """Hallucination-minimizing configs for different task types"""

    FACTUAL = {
        "temperature": 0.1,
        "system_suffix": "\n\nIf you're unsure about any fact, say so explicitly.",
        "use_rag": True,
        "self_critique": True
    }

    CONVERSATIONAL = {
        "temperature": 0.7,
        "system_suffix": "\n\nBe honest when you don't know something.",
        "use_rag": False,
        "self_critique": False
    }

    CODE = {
        "temperature": 0.2,
        "system_suffix": "\n\nOnly suggest functions and methods that actually exist.",
        "use_rag": True,   # Documentation-grounded RAG
        "self_critique": True
    }

Conclusion

Hallucination is not a fixable bug — it's an intrinsic property of probabilistic language models. The model doesn't know truth. It knows probabilities.

But with the right architecture, you can reduce hallucination dramatically:

RAG grounds responses in retrieved facts (most impactful)
Self-critique adds a review pass before the user sees the answer
Chain of Verification stress-tests individual claims
Low temperature keeps factual outputs conservative
Forced citation makes hallucinations visible and auditable

The key principle: match your mitigation strategy to your use case's risk level. For medical or legal applications, LLMs should never operate without human oversight regardless of what mitigation you apply.