- Authors

- Name
- Youngju Kim
- @fjvbn20031
- Introduction
- What Exactly Is Hallucination?
- Why Does Hallucination Happen? The Technical Cause
- 5 Prevention Strategies
- Measuring Hallucination
- When You Can't (and Shouldn't) Prevent Hallucination
- Production-Ready Configuration
- Conclusion
Introduction
If you've deployed an LLM in production, you've encountered this: a user asks a straightforward question and the model responds with complete confidence — and complete inaccuracy. A chatbot invents a return policy that doesn't exist. A coding assistant suggests an API method that was never part of any library. A research assistant cites a paper that was never published.
This is hallucination. And it's not a bug — it's a fundamental consequence of how LLMs work. This guide breaks down the technical causes and gives you five practical strategies to fight back, with real code you can deploy today.
What Exactly Is Hallucination?
Hallucination isn't a single phenomenon. Identifying the type determines the correct fix.
The 4 Types of Hallucination
1. Factual Hallucination The model generates outright false facts with apparent confidence.
- "The Eiffel Tower is located in London"
- "Python was created by Guido van Rossum in 1995" (it was 1991)
2. Confabulation Plausible-sounding but entirely fabricated details — the model fills gaps with invented specifics.
- Citing a paper that doesn't exist: "According to Smith et al., 2023..."
- Suggesting a library method or function that has never existed
3. Attribution Hallucination Real information, wrong source.
- Attributing a quote to the wrong person
- Citing accurate statistics but crediting the wrong organization
4. Temporal Hallucination Outdated information presented as current fact.
- Calling a model "the latest" when it was superseded after the training cutoff
- Writing code against a deprecated API because that was in the training data
Why Does Hallucination Happen? The Technical Cause
How an LLM works at its core:
Input tokens → [Transformer layers] → probability distribution over next token → sample
Example:
"Paris is the ___" → {"capital": 0.91, "city": 0.06, "heart": 0.02, ...}
→ select "capital"
The fundamental issue: an LLM does not reason about truth. It predicts the statistically most plausible next token. There is no "I don't know" state in the probability distribution — the model must always predict something.
When asked about information not in its training data, the model doesn't refuse. It pattern-matches to the closest thing it learned and fills in the blank — confidently.
Three Specific Technical Root Causes
Cause 1: Confidence and accuracy are decoupled
A high-probability token selection doesn't mean the output is factually correct. The model is confident that a token is a likely continuation — not that the statement is true. It has no internal flag for "I'm uncertain about this fact."
Cause 2: Training data contains errors
The internet is full of misinformation. LLMs train on it indiscriminately. Frequently repeated errors get reinforced as "plausible" patterns. There's no ground truth filter during pre-training.
Cause 3: Lost in the Middle
Research (Liu et al., 2023) shows that LLMs struggle to accurately recall information from the middle of long contexts. They attend more reliably to information at the beginning and end of the context window. This causes hallucination even when the correct answer was provided — the model just didn't attend to it.
5 Prevention Strategies
Strategy 1: RAG (Most Effective)
Retrieval-Augmented Generation grounds the model's response in retrieved facts. The model only answers from what's in the retrieved context.
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.prompts import ChatPromptTemplate
# The key: explicitly forbid fabrication
SYSTEM_PROMPT = """You are a helpful assistant that answers questions ONLY based on the provided context.
Rules:
1. Never make up information not in the context
2. If the answer isn't in the context, say "I don't have information about this in the provided documents"
3. Always cite which part of the context supports your answer
Context:
{context}
"""
def rag_query(question: str, vectorstore) -> dict:
# Retrieve relevant documents
docs = vectorstore.similarity_search(question, k=4)
context = "\n\n---\n\n".join([doc.page_content for doc in docs])
prompt = ChatPromptTemplate.from_messages([
("system", SYSTEM_PROMPT),
("human", "{question}")
])
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.1)
chain = prompt | llm
response = chain.invoke({
"context": context,
"question": question
})
return {
"answer": response.content,
"sources": [doc.metadata.get("source", "unknown") for doc in docs]
}
Real-world impact: RAG reduces hallucination rates by 60-80% on domain-specific queries. The constraint "only answer from context" is extraordinarily powerful.
Strategy 2: Self-Critique Pipeline
Ask the model to review its own answer. The same model plays both "answerer" and "reviewer" roles in two separate API calls — crucially, the reviewer doesn't see its own previous reasoning, reducing confirmation bias.
def self_critique_pipeline(question: str, llm) -> str:
"""Two-pass self-critique to reduce hallucination"""
# Pass 1: Generate initial answer
initial_response = llm.invoke(
f"Please answer the following question: {question}"
)
initial_answer = initial_response.content
# Pass 2: Self-review (separate call, no memory of pass 1's reasoning)
critique_prompt = f"""Review the following question and answer critically.
Question: {question}
Answer: {initial_answer}
Check for:
1. Factual accuracy — are any claims potentially wrong?
2. Unsupported specifics — dates, names, numbers that might be invented?
3. Outdated information that may have changed?
Mark uncertain claims explicitly, and provide a revised answer with corrections if needed.
Uncertain claims should use phrases like "as of my last training data" or "I believe, but please verify."
"""
critique_response = llm.invoke(critique_prompt)
return critique_response.content
# Usage
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o", temperature=0.3)
result = self_critique_pipeline(
"What were the main architectural innovations in GPT-3?",
llm
)
Strategy 3: Chain of Verification
Proposed by Dhuliawala et al. (2023) at Meta AI. The model generates an answer, then generates verification questions from its own claims, answers each independently, and uses the results to correct itself.
def chain_of_verification(question: str, llm) -> dict:
"""
Step 1: Generate initial answer
Step 2: Extract verifiable claims as questions
Step 3: Answer each verification question independently
Step 4: Correct the final answer using verification results
"""
# Step 1: Initial answer
initial = llm.invoke(question).content
# Step 2: Extract verification questions
verification_prompt = f"""From the following answer, extract the key factual claims
and turn each into a standalone verification question.
Answer: {initial}
Format: one verification question per line.
Focus on specific facts: dates, names, numbers, relationships."""
vq_raw = llm.invoke(verification_prompt).content
questions = [q.strip() for q in vq_raw.split('\n') if q.strip()]
# Step 3: Answer each independently (without seeing the original answer)
verifications = {}
for vq in questions[:5]: # Cap at 5 to control costs
answer = llm.invoke(
f"Answer this question concisely and accurately: {vq}"
).content
verifications[vq] = answer
# Step 4: Produce corrected final answer
correction_prompt = f"""Original question: {question}
Original answer: {initial}
Verification results:
{chr(10).join([f'Q: {q}\nA: {a}' for q, a in verifications.items()])}
Using the verification results, produce an improved final answer.
Where verification revealed uncertainty, use hedged language ("reportedly", "as of 2023", etc.)"""
final_answer = llm.invoke(correction_prompt).content
return {
"initial_answer": initial,
"verifications": verifications,
"final_answer": final_answer
}
Strategy 4: Temperature and Sampling Tuning
Temperature directly controls how "creative" (read: risky) the model is with its token selection. Lower temperature = more conservative = fewer hallucinations on factual tasks.
from openai import OpenAI
client = OpenAI()
def factual_query(prompt: str) -> str:
"""Conservative settings for fact-based queries"""
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
temperature=0.1, # Low: stick to highest-probability tokens
top_p=0.9, # Only sample from top 90% probability mass
presence_penalty=0.0, # Don't penalize repetition of established facts
frequency_penalty=0.0 # Same
)
return response.choices[0].message.content
def creative_query(prompt: str) -> str:
"""Relaxed settings for creative tasks"""
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
temperature=0.9, # High: allow exploration
top_p=0.95
)
return response.choices[0].message.content
# Task-appropriate dispatch
answer = factual_query("Explain the difference between TCP and UDP")
story = creative_query("Write a short story about an AI that becomes self-aware")
Temperature guidelines:
- 0.0–0.2: Factual Q&A, data extraction, classification
- 0.3–0.5: Technical writing, summarization, code generation
- 0.6–0.8: General conversation, explanations
- 0.9–1.0: Creative writing, brainstorming
Strategy 5: Forced Source Citation
Require the model to tag every factual claim with its source. This makes hallucinations immediately visible — any claim tagged [Source: unknown] signals a fact worth verifying.
CITATION_SYSTEM = """When answering questions, you MUST tag every factual claim:
- [Source: X] — where X is the specific source you're drawing from
- [Source: unknown] — for facts you believe are true but can't cite specifically
- [Inference] — for logical conclusions you're drawing yourself
Example:
"Python was first released in 1991 [Source: Python docs / Guido van Rossum].
It is now one of the most popular languages worldwide [Source: Stack Overflow Survey 2024].
It will likely remain dominant in ML for the next decade [Inference]."
Never omit source tags. If you would need to say [Source: unknown] for too many claims,
say so upfront and reduce the scope of your answer.
"""
def cited_response(question: str, llm) -> str:
from langchain.schema import SystemMessage, HumanMessage
messages = [
SystemMessage(content=CITATION_SYSTEM),
HumanMessage(content=question)
]
response = llm.invoke(messages)
return response.content
Measuring Hallucination
RAGAS Faithfulness (for RAG systems)
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
# Measures: does the answer stay faithful to the retrieved context?
# Score: 0.0 (fabricates everything) to 1.0 (perfectly grounded)
results = evaluate(
dataset=test_dataset, # questions, answers, contexts, ground_truths
metrics=[faithfulness, answer_relevancy, context_precision]
)
print(f"Faithfulness: {results['faithfulness']:.3f}") # Target: >0.85
print(f"Answer Relevancy: {results['answer_relevancy']:.3f}") # Target: >0.80
print(f"Context Precision: {results['context_precision']:.3f}") # Target: >0.75
TruthfulQA Benchmark
817 adversarially-crafted questions designed to elicit hallucinations. Reference scores (as of early 2025):
- GPT-4: ~59% truthful
- Claude 3 Opus: ~62% truthful
- Humans: ~94% truthful
The gap between AI and humans is exactly why hallucination mitigation matters in production.
When You Can't (and Shouldn't) Prevent Hallucination
Being honest: some hallucination is unavoidable. Some is desirable.
Cases where "hallucination" is a feature:
- Creative writing: you want the model to invent things
- Brainstorming: novel connections are the point
- Hypothetical scenarios: "what if" requires imagination
Risk-based framework:
| Use Case | Hallucination Risk | Recommended Approach |
|---|---|---|
| Medical information | Critical | RAG + verification + mandatory "consult a doctor" disclaimer |
| Legal advice | Critical | Never use LLM alone |
| Code generation | Medium | Auto-run tests to verify output |
| Document summarization | Low | Low temperature + source documents provided |
| Creative writing | N/A | No restrictions needed |
Production-Ready Configuration
class HallucinationConfig:
"""Hallucination-minimizing configs for different task types"""
FACTUAL = {
"temperature": 0.1,
"system_suffix": "\n\nIf you're unsure about any fact, say so explicitly.",
"use_rag": True,
"self_critique": True
}
CONVERSATIONAL = {
"temperature": 0.7,
"system_suffix": "\n\nBe honest when you don't know something.",
"use_rag": False,
"self_critique": False
}
CODE = {
"temperature": 0.2,
"system_suffix": "\n\nOnly suggest functions and methods that actually exist.",
"use_rag": True, # Documentation-grounded RAG
"self_critique": True
}
Conclusion
Hallucination is not a fixable bug — it's an intrinsic property of probabilistic language models. The model doesn't know truth. It knows probabilities.
But with the right architecture, you can reduce hallucination dramatically:
- RAG grounds responses in retrieved facts (most impactful)
- Self-critique adds a review pass before the user sees the answer
- Chain of Verification stress-tests individual claims
- Low temperature keeps factual outputs conservative
- Forced citation makes hallucinations visible and auditable
The key principle: match your mitigation strategy to your use case's risk level. For medical or legal applications, LLMs should never operate without human oversight regardless of what mitigation you apply.