- Authors

- Name
- Youngju Kim
- @fjvbn20031
- Why RAG Evaluation Is Hard
- The Four Metrics RAGAS Measures
- RAGAS Implementation
- Building an Automated Evaluation Pipeline (CI/CD for RAG)
- Generating Evaluation Test Sets Automatically
- Score Interpretation Guide
- The Continuous Improvement Loop
- Conclusion
Why RAG Evaluation Is Hard
"It feels like it's working, but I don't know if it's actually good."
That's the most common RAG engineering frustration I hear. When you're deep in development, you have an intuitive sense that things are improving — but you can't show a number to your team lead or justify a decision based on that gut feeling.
It's also hard to write tests the way you would for traditional software. There's no single right answer. A good response to the same question can vary depending on context.
RAGAS (RAG Assessment) was built to solve this problem. It uses LLMs as judges to quantify different aspects of your RAG system's performance as scores between 0 and 1.
This post covers understanding RAGAS's four core metrics, implementing them with real code, and building an automated evaluation pipeline.
The Four Metrics RAGAS Measures
RAGAS evaluates a RAG system across two dimensions: Retrieval quality and Generation quality.
Full RAG pipeline:
[User Question]
|
[Retrieval Step] <- Evaluated by Context Precision, Context Recall
|
[Generation Step] <- Evaluated by Faithfulness, Answer Relevancy
|
[Final Answer]
Metric 1: Faithfulness
Question: Is the generated answer grounded in the retrieved context?
This is the hallucination detection metric. When the model "makes up" content not present in the context, Faithfulness drops.
How it's calculated:
- Extract individual claims from the generated answer
- An LLM judge determines whether each claim is supported by the retrieved context
- Faithfulness = supported claims / total claims
Example:
Question: "What is the battery life of this product?"
Context: "The battery lasts up to 12 hours. Charging time is 2 hours."
Answer: "The battery lasts 12 hours and the device is also waterproof."
Claim analysis:
- "battery lasts 12 hours" -> supported by context CHECK
- "device is also waterproof" -> not in context CROSS
Faithfulness = 1/2 = 0.5 (low! hallucination occurred)
Metric 2: Answer Relevancy
Question: Does the generated answer actually address the user's question?
This detects answers that are faithful to the context but fail to answer the question.
How it's calculated:
- Generate multiple synthetic questions from the generated answer
- Calculate average cosine similarity between these synthetic questions and the original question
- Answer Relevancy = this similarity score
Example:
Original question: "How long does the battery last?"
Generated answer: "This product is made with premium materials and comes in various colors."
Synthetic questions generated:
- "What materials is this product made of?"
- "What colors does this come in?"
Similarity to original question: very low
Answer Relevancy ≈ 0.1 (low! answer ignores the question)
Metric 3: Context Precision
Question: Of the retrieved chunks, how many were actually useful?
This measures the "noise" in retrieval. If you retrieved 5 chunks but only 1 was actually relevant, you get a low score.
How it's calculated:
- Weighted sum of Precision@k for the top k retrieved contexts
- Each context's relevance is judged against the ground truth answer
High Context Precision = retrieval is accurate, so the LLM isn't confused by irrelevant information.
Metric 4: Context Recall
Question: Did retrieval surface all the information needed to answer?
This measures missing information. Did the retrieval step find everything the ground truth answer requires?
How it's calculated:
- Each sentence in the ground truth answer is checked against the retrieved contexts
- Context Recall = sentences supported by contexts / total ground truth sentences
High Context Recall = retrieval is comprehensive, so the LLM has everything it needs.
RAGAS Implementation
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_precision,
context_recall,
)
from datasets import Dataset
# Prepare evaluation dataset
# Each sample: question + system-generated answer + retrieved contexts + ground_truth
eval_data = {
"question": [
"What is the battery life of this product?",
"What is the return policy?",
"How long does shipping take?"
],
"answer": [
"The battery lasts up to 12 hours.",
"You can request a refund within 30 days of purchase.",
"Standard shipping takes 3-5 business days."
],
"contexts": [
# Retrieved context chunks for each question
[
"The battery lasts up to 12 hours. Charging time is 2 hours.",
"Product dimensions are 15cm x 10cm x 2cm."
],
[
"Customers can request a refund within 30 days of purchase.",
"Refunds are processed within 5-7 business days."
],
[
"Standard shipping: 3-5 business days. Express shipping: 1-2 business days.",
"Shipping fee of $5 applies to orders under $50."
]
],
"ground_truth": [
"Battery life is up to 12 hours, and charging takes 2 hours.",
"Returns are accepted within 30 days of purchase and processed in 5-7 business days.",
"Standard shipping takes 3-5 business days, express takes 1-2 business days."
]
}
dataset = Dataset.from_dict(eval_data)
# Run evaluation
results = evaluate(
dataset,
metrics=[
faithfulness,
answer_relevancy,
context_precision,
context_recall,
]
)
print(results)
# Sample output:
# {
# 'faithfulness': 0.95,
# 'answer_relevancy': 0.88,
# 'context_precision': 0.82,
# 'context_recall': 0.91
# }
# Convert to DataFrame for detailed analysis
df = results.to_pandas()
print(df.to_string())
Configuring the LLM Judge
RAGAS defaults to OpenAI as the judge LLM, but you can swap it for local models or other providers:
from ragas.llms import LangchainLLMWrapper
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from ragas.embeddings import LangchainEmbeddingsWrapper
# Use gpt-4o-mini to reduce evaluation costs
judge_llm = LangchainLLMWrapper(
ChatOpenAI(model="gpt-4o-mini", temperature=0)
)
# Embeddings used for Answer Relevancy calculation
embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())
results = evaluate(
dataset,
metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
llm=judge_llm,
embeddings=embeddings,
)
Building an Automated Evaluation Pipeline (CI/CD for RAG)
Here's a pipeline that automatically runs evaluation whenever your RAG system changes.
# evaluate_rag.py - evaluation script run in CI/CD
import json
import sys
from datetime import datetime
from pathlib import Path
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from datasets import Dataset
def load_test_dataset(path: str) -> Dataset:
"""Load saved test dataset from disk"""
with open(path, 'r', encoding='utf-8') as f:
data = json.load(f)
return Dataset.from_dict(data)
def run_rag_on_testset(testset: Dataset, rag_chain) -> Dataset:
"""Run each test question through RAG, collect answers and contexts"""
answers = []
contexts_list = []
for question in testset["question"]:
result = rag_chain.invoke({"query": question})
answers.append(result["result"])
contexts_list.append([
doc.page_content
for doc in result["source_documents"]
])
return testset.add_column("answer", answers).add_column("contexts", contexts_list)
def evaluate_and_gate(
dataset: Dataset,
thresholds: dict,
save_path: str = None
) -> bool:
"""
Run evaluation and apply pass/fail gate based on thresholds.
thresholds example:
{
"faithfulness": 0.8,
"answer_relevancy": 0.75,
"context_precision": 0.7,
"context_recall": 0.75
}
"""
results = evaluate(
dataset,
metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
)
print("\n" + "="*60)
print(f"RAG Evaluation Results - {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print("="*60)
all_passed = True
for metric_name, threshold in thresholds.items():
score = results[metric_name]
passed = score >= threshold
status = "PASS" if passed else "FAIL"
print(f"{metric_name:25s}: {score:.3f} (threshold: {threshold}) [{status}]")
if not passed:
all_passed = False
# Save results for history tracking
if save_path:
result_data = {
"timestamp": datetime.now().isoformat(),
"scores": {k: float(v) for k, v in results.items()},
"passed": all_passed
}
Path(save_path).parent.mkdir(parents=True, exist_ok=True)
with open(save_path, 'a') as f:
f.write(json.dumps(result_data) + "\n")
return all_passed
if __name__ == "__main__":
testset = load_test_dataset("tests/rag_testset.json")
# rag_chain = load_your_rag_chain()
# testset = run_rag_on_testset(testset, rag_chain)
thresholds = {
"faithfulness": 0.80,
"answer_relevancy": 0.75,
"context_precision": 0.70,
"context_recall": 0.75,
}
passed = evaluate_and_gate(
testset,
thresholds=thresholds,
save_path="results/rag_eval_history.jsonl"
)
sys.exit(0 if passed else 1) # Fail the build in CI/CD if below threshold
GitHub Actions integration:
# .github/workflows/rag-evaluation.yml
name: RAG Quality Gate
on:
push:
branches: [main]
pull_request:
branches: [main]
jobs:
evaluate-rag:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Setup Python
uses: actions/setup-python@v4
with:
python-version: '3.11'
- name: Install dependencies
run: pip install ragas langchain openai datasets
- name: Run RAG Evaluation
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: python evaluate_rag.py
Generating Evaluation Test Sets Automatically
Building test sets by hand is time-consuming. RAGAS's TestsetGenerator can automate this.
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from langchain_community.document_loaders import DirectoryLoader
# Load your documents
loader = DirectoryLoader("./docs", glob="**/*.md")
documents = loader.load()
# Initialize test set generator
generator = TestsetGenerator.with_openai()
# Generate test set with specified question types
testset = generator.generate_with_langchain_docs(
documents,
test_size=50,
distributions={
simple: 0.5, # Simple factual questions 50%
reasoning: 0.25, # Questions requiring reasoning 25%
multi_context: 0.25 # Questions needing multiple docs 25%
}
)
# Save to file
testset_df = testset.to_pandas()
testset_df.to_json("tests/rag_testset.json", orient='records')
print(f"Generated {len(testset_df)} questions")
print("\nSample questions:")
print(testset_df[['question', 'ground_truth']].head())
Example generated question types:
- Simple: "How many days is the return window?" (direct factual retrieval)
- Reasoning: "Which product has a longer battery life, A or B?" (comparison reasoning)
- Multi-context: "Which product in the current lineup has the highest storage capacity?" (requires multiple docs)
Score Interpretation Guide
| Metric | Danger Zone | Acceptable | Good | Excellent |
|---|---|---|---|---|
| Faithfulness | below 0.5 | 0.5-0.7 | 0.7-0.9 | above 0.9 |
| Answer Relevancy | below 0.6 | 0.6-0.75 | 0.75-0.9 | above 0.9 |
| Context Precision | below 0.5 | 0.5-0.7 | 0.7-0.85 | above 0.85 |
| Context Recall | below 0.6 | 0.6-0.75 | 0.75-0.9 | above 0.9 |
How to Improve Low Scores
Low Faithfulness → Hallucinations occurring
- Strengthen system prompt: "Only answer based on the provided context. If the answer is not in the context, say so."
- Add explicit citation instructions
- Switch to a more conservative model
Low Answer Relevancy → Answers ignore the question
- Emphasize the question more prominently in the prompt
- If too much context is causing the model to lose track of the question, reduce the number of retrieved chunks
- Add a re-ranking step
Low Context Precision → Too many irrelevant chunks being retrieved
- Reduce retrieval top-k (5 → 3)
- Add a reranker (Cohere Rerank, BGE Reranker, etc.)
- Adjust chunk size
Low Context Recall → Retrieval is missing needed information
- Increase retrieval top-k
- Implement Hybrid Search (BM25 + vector)
- Review chunking strategy (overly large chunks dilute relevant information)
The Continuous Improvement Loop
Putting it all together into a sustainable improvement cycle:
1. Measure baseline (run RAGAS)
|
2. Identify lowest scoring metric
|
3. Improve that metric (adjust chunking/retrieval/prompt)
|
4. Measure again, confirm improvement
|
5. Add Quality Gate to CI/CD
|
6. Automated evaluation on every code change
Conclusion
RAGAS transforms RAG system development from vibes-based to data-based engineering. When you first run it, the scores might be lower than you expect — that's normal. Facing reality is the first step.
Start with a small test set (20-50 questions), add a Quality Gate to CI/CD, and track scores with every change. Within a few weeks, you'll see systematic improvement reflected in numbers you can share with stakeholders.
This four-post series has covered the core pillars of building RAG: choosing an approach → chunking → retrieval → evaluation. Get these four right and you can build a production-quality RAG system.