RAGAS Complete Guide: How to Quantitatively Evaluate Your RAG System

Why RAG Evaluation Is Hard
The Four Metrics RAGAS Measures
RAGAS Implementation
- Configuring the LLM Judge
Building an Automated Evaluation Pipeline (CI/CD for RAG)
Generating Evaluation Test Sets Automatically
Score Interpretation Guide
- How to Improve Low Scores
The Continuous Improvement Loop
Conclusion

Why RAG Evaluation Is Hard

"It feels like it's working, but I don't know if it's actually good."

That's the most common RAG engineering frustration I hear. When you're deep in development, you have an intuitive sense that things are improving — but you can't show a number to your team lead or justify a decision based on that gut feeling.

It's also hard to write tests the way you would for traditional software. There's no single right answer. A good response to the same question can vary depending on context.

RAGAS (RAG Assessment) was built to solve this problem. It uses LLMs as judges to quantify different aspects of your RAG system's performance as scores between 0 and 1.

This post covers understanding RAGAS's four core metrics, implementing them with real code, and building an automated evaluation pipeline.

The Four Metrics RAGAS Measures

RAGAS evaluates a RAG system across two dimensions: Retrieval quality and Generation quality.

Full RAG pipeline:

[User Question]
        |
[Retrieval Step] <- Evaluated by Context Precision, Context Recall
        |
[Generation Step] <- Evaluated by Faithfulness, Answer Relevancy
        |
[Final Answer]

Metric 1: Faithfulness

Question: Is the generated answer grounded in the retrieved context?

This is the hallucination detection metric. When the model "makes up" content not present in the context, Faithfulness drops.

How it's calculated:

Extract individual claims from the generated answer
An LLM judge determines whether each claim is supported by the retrieved context
Faithfulness = supported claims / total claims

Example:

Question: "What is the battery life of this product?"
Context: "The battery lasts up to 12 hours. Charging time is 2 hours."
Answer: "The battery lasts 12 hours and the device is also waterproof."

Claim analysis:
- "battery lasts 12 hours" -> supported by context CHECK
- "device is also waterproof" -> not in context CROSS

Faithfulness = 1/2 = 0.5  (low! hallucination occurred)

Metric 2: Answer Relevancy

Question: Does the generated answer actually address the user's question?

This detects answers that are faithful to the context but fail to answer the question.

How it's calculated:

Generate multiple synthetic questions from the generated answer
Calculate average cosine similarity between these synthetic questions and the original question
Answer Relevancy = this similarity score

Example:

Original question: "How long does the battery last?"
Generated answer: "This product is made with premium materials and comes in various colors."

Synthetic questions generated:
- "What materials is this product made of?"
- "What colors does this come in?"

Similarity to original question: very low
Answer Relevancy ≈ 0.1  (low! answer ignores the question)

Metric 3: Context Precision

Question: Of the retrieved chunks, how many were actually useful?

This measures the "noise" in retrieval. If you retrieved 5 chunks but only 1 was actually relevant, you get a low score.

How it's calculated:

Weighted sum of Precision@k for the top k retrieved contexts
Each context's relevance is judged against the ground truth answer

High Context Precision = retrieval is accurate, so the LLM isn't confused by irrelevant information.

Metric 4: Context Recall

Question: Did retrieval surface all the information needed to answer?

This measures missing information. Did the retrieval step find everything the ground truth answer requires?

How it's calculated:

Each sentence in the ground truth answer is checked against the retrieved contexts
Context Recall = sentences supported by contexts / total ground truth sentences

High Context Recall = retrieval is comprehensive, so the LLM has everything it needs.

RAGAS Implementation

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
)
from datasets import Dataset

# Prepare evaluation dataset
# Each sample: question + system-generated answer + retrieved contexts + ground_truth
eval_data = {
    "question": [
        "What is the battery life of this product?",
        "What is the return policy?",
        "How long does shipping take?"
    ],
    "answer": [
        "The battery lasts up to 12 hours.",
        "You can request a refund within 30 days of purchase.",
        "Standard shipping takes 3-5 business days."
    ],
    "contexts": [
        # Retrieved context chunks for each question
        [
            "The battery lasts up to 12 hours. Charging time is 2 hours.",
            "Product dimensions are 15cm x 10cm x 2cm."
        ],
        [
            "Customers can request a refund within 30 days of purchase.",
            "Refunds are processed within 5-7 business days."
        ],
        [
            "Standard shipping: 3-5 business days. Express shipping: 1-2 business days.",
            "Shipping fee of $5 applies to orders under $50."
        ]
    ],
    "ground_truth": [
        "Battery life is up to 12 hours, and charging takes 2 hours.",
        "Returns are accepted within 30 days of purchase and processed in 5-7 business days.",
        "Standard shipping takes 3-5 business days, express takes 1-2 business days."
    ]
}

dataset = Dataset.from_dict(eval_data)

# Run evaluation
results = evaluate(
    dataset,
    metrics=[
        faithfulness,
        answer_relevancy,
        context_precision,
        context_recall,
    ]
)

print(results)
# Sample output:
# {
#   'faithfulness': 0.95,
#   'answer_relevancy': 0.88,
#   'context_precision': 0.82,
#   'context_recall': 0.91
# }

# Convert to DataFrame for detailed analysis
df = results.to_pandas()
print(df.to_string())

Configuring the LLM Judge

RAGAS defaults to OpenAI as the judge LLM, but you can swap it for local models or other providers:

from ragas.llms import LangchainLLMWrapper
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from ragas.embeddings import LangchainEmbeddingsWrapper

# Use gpt-4o-mini to reduce evaluation costs
judge_llm = LangchainLLMWrapper(
    ChatOpenAI(model="gpt-4o-mini", temperature=0)
)

# Embeddings used for Answer Relevancy calculation
embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

results = evaluate(
    dataset,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
    llm=judge_llm,
    embeddings=embeddings,
)

Building an Automated Evaluation Pipeline (CI/CD for RAG)

Here's a pipeline that automatically runs evaluation whenever your RAG system changes.

# evaluate_rag.py - evaluation script run in CI/CD

import json
import sys
from datetime import datetime
from pathlib import Path
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from datasets import Dataset


def load_test_dataset(path: str) -> Dataset:
    """Load saved test dataset from disk"""
    with open(path, 'r', encoding='utf-8') as f:
        data = json.load(f)
    return Dataset.from_dict(data)


def run_rag_on_testset(testset: Dataset, rag_chain) -> Dataset:
    """Run each test question through RAG, collect answers and contexts"""
    answers = []
    contexts_list = []

    for question in testset["question"]:
        result = rag_chain.invoke({"query": question})
        answers.append(result["result"])
        contexts_list.append([
            doc.page_content
            for doc in result["source_documents"]
        ])

    return testset.add_column("answer", answers).add_column("contexts", contexts_list)


def evaluate_and_gate(
    dataset: Dataset,
    thresholds: dict,
    save_path: str = None
) -> bool:
    """
    Run evaluation and apply pass/fail gate based on thresholds.

    thresholds example:
    {
        "faithfulness": 0.8,
        "answer_relevancy": 0.75,
        "context_precision": 0.7,
        "context_recall": 0.75
    }
    """
    results = evaluate(
        dataset,
        metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
    )

    print("\n" + "="*60)
    print(f"RAG Evaluation Results - {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
    print("="*60)

    all_passed = True
    for metric_name, threshold in thresholds.items():
        score = results[metric_name]
        passed = score >= threshold
        status = "PASS" if passed else "FAIL"
        print(f"{metric_name:25s}: {score:.3f} (threshold: {threshold}) [{status}]")
        if not passed:
            all_passed = False

    # Save results for history tracking
    if save_path:
        result_data = {
            "timestamp": datetime.now().isoformat(),
            "scores": {k: float(v) for k, v in results.items()},
            "passed": all_passed
        }
        Path(save_path).parent.mkdir(parents=True, exist_ok=True)
        with open(save_path, 'a') as f:
            f.write(json.dumps(result_data) + "\n")

    return all_passed


if __name__ == "__main__":
    testset = load_test_dataset("tests/rag_testset.json")
    # rag_chain = load_your_rag_chain()
    # testset = run_rag_on_testset(testset, rag_chain)

    thresholds = {
        "faithfulness": 0.80,
        "answer_relevancy": 0.75,
        "context_precision": 0.70,
        "context_recall": 0.75,
    }

    passed = evaluate_and_gate(
        testset,
        thresholds=thresholds,
        save_path="results/rag_eval_history.jsonl"
    )

    sys.exit(0 if passed else 1)  # Fail the build in CI/CD if below threshold

GitHub Actions integration:

# .github/workflows/rag-evaluation.yml
name: RAG Quality Gate

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  evaluate-rag:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Setup Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.11'
      - name: Install dependencies
        run: pip install ragas langchain openai datasets
      - name: Run RAG Evaluation
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: python evaluate_rag.py

Generating Evaluation Test Sets Automatically

Building test sets by hand is time-consuming. RAGAS's TestsetGenerator can automate this.

from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from langchain_community.document_loaders import DirectoryLoader

# Load your documents
loader = DirectoryLoader("./docs", glob="**/*.md")
documents = loader.load()

# Initialize test set generator
generator = TestsetGenerator.with_openai()

# Generate test set with specified question types
testset = generator.generate_with_langchain_docs(
    documents,
    test_size=50,
    distributions={
        simple: 0.5,        # Simple factual questions 50%
        reasoning: 0.25,    # Questions requiring reasoning 25%
        multi_context: 0.25 # Questions needing multiple docs 25%
    }
)

# Save to file
testset_df = testset.to_pandas()
testset_df.to_json("tests/rag_testset.json", orient='records')

print(f"Generated {len(testset_df)} questions")
print("\nSample questions:")
print(testset_df[['question', 'ground_truth']].head())

Example generated question types:

Simple: "How many days is the return window?" (direct factual retrieval)
Reasoning: "Which product has a longer battery life, A or B?" (comparison reasoning)
Multi-context: "Which product in the current lineup has the highest storage capacity?" (requires multiple docs)

Score Interpretation Guide

Metric	Danger Zone	Acceptable	Good	Excellent
Faithfulness	below 0.5	0.5-0.7	0.7-0.9	above 0.9
Answer Relevancy	below 0.6	0.6-0.75	0.75-0.9	above 0.9
Context Precision	below 0.5	0.5-0.7	0.7-0.85	above 0.85
Context Recall	below 0.6	0.6-0.75	0.75-0.9	above 0.9

How to Improve Low Scores

Low Faithfulness → Hallucinations occurring

Strengthen system prompt: "Only answer based on the provided context. If the answer is not in the context, say so."
Add explicit citation instructions
Switch to a more conservative model

Low Answer Relevancy → Answers ignore the question

Emphasize the question more prominently in the prompt
If too much context is causing the model to lose track of the question, reduce the number of retrieved chunks
Add a re-ranking step

Low Context Precision → Too many irrelevant chunks being retrieved

Reduce retrieval top-k (5 → 3)
Add a reranker (Cohere Rerank, BGE Reranker, etc.)
Adjust chunk size

Low Context Recall → Retrieval is missing needed information

Increase retrieval top-k
Implement Hybrid Search (BM25 + vector)
Review chunking strategy (overly large chunks dilute relevant information)

The Continuous Improvement Loop

Putting it all together into a sustainable improvement cycle:

1. Measure baseline (run RAGAS)
        |
2. Identify lowest scoring metric
        |
3. Improve that metric (adjust chunking/retrieval/prompt)
        |
4. Measure again, confirm improvement
        |
5. Add Quality Gate to CI/CD
        |
6. Automated evaluation on every code change

Conclusion

RAGAS transforms RAG system development from vibes-based to data-based engineering. When you first run it, the scores might be lower than you expect — that's normal. Facing reality is the first step.

Start with a small test set (20-50 questions), add a Quality Gate to CI/CD, and track scores with every change. Within a few weeks, you'll see systematic improvement reflected in numbers you can share with stakeholders.

This four-post series has covered the core pillars of building RAG: choosing an approach → chunking → retrieval → evaluation. Get these four right and you can build a production-quality RAG system.