Split View: RAGAS 완전 가이드: RAG 시스템을 어떻게 정량적으로 평가하는가

RAGAS 완전 가이드: RAG 시스템을 어떻게 정량적으로 평가하는가

RAG 평가가 어려운 이유
RAGAS가 측정하는 4가지 지표
실제 RAGAS 구현
- LLM Judge 설정
자동화 평가 파이프라인 구축 (CI/CD for RAG)
평가 데이터셋 자동 생성
RAGAS 점수 해석 가이드
- 점수가 낮을 때 어떻게 개선하는가
실전 평가 루프
마무리

RAG 평가가 어려운 이유

"잘 작동하는 것 같은데, 실제로 얼마나 좋은지 모르겠어요."

RAG를 운영하는 팀에서 가장 많이 듣는 말이다. 개발하다 보면 직관적으로 "좋아진 것 같다"는 느낌은 있는데, 정확히 얼마나 좋아졌는지를 수치로 보여주기가 어렵다.

전통적인 소프트웨어처럼 테스트 케이스를 짜기도 어렵다. 정답이 하나가 아니고, 같은 질문에도 맥락에 따라 좋은 답변이 달라진다.

**RAGAS(RAG Assessment)**는 이 문제를 해결하기 위해 설계된 프레임워크다. LLM을 judge로 활용해서 RAG 시스템의 여러 측면을 0-1 사이 점수로 정량화한다.

이 글에서는 RAGAS의 4가지 핵심 지표를 이해하고, 실제 코드로 구현하고, 자동화된 평가 파이프라인을 구축하는 방법을 다룬다.

RAGAS가 측정하는 4가지 지표

RAGAS는 RAG 시스템을 두 가지 차원으로 평가한다: **검색(Retrieval)**의 품질과 **생성(Generation)**의 품질.

RAG 파이프라인 전체:

[사용자 질문]
      ↓
[검색 단계] ← Context Precision, Context Recall로 평가
      ↓
[생성 단계] ← Faithfulness, Answer Relevancy로 평가
      ↓
[최종 답변]

지표 1: Faithfulness (충실성)

질문: 생성된 답변이 검색된 컨텍스트에 기반하는가?

할루시네이션 탐지 지표다. 모델이 컨텍스트에 없는 내용을 "만들어내면" Faithfulness가 낮아진다.

계산 방법:

생성된 답변에서 개별 claim(주장)을 추출
각 claim이 검색된 컨텍스트에서 지지되는지 LLM이 판단
Faithfulness = 지지되는 claim 수 / 전체 claim 수

예시:

질문: "제품의 배터리 수명은 얼마인가요?"
컨텍스트: "배터리는 최대 12시간 지속됩니다. 충전 시간은 2시간입니다."
답변: "배터리는 12시간이며, 방수 기능도 있습니다."

Claim 분석:
- "배터리는 12시간" → 컨텍스트에 지지됨 ✓
- "방수 기능도 있습니다" → 컨텍스트에 없음 ✗

Faithfulness = 1/2 = 0.5 (낮음! 할루시네이션 발생)

지표 2: Answer Relevancy (답변 관련성)

질문: 생성된 답변이 사용자의 질문에 실제로 대답하는가?

컨텍스트에 충실하지만 질문과 관련 없는 답변을 탐지한다.

계산 방법:

생성된 답변으로부터 역으로 synthetic 질문을 여러 개 생성
이 synthetic 질문들의 임베딩과 원래 질문의 임베딩 간 평균 코사인 유사도 계산
Answer Relevancy = 이 유사도 점수

예시:

원래 질문: "배터리 수명이 얼마인가요?"
생성된 답변: "이 제품은 고급 재질로 만들어졌으며 다양한 색상이 있습니다."

Synthetic 질문 생성:
- "이 제품은 어떤 재질로 만들어졌나요?"
- "어떤 색상이 있나요?"

원래 질문과의 유사도: 매우 낮음
Answer Relevancy ≈ 0.1 (낮음! 답변이 질문을 무시함)

지표 3: Context Precision (컨텍스트 정밀도)

질문: 검색된 컨텍스트 중 실제로 도움이 된 것이 얼마나 되는가?

검색의 "노이즈"를 측정한다. 5개 청크를 검색했는데 1개만 실제로 유용했다면 낮은 점수를 받는다.

계산 방법:

상위 k개 컨텍스트 중 관련있는 것의 비율 (Precision@k의 가중 합)
Ground truth 답변과 비교해서 각 컨텍스트의 관련성 판단

높은 Context Precision = 검색이 정확해서 LLM이 관련 없는 정보에 혼란받지 않음

지표 4: Context Recall (컨텍스트 재현율)

질문: 답변에 필요한 정보를 검색으로 모두 가져왔는가?

빠진 정보가 있는지 측정한다. Ground truth 답변이 필요로 하는 정보를 검색이 모두 찾아왔는지 확인한다.

계산 방법:

Ground truth 답변의 각 문장이 검색된 컨텍스트에 의해 지지되는지 판단
Context Recall = 지지되는 문장 수 / 전체 Ground truth 문장 수

높은 Context Recall = 검색이 포괄적이어서 LLM이 답변에 필요한 정보를 모두 갖고 있음

실제 RAGAS 구현

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
)
from datasets import Dataset

# 평가 데이터셋 준비
# 각 샘플: question + 시스템이 생성한 answer + retrieved contexts + ground_truth
eval_data = {
    "question": [
        "제품의 배터리 수명은 얼마인가요?",
        "환불 정책은 어떻게 되나요?",
        "배송 기간은 얼마나 걸리나요?"
    ],
    "answer": [
        "배터리는 최대 12시간 지속됩니다.",
        "구매 후 30일 이내에 환불 신청이 가능합니다.",
        "일반 배송은 3-5 영업일이 소요됩니다."
    ],
    "contexts": [
        # 각 질문에 대해 검색된 컨텍스트 청크 목록
        [
            "배터리는 최대 12시간 지속됩니다. 충전 시간은 2시간입니다.",
            "제품 크기는 15cm x 10cm x 2cm입니다."
        ],
        [
            "고객은 구매 후 30일 이내에 환불을 신청할 수 있습니다.",
            "환불은 영업일 기준 5-7일 내에 처리됩니다."
        ],
        [
            "일반 배송: 3-5 영업일, 빠른 배송: 1-2 영업일",
            "배송비는 3만원 미만 주문 시 3,000원이 부과됩니다."
        ]
    ],
    "ground_truth": [
        "배터리 수명은 최대 12시간이며, 충전 시간은 2시간입니다.",
        "환불은 구매 후 30일 이내 신청 가능하고 5-7 영업일 내 처리됩니다.",
        "일반 배송은 3-5 영업일, 빠른 배송은 1-2 영업일이 소요됩니다."
    ]
}

dataset = Dataset.from_dict(eval_data)

# 평가 실행
results = evaluate(
    dataset,
    metrics=[
        faithfulness,
        answer_relevancy,
        context_precision,
        context_recall,
    ]
)

print(results)
# 출력 예시:
# {
#   'faithfulness': 0.95,
#   'answer_relevancy': 0.88,
#   'context_precision': 0.82,
#   'context_recall': 0.91
# }

# 결과를 DataFrame으로 변환해서 상세 분석
df = results.to_pandas()
print(df.to_string())

LLM Judge 설정

RAGAS는 기본적으로 OpenAI를 judge LLM으로 사용하지만, 로컬 모델이나 다른 프로바이더로 교체 가능하다:

from ragas.llms import LangchainLLMWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
from ragas.embeddings import LangchainEmbeddingsWrapper

# Judge LLM 설정 (비용 절감을 위해 GPT-4o-mini 사용 가능)
judge_llm = LangchainLLMWrapper(
    ChatOpenAI(model="gpt-4o-mini", temperature=0)
)

# 임베딩 설정 (Answer Relevancy 계산에 사용)
embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

results = evaluate(
    dataset,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
    llm=judge_llm,
    embeddings=embeddings,
)

자동화 평가 파이프라인 구축 (CI/CD for RAG)

RAG 시스템의 변경이 있을 때마다 자동으로 평가가 실행되도록 하는 파이프라인이다.

# evaluate_rag.py - CI/CD에서 실행되는 평가 스크립트

import json
import sys
from datetime import datetime
from pathlib import Path
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from datasets import Dataset


def load_test_dataset(path: str) -> Dataset:
    """저장된 테스트 데이터셋 로드"""
    with open(path, 'r', encoding='utf-8') as f:
        data = json.load(f)
    return Dataset.from_dict(data)


def run_rag_on_testset(testset: Dataset, rag_chain) -> Dataset:
    """테스트셋의 각 질문을 RAG로 실행해서 answer와 contexts 수집"""
    answers = []
    contexts_list = []

    for question in testset["question"]:
        # RAG 실행
        result = rag_chain.invoke({"query": question})
        answers.append(result["result"])
        contexts_list.append([
            doc.page_content
            for doc in result["source_documents"]
        ])

    return testset.add_column("answer", answers).add_column("contexts", contexts_list)


def evaluate_and_gate(
    dataset: Dataset,
    thresholds: dict,
    save_path: str = None
) -> bool:
    """
    평가 실행 후 임계값 기반 통과/실패 판정.

    thresholds 예시:
    {
        "faithfulness": 0.8,
        "answer_relevancy": 0.75,
        "context_precision": 0.7,
        "context_recall": 0.75
    }
    """
    results = evaluate(
        dataset,
        metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
    )

    print("\n" + "="*50)
    print(f"RAG 평가 결과 - {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
    print("="*50)

    all_passed = True
    for metric_name, threshold in thresholds.items():
        score = results[metric_name]
        passed = score >= threshold
        status = "PASS" if passed else "FAIL"
        print(f"{metric_name:25s}: {score:.3f} (임계값: {threshold}) [{status}]")
        if not passed:
            all_passed = False

    # 결과 저장 (히스토리 추적용)
    if save_path:
        result_data = {
            "timestamp": datetime.now().isoformat(),
            "scores": {k: float(v) for k, v in results.items()},
            "passed": all_passed
        }
        Path(save_path).parent.mkdir(parents=True, exist_ok=True)
        with open(save_path, 'a') as f:
            f.write(json.dumps(result_data) + "\n")

    return all_passed


# 실행
if __name__ == "__main__":
    # 테스트셋과 RAG 체인 로드
    testset = load_test_dataset("tests/rag_testset.json")
    # rag_chain = load_your_rag_chain()  # 실제 RAG 체인 로드

    # testset = run_rag_on_testset(testset, rag_chain)

    thresholds = {
        "faithfulness": 0.80,
        "answer_relevancy": 0.75,
        "context_precision": 0.70,
        "context_recall": 0.75,
    }

    passed = evaluate_and_gate(
        testset,
        thresholds=thresholds,
        save_path="results/rag_eval_history.jsonl"
    )

    sys.exit(0 if passed else 1)  # CI/CD에서 실패 시 빌드 중단

GitHub Actions에서 이렇게 사용한다:

# .github/workflows/rag-evaluation.yml
name: RAG Quality Gate

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  evaluate-rag:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Setup Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.11'
      - name: Install dependencies
        run: pip install ragas langchain openai datasets
      - name: Run RAG Evaluation
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: python evaluate_rag.py

평가 데이터셋 자동 생성

테스트셋을 수동으로 만드는 건 시간이 많이 걸린다. RAGAS의 TestsetGenerator로 자동화할 수 있다.

from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from langchain_community.document_loaders import DirectoryLoader
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

# 문서 로드
loader = DirectoryLoader("./docs", glob="**/*.md")
documents = loader.load()

# 테스트셋 생성기 초기화
generator = TestsetGenerator.with_openai()

# 테스트셋 생성 (질문 유형 지정 가능)
testset = generator.generate_with_langchain_docs(
    documents,
    test_size=50,  # 총 50개 질문
    distributions={
        simple: 0.5,        # 단순 사실 질문 50%
        reasoning: 0.25,    # 추론 필요 질문 25%
        multi_context: 0.25 # 여러 문서 참조 필요 25%
    }
)

# 데이터셋을 파일로 저장
testset_df = testset.to_pandas()
testset_df.to_json("tests/rag_testset.json", orient='records', force_ascii=False)

print(f"생성된 질문 수: {len(testset_df)}")
print("\n샘플 질문:")
print(testset_df[['question', 'ground_truth']].head())

생성되는 질문 유형 예시:

Simple: "반품 기간은 며칠인가요?" (단순 사실 검색)
Reasoning: "A 제품과 B 제품 중 어떤 것이 더 오래 쓸 수 있나요?" (비교 추론)
Multi-context: "최신 제품 라인업에서 배터리 용량이 가장 큰 제품은?" (여러 문서 필요)

RAGAS 점수 해석 가이드

지표	위험 수준	수용 가능	좋음	우수
Faithfulness	0.5 미만	0.5-0.7	0.7-0.9	0.9 이상
Answer Relevancy	0.6 미만	0.6-0.75	0.75-0.9	0.9 이상
Context Precision	0.5 미만	0.5-0.7	0.7-0.85	0.85 이상
Context Recall	0.6 미만	0.6-0.75	0.75-0.9	0.9 이상

점수가 낮을 때 어떻게 개선하는가

Faithfulness 낮음 → 할루시네이션 발생

System prompt에 "컨텍스트에 없는 내용은 답변하지 말 것" 강화
더 구체적인 citation 지시 추가
LLM을 더 conservative한 모델로 교체

Answer Relevancy 낮음 → 답변이 질문을 무시

Prompt에서 질문을 더 명확하게 강조
Retrieved context가 너무 많아서 질문을 잊는 경우 → 컨텍스트 수 줄이기
Re-ranking 도입

Context Precision 낮음 → 관련 없는 청크가 너무 많이 검색됨

Retrieval top-k 수를 줄임 (5 → 3)
Re-ranking 추가 (Cohere Rerank, BGE Reranker 등)
청킹 크기 조정

Context Recall 낮음 → 필요한 정보를 검색에서 놓침

Retrieval top-k 수를 늘림
Hybrid Search 도입 (BM25 + 벡터)
청킹 전략 재검토 (너무 큰 청크는 관련 정보를 희석시킴)

실전 평가 루프

이 모든 것을 합쳐서 지속적인 개선 사이클을 만들자:

1. 기준선 측정 (RAGAS 실행)
      ↓
2. 가장 낮은 지표 확인
      ↓
3. 해당 지표 개선 (청킹/검색/프롬프트 수정)
      ↓
4. 다시 측정 → 개선 확인
      ↓
5. CI/CD에 Quality Gate 추가
      ↓
6. 코드 변경마다 자동 평가

마무리

RAGAS는 RAG 시스템 개발을 "느낌" 기반에서 "데이터" 기반으로 전환시켜주는 도구다. 처음 도입할 때는 점수가 생각보다 낮게 나와서 당황할 수 있다. 하지만 그게 정상이다 — 현실을 직시하는 것이 첫걸음이다.

작은 테스트셋(20-50개)으로 시작하고, CI/CD에 Quality Gate를 추가하고, 매 변경마다 점수를 추적하자. 몇 주 지나면 체계적으로 개선되는 수치를 보게 된다.

이 포스트 시리즈의 4개 글이 RAG 구축의 핵심 요소를 커버했다: 접근법 선택 → 청킹 → 검색 → 평가. 이 네 가지를 제대로 하면 프로덕션 수준의 RAG 시스템을 만들 수 있다.

RAGAS Complete Guide: How to Quantitatively Evaluate Your RAG System

Why RAG Evaluation Is Hard
The Four Metrics RAGAS Measures
RAGAS Implementation
- Configuring the LLM Judge
Building an Automated Evaluation Pipeline (CI/CD for RAG)
Generating Evaluation Test Sets Automatically
Score Interpretation Guide
- How to Improve Low Scores
The Continuous Improvement Loop
Conclusion

Why RAG Evaluation Is Hard

"It feels like it's working, but I don't know if it's actually good."

That's the most common RAG engineering frustration I hear. When you're deep in development, you have an intuitive sense that things are improving — but you can't show a number to your team lead or justify a decision based on that gut feeling.

It's also hard to write tests the way you would for traditional software. There's no single right answer. A good response to the same question can vary depending on context.

RAGAS (RAG Assessment) was built to solve this problem. It uses LLMs as judges to quantify different aspects of your RAG system's performance as scores between 0 and 1.

This post covers understanding RAGAS's four core metrics, implementing them with real code, and building an automated evaluation pipeline.

The Four Metrics RAGAS Measures

RAGAS evaluates a RAG system across two dimensions: Retrieval quality and Generation quality.

Full RAG pipeline:

[User Question]
        |
[Retrieval Step] <- Evaluated by Context Precision, Context Recall
        |
[Generation Step] <- Evaluated by Faithfulness, Answer Relevancy
        |
[Final Answer]

Metric 1: Faithfulness

Question: Is the generated answer grounded in the retrieved context?

This is the hallucination detection metric. When the model "makes up" content not present in the context, Faithfulness drops.

How it's calculated:

Extract individual claims from the generated answer
An LLM judge determines whether each claim is supported by the retrieved context
Faithfulness = supported claims / total claims

Example:

Question: "What is the battery life of this product?"
Context: "The battery lasts up to 12 hours. Charging time is 2 hours."
Answer: "The battery lasts 12 hours and the device is also waterproof."

Claim analysis:
- "battery lasts 12 hours" -> supported by context CHECK
- "device is also waterproof" -> not in context CROSS

Faithfulness = 1/2 = 0.5  (low! hallucination occurred)

Metric 2: Answer Relevancy

Question: Does the generated answer actually address the user's question?

This detects answers that are faithful to the context but fail to answer the question.

How it's calculated:

Generate multiple synthetic questions from the generated answer
Calculate average cosine similarity between these synthetic questions and the original question
Answer Relevancy = this similarity score

Example:

Original question: "How long does the battery last?"
Generated answer: "This product is made with premium materials and comes in various colors."

Synthetic questions generated:
- "What materials is this product made of?"
- "What colors does this come in?"

Similarity to original question: very low
Answer Relevancy ≈ 0.1  (low! answer ignores the question)

Metric 3: Context Precision

Question: Of the retrieved chunks, how many were actually useful?

This measures the "noise" in retrieval. If you retrieved 5 chunks but only 1 was actually relevant, you get a low score.

How it's calculated:

Weighted sum of Precision@k for the top k retrieved contexts
Each context's relevance is judged against the ground truth answer

High Context Precision = retrieval is accurate, so the LLM isn't confused by irrelevant information.

Metric 4: Context Recall

Question: Did retrieval surface all the information needed to answer?

This measures missing information. Did the retrieval step find everything the ground truth answer requires?

How it's calculated:

Each sentence in the ground truth answer is checked against the retrieved contexts
Context Recall = sentences supported by contexts / total ground truth sentences

High Context Recall = retrieval is comprehensive, so the LLM has everything it needs.

RAGAS Implementation

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
)
from datasets import Dataset

# Prepare evaluation dataset
# Each sample: question + system-generated answer + retrieved contexts + ground_truth
eval_data = {
    "question": [
        "What is the battery life of this product?",
        "What is the return policy?",
        "How long does shipping take?"
    ],
    "answer": [
        "The battery lasts up to 12 hours.",
        "You can request a refund within 30 days of purchase.",
        "Standard shipping takes 3-5 business days."
    ],
    "contexts": [
        # Retrieved context chunks for each question
        [
            "The battery lasts up to 12 hours. Charging time is 2 hours.",
            "Product dimensions are 15cm x 10cm x 2cm."
        ],
        [
            "Customers can request a refund within 30 days of purchase.",
            "Refunds are processed within 5-7 business days."
        ],
        [
            "Standard shipping: 3-5 business days. Express shipping: 1-2 business days.",
            "Shipping fee of $5 applies to orders under $50."
        ]
    ],
    "ground_truth": [
        "Battery life is up to 12 hours, and charging takes 2 hours.",
        "Returns are accepted within 30 days of purchase and processed in 5-7 business days.",
        "Standard shipping takes 3-5 business days, express takes 1-2 business days."
    ]
}

dataset = Dataset.from_dict(eval_data)

# Run evaluation
results = evaluate(
    dataset,
    metrics=[
        faithfulness,
        answer_relevancy,
        context_precision,
        context_recall,
    ]
)

print(results)
# Sample output:
# {
#   'faithfulness': 0.95,
#   'answer_relevancy': 0.88,
#   'context_precision': 0.82,
#   'context_recall': 0.91
# }

# Convert to DataFrame for detailed analysis
df = results.to_pandas()
print(df.to_string())

Configuring the LLM Judge

RAGAS defaults to OpenAI as the judge LLM, but you can swap it for local models or other providers:

from ragas.llms import LangchainLLMWrapper
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from ragas.embeddings import LangchainEmbeddingsWrapper

# Use gpt-4o-mini to reduce evaluation costs
judge_llm = LangchainLLMWrapper(
    ChatOpenAI(model="gpt-4o-mini", temperature=0)
)

# Embeddings used for Answer Relevancy calculation
embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

results = evaluate(
    dataset,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
    llm=judge_llm,
    embeddings=embeddings,
)

Building an Automated Evaluation Pipeline (CI/CD for RAG)

Here's a pipeline that automatically runs evaluation whenever your RAG system changes.

# evaluate_rag.py - evaluation script run in CI/CD

import json
import sys
from datetime import datetime
from pathlib import Path
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from datasets import Dataset


def load_test_dataset(path: str) -> Dataset:
    """Load saved test dataset from disk"""
    with open(path, 'r', encoding='utf-8') as f:
        data = json.load(f)
    return Dataset.from_dict(data)


def run_rag_on_testset(testset: Dataset, rag_chain) -> Dataset:
    """Run each test question through RAG, collect answers and contexts"""
    answers = []
    contexts_list = []

    for question in testset["question"]:
        result = rag_chain.invoke({"query": question})
        answers.append(result["result"])
        contexts_list.append([
            doc.page_content
            for doc in result["source_documents"]
        ])

    return testset.add_column("answer", answers).add_column("contexts", contexts_list)


def evaluate_and_gate(
    dataset: Dataset,
    thresholds: dict,
    save_path: str = None
) -> bool:
    """
    Run evaluation and apply pass/fail gate based on thresholds.

    thresholds example:
    {
        "faithfulness": 0.8,
        "answer_relevancy": 0.75,
        "context_precision": 0.7,
        "context_recall": 0.75
    }
    """
    results = evaluate(
        dataset,
        metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
    )

    print("\n" + "="*60)
    print(f"RAG Evaluation Results - {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
    print("="*60)

    all_passed = True
    for metric_name, threshold in thresholds.items():
        score = results[metric_name]
        passed = score >= threshold
        status = "PASS" if passed else "FAIL"
        print(f"{metric_name:25s}: {score:.3f} (threshold: {threshold}) [{status}]")
        if not passed:
            all_passed = False

    # Save results for history tracking
    if save_path:
        result_data = {
            "timestamp": datetime.now().isoformat(),
            "scores": {k: float(v) for k, v in results.items()},
            "passed": all_passed
        }
        Path(save_path).parent.mkdir(parents=True, exist_ok=True)
        with open(save_path, 'a') as f:
            f.write(json.dumps(result_data) + "\n")

    return all_passed


if __name__ == "__main__":
    testset = load_test_dataset("tests/rag_testset.json")
    # rag_chain = load_your_rag_chain()
    # testset = run_rag_on_testset(testset, rag_chain)

    thresholds = {
        "faithfulness": 0.80,
        "answer_relevancy": 0.75,
        "context_precision": 0.70,
        "context_recall": 0.75,
    }

    passed = evaluate_and_gate(
        testset,
        thresholds=thresholds,
        save_path="results/rag_eval_history.jsonl"
    )

    sys.exit(0 if passed else 1)  # Fail the build in CI/CD if below threshold

GitHub Actions integration:

# .github/workflows/rag-evaluation.yml
name: RAG Quality Gate

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  evaluate-rag:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Setup Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.11'
      - name: Install dependencies
        run: pip install ragas langchain openai datasets
      - name: Run RAG Evaluation
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: python evaluate_rag.py

Generating Evaluation Test Sets Automatically

Building test sets by hand is time-consuming. RAGAS's TestsetGenerator can automate this.

from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from langchain_community.document_loaders import DirectoryLoader

# Load your documents
loader = DirectoryLoader("./docs", glob="**/*.md")
documents = loader.load()

# Initialize test set generator
generator = TestsetGenerator.with_openai()

# Generate test set with specified question types
testset = generator.generate_with_langchain_docs(
    documents,
    test_size=50,
    distributions={
        simple: 0.5,        # Simple factual questions 50%
        reasoning: 0.25,    # Questions requiring reasoning 25%
        multi_context: 0.25 # Questions needing multiple docs 25%
    }
)

# Save to file
testset_df = testset.to_pandas()
testset_df.to_json("tests/rag_testset.json", orient='records')

print(f"Generated {len(testset_df)} questions")
print("\nSample questions:")
print(testset_df[['question', 'ground_truth']].head())

Example generated question types:

Simple: "How many days is the return window?" (direct factual retrieval)
Reasoning: "Which product has a longer battery life, A or B?" (comparison reasoning)
Multi-context: "Which product in the current lineup has the highest storage capacity?" (requires multiple docs)

Score Interpretation Guide

Metric	Danger Zone	Acceptable	Good	Excellent
Faithfulness	below 0.5	0.5-0.7	0.7-0.9	above 0.9
Answer Relevancy	below 0.6	0.6-0.75	0.75-0.9	above 0.9
Context Precision	below 0.5	0.5-0.7	0.7-0.85	above 0.85
Context Recall	below 0.6	0.6-0.75	0.75-0.9	above 0.9

How to Improve Low Scores

Low Faithfulness → Hallucinations occurring

Strengthen system prompt: "Only answer based on the provided context. If the answer is not in the context, say so."
Add explicit citation instructions
Switch to a more conservative model

Low Answer Relevancy → Answers ignore the question

Emphasize the question more prominently in the prompt
If too much context is causing the model to lose track of the question, reduce the number of retrieved chunks
Add a re-ranking step

Low Context Precision → Too many irrelevant chunks being retrieved

Reduce retrieval top-k (5 → 3)
Add a reranker (Cohere Rerank, BGE Reranker, etc.)
Adjust chunk size

Low Context Recall → Retrieval is missing needed information

Increase retrieval top-k
Implement Hybrid Search (BM25 + vector)
Review chunking strategy (overly large chunks dilute relevant information)

The Continuous Improvement Loop

Putting it all together into a sustainable improvement cycle:

1. Measure baseline (run RAGAS)
        |
2. Identify lowest scoring metric
        |
3. Improve that metric (adjust chunking/retrieval/prompt)
        |
4. Measure again, confirm improvement
        |
5. Add Quality Gate to CI/CD
        |
6. Automated evaluation on every code change

Conclusion

RAGAS transforms RAG system development from vibes-based to data-based engineering. When you first run it, the scores might be lower than you expect — that's normal. Facing reality is the first step.

Start with a small test set (20-50 questions), add a Quality Gate to CI/CD, and track scores with every change. Within a few weeks, you'll see systematic improvement reflected in numbers you can share with stakeholders.

This four-post series has covered the core pillars of building RAG: choosing an approach → chunking → retrieval → evaluation. Get these four right and you can build a production-quality RAG system.