Skip to content
Published on

Building a Chatbot Evaluation Framework: LLM-as-Judge, RAGAS, and Automated Testing Pipelines

Authors
  • Name
    Twitter
Chatbot Evaluation Guide

Introduction

"We changed the prompt and the answers seem better, but are they really?" When operating an LLM-based chatbot, answering this question objectively is surprisingly difficult. Having humans review every response does not scale, and simple keyword matching fails to capture the diversity of LLM outputs.

Since 2025, the LLM evaluation ecosystem has matured rapidly. RAGAS standardized RAG pipeline-specific metrics, DeepEval popularized pytest-style LLM testing, and LangSmith unified tracing and evaluation on a single platform. The most impactful innovation is the LLM-as-Judge pattern, where a powerful LLM automatically evaluates outputs from other LLMs. Research has shown that sophisticated judge models can achieve 85% agreement with human judgment, which actually surpasses the 81% inter-annotator agreement among human evaluators themselves.

This guide covers the entire process of building a chatbot evaluation framework from scratch: designing evaluation metrics, leveraging the RAGAS framework, implementing LLM-as-Judge, constructing golden datasets, integrating with CI/CD pipelines, running A/B tests, and addressing evaluation bias issues encountered in production.

The Challenge of Chatbot Evaluation

LLM-based chatbot evaluation is fundamentally different from traditional software testing because of non-deterministic outputs. The same input can generate different answers every time, and the very concept of a "correct answer" is ambiguous.

Why Traditional Testing Is Not Enough

  • Output diversity: Dozens of semantically equivalent but differently worded correct answers exist for the same question
  • Context dependency: In multi-turn conversations, the appropriate answer changes based on prior context
  • Subjective quality: "Good answer" criteria are multi-dimensional, spanning accuracy, usefulness, tone, and conciseness
  • Hallucination detection: Content that reads naturally but is factually incorrect must be automatically identified

The Evaluation Pyramid: A Three-Layer Strategy

Effective chatbot evaluation requires combining three layers:

  1. Offline automated evaluation (every deployment): Golden dataset regression testing, RAGAS metrics
  2. LLM-as-Judge deep evaluation (weekly/per sprint): Fine-grained quality assessment on complex scenarios
  3. Online evaluation (continuous): User feedback, A/B testing, production monitoring

Evaluation Metrics Framework

Chatbot evaluation metrics are categorized into four dimensions.

Correctness

Evaluates whether the generated answer is factually accurate. Measures factual correctness by comparing against reference answers in the golden dataset, using RAGAS Factual Correctness or Semantic Similarity metrics.

Relevancy

Measures whether the answer appropriately addresses the user question. Detects cases where the answer includes irrelevant information or misses key points. RAGAS Answer Relevancy metric covers this dimension.

Faithfulness

A particularly important metric for RAG systems. Verifies whether the generated answer is grounded in retrieved contexts, detecting hallucinations where the model fabricates content not present in the context. This is one of the core RAGAS metrics.

Harmfulness

Checks whether the answer contains harmful, biased, or inappropriate content, personal information, or offensive language. Safety evaluation operates in conjunction with guardrails.

Deep Dive into the RAGAS Framework

RAGAS (Retrieval Augmented Generation Assessment) is a framework that can evaluate RAG pipelines even without reference answers. It uses LLMs to independently measure the quality of each stage in retrieval and generation.

Core RAGAS Metrics

  • Faithfulness: Determines whether each sentence in the answer can be inferred from the context. Values range from 0 to 1, with higher values indicating fewer hallucinations.
  • Answer Relevancy: Measures how relevant the answer is to the question. It generates questions from the answer in reverse and computes similarity with the original question.
  • Context Precision: Measures the proportion of retrieved documents that are actually relevant. The score decreases when many irrelevant documents are retrieved.
  • Context Recall: Measures whether the information needed to derive the correct answer is included in the retrieval results.

Practical RAGAS Implementation

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
)
from datasets import Dataset

# Prepare evaluation dataset
eval_data = {
    "question": [
        "What is the company's annual leave policy?",
        "How do I apply for remote work?",
        "How many days of family event leave are available?",
    ],
    "answer": [
        "Employees with over 1 year of service receive 15 days of annual leave. "
        "After 3 years of service, an additional day is added every 2 years.",
        "Remote work requires team lead approval followed by application "
        "through the HR system. Up to 3 days per week are allowed, "
        "with Monday and Friday being mandatory in-office days.",
        "Family event leave includes 5 days for marriage, "
        "10 days for spouse's childbirth, 5 days for parent's death, "
        "and 3 days for sibling's death.",
    ],
    "contexts": [
        [
            "Annual Leave Policy: Employees with over 1 year of service "
            "receive 15 paid annual leave days. After 3 years of service, "
            "1 additional day is added every 2 years. "
            "Unused leave does not carry over."
        ],
        [
            "Remote Work Guide: Employees wishing to work remotely must "
            "obtain prior approval from their team lead and apply through "
            "the HR portal. Up to 3 remote work days per week are allowed. "
            "Monday and Friday are mandatory in-office days for all employees."
        ],
        [
            "Family Event Leave: Marriage 5 days, spouse childbirth 10 days, "
            "parent death 5 days, grandparent death 3 days, "
            "sibling death 3 days."
        ],
    ],
    "ground_truth": [
        "15 days after 1 year, plus 1 additional day every 2 years after 3 years",
        "Team lead approval then HR system application, up to 3 days, Mon/Fri in-office required",
        "Marriage 5 days, spouse childbirth 10 days, parent death 5 days, sibling death 3 days",
    ],
}

dataset = Dataset.from_dict(eval_data)

# Run RAGAS evaluation
result = evaluate(
    dataset=dataset,
    metrics=[
        faithfulness,
        answer_relevancy,
        context_precision,
        context_recall,
    ],
)

print(result)
# Example results:
# faithfulness: 0.92
# answer_relevancy: 0.88
# context_precision: 0.95
# context_recall: 0.90

Extending RAGAS with Custom Metrics

Beyond the built-in metrics, you can add domain-specific metrics. For example, a customer support chatbot might need metrics like "includes empathetic expression" or "provides next-step guidance."

from ragas.metrics.base import MetricWithLLM
from dataclasses import dataclass, field

@dataclass
class EmpathyScore(MetricWithLLM):
    """Custom metric that evaluates empathy level in support responses (0-1)"""
    name: str = "empathy_score"
    evaluation_mode: str = "qa"

    async def _ascore(self, row, callbacks=None):
        prompt = (
            "Evaluate the following customer support response for "
            "appropriate empathetic expression on a scale of 0 to 1.\n\n"
            f"Question: {row['question']}\n"
            f"Answer: {row['answer']}\n\n"
            "Respond with only the numeric score."
        )
        response = await self.llm.agenerate_text(prompt)
        try:
            return float(response.generations[0][0].text.strip())
        except (ValueError, IndexError):
            return 0.0

Implementing the LLM-as-Judge Pattern

LLM-as-Judge uses a powerful LLM (such as GPT-4o or Claude) as a judge to evaluate the outputs of other LLMs. Research indicates that sophisticated judge models can achieve 85% agreement with human judgment, exceeding the 81% inter-annotator agreement among humans.

Two Evaluation Approaches

  1. Direct Assessment (Pointwise Scoring): The judge evaluates individual responses and assigns scores
  2. Pairwise Comparison: The judge compares two responses and selects the better one

Direct Assessment Implementation

import openai
import json
from typing import TypedDict

class EvalResult(TypedDict):
    score: int
    reasoning: str

def llm_as_judge_evaluate(
    question: str,
    answer: str,
    criteria: str,
    model: str = "gpt-4o",
) -> EvalResult:
    """Evaluate answer quality on a 1-5 scale using LLM-as-Judge"""

    system_prompt = """You are an expert judge evaluating AI chatbot response quality.
Rate the response on a 1-5 scale according to the given criteria and explain your reasoning.

Scoring rubric:
- 1: Completely inappropriate or incorrect response
- 2: Partially relevant but contains significant errors
- 3: Basically correct but has room for improvement
- 4: Good quality, meets most expectations
- 5: Excellent response, perfectly meets all criteria

You must respond in JSON format:
{"score": number, "reasoning": "evaluation reasoning"}"""

    user_prompt = f"""Evaluation criteria: {criteria}

User question: {question}

Chatbot response: {answer}

Evaluate the above response according to the criteria."""

    client = openai.OpenAI()
    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt},
        ],
        temperature=0.0,
        response_format={"type": "json_object"},
    )

    return json.loads(response.choices[0].message.content)


# Usage example
result = llm_as_judge_evaluate(
    question="What is the difference between a list and a tuple in Python?",
    answer="Lists are created with square brackets ([]) and are mutable. "
           "Tuples are created with parentheses (()) and are immutable. "
           "In terms of performance, tuples are slightly faster than lists.",
    criteria="Evaluate based on accuracy, completeness, and clarity",
)
print(f"Score: {result['score']}/5")
print(f"Reasoning: {result['reasoning']}")

Pairwise Comparison Implementation

Pairwise comparison is useful for A/B testing or model comparisons.

def pairwise_compare(
    question: str,
    answer_a: str,
    answer_b: str,
    criteria: str,
    model: str = "gpt-4o",
) -> dict:
    """Compare two answers and select the better one"""

    system_prompt = """You are an expert judge comparing AI chatbot responses.
Compare responses A and B and determine which is better.

You must respond in JSON format:
{"winner": "A" or "B" or "tie", "reasoning": "comparison reasoning"}

Important: Do not be influenced by the order of responses.
Judge solely on content quality."""

    user_prompt = f"""Evaluation criteria: {criteria}

User question: {question}

Response A: {answer_a}

Response B: {answer_b}

Compare and evaluate both responses."""

    client = openai.OpenAI()
    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt},
        ],
        temperature=0.0,
        response_format={"type": "json_object"},
    )

    return json.loads(response.choices[0].message.content)

Mitigating Position Bias

The biggest limitation of LLM-as-Judge is position bias: judges tend to favor whichever response is presented first. An effective mitigation strategy is to evaluate twice with swapped response order and aggregate the results.

def debiased_pairwise_compare(
    question: str,
    answer_a: str,
    answer_b: str,
    criteria: str,
) -> dict:
    """Pairwise comparison with position bias mitigation"""

    # First evaluation: A presented first
    result_1 = pairwise_compare(question, answer_a, answer_b, criteria)

    # Second evaluation: B presented first (order reversed)
    result_2 = pairwise_compare(question, answer_b, answer_a, criteria)
    # Invert result_2's winner
    if result_2["winner"] == "A":
        result_2["winner"] = "B"
    elif result_2["winner"] == "B":
        result_2["winner"] = "A"

    # Aggregate results
    if result_1["winner"] == result_2["winner"]:
        return {
            "winner": result_1["winner"],
            "confidence": "high",
            "reasoning": f"Both evaluations agree: {result_1['reasoning']}",
        }
    else:
        return {
            "winner": "tie",
            "confidence": "low",
            "reasoning": (
                f"Evaluations disagree - "
                f"Forward: {result_1['winner']}, "
                f"Reversed: {result_2['winner']}"
            ),
        }

Golden Dataset Construction Strategy

A golden dataset consists of expert-verified question-answer pairs that serve as the evaluation benchmark. The quality of the dataset directly determines the reliability of your evaluations.

Construction Principles

  1. Representativeness: Must reflect actual user question patterns. Extract high-frequency question types from production logs
  2. Diversity: Include a balanced range of difficulty levels, from easy questions to edge cases
  3. Scale: Secure a minimum of 100, ideally 500+ test cases
  4. Version control: Manage the golden dataset in Git and track change history

Leveraging Synthetic Data

For initial construction, generating synthetic test data with LLMs and having experts review it is an efficient approach.

from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.document_loaders import DirectoryLoader

# Load documents
loader = DirectoryLoader("./knowledge_base/", glob="**/*.md")
documents = loader.load()

# Configure test set generator
generator_llm = ChatOpenAI(model="gpt-4o")
critic_llm = ChatOpenAI(model="gpt-4o")
embeddings = OpenAIEmbeddings()

generator = TestsetGenerator.from_langchain(
    generator_llm=generator_llm,
    critic_llm=critic_llm,
    embeddings=embeddings,
)

# Generate test set with varied difficulty levels
testset = generator.generate_with_langchain_docs(
    documents=documents,
    test_size=200,
    distributions={
        simple: 0.4,        # Simple factual questions 40%
        reasoning: 0.3,     # Questions requiring reasoning 30%
        multi_context: 0.3,  # Questions needing multiple documents 30%
    },
)

# Export as DataFrame for review
df = testset.to_pandas()
df.to_csv("golden_dataset_draft.csv", index=False)
print(f"Generated test cases: {len(df)}")

Automated Testing Pipeline (CI/CD)

A pipeline that automatically verifies performance is maintained when prompts change, models are swapped, or RAG configurations are modified is essential for production operations.

pytest-Style Testing with DeepEval

DeepEval integrates with pytest, allowing LLM tests to fit naturally into existing test workflows.

# tests/test_chatbot_quality.py
import pytest
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import (
    AnswerRelevancyMetric,
    FaithfulnessMetric,
    HallucinationMetric,
    GEval,
)

# Custom G-Eval metric: response tone evaluation
tone_metric = GEval(
    name="Professional Tone",
    criteria=(
        "Evaluate whether the response maintains a professional "
        "and courteous tone. It should not contain colloquialisms, "
        "emojis, or inappropriate expressions."
    ),
    evaluation_params=["actual_output"],
    threshold=0.7,
)

faithfulness_metric = FaithfulnessMetric(threshold=0.8)
relevancy_metric = AnswerRelevancyMetric(threshold=0.7)
hallucination_metric = HallucinationMetric(threshold=0.5)


@pytest.fixture
def chatbot_response():
    """Fixture to generate chatbot responses for testing"""
    from app.chatbot import get_response
    return get_response


class TestChatbotQuality:
    """Chatbot response quality regression tests"""

    def test_faq_faithfulness(self, chatbot_response):
        """Verify FAQ answers are faithful to retrieved context"""
        question = "How many annual leave days do I get?"
        response = chatbot_response(question)

        test_case = LLMTestCase(
            input=question,
            actual_output=response["answer"],
            retrieval_context=response["contexts"],
        )
        assert_test(test_case, [faithfulness_metric])

    def test_answer_relevancy(self, chatbot_response):
        """Verify answers are relevant to the question"""
        question = "How do I apply for remote work?"
        response = chatbot_response(question)

        test_case = LLMTestCase(
            input=question,
            actual_output=response["answer"],
        )
        assert_test(test_case, [relevancy_metric])

    def test_no_hallucination(self, chatbot_response):
        """Verify no hallucinations are present"""
        question = "How is severance pay calculated?"
        response = chatbot_response(question)

        test_case = LLMTestCase(
            input=question,
            actual_output=response["answer"],
            context=response["contexts"],
        )
        assert_test(test_case, [hallucination_metric])

    def test_professional_tone(self, chatbot_response):
        """Verify professional tone is maintained"""
        question = "When is payday?"
        response = chatbot_response(question)

        test_case = LLMTestCase(
            input=question,
            actual_output=response["answer"],
        )
        assert_test(test_case, [tone_metric])

GitHub Actions CI/CD Integration

# .github/workflows/chatbot-eval.yml
name: Chatbot Evaluation Pipeline

on:
  pull_request:
    paths:
      - 'prompts/**'
      - 'app/chatbot/**'
      - 'config/rag/**'

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - name: Install dependencies
        run: pip install -r requirements-eval.txt

      - name: Run RAGAS evaluation
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: |
          python scripts/run_ragas_eval.py \
            --dataset golden_dataset.json \
            --output eval_results.json

      - name: Run DeepEval tests
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: |
          deepeval test run tests/test_chatbot_quality.py \
            --verbose

      - name: Check regression thresholds
        run: |
          python scripts/check_thresholds.py \
            --results eval_results.json \
            --thresholds config/eval_thresholds.json

      - name: Post evaluation summary to PR
        if: always()
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const results = JSON.parse(
              fs.readFileSync('eval_results.json', 'utf8')
            );
            const body = `## Chatbot Evaluation Results
            | Metric | Score | Threshold | Status |
            |--------|-------|-----------|--------|
            | Faithfulness | ${results.faithfulness} | 0.85 | ${results.faithfulness >= 0.85 ? 'PASS' : 'FAIL'} |
            | Relevancy | ${results.relevancy} | 0.80 | ${results.relevancy >= 0.80 ? 'PASS' : 'FAIL'} |
            | Context Precision | ${results.context_precision} | 0.80 | ${results.context_precision >= 0.80 ? 'PASS' : 'FAIL'} |`;
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: body
            });

A/B Testing and Online Evaluation

Offline evaluation alone cannot fully predict real user experience. A/B testing and continuous monitoring in production environments are necessary.

Online Evaluation Metrics

  • User satisfaction: Thumbs up/down feedback ratio
  • Conversation completion rate: Percentage of conversations where users obtained desired information
  • Escalation rate: Percentage of conversations transferred from chatbot to human agents
  • Re-ask rate: Percentage of repeated questions on the same topic (lower is better)

A/B Test Design

When changing prompts or swapping models, split user traffic to compare performance of two versions. To achieve statistically significant results, a minimum of 2 weeks and 1,000+ conversations per group is recommended.

Framework Comparison

CategoryRAGASDeepEvalLangSmithCustom (Self-built)
Primary useRAG pipeline evaluationLLM output testingTracing + evaluationDomain-specific evaluation
Core metricsFaithfulness, Relevancy, Context Precision/RecallG-Eval, Hallucination, Answer Relevancy, ToxicityLLM-as-Judge, Heuristic, HumanFreely designed
pytest integrationPossible (wrapper needed)Native supportVia SDKManual implementation
TracingNot providedConfident AI integrationNative supportManual implementation
Reference answer requiredOptionalDepends on metricOptionalFreely designed
Custom metricsLLM-based extensionFree definition via G-EvalCustom EvaluatorFully flexible
Learning curveLowLowMediumHigh
CostOpen source + LLM API costsOpen source + paid platformPaid (free tier available)LLM API costs only
Recommended forRAG performance optimizationCI/CD quality gatesFull lifecycle managementSpecial requirements

Failure Cases and Lessons Learned

Case 1: Wrong Model Selection Due to Evaluation Bias

One team found that Model A consistently outperformed Model B in LLM-as-Judge comparisons. The cause was that Model A produced more verbose answers, and the judge LLM had a verbosity bias. Model B's concise but accurate answers were systematically undervalued.

Lesson: Explicitly state "conciseness should also be positively evaluated" in the evaluation prompt, and add a separate metric that normalizes for answer length.

Case 2: Expired Golden Dataset

When evaluating with a golden dataset created six months earlier, all metrics showed decline. The cause was that company policies had changed, making the reference answers in the golden dataset no longer valid.

Lesson: Set expiration dates on golden datasets and build a system that automatically flags related test cases when underlying documents change.

Case 3: Over-Trusting Metric Reliability

A RAGAS Faithfulness score of 0.95 was high, yet user complaints persisted. Investigation revealed that the chatbot was faithfully answering based on context, but the retrieved context itself was irrelevant to what users actually wanted. Faithfulness was high, but Context Precision was low.

Lesson: Do not rely on a single metric. Monitor multiple metrics comprehensively, and in particular, evaluate retrieval quality and generation quality separately.

Production Checklist

Review these items when building a chatbot evaluation framework.

Foundation

  • Have you secured at least 100 golden dataset entries
  • Does the golden dataset reflect actual user question patterns
  • Are evaluation metrics aligned with business objectives

Automation Pipeline

  • Do regression tests run automatically on prompt changes
  • Are evaluation results automatically posted as PR comments
  • Is deployment blocked when metric thresholds are not met

LLM-as-Judge Operations

  • Is position bias mitigation applied to judge prompts
  • Is verbosity bias addressed
  • Is judge model evaluation consistency periodically verified

Online Monitoring

  • Are you collecting user feedback (thumbs up/down)
  • Is a time-series dashboard for key metrics operational
  • Are alerts configured for sudden metric changes

Data Management

  • Are golden dataset expiration dates managed
  • Are evaluation results stored by version
  • Are new question types continuously added to the golden dataset

Conclusion

Building a chatbot evaluation framework is not a one-time project but a system that must be continuously evolved. Start with a small golden dataset and basic RAGAS metrics, then gradually add LLM-as-Judge evaluation, automated pipelines, and online monitoring.

The key principle is "you cannot improve what you cannot measure." Rather than judging prompt changes by intuition, establishing a culture of validating through objective metrics is what creates the most long-term value. Tools like RAGAS, DeepEval, and LangSmith are merely infrastructure that technically supports this culture.