Skip to content
Published on

Advanced RAG Pipeline Complete Guide 2025: Chunking Strategies, Re-ranking, Agentic RAG, Evaluation

Authors

1. The Evolution of RAG: From Naive to Advanced

1.1 What Is RAG

RAG (Retrieval-Augmented Generation) is a technique where an LLM retrieves relevant information from external knowledge sources before generating an answer, providing it as context. It reduces LLM hallucinations, incorporates up-to-date information, and enables domain-specific knowledge utilization.

User question → [Retrieval][Relevant documents][LLM + document context]Answer generation

1.2 RAG Architecture Evolution

Naive RAG (Early 2023)
├── Fixed-size chunking
├── Single embedding retrieval
├── Top-K results passed directly to LLM
└── Problems: Low retrieval accuracy, context noise

Advanced RAG (2024)
├── Semantic chunking + metadata
├── Hybrid search (vector + keyword)
├── Re-ranking to refine search results
├── Query transformation (HyDE, Multi-Query)
└── Context compression

Modular RAG (2025)
├── Agentic RAG (dynamic routing)
├── Self-RAG (self-reflection)
├── CRAG (Corrective RAG)
├── Multi-modal RAG
├── Knowledge Graph + RAG
└── Modular composable pipelines

1.3 Bottlenecks at Each Stage

StageNaive RAG ProblemAdvanced RAG Solution
IndexingFixed-size chunkingSemantic chunking, hierarchical indexing
RetrievalSingle vector searchHybrid search, re-ranking
GenerationNoisy contextContext compression, filtering
QueryRaw query as-isQuery transformation, decomposition
EvaluationManual evaluationRAGAS, automated evaluation

2. Chunking Strategies

2.1 Why Chunking Matters

Chunking is the first and most important step in a RAG pipeline. Poor chunking degrades performance of all subsequent stages.

Bad chunking: Cut mid-sentence → meaning lost → retrieval fails → wrong answer
Good chunking: Split by meaning → rich context → accurate retrieval → correct answer

2.2 Fixed-Size Chunking

The simplest approach. Splits text by a fixed token/character count.

from langchain.text_splitter import CharacterTextSplitter

splitter = CharacterTextSplitter(
    chunk_size=500,      # chunk size
    chunk_overlap=50,    # overlap (preserves boundary info)
    separator="\n\n"     # separator
)

chunks = splitter.split_text(document_text)

Pros: Simple, fast, predictable sizes Cons: Ignores semantic boundaries, may cut mid-sentence

2.3 Recursive Character Splitting

Tries multiple separators in hierarchical order. The most widely used approach.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ". ", " ", ""]  # priority order
)

chunks = splitter.split_text(document_text)
# First tries \n\n, if chunk too large then \n, then period...

Pros: Respects paragraph/sentence boundaries, general-purpose Cons: No guarantee of semantic coherence

2.4 Semantic Chunking

Measures embedding similarity between sentences and splits where meaning changes.

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Semantic chunking
chunker = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=95  # split when similarity below 95th percentile
)

chunks = chunker.split_text(document_text)

How it works:

Sentence 1: "Vector DBs are core AI infrastructure." → embedding [0.1, 0.3, ...]
Sentence 2: "HNSW is the fastest search algorithm." → similarity 0.85 (high → same chunk)
Sentence 3: "Meanwhile, the weather is sunny today." → similarity 0.15 (low → new chunk!)

Pros: Semantic unit splitting, high retrieval accuracy Cons: Requires embedding calls (cost/time), uneven chunk sizes

2.5 Document-Based Chunking

Leverages document structure (headings, sections, tables, code blocks).

from langchain.text_splitter import MarkdownHeaderTextSplitter

headers_to_split = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

splitter = MarkdownHeaderTextSplitter(headers_to_split)
chunks = splitter.split_text(markdown_text)

# Each chunk automatically includes header metadata
for chunk in chunks:
    print(f"Metadata: {chunk.metadata}")
    # {'Header 1': 'Vector Database', 'Header 2': 'Indexing Algorithms'}

2.6 Agentic Chunking

Uses an LLM to determine optimal chunking.

from openai import OpenAI

client = OpenAI()

def agentic_chunk(text, max_chunk_size=1500):
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": """Split the given text into semantically independent chunks.
Each chunk should be self-contained and understandable on its own.
Return chunk texts as a JSON array."""},
            {"role": "user", "content": text}
        ],
        response_format={"type": "json_object"}
    )
    return response.choices[0].message.content

Pros: Highest quality, context understanding Cons: High cost, slow processing, impractical at scale

2.7 Chunking Strategy Comparison

StrategyQualitySpeedCostBest For
Fixed-SizeLowVery FastFreePrototyping, simple docs
RecursiveMediumFastFreeGeneral purpose (default)
SemanticHighMediumEmbedding costWhen accuracy matters
Document-BasedHighFastFreeStructured docs (MD, HTML)
AgenticVery HighSlowLLM costSmall high-quality docs

2.8 Optimal Chunk Size Guide

General text: 500-1000 tokens (10-20% overlap)
Technical docs: 800-1500 tokens (section-based)
Legal/Medical: 300-500 tokens (precision-focused)
Code: Function/class units (structure-based)
Conversations/QA: Per dialog turn

3. Embedding Model Selection

3.1 Embedding Model Comparison

ModelDimsMax TokensMTEB ScoreCostRecommended For
text-embedding-3-large30728,19164.6PaidWhen top performance needed
text-embedding-3-small15368,19162.3LowGeneral purpose (best cost-perf)
embed-v3.0 (Cohere)102451264.5PaidMultilingual, search-focused
BGE-M3 (BAAI)10248,19268.2FreeSelf-hosted, best OSS
Jina-embeddings-v310248,19265.5FreeMultilingual, long docs
voyage-3 (Voyage AI)102432,00067.1PaidCode search specialized

3.2 Selection Criteria

Cost-focused + general      → text-embedding-3-small
Top performance + free      → BGE-M3 (self-hosted)
Multilingual + search       → Cohere embed-v3.0
Code search                 → voyage-code-3
Long docs (8K+ tokens)Jina-embeddings-v3
Private data + on-premises  → BGE-M3 or Nomic

3.3 Late Interaction Models (ColBERT)

Performs token-level fine-grained matching.

from ragatouille import RAGPretrainedModel

rag = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")

# Indexing
rag.index(
    collection=[doc1, doc2, doc3],
    index_name="my_index"
)

# Search (token-level matching)
results = rag.search(query="vector database performance comparison", k=5)

4. Query Transformation

4.1 Why Query Transformation Is Needed

User queries are often ambiguous, too short, or not suitable for retrieval.

Original query: "RAG is slow" (ambiguous, short)
Transformed: "How to optimize RAG pipeline retrieval latency" (specific, retrieval-suitable)

4.2 HyDE (Hypothetical Document Embeddings)

LLM generates a hypothetical answer document, then searches using that document's embedding.

from openai import OpenAI

client = OpenAI()

def hyde_search(query, collection):
    # 1. Generate hypothetical answer with LLM
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "Write a detailed answer to the given question. It doesn't need to be accurate."},
            {"role": "user", "content": query}
        ]
    )
    hypothetical_doc = response.choices[0].message.content
    
    # 2. Embed the hypothetical answer and search
    embedding = get_embedding(hypothetical_doc)
    results = collection.search(query_vector=embedding, limit=5)
    
    return results

Pros: Bridges the embedding gap between query and document Cons: LLM call cost, hypothetical answer may be wrong

4.3 Multi-Query

Rewrites a single query from multiple angles.

def multi_query_transform(original_query):
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": """Rewrite the given question from 3 different perspectives.
Each question should maintain the original intent but use different keywords.
Return only 3 questions separated by newlines."""},
            {"role": "user", "content": original_query}
        ]
    )
    queries = response.choices[0].message.content.strip().split("\n")
    return [original_query] + queries

# Usage
queries = multi_query_transform("RAG pipeline performance optimization")
# → ["RAG pipeline performance optimization",
#    "How to improve response time in retrieval augmented generation systems",
#    "Strategies to increase search accuracy in RAG architecture",
#    "LLM-based document retrieval system optimization techniques"]

# Search with each query and merge results (deduplicate)
all_results = set()
for q in queries:
    results = search(q)
    all_results.update(results)

4.4 Step-Back Prompting

Transforms a specific question into a more general one.

def step_back_prompt(query):
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": """Generate a question that is one step more general and broader than the given question.
The answer to this general question should help answer the original question."""},
            {"role": "user", "content": query}
        ]
    )
    return response.choices[0].message.content

# Example
original = "What is the memory impact when setting HNSW M parameter to 32 in Qdrant?"
step_back = step_back_prompt(original)
# → "How do HNSW index parameters affect performance and resources in vector databases?"

4.5 Query Decomposition

Breaks complex questions into sub-questions.

def decompose_query(complex_query):
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": """Decompose the complex question into simpler sub-questions.
Each sub-question should be independently answerable.
Return as a JSON array."""},
            {"role": "user", "content": complex_query}
        ],
        response_format={"type": "json_object"}
    )
    return response.choices[0].message.content

# Example
complex_q = "Compare Pinecone and Qdrant on performance, cost, and operational ease at 10M vector scale?"
sub_questions = decompose_query(complex_q)
# → ["What is Pinecone's performance at 10M vector scale?",
#    "What is Qdrant's performance at 10M vector scale?",
#    "What is Pinecone's cost structure?",
#    "What is Qdrant's cost structure?",
#    "How easy is Pinecone to operate?",
#    "How easy is Qdrant to operate?"]

5. Retrieval Optimization

Combines vector search with keyword search.

from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
from langchain_community.vectorstores import Qdrant

# BM25 (keyword) retriever
bm25_retriever = BM25Retriever.from_documents(documents)
bm25_retriever.k = 5

# Vector retriever
vector_retriever = qdrant_vectorstore.as_retriever(
    search_kwargs={"k": 5}
)

# Ensemble (hybrid)
ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, vector_retriever],
    weights=[0.4, 0.6]  # weight toward vector
)

results = ensemble_retriever.invoke("How to optimize RAG pipeline")

5.2 Contextual Compression

Extracts only the relevant portions from retrieved documents.

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

compressor = LLMChainExtractor.from_llm(llm)

compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vector_retriever
)

# Retrieval + compression: extracts only relevant parts
compressed_docs = compression_retriever.invoke("Role of M parameter in HNSW algorithm")
# Returns only question-relevant paragraphs, not the entire document

5.3 Parent Document Retrieval

Searches with small chunks but returns larger parent documents.

from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Small chunks: for retrieval
child_splitter = RecursiveCharacterTextSplitter(chunk_size=200)

# Large chunks: for context
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)

store = InMemoryStore()

retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

# Precise search with 200-token chunks → return 2000-token parent documents
retriever.add_documents(documents)
results = retriever.invoke("HNSW parameter tuning")
# Retrieval is precise with small chunks, context is rich with large ones

5.4 Multi-Vector Retrieval

Generates multiple vectors (summary, questions, original) per document.

from langchain.retrievers.multi_vector import MultiVectorRetriever

# Generate summaries + hypothetical questions per document
summaries = generate_summaries(documents)
hypothetical_questions = generate_questions(documents)

# Search by summaries/questions, return original documents
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,  # stores summary/question vectors
    docstore=store,           # stores original documents
    id_key="doc_id"
)

# Searches via summary embeddings but returns full original documents

6. Re-ranking

6.1 Why Re-ranking Is Needed

Initial retrieval (bi-encoder) is fast but imprecise. Re-ranking (cross-encoder) processes query and document together for more accurate relevance judgment.

Stage 1 (Bi-encoder): query vector vs doc vector → fast but imprecise → Top 20
Stage 2 (Cross-encoder): (query, doc) pairs directly compared → slow but accurate → Top 5

6.2 Cross-Encoder Re-ranking

from sentence_transformers import CrossEncoder

# Load cross-encoder model
model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

# Initial search results
initial_results = vector_search(query, top_k=20)

# Re-rank
pairs = [(query, doc.content) for doc in initial_results]
scores = model.predict(pairs)

# Sort by score
reranked = sorted(
    zip(initial_results, scores),
    key=lambda x: x[1],
    reverse=True
)[:5]

6.3 Cohere Rerank

Commercial re-ranking API. High performance with multilingual support.

import cohere

co = cohere.Client("YOUR_API_KEY")

# Initial search result texts
documents = [doc.content for doc in initial_results]

# Cohere re-ranking
response = co.rerank(
    model="rerank-v3.5",
    query="RAG pipeline optimization",
    documents=documents,
    top_n=5
)

for result in response.results:
    print(f"Index: {result.index}, Score: {result.relevance_score:.4f}")
    print(f"Text: {documents[result.index][:100]}...")

6.4 ColBERT Re-ranking

Late interaction approach with token-level fine-grained matching.

from ragatouille import RAGPretrainedModel

rag = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")

reranked = rag.rerank(
    query="How to optimize RAG pipeline",
    documents=[doc.content for doc in initial_results],
    k=5
)

6.5 LLM-Based Re-ranking

Lets the LLM directly judge relevance.

def llm_rerank(query, documents, top_n=5):
    scored_docs = []
    
    for doc in documents:
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {"role": "system", "content": "Rate the relevance between the question and document on a 0-10 scale. Return only the number."},
                {"role": "user", "content": f"Question: {query}\n\nDocument: {doc.content[:500]}"}
            ]
        )
        score = float(response.choices[0].message.content.strip())
        scored_docs.append((doc, score))
    
    return sorted(scored_docs, key=lambda x: x[1], reverse=True)[:top_n]

6.6 Re-ranking Model Comparison

ModelSpeedQualityCostMultilingualRecommended
cross-encoder/ms-marcoFastGoodFreeEnglishEnglish only
Cohere Rerank v3.5FastVery GoodPaid100+ languagesProduction default
ColBERT v2MediumVery GoodFreeEnglishSelf-hosted
BGE-Reranker-v2FastGoodFreeMultilingualOSS multilingual
LLM Re-rankingSlowBestHighAllSmall high-quality

7. Agentic RAG

7.1 What Is Agentic RAG

LLM agents dynamically decide retrieval strategies. Instead of simple "retrieve then generate," agents evaluate search results and adjust strategy.

Traditional RAG: QuestionRetrieveGenerate (fixed pipeline)
Agentic RAG: Question[Agent decides]
              ├── Is retrieval needed?RetrieveResults sufficient?
              │                              ├── YesGenerate
              │                              └── NoSearch different source / modify query
              └── Can answer without retrieval → Generate directly

7.2 Self-RAG (Self-Reflective RAG)

The model evaluates whether retrieval is needed and judges the quality of generated results.

def self_rag(query):
    # 1. Judge if retrieval is needed
    need_retrieval = judge_retrieval_need(query)
    
    if not need_retrieval:
        return generate_without_context(query)
    
    # 2. Perform retrieval
    documents = retrieve(query)
    
    # 3. Judge relevance of each document
    relevant_docs = []
    for doc in documents:
        if is_relevant(query, doc):
            relevant_docs.append(doc)
    
    if not relevant_docs:
        # No relevant docs found, refine query and re-search
        refined_query = refine_query(query)
        documents = retrieve(refined_query)
        relevant_docs = [d for d in documents if is_relevant(refined_query, d)]
    
    # 4. Generate answer
    answer = generate_with_context(query, relevant_docs)
    
    # 5. Self-evaluate answer quality
    if not is_supported(answer, relevant_docs):
        # If answer not grounded in docs, regenerate
        answer = regenerate(query, relevant_docs)
    
    return answer

7.3 CRAG (Corrective RAG)

Takes corrective action based on retrieval quality.

def corrective_rag(query):
    # 1. Initial retrieval
    documents = retrieve(query)
    
    # 2. Evaluate retrieval quality
    quality = evaluate_retrieval_quality(query, documents)
    
    if quality == "CORRECT":
        # Good results → refine knowledge and generate
        refined_knowledge = refine_knowledge(documents)
        return generate(query, refined_knowledge)
    
    elif quality == "AMBIGUOUS":
        # Ambiguous results → supplement with web search
        web_results = web_search(query)
        combined = documents + web_results
        refined = refine_knowledge(combined)
        return generate(query, refined)
    
    elif quality == "INCORRECT":
        # Poor results → replace with web search
        web_results = web_search(query)
        refined = refine_knowledge(web_results)
        return generate(query, refined)

7.4 Adaptive RAG

Selects strategy based on query complexity.

def adaptive_rag(query):
    # Classify query complexity
    complexity = classify_query(query)
    
    if complexity == "SIMPLE":
        # Simple factual question → direct retrieve + generate
        docs = simple_retrieve(query, top_k=3)
        return generate(query, docs)
    
    elif complexity == "MODERATE":
        # Moderate → multi-query + re-ranking
        queries = multi_query_transform(query)
        docs = hybrid_search(queries)
        reranked = rerank(query, docs)
        return generate(query, reranked)
    
    elif complexity == "COMPLEX":
        # Complex → query decomposition + multi-step reasoning
        sub_queries = decompose_query(query)
        sub_answers = []
        for sq in sub_queries:
            docs = search(sq)
            sub_answers.append(generate(sq, docs))
        return synthesize(query, sub_answers)

7.5 Query Routing

Routes queries to appropriate data sources based on question type.

def query_router(query):
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": """Analyze the query and select the appropriate data source:
- VECTOR_DB: When internal document search is needed
- WEB_SEARCH: When latest or external information is needed
- SQL_DB: When structured data queries are needed
- DIRECT: When the LLM can answer directly
Return a single word only."""},
            {"role": "user", "content": query}
        ]
    )
    route = response.choices[0].message.content.strip()
    
    if route == "VECTOR_DB":
        return vector_db_search(query)
    elif route == "WEB_SEARCH":
        return web_search(query)
    elif route == "SQL_DB":
        return sql_query(query)
    else:
        return direct_answer(query)

8. Multi-modal RAG

8.1 Image + Text RAG

Processes images and tables alongside text from PDFs and slides.

from langchain_community.document_loaders import UnstructuredPDFLoader

# Extract text + images + tables from PDF
loader = UnstructuredPDFLoader(
    "document.pdf",
    mode="elements",
    strategy="hi_res"  # high-res image/table extraction
)
elements = loader.load()

# Generate descriptions for image elements using GPT-4o
for element in elements:
    if element.metadata.get("type") == "Image":
        description = describe_image_with_vision(element)
        element.page_content = description  # convert image description to text

# Store all elements (text + image descriptions) in vector DB

8.2 Table Processing

def process_table(table_element):
    """Convert table to searchable format"""
    # 1. Convert table to markdown
    md_table = table_element.metadata.get("text_as_html", "")
    
    # 2. Generate table summary with LLM
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "Summarize the following table in natural language."},
            {"role": "user", "content": md_table}
        ]
    )
    summary = response.choices[0].message.content
    
    # 3. Store summary with metadata
    return {
        "content": summary,
        "metadata": {
            "type": "table",
            "original_html": md_table,
            "source": table_element.metadata.get("source")
        }
    }

8.3 Vision Model Usage

import base64

def describe_image_with_vision(image_path):
    with open(image_path, "rb") as f:
        image_data = base64.b64encode(f.read()).decode()
    
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "Describe this image in detail. For diagrams explain the structure, for charts explain the data."},
                    {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_data}"}}
                ]
            }
        ]
    )
    return response.choices[0].message.content

9. Knowledge Graph + RAG (GraphRAG)

9.1 GraphRAG Concept

Represents entity relationships via knowledge graphs and combines with vector search.

Standard RAG: Retrieves document chunks independently
GraphRAG: Uses entity relationships to retrieve connected information too

Example: "Which vector databases use HNSW?"
Standard RAG: Returns only HNSW chunks
GraphRAG: HNSW  (used_by)Qdrant, Weaviate, Milvus
         + Returns HNSW configuration chunks for each DB

9.2 Neo4j + RAG Implementation

from langchain_community.graphs import Neo4jGraph
from langchain.chains import GraphCypherQAChain

graph = Neo4jGraph(
    url="bolt://localhost:7687",
    username="neo4j",
    password="password"
)

# Knowledge graph + LLM QA chain
chain = GraphCypherQAChain.from_llm(
    llm=llm,
    graph=graph,
    verbose=True
)

result = chain.invoke("What indexing algorithms does Qdrant support?")
# LLM generates Cypher queries to explore relationships in Neo4j

9.3 Hybrid: Vector + Graph

def hybrid_graph_vector_rag(query):
    # 1. Vector search for relevant document chunks
    vector_results = vector_search(query, top_k=5)
    
    # 2. Extract entities from chunks
    entities = extract_entities(query)
    
    # 3. Explore related entities in knowledge graph
    graph_results = graph_query(entities, depth=2)
    
    # 4. Combine both results
    combined_context = merge_results(vector_results, graph_results)
    
    # 5. LLM generation
    return generate(query, combined_context)

10. RAG Evaluation

10.1 RAGAS Framework

A framework for automated RAG pipeline evaluation.

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
)
from datasets import Dataset

# Prepare evaluation dataset
eval_data = {
    "question": ["What is RAG?", "How does HNSW algorithm work?"],
    "answer": ["RAG is retrieval augmented generation...", "HNSW is a hierarchical graph..."],
    "contexts": [
        ["RAG (Retrieval-Augmented Generation) is...", "A retrieval-based generation technique..."],
        ["HNSW stands for Hierarchical Navigable...", "A graph-based ANN algorithm..."]
    ],
    "ground_truth": ["RAG enables LLMs to retrieve external...", "HNSW is a multi-layer graph structure..."]
}

dataset = Dataset.from_dict(eval_data)

# Run RAGAS evaluation
results = evaluate(
    dataset,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall]
)

print(results)
# {'faithfulness': 0.89, 'answer_relevancy': 0.92,
#  'context_precision': 0.85, 'context_recall': 0.78}

10.2 RAGAS Metrics Detail

MetricMeasuresRangeTarget
FaithfulnessHow well answer is grounded in context0-10.85+
Answer RelevancyHow well answer fits the question0-10.90+
Context PrecisionRanking accuracy of relevant context0-10.80+
Context RecallHow much needed info was retrieved0-10.75+
Answer SimilaritySemantic similarity to ground truth0-10.80+
Answer CorrectnessFactual accuracy of the answer0-10.85+

10.3 TruLens Evaluation

from trulens.apps.langchain import TruChain
from trulens.core import Feedback, TruSession
from trulens.providers.openai import OpenAI as TruOpenAI

session = TruSession()
provider = TruOpenAI()

# Define feedback functions
f_relevance = Feedback(provider.relevance).on_input_output()
f_groundedness = Feedback(provider.groundedness_measure_with_cot_reasons).on(
    source=context, statement=output
)

# Wrap RAG chain
tru_chain = TruChain(
    rag_chain,
    app_name="RAG Pipeline v1",
    feedbacks=[f_relevance, f_groundedness]
)

# Run evaluation
with tru_chain as recording:
    response = rag_chain.invoke("How to optimize RAG pipelines?")

# Check dashboard
session.get_leaderboard()

10.4 LLM-as-Judge

def llm_judge(question, answer, context, ground_truth):
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": """Evaluate the RAG system's answer.
Score 1-5 on these criteria:
1. Accuracy: Is the answer factually correct
2. Relevance: Is it a fitting answer to the question
3. Groundedness: Is the answer based on the provided context
4. Completeness: Does it fully answer the question
Return JSON with each score and reasoning."""},
            {"role": "user", "content": f"""Question: {question}
Answer: {answer}
Context: {context}
Ground Truth: {ground_truth}"""}
        ],
        response_format={"type": "json_object"}
    )
    return response.choices[0].message.content

10.5 Evaluation Method Comparison

MethodAutomationCostAccuracyWhen to Use
RAGASFully autoEmbedding costGoodContinuous monitoring
TruLensAutoLLM costGoodIterative dev evaluation
LLM-as-JudgeSemi-autoLLM costVery GoodDetailed analysis
Human EvalManualLabor costBestFinal validation

11. Production Optimization

11.1 Caching Strategy

import hashlib
import json

class RAGCache:
    def __init__(self, redis_client):
        self.redis = redis_client
        self.ttl = 3600  # 1 hour
    
    def _hash_query(self, query):
        return hashlib.md5(query.encode()).hexdigest()
    
    def get_cached_response(self, query):
        key = self._hash_query(query)
        cached = self.redis.get(f"rag:{key}")
        if cached:
            return json.loads(cached)
        return None
    
    def cache_response(self, query, response):
        key = self._hash_query(query)
        self.redis.setex(
            f"rag:{key}",
            self.ttl,
            json.dumps(response)
        )

# Semantic caching: cache hits for similar queries too
class SemanticCache:
    def __init__(self, vectorstore, threshold=0.95):
        self.store = vectorstore
        self.threshold = threshold
    
    def get(self, query):
        results = self.store.similarity_search_with_score(query, k=1)
        if results and results[0][1] >= self.threshold:
            return results[0][0].metadata["response"]
        return None

11.2 Streaming Responses

from openai import OpenAI

client = OpenAI()

def stream_rag_response(query, context):
    stream = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": f"Context:\n{context}"},
            {"role": "user", "content": query}
        ],
        stream=True
    )
    
    for chunk in stream:
        if chunk.choices[0].delta.content:
            yield chunk.choices[0].delta.content

# FastAPI streaming endpoint
from fastapi import FastAPI
from fastapi.responses import StreamingResponse

app = FastAPI()

@app.get("/rag/stream")
async def rag_stream(query: str):
    context = retrieve_context(query)
    return StreamingResponse(
        stream_rag_response(query, context),
        media_type="text/event-stream"
    )

11.3 Fallback Strategy

def rag_with_fallback(query):
    try:
        # Primary: Vector DB search
        docs = vector_search(query)
        
        if not docs or max_relevance_score(docs) < 0.5:
            # Secondary: Web search fallback
            docs = web_search_fallback(query)
        
        if not docs:
            # Tertiary: Direct LLM answer (no retrieval)
            return direct_llm_answer(query)
        
        return generate_with_context(query, docs)
    
    except Exception as e:
        # Error fallback
        return {
            "answer": "Sorry, a temporary error occurred.",
            "error": str(e),
            "fallback": True
        }

11.4 Monitoring

# Key RAG metrics
monitoring = {
    "Retrieval latency": "Target p99 200ms",
    "Generation latency": "Target p99 3s (streaming first-token 500ms)",
    "Retrieval relevance": "Auto-eval 0.85+",
    "Answer groundedness": "Auto-eval 0.90+",
    "Cache hit rate": "Target 30%+",
    "User feedback (thumbs up/down)": "Positive 80%+",
    "Token usage": "Cost tracking",
    "Error rate": "Target under 1%",
}

12. Common Failures and Solutions

12.1 Problem Diagnosis Checklist

SymptomCauseSolution
Irrelevant docs returnedChunk size inappropriateSemantic chunking, resize
Answer ignores contextContext too long or noisyContext compression, re-ranking
Synonym/similar term search failsEmbedding model limitationHybrid search, query expansion
Missing latest informationIndex not updatedIncremental indexing, scheduler
Answer hallucinationNo relevance thresholdScore filtering, Self-RAG
Multi-hop question failsSingle retrieval insufficientQuery decomposition, iterative search
Specific term search failsVector alone insufficientAdd BM25 hybrid
Slow responsesRe-ranking/generation delayCaching, streaming, async

12.2 Debugging Workflow

def debug_rag_pipeline(query):
    print(f"=== RAG Debug: {query} ===\n")
    
    # 1. Check retrieval results
    results = vector_search(query, top_k=10)
    print("--- Retrieval Results ---")
    for i, r in enumerate(results):
        print(f"#{i+1} Score: {r.score:.4f} | {r.content[:100]}...")
    
    # 2. Check re-ranked results
    reranked = rerank(query, results)
    print("\n--- After Re-ranking ---")
    for i, (r, score) in enumerate(reranked):
        print(f"#{i+1} Score: {score:.4f} | {r.content[:100]}...")
    
    # 3. Check final context
    context = build_context(reranked[:5])
    print(f"\n--- Context length: {len(context)} chars ---")
    
    # 4. Check generated result
    answer = generate(query, context)
    print(f"\n--- Answer ---\n{answer}")
    
    return {"results": results, "reranked": reranked, "answer": answer}

13. Full Pipeline Integration

from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Qdrant
from langchain.retrievers import EnsembleRetriever, ContextualCompressionRetriever

class AdvancedRAGPipeline:
    def __init__(self):
        self.llm = ChatOpenAI(model="gpt-4o")
        self.embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
        self.vectorstore = Qdrant.from_existing_collection(
            embedding=self.embeddings,
            collection_name="knowledge_base"
        )
        self.cache = SemanticCache(self.vectorstore)
    
    def query(self, question):
        # 1. Check cache
        cached = self.cache.get(question)
        if cached:
            return cached
        
        # 2. Query transformation
        queries = self.multi_query_transform(question)
        
        # 3. Hybrid search
        all_docs = []
        for q in queries:
            docs = self.hybrid_search(q)
            all_docs.extend(docs)
        
        # 4. Deduplicate
        unique_docs = self.deduplicate(all_docs)
        
        # 5. Re-ranking
        reranked = self.rerank(question, unique_docs)
        
        # 6. Context compression
        compressed = self.compress_context(question, reranked)
        
        # 7. Generate answer
        answer = self.generate(question, compressed)
        
        # 8. Verify answer
        if not self.verify_groundedness(answer, compressed):
            answer = self.regenerate_with_instruction(question, compressed)
        
        # 9. Cache result
        self.cache.set(question, answer)
        
        return answer

14. Quiz

Q1. What is the difference between semantic chunking and recursive chunking?

Recursive chunking applies predefined separators (newlines, periods, etc.) in hierarchical order to split text. It does not consider semantics. Semantic chunking measures embedding similarity between sentences and splits where meaning changes significantly. Semantic chunking provides higher retrieval accuracy but incurs embedding call costs. Use recursive chunking as default, semantic chunking when accuracy is critical.

Q2. How does HyDE (Hypothetical Document Embeddings) improve retrieval?

HyDE has the LLM first generate a hypothetical answer document for the user query. It then embeds that hypothetical document and uses it for search. This bridges the embedding gap between short queries and long documents, improving retrieval accuracy. Downsides include LLM call cost and the risk that the hypothetical answer may be incorrect.

Q3. What is the difference between Self-RAG and CRAG?

Self-RAG has the model judge whether retrieval is needed and self-evaluate whether the generated answer is grounded in context. If retrieval is unnecessary, it answers directly. CRAG always performs retrieval but evaluates the quality of results, classifying them as "correct/ambiguous/incorrect." If incorrect, it falls back to alternative sources like web search. Self-RAG controls retrieval itself; CRAG corrects retrieval results.

Q4. What is the difference between RAGAS Faithfulness and Answer Relevancy?

Faithfulness measures whether each claim in the answer is grounded in the provided context. It is key for hallucination detection. Answer Relevancy measures how well the answer fits the original question. If the answer contains content unrelated to the question, the score drops. An answer can be high in Faithfulness but low in Answer Relevancy if it does not actually address the question.

Q5. What are the benefits of semantic caching in production RAG?

Regular caching only hits on exactly identical queries. Semantic caching also hits on semantically similar queries. "RAG optimization methods" and "RAG performance improvement strategies" would share the same cache. This significantly increases cache hit rates and reduces both LLM call costs and response latency. The similarity threshold (e.g., 0.95) controls hit precision.


15. References

  1. LangChain Documentation - https://python.langchain.com/docs/
  2. LlamaIndex Documentation - https://docs.llamaindex.ai/
  3. RAGAS Documentation - https://docs.ragas.io/
  4. TruLens Documentation - https://www.trulens.org/
  5. Cohere Rerank - https://docs.cohere.com/reference/rerank
  6. ColBERT Paper - https://arxiv.org/abs/2004.12832
  7. Self-RAG Paper - https://arxiv.org/abs/2310.11511
  8. CRAG Paper - https://arxiv.org/abs/2401.15884
  9. Adaptive RAG Paper - https://arxiv.org/abs/2403.14403
  10. HyDE Paper - https://arxiv.org/abs/2212.10496
  11. GraphRAG (Microsoft) - https://github.com/microsoft/graphrag
  12. RAG Survey Paper - https://arxiv.org/abs/2312.10997
  13. Chunking Strategies Guide - https://www.pinecone.io/learn/chunking-strategies/
  14. MTEB Leaderboard - https://huggingface.co/spaces/mteb/leaderboard