RAG Pipeline Optimization Strategy: Chunking, Reranking, and Hybrid Search

Introduction
RAG Pipeline Architecture Overview
Deep Dive into Chunking Strategies
Embedding Model Selection and Optimization
- Major Embedding Model Comparison
Vector Database Comparison
- Major Vector Database Comparison
Hybrid Search: Combining Dense + Sparse
Applying Reranking Models
Evaluation Metrics and Benchmarking
Operational Considerations and Troubleshooting
Failure Cases and Recovery Procedures
Conclusion
References

Introduction

If your team has deployed RAG (Retrieval-Augmented Generation) to production, you've likely experienced this: "It retrieves something but the answer is wrong", "The relevant document clearly exists but doesn't show up in search", "It works for short questions but hallucinates on complex ones". The root cause of these issues is almost always retrieval quality. No matter how smart the LLM, it will generate wrong answers if given wrong context.

This article covers three core pillars for maximizing RAG pipeline retrieval quality:

Chunking: How to split documents
Hybrid Search: How to combine dense vector and sparse keyword search
Reranking: How to re-order search results

For each strategy, we examine practical code, benchmark numbers, and operational considerations. Written based on the latest models and tools as of March 2026.

RAG Pipeline Architecture Overview

The complete flow of an advanced RAG pipeline:

[Indexing Phase]
Document Collection -> Preprocessing -> Chunking -> Embedding -> Vector DB Storage + Inverted Index Storage

[Query Phase]
Question -> Query Transformation -> Hybrid Search (Dense + Sparse)
         -> Reranking -> Top K Selection -> Prompt Construction -> LLM Generation

Three key differences compared to basic RAG:

Refined chunking strategy: Semantic, recursive, and agentic chunking instead of simple fixed-size
Hybrid search: Combining BM25 keyword search with vector similarity instead of relying solely on vectors
Added reranking layer: Re-evaluating initial search results with a Cross-Encoder for improved precision

Combining these three can improve retrieval accuracy (Precision@K) by 30-50%+. Let's examine each in depth.

Deep Dive into Chunking Strategies

Chunking is the first decision in a RAG pipeline and has the greatest impact on overall performance. Poor chunking is nearly impossible to recover from with subsequent optimizations.

Chunking Strategy Comparison

Strategy	Principle	Best For	Chunk Size	Pros	Cons
Fixed-size	Split by fixed token/char count	Uniformly structured docs	256-512 tokens	Simple, fast	Ignores semantics
Recursive	Recursive split by separators	General text	512-1024 tokens	Respects boundaries	Config needed
Semantic	Detect boundaries by embedding	Docs with frequent topics	Variable	Best preservation	Costly, slow
Agentic	LLM analyzes structure	Complex technical docs	Variable	Highest quality	High cost, slow

Fixed-size Chunking

The simplest but still effective strategy. Even in 2026 benchmarks, 512-token fixed-size chunking sometimes outperforms complex semantic chunking.

from langchain_text_splitters import CharacterTextSplitter

# Fixed-size chunking - the most basic approach
fixed_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=512,
    chunk_overlap=50,       # 10% overlap recommended
    length_function=len,
    is_separator_regex=False,
)

chunks = fixed_splitter.split_text(document_text)
print(f"Total {len(chunks)} chunks created")

Recommended settings:

Factoid (fact-checking) queries: 256-512 tokens
Analytical queries: 1024+ tokens
Overlap: 10-20% of total chunk size

Recursive Chunking

LangChain's RecursiveCharacterTextSplitter recursively splits following a separator hierarchy. It respects paragraph, sentence, and word boundaries while matching target size.

from langchain_text_splitters import RecursiveCharacterTextSplitter

# Recursive chunking - recommended production default
recursive_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", ". ", " ", ""],
    chunk_size=512,
    chunk_overlap=64,
    length_function=len,
    add_start_index=True,   # For source position tracking
)

chunks = recursive_splitter.split_documents(documents)

# Each chunk automatically includes metadata
for chunk in chunks[:3]:
    print(f"Size: {len(chunk.page_content)}, "
          f"Start position: {chunk.metadata.get('start_index')}")

Practical tip: For most production environments, starting with Recursive chunking is recommended. It's simple while respecting paragraph boundaries, offering the best cost-effectiveness.

Semantic Chunking

Uses embeddings to calculate semantic similarity between adjacent sentences and splits at points where similarity drops sharply.

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

# Semantic chunking - splitting by meaning units
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

semantic_splitter = SemanticChunker(
    embeddings=embeddings,
    breakpoint_threshold_type="percentile",  # percentile, standard_deviation, interquartile
    breakpoint_threshold_amount=75,          # Split at 75th percentile difference
)

semantic_chunks = semantic_splitter.split_text(document_text)
print(f"Semantic chunk count: {len(semantic_chunks)}")
print(f"Average chunk length: {sum(len(c) for c in semantic_chunks) / len(semantic_chunks):.0f}")

Caution: Semantic chunking requires generating embeddings for all sentence pairs, so costs and time increase significantly for large-scale documents. For 100K+ documents, Recursive chunking is more practical.

Agentic Chunking

Uses an LLM to understand the document's logical structure and determine optimal split points. Most sophisticated but highest cost.

from openai import OpenAI

client = OpenAI()

def agentic_chunk(text: str, max_chunks: int = 20) -> list[dict]:
    """LLM-based agentic chunking"""
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": (
                    "Split the given text into logical units. "
                    "Each chunk should cover one complete topic. "
                    "Return as JSON array: "
                    '[{"title": "chunk title", "content": "chunk content", "summary": "one-line summary"}]'
                ),
            },
            {"role": "user", "content": text[:8000]},  # Watch token limits
        ],
        response_format={"type": "json_object"},
        temperature=0,
    )
    import json
    result = json.loads(response.choices[0].message.content)
    return result.get("chunks", [])

# Usage example
chunks = agentic_chunk(long_document_text)
for chunk in chunks:
    print(f"[{chunk['title']}] {chunk['summary']}")

Cost consideration: Agentic chunking incurs LLM API calls per document, making it unsuitable for bulk indexing. Best for small volumes of high-value documents (contracts, technical specifications, etc.).

Embedding Model Selection and Optimization

After chunking, the choice of vector embedding model directly impacts retrieval quality.

Major Embedding Model Comparison

Model	Dims	Max Tokens	Multilingual	MTEB Score	Features
text-embedding-3-large	3072	8191	Yes	64.6	OpenAI latest, dimension reduction possible
text-embedding-3-small	1536	8191	Yes	62.3	Cost-efficient
BAAI/bge-m3	1024	8192	Yes	68.2	Open source, Dense+Sparse simultaneous
Cohere embed-v4	1024	512	Yes	66.1	Multimodal support
voyage-3-large	1024	32000	Yes	67.5	Long context specialized

from langchain_openai import OpenAIEmbeddings

# OpenAI embeddings - leveraging dimension reduction
embeddings = OpenAIEmbeddings(
    model="text-embedding-3-large",
    dimensions=1024,  # Reduce 3072 -> 1024 for cost/storage savings
)

# BGE-M3: Simultaneous Dense + Sparse generation (optimal for hybrid search)
from FlagEmbedding import BGEM3FlagModel

bge_model = BGEM3FlagModel("BAAI/bge-m3", use_fp16=True)

# Generate Dense and Sparse vectors simultaneously
output = bge_model.encode(
    ["RAG pipeline optimization methodology"],
    return_dense=True,
    return_sparse=True,
)

dense_vector = output["dense_vecs"][0]    # (1024,) float vector
sparse_vector = output["lexical_weights"][0]  # Sparse vector (per-word weights)

print(f"Dense dimensions: {len(dense_vector)}")
print(f"Sparse active token count: {len(sparse_vector)}")

Practical recommendations:

Cost priority: text-embedding-3-small (OpenAI) or bge-m3 (self-hosted)
Quality priority: text-embedding-3-large or voyage-3-large
Planning hybrid search: bge-m3 (simultaneous Dense + Sparse generation simplifies infrastructure)

Vector Database Comparison

Vector DB selection significantly impacts operational complexity, cost, and performance.

Major Vector Database Comparison

Item	Pinecone	Weaviate	Qdrant	Milvus
Hosting	Managed (Serverless)	Managed + Self-hosted	Managed + Self-hosted	Self-hosted focused (Zilliz Cloud)
Hybrid Search	Supported (Sparse)	Native support	Supported (Sparse)	Supported
Metadata Filtering	Basic	GraphQL-based powerful	Rust-based high perf	Basic
Free Tier	Starter (100K vectors)	Sandbox	1GB free (permanent)	Open source
Query Latency	Under 50ms	50-100ms	Under 50ms	30-50ms
Scalability	Auto-scaling	Manual config needed	Horizontal scaling	K8s native
Language SDKs	Python, JS, Go	Python, JS, Go, Java	Python, JS, Rust, Go	Python, JS, Go, Java
Best For	Minimal ops teams	OSS + flexibility	Complex filtering	Large-scale enterprise

Selection guide:

Quick start + minimal operations: Pinecone
Self-hosted + complex filtering: Qdrant
Open source + native hybrid search: Weaviate
Large scale (1B+ vectors) + GPU acceleration: Milvus/Zilliz

Hybrid Search: Combining Dense + Sparse

Vector search alone struggles with exact keyword matching, while BM25 alone can't capture semantic similarity. Hybrid search combines both approaches to improve recall by 15-30%.

Hybrid Search Architecture

Query: "What is the CPU threshold in Kubernetes HPA settings?"

Dense Search (vector similarity):
  -> "How to configure autoscaling in container orchestration" (semantically similar)

Sparse Search (BM25 keyword matching):
  -> "Set HPA targetCPUUtilizationPercentage to 80" (exact keyword match)

Hybrid Combination (RRF or weighted sum):
  -> Merge both results for optimal document retrieval

Weaviate Hybrid Search Implementation

import weaviate, { WeaviateClient } from 'weaviate-client'

// Connect Weaviate client
const client: WeaviateClient = await weaviate.connectToLocal({
  host: 'localhost',
  port: 8080,
})

// Execute hybrid search
const collection = client.collections.get('Documents')

const result = await collection.query.hybrid('Kubernetes HPA CPU threshold', {
  alpha: 0.7, // 0 = pure BM25, 1 = pure vector, 0.7 = 70% vector
  limit: 20, // Candidates before reranking
  fusionType: 'RelativeScore', // RelativeScore or Ranked
  returnMetadata: ['score', 'explainScore'],
  returnProperties: ['title', 'content', 'source'],
})

for (const item of result.objects) {
  console.log(`[${item.metadata?.score?.toFixed(3)}] ${item.properties.title}`)
}

BM25 + Dense Direct Implementation in Python

When native hybrid search from the vector DB isn't available, implement directly with Reciprocal Rank Fusion (RRF).

from rank_bm25 import BM25Okapi
import numpy as np
from typing import List, Tuple

def hybrid_search(
    query: str,
    documents: list[dict],
    dense_scores: np.ndarray,
    k: int = 10,
    alpha: float = 0.7,
    rrf_k: int = 60,
) -> list[dict]:
    """
    RRF-based hybrid search
    alpha: Dense search weight (1-alpha is Sparse weight)
    rrf_k: RRF constant (default 60, paper recommended)
    """
    # Sparse search (BM25)
    tokenized_docs = [doc["content"].split() for doc in documents]
    bm25 = BM25Okapi(tokenized_docs)
    sparse_scores = bm25.get_scores(query.split())

    # Dense rank calculation
    dense_ranks = np.argsort(-dense_scores) + 1  # 1-indexed rank
    sparse_ranks = np.argsort(-sparse_scores) + 1

    # RRF score calculation
    rrf_scores = []
    for i in range(len(documents)):
        dense_rrf = alpha / (rrf_k + dense_ranks[i])
        sparse_rrf = (1 - alpha) / (rrf_k + sparse_ranks[i])
        rrf_scores.append(dense_rrf + sparse_rrf)

    # Return top K
    top_indices = np.argsort(rrf_scores)[::-1][:k]
    return [
        {**documents[i], "hybrid_score": rrf_scores[i]}
        for i in top_indices
    ]

# Usage example
results = hybrid_search(
    query="Kubernetes HPA CPU threshold",
    documents=all_documents,
    dense_scores=embedding_similarity_scores,
    k=20,       # Generous before reranking
    alpha=0.7,  # Dense 70%, Sparse 30%
)

Alpha Tuning Guide

Query Type	Recommended Alpha	Reason
Technical queries with jargon	0.3-0.5	Exact keyword matching important
Natural language questions	0.7-0.8	Semantic similarity important
Code search	0.2-0.4	Function/variable name matching
General FAQ	0.5-0.6	Balanced search

Key insight: Dynamically adjusting alpha per query type can improve Precision@1 by 2-7.5 percentage points compared to static settings.

Applying Reranking Models

After broadening candidates with hybrid search, reranking models precisely adjust the final ranking. Rerankers take query and document together as input (Cross-Encoding) to directly compute relevance scores, achieving higher accuracy than Bi-Encoder embeddings.

Reranking Model Comparison

Model	Type	Parameters	Multilingual	Latency (100 docs)	Cost
Cohere Rerank 4	API	Undisclosed	100+ languages	200-400ms	Pay-per-use
BAAI/bge-reranker-v2-m3	Open source	0.6B	Yes	500-800ms (GPU)	Free
BAAI/bge-reranker-large	Open source	560M	Limited	400-600ms (GPU)	Free
cross-encoder/ms-marco-MiniLM-L-12-v2	Open source	33M	No (English)	100-200ms (GPU)	Free

Cohere Rerank Integration

import cohere

co = cohere.ClientV2(api_key="your-cohere-api-key")

def rerank_with_cohere(
    query: str,
    documents: list[str],
    top_n: int = 5,
) -> list[dict]:
    """Rerank documents with Cohere Rerank 4"""
    response = co.rerank(
        model="rerank-v3.5",
        query=query,
        documents=documents,
        top_n=top_n,
        return_documents=True,
    )

    results = []
    for item in response.results:
        results.append({
            "index": item.index,
            "score": item.relevance_score,
            "text": item.document.text if item.document else documents[item.index],
        })
    return results

# Usage: Rerank 20 hybrid search results to select top 5
hybrid_results = hybrid_search(query, documents, dense_scores, k=20)
reranked = rerank_with_cohere(
    query="How to set Kubernetes HPA CPU threshold",
    documents=[r["content"] for r in hybrid_results],
    top_n=5,
)

for r in reranked:
    print(f"[{r['score']:.4f}] {r['text'][:80]}...")

Self-hosting BGE Reranker

For environments where API costs must be avoided or data cannot be sent externally, host an open-source reranker directly.

from FlagEmbedding import FlagReranker

# BGE Reranker v2 M3 - lightweight multilingual reranker
reranker = FlagReranker(
    "BAAI/bge-reranker-v2-m3",
    use_fp16=True,  # Save GPU memory
)

def rerank_with_bge(
    query: str,
    documents: list[str],
    top_n: int = 5,
) -> list[dict]:
    """Rerank documents with BGE Reranker"""
    # Create query-document pairs
    pairs = [[query, doc] for doc in documents]

    # Calculate relevance scores
    scores = reranker.compute_score(pairs, normalize=True)

    # Sort by score
    scored_docs = [
        {"index": i, "score": s, "text": d}
        for i, (s, d) in enumerate(zip(scores, documents))
    ]
    scored_docs.sort(key=lambda x: x["score"], reverse=True)

    return scored_docs[:top_n]

# Usage example
results = rerank_with_bge(
    query="What criteria determine chunking size in RAG pipelines?",
    documents=candidate_documents,
    top_n=5,
)

for r in results:
    print(f"[{r['score']:.4f}] {r['text'][:100]}...")

Full Pipeline Integration

An example integrating chunking, hybrid search, and reranking into a single pipeline.

from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

class OptimizedRAGPipeline:
    """Advanced RAG Pipeline"""

    def __init__(self):
        self.splitter = RecursiveCharacterTextSplitter(
            chunk_size=512, chunk_overlap=64
        )
        self.embeddings = OpenAIEmbeddings(
            model="text-embedding-3-large", dimensions=1024
        )
        self.llm = ChatOpenAI(model="gpt-4o", temperature=0)
        self.reranker = FlagReranker("BAAI/bge-reranker-v2-m3", use_fp16=True)

    def query(self, question: str, top_k: int = 5) -> str:
        # 1. Hybrid search: 20 candidates
        candidates = self._hybrid_search(question, k=20)

        # 2. Reranking: select top 5
        reranked = self._rerank(question, candidates, top_n=top_k)

        # 3. Prompt construction and LLM generation
        context = "\n\n---\n\n".join([doc["text"] for doc in reranked])
        prompt = ChatPromptTemplate.from_messages([
            ("system", "Answer the question based on the following context.\n\n{context}"),
            ("human", "{question}"),
        ])
        chain = prompt | self.llm
        response = chain.invoke({"context": context, "question": question})
        return response.content

    def _hybrid_search(self, query: str, k: int = 20) -> list[dict]:
        # Dense + Sparse hybrid search (see code above)
        ...

    def _rerank(self, query: str, docs: list[dict], top_n: int) -> list[dict]:
        pairs = [[query, doc["text"]] for doc in docs]
        scores = self.reranker.compute_score(pairs, normalize=True)
        for doc, score in zip(docs, scores):
            doc["rerank_score"] = score
        docs.sort(key=lambda x: x["rerank_score"], reverse=True)
        return docs[:top_n]

Evaluation Metrics and Benchmarking

Quantitative evaluation is essential for RAG pipeline optimization. "It seems better" doesn't work in production.

RAGAS Framework

RAGAS (Retrieval-Augmented Generation Assessment Suite) is a RAG-specific evaluation framework that leverages LLMs as evaluators for automatic scoring, even without ground truth.

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
)
from datasets import Dataset

# Evaluation dataset construction
eval_data = {
    "question": [
        "What is the default CPU threshold for Kubernetes HPA?",
        "What criteria determine chunking size in RAG?",
    ],
    "answer": [
        "The default CPU threshold for HPA is 80%.",
        "It is determined in the 256-1024 token range based on query type.",
    ],
    "contexts": [
        ["HPA uses targetCPUUtilizationPercentage default value of 80."],
        ["Factoid queries recommend 256-512, analytical queries 1024+ tokens."],
    ],
    "ground_truth": [
        "The default is 80%.",
        "It is determined by query type and document characteristics.",
    ],
}

dataset = Dataset.from_dict(eval_data)

# Run evaluation
results = evaluate(
    dataset=dataset,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
)

print(results)
# faithfulness: 0.92, answer_relevancy: 0.88,
# context_precision: 0.85, context_recall: 0.90

Unit Testing with DeepEval

DeepEval enables Pytest-style RAG testing, making it ideal for CI/CD pipeline integration.

from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import (
    FaithfulnessMetric,
    ContextualRelevancyMetric,
    AnswerRelevancyMetric,
)

def test_rag_faithfulness():
    """Test that RAG responses are faithful to context"""
    test_case = LLMTestCase(
        input="What is the Kubernetes HPA CPU threshold?",
        actual_output="The default CPU threshold for HPA is 80%.",
        retrieval_context=[
            "HPA uses targetCPUUtilizationPercentage default value of 80."
        ],
    )

    faithfulness = FaithfulnessMetric(threshold=0.8)
    relevancy = ContextualRelevancyMetric(threshold=0.7)
    answer_rel = AnswerRelevancyMetric(threshold=0.7)

    assert_test(test_case, [faithfulness, relevancy, answer_rel])

# Run with pytest: pytest test_rag.py -v

Key Evaluation Metrics Summary

Metric	Measures	Target	Tools
Context Precision	Relevance of retrieved context	0.8+	RAGAS
Context Recall	Ratio of needed context retrieved	0.85+	RAGAS
Faithfulness	Response fidelity to context	0.9+	RAGAS, DeepEval
Answer Relevancy	Response relevance to question	0.85+	RAGAS, DeepEval
MRR@K	Mean reciprocal rank of first hit	0.7+	Custom
NDCG@K	Normalized discounted cumulative gain	0.75+	Custom

Operational Considerations and Troubleshooting

1. Full Re-indexing Required When Changing Embedding Models

Upgrading the embedding model changes the vector space between old and new vectors. Partial re-indexing severely degrades retrieval quality.

Solution: Use a Blue-Green index strategy. Complete a new index with the new model, then switch traffic.

# Blue-Green index switching example
import time

def reindex_with_blue_green(
    old_collection: str,
    new_collection: str,
    new_embedding_model: str,
):
    """Zero-downtime re-indexing"""
    # 1. Index into new collection (existing service uses old_collection)
    print(f"Starting indexing for new collection '{new_collection}'...")
    create_and_populate_collection(new_collection, new_embedding_model)

    # 2. Validation: run test queries against new collection
    test_results = run_evaluation_suite(new_collection)
    if test_results["context_precision"] < 0.8:
        raise ValueError(
            f"New index quality below threshold: {test_results['context_precision']:.2f}"
        )

    # 3. Traffic switch: alias or config change
    update_active_collection(new_collection)
    print(f"Traffic switched: {old_collection} -> {new_collection}")

    # 4. Keep old collection for rollback for a period
    schedule_cleanup(old_collection, delay_days=7)

2. Chunk Size and Embedding Model Token Limit Mismatch

Chunks exceeding the embedding model's maximum token count get truncated or cause errors.

Solution: Use a length function that considers the embedding model's token limit during chunking.

import tiktoken

enc = tiktoken.encoding_for_model("text-embedding-3-small")

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=64,
    length_function=lambda text: len(enc.encode(text)),  # Token-count based
)

3. Reranking Latency Management

Rerankers are Cross-Encoders, so latency increases proportionally with candidate count. Reranking 100 docs adds 500ms-1s.

Solutions:

Limit candidates to 20-30 from hybrid search
Use async calls for batch processing
Deploy open-source rerankers on GPU servers for latency

4. Vector DB Index Memory Management

HNSW indexes must maintain the entire graph in memory. 1M vectors (1024 dims) use approximately 4-8GB memory.

Solutions:

Vector dimension reduction (3072 -> 1024)
Apply quantization: Scalar, Product, Binary quantization
Use DiskANN index (Milvus supported)

5. Metadata Filters and Search Performance

Excessive metadata filters can drastically degrade vector search performance. This is especially problematic with high-cardinality fields (timestamps, user IDs, etc.).

Solutions:

Use only low-cardinality fields for filters (category, department, document type)
Use broad date filters and apply recency weighting during reranking

Failure Cases and Recovery Procedures

Case 1: Retrieval Quality Drop After Switching to Semantic Chunking

Situation: After switching from Recursive to Semantic chunking, Context Precision dropped from 0.85 to 0.72.

Cause: Semantic chunking produced highly uneven chunk sizes. Some chunks were 50 tokens while others were 2000, leading to inconsistent embedding quality.

Recovery:

Added minimum/maximum size constraints to semantic chunking output
Immediately rolled back to the existing Recursive chunking index (possible because of Blue-Green approach)
Retried semantic chunking with min 200, max 800 token constraints

Case 2: Fixed Hybrid Search Alpha Causing Performance Issues for Specific Query Types

Situation: While operating with fixed alpha=0.7, multiple issues with code search queries failing to find exact function names.

Cause: Code-related queries need exact keyword matching, but Dense weighting was too high.

Recovery:

Added query classifier for automatic query type detection
Dynamic adjustment: alpha=0.3 for code/technical queries, alpha=0.7 for natural language
Classifier itself uses a lightweight model (distilbert-based) with under 10ms latency

Case 3: Service Down During Reranker Failure

Situation: Cohere Rerank API outage caused the entire RAG pipeline to become unresponsive.

Cause: Reranking step was configured as mandatory with no fallback logic.

Recovery:

Made the reranking step optional
On timeout (2s) or API error, return hybrid search results as-is
Deployed self-hosted BGE Reranker as backup for redundancy

import asyncio

async def rerank_with_fallback(
    query: str,
    documents: list[str],
    top_n: int = 5,
    timeout: float = 2.0,
) -> list[dict]:
    """Reranking with fallback"""
    try:
        # Primary: Cohere Rerank (2s timeout)
        result = await asyncio.wait_for(
            cohere_rerank_async(query, documents, top_n),
            timeout=timeout,
        )
        return result
    except (asyncio.TimeoutError, Exception) as e:
        print(f"Cohere reranking failed, BGE fallback: {e}")
        try:
            # Fallback: Self-hosted BGE Reranker
            return rerank_with_bge(query, documents, top_n)
        except Exception as e2:
            print(f"BGE reranking also failed, returning original ranking: {e2}")
            # Final fallback: return hybrid search results as-is
            return [
                {"index": i, "score": 1.0 - i * 0.05, "text": d}
                for i, d in enumerate(documents[:top_n])
            ]

Case 4: Service Quality Degradation During Bulk Re-indexing

Situation: During re-indexing of 100K documents, embedding API call surge hit rate limits, delaying real-time query embedding responses too.

Recovery:

Separated API keys/endpoints for indexing and querying
Added batch size control and rate limit handling logic for indexing
Ran indexing during off-peak hours to avoid competing with query traffic

Conclusion

RAG pipeline optimization is not a single technology but a combination of chunking, search, reranking, and evaluation. Optimize each stage independently, but make decisions based on the overall pipeline's evaluation metrics.

Recommended implementation order:

Recursive chunking + Dense search to establish baseline (1 week)
RAGAS/DeepEval evaluation pipeline setup (1 week)
Add hybrid search to improve Recall (1 week)
Add reranking to improve Precision (1 week)
Per-query dynamic alpha and continuous improvement based on evaluation (ongoing)

The approach of "apply everything at once" will fail. At each step, verify evaluation metric changes and immediately roll back if things get worse. That's the production playbook.

Introduction

RAG Pipeline Architecture Overview

Deep Dive into Chunking Strategies

Chunking Strategy Comparison

Fixed-size Chunking

Recursive Chunking

Semantic Chunking

Agentic Chunking

Embedding Model Selection and Optimization

Major Embedding Model Comparison

Vector Database Comparison

Major Vector Database Comparison

Hybrid Search: Combining Dense + Sparse

Hybrid Search Architecture

Weaviate Hybrid Search Implementation

BM25 + Dense Direct Implementation in Python

Alpha Tuning Guide

Applying Reranking Models

Reranking Model Comparison

Cohere Rerank Integration

Self-hosting BGE Reranker

Full Pipeline Integration

Evaluation Metrics and Benchmarking

RAGAS Framework

Unit Testing with DeepEval

Key Evaluation Metrics Summary

Operational Considerations and Troubleshooting

1. Full Re-indexing Required When Changing Embedding Models

2. Chunk Size and Embedding Model Token Limit Mismatch

3. Reranking Latency Management

4. Vector DB Index Memory Management

5. Metadata Filters and Search Performance

Failure Cases and Recovery Procedures

Case 1: Retrieval Quality Drop After Switching to Semantic Chunking

Case 2: Fixed Hybrid Search Alpha Causing Performance Issues for Specific Query Types

Case 3: Service Down During Reranker Failure

Case 4: Service Quality Degradation During Bulk Re-indexing

Conclusion

References