Skip to content
Published on

RAG Pipeline Optimization Strategy: Chunking, Reranking, and Hybrid Search

Authors
  • Name
    Twitter
RAG Pipeline Optimization Strategy

Introduction

If your team has deployed RAG (Retrieval-Augmented Generation) to production, you've likely experienced this: "It retrieves something but the answer is wrong", "The relevant document clearly exists but doesn't show up in search", "It works for short questions but hallucinates on complex ones". The root cause of these issues is almost always retrieval quality. No matter how smart the LLM, it will generate wrong answers if given wrong context.

This article covers three core pillars for maximizing RAG pipeline retrieval quality:

  1. Chunking: How to split documents
  2. Hybrid Search: How to combine dense vector and sparse keyword search
  3. Reranking: How to re-order search results

For each strategy, we examine practical code, benchmark numbers, and operational considerations. Written based on the latest models and tools as of March 2026.

RAG Pipeline Architecture Overview

The complete flow of an advanced RAG pipeline:

[Indexing Phase]
Document Collection -> Preprocessing -> Chunking -> Embedding -> Vector DB Storage + Inverted Index Storage

[Query Phase]
Question -> Query Transformation -> Hybrid Search (Dense + Sparse)
         -> Reranking -> Top K Selection -> Prompt Construction -> LLM Generation

Three key differences compared to basic RAG:

  • Refined chunking strategy: Semantic, recursive, and agentic chunking instead of simple fixed-size
  • Hybrid search: Combining BM25 keyword search with vector similarity instead of relying solely on vectors
  • Added reranking layer: Re-evaluating initial search results with a Cross-Encoder for improved precision

Combining these three can improve retrieval accuracy (Precision@K) by 30-50%+. Let's examine each in depth.

Deep Dive into Chunking Strategies

Chunking is the first decision in a RAG pipeline and has the greatest impact on overall performance. Poor chunking is nearly impossible to recover from with subsequent optimizations.

Chunking Strategy Comparison

StrategyPrincipleBest ForChunk SizeProsCons
Fixed-sizeSplit by fixed token/char countUniformly structured docs256-512 tokensSimple, fastIgnores semantics
RecursiveRecursive split by separatorsGeneral text512-1024 tokensRespects boundariesConfig needed
SemanticDetect boundaries by embeddingDocs with frequent topicsVariableBest preservationCostly, slow
AgenticLLM analyzes structureComplex technical docsVariableHighest qualityHigh cost, slow

Fixed-size Chunking

The simplest but still effective strategy. Even in 2026 benchmarks, 512-token fixed-size chunking sometimes outperforms complex semantic chunking.

from langchain_text_splitters import CharacterTextSplitter

# Fixed-size chunking - the most basic approach
fixed_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=512,
    chunk_overlap=50,       # 10% overlap recommended
    length_function=len,
    is_separator_regex=False,
)

chunks = fixed_splitter.split_text(document_text)
print(f"Total {len(chunks)} chunks created")

Recommended settings:

  • Factoid (fact-checking) queries: 256-512 tokens
  • Analytical queries: 1024+ tokens
  • Overlap: 10-20% of total chunk size

Recursive Chunking

LangChain's RecursiveCharacterTextSplitter recursively splits following a separator hierarchy. It respects paragraph, sentence, and word boundaries while matching target size.

from langchain_text_splitters import RecursiveCharacterTextSplitter

# Recursive chunking - recommended production default
recursive_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", ". ", " ", ""],
    chunk_size=512,
    chunk_overlap=64,
    length_function=len,
    add_start_index=True,   # For source position tracking
)

chunks = recursive_splitter.split_documents(documents)

# Each chunk automatically includes metadata
for chunk in chunks[:3]:
    print(f"Size: {len(chunk.page_content)}, "
          f"Start position: {chunk.metadata.get('start_index')}")

Practical tip: For most production environments, starting with Recursive chunking is recommended. It's simple while respecting paragraph boundaries, offering the best cost-effectiveness.

Semantic Chunking

Uses embeddings to calculate semantic similarity between adjacent sentences and splits at points where similarity drops sharply.

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

# Semantic chunking - splitting by meaning units
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

semantic_splitter = SemanticChunker(
    embeddings=embeddings,
    breakpoint_threshold_type="percentile",  # percentile, standard_deviation, interquartile
    breakpoint_threshold_amount=75,          # Split at 75th percentile difference
)

semantic_chunks = semantic_splitter.split_text(document_text)
print(f"Semantic chunk count: {len(semantic_chunks)}")
print(f"Average chunk length: {sum(len(c) for c in semantic_chunks) / len(semantic_chunks):.0f}")

Caution: Semantic chunking requires generating embeddings for all sentence pairs, so costs and time increase significantly for large-scale documents. For 100K+ documents, Recursive chunking is more practical.

Agentic Chunking

Uses an LLM to understand the document's logical structure and determine optimal split points. Most sophisticated but highest cost.

from openai import OpenAI

client = OpenAI()

def agentic_chunk(text: str, max_chunks: int = 20) -> list[dict]:
    """LLM-based agentic chunking"""
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": (
                    "Split the given text into logical units. "
                    "Each chunk should cover one complete topic. "
                    "Return as JSON array: "
                    '[{"title": "chunk title", "content": "chunk content", "summary": "one-line summary"}]'
                ),
            },
            {"role": "user", "content": text[:8000]},  # Watch token limits
        ],
        response_format={"type": "json_object"},
        temperature=0,
    )
    import json
    result = json.loads(response.choices[0].message.content)
    return result.get("chunks", [])

# Usage example
chunks = agentic_chunk(long_document_text)
for chunk in chunks:
    print(f"[{chunk['title']}] {chunk['summary']}")

Cost consideration: Agentic chunking incurs LLM API calls per document, making it unsuitable for bulk indexing. Best for small volumes of high-value documents (contracts, technical specifications, etc.).

Embedding Model Selection and Optimization

After chunking, the choice of vector embedding model directly impacts retrieval quality.

Major Embedding Model Comparison

ModelDimsMax TokensMultilingualMTEB ScoreFeatures
text-embedding-3-large30728191Yes64.6OpenAI latest, dimension reduction possible
text-embedding-3-small15368191Yes62.3Cost-efficient
BAAI/bge-m310248192Yes68.2Open source, Dense+Sparse simultaneous
Cohere embed-v41024512Yes66.1Multimodal support
voyage-3-large102432000Yes67.5Long context specialized
from langchain_openai import OpenAIEmbeddings

# OpenAI embeddings - leveraging dimension reduction
embeddings = OpenAIEmbeddings(
    model="text-embedding-3-large",
    dimensions=1024,  # Reduce 3072 -> 1024 for cost/storage savings
)

# BGE-M3: Simultaneous Dense + Sparse generation (optimal for hybrid search)
from FlagEmbedding import BGEM3FlagModel

bge_model = BGEM3FlagModel("BAAI/bge-m3", use_fp16=True)

# Generate Dense and Sparse vectors simultaneously
output = bge_model.encode(
    ["RAG pipeline optimization methodology"],
    return_dense=True,
    return_sparse=True,
)

dense_vector = output["dense_vecs"][0]    # (1024,) float vector
sparse_vector = output["lexical_weights"][0]  # Sparse vector (per-word weights)

print(f"Dense dimensions: {len(dense_vector)}")
print(f"Sparse active token count: {len(sparse_vector)}")

Practical recommendations:

  • Cost priority: text-embedding-3-small (OpenAI) or bge-m3 (self-hosted)
  • Quality priority: text-embedding-3-large or voyage-3-large
  • Planning hybrid search: bge-m3 (simultaneous Dense + Sparse generation simplifies infrastructure)

Vector Database Comparison

Vector DB selection significantly impacts operational complexity, cost, and performance.

Major Vector Database Comparison

ItemPineconeWeaviateQdrantMilvus
HostingManaged (Serverless)Managed + Self-hostedManaged + Self-hostedSelf-hosted focused (Zilliz Cloud)
Hybrid SearchSupported (Sparse)Native supportSupported (Sparse)Supported
Metadata FilteringBasicGraphQL-based powerfulRust-based high perfBasic
Free TierStarter (100K vectors)Sandbox1GB free (permanent)Open source
Query LatencyUnder 50ms50-100msUnder 50ms30-50ms
ScalabilityAuto-scalingManual config neededHorizontal scalingK8s native
Language SDKsPython, JS, GoPython, JS, Go, JavaPython, JS, Rust, GoPython, JS, Go, Java
Best ForMinimal ops teamsOSS + flexibilityComplex filteringLarge-scale enterprise

Selection guide:

  • Quick start + minimal operations: Pinecone
  • Self-hosted + complex filtering: Qdrant
  • Open source + native hybrid search: Weaviate
  • Large scale (1B+ vectors) + GPU acceleration: Milvus/Zilliz

Hybrid Search: Combining Dense + Sparse

Vector search alone struggles with exact keyword matching, while BM25 alone can't capture semantic similarity. Hybrid search combines both approaches to improve recall by 15-30%.

Hybrid Search Architecture

Query: "What is the CPU threshold in Kubernetes HPA settings?"

Dense Search (vector similarity):
  -> "How to configure autoscaling in container orchestration" (semantically similar)

Sparse Search (BM25 keyword matching):
  -> "Set HPA targetCPUUtilizationPercentage to 80" (exact keyword match)

Hybrid Combination (RRF or weighted sum):
  -> Merge both results for optimal document retrieval

Weaviate Hybrid Search Implementation

import weaviate, { WeaviateClient } from 'weaviate-client'

// Connect Weaviate client
const client: WeaviateClient = await weaviate.connectToLocal({
  host: 'localhost',
  port: 8080,
})

// Execute hybrid search
const collection = client.collections.get('Documents')

const result = await collection.query.hybrid('Kubernetes HPA CPU threshold', {
  alpha: 0.7, // 0 = pure BM25, 1 = pure vector, 0.7 = 70% vector
  limit: 20, // Candidates before reranking
  fusionType: 'RelativeScore', // RelativeScore or Ranked
  returnMetadata: ['score', 'explainScore'],
  returnProperties: ['title', 'content', 'source'],
})

for (const item of result.objects) {
  console.log(`[${item.metadata?.score?.toFixed(3)}] ${item.properties.title}`)
}

BM25 + Dense Direct Implementation in Python

When native hybrid search from the vector DB isn't available, implement directly with Reciprocal Rank Fusion (RRF).

from rank_bm25 import BM25Okapi
import numpy as np
from typing import List, Tuple

def hybrid_search(
    query: str,
    documents: list[dict],
    dense_scores: np.ndarray,
    k: int = 10,
    alpha: float = 0.7,
    rrf_k: int = 60,
) -> list[dict]:
    """
    RRF-based hybrid search
    alpha: Dense search weight (1-alpha is Sparse weight)
    rrf_k: RRF constant (default 60, paper recommended)
    """
    # Sparse search (BM25)
    tokenized_docs = [doc["content"].split() for doc in documents]
    bm25 = BM25Okapi(tokenized_docs)
    sparse_scores = bm25.get_scores(query.split())

    # Dense rank calculation
    dense_ranks = np.argsort(-dense_scores) + 1  # 1-indexed rank
    sparse_ranks = np.argsort(-sparse_scores) + 1

    # RRF score calculation
    rrf_scores = []
    for i in range(len(documents)):
        dense_rrf = alpha / (rrf_k + dense_ranks[i])
        sparse_rrf = (1 - alpha) / (rrf_k + sparse_ranks[i])
        rrf_scores.append(dense_rrf + sparse_rrf)

    # Return top K
    top_indices = np.argsort(rrf_scores)[::-1][:k]
    return [
        {**documents[i], "hybrid_score": rrf_scores[i]}
        for i in top_indices
    ]

# Usage example
results = hybrid_search(
    query="Kubernetes HPA CPU threshold",
    documents=all_documents,
    dense_scores=embedding_similarity_scores,
    k=20,       # Generous before reranking
    alpha=0.7,  # Dense 70%, Sparse 30%
)

Alpha Tuning Guide

Query TypeRecommended AlphaReason
Technical queries with jargon0.3-0.5Exact keyword matching important
Natural language questions0.7-0.8Semantic similarity important
Code search0.2-0.4Function/variable name matching
General FAQ0.5-0.6Balanced search

Key insight: Dynamically adjusting alpha per query type can improve Precision@1 by 2-7.5 percentage points compared to static settings.

Applying Reranking Models

After broadening candidates with hybrid search, reranking models precisely adjust the final ranking. Rerankers take query and document together as input (Cross-Encoding) to directly compute relevance scores, achieving higher accuracy than Bi-Encoder embeddings.

Reranking Model Comparison

ModelTypeParametersMultilingualLatency (100 docs)Cost
Cohere Rerank 4APIUndisclosed100+ languages200-400msPay-per-use
BAAI/bge-reranker-v2-m3Open source0.6BYes500-800ms (GPU)Free
BAAI/bge-reranker-largeOpen source560MLimited400-600ms (GPU)Free
cross-encoder/ms-marco-MiniLM-L-12-v2Open source33MNo (English)100-200ms (GPU)Free

Cohere Rerank Integration

import cohere

co = cohere.ClientV2(api_key="your-cohere-api-key")

def rerank_with_cohere(
    query: str,
    documents: list[str],
    top_n: int = 5,
) -> list[dict]:
    """Rerank documents with Cohere Rerank 4"""
    response = co.rerank(
        model="rerank-v3.5",
        query=query,
        documents=documents,
        top_n=top_n,
        return_documents=True,
    )

    results = []
    for item in response.results:
        results.append({
            "index": item.index,
            "score": item.relevance_score,
            "text": item.document.text if item.document else documents[item.index],
        })
    return results

# Usage: Rerank 20 hybrid search results to select top 5
hybrid_results = hybrid_search(query, documents, dense_scores, k=20)
reranked = rerank_with_cohere(
    query="How to set Kubernetes HPA CPU threshold",
    documents=[r["content"] for r in hybrid_results],
    top_n=5,
)

for r in reranked:
    print(f"[{r['score']:.4f}] {r['text'][:80]}...")

Self-hosting BGE Reranker

For environments where API costs must be avoided or data cannot be sent externally, host an open-source reranker directly.

from FlagEmbedding import FlagReranker

# BGE Reranker v2 M3 - lightweight multilingual reranker
reranker = FlagReranker(
    "BAAI/bge-reranker-v2-m3",
    use_fp16=True,  # Save GPU memory
)

def rerank_with_bge(
    query: str,
    documents: list[str],
    top_n: int = 5,
) -> list[dict]:
    """Rerank documents with BGE Reranker"""
    # Create query-document pairs
    pairs = [[query, doc] for doc in documents]

    # Calculate relevance scores
    scores = reranker.compute_score(pairs, normalize=True)

    # Sort by score
    scored_docs = [
        {"index": i, "score": s, "text": d}
        for i, (s, d) in enumerate(zip(scores, documents))
    ]
    scored_docs.sort(key=lambda x: x["score"], reverse=True)

    return scored_docs[:top_n]

# Usage example
results = rerank_with_bge(
    query="What criteria determine chunking size in RAG pipelines?",
    documents=candidate_documents,
    top_n=5,
)

for r in results:
    print(f"[{r['score']:.4f}] {r['text'][:100]}...")

Full Pipeline Integration

An example integrating chunking, hybrid search, and reranking into a single pipeline.

from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

class OptimizedRAGPipeline:
    """Advanced RAG Pipeline"""

    def __init__(self):
        self.splitter = RecursiveCharacterTextSplitter(
            chunk_size=512, chunk_overlap=64
        )
        self.embeddings = OpenAIEmbeddings(
            model="text-embedding-3-large", dimensions=1024
        )
        self.llm = ChatOpenAI(model="gpt-4o", temperature=0)
        self.reranker = FlagReranker("BAAI/bge-reranker-v2-m3", use_fp16=True)

    def query(self, question: str, top_k: int = 5) -> str:
        # 1. Hybrid search: 20 candidates
        candidates = self._hybrid_search(question, k=20)

        # 2. Reranking: select top 5
        reranked = self._rerank(question, candidates, top_n=top_k)

        # 3. Prompt construction and LLM generation
        context = "\n\n---\n\n".join([doc["text"] for doc in reranked])
        prompt = ChatPromptTemplate.from_messages([
            ("system", "Answer the question based on the following context.\n\n{context}"),
            ("human", "{question}"),
        ])
        chain = prompt | self.llm
        response = chain.invoke({"context": context, "question": question})
        return response.content

    def _hybrid_search(self, query: str, k: int = 20) -> list[dict]:
        # Dense + Sparse hybrid search (see code above)
        ...

    def _rerank(self, query: str, docs: list[dict], top_n: int) -> list[dict]:
        pairs = [[query, doc["text"]] for doc in docs]
        scores = self.reranker.compute_score(pairs, normalize=True)
        for doc, score in zip(docs, scores):
            doc["rerank_score"] = score
        docs.sort(key=lambda x: x["rerank_score"], reverse=True)
        return docs[:top_n]

Evaluation Metrics and Benchmarking

Quantitative evaluation is essential for RAG pipeline optimization. "It seems better" doesn't work in production.

RAGAS Framework

RAGAS (Retrieval-Augmented Generation Assessment Suite) is a RAG-specific evaluation framework that leverages LLMs as evaluators for automatic scoring, even without ground truth.

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
)
from datasets import Dataset

# Evaluation dataset construction
eval_data = {
    "question": [
        "What is the default CPU threshold for Kubernetes HPA?",
        "What criteria determine chunking size in RAG?",
    ],
    "answer": [
        "The default CPU threshold for HPA is 80%.",
        "It is determined in the 256-1024 token range based on query type.",
    ],
    "contexts": [
        ["HPA uses targetCPUUtilizationPercentage default value of 80."],
        ["Factoid queries recommend 256-512, analytical queries 1024+ tokens."],
    ],
    "ground_truth": [
        "The default is 80%.",
        "It is determined by query type and document characteristics.",
    ],
}

dataset = Dataset.from_dict(eval_data)

# Run evaluation
results = evaluate(
    dataset=dataset,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
)

print(results)
# faithfulness: 0.92, answer_relevancy: 0.88,
# context_precision: 0.85, context_recall: 0.90

Unit Testing with DeepEval

DeepEval enables Pytest-style RAG testing, making it ideal for CI/CD pipeline integration.

from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import (
    FaithfulnessMetric,
    ContextualRelevancyMetric,
    AnswerRelevancyMetric,
)

def test_rag_faithfulness():
    """Test that RAG responses are faithful to context"""
    test_case = LLMTestCase(
        input="What is the Kubernetes HPA CPU threshold?",
        actual_output="The default CPU threshold for HPA is 80%.",
        retrieval_context=[
            "HPA uses targetCPUUtilizationPercentage default value of 80."
        ],
    )

    faithfulness = FaithfulnessMetric(threshold=0.8)
    relevancy = ContextualRelevancyMetric(threshold=0.7)
    answer_rel = AnswerRelevancyMetric(threshold=0.7)

    assert_test(test_case, [faithfulness, relevancy, answer_rel])

# Run with pytest: pytest test_rag.py -v

Key Evaluation Metrics Summary

MetricMeasuresTargetTools
Context PrecisionRelevance of retrieved context0.8+RAGAS
Context RecallRatio of needed context retrieved0.85+RAGAS
FaithfulnessResponse fidelity to context0.9+RAGAS, DeepEval
Answer RelevancyResponse relevance to question0.85+RAGAS, DeepEval
MRR@KMean reciprocal rank of first hit0.7+Custom
NDCG@KNormalized discounted cumulative gain0.75+Custom

Operational Considerations and Troubleshooting

1. Full Re-indexing Required When Changing Embedding Models

Upgrading the embedding model changes the vector space between old and new vectors. Partial re-indexing severely degrades retrieval quality.

Solution: Use a Blue-Green index strategy. Complete a new index with the new model, then switch traffic.

# Blue-Green index switching example
import time

def reindex_with_blue_green(
    old_collection: str,
    new_collection: str,
    new_embedding_model: str,
):
    """Zero-downtime re-indexing"""
    # 1. Index into new collection (existing service uses old_collection)
    print(f"Starting indexing for new collection '{new_collection}'...")
    create_and_populate_collection(new_collection, new_embedding_model)

    # 2. Validation: run test queries against new collection
    test_results = run_evaluation_suite(new_collection)
    if test_results["context_precision"] < 0.8:
        raise ValueError(
            f"New index quality below threshold: {test_results['context_precision']:.2f}"
        )

    # 3. Traffic switch: alias or config change
    update_active_collection(new_collection)
    print(f"Traffic switched: {old_collection} -> {new_collection}")

    # 4. Keep old collection for rollback for a period
    schedule_cleanup(old_collection, delay_days=7)

2. Chunk Size and Embedding Model Token Limit Mismatch

Chunks exceeding the embedding model's maximum token count get truncated or cause errors.

Solution: Use a length function that considers the embedding model's token limit during chunking.

import tiktoken

enc = tiktoken.encoding_for_model("text-embedding-3-small")

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=64,
    length_function=lambda text: len(enc.encode(text)),  # Token-count based
)

3. Reranking Latency Management

Rerankers are Cross-Encoders, so latency increases proportionally with candidate count. Reranking 100 docs adds 500ms-1s.

Solutions:

  • Limit candidates to 20-30 from hybrid search
  • Use async calls for batch processing
  • Deploy open-source rerankers on GPU servers for latency

4. Vector DB Index Memory Management

HNSW indexes must maintain the entire graph in memory. 1M vectors (1024 dims) use approximately 4-8GB memory.

Solutions:

  • Vector dimension reduction (3072 -> 1024)
  • Apply quantization: Scalar, Product, Binary quantization
  • Use DiskANN index (Milvus supported)

5. Metadata Filters and Search Performance

Excessive metadata filters can drastically degrade vector search performance. This is especially problematic with high-cardinality fields (timestamps, user IDs, etc.).

Solutions:

  • Use only low-cardinality fields for filters (category, department, document type)
  • Use broad date filters and apply recency weighting during reranking

Failure Cases and Recovery Procedures

Case 1: Retrieval Quality Drop After Switching to Semantic Chunking

Situation: After switching from Recursive to Semantic chunking, Context Precision dropped from 0.85 to 0.72.

Cause: Semantic chunking produced highly uneven chunk sizes. Some chunks were 50 tokens while others were 2000, leading to inconsistent embedding quality.

Recovery:

  1. Added minimum/maximum size constraints to semantic chunking output
  2. Immediately rolled back to the existing Recursive chunking index (possible because of Blue-Green approach)
  3. Retried semantic chunking with min 200, max 800 token constraints

Case 2: Fixed Hybrid Search Alpha Causing Performance Issues for Specific Query Types

Situation: While operating with fixed alpha=0.7, multiple issues with code search queries failing to find exact function names.

Cause: Code-related queries need exact keyword matching, but Dense weighting was too high.

Recovery:

  1. Added query classifier for automatic query type detection
  2. Dynamic adjustment: alpha=0.3 for code/technical queries, alpha=0.7 for natural language
  3. Classifier itself uses a lightweight model (distilbert-based) with under 10ms latency

Case 3: Service Down During Reranker Failure

Situation: Cohere Rerank API outage caused the entire RAG pipeline to become unresponsive.

Cause: Reranking step was configured as mandatory with no fallback logic.

Recovery:

  1. Made the reranking step optional
  2. On timeout (2s) or API error, return hybrid search results as-is
  3. Deployed self-hosted BGE Reranker as backup for redundancy
import asyncio

async def rerank_with_fallback(
    query: str,
    documents: list[str],
    top_n: int = 5,
    timeout: float = 2.0,
) -> list[dict]:
    """Reranking with fallback"""
    try:
        # Primary: Cohere Rerank (2s timeout)
        result = await asyncio.wait_for(
            cohere_rerank_async(query, documents, top_n),
            timeout=timeout,
        )
        return result
    except (asyncio.TimeoutError, Exception) as e:
        print(f"Cohere reranking failed, BGE fallback: {e}")
        try:
            # Fallback: Self-hosted BGE Reranker
            return rerank_with_bge(query, documents, top_n)
        except Exception as e2:
            print(f"BGE reranking also failed, returning original ranking: {e2}")
            # Final fallback: return hybrid search results as-is
            return [
                {"index": i, "score": 1.0 - i * 0.05, "text": d}
                for i, d in enumerate(documents[:top_n])
            ]

Case 4: Service Quality Degradation During Bulk Re-indexing

Situation: During re-indexing of 100K documents, embedding API call surge hit rate limits, delaying real-time query embedding responses too.

Recovery:

  1. Separated API keys/endpoints for indexing and querying
  2. Added batch size control and rate limit handling logic for indexing
  3. Ran indexing during off-peak hours to avoid competing with query traffic

Conclusion

RAG pipeline optimization is not a single technology but a combination of chunking, search, reranking, and evaluation. Optimize each stage independently, but make decisions based on the overall pipeline's evaluation metrics.

Recommended implementation order:

  1. Recursive chunking + Dense search to establish baseline (1 week)
  2. RAGAS/DeepEval evaluation pipeline setup (1 week)
  3. Add hybrid search to improve Recall (1 week)
  4. Add reranking to improve Precision (1 week)
  5. Per-query dynamic alpha and continuous improvement based on evaluation (ongoing)

The approach of "apply everything at once" will fail. At each step, verify evaluation metric changes and immediately roll back if things get worse. That's the production playbook.

References