Skip to content

필사 모드: RAG Pipeline Optimization Strategy: Chunking, Reranking, and Hybrid Search

English
0%
정확도 0%
💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.
원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Introduction

If your team has deployed RAG (Retrieval-Augmented Generation) to production, you've likely experienced this: "It retrieves something but the answer is wrong", "The relevant document clearly exists but doesn't show up in search", "It works for short questions but hallucinates on complex ones". The root cause of these issues is almost always **retrieval quality**. No matter how smart the LLM, it will generate wrong answers if given wrong context.

This article covers three core pillars for maximizing RAG pipeline retrieval quality:

1. **Chunking**: How to split documents

2. **Hybrid Search**: How to combine dense vector and sparse keyword search

3. **Reranking**: How to re-order search results

For each strategy, we examine practical code, benchmark numbers, and operational considerations. Written based on the latest models and tools as of March 2026.

RAG Pipeline Architecture Overview

The complete flow of an advanced RAG pipeline:

[Indexing Phase]

Document Collection -> Preprocessing -> Chunking -> Embedding -> Vector DB Storage + Inverted Index Storage

[Query Phase]

Question -> Query Transformation -> Hybrid Search (Dense + Sparse)

-> Reranking -> Top K Selection -> Prompt Construction -> LLM Generation

Three key differences compared to basic RAG:

- **Refined chunking strategy**: Semantic, recursive, and agentic chunking instead of simple fixed-size

- **Hybrid search**: Combining BM25 keyword search with vector similarity instead of relying solely on vectors

- **Added reranking layer**: Re-evaluating initial search results with a Cross-Encoder for improved precision

Combining these three can improve retrieval accuracy (Precision@K) by 30-50%+. Let's examine each in depth.

Deep Dive into Chunking Strategies

Chunking is the first decision in a RAG pipeline and has the greatest impact on overall performance. Poor chunking is nearly impossible to recover from with subsequent optimizations.

Chunking Strategy Comparison

| Strategy | Principle | Best For | Chunk Size | Pros | Cons |

| ---------- | ------------------------------- | ------------------------- | --------------- | ------------------- | ----------------- |

| Fixed-size | Split by fixed token/char count | Uniformly structured docs | 256-512 tokens | Simple, fast | Ignores semantics |

| Recursive | Recursive split by separators | General text | 512-1024 tokens | Respects boundaries | Config needed |

| Semantic | Detect boundaries by embedding | Docs with frequent topics | Variable | Best preservation | Costly, slow |

| Agentic | LLM analyzes structure | Complex technical docs | Variable | Highest quality | High cost, slow |

Fixed-size Chunking

The simplest but still effective strategy. Even in 2026 benchmarks, 512-token fixed-size chunking sometimes outperforms complex semantic chunking.

from langchain_text_splitters import CharacterTextSplitter

Fixed-size chunking - the most basic approach

fixed_splitter = CharacterTextSplitter(

separator="\n",

chunk_size=512,

chunk_overlap=50, # 10% overlap recommended

length_function=len,

is_separator_regex=False,

)

chunks = fixed_splitter.split_text(document_text)

print(f"Total {len(chunks)} chunks created")

**Recommended settings:**

- Factoid (fact-checking) queries: 256-512 tokens

- Analytical queries: 1024+ tokens

- Overlap: 10-20% of total chunk size

Recursive Chunking

LangChain's `RecursiveCharacterTextSplitter` recursively splits following a separator hierarchy. It respects paragraph, sentence, and word boundaries while matching target size.

from langchain_text_splitters import RecursiveCharacterTextSplitter

Recursive chunking - recommended production default

recursive_splitter = RecursiveCharacterTextSplitter(

separators=["\n\n", "\n", ". ", " ", ""],

chunk_size=512,

chunk_overlap=64,

length_function=len,

add_start_index=True, # For source position tracking

)

chunks = recursive_splitter.split_documents(documents)

Each chunk automatically includes metadata

for chunk in chunks[:3]:

print(f"Size: {len(chunk.page_content)}, "

f"Start position: {chunk.metadata.get('start_index')}")

**Practical tip:** For most production environments, starting with Recursive chunking is recommended. It's simple while respecting paragraph boundaries, offering the best cost-effectiveness.

Semantic Chunking

Uses embeddings to calculate semantic similarity between adjacent sentences and splits at points where similarity drops sharply.

from langchain_experimental.text_splitter import SemanticChunker

from langchain_openai import OpenAIEmbeddings

Semantic chunking - splitting by meaning units

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

semantic_splitter = SemanticChunker(

embeddings=embeddings,

breakpoint_threshold_type="percentile", # percentile, standard_deviation, interquartile

breakpoint_threshold_amount=75, # Split at 75th percentile difference

)

semantic_chunks = semantic_splitter.split_text(document_text)

print(f"Semantic chunk count: {len(semantic_chunks)}")

print(f"Average chunk length: {sum(len(c) for c in semantic_chunks) / len(semantic_chunks):.0f}")

**Caution:** Semantic chunking requires generating embeddings for all sentence pairs, so costs and time increase significantly for large-scale documents. For 100K+ documents, Recursive chunking is more practical.

Agentic Chunking

Uses an LLM to understand the document's logical structure and determine optimal split points. Most sophisticated but highest cost.

from openai import OpenAI

client = OpenAI()

def agentic_chunk(text: str, max_chunks: int = 20) -> list[dict]:

"""LLM-based agentic chunking"""

response = client.chat.completions.create(

model="gpt-4o-mini",

messages=[

{

"role": "system",

"content": (

"Split the given text into logical units. "

"Each chunk should cover one complete topic. "

"Return as JSON array: "

'[{"title": "chunk title", "content": "chunk content", "summary": "one-line summary"}]'

),

},

{"role": "user", "content": text[:8000]}, # Watch token limits

],

response_format={"type": "json_object"},

temperature=0,

)

result = json.loads(response.choices[0].message.content)

return result.get("chunks", [])

Usage example

chunks = agentic_chunk(long_document_text)

for chunk in chunks:

print(f"[{chunk['title']}] {chunk['summary']}")

**Cost consideration:** Agentic chunking incurs LLM API calls per document, making it unsuitable for bulk indexing. Best for small volumes of high-value documents (contracts, technical specifications, etc.).

Embedding Model Selection and Optimization

After chunking, the choice of vector embedding model directly impacts retrieval quality.

Major Embedding Model Comparison

| Model | Dims | Max Tokens | Multilingual | MTEB Score | Features |

| ---------------------- | ---- | ---------- | ------------ | ---------- | ------------------------------------------- |

| text-embedding-3-large | 3072 | 8191 | Yes | 64.6 | OpenAI latest, dimension reduction possible |

| text-embedding-3-small | 1536 | 8191 | Yes | 62.3 | Cost-efficient |

| BAAI/bge-m3 | 1024 | 8192 | Yes | 68.2 | Open source, Dense+Sparse simultaneous |

| Cohere embed-v4 | 1024 | 512 | Yes | 66.1 | Multimodal support |

| voyage-3-large | 1024 | 32000 | Yes | 67.5 | Long context specialized |

from langchain_openai import OpenAIEmbeddings

OpenAI embeddings - leveraging dimension reduction

embeddings = OpenAIEmbeddings(

model="text-embedding-3-large",

dimensions=1024, # Reduce 3072 -> 1024 for cost/storage savings

)

BGE-M3: Simultaneous Dense + Sparse generation (optimal for hybrid search)

from FlagEmbedding import BGEM3FlagModel

bge_model = BGEM3FlagModel("BAAI/bge-m3", use_fp16=True)

Generate Dense and Sparse vectors simultaneously

output = bge_model.encode(

["RAG pipeline optimization methodology"],

return_dense=True,

return_sparse=True,

)

dense_vector = output["dense_vecs"][0] # (1024,) float vector

sparse_vector = output["lexical_weights"][0] # Sparse vector (per-word weights)

print(f"Dense dimensions: {len(dense_vector)}")

print(f"Sparse active token count: {len(sparse_vector)}")

**Practical recommendations:**

- **Cost priority**: text-embedding-3-small (OpenAI) or bge-m3 (self-hosted)

- **Quality priority**: text-embedding-3-large or voyage-3-large

- **Planning hybrid search**: bge-m3 (simultaneous Dense + Sparse generation simplifies infrastructure)

Vector Database Comparison

Vector DB selection significantly impacts operational complexity, cost, and performance.

Major Vector Database Comparison

| Item | Pinecone | Weaviate | Qdrant | Milvus |

| ------------------ | ---------------------- | ---------------------- | --------------------- | ---------------------------------- |

| Hosting | Managed (Serverless) | Managed + Self-hosted | Managed + Self-hosted | Self-hosted focused (Zilliz Cloud) |

| Hybrid Search | Supported (Sparse) | Native support | Supported (Sparse) | Supported |

| Metadata Filtering | Basic | GraphQL-based powerful | Rust-based high perf | Basic |

| Free Tier | Starter (100K vectors) | Sandbox | 1GB free (permanent) | Open source |

| Query Latency | Under 50ms | 50-100ms | Under 50ms | 30-50ms |

| Scalability | Auto-scaling | Manual config needed | Horizontal scaling | K8s native |

| Language SDKs | Python, JS, Go | Python, JS, Go, Java | Python, JS, Rust, Go | Python, JS, Go, Java |

| Best For | Minimal ops teams | OSS + flexibility | Complex filtering | Large-scale enterprise |

**Selection guide:**

- **Quick start + minimal operations**: Pinecone

- **Self-hosted + complex filtering**: Qdrant

- **Open source + native hybrid search**: Weaviate

- **Large scale (1B+ vectors) + GPU acceleration**: Milvus/Zilliz

Hybrid Search: Combining Dense + Sparse

Vector search alone struggles with exact keyword matching, while BM25 alone can't capture semantic similarity. Hybrid search combines both approaches to improve recall by 15-30%.

Hybrid Search Architecture

Query: "What is the CPU threshold in Kubernetes HPA settings?"

Dense Search (vector similarity):

-> "How to configure autoscaling in container orchestration" (semantically similar)

Sparse Search (BM25 keyword matching):

-> "Set HPA targetCPUUtilizationPercentage to 80" (exact keyword match)

Hybrid Combination (RRF or weighted sum):

-> Merge both results for optimal document retrieval

Weaviate Hybrid Search Implementation

// Connect Weaviate client

const client: WeaviateClient = await weaviate.connectToLocal({

host: 'localhost',

port: 8080,

})

// Execute hybrid search

const collection = client.collections.get('Documents')

const result = await collection.query.hybrid('Kubernetes HPA CPU threshold', {

alpha: 0.7, // 0 = pure BM25, 1 = pure vector, 0.7 = 70% vector

limit: 20, // Candidates before reranking

fusionType: 'RelativeScore', // RelativeScore or Ranked

returnMetadata: ['score', 'explainScore'],

returnProperties: ['title', 'content', 'source'],

})

for (const item of result.objects) {

console.log(`[${item.metadata?.score?.toFixed(3)}] ${item.properties.title}`)

}

BM25 + Dense Direct Implementation in Python

When native hybrid search from the vector DB isn't available, implement directly with Reciprocal Rank Fusion (RRF).

from rank_bm25 import BM25Okapi

from typing import List, Tuple

def hybrid_search(

query: str,

documents: list[dict],

dense_scores: np.ndarray,

k: int = 10,

alpha: float = 0.7,

rrf_k: int = 60,

) -> list[dict]:

"""

RRF-based hybrid search

alpha: Dense search weight (1-alpha is Sparse weight)

rrf_k: RRF constant (default 60, paper recommended)

"""

Sparse search (BM25)

tokenized_docs = [doc["content"].split() for doc in documents]

bm25 = BM25Okapi(tokenized_docs)

sparse_scores = bm25.get_scores(query.split())

Dense rank calculation

dense_ranks = np.argsort(-dense_scores) + 1 # 1-indexed rank

sparse_ranks = np.argsort(-sparse_scores) + 1

RRF score calculation

rrf_scores = []

for i in range(len(documents)):

dense_rrf = alpha / (rrf_k + dense_ranks[i])

sparse_rrf = (1 - alpha) / (rrf_k + sparse_ranks[i])

rrf_scores.append(dense_rrf + sparse_rrf)

Return top K

top_indices = np.argsort(rrf_scores)[::-1][:k]

return [

{**documents[i], "hybrid_score": rrf_scores[i]}

for i in top_indices

]

Usage example

results = hybrid_search(

query="Kubernetes HPA CPU threshold",

documents=all_documents,

dense_scores=embedding_similarity_scores,

k=20, # Generous before reranking

alpha=0.7, # Dense 70%, Sparse 30%

)

Alpha Tuning Guide

| Query Type | Recommended Alpha | Reason |

| ----------------------------- | ----------------- | -------------------------------- |

| Technical queries with jargon | 0.3-0.5 | Exact keyword matching important |

| Natural language questions | 0.7-0.8 | Semantic similarity important |

| Code search | 0.2-0.4 | Function/variable name matching |

| General FAQ | 0.5-0.6 | Balanced search |

**Key insight:** Dynamically adjusting alpha per query type can improve Precision@1 by 2-7.5 percentage points compared to static settings.

Applying Reranking Models

After broadening candidates with hybrid search, reranking models precisely adjust the final ranking. Rerankers take query and document together as input (Cross-Encoding) to directly compute relevance scores, achieving higher accuracy than Bi-Encoder embeddings.

Reranking Model Comparison

| Model | Type | Parameters | Multilingual | Latency (100 docs) | Cost |

| ------------------------------------- | ----------- | ----------- | -------------- | ------------------ | ----------- |

| Cohere Rerank 4 | API | Undisclosed | 100+ languages | 200-400ms | Pay-per-use |

| BAAI/bge-reranker-v2-m3 | Open source | 0.6B | Yes | 500-800ms (GPU) | Free |

| BAAI/bge-reranker-large | Open source | 560M | Limited | 400-600ms (GPU) | Free |

| cross-encoder/ms-marco-MiniLM-L-12-v2 | Open source | 33M | No (English) | 100-200ms (GPU) | Free |

Cohere Rerank Integration

co = cohere.ClientV2(api_key="your-cohere-api-key")

def rerank_with_cohere(

query: str,

documents: list[str],

top_n: int = 5,

) -> list[dict]:

"""Rerank documents with Cohere Rerank 4"""

response = co.rerank(

model="rerank-v3.5",

query=query,

documents=documents,

top_n=top_n,

return_documents=True,

)

results = []

for item in response.results:

results.append({

"index": item.index,

"score": item.relevance_score,

"text": item.document.text if item.document else documents[item.index],

})

return results

Usage: Rerank 20 hybrid search results to select top 5

hybrid_results = hybrid_search(query, documents, dense_scores, k=20)

reranked = rerank_with_cohere(

query="How to set Kubernetes HPA CPU threshold",

documents=[r["content"] for r in hybrid_results],

top_n=5,

)

for r in reranked:

print(f"[{r['score']:.4f}] {r['text'][:80]}...")

Self-hosting BGE Reranker

For environments where API costs must be avoided or data cannot be sent externally, host an open-source reranker directly.

from FlagEmbedding import FlagReranker

BGE Reranker v2 M3 - lightweight multilingual reranker

reranker = FlagReranker(

"BAAI/bge-reranker-v2-m3",

use_fp16=True, # Save GPU memory

)

def rerank_with_bge(

query: str,

documents: list[str],

top_n: int = 5,

) -> list[dict]:

"""Rerank documents with BGE Reranker"""

Create query-document pairs

pairs = [[query, doc] for doc in documents]

Calculate relevance scores

scores = reranker.compute_score(pairs, normalize=True)

Sort by score

scored_docs = [

{"index": i, "score": s, "text": d}

for i, (s, d) in enumerate(zip(scores, documents))

]

scored_docs.sort(key=lambda x: x["score"], reverse=True)

return scored_docs[:top_n]

Usage example

results = rerank_with_bge(

query="What criteria determine chunking size in RAG pipelines?",

documents=candidate_documents,

top_n=5,

)

for r in results:

print(f"[{r['score']:.4f}] {r['text'][:100]}...")

Full Pipeline Integration

An example integrating chunking, hybrid search, and reranking into a single pipeline.

from langchain_text_splitters import RecursiveCharacterTextSplitter

from langchain_openai import OpenAIEmbeddings, ChatOpenAI

from langchain_core.prompts import ChatPromptTemplate

class OptimizedRAGPipeline:

"""Advanced RAG Pipeline"""

def __init__(self):

self.splitter = RecursiveCharacterTextSplitter(

chunk_size=512, chunk_overlap=64

)

self.embeddings = OpenAIEmbeddings(

model="text-embedding-3-large", dimensions=1024

)

self.llm = ChatOpenAI(model="gpt-4o", temperature=0)

self.reranker = FlagReranker("BAAI/bge-reranker-v2-m3", use_fp16=True)

def query(self, question: str, top_k: int = 5) -> str:

1. Hybrid search: 20 candidates

candidates = self._hybrid_search(question, k=20)

2. Reranking: select top 5

reranked = self._rerank(question, candidates, top_n=top_k)

3. Prompt construction and LLM generation

context = "\n\n---\n\n".join([doc["text"] for doc in reranked])

prompt = ChatPromptTemplate.from_messages([

("system", "Answer the question based on the following context.\n\n{context}"),

("human", "{question}"),

])

chain = prompt | self.llm

response = chain.invoke({"context": context, "question": question})

return response.content

def _hybrid_search(self, query: str, k: int = 20) -> list[dict]:

Dense + Sparse hybrid search (see code above)

...

def _rerank(self, query: str, docs: list[dict], top_n: int) -> list[dict]:

pairs = [[query, doc["text"]] for doc in docs]

scores = self.reranker.compute_score(pairs, normalize=True)

for doc, score in zip(docs, scores):

doc["rerank_score"] = score

docs.sort(key=lambda x: x["rerank_score"], reverse=True)

return docs[:top_n]

Evaluation Metrics and Benchmarking

Quantitative evaluation is essential for RAG pipeline optimization. "It seems better" doesn't work in production.

RAGAS Framework

RAGAS (Retrieval-Augmented Generation Assessment Suite) is a RAG-specific evaluation framework that leverages LLMs as evaluators for automatic scoring, even without ground truth.

from ragas import evaluate

from ragas.metrics import (

faithfulness,

answer_relevancy,

context_precision,

context_recall,

)

from datasets import Dataset

Evaluation dataset construction

eval_data = {

"question": [

"What is the default CPU threshold for Kubernetes HPA?",

"What criteria determine chunking size in RAG?",

],

"answer": [

"The default CPU threshold for HPA is 80%.",

"It is determined in the 256-1024 token range based on query type.",

],

"contexts": [

["HPA uses targetCPUUtilizationPercentage default value of 80."],

["Factoid queries recommend 256-512, analytical queries 1024+ tokens."],

],

"ground_truth": [

"The default is 80%.",

"It is determined by query type and document characteristics.",

],

}

dataset = Dataset.from_dict(eval_data)

Run evaluation

results = evaluate(

dataset=dataset,

metrics=[faithfulness, answer_relevancy, context_precision, context_recall],

)

print(results)

faithfulness: 0.92, answer_relevancy: 0.88,

context_precision: 0.85, context_recall: 0.90

Unit Testing with DeepEval

DeepEval enables Pytest-style RAG testing, making it ideal for CI/CD pipeline integration.

from deepeval import assert_test

from deepeval.test_case import LLMTestCase

from deepeval.metrics import (

FaithfulnessMetric,

ContextualRelevancyMetric,

AnswerRelevancyMetric,

)

def test_rag_faithfulness():

"""Test that RAG responses are faithful to context"""

test_case = LLMTestCase(

input="What is the Kubernetes HPA CPU threshold?",

actual_output="The default CPU threshold for HPA is 80%.",

retrieval_context=[

"HPA uses targetCPUUtilizationPercentage default value of 80."

],

)

faithfulness = FaithfulnessMetric(threshold=0.8)

relevancy = ContextualRelevancyMetric(threshold=0.7)

answer_rel = AnswerRelevancyMetric(threshold=0.7)

assert_test(test_case, [faithfulness, relevancy, answer_rel])

Run with pytest: pytest test_rag.py -v

Key Evaluation Metrics Summary

| Metric | Measures | Target | Tools |

| ----------------- | ------------------------------------- | ------ | --------------- |

| Context Precision | Relevance of retrieved context | 0.8+ | RAGAS |

| Context Recall | Ratio of needed context retrieved | 0.85+ | RAGAS |

| Faithfulness | Response fidelity to context | 0.9+ | RAGAS, DeepEval |

| Answer Relevancy | Response relevance to question | 0.85+ | RAGAS, DeepEval |

| MRR@K | Mean reciprocal rank of first hit | 0.7+ | Custom |

| NDCG@K | Normalized discounted cumulative gain | 0.75+ | Custom |

Operational Considerations and Troubleshooting

1. Full Re-indexing Required When Changing Embedding Models

Upgrading the embedding model changes the vector space between old and new vectors. Partial re-indexing severely degrades retrieval quality.

**Solution:** Use a Blue-Green index strategy. Complete a new index with the new model, then switch traffic.

Blue-Green index switching example

def reindex_with_blue_green(

old_collection: str,

new_collection: str,

new_embedding_model: str,

):

"""Zero-downtime re-indexing"""

1. Index into new collection (existing service uses old_collection)

print(f"Starting indexing for new collection '{new_collection}'...")

create_and_populate_collection(new_collection, new_embedding_model)

2. Validation: run test queries against new collection

test_results = run_evaluation_suite(new_collection)

if test_results["context_precision"] < 0.8:

raise ValueError(

f"New index quality below threshold: {test_results['context_precision']:.2f}"

)

3. Traffic switch: alias or config change

update_active_collection(new_collection)

print(f"Traffic switched: {old_collection} -> {new_collection}")

4. Keep old collection for rollback for a period

schedule_cleanup(old_collection, delay_days=7)

2. Chunk Size and Embedding Model Token Limit Mismatch

Chunks exceeding the embedding model's maximum token count get truncated or cause errors.

**Solution:** Use a length function that considers the embedding model's token limit during chunking.

enc = tiktoken.encoding_for_model("text-embedding-3-small")

splitter = RecursiveCharacterTextSplitter(

chunk_size=512,

chunk_overlap=64,

length_function=lambda text: len(enc.encode(text)), # Token-count based

)

3. Reranking Latency Management

Rerankers are Cross-Encoders, so latency increases proportionally with candidate count. Reranking 100 docs adds 500ms-1s.

**Solutions:**

- Limit candidates to 20-30 from hybrid search

- Use async calls for batch processing

- Deploy open-source rerankers on GPU servers for latency

4. Vector DB Index Memory Management

HNSW indexes must maintain the entire graph in memory. 1M vectors (1024 dims) use approximately 4-8GB memory.

**Solutions:**

- Vector dimension reduction (3072 -> 1024)

- Apply quantization: Scalar, Product, Binary quantization

- Use DiskANN index (Milvus supported)

5. Metadata Filters and Search Performance

Excessive metadata filters can drastically degrade vector search performance. This is especially problematic with high-cardinality fields (timestamps, user IDs, etc.).

**Solutions:**

- Use only low-cardinality fields for filters (category, department, document type)

- Use broad date filters and apply recency weighting during reranking

Failure Cases and Recovery Procedures

Case 1: Retrieval Quality Drop After Switching to Semantic Chunking

**Situation:** After switching from Recursive to Semantic chunking, Context Precision dropped from 0.85 to 0.72.

**Cause:** Semantic chunking produced highly uneven chunk sizes. Some chunks were 50 tokens while others were 2000, leading to inconsistent embedding quality.

**Recovery:**

1. Added minimum/maximum size constraints to semantic chunking output

2. Immediately rolled back to the existing Recursive chunking index (possible because of Blue-Green approach)

3. Retried semantic chunking with min 200, max 800 token constraints

Case 2: Fixed Hybrid Search Alpha Causing Performance Issues for Specific Query Types

**Situation:** While operating with fixed alpha=0.7, multiple issues with code search queries failing to find exact function names.

**Cause:** Code-related queries need exact keyword matching, but Dense weighting was too high.

**Recovery:**

1. Added query classifier for automatic query type detection

2. Dynamic adjustment: alpha=0.3 for code/technical queries, alpha=0.7 for natural language

3. Classifier itself uses a lightweight model (distilbert-based) with under 10ms latency

Case 3: Service Down During Reranker Failure

**Situation:** Cohere Rerank API outage caused the entire RAG pipeline to become unresponsive.

**Cause:** Reranking step was configured as mandatory with no fallback logic.

**Recovery:**

1. Made the reranking step optional

2. On timeout (2s) or API error, return hybrid search results as-is

3. Deployed self-hosted BGE Reranker as backup for redundancy

async def rerank_with_fallback(

query: str,

documents: list[str],

top_n: int = 5,

timeout: float = 2.0,

) -> list[dict]:

"""Reranking with fallback"""

try:

Primary: Cohere Rerank (2s timeout)

result = await asyncio.wait_for(

cohere_rerank_async(query, documents, top_n),

timeout=timeout,

)

return result

except (asyncio.TimeoutError, Exception) as e:

print(f"Cohere reranking failed, BGE fallback: {e}")

try:

Fallback: Self-hosted BGE Reranker

return rerank_with_bge(query, documents, top_n)

except Exception as e2:

print(f"BGE reranking also failed, returning original ranking: {e2}")

Final fallback: return hybrid search results as-is

return [

{"index": i, "score": 1.0 - i * 0.05, "text": d}

for i, d in enumerate(documents[:top_n])

]

Case 4: Service Quality Degradation During Bulk Re-indexing

**Situation:** During re-indexing of 100K documents, embedding API call surge hit rate limits, delaying real-time query embedding responses too.

**Recovery:**

1. Separated API keys/endpoints for indexing and querying

2. Added batch size control and rate limit handling logic for indexing

3. Ran indexing during off-peak hours to avoid competing with query traffic

Conclusion

RAG pipeline optimization is not a single technology but a **combination of chunking, search, reranking, and evaluation**. Optimize each stage independently, but make decisions based on the overall pipeline's evaluation metrics.

**Recommended implementation order:**

1. **Recursive chunking + Dense search** to establish baseline (1 week)

2. **RAGAS/DeepEval evaluation pipeline** setup (1 week)

3. **Add hybrid search** to improve Recall (1 week)

4. **Add reranking** to improve Precision (1 week)

5. **Per-query dynamic alpha** and continuous improvement based on evaluation (ongoing)

The approach of "apply everything at once" will fail. At each step, verify evaluation metric changes and immediately roll back if things get worse. That's the production playbook.

References

- [LangChain Text Splitters Official Docs](https://docs.langchain.com/oss/python/integrations/splitters)

- [Weaviate Hybrid Search Explained](https://weaviate.io/blog/hybrid-search-explained)

- [Cohere Rerank Official Docs](https://docs.cohere.com/docs/rerank)

- [BAAI/bge-reranker-v2-m3 (Hugging Face)](https://huggingface.co/BAAI/bge-reranker-v2-m3)

- [RAGAS: Automated Evaluation of RAG (arXiv)](https://arxiv.org/abs/2309.15217)

- [DeepEval RAG Evaluation Guide](https://deepeval.com/guides/guides-rag-evaluation)

- [Pinecone Chunking Strategies](https://www.pinecone.io/learn/chunking-strategies/)

- [FlagEmbedding GitHub (FlagOpen)](https://github.com/FlagOpen/FlagEmbedding)

- [Optimizing RAG with Hybrid Search and Reranking (VectorHub)](https://superlinked.com/vectorhub/articles/optimizing-rag-with-hybrid-search-reranking)

현재 단락 (1/473)

If your team has deployed RAG (Retrieval-Augmented Generation) to production, you've likely experien...

작성 글자: 0원문 글자: 24,037작성 단락: 0/473