- Authors

- Name
- Youngju Kim
- @fjvbn20031
- When Pure Vector Search Fails
- What Is BM25?
- The Weaknesses of Each Approach
- RRF: The Method for Combining Both Results
- LangChain Implementation
- Performance Benchmark
- When Each Approach Wins
- Production Tip: Tuning the Tokenizer
- Full Hybrid RAG Pipeline
- Conclusion
When Pure Vector Search Fails
One of the most disorienting moments when you first build a RAG system is watching vector search fail to find a document that is obviously correct.
Here's a real example. Your product database has a document titled "iPhone 15 Pro Max 256GB Storage Capacity." A user asks: "Is the iPhone 15 Pro Max 256GB in stock?"
Vector search might not rank this as the top result. Why? Because "iPhone 14 Pro Max 512GB" is semantically very close in the embedding space. Both are in the "Pro Max smartphone" meaning cluster.
But the user asked specifically about the 256GB model of the 15. Those exact numbers and model identifiers need to match precisely.
This is the fundamental limitation of pure vector search: it's weak at exact keyword matching. Product codes, model numbers, proper nouns, dates, version numbers — all of these fall into this category.
What Is BM25?
BM25 (Okapi BM25) is a keyword search algorithm developed by Robertson et al. in 1994. Thirty years later, it remains the gold standard for keyword retrieval. Elasticsearch, Solr, and Apache Lucene all use BM25 as their default search algorithm.
The formula makes the logic clear:
score(D, Q) = Σ IDF(qi) × [f(qi, D) × (k1 + 1)] / [f(qi, D) + k1 × (1 - b + b × |D| / avgdl)]
where:
qi = each query term
f(qi, D) = term frequency of qi in document D
|D| = document length
avgdl = average document length across corpus
k1, b = tuning parameters (typically k1=1.5, b=0.75)
IDF(qi) = log((N - df + 0.5) / (df + 0.5))
BM25 has two key innovations:
1. TF Saturation: More occurrences increase the score, but with diminishing returns (a logarithmic curve). The difference between appearing once and 100 times is not linear.
2. Document Length Normalization: It's obvious that longer documents will have higher term frequencies. BM25 normalizes by document length to remove this unfair advantage.
These two properties explain why BM25 significantly outperforms simple TF-IDF scoring.
The Weaknesses of Each Approach
Vector search weaknesses:
- Poor at exact keyword matching (model numbers, product codes)
- "Semantic drift": retrieves semantically similar but incorrect content
- Vulnerable to rare terms, neologisms, abbreviations
BM25 weaknesses:
- No synonym understanding ("automobile" vs "car")
- No contextual understanding (word order ignored)
- Vulnerable to typos
- Weak multilingual support
These weaknesses complement each other perfectly. Vector search captures meaning; BM25 captures exact keywords. Combining the two results are significantly better than either alone.
RRF: The Method for Combining Both Results
The simplest and most effective way to combine two ranked result lists is RRF (Reciprocal Rank Fusion).
The idea is elegant: convert each result's rank to its reciprocal and sum them up.
def reciprocal_rank_fusion(results_list: list, k: int = 60) -> list:
"""
Combine multiple ranked result lists using RRF.
RRF score = Σ 1/(k + rank_i)
k=60 is a constant that dampens the impact of low-ranked items.
Args:
results_list: list of result lists (each is a list of doc IDs in rank order)
k: rank stabilization constant (default 60)
Returns:
List of doc IDs sorted by descending RRF score
"""
scores: dict = {}
for results in results_list:
for rank, doc_id in enumerate(results, start=1):
scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank)
return sorted(scores.keys(), key=lambda x: scores[x], reverse=True)
# Concrete example
vector_results = ["doc_A", "doc_C", "doc_B"] # from embedding search
bm25_results = ["doc_B", "doc_A", "doc_D"] # from BM25 keyword search
# Score calculation:
# doc_A: 1/(60+1) + 1/(60+2) = 0.01639 + 0.01613 = 0.03252
# doc_B: 1/(60+3) + 1/(60+1) = 0.01587 + 0.01639 = 0.03226
# doc_C: 1/(60+2) = 0.01613
# doc_D: 1/(60+2) = 0.01613
fused = reciprocal_rank_fusion([vector_results, bm25_results])
print(fused)
# ['doc_A', 'doc_B', 'doc_C', 'doc_D']
# doc_A ranks first because it's highly ranked in both lists
# doc_B ranks second: #1 in BM25 but #3 in vector
Why k=60? This value was determined experimentally. It's designed to appropriately dampen the contribution of lower-ranked items. Too small and low-ranked items get excessive credit; too large and all items become indistinguishable.
LangChain Implementation
from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
# Assume: docs is a list of LangChain Document objects
embedding_model = OpenAIEmbeddings()
# Create BM25 retriever
bm25_retriever = BM25Retriever.from_documents(
docs,
k=5 # return top 5
)
# Create vector store and retriever
vectorstore = FAISS.from_documents(docs, embedding_model)
vector_retriever = vectorstore.as_retriever(
search_type="similarity",
search_kwargs={"k": 5}
)
# EnsembleRetriever: combines both searches with weights
hybrid_retriever = EnsembleRetriever(
retrievers=[bm25_retriever, vector_retriever],
weights=[0.4, 0.6] # BM25 40%, Vector 60%
)
# Usage
query = "iPhone 15 Pro Max 256GB in stock"
results = hybrid_retriever.get_relevant_documents(query)
Weight selection guidelines:
- Exact keywords matter (product codes, model numbers): increase BM25 weight (0.5-0.6)
- Semantic understanding matters (customer queries, natural language): increase vector weight (0.6-0.7)
- General enterprise documents: start with 0.4/0.6 or 0.5/0.5
Production Implementation with Elasticsearch
LangChain's EnsembleRetriever is convenient but runs both searches locally. For large-scale production, Elasticsearch is more appropriate — it natively supports both BM25 and vector search.
from elasticsearch import Elasticsearch
es_client = Elasticsearch(["http://localhost:9200"])
def hybrid_search_es(query: str, query_embedding: list, k: int = 5):
response = es_client.search(
index="documents",
body={
"query": {
"bool": {
"should": [
# BM25 keyword search
{
"match": {
"content": {
"query": query,
"boost": 0.4
}
}
}
]
}
},
# KNN vector search (ES 8.x+)
"knn": {
"field": "content_vector",
"query_vector": query_embedding,
"k": k,
"num_candidates": 100,
"boost": 0.6
},
"size": k
}
)
return response["hits"]["hits"]
Performance Benchmark
Research results based on the BEIR (Benchmarking IR) dataset:
| Search Method | nDCG@10 (average) | Strongest Domain |
|---|---|---|
| BM25 only | 43.0 | Keyword matching, factual retrieval |
| Vector only | 47.8 | Semantic similarity, multilingual |
| Hybrid (RRF) | 52.1 | Balanced across all types |
Hybrid achieves roughly 21% higher nDCG@10 on average. The gap is especially large for general-purpose RAG systems where you can't predict what queries will come in.
From personal production experience: switching from vector-only to hybrid search reduced "irrelevant answer" complaints from customers by roughly 30%.
When Each Approach Wins
| Situation | Recommended | Reason |
|---|---|---|
| Product catalog search | Hybrid (BM25-heavy) | Model numbers and specs must be exact |
| FAQ chatbot | Hybrid (vector-heavy) | Questions come in diverse phrasings |
| Legal document search | Hybrid (BM25-heavy) | Exact legal terminology matching required |
| Sentiment-based search | Vector only | Meaning matters more than keywords |
| Code search | Hybrid (BM25-heavy) | Function names and variables need exact matching |
| Cross-language documents | Vector or multilingual BM25 | Cross-lingual semantic search needed |
Production Tip: Tuning the Tokenizer
BM25's default tokenizer is whitespace-based. For technical documentation, you can significantly improve results with a better tokenizer:
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
nltk.download('punkt')
nltk.download('stopwords')
stemmer = PorterStemmer()
stop_words = set(stopwords.words('english'))
def technical_tokenizer(text: str) -> list:
# Tokenize
tokens = word_tokenize(text.lower())
# Remove stopwords and apply stemming
tokens = [
stemmer.stem(token)
for token in tokens
if token.isalnum() and token not in stop_words
]
return tokens
# Apply custom tokenizer when creating BM25 retriever
bm25_retriever = BM25Retriever.from_documents(
docs,
preprocess_func=technical_tokenizer,
k=5
)
For code search specifically, you might want to preserve camelCase and snake_case splitting: getUserById → ["get", "user", "by", "id"].
Full Hybrid RAG Pipeline
Putting it all together in a complete RAG pipeline:
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
# Initialize all components
bm25_retriever = BM25Retriever.from_documents(docs, k=5)
vectorstore = FAISS.from_documents(docs, embedding_model)
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
hybrid_retriever = EnsembleRetriever(
retrievers=[bm25_retriever, vector_retriever],
weights=[0.4, 0.6]
)
# Build the RAG chain
llm = ChatOpenAI(model="gpt-4o", temperature=0)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
retriever=hybrid_retriever,
return_source_documents=True
)
# Run a query
response = qa_chain.invoke({"query": "iPhone 15 Pro Max 256GB price"})
print(response["result"])
print("\nSources:")
for doc in response["source_documents"]:
print(f" - {doc.metadata.get('source', 'unknown')}")
Conclusion
Hybrid Search is one of the highest-ROI improvements you can make to a RAG system. The performance gain relative to implementation complexity is substantial.
Recommended approach:
- Build baseline RAG with vector search
- Measure with RAGAS (context precision, context recall)
- Switch to Hybrid Search (BM25 + vector)
- Measure again and verify improvement
Don't rely on gut feeling to determine if things improved. The next post covers quantitative RAG evaluation with RAGAS in detail.