Skip to content
Published on

Hybrid Search Guide: Combining BM25 and Vector Search for Better RAG

Authors

When Pure Vector Search Fails

One of the most disorienting moments when you first build a RAG system is watching vector search fail to find a document that is obviously correct.

Here's a real example. Your product database has a document titled "iPhone 15 Pro Max 256GB Storage Capacity." A user asks: "Is the iPhone 15 Pro Max 256GB in stock?"

Vector search might not rank this as the top result. Why? Because "iPhone 14 Pro Max 512GB" is semantically very close in the embedding space. Both are in the "Pro Max smartphone" meaning cluster.

But the user asked specifically about the 256GB model of the 15. Those exact numbers and model identifiers need to match precisely.

This is the fundamental limitation of pure vector search: it's weak at exact keyword matching. Product codes, model numbers, proper nouns, dates, version numbers — all of these fall into this category.

What Is BM25?

BM25 (Okapi BM25) is a keyword search algorithm developed by Robertson et al. in 1994. Thirty years later, it remains the gold standard for keyword retrieval. Elasticsearch, Solr, and Apache Lucene all use BM25 as their default search algorithm.

The formula makes the logic clear:

score(D, Q) = Σ IDF(qi) × [f(qi, D) × (k1 + 1)] / [f(qi, D) + k1 × (1 - b + b × |D| / avgdl)]

where:
  qi        = each query term
  f(qi, D)  = term frequency of qi in document D
  |D|       = document length
  avgdl     = average document length across corpus
  k1, b     = tuning parameters (typically k1=1.5, b=0.75)
  IDF(qi)   = log((N - df + 0.5) / (df + 0.5))

BM25 has two key innovations:

1. TF Saturation: More occurrences increase the score, but with diminishing returns (a logarithmic curve). The difference between appearing once and 100 times is not linear.

2. Document Length Normalization: It's obvious that longer documents will have higher term frequencies. BM25 normalizes by document length to remove this unfair advantage.

These two properties explain why BM25 significantly outperforms simple TF-IDF scoring.

The Weaknesses of Each Approach

Vector search weaknesses:
- Poor at exact keyword matching (model numbers, product codes)
- "Semantic drift": retrieves semantically similar but incorrect content
- Vulnerable to rare terms, neologisms, abbreviations

BM25 weaknesses:
- No synonym understanding ("automobile" vs "car")
- No contextual understanding (word order ignored)
- Vulnerable to typos
- Weak multilingual support

These weaknesses complement each other perfectly. Vector search captures meaning; BM25 captures exact keywords. Combining the two results are significantly better than either alone.

RRF: The Method for Combining Both Results

The simplest and most effective way to combine two ranked result lists is RRF (Reciprocal Rank Fusion).

The idea is elegant: convert each result's rank to its reciprocal and sum them up.

def reciprocal_rank_fusion(results_list: list, k: int = 60) -> list:
    """
    Combine multiple ranked result lists using RRF.

    RRF score = Σ 1/(k + rank_i)
    k=60 is a constant that dampens the impact of low-ranked items.

    Args:
        results_list: list of result lists (each is a list of doc IDs in rank order)
        k: rank stabilization constant (default 60)

    Returns:
        List of doc IDs sorted by descending RRF score
    """
    scores: dict = {}

    for results in results_list:
        for rank, doc_id in enumerate(results, start=1):
            scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank)

    return sorted(scores.keys(), key=lambda x: scores[x], reverse=True)


# Concrete example
vector_results = ["doc_A", "doc_C", "doc_B"]  # from embedding search
bm25_results   = ["doc_B", "doc_A", "doc_D"]  # from BM25 keyword search

# Score calculation:
# doc_A: 1/(60+1) + 1/(60+2) = 0.01639 + 0.01613 = 0.03252
# doc_B: 1/(60+3) + 1/(60+1) = 0.01587 + 0.01639 = 0.03226
# doc_C: 1/(60+2) = 0.01613
# doc_D: 1/(60+2) = 0.01613

fused = reciprocal_rank_fusion([vector_results, bm25_results])
print(fused)
# ['doc_A', 'doc_B', 'doc_C', 'doc_D']
# doc_A ranks first because it's highly ranked in both lists
# doc_B ranks second: #1 in BM25 but #3 in vector

Why k=60? This value was determined experimentally. It's designed to appropriately dampen the contribution of lower-ranked items. Too small and low-ranked items get excessive credit; too large and all items become indistinguishable.

LangChain Implementation

from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings

# Assume: docs is a list of LangChain Document objects
embedding_model = OpenAIEmbeddings()

# Create BM25 retriever
bm25_retriever = BM25Retriever.from_documents(
    docs,
    k=5  # return top 5
)

# Create vector store and retriever
vectorstore = FAISS.from_documents(docs, embedding_model)
vector_retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 5}
)

# EnsembleRetriever: combines both searches with weights
hybrid_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, vector_retriever],
    weights=[0.4, 0.6]  # BM25 40%, Vector 60%
)

# Usage
query = "iPhone 15 Pro Max 256GB in stock"
results = hybrid_retriever.get_relevant_documents(query)

Weight selection guidelines:

  • Exact keywords matter (product codes, model numbers): increase BM25 weight (0.5-0.6)
  • Semantic understanding matters (customer queries, natural language): increase vector weight (0.6-0.7)
  • General enterprise documents: start with 0.4/0.6 or 0.5/0.5

Production Implementation with Elasticsearch

LangChain's EnsembleRetriever is convenient but runs both searches locally. For large-scale production, Elasticsearch is more appropriate — it natively supports both BM25 and vector search.

from elasticsearch import Elasticsearch

es_client = Elasticsearch(["http://localhost:9200"])

def hybrid_search_es(query: str, query_embedding: list, k: int = 5):
    response = es_client.search(
        index="documents",
        body={
            "query": {
                "bool": {
                    "should": [
                        # BM25 keyword search
                        {
                            "match": {
                                "content": {
                                    "query": query,
                                    "boost": 0.4
                                }
                            }
                        }
                    ]
                }
            },
            # KNN vector search (ES 8.x+)
            "knn": {
                "field": "content_vector",
                "query_vector": query_embedding,
                "k": k,
                "num_candidates": 100,
                "boost": 0.6
            },
            "size": k
        }
    )
    return response["hits"]["hits"]

Performance Benchmark

Research results based on the BEIR (Benchmarking IR) dataset:

Search MethodnDCG@10 (average)Strongest Domain
BM25 only43.0Keyword matching, factual retrieval
Vector only47.8Semantic similarity, multilingual
Hybrid (RRF)52.1Balanced across all types

Hybrid achieves roughly 21% higher nDCG@10 on average. The gap is especially large for general-purpose RAG systems where you can't predict what queries will come in.

From personal production experience: switching from vector-only to hybrid search reduced "irrelevant answer" complaints from customers by roughly 30%.

When Each Approach Wins

SituationRecommendedReason
Product catalog searchHybrid (BM25-heavy)Model numbers and specs must be exact
FAQ chatbotHybrid (vector-heavy)Questions come in diverse phrasings
Legal document searchHybrid (BM25-heavy)Exact legal terminology matching required
Sentiment-based searchVector onlyMeaning matters more than keywords
Code searchHybrid (BM25-heavy)Function names and variables need exact matching
Cross-language documentsVector or multilingual BM25Cross-lingual semantic search needed

Production Tip: Tuning the Tokenizer

BM25's default tokenizer is whitespace-based. For technical documentation, you can significantly improve results with a better tokenizer:

import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords

nltk.download('punkt')
nltk.download('stopwords')

stemmer = PorterStemmer()
stop_words = set(stopwords.words('english'))

def technical_tokenizer(text: str) -> list:
    # Tokenize
    tokens = word_tokenize(text.lower())
    # Remove stopwords and apply stemming
    tokens = [
        stemmer.stem(token)
        for token in tokens
        if token.isalnum() and token not in stop_words
    ]
    return tokens

# Apply custom tokenizer when creating BM25 retriever
bm25_retriever = BM25Retriever.from_documents(
    docs,
    preprocess_func=technical_tokenizer,
    k=5
)

For code search specifically, you might want to preserve camelCase and snake_case splitting: getUserById["get", "user", "by", "id"].

Full Hybrid RAG Pipeline

Putting it all together in a complete RAG pipeline:

from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA

# Initialize all components
bm25_retriever = BM25Retriever.from_documents(docs, k=5)
vectorstore = FAISS.from_documents(docs, embedding_model)
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

hybrid_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, vector_retriever],
    weights=[0.4, 0.6]
)

# Build the RAG chain
llm = ChatOpenAI(model="gpt-4o", temperature=0)
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=hybrid_retriever,
    return_source_documents=True
)

# Run a query
response = qa_chain.invoke({"query": "iPhone 15 Pro Max 256GB price"})
print(response["result"])
print("\nSources:")
for doc in response["source_documents"]:
    print(f"  - {doc.metadata.get('source', 'unknown')}")

Conclusion

Hybrid Search is one of the highest-ROI improvements you can make to a RAG system. The performance gain relative to implementation complexity is substantial.

Recommended approach:

  1. Build baseline RAG with vector search
  2. Measure with RAGAS (context precision, context recall)
  3. Switch to Hybrid Search (BM25 + vector)
  4. Measure again and verify improvement

Don't rely on gut feeling to determine if things improved. The next post covers quantitative RAG evaluation with RAGAS in detail.