- Authors
- Name

- Overview
- Chunking Strategy Comparison
- Embedding Model Selection
- Vector DB Indexing Strategies
- Retrieval Quality Metrics: MRR, NDCG, Recall@K
- Hybrid Search Implementation
- Reranking
- Troubleshooting
- Operations Checklist
- Failure Cases
- References
Overview
The two most important axes that determine LLM response quality in a RAG (Retrieval-Augmented Generation) pipeline are chunking strategy and embedding optimization. No matter how powerful the LLM is, if the retrieval stage fails to accurately fetch relevant documents, hallucinations occur. Conversely, when retrieval quality is high, even smaller models can generate sufficient responses.
As of early 2026, the recurring problems encountered in practice when building RAG pipelines are as follows:
- Retrieval accuracy plummeting due to incorrectly configured chunk sizes
- Costs escalating without clear criteria for embedding model selection
- Retrieval latency caused by mismatched vector DB indexing strategies
- Inability to determine improvement directions due to lack of quantitative retrieval quality measurement
This article covers concrete solutions for each problem with code and benchmark data. It reflects the latest benchmark results as of February 2026 and focuses on configuration values validated in production environments.
Chunking Strategy Comparison
Chunking is the process of splitting original documents into pieces of a size suitable for vector embedding. Depending on the chunking strategy, retrieval accuracy, embedding cost, and context quality vary significantly.
Fixed-Size Chunking
The simplest approach, where text is cut into uniform sizes based on a specified number of characters or tokens. It is easy to implement and predictable, but since it ignores sentence and paragraph boundaries, semantic breaks can occur.
from langchain.text_splitter import CharacterTextSplitter
# Fixed-size chunking - the most basic approach
splitter = CharacterTextSplitter(
separator="\n",
chunk_size=512,
chunk_overlap=50, # 10% overlap to maintain context
length_function=len,
)
documents = splitter.split_text(raw_text)
print(f"Total chunks: {len(documents)}")
print(f"Average chunk length: {sum(len(d) for d in documents) / len(documents):.0f} chars")
Pros: Minimal implementation cost, fastest processing speed, predictable chunk count. Cons: Mid-sentence cuts occur, unable to preserve semantic units.
Recursive Character Splitting
In the February 2026 FloTorch benchmark, 512-token recursive splitting achieved 69% accuracy, ranking first. Recursive chunking attempts splitting in order of paragraph (\n\n) -> newline (\n) -> space ( ) -> character (""), maintaining semantic units as much as possible within the specified size.
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Optimal settings based on 2026 benchmarks
splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=64, # approximately 12% overlap
separators=["\n\n", "\n", ". ", " ", ""],
length_function=len,
is_separator_regex=False,
)
chunks = splitter.split_text(raw_text)
# Chunk quality verification
for i, chunk in enumerate(chunks[:3]):
print(f"[Chunk {i}] length={len(chunk)} | start: {chunk[:80]}...")
Key configuration values: The validated recommended values as of early 2026 are chunk_size 400-512 and overlap 10-20%. When exceeding 2,500 tokens, a "context cliff" phenomenon is observed where response quality drops sharply.
Semantic Chunking
Uses an embedding model to calculate semantic similarity between sentences and splits at points where the meaning transitions. While theoretically the most sophisticated, it surprisingly recorded a low 54% accuracy in the 2026 benchmark. The cause was the average chunk size becoming too small at 43 tokens.
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
# Semantic chunking - split based on semantic transition points
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
semantic_splitter = SemanticChunker(
embeddings,
breakpoint_threshold_type="percentile", # percentile, standard_deviation, interquartile
breakpoint_threshold_amount=75, # split at top 25% similarity differences
)
semantic_chunks = semantic_splitter.split_text(raw_text)
print(f"Semantic chunk count: {len(semantic_chunks)}")
print(f"Average length: {sum(len(c) for c in semantic_chunks) / len(semantic_chunks):.0f} chars")
Caution: Semantic chunking generates 3-5x more vectors than recursive splitting on the same corpus. For 10,000 documents, recursive splitting creates approximately 50,000 chunks, while semantic splitting can increase to 250,000.
Document Structure-Based Chunking
Splits using the document's own structure such as Markdown headers, HTML tags, and PDF sections. It is effective for documents with clear hierarchical structures like technical documentation, API references, and legal documents. In the November 2025 MDPI Bioengineering study, adaptive chunking aligned with logical topic boundaries achieved 87% accuracy.
from langchain.text_splitter import MarkdownHeaderTextSplitter
# Markdown structure-based chunking
headers_to_split_on = [
("#", "Header 1"),
("##", "Header 2"),
("###", "Header 3"),
]
md_splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=headers_to_split_on,
strip_headers=False,
)
md_chunks = md_splitter.split_text(markdown_text)
# Each chunk includes header hierarchy info as metadata
for chunk in md_chunks[:3]:
print(f"Metadata: {chunk.metadata}")
print(f"Content: {chunk.page_content[:100]}...")
print("---")
Chunking Strategy Comparison Table
| Strategy | Accuracy (Benchmark) | Chunk Size Predictability | Implementation Complexity | Embedding Cost | Suitable Documents |
|---|---|---|---|---|---|
| Fixed-Size | 60-65% | High | Low | Baseline | Unstructured text |
| Recursive Splitting | 69% | Medium | Low | Baseline | General purpose (recommended) |
| Semantic | 54% | Low | Medium | 3-5x | Documents with frequent topics |
| Document Structure | 87% | Medium | Medium | 1-2x | Structured technical docs |
| Proposition-Based | 62% | Low | High | 5x+ | Research papers |
Practical recommendation: Start with RecursiveCharacterTextSplitter (400-512 tokens, 10-20% overlap), measure retrieval quality metrics, then decide whether to switch to structure-based or semantic approaches.
Embedding Model Selection
The embedding model directly determines the retrieval performance of a RAG pipeline. This section synthesizes the MTEB (Massive Text Embedding Benchmark) leaderboard and practical application results as of early 2026.
Model Comparison Based on MTEB Benchmark
| Model | MTEB Score | Dimensions | Max Tokens | Multilingual | License | Cost (1M tokens) |
|---|---|---|---|---|---|---|
| Cohere embed-v4 | 65.2 | 1024 | 512 | Yes | API | $0.10 |
| OpenAI text-embedding-3-large | 64.6 | 3072 | 8191 | Yes | API | $0.13 |
| OpenAI text-embedding-3-small | 62.3 | 1536 | 8191 | Yes | API | $0.02 |
| BGE-M3 | 63.0 | 1024 | 8192 | 100+ | MIT | Self-hosted |
| Qwen3-Embedding-8B | 70.58 | 4096 | 8192 | Multilingual | Apache 2.0 | Self-hosted |
| E5-Mistral-7B | 63.5 | 4096 | 32768 | Yes | MIT | Self-hosted |
Selection criteria summary:
- API-based rapid prototyping: OpenAI text-embedding-3-small (best performance-to-cost ratio)
- Production API: Cohere embed-v4 or OpenAI text-embedding-3-large
- Self-hosted multilingual: BGE-M3 (supports dense, sparse, and multi-vector simultaneously)
- Best performance self-hosted: Qwen3-Embedding-8B (MTEB 70.58, requires GPU resources)
Embedding Generation Code
from openai import OpenAI
import numpy as np
client = OpenAI()
def generate_embeddings(
texts: list[str],
model: str = "text-embedding-3-large",
dimensions: int = 1024, # dimension reduction for cost/speed optimization
batch_size: int = 100,
) -> np.ndarray:
"""Batch embedding generation with dimension reduction"""
all_embeddings = []
for i in range(0, len(texts), batch_size):
batch = texts[i:i + batch_size]
response = client.embeddings.create(
input=batch,
model=model,
dimensions=dimensions, # only supported by text-embedding-3 series
)
batch_embs = [item.embedding for item in response.data]
all_embeddings.extend(batch_embs)
return np.array(all_embeddings, dtype=np.float32)
# Usage example
chunks = ["The core of a RAG pipeline is retrieval quality.", "Results vary depending on the chunking strategy."]
embeddings = generate_embeddings(chunks, dimensions=1024)
print(f"Embeddings shape: {embeddings.shape}") # (2, 1024)
Dimension reduction tip: text-embedding-3-large defaults to 3072 dimensions, but you can reduce to 1024 or even 256 using the dimensions parameter. The MTEB score drop from 3072 to 1024 is within 1-2%, while gaining significant benefits in vector DB storage cost and search speed.
BGE-M3 Self-Hosted Embedding
from FlagEmbedding import BGEM3FlagModel
# BGE-M3: supports dense + sparse + colbert simultaneously
model = BGEM3FlagModel("BAAI/bge-m3", use_fp16=True)
sentences = [
"Chunking strategy is the core of retrieval quality in LLM RAG pipelines.",
"Vector database indexing directly impacts retrieval latency.",
]
# Generate dense + sparse + colbert embeddings simultaneously
output = model.encode(
sentences,
batch_size=12,
max_length=512,
return_dense=True,
return_sparse=True,
return_colbert_vecs=True,
)
dense_embeddings = output["dense_vecs"] # shape: (2, 1024)
sparse_embeddings = output["lexical_weights"] # sparse vectors (BM25 replacement)
colbert_vecs = output["colbert_vecs"] # multi-vector (precise matching)
print(f"Dense shape: {dense_embeddings.shape}")
print(f"Sparse keys example: {list(sparse_embeddings[0].keys())[:5]}")
The key advantage of BGE-M3 is that a single model supports dense, sparse, and multi-vector retrieval. By leveraging this, you can implement Hybrid Search without a separate BM25 index.
Vector DB Indexing Strategies
The choice and indexing strategy of the vector database that stores and retrieves embedded vectors directly impacts retrieval latency and accuracy.
Vector DB Comparison
| Feature | Chroma | Pinecone | Weaviate | Qdrant | Milvus |
|---|---|---|---|---|---|
| Hosting | Self/Cloud | Managed | Self/Cloud | Self/Cloud | Self/Cloud |
| p50 Latency (100K) | ~20ms | ~15ms | ~25ms | ~18ms | ~20ms |
| Max Vector Count | Millions | Billions | Hundreds of M | Billions | Billions |
| Metadata Filtering | Basic | Advanced | GraphQL | Advanced | Advanced |
| Hybrid Search | No | Yes | Yes | Yes | Yes |
| Free Tier | Unlimited Local | Limited | 14 days | 1GB Free | Open Source |
| Prototyping | Optimal | Good | Good | Good | Fair |
| Enterprise | Not suitable | Optimal | Good | Good | Good |
Vector Storage and Search with Chroma
import chromadb
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction
# Initialize Chroma client (persistent storage)
client = chromadb.PersistentClient(path="./chroma_db")
embedding_fn = OpenAIEmbeddingFunction(
api_key="sk-...",
model_name="text-embedding-3-large",
)
# Create collection (HNSW index automatically applied)
collection = client.get_or_create_collection(
name="rag_knowledge_base",
embedding_function=embedding_fn,
metadata={
"hnsw:space": "cosine", # similarity metric
"hnsw:M": 32, # HNSW connections (higher = more accurate, more memory)
"hnsw:ef_construction": 200, # search width during index construction
},
)
# Add documents (batch)
collection.add(
documents=["Chunking determines 80% of retrieval quality in RAG.", "Embedding model selection determines the remaining 20%."],
metadatas=[
{"source": "rag_guide", "section": "chunking", "date": "2026-03"},
{"source": "rag_guide", "section": "embedding", "date": "2026-03"},
],
ids=["doc_001", "doc_002"],
)
# Search (metadata filter + similarity)
results = collection.query(
query_texts=["What is the most important factor in a RAG pipeline?"],
n_results=5,
where={"source": "rag_guide"},
include=["documents", "distances", "metadatas"],
)
for doc, dist, meta in zip(
results["documents"][0], results["distances"][0], results["metadatas"][0]
):
print(f"[Distance: {dist:.4f}] {meta['section']} | {doc[:80]}")
HNSW Index Parameter Tuning
There are three key parameters for the HNSW (Hierarchical Navigable Small World) index used by most vector DBs.
| Parameter | Description | Default | Production Recommended | Impact |
|---|---|---|---|---|
| M | Connections per node | 16 | 32-48 | Higher = more accuracy, more memory usage |
| ef_construction | Search width during indexing | 100 | 200-400 | Higher = better index quality, longer build time |
| ef_search | Search width during query | 50 | 100-200 | Higher = better recall, higher search latency |
Practical tip: At 1 million vectors, increasing M from 32 to 48 improves Recall@10 by about 2-3%, but memory usage increases by 40%. If memory is constrained, increasing ef_search is more cost-effective.
Retrieval Quality Metrics: MRR, NDCG, Recall@K
Without quantitatively measuring the retrieval quality of a RAG pipeline, you cannot determine the direction for improvement. Here are the three key metrics with code.
Metric Definitions
- MRR (Mean Reciprocal Rank): The average of the reciprocal rank of the first relevant document. It measures "how quickly the correct answer appears."
- NDCG@K (Normalized Discounted Cumulative Gain): Evaluates the relevance of the top K results with rank-weighted scoring. Higher ranks receive higher weights.
- Recall@K: The proportion of all relevant documents included in the top K results. It measures "how many relevant documents were found."
Evaluation Code Implementation
import numpy as np
from typing import List, Set
def mean_reciprocal_rank(
retrieved_ids: List[List[str]],
relevant_ids: List[Set[str]],
) -> float:
"""MRR: average reciprocal rank of the first relevant document per query"""
mrr_scores = []
for retrieved, relevant in zip(retrieved_ids, relevant_ids):
for rank, doc_id in enumerate(retrieved, 1):
if doc_id in relevant:
mrr_scores.append(1.0 / rank)
break
else:
mrr_scores.append(0.0)
return np.mean(mrr_scores)
def recall_at_k(
retrieved_ids: List[List[str]],
relevant_ids: List[Set[str]],
k: int = 10,
) -> float:
"""Recall@K: proportion of relevant documents in top K results"""
recalls = []
for retrieved, relevant in zip(retrieved_ids, relevant_ids):
top_k = set(retrieved[:k])
if len(relevant) == 0:
continue
recalls.append(len(top_k & relevant) / len(relevant))
return np.mean(recalls)
def ndcg_at_k(
retrieved_ids: List[List[str]],
relevant_ids: List[Set[str]],
k: int = 10,
) -> float:
"""NDCG@K: rank-weighted relevance evaluation"""
ndcg_scores = []
for retrieved, relevant in zip(retrieved_ids, relevant_ids):
# DCG calculation
dcg = 0.0
for rank, doc_id in enumerate(retrieved[:k], 1):
if doc_id in relevant:
dcg += 1.0 / np.log2(rank + 1)
# Ideal DCG calculation
ideal_hits = min(len(relevant), k)
idcg = sum(1.0 / np.log2(r + 1) for r in range(1, ideal_hits + 1))
ndcg_scores.append(dcg / idcg if idcg > 0 else 0.0)
return np.mean(ndcg_scores)
# Usage example
retrieved = [["d1", "d3", "d5", "d2", "d4"]]
relevant = [{"d1", "d2", "d7"}]
print(f"MRR: {mean_reciprocal_rank(retrieved, relevant):.4f}")
print(f"Recall@3: {recall_at_k(retrieved, relevant, k=3):.4f}")
print(f"Recall@5: {recall_at_k(retrieved, relevant, k=5):.4f}")
print(f"NDCG@5: {ndcg_at_k(retrieved, relevant, k=5):.4f}")
Metric Interpretation Guidelines
| Metric | Poor | Fair | Good | Target |
|---|---|---|---|---|
| MRR | under 0.3 | 0.3-0.5 | 0.5-0.8 | over 0.7 |
| NDCG@10 | under 0.4 | 0.4-0.6 | 0.6-0.8 | over 0.7 |
| Recall@10 | under 0.5 | 0.5-0.7 | 0.7-0.9 | over 0.8 |
If MRR is low but Recall@K is high, it means relevant documents are being found but ranked too low. In this case, introducing reranking is highly effective.
Hybrid Search Implementation
Pure vector search (Dense Retrieval) alone has limitations when exact keyword matching is needed (proper nouns, code names, product numbers, etc.). Hybrid Search combines vector search with keyword search (BM25/Sparse) to leverage the strengths of both approaches.
Hybrid Search with Qdrant
from qdrant_client import QdrantClient, models
from qdrant_client.models import Distance, VectorParams, SparseVectorParams
client = QdrantClient(host="localhost", port=6333)
# Create collection storing Dense + Sparse vectors simultaneously
client.create_collection(
collection_name="hybrid_rag",
vectors_config={
"dense": VectorParams(size=1024, distance=Distance.COSINE),
},
sparse_vectors_config={
"sparse": SparseVectorParams(),
},
)
# Index documents (store dense + sparse vectors simultaneously)
client.upsert(
collection_name="hybrid_rag",
points=[
models.PointStruct(
id=1,
vector={
"dense": dense_embedding.tolist(),
"sparse": models.SparseVector(
indices=list(sparse_weights.keys()),
values=list(sparse_weights.values()),
),
},
payload={"text": "RAG pipeline chunking guide", "source": "blog"},
),
],
)
# Execute Hybrid Search (RRF-based score fusion)
results = client.query_points(
collection_name="hybrid_rag",
prefetch=[
models.Prefetch(
query=dense_query_vector.tolist(),
using="dense",
limit=20,
),
models.Prefetch(
query=models.SparseVector(
indices=list(sparse_query.keys()),
values=list(sparse_query.values()),
),
using="sparse",
limit=20,
),
],
query=models.FusionQuery(fusion=models.Fusion.RRF), # Reciprocal Rank Fusion
limit=10,
)
for point in results.points:
print(f"[Score: {point.score:.4f}] {point.payload['text']}")
Dense vs. Sparse vs. Hybrid Performance Comparison
| Search Method | Keyword Matching | Semantic Similarity | Proper Nouns/Code | General Questions | Recommended Use Case |
|---|---|---|---|---|---|
| Dense Only | Weak | Strong | Weak | Strong | Natural language Q&A |
| Sparse Only (BM25) | Strong | Weak | Strong | Weak | Keyword search |
| Hybrid (RRF) | Strong | Strong | Strong | Strong | Production RAG (recommended) |
In Hybrid Search, the weight ratio between Dense and Sparse needs to be adjusted per domain. For technical documentation, increasing Sparse weight (0.6) is effective, while for general conversational Q&A, increasing Dense weight (0.7) works better empirically.
Reranking
Reranking is the process of re-evaluating initial search results with a Cross-Encoder model to readjust rankings. According to Databricks research, applying reranking improves retrieval quality by up to 48%, with typical NDCG@10 improvements of 20-35%.
Reranking Architecture
- Stage 1 - Candidate Retrieval: Quickly extract the top 50-100 documents using vector search (or Hybrid Search).
- Stage 2 - Reranking: The Cross-Encoder directly compares query-document pairs to produce precise relevance scores.
- Stage 3 - Final Selection: Pass the top 5-10 documents based on reranking scores to the LLM context.
Reranking Model Comparison
| Model | NDCG@10 Improvement | Latency (50 docs) | Cost | Recommended |
|---|---|---|---|---|
| Cohere Rerank v3 | +30-35% | ~300ms | API-based | Production |
| cross-encoder/ms-marco-MiniLM-L-6-v2 | +20-25% | ~150ms | Free | Cost-sensitive |
| BGE-Reranker-v2-m3 | +25-30% | ~200ms | Free | Multilingual |
| Jina Reranker v2 | +28-32% | ~250ms | API/Self | Balanced |
Key trade-off: Cross-Encoder reranking improves accuracy by 20-35% but adds 200-500ms latency per query. In real-time chat applications, limit reranking candidates to 20-30 to keep latency under 150ms.
Troubleshooting
Here are frequently encountered problems and solutions in production RAG pipelines.
Problem 1: Search Results Return Irrelevant Documents
Root cause analysis: In most cases, chunk size is too large (over 2,500 tokens) or insufficient overlap causes semantic units to break.
Solution:
- Reduce chunk size to 400-512 and set overlap to 10-20%.
- Prepend the original document's title or section header to the beginning of each chunk before embedding.
- Add metadata filtering to narrow the search scope.
Problem 2: Relevant Documents Found but Ranked Low (Low MRR, High Recall)
Root cause analysis: When using only Dense search, documents that are semantically related but not direct answers rank higher.
Solution:
- Introduce Cross-Encoder reranking. In most cases, MRR increases by 0.2-0.3.
- Add a domain prefix to queries. Example: embed in the format
"Question: {query}". - Apply Hybrid Search to reinforce keyword matching signals.
Problem 3: Embedding Costs Exceed Budget
Root cause analysis: Too many vectors generated from semantic chunking, or using high-dimensional embeddings.
Solution:
- Use the
dimensionsparameter of text-embedding-3-large to reduce from 3072 to 1024 dimensions. The MTEB score drop is within 1-2%. - Switching from semantic chunking to recursive splitting reduces vector count by 3-5x.
- Separate infrequently accessed old documents into cold storage.
Problem 4: Vector Search Latency Exceeds SLA
Root cause analysis: Untuned HNSW index parameters, insufficient memory due to vector count growth, disk-based search occurring.
Solution:
- Incrementally adjust ef_search values (50 -> 100 -> 200). Measure the Recall vs. Latency trade-off.
- Quantize vectors (Scalar/Product Quantization) to reduce memory usage by 50-75%.
- Shard collections by date to reduce the number of vectors searched.
Problem 5: Cross-Language Search Failure in Multilingual Documents
Root cause analysis: When using English-centric embedding models, embedding quality degrades for non-English queries such as Korean or Japanese.
Solution:
- Switch to BGE-M3 (supports over 100 languages) or Cohere embed-v4 (multilingual optimized).
- When the query language differs from the document language, add a pipeline that translates the query to the document language before searching.
Operations Checklist
Here are items that must be verified before deploying a production RAG pipeline.
Chunking Configuration
- Is the chunk size set to 400-512 tokens?
- Is the overlap set to 10-20%?
- Have you verified that no chunks exceed 2,500 tokens?
- Have you separated chunking strategies by document type (PDF, Markdown, code, etc.)?
- Is there logic to filter empty and duplicate chunks?
Embedding
- Have you compared MTEB scores and costs of embedding models?
- Have you tested whether dimension reduction is applicable (3072 -> 1024)?
- Is rate limit handling implemented for batch embedding processing?
- Is the full re-indexing procedure documented for embedding model version changes?
Vector DB
- Have you tuned the HNSW index parameters (M, ef_construction, ef_search)?
- Is there a memory scaling plan for growing vector counts?
- Have you tested the backup/recovery procedures?
- Have you appropriately configured metadata filtering indexes?
Retrieval Quality
- Have you built an evaluation dataset (query-answer pairs) of at least 50 items?
- Have you set target values for MRR, NDCG@10, and Recall@10?
- Is an A/B testing pipeline built?
- Is there a system to collect and analyze retrieval failure logs?
Monitoring
- Are you monitoring per-query retrieval latency at p50/p95/p99?
- Are you tracking embedding API call failure rates?
- Are alerts configured for vector DB disk/memory usage?
- Is there a periodic batch job that automatically evaluates retrieval quality metrics?
Failure Cases
Case 1: The Semantic Chunking Trap
A company processed all documents with semantic chunking under the assumption that "more sophisticated chunking must be better." The results were:
- Vector count increased 4.2x compared to recursive splitting
- Monthly Pinecone cost rose from 3,400
- Average chunk size shrank to 38 tokens, causing insufficient context, and retrieval accuracy actually dropped by 12%
Lesson: Chunking strategies must be selected based on benchmarks. The assumption "more sophisticated method = better results" is repeatedly disproven in 2026 benchmarks.
Case 2: Missing Re-Indexing During Embedding Model Replacement
This case involved upgrading from text-embedding-ada-002 to text-embedding-3-large without re-indexing existing vectors. Vectors from different embedding spaces became mixed, causing search results to become nearly random.
Lesson: When changing embedding models, all vectors must be regenerated. For zero-downtime migration, re-index into a new collection, verify, then switch traffic using a Blue-Green deployment strategy.
Case 3: Outage Due to Unset HNSW ef_search
When vectors exceeded 1 million, search latency surpassed 500ms, but the default ef_search value (10) was still being used. Raising ef_search to 100 increased Recall@10 from 72% to 91% while latency remained at around 80ms.
Lesson: HNSW parameter tuning must be readjusted based on data scale. Re-evaluate ef_search and ef_construction every time vector count increases by 10x.
References
- MTEB Leaderboard - Hugging Face - Latest embedding model benchmark rankings
- LangChain Text Splitters Documentation - Official LangChain chunking implementation documentation
- Chunking Strategies for RAG - Weaviate - Performance comparison guide by chunking strategy
- Optimizing RAG with Hybrid Search and Reranking - Superlinked - Practical guide for Hybrid Search and reranking optimization
- Rerankers and Two-Stage Retrieval - Pinecone - Two-stage retrieval and reranking architecture explanation
- BGE-M3 - FlagEmbedding GitHub - BGE-M3 multilingual embedding model official repository
- Best Vector Databases in 2026 - Firecrawl - 2026 vector DB comparison analysis