Skip to content
Published on

LLM RAG Pipeline: Chunking Strategies and Embedding Optimization in Practice 2026

Authors
  • Name
    Twitter
LLM RAG Pipeline: Chunking Strategies and Embedding Optimization in Practice 2026

Overview

The two most important axes that determine LLM response quality in a RAG (Retrieval-Augmented Generation) pipeline are chunking strategy and embedding optimization. No matter how powerful the LLM is, if the retrieval stage fails to accurately fetch relevant documents, hallucinations occur. Conversely, when retrieval quality is high, even smaller models can generate sufficient responses.

As of early 2026, the recurring problems encountered in practice when building RAG pipelines are as follows:

  • Retrieval accuracy plummeting due to incorrectly configured chunk sizes
  • Costs escalating without clear criteria for embedding model selection
  • Retrieval latency caused by mismatched vector DB indexing strategies
  • Inability to determine improvement directions due to lack of quantitative retrieval quality measurement

This article covers concrete solutions for each problem with code and benchmark data. It reflects the latest benchmark results as of February 2026 and focuses on configuration values validated in production environments.

Chunking Strategy Comparison

Chunking is the process of splitting original documents into pieces of a size suitable for vector embedding. Depending on the chunking strategy, retrieval accuracy, embedding cost, and context quality vary significantly.

Fixed-Size Chunking

The simplest approach, where text is cut into uniform sizes based on a specified number of characters or tokens. It is easy to implement and predictable, but since it ignores sentence and paragraph boundaries, semantic breaks can occur.

from langchain.text_splitter import CharacterTextSplitter

# Fixed-size chunking - the most basic approach
splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=512,
    chunk_overlap=50,       # 10% overlap to maintain context
    length_function=len,
)

documents = splitter.split_text(raw_text)
print(f"Total chunks: {len(documents)}")
print(f"Average chunk length: {sum(len(d) for d in documents) / len(documents):.0f} chars")

Pros: Minimal implementation cost, fastest processing speed, predictable chunk count. Cons: Mid-sentence cuts occur, unable to preserve semantic units.

Recursive Character Splitting

In the February 2026 FloTorch benchmark, 512-token recursive splitting achieved 69% accuracy, ranking first. Recursive chunking attempts splitting in order of paragraph (\n\n) -> newline (\n) -> space ( ) -> character (""), maintaining semantic units as much as possible within the specified size.

from langchain.text_splitter import RecursiveCharacterTextSplitter

# Optimal settings based on 2026 benchmarks
splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=64,          # approximately 12% overlap
    separators=["\n\n", "\n", ". ", " ", ""],
    length_function=len,
    is_separator_regex=False,
)

chunks = splitter.split_text(raw_text)

# Chunk quality verification
for i, chunk in enumerate(chunks[:3]):
    print(f"[Chunk {i}] length={len(chunk)} | start: {chunk[:80]}...")

Key configuration values: The validated recommended values as of early 2026 are chunk_size 400-512 and overlap 10-20%. When exceeding 2,500 tokens, a "context cliff" phenomenon is observed where response quality drops sharply.

Semantic Chunking

Uses an embedding model to calculate semantic similarity between sentences and splits at points where the meaning transitions. While theoretically the most sophisticated, it surprisingly recorded a low 54% accuracy in the 2026 benchmark. The cause was the average chunk size becoming too small at 43 tokens.

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

# Semantic chunking - split based on semantic transition points
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
semantic_splitter = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile",  # percentile, standard_deviation, interquartile
    breakpoint_threshold_amount=75,          # split at top 25% similarity differences
)

semantic_chunks = semantic_splitter.split_text(raw_text)
print(f"Semantic chunk count: {len(semantic_chunks)}")
print(f"Average length: {sum(len(c) for c in semantic_chunks) / len(semantic_chunks):.0f} chars")

Caution: Semantic chunking generates 3-5x more vectors than recursive splitting on the same corpus. For 10,000 documents, recursive splitting creates approximately 50,000 chunks, while semantic splitting can increase to 250,000.

Document Structure-Based Chunking

Splits using the document's own structure such as Markdown headers, HTML tags, and PDF sections. It is effective for documents with clear hierarchical structures like technical documentation, API references, and legal documents. In the November 2025 MDPI Bioengineering study, adaptive chunking aligned with logical topic boundaries achieved 87% accuracy.

from langchain.text_splitter import MarkdownHeaderTextSplitter

# Markdown structure-based chunking
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

md_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on,
    strip_headers=False,
)

md_chunks = md_splitter.split_text(markdown_text)

# Each chunk includes header hierarchy info as metadata
for chunk in md_chunks[:3]:
    print(f"Metadata: {chunk.metadata}")
    print(f"Content: {chunk.page_content[:100]}...")
    print("---")

Chunking Strategy Comparison Table

StrategyAccuracy (Benchmark)Chunk Size PredictabilityImplementation ComplexityEmbedding CostSuitable Documents
Fixed-Size60-65%HighLowBaselineUnstructured text
Recursive Splitting69%MediumLowBaselineGeneral purpose (recommended)
Semantic54%LowMedium3-5xDocuments with frequent topics
Document Structure87%MediumMedium1-2xStructured technical docs
Proposition-Based62%LowHigh5x+Research papers

Practical recommendation: Start with RecursiveCharacterTextSplitter (400-512 tokens, 10-20% overlap), measure retrieval quality metrics, then decide whether to switch to structure-based or semantic approaches.

Embedding Model Selection

The embedding model directly determines the retrieval performance of a RAG pipeline. This section synthesizes the MTEB (Massive Text Embedding Benchmark) leaderboard and practical application results as of early 2026.

Model Comparison Based on MTEB Benchmark

ModelMTEB ScoreDimensionsMax TokensMultilingualLicenseCost (1M tokens)
Cohere embed-v465.21024512YesAPI$0.10
OpenAI text-embedding-3-large64.630728191YesAPI$0.13
OpenAI text-embedding-3-small62.315368191YesAPI$0.02
BGE-M363.010248192100+MITSelf-hosted
Qwen3-Embedding-8B70.5840968192MultilingualApache 2.0Self-hosted
E5-Mistral-7B63.5409632768YesMITSelf-hosted

Selection criteria summary:

  • API-based rapid prototyping: OpenAI text-embedding-3-small (best performance-to-cost ratio)
  • Production API: Cohere embed-v4 or OpenAI text-embedding-3-large
  • Self-hosted multilingual: BGE-M3 (supports dense, sparse, and multi-vector simultaneously)
  • Best performance self-hosted: Qwen3-Embedding-8B (MTEB 70.58, requires GPU resources)

Embedding Generation Code

from openai import OpenAI
import numpy as np

client = OpenAI()

def generate_embeddings(
    texts: list[str],
    model: str = "text-embedding-3-large",
    dimensions: int = 1024,    # dimension reduction for cost/speed optimization
    batch_size: int = 100,
) -> np.ndarray:
    """Batch embedding generation with dimension reduction"""
    all_embeddings = []

    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]
        response = client.embeddings.create(
            input=batch,
            model=model,
            dimensions=dimensions,  # only supported by text-embedding-3 series
        )
        batch_embs = [item.embedding for item in response.data]
        all_embeddings.extend(batch_embs)

    return np.array(all_embeddings, dtype=np.float32)


# Usage example
chunks = ["The core of a RAG pipeline is retrieval quality.", "Results vary depending on the chunking strategy."]
embeddings = generate_embeddings(chunks, dimensions=1024)
print(f"Embeddings shape: {embeddings.shape}")  # (2, 1024)

Dimension reduction tip: text-embedding-3-large defaults to 3072 dimensions, but you can reduce to 1024 or even 256 using the dimensions parameter. The MTEB score drop from 3072 to 1024 is within 1-2%, while gaining significant benefits in vector DB storage cost and search speed.

BGE-M3 Self-Hosted Embedding

from FlagEmbedding import BGEM3FlagModel

# BGE-M3: supports dense + sparse + colbert simultaneously
model = BGEM3FlagModel("BAAI/bge-m3", use_fp16=True)

sentences = [
    "Chunking strategy is the core of retrieval quality in LLM RAG pipelines.",
    "Vector database indexing directly impacts retrieval latency.",
]

# Generate dense + sparse + colbert embeddings simultaneously
output = model.encode(
    sentences,
    batch_size=12,
    max_length=512,
    return_dense=True,
    return_sparse=True,
    return_colbert_vecs=True,
)

dense_embeddings = output["dense_vecs"]       # shape: (2, 1024)
sparse_embeddings = output["lexical_weights"]  # sparse vectors (BM25 replacement)
colbert_vecs = output["colbert_vecs"]          # multi-vector (precise matching)

print(f"Dense shape: {dense_embeddings.shape}")
print(f"Sparse keys example: {list(sparse_embeddings[0].keys())[:5]}")

The key advantage of BGE-M3 is that a single model supports dense, sparse, and multi-vector retrieval. By leveraging this, you can implement Hybrid Search without a separate BM25 index.

Vector DB Indexing Strategies

The choice and indexing strategy of the vector database that stores and retrieves embedded vectors directly impacts retrieval latency and accuracy.

Vector DB Comparison

FeatureChromaPineconeWeaviateQdrantMilvus
HostingSelf/CloudManagedSelf/CloudSelf/CloudSelf/Cloud
p50 Latency (100K)~20ms~15ms~25ms~18ms~20ms
Max Vector CountMillionsBillionsHundreds of MBillionsBillions
Metadata FilteringBasicAdvancedGraphQLAdvancedAdvanced
Hybrid SearchNoYesYesYesYes
Free TierUnlimited LocalLimited14 days1GB FreeOpen Source
PrototypingOptimalGoodGoodGoodFair
EnterpriseNot suitableOptimalGoodGoodGood

Vector Storage and Search with Chroma

import chromadb
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction

# Initialize Chroma client (persistent storage)
client = chromadb.PersistentClient(path="./chroma_db")

embedding_fn = OpenAIEmbeddingFunction(
    api_key="sk-...",
    model_name="text-embedding-3-large",
)

# Create collection (HNSW index automatically applied)
collection = client.get_or_create_collection(
    name="rag_knowledge_base",
    embedding_function=embedding_fn,
    metadata={
        "hnsw:space": "cosine",       # similarity metric
        "hnsw:M": 32,                 # HNSW connections (higher = more accurate, more memory)
        "hnsw:ef_construction": 200,  # search width during index construction
    },
)

# Add documents (batch)
collection.add(
    documents=["Chunking determines 80% of retrieval quality in RAG.", "Embedding model selection determines the remaining 20%."],
    metadatas=[
        {"source": "rag_guide", "section": "chunking", "date": "2026-03"},
        {"source": "rag_guide", "section": "embedding", "date": "2026-03"},
    ],
    ids=["doc_001", "doc_002"],
)

# Search (metadata filter + similarity)
results = collection.query(
    query_texts=["What is the most important factor in a RAG pipeline?"],
    n_results=5,
    where={"source": "rag_guide"},
    include=["documents", "distances", "metadatas"],
)

for doc, dist, meta in zip(
    results["documents"][0], results["distances"][0], results["metadatas"][0]
):
    print(f"[Distance: {dist:.4f}] {meta['section']} | {doc[:80]}")

HNSW Index Parameter Tuning

There are three key parameters for the HNSW (Hierarchical Navigable Small World) index used by most vector DBs.

ParameterDescriptionDefaultProduction RecommendedImpact
MConnections per node1632-48Higher = more accuracy, more memory usage
ef_constructionSearch width during indexing100200-400Higher = better index quality, longer build time
ef_searchSearch width during query50100-200Higher = better recall, higher search latency

Practical tip: At 1 million vectors, increasing M from 32 to 48 improves Recall@10 by about 2-3%, but memory usage increases by 40%. If memory is constrained, increasing ef_search is more cost-effective.

Retrieval Quality Metrics: MRR, NDCG, Recall@K

Without quantitatively measuring the retrieval quality of a RAG pipeline, you cannot determine the direction for improvement. Here are the three key metrics with code.

Metric Definitions

  • MRR (Mean Reciprocal Rank): The average of the reciprocal rank of the first relevant document. It measures "how quickly the correct answer appears."
  • NDCG@K (Normalized Discounted Cumulative Gain): Evaluates the relevance of the top K results with rank-weighted scoring. Higher ranks receive higher weights.
  • Recall@K: The proportion of all relevant documents included in the top K results. It measures "how many relevant documents were found."

Evaluation Code Implementation

import numpy as np
from typing import List, Set


def mean_reciprocal_rank(
    retrieved_ids: List[List[str]],
    relevant_ids: List[Set[str]],
) -> float:
    """MRR: average reciprocal rank of the first relevant document per query"""
    mrr_scores = []
    for retrieved, relevant in zip(retrieved_ids, relevant_ids):
        for rank, doc_id in enumerate(retrieved, 1):
            if doc_id in relevant:
                mrr_scores.append(1.0 / rank)
                break
        else:
            mrr_scores.append(0.0)
    return np.mean(mrr_scores)


def recall_at_k(
    retrieved_ids: List[List[str]],
    relevant_ids: List[Set[str]],
    k: int = 10,
) -> float:
    """Recall@K: proportion of relevant documents in top K results"""
    recalls = []
    for retrieved, relevant in zip(retrieved_ids, relevant_ids):
        top_k = set(retrieved[:k])
        if len(relevant) == 0:
            continue
        recalls.append(len(top_k & relevant) / len(relevant))
    return np.mean(recalls)


def ndcg_at_k(
    retrieved_ids: List[List[str]],
    relevant_ids: List[Set[str]],
    k: int = 10,
) -> float:
    """NDCG@K: rank-weighted relevance evaluation"""
    ndcg_scores = []
    for retrieved, relevant in zip(retrieved_ids, relevant_ids):
        # DCG calculation
        dcg = 0.0
        for rank, doc_id in enumerate(retrieved[:k], 1):
            if doc_id in relevant:
                dcg += 1.0 / np.log2(rank + 1)

        # Ideal DCG calculation
        ideal_hits = min(len(relevant), k)
        idcg = sum(1.0 / np.log2(r + 1) for r in range(1, ideal_hits + 1))

        ndcg_scores.append(dcg / idcg if idcg > 0 else 0.0)
    return np.mean(ndcg_scores)


# Usage example
retrieved = [["d1", "d3", "d5", "d2", "d4"]]
relevant = [{"d1", "d2", "d7"}]

print(f"MRR:       {mean_reciprocal_rank(retrieved, relevant):.4f}")
print(f"Recall@3:  {recall_at_k(retrieved, relevant, k=3):.4f}")
print(f"Recall@5:  {recall_at_k(retrieved, relevant, k=5):.4f}")
print(f"NDCG@5:    {ndcg_at_k(retrieved, relevant, k=5):.4f}")

Metric Interpretation Guidelines

MetricPoorFairGoodTarget
MRRunder 0.30.3-0.50.5-0.8over 0.7
NDCG@10under 0.40.4-0.60.6-0.8over 0.7
Recall@10under 0.50.5-0.70.7-0.9over 0.8

If MRR is low but Recall@K is high, it means relevant documents are being found but ranked too low. In this case, introducing reranking is highly effective.

Hybrid Search Implementation

Pure vector search (Dense Retrieval) alone has limitations when exact keyword matching is needed (proper nouns, code names, product numbers, etc.). Hybrid Search combines vector search with keyword search (BM25/Sparse) to leverage the strengths of both approaches.

Hybrid Search with Qdrant

from qdrant_client import QdrantClient, models
from qdrant_client.models import Distance, VectorParams, SparseVectorParams

client = QdrantClient(host="localhost", port=6333)

# Create collection storing Dense + Sparse vectors simultaneously
client.create_collection(
    collection_name="hybrid_rag",
    vectors_config={
        "dense": VectorParams(size=1024, distance=Distance.COSINE),
    },
    sparse_vectors_config={
        "sparse": SparseVectorParams(),
    },
)

# Index documents (store dense + sparse vectors simultaneously)
client.upsert(
    collection_name="hybrid_rag",
    points=[
        models.PointStruct(
            id=1,
            vector={
                "dense": dense_embedding.tolist(),
                "sparse": models.SparseVector(
                    indices=list(sparse_weights.keys()),
                    values=list(sparse_weights.values()),
                ),
            },
            payload={"text": "RAG pipeline chunking guide", "source": "blog"},
        ),
    ],
)

# Execute Hybrid Search (RRF-based score fusion)
results = client.query_points(
    collection_name="hybrid_rag",
    prefetch=[
        models.Prefetch(
            query=dense_query_vector.tolist(),
            using="dense",
            limit=20,
        ),
        models.Prefetch(
            query=models.SparseVector(
                indices=list(sparse_query.keys()),
                values=list(sparse_query.values()),
            ),
            using="sparse",
            limit=20,
        ),
    ],
    query=models.FusionQuery(fusion=models.Fusion.RRF),  # Reciprocal Rank Fusion
    limit=10,
)

for point in results.points:
    print(f"[Score: {point.score:.4f}] {point.payload['text']}")

Dense vs. Sparse vs. Hybrid Performance Comparison

Search MethodKeyword MatchingSemantic SimilarityProper Nouns/CodeGeneral QuestionsRecommended Use Case
Dense OnlyWeakStrongWeakStrongNatural language Q&A
Sparse Only (BM25)StrongWeakStrongWeakKeyword search
Hybrid (RRF)StrongStrongStrongStrongProduction RAG (recommended)

In Hybrid Search, the weight ratio between Dense and Sparse needs to be adjusted per domain. For technical documentation, increasing Sparse weight (0.6) is effective, while for general conversational Q&A, increasing Dense weight (0.7) works better empirically.

Reranking

Reranking is the process of re-evaluating initial search results with a Cross-Encoder model to readjust rankings. According to Databricks research, applying reranking improves retrieval quality by up to 48%, with typical NDCG@10 improvements of 20-35%.

Reranking Architecture

  1. Stage 1 - Candidate Retrieval: Quickly extract the top 50-100 documents using vector search (or Hybrid Search).
  2. Stage 2 - Reranking: The Cross-Encoder directly compares query-document pairs to produce precise relevance scores.
  3. Stage 3 - Final Selection: Pass the top 5-10 documents based on reranking scores to the LLM context.

Reranking Model Comparison

ModelNDCG@10 ImprovementLatency (50 docs)CostRecommended
Cohere Rerank v3+30-35%~300msAPI-basedProduction
cross-encoder/ms-marco-MiniLM-L-6-v2+20-25%~150msFreeCost-sensitive
BGE-Reranker-v2-m3+25-30%~200msFreeMultilingual
Jina Reranker v2+28-32%~250msAPI/SelfBalanced

Key trade-off: Cross-Encoder reranking improves accuracy by 20-35% but adds 200-500ms latency per query. In real-time chat applications, limit reranking candidates to 20-30 to keep latency under 150ms.

Troubleshooting

Here are frequently encountered problems and solutions in production RAG pipelines.

Problem 1: Search Results Return Irrelevant Documents

Root cause analysis: In most cases, chunk size is too large (over 2,500 tokens) or insufficient overlap causes semantic units to break.

Solution:

  • Reduce chunk size to 400-512 and set overlap to 10-20%.
  • Prepend the original document's title or section header to the beginning of each chunk before embedding.
  • Add metadata filtering to narrow the search scope.

Problem 2: Relevant Documents Found but Ranked Low (Low MRR, High Recall)

Root cause analysis: When using only Dense search, documents that are semantically related but not direct answers rank higher.

Solution:

  • Introduce Cross-Encoder reranking. In most cases, MRR increases by 0.2-0.3.
  • Add a domain prefix to queries. Example: embed in the format "Question: {query}".
  • Apply Hybrid Search to reinforce keyword matching signals.

Problem 3: Embedding Costs Exceed Budget

Root cause analysis: Too many vectors generated from semantic chunking, or using high-dimensional embeddings.

Solution:

  • Use the dimensions parameter of text-embedding-3-large to reduce from 3072 to 1024 dimensions. The MTEB score drop is within 1-2%.
  • Switching from semantic chunking to recursive splitting reduces vector count by 3-5x.
  • Separate infrequently accessed old documents into cold storage.

Problem 4: Vector Search Latency Exceeds SLA

Root cause analysis: Untuned HNSW index parameters, insufficient memory due to vector count growth, disk-based search occurring.

Solution:

  • Incrementally adjust ef_search values (50 -> 100 -> 200). Measure the Recall vs. Latency trade-off.
  • Quantize vectors (Scalar/Product Quantization) to reduce memory usage by 50-75%.
  • Shard collections by date to reduce the number of vectors searched.

Problem 5: Cross-Language Search Failure in Multilingual Documents

Root cause analysis: When using English-centric embedding models, embedding quality degrades for non-English queries such as Korean or Japanese.

Solution:

  • Switch to BGE-M3 (supports over 100 languages) or Cohere embed-v4 (multilingual optimized).
  • When the query language differs from the document language, add a pipeline that translates the query to the document language before searching.

Operations Checklist

Here are items that must be verified before deploying a production RAG pipeline.

Chunking Configuration

  • Is the chunk size set to 400-512 tokens?
  • Is the overlap set to 10-20%?
  • Have you verified that no chunks exceed 2,500 tokens?
  • Have you separated chunking strategies by document type (PDF, Markdown, code, etc.)?
  • Is there logic to filter empty and duplicate chunks?

Embedding

  • Have you compared MTEB scores and costs of embedding models?
  • Have you tested whether dimension reduction is applicable (3072 -> 1024)?
  • Is rate limit handling implemented for batch embedding processing?
  • Is the full re-indexing procedure documented for embedding model version changes?

Vector DB

  • Have you tuned the HNSW index parameters (M, ef_construction, ef_search)?
  • Is there a memory scaling plan for growing vector counts?
  • Have you tested the backup/recovery procedures?
  • Have you appropriately configured metadata filtering indexes?

Retrieval Quality

  • Have you built an evaluation dataset (query-answer pairs) of at least 50 items?
  • Have you set target values for MRR, NDCG@10, and Recall@10?
  • Is an A/B testing pipeline built?
  • Is there a system to collect and analyze retrieval failure logs?

Monitoring

  • Are you monitoring per-query retrieval latency at p50/p95/p99?
  • Are you tracking embedding API call failure rates?
  • Are alerts configured for vector DB disk/memory usage?
  • Is there a periodic batch job that automatically evaluates retrieval quality metrics?

Failure Cases

Case 1: The Semantic Chunking Trap

A company processed all documents with semantic chunking under the assumption that "more sophisticated chunking must be better." The results were:

  • Vector count increased 4.2x compared to recursive splitting
  • Monthly Pinecone cost rose from 800to800 to 3,400
  • Average chunk size shrank to 38 tokens, causing insufficient context, and retrieval accuracy actually dropped by 12%

Lesson: Chunking strategies must be selected based on benchmarks. The assumption "more sophisticated method = better results" is repeatedly disproven in 2026 benchmarks.

Case 2: Missing Re-Indexing During Embedding Model Replacement

This case involved upgrading from text-embedding-ada-002 to text-embedding-3-large without re-indexing existing vectors. Vectors from different embedding spaces became mixed, causing search results to become nearly random.

Lesson: When changing embedding models, all vectors must be regenerated. For zero-downtime migration, re-index into a new collection, verify, then switch traffic using a Blue-Green deployment strategy.

When vectors exceeded 1 million, search latency surpassed 500ms, but the default ef_search value (10) was still being used. Raising ef_search to 100 increased Recall@10 from 72% to 91% while latency remained at around 80ms.

Lesson: HNSW parameter tuning must be readjusted based on data scale. Re-evaluate ef_search and ef_construction every time vector count increases by 10x.

References