Skip to content
Published on

RAG Chunking Strategies: From Naive Splitting to RAPTOR

Authors

Why Chunking Determines 70% of RAG Quality

"Garbage in, garbage out."

When you first build a RAG system, you feel this statement viscerally. Bad chunking means no embedding model, no matter how good, and no LLM, no matter how expensive, can save you.

Chunking is the process of splitting long documents into smaller units that can be searched. It sounds simple, but every decision you make here has an outsized impact on the final RAG performance.

Why does chunking matter so much?

Information density of embeddings: Embedding models represent an entire chunk as a single vector. If the chunk is too large, the core meaning gets diluted. If it's too small, context disappears.

Retrieval vs. generation trade-off: Small, focused chunks are better for search; the generation step needs sufficient context. The art of chunking is satisfying both simultaneously.

Document structure destruction: Splitting purely by character count cuts through sentences and lists, completely breaking meaning.

This post walks through five chunking strategies with real code and gives you clear criteria for when to use each one.

Strategy 1: Fixed-Size Chunking (The Simplest)

The most basic approach: split at a fixed token/character count with overlap to mitigate boundary cut issues.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,        # in tokens (roughly 75% of character count)
    chunk_overlap=50,      # overlapping tokens at boundaries
    separators=["\n\n", "\n", ".", " ", ""]
    # Priority order: paragraph > line > sentence > word > character
)

chunks = splitter.split_text(document)
print(f"Generated {len(chunks)} chunks")
print(f"First chunk preview: {chunks[0][:100]}...")

RecursiveCharacterTextSplitter tries each separator in order. It first tries to split on \n\n (paragraph breaks), then \n if chunks are still too large, then . and so on down the list.

Pros: Simple to implement, fast, predictable chunk sizes.

Cons: Ignores semantic boundaries. Can cut mid-sentence. Overlap helps but doesn't fully solve the problem.

When to use: Quick prototypes, documents with regular structure, when cost matters most.

Strategy 2: Semantic Chunking

Uses embedding similarity to find split points where the meaning shifts. Sentence boundaries where embedding distance spikes become chunk boundaries.

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings

splitter = SemanticChunker(
    OpenAIEmbeddings(),
    breakpoint_threshold_type="percentile",  # or "standard_deviation", "interquartile"
    breakpoint_threshold_amount=95           # split at top 5% distance changes
)

# Internal logic:
# 1. Embed each sentence
# 2. Calculate cosine distance between adjacent sentences
# 3. Split where distance exceeds the threshold

chunks = splitter.create_documents([long_document])

Pros: Chunks are semantically coherent — each chunk is "about one thing." Retrieval precision improves noticeably.

Cons: Incurs embedding API costs. Slower. Overkill for short documents.

When to use: Long documents that cover multiple topics, documents with unclear section boundaries, production systems where search accuracy is critical.

Real-World Performance Difference

From experience comparing fixed-size vs. semantic chunking on the same document corpus: semantic chunking improves retrieval precision by roughly 15-20%. The improvement is most noticeable on nuanced queries like "A but not B."

Strategy 3: Document-Structure-Aware Chunking

Splits while preserving the document's structure (headings, lists, code blocks, etc.).

from langchain.text_splitter import MarkdownHeaderTextSplitter

# Split by Markdown headings
headers_to_split_on = [
    ("#", "h1"),
    ("##", "h2"),
    ("###", "h3"),
]

md_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on,
    strip_headers=False  # keep heading text in chunks
)

md_header_splits = md_splitter.split_text(markdown_document)

# Each chunk now has heading metadata:
# md_header_splits[0].metadata = {'h1': 'Product Overview', 'h2': 'Key Features'}
# This metadata enables filtered retrieval!

# If any heading-based chunk is still too large, split further
from langchain.text_splitter import RecursiveCharacterTextSplitter

secondary_splitter = RecursiveCharacterTextSplitter(
    chunk_size=512, chunk_overlap=50
)

final_chunks = secondary_splitter.split_documents(md_header_splits)

Key insight: Preserving document structure as metadata enables filtered search. You can search "only within the Installation section" instead of across everything. This is the critical difference from naive fixed-size chunking.

When to use: Markdown documentation, technical docs, well-structured HTML pages, any use case that needs section-based filtering.

Strategy 4: Hierarchical (Parent-Child) Chunking

This is my personal favorite strategy. It captures both precise retrieval and rich context generation.

Core idea: Use small chunks for searching, but return the parent (larger) chunk to the LLM for generation.

from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_community.vectorstores import FAISS
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Parent chunks: large context windows (sent to LLM)
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)

# Child chunks: small, focused units (used for embedding and search)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=200)

# Document store: holds the parent documents
docstore = InMemoryStore()  # use Redis or a DB in production

# Vector store: holds child chunk embeddings
vectorstore = FAISS.from_documents([], embedding_model)

retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=docstore,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

# Adding documents automatically creates the parent-child hierarchy
retriever.add_documents(documents)

# At query time:
# 1. Find child chunks similar to the query (small, precise search)
# 2. Look up the parent ID for each matching child
# 3. Return the parent chunk (rich context for the LLM)
results = retriever.get_relevant_documents("What is the return policy?")

Why this works so well:

During retrieval, 200-token child chunks are used, so the embeddings are focused and specific. A query like "What is the release date of Product A?" finds exactly the small chunk containing that information.

During generation, the parent chunk (2000 tokens) is passed to the LLM. The LLM has ample context around the answer to reason correctly.

Production tip: InMemoryStore is only for prototypes. In production, use Redis or PostgreSQL as the docstore so data persists across server restarts.

Strategy 5: RAPTOR (Recursive Abstractive Processing)

RAPTOR stands for Recursive Abstractive Processing for Tree-Organized Retrieval, published by Stanford in 2024. It automatically builds a tree hierarchy from your documents.

Core idea: Cluster similar chunks together, summarize each cluster into a parent node, and repeat recursively.

Leaf Nodes (original chunks):
[Product A Feature 1] [Product A Feature 2] [Product B Feature 1] [Product B Comparison]

Level 1 (cluster + summarize):
[Product A Summary: Feature 1, Feature 2 combined] [Product B Summary: Feature 1, comparison]

Level 2 (higher abstraction):
[Full Product Line Overview]
# RAPTOR implementation (conceptual code)
from sklearn.mixture import GaussianMixture
import numpy as np

def build_raptor_tree(chunks, embeddings, levels=3):
    """
    Recursively build a RAPTOR tree.
    At each level: cluster chunks, then summarize each cluster.
    """
    tree = {'level_0': chunks}

    current_chunks = chunks
    current_embeddings = embeddings

    for level in range(1, levels + 1):
        # 1. Cluster using GMM (select optimal k via BIC)
        n_clusters = max(1, len(current_chunks) // 5)
        gm = GaussianMixture(n_components=n_clusters, random_state=42)
        gm.fit(current_embeddings)
        labels = gm.predict(current_embeddings)

        # 2. Combine chunks in each cluster and summarize
        summaries = []
        for cluster_id in range(n_clusters):
            cluster_chunks = [
                current_chunks[i]
                for i in range(len(current_chunks))
                if labels[i] == cluster_id
            ]
            combined_text = "\n\n".join(cluster_chunks)
            summary = llm.summarize(combined_text)  # LLM call
            summaries.append(summary)

        tree[f'level_{level}'] = summaries

        # Use current level summaries as input for the next level
        current_chunks = summaries
        current_embeddings = embed(summaries)

    return tree

Pros:

  • Handles both specific ("What is the price of Product A?") and abstract ("What product categories exist?") questions well
  • Hierarchical retrieval — search at the right abstraction level for each query

Cons:

  • Slow to build (clustering + LLM summarization, repeated)
  • Expensive (many LLM calls for summarization)
  • Complex to implement correctly

When to use: Very large document corpora, expected queries at various abstraction levels, enterprise production where quality outweighs cost.

Chunk Size Selection Guide

Chunk SizeBest ForMain Risk
128-256 tokensPrecise factual lookups, short answersInsufficient context causes LLM errors
512-1024 tokensGeneral QA, explanatory queriesIncreasing noise, diluted relevance
2048+ tokensComplex reasoning, analytical queriesDegraded embedding quality, key info diluted

Rule of thumb: 512 tokens works well for most cases. Start here unless you have a specific reason to deviate.

Measuring Chunking Quality

Don't rely on gut feeling when changing chunking strategies. Measure quantitatively.

from ragas.metrics import context_precision, context_recall
from ragas import evaluate
from datasets import Dataset

# Evaluation dataset: question + ground truth pairs
eval_questions = [
    "What is the return period?",
    "How much is shipping?",
    # ...
]
ground_truths = [
    "Returns accepted within 30 days",
    "Free shipping on orders over $50",
    # ...
]

def evaluate_chunking_strategy(strategy_name, retriever):
    results = []
    for q, gt in zip(eval_questions, ground_truths):
        retrieved_docs = retriever.get_relevant_documents(q)
        contexts = [doc.page_content for doc in retrieved_docs]
        results.append({
            "question": q,
            "contexts": contexts,
            "ground_truth": gt
        })

    dataset = Dataset.from_list(results)
    scores = evaluate(dataset, metrics=[context_precision, context_recall])

    print(f"\n=== {strategy_name} ===")
    print(f"Context Precision: {scores['context_precision']:.3f}")
    print(f"Context Recall:    {scores['context_recall']:.3f}")
    return scores

# Compare strategies head-to-head
evaluate_chunking_strategy("Fixed-size 512", fixed_retriever)
evaluate_chunking_strategy("Semantic", semantic_retriever)
evaluate_chunking_strategy("Parent-Child", parent_child_retriever)

Strategy Selection Summary

SituationRecommended Strategy
Quick prototypeFixed-size (512 tokens)
Structured Markdown/HTML docsDocument-Structure-Aware
Long docs covering many topicsSemantic Chunking
Need precise retrieval + rich contextParent-Child
Large corpus with diverse query typesRAPTOR
High budget, maximum qualityRAPTOR + Parent-Child hybrid

Final Thoughts

Chunking is not "set it and forget it." As your service grows and document types diversify, your chunking strategy should evolve with it. Build the habit of periodically creating evaluation datasets, measuring scores, and iterating.

The next post covers Hybrid Search (BM25 + vector search) — the other half of the retrieval equation. Even perfect chunking can't save you if the search itself is weak.