Skip to content
Published on

1 Million Token Context Windows: Is RAG Becoming Obsolete?

Authors

The Question Behind the Hype

Between 2024 and 2025, LLM context windows exploded in size:

  • Gemini 1.5 Pro: 1 million tokens
  • Claude 3.5/3.7: 200K tokens
  • GPT-4o: 128K tokens

A natural question followed: "If I can fit 10 books into the context window, why build a RAG pipeline at all? Why not just stuff everything in?"

It's a reasonable question. RAG is genuinely complex to build — chunking strategies, embedding model selection, vector DB operations, retrieval quality tuning. If all that could disappear, wouldn't that be wonderful?

But anyone who's shipped production LLM systems knows it's not that simple. Let's look at the data.


Long Context vs RAG: The Real Comparison

1. Cost — Numbers Don't Lie

# Scenario: 10,000-document knowledge base, each 500 tokens
# 1,000 user queries per day

total_docs_tokens = 10_000 * 500  # 5,000,000 tokens

# Option A: Long Context (stuff everything in)
# GPT-4o input price: $2.50 / 1M tokens
cost_long_context_per_query = total_docs_tokens * (2.50 / 1_000_000)
daily_cost_long_context = cost_long_context_per_query * 1_000
print(f"Long context daily cost: ${daily_cost_long_context:.2f}")
# Output: Long context daily cost: $12500.00 (!!)

# Option B: RAG (top-5 relevant chunks, ~1,000 tokens)
rag_context_tokens = 1_000
cost_rag_per_query = rag_context_tokens * (2.50 / 1_000_000)
daily_cost_rag = cost_rag_per_query * 1_000
print(f"RAG daily cost: ${daily_cost_rag:.2f}")
# Output: RAG daily cost: $2.50

ratio = daily_cost_long_context / daily_cost_rag
print(f"RAG is {ratio:.0f}x cheaper")
# Output: RAG is 5000x cheaper

12,500vs12,500 vs 2.50 per day. Monthly, that's 375,000vs375,000 vs 75. For most startups, the long context approach would kill the business before it has a chance to grow.

This is an extreme example, but the principle holds: the larger your knowledge base and the more queries you handle, the worse the economics of long context become.

2. Latency

Approximate production measurements:

  • 200K token context: 5-10 seconds to first token
  • RAG + 2K context: 0.3-0.8 seconds (retrieval included)

That's a dramatically different user experience. Streaming output helps but doesn't fully solve this — TTFT (time to first token) is also slower with long contexts.

3. Quality — The "Lost in the Middle" Problem

This is the most important comparison. Many developers assume that feeding more context always produces better answers. Research shows the opposite.

Liu et al. (2023), "Lost in the Middle: How Language Models Use Long Contexts":

Performance by position in a long context:

Beginning:  ████████████ 85%  <- LLM pays attention here
Middle:     ████████     65%  <- Performance drops (Lost in the Middle)
End:        ████████████ 87%  <- LLM pays attention here

RAG places relevant chunks at the beginning -> avoids this problem

If critical information is buried in the middle of a 500K-token context, the LLM is likely to miss it. RAG naturally places retrieved chunks at the front of the context, sidestepping this issue entirely.


When Long Context Beats RAG

Long context isn't always the wrong choice. These scenarios favor it:

1. Whole-document analysis

"Analyze character development across this entire novel" or "Review this 10,000-line codebase for security issues" — when you genuinely need to understand everything, long context is the right tool.

2. Complex cross-document reasoning

Finding subtle conflicts between two contracts, or tracing a causal chain across ten reports — chunk-based retrieval can miss the connections.

3. Unpredictable queries

If you have no idea what users will ask, putting everything in context is the safe bet.

4. Small knowledge bases

Fewer than ~50 short documents? Stuffing them all in may be simpler and more effective than building a full RAG pipeline.


When RAG Beats Long Context

For most production systems, RAG wins in these scenarios:

1. Large knowledge bases

Thousands or millions of documents make long context physically impossible or economically ruinous.

2. Cost-sensitive services

B2C apps, free tiers, high-volume internal tools — per-query cost is existential.

3. Latency requirements

Customer service bots, real-time help systems — a 10-second response is unacceptable.

4. Frequently updated knowledge

Add a new document to RAG and it's immediately searchable. Long context requires rebuilding the prompt.

5. Source attribution

"What document does this answer come from?" RAG can show the exact retrieved chunks. Long context can't.


The Best Answer: Long Context RAG

The highest-performing pattern in practice combines both approaches:

# Long Context RAG pattern
# Step 1: RAG retrieves many candidate chunks (prioritize recall)
# Step 2: Feed all candidates into a longer context for careful reading

def long_context_rag_query(query: str, vectordb, llm):
    # Standard RAG: top-5 chunks
    # Long Context RAG: top-20 chunks
    retrieved_chunks = vectordb.similarity_search(query, k=20)

    # Approximately 20K-40K token context
    context = "\n\n---\n\n".join([chunk.page_content for chunk in retrieved_chunks])

    prompt = f"""Answer the following question based on the provided documents.

Documents:
{context}

Question: {query}

Answer:"""

    return llm.invoke(prompt)

Benefits:

  • Higher recall from the retrieval stage (top-20 misses less)
  • Far cheaper than stuffing the full knowledge base
  • At 20K tokens, "Lost in the Middle" is much less severe than at 200K+

Reranking: The Quality Multiplier

Add a reranker to significantly improve retrieval quality without extra LLM calls:

from sentence_transformers import CrossEncoder

# Cross-encoder reranker (Cohere, Jina, BGE-reranker, etc.)
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def rag_with_reranking(query: str, vectordb, llm, top_k=20, rerank_top=5):
    # 1. Broad retrieval (maximize recall)
    candidates = vectordb.similarity_search(query, k=top_k)

    # 2. Reranker scores each candidate precisely
    pairs = [(query, chunk.page_content) for chunk in candidates]
    scores = reranker.predict(pairs)

    # 3. Keep only top-5 for the final context
    ranked = sorted(zip(scores, candidates), reverse=True)
    top_chunks = [chunk for _, chunk in ranked[:rerank_top]]

    context = "\n\n".join([chunk.page_content for chunk in top_chunks])

    return llm.invoke(f"Context:\n{context}\n\nQuestion: {query}")

Cross-encoders are small, fast models that run locally. They dramatically improve precision without touching your LLM budget.


Future Outlook: Contexts Keep Growing

Context windows will keep expanding, and prices will keep falling:

2023: GPT-4 -> 8K tokens
2024: Claude 3 -> 200K, Gemini 1.5 -> 1M tokens
2025: further expansion expected

Input price trends:
2023 GPT-4:   $30.00 / 1M tokens
2024 GPT-4o:   $2.50 / 1M tokens
2025+: continuing to fall

But RAG won't die — it will evolve:

  • Single-stage RAGMulti-hop RAG (multi-step retrieval)
  • Keyword/Semantic searchHybrid search + Reranking
  • Flat retrievalHierarchical/GraphRAG
  • RAG becomes a context management discipline rather than just a search technique

The future question won't be "RAG vs Long Context" — it'll be "which RAG strategy is optimal for this use case?"


Practical Decision Guide

Knowledge base size?
├─ Under ~50 short documents -> consider Long Context directly
├─ Hundreds to thousands -> Long Context RAG (top-20 + 50K context)
└─ Tens of thousands+ -> Standard RAG + Reranking

Query type?
├─ Specific fact lookup -> Standard RAG
├─ Complex reasoning across documents -> GraphRAG or Multi-hop RAG
└─ Global summarization/analysis -> Long Context or Global GraphRAG

Latency requirements?
├─ Under 1 second -> Standard RAG (small context)
├─ Under 3 seconds -> Long Context RAG (under 30K)
└─ Flexible -> Long Context fine

Monthly budget?
├─ Under $100 -> Standard RAG, no debate
├─ Under $1,000 -> Long Context RAG feasible
└─ Unconstrained -> optimize for quality

Conclusion

RAG is not going away. It's getting more sophisticated.

Long context is a powerful tool for specific scenarios — small document sets, whole-document analysis, complex cross-document reasoning. But for production services, cost, latency, and scalability mean RAG remains the right choice in most situations.

Practical advice: start with the Long Context RAG pattern. Retrieve broadly with RAG, then give the LLM enough context to reason carefully. This pattern balances cost, speed, and quality better than either extreme.

There's no universal answer. Measure what matters for your specific service and iterate.