Skip to content
Published on

RAG Systems Complete Guide: Everything About Retrieval-Augmented Generation

Authors

RAG Systems Complete Guide: Everything About Retrieval-Augmented Generation

LLMs like GPT-4 and Claude are remarkable, but they have fundamental limitations. They lack knowledge beyond their training cutoff, they may be unfamiliar with specialized domain knowledge, and they sometimes generate convincing but incorrect information — a phenomenon called hallucination. RAG (Retrieval-Augmented Generation) is the most practical architecture for solving these problems.

This guide starts from basic RAG and progresses to state-of-the-art architectures like Self-RAG, Corrective RAG, and GraphRAG, with complete production-ready code throughout.


1. What is RAG?

1.1 The Knowledge Limits of LLMs

LLMs are pre-trained on vast corpora of text, but face two fundamental limitations.

Knowledge Cutoff: LLMs cannot know about information after their training was completed. GPT-4's training data only extends to a certain date.

Hallucination: LLMs are probabilistic language models. Rather than saying "I don't know," they tend to generate plausible-sounding but incorrect information. This is especially common with specific facts, dates, citations, and numbers.

Lack of Domain Expertise: Internal company documents, the latest technical specifications, and expertise in domains like medicine and law are difficult to fully encode in a general-purpose LLM.

1.2 The Core Idea of RAG

The core idea of RAG is simple: before the LLM generates an answer, first retrieve relevant information and provide it as context.

User question → Retrieve relevant documents → Provide [documents + question] to LLMGenerate answer

That is the entirety of it. But the details — how to retrieve, how to prepare documents, how to pass context to the LLM — determine the quality of the system.

1.3 RAG vs Fine-tuning

CriterionRAGFine-tuning
Knowledge updatesReal-time, just swap documentsRequires retraining
CostRelatively lowHigh (GPU required)
Source tracingCan cite source documentsOpaque
Domain-specific formatDifficultWorks well
Current informationStrengthOnly up to training date
HallucinationLower (grounded in documents)Can still occur

In many cases RAG is more practical. However, when output format, style, or specialized domain reasoning is required, fine-tuning can be used as a complement.

1.4 RAG System Architecture Overview

The full RAG pipeline is divided into two phases.

Offline (Indexing) Phase:

  1. Document collection (PDF, HTML, DB, etc.)
  2. Text chunking (splitting)
  3. Embedding generation
  4. Storage in a vector database

Online (Query) Phase:

  1. Embed the user query
  2. Retrieve similar chunks from the vector database
  3. Assemble context
  4. Generate answer with an LLM

2. Text Embeddings

2.1 The Concept of Embeddings

An embedding converts text into a high-dimensional real-valued vector. The key insight is that semantically similar texts are positioned close together in the vector space.

For example:

  • "The puppy is playing"
  • "The dog is running"

These two sentences have embeddings with a very high cosine similarity.

2.2 Key Embedding Models

OpenAI Embeddings

  • text-embedding-3-small: 1536 dimensions, fast and cost-effective
  • text-embedding-3-large: 3072 dimensions, higher quality
  • API-based, easy to use, paid service

Sentence-Transformers

  • all-MiniLM-L6-v2: 384 dimensions, fast and general-purpose
  • BAAI/bge-large-en-v1.5: 1024 dimensions, high performance
  • Run locally, free to use

Multilingual Embedding Models

  • BAAI/bge-m3: Multilingual support, strong across many languages
  • intfloat/multilingual-e5-large: Strong multilingual performance
  • paraphrase-multilingual-mpnet-base-v2: Good for cross-lingual retrieval

2.3 Retrieval via Cosine Similarity

Embedding-based retrieval computes the cosine similarity between the query embedding and stored document embeddings:

similarity(q,d)=qdqd\text{similarity}(q, d) = \frac{q \cdot d}{\|q\| \|d\|}

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer('all-MiniLM-L6-v2')

documents = [
    "Python is a widely used programming language for data science.",
    "Machine learning is a field of AI that learns patterns from data.",
    "Paris is the capital of France.",
    "Deep learning is a machine learning method using neural networks.",
]

doc_embeddings = model.encode(documents)
print(f"Embedding shape: {doc_embeddings.shape}")  # (4, 384)

query = "artificial intelligence and data analysis"
query_embedding = model.encode([query])[0]

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

similarities = [cosine_similarity(query_embedding, emb) for emb in doc_embeddings]
ranked = sorted(zip(similarities, documents), reverse=True)

print("\nRanked by similarity:")
for score, doc in ranked:
    print(f"  {score:.4f}: {doc}")

2.4 Evaluating Embedding Quality (MTEB)

MTEB (Massive Text Embedding Benchmark) is a systematic benchmark for evaluating embedding models across tasks including Retrieval, Classification, and Clustering.

Practical criteria for choosing embedding models in RAG:

  1. Performance in the target language (check language-specific benchmark scores)
  2. Performance relative to embedding dimensions (higher dimensions increase storage costs)
  3. Inference speed (important for real-time systems)
  4. License (check commercial use allowances)

3. Chunking Strategies

Chunking is one of the most critical design decisions for RAG performance. Chunks that are too large contain too much noise; chunks too small lack sufficient context.

3.1 Fixed-Size Chunking

The simplest approach. Splits uniformly by a specified character or token count.

from langchain.text_splitter import CharacterTextSplitter

text = """
Machine learning is a subfield of artificial intelligence that enables computers
to learn from data without being explicitly programmed. Methods include supervised
learning, unsupervised learning, and reinforcement learning, each solving different
types of problems.
"""

splitter = CharacterTextSplitter(
    chunk_size=200,
    chunk_overlap=20,
    separator="\n"
)
chunks = splitter.split_text(text)
for i, chunk in enumerate(chunks):
    print(f"Chunk {i}: {chunk[:80]}...")

3.2 Recursive Chunking

LangChain's RecursiveCharacterTextSplitter recursively splits on paragraphs, sentences, and words in order, preserving semantic boundaries as much as possible.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separators=["\n\n", "\n", ".", "!", "?", ",", " ", ""]
)

with open("document.txt", "r", encoding="utf-8") as f:
    text = f.read()

chunks = splitter.split_text(text)
print(f"Total chunks: {len(chunks)}")
print(f"Average chunk length: {sum(len(c) for c in chunks) / len(chunks):.0f} chars")

3.3 Semantic Chunking

Uses embedding similarity to split at semantic boundaries. The split point is where the embedding similarity between adjacent sentences drops sharply.

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

semantic_splitter = SemanticChunker(
    embeddings=OpenAIEmbeddings(),
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=95    # split at top 5% similarity change points
)

chunks = semantic_splitter.create_documents([text])
print(f"Semantic chunking result: {len(chunks)} chunks")

3.4 Parent-Child Chunking

A strategy that uses small chunks (children) for retrieval and large chunks (parents) for context delivery.

  • Child chunks: small and precise (higher retrieval accuracy)
  • Parent chunks: large and comprehensive (sufficient context for the LLM)
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_community.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings

child_splitter = RecursiveCharacterTextSplitter(chunk_size=200)
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=1000)

vectorstore = Chroma(
    collection_name="full_documents",
    embedding_function=OpenAIEmbeddings()
)
store = InMemoryStore()

retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

3.5 Chunk Size Decision Guide

Use CaseRecommended Chunk SizeOverlap
Fact retrieval (Q&A)200-400 tokens10-20%
Document summarization800-1200 tokens5-10%
Code retrievalFunction/class unitNone
Mixed content512 tokens50-100 tokens

4. Vector Databases

A vector database is a database specialized for storing high-dimensional vectors and efficiently finding similar vectors.

4.1 Comparing Major Vector Databases

FAISS (Facebook AI Similarity Search)

  • Library developed by Meta
  • In-memory processing, very fast
  • No production server required (it's a library)
  • Optimized for large-scale batch processing

Chroma

  • Open source, built-in embeddings
  • Python-native API
  • Ideal for development and prototyping
  • Persistence supported (SQLite-based)

Pinecone

  • Fully managed cloud service
  • Enterprise-grade scaling
  • Paid service, easy to operate

Weaviate

  • Open source + cloud option
  • Hybrid search built-in
  • GraphQL API

Milvus

  • High-performance open source
  • Distributed architecture
  • Scales to billions of vectors

pgvector

  • PostgreSQL extension
  • Leverages existing PostgreSQL infrastructure
  • Vector search via SQL

4.2 ANN Algorithms

Exact nearest-neighbor search (KNN) requires O(n)O(n) time. At scale, approximate algorithms (ANN) are used.

HNSW (Hierarchical Navigable Small World)

Supports fast search via a hierarchical graph structure.

  • Insert: O(logn)O(\log n)
  • Search: O(logn)O(\log n)
  • High recall, fast queries
  • Default algorithm in Chroma and Weaviate

IVF (Inverted File Index)

Divides data into clusters and searches only relevant clusters.

  • Memory efficient
  • Trade recall for speed via the nprobe parameter
  • Commonly used with FAISS

4.3 FAISS vs Chroma Implementation Comparison

import numpy as np
import faiss
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain.schema import Document

# ===== FAISS Direct Usage =====
d = 384       # vector dimension
n = 10000     # number of documents
vectors = np.random.randn(n, d).astype('float32')

# Flat L2 index (exact search)
index_flat = faiss.IndexFlatL2(d)
index_flat.add(vectors)
print(f"FAISS index size: {index_flat.ntotal}")

# Search
query = np.random.randn(1, d).astype('float32')
k = 5
distances, indices = index_flat.search(query, k)
print(f"Top {k} results: {indices[0]}")

# HNSW index (approximate, faster)
index_hnsw = faiss.IndexHNSWFlat(d, 32)
index_hnsw.add(vectors)
distances_hnsw, indices_hnsw = index_hnsw.search(query, k)
print(f"HNSW results: {indices_hnsw[0]}")

# ===== LangChain + Chroma =====
documents = [
    Document(page_content="Python is used for data science.", metadata={"source": "doc1"}),
    Document(page_content="Machine learning learns patterns from data.", metadata={"source": "doc2"}),
    Document(page_content="Deep learning is neural network-based ML.", metadata={"source": "doc3"}),
    Document(page_content="NLP analyzes text data.", metadata={"source": "doc4"}),
]

embeddings = OpenAIEmbeddings()
chroma_db = Chroma.from_documents(
    documents,
    embeddings,
    persist_directory="./chroma_db"
)

results = chroma_db.similarity_search("AI and machine learning", k=2)
for doc in results:
    print(f"Source: {doc.metadata['source']}, Content: {doc.page_content}")

results_with_score = chroma_db.similarity_search_with_score("deep learning", k=2)
for doc, score in results_with_score:
    print(f"Score: {score:.4f}, Content: {doc.page_content}")

5. Basic RAG Pipeline Implementation

5.1 Complete RAG Pipeline with LangChain

from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

# ===== 1. Document Loading =====
loader = PyPDFLoader("company_handbook.pdf")
pages = loader.load()
print(f"Pages loaded: {len(pages)}")

# ===== 2. Text Splitting =====
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separators=["\n\n", "\n", ".", "!", "?", ",", " "]
)
chunks = text_splitter.split_documents(pages)
print(f"Chunks created: {len(chunks)}")

# ===== 3. Embedding and Vector DB Storage =====
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(
    chunks,
    embeddings,
    persist_directory="./rag_db"
)
print("Vector DB saved")

# ===== 4. RAG Chain Setup =====
prompt_template = """You are a helpful AI assistant.
Answer the question using ONLY the provided context.
If the answer is not in the context, say "I cannot find that in the provided documents."

Context:
{context}

Question: {question}

Answer:"""

PROMPT = PromptTemplate(
    template=prompt_template,
    input_variables=["context", "question"]
)

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 4}
)

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    chain_type_kwargs={"prompt": PROMPT},
    return_source_documents=True
)

# ===== 5. Question Answering =====
query = "What is the vacation policy?"
result = qa_chain.invoke({"query": query})

print(f"\nQuestion: {query}")
print(f"Answer: {result['result']}")
print(f"\nSource documents:")
for doc in result['source_documents']:
    print(f"  - Page {doc.metadata.get('page', '?')}: {doc.page_content[:100]}...")

5.2 RAG with LlamaIndex

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.core.node_parser import SentenceSplitter
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

Settings.llm = OpenAI(model="gpt-4o-mini", temperature=0)
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")
Settings.node_parser = SentenceSplitter(chunk_size=512, chunk_overlap=20)

documents = SimpleDirectoryReader("./docs/").load_data()
index = VectorStoreIndex.from_documents(documents)

query_engine = index.as_query_engine(
    similarity_top_k=4,
    response_mode="compact"
)

response = query_engine.query("What are the main topics covered in these documents?")
print(f"Answer: {response}")
print("\nSource nodes:")
for node in response.source_nodes:
    print(f"  - Score: {node.score:.4f}")
    print(f"    Text: {node.text[:100]}...")

6. Advanced Retrieval Techniques

Combining vector search (semantic) with BM25 (keyword-based) captures the strengths of both approaches.

from langchain_community.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever

# BM25 retriever (keyword-based)
bm25_retriever = BM25Retriever.from_documents(chunks)
bm25_retriever.k = 4

# Vector retriever (semantic)
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 4})

# Ensemble (hybrid)
ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, vector_retriever],
    weights=[0.5, 0.5]
)

results = ensemble_retriever.invoke("Python programming tutorial")
print(f"Hybrid search results: {len(results)}")

6.2 Multi-Query Retrieval

Rewrites a single question into multiple different phrasings to broaden the retrieval coverage.

from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(temperature=0)
multi_query_retriever = MultiQueryRetriever.from_llm(
    retriever=vectorstore.as_retriever(),
    llm=llm
)

# Internally, the LLM rewrites the question into multiple versions
# e.g. "What are the advantages of RAG?"
# → "What are the benefits of retrieval-augmented generation?"
# → "Why is RAG better than standard LLMs?"
# → "What makes RAG useful?"

results = multi_query_retriever.invoke("What are the advantages of RAG?")
print(f"Multi-query retrieval results: {len(results)}")

6.3 MMR (Maximal Marginal Relevance)

Considering only similarity can lead to selecting redundant chunks. MMR balances relevance and diversity simultaneously.

MMR=argmaxdiDR[λsim(di,q)(1λ)maxdjRsim(di,dj)]\text{MMR} = \arg\max_{d_i \in D \setminus R} [\lambda \cdot \text{sim}(d_i, q) - (1-\lambda) \cdot \max_{d_j \in R} \text{sim}(d_i, d_j)]

mmr_retriever = vectorstore.as_retriever(
    search_type="mmr",
    search_kwargs={
        "k": 4,             # final number to return
        "fetch_k": 20,      # initial candidate pool
        "lambda_mult": 0.5  # balance: 0=diversity, 1=similarity
    }
)

results = mmr_retriever.invoke("machine learning algorithms")
print(f"MMR results: {len(results)}")

6.4 Metadata Filtering

Add metadata conditions to searches to narrow the scope.

from langchain.schema import Document

docs_with_metadata = [
    Document(
        page_content="Q1 2024 revenue was $10M.",
        metadata={"year": 2024, "quarter": "Q1", "category": "financial"}
    ),
    Document(
        page_content="Q2 2024 revenue was $12M.",
        metadata={"year": 2024, "quarter": "Q2", "category": "financial"}
    ),
    Document(
        page_content="Technology roadmap: AI feature enhancements planned.",
        metadata={"year": 2024, "quarter": "Q1", "category": "strategy"}
    ),
]

# Narrow search with metadata filter
filtered_results = vectorstore.similarity_search(
    "revenue performance",
    k=2,
    filter={"category": "financial", "year": 2024}
)

7. Reranking

Reordering retrieval results with a more precise model to improve ranking quality. A two-stage strategy: high recall from retrieval, high precision from reranking.

7.1 Cross-Encoder Reranker

Cross-encoders (which encode both texts together) are more accurate than bi-encoders (which encode texts separately then compute similarity).

from sentence_transformers import CrossEncoder
import numpy as np

cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

query = "types of machine learning algorithms"
initial_results = vectorstore.similarity_search(query, k=20)  # retrieve many

# Re-score with cross-encoder
pairs = [[query, doc.page_content] for doc in initial_results]
scores = cross_encoder.predict(pairs)

# Sort by score
ranked = sorted(zip(scores, initial_results), reverse=True)
top_k = [doc for _, doc in ranked[:5]]

print("Top results after reranking:")
for score, doc in ranked[:3]:
    print(f"  Score {score:.4f}: {doc.page_content[:80]}...")

7.2 Cohere Rerank API

from langchain.retrievers.document_compressors import CohereRerank
from langchain.retrievers import ContextualCompressionRetriever

compressor = CohereRerank(
    cohere_api_key="your-api-key",
    top_n=3,
    model="rerank-multilingual-v3.0"
)

compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectorstore.as_retriever(search_kwargs={"k": 20})
)

results = compression_retriever.invoke("Tell me about company policies")
print(f"Reranked documents: {len(results)}")

7.3 BGE Reranker (Open Source)

from FlagEmbedding import FlagReranker

reranker = FlagReranker('BAAI/bge-reranker-large', use_fp16=True)

query = "What is RAG?"
passages = [
    "RAG stands for Retrieval-Augmented Generation.",
    "A rag is a piece of cloth used for cleaning.",
    "RAG systems combine retrieval with generation for better LLM responses.",
]

scores = reranker.compute_score([[query, p] for p in passages])
ranked = sorted(zip(scores, passages), reverse=True)

for score, passage in ranked:
    print(f"  {score:.4f}: {passage}")

8. HyDE (Hypothetical Document Embeddings)

8.1 The Idea Behind HyDE

Standard RAG directly compares query embeddings with document embeddings. However, a short query's embedding may be far from a long document's embedding in the semantic space.

HyDE's solution: have the LLM generate a hypothetical answer document, then use that hypothetical document's embedding for retrieval.

QueryLLM generates hypothetical answer → embed hypothetical answer → retrieve real documents

8.2 Implementing HyDE

from langchain.chains import HypotheticalDocumentEmbedder
from langchain_openai import OpenAI, OpenAIEmbeddings, ChatOpenAI

llm = OpenAI()
embeddings = OpenAIEmbeddings()

# Using LangChain's built-in HyDE
hyde_embeddings = HypotheticalDocumentEmbedder.from_llm(
    llm=llm,
    embeddings=embeddings,
    prompt_key="web_search"
)

# Manual HyDE implementation
def manual_hyde(query, llm, embeddings, vectorstore, k=4):
    # 1. Generate hypothetical document
    hypothetical_doc = llm.invoke(
        f"Write a detailed answer to the following question: {query}"
    )

    # 2. Embed hypothetical document
    hyp_embedding = embeddings.embed_query(hypothetical_doc.content)

    # 3. Search with hypothetical embedding
    results = vectorstore.similarity_search_by_vector(hyp_embedding, k=k)

    return results, hypothetical_doc.content

chat_llm = ChatOpenAI(temperature=0.7)
results, hyp_doc = manual_hyde(
    "History of deep learning", chat_llm, embeddings, vectorstore
)
print(f"Hypothetical document: {hyp_doc[:200]}...")
print(f"Retrieved actual documents: {len(results)}")

9. Advanced RAG Architectures

9.1 Self-RAG

Self-RAG (2023, Asai et al.) allows the LLM to judge whether retrieval is necessary and critically evaluate the relevance of retrieved documents and the quality of responses.

Uses four special tokens:

  • [Retrieve]: Is retrieval needed? (Yes/No)
  • [IsRel]: Is the retrieved document relevant? (Relevant/Irrelevant)
  • [IsSup]: Is the response supported by documents? (Supported/Partially/Not)
  • [IsUse]: Is the response useful? (score 1-5)
from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

class SelfRAGSimulator:
    """Simulates Self-RAG behavior (real Self-RAG requires specially trained models)"""

    def __init__(self, retriever, llm):
        self.retriever = retriever
        self.llm = llm

    def should_retrieve(self, query: str) -> bool:
        prompt = ChatPromptTemplate.from_template("""
Determine whether answering the following question requires searching external documents.
Answer 'NO' if it can be answered with general knowledge or reasoning alone.
Answer 'YES' if specific facts or specialized knowledge are needed.
Answer only YES or NO.

Question: {query}
Decision:""")
        response = self.llm.invoke(prompt.format_messages(query=query))
        return "YES" in response.content.upper()

    def is_relevant(self, query: str, doc_content: str) -> bool:
        prompt = ChatPromptTemplate.from_template("""
Determine if the following document is relevant to the question.
Answer only RELEVANT or IRRELEVANT.

Question: {query}
Document: {doc}
Decision:""")
        response = self.llm.invoke(
            prompt.format_messages(query=query, doc=doc_content[:500])
        )
        return "RELEVANT" in response.content.upper()

    def generate_with_reflection(self, query: str) -> str:
        # 1. Determine if retrieval is needed
        need_retrieve = self.should_retrieve(query)
        print(f"Retrieval needed: {need_retrieve}")

        if not need_retrieve:
            response = self.llm.invoke(query)
            return response.content

        # 2. Retrieve documents
        docs = self.retriever.invoke(query)

        # 3. Filter by relevance
        relevant_docs = [d for d in docs if self.is_relevant(query, d.page_content)]
        print(f"Relevant documents: {len(relevant_docs)}/{len(docs)}")

        if not relevant_docs:
            return "No relevant documents found. Answering from general knowledge: " + \
                   self.llm.invoke(query).content

        # 4. Generate answer with context
        context = "\n\n".join([d.page_content for d in relevant_docs[:3]])
        prompt = f"""Use the context to answer the question.
Context: {context}
Question: {query}
Answer:"""
        return self.llm.invoke(prompt).content

9.2 Corrective RAG (CRAG)

CRAG evaluates the quality of retrieved documents and supplements with web search if quality is low.

from langchain_community.tools.tavily_search import TavilySearchResults
from typing import List, Tuple

class CorrectiveRAG:
    def __init__(self, retriever, llm):
        self.retriever = retriever
        self.llm = llm
        self.web_search = TavilySearchResults(max_results=3)

    def evaluate_documents(self, query: str, docs: list) -> Tuple[str, List]:
        """
        Evaluate document relevance.
        Returns: ("CORRECT"|"INCORRECT"|"AMBIGUOUS", filtered documents)
        """
        evaluation_prompt = """Evaluate the relevance of retrieved documents for the question.
- CORRECT: Documents can directly answer the question
- INCORRECT: Documents are not related to the question
- AMBIGUOUS: Partially related but incomplete

Question: {query}
Documents:
{docs}

Evaluation (CORRECT/INCORRECT/AMBIGUOUS):"""

        docs_text = "\n---\n".join([d.page_content[:300] for d in docs[:4]])
        response = self.llm.invoke(
            evaluation_prompt.format(query=query, docs=docs_text)
        )

        evaluation = response.content.strip().upper()
        if "CORRECT" in evaluation:
            return "CORRECT", docs
        elif "INCORRECT" in evaluation:
            return "INCORRECT", []
        else:
            return "AMBIGUOUS", docs

    def run(self, query: str) -> str:
        # 1. Initial retrieval
        docs = self.retriever.invoke(query)

        # 2. Evaluate document quality
        status, filtered_docs = self.evaluate_documents(query, docs)
        print(f"Document evaluation: {status}")

        # 3. Handle each case
        if status == "INCORRECT":
            print("Supplementing with web search...")
            web_results = self.web_search.invoke(query)
            context = "\n".join([r['content'] for r in web_results])

        elif status == "AMBIGUOUS":
            web_results = self.web_search.invoke(query)
            web_context = "\n".join([r['content'] for r in web_results])
            doc_context = "\n".join([d.page_content for d in filtered_docs[:2]])
            context = doc_context + "\n\n[Web Search Supplement]\n" + web_context

        else:
            context = "\n\n".join([d.page_content for d in filtered_docs[:4]])

        # 4. Generate final response
        response = self.llm.invoke(
            f"Context:\n{context}\n\nQuestion: {query}\nAnswer:"
        )
        return response.content

9.3 Adaptive RAG

Dynamically selects retrieval strategies based on query complexity.

class AdaptiveRAG:
    def __init__(self, simple_retriever, advanced_retriever, llm):
        self.simple_retriever = simple_retriever       # basic vector search
        self.advanced_retriever = advanced_retriever   # hybrid + reranking
        self.llm = llm

    def classify_query(self, query: str) -> str:
        prompt = f"""Classify the complexity of the following question:
- simple: Simple fact check or directly answerable
- complex: Requires combining multiple sources or multi-step reasoning

Question: {query}
Classification (simple/complex):"""

        response = self.llm.invoke(prompt)
        return "complex" if "complex" in response.content.lower() else "simple"

    def run(self, query: str) -> str:
        query_type = self.classify_query(query)
        print(f"Query type: {query_type}")

        if query_type == "simple":
            docs = self.simple_retriever.invoke(query)
        else:
            docs = self.advanced_retriever.invoke(query)

        context = "\n\n".join([d.page_content for d in docs])
        return self.llm.invoke(
            f"Context:\n{context}\n\nQuestion: {query}\nAnswer:"
        ).content

9.4 GraphRAG (Microsoft)

Microsoft's GraphRAG constructs a knowledge graph from documents and uses graph structure for retrieval.

Key ideas:

  1. Extract entities (people, places, concepts) and relationships from documents
  2. Group related entities using community detection algorithms
  3. Generate summaries for each community
  4. For global queries use community summaries; for local queries use graph traversal
# Install and initialize GraphRAG
pip install graphrag

# Initialize project
python -m graphrag.index --init --root ./ragtest

# Edit settings then index
python -m graphrag.index --root ./ragtest

# Global search (requires understanding of full document set)
python -m graphrag.query --root ./ragtest --method global "What are the main themes?"

# Local search (focused on specific entities)
python -m graphrag.query --root ./ragtest --method local "Tell me about company X"

10. RAG Evaluation Metrics

Objectively measuring the quality of a RAG system is essential for improvement.

10.1 RAGAS (RAG Assessment)

RAGAS is a framework for automatically evaluating RAG pipelines.

Key metrics:

  • Faithfulness: How faithful is the answer to the context? (measures hallucination)
  • Answer Relevancy: How relevant is the answer to the question?
  • Context Recall: How well were the relevant contexts retrieved?
  • Context Precision: What proportion of retrieved contexts were actually useful?
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision
)
from datasets import Dataset

evaluation_data = {
    "question": [
        "What is the company's annual leave policy?",
        "What is the remote work policy?",
    ],
    "answer": [
        "Employees receive 15 days of annual leave after 1 year, with 1 additional day per year.",
        "Employees may work remotely 3 days per week.",
    ],
    "contexts": [
        ["Employees are granted 15 days of annual leave after 1 year of employment. One additional day is added each subsequent year."],
        ["Employees may work remotely up to 2 days per week. Additional days may be approved with permission."],
    ],
    "ground_truth": [
        "15 days after 1 year, 1 additional day per year",
        "2 days remote work by default, additional with approval",
    ]
}

dataset = Dataset.from_dict(evaluation_data)

result = evaluate(
    dataset,
    metrics=[faithfulness, answer_relevancy, context_recall, context_precision]
)

print(result)
# faithfulness: 0.75 (detects inconsistency between answer and context)
# answer_relevancy: 0.92
# context_recall: 0.85
# context_precision: 0.78

10.2 Production Evaluation Pipeline

from typing import List, Dict, Optional
import time

class RAGEvaluator:
    def __init__(self, rag_chain, llm):
        self.rag_chain = rag_chain
        self.llm = llm

    def evaluate_faithfulness(self, answer: str, context: str) -> float:
        prompt = f"""Evaluate whether the following answer is based solely on information in the given context.
Score from 0.0 (not at all) to 1.0 (completely faithful).

Context: {context}
Answer: {answer}

Faithfulness score (number only):"""
        response = self.llm.invoke(prompt)
        try:
            return float(response.content.strip())
        except:
            return 0.5

    def evaluate_answer_relevancy(self, question: str, answer: str) -> float:
        prompt = f"""Evaluate how relevant the following answer is to the question.
Score from 0.0 to 1.0.

Question: {question}
Answer: {answer}

Relevancy score (number only):"""
        response = self.llm.invoke(prompt)
        try:
            return float(response.content.strip())
        except:
            return 0.5

    def run_evaluation(self, test_cases: List[Dict]) -> Dict:
        results = []
        for case in test_cases:
            question = case["question"]

            result = self.rag_chain.invoke({"query": question})
            answer = result["result"]
            context = "\n".join([d.page_content for d in result["source_documents"]])

            faithfulness_score = self.evaluate_faithfulness(answer, context)
            relevancy_score = self.evaluate_answer_relevancy(question, answer)

            results.append({
                "question": question,
                "answer": answer,
                "faithfulness": faithfulness_score,
                "relevancy": relevancy_score,
            })

        avg_faithfulness = sum(r["faithfulness"] for r in results) / len(results)
        avg_relevancy = sum(r["relevancy"] for r in results) / len(results)

        return {
            "results": results,
            "avg_faithfulness": avg_faithfulness,
            "avg_relevancy": avg_relevancy,
            "overall_score": (avg_faithfulness + avg_relevancy) / 2
        }

11. Production RAG Systems

11.1 Caching Strategies

import hashlib
import json
import redis

class CachedRAGSystem:
    def __init__(self, rag_chain, redis_client=None, ttl=3600):
        self.rag_chain = rag_chain
        self.redis = redis_client
        self.ttl = ttl

    def _get_cache_key(self, query: str) -> str:
        return f"rag:{hashlib.md5(query.encode()).hexdigest()}"

    def query(self, query: str) -> dict:
        cache_key = self._get_cache_key(query)

        # Check cache
        if self.redis:
            cached = self.redis.get(cache_key)
            if cached:
                print("Cache hit!")
                return json.loads(cached)

        # Execute RAG
        result = self.rag_chain.invoke({"query": query})
        response = {
            "answer": result["result"],
            "sources": [d.metadata for d in result["source_documents"]]
        }

        # Store in cache
        if self.redis:
            self.redis.setex(cache_key, self.ttl, json.dumps(response))

        return response

# LLM response caching
from langchain.globals import set_llm_cache
from langchain_community.cache import InMemoryCache, SQLiteCache

set_llm_cache(InMemoryCache())                              # development
set_llm_cache(SQLiteCache(database_path=".langchain.db"))  # production

11.2 Streaming Responses

from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain_openai import ChatOpenAI
from fastapi import FastAPI
from fastapi.responses import StreamingResponse

app = FastAPI()

streaming_llm = ChatOpenAI(
    model="gpt-4o-mini",
    streaming=True,
    callbacks=[StreamingStdOutCallbackHandler()]
)

async def generate_rag_stream(query: str):
    docs = retriever.invoke(query)
    context = "\n\n".join([d.page_content for d in docs])

    async for chunk in streaming_llm.astream(
        f"Context:\n{context}\n\nQuestion: {query}\nAnswer:"
    ):
        if chunk.content:
            yield f"data: {chunk.content}\n\n"

@app.get("/rag/stream")
async def rag_stream_endpoint(query: str):
    return StreamingResponse(
        generate_rag_stream(query),
        media_type="text/event-stream"
    )

11.3 Cost Optimization

# Track token usage
from langchain.callbacks import get_openai_callback

with get_openai_callback() as cb:
    result = qa_chain.invoke({"query": "Your question here"})
    print(f"Total tokens: {cb.total_tokens}")
    print(f"Prompt tokens: {cb.prompt_tokens}")
    print(f"Completion tokens: {cb.completion_tokens}")
    print(f"Cost: ${cb.total_cost:.6f}")

# Context compression to save tokens
from langchain.retrievers.document_compressors import LLMChainExtractor
from langchain.retrievers import ContextualCompressionRetriever

compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectorstore.as_retriever(search_kwargs={"k": 8})
)

# Only pass compressed (relevant) portions to LLM
compressed_docs = compression_retriever.invoke("your query")
total_tokens_estimate = sum(len(d.page_content.split()) for d in compressed_docs)
print(f"Compressed context tokens (estimate): {total_tokens_estimate}")

11.4 Monitoring

import time
import logging
from dataclasses import dataclass
from typing import Optional

@dataclass
class RAGMetrics:
    query: str
    retrieval_time: float = 0.0
    generation_time: float = 0.0
    num_docs_retrieved: int = 0
    answer_length: int = 0
    error: Optional[str] = None

class MonitoredRAGSystem:
    def __init__(self, rag_chain, logger=None):
        self.rag_chain = rag_chain
        self.logger = logger or logging.getLogger(__name__)
        self.metrics_history = []

    def query(self, query: str) -> dict:
        metrics = RAGMetrics(query=query)
        start_total = time.time()

        try:
            retrieval_start = time.time()
            docs = retriever.invoke(query)
            metrics.retrieval_time = time.time() - retrieval_start
            metrics.num_docs_retrieved = len(docs)

            gen_start = time.time()
            result = self.rag_chain.invoke({"query": query})
            metrics.generation_time = time.time() - gen_start
            metrics.answer_length = len(result["result"])

        except Exception as e:
            metrics.error = str(e)
            self.logger.error(f"RAG error: {e}")
            raise

        finally:
            total_time = time.time() - start_total
            self.metrics_history.append(metrics)
            self.logger.info(
                f"Query processed | "
                f"Retrieval: {metrics.retrieval_time:.2f}s | "
                f"Generation: {metrics.generation_time:.2f}s | "
                f"Total: {total_time:.2f}s | "
                f"Docs: {metrics.num_docs_retrieved}"
            )

        return result

    def get_stats(self) -> dict:
        if not self.metrics_history:
            return {}
        retrieval_times = [m.retrieval_time for m in self.metrics_history if not m.error]
        gen_times = [m.generation_time for m in self.metrics_history if not m.error]
        return {
            "total_queries": len(self.metrics_history),
            "error_rate": sum(1 for m in self.metrics_history if m.error) / len(self.metrics_history),
            "avg_retrieval_time": sum(retrieval_times) / len(retrieval_times) if retrieval_times else 0,
            "avg_generation_time": sum(gen_times) / len(gen_times) if gen_times else 0,
        }

12. Production RAG Checklist

Key considerations when building a production RAG system.

Document Processing

  • Support diverse file formats (PDF, Word, HTML, Markdown)
  • Preserve metadata (source, date, author)
  • Strategy for images and tables
  • Support for incremental updates

Retrieval Quality

  • Choose embedding models suited to the domain and language
  • Consider hybrid search (keyword + semantic)
  • Tune chunk size and overlap appropriately
  • Use reranking to improve precision

LLM Integration

  • Clear system prompt (emphasize using only context)
  • Require source citation
  • Allow expression of uncertainty

Operations

  • Response caching to reduce costs
  • Monitor token usage
  • A/B test chunking and retrieval parameters
  • Automated evaluation pipeline (RAGAS)

Conclusion

RAG is the most practical approach to overcoming the limitations of LLMs. To summarize:

  1. Basic RAG: chunking → embedding → vector DB → retrieval → generation
  2. Improving retrieval quality: hybrid search, MMR, reranking
  3. Advanced architectures: Self-RAG, CRAG, HyDE, GraphRAG
  4. Evaluation: measure faithfulness and relevancy with RAGAS
  5. Production: caching, monitoring, cost optimization

The performance of a RAG system depends on the harmony of the entire pipeline, not any single component. Chunking strategy and embedding model selection determine 80% of retrieval quality — focusing on these two elements provides the highest ROI.


References