- Authors

- Name
- Youngju Kim
- @fjvbn20031
RAG Systems Complete Guide: Everything About Retrieval-Augmented Generation
LLMs like GPT-4 and Claude are remarkable, but they have fundamental limitations. They lack knowledge beyond their training cutoff, they may be unfamiliar with specialized domain knowledge, and they sometimes generate convincing but incorrect information — a phenomenon called hallucination. RAG (Retrieval-Augmented Generation) is the most practical architecture for solving these problems.
This guide starts from basic RAG and progresses to state-of-the-art architectures like Self-RAG, Corrective RAG, and GraphRAG, with complete production-ready code throughout.
1. What is RAG?
1.1 The Knowledge Limits of LLMs
LLMs are pre-trained on vast corpora of text, but face two fundamental limitations.
Knowledge Cutoff: LLMs cannot know about information after their training was completed. GPT-4's training data only extends to a certain date.
Hallucination: LLMs are probabilistic language models. Rather than saying "I don't know," they tend to generate plausible-sounding but incorrect information. This is especially common with specific facts, dates, citations, and numbers.
Lack of Domain Expertise: Internal company documents, the latest technical specifications, and expertise in domains like medicine and law are difficult to fully encode in a general-purpose LLM.
1.2 The Core Idea of RAG
The core idea of RAG is simple: before the LLM generates an answer, first retrieve relevant information and provide it as context.
User question → Retrieve relevant documents → Provide [documents + question] to LLM → Generate answer
That is the entirety of it. But the details — how to retrieve, how to prepare documents, how to pass context to the LLM — determine the quality of the system.
1.3 RAG vs Fine-tuning
| Criterion | RAG | Fine-tuning |
|---|---|---|
| Knowledge updates | Real-time, just swap documents | Requires retraining |
| Cost | Relatively low | High (GPU required) |
| Source tracing | Can cite source documents | Opaque |
| Domain-specific format | Difficult | Works well |
| Current information | Strength | Only up to training date |
| Hallucination | Lower (grounded in documents) | Can still occur |
In many cases RAG is more practical. However, when output format, style, or specialized domain reasoning is required, fine-tuning can be used as a complement.
1.4 RAG System Architecture Overview
The full RAG pipeline is divided into two phases.
Offline (Indexing) Phase:
- Document collection (PDF, HTML, DB, etc.)
- Text chunking (splitting)
- Embedding generation
- Storage in a vector database
Online (Query) Phase:
- Embed the user query
- Retrieve similar chunks from the vector database
- Assemble context
- Generate answer with an LLM
2. Text Embeddings
2.1 The Concept of Embeddings
An embedding converts text into a high-dimensional real-valued vector. The key insight is that semantically similar texts are positioned close together in the vector space.
For example:
- "The puppy is playing"
- "The dog is running"
These two sentences have embeddings with a very high cosine similarity.
2.2 Key Embedding Models
OpenAI Embeddings
text-embedding-3-small: 1536 dimensions, fast and cost-effectivetext-embedding-3-large: 3072 dimensions, higher quality- API-based, easy to use, paid service
Sentence-Transformers
all-MiniLM-L6-v2: 384 dimensions, fast and general-purposeBAAI/bge-large-en-v1.5: 1024 dimensions, high performance- Run locally, free to use
Multilingual Embedding Models
BAAI/bge-m3: Multilingual support, strong across many languagesintfloat/multilingual-e5-large: Strong multilingual performanceparaphrase-multilingual-mpnet-base-v2: Good for cross-lingual retrieval
2.3 Retrieval via Cosine Similarity
Embedding-based retrieval computes the cosine similarity between the query embedding and stored document embeddings:
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer('all-MiniLM-L6-v2')
documents = [
"Python is a widely used programming language for data science.",
"Machine learning is a field of AI that learns patterns from data.",
"Paris is the capital of France.",
"Deep learning is a machine learning method using neural networks.",
]
doc_embeddings = model.encode(documents)
print(f"Embedding shape: {doc_embeddings.shape}") # (4, 384)
query = "artificial intelligence and data analysis"
query_embedding = model.encode([query])[0]
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
similarities = [cosine_similarity(query_embedding, emb) for emb in doc_embeddings]
ranked = sorted(zip(similarities, documents), reverse=True)
print("\nRanked by similarity:")
for score, doc in ranked:
print(f" {score:.4f}: {doc}")
2.4 Evaluating Embedding Quality (MTEB)
MTEB (Massive Text Embedding Benchmark) is a systematic benchmark for evaluating embedding models across tasks including Retrieval, Classification, and Clustering.
Practical criteria for choosing embedding models in RAG:
- Performance in the target language (check language-specific benchmark scores)
- Performance relative to embedding dimensions (higher dimensions increase storage costs)
- Inference speed (important for real-time systems)
- License (check commercial use allowances)
3. Chunking Strategies
Chunking is one of the most critical design decisions for RAG performance. Chunks that are too large contain too much noise; chunks too small lack sufficient context.
3.1 Fixed-Size Chunking
The simplest approach. Splits uniformly by a specified character or token count.
from langchain.text_splitter import CharacterTextSplitter
text = """
Machine learning is a subfield of artificial intelligence that enables computers
to learn from data without being explicitly programmed. Methods include supervised
learning, unsupervised learning, and reinforcement learning, each solving different
types of problems.
"""
splitter = CharacterTextSplitter(
chunk_size=200,
chunk_overlap=20,
separator="\n"
)
chunks = splitter.split_text(text)
for i, chunk in enumerate(chunks):
print(f"Chunk {i}: {chunk[:80]}...")
3.2 Recursive Chunking
LangChain's RecursiveCharacterTextSplitter recursively splits on paragraphs, sentences, and words in order, preserving semantic boundaries as much as possible.
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=50,
separators=["\n\n", "\n", ".", "!", "?", ",", " ", ""]
)
with open("document.txt", "r", encoding="utf-8") as f:
text = f.read()
chunks = splitter.split_text(text)
print(f"Total chunks: {len(chunks)}")
print(f"Average chunk length: {sum(len(c) for c in chunks) / len(chunks):.0f} chars")
3.3 Semantic Chunking
Uses embedding similarity to split at semantic boundaries. The split point is where the embedding similarity between adjacent sentences drops sharply.
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
semantic_splitter = SemanticChunker(
embeddings=OpenAIEmbeddings(),
breakpoint_threshold_type="percentile",
breakpoint_threshold_amount=95 # split at top 5% similarity change points
)
chunks = semantic_splitter.create_documents([text])
print(f"Semantic chunking result: {len(chunks)} chunks")
3.4 Parent-Child Chunking
A strategy that uses small chunks (children) for retrieval and large chunks (parents) for context delivery.
- Child chunks: small and precise (higher retrieval accuracy)
- Parent chunks: large and comprehensive (sufficient context for the LLM)
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_community.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
child_splitter = RecursiveCharacterTextSplitter(chunk_size=200)
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=1000)
vectorstore = Chroma(
collection_name="full_documents",
embedding_function=OpenAIEmbeddings()
)
store = InMemoryStore()
retriever = ParentDocumentRetriever(
vectorstore=vectorstore,
docstore=store,
child_splitter=child_splitter,
parent_splitter=parent_splitter,
)
3.5 Chunk Size Decision Guide
| Use Case | Recommended Chunk Size | Overlap |
|---|---|---|
| Fact retrieval (Q&A) | 200-400 tokens | 10-20% |
| Document summarization | 800-1200 tokens | 5-10% |
| Code retrieval | Function/class unit | None |
| Mixed content | 512 tokens | 50-100 tokens |
4. Vector Databases
A vector database is a database specialized for storing high-dimensional vectors and efficiently finding similar vectors.
4.1 Comparing Major Vector Databases
FAISS (Facebook AI Similarity Search)
- Library developed by Meta
- In-memory processing, very fast
- No production server required (it's a library)
- Optimized for large-scale batch processing
Chroma
- Open source, built-in embeddings
- Python-native API
- Ideal for development and prototyping
- Persistence supported (SQLite-based)
Pinecone
- Fully managed cloud service
- Enterprise-grade scaling
- Paid service, easy to operate
Weaviate
- Open source + cloud option
- Hybrid search built-in
- GraphQL API
Milvus
- High-performance open source
- Distributed architecture
- Scales to billions of vectors
pgvector
- PostgreSQL extension
- Leverages existing PostgreSQL infrastructure
- Vector search via SQL
4.2 ANN Algorithms
Exact nearest-neighbor search (KNN) requires time. At scale, approximate algorithms (ANN) are used.
HNSW (Hierarchical Navigable Small World)
Supports fast search via a hierarchical graph structure.
- Insert:
- Search:
- High recall, fast queries
- Default algorithm in Chroma and Weaviate
IVF (Inverted File Index)
Divides data into clusters and searches only relevant clusters.
- Memory efficient
- Trade recall for speed via the nprobe parameter
- Commonly used with FAISS
4.3 FAISS vs Chroma Implementation Comparison
import numpy as np
import faiss
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain.schema import Document
# ===== FAISS Direct Usage =====
d = 384 # vector dimension
n = 10000 # number of documents
vectors = np.random.randn(n, d).astype('float32')
# Flat L2 index (exact search)
index_flat = faiss.IndexFlatL2(d)
index_flat.add(vectors)
print(f"FAISS index size: {index_flat.ntotal}")
# Search
query = np.random.randn(1, d).astype('float32')
k = 5
distances, indices = index_flat.search(query, k)
print(f"Top {k} results: {indices[0]}")
# HNSW index (approximate, faster)
index_hnsw = faiss.IndexHNSWFlat(d, 32)
index_hnsw.add(vectors)
distances_hnsw, indices_hnsw = index_hnsw.search(query, k)
print(f"HNSW results: {indices_hnsw[0]}")
# ===== LangChain + Chroma =====
documents = [
Document(page_content="Python is used for data science.", metadata={"source": "doc1"}),
Document(page_content="Machine learning learns patterns from data.", metadata={"source": "doc2"}),
Document(page_content="Deep learning is neural network-based ML.", metadata={"source": "doc3"}),
Document(page_content="NLP analyzes text data.", metadata={"source": "doc4"}),
]
embeddings = OpenAIEmbeddings()
chroma_db = Chroma.from_documents(
documents,
embeddings,
persist_directory="./chroma_db"
)
results = chroma_db.similarity_search("AI and machine learning", k=2)
for doc in results:
print(f"Source: {doc.metadata['source']}, Content: {doc.page_content}")
results_with_score = chroma_db.similarity_search_with_score("deep learning", k=2)
for doc, score in results_with_score:
print(f"Score: {score:.4f}, Content: {doc.page_content}")
5. Basic RAG Pipeline Implementation
5.1 Complete RAG Pipeline with LangChain
from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
# ===== 1. Document Loading =====
loader = PyPDFLoader("company_handbook.pdf")
pages = loader.load()
print(f"Pages loaded: {len(pages)}")
# ===== 2. Text Splitting =====
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=50,
separators=["\n\n", "\n", ".", "!", "?", ",", " "]
)
chunks = text_splitter.split_documents(pages)
print(f"Chunks created: {len(chunks)}")
# ===== 3. Embedding and Vector DB Storage =====
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(
chunks,
embeddings,
persist_directory="./rag_db"
)
print("Vector DB saved")
# ===== 4. RAG Chain Setup =====
prompt_template = """You are a helpful AI assistant.
Answer the question using ONLY the provided context.
If the answer is not in the context, say "I cannot find that in the provided documents."
Context:
{context}
Question: {question}
Answer:"""
PROMPT = PromptTemplate(
template=prompt_template,
input_variables=["context", "question"]
)
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
retriever = vectorstore.as_retriever(
search_type="similarity",
search_kwargs={"k": 4}
)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=retriever,
chain_type_kwargs={"prompt": PROMPT},
return_source_documents=True
)
# ===== 5. Question Answering =====
query = "What is the vacation policy?"
result = qa_chain.invoke({"query": query})
print(f"\nQuestion: {query}")
print(f"Answer: {result['result']}")
print(f"\nSource documents:")
for doc in result['source_documents']:
print(f" - Page {doc.metadata.get('page', '?')}: {doc.page_content[:100]}...")
5.2 RAG with LlamaIndex
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.core.node_parser import SentenceSplitter
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
Settings.llm = OpenAI(model="gpt-4o-mini", temperature=0)
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")
Settings.node_parser = SentenceSplitter(chunk_size=512, chunk_overlap=20)
documents = SimpleDirectoryReader("./docs/").load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine(
similarity_top_k=4,
response_mode="compact"
)
response = query_engine.query("What are the main topics covered in these documents?")
print(f"Answer: {response}")
print("\nSource nodes:")
for node in response.source_nodes:
print(f" - Score: {node.score:.4f}")
print(f" Text: {node.text[:100]}...")
6. Advanced Retrieval Techniques
6.1 Hybrid Search
Combining vector search (semantic) with BM25 (keyword-based) captures the strengths of both approaches.
from langchain_community.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever
# BM25 retriever (keyword-based)
bm25_retriever = BM25Retriever.from_documents(chunks)
bm25_retriever.k = 4
# Vector retriever (semantic)
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
# Ensemble (hybrid)
ensemble_retriever = EnsembleRetriever(
retrievers=[bm25_retriever, vector_retriever],
weights=[0.5, 0.5]
)
results = ensemble_retriever.invoke("Python programming tutorial")
print(f"Hybrid search results: {len(results)}")
6.2 Multi-Query Retrieval
Rewrites a single question into multiple different phrasings to broaden the retrieval coverage.
from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(temperature=0)
multi_query_retriever = MultiQueryRetriever.from_llm(
retriever=vectorstore.as_retriever(),
llm=llm
)
# Internally, the LLM rewrites the question into multiple versions
# e.g. "What are the advantages of RAG?"
# → "What are the benefits of retrieval-augmented generation?"
# → "Why is RAG better than standard LLMs?"
# → "What makes RAG useful?"
results = multi_query_retriever.invoke("What are the advantages of RAG?")
print(f"Multi-query retrieval results: {len(results)}")
6.3 MMR (Maximal Marginal Relevance)
Considering only similarity can lead to selecting redundant chunks. MMR balances relevance and diversity simultaneously.
mmr_retriever = vectorstore.as_retriever(
search_type="mmr",
search_kwargs={
"k": 4, # final number to return
"fetch_k": 20, # initial candidate pool
"lambda_mult": 0.5 # balance: 0=diversity, 1=similarity
}
)
results = mmr_retriever.invoke("machine learning algorithms")
print(f"MMR results: {len(results)}")
6.4 Metadata Filtering
Add metadata conditions to searches to narrow the scope.
from langchain.schema import Document
docs_with_metadata = [
Document(
page_content="Q1 2024 revenue was $10M.",
metadata={"year": 2024, "quarter": "Q1", "category": "financial"}
),
Document(
page_content="Q2 2024 revenue was $12M.",
metadata={"year": 2024, "quarter": "Q2", "category": "financial"}
),
Document(
page_content="Technology roadmap: AI feature enhancements planned.",
metadata={"year": 2024, "quarter": "Q1", "category": "strategy"}
),
]
# Narrow search with metadata filter
filtered_results = vectorstore.similarity_search(
"revenue performance",
k=2,
filter={"category": "financial", "year": 2024}
)
7. Reranking
Reordering retrieval results with a more precise model to improve ranking quality. A two-stage strategy: high recall from retrieval, high precision from reranking.
7.1 Cross-Encoder Reranker
Cross-encoders (which encode both texts together) are more accurate than bi-encoders (which encode texts separately then compute similarity).
from sentence_transformers import CrossEncoder
import numpy as np
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
query = "types of machine learning algorithms"
initial_results = vectorstore.similarity_search(query, k=20) # retrieve many
# Re-score with cross-encoder
pairs = [[query, doc.page_content] for doc in initial_results]
scores = cross_encoder.predict(pairs)
# Sort by score
ranked = sorted(zip(scores, initial_results), reverse=True)
top_k = [doc for _, doc in ranked[:5]]
print("Top results after reranking:")
for score, doc in ranked[:3]:
print(f" Score {score:.4f}: {doc.page_content[:80]}...")
7.2 Cohere Rerank API
from langchain.retrievers.document_compressors import CohereRerank
from langchain.retrievers import ContextualCompressionRetriever
compressor = CohereRerank(
cohere_api_key="your-api-key",
top_n=3,
model="rerank-multilingual-v3.0"
)
compression_retriever = ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=vectorstore.as_retriever(search_kwargs={"k": 20})
)
results = compression_retriever.invoke("Tell me about company policies")
print(f"Reranked documents: {len(results)}")
7.3 BGE Reranker (Open Source)
from FlagEmbedding import FlagReranker
reranker = FlagReranker('BAAI/bge-reranker-large', use_fp16=True)
query = "What is RAG?"
passages = [
"RAG stands for Retrieval-Augmented Generation.",
"A rag is a piece of cloth used for cleaning.",
"RAG systems combine retrieval with generation for better LLM responses.",
]
scores = reranker.compute_score([[query, p] for p in passages])
ranked = sorted(zip(scores, passages), reverse=True)
for score, passage in ranked:
print(f" {score:.4f}: {passage}")
8. HyDE (Hypothetical Document Embeddings)
8.1 The Idea Behind HyDE
Standard RAG directly compares query embeddings with document embeddings. However, a short query's embedding may be far from a long document's embedding in the semantic space.
HyDE's solution: have the LLM generate a hypothetical answer document, then use that hypothetical document's embedding for retrieval.
Query → LLM generates hypothetical answer → embed hypothetical answer → retrieve real documents
8.2 Implementing HyDE
from langchain.chains import HypotheticalDocumentEmbedder
from langchain_openai import OpenAI, OpenAIEmbeddings, ChatOpenAI
llm = OpenAI()
embeddings = OpenAIEmbeddings()
# Using LangChain's built-in HyDE
hyde_embeddings = HypotheticalDocumentEmbedder.from_llm(
llm=llm,
embeddings=embeddings,
prompt_key="web_search"
)
# Manual HyDE implementation
def manual_hyde(query, llm, embeddings, vectorstore, k=4):
# 1. Generate hypothetical document
hypothetical_doc = llm.invoke(
f"Write a detailed answer to the following question: {query}"
)
# 2. Embed hypothetical document
hyp_embedding = embeddings.embed_query(hypothetical_doc.content)
# 3. Search with hypothetical embedding
results = vectorstore.similarity_search_by_vector(hyp_embedding, k=k)
return results, hypothetical_doc.content
chat_llm = ChatOpenAI(temperature=0.7)
results, hyp_doc = manual_hyde(
"History of deep learning", chat_llm, embeddings, vectorstore
)
print(f"Hypothetical document: {hyp_doc[:200]}...")
print(f"Retrieved actual documents: {len(results)}")
9. Advanced RAG Architectures
9.1 Self-RAG
Self-RAG (2023, Asai et al.) allows the LLM to judge whether retrieval is necessary and critically evaluate the relevance of retrieved documents and the quality of responses.
Uses four special tokens:
[Retrieve]: Is retrieval needed? (Yes/No)[IsRel]: Is the retrieved document relevant? (Relevant/Irrelevant)[IsSup]: Is the response supported by documents? (Supported/Partially/Not)[IsUse]: Is the response useful? (score 1-5)
from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
class SelfRAGSimulator:
"""Simulates Self-RAG behavior (real Self-RAG requires specially trained models)"""
def __init__(self, retriever, llm):
self.retriever = retriever
self.llm = llm
def should_retrieve(self, query: str) -> bool:
prompt = ChatPromptTemplate.from_template("""
Determine whether answering the following question requires searching external documents.
Answer 'NO' if it can be answered with general knowledge or reasoning alone.
Answer 'YES' if specific facts or specialized knowledge are needed.
Answer only YES or NO.
Question: {query}
Decision:""")
response = self.llm.invoke(prompt.format_messages(query=query))
return "YES" in response.content.upper()
def is_relevant(self, query: str, doc_content: str) -> bool:
prompt = ChatPromptTemplate.from_template("""
Determine if the following document is relevant to the question.
Answer only RELEVANT or IRRELEVANT.
Question: {query}
Document: {doc}
Decision:""")
response = self.llm.invoke(
prompt.format_messages(query=query, doc=doc_content[:500])
)
return "RELEVANT" in response.content.upper()
def generate_with_reflection(self, query: str) -> str:
# 1. Determine if retrieval is needed
need_retrieve = self.should_retrieve(query)
print(f"Retrieval needed: {need_retrieve}")
if not need_retrieve:
response = self.llm.invoke(query)
return response.content
# 2. Retrieve documents
docs = self.retriever.invoke(query)
# 3. Filter by relevance
relevant_docs = [d for d in docs if self.is_relevant(query, d.page_content)]
print(f"Relevant documents: {len(relevant_docs)}/{len(docs)}")
if not relevant_docs:
return "No relevant documents found. Answering from general knowledge: " + \
self.llm.invoke(query).content
# 4. Generate answer with context
context = "\n\n".join([d.page_content for d in relevant_docs[:3]])
prompt = f"""Use the context to answer the question.
Context: {context}
Question: {query}
Answer:"""
return self.llm.invoke(prompt).content
9.2 Corrective RAG (CRAG)
CRAG evaluates the quality of retrieved documents and supplements with web search if quality is low.
from langchain_community.tools.tavily_search import TavilySearchResults
from typing import List, Tuple
class CorrectiveRAG:
def __init__(self, retriever, llm):
self.retriever = retriever
self.llm = llm
self.web_search = TavilySearchResults(max_results=3)
def evaluate_documents(self, query: str, docs: list) -> Tuple[str, List]:
"""
Evaluate document relevance.
Returns: ("CORRECT"|"INCORRECT"|"AMBIGUOUS", filtered documents)
"""
evaluation_prompt = """Evaluate the relevance of retrieved documents for the question.
- CORRECT: Documents can directly answer the question
- INCORRECT: Documents are not related to the question
- AMBIGUOUS: Partially related but incomplete
Question: {query}
Documents:
{docs}
Evaluation (CORRECT/INCORRECT/AMBIGUOUS):"""
docs_text = "\n---\n".join([d.page_content[:300] for d in docs[:4]])
response = self.llm.invoke(
evaluation_prompt.format(query=query, docs=docs_text)
)
evaluation = response.content.strip().upper()
if "CORRECT" in evaluation:
return "CORRECT", docs
elif "INCORRECT" in evaluation:
return "INCORRECT", []
else:
return "AMBIGUOUS", docs
def run(self, query: str) -> str:
# 1. Initial retrieval
docs = self.retriever.invoke(query)
# 2. Evaluate document quality
status, filtered_docs = self.evaluate_documents(query, docs)
print(f"Document evaluation: {status}")
# 3. Handle each case
if status == "INCORRECT":
print("Supplementing with web search...")
web_results = self.web_search.invoke(query)
context = "\n".join([r['content'] for r in web_results])
elif status == "AMBIGUOUS":
web_results = self.web_search.invoke(query)
web_context = "\n".join([r['content'] for r in web_results])
doc_context = "\n".join([d.page_content for d in filtered_docs[:2]])
context = doc_context + "\n\n[Web Search Supplement]\n" + web_context
else:
context = "\n\n".join([d.page_content for d in filtered_docs[:4]])
# 4. Generate final response
response = self.llm.invoke(
f"Context:\n{context}\n\nQuestion: {query}\nAnswer:"
)
return response.content
9.3 Adaptive RAG
Dynamically selects retrieval strategies based on query complexity.
class AdaptiveRAG:
def __init__(self, simple_retriever, advanced_retriever, llm):
self.simple_retriever = simple_retriever # basic vector search
self.advanced_retriever = advanced_retriever # hybrid + reranking
self.llm = llm
def classify_query(self, query: str) -> str:
prompt = f"""Classify the complexity of the following question:
- simple: Simple fact check or directly answerable
- complex: Requires combining multiple sources or multi-step reasoning
Question: {query}
Classification (simple/complex):"""
response = self.llm.invoke(prompt)
return "complex" if "complex" in response.content.lower() else "simple"
def run(self, query: str) -> str:
query_type = self.classify_query(query)
print(f"Query type: {query_type}")
if query_type == "simple":
docs = self.simple_retriever.invoke(query)
else:
docs = self.advanced_retriever.invoke(query)
context = "\n\n".join([d.page_content for d in docs])
return self.llm.invoke(
f"Context:\n{context}\n\nQuestion: {query}\nAnswer:"
).content
9.4 GraphRAG (Microsoft)
Microsoft's GraphRAG constructs a knowledge graph from documents and uses graph structure for retrieval.
Key ideas:
- Extract entities (people, places, concepts) and relationships from documents
- Group related entities using community detection algorithms
- Generate summaries for each community
- For global queries use community summaries; for local queries use graph traversal
# Install and initialize GraphRAG
pip install graphrag
# Initialize project
python -m graphrag.index --init --root ./ragtest
# Edit settings then index
python -m graphrag.index --root ./ragtest
# Global search (requires understanding of full document set)
python -m graphrag.query --root ./ragtest --method global "What are the main themes?"
# Local search (focused on specific entities)
python -m graphrag.query --root ./ragtest --method local "Tell me about company X"
10. RAG Evaluation Metrics
Objectively measuring the quality of a RAG system is essential for improvement.
10.1 RAGAS (RAG Assessment)
RAGAS is a framework for automatically evaluating RAG pipelines.
Key metrics:
- Faithfulness: How faithful is the answer to the context? (measures hallucination)
- Answer Relevancy: How relevant is the answer to the question?
- Context Recall: How well were the relevant contexts retrieved?
- Context Precision: What proportion of retrieved contexts were actually useful?
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_recall,
context_precision
)
from datasets import Dataset
evaluation_data = {
"question": [
"What is the company's annual leave policy?",
"What is the remote work policy?",
],
"answer": [
"Employees receive 15 days of annual leave after 1 year, with 1 additional day per year.",
"Employees may work remotely 3 days per week.",
],
"contexts": [
["Employees are granted 15 days of annual leave after 1 year of employment. One additional day is added each subsequent year."],
["Employees may work remotely up to 2 days per week. Additional days may be approved with permission."],
],
"ground_truth": [
"15 days after 1 year, 1 additional day per year",
"2 days remote work by default, additional with approval",
]
}
dataset = Dataset.from_dict(evaluation_data)
result = evaluate(
dataset,
metrics=[faithfulness, answer_relevancy, context_recall, context_precision]
)
print(result)
# faithfulness: 0.75 (detects inconsistency between answer and context)
# answer_relevancy: 0.92
# context_recall: 0.85
# context_precision: 0.78
10.2 Production Evaluation Pipeline
from typing import List, Dict, Optional
import time
class RAGEvaluator:
def __init__(self, rag_chain, llm):
self.rag_chain = rag_chain
self.llm = llm
def evaluate_faithfulness(self, answer: str, context: str) -> float:
prompt = f"""Evaluate whether the following answer is based solely on information in the given context.
Score from 0.0 (not at all) to 1.0 (completely faithful).
Context: {context}
Answer: {answer}
Faithfulness score (number only):"""
response = self.llm.invoke(prompt)
try:
return float(response.content.strip())
except:
return 0.5
def evaluate_answer_relevancy(self, question: str, answer: str) -> float:
prompt = f"""Evaluate how relevant the following answer is to the question.
Score from 0.0 to 1.0.
Question: {question}
Answer: {answer}
Relevancy score (number only):"""
response = self.llm.invoke(prompt)
try:
return float(response.content.strip())
except:
return 0.5
def run_evaluation(self, test_cases: List[Dict]) -> Dict:
results = []
for case in test_cases:
question = case["question"]
result = self.rag_chain.invoke({"query": question})
answer = result["result"]
context = "\n".join([d.page_content for d in result["source_documents"]])
faithfulness_score = self.evaluate_faithfulness(answer, context)
relevancy_score = self.evaluate_answer_relevancy(question, answer)
results.append({
"question": question,
"answer": answer,
"faithfulness": faithfulness_score,
"relevancy": relevancy_score,
})
avg_faithfulness = sum(r["faithfulness"] for r in results) / len(results)
avg_relevancy = sum(r["relevancy"] for r in results) / len(results)
return {
"results": results,
"avg_faithfulness": avg_faithfulness,
"avg_relevancy": avg_relevancy,
"overall_score": (avg_faithfulness + avg_relevancy) / 2
}
11. Production RAG Systems
11.1 Caching Strategies
import hashlib
import json
import redis
class CachedRAGSystem:
def __init__(self, rag_chain, redis_client=None, ttl=3600):
self.rag_chain = rag_chain
self.redis = redis_client
self.ttl = ttl
def _get_cache_key(self, query: str) -> str:
return f"rag:{hashlib.md5(query.encode()).hexdigest()}"
def query(self, query: str) -> dict:
cache_key = self._get_cache_key(query)
# Check cache
if self.redis:
cached = self.redis.get(cache_key)
if cached:
print("Cache hit!")
return json.loads(cached)
# Execute RAG
result = self.rag_chain.invoke({"query": query})
response = {
"answer": result["result"],
"sources": [d.metadata for d in result["source_documents"]]
}
# Store in cache
if self.redis:
self.redis.setex(cache_key, self.ttl, json.dumps(response))
return response
# LLM response caching
from langchain.globals import set_llm_cache
from langchain_community.cache import InMemoryCache, SQLiteCache
set_llm_cache(InMemoryCache()) # development
set_llm_cache(SQLiteCache(database_path=".langchain.db")) # production
11.2 Streaming Responses
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain_openai import ChatOpenAI
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
app = FastAPI()
streaming_llm = ChatOpenAI(
model="gpt-4o-mini",
streaming=True,
callbacks=[StreamingStdOutCallbackHandler()]
)
async def generate_rag_stream(query: str):
docs = retriever.invoke(query)
context = "\n\n".join([d.page_content for d in docs])
async for chunk in streaming_llm.astream(
f"Context:\n{context}\n\nQuestion: {query}\nAnswer:"
):
if chunk.content:
yield f"data: {chunk.content}\n\n"
@app.get("/rag/stream")
async def rag_stream_endpoint(query: str):
return StreamingResponse(
generate_rag_stream(query),
media_type="text/event-stream"
)
11.3 Cost Optimization
# Track token usage
from langchain.callbacks import get_openai_callback
with get_openai_callback() as cb:
result = qa_chain.invoke({"query": "Your question here"})
print(f"Total tokens: {cb.total_tokens}")
print(f"Prompt tokens: {cb.prompt_tokens}")
print(f"Completion tokens: {cb.completion_tokens}")
print(f"Cost: ${cb.total_cost:.6f}")
# Context compression to save tokens
from langchain.retrievers.document_compressors import LLMChainExtractor
from langchain.retrievers import ContextualCompressionRetriever
compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=vectorstore.as_retriever(search_kwargs={"k": 8})
)
# Only pass compressed (relevant) portions to LLM
compressed_docs = compression_retriever.invoke("your query")
total_tokens_estimate = sum(len(d.page_content.split()) for d in compressed_docs)
print(f"Compressed context tokens (estimate): {total_tokens_estimate}")
11.4 Monitoring
import time
import logging
from dataclasses import dataclass
from typing import Optional
@dataclass
class RAGMetrics:
query: str
retrieval_time: float = 0.0
generation_time: float = 0.0
num_docs_retrieved: int = 0
answer_length: int = 0
error: Optional[str] = None
class MonitoredRAGSystem:
def __init__(self, rag_chain, logger=None):
self.rag_chain = rag_chain
self.logger = logger or logging.getLogger(__name__)
self.metrics_history = []
def query(self, query: str) -> dict:
metrics = RAGMetrics(query=query)
start_total = time.time()
try:
retrieval_start = time.time()
docs = retriever.invoke(query)
metrics.retrieval_time = time.time() - retrieval_start
metrics.num_docs_retrieved = len(docs)
gen_start = time.time()
result = self.rag_chain.invoke({"query": query})
metrics.generation_time = time.time() - gen_start
metrics.answer_length = len(result["result"])
except Exception as e:
metrics.error = str(e)
self.logger.error(f"RAG error: {e}")
raise
finally:
total_time = time.time() - start_total
self.metrics_history.append(metrics)
self.logger.info(
f"Query processed | "
f"Retrieval: {metrics.retrieval_time:.2f}s | "
f"Generation: {metrics.generation_time:.2f}s | "
f"Total: {total_time:.2f}s | "
f"Docs: {metrics.num_docs_retrieved}"
)
return result
def get_stats(self) -> dict:
if not self.metrics_history:
return {}
retrieval_times = [m.retrieval_time for m in self.metrics_history if not m.error]
gen_times = [m.generation_time for m in self.metrics_history if not m.error]
return {
"total_queries": len(self.metrics_history),
"error_rate": sum(1 for m in self.metrics_history if m.error) / len(self.metrics_history),
"avg_retrieval_time": sum(retrieval_times) / len(retrieval_times) if retrieval_times else 0,
"avg_generation_time": sum(gen_times) / len(gen_times) if gen_times else 0,
}
12. Production RAG Checklist
Key considerations when building a production RAG system.
Document Processing
- Support diverse file formats (PDF, Word, HTML, Markdown)
- Preserve metadata (source, date, author)
- Strategy for images and tables
- Support for incremental updates
Retrieval Quality
- Choose embedding models suited to the domain and language
- Consider hybrid search (keyword + semantic)
- Tune chunk size and overlap appropriately
- Use reranking to improve precision
LLM Integration
- Clear system prompt (emphasize using only context)
- Require source citation
- Allow expression of uncertainty
Operations
- Response caching to reduce costs
- Monitor token usage
- A/B test chunking and retrieval parameters
- Automated evaluation pipeline (RAGAS)
Conclusion
RAG is the most practical approach to overcoming the limitations of LLMs. To summarize:
- Basic RAG: chunking → embedding → vector DB → retrieval → generation
- Improving retrieval quality: hybrid search, MMR, reranking
- Advanced architectures: Self-RAG, CRAG, HyDE, GraphRAG
- Evaluation: measure faithfulness and relevancy with RAGAS
- Production: caching, monitoring, cost optimization
The performance of a RAG system depends on the harmony of the entire pipeline, not any single component. Chunking strategy and embedding model selection determine 80% of retrieval quality — focusing on these two elements provides the highest ROI.
References
- Lewis et al. (2020), "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" (arXiv:2005.11401)
- Asai et al. (2023), "Self-RAG: Learning to Retrieve, Generate, and Critique" (arXiv:2310.11511)
- Gao et al. (2023), "Precise Zero-Shot Dense Retrieval without Relevance Labels" — HyDE (arXiv:2305.14283)
- Microsoft GraphRAG: https://microsoft.github.io/graphrag/
- LangChain documentation: https://python.langchain.com/docs/
- LlamaIndex documentation: https://docs.llamaindex.ai/
- RAGAS documentation: https://docs.ragas.io/
- FAISS: https://faiss.ai/
- Chroma: https://www.trychroma.com/