💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

RAG Systems Complete Guide: Everything About Retrieval-Augmented Generation

LLMs like GPT-4 and Claude are remarkable, but they have fundamental limitations. They lack knowledge beyond their training cutoff, they may be unfamiliar with specialized domain knowledge, and they sometimes generate convincing but incorrect information — a phenomenon called hallucination. RAG (Retrieval-Augmented Generation) is the most practical architecture for solving these problems.

This guide starts from basic RAG and progresses to state-of-the-art architectures like Self-RAG, Corrective RAG, and GraphRAG, with complete production-ready code throughout.

1. What is RAG?

1.1 The Knowledge Limits of LLMs

LLMs are pre-trained on vast corpora of text, but face two fundamental limitations.

**Knowledge Cutoff**: LLMs cannot know about information after their training was completed. GPT-4's training data only extends to a certain date.

**Hallucination**: LLMs are probabilistic language models. Rather than saying "I don't know," they tend to generate plausible-sounding but incorrect information. This is especially common with specific facts, dates, citations, and numbers.

**Lack of Domain Expertise**: Internal company documents, the latest technical specifications, and expertise in domains like medicine and law are difficult to fully encode in a general-purpose LLM.

1.2 The Core Idea of RAG

The core idea of RAG is simple: **before the LLM generates an answer, first retrieve relevant information and provide it as context.**

User question → Retrieve relevant documents → Provide [documents + question] to LLM → Generate answer

That is the entirety of it. But the details — _how_ to retrieve, _how_ to prepare documents, _how_ to pass context to the LLM — determine the quality of the system.

1.3 RAG vs Fine-tuning

| Criterion | RAG | Fine-tuning |

| ---------------------- | ------------------------------ | ------------------------ |

| Knowledge updates | Real-time, just swap documents | Requires retraining |

| Cost | Relatively low | High (GPU required) |

| Source tracing | Can cite source documents | Opaque |

| Domain-specific format | Difficult | Works well |

| Current information | Strength | Only up to training date |

| Hallucination | Lower (grounded in documents) | Can still occur |

In many cases RAG is more practical. However, when output format, style, or specialized domain reasoning is required, fine-tuning can be used as a complement.

1.4 RAG System Architecture Overview

The full RAG pipeline is divided into two phases.

**Offline (Indexing) Phase**:

1. Document collection (PDF, HTML, DB, etc.)

2. Text chunking (splitting)

3. Embedding generation

4. Storage in a vector database

**Online (Query) Phase**:

1. Embed the user query

2. Retrieve similar chunks from the vector database

3. Assemble context

4. Generate answer with an LLM

2. Text Embeddings

2.1 The Concept of Embeddings

An embedding converts text into a high-dimensional real-valued vector. The key insight is that **semantically similar texts are positioned close together in the vector space**.

For example:

- "The puppy is playing"

- "The dog is running"

These two sentences have embeddings with a very high cosine similarity.

2.2 Key Embedding Models

**OpenAI Embeddings**

- `text-embedding-3-small`: 1536 dimensions, fast and cost-effective

- `text-embedding-3-large`: 3072 dimensions, higher quality

- API-based, easy to use, paid service

**Sentence-Transformers**

- `all-MiniLM-L6-v2`: 384 dimensions, fast and general-purpose

- `BAAI/bge-large-en-v1.5`: 1024 dimensions, high performance

- Run locally, free to use

**Multilingual Embedding Models**

- `BAAI/bge-m3`: Multilingual support, strong across many languages

- `intfloat/multilingual-e5-large`: Strong multilingual performance

- `paraphrase-multilingual-mpnet-base-v2`: Good for cross-lingual retrieval

2.3 Retrieval via Cosine Similarity

Embedding-based retrieval computes the cosine similarity between the query embedding and stored document embeddings:

$$\text{similarity}(q, d) = \frac{q \cdot d}{\|q\| \|d\|}$$

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

documents = [

"Python is a widely used programming language for data science.",

"Machine learning is a field of AI that learns patterns from data.",

"Paris is the capital of France.",

"Deep learning is a machine learning method using neural networks.",

]

doc_embeddings = model.encode(documents)

print(f"Embedding shape: {doc_embeddings.shape}") # (4, 384)

query = "artificial intelligence and data analysis"

query_embedding = model.encode([query])[0]

def cosine_similarity(a, b):

return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

similarities = [cosine_similarity(query_embedding, emb) for emb in doc_embeddings]

ranked = sorted(zip(similarities, documents), reverse=True)

print("\nRanked by similarity:")

for score, doc in ranked:

print(f" {score:.4f}: {doc}")

2.4 Evaluating Embedding Quality (MTEB)

MTEB (Massive Text Embedding Benchmark) is a systematic benchmark for evaluating embedding models across tasks including Retrieval, Classification, and Clustering.

Practical criteria for choosing embedding models in RAG:

1. Performance in the target language (check language-specific benchmark scores)

2. Performance relative to embedding dimensions (higher dimensions increase storage costs)

3. Inference speed (important for real-time systems)

4. License (check commercial use allowances)

3. Chunking Strategies

Chunking is one of the most critical design decisions for RAG performance. Chunks that are too large contain too much noise; chunks too small lack sufficient context.

3.1 Fixed-Size Chunking

The simplest approach. Splits uniformly by a specified character or token count.

from langchain.text_splitter import CharacterTextSplitter

text = """

Machine learning is a subfield of artificial intelligence that enables computers

to learn from data without being explicitly programmed. Methods include supervised

learning, unsupervised learning, and reinforcement learning, each solving different

types of problems.

"""

splitter = CharacterTextSplitter(

chunk_size=200,

chunk_overlap=20,

separator="\n"

)

chunks = splitter.split_text(text)

for i, chunk in enumerate(chunks):

print(f"Chunk {i}: {chunk[:80]}...")

3.2 Recursive Chunking

LangChain's `RecursiveCharacterTextSplitter` recursively splits on paragraphs, sentences, and words in order, preserving semantic boundaries as much as possible.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(

chunk_size=500,

chunk_overlap=50,

separators=["\n\n", "\n", ".", "!", "?", ",", " ", ""]

)

with open("document.txt", "r", encoding="utf-8") as f:

text = f.read()

chunks = splitter.split_text(text)

print(f"Total chunks: {len(chunks)}")

print(f"Average chunk length: {sum(len(c) for c in chunks) / len(chunks):.0f} chars")

3.3 Semantic Chunking

Uses embedding similarity to split at semantic boundaries. The split point is where the embedding similarity between adjacent sentences drops sharply.

from langchain_experimental.text_splitter import SemanticChunker

from langchain_openai import OpenAIEmbeddings

semantic_splitter = SemanticChunker(

embeddings=OpenAIEmbeddings(),

breakpoint_threshold_type="percentile",

breakpoint_threshold_amount=95 # split at top 5% similarity change points

)

chunks = semantic_splitter.create_documents([text])

print(f"Semantic chunking result: {len(chunks)} chunks")

3.4 Parent-Child Chunking

A strategy that uses small chunks (children) for retrieval and large chunks (parents) for context delivery.

- Child chunks: small and precise (higher retrieval accuracy)

- Parent chunks: large and comprehensive (sufficient context for the LLM)

from langchain.retrievers import ParentDocumentRetriever

from langchain.storage import InMemoryStore

from langchain_community.vectorstores import Chroma

from langchain.text_splitter import RecursiveCharacterTextSplitter

from langchain_openai import OpenAIEmbeddings

child_splitter = RecursiveCharacterTextSplitter(chunk_size=200)

parent_splitter = RecursiveCharacterTextSplitter(chunk_size=1000)

vectorstore = Chroma(

collection_name="full_documents",

embedding_function=OpenAIEmbeddings()

)

store = InMemoryStore()

retriever = ParentDocumentRetriever(

vectorstore=vectorstore,

docstore=store,

child_splitter=child_splitter,

parent_splitter=parent_splitter,

)

3.5 Chunk Size Decision Guide

| Use Case | Recommended Chunk Size | Overlap |

| ---------------------- | ---------------------- | ------------- |

| Fact retrieval (Q&A) | 200-400 tokens | 10-20% |

| Document summarization | 800-1200 tokens | 5-10% |

| Code retrieval | Function/class unit | None |

| Mixed content | 512 tokens | 50-100 tokens |

4. Vector Databases

A vector database is a database specialized for storing high-dimensional vectors and efficiently finding similar vectors.

4.1 Comparing Major Vector Databases

**FAISS (Facebook AI Similarity Search)**

- Library developed by Meta

- In-memory processing, very fast

- No production server required (it's a library)

- Optimized for large-scale batch processing

**Chroma**

- Open source, built-in embeddings

- Python-native API

- Ideal for development and prototyping

- Persistence supported (SQLite-based)

**Pinecone**

- Fully managed cloud service

- Enterprise-grade scaling

- Paid service, easy to operate

**Weaviate**

- Open source + cloud option

- Hybrid search built-in

- GraphQL API

**Milvus**

- High-performance open source

- Distributed architecture

- Scales to billions of vectors

**pgvector**

- PostgreSQL extension

- Leverages existing PostgreSQL infrastructure

- Vector search via SQL

4.2 ANN Algorithms

Exact nearest-neighbor search (KNN) requires $O(n)$ time. At scale, approximate algorithms (ANN) are used.

**HNSW (Hierarchical Navigable Small World)**

Supports fast search via a hierarchical graph structure.

- Insert: $O(\log n)$

- Search: $O(\log n)$

- High recall, fast queries

- Default algorithm in Chroma and Weaviate

**IVF (Inverted File Index)**

Divides data into clusters and searches only relevant clusters.

- Memory efficient

- Trade recall for speed via the nprobe parameter

- Commonly used with FAISS

4.3 FAISS vs Chroma Implementation Comparison

from langchain_community.vectorstores import Chroma

from langchain_openai import OpenAIEmbeddings

from langchain.schema import Document

===== FAISS Direct Usage =====

d = 384 # vector dimension

n = 10000 # number of documents

vectors = np.random.randn(n, d).astype('float32')

Flat L2 index (exact search)

index_flat = faiss.IndexFlatL2(d)

index_flat.add(vectors)

print(f"FAISS index size: {index_flat.ntotal}")

Search

query = np.random.randn(1, d).astype('float32')

k = 5

distances, indices = index_flat.search(query, k)

print(f"Top {k} results: {indices[0]}")

HNSW index (approximate, faster)

index_hnsw = faiss.IndexHNSWFlat(d, 32)

index_hnsw.add(vectors)

distances_hnsw, indices_hnsw = index_hnsw.search(query, k)

print(f"HNSW results: {indices_hnsw[0]}")

===== LangChain + Chroma =====

documents = [

Document(page_content="Python is used for data science.", metadata={"source": "doc1"}),

Document(page_content="Machine learning learns patterns from data.", metadata={"source": "doc2"}),

Document(page_content="Deep learning is neural network-based ML.", metadata={"source": "doc3"}),

Document(page_content="NLP analyzes text data.", metadata={"source": "doc4"}),

]

embeddings = OpenAIEmbeddings()

chroma_db = Chroma.from_documents(

documents,

embeddings,

persist_directory="./chroma_db"

)

results = chroma_db.similarity_search("AI and machine learning", k=2)

for doc in results:

print(f"Source: {doc.metadata['source']}, Content: {doc.page_content}")

results_with_score = chroma_db.similarity_search_with_score("deep learning", k=2)

for doc, score in results_with_score:

print(f"Score: {score:.4f}, Content: {doc.page_content}")

5. Basic RAG Pipeline Implementation

5.1 Complete RAG Pipeline with LangChain

from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader

from langchain.text_splitter import RecursiveCharacterTextSplitter

from langchain_community.vectorstores import Chroma

from langchain_openai import OpenAIEmbeddings, ChatOpenAI

from langchain.chains import RetrievalQA

from langchain.prompts import PromptTemplate

===== 1. Document Loading =====

loader = PyPDFLoader("company_handbook.pdf")

pages = loader.load()

print(f"Pages loaded: {len(pages)}")

===== 2. Text Splitting =====

text_splitter = RecursiveCharacterTextSplitter(

chunk_size=500,

chunk_overlap=50,

separators=["\n\n", "\n", ".", "!", "?", ",", " "]

)

chunks = text_splitter.split_documents(pages)

print(f"Chunks created: {len(chunks)}")

===== 3. Embedding and Vector DB Storage =====

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

vectorstore = Chroma.from_documents(

chunks,

embeddings,

persist_directory="./rag_db"

)

print("Vector DB saved")

===== 4. RAG Chain Setup =====

prompt_template = """You are a helpful AI assistant.

Answer the question using ONLY the provided context.

If the answer is not in the context, say "I cannot find that in the provided documents."

Context:

{context}

Question: {question}

Answer:"""

PROMPT = PromptTemplate(

template=prompt_template,

input_variables=["context", "question"]

)

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

retriever = vectorstore.as_retriever(

search_type="similarity",

search_kwargs={"k": 4}

)

qa_chain = RetrievalQA.from_chain_type(

llm=llm,

chain_type="stuff",

retriever=retriever,

chain_type_kwargs={"prompt": PROMPT},

return_source_documents=True

)

===== 5. Question Answering =====

query = "What is the vacation policy?"

result = qa_chain.invoke({"query": query})

print(f"\nQuestion: {query}")

print(f"Answer: {result['result']}")

print(f"\nSource documents:")

for doc in result['source_documents']:

print(f" - Page {doc.metadata.get('page', '?')}: {doc.page_content[:100]}...")

5.2 RAG with LlamaIndex

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings

from llama_index.core.node_parser import SentenceSplitter

from llama_index.llms.openai import OpenAI

from llama_index.embeddings.openai import OpenAIEmbedding

Settings.llm = OpenAI(model="gpt-4o-mini", temperature=0)

Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")

Settings.node_parser = SentenceSplitter(chunk_size=512, chunk_overlap=20)

documents = SimpleDirectoryReader("./docs/").load_data()

index = VectorStoreIndex.from_documents(documents)

query_engine = index.as_query_engine(

similarity_top_k=4,

response_mode="compact"

)

response = query_engine.query("What are the main topics covered in these documents?")

print(f"Answer: {response}")

print("\nSource nodes:")

for node in response.source_nodes:

print(f" - Score: {node.score:.4f}")

print(f" Text: {node.text[:100]}...")

6. Advanced Retrieval Techniques

6.1 Hybrid Search

Combining vector search (semantic) with BM25 (keyword-based) captures the strengths of both approaches.

from langchain_community.retrievers import BM25Retriever

from langchain.retrievers import EnsembleRetriever

BM25 retriever (keyword-based)

bm25_retriever = BM25Retriever.from_documents(chunks)

bm25_retriever.k = 4

Vector retriever (semantic)

vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 4})

Ensemble (hybrid)

ensemble_retriever = EnsembleRetriever(

retrievers=[bm25_retriever, vector_retriever],

weights=[0.5, 0.5]

)

results = ensemble_retriever.invoke("Python programming tutorial")

print(f"Hybrid search results: {len(results)}")

6.2 Multi-Query Retrieval

Rewrites a single question into multiple different phrasings to broaden the retrieval coverage.

from langchain.retrievers.multi_query import MultiQueryRetriever

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(temperature=0)

multi_query_retriever = MultiQueryRetriever.from_llm(

retriever=vectorstore.as_retriever(),

llm=llm

)

Internally, the LLM rewrites the question into multiple versions

e.g. "What are the advantages of RAG?"

→ "What are the benefits of retrieval-augmented generation?"

→ "Why is RAG better than standard LLMs?"

→ "What makes RAG useful?"

results = multi_query_retriever.invoke("What are the advantages of RAG?")

print(f"Multi-query retrieval results: {len(results)}")

6.3 MMR (Maximal Marginal Relevance)

Considering only similarity can lead to selecting redundant chunks. MMR balances relevance and diversity simultaneously.

$$\text{MMR} = \arg\max_{d_i \in D \setminus R} [\lambda \cdot \text{sim}(d_i, q) - (1-\lambda) \cdot \max_{d_j \in R} \text{sim}(d_i, d_j)]$$

mmr_retriever = vectorstore.as_retriever(

search_type="mmr",

search_kwargs={

"k": 4, # final number to return

"fetch_k": 20, # initial candidate pool

"lambda_mult": 0.5 # balance: 0=diversity, 1=similarity

}

)

results = mmr_retriever.invoke("machine learning algorithms")

print(f"MMR results: {len(results)}")

6.4 Metadata Filtering

Add metadata conditions to searches to narrow the scope.

from langchain.schema import Document

docs_with_metadata = [

Document(

page_content="Q1 2024 revenue was $10M.",

metadata={"year": 2024, "quarter": "Q1", "category": "financial"}

Document(

page_content="Q2 2024 revenue was $12M.",

metadata={"year": 2024, "quarter": "Q2", "category": "financial"}

Document(

page_content="Technology roadmap: AI feature enhancements planned.",

metadata={"year": 2024, "quarter": "Q1", "category": "strategy"}

]

Narrow search with metadata filter

filtered_results = vectorstore.similarity_search(

"revenue performance",

k=2,

filter={"category": "financial", "year": 2024}

)

7. Reranking

Reordering retrieval results with a more precise model to improve ranking quality. A two-stage strategy: high recall from retrieval, high precision from reranking.

7.1 Cross-Encoder Reranker

Cross-encoders (which encode both texts together) are more accurate than bi-encoders (which encode texts separately then compute similarity).

from sentence_transformers import CrossEncoder

cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

query = "types of machine learning algorithms"

initial_results = vectorstore.similarity_search(query, k=20) # retrieve many

Re-score with cross-encoder

pairs = [[query, doc.page_content] for doc in initial_results]

scores = cross_encoder.predict(pairs)

Sort by score

ranked = sorted(zip(scores, initial_results), reverse=True)

top_k = [doc for _, doc in ranked[:5]]

print("Top results after reranking:")

for score, doc in ranked[:3]:

print(f" Score {score:.4f}: {doc.page_content[:80]}...")

7.2 Cohere Rerank API

from langchain.retrievers.document_compressors import CohereRerank

from langchain.retrievers import ContextualCompressionRetriever

compressor = CohereRerank(

cohere_api_key="your-api-key",

top_n=3,

model="rerank-multilingual-v3.0"

)

compression_retriever = ContextualCompressionRetriever(

base_compressor=compressor,

base_retriever=vectorstore.as_retriever(search_kwargs={"k": 20})

)

results = compression_retriever.invoke("Tell me about company policies")

print(f"Reranked documents: {len(results)}")

7.3 BGE Reranker (Open Source)

from FlagEmbedding import FlagReranker

reranker = FlagReranker('BAAI/bge-reranker-large', use_fp16=True)

query = "What is RAG?"

passages = [

"RAG stands for Retrieval-Augmented Generation.",

"A rag is a piece of cloth used for cleaning.",

"RAG systems combine retrieval with generation for better LLM responses.",

]

scores = reranker.compute_score([[query, p] for p in passages])

ranked = sorted(zip(scores, passages), reverse=True)

for score, passage in ranked:

print(f" {score:.4f}: {passage}")

8. HyDE (Hypothetical Document Embeddings)

8.1 The Idea Behind HyDE

Standard RAG directly compares query embeddings with document embeddings. However, a short query's embedding may be far from a long document's embedding in the semantic space.

HyDE's solution: have the LLM generate a hypothetical answer document, then use **that hypothetical document's embedding** for retrieval.

Query → LLM generates hypothetical answer → embed hypothetical answer → retrieve real documents

8.2 Implementing HyDE

from langchain.chains import HypotheticalDocumentEmbedder

from langchain_openai import OpenAI, OpenAIEmbeddings, ChatOpenAI

llm = OpenAI()

embeddings = OpenAIEmbeddings()

Using LangChain's built-in HyDE

hyde_embeddings = HypotheticalDocumentEmbedder.from_llm(

llm=llm,

embeddings=embeddings,

prompt_key="web_search"

)

Manual HyDE implementation

def manual_hyde(query, llm, embeddings, vectorstore, k=4):

1. Generate hypothetical document

hypothetical_doc = llm.invoke(

f"Write a detailed answer to the following question: {query}"

)

2. Embed hypothetical document

hyp_embedding = embeddings.embed_query(hypothetical_doc.content)

3. Search with hypothetical embedding

results = vectorstore.similarity_search_by_vector(hyp_embedding, k=k)

return results, hypothetical_doc.content

chat_llm = ChatOpenAI(temperature=0.7)

results, hyp_doc = manual_hyde(

"History of deep learning", chat_llm, embeddings, vectorstore

)

print(f"Hypothetical document: {hyp_doc[:200]}...")

print(f"Retrieved actual documents: {len(results)}")

9. Advanced RAG Architectures

9.1 Self-RAG

Self-RAG (2023, Asai et al.) allows the LLM to judge whether retrieval is necessary and critically evaluate the relevance of retrieved documents and the quality of responses.

Uses four special tokens:

- `[Retrieve]`: Is retrieval needed? (Yes/No)

- `[IsRel]`: Is the retrieved document relevant? (Relevant/Irrelevant)

- `[IsSup]`: Is the response supported by documents? (Supported/Partially/Not)

- `[IsUse]`: Is the response useful? (score 1-5)

from langchain_openai import ChatOpenAI

from langchain.prompts import ChatPromptTemplate

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

class SelfRAGSimulator:

"""Simulates Self-RAG behavior (real Self-RAG requires specially trained models)"""

def __init__(self, retriever, llm):

self.retriever = retriever

self.llm = llm

def should_retrieve(self, query: str) -> bool:

prompt = ChatPromptTemplate.from_template("""

Determine whether answering the following question requires searching external documents.

Answer 'NO' if it can be answered with general knowledge or reasoning alone.

Answer 'YES' if specific facts or specialized knowledge are needed.

Answer only YES or NO.

Question: {query}

Decision:""")

response = self.llm.invoke(prompt.format_messages(query=query))

return "YES" in response.content.upper()

def is_relevant(self, query: str, doc_content: str) -> bool:

prompt = ChatPromptTemplate.from_template("""

Determine if the following document is relevant to the question.

Answer only RELEVANT or IRRELEVANT.

Question: {query}

Document: {doc}

Decision:""")

response = self.llm.invoke(

prompt.format_messages(query=query, doc=doc_content[:500])

)

return "RELEVANT" in response.content.upper()

def generate_with_reflection(self, query: str) -> str:

1. Determine if retrieval is needed

need_retrieve = self.should_retrieve(query)

print(f"Retrieval needed: {need_retrieve}")

if not need_retrieve:

response = self.llm.invoke(query)

return response.content

2. Retrieve documents

docs = self.retriever.invoke(query)

3. Filter by relevance

relevant_docs = [d for d in docs if self.is_relevant(query, d.page_content)]

print(f"Relevant documents: {len(relevant_docs)}/{len(docs)}")

if not relevant_docs:

return "No relevant documents found. Answering from general knowledge: " + \

self.llm.invoke(query).content

4. Generate answer with context

context = "\n\n".join([d.page_content for d in relevant_docs[:3]])

prompt = f"""Use the context to answer the question.

Context: {context}

Question: {query}

Answer:"""

return self.llm.invoke(prompt).content

9.2 Corrective RAG (CRAG)

CRAG evaluates the quality of retrieved documents and supplements with web search if quality is low.

from langchain_community.tools.tavily_search import TavilySearchResults

from typing import List, Tuple

class CorrectiveRAG:

def __init__(self, retriever, llm):

self.retriever = retriever

self.llm = llm

self.web_search = TavilySearchResults(max_results=3)

def evaluate_documents(self, query: str, docs: list) -> Tuple[str, List]:

"""

Evaluate document relevance.

Returns: ("CORRECT"|"INCORRECT"|"AMBIGUOUS", filtered documents)

"""

evaluation_prompt = """Evaluate the relevance of retrieved documents for the question.

- CORRECT: Documents can directly answer the question

- INCORRECT: Documents are not related to the question

- AMBIGUOUS: Partially related but incomplete

Question: {query}

Documents:

{docs}

Evaluation (CORRECT/INCORRECT/AMBIGUOUS):"""

docs_text = "\n---\n".join([d.page_content[:300] for d in docs[:4]])

response = self.llm.invoke(

evaluation_prompt.format(query=query, docs=docs_text)

)

evaluation = response.content.strip().upper()

if "CORRECT" in evaluation:

return "CORRECT", docs

elif "INCORRECT" in evaluation:

return "INCORRECT", []

else:

return "AMBIGUOUS", docs

def run(self, query: str) -> str:

1. Initial retrieval

docs = self.retriever.invoke(query)

2. Evaluate document quality

status, filtered_docs = self.evaluate_documents(query, docs)

print(f"Document evaluation: {status}")

3. Handle each case

if status == "INCORRECT":

print("Supplementing with web search...")

web_results = self.web_search.invoke(query)

context = "\n".join([r['content'] for r in web_results])

elif status == "AMBIGUOUS":

web_results = self.web_search.invoke(query)

web_context = "\n".join([r['content'] for r in web_results])

doc_context = "\n".join([d.page_content for d in filtered_docs[:2]])

context = doc_context + "\n\n[Web Search Supplement]\n" + web_context

else:

context = "\n\n".join([d.page_content for d in filtered_docs[:4]])

4. Generate final response

response = self.llm.invoke(

f"Context:\n{context}\n\nQuestion: {query}\nAnswer:"

)

return response.content

9.3 Adaptive RAG

Dynamically selects retrieval strategies based on query complexity.

class AdaptiveRAG:

def __init__(self, simple_retriever, advanced_retriever, llm):

self.simple_retriever = simple_retriever # basic vector search

self.advanced_retriever = advanced_retriever # hybrid + reranking

self.llm = llm

def classify_query(self, query: str) -> str:

prompt = f"""Classify the complexity of the following question:

- simple: Simple fact check or directly answerable

- complex: Requires combining multiple sources or multi-step reasoning

Question: {query}

Classification (simple/complex):"""

response = self.llm.invoke(prompt)

return "complex" if "complex" in response.content.lower() else "simple"

def run(self, query: str) -> str:

query_type = self.classify_query(query)

print(f"Query type: {query_type}")

if query_type == "simple":

docs = self.simple_retriever.invoke(query)

else:

docs = self.advanced_retriever.invoke(query)

context = "\n\n".join([d.page_content for d in docs])

return self.llm.invoke(

f"Context:\n{context}\n\nQuestion: {query}\nAnswer:"

).content

9.4 GraphRAG (Microsoft)

Microsoft's GraphRAG constructs a knowledge graph from documents and uses graph structure for retrieval.

Key ideas:

1. Extract entities (people, places, concepts) and relationships from documents

2. Group related entities using community detection algorithms

3. Generate summaries for each community

4. For global queries use community summaries; for local queries use graph traversal

Install and initialize GraphRAG

pip install graphrag

Initialize project

python -m graphrag.index --init --root ./ragtest

Edit settings then index

python -m graphrag.index --root ./ragtest

Global search (requires understanding of full document set)

python -m graphrag.query --root ./ragtest --method global "What are the main themes?"

Local search (focused on specific entities)

python -m graphrag.query --root ./ragtest --method local "Tell me about company X"

10. RAG Evaluation Metrics

Objectively measuring the quality of a RAG system is essential for improvement.

10.1 RAGAS (RAG Assessment)

RAGAS is a framework for automatically evaluating RAG pipelines.

**Key metrics**:

- **Faithfulness**: How faithful is the answer to the context? (measures hallucination)

- **Answer Relevancy**: How relevant is the answer to the question?

- **Context Recall**: How well were the relevant contexts retrieved?

- **Context Precision**: What proportion of retrieved contexts were actually useful?

from ragas import evaluate

from ragas.metrics import (

faithfulness,

answer_relevancy,

context_recall,

context_precision

)

from datasets import Dataset

evaluation_data = {

"question": [

"What is the company's annual leave policy?",

"What is the remote work policy?",

"answer": [

"Employees receive 15 days of annual leave after 1 year, with 1 additional day per year.",

"Employees may work remotely 3 days per week.",

"contexts": [

["Employees are granted 15 days of annual leave after 1 year of employment. One additional day is added each subsequent year."],

["Employees may work remotely up to 2 days per week. Additional days may be approved with permission."],

"ground_truth": [

"15 days after 1 year, 1 additional day per year",

"2 days remote work by default, additional with approval",

]

}

dataset = Dataset.from_dict(evaluation_data)

result = evaluate(

dataset,

metrics=[faithfulness, answer_relevancy, context_recall, context_precision]

)

print(result)

faithfulness: 0.75 (detects inconsistency between answer and context)

answer_relevancy: 0.92

context_recall: 0.85

context_precision: 0.78

10.2 Production Evaluation Pipeline

from typing import List, Dict, Optional

class RAGEvaluator:

def __init__(self, rag_chain, llm):

self.rag_chain = rag_chain

self.llm = llm

def evaluate_faithfulness(self, answer: str, context: str) -> float:

prompt = f"""Evaluate whether the following answer is based solely on information in the given context.

Score from 0.0 (not at all) to 1.0 (completely faithful).

Context: {context}

Answer: {answer}

Faithfulness score (number only):"""

response = self.llm.invoke(prompt)

try:

return float(response.content.strip())

except:

return 0.5

def evaluate_answer_relevancy(self, question: str, answer: str) -> float:

prompt = f"""Evaluate how relevant the following answer is to the question.

Score from 0.0 to 1.0.

Question: {question}

Answer: {answer}

Relevancy score (number only):"""

response = self.llm.invoke(prompt)

try:

return float(response.content.strip())

except:

return 0.5

def run_evaluation(self, test_cases: List[Dict]) -> Dict:

results = []

for case in test_cases:

question = case["question"]

result = self.rag_chain.invoke({"query": question})

answer = result["result"]

context = "\n".join([d.page_content for d in result["source_documents"]])

faithfulness_score = self.evaluate_faithfulness(answer, context)

relevancy_score = self.evaluate_answer_relevancy(question, answer)

results.append({

"question": question,

"answer": answer,

"faithfulness": faithfulness_score,

"relevancy": relevancy_score,

})

avg_faithfulness = sum(r["faithfulness"] for r in results) / len(results)

avg_relevancy = sum(r["relevancy"] for r in results) / len(results)

return {

"results": results,

"avg_faithfulness": avg_faithfulness,

"avg_relevancy": avg_relevancy,

"overall_score": (avg_faithfulness + avg_relevancy) / 2

}

11. Production RAG Systems

11.1 Caching Strategies

class CachedRAGSystem:

def __init__(self, rag_chain, redis_client=None, ttl=3600):

self.rag_chain = rag_chain

self.redis = redis_client

self.ttl = ttl

def _get_cache_key(self, query: str) -> str:

return f"rag:{hashlib.md5(query.encode()).hexdigest()}"

def query(self, query: str) -> dict:

cache_key = self._get_cache_key(query)

Check cache

if self.redis:

cached = self.redis.get(cache_key)

if cached:

print("Cache hit!")

return json.loads(cached)

Execute RAG

result = self.rag_chain.invoke({"query": query})

response = {

"answer": result["result"],

"sources": [d.metadata for d in result["source_documents"]]

}

Store in cache

if self.redis:

self.redis.setex(cache_key, self.ttl, json.dumps(response))

return response

LLM response caching

from langchain.globals import set_llm_cache

from langchain_community.cache import InMemoryCache, SQLiteCache

set_llm_cache(InMemoryCache()) # development

set_llm_cache(SQLiteCache(database_path=".langchain.db")) # production

11.2 Streaming Responses

from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

from langchain_openai import ChatOpenAI

from fastapi import FastAPI

from fastapi.responses import StreamingResponse

app = FastAPI()

streaming_llm = ChatOpenAI(

model="gpt-4o-mini",

streaming=True,

callbacks=[StreamingStdOutCallbackHandler()]

)

async def generate_rag_stream(query: str):

docs = retriever.invoke(query)

context = "\n\n".join([d.page_content for d in docs])

async for chunk in streaming_llm.astream(

f"Context:\n{context}\n\nQuestion: {query}\nAnswer:"

if chunk.content:

yield f"data: {chunk.content}\n\n"

@app.get("/rag/stream")

async def rag_stream_endpoint(query: str):

return StreamingResponse(

generate_rag_stream(query),

media_type="text/event-stream"

)

11.3 Cost Optimization

Track token usage

from langchain.callbacks import get_openai_callback

with get_openai_callback() as cb:

result = qa_chain.invoke({"query": "Your question here"})

print(f"Total tokens: {cb.total_tokens}")

print(f"Prompt tokens: {cb.prompt_tokens}")

print(f"Completion tokens: {cb.completion_tokens}")

print(f"Cost: ${cb.total_cost:.6f}")

Context compression to save tokens

from langchain.retrievers.document_compressors import LLMChainExtractor

from langchain.retrievers import ContextualCompressionRetriever

compressor = LLMChainExtractor.from_llm(llm)

compression_retriever = ContextualCompressionRetriever(

base_compressor=compressor,

base_retriever=vectorstore.as_retriever(search_kwargs={"k": 8})

)

Only pass compressed (relevant) portions to LLM

compressed_docs = compression_retriever.invoke("your query")

total_tokens_estimate = sum(len(d.page_content.split()) for d in compressed_docs)

print(f"Compressed context tokens (estimate): {total_tokens_estimate}")

11.4 Monitoring

from dataclasses import dataclass

from typing import Optional

@dataclass

class RAGMetrics:

query: str

retrieval_time: float = 0.0

generation_time: float = 0.0

num_docs_retrieved: int = 0

answer_length: int = 0

error: Optional[str] = None

class MonitoredRAGSystem:

def __init__(self, rag_chain, logger=None):

self.rag_chain = rag_chain

self.logger = logger or logging.getLogger(__name__)

self.metrics_history = []

def query(self, query: str) -> dict:

metrics = RAGMetrics(query=query)

start_total = time.time()

try:

retrieval_start = time.time()

docs = retriever.invoke(query)

metrics.retrieval_time = time.time() - retrieval_start

metrics.num_docs_retrieved = len(docs)

gen_start = time.time()

result = self.rag_chain.invoke({"query": query})

metrics.generation_time = time.time() - gen_start

metrics.answer_length = len(result["result"])

except Exception as e:

metrics.error = str(e)

self.logger.error(f"RAG error: {e}")

raise

finally:

total_time = time.time() - start_total

self.metrics_history.append(metrics)

self.logger.info(

f"Query processed | "

f"Retrieval: {metrics.retrieval_time:.2f}s | "

f"Generation: {metrics.generation_time:.2f}s | "

f"Total: {total_time:.2f}s | "

f"Docs: {metrics.num_docs_retrieved}"

)

return result

def get_stats(self) -> dict:

if not self.metrics_history:

return {}

retrieval_times = [m.retrieval_time for m in self.metrics_history if not m.error]

gen_times = [m.generation_time for m in self.metrics_history if not m.error]

return {

"total_queries": len(self.metrics_history),

"error_rate": sum(1 for m in self.metrics_history if m.error) / len(self.metrics_history),

"avg_retrieval_time": sum(retrieval_times) / len(retrieval_times) if retrieval_times else 0,

"avg_generation_time": sum(gen_times) / len(gen_times) if gen_times else 0,

}

12. Production RAG Checklist

Key considerations when building a production RAG system.

**Document Processing**

- Support diverse file formats (PDF, Word, HTML, Markdown)

- Preserve metadata (source, date, author)

- Strategy for images and tables

- Support for incremental updates

**Retrieval Quality**

- Choose embedding models suited to the domain and language

- Consider hybrid search (keyword + semantic)

- Tune chunk size and overlap appropriately

- Use reranking to improve precision

**LLM Integration**

- Clear system prompt (emphasize using only context)

- Require source citation

- Allow expression of uncertainty

**Operations**

- Response caching to reduce costs

- Monitor token usage

- A/B test chunking and retrieval parameters

- Automated evaluation pipeline (RAGAS)

Conclusion

RAG is the most practical approach to overcoming the limitations of LLMs. To summarize:

1. **Basic RAG**: chunking → embedding → vector DB → retrieval → generation

2. **Improving retrieval quality**: hybrid search, MMR, reranking

3. **Advanced architectures**: Self-RAG, CRAG, HyDE, GraphRAG

4. **Evaluation**: measure faithfulness and relevancy with RAGAS

5. **Production**: caching, monitoring, cost optimization

The performance of a RAG system depends on the harmony of the entire pipeline, not any single component. Chunking strategy and embedding model selection determine 80% of retrieval quality — focusing on these two elements provides the highest ROI.

References

- Lewis et al. (2020), "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" (arXiv:2005.11401)

- Asai et al. (2023), "Self-RAG: Learning to Retrieve, Generate, and Critique" (arXiv:2310.11511)

- Gao et al. (2023), "Precise Zero-Shot Dense Retrieval without Relevance Labels" — HyDE (arXiv:2305.14283)

- Microsoft GraphRAG: https://microsoft.github.io/graphrag/

- LangChain documentation: https://python.langchain.com/docs/

- LlamaIndex documentation: https://docs.llamaindex.ai/

- RAGAS documentation: https://docs.ragas.io/

- FAISS: https://faiss.ai/

- Chroma: https://www.trychroma.com/