Skip to content
Published on

RAG: Retrieval-Augmented Generation — Paper Analysis and Production Architecture

Authors
  • Name
    Twitter

1. The LLM Hallucination Problem and the Emergence of RAG

Large Language Models (LLMs) are pre-trained on vast amounts of text data and demonstrate remarkable performance in natural language understanding and generation. However, LLMs have a fundamental limitation: the Hallucination problem.

Hallucination refers to the phenomenon where a model confidently generates information that is not factually correct. The root causes of this problem are as follows.

  • Static nature of knowledge: Knowledge encoded in LLM parameters is fixed at training time. It cannot reflect events or updated information that occurred after training.
  • Imperfect parametric memory: Even with billions of parameters, it is impossible to accurately store and reproduce every detail about the world.
  • Probabilistic generation: LLMs predict and generate the next token probabilistically, which means they can produce text that is statistically plausible but factually incorrect.
  • Inability to trace sources: It is impossible to trace which training data a generated answer originated from, making verification inherently difficult.

The approach that emerged to overcome these limitations is Retrieval-Augmented Generation (RAG). The core idea of RAG is simple yet powerful: before the LLM generates an answer, it retrieves relevant documents from an external knowledge store and generates the answer based on that information.

This yields the following benefits.

  1. Fact-based responses: Answers are generated based on actual retrieved documents, reducing hallucination.
  2. Easy knowledge updates: Simply updating the external database reflects the latest information. No model retraining is required.
  3. Source attribution: The documents underlying the answer can be presented alongside it, improving transparency and trustworthiness.
  4. Easy domain specialization: By indexing only domain-specific documents, a specialized system can be built quickly.

2. Original RAG Paper Architecture: Retriever + Generator

The paper "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" (NeurIPS 2020), published by Patrick Lewis et al. at Facebook AI Research (now Meta AI) in 2020, is the seminal work that first formalized the concept of RAG.

2.1 Core Proposal

Lewis et al. proposed a general (fine-tunable) methodology for combining the Parametric Memory (implicit knowledge within parameters) of a pre-trained language model with Non-Parametric Memory (explicit knowledge from an external document index). Specifically, the Parametric Memory is a pre-trained seq2seq model (BART), and the Non-Parametric Memory is the entirety of Wikipedia built as a Dense Vector Index.

2.2 Architecture Components

The RAG model architecture consists of two major components.

Retriever - p_eta(z|x)

Given an input query x, this component retrieves relevant documents (passages) z. The paper uses Dense Passage Retrieval (DPR) to encode queries and documents as Dense Vectors, then retrieves the top-k relevant documents via Maximum Inner Product Search (MIPS).

Generator - p_theta(y_i|x, z, y_{1:i-1})

This component receives the retrieved documents z along with the original input x as context and generates the final output y. The paper uses BART-large as the Generator.

2.3 Two RAG Variants

The paper proposes two variants based on how retrieved documents are utilized.

RAG-Sequence

A single retrieved document is used consistently when generating the entire output sequence. For each retrieved document z, the entire sequence is generated, and then the probabilities across documents are marginalized.

p_RAG-Sequence(y|x) ≈ Σ_z p_eta(z|x) Π_i p_theta(y_i|x, z, y_{1:i-1})

RAG-Token

Different retrieved documents can be referenced for each output token generated. Probabilities across documents are marginalized at the token level.

p_RAG-Token(y|x) ≈ Π_i Σ_z p_eta(z|x) p_theta(y_i|x, z, y_{1:i-1})

2.4 Key Experimental Results

The RAG model achieved state-of-the-art performance on three Open-Domain QA benchmarks (Natural Questions, TriviaQA, WebQuestions), surpassing both existing parametric seq2seq models and task-specific retrieve-and-extract architectures. Notably, the RAG model generated text that was more specific, diverse, and factual compared to parametric-only models.


3. Dense Passage Retrieval (DPR) Principles

The critical component in RAG's Retriever is Dense Passage Retrieval (DPR). It was proposed in the paper "Dense Passage Retrieval for Open-Domain Question Answering" by Karpukhin et al. at EMNLP 2020.

3.1 Limitations of Sparse Retrieval

Traditional information retrieval primarily used Sparse Retrieval methods like BM25. BM25 performs keyword matching based on TF-IDF but has the following limitations.

  • Lexical Mismatch: Relevant documents cannot be found when synonyms or different expressions are used. For example, a query about "machine learning" will fail to match a document that uses "ML" or an equivalent term in another language.
  • No semantic similarity: Since only word frequency is considered, contextual meaning is not captured.

3.2 DPR's Bi-Encoder Architecture

DPR uses a Bi-Encoder architecture. Two independent BERT-base encoders convert queries and documents into Dense Vectors respectively.

- Query Encoder: E_Q(q) → d-dimensional vector
- Passage Encoder: E_P(p) → d-dimensional vector

Similarity is computed as the Inner Product of the two vectors.

sim(q, p) = E_Q(q)^T · E_P(p)

The key advantage of this architecture is that query and document encoding are independent. Document encoding can be performed offline and indexed in ANN (Approximate Nearest Neighbor) libraries like FAISS. At search time, only the query needs to be encoded, enabling millisecond-level retrieval even across millions of documents.

3.3 Training Method

DPR is trained using an In-Batch Negatives strategy. Correct passages for other questions in the batch are used as Negative Samples. Additionally, passages retrieved by BM25 that are not correct answers are used as Hard Negatives to improve training effectiveness.

3.4 Performance

DPR achieved an absolute improvement of 9% to 19% in Top-20 Passage Retrieval Accuracy compared to BM25. This demonstrates that high-quality Dense Retrievers can be trained with only a limited number of query-passage pairs.


4. Chunking Strategies

Splitting documents into appropriately sized chunks is a critical step in RAG systems that directly impacts retrieval quality. LangChain provides various Text Splitters, and the key chunking strategies are as follows.

4.1 Fixed-Size Chunking

The simplest method, splitting text based on a specified number of characters or tokens.

from langchain_text_splitters import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
    separator="\n\n",
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len,
)
docs = text_splitter.split_documents(documents)
  • Pros: Simple to implement and predictable.
  • Cons: May split in the middle of sentences or semantic units.

The chunk_overlap parameter creates overlap between adjacent chunks to mitigate context loss.

4.2 Recursive Character Splitting

LangChain's most recommended general-purpose Text Splitter. It recursively applies multiple levels of separators to preserve semantic units as much as possible.

from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ". ", " ", ""],
    length_function=len,
)
docs = text_splitter.split_documents(documents)

The operation logic is as follows.

  1. First, attempt to split by \n\n (paragraph breaks).
  2. If a chunk still exceeds chunk_size, split by \n (line breaks).
  3. If still too large, split by . (sentence boundaries).
  4. As a last resort, split by spaces or individual characters.

The key principle is trying to preserve larger semantic units first, and only breaking down into smaller units when necessary.

4.3 Semantic Chunking

The most advanced strategy, detecting points where meaning changes based on embedding similarity.

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings

text_splitter = SemanticChunker(
    OpenAIEmbeddings(),
    breakpoint_threshold_type="percentile",
)
docs = text_splitter.split_documents(documents)

It computes the Embedding Cosine Similarity between adjacent sentences and sets chunk boundaries where similarity drops sharply. While it produces semantically coherent chunks, additional embedding computation costs are incurred.

4.4 Chunking Strategy Selection Guide

StrategyBest ForCost
Fixed-SizeUniformly structured documents, rapid prototypingLow
RecursiveMost general use cases (recommended default)Low
SemanticDocuments where semantic boundaries matter, high quality requiredHigh

In practice, chunk_size is typically set between 500 and 1500, with chunk_overlap at 10-20% of the chunk_size. Optimal values should be determined experimentally based on data characteristics and use case.


5. Embedding Model Selection

The choice of embedding model for converting chunks to vectors is a crucial decision that determines retrieval quality. The MTEB (Massive Text Embedding Benchmark) leaderboard provides useful comparisons of major models.

5.1 OpenAI text-embedding-3 Series

from langchain_openai import OpenAIEmbeddings

# High-performance model
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")  # 3072 dimensions

# Cost-efficient model
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")  # 1536 dimensions
  • MTEB Score: text-embedding-3-large approximately 64.6
  • Pros: Convenient to use via API calls with consistent quality. Supports Matryoshka Representation, allowing dimension reduction with minimal performance loss.
  • Cons: API costs apply, and data is sent to external servers.

5.2 Sentence-Transformers

from langchain_huggingface import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)
  • Pros: Open-source and can run locally. Many fast, lightweight models are available for English. all-MiniLM-L6-v2 is lightweight at 384 dimensions while offering decent performance.
  • Cons: Limited multilingual support, and performance may be lower compared to larger models.

5.3 BGE (BAAI General Embedding) Series

from langchain_huggingface import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(
    model_name="BAAI/bge-m3",
    model_kwargs={"device": "cuda"},
    encode_kwargs={"normalize_embeddings": True},
)
  • MTEB Score: BGE-M3 approximately 63.0
  • Pros: Top-tier open-source performance, supporting over 100 languages. Particularly suitable for multilingual RAG including Korean. Supports Dense, Sparse, and Multi-Vector Retrieval as a hybrid model.
  • Cons: The large model size requires a GPU.

5.4 Selection Criteria

CriterionRecommended Model
Rapid prototypingOpenAI text-embedding-3-small
Production (quality-first)OpenAI text-embedding-3-large or Cohere embed-v4
Production (cost-first, self-hosted)BGE-M3
MultilingualBGE-M3
Lightweight / edge environmentsall-MiniLM-L6-v2

The key principle is to always benchmark with your actual data. MTEB scores are based on general benchmarks, so performance on specific domains may differ.


6. Vector Database Comparison

The Vector Database that stores embedding vectors and performs similarity search is a core piece of RAG system infrastructure. Here is a comparison of major Vector Databases.

6.1 Chroma

from langchain_chroma import Chroma

vectorstore = Chroma.from_documents(
    documents=docs,
    embedding=embeddings,
    persist_directory="./chroma_db",
)
  • Type: Open-source, Embedded (In-Process)
  • Best for: Local development, prototyping, small-scale projects
  • Pros: Easy to install (pip install chromadb), runs within the Python process without a separate server. Excellent LangChain integration.
  • Limitations: Performance may degrade at large scale (millions of vectors or more). Production-level availability and scalability are limited.

6.2 Pinecone

  • Type: Managed SaaS (fully managed)
  • Best for: Production environments, teams wanting to minimize operational burden
  • Pros: Serverless architecture eliminates infrastructure management. Multi-region support, high availability, and auto-scaling. Scales to billions of vectors.
  • Limitations: Relatively high cost. Vendor lock-in concerns. Not open-source.

6.3 Weaviate

  • Type: Open-source + Managed Cloud
  • Best for: When Hybrid Search (vector + keyword) is important, when flexible schemas are needed
  • Pros: Strong native Hybrid Search support. GraphQL API, modular architecture, and built-in vectorization modules. Flexible with both open-source and managed cloud options.
  • Limitations: Has a learning curve and configuration can be somewhat complex.

6.4 pgvector

CREATE EXTENSION vector;

CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    content TEXT,
    embedding vector(1536)
);

-- Similarity search
SELECT content, embedding <=> '[0.1, 0.2, ...]'::vector AS distance
FROM documents
ORDER BY distance
LIMIT 5;
  • Type: PostgreSQL extension
  • Best for: Organizations already using PostgreSQL, those not wanting to add separate infrastructure
  • Pros: Leverages existing PostgreSQL infrastructure. SQL and vector search in a single database simplifies architecture. Supports HNSW and IVFFlat indexes.
  • Limitations: Performance may degrade beyond 100 million vectors. Features are limited compared to dedicated Vector Databases.

6.5 Milvus

  • Type: Open-source, distributed architecture
  • Best for: Billion-scale vector systems, teams with data engineering capabilities
  • Pros: Proven performance for industrial-scale large-scale vector search. Supports various index types (IVF, HNSW, DiskANN, etc.) with GPU acceleration. Managed service available via Zilliz Cloud.
  • Limitations: High operational complexity. Cluster setup and management require specialized expertise.

6.6 Comparison Summary

DBTypeMax ScaleHybrid SearchRecommended Scenario
ChromaEmbedded~1 millionLimitedPrototyping, development
PineconeManaged SaaSBillionsSupportedProduction, minimal management
WeaviateOpen-source/CloudHundreds of millionsStrong supportHybrid Search-centric
pgvectorPostgreSQL ext.~100 millionSQL combinedExisting PostgreSQL infra
MilvusOpen-source dist.BillionsSupportedLarge-scale systems

7. Advanced RAG Patterns

Basic RAG (Naive RAG) is a simple "retrieve then generate" pipeline. In production environments, various Advanced RAG patterns are needed to improve both retrieval and generation quality.

7.1 Re-ranking

In basic RAG, initial retrieval uses Bi-Encoder vector similarity, which is fast but may lack precision. Re-ranking is a pattern that re-evaluates initial retrieval results with a Cross-Encoder to improve precision.

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CrossEncoderReranker
from langchain_community.cross_encoders import HuggingFaceCrossEncoder

# Load Cross-Encoder model
model = HuggingFaceCrossEncoder(model_name="BAAI/bge-reranker-v2-m3")
compressor = CrossEncoderReranker(model=model, top_n=3)

# Configure Re-ranking Retriever
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectorstore.as_retriever(search_kwargs={"k": 20}),
)

The workflow is as follows.

  1. The Bi-Encoder quickly retrieves a broad set of candidate documents (top-20).
  2. The Cross-Encoder takes each candidate document paired with the query and scores direct relevance.
  3. Only the top re-ranked documents (top-3) are passed to the Generator.

Cross-Encoders are more precise than Bi-Encoders but require individual inference for every candidate, making them unsuitable for large-scale retrieval. Therefore, the standard pattern is a two-stage pipeline: narrow candidates with a Bi-Encoder, then re-rank with a Cross-Encoder.

7.2 HyDE (Hypothetical Document Embeddings)

HyDE, proposed by Gao et al. (2022), is a pattern designed to bridge the semantic gap between queries and documents. User questions are typically short and abstract, while documents containing the answers are long and detailed. This difference can make direct vector similarity search suboptimal.

The core idea of HyDE is as follows.

  1. When a user query is received, ask the LLM to "write a hypothetical document that answers this question."
  2. Embed the generated hypothetical document. This hypothetical document may not be factually accurate, but it has similar format and vocabulary to actual relevant documents.
  3. Use this embedding to search for actual documents.
from langchain.chains import HypotheticalDocumentEmbedder
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

llm = ChatOpenAI(model="gpt-4o-mini")
base_embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

hyde_embeddings = HypotheticalDocumentEmbedder.from_llm(
    llm=llm,
    base_embeddings=base_embeddings,
    prompt_key="web_search",
)

# Search using HyDE embeddings
results = vectorstore.similarity_search_by_vector(
    hyde_embeddings.embed_query("How to evaluate RAG system performance?")
)

The encoder's Dense Bottleneck serves to filter out hallucinations in the hypothetical document, so even if the hypothetical document is not accurate, it can effectively retrieve relevant real documents.

7.3 Self-RAG

Self-RAG, proposed by Asai et al. (2023), is a pattern where the LLM autonomously judges the need for retrieval and self-critically evaluates its generated results.

The core mechanism of Self-RAG is the Reflection Token.

  • [Retrieve]: Determines whether external retrieval is needed at the current point.
  • [IsRel]: Evaluates whether the retrieved document is relevant to the question.
  • [IsSup]: Evaluates whether the generated response is supported by the retrieved document.
  • [IsUse]: Evaluates whether the generated response is overall useful.

These Reflection Tokens are added to the model's vocabulary and trained like regular tokens, and the model generates them autonomously during inference. Self-RAG outperformed ChatGPT and retrieval-augmented Llama2-chat at the 7B and 13B parameter scales on Open-Domain QA, Reasoning, and Fact Verification tasks.

7.4 Corrective RAG (CRAG)

CRAG, proposed by Yan et al. (2024), is a pattern that actively evaluates and corrects the quality of retrieved documents.

The core components of CRAG are as follows.

  1. Retrieval Evaluator: A lightweight model judges the relevance of retrieved documents as Correct, Incorrect, or Ambiguous.
  2. Knowledge Refinement: Extracts key information from retrieved documents and removes unnecessary parts. Uses a Decompose-then-Recompose algorithm.
  3. Web Search Fallback: If the Retrieval Evaluator judges Incorrect, it falls back to web search instead of the static corpus to find better information.
[Query][Retriever][Retrieval Evaluator]
              CorrectKnowledge RefinementGenerator
              IncorrectWeb SearchKnowledge RefinementGenerator
              AmbiguousCombine both paths → Generator

The strength of this pattern is that it automatically activates fallback paths even when retrieval quality is low, generating robust responses.


8. Evaluation Metrics: RAGAS Framework

The RAGAS (Retrieval Augmented Generation Assessment) framework is widely used for systematically evaluating RAG system performance. Proposed by Es et al. (2023), RAGAS provides automated metrics that can evaluate RAG pipelines even without Ground Truth.

8.1 Faithfulness

Measures how faithful the generated answer is to the retrieved Context. This is the key metric for detecting hallucination.

Faithfulness = (Number of Claims supported by Context) / (Total number of Claims)

The process works as follows.

  1. The LLM extracts individual Claims from the generated answer.
  2. It determines whether each Claim is supported by the provided Context.
  3. The proportion of supported Claims is calculated.

Values range from 0 to 1, with values closer to 1 meaning the answer is more faithful to the Context.

8.2 Answer Relevance

Measures how relevant the generated answer is to the original question.

The process works as follows.

  1. Questions are reverse-engineered from the generated answer.
  2. The Embedding similarity between the generated questions and the original question is computed.
  3. The average similarity becomes the Answer Relevance score.

This approach indirectly measures whether the answer addresses the core of the question and whether it contains unnecessary information.

8.3 Context Recall

Measures how much of the information needed to generate the Ground Truth answer is contained in the retrieved Context.

Context Recall = (Number of GT sentences supported by Context) / (Total number of GT sentences)

This is the only metric that requires Ground Truth. It directly evaluates the Retriever's performance.

8.4 Context Precision

Measures whether the actually relevant items among the retrieved Context are positioned at the top. The score is higher when relevant documents appear earlier in the search results.

8.5 RAGAS Usage Example

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
)
from datasets import Dataset

# Compose evaluation dataset
eval_data = {
    "question": ["What is RAG?"],
    "answer": ["RAG is Retrieval-Augmented Generation, a method where the LLM retrieves external documents to generate answers."],
    "contexts": [["RAG (Retrieval-Augmented Generation) is a technique that retrieves external knowledge to leverage in LLM generation."]],
    "ground_truth": ["RAG stands for Retrieval-Augmented Generation, a methodology that augments LLM response generation by retrieving relevant information from external knowledge sources."],
}
dataset = Dataset.from_dict(eval_data)

# Run evaluation
result = evaluate(
    dataset=dataset,
    metrics=[faithfulness, answer_relevancy, context_recall, context_precision],
)
print(result)

In production environments, it is recommended to integrate these metrics into CI/CD pipelines to automatically monitor the impact of RAG system changes (chunking strategy modifications, model swaps, etc.) on quality.


9. Practical Implementation with LangChain + ChromaDB

Synthesizing all the concepts covered so far, here is a complete code implementation of a production RAG pipeline using LangChain and ChromaDB.

9.1 Environment Setup and Package Installation

pip install langchain langchain-openai langchain-chroma langchain-community
pip install chromadb pypdf sentence-transformers

9.2 Document Loading and Chunking

import os
from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Load PDF documents
loader = DirectoryLoader(
    "./documents",
    glob="**/*.pdf",
    loader_cls=PyPDFLoader,
)
documents = loader.load()
print(f"Number of documents loaded: {len(documents)}")

# Recursive Character Splitting
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ". ", " ", ""],
    length_function=len,
)
splits = text_splitter.split_documents(documents)
print(f"Number of chunks generated: {len(splits)}")

9.3 Embedding and Vector Store Construction

from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma

# Embedding model setup
embeddings = OpenAIEmbeddings(
    model="text-embedding-3-small",
    openai_api_key=os.getenv("OPENAI_API_KEY"),
)

# Build and persist ChromaDB Vector Store
vectorstore = Chroma.from_documents(
    documents=splits,
    embedding=embeddings,
    persist_directory="./chroma_db",
    collection_name="rag_collection",
)
print(f"Documents stored in Vector Store: {vectorstore._collection.count()}")

9.4 Retriever Configuration

# Basic Retriever
retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 5},
)

# MMR (Maximal Marginal Relevance) Retriever - ensures diversity
retriever_mmr = vectorstore.as_retriever(
    search_type="mmr",
    search_kwargs={
        "k": 5,
        "fetch_k": 20,      # Initial retrieval candidate count
        "lambda_mult": 0.7,  # Balance between relevance (1.0) and diversity (0.0)
    },
)

9.5 RAG Chain Construction and Execution

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

# LLM setup
llm = ChatOpenAI(
    model="gpt-4o-mini",
    temperature=0,
    openai_api_key=os.getenv("OPENAI_API_KEY"),
)

# Prompt Template
prompt = ChatPromptTemplate.from_template("""
Answer the question based on the following Context.
If the Context does not contain the information needed to answer, respond with "No relevant information was found in the provided documents."

Context:
{context}

Question: {question}

Answer:
""")

# Context formatting function
def format_docs(docs):
    return "\n\n---\n\n".join(
        f"[Source: {doc.metadata.get('source', 'unknown')}, "
        f"Page: {doc.metadata.get('page', 'N/A')}]\n{doc.page_content}"
        for doc in docs
    )

# Construct RAG Chain with LCEL (LangChain Expression Language)
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

# Execute
question = "What are the types of chunking strategies in a RAG system and their pros and cons?"
answer = rag_chain.invoke(question)
print(answer)

9.6 Responses with Source Information

from langchain_core.runnables import RunnableParallel

# Chain that returns both source information and the answer
rag_chain_with_sources = RunnableParallel(
    {"context": retriever, "question": RunnablePassthrough()}
).assign(
    answer=lambda x: (
        prompt.invoke(
            {"context": format_docs(x["context"]), "question": x["question"]}
        )
        | llm
        | StrOutputParser()
    ).invoke(x["question"])
)

# More concise approach
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain

# Stuff Documents Chain (combines retrieved documents into a single Context)
combine_docs_chain = create_stuff_documents_chain(llm, prompt)

# Retrieval Chain
retrieval_chain = create_retrieval_chain(retriever, combine_docs_chain)

# Execute - returns both context and answer
response = retrieval_chain.invoke({"input": question})
print("Answer:", response["answer"])
print("\nReference documents:")
for i, doc in enumerate(response["context"], 1):
    print(f"  [{i}] {doc.metadata.get('source', 'unknown')} "
          f"(p.{doc.metadata.get('page', 'N/A')})")

9.7 Conversational RAG

from langchain_core.prompts import MessagesPlaceholder
from langchain_community.chat_message_histories import ChatMessageHistory
from langchain_core.runnables.history import RunnableWithMessageHistory

# Prompt that considers conversation history
contextualize_prompt = ChatPromptTemplate.from_messages([
    ("system", "Reformulate the user's latest question so it can be understood independently, considering the previous conversation context."),
    MessagesPlaceholder("chat_history"),
    ("human", "{input}"),
])

# Per-session history management
store = {}

def get_session_history(session_id: str):
    if session_id not in store:
        store[session_id] = ChatMessageHistory()
    return store[session_id]

# Construct Conversational RAG Chain
conversational_rag = RunnableWithMessageHistory(
    retrieval_chain,
    get_session_history,
    input_messages_key="input",
    history_messages_key="chat_history",
    output_messages_key="answer",
)

# Execute conversation
config = {"configurable": {"session_id": "user_001"}}

response1 = conversational_rag.invoke(
    {"input": "What is RAG?"},
    config=config,
)
print(response1["answer"])

response2 = conversational_rag.invoke(
    {"input": "What are its main advantages?"},  # "its" = RAG from previous conversation
    config=config,
)
print(response2["answer"])

This implementation is a basic RAG pipeline. Moving to production requires applying Advanced patterns such as Re-ranking and HyDE as covered above, systematic evaluation through RAGAS, and building monitoring and logging infrastructure.


10. Summary

RAG is the most practical and effective approach for addressing the hallucination problem in LLMs. The Retriever + Generator architecture proposed in the original paper by Lewis et al. (2020) has since evolved into various Advanced patterns, establishing itself as the core architecture for building production-level AI systems.

The key decisions for building an effective RAG system can be summarized as follows.

  1. Chunking Strategy: Start with RecursiveCharacterTextSplitter as the default, and consider Semantic Chunking based on data characteristics.
  2. Embedding Model: BGE-M3 for multilingual needs; OpenAI text-embedding-3 series for English-centric, reliable performance.
  3. Vector Database: Chroma for prototyping; choose a DB suited to the workload for production.
  4. Advanced Patterns: Re-ranking is worth applying in almost all cases; HyDE and CRAG are worth considering when retrieval quality is insufficient.
  5. Evaluation: Integrate RAGAS metrics into CI/CD to continuously monitor quality.

References