💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

1. The LLM Hallucination Problem and the Emergence of RAG

Large Language Models (LLMs) are pre-trained on vast amounts of text data and demonstrate remarkable performance in natural language understanding and generation. However, LLMs have a fundamental limitation: the **Hallucination** problem.

Hallucination refers to the phenomenon where a model confidently generates information that is not factually correct. The root causes of this problem are as follows.

- **Static nature of knowledge**: Knowledge encoded in LLM parameters is fixed at training time. It cannot reflect events or updated information that occurred after training.

- **Imperfect parametric memory**: Even with billions of parameters, it is impossible to accurately store and reproduce every detail about the world.

- **Probabilistic generation**: LLMs predict and generate the next token probabilistically, which means they can produce text that is statistically plausible but factually incorrect.

- **Inability to trace sources**: It is impossible to trace which training data a generated answer originated from, making verification inherently difficult.

The approach that emerged to overcome these limitations is **Retrieval-Augmented Generation (RAG)**. The core idea of RAG is simple yet powerful: before the LLM generates an answer, it retrieves relevant documents from an external knowledge store and generates the answer based on that information.

This yields the following benefits.

1. **Fact-based responses**: Answers are generated based on actual retrieved documents, reducing hallucination.

2. **Easy knowledge updates**: Simply updating the external database reflects the latest information. No model retraining is required.

3. **Source attribution**: The documents underlying the answer can be presented alongside it, improving transparency and trustworthiness.

4. **Easy domain specialization**: By indexing only domain-specific documents, a specialized system can be built quickly.

2. Original RAG Paper Architecture: Retriever + Generator

The paper **"Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks"** (NeurIPS 2020), published by Patrick Lewis et al. at Facebook AI Research (now Meta AI) in 2020, is the seminal work that first formalized the concept of RAG.

2.1 Core Proposal

Lewis et al. proposed a general (fine-tunable) methodology for combining the **Parametric Memory** (implicit knowledge within parameters) of a pre-trained language model with **Non-Parametric Memory** (explicit knowledge from an external document index). Specifically, the Parametric Memory is a pre-trained seq2seq model (BART), and the Non-Parametric Memory is the entirety of Wikipedia built as a Dense Vector Index.

2.2 Architecture Components

The RAG model architecture consists of two major components.

**Retriever - `p_eta(z|x)`**

Given an input query `x`, this component retrieves relevant documents (passages) `z`. The paper uses Dense Passage Retrieval (DPR) to encode queries and documents as Dense Vectors, then retrieves the top-k relevant documents via Maximum Inner Product Search (MIPS).

**Generator - `p_theta(y_i|x, z, y_{1:i-1})`**

This component receives the retrieved documents `z` along with the original input `x` as context and generates the final output `y`. The paper uses BART-large as the Generator.

2.3 Two RAG Variants

The paper proposes two variants based on how retrieved documents are utilized.

**RAG-Sequence**

A single retrieved document is used consistently when generating the entire output sequence. For each retrieved document `z`, the entire sequence is generated, and then the probabilities across documents are marginalized.

p_RAG-Sequence(y|x) ≈ Σ_z p_eta(z|x) Π_i p_theta(y_i|x, z, y_{1:i-1})

**RAG-Token**

Different retrieved documents can be referenced for each output token generated. Probabilities across documents are marginalized at the token level.

p_RAG-Token(y|x) ≈ Π_i Σ_z p_eta(z|x) p_theta(y_i|x, z, y_{1:i-1})

2.4 Key Experimental Results

The RAG model achieved **state-of-the-art** performance on three Open-Domain QA benchmarks (Natural Questions, TriviaQA, WebQuestions), surpassing both existing parametric seq2seq models and task-specific retrieve-and-extract architectures. Notably, the RAG model generated text that was more **specific, diverse, and factual** compared to parametric-only models.

3. Dense Passage Retrieval (DPR) Principles

The critical component in RAG's Retriever is **Dense Passage Retrieval (DPR)**. It was proposed in the paper **"Dense Passage Retrieval for Open-Domain Question Answering"** by Karpukhin et al. at EMNLP 2020.

3.1 Limitations of Sparse Retrieval

Traditional information retrieval primarily used **Sparse Retrieval** methods like **BM25**. BM25 performs keyword matching based on TF-IDF but has the following limitations.

- **Lexical Mismatch**: Relevant documents cannot be found when synonyms or different expressions are used. For example, a query about "machine learning" will fail to match a document that uses "ML" or an equivalent term in another language.

- **No semantic similarity**: Since only word frequency is considered, contextual meaning is not captured.

3.2 DPR's Bi-Encoder Architecture

DPR uses a **Bi-Encoder** architecture. Two independent BERT-base encoders convert queries and documents into Dense Vectors respectively.

- Query Encoder: E_Q(q) → d-dimensional vector

- Passage Encoder: E_P(p) → d-dimensional vector

Similarity is computed as the **Inner Product** of the two vectors.

sim(q, p) = E_Q(q)^T · E_P(p)

The key advantage of this architecture is that query and document encoding are independent. Document encoding can be performed offline and indexed in ANN (Approximate Nearest Neighbor) libraries like FAISS. At search time, only the query needs to be encoded, enabling millisecond-level retrieval even across millions of documents.

3.3 Training Method

DPR is trained using an **In-Batch Negatives** strategy. Correct passages for other questions in the batch are used as Negative Samples. Additionally, passages retrieved by BM25 that are not correct answers are used as Hard Negatives to improve training effectiveness.

3.4 Performance

DPR achieved an absolute improvement of **9% to 19%** in Top-20 Passage Retrieval Accuracy compared to BM25. This demonstrates that high-quality Dense Retrievers can be trained with only a limited number of query-passage pairs.

4. Chunking Strategies

Splitting documents into appropriately sized chunks is a critical step in RAG systems that directly impacts retrieval quality. LangChain provides various Text Splitters, and the key chunking strategies are as follows.

4.1 Fixed-Size Chunking

The simplest method, splitting text based on a specified number of characters or tokens.

from langchain_text_splitters import CharacterTextSplitter

text_splitter = CharacterTextSplitter(

separator="\n\n",

chunk_size=1000,

chunk_overlap=200,

length_function=len,

)

docs = text_splitter.split_documents(documents)

- **Pros**: Simple to implement and predictable.

- **Cons**: May split in the middle of sentences or semantic units.

The `chunk_overlap` parameter creates overlap between adjacent chunks to mitigate context loss.

4.2 Recursive Character Splitting

LangChain's most recommended general-purpose Text Splitter. It recursively applies multiple levels of separators to preserve semantic units as much as possible.

from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(

chunk_size=1000,

chunk_overlap=200,

separators=["\n\n", "\n", ". ", " ", ""],

length_function=len,

)

docs = text_splitter.split_documents(documents)

The operation logic is as follows.

1. First, attempt to split by `\n\n` (paragraph breaks).

2. If a chunk still exceeds `chunk_size`, split by `\n` (line breaks).

3. If still too large, split by `. ` (sentence boundaries).

4. As a last resort, split by spaces or individual characters.

The key principle is **trying to preserve larger semantic units first, and only breaking down into smaller units when necessary**.

4.3 Semantic Chunking

The most advanced strategy, detecting points where meaning changes based on embedding similarity.

from langchain_experimental.text_splitter import SemanticChunker

from langchain_openai.embeddings import OpenAIEmbeddings

text_splitter = SemanticChunker(

OpenAIEmbeddings(),

breakpoint_threshold_type="percentile",

)

docs = text_splitter.split_documents(documents)

It computes the Embedding Cosine Similarity between adjacent sentences and sets chunk boundaries where similarity drops sharply. While it produces semantically coherent chunks, additional embedding computation costs are incurred.

4.4 Chunking Strategy Selection Guide

| Strategy | Best For | Cost |

| ---------- | ----------------------------------------------------------------- | ---- |

| Fixed-Size | Uniformly structured documents, rapid prototyping | Low |

| Recursive | Most general use cases (recommended default) | Low |

| Semantic | Documents where semantic boundaries matter, high quality required | High |

In practice, `chunk_size` is typically set between 500 and 1500, with `chunk_overlap` at 10-20% of the chunk_size. Optimal values should be determined experimentally based on data characteristics and use case.

5. Embedding Model Selection

The choice of embedding model for converting chunks to vectors is a crucial decision that determines retrieval quality. The MTEB (Massive Text Embedding Benchmark) leaderboard provides useful comparisons of major models.

5.1 OpenAI text-embedding-3 Series

from langchain_openai import OpenAIEmbeddings

High-performance model

embeddings = OpenAIEmbeddings(model="text-embedding-3-large") # 3072 dimensions

Cost-efficient model

embeddings = OpenAIEmbeddings(model="text-embedding-3-small") # 1536 dimensions

- **MTEB Score**: text-embedding-3-large approximately 64.6

- **Pros**: Convenient to use via API calls with consistent quality. Supports Matryoshka Representation, allowing dimension reduction with minimal performance loss.

- **Cons**: API costs apply, and data is sent to external servers.

5.2 Sentence-Transformers

from langchain_huggingface import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(

model_name="sentence-transformers/all-MiniLM-L6-v2"

)

- **Pros**: Open-source and can run locally. Many fast, lightweight models are available for English. all-MiniLM-L6-v2 is lightweight at 384 dimensions while offering decent performance.

- **Cons**: Limited multilingual support, and performance may be lower compared to larger models.

5.3 BGE (BAAI General Embedding) Series

from langchain_huggingface import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(

model_name="BAAI/bge-m3",

model_kwargs={"device": "cuda"},

encode_kwargs={"normalize_embeddings": True},

)

- **MTEB Score**: BGE-M3 approximately 63.0

- **Pros**: Top-tier open-source performance, supporting over 100 languages. Particularly suitable for multilingual RAG including Korean. Supports Dense, Sparse, and Multi-Vector Retrieval as a hybrid model.

- **Cons**: The large model size requires a GPU.

5.4 Selection Criteria

| Criterion | Recommended Model |

| ------------------------------------ | ------------------------------------------------ |

| Rapid prototyping | OpenAI text-embedding-3-small |

| Production (quality-first) | OpenAI text-embedding-3-large or Cohere embed-v4 |

| Production (cost-first, self-hosted) | BGE-M3 |

| Multilingual | BGE-M3 |

| Lightweight / edge environments | all-MiniLM-L6-v2 |

The key principle is to **always benchmark with your actual data**. MTEB scores are based on general benchmarks, so performance on specific domains may differ.

6. Vector Database Comparison

The Vector Database that stores embedding vectors and performs similarity search is a core piece of RAG system infrastructure. Here is a comparison of major Vector Databases.

6.1 Chroma

from langchain_chroma import Chroma

vectorstore = Chroma.from_documents(

documents=docs,

embedding=embeddings,

persist_directory="./chroma_db",

)

- **Type**: Open-source, Embedded (In-Process)

- **Best for**: Local development, prototyping, small-scale projects

- **Pros**: Easy to install (`pip install chromadb`), runs within the Python process without a separate server. Excellent LangChain integration.

- **Limitations**: Performance may degrade at large scale (millions of vectors or more). Production-level availability and scalability are limited.

6.2 Pinecone

- **Type**: Managed SaaS (fully managed)

- **Best for**: Production environments, teams wanting to minimize operational burden

- **Pros**: Serverless architecture eliminates infrastructure management. Multi-region support, high availability, and auto-scaling. Scales to billions of vectors.

- **Limitations**: Relatively high cost. Vendor lock-in concerns. Not open-source.

6.3 Weaviate

- **Type**: Open-source + Managed Cloud

- **Best for**: When Hybrid Search (vector + keyword) is important, when flexible schemas are needed

- **Pros**: Strong native Hybrid Search support. GraphQL API, modular architecture, and built-in vectorization modules. Flexible with both open-source and managed cloud options.

- **Limitations**: Has a learning curve and configuration can be somewhat complex.

6.4 pgvector

CREATE EXTENSION vector;

CREATE TABLE documents (

id SERIAL PRIMARY KEY,

content TEXT,

embedding vector(1536)

);

-- Similarity search

SELECT content, embedding <=> '[0.1, 0.2, ...]'::vector AS distance

FROM documents

ORDER BY distance

LIMIT 5;

- **Type**: PostgreSQL extension

- **Best for**: Organizations already using PostgreSQL, those not wanting to add separate infrastructure

- **Pros**: Leverages existing PostgreSQL infrastructure. SQL and vector search in a single database simplifies architecture. Supports HNSW and IVFFlat indexes.

- **Limitations**: Performance may degrade beyond 100 million vectors. Features are limited compared to dedicated Vector Databases.

6.5 Milvus

- **Type**: Open-source, distributed architecture

- **Best for**: Billion-scale vector systems, teams with data engineering capabilities

- **Pros**: Proven performance for industrial-scale large-scale vector search. Supports various index types (IVF, HNSW, DiskANN, etc.) with GPU acceleration. Managed service available via Zilliz Cloud.

- **Limitations**: High operational complexity. Cluster setup and management require specialized expertise.

6.6 Comparison Summary

| -------- | ----------------- | -------------------- | -------------- | ------------------------------ |

7. Advanced RAG Patterns

Basic RAG (Naive RAG) is a simple "retrieve then generate" pipeline. In production environments, various Advanced RAG patterns are needed to improve both retrieval and generation quality.

7.1 Re-ranking

In basic RAG, initial retrieval uses Bi-Encoder vector similarity, which is fast but may lack precision. Re-ranking is a pattern that re-evaluates initial retrieval results with a **Cross-Encoder** to improve precision.

from langchain.retrievers import ContextualCompressionRetriever

from langchain.retrievers.document_compressors import CrossEncoderReranker

from langchain_community.cross_encoders import HuggingFaceCrossEncoder

Load Cross-Encoder model

model = HuggingFaceCrossEncoder(model_name="BAAI/bge-reranker-v2-m3")

compressor = CrossEncoderReranker(model=model, top_n=3)

Configure Re-ranking Retriever

compression_retriever = ContextualCompressionRetriever(

base_compressor=compressor,

base_retriever=vectorstore.as_retriever(search_kwargs={"k": 20}),

)

The workflow is as follows.

1. The Bi-Encoder quickly retrieves a broad set of candidate documents (top-20).

2. The Cross-Encoder takes each candidate document paired with the query and scores direct relevance.

3. Only the top re-ranked documents (top-3) are passed to the Generator.

Cross-Encoders are more precise than Bi-Encoders but require individual inference for every candidate, making them unsuitable for large-scale retrieval. Therefore, the standard pattern is a **two-stage pipeline: narrow candidates with a Bi-Encoder, then re-rank with a Cross-Encoder**.

7.2 HyDE (Hypothetical Document Embeddings)

HyDE, proposed by Gao et al. (2022), is a pattern designed to bridge the semantic gap between queries and documents. User questions are typically short and abstract, while documents containing the answers are long and detailed. This difference can make direct vector similarity search suboptimal.

The core idea of HyDE is as follows.

1. When a user query is received, ask the LLM to "write a hypothetical document that answers this question."

2. Embed the generated hypothetical document. This hypothetical document may not be factually accurate, but it has similar format and vocabulary to actual relevant documents.

3. Use this embedding to search for actual documents.

from langchain.chains import HypotheticalDocumentEmbedder

from langchain_openai import ChatOpenAI, OpenAIEmbeddings

llm = ChatOpenAI(model="gpt-4o-mini")

base_embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

hyde_embeddings = HypotheticalDocumentEmbedder.from_llm(

llm=llm,

base_embeddings=base_embeddings,

prompt_key="web_search",

)

Search using HyDE embeddings

results = vectorstore.similarity_search_by_vector(

hyde_embeddings.embed_query("How to evaluate RAG system performance?")

)

The encoder's Dense Bottleneck serves to filter out hallucinations in the hypothetical document, so even if the hypothetical document is not accurate, it can effectively retrieve relevant real documents.

7.3 Self-RAG

**Self-RAG**, proposed by Asai et al. (2023), is a pattern where the LLM autonomously judges the need for retrieval and self-critically evaluates its generated results.

The core mechanism of Self-RAG is the **Reflection Token**.

- **[Retrieve]**: Determines whether external retrieval is needed at the current point.

- **[IsRel]**: Evaluates whether the retrieved document is relevant to the question.

- **[IsSup]**: Evaluates whether the generated response is supported by the retrieved document.

- **[IsUse]**: Evaluates whether the generated response is overall useful.

These Reflection Tokens are added to the model's vocabulary and trained like regular tokens, and the model generates them autonomously during inference. Self-RAG outperformed ChatGPT and retrieval-augmented Llama2-chat at the 7B and 13B parameter scales on Open-Domain QA, Reasoning, and Fact Verification tasks.

7.4 Corrective RAG (CRAG)

**CRAG**, proposed by Yan et al. (2024), is a pattern that actively evaluates and corrects the quality of retrieved documents.

The core components of CRAG are as follows.

1. **Retrieval Evaluator**: A lightweight model judges the relevance of retrieved documents as Correct, Incorrect, or Ambiguous.

2. **Knowledge Refinement**: Extracts key information from retrieved documents and removes unnecessary parts. Uses a Decompose-then-Recompose algorithm.

3. **Web Search Fallback**: If the Retrieval Evaluator judges Incorrect, it falls back to web search instead of the static corpus to find better information.

[Query] → [Retriever] → [Retrieval Evaluator]

↓

Correct → Knowledge Refinement → Generator

Incorrect → Web Search → Knowledge Refinement → Generator

Ambiguous → Combine both paths → Generator

The strength of this pattern is that it **automatically activates fallback paths** even when retrieval quality is low, generating robust responses.

8. Evaluation Metrics: RAGAS Framework

The **RAGAS (Retrieval Augmented Generation Assessment)** framework is widely used for systematically evaluating RAG system performance. Proposed by Es et al. (2023), RAGAS provides automated metrics that can evaluate RAG pipelines even without Ground Truth.

8.1 Faithfulness

Measures how faithful the generated answer is to the retrieved Context. This is the key metric for detecting hallucination.

Faithfulness = (Number of Claims supported by Context) / (Total number of Claims)

The process works as follows.

1. The LLM extracts individual Claims from the generated answer.

2. It determines whether each Claim is supported by the provided Context.

3. The proportion of supported Claims is calculated.

Values range from 0 to 1, with values closer to 1 meaning the answer is more faithful to the Context.

8.2 Answer Relevance

Measures how relevant the generated answer is to the original question.

The process works as follows.

1. Questions are reverse-engineered from the generated answer.

2. The Embedding similarity between the generated questions and the original question is computed.

3. The average similarity becomes the Answer Relevance score.

This approach indirectly measures whether the answer addresses the core of the question and whether it contains unnecessary information.

8.3 Context Recall

Measures how much of the information needed to generate the Ground Truth answer is contained in the retrieved Context.

Context Recall = (Number of GT sentences supported by Context) / (Total number of GT sentences)

This is the only metric that requires Ground Truth. It directly evaluates the Retriever's performance.

8.4 Context Precision

Measures whether the actually relevant items among the retrieved Context are positioned at the top. The score is higher when relevant documents appear earlier in the search results.

8.5 RAGAS Usage Example

from ragas import evaluate

from ragas.metrics import (

faithfulness,

answer_relevancy,

context_recall,

context_precision,

)

from datasets import Dataset

Compose evaluation dataset

eval_data = {

"question": ["What is RAG?"],

"answer": ["RAG is Retrieval-Augmented Generation, a method where the LLM retrieves external documents to generate answers."],

"contexts": [["RAG (Retrieval-Augmented Generation) is a technique that retrieves external knowledge to leverage in LLM generation."]],

"ground_truth": ["RAG stands for Retrieval-Augmented Generation, a methodology that augments LLM response generation by retrieving relevant information from external knowledge sources."],

}

dataset = Dataset.from_dict(eval_data)

Run evaluation

result = evaluate(

dataset=dataset,

metrics=[faithfulness, answer_relevancy, context_recall, context_precision],

)

print(result)

In production environments, it is recommended to integrate these metrics into CI/CD pipelines to automatically monitor the impact of RAG system changes (chunking strategy modifications, model swaps, etc.) on quality.

9. Practical Implementation with LangChain + ChromaDB

Synthesizing all the concepts covered so far, here is a complete code implementation of a production RAG pipeline using LangChain and ChromaDB.

9.1 Environment Setup and Package Installation

pip install langchain langchain-openai langchain-chroma langchain-community

pip install chromadb pypdf sentence-transformers

9.2 Document Loading and Chunking

from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader

from langchain_text_splitters import RecursiveCharacterTextSplitter

Load PDF documents

loader = DirectoryLoader(

"./documents",

glob="**/*.pdf",

loader_cls=PyPDFLoader,

)

documents = loader.load()

print(f"Number of documents loaded: {len(documents)}")

Recursive Character Splitting

text_splitter = RecursiveCharacterTextSplitter(

chunk_size=1000,

chunk_overlap=200,

separators=["\n\n", "\n", ". ", " ", ""],

length_function=len,

)

splits = text_splitter.split_documents(documents)

print(f"Number of chunks generated: {len(splits)}")

9.3 Embedding and Vector Store Construction

from langchain_openai import OpenAIEmbeddings

from langchain_chroma import Chroma

Embedding model setup

embeddings = OpenAIEmbeddings(

model="text-embedding-3-small",

openai_api_key=os.getenv("OPENAI_API_KEY"),

)

Build and persist ChromaDB Vector Store

vectorstore = Chroma.from_documents(

documents=splits,

embedding=embeddings,

persist_directory="./chroma_db",

collection_name="rag_collection",

)

print(f"Documents stored in Vector Store: {vectorstore._collection.count()}")

9.4 Retriever Configuration

Basic Retriever

retriever = vectorstore.as_retriever(

search_type="similarity",

search_kwargs={"k": 5},

)

MMR (Maximal Marginal Relevance) Retriever - ensures diversity

retriever_mmr = vectorstore.as_retriever(

search_type="mmr",

search_kwargs={

"k": 5,

"fetch_k": 20, # Initial retrieval candidate count

"lambda_mult": 0.7, # Balance between relevance (1.0) and diversity (0.0)

)

9.5 RAG Chain Construction and Execution

from langchain_openai import ChatOpenAI

from langchain_core.prompts import ChatPromptTemplate

from langchain_core.output_parsers import StrOutputParser

from langchain_core.runnables import RunnablePassthrough

LLM setup

llm = ChatOpenAI(

model="gpt-4o-mini",

temperature=0,

openai_api_key=os.getenv("OPENAI_API_KEY"),

)

Prompt Template

prompt = ChatPromptTemplate.from_template("""

Answer the question based on the following Context.

If the Context does not contain the information needed to answer, respond with "No relevant information was found in the provided documents."

Context:

{context}

Question: {question}

Answer:

""")

Context formatting function

def format_docs(docs):

return "\n\n---\n\n".join(

f"[Source: {doc.metadata.get('source', 'unknown')}, "

f"Page: {doc.metadata.get('page', 'N/A')}]\n{doc.page_content}"

for doc in docs

)

Construct RAG Chain with LCEL (LangChain Expression Language)

rag_chain = (

{"context": retriever | format_docs, "question": RunnablePassthrough()}

| prompt

| llm

| StrOutputParser()

)

Execute

question = "What are the types of chunking strategies in a RAG system and their pros and cons?"

answer = rag_chain.invoke(question)

print(answer)

9.6 Responses with Source Information

from langchain_core.runnables import RunnableParallel

Chain that returns both source information and the answer

rag_chain_with_sources = RunnableParallel(

{"context": retriever, "question": RunnablePassthrough()}

).assign(

answer=lambda x: (

prompt.invoke(

{"context": format_docs(x["context"]), "question": x["question"]}

)

| llm

| StrOutputParser()

).invoke(x["question"])

)

More concise approach

from langchain.chains import create_retrieval_chain

from langchain.chains.combine_documents import create_stuff_documents_chain

Stuff Documents Chain (combines retrieved documents into a single Context)

combine_docs_chain = create_stuff_documents_chain(llm, prompt)

Retrieval Chain

retrieval_chain = create_retrieval_chain(retriever, combine_docs_chain)

Execute - returns both context and answer

response = retrieval_chain.invoke({"input": question})

print("Answer:", response["answer"])

print("\nReference documents:")

for i, doc in enumerate(response["context"], 1):

print(f" [{i}] {doc.metadata.get('source', 'unknown')} "

f"(p.{doc.metadata.get('page', 'N/A')})")

9.7 Conversational RAG

from langchain_core.prompts import MessagesPlaceholder

from langchain_community.chat_message_histories import ChatMessageHistory

from langchain_core.runnables.history import RunnableWithMessageHistory

Prompt that considers conversation history

contextualize_prompt = ChatPromptTemplate.from_messages([

("system", "Reformulate the user's latest question so it can be understood independently, considering the previous conversation context."),

MessagesPlaceholder("chat_history"),

("human", "{input}"),

])

Per-session history management

store = {}

def get_session_history(session_id: str):

if session_id not in store:

store[session_id] = ChatMessageHistory()

return store[session_id]

Construct Conversational RAG Chain

conversational_rag = RunnableWithMessageHistory(

retrieval_chain,

get_session_history,

input_messages_key="input",

history_messages_key="chat_history",

output_messages_key="answer",

)

Execute conversation

config = {"configurable": {"session_id": "user_001"}}

response1 = conversational_rag.invoke(

{"input": "What is RAG?"},

config=config,

)

print(response1["answer"])

response2 = conversational_rag.invoke(

{"input": "What are its main advantages?"}, # "its" = RAG from previous conversation

config=config,

)

print(response2["answer"])

This implementation is a basic RAG pipeline. Moving to production requires applying Advanced patterns such as Re-ranking and HyDE as covered above, systematic evaluation through RAGAS, and building monitoring and logging infrastructure.

10. Summary

RAG is the most practical and effective approach for addressing the hallucination problem in LLMs. The Retriever + Generator architecture proposed in the original paper by Lewis et al. (2020) has since evolved into various Advanced patterns, establishing itself as the core architecture for building production-level AI systems.

The key decisions for building an effective RAG system can be summarized as follows.

1. **Chunking Strategy**: Start with RecursiveCharacterTextSplitter as the default, and consider Semantic Chunking based on data characteristics.

2. **Embedding Model**: BGE-M3 for multilingual needs; OpenAI text-embedding-3 series for English-centric, reliable performance.

3. **Vector Database**: Chroma for prototyping; choose a DB suited to the workload for production.

4. **Advanced Patterns**: Re-ranking is worth applying in almost all cases; HyDE and CRAG are worth considering when retrieval quality is insufficient.

5. **Evaluation**: Integrate RAGAS metrics into CI/CD to continuously monitor quality.

References

- Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., ... & Kiela, D. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. _NeurIPS 2020_. [https://arxiv.org/abs/2005.11401](https://arxiv.org/abs/2005.11401)

- Karpukhin, V., Oguz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., & Yih, W. (2020). Dense Passage Retrieval for Open-Domain Question Answering. _EMNLP 2020_. [https://arxiv.org/abs/2004.04906](https://arxiv.org/abs/2004.04906)

- Gao, L., Ma, X., Lin, J., & Callan, J. (2022). Precise Zero-Shot Dense Retrieval without Relevance Labels (HyDE). _ACL 2023_. [https://arxiv.org/abs/2212.10496](https://arxiv.org/abs/2212.10496)

- Asai, A., Wu, Z., Wang, Y., Sil, A., & Hajishirzi, H. (2023). Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection. _ICLR 2024_. [https://arxiv.org/abs/2310.11511](https://arxiv.org/abs/2310.11511)

- Yan, S., Gu, J., Zhu, Y., & Ling, Z. (2024). Corrective Retrieval Augmented Generation. [https://arxiv.org/abs/2401.15884](https://arxiv.org/abs/2401.15884)

- Es, S., James, J., Espinosa-Anke, L., & Schockaert, S. (2023). RAGAS: Automated Evaluation of Retrieval Augmented Generation. [https://arxiv.org/abs/2309.15217](https://arxiv.org/abs/2309.15217)

- LangChain Text Splitters Documentation. [https://docs.langchain.com/oss/python/integrations/splitters](https://docs.langchain.com/oss/python/integrations/splitters)

- RAGAS Documentation - Available Metrics. [https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/)

- Pinecone - Chunking Strategies. [https://www.pinecone.io/learn/chunking-strategies/](https://www.pinecone.io/learn/chunking-strategies/)

- MTEB: Massive Text Embedding Benchmark. [https://huggingface.co/blog/mteb](https://huggingface.co/blog/mteb)

Quiz

Q1: What is the main topic covered in "RAG: Retrieval-Augmented Generation — Paper Analysis and

Production Architecture"?

Analyzing the core concepts of the RAG paper and covering chunking strategies, Vector DB

selection, and Advanced RAG patterns for designing production-level RAG systems.

Large Language Models (LLMs) are pre-trained on vast amounts of text data and demonstrate

remarkable performance in natural language understanding and generation. However, LLMs have a

fundamental limitation: the Hallucination problem.

The paper "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" (NeurIPS 2020),

published by Patrick Lewis et al. at Facebook AI Research (now Meta AI) in 2020, is the seminal

work that first formalized the concept of RAG. 2.1 Core Proposal Lewis et al.

The critical component in RAG's Retriever is Dense Passage Retrieval (DPR). It was proposed in the

paper "Dense Passage Retrieval for Open-Domain Question Answering" by Karpukhin et al. at EMNLP

2020.

Splitting documents into appropriately sized chunks is a critical step in RAG systems that

directly impacts retrieval quality. LangChain provides various Text Splitters, and the key

chunking strategies are as follows.