Split View: RAG 챗봇 구축 실전 — LangChain + ChromaDB + OpenAI로 나만의 문서 QA 봇 만들기
RAG 챗봇 구축 실전 — LangChain + ChromaDB + OpenAI로 나만의 문서 QA 봇 만들기
- 개요
- RAG 아키텍처
- 환경 설정
- Step 1: 문서 로딩
- Step 2: 텍스트 청킹 (Chunking)
- Step 3: 벡터 저장소 (ChromaDB)
- Step 4: 검색기 (Retriever) 설정
- Step 5: RAG 체인 구성
- Step 6: 대화 히스토리 지원
- Step 7: Streamlit UI
- 성능 최적화 팁
- 마무리
- 퀴즈

개요
LLM은 범용적인 지식을 가지고 있지만, 우리 회사의 내부 문서나 최신 정보에 대해서는 답변할 수 없다. **RAG(Retrieval-Augmented Generation)**는 이 한계를 극복하는 패턴으로, 질문과 관련된 문서를 먼저 검색한 뒤 그 컨텍스트를 LLM에게 전달하여 정확한 답변을 생성한다.
이 글에서는 LangChain + ChromaDB + OpenAI를 사용하여 PDF 문서 기반 QA 챗봇을 처음부터 끝까지 구축한다. 최종적으로 Streamlit으로 웹 UI까지 만들어 실제 사용 가능한 챗봇을 완성한다.
RAG 아키텍처
RAG의 전체 흐름은 두 단계로 나뉜다:
1단계: 인덱싱 (오프라인)
문서 → 청킹 → 임베딩 → 벡터 DB 저장
2단계: 질의 (온라인)
질문 → 임베딩 → 유사 문서 검색 → 프롬프트 구성 → LLM 답변
환경 설정
패키지 설치
pip install langchain langchain-openai langchain-community \
chromadb pypdf tiktoken streamlit python-dotenv
환경 변수
# .env
OPENAI_API_KEY=sk-proj-your-api-key-here
# config.py
import os
from dotenv import load_dotenv
load_dotenv()
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
CHROMA_PERSIST_DIR = "./chroma_db"
COLLECTION_NAME = "my_documents"
CHUNK_SIZE = 1000
CHUNK_OVERLAP = 200
EMBEDDING_MODEL = "text-embedding-3-small"
LLM_MODEL = "gpt-4o-mini"
Step 1: 문서 로딩
# document_loader.py
from langchain_community.document_loaders import (
PyPDFLoader,
DirectoryLoader,
TextLoader,
)
def load_pdf(file_path: str):
"""단일 PDF 파일 로딩"""
loader = PyPDFLoader(file_path)
documents = loader.load()
print(f"Loaded {len(documents)} pages from {file_path}")
return documents
def load_directory(dir_path: str, glob: str = "**/*.pdf"):
"""디렉토리 내 모든 PDF 로딩"""
loader = DirectoryLoader(
dir_path,
glob=glob,
loader_cls=PyPDFLoader,
show_progress=True,
)
documents = loader.load()
print(f"Loaded {len(documents)} pages from {dir_path}")
return documents
# 사용 예시
documents = load_directory("./docs")
Step 2: 텍스트 청킹 (Chunking)
청킹은 RAG 성능에 큰 영향을 미친다. 너무 작으면 컨텍스트가 부족하고, 너무 크면 검색 정확도가 떨어진다.
# chunker.py
from langchain.text_splitter import RecursiveCharacterTextSplitter
def chunk_documents(documents, chunk_size=1000, chunk_overlap=200):
"""문서를 청크로 분할"""
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap,
length_function=len,
separators=["\n\n", "\n", ".", " ", ""],
)
chunks = text_splitter.split_documents(documents)
print(f"Split {len(documents)} documents into {len(chunks)} chunks")
# 메타데이터에 청크 인덱스 추가
for i, chunk in enumerate(chunks):
chunk.metadata["chunk_index"] = i
chunk.metadata["chunk_size"] = len(chunk.page_content)
return chunks
chunks = chunk_documents(documents)
청킹 전략 비교
| 전략 | 장점 | 단점 | 추천 상황 |
|---|---|---|---|
| RecursiveCharacter | 문맥 보존 우수 | 범용적 | 일반 문서 |
| TokenTextSplitter | 토큰 수 정확 제어 | 문맥 단절 가능 | 토큰 제한 엄격 시 |
| MarkdownHeader | 구조 보존 | Markdown 전용 | 기술 문서 |
| SemanticChunker | 의미 기반 분할 | 느림, 비용 | 고품질 요구 시 |
Step 3: 벡터 저장소 (ChromaDB)
# vectorstore.py
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from config import CHROMA_PERSIST_DIR, COLLECTION_NAME, EMBEDDING_MODEL
def create_vectorstore(chunks):
"""청크를 임베딩하여 ChromaDB에 저장"""
embeddings = OpenAIEmbeddings(model=EMBEDDING_MODEL)
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory=CHROMA_PERSIST_DIR,
collection_name=COLLECTION_NAME,
)
print(f"Stored {len(chunks)} chunks in ChromaDB")
return vectorstore
def load_vectorstore():
"""기존 ChromaDB 로드"""
embeddings = OpenAIEmbeddings(model=EMBEDDING_MODEL)
vectorstore = Chroma(
persist_directory=CHROMA_PERSIST_DIR,
collection_name=COLLECTION_NAME,
embedding_function=embeddings,
)
count = vectorstore._collection.count()
print(f"Loaded ChromaDB with {count} documents")
return vectorstore
임베딩 모델 선택 가이드
| 모델 | 차원 | 비용 | 성능 |
|---|---|---|---|
text-embedding-3-small | 1536 | $0.02/1M tokens | 좋음 |
text-embedding-3-large | 3072 | $0.13/1M tokens | 매우 좋음 |
text-embedding-ada-002 | 1536 | $0.10/1M tokens | 보통 (레거시) |
비용 대비 성능으로 text-embedding-3-small을 추천한다.
Step 4: 검색기 (Retriever) 설정
# retriever.py
def get_retriever(vectorstore, search_type="mmr", k=4):
"""벡터스토어에서 검색기 생성"""
if search_type == "mmr":
# MMR: 관련성 + 다양성 균형
retriever = vectorstore.as_retriever(
search_type="mmr",
search_kwargs={
"k": k,
"fetch_k": 20, # 후보 문서 수
"lambda_mult": 0.7, # 1.0=관련성, 0.0=다양성
},
)
elif search_type == "similarity_score":
# 유사도 임계값 기반
retriever = vectorstore.as_retriever(
search_type="similarity_score_threshold",
search_kwargs={
"score_threshold": 0.7,
"k": k,
},
)
else:
# 기본 유사도 검색
retriever = vectorstore.as_retriever(
search_kwargs={"k": k},
)
return retriever
MMR (Maximal Marginal Relevance)
MMR은 검색 결과에서 관련성은 높지만 서로 다른 문서를 선택한다. 비슷한 내용의 청크가 중복 반환되는 것을 방지하여 LLM에게 더 풍부한 컨텍스트를 제공한다.
Step 5: RAG 체인 구성
# rag_chain.py
from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.schema.runnable import RunnablePassthrough
from langchain.schema.output_parser import StrOutputParser
from config import LLM_MODEL
SYSTEM_TEMPLATE = """당신은 문서 기반 QA 어시스턴트입니다.
주어진 컨텍스트만을 사용하여 질문에 답변하세요.
규칙:
1. 컨텍스트에 없는 정보는 "해당 정보를 문서에서 찾을 수 없습니다"라고 답변하세요.
2. 답변은 한국어로 작성하세요.
3. 가능하면 구체적인 수치나 인용을 포함하세요.
4. 답변의 근거가 되는 문서를 언급하세요.
컨텍스트:
{context}
"""
def format_docs(docs):
"""검색된 문서를 포맷팅"""
formatted = []
for i, doc in enumerate(docs, 1):
source = doc.metadata.get("source", "unknown")
page = doc.metadata.get("page", "?")
formatted.append(
f"[문서 {i}] (출처: {source}, 페이지: {page})\n{doc.page_content}"
)
return "\n\n---\n\n".join(formatted)
def create_rag_chain(retriever):
"""RAG 체인 생성"""
llm = ChatOpenAI(
model=LLM_MODEL,
temperature=0,
max_tokens=2000,
)
prompt = ChatPromptTemplate.from_messages([
("system", SYSTEM_TEMPLATE),
("human", "{question}"),
])
chain = (
{
"context": retriever | format_docs,
"question": RunnablePassthrough(),
}
| prompt
| llm
| StrOutputParser()
)
return chain
Step 6: 대화 히스토리 지원
# conversation.py
from langchain.memory import ConversationBufferWindowMemory
from langchain.chains import ConversationalRetrievalChain
from langchain_openai import ChatOpenAI
from config import LLM_MODEL
def create_conversational_chain(retriever):
"""대화 히스토리를 지원하는 RAG 체인"""
llm = ChatOpenAI(model=LLM_MODEL, temperature=0)
memory = ConversationBufferWindowMemory(
memory_key="chat_history",
return_messages=True,
output_key="answer",
k=5, # 최근 5턴만 유지
)
chain = ConversationalRetrievalChain.from_llm(
llm=llm,
retriever=retriever,
memory=memory,
return_source_documents=True,
verbose=False,
)
return chain
# 사용 예시
chain = create_conversational_chain(retriever)
result = chain.invoke({"question": "이 문서의 핵심 내용은 무엇인가요?"})
print(result["answer"])
print(f"\n참조 문서 {len(result['source_documents'])}건")
Step 7: Streamlit UI
# app.py
import streamlit as st
from document_loader import load_pdf
from chunker import chunk_documents
from vectorstore import create_vectorstore, load_vectorstore
from retriever import get_retriever
from rag_chain import create_rag_chain
st.set_page_config(page_title="📚 문서 QA 챗봇", layout="wide")
st.title("📚 RAG 문서 QA 챗봇")
# 사이드바: 문서 업로드
with st.sidebar:
st.header("📁 문서 업로드")
uploaded_files = st.file_uploader(
"PDF 파일을 업로드하세요",
type=["pdf"],
accept_multiple_files=True,
)
if uploaded_files and st.button("🔄 문서 처리"):
with st.spinner("문서 처리 중..."):
all_chunks = []
for file in uploaded_files:
# 임시 파일 저장
temp_path = f"/tmp/{file.name}"
with open(temp_path, "wb") as f:
f.write(file.getbuffer())
docs = load_pdf(temp_path)
chunks = chunk_documents(docs)
all_chunks.extend(chunks)
vectorstore = create_vectorstore(all_chunks)
st.session_state["vectorstore"] = vectorstore
st.success(f"✅ {len(all_chunks)}개 청크 처리 완료!")
# 메인: 채팅 인터페이스
if "messages" not in st.session_state:
st.session_state.messages = []
# 이전 메시지 표시
for message in st.session_state.messages:
with st.chat_message(message["role"]):
st.markdown(message["content"])
# 사용자 입력
if prompt := st.chat_input("문서에 대해 질문하세요..."):
st.session_state.messages.append({"role": "user", "content": prompt})
with st.chat_message("user"):
st.markdown(prompt)
with st.chat_message("assistant"):
if "vectorstore" not in st.session_state:
try:
st.session_state["vectorstore"] = load_vectorstore()
except Exception:
st.error("먼저 문서를 업로드해주세요.")
st.stop()
vectorstore = st.session_state["vectorstore"]
retriever = get_retriever(vectorstore)
chain = create_rag_chain(retriever)
with st.spinner("답변 생성 중..."):
response = chain.invoke(prompt)
st.markdown(response)
st.session_state.messages.append(
{"role": "assistant", "content": response}
)
# 실행
streamlit run app.py --server.port 8501
성능 최적화 팁
1. 하이브리드 검색 (키워드 + 벡터)
from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
def create_hybrid_retriever(chunks, vectorstore, k=4):
"""BM25 + 벡터 검색 앙상블"""
bm25_retriever = BM25Retriever.from_documents(chunks)
bm25_retriever.k = k
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": k})
ensemble = EnsembleRetriever(
retrievers=[bm25_retriever, vector_retriever],
weights=[0.4, 0.6], # 벡터 검색에 가중치
)
return ensemble
2. Reranker로 검색 정확도 향상
from langchain.retrievers import ContextualCompressionRetriever
from langchain_community.document_compressors import CohereRerank
def create_reranked_retriever(vectorstore, k=4, top_n=3):
"""Cohere Reranker로 검색 결과 재정렬"""
base_retriever = vectorstore.as_retriever(search_kwargs={"k": k * 3})
compressor = CohereRerank(
model="rerank-v3.5",
top_n=top_n,
)
return ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=base_retriever,
)
3. 청크 메타데이터 강화
# 청크에 요약 메타데이터 추가
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
for chunk in chunks:
summary = llm.invoke(
f"다음 텍스트를 한 문장으로 요약하세요:\n{chunk.page_content}"
).content
chunk.metadata["summary"] = summary
마무리
RAG 챗봇의 핵심은 검색 품질이다. LLM이 아무리 뛰어나도 관련 없는 문서가 전달되면 좋은 답변을 생성할 수 없다. 성능 개선 우선순위를 정리하면:
- 청킹 전략 최적화 — 문서 특성에 맞는 청크 크기와 분할 방법
- 임베딩 모델 선택 — 도메인에 적합한 임베딩 모델
- 하이브리드 검색 — BM25 + 벡터 검색 앙상블
- Reranker 적용 — 검색 결과 재정렬로 정밀도 향상
- 프롬프트 엔지니어링 — 출력 형식과 규칙 명시
이 모든 것을 LangChain이 추상화해주기 때문에, 각 컴포넌트를 쉽게 교체하며 실험할 수 있다는 것이 큰 장점이다.
퀴즈
Q1: RAG에서 Retrieval과 Generation의 역할은?
Retrieval은 질문과 관련된 문서를 벡터 유사도 검색으로 찾아오는 단계이고, Generation은 검색된
문서를 컨텍스트로 LLM에 전달하여 답변을 생성하는 단계이다.
Q2: 청킹에서 chunk_overlap을 설정하는 이유는?
청크 경계에서 문맥이 단절되는 것을 방지하기 위해서다. 인접 청크 간에 겹치는 부분을 두면 문장이
중간에 잘려도 다음 청크에서 완전한 문맥을 유지할 수 있다.
Q3: ChromaDB의 persist_directory 설정의 의미는?
벡터 데이터를 디스크에 영구 저장하는 경로를 지정한다. 이를 설정하면 프로세스 재시작 시에도
임베딩을 다시 계산할 필요 없이 기존 벡터 DB를 로드할 수 있다.
Q4: MMR(Maximal Marginal Relevance) 검색의 장점은?
관련성이 높으면서도 서로 다양한 문서를 선택한다. 유사한 내용의 청크가 중복 반환되는 것을 방지하여
LLM에게 더 넓은 범위의 컨텍스트를 제공할 수 있다.
Q5:
small은 비용이 6.5배 저렴하면서 대부분의 용도에 충분한 성능을 제공한다. 의료, 법률 등 도메인 특화
고정밀이 필요한 경우에만 large를 고려한다.text-embedding-3-small vs text-embedding-3-large의 선택 기준은?
Q6: 하이브리드 검색(BM25 + 벡터)이 순수 벡터 검색보다 나은 이유는?
벡터 검색은 의미적 유사성에 강하지만 정확한 키워드 매칭에 약하다. BM25는 키워드 매칭에 강하므로,
두 방식을 앙상블하면 의미적 유사성과 키워드 정확도를 모두 확보할 수 있다.
Q7: ConversationBufferWindowMemory에서 k=5의 의미는?
최근 5턴의 대화만 메모리에 유지한다는 뜻이다. 전체 대화를 유지하면 토큰이 초과될 수 있으므로, 최근
대화만 유지하여 컨텍스트 윈도우를 효율적으로 사용한다.
Q8: Reranker를 사용할 때 base retriever의 k를 크게 설정하는 이유는?
Reranker가 더 넓은 후보군에서 가장 관련성 높은 문서를 재정렬하도록 하기 위해서다. 예를 들어 k=12로
후보를 가져온 뒤 top_n=3으로 최종 선택하면, 초기 검색에서 놓친 관련 문서를 재정렬 과정에서 살릴 수
있다.
Building a RAG Chatbot — Create Your Own Document QA Bot with LangChain + ChromaDB + OpenAI
- Overview
- RAG Architecture
- Environment Setup
- Step 1: Document Loading
- Step 2: Text Chunking
- Step 3: Vector Store (ChromaDB)
- Step 4: Retriever Configuration
- Step 5: RAG Chain Construction
- Step 6: Conversation History Support
- Step 7: Streamlit UI
- Performance Optimization Tips
- Conclusion
- Quiz

Overview
LLMs have broad general knowledge, but they cannot answer questions about your company's internal documents or the latest information. RAG (Retrieval-Augmented Generation) is a pattern that overcomes this limitation by first retrieving documents relevant to the question, then passing that context to the LLM to generate accurate answers.
In this post, we build a PDF document-based QA chatbot from start to finish using LangChain + ChromaDB + OpenAI. We will also create a web UI with Streamlit to complete a fully functional chatbot.
RAG Architecture
The overall RAG flow consists of two stages:
Stage 1: Indexing (Offline)
Documents → Chunking → Embedding → Store in Vector DB
Stage 2: Querying (Online)
Question → Embedding → Similar Document Search → Prompt Construction → LLM Answer
Environment Setup
Package Installation
pip install langchain langchain-openai langchain-community \
chromadb pypdf tiktoken streamlit python-dotenv
Environment Variables
# .env
OPENAI_API_KEY=sk-proj-your-api-key-here
# config.py
import os
from dotenv import load_dotenv
load_dotenv()
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
CHROMA_PERSIST_DIR = "./chroma_db"
COLLECTION_NAME = "my_documents"
CHUNK_SIZE = 1000
CHUNK_OVERLAP = 200
EMBEDDING_MODEL = "text-embedding-3-small"
LLM_MODEL = "gpt-4o-mini"
Step 1: Document Loading
# document_loader.py
from langchain_community.document_loaders import (
PyPDFLoader,
DirectoryLoader,
TextLoader,
)
def load_pdf(file_path: str):
"""Load a single PDF file"""
loader = PyPDFLoader(file_path)
documents = loader.load()
print(f"Loaded {len(documents)} pages from {file_path}")
return documents
def load_directory(dir_path: str, glob: str = "**/*.pdf"):
"""Load all PDFs in a directory"""
loader = DirectoryLoader(
dir_path,
glob=glob,
loader_cls=PyPDFLoader,
show_progress=True,
)
documents = loader.load()
print(f"Loaded {len(documents)} pages from {dir_path}")
return documents
# Usage example
documents = load_directory("./docs")
Step 2: Text Chunking
Chunking has a significant impact on RAG performance. Chunks that are too small lack context, while chunks that are too large reduce search accuracy.
# chunker.py
from langchain.text_splitter import RecursiveCharacterTextSplitter
def chunk_documents(documents, chunk_size=1000, chunk_overlap=200):
"""Split documents into chunks"""
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap,
length_function=len,
separators=["\n\n", "\n", ".", " ", ""],
)
chunks = text_splitter.split_documents(documents)
print(f"Split {len(documents)} documents into {len(chunks)} chunks")
# Add chunk index to metadata
for i, chunk in enumerate(chunks):
chunk.metadata["chunk_index"] = i
chunk.metadata["chunk_size"] = len(chunk.page_content)
return chunks
chunks = chunk_documents(documents)
Chunking Strategy Comparison
| Strategy | Pros | Cons | Recommended For |
|---|---|---|---|
| RecursiveCharacter | Excellent context preservation | General-purpose | General documents |
| TokenTextSplitter | Precise token count control | May break context | Strict token limits |
| MarkdownHeader | Preserves structure | Markdown only | Technical documentation |
| SemanticChunker | Semantic-based splitting | Slow, costly | High-quality requirements |
Step 3: Vector Store (ChromaDB)
# vectorstore.py
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from config import CHROMA_PERSIST_DIR, COLLECTION_NAME, EMBEDDING_MODEL
def create_vectorstore(chunks):
"""Embed chunks and store in ChromaDB"""
embeddings = OpenAIEmbeddings(model=EMBEDDING_MODEL)
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory=CHROMA_PERSIST_DIR,
collection_name=COLLECTION_NAME,
)
print(f"Stored {len(chunks)} chunks in ChromaDB")
return vectorstore
def load_vectorstore():
"""Load existing ChromaDB"""
embeddings = OpenAIEmbeddings(model=EMBEDDING_MODEL)
vectorstore = Chroma(
persist_directory=CHROMA_PERSIST_DIR,
collection_name=COLLECTION_NAME,
embedding_function=embeddings,
)
count = vectorstore._collection.count()
print(f"Loaded ChromaDB with {count} documents")
return vectorstore
Embedding Model Selection Guide
| Model | Dimensions | Cost | Performance |
|---|---|---|---|
text-embedding-3-small | 1536 | $0.02/1M tokens | Good |
text-embedding-3-large | 3072 | $0.13/1M tokens | Very Good |
text-embedding-ada-002 | 1536 | $0.10/1M tokens | Average (Legacy) |
For the best cost-to-performance ratio, text-embedding-3-small is recommended.
Step 4: Retriever Configuration
# retriever.py
def get_retriever(vectorstore, search_type="mmr", k=4):
"""Create a retriever from the vector store"""
if search_type == "mmr":
# MMR: Balance between relevance and diversity
retriever = vectorstore.as_retriever(
search_type="mmr",
search_kwargs={
"k": k,
"fetch_k": 20, # Number of candidate documents
"lambda_mult": 0.7, # 1.0=relevance, 0.0=diversity
},
)
elif search_type == "similarity_score":
# Similarity threshold-based
retriever = vectorstore.as_retriever(
search_type="similarity_score_threshold",
search_kwargs={
"score_threshold": 0.7,
"k": k,
},
)
else:
# Default similarity search
retriever = vectorstore.as_retriever(
search_kwargs={"k": k},
)
return retriever
MMR (Maximal Marginal Relevance)
MMR selects documents that are highly relevant yet diverse from the search results. It prevents similar chunks from being returned redundantly, providing the LLM with richer context.
Step 5: RAG Chain Construction
# rag_chain.py
from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.schema.runnable import RunnablePassthrough
from langchain.schema.output_parser import StrOutputParser
from config import LLM_MODEL
SYSTEM_TEMPLATE = """You are a document-based QA assistant.
Answer the question using only the provided context.
Rules:
1. If the information is not in the context, reply "I could not find that information in the documents."
2. Write your answer in Korean.
3. Include specific numbers or citations when possible.
4. Mention the source documents that support your answer.
Context:
{context}
"""
def format_docs(docs):
"""Format retrieved documents"""
formatted = []
for i, doc in enumerate(docs, 1):
source = doc.metadata.get("source", "unknown")
page = doc.metadata.get("page", "?")
formatted.append(
f"[Document {i}] (Source: {source}, Page: {page})\n{doc.page_content}"
)
return "\n\n---\n\n".join(formatted)
def create_rag_chain(retriever):
"""Create a RAG chain"""
llm = ChatOpenAI(
model=LLM_MODEL,
temperature=0,
max_tokens=2000,
)
prompt = ChatPromptTemplate.from_messages([
("system", SYSTEM_TEMPLATE),
("human", "{question}"),
])
chain = (
{
"context": retriever | format_docs,
"question": RunnablePassthrough(),
}
| prompt
| llm
| StrOutputParser()
)
return chain
Step 6: Conversation History Support
# conversation.py
from langchain.memory import ConversationBufferWindowMemory
from langchain.chains import ConversationalRetrievalChain
from langchain_openai import ChatOpenAI
from config import LLM_MODEL
def create_conversational_chain(retriever):
"""Create a RAG chain with conversation history support"""
llm = ChatOpenAI(model=LLM_MODEL, temperature=0)
memory = ConversationBufferWindowMemory(
memory_key="chat_history",
return_messages=True,
output_key="answer",
k=5, # Keep only the last 5 turns
)
chain = ConversationalRetrievalChain.from_llm(
llm=llm,
retriever=retriever,
memory=memory,
return_source_documents=True,
verbose=False,
)
return chain
# Usage example
chain = create_conversational_chain(retriever)
result = chain.invoke({"question": "What is the key content of this document?"})
print(result["answer"])
print(f"\nReferenced {len(result['source_documents'])} documents")
Step 7: Streamlit UI
# app.py
import streamlit as st
from document_loader import load_pdf
from chunker import chunk_documents
from vectorstore import create_vectorstore, load_vectorstore
from retriever import get_retriever
from rag_chain import create_rag_chain
st.set_page_config(page_title="📚 Document QA Chatbot", layout="wide")
st.title("📚 RAG Document QA Chatbot")
# Sidebar: Document Upload
with st.sidebar:
st.header("📁 Upload Documents")
uploaded_files = st.file_uploader(
"Upload PDF files",
type=["pdf"],
accept_multiple_files=True,
)
if uploaded_files and st.button("🔄 Process Documents"):
with st.spinner("Processing documents..."):
all_chunks = []
for file in uploaded_files:
# Save temporary file
temp_path = f"/tmp/{file.name}"
with open(temp_path, "wb") as f:
f.write(file.getbuffer())
docs = load_pdf(temp_path)
chunks = chunk_documents(docs)
all_chunks.extend(chunks)
vectorstore = create_vectorstore(all_chunks)
st.session_state["vectorstore"] = vectorstore
st.success(f"✅ {len(all_chunks)} chunks processed!")
# Main: Chat Interface
if "messages" not in st.session_state:
st.session_state.messages = []
# Display previous messages
for message in st.session_state.messages:
with st.chat_message(message["role"]):
st.markdown(message["content"])
# User input
if prompt := st.chat_input("Ask a question about your documents..."):
st.session_state.messages.append({"role": "user", "content": prompt})
with st.chat_message("user"):
st.markdown(prompt)
with st.chat_message("assistant"):
if "vectorstore" not in st.session_state:
try:
st.session_state["vectorstore"] = load_vectorstore()
except Exception:
st.error("Please upload documents first.")
st.stop()
vectorstore = st.session_state["vectorstore"]
retriever = get_retriever(vectorstore)
chain = create_rag_chain(retriever)
with st.spinner("Generating answer..."):
response = chain.invoke(prompt)
st.markdown(response)
st.session_state.messages.append(
{"role": "assistant", "content": response}
)
# Run
streamlit run app.py --server.port 8501
Performance Optimization Tips
1. Hybrid Search (Keyword + Vector)
from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
def create_hybrid_retriever(chunks, vectorstore, k=4):
"""BM25 + Vector search ensemble"""
bm25_retriever = BM25Retriever.from_documents(chunks)
bm25_retriever.k = k
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": k})
ensemble = EnsembleRetriever(
retrievers=[bm25_retriever, vector_retriever],
weights=[0.4, 0.6], # Weight toward vector search
)
return ensemble
2. Improving Search Accuracy with a Reranker
from langchain.retrievers import ContextualCompressionRetriever
from langchain_community.document_compressors import CohereRerank
def create_reranked_retriever(vectorstore, k=4, top_n=3):
"""Rerank search results using Cohere Reranker"""
base_retriever = vectorstore.as_retriever(search_kwargs={"k": k * 3})
compressor = CohereRerank(
model="rerank-v3.5",
top_n=top_n,
)
return ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=base_retriever,
)
3. Enriching Chunk Metadata
# Add summary metadata to chunks
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
for chunk in chunks:
summary = llm.invoke(
f"Summarize the following text in one sentence:\n{chunk.page_content}"
).content
chunk.metadata["summary"] = summary
Conclusion
The key to a RAG chatbot is retrieval quality. No matter how powerful the LLM is, it cannot generate good answers if irrelevant documents are provided. Here is a prioritized list for performance improvement:
- Optimize chunking strategy — Choose chunk sizes and splitting methods suited to document characteristics
- Choose the right embedding model — Select an embedding model appropriate for your domain
- Hybrid search — BM25 + vector search ensemble
- Apply a reranker — Improve precision by reranking search results
- Prompt engineering — Specify output format and rules
Since LangChain abstracts all of these components, the great advantage is that you can easily swap and experiment with each one.
Quiz
Q1: What are the roles of Retrieval and Generation in RAG?
Retrieval is the stage that finds documents related to the question through vector similarity
search, and Generation is the stage that passes the retrieved documents as context to the LLM to
produce an answer.
Q2: Why do we set chunk_overlap in chunking?
To prevent context from being lost at chunk boundaries. By having overlapping sections between
adjacent chunks, even if a sentence is cut in the middle, the next chunk can maintain the complete
context.
Q3: What does the persist_directory setting in ChromaDB mean?
It specifies the path for permanently storing vector data on disk. With this setting, you can load
the existing vector DB without recalculating embeddings when the process restarts.
Q4: What is the advantage of MMR (Maximal Marginal Relevance) search?
It selects documents that are highly relevant yet diverse from each other. By preventing similar
chunks from being returned redundantly, it provides the LLM with a broader range of context.
Q5: What criteria should guide the choice between text-embedding-3-small and
text-embedding-3-large?
text-embedding-3-small and
text-embedding-3-large?The small model is 6.5 times cheaper while providing sufficient performance for most use cases. Consider the large model only when domain-specific high precision is needed, such as in medical or legal fields.
Q6: Why is hybrid search (BM25 + vector) better than pure vector search?
Vector search excels at semantic similarity but is weak at exact keyword matching. Since BM25 is
strong at keyword matching, ensembling both methods ensures both semantic similarity and keyword
accuracy.
Q7: What does k=5 mean in ConversationBufferWindowMemory?
It means only the last 5 conversation turns are kept in memory. Since keeping the entire
conversation could exceed the token limit, retaining only recent turns uses the context window
efficiently.
Q8: Why do we set a large k for the base retriever when using a reranker?
To allow the reranker to reorder and select the most relevant documents from a wider candidate
pool. For example, fetching k=12 candidates and then selecting top_n=3 can recover relevant
documents that were missed in the initial search through the reranking process.