- Authors
- Name
- Overview
- RAG Architecture
- Environment Setup
- Step 1: Document Loading
- Step 2: Text Chunking
- Step 3: Vector Store (ChromaDB)
- Step 4: Retriever Configuration
- Step 5: RAG Chain Construction
- Step 6: Conversation History Support
- Step 7: Streamlit UI
- Performance Optimization Tips
- Conclusion
- Quiz

Overview
LLMs have broad general knowledge, but they cannot answer questions about your company's internal documents or the latest information. RAG (Retrieval-Augmented Generation) is a pattern that overcomes this limitation by first retrieving documents relevant to the question, then passing that context to the LLM to generate accurate answers.
In this post, we build a PDF document-based QA chatbot from start to finish using LangChain + ChromaDB + OpenAI. We will also create a web UI with Streamlit to complete a fully functional chatbot.
RAG Architecture
The overall RAG flow consists of two stages:
Stage 1: Indexing (Offline)
Documents → Chunking → Embedding → Store in Vector DB
Stage 2: Querying (Online)
Question → Embedding → Similar Document Search → Prompt Construction → LLM Answer
Environment Setup
Package Installation
pip install langchain langchain-openai langchain-community \
chromadb pypdf tiktoken streamlit python-dotenv
Environment Variables
# .env
OPENAI_API_KEY=sk-proj-your-api-key-here
# config.py
import os
from dotenv import load_dotenv
load_dotenv()
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
CHROMA_PERSIST_DIR = "./chroma_db"
COLLECTION_NAME = "my_documents"
CHUNK_SIZE = 1000
CHUNK_OVERLAP = 200
EMBEDDING_MODEL = "text-embedding-3-small"
LLM_MODEL = "gpt-4o-mini"
Step 1: Document Loading
# document_loader.py
from langchain_community.document_loaders import (
PyPDFLoader,
DirectoryLoader,
TextLoader,
)
def load_pdf(file_path: str):
"""Load a single PDF file"""
loader = PyPDFLoader(file_path)
documents = loader.load()
print(f"Loaded {len(documents)} pages from {file_path}")
return documents
def load_directory(dir_path: str, glob: str = "**/*.pdf"):
"""Load all PDFs in a directory"""
loader = DirectoryLoader(
dir_path,
glob=glob,
loader_cls=PyPDFLoader,
show_progress=True,
)
documents = loader.load()
print(f"Loaded {len(documents)} pages from {dir_path}")
return documents
# Usage example
documents = load_directory("./docs")
Step 2: Text Chunking
Chunking has a significant impact on RAG performance. Chunks that are too small lack context, while chunks that are too large reduce search accuracy.
# chunker.py
from langchain.text_splitter import RecursiveCharacterTextSplitter
def chunk_documents(documents, chunk_size=1000, chunk_overlap=200):
"""Split documents into chunks"""
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap,
length_function=len,
separators=["\n\n", "\n", ".", " ", ""],
)
chunks = text_splitter.split_documents(documents)
print(f"Split {len(documents)} documents into {len(chunks)} chunks")
# Add chunk index to metadata
for i, chunk in enumerate(chunks):
chunk.metadata["chunk_index"] = i
chunk.metadata["chunk_size"] = len(chunk.page_content)
return chunks
chunks = chunk_documents(documents)
Chunking Strategy Comparison
| Strategy | Pros | Cons | Recommended For |
|---|---|---|---|
| RecursiveCharacter | Excellent context preservation | General-purpose | General documents |
| TokenTextSplitter | Precise token count control | May break context | Strict token limits |
| MarkdownHeader | Preserves structure | Markdown only | Technical documentation |
| SemanticChunker | Semantic-based splitting | Slow, costly | High-quality requirements |
Step 3: Vector Store (ChromaDB)
# vectorstore.py
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from config import CHROMA_PERSIST_DIR, COLLECTION_NAME, EMBEDDING_MODEL
def create_vectorstore(chunks):
"""Embed chunks and store in ChromaDB"""
embeddings = OpenAIEmbeddings(model=EMBEDDING_MODEL)
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory=CHROMA_PERSIST_DIR,
collection_name=COLLECTION_NAME,
)
print(f"Stored {len(chunks)} chunks in ChromaDB")
return vectorstore
def load_vectorstore():
"""Load existing ChromaDB"""
embeddings = OpenAIEmbeddings(model=EMBEDDING_MODEL)
vectorstore = Chroma(
persist_directory=CHROMA_PERSIST_DIR,
collection_name=COLLECTION_NAME,
embedding_function=embeddings,
)
count = vectorstore._collection.count()
print(f"Loaded ChromaDB with {count} documents")
return vectorstore
Embedding Model Selection Guide
| Model | Dimensions | Cost | Performance |
|---|---|---|---|
text-embedding-3-small | 1536 | $0.02/1M tokens | Good |
text-embedding-3-large | 3072 | $0.13/1M tokens | Very Good |
text-embedding-ada-002 | 1536 | $0.10/1M tokens | Average (Legacy) |
For the best cost-to-performance ratio, text-embedding-3-small is recommended.
Step 4: Retriever Configuration
# retriever.py
def get_retriever(vectorstore, search_type="mmr", k=4):
"""Create a retriever from the vector store"""
if search_type == "mmr":
# MMR: Balance between relevance and diversity
retriever = vectorstore.as_retriever(
search_type="mmr",
search_kwargs={
"k": k,
"fetch_k": 20, # Number of candidate documents
"lambda_mult": 0.7, # 1.0=relevance, 0.0=diversity
},
)
elif search_type == "similarity_score":
# Similarity threshold-based
retriever = vectorstore.as_retriever(
search_type="similarity_score_threshold",
search_kwargs={
"score_threshold": 0.7,
"k": k,
},
)
else:
# Default similarity search
retriever = vectorstore.as_retriever(
search_kwargs={"k": k},
)
return retriever
MMR (Maximal Marginal Relevance)
MMR selects documents that are highly relevant yet diverse from the search results. It prevents similar chunks from being returned redundantly, providing the LLM with richer context.
Step 5: RAG Chain Construction
# rag_chain.py
from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.schema.runnable import RunnablePassthrough
from langchain.schema.output_parser import StrOutputParser
from config import LLM_MODEL
SYSTEM_TEMPLATE = """You are a document-based QA assistant.
Answer the question using only the provided context.
Rules:
1. If the information is not in the context, reply "I could not find that information in the documents."
2. Write your answer in Korean.
3. Include specific numbers or citations when possible.
4. Mention the source documents that support your answer.
Context:
{context}
"""
def format_docs(docs):
"""Format retrieved documents"""
formatted = []
for i, doc in enumerate(docs, 1):
source = doc.metadata.get("source", "unknown")
page = doc.metadata.get("page", "?")
formatted.append(
f"[Document {i}] (Source: {source}, Page: {page})\n{doc.page_content}"
)
return "\n\n---\n\n".join(formatted)
def create_rag_chain(retriever):
"""Create a RAG chain"""
llm = ChatOpenAI(
model=LLM_MODEL,
temperature=0,
max_tokens=2000,
)
prompt = ChatPromptTemplate.from_messages([
("system", SYSTEM_TEMPLATE),
("human", "{question}"),
])
chain = (
{
"context": retriever | format_docs,
"question": RunnablePassthrough(),
}
| prompt
| llm
| StrOutputParser()
)
return chain
Step 6: Conversation History Support
# conversation.py
from langchain.memory import ConversationBufferWindowMemory
from langchain.chains import ConversationalRetrievalChain
from langchain_openai import ChatOpenAI
from config import LLM_MODEL
def create_conversational_chain(retriever):
"""Create a RAG chain with conversation history support"""
llm = ChatOpenAI(model=LLM_MODEL, temperature=0)
memory = ConversationBufferWindowMemory(
memory_key="chat_history",
return_messages=True,
output_key="answer",
k=5, # Keep only the last 5 turns
)
chain = ConversationalRetrievalChain.from_llm(
llm=llm,
retriever=retriever,
memory=memory,
return_source_documents=True,
verbose=False,
)
return chain
# Usage example
chain = create_conversational_chain(retriever)
result = chain.invoke({"question": "What is the key content of this document?"})
print(result["answer"])
print(f"\nReferenced {len(result['source_documents'])} documents")
Step 7: Streamlit UI
# app.py
import streamlit as st
from document_loader import load_pdf
from chunker import chunk_documents
from vectorstore import create_vectorstore, load_vectorstore
from retriever import get_retriever
from rag_chain import create_rag_chain
st.set_page_config(page_title="📚 Document QA Chatbot", layout="wide")
st.title("📚 RAG Document QA Chatbot")
# Sidebar: Document Upload
with st.sidebar:
st.header("📁 Upload Documents")
uploaded_files = st.file_uploader(
"Upload PDF files",
type=["pdf"],
accept_multiple_files=True,
)
if uploaded_files and st.button("🔄 Process Documents"):
with st.spinner("Processing documents..."):
all_chunks = []
for file in uploaded_files:
# Save temporary file
temp_path = f"/tmp/{file.name}"
with open(temp_path, "wb") as f:
f.write(file.getbuffer())
docs = load_pdf(temp_path)
chunks = chunk_documents(docs)
all_chunks.extend(chunks)
vectorstore = create_vectorstore(all_chunks)
st.session_state["vectorstore"] = vectorstore
st.success(f"✅ {len(all_chunks)} chunks processed!")
# Main: Chat Interface
if "messages" not in st.session_state:
st.session_state.messages = []
# Display previous messages
for message in st.session_state.messages:
with st.chat_message(message["role"]):
st.markdown(message["content"])
# User input
if prompt := st.chat_input("Ask a question about your documents..."):
st.session_state.messages.append({"role": "user", "content": prompt})
with st.chat_message("user"):
st.markdown(prompt)
with st.chat_message("assistant"):
if "vectorstore" not in st.session_state:
try:
st.session_state["vectorstore"] = load_vectorstore()
except Exception:
st.error("Please upload documents first.")
st.stop()
vectorstore = st.session_state["vectorstore"]
retriever = get_retriever(vectorstore)
chain = create_rag_chain(retriever)
with st.spinner("Generating answer..."):
response = chain.invoke(prompt)
st.markdown(response)
st.session_state.messages.append(
{"role": "assistant", "content": response}
)
# Run
streamlit run app.py --server.port 8501
Performance Optimization Tips
1. Hybrid Search (Keyword + Vector)
from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
def create_hybrid_retriever(chunks, vectorstore, k=4):
"""BM25 + Vector search ensemble"""
bm25_retriever = BM25Retriever.from_documents(chunks)
bm25_retriever.k = k
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": k})
ensemble = EnsembleRetriever(
retrievers=[bm25_retriever, vector_retriever],
weights=[0.4, 0.6], # Weight toward vector search
)
return ensemble
2. Improving Search Accuracy with a Reranker
from langchain.retrievers import ContextualCompressionRetriever
from langchain_community.document_compressors import CohereRerank
def create_reranked_retriever(vectorstore, k=4, top_n=3):
"""Rerank search results using Cohere Reranker"""
base_retriever = vectorstore.as_retriever(search_kwargs={"k": k * 3})
compressor = CohereRerank(
model="rerank-v3.5",
top_n=top_n,
)
return ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=base_retriever,
)
3. Enriching Chunk Metadata
# Add summary metadata to chunks
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
for chunk in chunks:
summary = llm.invoke(
f"Summarize the following text in one sentence:\n{chunk.page_content}"
).content
chunk.metadata["summary"] = summary
Conclusion
The key to a RAG chatbot is retrieval quality. No matter how powerful the LLM is, it cannot generate good answers if irrelevant documents are provided. Here is a prioritized list for performance improvement:
- Optimize chunking strategy — Choose chunk sizes and splitting methods suited to document characteristics
- Choose the right embedding model — Select an embedding model appropriate for your domain
- Hybrid search — BM25 + vector search ensemble
- Apply a reranker — Improve precision by reranking search results
- Prompt engineering — Specify output format and rules
Since LangChain abstracts all of these components, the great advantage is that you can easily swap and experiment with each one.
Quiz
Q1: What are the roles of Retrieval and Generation in RAG?
Retrieval is the stage that finds documents related to the question through vector similarity search, and Generation is the stage that passes the retrieved documents as context to the LLM to produce an answer.
Q2: Why do we set chunk_overlap in chunking?
To prevent context from being lost at chunk boundaries. By having overlapping sections between adjacent chunks, even if a sentence is cut in the middle, the next chunk can maintain the complete context.
Q3: What does the persist_directory setting in ChromaDB mean?
It specifies the path for permanently storing vector data on disk. With this setting, you can load the existing vector DB without recalculating embeddings when the process restarts.
Q4: What is the advantage of MMR (Maximal Marginal Relevance) search?
It selects documents that are highly relevant yet diverse from each other. By preventing similar chunks from being returned redundantly, it provides the LLM with a broader range of context.
Q5: What criteria should guide the choice between text-embedding-3-small and text-embedding-3-large?
text-embedding-3-small and text-embedding-3-large?The small model is 6.5 times cheaper while providing sufficient performance for most use cases. Consider the large model only when domain-specific high precision is needed, such as in medical or legal fields.
Q6: Why is hybrid search (BM25 + vector) better than pure vector search?
Vector search excels at semantic similarity but is weak at exact keyword matching. Since BM25 is strong at keyword matching, ensembling both methods ensures both semantic similarity and keyword accuracy.
Q7: What does k=5 mean in ConversationBufferWindowMemory?
It means only the last 5 conversation turns are kept in memory. Since keeping the entire conversation could exceed the token limit, retaining only recent turns uses the context window efficiently.
Q8: Why do we set a large k for the base retriever when using a reranker?
To allow the reranker to reorder and select the most relevant documents from a wider candidate pool. For example, fetching k=12 candidates and then selecting top_n=3 can recover relevant documents that were missed in the initial search through the reranking process.