Building a RAG Chatbot — Create Your Own Document QA Bot with LangChain + ChromaDB + OpenAI

Overview
RAG Architecture
Environment Setup
- Package Installation
- Environment Variables
Step 1: Document Loading
Step 2: Text Chunking
- Chunking Strategy Comparison
Step 3: Vector Store (ChromaDB)
- Embedding Model Selection Guide
Step 4: Retriever Configuration
- MMR (Maximal Marginal Relevance)
Step 5: RAG Chain Construction
Step 6: Conversation History Support
Step 7: Streamlit UI
Performance Optimization Tips
Conclusion
Quiz

RAG Chatbot — LangChain + ChromaDB + OpenAI

Overview

LLMs have broad general knowledge, but they cannot answer questions about your company's internal documents or the latest information. RAG (Retrieval-Augmented Generation) is a pattern that overcomes this limitation by first retrieving documents relevant to the question, then passing that context to the LLM to generate accurate answers.

In this post, we build a PDF document-based QA chatbot from start to finish using LangChain + ChromaDB + OpenAI. We will also create a web UI with Streamlit to complete a fully functional chatbot.

RAG Architecture

The overall RAG flow consists of two stages:

Stage 1: Indexing (Offline)

Documents → Chunking → Embedding → Store in Vector DB

Stage 2: Querying (Online)

Question → Embedding → Similar Document Search → Prompt Construction → LLM Answer

Environment Setup

Package Installation

pip install langchain langchain-openai langchain-community \
            chromadb pypdf tiktoken streamlit python-dotenv

Environment Variables

# .env
OPENAI_API_KEY=sk-proj-your-api-key-here

# config.py
import os
from dotenv import load_dotenv

load_dotenv()

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
CHROMA_PERSIST_DIR = "./chroma_db"
COLLECTION_NAME = "my_documents"
CHUNK_SIZE = 1000
CHUNK_OVERLAP = 200
EMBEDDING_MODEL = "text-embedding-3-small"
LLM_MODEL = "gpt-4o-mini"

Step 1: Document Loading

# document_loader.py
from langchain_community.document_loaders import (
    PyPDFLoader,
    DirectoryLoader,
    TextLoader,
)

def load_pdf(file_path: str):
    """Load a single PDF file"""
    loader = PyPDFLoader(file_path)
    documents = loader.load()
    print(f"Loaded {len(documents)} pages from {file_path}")
    return documents

def load_directory(dir_path: str, glob: str = "**/*.pdf"):
    """Load all PDFs in a directory"""
    loader = DirectoryLoader(
        dir_path,
        glob=glob,
        loader_cls=PyPDFLoader,
        show_progress=True,
    )
    documents = loader.load()
    print(f"Loaded {len(documents)} pages from {dir_path}")
    return documents

# Usage example
documents = load_directory("./docs")

Step 2: Text Chunking

Chunking has a significant impact on RAG performance. Chunks that are too small lack context, while chunks that are too large reduce search accuracy.

# chunker.py
from langchain.text_splitter import RecursiveCharacterTextSplitter

def chunk_documents(documents, chunk_size=1000, chunk_overlap=200):
    """Split documents into chunks"""
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        length_function=len,
        separators=["\n\n", "\n", ".", " ", ""],
    )

    chunks = text_splitter.split_documents(documents)
    print(f"Split {len(documents)} documents into {len(chunks)} chunks")

    # Add chunk index to metadata
    for i, chunk in enumerate(chunks):
        chunk.metadata["chunk_index"] = i
        chunk.metadata["chunk_size"] = len(chunk.page_content)

    return chunks

chunks = chunk_documents(documents)

Chunking Strategy Comparison

Strategy	Pros	Cons	Recommended For
RecursiveCharacter	Excellent context preservation	General-purpose	General documents
TokenTextSplitter	Precise token count control	May break context	Strict token limits
MarkdownHeader	Preserves structure	Markdown only	Technical documentation
SemanticChunker	Semantic-based splitting	Slow, costly	High-quality requirements

Step 3: Vector Store (ChromaDB)

# vectorstore.py
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from config import CHROMA_PERSIST_DIR, COLLECTION_NAME, EMBEDDING_MODEL

def create_vectorstore(chunks):
    """Embed chunks and store in ChromaDB"""
    embeddings = OpenAIEmbeddings(model=EMBEDDING_MODEL)

    vectorstore = Chroma.from_documents(
        documents=chunks,
        embedding=embeddings,
        persist_directory=CHROMA_PERSIST_DIR,
        collection_name=COLLECTION_NAME,
    )

    print(f"Stored {len(chunks)} chunks in ChromaDB")
    return vectorstore

def load_vectorstore():
    """Load existing ChromaDB"""
    embeddings = OpenAIEmbeddings(model=EMBEDDING_MODEL)

    vectorstore = Chroma(
        persist_directory=CHROMA_PERSIST_DIR,
        collection_name=COLLECTION_NAME,
        embedding_function=embeddings,
    )

    count = vectorstore._collection.count()
    print(f"Loaded ChromaDB with {count} documents")
    return vectorstore

Embedding Model Selection Guide

Model	Dimensions	Cost	Performance
`text-embedding-3-small`	1536	$0.02/1M tokens	Good
`text-embedding-3-large`	3072	$0.13/1M tokens	Very Good
`text-embedding-ada-002`	1536	$0.10/1M tokens	Average (Legacy)

For the best cost-to-performance ratio, text-embedding-3-small is recommended.

Step 4: Retriever Configuration

# retriever.py

def get_retriever(vectorstore, search_type="mmr", k=4):
    """Create a retriever from the vector store"""

    if search_type == "mmr":
        # MMR: Balance between relevance and diversity
        retriever = vectorstore.as_retriever(
            search_type="mmr",
            search_kwargs={
                "k": k,
                "fetch_k": 20,      # Number of candidate documents
                "lambda_mult": 0.7,  # 1.0=relevance, 0.0=diversity
            },
        )
    elif search_type == "similarity_score":
        # Similarity threshold-based
        retriever = vectorstore.as_retriever(
            search_type="similarity_score_threshold",
            search_kwargs={
                "score_threshold": 0.7,
                "k": k,
            },
        )
    else:
        # Default similarity search
        retriever = vectorstore.as_retriever(
            search_kwargs={"k": k},
        )

    return retriever

MMR (Maximal Marginal Relevance)

MMR selects documents that are highly relevant yet diverse from the search results. It prevents similar chunks from being returned redundantly, providing the LLM with richer context.

Step 5: RAG Chain Construction

# rag_chain.py
from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.schema.runnable import RunnablePassthrough
from langchain.schema.output_parser import StrOutputParser
from config import LLM_MODEL

SYSTEM_TEMPLATE = """You are a document-based QA assistant.
Answer the question using only the provided context.

Rules:
1. If the information is not in the context, reply "I could not find that information in the documents."
2. Write your answer in Korean.
3. Include specific numbers or citations when possible.
4. Mention the source documents that support your answer.

Context:
{context}
"""

def format_docs(docs):
    """Format retrieved documents"""
    formatted = []
    for i, doc in enumerate(docs, 1):
        source = doc.metadata.get("source", "unknown")
        page = doc.metadata.get("page", "?")
        formatted.append(
            f"[Document {i}] (Source: {source}, Page: {page})\n{doc.page_content}"
        )
    return "\n\n---\n\n".join(formatted)

def create_rag_chain(retriever):
    """Create a RAG chain"""
    llm = ChatOpenAI(
        model=LLM_MODEL,
        temperature=0,
        max_tokens=2000,
    )

    prompt = ChatPromptTemplate.from_messages([
        ("system", SYSTEM_TEMPLATE),
        ("human", "{question}"),
    ])

    chain = (
        {
            "context": retriever | format_docs,
            "question": RunnablePassthrough(),
        }
        | prompt
        | llm
        | StrOutputParser()
    )

    return chain

Step 6: Conversation History Support

# conversation.py
from langchain.memory import ConversationBufferWindowMemory
from langchain.chains import ConversationalRetrievalChain
from langchain_openai import ChatOpenAI
from config import LLM_MODEL

def create_conversational_chain(retriever):
    """Create a RAG chain with conversation history support"""
    llm = ChatOpenAI(model=LLM_MODEL, temperature=0)

    memory = ConversationBufferWindowMemory(
        memory_key="chat_history",
        return_messages=True,
        output_key="answer",
        k=5,  # Keep only the last 5 turns
    )

    chain = ConversationalRetrievalChain.from_llm(
        llm=llm,
        retriever=retriever,
        memory=memory,
        return_source_documents=True,
        verbose=False,
    )

    return chain

# Usage example
chain = create_conversational_chain(retriever)

result = chain.invoke({"question": "What is the key content of this document?"})
print(result["answer"])
print(f"\nReferenced {len(result['source_documents'])} documents")

Step 7: Streamlit UI

# app.py
import streamlit as st
from document_loader import load_pdf
from chunker import chunk_documents
from vectorstore import create_vectorstore, load_vectorstore
from retriever import get_retriever
from rag_chain import create_rag_chain

st.set_page_config(page_title="📚 Document QA Chatbot", layout="wide")
st.title("📚 RAG Document QA Chatbot")

# Sidebar: Document Upload
with st.sidebar:
    st.header("📁 Upload Documents")
    uploaded_files = st.file_uploader(
        "Upload PDF files",
        type=["pdf"],
        accept_multiple_files=True,
    )

    if uploaded_files and st.button("🔄 Process Documents"):
        with st.spinner("Processing documents..."):
            all_chunks = []
            for file in uploaded_files:
                # Save temporary file
                temp_path = f"/tmp/{file.name}"
                with open(temp_path, "wb") as f:
                    f.write(file.getbuffer())

                docs = load_pdf(temp_path)
                chunks = chunk_documents(docs)
                all_chunks.extend(chunks)

            vectorstore = create_vectorstore(all_chunks)
            st.session_state["vectorstore"] = vectorstore
            st.success(f"✅ {len(all_chunks)} chunks processed!")

# Main: Chat Interface
if "messages" not in st.session_state:
    st.session_state.messages = []

# Display previous messages
for message in st.session_state.messages:
    with st.chat_message(message["role"]):
        st.markdown(message["content"])

# User input
if prompt := st.chat_input("Ask a question about your documents..."):
    st.session_state.messages.append({"role": "user", "content": prompt})

    with st.chat_message("user"):
        st.markdown(prompt)

    with st.chat_message("assistant"):
        if "vectorstore" not in st.session_state:
            try:
                st.session_state["vectorstore"] = load_vectorstore()
            except Exception:
                st.error("Please upload documents first.")
                st.stop()

        vectorstore = st.session_state["vectorstore"]
        retriever = get_retriever(vectorstore)
        chain = create_rag_chain(retriever)

        with st.spinner("Generating answer..."):
            response = chain.invoke(prompt)

        st.markdown(response)
        st.session_state.messages.append(
            {"role": "assistant", "content": response}
        )

# Run
streamlit run app.py --server.port 8501

Performance Optimization Tips

1. Hybrid Search (Keyword + Vector)

from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever

def create_hybrid_retriever(chunks, vectorstore, k=4):
    """BM25 + Vector search ensemble"""
    bm25_retriever = BM25Retriever.from_documents(chunks)
    bm25_retriever.k = k

    vector_retriever = vectorstore.as_retriever(search_kwargs={"k": k})

    ensemble = EnsembleRetriever(
        retrievers=[bm25_retriever, vector_retriever],
        weights=[0.4, 0.6],  # Weight toward vector search
    )

    return ensemble

2. Improving Search Accuracy with a Reranker

from langchain.retrievers import ContextualCompressionRetriever
from langchain_community.document_compressors import CohereRerank

def create_reranked_retriever(vectorstore, k=4, top_n=3):
    """Rerank search results using Cohere Reranker"""
    base_retriever = vectorstore.as_retriever(search_kwargs={"k": k * 3})

    compressor = CohereRerank(
        model="rerank-v3.5",
        top_n=top_n,
    )

    return ContextualCompressionRetriever(
        base_compressor=compressor,
        base_retriever=base_retriever,
    )

3. Enriching Chunk Metadata

# Add summary metadata to chunks
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

for chunk in chunks:
    summary = llm.invoke(
        f"Summarize the following text in one sentence:\n{chunk.page_content}"
    ).content
    chunk.metadata["summary"] = summary

Conclusion

The key to a RAG chatbot is retrieval quality. No matter how powerful the LLM is, it cannot generate good answers if irrelevant documents are provided. Here is a prioritized list for performance improvement:

Optimize chunking strategy — Choose chunk sizes and splitting methods suited to document characteristics
Choose the right embedding model — Select an embedding model appropriate for your domain
Hybrid search — BM25 + vector search ensemble
Apply a reranker — Improve precision by reranking search results
Prompt engineering — Specify output format and rules

Since LangChain abstracts all of these components, the great advantage is that you can easily swap and experiment with each one.

Quiz

Q1: What are the roles of Retrieval and Generation in RAG?

Retrieval is the stage that finds documents related to the question through vector similarity search, and Generation is the stage that passes the retrieved documents as context to the LLM to produce an answer.

Q2: Why do we set chunk_overlap in chunking?

To prevent context from being lost at chunk boundaries. By having overlapping sections between adjacent chunks, even if a sentence is cut in the middle, the next chunk can maintain the complete context.

Q3: What does the persist_directory setting in ChromaDB mean?

It specifies the path for permanently storing vector data on disk. With this setting, you can load the existing vector DB without recalculating embeddings when the process restarts.

Q4: What is the advantage of MMR (Maximal Marginal Relevance) search?

It selects documents that are highly relevant yet diverse from each other. By preventing similar chunks from being returned redundantly, it provides the LLM with a broader range of context.

Q5: What criteria should guide the choice between text-embedding-3-small and text-embedding-3-large?

The small model is 6.5 times cheaper while providing sufficient performance for most use cases. Consider the large model only when domain-specific high precision is needed, such as in medical or legal fields.

Q6: Why is hybrid search (BM25 + vector) better than pure vector search?

Vector search excels at semantic similarity but is weak at exact keyword matching. Since BM25 is strong at keyword matching, ensembling both methods ensures both semantic similarity and keyword accuracy.

Q7: What does k=5 mean in ConversationBufferWindowMemory?

It means only the last 5 conversation turns are kept in memory. Since keeping the entire conversation could exceed the token limit, retaining only recent turns uses the context window efficiently.

Q8: Why do we set a large k for the base retriever when using a reranker?

To allow the reranker to reorder and select the most relevant documents from a wider candidate pool. For example, fetching k=12 candidates and then selecting top_n=3 can recover relevant documents that were missed in the initial search through the reranking process.