Skip to content

Split View: Advanced RAG 파이프라인 완전 가이드 2025: 청킹 전략, 리랭킹, 에이전틱 RAG, 평가

✨ Learn with Quiz
|

Advanced RAG 파이프라인 완전 가이드 2025: 청킹 전략, 리랭킹, 에이전틱 RAG, 평가

1. RAG의 진화: Naive에서 Advanced로

1.1 RAG란 무엇인가

RAG(Retrieval-Augmented Generation)는 LLM이 답변을 생성하기 전에 외부 지식 소스에서 관련 정보를 검색하여 컨텍스트로 제공하는 기술입니다. LLM의 할루시네이션을 줄이고, 최신 정보를 반영하며, 도메인 특화 지식을 활용할 수 있게 합니다.

사용자 질문 → [검색(Retrieval)][관련 문서][LLM + 문서 컨텍스트] → 답변 생성

1.2 RAG 아키텍처 진화

Naive RAG (2023)
├── 고정 크기 청킹
├── 단일 임베딩 검색
├── Top-K 결과를 그대로 LLM에 전달
└── 문제: 검색 정확도 낮음, 컨텍스트 노이즈

Advanced RAG (2024)
├── 시맨틱 청킹 + 메타데이터
├── 하이브리드 검색 (벡터 + 키워드)
├── 리랭킹으로 검색 결과 정제
├── 쿼리 변환 (HyDE, Multi-Query)
└── 컨텍스트 압축

Modular RAG (2025)
├── 에이전틱 RAG (동적 라우팅)
├── Self-RAG (자기 반성)
├── CRAG (Corrective RAG)
├── Multi-modal RAG
├── Knowledge Graph + RAG
└── 모듈 조합형 파이프라인

1.3 각 단계별 병목

단계Naive RAG 문제Advanced RAG 해결
인덱싱고정 크기 청킹시맨틱 청킹, 계층적 인덱싱
검색단일 벡터 검색하이브리드 검색, 리랭킹
생성노이즈 컨텍스트컨텍스트 압축, 필터링
쿼리원본 쿼리 그대로쿼리 변환, 분해
평가수동 평가RAGAS, 자동 평가

2. 청킹 전략 (Chunking Strategies)

2.1 왜 청킹이 중요한가

청킹은 RAG 파이프라인의 첫 번째이자 가장 중요한 단계입니다. 잘못된 청킹은 이후 모든 단계의 성능을 저하시킵니다.

나쁜 청킹: 문장 중간에서 잘림 → 의미 손실 → 검색 실패 → 잘못된 답변
좋은 청킹: 의미 단위로 분리 → 풍부한 컨텍스트 → 정확한 검색 → 정확한 답변

2.2 고정 크기 청킹 (Fixed-Size Chunking)

가장 단순한 방식입니다. 텍스트를 일정 토큰/문자 수로 분할합니다.

from langchain.text_splitter import CharacterTextSplitter

splitter = CharacterTextSplitter(
    chunk_size=500,      # 청크 크기
    chunk_overlap=50,    # 오버랩 (경계 정보 보존)
    separator="\n\n"     # 분리자
)

chunks = splitter.split_text(document_text)

장점: 간단, 빠름, 예측 가능한 크기 단점: 의미 단위 무시, 문장 중간 절단 가능

2.3 재귀적 청킹 (Recursive Character Splitting)

여러 분리자를 계층적으로 시도합니다. 가장 널리 사용되는 방식입니다.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ". ", " ", ""]  # 우선순위
)

chunks = splitter.split_text(document_text)
# 먼저 \n\n으로 분리 시도, 크기 초과 시 \n, 그다음 마침표...

장점: 문단/문장 경계 존중, 범용적 단점: 의미적 연관성 보장 불가

2.4 시맨틱 청킹 (Semantic Chunking)

문장 간 임베딩 유사도를 측정하여 의미가 바뀌는 지점에서 분할합니다.

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# 시맨틱 청킹
chunker = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=95  # 유사도가 95번째 백분위수 이하면 분할
)

chunks = chunker.split_text(document_text)

작동 원리:

문장1: "벡터DB는 AI 인프라의 핵심이다." ──→ 임베딩 [0.1, 0.3, ...]
문장2: "HNSW는 가장 빠른 검색 알고리즘이다." ──→ 유사도 0.85 (높음 → 같은 청크)
문장3: "한편, 한국의 날씨는 맑다." ──→ 유사도 0.15 (낮음 → 새 청크 시작!)

장점: 의미 단위 분할, 높은 검색 정확도 단점: 임베딩 호출 필요 (비용/시간), 청크 크기 불균일

2.5 문서 기반 청킹 (Document-Based Chunking)

문서 구조(제목, 섹션, 표, 코드 블록)를 활용합니다.

from langchain.text_splitter import MarkdownHeaderTextSplitter

headers_to_split = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

splitter = MarkdownHeaderTextSplitter(headers_to_split)
chunks = splitter.split_text(markdown_text)

# 각 청크에 제목 메타데이터 자동 포함
for chunk in chunks:
    print(f"메타데이터: {chunk.metadata}")
    # {'Header 1': 'Vector Database', 'Header 2': '인덱싱 알고리즘'}

2.6 에이전틱 청킹 (Agentic Chunking)

LLM을 사용하여 최적의 청킹을 결정합니다.

from openai import OpenAI

client = OpenAI()

def agentic_chunk(text, max_chunk_size=1500):
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": """주어진 텍스트를 의미적으로 독립적인 청크로 분할하세요.
각 청크는 자체적으로 이해 가능해야 합니다.
JSON 배열로 청크 텍스트를 반환하세요."""},
            {"role": "user", "content": text}
        ],
        response_format={"type": "json_object"}
    )
    return response.choices[0].message.content

장점: 가장 높은 품질, 문맥 이해 단점: 비용 높음, 처리 시간 김, 대규모 비실용적

2.7 청킹 전략 비교

전략품질속도비용적합 사례
고정 크기낮음매우 빠름무료프로토타이핑, 단순 문서
재귀적중간빠름무료범용 (기본 추천)
시맨틱높음보통임베딩 비용정확도가 중요한 경우
문서 기반높음빠름무료구조화된 문서 (MD, HTML)
에이전틱매우 높음느림LLM 비용소량 고품질 문서

2.8 최적 청크 크기 가이드

일반 텍스트: 500~1000 토큰 (오버랩 10~20%)
기술 문서: 800~1500 토큰 (섹션 단위)
법률/의료 문서: 300~500 토큰 (정밀도 중시)
코드: 함수/클래스 단위 (구조 기반)
대화/QA: 대화 턴 단위

3. 임베딩 모델 선택

3.1 임베딩 모델 비교

모델차원최대 토큰MTEB 점수비용추천 사례
text-embedding-3-large30728,19164.6유료최고 성능 필요 시
text-embedding-3-small15368,19162.3저렴범용 (비용 대비 최적)
embed-v3.0 (Cohere)102451264.5유료다국어, 검색 특화
BGE-M3 (BAAI)10248,19268.2무료셀프호스팅, 최고 OSS
Jina-embeddings-v310248,19265.5무료다국어, 긴 문서
voyage-3 (Voyage AI)102432,00067.1유료코드 검색 특화

3.2 임베딩 모델 선택 기준

비용 중시 + 범용        → text-embedding-3-small
최고 성능 + 무료        → BGE-M3 (셀프호스팅)
다국어 + 검색 특화      → Cohere embed-v3.0
코드 검색              → voyage-code-3
문서 (8K+ 토큰)Jina-embeddings-v3
프라이빗 데이터 + 온프레미스 → BGE-M3 or Nomic

3.3 Late Interaction 모델 (ColBERT)

토큰 수준의 세밀한 매칭을 수행합니다.

# ColBERT v2 예시
from ragatouille import RAGPretrainedModel

rag = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")

# 인덱싱
rag.index(
    collection=[doc1, doc2, doc3],
    index_name="my_index"
)

# 검색 (토큰 수준 매칭)
results = rag.search(query="벡터 데이터베이스 성능 비교", k=5)

4. 쿼리 변환 (Query Transformation)

4.1 왜 쿼리 변환이 필요한가

사용자 쿼리는 종종 모호하거나, 너무 짧거나, 검색에 적합하지 않습니다.

원본 쿼리: "RAG 느려요" (모호, 짧음)
→ 변환 후: "RAG 파이프라인 검색 지연시간 최적화 방법" (구체적, 검색 적합)

4.2 HyDE (Hypothetical Document Embeddings)

LLM이 가상의 답변 문서를 생성하고, 그 문서의 임베딩으로 검색합니다.

from openai import OpenAI

client = OpenAI()

def hyde_search(query, collection):
    # 1. LLM으로 가상 답변 생성
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "주어진 질문에 대한 상세한 답변을 작성하세요. 정확하지 않아도 됩니다."},
            {"role": "user", "content": query}
        ]
    )
    hypothetical_doc = response.choices[0].message.content
    
    # 2. 가상 답변을 임베딩하여 검색
    embedding = get_embedding(hypothetical_doc)
    results = collection.search(query_vector=embedding, limit=5)
    
    return results

장점: 문서와 쿼리 사이의 임베딩 갭 해소 단점: LLM 호출 비용, 가상 답변이 잘못될 수 있음

4.3 Multi-Query (다중 쿼리)

하나의 쿼리를 여러 각도에서 재작성합니다.

def multi_query_transform(original_query):
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": """주어진 질문을 3가지 다른 관점에서 재작성하세요.
각 질문은 원래 의도를 유지하되 다른 키워드를 사용하세요.
줄바꿈으로 구분하여 3개의 질문만 반환하세요."""},
            {"role": "user", "content": original_query}
        ]
    )
    queries = response.choices[0].message.content.strip().split("\n")
    return [original_query] + queries

# 사용
queries = multi_query_transform("RAG 파이프라인 성능 최적화")
# → ["RAG 파이프라인 성능 최적화",
#    "검색 증강 생성 시스템의 응답 시간 개선 방법",
#    "RAG 아키텍처에서 검색 정확도를 높이는 전략",
#    "LLM 기반 문서 검색 시스템 최적화 기법"]

# 각 쿼리로 검색 후 결과 합치기 (deduplicate)
all_results = set()
for q in queries:
    results = search(q)
    all_results.update(results)

4.4 Step-Back Prompting

구체적 질문을 더 일반적인 질문으로 변환합니다.

def step_back_prompt(query):
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": """주어진 질문보다 한 단계 더 일반적이고 넓은 질문을 생성하세요.
이 일반적 질문의 답변이 원래 질문에 답하는 데 도움이 되어야 합니다."""},
            {"role": "user", "content": query}
        ]
    )
    return response.choices[0].message.content

# 예시
original = "Qdrant에서 HNSW M 파라미터를 32로 설정했을 때 메모리 영향은?"
step_back = step_back_prompt(original)
# → "벡터 데이터베이스에서 HNSW 인덱스 파라미터가 성능과 리소스에 미치는 영향은?"

4.5 Query Decomposition (쿼리 분해)

복잡한 질문을 여러 하위 질문으로 분해합니다.

def decompose_query(complex_query):
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": """복잡한 질문을 간단한 하위 질문들로 분해하세요.
각 하위 질문은 독립적으로 답변 가능해야 합니다.
JSON 배열로 반환하세요."""},
            {"role": "user", "content": complex_query}
        ],
        response_format={"type": "json_object"}
    )
    return response.choices[0].message.content

# 예시
complex_q = "Pinecone과 Qdrant의 성능, 비용, 운영 편의성을 비교하고 10M 벡터 규모에서 어떤 것을 추천하나요?"
sub_questions = decompose_query(complex_q)
# → ["Pinecone의 10M 벡터 규모 성능은?",
#    "Qdrant의 10M 벡터 규모 성능은?",
#    "Pinecone의 비용 구조는?",
#    "Qdrant의 비용 구조는?",
#    "Pinecone의 운영 편의성은?",
#    "Qdrant의 운영 편의성은?"]

5. 검색 최적화 (Retrieval Optimization)

5.1 하이브리드 검색

벡터 검색과 키워드 검색을 결합합니다.

from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
from langchain_community.vectorstores import Qdrant

# BM25 (키워드) 검색기
bm25_retriever = BM25Retriever.from_documents(documents)
bm25_retriever.k = 5

# 벡터 검색기
vector_retriever = qdrant_vectorstore.as_retriever(
    search_kwargs={"k": 5}
)

# 앙상블 (하이브리드)
ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, vector_retriever],
    weights=[0.4, 0.6]  # 벡터에 가중치
)

results = ensemble_retriever.invoke("RAG 파이프라인 최적화 방법")

5.2 Contextual Compression (컨텍스트 압축)

검색된 문서에서 관련 부분만 추출합니다.

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

compressor = LLMChainExtractor.from_llm(llm)

compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vector_retriever
)

# 검색 + 압축: 관련 부분만 추출
compressed_docs = compression_retriever.invoke("HNSW 알고리즘의 M 파라미터 역할")
# 원본 문서의 전체가 아닌, 질문과 관련된 문단만 반환

5.3 Parent Document Retrieval (상위 문서 검색)

작은 청크로 검색하되, 큰 상위 문서를 반환합니다.

from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain.text_splitter import RecursiveCharacterTextSplitter

# 작은 청크: 검색용
child_splitter = RecursiveCharacterTextSplitter(chunk_size=200)

# 큰 청크: 컨텍스트용
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)

store = InMemoryStore()

retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

# 200토큰 청크로 정밀 검색 → 2000토큰 상위 문서 반환
retriever.add_documents(documents)
results = retriever.invoke("HNSW 파라미터 튜닝")
# 검색은 작은 청크로 정확하게, 반환은 큰 컨텍스트로 풍부하게

5.4 Multi-Vector Retrieval (다중 벡터 검색)

하나의 문서에 대해 여러 벡터(요약, 질문, 원본)를 생성합니다.

from langchain.retrievers.multi_vector import MultiVectorRetriever

# 문서별로 요약 + 가상 질문 생성
summaries = generate_summaries(documents)
hypothetical_questions = generate_questions(documents)

# 요약/질문으로 검색, 원본 문서 반환
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,  # 요약/질문 벡터 저장
    docstore=store,           # 원본 문서 저장
    id_key="doc_id"
)

# 요약의 임베딩으로 검색하되 원본 전체 문서를 반환

6. 리랭킹 (Re-ranking)

6.1 왜 리랭킹이 필요한가

초기 검색(bi-encoder)은 빠르지만 정밀도가 떨어집니다. 리랭킹(cross-encoder)은 쿼리와 문서를 함께 처리하여 더 정확한 관련도를 판단합니다.

1단계 (Bi-encoder): 쿼리 벡터 vs 문서 벡터 → 빠르지만 부정확 → Top 20
2단계 (Cross-encoder): (쿼리, 문서) 쌍을 직접 비교 → 느리지만 정확 → Top 5

6.2 Cross-Encoder 리랭킹

from sentence_transformers import CrossEncoder

# Cross-encoder 모델 로드
model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

# 초기 검색 결과
initial_results = vector_search(query, top_k=20)

# 리랭킹
pairs = [(query, doc.content) for doc in initial_results]
scores = model.predict(pairs)

# 점수 기준 재정렬
reranked = sorted(
    zip(initial_results, scores),
    key=lambda x: x[1],
    reverse=True
)[:5]

6.3 Cohere Rerank

상용 리랭킹 API입니다. 높은 성능과 다국어 지원이 장점입니다.

import cohere

co = cohere.Client("YOUR_API_KEY")

# 초기 검색 결과 텍스트
documents = [doc.content for doc in initial_results]

# Cohere 리랭킹
response = co.rerank(
    model="rerank-v3.5",
    query="RAG 파이프라인 최적화",
    documents=documents,
    top_n=5
)

for result in response.results:
    print(f"Index: {result.index}, Score: {result.relevance_score:.4f}")
    print(f"Text: {documents[result.index][:100]}...")

6.4 ColBERT 리랭킹

Late Interaction 방식으로 토큰 수준의 세밀한 매칭을 수행합니다.

from ragatouille import RAGPretrainedModel

# ColBERT 모델 로드
rag = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")

# 리랭킹
reranked = rag.rerank(
    query="RAG 파이프라인 최적화 방법",
    documents=[doc.content for doc in initial_results],
    k=5
)

6.5 LLM 기반 리랭킹

LLM에게 직접 관련도를 판단하게 합니다.

def llm_rerank(query, documents, top_n=5):
    scored_docs = []
    
    for doc in documents:
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {"role": "system", "content": "질문과 문서의 관련도를 0~10 점수로 평가하세요. 숫자만 반환하세요."},
                {"role": "user", "content": f"질문: {query}\n\n문서: {doc.content[:500]}"}
            ]
        )
        score = float(response.choices[0].message.content.strip())
        scored_docs.append((doc, score))
    
    return sorted(scored_docs, key=lambda x: x[1], reverse=True)[:top_n]

6.6 리랭킹 모델 비교

모델속도품질비용다국어추천
cross-encoder/ms-marco빠름좋음무료영어영어 전용
Cohere Rerank v3.5빠름매우 좋음유료100+ 언어프로덕션 기본
ColBERT v2보통매우 좋음무료영어셀프호스팅
BGE-Reranker-v2빠름좋음무료다국어OSS 다국어
LLM 리랭킹느림최고높음전체소량 고품질

7. 에이전틱 RAG (Agentic RAG)

7.1 에이전틱 RAG란

LLM 에이전트가 검색 전략을 동적으로 결정합니다. 단순히 "검색 후 생성"이 아니라, 검색 결과를 평가하고 전략을 조정합니다.

전통 RAG: 질문 → 검색 → 생성 (고정 파이프라인)
에이전틱 RAG: 질문 → [에이전트가 판단]
              ├── 검색이 필요한가? → 검색 → 결과 충분한가?
              │                         ├── Yes → 생성
              │                         └── No → 다른 소스 검색 / 쿼리 수정
              └── 검색 없이 답변 가능 → 직접 생성

7.2 Self-RAG (자기 반성 RAG)

모델이 스스로 검색 필요 여부와 생성 결과의 품질을 평가합니다.

def self_rag(query):
    # 1. 검색 필요 여부 판단
    need_retrieval = judge_retrieval_need(query)
    
    if not need_retrieval:
        return generate_without_context(query)
    
    # 2. 검색 수행
    documents = retrieve(query)
    
    # 3. 각 문서의 관련성 판단
    relevant_docs = []
    for doc in documents:
        if is_relevant(query, doc):
            relevant_docs.append(doc)
    
    if not relevant_docs:
        # 관련 문서 없으면 쿼리 수정 후 재검색
        refined_query = refine_query(query)
        documents = retrieve(refined_query)
        relevant_docs = [d for d in documents if is_relevant(refined_query, d)]
    
    # 4. 답변 생성
    answer = generate_with_context(query, relevant_docs)
    
    # 5. 답변 품질 자체 평가
    if not is_supported(answer, relevant_docs):
        # 답변이 문서에 근거하지 않으면 재생성
        answer = regenerate(query, relevant_docs)
    
    return answer

7.3 CRAG (Corrective RAG)

검색 결과의 품질에 따라 교정 조치를 취합니다.

def corrective_rag(query):
    # 1. 초기 검색
    documents = retrieve(query)
    
    # 2. 검색 결과 품질 평가
    quality = evaluate_retrieval_quality(query, documents)
    
    if quality == "CORRECT":
        # 검색 결과 우수 → 지식 정제 후 생성
        refined_knowledge = refine_knowledge(documents)
        return generate(query, refined_knowledge)
    
    elif quality == "AMBIGUOUS":
        # 검색 결과 애매 → 웹 검색으로 보충
        web_results = web_search(query)
        combined = documents + web_results
        refined = refine_knowledge(combined)
        return generate(query, refined)
    
    elif quality == "INCORRECT":
        # 검색 결과 부적절 → 웹 검색으로 대체
        web_results = web_search(query)
        refined = refine_knowledge(web_results)
        return generate(query, refined)

7.4 Adaptive RAG (적응형 RAG)

쿼리 복잡도에 따라 전략을 선택합니다.

def adaptive_rag(query):
    # 쿼리 복잡도 분류
    complexity = classify_query(query)
    
    if complexity == "SIMPLE":
        # 간단한 사실 질문 → 직접 검색 + 생성
        docs = simple_retrieve(query, top_k=3)
        return generate(query, docs)
    
    elif complexity == "MODERATE":
        # 보통 → 멀티쿼리 + 리랭킹
        queries = multi_query_transform(query)
        docs = hybrid_search(queries)
        reranked = rerank(query, docs)
        return generate(query, reranked)
    
    elif complexity == "COMPLEX":
        # 복잡 → 쿼리 분해 + 다단계 추론
        sub_queries = decompose_query(query)
        sub_answers = []
        for sq in sub_queries:
            docs = search(sq)
            sub_answers.append(generate(sq, docs))
        return synthesize(query, sub_answers)

7.5 Query Routing (쿼리 라우팅)

질문 유형에 따라 적절한 데이터 소스로 라우팅합니다.

def query_router(query):
    # LLM으로 라우팅 결정
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": """쿼리를 분석하여 적절한 데이터 소스를 선택하세요:
- VECTOR_DB: 내부 문서 검색이 필요한 경우
- WEB_SEARCH: 최신 정보나 외부 정보가 필요한 경우
- SQL_DB: 정형 데이터 쿼리가 필요한 경우
- DIRECT: LLM이 직접 답변 가능한 경우
한 단어만 반환하세요."""},
            {"role": "user", "content": query}
        ]
    )
    route = response.choices[0].message.content.strip()
    
    if route == "VECTOR_DB":
        return vector_db_search(query)
    elif route == "WEB_SEARCH":
        return web_search(query)
    elif route == "SQL_DB":
        return sql_query(query)
    else:
        return direct_answer(query)

8. Multi-modal RAG

8.1 이미지 + 텍스트 RAG

PDF나 슬라이드에서 이미지와 표를 함께 처리합니다.

from langchain_community.document_loaders import UnstructuredPDFLoader

# PDF에서 텍스트 + 이미지 + 표 추출
loader = UnstructuredPDFLoader(
    "document.pdf",
    mode="elements",
    strategy="hi_res"  # 고해상도 이미지/표 추출
)
elements = loader.load()

# 이미지 요소에 대해 GPT-4o로 설명 생성
for element in elements:
    if element.metadata.get("type") == "Image":
        description = describe_image_with_vision(element)
        element.page_content = description  # 이미지 설명을 텍스트로 변환

# 모든 요소 (텍스트 + 이미지 설명)를 벡터 DB에 저장

8.2 표(Table) 처리

def process_table(table_element):
    """표를 검색 가능한 형태로 변환"""
    # 1. 표를 마크다운으로 변환
    md_table = table_element.metadata.get("text_as_html", "")
    
    # 2. LLM으로 표 요약 생성
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "다음 표의 내용을 자연어로 요약하세요."},
            {"role": "user", "content": md_table}
        ]
    )
    summary = response.choices[0].message.content
    
    # 3. 요약을 메타데이터와 함께 저장
    return {
        "content": summary,
        "metadata": {
            "type": "table",
            "original_html": md_table,
            "source": table_element.metadata.get("source")
        }
    }

8.3 Vision 모델 활용

import base64

def describe_image_with_vision(image_path):
    with open(image_path, "rb") as f:
        image_data = base64.b64encode(f.read()).decode()
    
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "이 이미지의 내용을 상세히 설명하세요. 다이어그램이면 구조를, 차트면 데이터를 설명하세요."},
                    {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_data}"}}
                ]
            }
        ]
    )
    return response.choices[0].message.content

9. Knowledge Graph + RAG (GraphRAG)

9.1 GraphRAG 개념

지식 그래프로 엔티티 간 관계를 표현하고, 벡터 검색과 결합합니다.

일반 RAG: 문서 청크를 독립적으로 검색
GraphRAG: 엔티티 관계를 활용하여 연결된 정보까지 검색

예시: "HNSW를 사용하는 벡터 데이터베이스는?"
일반 RAG: HNSW에 대한 청크만 반환
GraphRAG: HNSW  (사용됨)Qdrant, Weaviate, Milvus
         +DBHNSW 설정 관련 청크까지 반환

9.2 Neo4j + RAG 구현

from langchain_community.graphs import Neo4jGraph
from langchain.chains import GraphCypherQAChain

graph = Neo4jGraph(
    url="bolt://localhost:7687",
    username="neo4j",
    password="password"
)

# 지식 그래프 + LLM QA 체인
chain = GraphCypherQAChain.from_llm(
    llm=llm,
    graph=graph,
    verbose=True
)

result = chain.invoke("Qdrant가 지원하는 인덱싱 알고리즘은 무엇인가요?")
# LLM이 Cypher 쿼리를 생성하여 Neo4j에서 관계를 탐색

9.3 하이브리드: Vector + Graph

def hybrid_graph_vector_rag(query):
    # 1. 벡터 검색으로 관련 문서 청크
    vector_results = vector_search(query, top_k=5)
    
    # 2. 청크에서 엔티티 추출
    entities = extract_entities(query)
    
    # 3. 지식 그래프에서 관련 엔티티 탐색
    graph_results = graph_query(entities, depth=2)
    
    # 4. 두 결과를 결합
    combined_context = merge_results(vector_results, graph_results)
    
    # 5. LLM 생성
    return generate(query, combined_context)

10. RAG 평가 (Evaluation)

10.1 RAGAS 프레임워크

RAG 파이프라인을 자동으로 평가하는 프레임워크입니다.

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
)
from datasets import Dataset

# 평가 데이터셋 준비
eval_data = {
    "question": ["RAG란 무엇인가?", "HNSW 알고리즘의 원리는?"],
    "answer": ["RAG는 검색 증강 생성으로...", "HNSW는 계층적 그래프..."],
    "contexts": [
        ["RAG(Retrieval-Augmented Generation)은...", "검색 기반 생성 기술..."],
        ["HNSW는 Hierarchical Navigable...", "그래프 기반 ANN 알고리즘..."]
    ],
    "ground_truth": ["RAG는 LLM이 외부 지식을...", "HNSW는 다층 그래프 구조의..."]
}

dataset = Dataset.from_dict(eval_data)

# RAGAS 평가 실행
results = evaluate(
    dataset,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall]
)

print(results)
# {'faithfulness': 0.89, 'answer_relevancy': 0.92,
#  'context_precision': 0.85, 'context_recall': 0.78}

10.2 RAGAS 메트릭 상세

메트릭측정 대상범위목표
Faithfulness답변이 컨텍스트에 근거하는 정도0~10.85+
Answer Relevancy답변이 질문에 적합한 정도0~10.90+
Context Precision관련 컨텍스트의 순위 정확도0~10.80+
Context Recall필요한 정보가 검색된 정도0~10.75+
Answer Similarity답변과 정답의 의미적 유사도0~10.80+
Answer Correctness답변의 사실적 정확성0~10.85+

10.3 TruLens 평가

from trulens.apps.langchain import TruChain
from trulens.core import Feedback, TruSession
from trulens.providers.openai import OpenAI as TruOpenAI

session = TruSession()
provider = TruOpenAI()

# 피드백 함수 정의
f_relevance = Feedback(provider.relevance).on_input_output()
f_groundedness = Feedback(provider.groundedness_measure_with_cot_reasons).on(
    source=context, statement=output
)

# RAG 체인 래핑
tru_chain = TruChain(
    rag_chain,
    app_name="RAG Pipeline v1",
    feedbacks=[f_relevance, f_groundedness]
)

# 평가 실행
with tru_chain as recording:
    response = rag_chain.invoke("RAG 파이프라인 최적화 방법은?")

# 대시보드 확인
session.get_leaderboard()

10.4 LLM-as-Judge

def llm_judge(question, answer, context, ground_truth):
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": """RAG 시스템의 답변을 평가하세요.
다음 기준으로 1~5점을 매기세요:
1. 정확성: 답변이 사실적으로 정확한가
2. 관련성: 질문에 적합한 답변인가
3. 근거성: 제공된 컨텍스트에 기반한 답변인가
4. 완전성: 질문에 충분히 답변했는가
JSON으로 각 점수와 이유를 반환하세요."""},
            {"role": "user", "content": f"""질문: {question}
답변: {answer}
컨텍스트: {context}
정답: {ground_truth}"""}
        ],
        response_format={"type": "json_object"}
    )
    return response.choices[0].message.content

10.5 평가 방법 비교

방법자동화비용정확도사용 시기
RAGAS완전 자동임베딩 비용좋음지속적 모니터링
TruLens자동LLM 비용좋음개발 중 반복 평가
LLM-as-Judge반자동LLM 비용매우 좋음상세 분석
Human Eval수동인력 비용최고최종 검증

11. 프로덕션 최적화

11.1 캐싱 전략

import hashlib
import json
from functools import lru_cache

class RAGCache:
    def __init__(self, redis_client):
        self.redis = redis_client
        self.ttl = 3600  # 1시간
    
    def _hash_query(self, query):
        return hashlib.md5(query.encode()).hexdigest()
    
    def get_cached_response(self, query):
        key = self._hash_query(query)
        cached = self.redis.get(f"rag:{key}")
        if cached:
            return json.loads(cached)
        return None
    
    def cache_response(self, query, response):
        key = self._hash_query(query)
        self.redis.setex(
            f"rag:{key}",
            self.ttl,
            json.dumps(response)
        )

# 시맨틱 캐싱: 유사한 쿼리도 캐시 히트
class SemanticCache:
    def __init__(self, vectorstore, threshold=0.95):
        self.store = vectorstore
        self.threshold = threshold
    
    def get(self, query):
        results = self.store.similarity_search_with_score(query, k=1)
        if results and results[0][1] >= self.threshold:
            return results[0][0].metadata["response"]
        return None

11.2 스트리밍 응답

from openai import OpenAI

client = OpenAI()

def stream_rag_response(query, context):
    stream = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": f"컨텍스트:\n{context}"},
            {"role": "user", "content": query}
        ],
        stream=True
    )
    
    for chunk in stream:
        if chunk.choices[0].delta.content:
            yield chunk.choices[0].delta.content

# FastAPI 스트리밍 엔드포인트
from fastapi import FastAPI
from fastapi.responses import StreamingResponse

app = FastAPI()

@app.get("/rag/stream")
async def rag_stream(query: str):
    context = retrieve_context(query)
    return StreamingResponse(
        stream_rag_response(query, context),
        media_type="text/event-stream"
    )

11.3 폴백 전략

def rag_with_fallback(query):
    try:
        # 1차: 벡터 DB 검색
        docs = vector_search(query)
        
        if not docs or max_relevance_score(docs) < 0.5:
            # 2차: 웹 검색 폴백
            docs = web_search_fallback(query)
        
        if not docs:
            # 3차: LLM 직접 답변 (검색 없이)
            return direct_llm_answer(query)
        
        return generate_with_context(query, docs)
    
    except Exception as e:
        # 에러 폴백
        return {
            "answer": "죄송합니다. 일시적인 오류가 발생했습니다.",
            "error": str(e),
            "fallback": True
        }

11.4 모니터링

# 핵심 RAG 메트릭
monitoring = {
    "검색 지연 시간": "목표 p99 200ms",
    "생성 지연 시간": "목표 p99 3s (스트리밍 first-token 500ms)",
    "검색 적합도 (Relevance)": "자동 평가 0.85+",
    "답변 근거성 (Groundedness)": "자동 평가 0.90+",
    "캐시 히트율": "목표 30%+",
    "사용자 피드백 (좋아요/싫어요)": "긍정 80%+",
    "토큰 사용량": "비용 추적",
    "에러율": "목표 1% 이하",
}

12. 흔한 실패와 해결책

12.1 문제 진단 체크리스트

증상원인해결
관련 없는 문서 반환청킹 크기 부적절시맨틱 청킹, 크기 조정
답변이 컨텍스트 무시컨텍스트 너무 길거나 노이즈컨텍스트 압축, 리랭킹
동의어/유사어 검색 실패임베딩 모델 한계하이브리드 검색, 쿼리 확장
최신 정보 부재인덱스 갱신 안 됨증분 인덱싱, 스케줄러
답변 할루시네이션관련도 임계값 없음점수 필터링, Self-RAG
멀티홉 질문 실패단일 검색으로 부족쿼리 분해, 반복 검색
특정 용어 검색 실패벡터만으로 부족BM25 하이브리드 추가
느린 응답리랭킹/생성 지연캐싱, 스트리밍, 비동기

12.2 디버깅 워크플로

def debug_rag_pipeline(query):
    print(f"=== RAG 디버깅: {query} ===\n")
    
    # 1. 검색 결과 확인
    results = vector_search(query, top_k=10)
    print("--- 검색 결과 ---")
    for i, r in enumerate(results):
        print(f"#{i+1} Score: {r.score:.4f} | {r.content[:100]}...")
    
    # 2. 리랭킹 결과 확인
    reranked = rerank(query, results)
    print("\n--- 리랭킹 후 ---")
    for i, (r, score) in enumerate(reranked):
        print(f"#{i+1} Score: {score:.4f} | {r.content[:100]}...")
    
    # 3. 최종 컨텍스트 확인
    context = build_context(reranked[:5])
    print(f"\n--- 컨텍스트 길이: {len(context)} chars ---")
    
    # 4. 생성 결과 확인
    answer = generate(query, context)
    print(f"\n--- 답변 ---\n{answer}")
    
    return {"results": results, "reranked": reranked, "answer": answer}

13. 전체 파이프라인 통합

from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Qdrant
from langchain.retrievers import EnsembleRetriever, ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CohereRerank

class AdvancedRAGPipeline:
    def __init__(self):
        self.llm = ChatOpenAI(model="gpt-4o")
        self.embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
        self.vectorstore = Qdrant.from_existing_collection(
            embedding=self.embeddings,
            collection_name="knowledge_base"
        )
        self.cache = SemanticCache(self.vectorstore)
    
    def query(self, question):
        # 1. 캐시 확인
        cached = self.cache.get(question)
        if cached:
            return cached
        
        # 2. 쿼리 변환
        queries = self.multi_query_transform(question)
        
        # 3. 하이브리드 검색
        all_docs = []
        for q in queries:
            docs = self.hybrid_search(q)
            all_docs.extend(docs)
        
        # 4. 중복 제거
        unique_docs = self.deduplicate(all_docs)
        
        # 5. 리랭킹
        reranked = self.rerank(question, unique_docs)
        
        # 6. 컨텍스트 압축
        compressed = self.compress_context(question, reranked)
        
        # 7. 답변 생성
        answer = self.generate(question, compressed)
        
        # 8. 답변 검증
        if not self.verify_groundedness(answer, compressed):
            answer = self.regenerate_with_instruction(question, compressed)
        
        # 9. 캐시 저장
        self.cache.set(question, answer)
        
        return answer

14. 퀴즈

Q1. 시맨틱 청킹과 재귀적 청킹의 차이점은?

재귀적 청킹은 미리 정의된 분리자(줄바꿈, 마침표 등)를 계층적으로 적용하여 텍스트를 분할합니다. 의미는 고려하지 않습니다. 시맨틱 청킹은 문장 간 임베딩 유사도를 측정하여, 의미가 크게 변하는 지점에서 분할합니다. 시맨틱 청킹이 검색 정확도는 높지만 임베딩 호출 비용이 듭니다. 기본은 재귀적 청킹, 정확도 중시라면 시맨틱 청킹을 사용합니다.

Q2. HyDE(Hypothetical Document Embeddings)는 어떻게 검색을 개선하나요?

HyDE는 사용자 쿼리에 대한 가상의 답변 문서를 LLM이 먼저 생성합니다. 그 가상 문서를 임베딩하여 검색에 사용합니다. 쿼리와 문서 사이의 임베딩 갭(짧은 질문 vs 긴 문서)을 해소하여 검색 정확도를 높입니다. 단점은 LLM 호출 비용과 가상 답변이 잘못될 위험입니다.

Q3. Self-RAG와 CRAG의 차이점은?

Self-RAG는 모델이 스스로 검색 필요 여부를 판단하고, 생성된 답변이 컨텍스트에 근거하는지 자체 평가합니다. 검색이 불필요하면 직접 답변합니다. CRAG는 검색은 항상 수행하되, 검색 결과의 품질을 평가하여 "정확/애매/부적절"로 분류합니다. 부적절하면 웹 검색 등 대안 소스로 교정합니다. Self-RAG는 검색 자체를 제어하고, CRAG는 검색 결과를 교정합니다.

Q4. RAGAS의 Faithfulness와 Answer Relevancy의 차이는?

Faithfulness는 답변의 각 주장이 제공된 컨텍스트에 근거하는지 측정합니다. 할루시네이션 탐지에 핵심입니다. Answer Relevancy는 답변이 원래 질문에 얼마나 적합한지 측정합니다. 질문과 무관한 내용이 답변에 포함되면 점수가 낮아집니다. Faithfulness가 높아도 질문에 대한 답이 아니면 Answer Relevancy가 낮을 수 있습니다.

Q5. 프로덕션 RAG에서 시맨틱 캐싱의 장점은?

일반 캐싱은 완전히 동일한 쿼리만 히트됩니다. 시맨틱 캐싱은 의미적으로 유사한 쿼리도 캐시 히트됩니다. "RAG 최적화 방법"과 "RAG 성능 개선 전략"이 같은 캐시를 활용합니다. 이를 통해 캐시 히트율을 크게 높이고, LLM 호출 비용과 응답 지연을 줄일 수 있습니다. 유사도 임계값(예: 0.95)으로 히트 정밀도를 조절합니다.


15. 참고 자료

  1. LangChain Documentation - https://python.langchain.com/docs/
  2. LlamaIndex Documentation - https://docs.llamaindex.ai/
  3. RAGAS Documentation - https://docs.ragas.io/
  4. TruLens Documentation - https://www.trulens.org/
  5. Cohere Rerank - https://docs.cohere.com/reference/rerank
  6. ColBERT Paper - https://arxiv.org/abs/2004.12832
  7. Self-RAG Paper - https://arxiv.org/abs/2310.11511
  8. CRAG Paper - https://arxiv.org/abs/2401.15884
  9. Adaptive RAG Paper - https://arxiv.org/abs/2403.14403
  10. HyDE Paper - https://arxiv.org/abs/2212.10496
  11. GraphRAG (Microsoft) - https://github.com/microsoft/graphrag
  12. RAG Survey Paper - https://arxiv.org/abs/2312.10997
  13. Chunking Strategies Guide - https://www.pinecone.io/learn/chunking-strategies/
  14. MTEB Leaderboard - https://huggingface.co/spaces/mteb/leaderboard

Advanced RAG Pipeline Complete Guide 2025: Chunking Strategies, Re-ranking, Agentic RAG, Evaluation

1. The Evolution of RAG: From Naive to Advanced

1.1 What Is RAG

RAG (Retrieval-Augmented Generation) is a technique where an LLM retrieves relevant information from external knowledge sources before generating an answer, providing it as context. It reduces LLM hallucinations, incorporates up-to-date information, and enables domain-specific knowledge utilization.

User question → [Retrieval][Relevant documents][LLM + document context]Answer generation

1.2 RAG Architecture Evolution

Naive RAG (Early 2023)
├── Fixed-size chunking
├── Single embedding retrieval
├── Top-K results passed directly to LLM
└── Problems: Low retrieval accuracy, context noise

Advanced RAG (2024)
├── Semantic chunking + metadata
├── Hybrid search (vector + keyword)
├── Re-ranking to refine search results
├── Query transformation (HyDE, Multi-Query)
└── Context compression

Modular RAG (2025)
├── Agentic RAG (dynamic routing)
├── Self-RAG (self-reflection)
├── CRAG (Corrective RAG)
├── Multi-modal RAG
├── Knowledge Graph + RAG
└── Modular composable pipelines

1.3 Bottlenecks at Each Stage

StageNaive RAG ProblemAdvanced RAG Solution
IndexingFixed-size chunkingSemantic chunking, hierarchical indexing
RetrievalSingle vector searchHybrid search, re-ranking
GenerationNoisy contextContext compression, filtering
QueryRaw query as-isQuery transformation, decomposition
EvaluationManual evaluationRAGAS, automated evaluation

2. Chunking Strategies

2.1 Why Chunking Matters

Chunking is the first and most important step in a RAG pipeline. Poor chunking degrades performance of all subsequent stages.

Bad chunking: Cut mid-sentence → meaning lost → retrieval fails → wrong answer
Good chunking: Split by meaning → rich context → accurate retrieval → correct answer

2.2 Fixed-Size Chunking

The simplest approach. Splits text by a fixed token/character count.

from langchain.text_splitter import CharacterTextSplitter

splitter = CharacterTextSplitter(
    chunk_size=500,      # chunk size
    chunk_overlap=50,    # overlap (preserves boundary info)
    separator="\n\n"     # separator
)

chunks = splitter.split_text(document_text)

Pros: Simple, fast, predictable sizes Cons: Ignores semantic boundaries, may cut mid-sentence

2.3 Recursive Character Splitting

Tries multiple separators in hierarchical order. The most widely used approach.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ". ", " ", ""]  # priority order
)

chunks = splitter.split_text(document_text)
# First tries \n\n, if chunk too large then \n, then period...

Pros: Respects paragraph/sentence boundaries, general-purpose Cons: No guarantee of semantic coherence

2.4 Semantic Chunking

Measures embedding similarity between sentences and splits where meaning changes.

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Semantic chunking
chunker = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=95  # split when similarity below 95th percentile
)

chunks = chunker.split_text(document_text)

How it works:

Sentence 1: "Vector DBs are core AI infrastructure." → embedding [0.1, 0.3, ...]
Sentence 2: "HNSW is the fastest search algorithm." → similarity 0.85 (high → same chunk)
Sentence 3: "Meanwhile, the weather is sunny today." → similarity 0.15 (low → new chunk!)

Pros: Semantic unit splitting, high retrieval accuracy Cons: Requires embedding calls (cost/time), uneven chunk sizes

2.5 Document-Based Chunking

Leverages document structure (headings, sections, tables, code blocks).

from langchain.text_splitter import MarkdownHeaderTextSplitter

headers_to_split = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

splitter = MarkdownHeaderTextSplitter(headers_to_split)
chunks = splitter.split_text(markdown_text)

# Each chunk automatically includes header metadata
for chunk in chunks:
    print(f"Metadata: {chunk.metadata}")
    # {'Header 1': 'Vector Database', 'Header 2': 'Indexing Algorithms'}

2.6 Agentic Chunking

Uses an LLM to determine optimal chunking.

from openai import OpenAI

client = OpenAI()

def agentic_chunk(text, max_chunk_size=1500):
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": """Split the given text into semantically independent chunks.
Each chunk should be self-contained and understandable on its own.
Return chunk texts as a JSON array."""},
            {"role": "user", "content": text}
        ],
        response_format={"type": "json_object"}
    )
    return response.choices[0].message.content

Pros: Highest quality, context understanding Cons: High cost, slow processing, impractical at scale

2.7 Chunking Strategy Comparison

StrategyQualitySpeedCostBest For
Fixed-SizeLowVery FastFreePrototyping, simple docs
RecursiveMediumFastFreeGeneral purpose (default)
SemanticHighMediumEmbedding costWhen accuracy matters
Document-BasedHighFastFreeStructured docs (MD, HTML)
AgenticVery HighSlowLLM costSmall high-quality docs

2.8 Optimal Chunk Size Guide

General text: 500-1000 tokens (10-20% overlap)
Technical docs: 800-1500 tokens (section-based)
Legal/Medical: 300-500 tokens (precision-focused)
Code: Function/class units (structure-based)
Conversations/QA: Per dialog turn

3. Embedding Model Selection

3.1 Embedding Model Comparison

ModelDimsMax TokensMTEB ScoreCostRecommended For
text-embedding-3-large30728,19164.6PaidWhen top performance needed
text-embedding-3-small15368,19162.3LowGeneral purpose (best cost-perf)
embed-v3.0 (Cohere)102451264.5PaidMultilingual, search-focused
BGE-M3 (BAAI)10248,19268.2FreeSelf-hosted, best OSS
Jina-embeddings-v310248,19265.5FreeMultilingual, long docs
voyage-3 (Voyage AI)102432,00067.1PaidCode search specialized

3.2 Selection Criteria

Cost-focused + general      → text-embedding-3-small
Top performance + free      → BGE-M3 (self-hosted)
Multilingual + search       → Cohere embed-v3.0
Code search                 → voyage-code-3
Long docs (8K+ tokens)Jina-embeddings-v3
Private data + on-premises  → BGE-M3 or Nomic

3.3 Late Interaction Models (ColBERT)

Performs token-level fine-grained matching.

from ragatouille import RAGPretrainedModel

rag = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")

# Indexing
rag.index(
    collection=[doc1, doc2, doc3],
    index_name="my_index"
)

# Search (token-level matching)
results = rag.search(query="vector database performance comparison", k=5)

4. Query Transformation

4.1 Why Query Transformation Is Needed

User queries are often ambiguous, too short, or not suitable for retrieval.

Original query: "RAG is slow" (ambiguous, short)
Transformed: "How to optimize RAG pipeline retrieval latency" (specific, retrieval-suitable)

4.2 HyDE (Hypothetical Document Embeddings)

LLM generates a hypothetical answer document, then searches using that document's embedding.

from openai import OpenAI

client = OpenAI()

def hyde_search(query, collection):
    # 1. Generate hypothetical answer with LLM
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "Write a detailed answer to the given question. It doesn't need to be accurate."},
            {"role": "user", "content": query}
        ]
    )
    hypothetical_doc = response.choices[0].message.content
    
    # 2. Embed the hypothetical answer and search
    embedding = get_embedding(hypothetical_doc)
    results = collection.search(query_vector=embedding, limit=5)
    
    return results

Pros: Bridges the embedding gap between query and document Cons: LLM call cost, hypothetical answer may be wrong

4.3 Multi-Query

Rewrites a single query from multiple angles.

def multi_query_transform(original_query):
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": """Rewrite the given question from 3 different perspectives.
Each question should maintain the original intent but use different keywords.
Return only 3 questions separated by newlines."""},
            {"role": "user", "content": original_query}
        ]
    )
    queries = response.choices[0].message.content.strip().split("\n")
    return [original_query] + queries

# Usage
queries = multi_query_transform("RAG pipeline performance optimization")
# → ["RAG pipeline performance optimization",
#    "How to improve response time in retrieval augmented generation systems",
#    "Strategies to increase search accuracy in RAG architecture",
#    "LLM-based document retrieval system optimization techniques"]

# Search with each query and merge results (deduplicate)
all_results = set()
for q in queries:
    results = search(q)
    all_results.update(results)

4.4 Step-Back Prompting

Transforms a specific question into a more general one.

def step_back_prompt(query):
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": """Generate a question that is one step more general and broader than the given question.
The answer to this general question should help answer the original question."""},
            {"role": "user", "content": query}
        ]
    )
    return response.choices[0].message.content

# Example
original = "What is the memory impact when setting HNSW M parameter to 32 in Qdrant?"
step_back = step_back_prompt(original)
# → "How do HNSW index parameters affect performance and resources in vector databases?"

4.5 Query Decomposition

Breaks complex questions into sub-questions.

def decompose_query(complex_query):
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": """Decompose the complex question into simpler sub-questions.
Each sub-question should be independently answerable.
Return as a JSON array."""},
            {"role": "user", "content": complex_query}
        ],
        response_format={"type": "json_object"}
    )
    return response.choices[0].message.content

# Example
complex_q = "Compare Pinecone and Qdrant on performance, cost, and operational ease at 10M vector scale?"
sub_questions = decompose_query(complex_q)
# → ["What is Pinecone's performance at 10M vector scale?",
#    "What is Qdrant's performance at 10M vector scale?",
#    "What is Pinecone's cost structure?",
#    "What is Qdrant's cost structure?",
#    "How easy is Pinecone to operate?",
#    "How easy is Qdrant to operate?"]

5. Retrieval Optimization

Combines vector search with keyword search.

from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
from langchain_community.vectorstores import Qdrant

# BM25 (keyword) retriever
bm25_retriever = BM25Retriever.from_documents(documents)
bm25_retriever.k = 5

# Vector retriever
vector_retriever = qdrant_vectorstore.as_retriever(
    search_kwargs={"k": 5}
)

# Ensemble (hybrid)
ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, vector_retriever],
    weights=[0.4, 0.6]  # weight toward vector
)

results = ensemble_retriever.invoke("How to optimize RAG pipeline")

5.2 Contextual Compression

Extracts only the relevant portions from retrieved documents.

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

compressor = LLMChainExtractor.from_llm(llm)

compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vector_retriever
)

# Retrieval + compression: extracts only relevant parts
compressed_docs = compression_retriever.invoke("Role of M parameter in HNSW algorithm")
# Returns only question-relevant paragraphs, not the entire document

5.3 Parent Document Retrieval

Searches with small chunks but returns larger parent documents.

from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Small chunks: for retrieval
child_splitter = RecursiveCharacterTextSplitter(chunk_size=200)

# Large chunks: for context
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)

store = InMemoryStore()

retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

# Precise search with 200-token chunks → return 2000-token parent documents
retriever.add_documents(documents)
results = retriever.invoke("HNSW parameter tuning")
# Retrieval is precise with small chunks, context is rich with large ones

5.4 Multi-Vector Retrieval

Generates multiple vectors (summary, questions, original) per document.

from langchain.retrievers.multi_vector import MultiVectorRetriever

# Generate summaries + hypothetical questions per document
summaries = generate_summaries(documents)
hypothetical_questions = generate_questions(documents)

# Search by summaries/questions, return original documents
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,  # stores summary/question vectors
    docstore=store,           # stores original documents
    id_key="doc_id"
)

# Searches via summary embeddings but returns full original documents

6. Re-ranking

6.1 Why Re-ranking Is Needed

Initial retrieval (bi-encoder) is fast but imprecise. Re-ranking (cross-encoder) processes query and document together for more accurate relevance judgment.

Stage 1 (Bi-encoder): query vector vs doc vector → fast but imprecise → Top 20
Stage 2 (Cross-encoder): (query, doc) pairs directly compared → slow but accurate → Top 5

6.2 Cross-Encoder Re-ranking

from sentence_transformers import CrossEncoder

# Load cross-encoder model
model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

# Initial search results
initial_results = vector_search(query, top_k=20)

# Re-rank
pairs = [(query, doc.content) for doc in initial_results]
scores = model.predict(pairs)

# Sort by score
reranked = sorted(
    zip(initial_results, scores),
    key=lambda x: x[1],
    reverse=True
)[:5]

6.3 Cohere Rerank

Commercial re-ranking API. High performance with multilingual support.

import cohere

co = cohere.Client("YOUR_API_KEY")

# Initial search result texts
documents = [doc.content for doc in initial_results]

# Cohere re-ranking
response = co.rerank(
    model="rerank-v3.5",
    query="RAG pipeline optimization",
    documents=documents,
    top_n=5
)

for result in response.results:
    print(f"Index: {result.index}, Score: {result.relevance_score:.4f}")
    print(f"Text: {documents[result.index][:100]}...")

6.4 ColBERT Re-ranking

Late interaction approach with token-level fine-grained matching.

from ragatouille import RAGPretrainedModel

rag = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")

reranked = rag.rerank(
    query="How to optimize RAG pipeline",
    documents=[doc.content for doc in initial_results],
    k=5
)

6.5 LLM-Based Re-ranking

Lets the LLM directly judge relevance.

def llm_rerank(query, documents, top_n=5):
    scored_docs = []
    
    for doc in documents:
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {"role": "system", "content": "Rate the relevance between the question and document on a 0-10 scale. Return only the number."},
                {"role": "user", "content": f"Question: {query}\n\nDocument: {doc.content[:500]}"}
            ]
        )
        score = float(response.choices[0].message.content.strip())
        scored_docs.append((doc, score))
    
    return sorted(scored_docs, key=lambda x: x[1], reverse=True)[:top_n]

6.6 Re-ranking Model Comparison

ModelSpeedQualityCostMultilingualRecommended
cross-encoder/ms-marcoFastGoodFreeEnglishEnglish only
Cohere Rerank v3.5FastVery GoodPaid100+ languagesProduction default
ColBERT v2MediumVery GoodFreeEnglishSelf-hosted
BGE-Reranker-v2FastGoodFreeMultilingualOSS multilingual
LLM Re-rankingSlowBestHighAllSmall high-quality

7. Agentic RAG

7.1 What Is Agentic RAG

LLM agents dynamically decide retrieval strategies. Instead of simple "retrieve then generate," agents evaluate search results and adjust strategy.

Traditional RAG: QuestionRetrieveGenerate (fixed pipeline)
Agentic RAG: Question[Agent decides]
              ├── Is retrieval needed?RetrieveResults sufficient?
              │                              ├── YesGenerate
              │                              └── NoSearch different source / modify query
              └── Can answer without retrieval → Generate directly

7.2 Self-RAG (Self-Reflective RAG)

The model evaluates whether retrieval is needed and judges the quality of generated results.

def self_rag(query):
    # 1. Judge if retrieval is needed
    need_retrieval = judge_retrieval_need(query)
    
    if not need_retrieval:
        return generate_without_context(query)
    
    # 2. Perform retrieval
    documents = retrieve(query)
    
    # 3. Judge relevance of each document
    relevant_docs = []
    for doc in documents:
        if is_relevant(query, doc):
            relevant_docs.append(doc)
    
    if not relevant_docs:
        # No relevant docs found, refine query and re-search
        refined_query = refine_query(query)
        documents = retrieve(refined_query)
        relevant_docs = [d for d in documents if is_relevant(refined_query, d)]
    
    # 4. Generate answer
    answer = generate_with_context(query, relevant_docs)
    
    # 5. Self-evaluate answer quality
    if not is_supported(answer, relevant_docs):
        # If answer not grounded in docs, regenerate
        answer = regenerate(query, relevant_docs)
    
    return answer

7.3 CRAG (Corrective RAG)

Takes corrective action based on retrieval quality.

def corrective_rag(query):
    # 1. Initial retrieval
    documents = retrieve(query)
    
    # 2. Evaluate retrieval quality
    quality = evaluate_retrieval_quality(query, documents)
    
    if quality == "CORRECT":
        # Good results → refine knowledge and generate
        refined_knowledge = refine_knowledge(documents)
        return generate(query, refined_knowledge)
    
    elif quality == "AMBIGUOUS":
        # Ambiguous results → supplement with web search
        web_results = web_search(query)
        combined = documents + web_results
        refined = refine_knowledge(combined)
        return generate(query, refined)
    
    elif quality == "INCORRECT":
        # Poor results → replace with web search
        web_results = web_search(query)
        refined = refine_knowledge(web_results)
        return generate(query, refined)

7.4 Adaptive RAG

Selects strategy based on query complexity.

def adaptive_rag(query):
    # Classify query complexity
    complexity = classify_query(query)
    
    if complexity == "SIMPLE":
        # Simple factual question → direct retrieve + generate
        docs = simple_retrieve(query, top_k=3)
        return generate(query, docs)
    
    elif complexity == "MODERATE":
        # Moderate → multi-query + re-ranking
        queries = multi_query_transform(query)
        docs = hybrid_search(queries)
        reranked = rerank(query, docs)
        return generate(query, reranked)
    
    elif complexity == "COMPLEX":
        # Complex → query decomposition + multi-step reasoning
        sub_queries = decompose_query(query)
        sub_answers = []
        for sq in sub_queries:
            docs = search(sq)
            sub_answers.append(generate(sq, docs))
        return synthesize(query, sub_answers)

7.5 Query Routing

Routes queries to appropriate data sources based on question type.

def query_router(query):
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": """Analyze the query and select the appropriate data source:
- VECTOR_DB: When internal document search is needed
- WEB_SEARCH: When latest or external information is needed
- SQL_DB: When structured data queries are needed
- DIRECT: When the LLM can answer directly
Return a single word only."""},
            {"role": "user", "content": query}
        ]
    )
    route = response.choices[0].message.content.strip()
    
    if route == "VECTOR_DB":
        return vector_db_search(query)
    elif route == "WEB_SEARCH":
        return web_search(query)
    elif route == "SQL_DB":
        return sql_query(query)
    else:
        return direct_answer(query)

8. Multi-modal RAG

8.1 Image + Text RAG

Processes images and tables alongside text from PDFs and slides.

from langchain_community.document_loaders import UnstructuredPDFLoader

# Extract text + images + tables from PDF
loader = UnstructuredPDFLoader(
    "document.pdf",
    mode="elements",
    strategy="hi_res"  # high-res image/table extraction
)
elements = loader.load()

# Generate descriptions for image elements using GPT-4o
for element in elements:
    if element.metadata.get("type") == "Image":
        description = describe_image_with_vision(element)
        element.page_content = description  # convert image description to text

# Store all elements (text + image descriptions) in vector DB

8.2 Table Processing

def process_table(table_element):
    """Convert table to searchable format"""
    # 1. Convert table to markdown
    md_table = table_element.metadata.get("text_as_html", "")
    
    # 2. Generate table summary with LLM
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "Summarize the following table in natural language."},
            {"role": "user", "content": md_table}
        ]
    )
    summary = response.choices[0].message.content
    
    # 3. Store summary with metadata
    return {
        "content": summary,
        "metadata": {
            "type": "table",
            "original_html": md_table,
            "source": table_element.metadata.get("source")
        }
    }

8.3 Vision Model Usage

import base64

def describe_image_with_vision(image_path):
    with open(image_path, "rb") as f:
        image_data = base64.b64encode(f.read()).decode()
    
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "Describe this image in detail. For diagrams explain the structure, for charts explain the data."},
                    {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_data}"}}
                ]
            }
        ]
    )
    return response.choices[0].message.content

9. Knowledge Graph + RAG (GraphRAG)

9.1 GraphRAG Concept

Represents entity relationships via knowledge graphs and combines with vector search.

Standard RAG: Retrieves document chunks independently
GraphRAG: Uses entity relationships to retrieve connected information too

Example: "Which vector databases use HNSW?"
Standard RAG: Returns only HNSW chunks
GraphRAG: HNSW  (used_by)Qdrant, Weaviate, Milvus
         + Returns HNSW configuration chunks for each DB

9.2 Neo4j + RAG Implementation

from langchain_community.graphs import Neo4jGraph
from langchain.chains import GraphCypherQAChain

graph = Neo4jGraph(
    url="bolt://localhost:7687",
    username="neo4j",
    password="password"
)

# Knowledge graph + LLM QA chain
chain = GraphCypherQAChain.from_llm(
    llm=llm,
    graph=graph,
    verbose=True
)

result = chain.invoke("What indexing algorithms does Qdrant support?")
# LLM generates Cypher queries to explore relationships in Neo4j

9.3 Hybrid: Vector + Graph

def hybrid_graph_vector_rag(query):
    # 1. Vector search for relevant document chunks
    vector_results = vector_search(query, top_k=5)
    
    # 2. Extract entities from chunks
    entities = extract_entities(query)
    
    # 3. Explore related entities in knowledge graph
    graph_results = graph_query(entities, depth=2)
    
    # 4. Combine both results
    combined_context = merge_results(vector_results, graph_results)
    
    # 5. LLM generation
    return generate(query, combined_context)

10. RAG Evaluation

10.1 RAGAS Framework

A framework for automated RAG pipeline evaluation.

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
)
from datasets import Dataset

# Prepare evaluation dataset
eval_data = {
    "question": ["What is RAG?", "How does HNSW algorithm work?"],
    "answer": ["RAG is retrieval augmented generation...", "HNSW is a hierarchical graph..."],
    "contexts": [
        ["RAG (Retrieval-Augmented Generation) is...", "A retrieval-based generation technique..."],
        ["HNSW stands for Hierarchical Navigable...", "A graph-based ANN algorithm..."]
    ],
    "ground_truth": ["RAG enables LLMs to retrieve external...", "HNSW is a multi-layer graph structure..."]
}

dataset = Dataset.from_dict(eval_data)

# Run RAGAS evaluation
results = evaluate(
    dataset,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall]
)

print(results)
# {'faithfulness': 0.89, 'answer_relevancy': 0.92,
#  'context_precision': 0.85, 'context_recall': 0.78}

10.2 RAGAS Metrics Detail

MetricMeasuresRangeTarget
FaithfulnessHow well answer is grounded in context0-10.85+
Answer RelevancyHow well answer fits the question0-10.90+
Context PrecisionRanking accuracy of relevant context0-10.80+
Context RecallHow much needed info was retrieved0-10.75+
Answer SimilaritySemantic similarity to ground truth0-10.80+
Answer CorrectnessFactual accuracy of the answer0-10.85+

10.3 TruLens Evaluation

from trulens.apps.langchain import TruChain
from trulens.core import Feedback, TruSession
from trulens.providers.openai import OpenAI as TruOpenAI

session = TruSession()
provider = TruOpenAI()

# Define feedback functions
f_relevance = Feedback(provider.relevance).on_input_output()
f_groundedness = Feedback(provider.groundedness_measure_with_cot_reasons).on(
    source=context, statement=output
)

# Wrap RAG chain
tru_chain = TruChain(
    rag_chain,
    app_name="RAG Pipeline v1",
    feedbacks=[f_relevance, f_groundedness]
)

# Run evaluation
with tru_chain as recording:
    response = rag_chain.invoke("How to optimize RAG pipelines?")

# Check dashboard
session.get_leaderboard()

10.4 LLM-as-Judge

def llm_judge(question, answer, context, ground_truth):
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": """Evaluate the RAG system's answer.
Score 1-5 on these criteria:
1. Accuracy: Is the answer factually correct
2. Relevance: Is it a fitting answer to the question
3. Groundedness: Is the answer based on the provided context
4. Completeness: Does it fully answer the question
Return JSON with each score and reasoning."""},
            {"role": "user", "content": f"""Question: {question}
Answer: {answer}
Context: {context}
Ground Truth: {ground_truth}"""}
        ],
        response_format={"type": "json_object"}
    )
    return response.choices[0].message.content

10.5 Evaluation Method Comparison

MethodAutomationCostAccuracyWhen to Use
RAGASFully autoEmbedding costGoodContinuous monitoring
TruLensAutoLLM costGoodIterative dev evaluation
LLM-as-JudgeSemi-autoLLM costVery GoodDetailed analysis
Human EvalManualLabor costBestFinal validation

11. Production Optimization

11.1 Caching Strategy

import hashlib
import json

class RAGCache:
    def __init__(self, redis_client):
        self.redis = redis_client
        self.ttl = 3600  # 1 hour
    
    def _hash_query(self, query):
        return hashlib.md5(query.encode()).hexdigest()
    
    def get_cached_response(self, query):
        key = self._hash_query(query)
        cached = self.redis.get(f"rag:{key}")
        if cached:
            return json.loads(cached)
        return None
    
    def cache_response(self, query, response):
        key = self._hash_query(query)
        self.redis.setex(
            f"rag:{key}",
            self.ttl,
            json.dumps(response)
        )

# Semantic caching: cache hits for similar queries too
class SemanticCache:
    def __init__(self, vectorstore, threshold=0.95):
        self.store = vectorstore
        self.threshold = threshold
    
    def get(self, query):
        results = self.store.similarity_search_with_score(query, k=1)
        if results and results[0][1] >= self.threshold:
            return results[0][0].metadata["response"]
        return None

11.2 Streaming Responses

from openai import OpenAI

client = OpenAI()

def stream_rag_response(query, context):
    stream = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": f"Context:\n{context}"},
            {"role": "user", "content": query}
        ],
        stream=True
    )
    
    for chunk in stream:
        if chunk.choices[0].delta.content:
            yield chunk.choices[0].delta.content

# FastAPI streaming endpoint
from fastapi import FastAPI
from fastapi.responses import StreamingResponse

app = FastAPI()

@app.get("/rag/stream")
async def rag_stream(query: str):
    context = retrieve_context(query)
    return StreamingResponse(
        stream_rag_response(query, context),
        media_type="text/event-stream"
    )

11.3 Fallback Strategy

def rag_with_fallback(query):
    try:
        # Primary: Vector DB search
        docs = vector_search(query)
        
        if not docs or max_relevance_score(docs) < 0.5:
            # Secondary: Web search fallback
            docs = web_search_fallback(query)
        
        if not docs:
            # Tertiary: Direct LLM answer (no retrieval)
            return direct_llm_answer(query)
        
        return generate_with_context(query, docs)
    
    except Exception as e:
        # Error fallback
        return {
            "answer": "Sorry, a temporary error occurred.",
            "error": str(e),
            "fallback": True
        }

11.4 Monitoring

# Key RAG metrics
monitoring = {
    "Retrieval latency": "Target p99 200ms",
    "Generation latency": "Target p99 3s (streaming first-token 500ms)",
    "Retrieval relevance": "Auto-eval 0.85+",
    "Answer groundedness": "Auto-eval 0.90+",
    "Cache hit rate": "Target 30%+",
    "User feedback (thumbs up/down)": "Positive 80%+",
    "Token usage": "Cost tracking",
    "Error rate": "Target under 1%",
}

12. Common Failures and Solutions

12.1 Problem Diagnosis Checklist

SymptomCauseSolution
Irrelevant docs returnedChunk size inappropriateSemantic chunking, resize
Answer ignores contextContext too long or noisyContext compression, re-ranking
Synonym/similar term search failsEmbedding model limitationHybrid search, query expansion
Missing latest informationIndex not updatedIncremental indexing, scheduler
Answer hallucinationNo relevance thresholdScore filtering, Self-RAG
Multi-hop question failsSingle retrieval insufficientQuery decomposition, iterative search
Specific term search failsVector alone insufficientAdd BM25 hybrid
Slow responsesRe-ranking/generation delayCaching, streaming, async

12.2 Debugging Workflow

def debug_rag_pipeline(query):
    print(f"=== RAG Debug: {query} ===\n")
    
    # 1. Check retrieval results
    results = vector_search(query, top_k=10)
    print("--- Retrieval Results ---")
    for i, r in enumerate(results):
        print(f"#{i+1} Score: {r.score:.4f} | {r.content[:100]}...")
    
    # 2. Check re-ranked results
    reranked = rerank(query, results)
    print("\n--- After Re-ranking ---")
    for i, (r, score) in enumerate(reranked):
        print(f"#{i+1} Score: {score:.4f} | {r.content[:100]}...")
    
    # 3. Check final context
    context = build_context(reranked[:5])
    print(f"\n--- Context length: {len(context)} chars ---")
    
    # 4. Check generated result
    answer = generate(query, context)
    print(f"\n--- Answer ---\n{answer}")
    
    return {"results": results, "reranked": reranked, "answer": answer}

13. Full Pipeline Integration

from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Qdrant
from langchain.retrievers import EnsembleRetriever, ContextualCompressionRetriever

class AdvancedRAGPipeline:
    def __init__(self):
        self.llm = ChatOpenAI(model="gpt-4o")
        self.embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
        self.vectorstore = Qdrant.from_existing_collection(
            embedding=self.embeddings,
            collection_name="knowledge_base"
        )
        self.cache = SemanticCache(self.vectorstore)
    
    def query(self, question):
        # 1. Check cache
        cached = self.cache.get(question)
        if cached:
            return cached
        
        # 2. Query transformation
        queries = self.multi_query_transform(question)
        
        # 3. Hybrid search
        all_docs = []
        for q in queries:
            docs = self.hybrid_search(q)
            all_docs.extend(docs)
        
        # 4. Deduplicate
        unique_docs = self.deduplicate(all_docs)
        
        # 5. Re-ranking
        reranked = self.rerank(question, unique_docs)
        
        # 6. Context compression
        compressed = self.compress_context(question, reranked)
        
        # 7. Generate answer
        answer = self.generate(question, compressed)
        
        # 8. Verify answer
        if not self.verify_groundedness(answer, compressed):
            answer = self.regenerate_with_instruction(question, compressed)
        
        # 9. Cache result
        self.cache.set(question, answer)
        
        return answer

14. Quiz

Q1. What is the difference between semantic chunking and recursive chunking?

Recursive chunking applies predefined separators (newlines, periods, etc.) in hierarchical order to split text. It does not consider semantics. Semantic chunking measures embedding similarity between sentences and splits where meaning changes significantly. Semantic chunking provides higher retrieval accuracy but incurs embedding call costs. Use recursive chunking as default, semantic chunking when accuracy is critical.

Q2. How does HyDE (Hypothetical Document Embeddings) improve retrieval?

HyDE has the LLM first generate a hypothetical answer document for the user query. It then embeds that hypothetical document and uses it for search. This bridges the embedding gap between short queries and long documents, improving retrieval accuracy. Downsides include LLM call cost and the risk that the hypothetical answer may be incorrect.

Q3. What is the difference between Self-RAG and CRAG?

Self-RAG has the model judge whether retrieval is needed and self-evaluate whether the generated answer is grounded in context. If retrieval is unnecessary, it answers directly. CRAG always performs retrieval but evaluates the quality of results, classifying them as "correct/ambiguous/incorrect." If incorrect, it falls back to alternative sources like web search. Self-RAG controls retrieval itself; CRAG corrects retrieval results.

Q4. What is the difference between RAGAS Faithfulness and Answer Relevancy?

Faithfulness measures whether each claim in the answer is grounded in the provided context. It is key for hallucination detection. Answer Relevancy measures how well the answer fits the original question. If the answer contains content unrelated to the question, the score drops. An answer can be high in Faithfulness but low in Answer Relevancy if it does not actually address the question.

Q5. What are the benefits of semantic caching in production RAG?

Regular caching only hits on exactly identical queries. Semantic caching also hits on semantically similar queries. "RAG optimization methods" and "RAG performance improvement strategies" would share the same cache. This significantly increases cache hit rates and reduces both LLM call costs and response latency. The similarity threshold (e.g., 0.95) controls hit precision.


15. References

  1. LangChain Documentation - https://python.langchain.com/docs/
  2. LlamaIndex Documentation - https://docs.llamaindex.ai/
  3. RAGAS Documentation - https://docs.ragas.io/
  4. TruLens Documentation - https://www.trulens.org/
  5. Cohere Rerank - https://docs.cohere.com/reference/rerank
  6. ColBERT Paper - https://arxiv.org/abs/2004.12832
  7. Self-RAG Paper - https://arxiv.org/abs/2310.11511
  8. CRAG Paper - https://arxiv.org/abs/2401.15884
  9. Adaptive RAG Paper - https://arxiv.org/abs/2403.14403
  10. HyDE Paper - https://arxiv.org/abs/2212.10496
  11. GraphRAG (Microsoft) - https://github.com/microsoft/graphrag
  12. RAG Survey Paper - https://arxiv.org/abs/2312.10997
  13. Chunking Strategies Guide - https://www.pinecone.io/learn/chunking-strategies/
  14. MTEB Leaderboard - https://huggingface.co/spaces/mteb/leaderboard