RAG 논문 서베이: Retrieval-Augmented Generation의 진화 — RETRO에서 Self-RAG·Corrective-RAG까지

들어가며
Original RAG (Lewis et al., 2020)
REALM과 RETRO: 대규모 검색 통합
- REALM: 사전학습 단계에서의 검색
- RETRO: 2조 토큰 데이터베이스
Atlas: Few-shot 학습과 검색
Self-RAG: 자기 반성 기반 적응적 검색
Corrective RAG (CRAG)
- 검색 품질 평가기의 도입
- Decompose-then-Recompose 알고리즘
Naive RAG에서 Advanced RAG, Modular RAG으로의 진화
- 아키텍처 진화 비교표
- Pre-retrieval, Retrieval, Post-retrieval 최적화
벤치마크 비교 분석
- 주요 모델 성능 종합 비교
- CRAG Benchmark (Meta, NeurIPS 2024)
실무 적용 시 고려사항
향후 연구 방향
결론
참고자료

들어가며

대규모 언어 모델(LLM)은 놀라운 언어 이해 및 생성 능력을 보여주지만, 두 가지 근본적인 한계가 있다. 첫째, 환각(hallucination) 문제로 사실이 아닌 내용을 그럴듯하게 생성한다. 둘째, 학습 데이터의 지식 단절(knowledge cutoff) 로 인해 최신 정보를 반영하지 못한다. 파라미터에 지식을 저장하는 방식은 확장에 한계가 있으며, 모델을 재학습하는 비용도 천문학적이다.

Retrieval-Augmented Generation(RAG) 은 이 문제에 대한 가장 실용적인 해법으로 부상했다. 핵심 아이디어는 간단하다. 질문이 주어지면 외부 지식 저장소에서 관련 문서를 검색(Retrieve)하고, 이를 컨텍스트로 활용하여 답변을 생성(Generate)한다. 이를 통해 모델의 파라미터를 수정하지 않고도 최신 지식을 반영하고 환각을 줄일 수 있다.

이 글에서는 RAG 연구의 진화를 핵심 논문 중심으로 추적한다. 2020년 Lewis et al.의 Original RAG부터 시작하여, REALM과 RETRO의 대규모 검색 통합, Atlas의 Few-shot 학습, Self-RAG의 자기 반성 메커니즘, Corrective-RAG의 검색 품질 평가까지 아키텍처와 벤치마크를 비교 분석한다.

Original RAG (Lewis et al., 2020)

아키텍처 개요

Lewis et al.이 NeurIPS 2020에서 발표한 "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks"는 RAG 패러다임의 시초이다. 핵심 구조는 DPR(Dense Passage Retrieval) 검색기와 BART seq2seq 생성기의 결합이다.

모델은 두 종류의 메모리를 활용한다.

Parametric Memory: BART의 사전학습된 파라미터에 저장된 지식
Non-parametric Memory: Wikipedia 덤프를 FAISS 인덱스로 구축한 외부 지식 저장소

RAG-Sequence vs RAG-Token

논문은 두 가지 모델 변형을 제안한다.

RAG-Sequence: 전체 시퀀스 생성에 동일한 문서를 사용. 문서 z가 주어졌을 때 전체 출력 y를 한 번에 생성
RAG-Token: 토큰마다 다른 문서를 참조 가능. 각 토큰 생성 시점에서 문서 분포를 재계산

P_{\text{RAG-Seq}}(y|x) \approx \sum_{z \in \text{top-k}} P_\eta(z|x) \prod_{i} P_\theta(y_i|x, z, y_{1:i-1})

P_{\text{RAG-Token}}(y|x) \approx \prod_{i} \sum_{z \in \text{top-k}} P_\eta(z|x) P_\theta(y_i|x, z, y_{1:i-1})

DPR 기반 문서 검색 구현

import torch
import numpy as np
from transformers import DPRQuestionEncoder, DPRQuestionEncoderTokenizer
from transformers import DPRContextEncoder, DPRContextEncoderTokenizer

class DPRRetriever:
    """DPR 기반 Dense Passage Retrieval 구현"""

    def __init__(self, model_name="facebook/dpr-question_encoder-single-nq-base"):
        self.q_encoder = DPRQuestionEncoder.from_pretrained(model_name)
        self.q_tokenizer = DPRQuestionEncoderTokenizer.from_pretrained(model_name)

        ctx_model = "facebook/dpr-ctx_encoder-single-nq-base"
        self.ctx_encoder = DPRContextEncoder.from_pretrained(ctx_model)
        self.ctx_tokenizer = DPRContextEncoderTokenizer.from_pretrained(ctx_model)

        self.document_embeddings = None
        self.documents = []

    def encode_documents(self, documents: list[str]) -> np.ndarray:
        """문서 코퍼스를 임베딩으로 변환"""
        self.documents = documents
        embeddings = []

        for doc in documents:
            inputs = self.ctx_tokenizer(
                doc, return_tensors="pt",
                max_length=256, truncation=True, padding=True
            )
            with torch.no_grad():
                output = self.ctx_encoder(**inputs)
            embeddings.append(output.pooler_output.numpy())

        self.document_embeddings = np.vstack(embeddings)
        # L2 정규화 적용
        norms = np.linalg.norm(self.document_embeddings, axis=1, keepdims=True)
        self.document_embeddings = self.document_embeddings / norms
        return self.document_embeddings

    def retrieve(self, query: str, top_k: int = 5) -> list[dict]:
        """질의에 대해 상위 k개 관련 문서 검색"""
        inputs = self.q_tokenizer(
            query, return_tensors="pt",
            max_length=64, truncation=True, padding=True
        )
        with torch.no_grad():
            q_embedding = self.q_encoder(**inputs).pooler_output.numpy()

        q_embedding = q_embedding / np.linalg.norm(q_embedding)
        scores = np.dot(self.document_embeddings, q_embedding.T).squeeze()
        top_indices = np.argsort(scores)[::-1][:top_k]

        results = []
        for idx in top_indices:
            results.append({
                "document": self.documents[idx],
                "score": float(scores[idx]),
                "index": int(idx)
            })
        return results


# 사용 예시
retriever = DPRRetriever()
corpus = [
    "RAG는 검색과 생성을 결합한 모델이다.",
    "Transformer는 Self-Attention 메커니즘을 사용한다.",
    "BERT는 양방향 사전학습 언어 모델이다.",
    "DPR은 밀집 벡터를 사용하여 패시지를 검색한다.",
]
retriever.encode_documents(corpus)
results = retriever.retrieve("RAG에서 문서 검색은 어떻게 동작하나요?")
for r in results:
    print(f"[Score: {r['score']:.4f}] {r['document']}")

Original RAG는 Natural Questions에서 44.5 EM, TriviaQA에서 56.8 EM을 달성하여, 당시 추출형 QA 방식 대비 생성형 접근의 가능성을 입증했다.

REALM과 RETRO: 대규모 검색 통합

REALM: 사전학습 단계에서의 검색

Guu et al. (2020)의 REALM(Retrieval-Enhanced Language Model)은 RAG보다 한 단계 앞서, 사전학습 단계부터 검색을 통합한 최초의 연구이다. Masked Language Modeling 과정에서 마스킹된 토큰을 예측하기 위해 외부 문서를 검색하고, 이 검색 과정이 역전파를 통해 함께 학습된다.

핵심 기여는 검색기와 생성기를 end-to-end로 공동 학습할 수 있음을 보여준 것이다.

RETRO: 2조 토큰 데이터베이스

Borgeaud et al. (2022)의 RETRO(Retrieval-Enhanced Transformer)는 검색 규모를 극적으로 확장했다. 2조 토큰 규모의 데이터베이스를 구축하고, Chunked Cross-Attention(CCA) 메커니즘을 도입하여 검색된 청크를 효율적으로 활용한다.

RETRO의 핵심 설계 원리는 다음과 같다.

특성	RETRO	GPT-3
파라미터 수	7.5B	175B
검색 데이터베이스	2T 토큰	없음
Pile 테스트 perplexity	유사	기준
학습 비용	상대적 저비용	고비용

GPT-3 대비 약 25배 적은 파라미터로 동등한 성능을 달성했다. 이는 모든 지식을 파라미터에 저장할 필요가 없음을 실증한 결과이다.

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

class ChunkedCrossAttention(nn.Module):
    """RETRO 스타일 Chunked Cross-Attention 구현"""

    def __init__(self, d_model: int = 512, n_heads: int = 8, chunk_size: int = 64):
        super().__init__()
        self.d_model = d_model
        self.n_heads = n_heads
        self.d_k = d_model // n_heads
        self.chunk_size = chunk_size

        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)
        self.layer_norm = nn.LayerNorm(d_model)

    def forward(
        self,
        hidden_states: torch.Tensor,
        retrieved_chunks: torch.Tensor
    ) -> torch.Tensor:
        """
        Args:
            hidden_states: (B, seq_len, d_model) - 디코더 히든 상태
            retrieved_chunks: (B, n_chunks, chunk_len, d_model) - 검색된 이웃 청크
        """
        B, seq_len, D = hidden_states.shape
        n_chunks = seq_len // self.chunk_size

        # 시퀀스를 청크 단위로 분할
        h_chunks = hidden_states[:, :n_chunks * self.chunk_size].reshape(
            B, n_chunks, self.chunk_size, D
        )

        # 각 청크에 대해 검색된 이웃과 Cross-Attention 수행
        Q = self.W_q(h_chunks)  # (B, n_chunks, chunk_size, D)
        K = self.W_k(retrieved_chunks)  # (B, n_chunks, chunk_len, D)
        V = self.W_v(retrieved_chunks)

        # Multi-head 분리
        Q = Q.reshape(B, n_chunks, self.chunk_size, self.n_heads, self.d_k).permute(0, 1, 3, 2, 4)
        K = K.reshape(B, n_chunks, -1, self.n_heads, self.d_k).permute(0, 1, 3, 2, 4)
        V = V.reshape(B, n_chunks, -1, self.n_heads, self.d_k).permute(0, 1, 3, 2, 4)

        # Scaled Dot-Product Attention
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
        attn_weights = F.softmax(scores, dim=-1)
        attn_output = torch.matmul(attn_weights, V)

        # 헤드 결합 및 출력 프로젝션
        attn_output = attn_output.permute(0, 1, 3, 2, 4).reshape(
            B, n_chunks, self.chunk_size, D
        )
        attn_output = self.W_o(attn_output)

        # 잔차 연결 및 레이어 정규화
        output = self.layer_norm(h_chunks + attn_output)
        output = output.reshape(B, n_chunks * self.chunk_size, D)

        # 나머지 토큰 복원(chunk_size로 나눠떨어지지 않는 경우)
        if seq_len > n_chunks * self.chunk_size:
            remainder = hidden_states[:, n_chunks * self.chunk_size:]
            output = torch.cat([output, remainder], dim=1)

        return output


# RETRO 스타일 검색 파이프라인 예시
cca = ChunkedCrossAttention(d_model=512, n_heads=8, chunk_size=64)
hidden = torch.randn(2, 256, 512)  # 배치 2, 시퀀스 256
retrieved = torch.randn(2, 4, 32, 512)  # 4 청크, 각 32 토큰
output = cca(hidden, retrieved)
print(f"Input shape: {hidden.shape} -> Output shape: {output.shape}")

Atlas: Few-shot 학습과 검색

Izacard et al. (2023)의 Atlas는 Contriever 검색기와 Fusion-in-Decoder(FiD) 생성기를 결합했다. 핵심 발견은 검색 품질이 충분히 높으면 파라미터 수를 대폭 줄여도 대규모 모델과 경쟁할 수 있다는 것이다.

Atlas 11B 모델은 단 **64개의 예시(64-shot)**만으로 Natural Questions에서 PaLM 540B를 초과하는 성능을 달성했다. 이는 파라미터 수로 50배 작은 모델이 우수한 검색 메커니즘을 통해 대규모 모델을 이길 수 있음을 보여준다.

모델	파라미터	NQ (64-shot)	TriviaQA (64-shot)
PaLM	540B	39.6	81.4
Atlas	11B	42.4	84.7
Chinchilla	70B	35.5	72.3

Atlas의 학습 전략에서 주목할 점은 Attention Distillation이다. 생성기의 Cross-Attention 분포를 활용하여 검색기를 미세 조정함으로써, 검색기와 생성기가 상호 강화하는 선순환을 만든다.

Self-RAG: 자기 반성 기반 적응적 검색

ICLR 2024 Oral 논문

Asai et al. (2023)의 Self-RAG(Self-Reflective Retrieval-Augmented Generation)은 ICLR 2024에서 Oral 발표(상위 약 1%) 로 선정되었다. 기존 RAG의 근본적 한계를 정면으로 다룬다. 기존 방식은 질문 유형에 관계없이 항상 검색을 수행하는데, 단순 상식 질문이나 창작 과제에서는 불필요한 검색이 오히려 성능을 저하시킬 수 있다.

Reflection Token 메커니즘

Self-RAG의 핵심 혁신은 **4종의 반성 토큰(Reflection Token)**이다.

반성 토큰	역할	출력 값
Retrieve	검색 필요성 판단	Yes, No, Continue
ISREL	검색 문서의 관련성 평가	Relevant, Irrelevant
ISSUP	생성 내용의 근거 충분성	Fully Supported, Partially Supported, No Support
ISUSE	전체 응답 유용성	1~5점

모델은 생성 과정에서 이 토큰들을 자체적으로 출력하여 검색 여부, 문서 관련성, 응답 품질을 스스로 판단한다.

성능 비교

Self-RAG는 기존 방법론 대비 압도적인 성능 향상을 보여준다.

모델	PopQA	Bio	ASQA (EM)
Llama2-7B	14.7	31.6	21.9
Llama2 + RAG	38.2	36.7	25.3
Self-RAG (7B)	55.8	51.5	30.1
ChatGPT	29.3	41.2	27.8

PopQA에서 Llama2 대비 270% 이상의 개선, ChatGPT 대비 90% 이상의 개선을 달성했다.

from dataclasses import dataclass
from enum import Enum
from typing import Optional

class RetrieveDecision(Enum):
    YES = "yes"
    NO = "no"
    CONTINUE = "continue"

class RelevanceScore(Enum):
    RELEVANT = "relevant"
    IRRELEVANT = "irrelevant"

class SupportScore(Enum):
    FULLY_SUPPORTED = "fully_supported"
    PARTIALLY_SUPPORTED = "partially_supported"
    NO_SUPPORT = "no_support"

@dataclass
class ReflectionResult:
    retrieve: RetrieveDecision
    relevance: Optional[RelevanceScore] = None
    support: Optional[SupportScore] = None
    utility: Optional[int] = None  # 1-5 점수

class SelfRAGPipeline:
    """Self-RAG 스타일 적응적 검색-생성 파이프라인"""

    def __init__(self, generator, retriever, reflection_model):
        self.generator = generator
        self.retriever = retriever
        self.reflection_model = reflection_model

    def should_retrieve(self, query: str, partial_output: str = "") -> RetrieveDecision:
        """검색 필요성을 자체 판단하는 반성 단계"""
        prompt = (
            f"Query: {query}\n"
            f"Partial output: {partial_output}\n"
            "Does this query require external knowledge retrieval? "
            "Answer: yes, no, or continue"
        )
        decision = self.reflection_model.predict(prompt)
        return RetrieveDecision(decision.strip().lower())

    def evaluate_relevance(self, query: str, document: str) -> RelevanceScore:
        """검색 문서의 관련성 평가 (ISREL 토큰 시뮬레이션)"""
        prompt = (
            f"Query: {query}\n"
            f"Document: {document}\n"
            "Is this document relevant to answering the query? "
            "Answer: relevant or irrelevant"
        )
        score = self.reflection_model.predict(prompt)
        return RelevanceScore(score.strip().lower())

    def evaluate_support(
        self, query: str, document: str, response: str
    ) -> SupportScore:
        """생성 결과의 근거 충분성 평가 (ISSUP 토큰 시뮬레이션)"""
        prompt = (
            f"Query: {query}\n"
            f"Document: {document}\n"
            f"Response: {response}\n"
            "Is the response supported by the document? "
            "Answer: fully_supported, partially_supported, or no_support"
        )
        score = self.reflection_model.predict(prompt)
        return SupportScore(score.strip().lower())

    def generate_with_reflection(self, query: str) -> dict:
        """Self-RAG 전체 파이프라인 실행"""
        # 1단계: 검색 필요성 판단
        retrieve_decision = self.should_retrieve(query)

        if retrieve_decision == RetrieveDecision.NO:
            # 검색 불필요 - 직접 생성
            response = self.generator.generate(query)
            return {
                "response": response,
                "retrieved": False,
                "reflection": ReflectionResult(retrieve=RetrieveDecision.NO)
            }

        # 2단계: 문서 검색
        documents = self.retriever.retrieve(query, top_k=5)

        # 3단계: 관련성 평가로 문서 필터링
        relevant_docs = []
        for doc in documents:
            relevance = self.evaluate_relevance(query, doc["text"])
            if relevance == RelevanceScore.RELEVANT:
                relevant_docs.append(doc)

        if not relevant_docs:
            # 관련 문서 없음 - 검색 없이 생성
            response = self.generator.generate(query)
            return {
                "response": response,
                "retrieved": True,
                "relevant_docs": 0,
                "reflection": ReflectionResult(
                    retrieve=RetrieveDecision.YES,
                    relevance=RelevanceScore.IRRELEVANT
                )
            }

        # 4단계: 후보 응답 생성 및 평가
        best_response = None
        best_score = -1

        for doc in relevant_docs:
            context = f"Context: {doc['text']}\nQuery: {query}"
            candidate = self.generator.generate(context)

            support = self.evaluate_support(query, doc["text"], candidate)
            # 지지도 점수 계산
            support_score = {
                SupportScore.FULLY_SUPPORTED: 3,
                SupportScore.PARTIALLY_SUPPORTED: 1,
                SupportScore.NO_SUPPORT: 0
            }.get(support, 0)

            if support_score > best_score:
                best_score = support_score
                best_response = candidate
                best_support = support

        return {
            "response": best_response,
            "retrieved": True,
            "relevant_docs": len(relevant_docs),
            "reflection": ReflectionResult(
                retrieve=RetrieveDecision.YES,
                relevance=RelevanceScore.RELEVANT,
                support=best_support,
                utility=min(best_score + 2, 5)
            )
        }

Corrective RAG (CRAG)

검색 품질 평가기의 도입

Yan et al. (2024)의 Corrective RAG(CRAG)는 기존 RAG의 또 다른 약점을 공략한다. 기존 방식은 검색된 문서가 실제로 유용한지 검증 없이 그대로 사용한다. 검색 품질이 낮을 경우 부정확한 컨텍스트로 인해 오히려 환각이 악화될 수 있다.

CRAG는 **경량 검색 평가기(Retrieval Evaluator)**를 도입하여 검색 결과의 신뢰도를 정량적으로 평가하고, 그 결과에 따라 세 가지 액션을 트리거한다.

판정 결과	신뢰도 조건	액션
Correct	신뢰도 높음	검색 문서에서 핵심 지식 정제 후 사용
Incorrect	신뢰도 낮음	웹 검색 등 대안적 지식 소스로 전환
Ambiguous	신뢰도 중간	정제된 검색 결과 + 웹 검색 결과 결합

Decompose-then-Recompose 알고리즘

CRAG의 또 다른 핵심 기여는 Decompose-then-Recompose 알고리즘이다. 검색된 문서에서 관련 없는 정보를 제거하고 핵심 지식만 추출하여 재구성한다.

검색 문서를 세밀한 지식 단위(knowledge strip)로 분해
각 단위의 관련성을 개별 평가
관련 지식 단위만 선별하여 재조합
재조합된 컨텍스트로 최종 응답 생성

from dataclasses import dataclass
from enum import Enum
import numpy as np

class ConfidenceLevel(Enum):
    CORRECT = "correct"
    INCORRECT = "incorrect"
    AMBIGUOUS = "ambiguous"

@dataclass
class EvaluationResult:
    confidence: ConfidenceLevel
    score: float
    action: str

class CRAGPipeline:
    """Corrective RAG 스타일 파이프라인 구현"""

    def __init__(
        self,
        retriever,
        evaluator,
        generator,
        web_searcher,
        upper_threshold: float = 0.7,
        lower_threshold: float = 0.3
    ):
        self.retriever = retriever
        self.evaluator = evaluator
        self.generator = generator
        self.web_searcher = web_searcher
        self.upper_threshold = upper_threshold
        self.lower_threshold = lower_threshold

    def evaluate_retrieval(self, query: str, documents: list[dict]) -> EvaluationResult:
        """검색 결과 신뢰도 평가"""
        scores = []
        for doc in documents:
            score = self.evaluator.score(query, doc["text"])
            scores.append(score)

        max_score = max(scores) if scores else 0.0

        if max_score >= self.upper_threshold:
            return EvaluationResult(
                confidence=ConfidenceLevel.CORRECT,
                score=max_score,
                action="refine_and_use"
            )
        elif max_score <= self.lower_threshold:
            return EvaluationResult(
                confidence=ConfidenceLevel.INCORRECT,
                score=max_score,
                action="web_search_fallback"
            )
        else:
            return EvaluationResult(
                confidence=ConfidenceLevel.AMBIGUOUS,
                score=max_score,
                action="combine_sources"
            )

    def decompose_then_recompose(
        self, query: str, document: str
    ) -> str:
        """Decompose-then-Recompose: 문서에서 관련 지식만 추출"""
        # Step 1: 문서를 세밀한 지식 단위로 분해
        sentences = document.split(". ")
        knowledge_strips = [s.strip() + "." for s in sentences if s.strip()]

        # Step 2: 각 지식 단위의 관련성 평가
        relevant_strips = []
        for strip in knowledge_strips:
            relevance = self.evaluator.score(query, strip)
            if relevance > 0.5:
                relevant_strips.append((strip, relevance))

        # Step 3: 관련성 순으로 정렬 및 재조합
        relevant_strips.sort(key=lambda x: x[1], reverse=True)
        refined_context = " ".join([s[0] for s in relevant_strips])

        return refined_context if refined_context else document

    def process_query(self, query: str) -> dict:
        """CRAG 전체 파이프라인 실행"""
        # 1단계: 초기 문서 검색
        documents = self.retriever.retrieve(query, top_k=10)

        # 2단계: 검색 품질 평가
        evaluation = self.evaluate_retrieval(query, documents)

        context = ""
        sources = []

        if evaluation.confidence == ConfidenceLevel.CORRECT:
            # 검색 결과 신뢰 - 핵심 지식 정제 후 사용
            for doc in documents[:3]:
                refined = self.decompose_then_recompose(query, doc["text"])
                context += refined + "\n"
            sources = ["internal_retrieval"]

        elif evaluation.confidence == ConfidenceLevel.INCORRECT:
            # 검색 결과 불신뢰 - 웹 검색으로 전환
            web_results = self.web_searcher.search(query)
            for result in web_results[:3]:
                context += result["snippet"] + "\n"
            sources = ["web_search"]

        else:  # AMBIGUOUS
            # 양쪽 소스 결합
            for doc in documents[:2]:
                refined = self.decompose_then_recompose(query, doc["text"])
                context += refined + "\n"
            web_results = self.web_searcher.search(query)
            for result in web_results[:2]:
                context += result["snippet"] + "\n"
            sources = ["internal_retrieval", "web_search"]

        # 3단계: 최종 응답 생성
        prompt = f"Context: {context}\nQuery: {query}\nAnswer:"
        response = self.generator.generate(prompt)

        return {
            "response": response,
            "confidence": evaluation.confidence.value,
            "score": evaluation.score,
            "sources": sources
        }

Naive RAG에서 Advanced RAG, Modular RAG으로의 진화

Gao et al. (2024)의 서베이 논문 "Retrieval-Augmented Generation for Large Language Models: A Survey"는 RAG의 발전을 세 단계로 분류한다.

아키텍처 진화 비교표

구분	Naive RAG	Advanced RAG	Modular RAG
시기	2020~2022	2022~2023	2023~
검색 전략	단순 유사도 검색	쿼리 재작성, HyDE	적응적 검색, 라우팅
청킹	고정 크기	의미 기반 청킹	계층적, 재귀적 청킹
검색 후처리	없음	재랭킹, 압축	자기 반성, 교정
한계	낮은 검색 정밀도, 환각	파이프라인 복잡성	설계 공간 폭발
대표 모델	RAG (Lewis)	RETRO, Atlas	Self-RAG, CRAG

Pre-retrieval, Retrieval, Post-retrieval 최적화

Advanced RAG 이후 각 단계별로 다양한 최적화 기법이 등장했다.

Pre-retrieval 최적화:

쿼리 재작성(Query Rewriting): 원본 질문을 검색에 최적화된 형태로 변환
HyDE(Hypothetical Document Embeddings): 가상 문서를 먼저 생성한 뒤 이를 검색 쿼리로 사용
Step-back Prompting: 추상적 질문으로 변환하여 더 넓은 검색 수행

Retrieval 최적화:

Hybrid Search: BM25(Sparse) + 벡터 검색(Dense) 결합
다중 벡터 검색: ColBERT 스타일의 토큰 수준 상호작용
재귀적 검색: 초기 결과를 기반으로 반복 검색

Post-retrieval 최적화:

재랭킹(Re-ranking): Cross-Encoder로 검색 결과 재정렬
컨텍스트 압축: 불필요한 정보 제거
Self-RAG / CRAG: 자기 반성 및 교정

벤치마크 비교 분석

주요 모델 성능 종합 비교

모델	유형	NQ (EM)	TriviaQA (EM)	PopQA (F1)	FEVER (Acc)
RAG (Lewis, 2020)	Naive	44.5	56.8	-	-
REALM (Guu, 2020)	Pre-train	40.4	-	-	-
RETRO (Borgeaud, 2022)	Pre-train	-	-	-	-
Atlas-11B (Izacard, 2023)	Few-shot	42.4	84.7	-	-
Self-RAG-7B (Asai, 2023)	Adaptive	-	-	55.8	-
CRAG (Yan, 2024)	Corrective	-	-	-	-

직접적인 동일 벤치마크 비교가 어려운 이유는 각 논문이 서로 다른 평가 설정(few-shot 수, 검색 코퍼스 크기, 모델 크기)을 사용하기 때문이다. 그러나 전반적인 추세는 명확하다. 적응적 검색과 자기 반성 메커니즘이 도입될수록 성능이 향상된다.

CRAG Benchmark (Meta, NeurIPS 2024)

Meta가 NeurIPS 2024에서 발표한 CRAG Benchmark는 8개 도메인, 다양한 질문 유형에 걸쳐 RAG 시스템을 체계적으로 평가한다.

접근 방식	전체 정확도	환각률
순수 LLM (검색 없음)	34%	높음
Naive RAG	44%	중간
Advanced RAG	55%	낮음
SOTA RAG 시스템	63%	매우 낮음

이 결과는 두 가지를 시사한다. 첫째, RAG는 순수 LLM 대비 확실한 개선을 제공한다(34% 대 44%). 둘째, 단순 RAG에서 고급 RAG로 전환하면 추가 20%p 이상의 개선이 가능하다.

실무 적용 시 고려사항

검색기 선택: Dense vs Sparse vs Hybrid

실무에서 검색기 선택은 데이터 특성과 요구사항에 따라 달라진다.

검색 방식	장점	단점	적합한 경우
Sparse (BM25)	키워드 매칭 정확, 빠름	의미적 유사성 미반영	전문 용어, 코드 검색
Dense (벡터)	의미적 유사성 포착	키워드 미스매치 가능	일반 QA, 대화형 검색
Hybrid	양쪽 장점 결합	구현 복잡, 가중치 튜닝 필요	프로덕션 시스템

Hybrid Retrieval 파이프라인 구현

import numpy as np
from dataclasses import dataclass, field
from typing import Optional
import re
from collections import Counter
import math

@dataclass
class Document:
    text: str
    doc_id: str
    metadata: dict = field(default_factory=dict)

@dataclass
class SearchResult:
    document: Document
    score: float
    source: str  # "sparse", "dense", or "hybrid"

class BM25Retriever:
    """BM25 Sparse Retriever 간소화 구현"""

    def __init__(self, k1: float = 1.5, b: float = 0.75):
        self.k1 = k1
        self.b = b
        self.documents: list[Document] = []
        self.doc_freqs: dict[str, int] = {}
        self.doc_lengths: list[int] = []
        self.avg_doc_length: float = 0
        self.doc_term_freqs: list[dict[str, int]] = []

    def _tokenize(self, text: str) -> list[str]:
        return re.findall(r'\w+', text.lower())

    def index(self, documents: list[Document]):
        self.documents = documents
        for doc in documents:
            tokens = self._tokenize(doc.text)
            self.doc_lengths.append(len(tokens))
            term_freq = Counter(tokens)
            self.doc_term_freqs.append(term_freq)
            for term in set(tokens):
                self.doc_freqs[term] = self.doc_freqs.get(term, 0) + 1

        self.avg_doc_length = (
            sum(self.doc_lengths) / len(self.doc_lengths) if self.doc_lengths else 0
        )

    def search(self, query: str, top_k: int = 10) -> list[SearchResult]:
        query_tokens = self._tokenize(query)
        n_docs = len(self.documents)
        scores = []

        for i, doc in enumerate(self.documents):
            score = 0.0
            for term in query_tokens:
                if term not in self.doc_term_freqs[i]:
                    continue
                tf = self.doc_term_freqs[i][term]
                df = self.doc_freqs.get(term, 0)
                idf = math.log((n_docs - df + 0.5) / (df + 0.5) + 1)
                dl = self.doc_lengths[i]
                numerator = tf * (self.k1 + 1)
                denominator = tf + self.k1 * (
                    1 - self.b + self.b * dl / self.avg_doc_length
                )
                score += idf * numerator / denominator
            scores.append(score)

        top_indices = np.argsort(scores)[::-1][:top_k]
        return [
            SearchResult(
                document=self.documents[i],
                score=float(scores[i]),
                source="sparse"
            )
            for i in top_indices if scores[i] > 0
        ]

class DenseRetriever:
    """Dense Vector Retriever (임베딩 기반)"""

    def __init__(self, embedding_fn):
        self.embedding_fn = embedding_fn
        self.documents: list[Document] = []
        self.embeddings: Optional[np.ndarray] = None

    def index(self, documents: list[Document]):
        self.documents = documents
        texts = [doc.text for doc in documents]
        self.embeddings = self.embedding_fn(texts)
        # L2 정규화
        norms = np.linalg.norm(self.embeddings, axis=1, keepdims=True)
        self.embeddings = self.embeddings / (norms + 1e-10)

    def search(self, query: str, top_k: int = 10) -> list[SearchResult]:
        q_emb = self.embedding_fn([query])
        q_emb = q_emb / (np.linalg.norm(q_emb) + 1e-10)
        scores = np.dot(self.embeddings, q_emb.T).squeeze()
        top_indices = np.argsort(scores)[::-1][:top_k]
        return [
            SearchResult(
                document=self.documents[i],
                score=float(scores[i]),
                source="dense"
            )
            for i in top_indices
        ]

class HybridRetriever:
    """Hybrid Retrieval: BM25 + Dense 검색 결합"""

    def __init__(
        self,
        sparse: BM25Retriever,
        dense: DenseRetriever,
        alpha: float = 0.5
    ):
        self.sparse = sparse
        self.dense = dense
        self.alpha = alpha  # Dense 가중치 (1-alpha = Sparse 가중치)

    def _normalize_scores(self, results: list[SearchResult]) -> dict[str, float]:
        """Min-Max 정규화"""
        if not results:
            return {}
        scores = [r.score for r in results]
        min_s, max_s = min(scores), max(scores)
        range_s = max_s - min_s if max_s != min_s else 1.0
        return {
            r.document.doc_id: (r.score - min_s) / range_s
            for r in results
        }

    def search(self, query: str, top_k: int = 10) -> list[SearchResult]:
        """Reciprocal Rank Fusion 기반 하이브리드 검색"""
        sparse_results = self.sparse.search(query, top_k=top_k * 2)
        dense_results = self.dense.search(query, top_k=top_k * 2)

        sparse_scores = self._normalize_scores(sparse_results)
        dense_scores = self._normalize_scores(dense_results)

        # 모든 고유 문서 수집
        all_doc_ids = set(sparse_scores.keys()) | set(dense_scores.keys())
        doc_map = {}
        for r in sparse_results + dense_results:
            doc_map[r.document.doc_id] = r.document

        # 가중 결합
        hybrid_scores = {}
        for doc_id in all_doc_ids:
            s_score = sparse_scores.get(doc_id, 0.0)
            d_score = dense_scores.get(doc_id, 0.0)
            hybrid_scores[doc_id] = (
                (1 - self.alpha) * s_score + self.alpha * d_score
            )

        # 정렬 및 상위 k개 반환
        sorted_docs = sorted(
            hybrid_scores.items(), key=lambda x: x[1], reverse=True
        )[:top_k]

        return [
            SearchResult(
                document=doc_map[doc_id],
                score=score,
                source="hybrid"
            )
            for doc_id, score in sorted_docs
        ]


# 사용 예시
bm25 = BM25Retriever()
docs = [
    Document("RAG는 검색과 생성을 결합한다.", "doc1"),
    Document("Self-RAG는 반성 토큰을 사용한다.", "doc2"),
    Document("CRAG는 검색 품질을 평가한다.", "doc3"),
    Document("RETRO는 2조 토큰 데이터베이스를 사용한다.", "doc4"),
]
bm25.index(docs)
sparse_results = bm25.search("RAG에서 검색 품질 평가 방법은?")
for r in sparse_results:
    print(f"[BM25 Score: {r.score:.4f}] {r.document.text}")

청킹 전략과 비용-성능 트레이드오프

청킹(Chunking)은 RAG 성능에 결정적 영향을 미친다.

청킹 전략	청크 크기	장점	단점
고정 크기	256~512 토큰	구현 간단	문맥 단절
문장 기반	3~5 문장	자연스러운 경계	크기 불균일
의미 기반	가변	주제 일관성 유지	임베딩 비용
재귀적	계층적	다중 수준 검색	구현 복잡

class SemanticChunker:
    """의미 기반 청킹: 임베딩 유사도로 자연스러운 경계 탐지"""

    def __init__(self, embedding_fn, similarity_threshold: float = 0.75):
        self.embedding_fn = embedding_fn
        self.threshold = similarity_threshold

    def chunk(self, text: str, min_chunk_size: int = 100) -> list[str]:
        """문장 간 의미적 유사도 변화 지점에서 분할"""
        sentences = [s.strip() for s in text.split(". ") if s.strip()]
        if len(sentences) <= 1:
            return [text]

        # 각 문장의 임베딩 계산
        embeddings = self.embedding_fn(sentences)

        # 인접 문장 간 코사인 유사도 계산
        chunks = []
        current_chunk = [sentences[0]]

        for i in range(1, len(sentences)):
            sim = np.dot(embeddings[i], embeddings[i - 1]) / (
                np.linalg.norm(embeddings[i]) * np.linalg.norm(embeddings[i - 1])
                + 1e-10
            )

            if sim < self.threshold and len(". ".join(current_chunk)) >= min_chunk_size:
                # 유사도가 임계값 미만이면 새로운 청크 시작
                chunks.append(". ".join(current_chunk) + ".")
                current_chunk = [sentences[i]]
            else:
                current_chunk.append(sentences[i])

        if current_chunk:
            chunks.append(". ".join(current_chunk) + ".")

        return chunks

향후 연구 방향

Agentic RAG: 도구 사용과 검색의 결합

최근 가장 주목받는 방향은 Agentic RAG이다. 단순히 문서를 검색하는 것을 넘어, LLM 에이전트가 다양한 도구(API 호출, 데이터베이스 쿼리, 코드 실행)를 활용하여 필요한 정보를 능동적으로 수집한다. 검색 자체가 하나의 도구(tool)로 편입되어, 에이전트가 상황에 따라 검색, 계산, API 호출 중 최적의 액션을 선택한다.

텍스트만이 아닌 이미지, 표, 그래프 등 다중 모달리티를 검색하고 활용하는 Multi-modal RAG도 활발히 연구되고 있다. 예를 들어, 기술 문서에서 아키텍처 다이어그램을 검색하거나, 재무 보고서의 표를 파싱하여 수치 질문에 답변하는 시나리오이다. ColPali와 같은 비전-언어 모델 기반 검색기가 이 방향의 대표적 연구이다.

실시간 지식 업데이트

프로덕션 RAG 시스템에서 지식 저장소의 실시간 업데이트는 여전히 해결되지 않은 과제이다. 문서가 추가/수정/삭제될 때 임베딩 인덱스를 효율적으로 갱신하는 방법, 버전 관리, 일관성 유지 등이 핵심 연구 주제이다. 스트리밍 인덱싱과 증분 업데이트 기법이 주목받고 있다.

결론

RAG의 진화는 단순한 "검색 후 생성"에서 지능적이고 적응적인 지식 활용으로의 전환을 보여준다. 핵심 발전 궤적을 정리하면 다음과 같다.

Original RAG (2020): 검색과 생성의 결합이 가능함을 증명
RETRO (2022): 대규모 검색으로 파라미터 효율성 극대화
Atlas (2023): 검색 품질이 모델 크기를 대체할 수 있음을 실증
Self-RAG (2023): 검색 자체를 선택적으로 만들고, 자기 반성으로 품질 보장
CRAG (2024): 검색 결과의 신뢰도를 평가하고, 대안적 소스로 교정

실무에서는 이 논문들의 아이디어를 선택적으로 조합하는 것이 핵심이다. 단순한 사내 QA 시스템이라면 Naive RAG + BM25로 충분할 수 있지만, 높은 정확도가 요구되는 의료/법률 도메인에서는 Self-RAG의 반성 메커니즘이나 CRAG의 품질 평가 기법이 필수적이다.