AI Papers: Test-Time Scaling 핵심 논문 정리 — 추론 예산으로 성능을 끌어올리는 방법

Train-Time vs Test-Time: 스케일링의 두 축
논문 흐름으로 보는 TTS 발전사
Best-of-N: 가장 단순하지만 놀라운 효과
Self-Consistency: 다수결의 수학적 근거
Tree Search 기반 방법: 구조적 탐색
코드 생성에서의 TTS: Test-based Reranking
추론 예산 제어: 비용 폭발 방지
운영 지표 수집과 모니터링
적용 판단 프레임워크
흔한 실수와 대응
참고 자료

Train-Time vs Test-Time: 스케일링의 두 축

LLM 성능 향상의 역사는 크게 두 갈래로 나뉜다.

Train-time scaling은 파라미터 수, 학습 데이터, 학습 스텝을 키워서 모델 자체를 강하게 만드는 접근이다. Kaplan et al.(2020)의 Scaling Laws 논문(arxiv:2001.08361)이 이 방향의 이론적 토대를 닦았고, GPT-4, Claude, Gemini 등 최신 모델이 이 전략의 산물이다.

Test-time scaling(TTS) 은 이미 학습된 모델을 그대로 두고, 추론 시점에 더 많은 계산을 투입해 성능을 끌어올린다. 모델 가중치를 바꾸지 않기 때문에 파인튜닝이나 재학습 없이도 적용할 수 있다. 특히 수학, 코딩, 멀티스텝 추론처럼 "정답에 도달하는 경로가 여러 개이고 검증이 가능한" 태스크에서 강력한 효과를 보인다.

TTS의 핵심 아이디어를 한 문장으로 요약하면 이렇다.

한 번 답하는 대신 여러 후보를 만들고, 그중 가장 신뢰할 수 있는 답을 선택한다.

이 단순한 전략이 왜 효과적인지는 확률적으로 직관적이다. 모델이 단일 시도에서 정답을 맞출 확률이 p일 때, N번 독립 시도하면 적어도 하나가 정답일 확률은 1-(1-p)^N으로 빠르게 올라간다. 물론 "정답을 골라내는 능력"이 있어야 이 확률이 실제 성능으로 이어진다.

논문 흐름으로 보는 TTS 발전사

TTS는 단일 논문이 아니라 여러 연구의 누적으로 형성된 분야다. 주요 마일스톤을 시간순으로 정리한다.

Chain-of-Thought Prompting (Wei et al., 2022, arxiv:2201.11903) "Let's think step by step"으로 대표되는 프롬프팅 기법. 중간 추론 과정을 명시적으로 생성하면 최종 답의 정확도가 크게 오른다는 것을 보였다. TTS의 직접적 기법은 아니지만, "추론 경로를 분리해서 볼 수 있다"는 발상이 이후 모든 TTS 방법의 전제가 된다.

Self-Consistency Decoding (Wang et al., 2022, arxiv:2203.11171) CoT를 여러 번 샘플링하고, 최종 답에 대해 다수결을 취한다. 추론 경로는 달라도 같은 정답에 수렴하는 경우가 많다는 관찰에 기반한다. GSM8K 수학 벤치마크에서 greedy decoding 대비 큰 폭의 정확도 향상을 보여줬다.

Verifier-guided Decoding (Cobbe et al., 2021, arxiv:2110.14168) 수학 문제에서 solution-level verifier를 학습시켜, 여러 후보 중 verifier 점수가 높은 답을 선택한다. "생성"과 "검증"을 분리하는 것이 핵심이다. 이후 OpenAI의 process reward model(PRM) 연구(Lightman et al., 2023, arxiv:2305.20050)로 이어져, 풀이 과정의 각 단계를 평가하는 방향으로 발전한다.

Tree-of-Thought (Yao et al., 2023, arxiv:2305.10601) 추론을 tree로 구조화하고 BFS/DFS로 탐색한다. 각 노드에서 LLM이 자기 평가(self-evaluation)를 수행해 가지치기한다. 24 게임, 크로스워드 같은 탐색 성격이 강한 문제에서 CoT 대비 큰 성능 차이를 보였다.

Scaling LLM Test-Time Compute (Snell et al., 2024, arxiv:2408.03314) "추론 예산을 어떻게 분배할 것인가"를 체계적으로 분석한 논문. compute-optimal scaling 관점에서, 문제 난이도에 따라 N(샘플 수)과 탐색 깊이를 조절하는 것이 효과적임을 보였다. 쉬운 문제는 Best-of-2만으로 충분하고, 어려운 문제에 예산을 집중해야 한다.

TTS Survey: What, How, Where, and How Well? (arxiv:2503.24235, 2025) TTS 방법론을 what(무엇을 스케일하나), how(어떻게 스케일하나), where(어디서 스케일하나), how well(얼마나 효과적인가)의 4축으로 정리한 종합 서베이. 이 글에서 다루는 대부분의 방법이 이 프레임워크 안에서 위치를 잡을 수 있다.

Best-of-N: 가장 단순하지만 놀라운 효과

Best-of-N은 TTS의 가장 기본적인 형태다. temperature를 올려 N개의 독립 답을 생성한 뒤, 어떤 기준으로든 가장 좋은 하나를 선택한다.

단순해 보이지만 실제 효과는 크다. 예를 들어 코드 생성에서 pass@1이 30%인 모델도 pass@10은 60%를 넘기는 경우가 흔하다(Chen et al., "Evaluating Large Language Models Trained on Code", arxiv:2107.03374). 관건은 "좋은 답을 어떻게 고르느냐"다.

다음은 프로덕션에서 사용할 수 있는 Best-of-N 파이프라인 구현이다.

from dataclasses import dataclass, field
from typing import Callable
import asyncio
import time

@dataclass
class ScoredCandidate:
    text: str
    score: float
    latency_ms: float
    token_count: int

@dataclass
class BestOfNConfig:
    n_samples: int = 4
    temperature: float = 0.7
    max_tokens: int = 1024
    timeout_seconds: float = 15.0
    cost_budget_usd: float = 0.05

class BestOfNPipeline:
    """프로덕션용 Best-of-N 파이프라인.

    후보 생성, 스코어링, 예산 제어를 하나의 흐름으로 관리한다.
    """

    def __init__(
        self,
        generator: Callable,
        scorers: list[Callable],
        config: BestOfNConfig = BestOfNConfig(),
    ):
        self.generator = generator
        self.scorers = scorers
        self.config = config

    async def generate_candidates(self, prompt: str) -> list[ScoredCandidate]:
        """N개의 후보를 비동기로 생성하고 스코어링한다."""
        candidates = []
        tasks = [
            self._generate_one(prompt)
            for _ in range(self.config.n_samples)
        ]

        results = await asyncio.gather(*tasks, return_exceptions=True)

        for result in results:
            if isinstance(result, Exception):
                continue  # timeout이나 API 에러는 건너뜀
            text, latency_ms, token_count = result
            score = self._compute_score(text)
            candidates.append(ScoredCandidate(
                text=text,
                score=score,
                latency_ms=latency_ms,
                token_count=token_count,
            ))

        return sorted(candidates, key=lambda c: c.score, reverse=True)

    def _compute_score(self, text: str) -> float:
        """여러 scorer의 가중 평균을 계산한다."""
        if not self.scorers:
            return 0.0
        scores = [scorer(text) for scorer in self.scorers]
        return sum(scores) / len(scores)

    async def _generate_one(self, prompt: str):
        """단일 후보 생성. timeout 초과 시 asyncio.TimeoutError 발생."""
        start = time.monotonic()
        result = await asyncio.wait_for(
            self.generator(prompt, temperature=self.config.temperature),
            timeout=self.config.timeout_seconds,
        )
        elapsed_ms = (time.monotonic() - start) * 1000
        return result.text, elapsed_ms, result.token_count

    async def run(self, prompt: str) -> ScoredCandidate | None:
        """최적 후보를 반환한다. 후보가 없으면 None."""
        candidates = await self.generate_candidates(prompt)
        return candidates[0] if candidates else None

이 구현에서 주목할 점은 세 가지다.

비동기 병렬 생성: N개 요청을 순차가 아니라 동시에 보내서 총 지연을 줄인다.
다중 scorer 합성: 단일 점수 함수에 의존하지 않고 여러 관점의 scorer를 합산한다.
예외 내성: 일부 후보 생성이 실패해도 나머지로 진행한다.

Self-Consistency: 다수결의 수학적 근거

Self-Consistency는 Best-of-N과 비슷하게 여러 답을 생성하지만, 선택 방식이 다르다. 개별 답의 "품질"을 평가하는 대신, 최종 답이 가장 많이 등장하는 것을 고른다.

이 방법이 효과적인 이유는 수학/논리 문제의 특성에 있다. 추론 경로는 다양해도 올바른 답은 하나이므로, 여러 경로가 같은 답에 수렴하면 그것이 정답일 확률이 높다. 반면 오답은 다양한 형태로 흩어진다.

from collections import Counter
import re
from typing import Optional


def extract_final_answer(text: str) -> Optional[str]:
    """CoT 출력에서 최종 답을 추출한다.

    여러 패턴을 시도하며, 하나도 매칭되지 않으면
    마지막 줄을 답으로 사용한다.
    """
    # 패턴 1: "정답: X", "Answer: X" 형태
    patterns = [
        r"(?:정답|답|answer|result)\s*[:：=]\s*(.+)",
        r"\\boxed\{(.+?)\}",              # LaTeX boxed 형태
        r"####\s*(.+)",                     # GSM8K 형태
        r"(?:따라서|그러므로|therefore)\s*(.+)",
    ]
    for pattern in patterns:
        matches = re.findall(pattern, text, re.IGNORECASE)
        if matches:
            return matches[-1].strip()

    # 모든 패턴 실패 시 마지막 비어있지 않은 줄 반환
    lines = [line.strip() for line in text.strip().splitlines() if line.strip()]
    return lines[-1] if lines else None


def normalize_answer(answer: str) -> str:
    """답 비교를 위한 정규화.

    공백, 구두점, 단위 표기 등의 사소한 차이를
    제거해서 동일한 답을 같은 것으로 인식한다.
    """
    s = answer.strip().lower()
    # 통화/단위 기호 제거
    s = re.sub(r"[$€₩\\,]", "", s)
    # 후행 마침표/쉼표 제거
    s = s.rstrip(".,;")
    # 다중 공백을 단일 공백으로
    s = re.sub(r"\s+", " ", s)
    return s


def self_consistency_vote(
    samples: list[str],
    min_agreement: int = 2,
) -> tuple[Optional[str], int, int]:
    """Self-Consistency 다수결을 수행한다.

    Returns:
        (best_answer, vote_count, total_valid):
        - best_answer: 가장 많이 등장한 답 (None if 추출 실패)
        - vote_count: 해당 답의 득표 수
        - total_valid: 답 추출에 성공한 전체 샘플 수
    """
    extracted = []
    for sample in samples:
        ans = extract_final_answer(sample)
        if ans is not None:
            extracted.append(normalize_answer(ans))

    if not extracted:
        return None, 0, 0

    counter = Counter(extracted)
    best_answer, vote_count = counter.most_common(1)[0]

    if vote_count < min_agreement:
        return None, vote_count, len(extracted)

    return best_answer, vote_count, len(extracted)

Self-Consistency의 한계도 분명하다. 자유 형식 생성(에세이, 요약)에서는 "같은 답"의 정의가 모호하므로 적용이 어렵다. 또한 모델이 체계적으로 같은 오답을 내는 경우(systematic bias), 다수결이 오답을 강화한다.

Tree Search 기반 방법: 구조적 탐색

Tree-of-Thought(ToT)와 MCTS(Monte Carlo Tree Search) 기반 방법은 추론 과정을 트리로 구조화해 탐색한다. 단순히 완결된 답을 여러 개 만드는 것이 아니라, 중간 단계에서 분기하고 평가하며 가지치기한다.

from dataclasses import dataclass, field
from typing import Optional
import heapq

@dataclass
class ThoughtNode:
    """추론 트리의 한 노드."""
    content: str
    depth: int
    score: float = 0.0
    parent: Optional['ThoughtNode'] = None
    children: list['ThoughtNode'] = field(default_factory=list)

    def path(self) -> list[str]:
        """루트부터 현재 노드까지의 추론 경로를 반환한다."""
        nodes = []
        current = self
        while current is not None:
            nodes.append(current.content)
            current = current.parent
        return list(reversed(nodes))


class BeamSearchReasoner:
    """Beam Search 기반 추론 탐색기.

    각 depth에서 beam_width만큼의 후보를 유지하며
    max_depth까지 탐색한 뒤 최고 점수 경로를 반환한다.
    """

    def __init__(
        self,
        expand_fn,       # (context: str) -> list[str]: 다음 추론 단계 후보들
        evaluate_fn,     # (partial_solution: str) -> float: 중간 평가 점수
        beam_width: int = 3,
        max_depth: int = 5,
    ):
        self.expand_fn = expand_fn
        self.evaluate_fn = evaluate_fn
        self.beam_width = beam_width
        self.max_depth = max_depth

    def search(self, problem: str) -> list[str]:
        """beam search로 최적 추론 경로를 찾는다."""
        root = ThoughtNode(content=problem, depth=0)
        beam = [root]

        for depth in range(1, self.max_depth + 1):
            all_candidates = []

            for node in beam:
                context = "\n".join(node.path())
                expansions = self.expand_fn(context)

                for thought in expansions:
                    child = ThoughtNode(
                        content=thought,
                        depth=depth,
                        parent=node,
                    )
                    partial = "\n".join(child.path())
                    child.score = self.evaluate_fn(partial)
                    node.children.append(child)
                    all_candidates.append(child)

            # 상위 beam_width개만 유지
            beam = heapq.nlargest(
                self.beam_width,
                all_candidates,
                key=lambda n: n.score,
            )

            if not beam:
                break

        # 전체 탐색 결과 중 최고 점수 경로 반환
        best = max(beam, key=lambda n: n.score)
        return best.path()

Tree search의 비용은 exponential하게 증가할 수 있으므로, 실무에서는 반드시 depth 제한과 beam width 제한을 둬야 한다. 추론 단계별 평가 함수(evaluate_fn)의 품질이 전체 시스템의 성능을 좌우한다.

코드 생성에서의 TTS: Test-based Reranking

코드 생성은 TTS가 가장 자연스럽게 적용되는 도메인이다. "정답"의 정의가 명확하기 때문이다 -- 테스트를 통과하면 맞고, 실패하면 틀리다.

import subprocess
import tempfile
import textwrap
from pathlib import Path
from dataclasses import dataclass


@dataclass
class CodeEvalResult:
    code: str
    passed: bool
    tests_passed: int
    tests_total: int
    error_output: str
    execution_time_ms: float


def evaluate_code_candidate(
    code: str,
    test_code: str,
    timeout_seconds: int = 30,
) -> CodeEvalResult:
    """코드 후보를 테스트 스위트로 평가한다.

    실제 pytest를 실행하여 통과/실패를 판별하며,
    timeout, import error 등의 예외도 처리한다.
    """
    import time

    with tempfile.TemporaryDirectory() as tmpdir:
        solution_path = Path(tmpdir) / "solution.py"
        test_path = Path(tmpdir) / "test_solution.py"

        solution_path.write_text(code, encoding="utf-8")

        # test 파일에서 solution을 import할 수 있도록 구성
        full_test = textwrap.dedent(f"""\
            import sys
            sys.path.insert(0, "{tmpdir}")
            from solution import *

            {test_code}
        """)
        test_path.write_text(full_test, encoding="utf-8")

        start = time.monotonic()
        try:
            result = subprocess.run(
                ["python", "-m", "pytest", str(test_path), "-v", "--tb=short"],
                capture_output=True,
                text=True,
                timeout=timeout_seconds,
                cwd=tmpdir,
            )
            elapsed_ms = (time.monotonic() - start) * 1000

            # pytest 출력에서 passed/failed 수 파싱
            passed = result.returncode == 0
            lines = result.stdout.splitlines()
            summary_line = [l for l in lines if "passed" in l or "failed" in l]

            return CodeEvalResult(
                code=code,
                passed=passed,
                tests_passed=result.stdout.count("PASSED"),
                tests_total=result.stdout.count("PASSED") + result.stdout.count("FAILED"),
                error_output=result.stderr if not passed else "",
                execution_time_ms=elapsed_ms,
            )
        except subprocess.TimeoutExpired:
            elapsed_ms = (time.monotonic() - start) * 1000
            return CodeEvalResult(
                code=code,
                passed=False,
                tests_passed=0,
                tests_total=0,
                error_output=f"Execution timed out after {timeout_seconds}s",
                execution_time_ms=elapsed_ms,
            )


def pick_best_code(
    candidates: list[str],
    test_code: str,
) -> CodeEvalResult | None:
    """여러 코드 후보 중 가장 많은 테스트를 통과하는 것을 선택한다.

    전부 통과하는 후보가 여러 개면 실행 시간이 짧은 것을 우선한다.
    """
    results = [evaluate_code_candidate(c, test_code) for c in candidates]

    # 1순위: 통과 테스트 수, 2순위: 실행 시간 (낮을수록 좋음)
    results.sort(key=lambda r: (r.tests_passed, -r.execution_time_ms), reverse=True)

    return results[0] if results else None

추론 예산 제어: 비용 폭발 방지

TTS의 가장 큰 실무적 위험은 비용 폭발이다. N=8로 모든 요청에 적용하면 API 비용이 8배 가까이 오른다. 문제 난이도에 따라 예산을 동적으로 할당하는 전략이 필요하다.

# inference_policy.yaml
# 문제 난이도별 TTS 예산 할당 정책

policies:
  easy:
    description: '단순 QA, FAQ, 분류'
    strategy: single_pass
    samples: 1
    temperature: 0.0
    max_tokens: 256
    estimated_cost_per_query_usd: 0.001

  medium:
    description: '요약, 번역, 간단한 추론'
    strategy: best_of_n
    samples: 3
    temperature: 0.5
    max_tokens: 512
    scorer: format_and_length
    estimated_cost_per_query_usd: 0.005

  hard:
    description: '수학, 코딩, 멀티스텝 추론'
    strategy: self_consistency
    samples: 8
    temperature: 0.7
    max_tokens: 1024
    min_agreement: 3
    estimated_cost_per_query_usd: 0.015

  critical:
    description: '코드 리뷰, 안전 판단, 의료/법률'
    strategy: verified_best_of_n
    samples: 5
    temperature: 0.6
    max_tokens: 2048
    verifier: domain_specific_verifier
    fallback_on_low_confidence: human_review
    estimated_cost_per_query_usd: 0.025

routing:
  classifier: difficulty_classifier_v2
  fallback_policy: medium

guardrails:
  max_total_tokens_per_query: 16384
  timeout_seconds: 30
  daily_budget_usd: 500
  alert_threshold_usd: 400

운영 지표 수집과 모니터링

TTS를 도입하면 기존 단일 생성과는 다른 지표를 추적해야 한다.

-- TTS 운영 대시보드용 쿼리
-- 일별 전략별 성능/비용/지연을 한 눈에 파악한다

SELECT
    date_trunc('day', created_at)     AS day,
    tts_strategy,
    COUNT(*)                          AS total_queries,
    AVG(correct::int)                 AS accuracy,
    PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY latency_ms) AS p50_latency_ms,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY latency_ms) AS p95_latency_ms,
    AVG(samples_generated)            AS avg_samples,
    SUM(cost_usd)                     AS total_cost_usd,
    AVG(cost_usd)                     AS avg_cost_per_query,
    -- 효율성 지표: 정답 1건당 비용
    SUM(cost_usd) / NULLIF(SUM(correct::int), 0) AS cost_per_correct_answer
FROM reasoning_eval_runs
WHERE created_at >= CURRENT_DATE - INTERVAL '14 days'
GROUP BY 1, 2
ORDER BY 1 DESC, 2;

추적해야 할 핵심 지표를 정리하면 다음과 같다.

지표	설명	경고 기준 예시
accuracy (정확도)	정답률 또는 통과율	baseline 대비 5% 미만 개선이면 비용 대비 실익 의심
p95 latency	95번째 백분위 지연	단일 생성 대비 3배 이상이면 UX 영향 검토
avg_samples	평균 샘플 수	목표 N보다 낮으면 timeout/에러 비율 점검
cost_per_correct	정답 1건당 비용	비즈니스 가치 대비 ROI 평가
agreement_rate	Self-Consistency의 합의률	40% 미만이면 모델이나 문제 유형 재검토

적용 판단 프레임워크

TTS를 모든 곳에 적용하는 것은 비효율적이다. 다음 기준으로 적용 여부를 판단한다.

TTS가 강력한 경우:

정답 검증이 자동화 가능한 태스크 (테스트 실행, 규칙 체크, 수식 검증)
오답의 비즈니스 비용이 높은 경우 (의료, 법률, 금융 분석)
지연 시간 요구가 초 단위로 여유가 있는 경우 (배치 처리, 비동기 파이프라인)
입력당 가치가 높은 경우 (한 건의 코드 리뷰가 수 시간의 디버깅을 절약)

TTS가 비효율적인 경우:

실시간 스트리밍 응답이 필요한 채팅 (100ms 단위의 TTFT가 중요)
정답이 주관적인 창의적 생성 (에세이, 마케팅 카피)
검증기가 약하거나 없는 경우 (scorer가 랜덤에 가까우면 N을 올려도 의미 없음)
입력이 대량이고 건당 가치가 낮은 경우 (로그 분류, 스팸 필터)

흔한 실수와 대응

실수 1: N만 올리고 선택 정책은 방치

N=16까지 올려도 scorer가 "길이 기반"이면 긴 쓸데없는 답이 선택된다. scorer의 품질이 N 증가의 상한을 결정한다. 투자 우선순위는 N 증가보다 scorer 개선이다.

실수 2: 모든 요청에 동일한 TTS 전략 적용

"이봐, FAQ에 Best-of-8을 돌리고 있어" -- 실제로 겪는 상황이다. 난이도 분류기(difficulty classifier)를 먼저 두고, 쉬운 질문은 single pass로 보내야 한다.

실수 3: 오프라인 벤치마크 수치만 보고 배포

GSM8K에서 정확도 15% 올랐다고 해도, 프로덕션에서의 지연 증가와 비용 증가를 함께 봐야 한다. A/B 테스트에서 사용자 만족도나 비즈니스 KPI가 실제로 개선되는지 확인해야 한다.

실수 4: verifier를 맹신

verifier 자체가 편향을 가질 수 있다. 특정 유형의 오답을 높게 평가하거나, 올바른 답을 낮게 평가하는 체계적 오류가 있으면 TTS가 오히려 성능을 악화시킨다. verifier의 정밀도/재현율을 별도로 추적해야 한다.

실수 5: fallback 경로 미설계

모든 N개 생성이 timeout되거나 API 에러가 나면 어떻게 할 것인가? "그냥 에러 반환"이 아니라, single pass fallback 또는 캐시된 유사 답변 반환 등의 대안이 있어야 한다.

참고 자료

Wei et al., "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models", 2022 -- arxiv:2201.11903
Wang et al., "Self-Consistency Improves Chain of Thought Reasoning in Language Models", 2022 -- arxiv:2203.11171
Yao et al., "Tree of Thoughts: Deliberate Problem Solving with Large Language Models", 2023 -- arxiv:2305.10601
Lightman et al., "Let's Verify Step by Step", 2023 -- arxiv:2305.20050
Snell et al., "Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters", 2024 -- arxiv:2408.03314
"What, How, Where, and How Well? A Survey on Test-Time Scaling in LLMs", 2025 -- arxiv:2503.24235
"The Art of Scaling Test-Time Compute for Large Language Models", 2025 -- arxiv:2512.02008
Cobbe et al., "Training Verifiers to Solve Math Word Problems", 2021 -- arxiv:2110.14168

퀴즈

Test-time scaling과 train-time scaling의 근본적 차이는 무엇인가? 정답: ||Train-time scaling은 모델 가중치를 변경(파라미터/데이터/학습 증가)하지만, test-time scaling은 학습된 모델을 그대로 두고 추론 시점의 계산량을 늘려 성능을 올린다.||
Self-Consistency가 Best-of-N과 다른 핵심적인 차이는? 정답: ||Best-of-N은 각 후보의 품질을 개별 평가하여 최고 점수를 선택하지만, Self-Consistency는 최종 답의 다수결(빈도)로 선택한다. 별도의 scorer가 필요 없는 대신 정답이 명확히 정의되는 태스크에서만 효과적이다.||
pass@1이 30%인 모델에서 pass@10의 이론적 상한은? 정답: ||1 - (1-0.3)^10 = 1 - 0.7^10 ≈ 97.2%. 단, 이는 10개 중 정답을 완벽히 골라낼 수 있다는 전제(oracle selection)하에서의 상한이다.||
Tree-of-Thought가 단순 Best-of-N보다 유리한 문제 유형은? 정답: ||중간 단계에서 분기 판단이 중요한 탐색 문제(24 게임, 계획 수립, 멀티스텝 의사결정). 완결된 답을 통째로 여러 개 만드는 것보다 중간에서 가지치기하는 것이 예산 효율적이다.||
TTS에서 compute-optimal scaling이란? 정답: ||모든 문제에 동일한 예산을 배분하는 대신, 문제 난이도에 따라 쉬운 문제는 적은 샘플(1-2개), 어려운 문제는 많은 샘플로 예산을 차등 배분하는 전략. Snell et al.(2024)이 체계적으로 분석했다.||
Verifier와 Self-Consistency를 결합하면 어떤 이점이 있는가? 정답: ||Self-Consistency의 다수결 결과와 verifier 점수를 함께 보면, 모델의 systematic bias(체계적 오답 편향)를 verifier가 잡아내고, verifier의 개별 오판은 다수결이 보완해 둘의 약점을 상호 보완할 수 있다.||
코드 생성에서 test-based reranking이 다른 도메인보다 강력한 이유는? 정답: ||테스트 통과 여부라는 이진(binary)의 확실한 검증 신호가 있기 때문이다. 대부분의 다른 도메인에서는 정답 검증이 모호하거나 비용이 높다.||
TTS 도입 시 가장 먼저 확보해야 할 baseline 수치는? 정답: ||단일 샘플(single pass, temperature=0) 기준의 정확도, 지연, 비용. 이 baseline이 없으면 TTS의 실제 개선 효과를 정량적으로 측정할 수 없다.||