Batching

Published on
2026년 6월 26일
LLM 추론 서빙 2026 — vLLM, SGLang, TensorRT-LLM 비교
llm-serving vllm sglang tensorrt-llm inference mlops batching
2026년의 LLM 추론 서빙을 한눈에 정리합니다. prefill과 decode의 성격 차이, continuous batching, paged KV 캐시 같은 핵심 원리부터 vLLM, SGLang, TensorRT-LLM의 강약점 비교와 선택 가이드, 실제 배포 설정까지 개발자 관점에서 다룹니다.
Published on
2026년 6월 26일
추론을 빠르게 — Speculative Decoding과 처리량 최적화
speculative-decoding throughput inference mlops latency batching llm-serving
LLM의 decode가 느린 근본 이유부터 speculative decoding으로 속도를 끌어올리는 원리, 메두사와 EAGLE 같은 변형, chunked prefill과 prefill/decode 분리, 지연과 처리량의 트레이드오프, 그리고 TTFT/TPOT 같은 측정 지표까지 추론 가속의 핵심을 정리합니다.
Published on
2026년 4월 14일
LLM 추론 최적화 완전 가이드 2025: vLLM, TensorRT-LLM, KV Cache, Speculative Decoding
llm-inference vllm tensorrt-llm kv-cache speculative-decoding quantization batching model-serving gpu-optimization 2026-04
LLM 추론 최적화의 모든 것! vLLM(PagedAttention), TensorRT-LLM(FP8/INT4), KV Cache 관리, Speculative Decoding, Continuous Batching, FlashAttention, 양자화(GPTQ/AWQ/GGUF), 모델 서빙(Triton/vLLM/TGI), GPU 메모리 최적화, 비용 분석.

LLM 추론 서빙 2026 — vLLM, SGLang, TensorRT-LLM 비교