AI 모델 서빙과 추론 최적화 완전 가이드: vLLM, TensorRT, Triton, Ollama

들어가며

AI 모델을 개발하는 것과 프로덕션에서 효율적으로 서빙하는 것은 전혀 다른 도전입니다. 수백만 명의 사용자에게 100ms 이하의 응답시간으로 GPT급 LLM을 서빙하거나, 엣지 디바이스에서 실시간 이미지 분류를 처리하는 것은 상당한 최적화 기술을 요구합니다.

이 가이드에서는 AI 모델 서빙의 핵심 도구들(vLLM, TensorRT, NVIDIA Triton, Ollama)과 최적화 기법을 실전 예제와 함께 완전히 정복합니다.

1. AI 추론(Inference)의 도전과 목표

1.1 학습 vs 추론의 차이

학습(Training)과 추론(Inference)은 컴퓨팅 요구사항이 근본적으로 다릅니다.

구분	학습	추론
목표	모델 파라미터 최적화	빠른 예측 생성
배치 크기	크게 설정 (128~2048)	작거나 스트리밍
메모리	그래디언트 저장 필요	활성화값만 필요
정밀도	FP32 또는 FP16	INT8, INT4 가능
가속기	A100, H100 (고가)	T4, L4, RTX (저렴)
비용 패턴	1회성 대규모 비용	지속적 소규모 비용

1.2 지연시간(Latency) vs 처리량(Throughput)

추론 최적화에서 가장 중요한 두 지표입니다:

지연시간 (Latency)

단일 요청의 응답 시간
실시간 애플리케이션에서 중요 (챗봇, 자동완성)
P50, P95, P99 백분위수로 측정
목표: 100ms 이하 (일반), 50ms 이하 (실시간)

처리량 (Throughput)

단위 시간당 처리 가능한 요청 수 (QPS, Tokens/second)
배치 처리, 오프라인 추론에서 중요
지연시간과 트레이드오프 관계

# 지연시간 vs 처리량 측정 예제
import time
import numpy as np
from typing import List

def measure_latency(model_fn, inputs: List, n_runs: int = 100):
    """지연시간 측정"""
    latencies = []

    # 워밍업
    for _ in range(10):
        _ = model_fn(inputs[0])

    # 측정
    for inp in inputs[:n_runs]:
        start = time.perf_counter()
        _ = model_fn(inp)
        end = time.perf_counter()
        latencies.append((end - start) * 1000)  # ms

    latencies = np.array(latencies)
    return {
        "p50_ms": np.percentile(latencies, 50),
        "p95_ms": np.percentile(latencies, 95),
        "p99_ms": np.percentile(latencies, 99),
        "mean_ms": np.mean(latencies),
        "std_ms": np.std(latencies),
    }

def measure_throughput(model_fn, inputs: List, duration_sec: int = 60):
    """처리량 측정"""
    count = 0
    start = time.time()

    while time.time() - start < duration_sec:
        _ = model_fn(inputs[count % len(inputs)])
        count += 1

    elapsed = time.time() - start
    return {
        "qps": count / elapsed,
        "total_requests": count,
        "duration_sec": elapsed,
    }

1.3 하드웨어 선택 가이드

GPU (NVIDIA):
  A100 80GB: 최고 성능, 학습/추론 모두 최적, 고가
  H100 80GB: 현재 최고 사양, LLM 추론 특화
  A10G 24GB: AWS에서 자주 사용, 중간 성능
  T4 16GB: 비용 효율, 추론 전용, AWS/GCP에서 저렴
  L4 24GB: T4 후속, 추론 최적화
  RTX 4090 24GB: 소규모 배포, 로컬 LLM

CPU:
  장점: 저렴, 어디서나 사용 가능, 메모리 대용량
  단점: 병렬 처리 제한, 느린 행렬 연산
  활용: INT8 양자화 모델, 소형 모델, 엣지

TPU (Google):
  Cloud TPU v4: 대규모 LLM 학습/서빙
  TPU v5e: 추론 최적화 버전

NPU (Edge):
  Apple Neural Engine: iPhone/Mac에서 Core ML 모델
  Qualcomm AI Engine: Android 온디바이스 추론

2. 모델 최적화 기법

2.1 양자화 (Quantization)

양자화는 모델의 가중치와 활성화값을 낮은 비트 정밀도로 표현하여 메모리와 연산량을 줄이는 기법입니다.

FP32 (32bit) → FP16 (16bit) → BF16 (16bit) → INT8 (8bit) → INT4 (4bit)
메모리:         100%           50%            50%           25%           12.5%
속도:           기준            1.5-2x         1.5-2x        2-4x          4-8x
정확도 손실:    없음            미미           미미          소폭           중간

Post-Training Quantization (PTQ)

# PyTorch PTQ 예제
import torch
from torch.quantization import quantize_dynamic, prepare, convert

# 동적 양자화 (가중치만 INT8)
model = MyModel()
model.load_state_dict(torch.load("model.pth"))
model.eval()

# 동적 양자화 적용 (Linear 레이어)
quantized_model = quantize_dynamic(
    model,
    {torch.nn.Linear},  # 양자화할 레이어 타입
    dtype=torch.qint8
)

# 모델 크기 비교
def get_model_size_mb(model):
    import io
    buffer = io.BytesIO()
    torch.save(model.state_dict(), buffer)
    return buffer.tell() / (1024 * 1024)

print(f"원본 모델: {get_model_size_mb(model):.2f} MB")
print(f"양자화 모델: {get_model_size_mb(quantized_model):.2f} MB")

# 정적 양자화 (가중치 + 활성화값 INT8)
from torch.quantization import get_default_qconfig

model.qconfig = get_default_qconfig('x86')

# 관찰자 삽입
prepared_model = prepare(model)

# 보정 데이터로 통계 수집
with torch.no_grad():
    for batch in calibration_loader:
        prepared_model(batch)

# 양자화 변환
static_quantized_model = convert(prepared_model)

GPTQ - LLM 양자화

# GPTQ를 사용한 LLM INT4 양자화
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig

model_id = "meta-llama/Llama-2-7b-hf"

# GPTQ 설정
gptq_config = GPTQConfig(
    bits=4,           # INT4 양자화
    dataset="wikitext2",  # 보정 데이터셋
    block_size=128,
    damp_percent=0.01,
)

# 양자화 실행
tokenizer = AutoTokenizer.from_pretrained(model_id)
quantized_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=gptq_config,
    device_map="auto"
)

# 저장
quantized_model.save_pretrained("llama2-7b-gptq-int4")
tokenizer.save_pretrained("llama2-7b-gptq-int4")

AWQ - 활성화 인식 양자화

# AWQ 양자화 (더 높은 품질의 INT4)
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = "meta-llama/Llama-2-7b-hf"
quant_path = "llama2-7b-awq"

# AWQ 양자화 설정
quant_config = {
    "zero_point": True,
    "q_group_size": 128,
    "w_bit": 4,
    "version": "GEMM"
}

# 모델 로드 및 양자화
model = AutoAWQForCausalLM.from_pretrained(model_path, device_map="cuda")
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

model.quantize(tokenizer, quant_config=quant_config)
model.save_quantized(quant_path)

2.2 프루닝 (Pruning)

import torch
import torch.nn.utils.prune as prune

model = MyConvNet()

# 비구조적 프루닝 (L1 노름 기반, 50% 스파시티)
prune.l1_unstructured(
    model.conv1,
    name='weight',
    amount=0.5
)

# 구조적 프루닝 (채널 단위 - 실제 가속에 유리)
prune.ln_structured(
    model.conv1,
    name='weight',
    amount=0.3,
    n=2,
    dim=0  # 출력 채널 기준
)

# 전역 프루닝 (모델 전체에 걸쳐 상위 20% 제거)
parameters_to_prune = (
    (model.conv1, 'weight'),
    (model.conv2, 'weight'),
    (model.fc1, 'weight'),
)

prune.global_unstructured(
    parameters_to_prune,
    pruning_method=prune.L1Unstructured,
    amount=0.2,
)

# 프루닝 영구 적용
prune.remove(model.conv1, 'weight')

# 스파시티 확인
def print_sparsity(model):
    for name, module in model.named_modules():
        if isinstance(module, torch.nn.Conv2d):
            sparsity = 100. * float(torch.sum(module.weight == 0)) / float(module.weight.nelement())
            print(f"{name}: {sparsity:.1f}% sparsity")

2.3 지식 증류 (Knowledge Distillation)

import torch
import torch.nn as nn
import torch.nn.functional as F

class DistillationTrainer:
    """Teacher-Student 지식 증류"""

    def __init__(self, teacher, student, temperature=4.0, alpha=0.7):
        self.teacher = teacher
        self.student = student
        self.temperature = temperature
        self.alpha = alpha  # 소프트 레이블 가중치

        self.teacher.eval()  # Teacher는 고정

    def distillation_loss(self, student_logits, teacher_logits, labels):
        """증류 손실 = 소프트 레이블 손실 + 하드 레이블 손실"""
        T = self.temperature

        # 소프트 레이블 손실 (Teacher 지식 활용)
        soft_targets = F.softmax(teacher_logits / T, dim=1)
        soft_pred = F.log_softmax(student_logits / T, dim=1)
        soft_loss = F.kl_div(soft_pred, soft_targets, reduction='batchmean') * (T ** 2)

        # 하드 레이블 손실 (실제 레이블)
        hard_loss = F.cross_entropy(student_logits, labels)

        # 결합 손실
        total_loss = self.alpha * soft_loss + (1 - self.alpha) * hard_loss
        return total_loss

    def train_step(self, inputs, labels, optimizer):
        optimizer.zero_grad()

        # Teacher 예측 (그래디언트 불필요)
        with torch.no_grad():
            teacher_logits = self.teacher(inputs)

        # Student 예측
        student_logits = self.student(inputs)

        # 증류 손실 계산 및 역전파
        loss = self.distillation_loss(student_logits, teacher_logits, labels)
        loss.backward()
        optimizer.step()

        return loss.item()

2.4 TorchScript와 ONNX 변환

import torch
import torch.onnx

# TorchScript로 변환 (tracing 방식)
model = MyModel()
model.eval()

example_input = torch.randn(1, 3, 224, 224)
traced_model = torch.jit.trace(model, example_input)
traced_model.save("model_traced.pt")

# TorchScript scripting 방식 (동적 제어 흐름 포함)
scripted_model = torch.jit.script(model)
scripted_model.save("model_scripted.pt")

# ONNX로 내보내기
torch.onnx.export(
    model,
    example_input,
    "model.onnx",
    opset_version=17,
    input_names=["input"],
    output_names=["output"],
    dynamic_axes={
        "input": {0: "batch_size"},
        "output": {0: "batch_size"},
    },
    verbose=False
)

# ONNX 모델 검증
import onnx
onnx_model = onnx.load("model.onnx")
onnx.checker.check_model(onnx_model)

# ONNX Runtime으로 추론
import onnxruntime as ort
import numpy as np

session = ort.InferenceSession(
    "model.onnx",
    providers=['CUDAExecutionProvider', 'CPUExecutionProvider']
)

input_name = session.get_inputs()[0].name
output_name = session.get_outputs()[0].name

# 추론 실행
input_data = np.random.randn(1, 3, 224, 224).astype(np.float32)
outputs = session.run([output_name], {input_name: input_data})
print(f"Output shape: {outputs[0].shape}")

3. TensorRT

3.1 TensorRT 소개

TensorRT는 NVIDIA의 딥러닝 추론 최적화 SDK입니다. 다음 최적화를 자동으로 수행합니다:

레이어 융합 (Layer Fusion): Conv+BN+ReLU를 단일 연산으로 합침
커널 자동 선택: GPU 아키텍처에 최적화된 CUDA 커널 선택
FP16/INT8 정밀도 보정: 정확도 손실 최소화
메모리 재사용: 텐서 메모리 최적 할당

3.2 Python API로 TensorRT 변환

import tensorrt as trt
import numpy as np
import pycuda.driver as cuda
import pycuda.autoinit

TRT_LOGGER = trt.Logger(trt.Logger.WARNING)

def build_engine_from_onnx(onnx_path: str, precision: str = "fp16") -> trt.ICudaEngine:
    """ONNX 모델을 TensorRT 엔진으로 변환"""

    with trt.Builder(TRT_LOGGER) as builder, \
         builder.create_network(
             1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
         ) as network, \
         trt.OnnxParser(network, TRT_LOGGER) as parser:

        config = builder.create_builder_config()
        config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 4 * 1024 * 1024 * 1024)  # 4GB

        # 정밀도 설정
        if precision == "fp16":
            config.set_flag(trt.BuilderFlag.FP16)
        elif precision == "int8":
            config.set_flag(trt.BuilderFlag.INT8)
            config.int8_calibrator = MyCalibrator()  # 보정기 필요

        # ONNX 파싱
        with open(onnx_path, 'rb') as model:
            if not parser.parse(model.read()):
                for error in range(parser.num_errors):
                    print(f"ONNX parse error: {parser.get_error(error)}")
                raise ValueError("ONNX parsing failed")

        # 동적 입력 크기 설정 (배치 크기 가변)
        profile = builder.create_optimization_profile()
        profile.set_shape(
            "input",
            min=(1, 3, 224, 224),    # 최소
            opt=(8, 3, 224, 224),    # 최적
            max=(32, 3, 224, 224)    # 최대
        )
        config.add_optimization_profile(profile)

        # 엔진 빌드
        serialized_engine = builder.build_serialized_network(network, config)

        runtime = trt.Runtime(TRT_LOGGER)
        return runtime.deserialize_cuda_engine(serialized_engine)

def save_engine(engine, path: str):
    with open(path, 'wb') as f:
        f.write(engine.serialize())

def load_engine(path: str):
    runtime = trt.Runtime(TRT_LOGGER)
    with open(path, 'rb') as f:
        return runtime.deserialize_cuda_engine(f.read())

class TRTInferenceEngine:
    """TensorRT 추론 엔진 래퍼"""

    def __init__(self, engine_path: str):
        self.engine = load_engine(engine_path)
        self.context = self.engine.create_execution_context()

        # 메모리 할당
        self.inputs = []
        self.outputs = []
        self.bindings = []

        for binding in self.engine:
            size = trt.volume(self.engine.get_binding_shape(binding))
            dtype = trt.nptype(self.engine.get_binding_dtype(binding))
            host_mem = cuda.pagelocked_empty(size, dtype)
            device_mem = cuda.mem_alloc(host_mem.nbytes)

            self.bindings.append(int(device_mem))

            if self.engine.binding_is_input(binding):
                self.inputs.append({'host': host_mem, 'device': device_mem})
            else:
                self.outputs.append({'host': host_mem, 'device': device_mem})

    def infer(self, input_data: np.ndarray) -> np.ndarray:
        """추론 실행"""
        # 입력 데이터를 디바이스 메모리에 복사
        np.copyto(self.inputs[0]['host'], input_data.ravel())
        cuda.memcpy_htod(self.inputs[0]['device'], self.inputs[0]['host'])

        # 추론
        self.context.execute_v2(bindings=self.bindings)

        # 출력 복사
        cuda.memcpy_dtoh(self.outputs[0]['host'], self.outputs[0]['device'])
        return self.outputs[0]['host'].copy()

# 사용 예제
engine = build_engine_from_onnx("model.onnx", precision="fp16")
save_engine(engine, "model_fp16.trt")

trt_engine = TRTInferenceEngine("model_fp16.trt")
output = trt_engine.infer(np.random.randn(1, 3, 224, 224).astype(np.float32))

3.3 Torch-TensorRT

# Torch-TensorRT: PyTorch 모델을 직접 TRT로 변환
import torch_tensorrt

model = MyResNet50()
model.eval()
model.cuda()

# TorchScript → TensorRT
traced_model = torch.jit.trace(model, torch.randn(1, 3, 224, 224).cuda())

trt_model = torch_tensorrt.compile(
    traced_model,
    inputs=[
        torch_tensorrt.Input(
            min_shape=[1, 3, 224, 224],
            opt_shape=[8, 3, 224, 224],
            max_shape=[32, 3, 224, 224],
            dtype=torch.float32
        )
    ],
    enabled_precisions={torch.float16},  # FP16 활성화
    workspace_size=4 * 1024 * 1024 * 1024,  # 4GB
)

# 저장 및 로드
torch.jit.save(trt_model, "model_trt.ts")
loaded_model = torch.jit.load("model_trt.ts")

# 속도 비교
import time

input_tensor = torch.randn(8, 3, 224, 224).cuda()

# 원본 PyTorch
with torch.no_grad():
    start = time.perf_counter()
    for _ in range(100):
        _ = model(input_tensor)
    pytorch_time = (time.perf_counter() - start) / 100 * 1000

# TensorRT
with torch.no_grad():
    start = time.perf_counter()
    for _ in range(100):
        _ = loaded_model(input_tensor)
    trt_time = (time.perf_counter() - start) / 100 * 1000

print(f"PyTorch: {pytorch_time:.2f}ms, TensorRT: {trt_time:.2f}ms")
print(f"Speedup: {pytorch_time / trt_time:.2f}x")

4. NVIDIA Triton Inference Server

4.1 Triton 소개

NVIDIA Triton Inference Server는 프로덕션 환경에서 다양한 ML 프레임워크의 모델을 서빙하는 오픈소스 추론 서버입니다.

주요 특징:

다중 프레임워크 지원 (TensorRT, ONNX, PyTorch, TensorFlow, Python)
동적 배치 처리 (Dynamic Batching)
동시 모델 실행 (Concurrent Model Execution)
GPU/CPU 자원 효율적 활용
모델 앙상블 파이프라인
gRPC 및 HTTP REST API

4.2 모델 저장소 구조

model_repository/
├── resnet50/
│   ├── config.pbtxt
│   ├── 1/
│   │   └── model.plan          # TensorRT 엔진
│   └── 2/
│       └── model.plan
├── bert_onnx/
│   ├── config.pbtxt
│   └── 1/
│       └── model.onnx
├── llm_python/
│   ├── config.pbtxt
│   └── 1/
│       └── model.py
└── ensemble_pipeline/
    ├── config.pbtxt
    └── 1/
        └── (empty)

4.3 설정 파일 (config.pbtxt)

# model_repository/resnet50/config.pbtxt
name: "resnet50"
platform: "tensorrt_plan"
max_batch_size: 32

input [
  {
    name: "input"
    data_type: TYPE_FP32
    dims: [3, 224, 224]
  }
]

output [
  {
    name: "output"
    data_type: TYPE_FP32
    dims: [1000]
  }
]

# 동적 배치 설정
dynamic_batching {
  preferred_batch_size: [4, 8, 16, 32]
  max_queue_delay_microseconds: 5000  # 5ms 대기
}

# GPU 인스턴스 설정
instance_group [
  {
    count: 2
    kind: KIND_GPU
    gpus: [0]
  }
]

# model_repository/bert_onnx/config.pbtxt
name: "bert_onnx"
platform: "onnxruntime_onnx"
max_batch_size: 8

input [
  {
    name: "input_ids"
    data_type: TYPE_INT64
    dims: [128]  # 시퀀스 길이
  },
  {
    name: "attention_mask"
    data_type: TYPE_INT64
    dims: [128]
  }
]

output [
  {
    name: "last_hidden_state"
    data_type: TYPE_FP32
    dims: [128, 768]
  }
]

dynamic_batching {
  max_queue_delay_microseconds: 10000
}

optimization {
  execution_accelerators {
    gpu_execution_accelerator: [
      {
        name: "tensorrt"
        parameters { key: "precision_mode" value: "FP16" }
        parameters { key: "max_workspace_size_bytes" value: "1073741824" }
      }
    ]
  }
}

4.4 Python 백엔드 모델

# model_repository/custom_model/1/model.py
import numpy as np
import json
import triton_python_backend_utils as pb_utils
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

class TritonPythonModel:
    """Python 백엔드 Triton 모델"""

    def initialize(self, args):
        """모델 초기화 (서버 시작 시 1회 실행)"""
        model_config = json.loads(args['model_config'])

        # GPU 사용 여부
        self.device = 'cuda' if args['model_instance_kind'] == 'GPU' else 'cpu'

        # 모델 로드
        model_name = "distilbert-base-uncased-finetuned-sst-2-english"
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForSequenceClassification.from_pretrained(model_name)
        self.model.to(self.device)
        self.model.eval()

    def execute(self, requests):
        """배치 추론 실행"""
        responses = []

        for request in requests:
            # 입력 텍스트 가져오기
            input_text = pb_utils.get_input_tensor_by_name(request, "TEXT")
            texts = input_text.as_numpy().tolist()
            texts = [t[0].decode('utf-8') for t in texts]

            # 토크나이징
            inputs = self.tokenizer(
                texts,
                return_tensors="pt",
                padding=True,
                truncation=True,
                max_length=128
            ).to(self.device)

            # 추론
            with torch.no_grad():
                outputs = self.model(**inputs)
                probs = torch.softmax(outputs.logits, dim=1).cpu().numpy()

            # 응답 생성
            output_tensor = pb_utils.Tensor("PROBABILITIES", probs.astype(np.float32))
            response = pb_utils.InferenceResponse(output_tensors=[output_tensor])
            responses.append(response)

        return responses

    def finalize(self):
        """모델 정리"""
        del self.model
        torch.cuda.empty_cache()

4.5 Docker로 Triton 배포

# Triton 서버 실행
docker run --gpus all \
  -p 8000:8000 \  # HTTP
  -p 8001:8001 \  # gRPC
  -p 8002:8002 \  # Metrics
  -v /path/to/model_repository:/models \
  --shm-size=1g \
  nvcr.io/nvidia/tritonserver:24.02-py3 \
  tritonserver \
  --model-repository=/models \
  --log-verbose=1 \
  --strict-model-config=false

# 모델 로드 상태 확인
curl http://localhost:8000/v2/health/ready

# 모델 정보 조회
curl http://localhost:8000/v2/models/resnet50

# Python 클라이언트로 Triton 요청
import tritonclient.http as httpclient
import tritonclient.grpc as grpcclient
import numpy as np

# HTTP 클라이언트
client = httpclient.InferenceServerClient(url="localhost:8000")

# 입력 설정
input_data = np.random.randn(1, 3, 224, 224).astype(np.float32)
inputs = [httpclient.InferInput("input", input_data.shape, "FP32")]
inputs[0].set_data_from_numpy(input_data)

# 출력 설정
outputs = [httpclient.InferRequestedOutput("output")]

# 추론 요청
result = client.infer(
    model_name="resnet50",
    model_version="1",
    inputs=inputs,
    outputs=outputs
)

output = result.as_numpy("output")
print(f"Output shape: {output.shape}")
print(f"Top-5 predictions: {np.argsort(output[0])[-5:][::-1]}")

# 비동기 배치 요청 (처리량 최적화)
async def async_batch_inference(client, batch_inputs):
    tasks = []
    for inp in batch_inputs:
        inputs = [httpclient.InferInput("input", inp.shape, "FP32")]
        inputs[0].set_data_from_numpy(inp)
        task = client.async_infer(
            model_name="resnet50",
            inputs=inputs,
            outputs=[httpclient.InferRequestedOutput("output")]
        )
        tasks.append(task)

    results = [task.get_result() for task in tasks]
    return [r.as_numpy("output") for r in results]

5. vLLM - LLM 고속 서빙

5.1 vLLM 소개

vLLM은 LLM 추론을 위한 고성능 서빙 라이브러리입니다. 기존 HuggingFace Transformers보다 최대 24배 높은 처리량을 달성합니다.

핵심 기술:

PagedAttention: KV 캐시를 페이지 단위로 관리하여 메모리 낭비 최소화
Continuous Batching: 고정 배치 대신 동적으로 요청을 처리
CUDA Kernel 최적화: FlashAttention 등 최적화된 어텐션 커널

5.2 vLLM 설치 및 기본 사용

# vLLM 설치 (CUDA 12.1 기준)
pip install vllm

# 특정 CUDA 버전
pip install vllm --extra-index-url https://download.pytorch.org/whl/cu121

from vllm import LLM, SamplingParams

# 모델 로드
llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    tensor_parallel_size=1,   # GPU 수
    gpu_memory_utilization=0.9,  # GPU 메모리 활용률
    max_model_len=4096,        # 최대 컨텍스트 길이
)

# 샘플링 파라미터
sampling_params = SamplingParams(
    temperature=0.8,
    top_p=0.95,
    max_tokens=512,
    stop=["</s>", "[INST]"],
)

# 배치 추론 (여러 프롬프트 동시 처리)
prompts = [
    "Python으로 피보나치 수열을 구현해줘",
    "머신러닝과 딥러닝의 차이를 설명해줘",
    "SQL에서 JOIN의 종류를 설명해줘",
]

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(f"Prompt: {output.prompt[:50]}...")
    print(f"Generated: {output.outputs[0].text}")
    print(f"Tokens: {len(output.outputs[0].token_ids)}")
    print("---")

5.3 OpenAI 호환 API 서버

# vLLM OpenAI 호환 서버 시작
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 1 \
  --gpu-memory-utilization 0.9 \
  --max-model-len 4096 \
  --served-model-name llama3-8b

# 양자화 모델 서빙
python -m vllm.entrypoints.openai.api_server \
  --model TheBloke/Llama-2-7B-Chat-GPTQ \
  --quantization gptq \
  --dtype float16 \
  --port 8000

# OpenAI 클라이언트로 vLLM 사용
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"  # vLLM은 API 키 불필요
)

# 채팅 완성
response = client.chat.completions.create(
    model="llama3-8b",
    messages=[
        {"role": "system", "content": "당신은 도움이 되는 AI 어시스턴트입니다."},
        {"role": "user", "content": "Python에서 비동기 프로그래밍을 설명해줘"},
    ],
    temperature=0.7,
    max_tokens=1000,
    stream=False,
)

print(response.choices[0].message.content)

# 스트리밍 응답
stream = client.chat.completions.create(
    model="llama3-8b",
    messages=[{"role": "user", "content": "한국의 수도는?"}],
    stream=True,
)

for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="", flush=True)

5.4 vLLM 양자화 서빙

from vllm import LLM, SamplingParams

# GPTQ INT4 양자화 모델
llm_gptq = LLM(
    model="TheBloke/Llama-2-7B-GPTQ",
    quantization="gptq",
    dtype="float16",
    gpu_memory_utilization=0.85,
)

# AWQ INT4 양자화 모델
llm_awq = LLM(
    model="TheBloke/Llama-2-7B-AWQ",
    quantization="awq",
    dtype="float16",
)

# FP8 양자화 (H100에서 최고 성능)
llm_fp8 = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    quantization="fp8",
    dtype="bfloat16",
)

# 처리량 비교 테스트
def benchmark_throughput(llm, prompts, n_iterations=5):
    import time

    sampling_params = SamplingParams(max_tokens=200, temperature=0.8)

    # 워밍업
    llm.generate(prompts[:2], sampling_params)

    start = time.time()
    for _ in range(n_iterations):
        outputs = llm.generate(prompts, sampling_params)
    elapsed = time.time() - start

    total_tokens = sum(
        len(o.outputs[0].token_ids)
        for outputs in [outputs]
        for o in outputs
    )

    return {
        "tokens_per_second": total_tokens * n_iterations / elapsed,
        "latency_per_batch_ms": elapsed / n_iterations * 1000,
    }

5.5 LoRA 어댑터 서빙

from vllm import LLM
from vllm.lora.request import LoRARequest

# LoRA 서빙 활성화
llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    enable_lora=True,
    max_lora_rank=64,
    max_loras=4,  # 최대 동시 LoRA 수
)

# LoRA 어댑터 지정하여 추론
sampling_params = SamplingParams(temperature=0.0, max_tokens=200)

outputs = llm.generate(
    "한국 역사에 대해 설명해줘",
    sampling_params=sampling_params,
    lora_request=LoRARequest(
        "korean-lora",    # LoRA 이름
        1,                # LoRA ID
        "/path/to/korean-lora-adapter"  # 어댑터 경로
    )
)

6. Ollama - 로컬 LLM 서빙

6.1 Ollama 소개

Ollama는 로컬 환경에서 LLM을 쉽게 실행할 수 있는 도구입니다. 복잡한 설정 없이 터미널 명령 하나로 다양한 LLM을 실행할 수 있습니다.

6.2 설치 및 기본 사용

# macOS/Linux 설치
curl -fsSL https://ollama.com/install.sh | sh

# 모델 다운로드 및 실행
ollama run llama3.1

# 다른 모델들
ollama run mistral
ollama run codellama
ollama run phi3
ollama run gemma2

# 백그라운드 서비스 시작
ollama serve

# 설치된 모델 목록
ollama list

# 모델 삭제
ollama rm llama3.1

# 모델 정보
ollama show llama3.1

6.3 REST API 사용

import requests
import json

# 기본 텍스트 생성
def ollama_generate(prompt: str, model: str = "llama3.1") -> str:
    response = requests.post(
        "http://localhost:11434/api/generate",
        json={
            "model": model,
            "prompt": prompt,
            "stream": False,
        }
    )
    return response.json()["response"]

# 스트리밍 생성
def ollama_stream(prompt: str, model: str = "llama3.1"):
    response = requests.post(
        "http://localhost:11434/api/generate",
        json={
            "model": model,
            "prompt": prompt,
            "stream": True,
        },
        stream=True
    )

    for line in response.iter_lines():
        if line:
            data = json.loads(line)
            yield data.get("response", "")
            if data.get("done", False):
                break

# 채팅 API
def ollama_chat(messages: list, model: str = "llama3.1") -> str:
    response = requests.post(
        "http://localhost:11434/api/chat",
        json={
            "model": model,
            "messages": messages,
            "stream": False,
        }
    )
    return response.json()["message"]["content"]

# 사용 예제
result = ollama_generate("Python에서 데코레이터를 설명해줘")
print(result)

# 스트리밍
for token in ollama_stream("머신러닝 기초를 설명해줘"):
    print(token, end="", flush=True)

# 채팅
messages = [
    {"role": "system", "content": "당신은 파이썬 전문가입니다."},
    {"role": "user", "content": "제너레이터와 이터레이터의 차이는?"},
]
print(ollama_chat(messages))

6.4 커스텀 Modelfile

# Modelfile - 커스텀 시스템 프롬프트 및 설정
FROM llama3.1

# 시스템 프롬프트
SYSTEM """
당신은 한국어로 답변하는 시니어 소프트웨어 엔지니어입니다.
코드 예제를 항상 포함하고, 명확하고 간결하게 답변합니다.
모르는 것은 솔직하게 모른다고 말합니다.
"""

# 생성 파라미터
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER top_k 40
PARAMETER num_ctx 4096
PARAMETER num_predict 512
PARAMETER stop "<|im_end|>"
PARAMETER stop "Human:"

# 커스텀 모델 생성
ollama create my-korean-assistant -f Modelfile

# 실행
ollama run my-korean-assistant

6.5 Python 클라이언트 (ollama 패키지)

import ollama

# 동기 생성
response = ollama.generate(
    model='llama3.1',
    prompt='FastAPI로 REST API를 만드는 방법을 설명해줘',
    options={
        'temperature': 0.7,
        'num_ctx': 2048,
    }
)
print(response['response'])

# 채팅 (대화 이력 유지)
messages = []

def chat(user_message: str, model: str = "llama3.1") -> str:
    messages.append({'role': 'user', 'content': user_message})

    response = ollama.chat(
        model=model,
        messages=messages,
    )

    assistant_message = response['message']['content']
    messages.append({'role': 'assistant', 'content': assistant_message})
    return assistant_message

print(chat("안녕하세요! Python을 배우고 싶어요."))
print(chat("어디서 시작하면 좋을까요?"))
print(chat("추천 학습 자료가 있나요?"))

# 임베딩 생성
embeddings = ollama.embeddings(
    model='nomic-embed-text',
    prompt='This is a sample text for embedding'
)
print(f"Embedding dimension: {len(embeddings['embedding'])}")

7. Text Generation Inference (TGI)

7.1 HuggingFace TGI 소개

HuggingFace TGI(Text Generation Inference)는 LLM을 프로덕션에서 서빙하기 위한 고성능 툴킷입니다.