AIモデルサービングと推論最適化完全ガイド: vLLM、TensorRT、Triton、Ollama

はじめに

AIモデルを開発することと、本番環境で効率的にサービングすることは全く異なる課題です。GPTスケールのLLMを数百万ユーザーに100ms以下のレスポンスタイムで提供したり、エッジデバイスでリアルタイム画像分類を実行したりするには、相当の最適化の専門知識が必要です。

このガイドでは、コアのAIモデルサービングツール（vLLM、TensorRT、NVIDIA Triton、Ollama）と最適化技術を実例を通じて完全に習得できます。

1. AI推論の課題と目標

1.1 学習と推論の違い

学習と推論は根本的に異なる計算要件を持っています。

カテゴリ	学習	推論
目的	モデルパラメータの最適化	高速な予測生成
バッチサイズ	大きい（128〜2048）	小さいかストリーミング
メモリ	勾配ストレージが必要	アクティベーションのみ
精度	FP32またはFP16	INT8、INT4が可能
アクセラレータ	A100、H100（高価）	T4、L4、RTX（安価）
コストパターン	一回限りの大きなコスト	継続的な小さなコスト

1.2 レイテンシとスループット

推論最適化における最も重要な2つのメトリクス：

レイテンシ

単一リクエストのレスポンス時間
リアルタイムアプリケーション（チャットボット、オートコンプリート）に重要
P50、P95、P99パーセンタイルで計測
目標：100ms以下（一般）、50ms以下（リアルタイム）

スループット

単位時間あたりの処理リクエスト数（QPS、トークン/秒）
バッチ処理、オフライン推論に重要
レイテンシとトレードオフの関係

# レイテンシとスループットの計測例
import time
import numpy as np
from typing import List

def measure_latency(model_fn, inputs: List, n_runs: int = 100):
    """推論レイテンシの計測"""
    latencies = []

    # ウォームアップ
    for _ in range(10):
        _ = model_fn(inputs[0])

    # 計測
    for inp in inputs[:n_runs]:
        start = time.perf_counter()
        _ = model_fn(inp)
        end = time.perf_counter()
        latencies.append((end - start) * 1000)  # ms

    latencies = np.array(latencies)
    return {
        "p50_ms": np.percentile(latencies, 50),
        "p95_ms": np.percentile(latencies, 95),
        "p99_ms": np.percentile(latencies, 99),
        "mean_ms": np.mean(latencies),
        "std_ms": np.std(latencies),
    }

def measure_throughput(model_fn, inputs: List, duration_sec: int = 60):
    """推論スループットの計測"""
    count = 0
    start = time.time()

    while time.time() - start < duration_sec:
        _ = model_fn(inputs[count % len(inputs)])
        count += 1

    elapsed = time.time() - start
    return {
        "qps": count / elapsed,
        "total_requests": count,
        "duration_sec": elapsed,
    }

1.3 ハードウェア選定ガイド

GPU (NVIDIA):
  A100 80GB: 最高性能、学習/推論に最適、高価
  H100 80GB: 現在最高峰、LLM推論に特化
  A10G 24GB: AWSで広く使用、中程度の性能
  T4 16GB: コスト効率良好、推論専用、AWS/GCPで安価
  L4 24GB: T4の後継、推論最適化
  RTX 4090 24GB: 小規模デプロイ、ローカルLLM

CPU:
  メリット: 安価、普遍的に利用可能、大容量メモリ
  デメリット: 並列性が限定的、行列演算が遅い
  用途: INT8量子化モデル、小規模モデル、エッジ

TPU (Google):
  Cloud TPU v4: 大規模LLM学習/サービング
  TPU v5e: 推論最適化バージョン

NPU (エッジ):
  Apple Neural Engine: iPhone/MacのCore MLモデル
  Qualcomm AI Engine: AndroidのオンデバイスAndroid推論

2. モデル最適化技術

2.1 量子化

量子化は、モデルの重みとアクティベーションをより低いビット精度で表現することで、メモリと計算量を削減します。

FP32 (32bit) → FP16 (16bit) → BF16 (16bit) → INT8 (8bit) → INT4 (4bit)
メモリ:          100%          50%            50%           25%          12.5%
速度:            ベースライン   1.5-2x         1.5-2x        2-4x         4-8x
精度損失:        なし          無視できる     無視できる     軽微          中程度

ポスト学習量子化（PTQ）

# PyTorch PTQの例
import torch
from torch.quantization import quantize_dynamic, prepare, convert

model = MyModel()
model.load_state_dict(torch.load("model.pth"))
model.eval()

# 動的量子化（重みのみ、INT8）
quantized_model = quantize_dynamic(
    model,
    {torch.nn.Linear},
    dtype=torch.qint8
)

def get_model_size_mb(model):
    import io
    buffer = io.BytesIO()
    torch.save(model.state_dict(), buffer)
    return buffer.tell() / (1024 * 1024)

print(f"Original model: {get_model_size_mb(model):.2f} MB")
print(f"Quantized model: {get_model_size_mb(quantized_model):.2f} MB")

# 静的量子化（重み+アクティベーション、INT8）
from torch.quantization import get_default_qconfig

model.qconfig = get_default_qconfig('x86')

prepared_model = prepare(model)

with torch.no_grad():
    for batch in calibration_loader:
        prepared_model(batch)

static_quantized_model = convert(prepared_model)

GPTQ - LLM量子化

# GPTQを使用したINT4 LLM量子化
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig

model_id = "meta-llama/Llama-2-7b-hf"

gptq_config = GPTQConfig(
    bits=4,
    dataset="wikitext2",
    block_size=128,
    damp_percent=0.01,
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
quantized_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=gptq_config,
    device_map="auto"
)

quantized_model.save_pretrained("llama2-7b-gptq-int4")
tokenizer.save_pretrained("llama2-7b-gptq-int4")

AWQ - アクティベーション認識量子化

# AWQ量子化（高品質なINT4）
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = "meta-llama/Llama-2-7b-hf"
quant_path = "llama2-7b-awq"

quant_config = {
    "zero_point": True,
    "q_group_size": 128,
    "w_bit": 4,
    "version": "GEMM"
}

model = AutoAWQForCausalLM.from_pretrained(model_path, device_map="cuda")
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

model.quantize(tokenizer, quant_config=quant_config)
model.save_quantized(quant_path)

2.2 プルーニング

import torch
import torch.nn.utils.prune as prune

model = MyConvNet()

# 非構造化プルーニング（L1ノルムベース、50%スパーシティ）
prune.l1_unstructured(
    model.conv1,
    name='weight',
    amount=0.5
)

# 構造化プルーニング（チャネルレベル - 実際に推論を高速化）
prune.ln_structured(
    model.conv1,
    name='weight',
    amount=0.3,
    n=2,
    dim=0  # 出力チャネル次元
)

# グローバルプルーニング（モデル全体の上位20%を削除）
parameters_to_prune = (
    (model.conv1, 'weight'),
    (model.conv2, 'weight'),
    (model.fc1, 'weight'),
)

prune.global_unstructured(
    parameters_to_prune,
    pruning_method=prune.L1Unstructured,
    amount=0.2,
)

# プルーニングを永続化
prune.remove(model.conv1, 'weight')

def print_sparsity(model):
    for name, module in model.named_modules():
        if isinstance(module, torch.nn.Conv2d):
            sparsity = 100. * float(torch.sum(module.weight == 0)) / float(module.weight.nelement())
            print(f"{name}: {sparsity:.1f}% sparsity")

2.3 知識蒸留

import torch
import torch.nn as nn
import torch.nn.functional as F

class DistillationTrainer:
    """教師-生徒知識蒸留"""

    def __init__(self, teacher, student, temperature=4.0, alpha=0.7):
        self.teacher = teacher
        self.student = student
        self.temperature = temperature
        self.alpha = alpha  # ソフトラベルの重み

        self.teacher.eval()  # 教師を凍結

    def distillation_loss(self, student_logits, teacher_logits, labels):
        """蒸留損失 = ソフトラベル損失 + ハードラベル損失"""
        T = self.temperature

        # ソフトラベル損失（教師の知識を使用）
        soft_targets = F.softmax(teacher_logits / T, dim=1)
        soft_pred = F.log_softmax(student_logits / T, dim=1)
        soft_loss = F.kl_div(soft_pred, soft_targets, reduction='batchmean') * (T ** 2)

        # ハードラベル損失（グラウンドトゥルース）
        hard_loss = F.cross_entropy(student_logits, labels)

        total_loss = self.alpha * soft_loss + (1 - self.alpha) * hard_loss
        return total_loss

    def train_step(self, inputs, labels, optimizer):
        optimizer.zero_grad()

        with torch.no_grad():
            teacher_logits = self.teacher(inputs)

        student_logits = self.student(inputs)

        loss = self.distillation_loss(student_logits, teacher_logits, labels)
        loss.backward()
        optimizer.step()

        return loss.item()

2.4 TorchScriptとONNX変換

import torch
import torch.onnx

model = MyModel()
model.eval()

example_input = torch.randn(1, 3, 224, 224)

# TorchScriptトレーシング
traced_model = torch.jit.trace(model, example_input)
traced_model.save("model_traced.pt")

# TorchScriptスクリプティング（動的制御フローをサポート）
scripted_model = torch.jit.script(model)
scripted_model.save("model_scripted.pt")

# ONNXエクスポート
torch.onnx.export(
    model,
    example_input,
    "model.onnx",
    opset_version=17,
    input_names=["input"],
    output_names=["output"],
    dynamic_axes={
        "input": {0: "batch_size"},
        "output": {0: "batch_size"},
    },
    verbose=False
)

# ONNXモデルの検証
import onnx
onnx_model = onnx.load("model.onnx")
onnx.checker.check_model(onnx_model)

# ONNXランタイム推論
import onnxruntime as ort
import numpy as np

session = ort.InferenceSession(
    "model.onnx",
    providers=['CUDAExecutionProvider', 'CPUExecutionProvider']
)

input_name = session.get_inputs()[0].name
output_name = session.get_outputs()[0].name

input_data = np.random.randn(1, 3, 224, 224).astype(np.float32)
outputs = session.run([output_name], {input_name: input_data})
print(f"Output shape: {outputs[0].shape}")

3. TensorRT

3.1 TensorRTの概要

TensorRTはNVIDIAのディープラーニング推論最適化SDKです。以下の最適化を自動的に実行します：

レイヤー融合: Conv+BN+ReLUを単一操作に統合
カーネル自動選択: GPUアーキテクチャに最適化されたCUDAカーネルを選択
FP16/INT8キャリブレーション: 精度削減時の精度損失を最小化
メモリ再利用: 最適なテンソルメモリ割り当て

3.2 Python APIによるTensorRT変換

import tensorrt as trt
import numpy as np
import pycuda.driver as cuda
import pycuda.autoinit

TRT_LOGGER = trt.Logger(trt.Logger.WARNING)

def build_engine_from_onnx(onnx_path: str, precision: str = "fp16") -> trt.ICudaEngine:
    """ONNXモデルをTensorRTエンジンに変換"""

    with trt.Builder(TRT_LOGGER) as builder, \
         builder.create_network(
             1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
         ) as network, \
         trt.OnnxParser(network, TRT_LOGGER) as parser:

        config = builder.create_builder_config()
        config.set_memory_pool_limit(
            trt.MemoryPoolType.WORKSPACE,
            4 * 1024 * 1024 * 1024  # 4GB
        )

        if precision == "fp16":
            config.set_flag(trt.BuilderFlag.FP16)
        elif precision == "int8":
            config.set_flag(trt.BuilderFlag.INT8)
            config.int8_calibrator = MyCalibrator()

        with open(onnx_path, 'rb') as model:
            if not parser.parse(model.read()):
                for error in range(parser.num_errors):
                    print(f"ONNX parse error: {parser.get_error(error)}")
                raise ValueError("ONNX parsing failed")

        # 動的入力シェイプ（可変バッチサイズ）
        profile = builder.create_optimization_profile()
        profile.set_shape(
            "input",
            min=(1, 3, 224, 224),
            opt=(8, 3, 224, 224),
            max=(32, 3, 224, 224)
        )
        config.add_optimization_profile(profile)

        serialized_engine = builder.build_serialized_network(network, config)
        runtime = trt.Runtime(TRT_LOGGER)
        return runtime.deserialize_cuda_engine(serialized_engine)

class TRTInferenceEngine:
    """TensorRT推論エンジンラッパー"""

    def __init__(self, engine_path: str):
        runtime = trt.Runtime(TRT_LOGGER)
        with open(engine_path, 'rb') as f:
            self.engine = runtime.deserialize_cuda_engine(f.read())
        self.context = self.engine.create_execution_context()

        self.inputs = []
        self.outputs = []
        self.bindings = []

        for binding in self.engine:
            size = trt.volume(self.engine.get_binding_shape(binding))
            dtype = trt.nptype(self.engine.get_binding_dtype(binding))
            host_mem = cuda.pagelocked_empty(size, dtype)
            device_mem = cuda.mem_alloc(host_mem.nbytes)
            self.bindings.append(int(device_mem))

            if self.engine.binding_is_input(binding):
                self.inputs.append({'host': host_mem, 'device': device_mem})
            else:
                self.outputs.append({'host': host_mem, 'device': device_mem})

    def infer(self, input_data: np.ndarray) -> np.ndarray:
        np.copyto(self.inputs[0]['host'], input_data.ravel())
        cuda.memcpy_htod(self.inputs[0]['device'], self.inputs[0]['host'])
        self.context.execute_v2(bindings=self.bindings)
        cuda.memcpy_dtoh(self.outputs[0]['host'], self.outputs[0]['device'])
        return self.outputs[0]['host'].copy()

3.3 Torch-TensorRT

import torch_tensorrt

model = MyResNet50()
model.eval()
model.cuda()

traced_model = torch.jit.trace(model, torch.randn(1, 3, 224, 224).cuda())

trt_model = torch_tensorrt.compile(
    traced_model,
    inputs=[
        torch_tensorrt.Input(
            min_shape=[1, 3, 224, 224],
            opt_shape=[8, 3, 224, 224],
            max_shape=[32, 3, 224, 224],
            dtype=torch.float32
        )
    ],
    enabled_precisions={torch.float16},
    workspace_size=4 * 1024 * 1024 * 1024,
)

torch.jit.save(trt_model, "model_trt.ts")
loaded_model = torch.jit.load("model_trt.ts")

# 速度比較
import time
input_tensor = torch.randn(8, 3, 224, 224).cuda()

with torch.no_grad():
    start = time.perf_counter()
    for _ in range(100):
        _ = model(input_tensor)
    pytorch_time = (time.perf_counter() - start) / 100 * 1000

    start = time.perf_counter()
    for _ in range(100):
        _ = loaded_model(input_tensor)
    trt_time = (time.perf_counter() - start) / 100 * 1000

print(f"PyTorch: {pytorch_time:.2f}ms, TensorRT: {trt_time:.2f}ms")
print(f"Speedup: {pytorch_time / trt_time:.2f}x")

4. NVIDIA Triton推論サーバー

4.1 Tritonの概要

NVIDIA Triton推論サーバーは、様々なMLフレームワークのモデルを本番環境でサービングするオープンソース推論サーバーです。

主な特徴：

マルチフレームワーク対応（TensorRT、ONNX、PyTorch、TensorFlow、Python）
動的バッチング
並行モデル実行
効率的なGPU/CPUリソース利用
モデルアンサンブルパイプライン
gRPCとHTTP REST API

4.2 モデルリポジトリ構造

model_repository/
├── resnet50/
│   ├── config.pbtxt
│   ├── 1/
│   │   └── model.plan          # TensorRTエンジン
│   └── 2/
│       └── model.plan
├── bert_onnx/
│   ├── config.pbtxt
│   └── 1/
│       └── model.onnx
└── custom_model/
    ├── config.pbtxt
    └── 1/
        └── model.py

4.3 設定ファイル（config.pbtxt）

# model_repository/resnet50/config.pbtxt
name: "resnet50"
platform: "tensorrt_plan"
max_batch_size: 32

input [
  {
    name: "input"
    data_type: TYPE_FP32
    dims: [3, 224, 224]
  }
]

output [
  {
    name: "output"
    data_type: TYPE_FP32
    dims: [1000]
  }
]

dynamic_batching {
  preferred_batch_size: [4, 8, 16, 32]
  max_queue_delay_microseconds: 5000  # 5ms待機
}

instance_group [
  {
    count: 2
    kind: KIND_GPU
    gpus: [0]
  }
]

# model_repository/bert_onnx/config.pbtxt
name: "bert_onnx"
platform: "onnxruntime_onnx"
max_batch_size: 8

input [
  {
    name: "input_ids"
    data_type: TYPE_INT64
    dims: [128]
  },
  {
    name: "attention_mask"
    data_type: TYPE_INT64
    dims: [128]
  }
]

output [
  {
    name: "last_hidden_state"
    data_type: TYPE_FP32
    dims: [128, 768]
  }
]

dynamic_batching {
  max_queue_delay_microseconds: 10000
}

optimization {
  execution_accelerators {
    gpu_execution_accelerator: [
      {
        name: "tensorrt"
        parameters { key: "precision_mode" value: "FP16" }
        parameters { key: "max_workspace_size_bytes" value: "1073741824" }
      }
    ]
  }
}

4.4 Pythonバックエンドモデル

# model_repository/custom_model/1/model.py
import numpy as np
import json
import triton_python_backend_utils as pb_utils
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

class TritonPythonModel:
    def initialize(self, args):
        """サーバー起動時に一度呼び出される"""
        self.device = 'cuda' if args['model_instance_kind'] == 'GPU' else 'cpu'

        model_name = "distilbert-base-uncased-finetuned-sst-2-english"
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForSequenceClassification.from_pretrained(model_name)
        self.model.to(self.device)
        self.model.eval()

    def execute(self, requests):
        """バッチ推論の実行"""
        responses = []

        for request in requests:
            input_text = pb_utils.get_input_tensor_by_name(request, "TEXT")
            texts = input_text.as_numpy().tolist()
            texts = [t[0].decode('utf-8') for t in texts]

            inputs = self.tokenizer(
                texts,
                return_tensors="pt",
                padding=True,
                truncation=True,
                max_length=128
            ).to(self.device)

            with torch.no_grad():
                outputs = self.model(**inputs)
                probs = torch.softmax(outputs.logits, dim=1).cpu().numpy()

            output_tensor = pb_utils.Tensor("PROBABILITIES", probs.astype(np.float32))
            response = pb_utils.InferenceResponse(output_tensors=[output_tensor])
            responses.append(response)

        return responses

    def finalize(self):
        del self.model
        torch.cuda.empty_cache()

4.5 DockerでTritonをデプロイ

# Tritonサーバーの起動
docker run --gpus all \
  -p 8000:8000 \
  -p 8001:8001 \
  -p 8002:8002 \
  -v /path/to/model_repository:/models \
  --shm-size=1g \
  nvcr.io/nvidia/tritonserver:24.02-py3 \
  tritonserver \
  --model-repository=/models \
  --log-verbose=1 \
  --strict-model-config=false

# 準備完了の確認
curl http://localhost:8000/v2/health/ready

# モデル情報の照会
curl http://localhost:8000/v2/models/resnet50

# Triton用Pythonクライアント
import tritonclient.http as httpclient
import numpy as np

client = httpclient.InferenceServerClient(url="localhost:8000")

input_data = np.random.randn(1, 3, 224, 224).astype(np.float32)
inputs = [httpclient.InferInput("input", input_data.shape, "FP32")]
inputs[0].set_data_from_numpy(input_data)

outputs = [httpclient.InferRequestedOutput("output")]

result = client.infer(
    model_name="resnet50",
    model_version="1",
    inputs=inputs,
    outputs=outputs
)

output = result.as_numpy("output")
print(f"Output shape: {output.shape}")
print(f"Top-5 predictions: {np.argsort(output[0])[-5:][::-1]}")

5. vLLM - 高速LLMサービング

5.1 vLLMの概要

vLLMはLLM推論のための高性能サービングライブラリです。標準的なHuggingFace Transformersと比較して最大24倍高いスループットを実現します。

コア技術：

PagedAttention: KVキャッシュをページ単位で管理し、メモリの無駄を最小化
Continuous Batching: 固定バッチではなく動的にリクエストを処理
CUDAカーネル最適化: FlashAttentionを含む最適化された注意カーネル

5.2 vLLMのインストールと基本的な使い方

# vLLMのインストール（CUDA 12.1）
pip install vllm

# 特定のCUDAバージョン
pip install vllm --extra-index-url https://download.pytorch.org/whl/cu121

from vllm import LLM, SamplingParams

# モデルの読み込み
llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    tensor_parallel_size=1,
    gpu_memory_utilization=0.9,
    max_model_len=4096,
)

# サンプリングパラメータ
sampling_params = SamplingParams(
    temperature=0.8,
    top_p=0.95,
    max_tokens=512,
    stop=["</s>", "[INST]"],
)

# バッチ推論（複数のプロンプトを同時に処理）
prompts = [
    "Implement a fibonacci sequence in Python",
    "Explain the difference between machine learning and deep learning",
    "Describe the types of SQL JOINs",
]

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(f"Prompt: {output.prompt[:50]}...")
    print(f"Generated: {output.outputs[0].text}")
    print(f"Tokens: {len(output.outputs[0].token_ids)}")
    print("---")

5.3 OpenAI互換APIサーバー

# vLLM OpenAI互換サーバーの起動
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 1 \
  --gpu-memory-utilization 0.9 \
  --max-model-len 4096 \
  --served-model-name llama3-8b

# 量子化モデルのサービング
python -m vllm.entrypoints.openai.api_server \
  --model TheBloke/Llama-2-7B-Chat-GPTQ \
  --quantization gptq \
  --dtype float16 \
  --port 8000

# OpenAIクライアントでvLLMを使用
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"
)

# チャット補完
response = client.chat.completions.create(
    model="llama3-8b",
    messages=[
        {"role": "system", "content": "You are a helpful AI assistant."},
        {"role": "user", "content": "Explain asynchronous programming in Python"},
    ],
    temperature=0.7,
    max_tokens=1000,
    stream=False,
)

print(response.choices[0].message.content)

# ストリーミングレスポンス
stream = client.chat.completions.create(
    model="llama3-8b",
    messages=[{"role": "user", "content": "What is the capital of France?"}],
    stream=True,
)

for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="", flush=True)

5.4 vLLM量子化サービング

from vllm import LLM, SamplingParams

# GPTQ INT4量子化モデル
llm_gptq = LLM(
    model="TheBloke/Llama-2-7B-GPTQ",
    quantization="gptq",
    dtype="float16",
    gpu_memory_utilization=0.85,
)

# AWQ INT4量子化モデル
llm_awq = LLM(
    model="TheBloke/Llama-2-7B-AWQ",
    quantization="awq",
    dtype="float16",
)

# FP8量子化（H100で最高性能）
llm_fp8 = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    quantization="fp8",
    dtype="bfloat16",
)

def benchmark_throughput(llm, prompts, n_iterations=5):
    import time
    sampling_params = SamplingParams(max_tokens=200, temperature=0.8)

    llm.generate(prompts[:2], sampling_params)  # ウォームアップ

    start = time.time()
    for _ in range(n_iterations):
        outputs = llm.generate(prompts, sampling_params)
    elapsed = time.time() - start

    total_tokens = sum(
        len(o.outputs[0].token_ids)
        for o in outputs
    )

    return {
        "tokens_per_second": total_tokens * n_iterations / elapsed,
        "latency_per_batch_ms": elapsed / n_iterations * 1000,
    }

5.5 LoRAアダプターサービング

from vllm import LLM
from vllm.lora.request import LoRARequest

llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    enable_lora=True,
    max_lora_rank=64,
    max_loras=4,
)

sampling_params = SamplingParams(temperature=0.0, max_tokens=200)

outputs = llm.generate(
    "Explain the history of the Roman Empire",
    sampling_params=sampling_params,
    lora_request=LoRARequest(
        "history-lora",
        1,
        "/path/to/history-lora-adapter"
    )
)

6. Ollama - ローカルLLMサービング

6.1 Ollamaの概要

OllamaはLLMをローカルで簡単に実行できるツールです。複雑な設定なしに、単一のターミナルコマンドで様々なLLMを実行できます。

6.2 インストールと基本的な使い方

# macOS/Linuxインストール
curl -fsSL https://ollama.com/install.sh | sh

# モデルのダウンロードと実行
ollama run llama3.1

# その他の人気モデル
ollama run mistral
ollama run codellama
ollama run phi3
ollama run gemma2

# バックグラウンドサービスの起動
ollama serve

# インストール済みモデルの一覧表示
ollama list

# モデルの削除
ollama rm llama3.1

# モデル情報の表示
ollama show llama3.1

6.3 REST APIの使い方

import requests
import json

def ollama_generate(prompt: str, model: str = "llama3.1") -> str:
    response = requests.post(
        "http://localhost:11434/api/generate",
        json={
            "model": model,
            "prompt": prompt,
            "stream": False,
        }
    )
    return response.json()["response"]

def ollama_stream(prompt: str, model: str = "llama3.1"):
    response = requests.post(
        "http://localhost:11434/api/generate",
        json={
            "model": model,
            "prompt": prompt,
            "stream": True,
        },
        stream=True
    )

    for line in response.iter_lines():
        if line:
            data = json.loads(line)
            yield data.get("response", "")
            if data.get("done", False):
                break

def ollama_chat(messages: list, model: str = "llama3.1") -> str:
    response = requests.post(
        "http://localhost:11434/api/chat",
        json={
            "model": model,
            "messages": messages,
            "stream": False,
        }
    )
    return response.json()["message"]["content"]

# 使用例
result = ollama_generate("Explain decorators in Python")
print(result)

for token in ollama_stream("Explain machine learning basics"):
    print(token, end="", flush=True)

messages = [
    {"role": "system", "content": "You are a Python expert."},
    {"role": "user", "content": "What is the difference between generators and iterators?"},
]
print(ollama_chat(messages))

6.4 カスタムModelfile

# Modelfile - カスタムシステムプロンプトと設定
FROM llama3.1

SYSTEM """
You are a senior software engineer who provides clear, concise answers.
Always include code examples and be honest when you don't know something.
"""

PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER top_k 40
PARAMETER num_ctx 4096
PARAMETER num_predict 512
PARAMETER stop "<|im_end|>"

# カスタムモデルの作成
ollama create my-coding-assistant -f Modelfile

# 実行
ollama run my-coding-assistant

6.5 Pythonクライアント（ollamaパッケージ）

import ollama

# 同期生成
response = ollama.generate(
    model='llama3.1',
    prompt='Explain how to build a REST API with FastAPI',
    options={
        'temperature': 0.7,
        'num_ctx': 2048,
    }
)
print(response['response'])

# 会話履歴を持つチャット
messages = []

def chat(user_message: str, model: str = "llama3.1") -> str:
    messages.append({'role': 'user', 'content': user_message})

    response = ollama.chat(model=model, messages=messages)

    assistant_message = response['message']['content']
    messages.append({'role': 'assistant', 'content': assistant_message})
    return assistant_message

print(chat("I want to learn Python. Where should I start?"))
print(chat("What are the best learning resources?"))
print(chat("How long will it take to learn?"))

# 埋め込み生成
embeddings = ollama.embeddings(
    model='nomic-embed-text',
    prompt='This is a sample text for embedding'
)
print(f"Embedding dimension: {len(embeddings['embedding'])}")

7. Text Generation Inference（TGI）

7.1 HuggingFace TGIの概要

HuggingFace TGI（Text Generation Inference）は、本番環境でLLMをサービングするための高性能ツールキットです。