AI Model Serving and Inference Optimization Complete Guide: vLLM, TensorRT, Triton, Ollama

Introduction

Developing an AI model and serving it efficiently in production are entirely different challenges. Serving a GPT-scale LLM to millions of users with sub-100ms response times, or handling real-time image classification on edge devices, requires substantial optimization expertise.

This guide gives you complete mastery of the core AI model serving tools (vLLM, TensorRT, NVIDIA Triton, Ollama) and optimization techniques through real-world examples.

1. The Challenges and Goals of AI Inference

1.1 Training vs Inference

Training and inference have fundamentally different computing requirements.

Category	Training	Inference
Goal	Optimize model parameters	Fast prediction generation
Batch size	Large (128–2048)	Small or streaming
Memory	Gradient storage required	Only activations needed
Precision	FP32 or FP16	INT8, INT4 possible
Accelerator	A100, H100 (expensive)	T4, L4, RTX (cheaper)
Cost pattern	One-time large cost	Ongoing small cost

1.2 Latency vs Throughput

The two most important metrics in inference optimization:

Latency

Response time for a single request
Critical for real-time applications (chatbots, autocomplete)
Measured by P50, P95, P99 percentiles
Target: below 100ms (general), below 50ms (real-time)

Throughput

Number of requests handled per unit time (QPS, Tokens/second)
Critical for batch processing, offline inference
Trade-off relationship with latency

# Latency vs throughput measurement example
import time
import numpy as np
from typing import List

def measure_latency(model_fn, inputs: List, n_runs: int = 100):
    """Measure inference latency"""
    latencies = []

    # Warm-up
    for _ in range(10):
        _ = model_fn(inputs[0])

    # Measure
    for inp in inputs[:n_runs]:
        start = time.perf_counter()
        _ = model_fn(inp)
        end = time.perf_counter()
        latencies.append((end - start) * 1000)  # ms

    latencies = np.array(latencies)
    return {
        "p50_ms": np.percentile(latencies, 50),
        "p95_ms": np.percentile(latencies, 95),
        "p99_ms": np.percentile(latencies, 99),
        "mean_ms": np.mean(latencies),
        "std_ms": np.std(latencies),
    }

def measure_throughput(model_fn, inputs: List, duration_sec: int = 60):
    """Measure inference throughput"""
    count = 0
    start = time.time()

    while time.time() - start < duration_sec:
        _ = model_fn(inputs[count % len(inputs)])
        count += 1

    elapsed = time.time() - start
    return {
        "qps": count / elapsed,
        "total_requests": count,
        "duration_sec": elapsed,
    }

1.3 Hardware Selection Guide

GPU (NVIDIA):
  A100 80GB: Best performance, optimal for training/inference, expensive
  H100 80GB: Current top-of-line, specialized for LLM inference
  A10G 24GB: Widely used on AWS, mid-range performance
  T4 16GB: Cost-efficient, inference-only, cheap on AWS/GCP
  L4 24GB: T4 successor, inference-optimized
  RTX 4090 24GB: Small-scale deployment, local LLMs

CPU:
  Pros: Cheap, universally available, large memory
  Cons: Limited parallelism, slow matrix ops
  Use: INT8 quantized models, small models, edge

TPU (Google):
  Cloud TPU v4: Large-scale LLM training/serving
  TPU v5e: Inference-optimized version

NPU (Edge):
  Apple Neural Engine: Core ML models on iPhone/Mac
  Qualcomm AI Engine: On-device Android inference

2. Model Optimization Techniques

2.1 Quantization

Quantization reduces memory and computation by representing model weights and activations in lower bit precision.

FP32 (32bit) → FP16 (16bit) → BF16 (16bit) → INT8 (8bit) → INT4 (4bit)
Memory:          100%          50%            50%           25%          12.5%
Speed:           baseline      1.5-2x         1.5-2x        2-4x         4-8x
Accuracy loss:   none          negligible     negligible    minor        moderate

Post-Training Quantization (PTQ)

# PyTorch PTQ example
import torch
from torch.quantization import quantize_dynamic, prepare, convert

model = MyModel()
model.load_state_dict(torch.load("model.pth"))
model.eval()

# Dynamic quantization (weights only, INT8)
quantized_model = quantize_dynamic(
    model,
    {torch.nn.Linear},
    dtype=torch.qint8
)

def get_model_size_mb(model):
    import io
    buffer = io.BytesIO()
    torch.save(model.state_dict(), buffer)
    return buffer.tell() / (1024 * 1024)

print(f"Original model: {get_model_size_mb(model):.2f} MB")
print(f"Quantized model: {get_model_size_mb(quantized_model):.2f} MB")

# Static quantization (weights + activations, INT8)
from torch.quantization import get_default_qconfig

model.qconfig = get_default_qconfig('x86')

prepared_model = prepare(model)

with torch.no_grad():
    for batch in calibration_loader:
        prepared_model(batch)

static_quantized_model = convert(prepared_model)

GPTQ - LLM Quantization

# INT4 LLM quantization using GPTQ
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig

model_id = "meta-llama/Llama-2-7b-hf"

gptq_config = GPTQConfig(
    bits=4,
    dataset="wikitext2",
    block_size=128,
    damp_percent=0.01,
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
quantized_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=gptq_config,
    device_map="auto"
)

quantized_model.save_pretrained("llama2-7b-gptq-int4")
tokenizer.save_pretrained("llama2-7b-gptq-int4")

AWQ - Activation-Aware Quantization

# AWQ quantization (higher-quality INT4)
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = "meta-llama/Llama-2-7b-hf"
quant_path = "llama2-7b-awq"

quant_config = {
    "zero_point": True,
    "q_group_size": 128,
    "w_bit": 4,
    "version": "GEMM"
}

model = AutoAWQForCausalLM.from_pretrained(model_path, device_map="cuda")
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

model.quantize(tokenizer, quant_config=quant_config)
model.save_quantized(quant_path)

2.2 Pruning

import torch
import torch.nn.utils.prune as prune

model = MyConvNet()

# Unstructured pruning (L1 norm-based, 50% sparsity)
prune.l1_unstructured(
    model.conv1,
    name='weight',
    amount=0.5
)

# Structured pruning (channel-level - actually accelerates inference)
prune.ln_structured(
    model.conv1,
    name='weight',
    amount=0.3,
    n=2,
    dim=0  # output channel dimension
)

# Global pruning (remove top 20% across entire model)
parameters_to_prune = (
    (model.conv1, 'weight'),
    (model.conv2, 'weight'),
    (model.fc1, 'weight'),
)

prune.global_unstructured(
    parameters_to_prune,
    pruning_method=prune.L1Unstructured,
    amount=0.2,
)

# Make pruning permanent
prune.remove(model.conv1, 'weight')

def print_sparsity(model):
    for name, module in model.named_modules():
        if isinstance(module, torch.nn.Conv2d):
            sparsity = 100. * float(torch.sum(module.weight == 0)) / float(module.weight.nelement())
            print(f"{name}: {sparsity:.1f}% sparsity")

2.3 Knowledge Distillation

import torch
import torch.nn as nn
import torch.nn.functional as F

class DistillationTrainer:
    """Teacher-Student Knowledge Distillation"""

    def __init__(self, teacher, student, temperature=4.0, alpha=0.7):
        self.teacher = teacher
        self.student = student
        self.temperature = temperature
        self.alpha = alpha  # Soft label weight

        self.teacher.eval()  # Freeze teacher

    def distillation_loss(self, student_logits, teacher_logits, labels):
        """Distillation loss = soft label loss + hard label loss"""
        T = self.temperature

        # Soft label loss (uses teacher's knowledge)
        soft_targets = F.softmax(teacher_logits / T, dim=1)
        soft_pred = F.log_softmax(student_logits / T, dim=1)
        soft_loss = F.kl_div(soft_pred, soft_targets, reduction='batchmean') * (T ** 2)

        # Hard label loss (ground truth)
        hard_loss = F.cross_entropy(student_logits, labels)

        total_loss = self.alpha * soft_loss + (1 - self.alpha) * hard_loss
        return total_loss

    def train_step(self, inputs, labels, optimizer):
        optimizer.zero_grad()

        with torch.no_grad():
            teacher_logits = self.teacher(inputs)

        student_logits = self.student(inputs)

        loss = self.distillation_loss(student_logits, teacher_logits, labels)
        loss.backward()
        optimizer.step()

        return loss.item()

2.4 TorchScript and ONNX Conversion

import torch
import torch.onnx

model = MyModel()
model.eval()

example_input = torch.randn(1, 3, 224, 224)

# TorchScript tracing
traced_model = torch.jit.trace(model, example_input)
traced_model.save("model_traced.pt")

# TorchScript scripting (supports dynamic control flow)
scripted_model = torch.jit.script(model)
scripted_model.save("model_scripted.pt")

# ONNX export
torch.onnx.export(
    model,
    example_input,
    "model.onnx",
    opset_version=17,
    input_names=["input"],
    output_names=["output"],
    dynamic_axes={
        "input": {0: "batch_size"},
        "output": {0: "batch_size"},
    },
    verbose=False
)

# Validate ONNX model
import onnx
onnx_model = onnx.load("model.onnx")
onnx.checker.check_model(onnx_model)

# ONNX Runtime inference
import onnxruntime as ort
import numpy as np

session = ort.InferenceSession(
    "model.onnx",
    providers=['CUDAExecutionProvider', 'CPUExecutionProvider']
)

input_name = session.get_inputs()[0].name
output_name = session.get_outputs()[0].name

input_data = np.random.randn(1, 3, 224, 224).astype(np.float32)
outputs = session.run([output_name], {input_name: input_data})
print(f"Output shape: {outputs[0].shape}")

3. TensorRT

3.1 Introduction to TensorRT

TensorRT is NVIDIA's deep learning inference optimization SDK. It automatically performs these optimizations:

Layer Fusion: Merges Conv+BN+ReLU into a single operation
Kernel Auto-Selection: Chooses optimized CUDA kernels for the GPU architecture
FP16/INT8 Calibration: Minimizes accuracy loss during precision reduction
Memory Reuse: Optimal tensor memory allocation

3.2 TensorRT Conversion via Python API

import tensorrt as trt
import numpy as np
import pycuda.driver as cuda
import pycuda.autoinit

TRT_LOGGER = trt.Logger(trt.Logger.WARNING)

def build_engine_from_onnx(onnx_path: str, precision: str = "fp16") -> trt.ICudaEngine:
    """Convert ONNX model to TensorRT engine"""

    with trt.Builder(TRT_LOGGER) as builder, \
         builder.create_network(
             1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
         ) as network, \
         trt.OnnxParser(network, TRT_LOGGER) as parser:

        config = builder.create_builder_config()
        config.set_memory_pool_limit(
            trt.MemoryPoolType.WORKSPACE,
            4 * 1024 * 1024 * 1024  # 4GB
        )

        if precision == "fp16":
            config.set_flag(trt.BuilderFlag.FP16)
        elif precision == "int8":
            config.set_flag(trt.BuilderFlag.INT8)
            config.int8_calibrator = MyCalibrator()

        with open(onnx_path, 'rb') as model:
            if not parser.parse(model.read()):
                for error in range(parser.num_errors):
                    print(f"ONNX parse error: {parser.get_error(error)}")
                raise ValueError("ONNX parsing failed")

        # Dynamic input shape (variable batch size)
        profile = builder.create_optimization_profile()
        profile.set_shape(
            "input",
            min=(1, 3, 224, 224),
            opt=(8, 3, 224, 224),
            max=(32, 3, 224, 224)
        )
        config.add_optimization_profile(profile)

        serialized_engine = builder.build_serialized_network(network, config)
        runtime = trt.Runtime(TRT_LOGGER)
        return runtime.deserialize_cuda_engine(serialized_engine)

class TRTInferenceEngine:
    """TensorRT inference engine wrapper"""

    def __init__(self, engine_path: str):
        runtime = trt.Runtime(TRT_LOGGER)
        with open(engine_path, 'rb') as f:
            self.engine = runtime.deserialize_cuda_engine(f.read())
        self.context = self.engine.create_execution_context()

        self.inputs = []
        self.outputs = []
        self.bindings = []

        for binding in self.engine:
            size = trt.volume(self.engine.get_binding_shape(binding))
            dtype = trt.nptype(self.engine.get_binding_dtype(binding))
            host_mem = cuda.pagelocked_empty(size, dtype)
            device_mem = cuda.mem_alloc(host_mem.nbytes)
            self.bindings.append(int(device_mem))

            if self.engine.binding_is_input(binding):
                self.inputs.append({'host': host_mem, 'device': device_mem})
            else:
                self.outputs.append({'host': host_mem, 'device': device_mem})

    def infer(self, input_data: np.ndarray) -> np.ndarray:
        np.copyto(self.inputs[0]['host'], input_data.ravel())
        cuda.memcpy_htod(self.inputs[0]['device'], self.inputs[0]['host'])
        self.context.execute_v2(bindings=self.bindings)
        cuda.memcpy_dtoh(self.outputs[0]['host'], self.outputs[0]['device'])
        return self.outputs[0]['host'].copy()

3.3 Torch-TensorRT

import torch_tensorrt

model = MyResNet50()
model.eval()
model.cuda()

traced_model = torch.jit.trace(model, torch.randn(1, 3, 224, 224).cuda())

trt_model = torch_tensorrt.compile(
    traced_model,
    inputs=[
        torch_tensorrt.Input(
            min_shape=[1, 3, 224, 224],
            opt_shape=[8, 3, 224, 224],
            max_shape=[32, 3, 224, 224],
            dtype=torch.float32
        )
    ],
    enabled_precisions={torch.float16},
    workspace_size=4 * 1024 * 1024 * 1024,
)

torch.jit.save(trt_model, "model_trt.ts")
loaded_model = torch.jit.load("model_trt.ts")

# Speed comparison
import time
input_tensor = torch.randn(8, 3, 224, 224).cuda()

with torch.no_grad():
    start = time.perf_counter()
    for _ in range(100):
        _ = model(input_tensor)
    pytorch_time = (time.perf_counter() - start) / 100 * 1000

    start = time.perf_counter()
    for _ in range(100):
        _ = loaded_model(input_tensor)
    trt_time = (time.perf_counter() - start) / 100 * 1000

print(f"PyTorch: {pytorch_time:.2f}ms, TensorRT: {trt_time:.2f}ms")
print(f"Speedup: {pytorch_time / trt_time:.2f}x")

4. NVIDIA Triton Inference Server

4.1 Introduction to Triton

NVIDIA Triton Inference Server is an open-source inference server that serves models from various ML frameworks in production.

Key features:

Multi-framework support (TensorRT, ONNX, PyTorch, TensorFlow, Python)
Dynamic Batching
Concurrent Model Execution
Efficient GPU/CPU resource utilization
Model ensemble pipelines
gRPC and HTTP REST API

4.2 Model Repository Structure

model_repository/
├── resnet50/
│   ├── config.pbtxt
│   ├── 1/
│   │   └── model.plan          # TensorRT engine
│   └── 2/
│       └── model.plan
├── bert_onnx/
│   ├── config.pbtxt
│   └── 1/
│       └── model.onnx
└── custom_model/
    ├── config.pbtxt
    └── 1/
        └── model.py

4.3 Configuration File (config.pbtxt)

# model_repository/resnet50/config.pbtxt
name: "resnet50"
platform: "tensorrt_plan"
max_batch_size: 32

input [
  {
    name: "input"
    data_type: TYPE_FP32
    dims: [3, 224, 224]
  }
]

output [
  {
    name: "output"
    data_type: TYPE_FP32
    dims: [1000]
  }
]

dynamic_batching {
  preferred_batch_size: [4, 8, 16, 32]
  max_queue_delay_microseconds: 5000  # 5ms wait
}

instance_group [
  {
    count: 2
    kind: KIND_GPU
    gpus: [0]
  }
]

# model_repository/bert_onnx/config.pbtxt
name: "bert_onnx"
platform: "onnxruntime_onnx"
max_batch_size: 8

input [
  {
    name: "input_ids"
    data_type: TYPE_INT64
    dims: [128]
  },
  {
    name: "attention_mask"
    data_type: TYPE_INT64
    dims: [128]
  }
]

output [
  {
    name: "last_hidden_state"
    data_type: TYPE_FP32
    dims: [128, 768]
  }
]

dynamic_batching {
  max_queue_delay_microseconds: 10000
}

optimization {
  execution_accelerators {
    gpu_execution_accelerator: [
      {
        name: "tensorrt"
        parameters { key: "precision_mode" value: "FP16" }
        parameters { key: "max_workspace_size_bytes" value: "1073741824" }
      }
    ]
  }
}

4.4 Python Backend Model

# model_repository/custom_model/1/model.py
import numpy as np
import json
import triton_python_backend_utils as pb_utils
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

class TritonPythonModel:
    def initialize(self, args):
        """Called once at server startup"""
        self.device = 'cuda' if args['model_instance_kind'] == 'GPU' else 'cpu'

        model_name = "distilbert-base-uncased-finetuned-sst-2-english"
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForSequenceClassification.from_pretrained(model_name)
        self.model.to(self.device)
        self.model.eval()

    def execute(self, requests):
        """Execute batch inference"""
        responses = []

        for request in requests:
            input_text = pb_utils.get_input_tensor_by_name(request, "TEXT")
            texts = input_text.as_numpy().tolist()
            texts = [t[0].decode('utf-8') for t in texts]

            inputs = self.tokenizer(
                texts,
                return_tensors="pt",
                padding=True,
                truncation=True,
                max_length=128
            ).to(self.device)

            with torch.no_grad():
                outputs = self.model(**inputs)
                probs = torch.softmax(outputs.logits, dim=1).cpu().numpy()

            output_tensor = pb_utils.Tensor("PROBABILITIES", probs.astype(np.float32))
            response = pb_utils.InferenceResponse(output_tensors=[output_tensor])
            responses.append(response)

        return responses

    def finalize(self):
        del self.model
        torch.cuda.empty_cache()

4.5 Deploying Triton with Docker

# Start Triton server
docker run --gpus all \
  -p 8000:8000 \
  -p 8001:8001 \
  -p 8002:8002 \
  -v /path/to/model_repository:/models \
  --shm-size=1g \
  nvcr.io/nvidia/tritonserver:24.02-py3 \
  tritonserver \
  --model-repository=/models \
  --log-verbose=1 \
  --strict-model-config=false

# Check readiness
curl http://localhost:8000/v2/health/ready

# Query model info
curl http://localhost:8000/v2/models/resnet50

# Python client for Triton
import tritonclient.http as httpclient
import numpy as np

client = httpclient.InferenceServerClient(url="localhost:8000")

input_data = np.random.randn(1, 3, 224, 224).astype(np.float32)
inputs = [httpclient.InferInput("input", input_data.shape, "FP32")]
inputs[0].set_data_from_numpy(input_data)

outputs = [httpclient.InferRequestedOutput("output")]

result = client.infer(
    model_name="resnet50",
    model_version="1",
    inputs=inputs,
    outputs=outputs
)

output = result.as_numpy("output")
print(f"Output shape: {output.shape}")
print(f"Top-5 predictions: {np.argsort(output[0])[-5:][::-1]}")

5. vLLM - High-Speed LLM Serving

5.1 Introduction to vLLM

vLLM is a high-performance serving library for LLM inference. It achieves up to 24x higher throughput compared to standard HuggingFace Transformers.

Core technologies:

PagedAttention: Manages KV cache in pages to minimize memory waste
Continuous Batching: Dynamically processes requests instead of fixed batches
CUDA Kernel Optimization: Optimized attention kernels including FlashAttention

5.2 vLLM Installation and Basic Usage

# Install vLLM (CUDA 12.1)
pip install vllm

# Specific CUDA version
pip install vllm --extra-index-url https://download.pytorch.org/whl/cu121

from vllm import LLM, SamplingParams

# Load model
llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    tensor_parallel_size=1,
    gpu_memory_utilization=0.9,
    max_model_len=4096,
)

# Sampling parameters
sampling_params = SamplingParams(
    temperature=0.8,
    top_p=0.95,
    max_tokens=512,
    stop=["</s>", "[INST]"],
)

# Batch inference (multiple prompts simultaneously)
prompts = [
    "Implement a fibonacci sequence in Python",
    "Explain the difference between machine learning and deep learning",
    "Describe the types of SQL JOINs",
]

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(f"Prompt: {output.prompt[:50]}...")
    print(f"Generated: {output.outputs[0].text}")
    print(f"Tokens: {len(output.outputs[0].token_ids)}")
    print("---")

5.3 OpenAI-Compatible API Server

# Start vLLM OpenAI-compatible server
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 1 \
  --gpu-memory-utilization 0.9 \
  --max-model-len 4096 \
  --served-model-name llama3-8b

# Serve quantized model
python -m vllm.entrypoints.openai.api_server \
  --model TheBloke/Llama-2-7B-Chat-GPTQ \
  --quantization gptq \
  --dtype float16 \
  --port 8000

# Use vLLM with OpenAI client
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"
)

# Chat completion
response = client.chat.completions.create(
    model="llama3-8b",
    messages=[
        {"role": "system", "content": "You are a helpful AI assistant."},
        {"role": "user", "content": "Explain asynchronous programming in Python"},
    ],
    temperature=0.7,
    max_tokens=1000,
    stream=False,
)

print(response.choices[0].message.content)

# Streaming response
stream = client.chat.completions.create(
    model="llama3-8b",
    messages=[{"role": "user", "content": "What is the capital of France?"}],
    stream=True,
)

for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="", flush=True)

5.4 vLLM Quantization Serving

from vllm import LLM, SamplingParams

# GPTQ INT4 quantized model
llm_gptq = LLM(
    model="TheBloke/Llama-2-7B-GPTQ",
    quantization="gptq",
    dtype="float16",
    gpu_memory_utilization=0.85,
)

# AWQ INT4 quantized model
llm_awq = LLM(
    model="TheBloke/Llama-2-7B-AWQ",
    quantization="awq",
    dtype="float16",
)

# FP8 quantization (best performance on H100)
llm_fp8 = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    quantization="fp8",
    dtype="bfloat16",
)

def benchmark_throughput(llm, prompts, n_iterations=5):
    import time
    sampling_params = SamplingParams(max_tokens=200, temperature=0.8)

    llm.generate(prompts[:2], sampling_params)  # Warm-up

    start = time.time()
    for _ in range(n_iterations):
        outputs = llm.generate(prompts, sampling_params)
    elapsed = time.time() - start

    total_tokens = sum(
        len(o.outputs[0].token_ids)
        for o in outputs
    )

    return {
        "tokens_per_second": total_tokens * n_iterations / elapsed,
        "latency_per_batch_ms": elapsed / n_iterations * 1000,
    }

5.5 LoRA Adapter Serving

from vllm import LLM
from vllm.lora.request import LoRARequest

llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    enable_lora=True,
    max_lora_rank=64,
    max_loras=4,
)

sampling_params = SamplingParams(temperature=0.0, max_tokens=200)

outputs = llm.generate(
    "Explain the history of the Roman Empire",
    sampling_params=sampling_params,
    lora_request=LoRARequest(
        "history-lora",
        1,
        "/path/to/history-lora-adapter"
    )
)

6. Ollama - Local LLM Serving

6.1 Introduction to Ollama

Ollama makes it easy to run LLMs locally. You can run various LLMs with a single terminal command, without complex configuration.

6.2 Installation and Basic Usage

# macOS/Linux installation
curl -fsSL https://ollama.com/install.sh | sh

# Download and run models
ollama run llama3.1

# Other popular models
ollama run mistral
ollama run codellama
ollama run phi3
ollama run gemma2

# Start background service
ollama serve

# List installed models
ollama list

# Remove model
ollama rm llama3.1

# Show model info
ollama show llama3.1

6.3 REST API Usage

import requests
import json

def ollama_generate(prompt: str, model: str = "llama3.1") -> str:
    response = requests.post(
        "http://localhost:11434/api/generate",
        json={
            "model": model,
            "prompt": prompt,
            "stream": False,
        }
    )
    return response.json()["response"]

def ollama_stream(prompt: str, model: str = "llama3.1"):
    response = requests.post(
        "http://localhost:11434/api/generate",
        json={
            "model": model,
            "prompt": prompt,
            "stream": True,
        },
        stream=True
    )

    for line in response.iter_lines():
        if line:
            data = json.loads(line)
            yield data.get("response", "")
            if data.get("done", False):
                break

def ollama_chat(messages: list, model: str = "llama3.1") -> str:
    response = requests.post(
        "http://localhost:11434/api/chat",
        json={
            "model": model,
            "messages": messages,
            "stream": False,
        }
    )
    return response.json()["message"]["content"]

# Usage examples
result = ollama_generate("Explain decorators in Python")
print(result)

for token in ollama_stream("Explain machine learning basics"):
    print(token, end="", flush=True)

messages = [
    {"role": "system", "content": "You are a Python expert."},
    {"role": "user", "content": "What is the difference between generators and iterators?"},
]
print(ollama_chat(messages))

6.4 Custom Modelfile

# Modelfile - Custom system prompt and settings
FROM llama3.1

SYSTEM """
You are a senior software engineer who provides clear, concise answers.
Always include code examples and be honest when you don't know something.
"""

PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER top_k 40
PARAMETER num_ctx 4096
PARAMETER num_predict 512
PARAMETER stop "<|im_end|>"

# Create custom model
ollama create my-coding-assistant -f Modelfile

# Run it
ollama run my-coding-assistant

6.5 Python Client (ollama package)

import ollama

# Synchronous generation
response = ollama.generate(
    model='llama3.1',
    prompt='Explain how to build a REST API with FastAPI',
    options={
        'temperature': 0.7,
        'num_ctx': 2048,
    }
)
print(response['response'])

# Chat with conversation history
messages = []

def chat(user_message: str, model: str = "llama3.1") -> str:
    messages.append({'role': 'user', 'content': user_message})

    response = ollama.chat(model=model, messages=messages)

    assistant_message = response['message']['content']
    messages.append({'role': 'assistant', 'content': assistant_message})
    return assistant_message

print(chat("I want to learn Python. Where should I start?"))
print(chat("What are the best learning resources?"))
print(chat("How long will it take to learn?"))

# Generate embeddings
embeddings = ollama.embeddings(
    model='nomic-embed-text',
    prompt='This is a sample text for embedding'
)
print(f"Embedding dimension: {len(embeddings['embedding'])}")

7. Text Generation Inference (TGI)

7.1 HuggingFace TGI Introduction

HuggingFace TGI (Text Generation Inference) is a high-performance toolkit for serving LLMs in production.