Edge AI Complete Guide 2025: On-Device Inference, Model Optimization, TensorRT/ONNX/CoreML

Introduction: Why Edge AI?

The future of AI is not just the cloud. An increasing amount of AI inference is running where data is generated -- on edge devices. Edge AI is a paradigm for running AI models directly on smartphones, IoT devices, vehicles, and medical devices without sending data to the cloud.

5 reasons Edge AI is essential:

Latency: Millisecond-level responses without cloud round-trips. For autonomous vehicles, even 100ms delay can be fatal.
Privacy: Sensitive data (faces, voice, medical) never leaves the device.
Bandwidth savings: Cameras generate gigabytes of video per second, but only inference results need to be transmitted.
Cost reduction: Run inference locally without cloud GPU costs.
Offline operation: AI features work without network connectivity.

1. Edge AI vs Cloud AI Comparison

Dimension	Edge AI	Cloud AI
Latency	1-10ms (local)	50-200ms (network round-trip)
Privacy	Data stays on device	Data must be sent to cloud
Bandwidth	Minimal (results only)	Heavy (raw data transfer)
Cost	Upfront hardware cost	Ongoing cloud costs
Offline	Fully supported	Not possible
Model size	Limited (MB to a few GB)	Unlimited (hundreds of GB)
Compute power	Limited (NPU/GPU)	Nearly unlimited
Updates	OTA update required	Instant deployment
Scalability	Proportional to device count	Elastic with cloud resources
Accuracy	Slight degradation possible	Maximum accuracy

2. Inference Runtime Comparison

2.1 TensorRT (NVIDIA)

A high-performance inference engine exclusive to NVIDIA GPUs.

import tensorrt as trt
import numpy as np

# Build TensorRT engine
logger = trt.Logger(trt.Logger.WARNING)
builder = trt.Builder(logger)
network = builder.create_network(
    1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
)
parser = trt.OnnxParser(network, logger)

# Load ONNX model
with open("model.onnx", "rb") as f:
    parser.parse(f.read())

# Optimization profile configuration
config = builder.create_builder_config()
config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 30)  # 1GB

# Enable INT8 quantization
config.set_flag(trt.BuilderFlag.INT8)
config.int8_calibrator = MyCalibrator(calibration_data)

# Enable FP16
config.set_flag(trt.BuilderFlag.FP16)

# Build and serialize engine
serialized_engine = builder.build_serialized_network(network, config)
with open("model.engine", "wb") as f:
    f.write(serialized_engine)

Key TensorRT optimization techniques:

1. Layer fusion: Combines Conv + BatchNorm + ReLU into a single kernel
2. Kernel auto-tuning: Automatic selection of optimal CUDA kernels per hardware
3. Dynamic tensor memory: Reduces total usage through memory reuse
4. Precision calibration: Minimizes accuracy loss during INT8 quantization
5. Multi-stream execution: Runs multiple inferences in parallel

2.2 ONNX Runtime

A cross-platform inference engine that works across various hardware.

import onnxruntime as ort
import numpy as np

# Session options
sess_options = ort.SessionOptions()
sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
sess_options.intra_op_num_threads = 4
sess_options.inter_op_num_threads = 2

# Execution Provider selection
# CPU: CPUExecutionProvider
# GPU: CUDAExecutionProvider, TensorrtExecutionProvider
# Mobile: CoreMLExecutionProvider, NNAPIExecutionProvider
providers = [
    ('TensorrtExecutionProvider', {
        'trt_max_workspace_size': 2147483648,
        'trt_fp16_enable': True,
        'trt_int8_enable': True,
    }),
    ('CUDAExecutionProvider', {
        'device_id': 0,
        'arena_extend_strategy': 'kNextPowerOfTwo',
    }),
    'CPUExecutionProvider'
]

session = ort.InferenceSession("model.onnx", sess_options, providers=providers)

# Run inference
input_name = session.get_inputs()[0].name
output = session.run(None, {input_name: input_data})

2.3 TensorFlow Lite (TFLite)

A lightweight inference engine for mobile and embedded devices.

import tensorflow as tf

# Convert to TFLite model
converter = tf.lite.TFLiteConverter.from_saved_model("saved_model_dir")

# Quantization settings
converter.optimizations = [tf.lite.Optimize.DEFAULT]

# Full integer quantization (INT8)
def representative_dataset():
    for data in calibration_data:
        yield [data.astype(np.float32)]

converter.representative_dataset = representative_dataset
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8

tflite_model = converter.convert()

# Save model
with open("model_int8.tflite", "wb") as f:
    f.write(tflite_model)

# TFLite inference
interpreter = tf.lite.Interpreter(model_path="model_int8.tflite")
interpreter.allocate_tensors()

input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()

interpreter.set_tensor(input_details[0]['index'], input_data)
interpreter.invoke()
output = interpreter.get_tensor(output_details[0]['index'])

2.4 CoreML (Apple)

Apple's device-exclusive inference framework.

import coremltools as ct

# PyTorch -> CoreML conversion
model = torch.load("model.pt")
model.eval()

traced_model = torch.jit.trace(model, torch.randn(1, 3, 224, 224))
coreml_model = ct.convert(
    traced_model,
    inputs=[ct.ImageType(name="image", shape=(1, 3, 224, 224))],
    compute_precision=ct.precision.FLOAT16,
    compute_units=ct.ComputeUnit.ALL,  # CPU + GPU + Neural Engine
)

# Add metadata
coreml_model.author = "ML Team"
coreml_model.short_description = "Image classifier"

coreml_model.save("MyModel.mlpackage")

// Using CoreML in Swift
import CoreML
import Vision

let model = try! MyModel(configuration: MLModelConfiguration())
let request = VNCoreMLRequest(model: try! VNCoreMLModel(for: model.model))

let handler = VNImageRequestHandler(cgImage: image, options: [:])
try! handler.perform([request])

if let results = request.results as? [VNClassificationObservation] {
    print("Top prediction: \(results.first!.identifier)")
}

2.5 OpenVINO (Intel)

An inference engine for Intel hardware.

import openvino as ov

core = ov.Core()

# Load and compile model
model = core.read_model("model.xml")
compiled_model = core.compile_model(model, "CPU", config={
    "PERFORMANCE_HINT": "LATENCY",
    "INFERENCE_NUM_THREADS": "4",
})

# INT8 quantization (NNCF)
import nncf

calibration_dataset = nncf.Dataset(calibration_loader)
quantized_model = nncf.quantize(
    model,
    calibration_dataset,
    preset=nncf.QuantizationPreset.MIXED,
    subset_size=300,
)

# Save quantized model
ov.save_model(quantized_model, "model_int8.xml")

2.6 Runtime Comparison Summary

Inference Runtime Comparison:
========================================
Runtime      | Platform       | Key Hardware    | Languages
TensorRT     | Linux/Windows  | NVIDIA GPU      | C++, Python
ONNX Runtime | Cross-platform | CPU/GPU/NPU     | C++, Python, C#, Java
TFLite       | Mobile/Embedded| CPU/GPU/NPU     | C++, Java, Swift, Python
CoreML       | Apple only     | ANE/GPU/CPU     | Swift, Objective-C
OpenVINO     | Intel only     | CPU/iGPU/VPU    | C++, Python
MLC LLM      | Cross-platform | CPU/GPU/NPU     | C++, Python, Swift

3. Model Optimization Techniques

3.1 Quantization

Quantization is a core technique that reduces model weight precision to improve size and inference speed.

Post-Training Quantization (PTQ)

Quantizes models after training without additional training.

# PyTorch PTQ example
import torch
from torch.quantization import quantize_dynamic

# Dynamic quantization (simplest)
model_fp32 = load_model()
model_int8 = quantize_dynamic(
    model_fp32,
    {torch.nn.Linear, torch.nn.LSTM},
    dtype=torch.qint8
)

# Static quantization (higher performance)
from torch.quantization import prepare, convert, get_default_qconfig

model_fp32.eval()
model_fp32.qconfig = get_default_qconfig('x86')
model_prepared = prepare(model_fp32)

# Collect statistics with calibration data
with torch.no_grad():
    for batch in calibration_loader:
        model_prepared(batch)

model_int8 = convert(model_prepared)

Quantization-Aware Training (QAT)

Simulates quantization effects during training to minimize accuracy loss.

import torch
from torch.quantization import prepare_qat, convert

model.train()
model.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')
model_prepared = prepare_qat(model)

# QAT training (same as regular training)
optimizer = torch.optim.Adam(model_prepared.parameters(), lr=1e-4)
for epoch in range(num_epochs):
    for batch in train_loader:
        output = model_prepared(batch)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()

# Convert to quantized model
model_prepared.eval()
model_int8 = convert(model_prepared)

LLM Quantization Techniques

# GPTQ (GPU-based quantization)
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

quantize_config = BaseQuantizeConfig(
    bits=4,
    group_size=128,
    desc_act=False,
)

model = AutoGPTQForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B",
    quantize_config
)
model.quantize(calibration_data)
model.save_quantized("llama-3.1-8b-gptq-4bit")

# AWQ (Activation-aware Weight Quantization)
from awq import AutoAWQForCausalLM

model = AutoAWQForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")

quant_config = {
    "zero_point": True,
    "q_group_size": 128,
    "w_bit": 4,
    "version": "GEMM"
}

model.quantize(tokenizer, quant_config=quant_config)
model.save_quantized("llama-3.1-8b-awq-4bit")

# bitsandbytes (simple quantization)
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B",
    quantization_config=bnb_config,
    device_map="auto",
)

Precision-level comparison:

Precision | Model Size (7B) | Memory | Speed Gain | Accuracy Loss
FP32      | 28GB            | 28GB   | 1x         | Baseline
FP16      | 14GB            | 14GB   | 2x         | Negligible
INT8      | 7GB             | 7GB    | 3-4x       | Minimal (under 0.5%)
INT4      | 3.5GB           | 3.5GB  | 4-6x       | Slight (1-2%)

3.2 Pruning

Pruning removes unnecessary weights or neurons from a model.

Unstructured Pruning

Creates sparse models by zeroing out individual weights.

import torch.nn.utils.prune as prune

# Magnitude-based pruning
model = load_model()

# Apply 50% pruning across the entire model
parameters_to_prune = [
    (module, 'weight') for module in model.modules()
    if isinstance(module, (torch.nn.Linear, torch.nn.Conv2d))
]

prune.global_unstructured(
    parameters_to_prune,
    pruning_method=prune.L1Unstructured,
    amount=0.5,  # 50% pruning
)

# Permanently apply pruning masks
for module, name in parameters_to_prune:
    prune.remove(module, name)

# Check sparsity
total = 0
zero = 0
for p in model.parameters():
    total += p.numel()
    zero += (p == 0).sum().item()
print(f"Sparsity: {zero/total:.2%}")

Structured Pruning

Removes entire channels or heads for actual speed improvements.

# Attention Head pruning example
import torch.nn.utils.prune as prune

# Prune 30% of output channels in a specific layer
prune.ln_structured(
    model.layer1.conv1,
    name='weight',
    amount=0.3,
    n=2,
    dim=0  # Output channel dimension
)

Movement Pruning

Identifies and removes unimportant weights during fine-tuning.

Pruning Technique Comparison:
========================================
Technique        | Sparsity Pattern | Actual Speedup | Accuracy
Magnitude-based  | Unstructured     | Limited*       | High
Structured       | Structured       | High           | Medium
Movement Pruning | Unstructured     | Limited*       | Very High

* Unstructured pruning requires specialized hardware/libraries

3.3 Knowledge Distillation

Transfers knowledge from a large teacher model to a smaller student model.

import torch
import torch.nn.functional as F

class DistillationLoss(torch.nn.Module):
    def __init__(self, temperature=4.0, alpha=0.5):
        super().__init__()
        self.temperature = temperature
        self.alpha = alpha
        self.ce_loss = torch.nn.CrossEntropyLoss()

    def forward(self, student_logits, teacher_logits, labels):
        # Soft target loss (KL Divergence)
        soft_loss = F.kl_div(
            F.log_softmax(student_logits / self.temperature, dim=1),
            F.softmax(teacher_logits / self.temperature, dim=1),
            reduction='batchmean'
        ) * (self.temperature ** 2)

        # Hard target loss (Cross-Entropy)
        hard_loss = self.ce_loss(student_logits, labels)

        return self.alpha * soft_loss + (1 - self.alpha) * hard_loss

# Training loop
teacher_model.eval()
student_model.train()

for batch in train_loader:
    inputs, labels = batch
    with torch.no_grad():
        teacher_logits = teacher_model(inputs)
    student_logits = student_model(inputs)

    loss = distillation_loss(student_logits, teacher_logits, labels)
    loss.backward()
    optimizer.step()

Notable knowledge distillation success stories:

Teacher -> Student examples:
BERT-base (110M) -> DistilBERT (66M): 40% smaller, retains 97% performance
GPT-4 -> Phi-3 Mini (3.8B): Large model knowledge transferred to small model
Llama 70B -> Llama 8B: Knowledge distillation + synthetic data for maximum performance

3.4 Neural Architecture Search (NAS)

Automatically searches for optimal model architectures.

Efficient architectures discovered by NAS:
========================================
Model          | Parameters | Top-1 Accuracy | Inference Speed
EfficientNet-B0| 5.3M       | 77.3%          | Very fast
MobileNetV3    | 5.4M       | 75.2%          | Very fast
EfficientNetV2 | 21M        | 85.7%          | Fast
MnasNet        | 4.2M       | 74.0%          | Very fast

4. Hardware Landscape

4.1 NVIDIA (Jetson, T4/L4)

NVIDIA Edge AI Hardware:
========================================
Device          | TOPS | Memory  | TDP    | Use Case
Jetson Orin Nano| 40   | 4-8GB   | 7-15W  | Embedded AI
Jetson Orin NX  | 100  | 8-16GB  | 10-25W | Robotics, Drones
Jetson AGX Orin | 275  | 32-64GB | 15-60W | Autonomous driving
T4 (datacenter) | 130  | 16GB    | 70W    | Edge servers
L4 (datacenter) | 120  | 24GB    | 72W    | Video AI

# Running TensorRT inference on Jetson
# After JetPack SDK installation

# Model conversion (ONNX -> TensorRT)
/usr/src/tensorrt/bin/trtexec \
  --onnx=model.onnx \
  --saveEngine=model.engine \
  --fp16 \
  --workspace=2048 \
  --minShapes=input:1x3x224x224 \
  --optShapes=input:4x3x224x224 \
  --maxShapes=input:8x3x224x224

# Benchmark
/usr/src/tensorrt/bin/trtexec --loadEngine=model.engine --batch=4

4.2 Apple (Neural Engine, CoreML, MLX)

Apple Silicon AI Performance:
========================================
Chip          | Neural Engine TOPS | GPU     | Unified Memory
M1            | 11                 | 8-core  | 8-16GB
M2            | 15.8               | 10-core | 8-24GB
M3            | 18                 | 10-core | 8-36GB
M4            | 38                 | 10-core | 16-64GB
A17 Pro       | 35                 | 6-core  | 8GB
A18 Pro       | 35                 | 6-core  | 8GB

# MLX framework (Apple Silicon native)
import mlx.core as mx
import mlx.nn as nn

class SimpleModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear1 = nn.Linear(784, 256)
        self.linear2 = nn.Linear(256, 10)

    def __call__(self, x):
        x = nn.relu(self.linear1(x))
        return self.linear2(x)

model = SimpleModel()
input_data = mx.random.normal((1, 784))
output = model(input_data)
mx.eval(output)  # Execute lazy evaluation

4.3 Qualcomm (Snapdragon NPU)

Qualcomm AI Engine:
========================================
Chip                | NPU TOPS | Use Case
Snapdragon 8 Gen 3  | 45       | Flagship smartphones
Snapdragon 8 Gen 2  | 36       | Premium smartphones
Snapdragon 7+ Gen 2 | 13       | Mid-range smartphones
QCS6490              | 12       | IoT, cameras

4.4 Google (Edge TPU, MediaPipe)

# Edge TPU model compilation
edgetpu_compiler --min_runtime_version 15 model_int8.tflite

# Result: model_int8_edgetpu.tflite

# MediaPipe real-time inference
import mediapipe as mp
import cv2

mp_hands = mp.solutions.hands
hands = mp_hands.Hands(
    static_image_mode=False,
    max_num_hands=2,
    min_detection_confidence=0.5,
    min_tracking_confidence=0.5,
)

cap = cv2.VideoCapture(0)
while cap.isOpened():
    ret, frame = cap.read()
    results = hands.process(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))
    if results.multi_hand_landmarks:
        for hand_landmarks in results.multi_hand_landmarks:
            # Process 21 landmarks
            pass

4.5 Intel (OpenVINO, Movidius)

Intel AI Accelerators:
========================================
Hardware          | Performance | Use Case
Core Ultra NPU    | 10-34 TOPS  | Laptop/Desktop AI
Arc GPU           | Variable    | Desktop/Workstation
Movidius VPU      | 4 TOPS      | IoT, cameras (discontinued)
Gaudi 2/3         | Server-grade| Edge servers

5. On-Device LLMs

5.1 llama.cpp

A lightweight LLM inference engine written in C/C++.

# Build
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make -j$(nproc)

# Metal support (macOS)
make LLAMA_METAL=1 -j$(nproc)

# CUDA support
make LLAMA_CUDA=1 -j$(nproc)

# Model quantization (GGUF format)
./llama-quantize models/llama-3.1-8b-f16.gguf \
  models/llama-3.1-8b-q4_k_m.gguf Q4_K_M

# Run inference
./llama-cli \
  -m models/llama-3.1-8b-q4_k_m.gguf \
  -p "Explain quantum computing in simple terms:" \
  -n 256 \
  -ngl 99  # Number of GPU layers

5.2 GGUF Quantization Levels

GGUF Quantization Comparison (Llama 3.1 8B):
========================================
Quant     | Size   | Memory | Quality    | Speed
Q2_K      | 2.96GB | 5.4GB  | Low        | Very fast
Q3_K_M    | 3.52GB | 6.0GB  | Moderate   | Fast
Q4_K_M    | 4.58GB | 7.0GB  | Good       | Fast
Q5_K_M    | 5.33GB | 7.8GB  | Very good  | Moderate
Q6_K      | 6.14GB | 8.6GB  | Excellent  | Moderate
Q8_0      | 7.95GB | 10.4GB | Near-original | Slow
F16       | 15.0GB | 17.5GB | Original   | Slow

5.3 MLC LLM

A framework for running LLMs across diverse platforms.

# Install MLC LLM
pip install mlc-llm

# Compile model (Vulkan backend)
mlc_llm compile ./dist/Llama-3.1-8B-q4f16_1-MLC/ \
  --device vulkan \
  --output ./dist/Llama-3.1-8B-q4f16_1-vulkan/

# Compile for iOS/Android
mlc_llm compile ./dist/Llama-3.1-8B-q4f16_1-MLC/ \
  --device iphone \
  --output ./dist/Llama-3.1-8B-q4f16_1-ios/

5.4 MLX (Apple Silicon)

# LLM inference with MLX
from mlx_lm import load, generate

model, tokenizer = load("mlx-community/Llama-3.1-8B-Instruct-4bit")

prompt = "What is machine learning?"
response = generate(
    model,
    tokenizer,
    prompt=prompt,
    max_tokens=256,
    temp=0.7,
)
print(response)

5.5 LLMs That Run on Edge Devices

On-Device LLM Comparison:
========================================
Model             | Size  | RAM Req | Speed (tok/s) | Quality
Phi-3 Mini (3.8B) | 2.3GB | 4GB     | 20-40         | Excellent
Gemma 2 (2B)      | 1.4GB | 3GB     | 30-50         | Good
Llama 3.2 (1B)    | 0.7GB | 2GB     | 40-60         | Moderate
Llama 3.2 (3B)    | 1.8GB | 3.5GB   | 25-40         | Good
Qwen2.5 (3B)      | 1.8GB | 3.5GB   | 25-40         | Good
SmolLM (1.7B)     | 1.0GB | 2.5GB   | 35-55         | Moderate

* Based on Q4_K_M quantization, approximate figures on smartphones

6. Federated Learning

6.1 FedAvg Algorithm

The most fundamental algorithm in federated learning.

# FedAvg pseudocode
def federated_averaging(global_model, clients, rounds, local_epochs):
    for round_num in range(rounds):
        # 1. Distribute global model to each client
        client_models = []
        client_sizes = []

        for client in selected_clients:
            # 2. Local training on each client
            local_model = copy.deepcopy(global_model)
            local_model = train_local(
                local_model,
                client.data,
                epochs=local_epochs,
                lr=0.01
            )
            client_models.append(local_model.state_dict())
            client_sizes.append(len(client.data))

        # 3. Update global model with weighted average
        total_size = sum(client_sizes)
        new_global = {}
        for key in global_model.state_dict():
            new_global[key] = sum(
                client_models[i][key] * (client_sizes[i] / total_size)
                for i in range(len(client_models))
            )
        global_model.load_state_dict(new_global)

    return global_model

6.2 Flower Framework

# Flower server
import flwr as fl

strategy = fl.server.strategy.FedAvg(
    fraction_fit=0.3,          # 30% client participation
    fraction_evaluate=0.2,     # 20% client evaluation
    min_fit_clients=2,         # Minimum participating clients
    min_evaluate_clients=2,
    min_available_clients=3,
)

fl.server.start_server(
    server_address="0.0.0.0:8080",
    config=fl.server.ServerConfig(num_rounds=10),
    strategy=strategy,
)

# Flower client
import flwr as fl
import torch

class FlowerClient(fl.client.NumPyClient):
    def __init__(self, model, trainloader, testloader):
        self.model = model
        self.trainloader = trainloader
        self.testloader = testloader

    def get_parameters(self, config):
        return [val.cpu().numpy() for val in self.model.parameters()]

    def set_parameters(self, parameters):
        for param, new_val in zip(self.model.parameters(), parameters):
            param.data = torch.tensor(new_val)

    def fit(self, parameters, config):
        self.set_parameters(parameters)
        train(self.model, self.trainloader, epochs=1)
        return self.get_parameters(config), len(self.trainloader.dataset), {}

    def evaluate(self, parameters, config):
        self.set_parameters(parameters)
        loss, accuracy = test(self.model, self.testloader)
        return float(loss), len(self.testloader.dataset), {"accuracy": float(accuracy)}

fl.client.start_numpy_client(
    server_address="localhost:8080",
    client=FlowerClient(model, trainloader, testloader),
)

6.3 Differential Privacy

# Differential privacy training with Opacus
from opacus import PrivacyEngine

model = create_model()
optimizer = torch.optim.SGD(model.parameters(), lr=0.05)

privacy_engine = PrivacyEngine()
model, optimizer, train_loader = privacy_engine.make_private_with_epsilon(
    module=model,
    optimizer=optimizer,
    data_loader=train_loader,
    epochs=10,
    target_epsilon=1.0,    # Privacy budget
    target_delta=1e-5,     # Failure probability
    max_grad_norm=1.0,     # Gradient clipping
)

# Training (same as regular training)
for batch in train_loader:
    output = model(batch)
    loss = criterion(output, target)
    loss.backward()
    optimizer.step()

# Check consumed privacy budget
epsilon = privacy_engine.get_epsilon(delta=1e-5)
print(f"Epsilon: {epsilon:.2f}")

7. Privacy-Preserving AI

7.1 On-Device Processing

An approach that processes sensitive data without sending it off the device.

On-Device Privacy Preservation Examples:
========================================
Use Case                 | Technology              | Privacy Guarantee
Apple Face ID            | Neural Engine + Secure Enclave | Facial data processed on-device
Google Keyboard Prediction| Federated Learning     | Typing data never sent to server
Apple Siri Voice          | CoreML on-device       | Voice data processed locally
Samsung Knox AI           | NPU + TEE             | Enterprise data isolation

7.2 Secure Aggregation

Encrypts individual model updates in federated learning so even the server cannot see them.

Secure Aggregation Protocol:
========================================
1. Each client generates masks (pairwise seed exchange)
2. Adds masks to local model updates before sending
3. Server collects only masked updates
4. Masks cancel out during aggregation, recovering only the sum
5. Individual updates cannot be recovered even by the server

8. Deployment Pipeline

8.1 Model Conversion Workflow

Training Framework -> Intermediate Format -> Inference Engine -> Device
========================================
PyTorch  --+
            +---> ONNX --+---> TensorRT (.engine)  ---> NVIDIA GPU
TensorFlow-+             +---> ONNX Runtime         ---> Cross-platform
            |             +---> OpenVINO (.xml)      ---> Intel
            +---> TFLite ---> TFLite Runtime         ---> Mobile/IoT
            +---> CoreML ---> CoreML Runtime          ---> Apple
            +---> GGUF   ---> llama.cpp              ---> All platforms

8.2 OTA (Over-The-Air) Model Updates

# Model version management and OTA update example
import hashlib
import json
import requests

class ModelManager:
    def __init__(self, model_dir, manifest_url):
        self.model_dir = model_dir
        self.manifest_url = manifest_url

    def check_update(self):
        """Check latest model manifest from server"""
        manifest = requests.get(self.manifest_url).json()
        current_version = self.get_current_version()

        if manifest["version"] > current_version:
            return manifest
        return None

    def download_model(self, manifest):
        """Delta update or full download"""
        if manifest.get("delta_url"):
            # Download delta patch (size savings)
            patch = requests.get(manifest["delta_url"]).content
            self.apply_delta(patch)
        else:
            # Full model download
            model_data = requests.get(manifest["model_url"]).content
            self.save_model(model_data, manifest)

    def validate_model(self, model_path, expected_hash):
        """Verify integrity with SHA-256 hash"""
        with open(model_path, "rb") as f:
            file_hash = hashlib.sha256(f.read()).hexdigest()
        return file_hash == expected_hash

    def rollback(self):
        """Rollback to previous version on failure"""
        # Previous version restoration logic
        pass

8.3 Model Packaging

# Model manifest (model_manifest.json)
model_name: "image-classifier-v2"
version: "2.1.0"
framework: "tflite"
file: "model_int8.tflite"
input_shape: [1, 224, 224, 3]
input_type: "uint8"
output_shape: [1, 1000]
labels_file: "labels.txt"
hardware_requirements:
  min_ram_mb: 256
  supported_delegates: ["gpu", "nnapi", "xnnpack"]
metrics:
  accuracy: 0.943
  latency_ms: 12.5
  model_size_mb: 4.2

9. Use Cases

9.1 Autonomous Driving

Autonomous Driving Edge AI Stack:
========================================
Sensor         | AI Task              | Hardware       | Latency Req
Cameras (8-12) | Object detection     | NVIDIA Orin    | Under 10ms
LiDAR          | 3D point cloud       | NVIDIA Orin    | Under 20ms
Radar          | Distance/speed est.  | DSP            | Under 5ms
Fusion         | Sensor fusion/decision| NVIDIA Orin   | Under 30ms
Planning       | Path planning        | CPU/GPU        | Under 50ms

9.2 Smart Cameras

# Real-time object detection on edge device
import cv2
import numpy as np

# Load TFLite model (SSD MobileNet)
interpreter = tf.lite.Interpreter(
    model_path="ssd_mobilenet_v2_int8.tflite",
    num_threads=4
)
interpreter.allocate_tensors()

cap = cv2.VideoCapture(0)
while True:
    ret, frame = cap.read()
    # Preprocessing
    input_data = preprocess(frame, target_size=(300, 300))

    # Inference
    interpreter.set_tensor(input_details[0]['index'], input_data)
    interpreter.invoke()

    # Parse results
    boxes = interpreter.get_tensor(output_details[0]['index'])
    classes = interpreter.get_tensor(output_details[1]['index'])
    scores = interpreter.get_tensor(output_details[2]['index'])

    # Display high-confidence results only
    for i in range(len(scores[0])):
        if scores[0][i] > 0.5:
            draw_box(frame, boxes[0][i], classes[0][i], scores[0][i])

9.3 Voice Assistants

On-Device Voice Processing Pipeline:
========================================
1. Wake Word Detection: Small model (50KB-1MB), always-on, NPU
2. Voice Activity Detection (VAD): Speech segment identification, very lightweight
3. Speech Recognition (ASR): Whisper Tiny/Base (39M-74M parameters)
4. Natural Language Understanding (NLU): Intent classification + entity recognition
5. Response Generation: Small LLM or template-based
6. Text-to-Speech (TTS): Convert text to speech

9.4 Medical Devices

Medical Edge AI Applications:
========================================
Device            | AI Task                | Regulatory
Smartwatch        | ECG arrhythmia detect  | FDA Class II
Stethoscope       | Heart/lung sound anal. | FDA Class II
Fundus Camera     | Diabetic retinopathy   | FDA Class II
CT/MRI Assist     | Lesion highlight       | FDA Class III
Glucose Monitor   | Blood sugar trend pred | FDA Class II

9.5 Industrial IoT

Industrial Edge AI Applications:
========================================
Area              | AI Task                | Benefit
Quality Inspection| Visual defect detect   | Real-time, 100% inspection
Predictive Maint. | Vibration/noise anomaly| 40% downtime reduction
Energy Optimization| Consumption prediction| 15% energy savings
Safety Monitoring | PPE detection          | Accident prevention
Robot Control     | Path planning/obstacle | Autonomous operation

10. Challenges and Limitations

10.1 Power Constraints

Power Efficiency Comparison:
========================================
Device                 | Power  | AI Perf  | TOPS/W
NVIDIA Jetson Orin Nano| 7-15W  | 40 TOPS  | 2.7-5.7
Apple A18 Pro          | ~5W    | 35 TOPS  | 7.0
Google Edge TPU        | 2W     | 4 TOPS   | 2.0
Qualcomm Snapdragon 8g3| ~5W   | 45 TOPS  | 9.0
Intel Core Ultra NPU   | ~5W   | 34 TOPS  | 6.8

10.2 Thermal Management

Continuous AI inference on mobile devices causes thermal issues. Thermal throttling can degrade performance by 30-50%.

10.3 Memory Limitations

Memory Constraints by Device Type:
========================================
Device Type       | Typical RAM | AI-Available Memory
IoT Sensors       | 256KB-4MB   | Extremely limited
Microcontrollers  | 2-16MB      | Very limited
Smart Cameras     | 256MB-2GB   | Hundreds of MB
Smartphones       | 6-16GB      | 2-8GB
Edge Servers      | 16-128GB    | Most available

10.4 Model Update Challenges

OTA Model Update Challenges:
========================================
- Bandwidth limits: Transmitting large models wirelessly
- Battery drain: Energy for download + conversion process
- A/B testing: Need to store both old and new versions
- Rollback: Recovery mechanism when updates fail
- Integrity: Prevent model tampering (signature/hash verification)

Quiz

Q1: List and explain at least 5 key differences between Edge AI and Cloud AI.

Latency: Edge AI offers 1-10ms local inference, while Cloud AI requires 50-200ms network round-trip
Privacy: Edge AI keeps data on device, while Cloud AI requires data transmission
Bandwidth: Edge AI transmits only results (minimal), while Cloud AI sends raw data (heavy)
Offline capability: Edge AI works without network, Cloud AI cannot
Compute power: Edge AI is limited (NPU with tens of TOPS), Cloud AI is nearly unlimited
Model size: Edge AI is limited to MB-to-few-GB range, Cloud AI can handle hundreds of GB
Cost structure: Edge AI has upfront hardware costs, Cloud AI has ongoing service costs

Q2: Explain the difference between PTQ and QAT, and compare their pros and cons.

PTQ (Post-Training Quantization):

Quantizes an already-trained model without additional training
Pros: Simple, no training data needed (only calibration data), fast to apply
Cons: May have greater accuracy loss than QAT, especially at INT4

QAT (Quantization-Aware Training):

Simulates quantization effects during the training process
Pros: Minimizes accuracy loss, good quality even at INT4
Cons: Requires additional training (data, GPU, time), more complex implementation

Generally, PTQ is sufficient for INT8, while QAT is recommended for INT4 and below.

Q3: Explain the FedAvg algorithm in Federated Learning.

The FedAvg (Federated Averaging) process:

Model distribution: Server sends the global model to selected clients
Local training: Each client trains the model on their local data (multiple epochs)
Update collection: Trained model parameters are sent to the server (original data is never transmitted)
Weighted averaging: Server updates the global model using a weighted average proportional to each client's data size
Iteration: The above process repeats for multiple rounds

Key point: Data never leaves the device -- only model updates (weights) are exchanged with the server, preserving privacy.

Q4: Explain what Q4_K_M means in GGUF quantization.

GGUF quantization naming convention:

Q: Quantization
4: Bit count (4-bit). Uses an average of 4 bits per weight
K: K-quant method. Applies different quantization levels per block, maintaining higher precision for important layers
M: Medium quality. Among S (Small/low quality), M (Medium), and L (Large/high quality)

Q4_K_M is 4-bit K-quant medium quality quantization, the most popular level that reduces model size by approximately 4-5x while minimizing quality loss. For a 7B model, this results in about 4.58GB, runnable on smartphones.

Q5: What is the difference between structured and unstructured pruning?

Unstructured Pruning:

Sets individual weights to zero, creating sparse matrices
Can achieve high sparsity (90%+)
Large theoretical compute savings, but limited actual speedup on general hardware
Requires specialized sparse matrix hardware/libraries

Structured Pruning:

Removes entire channels, filters, or Attention Heads as structural units
Resulting model operates with regular dense tensor operations
Achieves actual speedup even on general hardware
Sparsity levels may be lower than unstructured (around 30-50%)

Practically, structured pruning is more effective for actual inference speedup, while unstructured pruning is effective on specialized hardware like NVIDIA Sparse Tensor Cores.