- Published on
Edge AI Complete Guide 2025: On-Device Inference, Model Optimization, TensorRT/ONNX/CoreML
- Authors

- Name
- Youngju Kim
- @fjvbn20031
Introduction: Why Edge AI?
The future of AI is not just the cloud. An increasing amount of AI inference is running where data is generated -- on edge devices. Edge AI is a paradigm for running AI models directly on smartphones, IoT devices, vehicles, and medical devices without sending data to the cloud.
5 reasons Edge AI is essential:
- Latency: Millisecond-level responses without cloud round-trips. For autonomous vehicles, even 100ms delay can be fatal.
- Privacy: Sensitive data (faces, voice, medical) never leaves the device.
- Bandwidth savings: Cameras generate gigabytes of video per second, but only inference results need to be transmitted.
- Cost reduction: Run inference locally without cloud GPU costs.
- Offline operation: AI features work without network connectivity.
1. Edge AI vs Cloud AI Comparison
| Dimension | Edge AI | Cloud AI |
|---|---|---|
| Latency | 1-10ms (local) | 50-200ms (network round-trip) |
| Privacy | Data stays on device | Data must be sent to cloud |
| Bandwidth | Minimal (results only) | Heavy (raw data transfer) |
| Cost | Upfront hardware cost | Ongoing cloud costs |
| Offline | Fully supported | Not possible |
| Model size | Limited (MB to a few GB) | Unlimited (hundreds of GB) |
| Compute power | Limited (NPU/GPU) | Nearly unlimited |
| Updates | OTA update required | Instant deployment |
| Scalability | Proportional to device count | Elastic with cloud resources |
| Accuracy | Slight degradation possible | Maximum accuracy |
2. Inference Runtime Comparison
2.1 TensorRT (NVIDIA)
A high-performance inference engine exclusive to NVIDIA GPUs.
import tensorrt as trt
import numpy as np
# Build TensorRT engine
logger = trt.Logger(trt.Logger.WARNING)
builder = trt.Builder(logger)
network = builder.create_network(
1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
)
parser = trt.OnnxParser(network, logger)
# Load ONNX model
with open("model.onnx", "rb") as f:
parser.parse(f.read())
# Optimization profile configuration
config = builder.create_builder_config()
config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 30) # 1GB
# Enable INT8 quantization
config.set_flag(trt.BuilderFlag.INT8)
config.int8_calibrator = MyCalibrator(calibration_data)
# Enable FP16
config.set_flag(trt.BuilderFlag.FP16)
# Build and serialize engine
serialized_engine = builder.build_serialized_network(network, config)
with open("model.engine", "wb") as f:
f.write(serialized_engine)
Key TensorRT optimization techniques:
1. Layer fusion: Combines Conv + BatchNorm + ReLU into a single kernel
2. Kernel auto-tuning: Automatic selection of optimal CUDA kernels per hardware
3. Dynamic tensor memory: Reduces total usage through memory reuse
4. Precision calibration: Minimizes accuracy loss during INT8 quantization
5. Multi-stream execution: Runs multiple inferences in parallel
2.2 ONNX Runtime
A cross-platform inference engine that works across various hardware.
import onnxruntime as ort
import numpy as np
# Session options
sess_options = ort.SessionOptions()
sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
sess_options.intra_op_num_threads = 4
sess_options.inter_op_num_threads = 2
# Execution Provider selection
# CPU: CPUExecutionProvider
# GPU: CUDAExecutionProvider, TensorrtExecutionProvider
# Mobile: CoreMLExecutionProvider, NNAPIExecutionProvider
providers = [
('TensorrtExecutionProvider', {
'trt_max_workspace_size': 2147483648,
'trt_fp16_enable': True,
'trt_int8_enable': True,
}),
('CUDAExecutionProvider', {
'device_id': 0,
'arena_extend_strategy': 'kNextPowerOfTwo',
}),
'CPUExecutionProvider'
]
session = ort.InferenceSession("model.onnx", sess_options, providers=providers)
# Run inference
input_name = session.get_inputs()[0].name
output = session.run(None, {input_name: input_data})
2.3 TensorFlow Lite (TFLite)
A lightweight inference engine for mobile and embedded devices.
import tensorflow as tf
# Convert to TFLite model
converter = tf.lite.TFLiteConverter.from_saved_model("saved_model_dir")
# Quantization settings
converter.optimizations = [tf.lite.Optimize.DEFAULT]
# Full integer quantization (INT8)
def representative_dataset():
for data in calibration_data:
yield [data.astype(np.float32)]
converter.representative_dataset = representative_dataset
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8
tflite_model = converter.convert()
# Save model
with open("model_int8.tflite", "wb") as f:
f.write(tflite_model)
# TFLite inference
interpreter = tf.lite.Interpreter(model_path="model_int8.tflite")
interpreter.allocate_tensors()
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
interpreter.set_tensor(input_details[0]['index'], input_data)
interpreter.invoke()
output = interpreter.get_tensor(output_details[0]['index'])
2.4 CoreML (Apple)
Apple's device-exclusive inference framework.
import coremltools as ct
# PyTorch -> CoreML conversion
model = torch.load("model.pt")
model.eval()
traced_model = torch.jit.trace(model, torch.randn(1, 3, 224, 224))
coreml_model = ct.convert(
traced_model,
inputs=[ct.ImageType(name="image", shape=(1, 3, 224, 224))],
compute_precision=ct.precision.FLOAT16,
compute_units=ct.ComputeUnit.ALL, # CPU + GPU + Neural Engine
)
# Add metadata
coreml_model.author = "ML Team"
coreml_model.short_description = "Image classifier"
coreml_model.save("MyModel.mlpackage")
// Using CoreML in Swift
import CoreML
import Vision
let model = try! MyModel(configuration: MLModelConfiguration())
let request = VNCoreMLRequest(model: try! VNCoreMLModel(for: model.model))
let handler = VNImageRequestHandler(cgImage: image, options: [:])
try! handler.perform([request])
if let results = request.results as? [VNClassificationObservation] {
print("Top prediction: \(results.first!.identifier)")
}
2.5 OpenVINO (Intel)
An inference engine for Intel hardware.
import openvino as ov
core = ov.Core()
# Load and compile model
model = core.read_model("model.xml")
compiled_model = core.compile_model(model, "CPU", config={
"PERFORMANCE_HINT": "LATENCY",
"INFERENCE_NUM_THREADS": "4",
})
# INT8 quantization (NNCF)
import nncf
calibration_dataset = nncf.Dataset(calibration_loader)
quantized_model = nncf.quantize(
model,
calibration_dataset,
preset=nncf.QuantizationPreset.MIXED,
subset_size=300,
)
# Save quantized model
ov.save_model(quantized_model, "model_int8.xml")
2.6 Runtime Comparison Summary
Inference Runtime Comparison:
========================================
Runtime | Platform | Key Hardware | Languages
TensorRT | Linux/Windows | NVIDIA GPU | C++, Python
ONNX Runtime | Cross-platform | CPU/GPU/NPU | C++, Python, C#, Java
TFLite | Mobile/Embedded| CPU/GPU/NPU | C++, Java, Swift, Python
CoreML | Apple only | ANE/GPU/CPU | Swift, Objective-C
OpenVINO | Intel only | CPU/iGPU/VPU | C++, Python
MLC LLM | Cross-platform | CPU/GPU/NPU | C++, Python, Swift
3. Model Optimization Techniques
3.1 Quantization
Quantization is a core technique that reduces model weight precision to improve size and inference speed.
Post-Training Quantization (PTQ)
Quantizes models after training without additional training.
# PyTorch PTQ example
import torch
from torch.quantization import quantize_dynamic
# Dynamic quantization (simplest)
model_fp32 = load_model()
model_int8 = quantize_dynamic(
model_fp32,
{torch.nn.Linear, torch.nn.LSTM},
dtype=torch.qint8
)
# Static quantization (higher performance)
from torch.quantization import prepare, convert, get_default_qconfig
model_fp32.eval()
model_fp32.qconfig = get_default_qconfig('x86')
model_prepared = prepare(model_fp32)
# Collect statistics with calibration data
with torch.no_grad():
for batch in calibration_loader:
model_prepared(batch)
model_int8 = convert(model_prepared)
Quantization-Aware Training (QAT)
Simulates quantization effects during training to minimize accuracy loss.
import torch
from torch.quantization import prepare_qat, convert
model.train()
model.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')
model_prepared = prepare_qat(model)
# QAT training (same as regular training)
optimizer = torch.optim.Adam(model_prepared.parameters(), lr=1e-4)
for epoch in range(num_epochs):
for batch in train_loader:
output = model_prepared(batch)
loss = criterion(output, target)
loss.backward()
optimizer.step()
# Convert to quantized model
model_prepared.eval()
model_int8 = convert(model_prepared)
LLM Quantization Techniques
# GPTQ (GPU-based quantization)
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
quantize_config = BaseQuantizeConfig(
bits=4,
group_size=128,
desc_act=False,
)
model = AutoGPTQForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B",
quantize_config
)
model.quantize(calibration_data)
model.save_quantized("llama-3.1-8b-gptq-4bit")
# AWQ (Activation-aware Weight Quantization)
from awq import AutoAWQForCausalLM
model = AutoAWQForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
quant_config = {
"zero_point": True,
"q_group_size": 128,
"w_bit": 4,
"version": "GEMM"
}
model.quantize(tokenizer, quant_config=quant_config)
model.save_quantized("llama-3.1-8b-awq-4bit")
# bitsandbytes (simple quantization)
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B",
quantization_config=bnb_config,
device_map="auto",
)
Precision-level comparison:
Precision | Model Size (7B) | Memory | Speed Gain | Accuracy Loss
FP32 | 28GB | 28GB | 1x | Baseline
FP16 | 14GB | 14GB | 2x | Negligible
INT8 | 7GB | 7GB | 3-4x | Minimal (under 0.5%)
INT4 | 3.5GB | 3.5GB | 4-6x | Slight (1-2%)
3.2 Pruning
Pruning removes unnecessary weights or neurons from a model.
Unstructured Pruning
Creates sparse models by zeroing out individual weights.
import torch.nn.utils.prune as prune
# Magnitude-based pruning
model = load_model()
# Apply 50% pruning across the entire model
parameters_to_prune = [
(module, 'weight') for module in model.modules()
if isinstance(module, (torch.nn.Linear, torch.nn.Conv2d))
]
prune.global_unstructured(
parameters_to_prune,
pruning_method=prune.L1Unstructured,
amount=0.5, # 50% pruning
)
# Permanently apply pruning masks
for module, name in parameters_to_prune:
prune.remove(module, name)
# Check sparsity
total = 0
zero = 0
for p in model.parameters():
total += p.numel()
zero += (p == 0).sum().item()
print(f"Sparsity: {zero/total:.2%}")
Structured Pruning
Removes entire channels or heads for actual speed improvements.
# Attention Head pruning example
import torch.nn.utils.prune as prune
# Prune 30% of output channels in a specific layer
prune.ln_structured(
model.layer1.conv1,
name='weight',
amount=0.3,
n=2,
dim=0 # Output channel dimension
)
Movement Pruning
Identifies and removes unimportant weights during fine-tuning.
Pruning Technique Comparison:
========================================
Technique | Sparsity Pattern | Actual Speedup | Accuracy
Magnitude-based | Unstructured | Limited* | High
Structured | Structured | High | Medium
Movement Pruning | Unstructured | Limited* | Very High
* Unstructured pruning requires specialized hardware/libraries
3.3 Knowledge Distillation
Transfers knowledge from a large teacher model to a smaller student model.
import torch
import torch.nn.functional as F
class DistillationLoss(torch.nn.Module):
def __init__(self, temperature=4.0, alpha=0.5):
super().__init__()
self.temperature = temperature
self.alpha = alpha
self.ce_loss = torch.nn.CrossEntropyLoss()
def forward(self, student_logits, teacher_logits, labels):
# Soft target loss (KL Divergence)
soft_loss = F.kl_div(
F.log_softmax(student_logits / self.temperature, dim=1),
F.softmax(teacher_logits / self.temperature, dim=1),
reduction='batchmean'
) * (self.temperature ** 2)
# Hard target loss (Cross-Entropy)
hard_loss = self.ce_loss(student_logits, labels)
return self.alpha * soft_loss + (1 - self.alpha) * hard_loss
# Training loop
teacher_model.eval()
student_model.train()
for batch in train_loader:
inputs, labels = batch
with torch.no_grad():
teacher_logits = teacher_model(inputs)
student_logits = student_model(inputs)
loss = distillation_loss(student_logits, teacher_logits, labels)
loss.backward()
optimizer.step()
Notable knowledge distillation success stories:
Teacher -> Student examples:
BERT-base (110M) -> DistilBERT (66M): 40% smaller, retains 97% performance
GPT-4 -> Phi-3 Mini (3.8B): Large model knowledge transferred to small model
Llama 70B -> Llama 8B: Knowledge distillation + synthetic data for maximum performance
3.4 Neural Architecture Search (NAS)
Automatically searches for optimal model architectures.
Efficient architectures discovered by NAS:
========================================
Model | Parameters | Top-1 Accuracy | Inference Speed
EfficientNet-B0| 5.3M | 77.3% | Very fast
MobileNetV3 | 5.4M | 75.2% | Very fast
EfficientNetV2 | 21M | 85.7% | Fast
MnasNet | 4.2M | 74.0% | Very fast
4. Hardware Landscape
4.1 NVIDIA (Jetson, T4/L4)
NVIDIA Edge AI Hardware:
========================================
Device | TOPS | Memory | TDP | Use Case
Jetson Orin Nano| 40 | 4-8GB | 7-15W | Embedded AI
Jetson Orin NX | 100 | 8-16GB | 10-25W | Robotics, Drones
Jetson AGX Orin | 275 | 32-64GB | 15-60W | Autonomous driving
T4 (datacenter) | 130 | 16GB | 70W | Edge servers
L4 (datacenter) | 120 | 24GB | 72W | Video AI
# Running TensorRT inference on Jetson
# After JetPack SDK installation
# Model conversion (ONNX -> TensorRT)
/usr/src/tensorrt/bin/trtexec \
--onnx=model.onnx \
--saveEngine=model.engine \
--fp16 \
--workspace=2048 \
--minShapes=input:1x3x224x224 \
--optShapes=input:4x3x224x224 \
--maxShapes=input:8x3x224x224
# Benchmark
/usr/src/tensorrt/bin/trtexec --loadEngine=model.engine --batch=4
4.2 Apple (Neural Engine, CoreML, MLX)
Apple Silicon AI Performance:
========================================
Chip | Neural Engine TOPS | GPU | Unified Memory
M1 | 11 | 8-core | 8-16GB
M2 | 15.8 | 10-core | 8-24GB
M3 | 18 | 10-core | 8-36GB
M4 | 38 | 10-core | 16-64GB
A17 Pro | 35 | 6-core | 8GB
A18 Pro | 35 | 6-core | 8GB
# MLX framework (Apple Silicon native)
import mlx.core as mx
import mlx.nn as nn
class SimpleModel(nn.Module):
def __init__(self):
super().__init__()
self.linear1 = nn.Linear(784, 256)
self.linear2 = nn.Linear(256, 10)
def __call__(self, x):
x = nn.relu(self.linear1(x))
return self.linear2(x)
model = SimpleModel()
input_data = mx.random.normal((1, 784))
output = model(input_data)
mx.eval(output) # Execute lazy evaluation
4.3 Qualcomm (Snapdragon NPU)
Qualcomm AI Engine:
========================================
Chip | NPU TOPS | Use Case
Snapdragon 8 Gen 3 | 45 | Flagship smartphones
Snapdragon 8 Gen 2 | 36 | Premium smartphones
Snapdragon 7+ Gen 2 | 13 | Mid-range smartphones
QCS6490 | 12 | IoT, cameras
4.4 Google (Edge TPU, MediaPipe)
# Edge TPU model compilation
edgetpu_compiler --min_runtime_version 15 model_int8.tflite
# Result: model_int8_edgetpu.tflite
# MediaPipe real-time inference
import mediapipe as mp
import cv2
mp_hands = mp.solutions.hands
hands = mp_hands.Hands(
static_image_mode=False,
max_num_hands=2,
min_detection_confidence=0.5,
min_tracking_confidence=0.5,
)
cap = cv2.VideoCapture(0)
while cap.isOpened():
ret, frame = cap.read()
results = hands.process(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))
if results.multi_hand_landmarks:
for hand_landmarks in results.multi_hand_landmarks:
# Process 21 landmarks
pass
4.5 Intel (OpenVINO, Movidius)
Intel AI Accelerators:
========================================
Hardware | Performance | Use Case
Core Ultra NPU | 10-34 TOPS | Laptop/Desktop AI
Arc GPU | Variable | Desktop/Workstation
Movidius VPU | 4 TOPS | IoT, cameras (discontinued)
Gaudi 2/3 | Server-grade| Edge servers
5. On-Device LLMs
5.1 llama.cpp
A lightweight LLM inference engine written in C/C++.
# Build
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make -j$(nproc)
# Metal support (macOS)
make LLAMA_METAL=1 -j$(nproc)
# CUDA support
make LLAMA_CUDA=1 -j$(nproc)
# Model quantization (GGUF format)
./llama-quantize models/llama-3.1-8b-f16.gguf \
models/llama-3.1-8b-q4_k_m.gguf Q4_K_M
# Run inference
./llama-cli \
-m models/llama-3.1-8b-q4_k_m.gguf \
-p "Explain quantum computing in simple terms:" \
-n 256 \
-ngl 99 # Number of GPU layers
5.2 GGUF Quantization Levels
GGUF Quantization Comparison (Llama 3.1 8B):
========================================
Quant | Size | Memory | Quality | Speed
Q2_K | 2.96GB | 5.4GB | Low | Very fast
Q3_K_M | 3.52GB | 6.0GB | Moderate | Fast
Q4_K_M | 4.58GB | 7.0GB | Good | Fast
Q5_K_M | 5.33GB | 7.8GB | Very good | Moderate
Q6_K | 6.14GB | 8.6GB | Excellent | Moderate
Q8_0 | 7.95GB | 10.4GB | Near-original | Slow
F16 | 15.0GB | 17.5GB | Original | Slow
5.3 MLC LLM
A framework for running LLMs across diverse platforms.
# Install MLC LLM
pip install mlc-llm
# Compile model (Vulkan backend)
mlc_llm compile ./dist/Llama-3.1-8B-q4f16_1-MLC/ \
--device vulkan \
--output ./dist/Llama-3.1-8B-q4f16_1-vulkan/
# Compile for iOS/Android
mlc_llm compile ./dist/Llama-3.1-8B-q4f16_1-MLC/ \
--device iphone \
--output ./dist/Llama-3.1-8B-q4f16_1-ios/
5.4 MLX (Apple Silicon)
# LLM inference with MLX
from mlx_lm import load, generate
model, tokenizer = load("mlx-community/Llama-3.1-8B-Instruct-4bit")
prompt = "What is machine learning?"
response = generate(
model,
tokenizer,
prompt=prompt,
max_tokens=256,
temp=0.7,
)
print(response)
5.5 LLMs That Run on Edge Devices
On-Device LLM Comparison:
========================================
Model | Size | RAM Req | Speed (tok/s) | Quality
Phi-3 Mini (3.8B) | 2.3GB | 4GB | 20-40 | Excellent
Gemma 2 (2B) | 1.4GB | 3GB | 30-50 | Good
Llama 3.2 (1B) | 0.7GB | 2GB | 40-60 | Moderate
Llama 3.2 (3B) | 1.8GB | 3.5GB | 25-40 | Good
Qwen2.5 (3B) | 1.8GB | 3.5GB | 25-40 | Good
SmolLM (1.7B) | 1.0GB | 2.5GB | 35-55 | Moderate
* Based on Q4_K_M quantization, approximate figures on smartphones
6. Federated Learning
6.1 FedAvg Algorithm
The most fundamental algorithm in federated learning.
# FedAvg pseudocode
def federated_averaging(global_model, clients, rounds, local_epochs):
for round_num in range(rounds):
# 1. Distribute global model to each client
client_models = []
client_sizes = []
for client in selected_clients:
# 2. Local training on each client
local_model = copy.deepcopy(global_model)
local_model = train_local(
local_model,
client.data,
epochs=local_epochs,
lr=0.01
)
client_models.append(local_model.state_dict())
client_sizes.append(len(client.data))
# 3. Update global model with weighted average
total_size = sum(client_sizes)
new_global = {}
for key in global_model.state_dict():
new_global[key] = sum(
client_models[i][key] * (client_sizes[i] / total_size)
for i in range(len(client_models))
)
global_model.load_state_dict(new_global)
return global_model
6.2 Flower Framework
# Flower server
import flwr as fl
strategy = fl.server.strategy.FedAvg(
fraction_fit=0.3, # 30% client participation
fraction_evaluate=0.2, # 20% client evaluation
min_fit_clients=2, # Minimum participating clients
min_evaluate_clients=2,
min_available_clients=3,
)
fl.server.start_server(
server_address="0.0.0.0:8080",
config=fl.server.ServerConfig(num_rounds=10),
strategy=strategy,
)
# Flower client
import flwr as fl
import torch
class FlowerClient(fl.client.NumPyClient):
def __init__(self, model, trainloader, testloader):
self.model = model
self.trainloader = trainloader
self.testloader = testloader
def get_parameters(self, config):
return [val.cpu().numpy() for val in self.model.parameters()]
def set_parameters(self, parameters):
for param, new_val in zip(self.model.parameters(), parameters):
param.data = torch.tensor(new_val)
def fit(self, parameters, config):
self.set_parameters(parameters)
train(self.model, self.trainloader, epochs=1)
return self.get_parameters(config), len(self.trainloader.dataset), {}
def evaluate(self, parameters, config):
self.set_parameters(parameters)
loss, accuracy = test(self.model, self.testloader)
return float(loss), len(self.testloader.dataset), {"accuracy": float(accuracy)}
fl.client.start_numpy_client(
server_address="localhost:8080",
client=FlowerClient(model, trainloader, testloader),
)
6.3 Differential Privacy
# Differential privacy training with Opacus
from opacus import PrivacyEngine
model = create_model()
optimizer = torch.optim.SGD(model.parameters(), lr=0.05)
privacy_engine = PrivacyEngine()
model, optimizer, train_loader = privacy_engine.make_private_with_epsilon(
module=model,
optimizer=optimizer,
data_loader=train_loader,
epochs=10,
target_epsilon=1.0, # Privacy budget
target_delta=1e-5, # Failure probability
max_grad_norm=1.0, # Gradient clipping
)
# Training (same as regular training)
for batch in train_loader:
output = model(batch)
loss = criterion(output, target)
loss.backward()
optimizer.step()
# Check consumed privacy budget
epsilon = privacy_engine.get_epsilon(delta=1e-5)
print(f"Epsilon: {epsilon:.2f}")
7. Privacy-Preserving AI
7.1 On-Device Processing
An approach that processes sensitive data without sending it off the device.
On-Device Privacy Preservation Examples:
========================================
Use Case | Technology | Privacy Guarantee
Apple Face ID | Neural Engine + Secure Enclave | Facial data processed on-device
Google Keyboard Prediction| Federated Learning | Typing data never sent to server
Apple Siri Voice | CoreML on-device | Voice data processed locally
Samsung Knox AI | NPU + TEE | Enterprise data isolation
7.2 Secure Aggregation
Encrypts individual model updates in federated learning so even the server cannot see them.
Secure Aggregation Protocol:
========================================
1. Each client generates masks (pairwise seed exchange)
2. Adds masks to local model updates before sending
3. Server collects only masked updates
4. Masks cancel out during aggregation, recovering only the sum
5. Individual updates cannot be recovered even by the server
8. Deployment Pipeline
8.1 Model Conversion Workflow
Training Framework -> Intermediate Format -> Inference Engine -> Device
========================================
PyTorch --+
+---> ONNX --+---> TensorRT (.engine) ---> NVIDIA GPU
TensorFlow-+ +---> ONNX Runtime ---> Cross-platform
| +---> OpenVINO (.xml) ---> Intel
+---> TFLite ---> TFLite Runtime ---> Mobile/IoT
+---> CoreML ---> CoreML Runtime ---> Apple
+---> GGUF ---> llama.cpp ---> All platforms
8.2 OTA (Over-The-Air) Model Updates
# Model version management and OTA update example
import hashlib
import json
import requests
class ModelManager:
def __init__(self, model_dir, manifest_url):
self.model_dir = model_dir
self.manifest_url = manifest_url
def check_update(self):
"""Check latest model manifest from server"""
manifest = requests.get(self.manifest_url).json()
current_version = self.get_current_version()
if manifest["version"] > current_version:
return manifest
return None
def download_model(self, manifest):
"""Delta update or full download"""
if manifest.get("delta_url"):
# Download delta patch (size savings)
patch = requests.get(manifest["delta_url"]).content
self.apply_delta(patch)
else:
# Full model download
model_data = requests.get(manifest["model_url"]).content
self.save_model(model_data, manifest)
def validate_model(self, model_path, expected_hash):
"""Verify integrity with SHA-256 hash"""
with open(model_path, "rb") as f:
file_hash = hashlib.sha256(f.read()).hexdigest()
return file_hash == expected_hash
def rollback(self):
"""Rollback to previous version on failure"""
# Previous version restoration logic
pass
8.3 Model Packaging
# Model manifest (model_manifest.json)
model_name: "image-classifier-v2"
version: "2.1.0"
framework: "tflite"
file: "model_int8.tflite"
input_shape: [1, 224, 224, 3]
input_type: "uint8"
output_shape: [1, 1000]
labels_file: "labels.txt"
hardware_requirements:
min_ram_mb: 256
supported_delegates: ["gpu", "nnapi", "xnnpack"]
metrics:
accuracy: 0.943
latency_ms: 12.5
model_size_mb: 4.2
9. Use Cases
9.1 Autonomous Driving
Autonomous Driving Edge AI Stack:
========================================
Sensor | AI Task | Hardware | Latency Req
Cameras (8-12) | Object detection | NVIDIA Orin | Under 10ms
LiDAR | 3D point cloud | NVIDIA Orin | Under 20ms
Radar | Distance/speed est. | DSP | Under 5ms
Fusion | Sensor fusion/decision| NVIDIA Orin | Under 30ms
Planning | Path planning | CPU/GPU | Under 50ms
9.2 Smart Cameras
# Real-time object detection on edge device
import cv2
import numpy as np
# Load TFLite model (SSD MobileNet)
interpreter = tf.lite.Interpreter(
model_path="ssd_mobilenet_v2_int8.tflite",
num_threads=4
)
interpreter.allocate_tensors()
cap = cv2.VideoCapture(0)
while True:
ret, frame = cap.read()
# Preprocessing
input_data = preprocess(frame, target_size=(300, 300))
# Inference
interpreter.set_tensor(input_details[0]['index'], input_data)
interpreter.invoke()
# Parse results
boxes = interpreter.get_tensor(output_details[0]['index'])
classes = interpreter.get_tensor(output_details[1]['index'])
scores = interpreter.get_tensor(output_details[2]['index'])
# Display high-confidence results only
for i in range(len(scores[0])):
if scores[0][i] > 0.5:
draw_box(frame, boxes[0][i], classes[0][i], scores[0][i])
9.3 Voice Assistants
On-Device Voice Processing Pipeline:
========================================
1. Wake Word Detection: Small model (50KB-1MB), always-on, NPU
2. Voice Activity Detection (VAD): Speech segment identification, very lightweight
3. Speech Recognition (ASR): Whisper Tiny/Base (39M-74M parameters)
4. Natural Language Understanding (NLU): Intent classification + entity recognition
5. Response Generation: Small LLM or template-based
6. Text-to-Speech (TTS): Convert text to speech
9.4 Medical Devices
Medical Edge AI Applications:
========================================
Device | AI Task | Regulatory
Smartwatch | ECG arrhythmia detect | FDA Class II
Stethoscope | Heart/lung sound anal. | FDA Class II
Fundus Camera | Diabetic retinopathy | FDA Class II
CT/MRI Assist | Lesion highlight | FDA Class III
Glucose Monitor | Blood sugar trend pred | FDA Class II
9.5 Industrial IoT
Industrial Edge AI Applications:
========================================
Area | AI Task | Benefit
Quality Inspection| Visual defect detect | Real-time, 100% inspection
Predictive Maint. | Vibration/noise anomaly| 40% downtime reduction
Energy Optimization| Consumption prediction| 15% energy savings
Safety Monitoring | PPE detection | Accident prevention
Robot Control | Path planning/obstacle | Autonomous operation
10. Challenges and Limitations
10.1 Power Constraints
Power Efficiency Comparison:
========================================
Device | Power | AI Perf | TOPS/W
NVIDIA Jetson Orin Nano| 7-15W | 40 TOPS | 2.7-5.7
Apple A18 Pro | ~5W | 35 TOPS | 7.0
Google Edge TPU | 2W | 4 TOPS | 2.0
Qualcomm Snapdragon 8g3| ~5W | 45 TOPS | 9.0
Intel Core Ultra NPU | ~5W | 34 TOPS | 6.8
10.2 Thermal Management
Continuous AI inference on mobile devices causes thermal issues. Thermal throttling can degrade performance by 30-50%.
10.3 Memory Limitations
Memory Constraints by Device Type:
========================================
Device Type | Typical RAM | AI-Available Memory
IoT Sensors | 256KB-4MB | Extremely limited
Microcontrollers | 2-16MB | Very limited
Smart Cameras | 256MB-2GB | Hundreds of MB
Smartphones | 6-16GB | 2-8GB
Edge Servers | 16-128GB | Most available
10.4 Model Update Challenges
OTA Model Update Challenges:
========================================
- Bandwidth limits: Transmitting large models wirelessly
- Battery drain: Energy for download + conversion process
- A/B testing: Need to store both old and new versions
- Rollback: Recovery mechanism when updates fail
- Integrity: Prevent model tampering (signature/hash verification)
Quiz
Q1: List and explain at least 5 key differences between Edge AI and Cloud AI.
- Latency: Edge AI offers 1-10ms local inference, while Cloud AI requires 50-200ms network round-trip
- Privacy: Edge AI keeps data on device, while Cloud AI requires data transmission
- Bandwidth: Edge AI transmits only results (minimal), while Cloud AI sends raw data (heavy)
- Offline capability: Edge AI works without network, Cloud AI cannot
- Compute power: Edge AI is limited (NPU with tens of TOPS), Cloud AI is nearly unlimited
- Model size: Edge AI is limited to MB-to-few-GB range, Cloud AI can handle hundreds of GB
- Cost structure: Edge AI has upfront hardware costs, Cloud AI has ongoing service costs
Q2: Explain the difference between PTQ and QAT, and compare their pros and cons.
PTQ (Post-Training Quantization):
- Quantizes an already-trained model without additional training
- Pros: Simple, no training data needed (only calibration data), fast to apply
- Cons: May have greater accuracy loss than QAT, especially at INT4
QAT (Quantization-Aware Training):
- Simulates quantization effects during the training process
- Pros: Minimizes accuracy loss, good quality even at INT4
- Cons: Requires additional training (data, GPU, time), more complex implementation
Generally, PTQ is sufficient for INT8, while QAT is recommended for INT4 and below.
Q3: Explain the FedAvg algorithm in Federated Learning.
The FedAvg (Federated Averaging) process:
- Model distribution: Server sends the global model to selected clients
- Local training: Each client trains the model on their local data (multiple epochs)
- Update collection: Trained model parameters are sent to the server (original data is never transmitted)
- Weighted averaging: Server updates the global model using a weighted average proportional to each client's data size
- Iteration: The above process repeats for multiple rounds
Key point: Data never leaves the device -- only model updates (weights) are exchanged with the server, preserving privacy.
Q4: Explain what Q4_K_M means in GGUF quantization.
GGUF quantization naming convention:
- Q: Quantization
- 4: Bit count (4-bit). Uses an average of 4 bits per weight
- K: K-quant method. Applies different quantization levels per block, maintaining higher precision for important layers
- M: Medium quality. Among S (Small/low quality), M (Medium), and L (Large/high quality)
Q4_K_M is 4-bit K-quant medium quality quantization, the most popular level that reduces model size by approximately 4-5x while minimizing quality loss. For a 7B model, this results in about 4.58GB, runnable on smartphones.
Q5: What is the difference between structured and unstructured pruning?
Unstructured Pruning:
- Sets individual weights to zero, creating sparse matrices
- Can achieve high sparsity (90%+)
- Large theoretical compute savings, but limited actual speedup on general hardware
- Requires specialized sparse matrix hardware/libraries
Structured Pruning:
- Removes entire channels, filters, or Attention Heads as structural units
- Resulting model operates with regular dense tensor operations
- Achieves actual speedup even on general hardware
- Sparsity levels may be lower than unstructured (around 30-50%)
Practically, structured pruning is more effective for actual inference speedup, while unstructured pruning is effective on specialized hardware like NVIDIA Sparse Tensor Cores.
References
- TensorRT Developer Guide
- ONNX Runtime Documentation
- TensorFlow Lite Guide
- CoreML Documentation
- OpenVINO Documentation
- llama.cpp Repository
- MLC LLM Documentation
- MLX Documentation
- Flower Federated Learning
- Opacus Differential Privacy
- NVIDIA Jetson Developer
- MediaPipe Solutions
- Qualcomm AI Engine
- Edge AI and Vision Alliance