Skip to content
Published on

Edge AI and On-Device ML Complete Guide: TFLite, ONNX, Core ML, llama.cpp

Authors

Table of Contents

  1. Edge AI Overview
  2. TensorFlow Lite (TFLite)
  3. ONNX and ONNX Runtime
  4. Core ML (Apple)
  5. NVIDIA Jetson Platform
  6. Raspberry Pi AI
  7. MediaPipe
  8. llama.cpp and GGUF
  9. Whisper.cpp
  10. Web Browser AI
  11. AI Model Optimization Pipeline

1. Edge AI Overview

Cloud AI vs Edge AI

Where AI inference runs depends heavily on the nature and requirements of your application. Traditional cloud AI sends data to a remote server, performs inference, and returns results. Edge AI, by contrast, runs inference directly on the device where data is generated — smartphones, IoT sensors, cameras, and more.

DimensionCloud AIEdge AI
Compute locationRemote serverLocal device
LatencyHundreds of ms to secondsA few ms to tens of ms
PrivacyData leaves deviceData stays on device
Internet dependencyRequiredNot required
CostPer-API-call chargesOne-time model cost
Model size limitNoneMemory/storage constrained

Advantages of Edge AI

1. Privacy Protection

Sensitive data such as medical images, biometrics, and personal audio never leaves the device. Compliance with GDPR, HIPAA, and other data regulations is naturally achieved.

2. Ultra-Low Latency

Applications like autonomous vehicles, industrial automation, real-time translation, and AR/VR require millisecond responses. Without network round-trip latency, consistent response times are guaranteed.

3. Cost Reduction

No cloud API call costs. At scale, running inference locally across millions of devices reduces central server costs to near zero.

4. Offline Operation

AI features work in environments with unreliable or no internet connectivity — rural areas, underground locations, aircraft, etc.

5. Real-Time Data Processing

Filtering, anomaly detection, and classification of IoT sensor data locally before uploading dramatically reduces transmission volume and storage costs.

Edge Hardware

Mobile GPUs and NPUs

Modern smartphones include dedicated AI hardware:

  • Apple Neural Engine (ANE): Available since iPhone 8 and in M-series Macs. A17 Pro delivers 35 TOPS
  • Qualcomm Hexagon DSP: Android flagships. Snapdragon 8 Gen 3 features Hexagon NPU
  • Google Tensor: Pixel-exclusive chip, optimized for on-device speech recognition and translation
  • MediaTek APU: Widely used in mid-range Android devices

Edge Computing Boards

  • NVIDIA Jetson: For autonomous driving, robotics, smart cameras. Jetson Orin delivers 275 TOPS
  • Raspberry Pi 5: 4GB/8GB memory, suitable for general computer vision tasks
  • Google Coral: Edge TPU for TFLite-specific acceleration
  • Intel Neural Compute Stick: USB inference accelerator

Edge AI Application Areas

  • Smartphones: Face unlock, photo classification, real-time translation, voice assistants
  • Smart Home: Voice command processing, motion detection, energy optimization
  • Industrial IoT: Defect detection, predictive maintenance, anomaly detection
  • Medical Devices: ECG analysis, glucose prediction, skin condition diagnosis
  • Autonomous Vehicles: Real-time object detection, lane recognition, obstacle avoidance
  • Agriculture: Drone-based crop monitoring, pest and disease detection

2. TensorFlow Lite (TFLite)

TensorFlow Lite is Google's lightweight ML framework for mobile and edge devices. It converts TensorFlow models to the TFLite format (.tflite) for deployment on Android, iOS, embedded Linux, and microcontrollers.

Converting Models (SavedModel to TFLite)

import tensorflow as tf

# Method 1: Convert from SavedModel
converter = tf.lite.TFLiteConverter.from_saved_model('saved_model_dir')
tflite_model = converter.convert()

with open('model.tflite', 'wb') as f:
    f.write(tflite_model)

# Method 2: Convert directly from a Keras model
model = tf.keras.applications.MobileNetV2(weights='imagenet')
converter = tf.lite.TFLiteConverter.from_keras_model(model)
tflite_model = converter.convert()

with open('mobilenetv2.tflite', 'wb') as f:
    f.write(tflite_model)

# Method 3: Convert from a concrete function
@tf.function(input_signature=[tf.TensorSpec(shape=[1, 224, 224, 3], dtype=tf.float32)])
def predict(x):
    return model(x)

converter = tf.lite.TFLiteConverter.from_concrete_functions(
    [predict.get_concrete_function()]
)
tflite_model = converter.convert()

Quantization

Quantization converts model weights and activations from floating point to lower-precision integers, reducing model size and improving inference speed.

Float16 Quantization

converter = tf.lite.TFLiteConverter.from_saved_model('saved_model_dir')
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_types = [tf.float16]
tflite_model = converter.convert()
# ~50% size reduction, minimal accuracy loss

Full Integer (INT8) Quantization

import numpy as np

def representative_dataset():
    # Use 100-1000 representative samples from real data
    for _ in range(100):
        data = np.random.rand(1, 224, 224, 3).astype(np.float32)
        yield [data]

converter = tf.lite.TFLiteConverter.from_saved_model('saved_model_dir')
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_dataset
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8
tflite_model = converter.convert()
# ~75% size reduction, 2-4x inference speedup

Dynamic Range Quantization

converter = tf.lite.TFLiteConverter.from_saved_model('saved_model_dir')
converter.optimizations = [tf.lite.Optimize.DEFAULT]
# No representative dataset needed - quantizes weights only
tflite_model = converter.convert()

TFLite Interpreter

import tensorflow as tf
import numpy as np
from PIL import Image

# Initialize interpreter
interpreter = tf.lite.Interpreter(model_path='mobilenetv2.tflite')
interpreter.allocate_tensors()

# Inspect input/output details
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
print(f"Input shape: {input_details[0]['shape']}")
print(f"Input dtype: {input_details[0]['dtype']}")
print(f"Output shape: {output_details[0]['shape']}")

# Preprocess image
img = Image.open('test_image.jpg').resize((224, 224))
input_data = np.expand_dims(np.array(img, dtype=np.float32) / 255.0, axis=0)

# Run inference
interpreter.set_tensor(input_details[0]['index'], input_data)
interpreter.invoke()

# Extract result
output_data = interpreter.get_tensor(output_details[0]['index'])
predicted_class = np.argmax(output_data[0])
confidence = output_data[0][predicted_class]
print(f"Predicted class: {predicted_class}, Confidence: {confidence:.4f}")

Multithreading and GPU Delegate

# Configure multithreading
interpreter = tf.lite.Interpreter(
    model_path='model.tflite',
    num_threads=4
)

# GPU delegate (Android/iOS)
try:
    from tensorflow.lite.python.interpreter import load_delegate
    gpu_delegate = load_delegate('libdelegate.so')
    interpreter = tf.lite.Interpreter(
        model_path='model.tflite',
        experimental_delegates=[gpu_delegate]
    )
    print("GPU delegate activated")
except Exception as e:
    print(f"GPU delegate unavailable, using CPU: {e}")
    interpreter = tf.lite.Interpreter(model_path='model.tflite')

interpreter.allocate_tensors()

Android Deployment

build.gradle:

dependencies {
    implementation 'org.tensorflow:tensorflow-lite:2.13.0'
    implementation 'org.tensorflow:tensorflow-lite-gpu:2.13.0'
    implementation 'org.tensorflow:tensorflow-lite-support:0.4.4'
}

Kotlin code:

import org.tensorflow.lite.Interpreter
import java.nio.ByteBuffer
import java.nio.ByteOrder

class TFLiteClassifier(private val context: Context) {

    private lateinit var interpreter: Interpreter
    private val inputSize = 224
    private val numClasses = 1000

    fun initialize() {
        val model = loadModelFile("mobilenetv2.tflite")
        val options = Interpreter.Options().apply {
            numThreads = 4
            useNNAPI = true  // Android Neural Networks API
        }
        interpreter = Interpreter(model, options)
    }

    private fun loadModelFile(filename: String): ByteBuffer {
        val assetFileDescriptor = context.assets.openFd(filename)
        val fileInputStream = FileInputStream(assetFileDescriptor.fileDescriptor)
        val fileChannel = fileInputStream.channel
        val startOffset = assetFileDescriptor.startOffset
        val declaredLength = assetFileDescriptor.declaredLength
        return fileChannel.map(FileChannel.MapMode.READ_ONLY, startOffset, declaredLength)
    }

    fun classify(bitmap: Bitmap): FloatArray {
        val resized = Bitmap.createScaledBitmap(bitmap, inputSize, inputSize, true)
        val inputBuffer = ByteBuffer.allocateDirect(1 * inputSize * inputSize * 3 * 4)
        inputBuffer.order(ByteOrder.nativeOrder())

        for (y in 0 until inputSize) {
            for (x in 0 until inputSize) {
                val pixel = resized.getPixel(x, y)
                inputBuffer.putFloat(((pixel shr 16 and 0xFF) / 255.0f))
                inputBuffer.putFloat(((pixel shr 8 and 0xFF) / 255.0f))
                inputBuffer.putFloat(((pixel and 0xFF) / 255.0f))
            }
        }

        val outputBuffer = Array(1) { FloatArray(numClasses) }
        interpreter.run(inputBuffer, outputBuffer)
        return outputBuffer[0]
    }
}

iOS Deployment

import TensorFlowLite
import UIKit

class TFLiteImageClassifier {
    private var interpreter: Interpreter
    private let inputWidth = 224
    private let inputHeight = 224

    init(modelName: String) throws {
        guard let modelPath = Bundle.main.path(forResource: modelName, ofType: "tflite") else {
            throw NSError(domain: "ModelNotFound", code: 0, userInfo: nil)
        }
        var options = Interpreter.Options()
        options.threadCount = 4
        interpreter = try Interpreter(modelPath: modelPath, options: options)
        try interpreter.allocateTensors()
    }

    func classify(image: UIImage) throws -> [Float] {
        guard let cgImage = image.cgImage else { return [] }
        let inputData = preprocessImage(cgImage: cgImage)
        try interpreter.copy(inputData, toInputAt: 0)
        try interpreter.invoke()
        let outputTensor = try interpreter.output(at: 0)
        let results: [Float] = [Float](unsafeData: outputTensor.data) ?? []
        return results
    }

    private func preprocessImage(cgImage: CGImage) -> Data {
        var data = Data(count: inputWidth * inputHeight * 3 * 4)
        return data
    }
}

3. ONNX and ONNX Runtime

ONNX (Open Neural Network Exchange) is an open format that makes ML models portable across frameworks. Models trained in PyTorch, TensorFlow, or scikit-learn can be exported to a single standard format and run with ONNX Runtime anywhere.

Converting PyTorch to ONNX

import torch
import torch.nn as nn
import torchvision.models as models

# Load model
model = models.resnet50(pretrained=True)
model.eval()

# Create dummy input
dummy_input = torch.randn(1, 3, 224, 224)

# Export to ONNX
torch.onnx.export(
    model,
    dummy_input,
    "resnet50.onnx",
    export_params=True,
    opset_version=17,
    do_constant_folding=True,
    input_names=['input'],
    output_names=['output'],
    dynamic_axes={
        'input': {0: 'batch_size'},
        'output': {0: 'batch_size'}
    }
)
print("ONNX export complete!")

# Validate the model
import onnx
onnx_model = onnx.load("resnet50.onnx")
onnx.checker.check_model(onnx_model)
print(f"ONNX IR version: {onnx_model.ir_version}")
print(f"Opset version: {onnx_model.opset_import[0].version}")

ONNX Runtime Inference

import onnxruntime as ort
import numpy as np
from PIL import Image
import torchvision.transforms as transforms

# Create session with execution providers
providers = ['CUDAExecutionProvider', 'CPUExecutionProvider']
session = ort.InferenceSession("resnet50.onnx", providers=providers)

print(f"Active providers: {session.get_providers()}")

input_name = session.get_inputs()[0].name
output_name = session.get_outputs()[0].name
input_shape = session.get_inputs()[0].shape
print(f"Input: {input_name}, shape: {input_shape}")

# Preprocess image
transform = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

img = Image.open('test.jpg')
input_tensor = transform(img).unsqueeze(0).numpy()

# Run inference
outputs = session.run([output_name], {input_name: input_tensor})
logits = outputs[0]
predicted_class = np.argmax(logits[0])
print(f"Predicted class: {predicted_class}")

ONNX Runtime Optimization

from onnxruntime.transformers import optimizer
from onnxruntime.quantization import quantize_dynamic, QuantType

# Graph optimization for transformer models
optimized_model = optimizer.optimize_model(
    'bert_base.onnx',
    model_type='bert',
    num_heads=12,
    hidden_size=768
)
optimized_model.save_model_to_file('bert_optimized.onnx')

# Dynamic INT8 quantization
quantize_dynamic(
    model_input='bert_optimized.onnx',
    model_output='bert_quantized_int8.onnx',
    weight_type=QuantType.QInt8,
    per_channel=True
)
print("Quantization complete!")

# Session options tuning
so = ort.SessionOptions()
so.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
so.intra_op_num_threads = 4
so.inter_op_num_threads = 2
so.execution_mode = ort.ExecutionMode.ORT_SEQUENTIAL

session = ort.InferenceSession('model.onnx', sess_options=so)

ONNX Runtime Web (Browser)

// npm install onnxruntime-web
import * as ort from 'onnxruntime-web'

async function runInference() {
  // Configure WebAssembly backend
  ort.env.wasm.wasmPaths = '/static/'
  ort.env.wasm.numThreads = 4

  // Create session
  const session = await ort.InferenceSession.create('/models/mobilenet.onnx', {
    executionProviders: ['webgpu', 'wasm'],
    graphOptimizationLevel: 'all',
  })

  // Create input tensor (1, 3, 224, 224)
  const inputData = new Float32Array(1 * 3 * 224 * 224).fill(0.5)
  const inputTensor = new ort.Tensor('float32', inputData, [1, 3, 224, 224])

  // Run inference
  const feeds = { input: inputTensor }
  const results = await session.run(feeds)

  const outputData = results.output.data
  const maxIndex = Array.from(outputData).indexOf(Math.max(...outputData))
  console.log('Predicted class:', maxIndex)
}

runInference()

4. Core ML (Apple)

Core ML is Apple's framework for running ML models on Apple platforms (iOS, macOS, watchOS, tvOS). It leverages the Neural Engine for power-efficient and fast inference.

Converting Models with coremltools

import coremltools as ct
import torch
import torchvision.models as models

# PyTorch model conversion
torch_model = models.mobilenet_v2(pretrained=True)
torch_model.eval()

example_input = torch.rand(1, 3, 224, 224)
traced_model = torch.jit.trace(torch_model, example_input)

# Convert to Core ML
mlmodel = ct.convert(
    traced_model,
    inputs=[ct.TensorType(name='input', shape=(1, 3, 224, 224))],
    compute_units=ct.ComputeUnit.ALL,  # CPU + GPU + Neural Engine
    minimum_deployment_target=ct.target.iOS16
)

# Add metadata
mlmodel.short_description = "MobileNetV2 Image Classifier"
mlmodel.author = "YJ Blog"
mlmodel.version = "1.0"

mlmodel.save("MobileNetV2.mlpackage")
print("Core ML conversion complete!")

Float16 and INT8 Quantization

import coremltools as ct
from coremltools.optimize.coreml import (
    OpLinearQuantizerConfig,
    OptimizationConfig,
    linearly_quantize_weights
)

mlmodel = ct.models.MLModel("MobileNetV2.mlpackage")

# Linear weight quantization (8-bit)
op_config = OpLinearQuantizerConfig(mode="linear_symmetric", dtype="int8")
config = OptimizationConfig(global_config=op_config)

compressed_model = linearly_quantize_weights(mlmodel, config)
compressed_model.save("MobileNetV2_int8.mlpackage")

# Palettization (4-bit)
from coremltools.optimize.coreml import palettize_weights, OpPalettizerConfig

palette_config = OptimizationConfig(
    global_config=OpPalettizerConfig(mode="kmeans", nbits=4)
)
palette_model = palettize_weights(mlmodel, palette_config)
palette_model.save("MobileNetV2_4bit.mlpackage")

Using Core ML in Swift

import CoreML
import Vision
import UIKit

class CoreMLClassifier {
    private var model: VNCoreMLModel?

    func loadModel() {
        guard let modelURL = Bundle.main.url(forResource: "MobileNetV2", withExtension: "mlpackage") else {
            print("Model file not found")
            return
        }

        let config = MLModelConfiguration()
        config.computeUnits = .all  // CPU + GPU + Neural Engine

        do {
            let coreMLModel = try MLModel(contentsOf: modelURL, configuration: config)
            model = try VNCoreMLModel(for: coreMLModel)
            print("Model loaded successfully")
        } catch {
            print("Failed to load model: \(error)")
        }
    }

    func classify(image: UIImage, completion: @escaping ([VNClassificationObservation]?) -> Void) {
        guard let model = model,
              let cgImage = image.cgImage else {
            completion(nil)
            return
        }

        let request = VNCoreMLRequest(model: model) { request, error in
            guard let results = request.results as? [VNClassificationObservation] else {
                completion(nil)
                return
            }
            completion(Array(results.prefix(5)))
        }

        request.imageCropAndScaleOption = .centerCrop

        let handler = VNImageRequestHandler(cgImage: cgImage, options: [:])
        DispatchQueue.global(qos: .userInteractive).async {
            try? handler.perform([request])
        }
    }
}

Training Custom Models with Create ML

import CreateML
import Foundation

// Train an image classifier
let trainingData = MLImageClassifier.DataSource.labeledDirectories(
    at: URL(fileURLWithPath: "/path/to/training_data")
)

let parameters = MLImageClassifier.ModelParameters(
    featureExtractor: .scenePrint(revision: 2),
    maxIterations: 25,
    augmentation: [.flip, .crop, .rotation]
)

let classifier = try MLImageClassifier(
    trainingData: trainingData,
    parameters: parameters
)

let evaluationData = MLImageClassifier.DataSource.labeledDirectories(
    at: URL(fileURLWithPath: "/path/to/test_data")
)
let metrics = classifier.evaluation(on: evaluationData)
print("Accuracy: \(1.0 - metrics.classificationError)")

try classifier.write(to: URL(fileURLWithPath: "MyClassifier.mlmodel"))

5. NVIDIA Jetson Platform

NVIDIA Jetson is an embedded AI computing platform widely used in robotics, autonomous vehicles, and smart cameras.

Jetson Model Comparison

ModelAI PerformanceRAMPowerPrimary Use
Jetson Nano472 GFLOPS4GB10WEducation, prototyping
Jetson Xavier NX21 TOPS8/16GB15WIndustrial IoT
Jetson AGX Orin275 TOPS64GB60WAutonomous vehicles, robotics
Jetson Orin NX100 TOPS16GB25WEdge AI

TensorRT Conversion

import tensorrt as trt
import numpy as np
import pycuda.driver as cuda
import pycuda.autoinit

TRT_LOGGER = trt.Logger(trt.Logger.WARNING)

def build_engine_from_onnx(onnx_path, engine_path, fp16=True, int8=False):
    with trt.Builder(TRT_LOGGER) as builder, \
         builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)) as network, \
         trt.OnnxParser(network, TRT_LOGGER) as parser:

        config = builder.create_builder_config()
        config.max_workspace_size = 1 << 30  # 1GB

        if fp16:
            config.set_flag(trt.BuilderFlag.FP16)
        if int8:
            config.set_flag(trt.BuilderFlag.INT8)

        with open(onnx_path, 'rb') as f:
            if not parser.parse(f.read()):
                for error in range(parser.num_errors):
                    print(parser.get_error(error))
                return None

        print("Building TensorRT engine (may take several minutes)...")
        serialized_engine = builder.build_serialized_network(network, config)

        with open(engine_path, 'wb') as f:
            f.write(serialized_engine)
        print(f"Engine saved: {engine_path}")

build_engine_from_onnx('resnet50.onnx', 'resnet50_fp16.trt', fp16=True)

TensorRT Inference

import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np

class TRTInference:
    def __init__(self, engine_path):
        TRT_LOGGER = trt.Logger(trt.Logger.WARNING)

        with open(engine_path, 'rb') as f:
            runtime = trt.Runtime(TRT_LOGGER)
            self.engine = runtime.deserialize_cuda_engine(f.read())

        self.context = self.engine.create_execution_context()
        self.inputs, self.outputs, self.bindings, self.stream = self._allocate_buffers()

    def _allocate_buffers(self):
        inputs, outputs, bindings = [], [], []
        stream = cuda.Stream()

        for binding in self.engine:
            size = trt.volume(self.engine.get_binding_shape(binding))
            dtype = trt.nptype(self.engine.get_binding_dtype(binding))
            host_mem = cuda.pagelocked_empty(size, dtype)
            device_mem = cuda.mem_alloc(host_mem.nbytes)
            bindings.append(int(device_mem))

            if self.engine.binding_is_input(binding):
                inputs.append({'host': host_mem, 'device': device_mem})
            else:
                outputs.append({'host': host_mem, 'device': device_mem})

        return inputs, outputs, bindings, stream

    def infer(self, input_data):
        np.copyto(self.inputs[0]['host'], input_data.ravel())
        cuda.memcpy_htod_async(self.inputs[0]['device'], self.inputs[0]['host'], self.stream)
        self.context.execute_async_v2(bindings=self.bindings, stream_handle=self.stream.handle)
        cuda.memcpy_dtoh_async(self.outputs[0]['host'], self.outputs[0]['device'], self.stream)
        self.stream.synchronize()
        return self.outputs[0]['host']

trt_model = TRTInference('resnet50_fp16.trt')
input_array = np.random.rand(1, 3, 224, 224).astype(np.float32)
result = trt_model.infer(input_array)
print(f"Predicted class: {np.argmax(result)}")

DeepStream SDK for Video Pipelines

import gi
gi.require_version('Gst', '1.0')
from gi.repository import GObject, Gst, GLib

Gst.init(None)

def create_pipeline():
    pipeline = Gst.Pipeline()

    # Source: USB camera
    source = Gst.ElementFactory.make("v4l2src", "usb-cam-source")
    source.set_property("device", "/dev/video0")

    caps = Gst.ElementFactory.make("capsfilter", "capsfilter")
    caps.set_property("caps", Gst.Caps.from_string(
        "video/x-raw,width=1280,height=720,framerate=30/1"
    ))

    nvconv = Gst.ElementFactory.make("nvvideoconvert", "convertor")

    # nvinfer element runs TensorRT inference
    nvinfer = Gst.ElementFactory.make("nvinfer", "primary-inference")
    nvinfer.set_property("config-file-path", "config_infer_primary.txt")

    tracker = Gst.ElementFactory.make("nvtracker", "tracker")
    tracker.set_property(
        "ll-lib-file",
        "/opt/nvidia/deepstream/deepstream/lib/libnvds_nvmultiobjecttracker.so"
    )

    osd = Gst.ElementFactory.make("nvdsosd", "onscreendisplay")
    sink = Gst.ElementFactory.make("nveglglessink", "nvvideo-renderer")

    for element in [source, caps, nvconv, nvinfer, tracker, osd, sink]:
        pipeline.add(element)

    source.link(caps)
    caps.link(nvconv)
    nvconv.link(nvinfer)
    nvinfer.link(tracker)
    tracker.link(osd)
    osd.link(sink)

    return pipeline

pipeline = create_pipeline()
pipeline.set_state(Gst.State.PLAYING)

6. Raspberry Pi AI

Raspberry Pi has evolved from an educational platform into a genuine edge AI deployment target.

Raspberry Pi 5 + Hailo-8

The Hailo-8 is an AI accelerator HAT that delivers 26 TOPS for Raspberry Pi.

# Install Hailo SDK
pip install hailort

# Download pre-compiled model (ONNX -> HEF format)
wget https://hailo-model-zoo.s3.eu-west-2.amazonaws.com/ModelZoo/Compiled/v2.11.0/hailo8/resnet_v1_50.hef
import hailo_platform as hp
import numpy as np

with hp.VDevice() as vdevice:
    hef = hp.Hef("resnet_v1_50.hef")
    network_groups = vdevice.configure(hef)
    network_group = network_groups[0]

    input_vstreams_params = hp.InputVStreamParams.make_from_network_group(
        network_group, quantized=False, format_type=hp.FormatType.FLOAT32
    )
    output_vstreams_params = hp.OutputVStreamParams.make_from_network_group(
        network_group, quantized=False, format_type=hp.FormatType.FLOAT32
    )

    with hp.InferVStreams(network_group, input_vstreams_params, output_vstreams_params) as infer_pipeline:
        input_data = {"input_layer1": np.random.rand(1, 224, 224, 3).astype(np.float32)}
        with network_group.activate():
            infer_results = infer_pipeline.infer(input_data)
        output_key = 'resnet_v1_50/softmax1'
        print(f"Result: {np.argmax(infer_results[output_key])}")

OpenCV + Raspberry Pi Camera

import cv2
import numpy as np
import tflite_runtime.interpreter as tflite

# Use tflite_runtime (lightweight) on Raspberry Pi
interpreter = tflite.Interpreter(
    model_path='ssd_mobilenet_v2.tflite',
    num_threads=4
)
interpreter.allocate_tensors()
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()

cap = cv2.VideoCapture(0)
cap.set(cv2.CAP_PROP_FRAME_WIDTH, 640)
cap.set(cv2.CAP_PROP_FRAME_HEIGHT, 480)

while True:
    ret, frame = cap.read()
    if not ret:
        break

    input_size = (input_details[0]['shape'][2], input_details[0]['shape'][1])
    rgb_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
    resized = cv2.resize(rgb_frame, input_size)
    input_data = np.expand_dims(resized, axis=0).astype(np.uint8)

    interpreter.set_tensor(input_details[0]['index'], input_data)
    interpreter.invoke()

    boxes = interpreter.get_tensor(output_details[0]['index'])[0]
    classes = interpreter.get_tensor(output_details[1]['index'])[0]
    scores = interpreter.get_tensor(output_details[2]['index'])[0]

    h, w = frame.shape[:2]
    for i in range(len(scores)):
        if scores[i] > 0.5:
            ymin, xmin, ymax, xmax = boxes[i]
            cv2.rectangle(frame,
                         (int(xmin * w), int(ymin * h)),
                         (int(xmax * w), int(ymax * h)),
                         (0, 255, 0), 2)
            label = f"class {int(classes[i])}: {scores[i]:.2f}"
            cv2.putText(frame, label, (int(xmin * w), int(ymin * h) - 10),
                       cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 2)

    cv2.imshow('Raspberry Pi AI', frame)
    if cv2.waitKey(1) & 0xFF == ord('q'):
        break

cap.release()
cv2.destroyAllWindows()

7. MediaPipe

Google MediaPipe provides ready-to-use ML solutions for face detection, hand tracking, pose estimation, object detection, and more.

Hand Tracking in Python

import mediapipe as mp
import cv2
import numpy as np

mp_hands = mp.solutions.hands
mp_drawing = mp.solutions.drawing_utils
mp_drawing_styles = mp.solutions.drawing_styles

def run_hand_tracking():
    cap = cv2.VideoCapture(0)

    with mp_hands.Hands(
        static_image_mode=False,
        max_num_hands=2,
        min_detection_confidence=0.7,
        min_tracking_confidence=0.5
    ) as hands:
        while cap.isOpened():
            ret, frame = cap.read()
            if not ret:
                break

            rgb_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
            rgb_frame.flags.writeable = False
            results = hands.process(rgb_frame)
            rgb_frame.flags.writeable = True

            frame = cv2.cvtColor(rgb_frame, cv2.COLOR_RGB2BGR)

            if results.multi_hand_landmarks:
                for hand_landmarks in results.multi_hand_landmarks:
                    mp_drawing.draw_landmarks(
                        frame,
                        hand_landmarks,
                        mp_hands.HAND_CONNECTIONS,
                        mp_drawing_styles.get_default_hand_landmarks_style(),
                        mp_drawing_styles.get_default_hand_connections_style()
                    )

                    # Extract landmark coordinates (21 keypoints)
                    for idx, landmark in enumerate(hand_landmarks.landmark):
                        h, w, _ = frame.shape
                        cx, cy = int(landmark.x * w), int(landmark.y * h)
                        if idx == 8:  # Index fingertip
                            cv2.circle(frame, (cx, cy), 10, (255, 0, 0), -1)

            cv2.imshow('Hand Tracking', frame)
            if cv2.waitKey(5) & 0xFF == ord('q'):
                break

    cap.release()
    cv2.destroyAllWindows()

run_hand_tracking()

Pose Estimation

import mediapipe as mp
import cv2
import numpy as np

mp_pose = mp.solutions.pose

def calculate_angle(a, b, c):
    """Calculate angle between three points."""
    a = np.array(a)
    b = np.array(b)
    c = np.array(c)

    radians = np.arctan2(c[1] - b[1], c[0] - b[0]) - \
              np.arctan2(a[1] - b[1], a[0] - b[0])
    angle = np.abs(radians * 180.0 / np.pi)

    if angle > 180.0:
        angle = 360 - angle
    return angle

cap = cv2.VideoCapture(0)

with mp_pose.Pose(
    min_detection_confidence=0.5,
    min_tracking_confidence=0.5,
    model_complexity=1  # 0: Lite, 1: Full, 2: Heavy
) as pose:
    while cap.isOpened():
        ret, frame = cap.read()
        if not ret:
            break

        rgb_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
        results = pose.process(rgb_frame)
        frame = cv2.cvtColor(rgb_frame, cv2.COLOR_RGB2BGR)

        if results.pose_landmarks:
            landmarks = results.pose_landmarks.landmark
            h, w, _ = frame.shape

            shoulder = [landmarks[mp_pose.PoseLandmark.LEFT_SHOULDER.value].x * w,
                       landmarks[mp_pose.PoseLandmark.LEFT_SHOULDER.value].y * h]
            elbow = [landmarks[mp_pose.PoseLandmark.LEFT_ELBOW.value].x * w,
                    landmarks[mp_pose.PoseLandmark.LEFT_ELBOW.value].y * h]
            wrist = [landmarks[mp_pose.PoseLandmark.LEFT_WRIST.value].x * w,
                    landmarks[mp_pose.PoseLandmark.LEFT_WRIST.value].y * h]

            angle = calculate_angle(shoulder, elbow, wrist)
            cv2.putText(frame, f"Elbow: {angle:.1f} deg",
                       (int(elbow[0]), int(elbow[1])),
                       cv2.FONT_HERSHEY_SIMPLEX, 0.8, (0, 255, 0), 2)

            mp.solutions.drawing_utils.draw_landmarks(
                frame, results.pose_landmarks, mp_pose.POSE_CONNECTIONS
            )

        cv2.imshow('Pose Estimation', frame)
        if cv2.waitKey(5) & 0xFF == ord('q'):
            break

cap.release()
cv2.destroyAllWindows()

MediaPipe Tasks API

import mediapipe as mp
from mediapipe.tasks import python
from mediapipe.tasks.python import vision

base_options = python.BaseOptions(model_asset_path='efficientdet_lite0.tflite')
options = vision.ObjectDetectorOptions(
    base_options=base_options,
    running_mode=vision.RunningMode.IMAGE,
    max_results=5,
    score_threshold=0.5
)

with vision.ObjectDetector.create_from_options(options) as detector:
    image = mp.Image.create_from_file('test_image.jpg')
    detection_result = detector.detect(image)

    for detection in detection_result.detections:
        category = detection.categories[0]
        print(f"Object: {category.category_name}, Score: {category.score:.2f}")
        bbox = detection.bounding_box
        print(f"  Location: ({bbox.origin_x}, {bbox.origin_y}), Size: {bbox.width}x{bbox.height}")

8. llama.cpp and GGUF

llama.cpp is a C++ implementation of Meta's LLaMA model that enables running large language models on CPU without requiring a GPU.

Installation and Basic Usage

# Build from source
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# CPU only
make -j4

# Apple Silicon (Metal GPU acceleration)
make LLAMA_METAL=1 -j4

# NVIDIA CUDA
make LLAMA_CUDA=1 -j4

# Download a GGUF model
huggingface-cli download \
    bartowski/Llama-3.2-3B-Instruct-GGUF \
    Llama-3.2-3B-Instruct-Q4_K_M.gguf \
    --local-dir ./models

# Interactive chat
./llama-cli \
    -m models/Llama-3.2-3B-Instruct-Q4_K_M.gguf \
    -n 512 \
    -p "You are a helpful AI assistant." \
    --repeat-penalty 1.1 \
    -t 8 \
    --color

Quantization Levels (GGUF)

QuantizationBits/weightSize (7B)QualityWhen to Use
Q2_K~2.6 bits~2.7GBLowVery limited memory
Q4_04.5 bits~3.8GBModerateBasic use
Q4_K_M4.8 bits~4.1GBGoodRecommended: balanced
Q5_K_M5.7 bits~4.8GBVery GoodQuality-focused
Q6_K6.6 bits~5.5GBExcellentHigh quality needed
Q8_08.5 bits~7.2GBBestAmple memory available

llama-cpp-python

from llama_cpp import Llama

# Load model
llm = Llama(
    model_path="./models/Llama-3.2-3B-Instruct-Q4_K_M.gguf",
    n_ctx=4096,          # Context window
    n_threads=8,         # CPU threads
    n_gpu_layers=35,     # Layers to offload to GPU (-1 for all)
    verbose=False
)

# Basic text generation
output = llm(
    "What is the capital of France?",
    max_tokens=128,
    temperature=0.7,
    top_p=0.95,
    top_k=40,
    repeat_penalty=1.1
)
print(output['choices'][0]['text'])

# Chat completion
messages = [
    {"role": "system", "content": "You are a helpful coding assistant."},
    {"role": "user", "content": "Write a Python function to compute the Fibonacci sequence."}
]

response = llm.create_chat_completion(
    messages=messages,
    max_tokens=512,
    temperature=0.7
)
print(response['choices'][0]['message']['content'])

# Streaming output
stream = llm.create_chat_completion(
    messages=messages,
    max_tokens=512,
    stream=True
)

for chunk in stream:
    delta = chunk['choices'][0]['delta']
    if 'content' in delta:
        print(delta['content'], end='', flush=True)
print()

OpenAI-Compatible Server

# Start llama.cpp server
./llama-server \
    -m models/Llama-3.2-3B-Instruct-Q4_K_M.gguf \
    --port 8080 \
    --host 0.0.0.0 \
    -n 2048 \
    -t 8 \
    --n-gpu-layers 35
# Use OpenAI SDK with local server
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="none"
)

response = client.chat.completions.create(
    model="local-model",
    messages=[
        {"role": "user", "content": "Explain the difference between machine learning and deep learning."}
    ],
    max_tokens=512,
    temperature=0.7
)
print(response.choices[0].message.content)

Converting HuggingFace Models to GGUF

cd llama.cpp

# Install Python dependencies
pip install -r requirements.txt

# Convert HuggingFace model to GGUF
python convert_hf_to_gguf.py \
    /path/to/hf_model \
    --outfile models/my_model.gguf \
    --outtype f16

# Quantize the model
./quantize models/my_model.gguf models/my_model_q4km.gguf Q4_K_M

9. Whisper.cpp

Whisper.cpp is a C++ implementation of OpenAI's Whisper speech recognition model, enabling offline speech recognition from Raspberry Pi to smartphones.

Installation and Basic Usage

# Build
git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp
make -j4

# Apple Silicon with Metal
make WHISPER_METAL=1 -j4

# Download models
bash ./models/download-ggml-model.sh base.en    # English only, 142MB
bash ./models/download-ggml-model.sh medium     # Multilingual, 1.5GB
bash ./models/download-ggml-model.sh large-v3   # Best quality, 3.1GB

# Transcribe an audio file
./main -m models/ggml-medium.bin \
       -f audio.wav \
       -l en \
       --output-txt \
       -of output

# Real-time microphone input
./stream -m models/ggml-medium.bin \
         -t 8 \
         --step 500 \
         --length 5000 \
         -l en

whisper-cpp-python

import whisper_cpp
import numpy as np
import soundfile as sf

# Load model
model = whisper_cpp.Whisper.from_pretrained("medium")

# Transcribe a WAV file
audio, sr = sf.read("audio.wav", dtype="float32")
if audio.ndim > 1:
    audio = audio.mean(axis=1)  # Stereo -> mono

# Resample to 16kHz if needed
if sr != 16000:
    import librosa
    audio = librosa.resample(audio, orig_sr=sr, target_sr=16000)

result = model.transcribe(audio, language="en")
print(f"Transcript:\n{result['text']}")

# With timestamps
for segment in result['segments']:
    start = segment['start']
    end = segment['end']
    text = segment['text']
    print(f"[{start:.2f}s -> {end:.2f}s] {text}")

Quantizing Whisper Models

# Quantize GGML model
./quantize models/ggml-medium.bin models/ggml-medium-q5_0.bin q5_0

# Size comparison
ls -lh models/ggml-medium*.bin
# ggml-medium.bin: 1.5GB
# ggml-medium-q5_0.bin: ~900MB

Whisper on iOS with WhisperKit

import WhisperKit

class SpeechRecognizer {
    var whisperKit: WhisperKit?

    func initialize() async {
        do {
            whisperKit = try await WhisperKit(
                model: "openai_whisper-medium",
                computeOptions: ModelComputeOptions(melCompute: .cpuAndGPU)
            )
            print("Whisper model loaded")
        } catch {
            print("Initialization failed: \(error)")
        }
    }

    func transcribe(audioURL: URL) async -> String? {
        guard let whisperKit = whisperKit else { return nil }

        do {
            let result = try await whisperKit.transcribe(
                audioPath: audioURL.path,
                decodeOptions: DecodingOptions(language: "en")
            )
            return result.map(\.text).joined(separator: " ")
        } catch {
            print("Transcription failed: \(error)")
            return nil
        }
    }
}

10. Web Browser AI

Thanks to WebAssembly and WebGPU, powerful AI inference in the browser is now practical.

TensorFlow.js

<!DOCTYPE html>
<html>
  <head>
    <title>Browser Image Classification</title>
    <script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs@latest"></script>
    <script src="https://cdn.jsdelivr.net/npm/@tensorflow-models/mobilenet@2.1.1"></script>
  </head>
  <body>
    <input type="file" id="imageInput" accept="image/*" />
    <img id="preview" style="max-width: 400px;" />
    <div id="result"></div>

    <script>
      let model

      async function loadModel() {
        model = await mobilenet.load({ version: 2, alpha: 1.0 })
        console.log('Model loaded!')
        document.getElementById('result').textContent = 'Model ready. Select an image.'
      }

      document.getElementById('imageInput').addEventListener('change', async (e) => {
        const file = e.target.files[0]
        if (!file) return

        const img = document.getElementById('preview')
        img.src = URL.createObjectURL(file)
        img.onload = async () => {
          const predictions = await model.classify(img, 5)
          const resultDiv = document.getElementById('result')
          resultDiv.innerHTML = '<h3>Top Predictions:</h3>'
          predictions.forEach((pred) => {
            resultDiv.innerHTML += `
                    <p>${pred.className}: ${(pred.probability * 100).toFixed(2)}%</p>
                `
          })
        }
      })

      loadModel()
    </script>
  </body>
</html>

Transformers.js (HuggingFace)

import { pipeline, env } from '@xenova/transformers'

env.backends.onnx.wasm.wasmPaths = 'https://cdn.jsdelivr.net/npm/onnxruntime-web/dist/'

// Sentiment analysis pipeline
async function runTextClassification() {
  const classifier = await pipeline(
    'sentiment-analysis',
    'Xenova/distilbert-base-uncased-finetuned-sst-2-english'
  )

  const results = await classifier(['I love machine learning!', 'This is terrible.'])
  results.forEach((result, i) => {
    console.log(`Text ${i + 1}: ${result.label} (${(result.score * 100).toFixed(2)}%)`)
  })
}

// Image classification
async function runImageClassification() {
  const classifier = await pipeline('image-classification', 'Xenova/vit-base-patch16-224')
  const result = await classifier('https://example.com/image.jpg')
  console.log('Image classification result:', result)
}

// Text generation with a small LLM
async function runTextGeneration() {
  const generator = await pipeline('text-generation', 'Xenova/gpt2')
  const output = await generator('The future of AI is', {
    max_new_tokens: 100,
    temperature: 0.7,
  })
  console.log('Generated text:', output[0].generated_text)
}

runTextClassification()

WebGPU-Accelerated Inference

import * as ort from 'onnxruntime-web'

async function runWithWebGPU() {
  if (!navigator.gpu) {
    console.log('WebGPU is not supported in this browser.')
    return
  }

  const adapter = await navigator.gpu.requestAdapter()
  const device = await adapter.requestDevice()
  console.log('WebGPU adapter:', adapter.info)

  ort.env.wasm.wasmPaths = '/'
  const session = await ort.InferenceSession.create('/models/resnet50.onnx', {
    executionProviders: ['webgpu'],
    graphOptimizationLevel: 'all',
  })

  const batchSize = 4
  const inputData = new Float32Array(batchSize * 3 * 224 * 224)
  const inputTensor = new ort.Tensor('float32', inputData, [batchSize, 3, 224, 224])

  const startTime = performance.now()
  const output = await session.run({ input: inputTensor })
  const elapsed = performance.now() - startTime

  console.log(`WebGPU inference time: ${elapsed.toFixed(2)}ms`)
  console.log(`Throughput: ${((batchSize / elapsed) * 1000).toFixed(1)} images/sec`)
}

runWithWebGPU()

11. AI Model Optimization Pipeline

End-to-End Flow: Train → Optimize → Deploy

Training (PyTorch/TF)
     |
Pruning (remove unnecessary weights)
     |
Knowledge Distillation (Teacher-Student)
     |
Quantization-Aware Training (QAT)
     |
Format Conversion (ONNX/TFLite/GGUF)
     |
Runtime Optimization (TensorRT/OpenVINO)
     |
Deployment (Mobile/Edge/Web)

Pruning

import torch
import torch.nn.utils.prune as prune
import torchvision.models as models

model = models.resnet50(pretrained=True)

# Unstructured pruning: remove 30% of weights in Conv2d layers
for name, module in model.named_modules():
    if isinstance(module, torch.nn.Conv2d):
        prune.l1_unstructured(module, name='weight', amount=0.3)
        prune.remove(module, 'weight')  # Make mask permanent

original_params = sum(p.numel() for p in models.resnet50().parameters())
pruned_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Original parameters: {original_params:,}")
print(f"After pruning: {pruned_params:,}")
print(f"Reduction: {(1 - pruned_params/original_params)*100:.1f}%")

Knowledge Distillation

import torch
import torch.nn as nn
import torch.nn.functional as F

class DistillationLoss(nn.Module):
    def __init__(self, temperature=4.0, alpha=0.7):
        super().__init__()
        self.T = temperature
        self.alpha = alpha

    def forward(self, student_logits, teacher_logits, labels):
        # Soft target loss (distillation)
        soft_loss = F.kl_div(
            F.log_softmax(student_logits / self.T, dim=1),
            F.softmax(teacher_logits / self.T, dim=1),
            reduction='batchmean'
        ) * (self.T ** 2)

        # Hard target loss (cross-entropy)
        hard_loss = F.cross_entropy(student_logits, labels)

        return self.alpha * soft_loss + (1 - self.alpha) * hard_loss

def train_student(teacher, student, dataloader, epochs=10):
    teacher.eval()
    student.train()

    optimizer = torch.optim.AdamW(student.parameters(), lr=1e-4)
    criterion = DistillationLoss(temperature=4.0, alpha=0.7)

    for epoch in range(epochs):
        total_loss = 0
        for images, labels in dataloader:
            with torch.no_grad():
                teacher_logits = teacher(images)

            student_logits = student(images)
            loss = criterion(student_logits, teacher_logits, labels)

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            total_loss += loss.item()

        print(f"Epoch {epoch+1}/{epochs}, Loss: {total_loss/len(dataloader):.4f}")

# Teacher: ResNet50, Student: MobileNetV2
teacher = models.resnet50(pretrained=True)
student = models.mobilenet_v2(pretrained=False)

Quantization-Aware Training (QAT)

import torch
from torch.quantization import get_default_qat_qconfig, prepare_qat, convert

model = models.mobilenet_v2(pretrained=True)
model.train()

# Set QAT config
model.qconfig = get_default_qat_qconfig('qnnpack')  # ARM/mobile
# model.qconfig = get_default_qat_qconfig('fbgemm')  # x86

# Prepare for QAT (insert fake quantization nodes)
model = prepare_qat(model, inplace=False)

# Fine-tune with QAT for a few epochs
optimizer = torch.optim.SGD(model.parameters(), lr=0.0001)
model.train()
for epoch in range(5):
    for images, labels in dataloader:
        outputs = model(images)
        loss = nn.CrossEntropyLoss()(outputs, labels)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    print(f"QAT Epoch {epoch+1}/5 complete")

# Convert to INT8 model
model.eval()
quantized_model = convert(model.eval(), inplace=False)
torch.save(quantized_model.state_dict(), 'mobilenetv2_int8.pth')
print("QAT complete! 4x smaller model with minimal accuracy loss")

Comprehensive Benchmark Tool

import time
import numpy as np
import psutil
import os

class EdgeAIBenchmark:
    def __init__(self, model_path, framework='tflite'):
        self.model_path = model_path
        self.framework = framework
        self.results = {}

    def measure_latency(self, input_data, num_runs=100, warmup=10):
        """Measure average inference latency."""
        import tensorflow as tf

        interpreter = tf.lite.Interpreter(model_path=self.model_path)
        interpreter.allocate_tensors()
        input_details = interpreter.get_input_details()
        output_details = interpreter.get_output_details()

        # Warmup
        for _ in range(warmup):
            interpreter.set_tensor(input_details[0]['index'], input_data)
            interpreter.invoke()

        # Measurement
        latencies = []
        for _ in range(num_runs):
            start = time.perf_counter()
            interpreter.set_tensor(input_details[0]['index'], input_data)
            interpreter.invoke()
            _ = interpreter.get_tensor(output_details[0]['index'])
            latencies.append((time.perf_counter() - start) * 1000)

        self.results['latency_mean_ms'] = np.mean(latencies)
        self.results['latency_p99_ms'] = np.percentile(latencies, 99)
        self.results['throughput_fps'] = 1000 / np.mean(latencies)
        return self.results

    def measure_memory(self):
        """Measure memory consumption."""
        process = psutil.Process(os.getpid())
        before = process.memory_info().rss / 1024 / 1024

        import tensorflow as tf
        interpreter = tf.lite.Interpreter(model_path=self.model_path)
        interpreter.allocate_tensors()

        after = process.memory_info().rss / 1024 / 1024
        self.results['memory_mb'] = after - before
        return self.results

    def measure_model_size(self):
        """Measure model file size."""
        size_bytes = os.path.getsize(self.model_path)
        self.results['model_size_mb'] = size_bytes / 1024 / 1024
        return self.results

    def run_full_benchmark(self, input_data):
        self.measure_model_size()
        self.measure_memory()
        self.measure_latency(input_data)

        print(f"\n=== {self.model_path} Benchmark ===")
        print(f"Model size: {self.results.get('model_size_mb', 0):.2f} MB")
        print(f"Memory usage: {self.results.get('memory_mb', 0):.2f} MB")
        print(f"Mean latency: {self.results.get('latency_mean_ms', 0):.2f} ms")
        print(f"P99 latency: {self.results.get('latency_p99_ms', 0):.2f} ms")
        print(f"Throughput: {self.results.get('throughput_fps', 0):.1f} FPS")
        return self.results


# Usage
input_data = np.random.rand(1, 224, 224, 3).astype(np.float32)

bench = EdgeAIBenchmark('mobilenetv2.tflite')
results = bench.run_full_benchmark(input_data)

bench_q = EdgeAIBenchmark('mobilenetv2_int8.tflite')
results_q = bench_q.run_full_benchmark(input_data)

print("\n=== Quantization Impact ===")
size_reduction = (1 - results_q['model_size_mb'] / results['model_size_mb']) * 100
speed_improvement = results['latency_mean_ms'] / results_q['latency_mean_ms']
print(f"Size reduction: {size_reduction:.1f}%")
print(f"Speed improvement: {speed_improvement:.1f}x")

Summary

Edge AI is no longer a research curiosity — it is deployed in real products at scale. Here is a quick reference for the frameworks covered in this guide:

  1. TFLite: Widest adoption in mobile apps. Supports both Android and iOS natively
  2. ONNX Runtime: Framework-agnostic. Ideal for cross-platform deployment
  3. Core ML: Maximizes Apple Neural Engine on Apple devices
  4. TensorRT: Maximizes NVIDIA GPU acceleration on Jetson and servers
  5. llama.cpp: Runs LLMs on CPU. Especially powerful on Apple Silicon
  6. Whisper.cpp: The de facto standard for offline speech recognition
  7. MediaPipe: Fast prototyping of vision ML solutions
  8. Transformers.js: Run HuggingFace models directly in the browser

Model selection and optimization strategy depend on target hardware, accuracy requirements, and latency goals. INT8 quantization is the first optimization step strongly recommended for virtually every edge AI project — it dramatically reduces size and improves speed with minimal accuracy loss.


References