Split View: Edge AI 완전 가이드 2025: 온디바이스 추론, 모델 최적화, TensorRT/ONNX/CoreML

Edge AI 완전 가이드 2025: 온디바이스 추론, 모델 최적화, TensorRT/ONNX/CoreML

들어가며: 왜 Edge AI인가?

AI의 미래는 클라우드만이 아닙니다. 점점 더 많은 AI 추론이 데이터가 생성되는 곳, 즉 엣지 디바이스에서 실행되고 있습니다. Edge AI는 클라우드로 데이터를 보내지 않고 스마트폰, IoT 기기, 자동차, 의료 기기 등에서 직접 AI 모델을 실행하는 패러다임입니다.

Edge AI가 필요한 5가지 이유:

레이턴시: 클라우드 왕복 없이 밀리초 단위 응답. 자율주행차는 100ms 지연도 치명적입니다.
프라이버시: 민감한 데이터(얼굴, 음성, 의료)가 디바이스를 떠나지 않습니다.
대역폭 절감: 카메라가 초당 수 GB의 영상을 생성하지만, 추론 결과만 전송하면 됩니다.
비용 절감: 클라우드 GPU 비용 없이 로컬에서 추론을 실행합니다.
오프라인 동작: 네트워크 연결 없이도 AI 기능이 작동합니다.

1. Edge AI vs Cloud AI 비교

비교 항목	Edge AI	Cloud AI
레이턴시	1-10ms (로컬)	50-200ms (네트워크 왕복)
프라이버시	데이터가 디바이스에 유지	클라우드로 데이터 전송 필요
대역폭	최소 (결과만 전송)	대량 (원본 데이터 전송)
비용	초기 하드웨어 비용	지속적 클라우드 비용
오프라인	완전 지원	불가능
모델 크기	제한적 (MB~수GB)	무제한 (수백GB 가능)
컴퓨팅 파워	제한적 (NPU/GPU)	거의 무제한
업데이트	OTA 업데이트 필요	즉시 배포 가능
확장성	디바이스 수에 비례	클라우드 리소스로 탄력적
정확도	최적화로 인한 소폭 저하 가능	최대 정확도

2. 추론 런타임 비교

2.1 TensorRT (NVIDIA)

NVIDIA GPU 전용 고성능 추론 엔진입니다.

import tensorrt as trt
import numpy as np

# TensorRT 엔진 빌드
logger = trt.Logger(trt.Logger.WARNING)
builder = trt.Builder(logger)
network = builder.create_network(
    1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
)
parser = trt.OnnxParser(network, logger)

# ONNX 모델 로드
with open("model.onnx", "rb") as f:
    parser.parse(f.read())

# 최적화 프로필 설정
config = builder.create_builder_config()
config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 30)  # 1GB

# INT8 양자화 활성화
config.set_flag(trt.BuilderFlag.INT8)
config.int8_calibrator = MyCalibrator(calibration_data)

# FP16 활성화
config.set_flag(trt.BuilderFlag.FP16)

# 엔진 빌드 및 직렬화
serialized_engine = builder.build_serialized_network(network, config)
with open("model.engine", "wb") as f:
    f.write(serialized_engine)

TensorRT의 주요 최적화 기법:

1. 레이어 퓨전: Conv + BatchNorm + ReLU를 단일 커널로 합침
2. 커널 오토튜닝: 하드웨어별 최적 CUDA 커널 자동 선택
3. 동적 텐서 메모리: 메모리 재사용으로 총 사용량 절감
4. 정밀도 캘리브레이션: INT8 양자화 시 정확도 손실 최소화
5. 다중 스트림 실행: 여러 추론을 병렬로 실행

2.2 ONNX Runtime

크로스 플랫폼 추론 엔진으로, 다양한 하드웨어에서 동작합니다.

import onnxruntime as ort
import numpy as np

# 세션 옵션 설정
sess_options = ort.SessionOptions()
sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
sess_options.intra_op_num_threads = 4
sess_options.inter_op_num_threads = 2

# Execution Provider 선택
# CPU: CPUExecutionProvider
# GPU: CUDAExecutionProvider, TensorrtExecutionProvider
# 모바일: CoreMLExecutionProvider, NNAPIExecutionProvider
providers = [
    ('TensorrtExecutionProvider', {
        'trt_max_workspace_size': 2147483648,
        'trt_fp16_enable': True,
        'trt_int8_enable': True,
    }),
    ('CUDAExecutionProvider', {
        'device_id': 0,
        'arena_extend_strategy': 'kNextPowerOfTwo',
    }),
    'CPUExecutionProvider'
]

session = ort.InferenceSession("model.onnx", sess_options, providers=providers)

# 추론 실행
input_name = session.get_inputs()[0].name
output = session.run(None, {input_name: input_data})

2.3 TensorFlow Lite (TFLite)

모바일 및 임베디드 디바이스를 위한 경량 추론 엔진입니다.

import tensorflow as tf

# TFLite 모델 변환
converter = tf.lite.TFLiteConverter.from_saved_model("saved_model_dir")

# 양자화 설정
converter.optimizations = [tf.lite.Optimize.DEFAULT]

# 완전 정수 양자화 (INT8)
def representative_dataset():
    for data in calibration_data:
        yield [data.astype(np.float32)]

converter.representative_dataset = representative_dataset
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8

tflite_model = converter.convert()

# 모델 저장
with open("model_int8.tflite", "wb") as f:
    f.write(tflite_model)

# TFLite 추론 실행
interpreter = tf.lite.Interpreter(model_path="model_int8.tflite")
interpreter.allocate_tensors()

input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()

interpreter.set_tensor(input_details[0]['index'], input_data)
interpreter.invoke()
output = interpreter.get_tensor(output_details[0]['index'])

2.4 CoreML (Apple)

Apple 디바이스 전용 추론 프레임워크입니다.

import coremltools as ct

# PyTorch -> CoreML 변환
model = torch.load("model.pt")
model.eval()

traced_model = torch.jit.trace(model, torch.randn(1, 3, 224, 224))
coreml_model = ct.convert(
    traced_model,
    inputs=[ct.ImageType(name="image", shape=(1, 3, 224, 224))],
    compute_precision=ct.precision.FLOAT16,
    compute_units=ct.ComputeUnit.ALL,  # CPU + GPU + Neural Engine
)

# 메타데이터 추가
coreml_model.author = "ML Team"
coreml_model.short_description = "Image classifier"

coreml_model.save("MyModel.mlpackage")

// Swift에서 CoreML 사용
import CoreML
import Vision

let model = try! MyModel(configuration: MLModelConfiguration())
let request = VNCoreMLRequest(model: try! VNCoreMLModel(for: model.model))

let handler = VNImageRequestHandler(cgImage: image, options: [:])
try! handler.perform([request])

if let results = request.results as? [VNClassificationObservation] {
    print("Top prediction: \(results.first!.identifier)")
}

2.5 OpenVINO (Intel)

Intel 하드웨어를 위한 추론 엔진입니다.

import openvino as ov

core = ov.Core()

# 모델 로드 및 컴파일
model = core.read_model("model.xml")
compiled_model = core.compile_model(model, "CPU", config={
    "PERFORMANCE_HINT": "LATENCY",
    "INFERENCE_NUM_THREADS": "4",
})

# INT8 양자화 (NNCF)
import nncf

calibration_dataset = nncf.Dataset(calibration_loader)
quantized_model = nncf.quantize(
    model,
    calibration_dataset,
    preset=nncf.QuantizationPreset.MIXED,
    subset_size=300,
)

# 양자화 모델 저장
ov.save_model(quantized_model, "model_int8.xml")

2.6 런타임 비교 요약

추론 런타임 비교:
========================================
런타임        | 플랫폼         | 주요 하드웨어    | 언어
TensorRT     | Linux/Windows  | NVIDIA GPU      | C++, Python
ONNX Runtime | 크로스 플랫폼    | CPU/GPU/NPU     | C++, Python, C#, Java
TFLite       | 모바일/임베디드  | CPU/GPU/NPU     | C++, Java, Swift, Python
CoreML       | Apple 전용      | ANE/GPU/CPU     | Swift, Objective-C
OpenVINO     | Intel 전용      | CPU/iGPU/VPU    | C++, Python
MLC LLM      | 크로스 플랫폼    | CPU/GPU/NPU     | C++, Python, Swift

3. 모델 최적화 기법

3.1 양자화 (Quantization)

양자화는 모델 가중치의 정밀도를 낮추어 크기와 추론 속도를 개선하는 핵심 기법입니다.

Post-Training Quantization (PTQ)

학습 후 양자화로, 추가 학습 없이 모델을 양자화합니다.

# PyTorch PTQ 예시
import torch
from torch.quantization import quantize_dynamic

# 동적 양자화 (가장 간단)
model_fp32 = load_model()
model_int8 = quantize_dynamic(
    model_fp32,
    {torch.nn.Linear, torch.nn.LSTM},
    dtype=torch.qint8
)

# 정적 양자화 (더 높은 성능)
from torch.quantization import prepare, convert, get_default_qconfig

model_fp32.eval()
model_fp32.qconfig = get_default_qconfig('x86')
model_prepared = prepare(model_fp32)

# 캘리브레이션 데이터로 통계 수집
with torch.no_grad():
    for batch in calibration_loader:
        model_prepared(batch)

model_int8 = convert(model_prepared)

Quantization-Aware Training (QAT)

학습 과정에서 양자화 효과를 시뮬레이션하여 정확도 손실을 최소화합니다.

import torch
from torch.quantization import prepare_qat, convert

model.train()
model.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')
model_prepared = prepare_qat(model)

# QAT 학습 (일반 학습과 동일)
optimizer = torch.optim.Adam(model_prepared.parameters(), lr=1e-4)
for epoch in range(num_epochs):
    for batch in train_loader:
        output = model_prepared(batch)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()

# 양자화 변환
model_prepared.eval()
model_int8 = convert(model_prepared)

LLM 양자화 기법

# GPTQ (GPU 기반 양자화)
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

quantize_config = BaseQuantizeConfig(
    bits=4,
    group_size=128,
    desc_act=False,
)

model = AutoGPTQForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B",
    quantize_config
)
model.quantize(calibration_data)
model.save_quantized("llama-3.1-8b-gptq-4bit")

# AWQ (Activation-aware Weight Quantization)
from awq import AutoAWQForCausalLM

model = AutoAWQForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")

quant_config = {
    "zero_point": True,
    "q_group_size": 128,
    "w_bit": 4,
    "version": "GEMM"
}

model.quantize(tokenizer, quant_config=quant_config)
model.save_quantized("llama-3.1-8b-awq-4bit")

# bitsandbytes (간단한 양자화)
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B",
    quantization_config=bnb_config,
    device_map="auto",
)

양자화 정밀도별 비교:

정밀도    | 모델 크기 (7B 기준) | 메모리   | 속도 향상 | 정확도 손실
FP32     | 28GB               | 28GB    | 1x       | 기준
FP16     | 14GB               | 14GB    | 2x       | 무시 가능
INT8     | 7GB                | 7GB     | 3-4x     | 미미함 (0.5% 이하)
INT4     | 3.5GB              | 3.5GB   | 4-6x     | 소폭 (1-2%)

3.2 프루닝 (Pruning)

프루닝은 모델에서 불필요한 가중치나 뉴런을 제거하는 기법입니다.

비구조적 프루닝 (Unstructured Pruning)

개별 가중치를 0으로 만들어 희소(sparse) 모델을 생성합니다.

import torch.nn.utils.prune as prune

# 크기 기반 프루닝 (Magnitude-based)
model = load_model()

# 전체 모델에 50% 프루닝 적용
parameters_to_prune = [
    (module, 'weight') for module in model.modules()
    if isinstance(module, (torch.nn.Linear, torch.nn.Conv2d))
]

prune.global_unstructured(
    parameters_to_prune,
    pruning_method=prune.L1Unstructured,
    amount=0.5,  # 50% 프루닝
)

# 프루닝 마스크를 영구 적용
for module, name in parameters_to_prune:
    prune.remove(module, name)

# 희소성 확인
total = 0
zero = 0
for p in model.parameters():
    total += p.numel()
    zero += (p == 0).sum().item()
print(f"Sparsity: {zero/total:.2%}")

구조적 프루닝 (Structured Pruning)

전체 채널이나 헤드를 제거하여 실제 속도 향상을 달성합니다.

# Attention Head 프루닝 예시
import torch.nn.utils.prune as prune

# 특정 레이어의 출력 채널 30% 프루닝
prune.ln_structured(
    model.layer1.conv1,
    name='weight',
    amount=0.3,
    n=2,
    dim=0  # 출력 채널 차원
)

이동 프루닝 (Movement Pruning)

파인튜닝 과정에서 중요하지 않은 가중치를 식별하여 제거합니다.

프루닝 기법 비교:
========================================
기법          | 희소성 패턴 | 실제 속도 향상 | 정확도 보존
크기 기반     | 비구조적   | 제한적*       | 높음
구조적        | 구조적     | 높음          | 중간
이동 프루닝   | 비구조적   | 제한적*       | 매우 높음

* 비구조적 프루닝은 전용 하드웨어/라이브러리 필요

3.3 지식 증류 (Knowledge Distillation)

큰 교사(teacher) 모델의 지식을 작은 학생(student) 모델로 전달하는 기법입니다.

import torch
import torch.nn.functional as F

class DistillationLoss(torch.nn.Module):
    def __init__(self, temperature=4.0, alpha=0.5):
        super().__init__()
        self.temperature = temperature
        self.alpha = alpha
        self.ce_loss = torch.nn.CrossEntropyLoss()

    def forward(self, student_logits, teacher_logits, labels):
        # 소프트 타겟 손실 (KL Divergence)
        soft_loss = F.kl_div(
            F.log_softmax(student_logits / self.temperature, dim=1),
            F.softmax(teacher_logits / self.temperature, dim=1),
            reduction='batchmean'
        ) * (self.temperature ** 2)

        # 하드 타겟 손실 (Cross-Entropy)
        hard_loss = self.ce_loss(student_logits, labels)

        return self.alpha * soft_loss + (1 - self.alpha) * hard_loss

# 학습 루프
teacher_model.eval()
student_model.train()

for batch in train_loader:
    inputs, labels = batch
    with torch.no_grad():
        teacher_logits = teacher_model(inputs)
    student_logits = student_model(inputs)

    loss = distillation_loss(student_logits, teacher_logits, labels)
    loss.backward()
    optimizer.step()

지식 증류의 대표적 성공 사례:

교사 -> 학생 모델 예시:
BERT-base (110M) -> DistilBERT (66M): 40% 작고, 97% 성능 유지
GPT-4 -> Phi-3 Mini (3.8B): 대형 모델의 지식을 소형 모델로 전달
Llama 70B -> Llama 8B: 지식 증류 + 합성 데이터로 성능 극대화

3.4 Neural Architecture Search (NAS)

자동으로 최적의 모델 아키텍처를 탐색하는 기법입니다.

NAS로 발견된 효율적 아키텍처:
========================================
모델           | 파라미터 | Top-1 정확도 | 추론 속도
EfficientNet-B0| 5.3M    | 77.3%       | 매우 빠름
MobileNetV3   | 5.4M    | 75.2%       | 매우 빠름
EfficientNetV2| 21M     | 85.7%       | 빠름
MnasNet       | 4.2M    | 74.0%       | 매우 빠름

4. 하드웨어 랜드스케이프

4.1 NVIDIA (Jetson, T4/L4)

NVIDIA 엣지 AI 하드웨어:
========================================
디바이스         | TOPS  | 메모리  | TDP   | 용도
Jetson Orin Nano | 40    | 4-8GB  | 7-15W | 임베디드 AI
Jetson Orin NX   | 100   | 8-16GB | 10-25W| 로봇, 드론
Jetson AGX Orin  | 275   | 32-64GB| 15-60W| 자율주행, 고성능
T4 (데이터센터)   | 130   | 16GB   | 70W   | 엣지 서버
L4 (데이터센터)   | 120   | 24GB   | 72W   | 비디오 AI

# Jetson에서 TensorRT 추론 실행
# JetPack SDK 설치 후

# 모델 변환 (ONNX -> TensorRT)
/usr/src/tensorrt/bin/trtexec \
  --onnx=model.onnx \
  --saveEngine=model.engine \
  --fp16 \
  --workspace=2048 \
  --minShapes=input:1x3x224x224 \
  --optShapes=input:4x3x224x224 \
  --maxShapes=input:8x3x224x224

# 벤치마크
/usr/src/tensorrt/bin/trtexec --loadEngine=model.engine --batch=4

4.2 Apple (Neural Engine, CoreML, MLX)

Apple Silicon AI 성능:
========================================
칩셋          | Neural Engine TOPS | GPU | 통합 메모리
M1            | 11                 | 8코어  | 8-16GB
M2            | 15.8               | 10코어 | 8-24GB
M3            | 18                 | 10코어 | 8-36GB
M4            | 38                 | 10코어 | 16-64GB
A17 Pro       | 35                 | 6코어  | 8GB
A18 Pro       | 35                 | 6코어  | 8GB

# MLX 프레임워크 (Apple Silicon 네이티브)
import mlx.core as mx
import mlx.nn as nn

class SimpleModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear1 = nn.Linear(784, 256)
        self.linear2 = nn.Linear(256, 10)

    def __call__(self, x):
        x = nn.relu(self.linear1(x))
        return self.linear2(x)

model = SimpleModel()
input_data = mx.random.normal((1, 784))
output = model(input_data)
mx.eval(output)  # 지연 평가 실행

4.3 Qualcomm (Snapdragon NPU)

Qualcomm AI Engine:
========================================
칩셋              | NPU TOPS | 용도
Snapdragon 8 Gen 3| 45       | 플래그십 스마트폰
Snapdragon 8 Gen 2| 36       | 프리미엄 스마트폰
Snapdragon 7+ Gen 2| 13      | 중급 스마트폰
QCS6490           | 12       | IoT, 카메라

4.4 Google (Edge TPU, MediaPipe)

# Edge TPU 모델 컴파일
edgetpu_compiler --min_runtime_version 15 model_int8.tflite

# 결과: model_int8_edgetpu.tflite

# MediaPipe 실시간 추론
import mediapipe as mp
import cv2

mp_hands = mp.solutions.hands
hands = mp_hands.Hands(
    static_image_mode=False,
    max_num_hands=2,
    min_detection_confidence=0.5,
    min_tracking_confidence=0.5,
)

cap = cv2.VideoCapture(0)
while cap.isOpened():
    ret, frame = cap.read()
    results = hands.process(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))
    if results.multi_hand_landmarks:
        for hand_landmarks in results.multi_hand_landmarks:
            # 21개 랜드마크 처리
            pass

4.5 Intel (OpenVINO, Movidius)

Intel AI 가속기:
========================================
하드웨어          | 성능      | 용도
Core Ultra NPU   | 10-34 TOPS| 노트북/데스크탑 AI
Arc GPU          | 가변      | 데스크탑/워크스테이션
Movidius VPU     | 4 TOPS    | IoT, 카메라 (단종)
Gaudi 2/3        | 서버급    | 엣지 서버

5. 온디바이스 LLM

5.1 llama.cpp

C/C++로 작성된 경량 LLM 추론 엔진입니다.

# 빌드
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make -j$(nproc)

# Metal 지원 (macOS)
make LLAMA_METAL=1 -j$(nproc)

# CUDA 지원
make LLAMA_CUDA=1 -j$(nproc)

# 모델 양자화 (GGUF 포맷)
./llama-quantize models/llama-3.1-8b-f16.gguf \
  models/llama-3.1-8b-q4_k_m.gguf Q4_K_M

# 추론 실행
./llama-cli \
  -m models/llama-3.1-8b-q4_k_m.gguf \
  -p "Explain quantum computing in simple terms:" \
  -n 256 \
  -ngl 99  # GPU 레이어 수

5.2 GGUF 포맷 양자화 수준

GGUF 양자화 비교 (Llama 3.1 8B 기준):
========================================
양자화     | 크기    | 메모리  | 품질     | 속도
Q2_K      | 2.96GB | 5.4GB  | 낮음     | 매우 빠름
Q3_K_M    | 3.52GB | 6.0GB  | 보통     | 빠름
Q4_K_M    | 4.58GB | 7.0GB  | 좋음     | 빠름
Q5_K_M    | 5.33GB | 7.8GB  | 매우 좋음 | 보통
Q6_K      | 6.14GB | 8.6GB  | 우수     | 보통
Q8_0      | 7.95GB | 10.4GB | 거의 원본 | 느림
F16       | 15.0GB | 17.5GB | 원본     | 느림

5.3 MLC LLM

다양한 플랫폼에서 LLM을 실행할 수 있는 프레임워크입니다.

# MLC LLM 설치
pip install mlc-llm

# 모델 컴파일 (Vulkan 백엔드)
mlc_llm compile ./dist/Llama-3.1-8B-q4f16_1-MLC/ \
  --device vulkan \
  --output ./dist/Llama-3.1-8B-q4f16_1-vulkan/

# iOS/Android용 컴파일
mlc_llm compile ./dist/Llama-3.1-8B-q4f16_1-MLC/ \
  --device iphone \
  --output ./dist/Llama-3.1-8B-q4f16_1-ios/

5.4 MLX (Apple Silicon)

# MLX로 LLM 추론
from mlx_lm import load, generate

model, tokenizer = load("mlx-community/Llama-3.1-8B-Instruct-4bit")

prompt = "What is machine learning?"
response = generate(
    model,
    tokenizer,
    prompt=prompt,
    max_tokens=256,
    temp=0.7,
)
print(response)

5.5 엣지 디바이스에서 실행 가능한 LLM

온디바이스 LLM 비교:
========================================
모델              | 크기  | RAM 요구 | 속도 (tok/s) | 품질
Phi-3 Mini (3.8B) | 2.3GB | 4GB     | 20-40       | 우수
Gemma 2 (2B)      | 1.4GB | 3GB     | 30-50       | 좋음
Llama 3.2 (1B)    | 0.7GB | 2GB     | 40-60       | 보통
Llama 3.2 (3B)    | 1.8GB | 3.5GB   | 25-40       | 좋음
Qwen2.5 (3B)      | 1.8GB | 3.5GB   | 25-40       | 좋음
SmolLM (1.7B)     | 1.0GB | 2.5GB   | 35-55       | 보통

* Q4_K_M 양자화 기준, 스마트폰에서의 대략적 수치

6. Federated Learning (연합 학습)

6.1 FedAvg 알고리즘

연합 학습의 가장 기본적인 알고리즘입니다.

# FedAvg 의사코드
def federated_averaging(global_model, clients, rounds, local_epochs):
    for round_num in range(rounds):
        # 1. 글로벌 모델을 각 클라이언트에 배포
        client_models = []
        client_sizes = []

        for client in selected_clients:
            # 2. 각 클라이언트에서 로컬 학습
            local_model = copy.deepcopy(global_model)
            local_model = train_local(
                local_model,
                client.data,
                epochs=local_epochs,
                lr=0.01
            )
            client_models.append(local_model.state_dict())
            client_sizes.append(len(client.data))

        # 3. 가중 평균으로 글로벌 모델 업데이트
        total_size = sum(client_sizes)
        new_global = {}
        for key in global_model.state_dict():
            new_global[key] = sum(
                client_models[i][key] * (client_sizes[i] / total_size)
                for i in range(len(client_models))
            )
        global_model.load_state_dict(new_global)

    return global_model

6.2 Flower 프레임워크

# Flower 서버
import flwr as fl

strategy = fl.server.strategy.FedAvg(
    fraction_fit=0.3,          # 30% 클라이언트 참여
    fraction_evaluate=0.2,     # 20% 클라이언트 평가
    min_fit_clients=2,         # 최소 참여 클라이언트
    min_evaluate_clients=2,
    min_available_clients=3,
)

fl.server.start_server(
    server_address="0.0.0.0:8080",
    config=fl.server.ServerConfig(num_rounds=10),
    strategy=strategy,
)

# Flower 클라이언트
import flwr as fl
import torch

class FlowerClient(fl.client.NumPyClient):
    def __init__(self, model, trainloader, testloader):
        self.model = model
        self.trainloader = trainloader
        self.testloader = testloader

    def get_parameters(self, config):
        return [val.cpu().numpy() for val in self.model.parameters()]

    def set_parameters(self, parameters):
        for param, new_val in zip(self.model.parameters(), parameters):
            param.data = torch.tensor(new_val)

    def fit(self, parameters, config):
        self.set_parameters(parameters)
        train(self.model, self.trainloader, epochs=1)
        return self.get_parameters(config), len(self.trainloader.dataset), {}

    def evaluate(self, parameters, config):
        self.set_parameters(parameters)
        loss, accuracy = test(self.model, self.testloader)
        return float(loss), len(self.testloader.dataset), {"accuracy": float(accuracy)}

fl.client.start_numpy_client(
    server_address="localhost:8080",
    client=FlowerClient(model, trainloader, testloader),
)

6.3 차분 프라이버시 (Differential Privacy)

# Opacus를 사용한 차분 프라이버시 학습
from opacus import PrivacyEngine

model = create_model()
optimizer = torch.optim.SGD(model.parameters(), lr=0.05)

privacy_engine = PrivacyEngine()
model, optimizer, train_loader = privacy_engine.make_private_with_epsilon(
    module=model,
    optimizer=optimizer,
    data_loader=train_loader,
    epochs=10,
    target_epsilon=1.0,    # 프라이버시 예산
    target_delta=1e-5,     # 실패 확률
    max_grad_norm=1.0,     # 그래디언트 클리핑
)

# 학습 (일반 학습과 동일)
for batch in train_loader:
    output = model(batch)
    loss = criterion(output, target)
    loss.backward()
    optimizer.step()

# 사용된 프라이버시 예산 확인
epsilon = privacy_engine.get_epsilon(delta=1e-5)
print(f"Epsilon: {epsilon:.2f}")

7. 프라이버시 보존 AI

7.1 온디바이스 처리

민감한 데이터를 디바이스 밖으로 보내지 않고 처리하는 접근법입니다.

온디바이스 프라이버시 보존 사례:
========================================
사례                    | 기술                    | 프라이버시 보장
Apple Face ID           | Neural Engine + Secure Enclave | 얼굴 데이터 디바이스 내 처리
Google Keyboard 예측     | Federated Learning      | 타이핑 데이터 서버 미전송
Apple Siri 음성 인식    | CoreML 온디바이스       | 음성 데이터 로컬 처리
Samsung Knox AI         | NPU + TEE              | 기업 데이터 격리

7.2 Secure Aggregation

연합 학습에서 개별 모델 업데이트를 서버도 볼 수 없게 암호화합니다.

Secure Aggregation 프로토콜:
========================================
1. 각 클라이언트가 마스크 생성 (페어와이즈 시드 교환)
2. 로컬 모델 업데이트에 마스크를 더해 전송
3. 서버는 마스크된 업데이트만 수집
4. 집계 시 마스크가 상쇄되어 합계만 복원
5. 개별 업데이트는 서버도 복원 불가

8. 배포 파이프라인

8.1 모델 변환 워크플로우

학습 프레임워크 → 중간 형식 → 추론 엔진 → 디바이스
========================================
PyTorch  ─┐
           ├──► ONNX ──┬──► TensorRT (.engine)  ──► NVIDIA GPU
TensorFlow┤           ├──► ONNX Runtime           ──► 크로스 플랫폼
           │           ├──► OpenVINO (.xml)       ──► Intel
           ├──► TFLite ──► TFLite Runtime         ──► 모바일/IoT
           ├──► CoreML ──► CoreML Runtime          ──► Apple
           └──► GGUF   ──► llama.cpp              ──► 모든 플랫폼

8.2 OTA (Over-The-Air) 모델 업데이트

# 모델 버전 관리 및 OTA 업데이트 예시
import hashlib
import json
import requests

class ModelManager:
    def __init__(self, model_dir, manifest_url):
        self.model_dir = model_dir
        self.manifest_url = manifest_url

    def check_update(self):
        """서버에서 최신 모델 매니페스트 확인"""
        manifest = requests.get(self.manifest_url).json()
        current_version = self.get_current_version()

        if manifest["version"] > current_version:
            return manifest
        return None

    def download_model(self, manifest):
        """델타 업데이트 또는 전체 다운로드"""
        if manifest.get("delta_url"):
            # 델타 패치 다운로드 (크기 절감)
            patch = requests.get(manifest["delta_url"]).content
            self.apply_delta(patch)
        else:
            # 전체 모델 다운로드
            model_data = requests.get(manifest["model_url"]).content
            self.save_model(model_data, manifest)

    def validate_model(self, model_path, expected_hash):
        """SHA-256 해시로 무결성 검증"""
        with open(model_path, "rb") as f:
            file_hash = hashlib.sha256(f.read()).hexdigest()
        return file_hash == expected_hash

    def rollback(self):
        """문제 발생 시 이전 버전으로 롤백"""
        # 이전 버전 복원 로직
        pass

8.3 모델 패키징

# 모델 매니페스트 (model_manifest.json)
model_name: "image-classifier-v2"
version: "2.1.0"
framework: "tflite"
file: "model_int8.tflite"
input_shape: [1, 224, 224, 3]
input_type: "uint8"
output_shape: [1, 1000]
labels_file: "labels.txt"
hardware_requirements:
  min_ram_mb: 256
  supported_delegates: ["gpu", "nnapi", "xnnpack"]
metrics:
  accuracy: 0.943
  latency_ms: 12.5
  model_size_mb: 4.2

9. 활용 사례

9.1 자율주행

자율주행 Edge AI 스택:
========================================
센서           | AI 작업            | 하드웨어        | 레이턴시 요구
카메라 (8-12대) | 객체 탐지/분류     | NVIDIA Orin     | 10ms 이하
LiDAR         | 3D 포인트 클라우드  | NVIDIA Orin     | 20ms 이하
레이더         | 거리/속도 추정     | DSP            | 5ms 이하
퓨전           | 센서 퓨전/판단     | NVIDIA Orin     | 30ms 이하
계획           | 경로 계획          | CPU/GPU        | 50ms 이하

9.2 스마트 카메라

# 엣지 디바이스에서 실시간 객체 탐지
import cv2
import numpy as np

# TFLite 모델 로드 (SSD MobileNet)
interpreter = tf.lite.Interpreter(
    model_path="ssd_mobilenet_v2_int8.tflite",
    num_threads=4
)
interpreter.allocate_tensors()

cap = cv2.VideoCapture(0)
while True:
    ret, frame = cap.read()
    # 전처리
    input_data = preprocess(frame, target_size=(300, 300))

    # 추론
    interpreter.set_tensor(input_details[0]['index'], input_data)
    interpreter.invoke()

    # 결과 파싱
    boxes = interpreter.get_tensor(output_details[0]['index'])
    classes = interpreter.get_tensor(output_details[1]['index'])
    scores = interpreter.get_tensor(output_details[2]['index'])

    # 높은 신뢰도 결과만 표시
    for i in range(len(scores[0])):
        if scores[0][i] > 0.5:
            draw_box(frame, boxes[0][i], classes[0][i], scores[0][i])

9.3 음성 어시스턴트

온디바이스 음성 처리 파이프라인:
========================================
1. Wake Word 탐지: 작은 모델 (50KB-1MB), 항상 실행, NPU
2. 음성 활동 탐지(VAD): 음성 구간 식별, 매우 경량
3. 음성 인식(ASR): Whisper Tiny/Base (39M-74M 파라미터)
4. 자연어 이해(NLU): 의도 분류 + 개체명 인식
5. 응답 생성: 소형 LLM 또는 템플릿 기반
6. 음성 합성(TTS): 텍스트를 음성으로 변환

9.4 의료 기기

의료 Edge AI 활용:
========================================
기기              | AI 작업              | 규제 고려사항
스마트 워치       | ECG 부정맥 탐지      | FDA 클래스 II
청진기            | 심음/폐음 분석       | FDA 클래스 II
안저 카메라       | 당뇨 망막병증 스크리닝 | FDA 클래스 II
CT/MRI 보조      | 병변 탐지 하이라이트  | FDA 클래스 III
혈당 측정기       | 혈당 추세 예측       | FDA 클래스 II

9.5 산업 IoT

산업 Edge AI 활용:
========================================
영역              | AI 작업              | 이점
품질 검사         | 시각 결함 탐지       | 실시간, 100% 검사
예측 유지보수     | 진동/소음 이상 탐지   | 다운타임 40% 감소
에너지 최적화     | 소비 패턴 예측       | 에너지 15% 절감
안전 모니터링     | PPE 착용 여부 감지    | 사고 예방
로봇 제어         | 경로 계획/장애물 회피  | 자율 운영

10. 과제와 한계

10.1 전력 제약

전력 효율 비교:
========================================
디바이스               | 전력  | AI 성능   | TOPS/W
NVIDIA Jetson Orin Nano| 7-15W | 40 TOPS  | 2.7-5.7
Apple A18 Pro          | ~5W   | 35 TOPS  | 7.0
Google Edge TPU        | 2W    | 4 TOPS   | 2.0
Qualcomm Snapdragon 8g3| ~5W  | 45 TOPS  | 9.0
Intel Core Ultra NPU   | ~5W  | 34 TOPS  | 6.8

10.2 열 관리

모바일 디바이스에서 지속적인 AI 추론은 발열 문제를 일으킵니다. 서멀 스로틀링이 발생하면 성능이 30-50% 저하될 수 있습니다.

10.3 메모리 제한

디바이스별 메모리 제약:
========================================
디바이스 유형      | 일반적 RAM | AI 사용 가능 메모리
IoT 센서          | 256KB-4MB | 극히 제한
마이크로컨트롤러   | 2-16MB    | 매우 제한
스마트 카메라     | 256MB-2GB | 수백 MB
스마트폰          | 6-16GB    | 2-8GB
엣지 서버         | 16-128GB  | 대부분 사용 가능

10.4 모델 업데이트의 어려움

OTA 모델 업데이트 과제:
========================================
- 대역폭 제한: 대용량 모델을 무선으로 전송
- 배터리 소모: 다운로드 + 변환 과정의 에너지
- A/B 테스트: 구버전/신버전 동시 보관 필요
- 롤백: 실패 시 이전 버전 복구 메커니즘
- 무결성: 모델 변조 방지 (서명/해시 검증)

퀴즈

Q1: Edge AI와 Cloud AI의 주요 차이점을 5가지 이상 설명하세요.

레이턴시: Edge AI는 1-10ms 로컬 추론, Cloud AI는 50-200ms 네트워크 왕복 필요
프라이버시: Edge AI는 데이터가 디바이스에 유지, Cloud AI는 데이터 전송 필요
대역폭: Edge AI는 결과만 전송(최소), Cloud AI는 원본 데이터 전송(대량)
오프라인: Edge AI는 네트워크 없이 동작 가능, Cloud AI는 불가능
컴퓨팅 파워: Edge AI는 제한적(NPU 수십 TOPS), Cloud AI는 거의 무제한
모델 크기: Edge AI는 MB에서 수 GB 제한, Cloud AI는 수백 GB 가능
비용 구조: Edge AI는 초기 하드웨어 비용, Cloud AI는 지속적 서비스 비용

Q2: PTQ와 QAT의 차이를 설명하고, 각각의 장단점을 비교하세요.

PTQ (Post-Training Quantization):

이미 학습된 모델을 추가 학습 없이 양자화
장점: 간편함, 학습 데이터 불필요 (캘리브레이션 데이터만 필요), 빠른 적용
단점: 정확도 손실이 QAT보다 클 수 있음, 특히 INT4에서

QAT (Quantization-Aware Training):

학습 과정에서 양자화 효과를 시뮬레이션
장점: 정확도 손실 최소화, INT4에서도 좋은 품질
단점: 추가 학습 필요 (학습 데이터, GPU, 시간), 구현 복잡

일반적으로 INT8에서는 PTQ만으로도 충분하고, INT4 이하에서는 QAT가 권장됩니다.

Q3: 연합 학습(Federated Learning)의 FedAvg 알고리즘을 설명하세요.

FedAvg (Federated Averaging)의 과정:

모델 배포: 서버가 글로벌 모델을 선택된 클라이언트들에게 전송
로컬 학습: 각 클라이언트가 자신의 로컬 데이터로 모델을 학습 (여러 에포크)
업데이트 수집: 학습된 모델의 파라미터를 서버로 전송 (원본 데이터는 전송하지 않음)
가중 평균: 서버가 각 클라이언트의 데이터 크기에 비례하여 가중 평균으로 글로벌 모델 업데이트
반복: 위 과정을 여러 라운드 반복

핵심: 데이터는 디바이스를 떠나지 않고, 모델 업데이트(가중치)만 서버와 주고받아 프라이버시를 보존합니다.

Q4: GGUF 양자화에서 Q4_K_M이 무엇을 의미하는지 설명하세요.

GGUF 양자화 명명 규칙:

Q: Quantization (양자화)
4: 비트 수 (4비트). 가중치당 평균 4비트 사용
K: K-quant 방식. 블록별로 다른 양자화 수준을 적용하여 중요한 레이어는 더 높은 정밀도 유지
M: Medium quality. S(Small/저품질), M(Medium), L(Large/고품질) 중 중간 수준

Q4_K_M은 4비트 K-quant 중간 품질 양자화로, 모델 크기를 약 4-5배 줄이면서 품질 손실을 최소화하는 가장 인기 있는 양자화 수준입니다. 7B 모델 기준 약 4.58GB로 스마트폰에서도 실행 가능합니다.

Q5: 모델 프루닝에서 구조적 프루닝과 비구조적 프루닝의 차이는 무엇인가요?

비구조적 프루닝 (Unstructured Pruning):

개별 가중치를 0으로 설정하여 희소(sparse) 행렬 생성
높은 희소성 달성 가능 (90% 이상)
이론적 연산 절감은 크지만, 일반 하드웨어에서 실제 속도 향상은 제한적
전용 희소 행렬 연산 하드웨어/라이브러리가 필요

구조적 프루닝 (Structured Pruning):

전체 채널, 필터, Attention Head 등 구조적 단위를 제거
결과 모델이 일반적인 밀집(dense) 텐서 연산으로 동작
일반 하드웨어에서도 실제 속도 향상을 달성
희소성 수준은 비구조적보다 낮을 수 있음 (30-50% 정도)

실용적으로는 구조적 프루닝이 실제 추론 속도 향상에 더 효과적이고, 비구조적 프루닝은 NVIDIA의 Sparse Tensor Core 같은 전용 하드웨어에서 효과적입니다.

참고 자료

Edge AI Complete Guide 2025: On-Device Inference, Model Optimization, TensorRT/ONNX/CoreML

Introduction: Why Edge AI?

The future of AI is not just the cloud. An increasing amount of AI inference is running where data is generated -- on edge devices. Edge AI is a paradigm for running AI models directly on smartphones, IoT devices, vehicles, and medical devices without sending data to the cloud.

5 reasons Edge AI is essential:

Latency: Millisecond-level responses without cloud round-trips. For autonomous vehicles, even 100ms delay can be fatal.
Privacy: Sensitive data (faces, voice, medical) never leaves the device.
Bandwidth savings: Cameras generate gigabytes of video per second, but only inference results need to be transmitted.
Cost reduction: Run inference locally without cloud GPU costs.
Offline operation: AI features work without network connectivity.

1. Edge AI vs Cloud AI Comparison

Dimension	Edge AI	Cloud AI
Latency	1-10ms (local)	50-200ms (network round-trip)
Privacy	Data stays on device	Data must be sent to cloud
Bandwidth	Minimal (results only)	Heavy (raw data transfer)
Cost	Upfront hardware cost	Ongoing cloud costs
Offline	Fully supported	Not possible
Model size	Limited (MB to a few GB)	Unlimited (hundreds of GB)
Compute power	Limited (NPU/GPU)	Nearly unlimited
Updates	OTA update required	Instant deployment
Scalability	Proportional to device count	Elastic with cloud resources
Accuracy	Slight degradation possible	Maximum accuracy

2. Inference Runtime Comparison

2.1 TensorRT (NVIDIA)

A high-performance inference engine exclusive to NVIDIA GPUs.

import tensorrt as trt
import numpy as np

# Build TensorRT engine
logger = trt.Logger(trt.Logger.WARNING)
builder = trt.Builder(logger)
network = builder.create_network(
    1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
)
parser = trt.OnnxParser(network, logger)

# Load ONNX model
with open("model.onnx", "rb") as f:
    parser.parse(f.read())

# Optimization profile configuration
config = builder.create_builder_config()
config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 30)  # 1GB

# Enable INT8 quantization
config.set_flag(trt.BuilderFlag.INT8)
config.int8_calibrator = MyCalibrator(calibration_data)

# Enable FP16
config.set_flag(trt.BuilderFlag.FP16)

# Build and serialize engine
serialized_engine = builder.build_serialized_network(network, config)
with open("model.engine", "wb") as f:
    f.write(serialized_engine)

Key TensorRT optimization techniques:

1. Layer fusion: Combines Conv + BatchNorm + ReLU into a single kernel
2. Kernel auto-tuning: Automatic selection of optimal CUDA kernels per hardware
3. Dynamic tensor memory: Reduces total usage through memory reuse
4. Precision calibration: Minimizes accuracy loss during INT8 quantization
5. Multi-stream execution: Runs multiple inferences in parallel

2.2 ONNX Runtime

A cross-platform inference engine that works across various hardware.

import onnxruntime as ort
import numpy as np

# Session options
sess_options = ort.SessionOptions()
sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
sess_options.intra_op_num_threads = 4
sess_options.inter_op_num_threads = 2

# Execution Provider selection
# CPU: CPUExecutionProvider
# GPU: CUDAExecutionProvider, TensorrtExecutionProvider
# Mobile: CoreMLExecutionProvider, NNAPIExecutionProvider
providers = [
    ('TensorrtExecutionProvider', {
        'trt_max_workspace_size': 2147483648,
        'trt_fp16_enable': True,
        'trt_int8_enable': True,
    }),
    ('CUDAExecutionProvider', {
        'device_id': 0,
        'arena_extend_strategy': 'kNextPowerOfTwo',
    }),
    'CPUExecutionProvider'
]

session = ort.InferenceSession("model.onnx", sess_options, providers=providers)

# Run inference
input_name = session.get_inputs()[0].name
output = session.run(None, {input_name: input_data})

2.3 TensorFlow Lite (TFLite)

A lightweight inference engine for mobile and embedded devices.

import tensorflow as tf

# Convert to TFLite model
converter = tf.lite.TFLiteConverter.from_saved_model("saved_model_dir")

# Quantization settings
converter.optimizations = [tf.lite.Optimize.DEFAULT]

# Full integer quantization (INT8)
def representative_dataset():
    for data in calibration_data:
        yield [data.astype(np.float32)]

converter.representative_dataset = representative_dataset
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8

tflite_model = converter.convert()

# Save model
with open("model_int8.tflite", "wb") as f:
    f.write(tflite_model)

# TFLite inference
interpreter = tf.lite.Interpreter(model_path="model_int8.tflite")
interpreter.allocate_tensors()

input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()

interpreter.set_tensor(input_details[0]['index'], input_data)
interpreter.invoke()
output = interpreter.get_tensor(output_details[0]['index'])

2.4 CoreML (Apple)

Apple's device-exclusive inference framework.

import coremltools as ct

# PyTorch -> CoreML conversion
model = torch.load("model.pt")
model.eval()

traced_model = torch.jit.trace(model, torch.randn(1, 3, 224, 224))
coreml_model = ct.convert(
    traced_model,
    inputs=[ct.ImageType(name="image", shape=(1, 3, 224, 224))],
    compute_precision=ct.precision.FLOAT16,
    compute_units=ct.ComputeUnit.ALL,  # CPU + GPU + Neural Engine
)

# Add metadata
coreml_model.author = "ML Team"
coreml_model.short_description = "Image classifier"

coreml_model.save("MyModel.mlpackage")

// Using CoreML in Swift
import CoreML
import Vision

let model = try! MyModel(configuration: MLModelConfiguration())
let request = VNCoreMLRequest(model: try! VNCoreMLModel(for: model.model))

let handler = VNImageRequestHandler(cgImage: image, options: [:])
try! handler.perform([request])

if let results = request.results as? [VNClassificationObservation] {
    print("Top prediction: \(results.first!.identifier)")
}

2.5 OpenVINO (Intel)

An inference engine for Intel hardware.

import openvino as ov

core = ov.Core()

# Load and compile model
model = core.read_model("model.xml")
compiled_model = core.compile_model(model, "CPU", config={
    "PERFORMANCE_HINT": "LATENCY",
    "INFERENCE_NUM_THREADS": "4",
})

# INT8 quantization (NNCF)
import nncf

calibration_dataset = nncf.Dataset(calibration_loader)
quantized_model = nncf.quantize(
    model,
    calibration_dataset,
    preset=nncf.QuantizationPreset.MIXED,
    subset_size=300,
)

# Save quantized model
ov.save_model(quantized_model, "model_int8.xml")

2.6 Runtime Comparison Summary

Inference Runtime Comparison:
========================================
Runtime      | Platform       | Key Hardware    | Languages
TensorRT     | Linux/Windows  | NVIDIA GPU      | C++, Python
ONNX Runtime | Cross-platform | CPU/GPU/NPU     | C++, Python, C#, Java
TFLite       | Mobile/Embedded| CPU/GPU/NPU     | C++, Java, Swift, Python
CoreML       | Apple only     | ANE/GPU/CPU     | Swift, Objective-C
OpenVINO     | Intel only     | CPU/iGPU/VPU    | C++, Python
MLC LLM      | Cross-platform | CPU/GPU/NPU     | C++, Python, Swift

3. Model Optimization Techniques

3.1 Quantization

Quantization is a core technique that reduces model weight precision to improve size and inference speed.

Post-Training Quantization (PTQ)

Quantizes models after training without additional training.

# PyTorch PTQ example
import torch
from torch.quantization import quantize_dynamic

# Dynamic quantization (simplest)
model_fp32 = load_model()
model_int8 = quantize_dynamic(
    model_fp32,
    {torch.nn.Linear, torch.nn.LSTM},
    dtype=torch.qint8
)

# Static quantization (higher performance)
from torch.quantization import prepare, convert, get_default_qconfig

model_fp32.eval()
model_fp32.qconfig = get_default_qconfig('x86')
model_prepared = prepare(model_fp32)

# Collect statistics with calibration data
with torch.no_grad():
    for batch in calibration_loader:
        model_prepared(batch)

model_int8 = convert(model_prepared)

Quantization-Aware Training (QAT)

Simulates quantization effects during training to minimize accuracy loss.

import torch
from torch.quantization import prepare_qat, convert

model.train()
model.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')
model_prepared = prepare_qat(model)

# QAT training (same as regular training)
optimizer = torch.optim.Adam(model_prepared.parameters(), lr=1e-4)
for epoch in range(num_epochs):
    for batch in train_loader:
        output = model_prepared(batch)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()

# Convert to quantized model
model_prepared.eval()
model_int8 = convert(model_prepared)

LLM Quantization Techniques

# GPTQ (GPU-based quantization)
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

quantize_config = BaseQuantizeConfig(
    bits=4,
    group_size=128,
    desc_act=False,
)

model = AutoGPTQForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B",
    quantize_config
)
model.quantize(calibration_data)
model.save_quantized("llama-3.1-8b-gptq-4bit")

# AWQ (Activation-aware Weight Quantization)
from awq import AutoAWQForCausalLM

model = AutoAWQForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")

quant_config = {
    "zero_point": True,
    "q_group_size": 128,
    "w_bit": 4,
    "version": "GEMM"
}

model.quantize(tokenizer, quant_config=quant_config)
model.save_quantized("llama-3.1-8b-awq-4bit")

# bitsandbytes (simple quantization)
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B",
    quantization_config=bnb_config,
    device_map="auto",
)

Precision-level comparison:

Precision | Model Size (7B) | Memory | Speed Gain | Accuracy Loss
FP32      | 28GB            | 28GB   | 1x         | Baseline
FP16      | 14GB            | 14GB   | 2x         | Negligible
INT8      | 7GB             | 7GB    | 3-4x       | Minimal (under 0.5%)
INT4      | 3.5GB           | 3.5GB  | 4-6x       | Slight (1-2%)

3.2 Pruning

Pruning removes unnecessary weights or neurons from a model.

Unstructured Pruning

Creates sparse models by zeroing out individual weights.

import torch.nn.utils.prune as prune

# Magnitude-based pruning
model = load_model()

# Apply 50% pruning across the entire model
parameters_to_prune = [
    (module, 'weight') for module in model.modules()
    if isinstance(module, (torch.nn.Linear, torch.nn.Conv2d))
]

prune.global_unstructured(
    parameters_to_prune,
    pruning_method=prune.L1Unstructured,
    amount=0.5,  # 50% pruning
)

# Permanently apply pruning masks
for module, name in parameters_to_prune:
    prune.remove(module, name)

# Check sparsity
total = 0
zero = 0
for p in model.parameters():
    total += p.numel()
    zero += (p == 0).sum().item()
print(f"Sparsity: {zero/total:.2%}")

Structured Pruning

Removes entire channels or heads for actual speed improvements.

# Attention Head pruning example
import torch.nn.utils.prune as prune

# Prune 30% of output channels in a specific layer
prune.ln_structured(
    model.layer1.conv1,
    name='weight',
    amount=0.3,
    n=2,
    dim=0  # Output channel dimension
)

Movement Pruning

Identifies and removes unimportant weights during fine-tuning.

Pruning Technique Comparison:
========================================
Technique        | Sparsity Pattern | Actual Speedup | Accuracy
Magnitude-based  | Unstructured     | Limited*       | High
Structured       | Structured       | High           | Medium
Movement Pruning | Unstructured     | Limited*       | Very High

* Unstructured pruning requires specialized hardware/libraries

3.3 Knowledge Distillation

Transfers knowledge from a large teacher model to a smaller student model.

import torch
import torch.nn.functional as F

class DistillationLoss(torch.nn.Module):
    def __init__(self, temperature=4.0, alpha=0.5):
        super().__init__()
        self.temperature = temperature
        self.alpha = alpha
        self.ce_loss = torch.nn.CrossEntropyLoss()

    def forward(self, student_logits, teacher_logits, labels):
        # Soft target loss (KL Divergence)
        soft_loss = F.kl_div(
            F.log_softmax(student_logits / self.temperature, dim=1),
            F.softmax(teacher_logits / self.temperature, dim=1),
            reduction='batchmean'
        ) * (self.temperature ** 2)

        # Hard target loss (Cross-Entropy)
        hard_loss = self.ce_loss(student_logits, labels)

        return self.alpha * soft_loss + (1 - self.alpha) * hard_loss

# Training loop
teacher_model.eval()
student_model.train()

for batch in train_loader:
    inputs, labels = batch
    with torch.no_grad():
        teacher_logits = teacher_model(inputs)
    student_logits = student_model(inputs)

    loss = distillation_loss(student_logits, teacher_logits, labels)
    loss.backward()
    optimizer.step()

Notable knowledge distillation success stories:

Teacher -> Student examples:
BERT-base (110M) -> DistilBERT (66M): 40% smaller, retains 97% performance
GPT-4 -> Phi-3 Mini (3.8B): Large model knowledge transferred to small model
Llama 70B -> Llama 8B: Knowledge distillation + synthetic data for maximum performance

3.4 Neural Architecture Search (NAS)

Automatically searches for optimal model architectures.

Efficient architectures discovered by NAS:
========================================
Model          | Parameters | Top-1 Accuracy | Inference Speed
EfficientNet-B0| 5.3M       | 77.3%          | Very fast
MobileNetV3    | 5.4M       | 75.2%          | Very fast
EfficientNetV2 | 21M        | 85.7%          | Fast
MnasNet        | 4.2M       | 74.0%          | Very fast

4. Hardware Landscape

4.1 NVIDIA (Jetson, T4/L4)

NVIDIA Edge AI Hardware:
========================================
Device          | TOPS | Memory  | TDP    | Use Case
Jetson Orin Nano| 40   | 4-8GB   | 7-15W  | Embedded AI
Jetson Orin NX  | 100  | 8-16GB  | 10-25W | Robotics, Drones
Jetson AGX Orin | 275  | 32-64GB | 15-60W | Autonomous driving
T4 (datacenter) | 130  | 16GB    | 70W    | Edge servers
L4 (datacenter) | 120  | 24GB    | 72W    | Video AI

# Running TensorRT inference on Jetson
# After JetPack SDK installation

# Model conversion (ONNX -> TensorRT)
/usr/src/tensorrt/bin/trtexec \
  --onnx=model.onnx \
  --saveEngine=model.engine \
  --fp16 \
  --workspace=2048 \
  --minShapes=input:1x3x224x224 \
  --optShapes=input:4x3x224x224 \
  --maxShapes=input:8x3x224x224

# Benchmark
/usr/src/tensorrt/bin/trtexec --loadEngine=model.engine --batch=4

4.2 Apple (Neural Engine, CoreML, MLX)

Apple Silicon AI Performance:
========================================
Chip          | Neural Engine TOPS | GPU     | Unified Memory
M1            | 11                 | 8-core  | 8-16GB
M2            | 15.8               | 10-core | 8-24GB
M3            | 18                 | 10-core | 8-36GB
M4            | 38                 | 10-core | 16-64GB
A17 Pro       | 35                 | 6-core  | 8GB
A18 Pro       | 35                 | 6-core  | 8GB

# MLX framework (Apple Silicon native)
import mlx.core as mx
import mlx.nn as nn

class SimpleModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear1 = nn.Linear(784, 256)
        self.linear2 = nn.Linear(256, 10)

    def __call__(self, x):
        x = nn.relu(self.linear1(x))
        return self.linear2(x)

model = SimpleModel()
input_data = mx.random.normal((1, 784))
output = model(input_data)
mx.eval(output)  # Execute lazy evaluation

4.3 Qualcomm (Snapdragon NPU)

Qualcomm AI Engine:
========================================
Chip                | NPU TOPS | Use Case
Snapdragon 8 Gen 3  | 45       | Flagship smartphones
Snapdragon 8 Gen 2  | 36       | Premium smartphones
Snapdragon 7+ Gen 2 | 13       | Mid-range smartphones
QCS6490              | 12       | IoT, cameras

4.4 Google (Edge TPU, MediaPipe)

# Edge TPU model compilation
edgetpu_compiler --min_runtime_version 15 model_int8.tflite

# Result: model_int8_edgetpu.tflite

# MediaPipe real-time inference
import mediapipe as mp
import cv2

mp_hands = mp.solutions.hands
hands = mp_hands.Hands(
    static_image_mode=False,
    max_num_hands=2,
    min_detection_confidence=0.5,
    min_tracking_confidence=0.5,
)

cap = cv2.VideoCapture(0)
while cap.isOpened():
    ret, frame = cap.read()
    results = hands.process(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))
    if results.multi_hand_landmarks:
        for hand_landmarks in results.multi_hand_landmarks:
            # Process 21 landmarks
            pass

4.5 Intel (OpenVINO, Movidius)

Intel AI Accelerators:
========================================
Hardware          | Performance | Use Case
Core Ultra NPU    | 10-34 TOPS  | Laptop/Desktop AI
Arc GPU           | Variable    | Desktop/Workstation
Movidius VPU      | 4 TOPS      | IoT, cameras (discontinued)
Gaudi 2/3         | Server-grade| Edge servers

5. On-Device LLMs

5.1 llama.cpp

A lightweight LLM inference engine written in C/C++.

# Build
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make -j$(nproc)

# Metal support (macOS)
make LLAMA_METAL=1 -j$(nproc)

# CUDA support
make LLAMA_CUDA=1 -j$(nproc)

# Model quantization (GGUF format)
./llama-quantize models/llama-3.1-8b-f16.gguf \
  models/llama-3.1-8b-q4_k_m.gguf Q4_K_M

# Run inference
./llama-cli \
  -m models/llama-3.1-8b-q4_k_m.gguf \
  -p "Explain quantum computing in simple terms:" \
  -n 256 \
  -ngl 99  # Number of GPU layers

5.2 GGUF Quantization Levels

GGUF Quantization Comparison (Llama 3.1 8B):
========================================
Quant     | Size   | Memory | Quality    | Speed
Q2_K      | 2.96GB | 5.4GB  | Low        | Very fast
Q3_K_M    | 3.52GB | 6.0GB  | Moderate   | Fast
Q4_K_M    | 4.58GB | 7.0GB  | Good       | Fast
Q5_K_M    | 5.33GB | 7.8GB  | Very good  | Moderate
Q6_K      | 6.14GB | 8.6GB  | Excellent  | Moderate
Q8_0      | 7.95GB | 10.4GB | Near-original | Slow
F16       | 15.0GB | 17.5GB | Original   | Slow

5.3 MLC LLM

A framework for running LLMs across diverse platforms.

# Install MLC LLM
pip install mlc-llm

# Compile model (Vulkan backend)
mlc_llm compile ./dist/Llama-3.1-8B-q4f16_1-MLC/ \
  --device vulkan \
  --output ./dist/Llama-3.1-8B-q4f16_1-vulkan/

# Compile for iOS/Android
mlc_llm compile ./dist/Llama-3.1-8B-q4f16_1-MLC/ \
  --device iphone \
  --output ./dist/Llama-3.1-8B-q4f16_1-ios/

5.4 MLX (Apple Silicon)

# LLM inference with MLX
from mlx_lm import load, generate

model, tokenizer = load("mlx-community/Llama-3.1-8B-Instruct-4bit")

prompt = "What is machine learning?"
response = generate(
    model,
    tokenizer,
    prompt=prompt,
    max_tokens=256,
    temp=0.7,
)
print(response)

5.5 LLMs That Run on Edge Devices

On-Device LLM Comparison:
========================================
Model             | Size  | RAM Req | Speed (tok/s) | Quality
Phi-3 Mini (3.8B) | 2.3GB | 4GB     | 20-40         | Excellent
Gemma 2 (2B)      | 1.4GB | 3GB     | 30-50         | Good
Llama 3.2 (1B)    | 0.7GB | 2GB     | 40-60         | Moderate
Llama 3.2 (3B)    | 1.8GB | 3.5GB   | 25-40         | Good
Qwen2.5 (3B)      | 1.8GB | 3.5GB   | 25-40         | Good
SmolLM (1.7B)     | 1.0GB | 2.5GB   | 35-55         | Moderate

* Based on Q4_K_M quantization, approximate figures on smartphones

6. Federated Learning

6.1 FedAvg Algorithm

The most fundamental algorithm in federated learning.

# FedAvg pseudocode
def federated_averaging(global_model, clients, rounds, local_epochs):
    for round_num in range(rounds):
        # 1. Distribute global model to each client
        client_models = []
        client_sizes = []

        for client in selected_clients:
            # 2. Local training on each client
            local_model = copy.deepcopy(global_model)
            local_model = train_local(
                local_model,
                client.data,
                epochs=local_epochs,
                lr=0.01
            )
            client_models.append(local_model.state_dict())
            client_sizes.append(len(client.data))

        # 3. Update global model with weighted average
        total_size = sum(client_sizes)
        new_global = {}
        for key in global_model.state_dict():
            new_global[key] = sum(
                client_models[i][key] * (client_sizes[i] / total_size)
                for i in range(len(client_models))
            )
        global_model.load_state_dict(new_global)

    return global_model

6.2 Flower Framework

# Flower server
import flwr as fl

strategy = fl.server.strategy.FedAvg(
    fraction_fit=0.3,          # 30% client participation
    fraction_evaluate=0.2,     # 20% client evaluation
    min_fit_clients=2,         # Minimum participating clients
    min_evaluate_clients=2,
    min_available_clients=3,
)

fl.server.start_server(
    server_address="0.0.0.0:8080",
    config=fl.server.ServerConfig(num_rounds=10),
    strategy=strategy,
)

# Flower client
import flwr as fl
import torch

class FlowerClient(fl.client.NumPyClient):
    def __init__(self, model, trainloader, testloader):
        self.model = model
        self.trainloader = trainloader
        self.testloader = testloader

    def get_parameters(self, config):
        return [val.cpu().numpy() for val in self.model.parameters()]

    def set_parameters(self, parameters):
        for param, new_val in zip(self.model.parameters(), parameters):
            param.data = torch.tensor(new_val)

    def fit(self, parameters, config):
        self.set_parameters(parameters)
        train(self.model, self.trainloader, epochs=1)
        return self.get_parameters(config), len(self.trainloader.dataset), {}

    def evaluate(self, parameters, config):
        self.set_parameters(parameters)
        loss, accuracy = test(self.model, self.testloader)
        return float(loss), len(self.testloader.dataset), {"accuracy": float(accuracy)}

fl.client.start_numpy_client(
    server_address="localhost:8080",
    client=FlowerClient(model, trainloader, testloader),
)

6.3 Differential Privacy

# Differential privacy training with Opacus
from opacus import PrivacyEngine

model = create_model()
optimizer = torch.optim.SGD(model.parameters(), lr=0.05)

privacy_engine = PrivacyEngine()
model, optimizer, train_loader = privacy_engine.make_private_with_epsilon(
    module=model,
    optimizer=optimizer,
    data_loader=train_loader,
    epochs=10,
    target_epsilon=1.0,    # Privacy budget
    target_delta=1e-5,     # Failure probability
    max_grad_norm=1.0,     # Gradient clipping
)

# Training (same as regular training)
for batch in train_loader:
    output = model(batch)
    loss = criterion(output, target)
    loss.backward()
    optimizer.step()

# Check consumed privacy budget
epsilon = privacy_engine.get_epsilon(delta=1e-5)
print(f"Epsilon: {epsilon:.2f}")

7. Privacy-Preserving AI

7.1 On-Device Processing

An approach that processes sensitive data without sending it off the device.

On-Device Privacy Preservation Examples:
========================================
Use Case                 | Technology              | Privacy Guarantee
Apple Face ID            | Neural Engine + Secure Enclave | Facial data processed on-device
Google Keyboard Prediction| Federated Learning     | Typing data never sent to server
Apple Siri Voice          | CoreML on-device       | Voice data processed locally
Samsung Knox AI           | NPU + TEE             | Enterprise data isolation

7.2 Secure Aggregation

Encrypts individual model updates in federated learning so even the server cannot see them.

Secure Aggregation Protocol:
========================================
1. Each client generates masks (pairwise seed exchange)
2. Adds masks to local model updates before sending
3. Server collects only masked updates
4. Masks cancel out during aggregation, recovering only the sum
5. Individual updates cannot be recovered even by the server

8. Deployment Pipeline

8.1 Model Conversion Workflow

Training Framework -> Intermediate Format -> Inference Engine -> Device
========================================
PyTorch  --+
            +---> ONNX --+---> TensorRT (.engine)  ---> NVIDIA GPU
TensorFlow-+             +---> ONNX Runtime         ---> Cross-platform
            |             +---> OpenVINO (.xml)      ---> Intel
            +---> TFLite ---> TFLite Runtime         ---> Mobile/IoT
            +---> CoreML ---> CoreML Runtime          ---> Apple
            +---> GGUF   ---> llama.cpp              ---> All platforms

8.2 OTA (Over-The-Air) Model Updates

# Model version management and OTA update example
import hashlib
import json
import requests

class ModelManager:
    def __init__(self, model_dir, manifest_url):
        self.model_dir = model_dir
        self.manifest_url = manifest_url

    def check_update(self):
        """Check latest model manifest from server"""
        manifest = requests.get(self.manifest_url).json()
        current_version = self.get_current_version()

        if manifest["version"] > current_version:
            return manifest
        return None

    def download_model(self, manifest):
        """Delta update or full download"""
        if manifest.get("delta_url"):
            # Download delta patch (size savings)
            patch = requests.get(manifest["delta_url"]).content
            self.apply_delta(patch)
        else:
            # Full model download
            model_data = requests.get(manifest["model_url"]).content
            self.save_model(model_data, manifest)

    def validate_model(self, model_path, expected_hash):
        """Verify integrity with SHA-256 hash"""
        with open(model_path, "rb") as f:
            file_hash = hashlib.sha256(f.read()).hexdigest()
        return file_hash == expected_hash

    def rollback(self):
        """Rollback to previous version on failure"""
        # Previous version restoration logic
        pass

8.3 Model Packaging

# Model manifest (model_manifest.json)
model_name: "image-classifier-v2"
version: "2.1.0"
framework: "tflite"
file: "model_int8.tflite"
input_shape: [1, 224, 224, 3]
input_type: "uint8"
output_shape: [1, 1000]
labels_file: "labels.txt"
hardware_requirements:
  min_ram_mb: 256
  supported_delegates: ["gpu", "nnapi", "xnnpack"]
metrics:
  accuracy: 0.943
  latency_ms: 12.5
  model_size_mb: 4.2

9. Use Cases

9.1 Autonomous Driving

Autonomous Driving Edge AI Stack:
========================================
Sensor         | AI Task              | Hardware       | Latency Req
Cameras (8-12) | Object detection     | NVIDIA Orin    | Under 10ms
LiDAR          | 3D point cloud       | NVIDIA Orin    | Under 20ms
Radar          | Distance/speed est.  | DSP            | Under 5ms
Fusion         | Sensor fusion/decision| NVIDIA Orin   | Under 30ms
Planning       | Path planning        | CPU/GPU        | Under 50ms

9.2 Smart Cameras

# Real-time object detection on edge device
import cv2
import numpy as np

# Load TFLite model (SSD MobileNet)
interpreter = tf.lite.Interpreter(
    model_path="ssd_mobilenet_v2_int8.tflite",
    num_threads=4
)
interpreter.allocate_tensors()

cap = cv2.VideoCapture(0)
while True:
    ret, frame = cap.read()
    # Preprocessing
    input_data = preprocess(frame, target_size=(300, 300))

    # Inference
    interpreter.set_tensor(input_details[0]['index'], input_data)
    interpreter.invoke()

    # Parse results
    boxes = interpreter.get_tensor(output_details[0]['index'])
    classes = interpreter.get_tensor(output_details[1]['index'])
    scores = interpreter.get_tensor(output_details[2]['index'])

    # Display high-confidence results only
    for i in range(len(scores[0])):
        if scores[0][i] > 0.5:
            draw_box(frame, boxes[0][i], classes[0][i], scores[0][i])

9.3 Voice Assistants

On-Device Voice Processing Pipeline:
========================================
1. Wake Word Detection: Small model (50KB-1MB), always-on, NPU
2. Voice Activity Detection (VAD): Speech segment identification, very lightweight
3. Speech Recognition (ASR): Whisper Tiny/Base (39M-74M parameters)
4. Natural Language Understanding (NLU): Intent classification + entity recognition
5. Response Generation: Small LLM or template-based
6. Text-to-Speech (TTS): Convert text to speech

9.4 Medical Devices

Medical Edge AI Applications:
========================================
Device            | AI Task                | Regulatory
Smartwatch        | ECG arrhythmia detect  | FDA Class II
Stethoscope       | Heart/lung sound anal. | FDA Class II
Fundus Camera     | Diabetic retinopathy   | FDA Class II
CT/MRI Assist     | Lesion highlight       | FDA Class III
Glucose Monitor   | Blood sugar trend pred | FDA Class II

9.5 Industrial IoT

Industrial Edge AI Applications:
========================================
Area              | AI Task                | Benefit
Quality Inspection| Visual defect detect   | Real-time, 100% inspection
Predictive Maint. | Vibration/noise anomaly| 40% downtime reduction
Energy Optimization| Consumption prediction| 15% energy savings
Safety Monitoring | PPE detection          | Accident prevention
Robot Control     | Path planning/obstacle | Autonomous operation

10. Challenges and Limitations

10.1 Power Constraints

Power Efficiency Comparison:
========================================
Device                 | Power  | AI Perf  | TOPS/W
NVIDIA Jetson Orin Nano| 7-15W  | 40 TOPS  | 2.7-5.7
Apple A18 Pro          | ~5W    | 35 TOPS  | 7.0
Google Edge TPU        | 2W     | 4 TOPS   | 2.0
Qualcomm Snapdragon 8g3| ~5W   | 45 TOPS  | 9.0
Intel Core Ultra NPU   | ~5W   | 34 TOPS  | 6.8

10.2 Thermal Management

Continuous AI inference on mobile devices causes thermal issues. Thermal throttling can degrade performance by 30-50%.

10.3 Memory Limitations

Memory Constraints by Device Type:
========================================
Device Type       | Typical RAM | AI-Available Memory
IoT Sensors       | 256KB-4MB   | Extremely limited
Microcontrollers  | 2-16MB      | Very limited
Smart Cameras     | 256MB-2GB   | Hundreds of MB
Smartphones       | 6-16GB      | 2-8GB
Edge Servers      | 16-128GB    | Most available

10.4 Model Update Challenges

OTA Model Update Challenges:
========================================
- Bandwidth limits: Transmitting large models wirelessly
- Battery drain: Energy for download + conversion process
- A/B testing: Need to store both old and new versions
- Rollback: Recovery mechanism when updates fail
- Integrity: Prevent model tampering (signature/hash verification)

Quiz

Q1: List and explain at least 5 key differences between Edge AI and Cloud AI.

Latency: Edge AI offers 1-10ms local inference, while Cloud AI requires 50-200ms network round-trip
Privacy: Edge AI keeps data on device, while Cloud AI requires data transmission
Bandwidth: Edge AI transmits only results (minimal), while Cloud AI sends raw data (heavy)
Offline capability: Edge AI works without network, Cloud AI cannot
Compute power: Edge AI is limited (NPU with tens of TOPS), Cloud AI is nearly unlimited
Model size: Edge AI is limited to MB-to-few-GB range, Cloud AI can handle hundreds of GB
Cost structure: Edge AI has upfront hardware costs, Cloud AI has ongoing service costs

Q2: Explain the difference between PTQ and QAT, and compare their pros and cons.

PTQ (Post-Training Quantization):

Quantizes an already-trained model without additional training
Pros: Simple, no training data needed (only calibration data), fast to apply
Cons: May have greater accuracy loss than QAT, especially at INT4

QAT (Quantization-Aware Training):

Simulates quantization effects during the training process
Pros: Minimizes accuracy loss, good quality even at INT4
Cons: Requires additional training (data, GPU, time), more complex implementation

Generally, PTQ is sufficient for INT8, while QAT is recommended for INT4 and below.

Q3: Explain the FedAvg algorithm in Federated Learning.

The FedAvg (Federated Averaging) process:

Model distribution: Server sends the global model to selected clients
Local training: Each client trains the model on their local data (multiple epochs)
Update collection: Trained model parameters are sent to the server (original data is never transmitted)
Weighted averaging: Server updates the global model using a weighted average proportional to each client's data size
Iteration: The above process repeats for multiple rounds

Key point: Data never leaves the device -- only model updates (weights) are exchanged with the server, preserving privacy.

Q4: Explain what Q4_K_M means in GGUF quantization.

GGUF quantization naming convention:

Q: Quantization
4: Bit count (4-bit). Uses an average of 4 bits per weight
K: K-quant method. Applies different quantization levels per block, maintaining higher precision for important layers
M: Medium quality. Among S (Small/low quality), M (Medium), and L (Large/high quality)

Q4_K_M is 4-bit K-quant medium quality quantization, the most popular level that reduces model size by approximately 4-5x while minimizing quality loss. For a 7B model, this results in about 4.58GB, runnable on smartphones.

Q5: What is the difference between structured and unstructured pruning?

Unstructured Pruning:

Sets individual weights to zero, creating sparse matrices
Can achieve high sparsity (90%+)
Large theoretical compute savings, but limited actual speedup on general hardware
Requires specialized sparse matrix hardware/libraries

Structured Pruning:

Removes entire channels, filters, or Attention Heads as structural units
Resulting model operates with regular dense tensor operations
Achieves actual speedup even on general hardware
Sparsity levels may be lower than unstructured (around 30-50%)

Practically, structured pruning is more effective for actual inference speedup, while unstructured pruning is effective on specialized hardware like NVIDIA Sparse Tensor Cores.