필사 모드: Edge AI & TinyML 2026 — LiteRT / ExecuTorch / Edge Impulse / Jetson / Coral / Hailo / Sipeed K230 / llama.cpp / Phi-4 Deep-Dive Guide
English1. The 2026 Edge AI Map — Four Categories: MCU / SBC / Phone / Auto
Edge AI in 2026 is not a single category. The single word "edge" spans devices from 100 mW microcontrollers to autonomous-driving computers consuming over 100 W, and the models that run on them range from sub-1KB keyword spotting networks to 4-bit quantized 70B LLMs.
First, the four broad categories of 2026 Edge AI devices:
- MCU (Microcontroller) class — 1-100 mW power, 16KB-2MB memory, 1KB-1MB models. Arduino Nano 33 BLE Sense, Seeed XIAO ESP32-S3, STMicro STM32H7, Nordic nRF52840. Keyword spotting (Hey Siri), vibration anomaly detection, gesture recognition.
- SBC (Single Board Computer) class — 1-15 W, 4-16 GB memory, 1MB-1GB models. Raspberry Pi 5, Rockchip RK3588 boards, NVIDIA Jetson Orin Nano, Coral Dev Board, Sipeed K230. Object detection, pose estimation, speech recognition.
- Mobile / phone class — 5-15 W, 8-16 GB memory, 1-8 GB models. iPhone (A17/A18 Bionic + Neural Engine), Galaxy S24/S25 (Snapdragon 8 Gen 3/4 + Hexagon NPU), Pixel 9 (Tensor G4 + Edge TPU). 1B-7B quantized LLMs, on-device Whisper, Stable Diffusion (LCM).
- Automotive / robotics / industrial class — 30-130 W, 32-64 GB memory, 1B-70B models. NVIDIA Jetson AGX Orin, Jetson Thor (new in 2026), Tesla FSD HW4, Mobileye EyeQ7. Autonomous driving, humanoid robots, industrial vision.
The biggest events of 2024 were two: First, Google rebranded the TensorFlow Lite mobile/embedded runtime as LiteRT — TFLite's official name is now LiteRT, and TFLite Micro is now LiteRT Micro. Second, Meta announced ExecuTorch as GA — the PyTorch camp's mobile/embedded runtime emerged as a direct alternative to TFLite/LiteRT.
Until then the conventional wisdom was "to run PyTorch on the edge, convert via ONNX to TFLite." Now there is a direct PyTorch → ExecuTorch path. So the first fork in 2026 Edge AI is: do you go with the LiteRT (Google) camp or the ExecuTorch (Meta/PyTorch) camp?
This article lays out all of those forks as a single map: from MCUs to phones, from Google to Meta, from ONNX Runtime to Core ML, from small models (Phi-3, Gemma 3, Llama 3.2) to large ones (70B GGUF), with Korean/Japanese case studies along the way.
2. TFLite Micro → LiteRT (the 2024 Rebrand)
Let us start with the story of TFLite Micro becoming LiteRT.
Ever since Google released TensorFlow Lite in 2017, TFLite has become the de facto standard for mobile/embedded ML. On top of that, TFLite Micro arrived in 2018 — a lighter runtime that runs even on MCUs with only tens of KB of RAM — and for almost seven years these two were the core of Google's edge ML strategy.
Then at Google I/O 2024 (May), Google announced two changes at once:
- TensorFlow Lite is now LiteRT
- LiteRT is no longer TensorFlow-only — you can convert from PyTorch, JAX, and Keras
The rebrand reason is clear. The name "TFLite" felt too TensorFlow-bound, and by 2023-2024 the ML ecosystem was dominated by PyTorch. Google had to break the perception that "the TFLite runtime is good but it cannot run PyTorch models."
The key LiteRT changes:
- Conversion from any framework (TF, PyTorch, JAX) is supported
- PyTorch conversion path — torch.export → LiteRT (still produces .tflite file format)
- Existing TFLite code keeps working — no migration burden
- The ai_edge_torch package supports direct PyTorch conversion
- MediaPipe sits on top with an LLM Inference API (the standard path to run models like Gemma 2B on a phone)
LiteRT Micro (formerly TFLite Micro) follows the same flow. The C++ header-only runtime stays, but you can now build a model directly in PyTorch and send it to LiteRT Micro.
A simple PyTorch → LiteRT conversion example:
PyTorch model -> LiteRT (old .tflite) conversion
class TinyClassifier(torch.nn.Module):
def __init__(self):
super().__init__()
self.conv = torch.nn.Conv2d(1, 8, 3)
self.fc = torch.nn.Linear(8 * 26 * 26, 10)
def forward(self, x):
x = self.conv(x)
x = torch.relu(x)
x = x.flatten(1)
return self.fc(x)
model = TinyClassifier().eval()
sample_input = (torch.randn(1, 1, 28, 28),)
torch.export-based conversion
edge_model = ai_edge_torch.convert(model, sample_input)
edge_model.export("tiny_classifier.tflite")
That .tflite runs identically on Android, iOS, Raspberry Pi, Coral, and ESP32-S3.
The deeper significance is market competition with ExecuTorch. Had Google not embraced PyTorch compatibility, the PyTorch camp would have gone 100% with ExecuTorch. Now both standards coexist. From an edge-ML engineer's perspective, you can run the same model on both runtimes and pick whichever is faster.
3. ExecuTorch (PyTorch) GA — The Direct Alternative to LiteRT
ExecuTorch is the mobile/embedded PyTorch runtime that Meta (PyTorch) first announced at PyTorch Conference 2023. It hit 1.0 GA in 2024 and became a direct competitor to LiteRT.
Two key ideas:
- Execute the torch.export graph directly on mobile/embedded
- Backend abstraction supports CPU / GPU / NPU / DSP uniformly
Old PyTorch Mobile used a separate IR called TorchScript, which often failed to convert PyTorch's dynamic graphs cleanly. ExecuTorch adopts torch.export (PyTorch 2.x's new static graph API) as the standard, dramatically improving conversion success rates.
The ExecuTorch backend list shows how serious it is:
- XNNPACK — ARM CPU optimization. Default backend
- CoreML Delegate — iOS / macOS Neural Engine
- MPS Delegate — Apple Metal Performance Shaders (GPU)
- Vulkan Delegate — Android GPU
- Qualcomm QNN Delegate — Snapdragon Hexagon NPU
- MediaTek Neuron Delegate — Dimensity NPU
- ARM Ethos-U Delegate — Cortex-M NPU
- Cadence DSP, NXP, XTensa — embedded DSPs
A single ExecuTorch graph can target iPhone Neural Engine, Snapdragon Hexagon, and Cortex-M Ethos-U with the same source.
A simple conversion example:
PyTorch -> ExecuTorch conversion
from torch.export import export
from executorch.exir import to_edge
class MyModel(torch.nn.Module):
def __init__(self):
super().__init__()
self.lin = torch.nn.Linear(10, 1)
def forward(self, x):
return self.lin(x)
model = MyModel().eval()
example_args = (torch.randn(1, 10),)
torch.export
exported = export(model, example_args)
ExecuTorch conversion
edge_program = to_edge(exported)
et_program = edge_program.to_executorch()
Save as .pte (PyTorch Edge format)
with open("my_model.pte", "wb") as f:
f.write(et_program.buffer)
Loaded by the ExecuTorch SDK on Android/iOS, that .pte runs the same model with the same semantics as the original PyTorch dynamic graph.
LiteRT vs ExecuTorch comparison:
- License — both Apache 2.0
- Model conversion — LiteRT supports PyTorch/TF/JAX; ExecuTorch supports PyTorch
- File format — LiteRT is .tflite; ExecuTorch is .pte
- Camp — Google vs Meta (PyTorch)
- Market — LiteRT is the Android standard; ExecuTorch is PyTorch-friendly mobile/MCU
- Tooling — LiteRT has MediaPipe + ai_edge_torch; ExecuTorch has torch.export + delegates
As of 2026 ExecuTorch is the official mobile execution path for Llama 3.2 1B/3B. It is natural for Meta to push its own LLM with its own runtime, and most Llama 3.2 mobile demos use ExecuTorch + iOS/Android.
4. Edge Impulse — The Largest TinyML Platform
Edge Impulse is a TinyML-focused startup founded in 2019. As of 2026 it is effectively the standard cloud platform for TinyML.
Its strength is handling the full stack — from data collection to deployment — in a single UI. A typical TinyML workflow:
1. Collect sensor data — upload accelerometer, microphone, and camera data from Arduino / ESP32 / phones
2. Labeling — label clips by class in the web UI
3. Preprocessing — pick DSP blocks like FFT, spectrogram, MFCC
4. Model training — Keras / scikit-learn / Edge Impulse's EON Tuner searches automatically
5. Quantization + compilation — int8 quantization, EON Compiler generates a C++ library
6. Deployment — Arduino IDE library, PlatformIO, or firmware OTA
The EON Compiler is Edge Impulse's secret weapon. While a generic TFLite Micro interpreter uses ~100 KB of RAM, EON compiles the model into static C++ code, cutting RAM usage by 30-50%. That is how it runs ML on Cortex-M0+ chips with only 64 KB of RAM.
Representative use cases:
- Keyword spotting — recognize wake words like "Hey Alexa"
- Vibration anomaly detection — attach to a factory motor and detect bearing faults early
- Pose recognition — classify human posture (sitting, standing, falling) from IMU data
- Object detection — FOMO (Faster Objects, More Objects), an ultra-light MobileNet variant
- Time-series classification — ECG, EEG, vibration, pressure 1-D signals
Edge Impulse has official partnerships with virtually every major MCU vendor — Sony Spresense, Nordic nRF5340, Renesas RA, Silicon Labs xG24 — so SDK support is clean.
Connect an Arduino Nano 33 BLE Sense via Edge Impulse CLI
npm install -g edge-impulse-cli
Flash device firmware (Arduino Nano 33 BLE Sense)
edge-impulse-daemon --clean
Export trained model as an Arduino library
edge-impulse-runner --download
-> Import the downloaded .zip via Arduino IDE: Sketch > Include Library > Add .ZIP Library
From a company perspective, Edge Impulse's "data -> model -> firmware" full stack lowers the barrier dramatically. Firmware engineers do not need a PhD in ML, and ML engineers do not need to be firmware veterans — both sides meet inside Edge Impulse.
In 2026 LLM integration started landing on Edge Impulse Studio. A ChatGPT-style chat UI lets you say "analyze sensor data and propose a new model," and it suggests datasets, preprocessing, and candidate models.
5. NVIDIA Jetson Orin Nano / NX / Thor / AGX
NVIDIA Jetson is the standard in SBC / industrial embedded / robotics. The 2026 Jetson lineup is very strong.
- Jetson Orin Nano (8GB) — 40 TOPS, 7-15 W. Entry / developer kit. \$249-\$399
- Jetson Orin NX (8GB / 16GB) — 70-100 TOPS, 10-25 W. Industrial / robotics. \$599-\$899
- Jetson AGX Orin (32GB / 64GB) — 200-275 TOPS, 15-60 W. Autonomous driving / robots. \$1999-\$2999
- Jetson Thor (new in 2026) — 2000+ TOPS, 130 W. Humanoid robots / large autonomous driving. \$3499 (developer kit)
Jetson Thor was unveiled at GTC 2025 and shipped in earnest in early 2026 as a computer for humanoid robots. A Blackwell-architecture GPU plus 128 GB LPDDR5X lets you run 70B-class LLMs locally and handle 14 concurrent camera/LiDAR streams. Standard usage is alongside NVIDIA Isaac Lab's robot learning environment and the Cosmos sim-to-real model.
Jetson's software stack is essentially aligned with NVIDIA desktop GPUs:
- JetPack — Ubuntu-based OS + CUDA + cuDNN + TensorRT integrated SDK
- TensorRT — NVIDIA's inference accelerator. Optimizes ONNX/PyTorch models for GPU
- DeepStream — video analytics pipeline. Handle N concurrent cameras
- Isaac ROS — ROS 2 + GPU-accelerated nodes. Standard for autonomous driving / robots
- NIM (NVIDIA Inference Microservice) — LLM serving in containers
The standard way to run LLMs on Jetson is llama.cpp (GGUF) or TensorRT-LLM. On Orin Nano 8GB, Phi-3 mini (3.8B) runs at ~5-10 ms per token; on AGX Orin 64GB, Llama 3.1 70B (4-bit) hits ~30-50 ms per token. On Jetson Thor the same 70B drops below 5 ms per token, roughly matching desktop RTX 4090.
Run llama.cpp + Phi-3 mini on Jetson Orin Nano
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make GGML_CUDA=1 -j
Download Phi-3 mini 4-bit GGUF (example model name)
huggingface-cli download microsoft/Phi-3-mini-4k-instruct-gguf \
Phi-3-mini-4k-instruct-q4.gguf --local-dir ./models
./llama-cli -m ./models/Phi-3-mini-4k-instruct-q4.gguf \
-p "What is the capital of Korea?" -n 64 -ngl 32
Jetson's weakness is price and thermals. AGX Orin 64GB is nearly \$3000, and 60 W TDP requires active cooling. People wanting lower power / lower cost look to Coral, Hailo, or Rockchip alternatives.
6. Coral Dev Board (Google TPU) — 4 TOPS, 2 W
Coral is Google's Edge TPU (Tensor Processing Unit) and the board series that ships it. One of NVIDIA Jetson's lowest-power alternatives.
- Coral Dev Board — NXP i.MX 8M + Edge TPU. 4 TOPS, 2 W.
- Coral USB Accelerator — USB dongle that adds a TPU to Raspberry Pi / PC. 4 TOPS
- Coral M.2 / Mini PCIe — industrial form factors
- Coral SoM (System on Module) — for integration in industrial boards
The Edge TPU only runs int8 quantized models, specializing in light CNNs like MobileNet / EfficientNet-Lite / PoseNet. It cannot run large LLMs, but for "always-on 24/7 inference of a fixed small model" it is dramatically more efficient than Jetson.
Typical Coral use cases:
- Retail store cameras — people counting, queue length
- Smart doorbell — person vs animal vs vehicle classification
- Farm cameras — livestock behavior, crop condition monitoring
- Industrial CCTV — helmet detection, restricted zone intrusion
- Outdoor wildlife cameras — species identification
Coding for the Edge TPU on top of TFLite/LiteRT is straightforward:
Object classification on Coral Edge TPU
from pycoral.utils.edgetpu import make_interpreter
from pycoral.adapters import classify, common
from PIL import Image
interpreter = make_interpreter('mobilenet_v2_quant_edgetpu.tflite')
interpreter.allocate_tensors()
image = Image.open('cat.jpg').convert('RGB')
size = common.input_size(interpreter)
common.set_input(interpreter, image.resize(size, Image.LANCZOS))
interpreter.invoke()
classes = classify.get_classes(interpreter, top_k=3)
for c in classes:
print(f"class={c.id} score={c.score}")
Coral's 2024-2026 limitation is obvious. The Edge TPU silicon is a 2018 design, Google has not pushed a major update, and newer architectures (Transformer, ViT) are weakly accelerated. From 2024 onward, late entrants like Hailo / Sipeed / Rockchip have started taking market share.
Still, when you need a "proven, stable, low-power AI board with 4+ years of support," Coral remains a top pick.
7. Hailo-15 / Hailo-8 NPU — The Dark Horse from Israel
Hailo is an NPU (Neural Processing Unit) startup based in Tel Aviv, Israel. Founded in 2017, it became a unicorn after a \$340M Series D in 2024.
The Hailo NPU lineup:
- Hailo-8 — 26 TOPS, 2.5 W. Automotive / industrial embedded. M.2 / Mini PCIe form factors
- Hailo-8L — 13 TOPS, 1.5 W. Lower-end
- Hailo-15 — 20 TOPS, 5 W (SoC-integrated). Video / IP camera SoC. ARM Cortex-A53 + Hailo NPU on one chip
- Hailo-10H — 40 TOPS, 5 W. Automotive ADAS-certified (ASIL-B)
Hailo's key strength is TOPS per watt. Coral Edge TPU is ~2 TOPS/W; Hailo-8 is ~10 TOPS/W. A 5x gap.
Hailo-15 in particular is reshaping the IP camera market. Previously a camera streamed 1080p H.264 to an NVR (Network Video Recorder) that ran AI analytics. A Hailo-15-based camera runs object detection + person re-identification + pose estimation inside the camera and only transmits metadata. That is a triple win: 99% bandwidth reduction, stronger privacy, lower latency.
Hailo's SDK is its proprietary Dataflow Compiler:
Download a pre-trained Hailo Model Zoo model and run it
pip install hailo-platform hailo-model-zoo
Compile YOLOv8 (.hef = Hailo Executable Format)
hailomz compile yolov8s --ckpt yolov8s.pt --hw-arch hailo8
Run inference
hailomz eval yolov8s --target hailo8 --data-zip-path coco_val.zip
Hailo's weakness is the ecosystem. The community / documentation / examples are not yet at the NVIDIA CUDA or Google TFLite level. Still, between 2025 and 2026 automotive Tier-1s like Bosch, Ficosa, and Continental adopted the Hailo-10H for ADAS, and in the automotive market Hailo now stands as one of the top three players alongside NVIDIA and Mobileye.
8. Sipeed K230 — The First Mainstream RISC-V + NPU
Sipeed is a Shenzhen-based embedded-ML board company. Famous for the MaixPy series, it began shipping the Sipeed K230 (RISC-V + NPU SoC) in earnest in 2024.
Sipeed K230 specs:
- CPU — Canaan Kendryte K230. Dual-core RISC-V (RV64GC). 1.6 GHz
- NPU — Canaan KPU 2.0. 6 TOPS @ int8
- DSP — Canaan KDPU (digital signal processor). Signal / audio acceleration
- Memory — 512 MB LPDDR4
- Camera — MIPI CSI 2 lanes, integrated ISP
- Form factor — Sipeed CanMV-K230 board (\$45-65) / Sipeed MaixCAM (\$65)
- Power — 1-3 W
Packing 6 TOPS NPU + camera ISP + dual RISC-V cores at this price is a big deal. For comparison, Raspberry Pi 5 is \$80 with no NPU (needs a separate accelerator module). Coral Dev Board is \$130 at 4 TOPS. Jetson Orin Nano starts at \$249.
RISC-V matters too. Unlike ARM Cortex, there is no licensing fee, and aligned with China's RISC-V self-sufficiency plan (2023-2030) the RISC-V infrastructure is maturing fast. MicroPython, OpenCV, and ONNX Runtime all officially ship RISC-V builds.
The Sipeed K230 development environment is MaixPy IDE or the raw SDK.
YOLOv5 object detection on the K230 camera via MaixPy
from maix import camera, display, nn
Load a YOLOv5 model onto the Kendryte KPU
model = nn.YOLOv5s(model="yolov5s_quant.kmodel")
cam = camera.Camera(640, 480)
disp = display.Display()
while True:
img = cam.read()
boxes = model.detect(img, conf_thres=0.5, iou_thres=0.45)
for box in boxes:
img.draw_rect(box.x, box.y, box.w, box.h, color="red")
img.draw_string(box.x, box.y, box.class_name, color="green")
disp.show(img)
The ".kmodel" format is Canaan's proprietary NPU format. A compiler called nncase converts ONNX / TFLite models to .kmodel.
Convert ONNX -> .kmodel (Canaan NPU format)
pip install nncase
ncc compile yolov5s.onnx yolov5s.kmodel \
--target k230 \
--input-type uint8 \
--output-type float32
Sipeed's 2026 hit product, MaixCAM (K230 + 5 MP camera + 2.3-inch display), runs full vision-AI demos out of the box at \$65 and is selling explosively in education / maker markets.
9. Rockchip RK3588 — The De Facto Standard SBC NPU
Rockchip is an ARM-SoC design company in Fuzhou, China. The RK3588, released in 2022, has become the de facto standard SoC of the 2024-2026 SBC market.
RK3588 specs:
- CPU — 4x Cortex-A76 + 4x Cortex-A55 (big.LITTLE). 2.4 GHz
- GPU — Mali-G610 MP4. OpenGL ES 3.2 / Vulkan 1.2
- NPU — 6 TOPS @ int8 (split across 3 cores)
- Memory — 4/8/16/32 GB LPDDR4 / LPDDR5
- Video — 8K 60fps decode, 8K 30fps encode
- Form factor — many SBCs use it: Orange Pi 5, Radxa Rock 5B, Khadas Edge 2, FriendlyElec NanoPi M6, etc.
RK3588 boards have overwhelming bang per buck. Orange Pi 5 Plus 16GB is \$130-150 and Radxa Rock 5B 16GB is \$160-180 — more memory and faster CPU than Jetson Orin Nano 8GB (\$249), though the NPU maturity (software + model compatibility) does not yet match NVIDIA TensorRT.
Rockchip RKNN-Toolkit is the SDK.
Install RKNN-Toolkit2 (host PC, x86)
pip install rknn-toolkit2
Convert ONNX -> .rknn (Rockchip NPU format)
python -c "
from rknn.api import RKNN
rknn = RKNN()
rknn.config(target_platform='rk3588')
rknn.load_onnx('yolov8n.onnx')
rknn.build(do_quantization=True, dataset='./dataset.txt')
rknn.export_rknn('./yolov8n.rknn')
"
Run .rknn on an RK3588 board (rknnlite)
from rknnlite.api import RKNNLite
rknn = RKNNLite()
rknn.load_rknn('./yolov8n.rknn')
rknn.init_runtime(core_mask=RKNNLite.NPU_CORE_AUTO)
img = cv2.imread('test.jpg')
outputs = rknn.inference(inputs=[img])
print(outputs[0].shape)
The RK3588 appeal is NPU + 8K video + generous memory options in one chip. It has become standard in 4K/8K security cameras, IoT gateways, digital signage, and industrial HMI. The follow-ups RK3588S (lower-end) and RK3576 (mid) are also popular, and the RK3688 (next gen, with an expected 14 TOPS NPU) unveiled in late 2025 is on track to be the 2026-2027 standard.
10. MaixPy / Arduino Nano 33 BLE Sense / Seeed Wio AI
This section covers representative MCU / maker boards.
MaixPy (Sipeed)
MaixPy is Sipeed's embedded MicroPython environment. It runs on Maixduino, MaixCube, MaixCAM, and similar boards, integrating camera + NPU + display into a maker kit. The progression has been K210 (Gen 1, 2018), K510 (Gen 2, 2022), K230 (Gen 3, 2024).
MaixCube in particular packs LCD + camera + microphone + battery + gyro for about \$30 and lets you run full AI demos — keyword spotting, face recognition, pose estimation — right out of the box.
Arduino Nano 33 BLE Sense
The Arduino Nano 33 BLE Sense (Rev2) is effectively the standard learning board for TinyML. Since its launch in 2019 it has been the official demo board for Edge Impulse and TensorFlow Lite Micro and appears in nearly every TinyML book and course.
Specs:
- MCU — Nordic nRF52840. ARM Cortex-M4F. 64 MHz. 1 MB Flash, 256 KB RAM
- Sensors — 9-axis IMU, microphone (PDM), barometer, temperature/humidity, light, proximity, color (all on-board)
- Wireless — BLE 5.0
- Price — \$30-35
At that price you can run nearly every TinyML demo (keyword spotting, gesture, vibration, environmental monitoring), which is why it dominates the education market.
// Arduino Nano 33 BLE Sense + TFLite Micro keyword spotting (conceptual)
#include <TensorFlowLite.h>
#include <PDM.h>
#include "model_data.h" // Trained model (generated by Edge Impulse, etc.)
const tflite::Model* model = tflite::GetModel(g_model);
static tflite::MicroInterpreter* interpreter;
constexpr int kTensorArenaSize = 80 * 1024;
alignas(16) uint8_t tensor_arena[kTensorArenaSize];
void setup() {
static tflite::AllOpsResolver resolver;
static tflite::MicroInterpreter static_interpreter(
model, resolver, tensor_arena, kTensorArenaSize);
interpreter = &static_interpreter;
interpreter->AllocateTensors();
PDM.begin(1, 16000); // 1 channel, 16 kHz
}
void loop() {
// Collect 1-second microphone clip
// Extract MFCC features
// Copy into the model input tensor
// interpreter->Invoke();
// Print result label ("yes", "no", "stop", ...)
}
Seeed Wio AI / XIAO ESP32-S3
Seeed Studio's (Shenzhen, China) Wio AI line and XIAO ESP32-S3 (Sense) are core to the maker market. XIAO ESP32-S3 Sense packs ESP32-S3 + camera + microphone + microSD onto a stamp-sized board (21x18 mm) for \$10-15. It is an officially supported Edge Impulse board.
The ESP32-S3 also brings built-in Wi-Fi. Arduino Nano 33 only has BLE, but ESP32-S3 ships Wi-Fi + BLE, which makes it more suitable for IoT scenarios (uploading results to the cloud, OTA firmware updates).
MicroPython for ML
MicroPython is the embedded edition of Python. Between 2024 and 2026, running ML on top of MicroPython became more common.
- ulab — a MicroPython port of numpy
- emlearn — C exports of scikit-learn tree / forest models
- tflite-micro Python bindings — provided by Sipeed / Espressif
The MicroPython appeal is rapid prototyping. With C++, every compile + flash cycle takes 30 seconds. With MicroPython you can execute via REPL on the device, which speeds up sensor data exploration.
11. ONNX Runtime Mobile / Core ML / TensorRT / Apache TVM
This section covers four mobile / edge inference runtimes.
ONNX Runtime Mobile
ONNX Runtime is Microsoft's multi-framework inference engine. It runs models in the ONNX (Open Neural Network Exchange) standard format and can convert from PyTorch, TF, JAX, and Keras.
ONNX Runtime Mobile is the slim mobile build.
- Android — AAR library, NNAPI backend, QNN (Qualcomm) backend
- iOS — Pod, Core ML backend
- Raspberry Pi / Linux ARM — .so libraries, XNNPACK backend
The appeal of ONNX Runtime is camp neutrality. Between the PyTorch camp (ExecuTorch) and Google camp (LiteRT), ONNX is the safe "compatible with both" choice. The trade-off is that for quantization and NPU optimization, native runtimes (LiteRT / ExecuTorch) are usually one or two steps ahead.
Core ML (Apple)
Core ML is Apple's first-party ML runtime for its own devices (iPhone, iPad, Mac, Watch). Introduced in iOS 11 (2017), it has become the standard path for tapping the Neural Engine on A17 Pro / A18 Pro / M3 / M4 between 2024 and 2026.
Core ML's strength is Apple Silicon integration. It schedules across CPU / GPU / Neural Engine (ANE) automatically, and the M3/M4 ANE delivers 35-38 TOPS. Mobile Stable Diffusion, on-device Whisper, and all of Apple Intelligence's on-device LLMs (WWDC 2024) run on Core ML.
PyTorch -> Core ML conversion (coremltools)
class MyModel(torch.nn.Module):
def forward(self, x):
return torch.nn.functional.relu(x)
model = MyModel().eval()
traced = torch.jit.trace(model, torch.randn(1, 3, 224, 224))
mlmodel = ct.convert(
traced,
inputs=[ct.TensorType(shape=(1, 3, 224, 224))],
compute_units=ct.ComputeUnit.ALL, # CPU + GPU + ANE
)
mlmodel.save("MyModel.mlpackage")
Apple Intelligence's on-device model is reported to be roughly 3B parameters (2-bit quantized) and runs at ~30 ms per token on the Neural Engine of iPhone 15 Pro and above.
TensorRT (NVIDIA)
TensorRT is NVIDIA's GPU-only inference accelerator. The same API spans desktop RTX, server H100/H200/B200, and edge Jetson.
PyTorch -> ONNX -> TensorRT engine build
1. PyTorch -> ONNX
torch.onnx.export(model, dummy_input, "model.onnx", opset_version=17)
2. ONNX -> TensorRT engine
logger = trt.Logger(trt.Logger.WARNING)
builder = trt.Builder(logger)
network = builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
parser = trt.OnnxParser(network, logger)
with open("model.onnx", "rb") as f:
parser.parse(f.read())
config = builder.create_builder_config()
config.set_flag(trt.BuilderFlag.FP16)
engine = builder.build_serialized_network(network, config)
with open("model.engine", "wb") as f:
f.write(engine)
TensorRT-LLM is a dedicated LLM accelerator library that performs graph fusion + KV-cache optimization + quantization (FP8 / INT4) automatically for Llama / Mistral / Qwen. Llama 3.1 8B reaches ~5-7 ms per token on Jetson AGX Orin.
Apache TVM
Apache TVM is an ML compiler project led by OctoML. It takes PyTorch / TF / ONNX models and auto-generates code that runs on CPU / GPU / NPU / DSP.
MLC LLM (next section) is built on TVM. TVM itself has a steep learning curve, but via the MLC user-friendly wrapper it has become the key infrastructure for running LLMs on phones.
12. LLMs on Phones — MLC LLM / llama.cpp / Whisper.cpp / GGUF
The biggest change from 2024 to 2026 is that 1-8B LLMs run at practical speeds on phones. The key tools:
llama.cpp
A C++ LLM inference engine by ggerganov. Started in spring 2023, by 2026 it is effectively the standard local LLM runtime.
Its core values:
- Pure C++. Almost no dependencies. Supports ARM / x86 / CUDA / Metal / Vulkan / SYCL
- GGUF — llama.cpp's unified model file format, embedding quantization info + metadata
- Quantization — Q2_K, Q3_K, Q4_K, Q5_K, Q6_K, Q8_0, and other sub-4-bit options
- Tokenization, sampling, and chat templates all built in
Build llama.cpp on Android (Termux)
pkg install clang make git
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make -j
Download a Phi-3.5 mini GGUF (4-bit, example)
huggingface-cli download bartowski/Phi-3.5-mini-instruct-GGUF \
Phi-3.5-mini-instruct-Q4_K_M.gguf --local-dir ./models
./llama-cli -m ./models/Phi-3.5-mini-instruct-Q4_K_M.gguf \
-p "Explain attention." -n 128 -t 4
On phones like the Galaxy S24 Ultra or iPhone 15 Pro, Phi-3.5 mini (3.8B Q4_K_M, ~2.2 GB) runs at 30-50 ms per token (20-30 tok/s).
Whisper.cpp
A C++ port of OpenAI's Whisper speech recognition, also by ggerganov. Lets you run speech-to-text on phones / laptops without the cloud.
Korean speech recognition with Whisper.cpp (CPU)
git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp
bash ./models/download-ggml-model.sh medium
make -j
./build/bin/whisper-cli -m models/ggml-medium.bin -l ko -f my_audio.wav
Whisper.cpp's Core ML build on iPhone processes a 30-minute medium model (769M) clip in ~5 minutes. The small model (244M) runs faster than real time, and base (74M) is essentially real time on a phone.
MLC LLM
MLC (Machine Learning Compilation) LLM is a phone / browser LLM engine from the Carnegie Mellon / Apache TVM camp.
- Android — Vulkan / OpenCL backend
- iOS — Metal backend
- Browser — WebGPU backend (LLM in the browser)
- Desktop — CUDA / ROCm / Metal
The WebGPU backend is particularly interesting. As a user lands on a page, the model downloads, and Chrome / Edge / Safari runs the GPU-accelerated LLM right inside the browser. No server calls — fully local.
MLC LLM Android demo build
git clone --recursive https://github.com/mlc-ai/mlc-llm
cd mlc-llm
python -m mlc_llm package --model "HF://mlc-ai/Llama-3.2-3B-Instruct-q4f16_1-MLC"
Open the android/MLCChat project in Android Studio and build
With MLC LLM, a Galaxy S24 Ultra runs Llama 3.2 3B at ~25 ms per token (40 tok/s). On the same device the GPU backend is slightly faster than llama.cpp.
The GGUF Format
GGUF (Georgi Gerganov Unified Format) is llama.cpp's standard model file. A single file packs:
- Weights (quantized tensors)
- Tokenizer (BPE / SentencePiece)
- Chat template (chat_template, Jinja-like)
- Metadata (architecture, context size, RoPE settings)
That means one .gguf can run identically across llama.cpp / Ollama / LM Studio / GPT4All.
By May 2026 Hugging Face hosts 50,000+ GGUF models, with "Q4_K_M" or "Q5_K_M" as the standard quantization. Q4_K_M is the recommended quality / size sweet spot.
13. Small Models — Phi-3 / 3.5 / 4 (MS) / Gemma 2 / 3 (Google) / Llama 3.2 1B / 3B
The biggest variable in edge LLM is model selection. Between 2024 and 2026 we saw an explosion of "1-4B parameter models that match GPT-3.5 quality." The three main families:
Microsoft Phi Series
Phi is Microsoft's small-LLM series. Building on the "Textbooks Are All You Need" paper, the goal is to approach the performance of much larger models using high-quality synthetic data with small models.
- Phi-3 mini (3.8B) — April 2024. 128K context. ~12 tok/s on iPhone 15
- Phi-3 small (7B) — May 2024
- Phi-3 medium (14B) — May 2024
- Phi-3.5 mini (3.8B) — August 2024. Adds multilingual support (Korean / Japanese / etc.)
- Phi-3.5 vision (4.2B) — vision input
- Phi-3.5 MoE (16x3.8B, ~6.6B active) — MoE variant
- Phi-4 (14B) — December 2024. Strong at code / math
- Phi-4 mini (3.8B) — early 2025
Phi-3 mini's popularity comes from being the first practical phone LLM. It reaches ~12-15 tok/s on iPhone 15 Pro and ~20-25 tok/s on Galaxy S24 Ultra, fast enough for real-time chat.
Google Gemma Series
Gemma is Google's open-model series, derived from the same research infrastructure as Gemini.
- Gemma 2B / 7B — February 2024. Initial release
- Gemma 2 2B / 9B / 27B — June 2024. Big quality jump
- Gemma 3 1B / 4B / 12B / 27B — March 2025. Multimodal (vision + text), 128K context
- Gemma 3n (mobile-focused) — May 2025. 4B that behaves like 8B via the PLE structure
Gemma 3 27B punches above 9B, and the 4B Gemma 3n shows quality comparable to typical 8B models, optimized for mobile. PLE (Per-Layer Embeddings) distributes embeddings across layers for memory efficiency.
Meta Llama 3.2 1B / 3B
Llama 3.2, announced in September 2024, is Meta's small-model lineup. Effectively the mobile / edge line.
- Llama 3.2 1B / 3B — text-only, small
- Llama 3.2 11B / 90B Vision — vision + text (the larger sizes are not edge-friendly)
Llama 3.2 1B is the smallest practical LLM that still gives usable answers, running at ~50-80 tok/s on iPhone 15 / Galaxy S24. Sufficient for light scenarios — voice interfaces, chatbots, text classification.
Meta itself recommends ExecuTorch as the official mobile path for Llama 3.2 1B / 3B and ships demo apps on Android / iOS.
Model Selection Guide
- Phi-3 mini / 3.5 mini / Phi-4 mini — multilingual, general chat, the most balanced choice
- Gemma 2 2B / Gemma 3 4B (Gemma 3n) — Google camp, integrated with MediaPipe LLM Inference API
- Llama 3.2 1B / 3B — Meta camp, the #1 choice on ExecuTorch, strong in English
For the fastest answers on a phone the order is: Llama 3.2 1B (50-80 tok/s) -> Phi-3 mini (20-25 tok/s) -> Gemma 3 4B (15-20 tok/s) -> Llama 3.2 3B (10-15 tok/s). For answer quality the order roughly reverses — Phi-3 mini / Gemma 3 4B / Llama 3.2 3B clearly outperform 1B.
14. Always-on AI — The Era of Sensor + ML
The real value of Edge AI is not one-shot inference but 24/7 always-on operation. That is Always-on AI.
Typical scenarios:
- Smart speaker wake words — listening at all times, waking on "Hey Siri"
- Smartwatch fall detection — continuously monitoring IMU data, alerting on a matched pattern
- Industrial vibration analysis — listening to a motor and detecting bearing failure
- Agricultural IoT — a camera continuously watching crops for disease
- City CCTV — people / vehicle counting plus accident detection
The technical core of Always-on AI is:
1. Dual-core / dual-model — a tiny model (1-10 KB) runs constantly catching "candidates," then a larger model (100 KB - 1 MB) wakes up to verify. Keyword spotting is the canonical example. Apple Watch / Pixel Buds work this way.
2. Quantization — int8 or below (4-bit, 2-bit) for 99% power savings. Edge TPU, Hexagon DSP, Cortex-M NPUs are all int8.
3. Inference on NPU / DSP — the main CPU stays in deep sleep while the NPU does inference solo.
4. Direct sensor -> ML path — camera ISP / microphone PDM share the same SoC as the NPU, so data bypasses CPU memory and goes straight to the NPU.
// Pseudocode: Cortex-M NPU always-on keyword spotting
void main(void) {
while (1) {
// 1. First-pass filter with a tiny model (10 KB)
int trigger = run_tiny_kws_model(audio_buffer);
if (trigger > THRESHOLD_LOW) {
// 2. Wake the larger model (500 KB)
int label = run_large_kws_model(audio_buffer);
if (label == LABEL_HEY_SIRI) {
// 3. Wake the application processor (UART / SPI / IPC)
wake_application_processor();
}
}
// Sleep until the next frame (DMA collects microphone data automatically)
enter_deep_sleep();
}
}
That pattern is why "Hey Siri" on Apple Watch runs 24 hours on almost no battery. A Cortex-M-class NPU (Apple's in-house design) listens on the microphone all day, and the main SoC only wakes when a keyword matches.
Industrial vibration anomaly detection follows the same pattern. STM32H7 + ST MEMS accelerometer + a 1 KB TFLite Micro autoencoder runs 24/7 to monitor bearing health, running for over six months on a single battery.
The 2026 trend is Visual Wake Words — the camera ISP stays on, but the main SoC only wakes when a "person is visible." The Visual Wake Words model is ~250 KB, an ultra-light MobileNet-V2 variant, and runs at the 1 mW level on Cortex-M55 + Ethos-U65 type integrated NPUs.
15. Korea / Japan — ETRI / Samsung / LG / Sony AI / NTT
Korea
- Samsung Electronics — Galaxy AI on Galaxy S24/S25 (2024-2026) is hybrid on-device + cloud. Translation, real-time call interpretation, and photo edits partly run on Snapdragon 8 Gen 3/4 Hexagon NPU + the Exynos modem NPU
- Samsung System LSI — Stronger NPU cores in Exynos 2400/2500. A unified AI Engine for consistent behavior across phone / tablet / wearable
- LG Electronics — On-device AI in LG ThinQ Home appliances (refrigerator food recognition, washing machine fabric recognition, AI upscaler in TVs). In-house NPU-integrated SoCs (webOS NPU)
- Hyundai Motor — Hyundai Mobis + in-house IDC (Infotainment Domain Controller) running NVIDIA Drive in tandem with proprietary solutions. ADAS standardization
- Naver / NAVER Cloud — Considering deploying lightweight HyperCLOVA X (2-3B) to mobile / edge
- Kakao / Kakao Brain — Device-targeted sLM Honeybee and the Kanana series (Korean-specialized small models)
- ETRI (Electronics and Telecommunications Research Institute) — Edge AI standardization research. Compression of KoBERT / KoBigBird, MOA (Meta-OS Acceleration) project
- KAIST / Seoul National University — Research on Korean speech / translation on Sipeed K230 and Jetson Nano
- Mando / HL Mando — Hailo / Ambarella NPU adoption for ADAS cameras
- LaonPeople, Suprema — Custom NPU or Hailo integration in industrial / security cameras
Japan
- Sony AI / Sony Semiconductor — IMX500, an NPU-integrated image sensor, is the marquee product. A pioneer of "AI on the sensor," running inference inside the camera silicon itself
- NTT / NTT DoCoMo — Edge AI infrastructure as part of IOWN (Innovative Optical and Wireless Network). NPUs at telecom base stations
- Renesas Electronics — Proprietary DRP-AI (Dynamically Reconfigurable Processor for AI) on RA / RZ-series MCUs. Standard in industrial / automotive
- Panasonic — Custom vision stacks in Iolite / Connect industrial cameras / HMI
- Japanese OEMs mirror the Samsung pattern — Sony Xperia and Sharp Aquos leverage their NPUs
- Toyota / Honda / Nissan — Proprietary self-driving / ADAS computers (Toyota T-MAS, Honda Sensing) alongside NVIDIA Drive
- Japanese startups — Edgecortix and LeapMind ship their own NPU / compiler solutions. LeapMind is known for Blueoil, a quantized model compiler
- ASTERA Labs (US headquartered, strong in Japan) — CXL / PCIe memory fabric for edge data center infrastructure. Gaining share in in-vehicle memory fabric
Common Threads
Both Korea and Japan are leaning hard into on-device AI. NPUs are now standard in phones / cars / appliances, and cloud-LLM cost / latency / privacy issues are pushing a "do everything possible on the device" strategy.
Japan in particular has strong in-house NPU design — Renesas DRP-AI, Sony IMX500, Panasonic's vision IP, and Edgecortix's SAKURA-II have positioned themselves as global competitors to NVIDIA / Hailo / Coral.
16. Who Should Learn Edge AI — IoT / Mobile / Automotive
Finally, what tools to learn for which role.
IoT / Firmware Engineer
- Must — Arduino Nano 33 BLE Sense + TFLite Micro / LiteRT Micro + Edge Impulse. C / C++
- Recommended — Cortex-M NPU-integrated MCUs (Ethos-U55 / U65), Sipeed K230, ESP32-S3
- Applications — Always-on AI, vibration analysis, environmental monitoring, keyword spotting
- Career — Industrial IoT, smart factory, healthcare devices, agricultural IoT
Mobile Engineer
- Must — LiteRT (Android) + Core ML (iOS). Kotlin / Swift
- Recommended — ExecuTorch (both), MLC LLM, llama.cpp, Whisper.cpp
- Applications — On-phone LLM chat, speech recognition, image classification, AR
- Career — Phone OS / keyboard / messenger / camera app / health app
SBC / Robotics Engineer
- Must — NVIDIA Jetson + JetPack + TensorRT, ROS 2, Isaac ROS
- Recommended — Rockchip RK3588, Hailo-15, Coral, Sipeed K230
- Applications — Autonomous mobile robots, humanoids, industrial vision, security cameras
- Career — Robotics companies, autonomous driving, industrial automation, aerospace
Automotive Engineer
- Must — NVIDIA Drive AGX, Mobileye EyeQ, TensorRT
- Recommended — Hailo-10H (ASIL-B), Qualcomm Snapdragon Ride
- Applications — ADAS, autonomous driving, in-vehicle infotainment
- Career — OEMs, Tier-1 (Bosch, Continental), Tier-2 (NXP, Infineon)
ML Engineer / Data Scientist (transitioning to edge)
- Must — PyTorch + torch.export + quantization-aware training
- Recommended — ONNX, ExecuTorch, LiteRT, llama.cpp, MLC LLM
- Applications — Porting cloud models to the edge: quantization / pruning / knowledge distillation
Students / Beginners
The cheapest and fastest path:
1. Arduino Nano 33 BLE Sense (\$35) + Edge Impulse (free tier) — first steps in TinyML. Keyword spotting, gesture recognition
2. Sipeed MaixCAM or XIAO ESP32-S3 Sense (\$15-\$65) — camera + AI maker projects
3. Raspberry Pi 5 + Coral USB Accelerator (\$130) or Orange Pi 5 (\$130) — entry to SBC
4. Jetson Orin Nano (\$249) — serious robotics / SBC
Starting with a \$15 board and stepping up to a \$249 Jetson over six months is the smoothest path.
17. References
- LiteRT (formerly TFLite) — https://ai.google.dev/edge/litert
- LiteRT Micro — https://ai.google.dev/edge/litert/microcontrollers/overview
- ExecuTorch — https://pytorch.org/executorch/
- ExecuTorch GitHub — https://github.com/pytorch/executorch
- Edge Impulse — https://www.edgeimpulse.com/
- NVIDIA Jetson Orin — https://developer.nvidia.com/embedded/jetson-orin
- NVIDIA Jetson Thor — https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-thor/
- Coral by Google — https://coral.ai/
- Hailo — https://hailo.ai/
- Sipeed K230 / MaixPy — https://wiki.sipeed.com/hardware/en/maixIV/m4ndock/maixIV.html
- Rockchip RKNN-Toolkit2 — https://github.com/airockchip/rknn-toolkit2
- Arduino Nano 33 BLE Sense — https://store.arduino.cc/products/arduino-nano-33-ble-sense-rev2
- Seeed XIAO ESP32-S3 Sense — https://wiki.seeedstudio.com/xiao_esp32s3_getting_started/
- ONNX Runtime Mobile — https://onnxruntime.ai/docs/tutorials/mobile/
- Core ML Tools — https://apple.github.io/coremltools/docs-guides/
- NVIDIA TensorRT — https://developer.nvidia.com/tensorrt
- TensorRT-LLM — https://github.com/NVIDIA/TensorRT-LLM
- Apache TVM — https://tvm.apache.org/
- MLC LLM — https://llm.mlc.ai/
- llama.cpp — https://github.com/ggerganov/llama.cpp
- Whisper.cpp — https://github.com/ggerganov/whisper.cpp
- GGUF Spec — https://github.com/ggerganov/ggml/blob/master/docs/gguf.md
- Microsoft Phi-3 — https://azure.microsoft.com/en-us/products/phi
- Microsoft Phi-4 — https://huggingface.co/microsoft/phi-4
- Google Gemma — https://ai.google.dev/gemma
- Meta Llama 3.2 — https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/
- MediaPipe LLM Inference — https://ai.google.dev/edge/mediapipe/solutions/genai/llm_inference
- Sony IMX500 — https://www.sony-semicon.com/en/products/is/industry/imx500.html
- Renesas DRP-AI — https://www.renesas.com/en/key-technologies/ai-machine-learning/drp-ai
- ETRI — https://www.etri.re.kr/eng/main/main.etri
현재 단락 (1/466)
Edge AI in 2026 is not a single category. The single word "edge" spans devices from 100 mW microcont...