Skip to content
ClelandCo
All writing
MLOps9 min

From Notebook to Edge: ONNX and int8 Quantization, Measured

June 22, 2026

Most edge-AI advice is abstract. This post is the opposite: a single, self-contained pipeline that trains a small image classifier, optimizes it two different ways, and measures the result so the numbers are defensible. Everything here comes from code you can run yourself.

The point is not a hard ML problem. It is showing that the productionization and edge-optimization steps are real, measured, and repeatable — the same methodology I use on client work.

The setup

A small CNN trained on Fashion-MNIST in PyTorch (CPU-only, ~38 seconds, 89.3% test accuracy). Nothing exotic. The interesting part is what happens after training, when you have to ship it.

Two optimizations, applied in order, each isolated so you can attribute the win:

  1. Swap the runtime: export the PyTorch model to ONNX and run it under ONNX Runtime instead of PyTorch eager.
  2. Quantize the head: apply ONNX Runtime dynamic quantization (int8) to the fully-connected head, leaving the conv feature extractor in fp32.

The measured result

Benchmarked over 300 timed runs after 20 warmup runs, batch size 1, threads pinned to 1 for reproducible single-core numbers:

Variant                            mean (ms)    p95 (ms)   size (KB)
--------------------------------------------------------------------
PyTorch eager (fp32)                   0.122       0.128       811.2
ONNX Runtime (fp32)                    0.067       0.072       810.3
ONNX Runtime (int8 quantized)          0.058       0.060       221.6

Two distinct wins, from two distinct changes. The runtime swap (PyTorch eager → ONNX Runtime) is the latency win: 1.81× lower mean latency. The quantization is the footprint win: 3.66× smaller on disk (811 → 222 KB), because the int8 FC weights do the work while the conv layers stay fp32.

Re-running on your hardware produces different absolute numbers; the relative deltas are what the optimization demonstrates. The value is the measured methodology, not a stressed production workload.

Why measure like this

The discipline matters more than the numbers. A benchmark you can trust has a few non-negotiables:

  • Warm up first. The first few runs include JIT, allocation, and cache effects. Discard them.
  • Pin threads. Single-thread, batch-1 numbers are reproducible; multi-thread numbers depend on what else the box is doing.
  • Report p95, not just the mean. Tail latency is what users and SLAs actually feel.
  • Measure on-disk size too. For edge deployment, artifact size drives cold-start time, image size, and what hardware you can target.

Then productionize it

A fast model is not a service. The quantized ONNX model gets wrapped in a FastAPI app with /predict and /healthz, structured JSON logging, and per-request latency instrumentation (every response carries X-Request-ID and X-Latency-Ms headers). It is containerized with a Dockerfile that bundles the optimized artifact.

Deliberately scoped: no auth, no metrics backend, no speculative features. Structured logging plus per-request latency is the monitoring hook a real deployment wires into its existing log and metrics pipeline. Scope is a feature.

Run it yourself

The full pipeline — train, export, quantize, benchmark, serve — is open and runnable. This is the kind of proof I'd rather show than claim.

If you have a model stuck in a notebook and a deadline to get it onto constrained hardware, that gap is exactly what the Prototype to Production work closes. The same steps shown here — runtime swap, quantization, a measured benchmark, a monitored service — applied to your model.

Have a system stuck before production, or a business customers can't find?