QuantZero

Inspiration

At 2 am, waiting for epoch 3 of a ResNet-50 school project to finish, I stumbled on an MIT Technology Review piece: training a single large NLP transformer — not a ResNet-50, but a search-scale model — emits as much CO₂ as five cars over their entire lifetimes. The headline was about training. But it made me wonder about the model I had running. I did the maths on my own ResNet-50 and got a number that didn't sit right. I called Rakshan. We looked for a tool that could measure this automatically, compress the model, and give you a certificate to prove the reduction. Nothing existed for the kind of people actually running these models — teachers, NGO workers, students without GPU budgets. So we built it.

The numbers compound brutally. A single ResNet-50 model running at full FP32 precision emits approximately 1.38 g CO₂ per 1,000 inferences (on a Raspberry Pi 4 in India at 713 g CO₂/kWh). At a conservative 10,000 inferences per day across 50,000 community deployments — schools, clinics, NGOs — that becomes:

$$ C_{\text{annual}} = \frac{1.38\,\text{g}}{10^3\,\text{inf}} \times 10^4\,\frac{\text{inf}}{\text{day}} \times 365\,\text{days} \times 50{,}000\,\text{deployments} \approx 252\,\text{tonnes CO}_2 $$

What struck us most was not the magnitude — it was the invisibility. There was no standard tool that (a) measured the carbon cost of a neural network before deployment, (b) compressed it, and (c) issued a verifiable certificate of the savings. The footprint was hidden. Developers had no feedback loop.

That gap is what QuantZero closes.

What It Does

QuantZero is a carbon-aware AI model compression platform that takes any PyTorch neural network, measures its inference carbon footprint, compresses it using biologically-inspired mixed-precision quantization, and produces a signed EcoInfer Certificate — QuantZero's verifiable output artifact documenting the before/after carbon reduction. The workflow is three steps.

01 — Measure

QuantZero's profiler hooks into every layer of any arbitrary nn.Module using PyTorch forward hooks. It records per-layer FLOPs, parameter counts, and cumulative FLOP share through the network in execution order.

This FLOPs profile is converted to a carbon estimate through a unified hardware-aware energy pipeline:

$$ E_{\text{inf}} = \frac{F}{R_{\text{hw}}} \cdot P_{\text{hw}} \quad \text{(joules per inference)} $$

$$ C_{1000} = \frac{E_{\text{inf}} \times 1000}{3{,}600{,}000} \times \gamma_{\text{grid}} \quad \text{(g CO}_2\text{ per 1,000 inferences)} $$

where $ F $ is FLOPs per inference, $ R_{\text{hw}} $ is the hardware's effective inference throughput in GFLOP/s, $ P_{\text{hw}} $ is its measured steady-state inference power draw in watts, and $ \gamma_{\text{grid}} $ is the deployment country's grid carbon intensity in g CO₂/kWh. This single formula governs all carbon estimates in QuantZero — the same expression is used at scan time, after compression, and on the certificate.

Grid intensities are sourced from EMBER's annual dataset (~60 countries). Deploying in France ($ \gamma = 85\,\text{g/kWh} $) versus India ($ \gamma = 713\,\text{g/kWh} $) produces an 8.4× difference in carbon cost for the same model — a fact almost no AI deployment pipeline accounts for.

02 — Compress (NeuroQuant)

QuantZero applies NeuroQuant cortical quantization: a mixed-precision policy directly inspired by the information hierarchy of the primate visual cortex.

Cortical Region	Functional role	Precision assigned
V1/V2 (early, FLOP share 0–threshold₁)	Edge detection, low-level spatial features	INT8
V4 (mid, FLOP share threshold₁–threshold₂)	Shape, texture, mid-level representation	BF16
IT (deep, FLOP share threshold₂–100%)	Semantic meaning, class identity	FP32

The insight: early layers operate on raw pixels — high redundancy, low sensitivity to quantization noise. Deep layers encode semantic identity — precision loss here costs accuracy. Matching precision to biological information density gives a better accuracy/energy trade-off than uniform quantization.

Two compression modes vary the FLOP-share thresholds:

Mode	INT8 threshold	BF16 threshold	Energy saved	Accuracy change
`accuracy_preserving`	15%	60%	21.2%	−0.14 pp
`max_savings`	30%	70%	48.4%	−0.20 pp

Validated on ResNet-50 / ImageNet-1K. Energy savings measured from before/after carbon figures. Accuracy change is top-1 delta versus uncompressed FP32 baseline. Results for other library models (MobileNetV3, EfficientNet-B0, ShuffleNetV2) follow the same policy but are not independently reported here.

Note on the accuracy figures: It may appear counterintuitive that max_savings loses slightly less accuracy than accuracy_preserving despite being more aggressive. The mechanism: in accuracy_preserving mode, a larger proportion of mid-network layers operate in BF16, whose rounding errors accumulate across more intermediate activation tensors. In max_savings mode, more layers are pushed into INT8 — but the INT8 calibration across a wider range of early layers converges more cleanly on this architecture's activation distribution. The result is non-monotonic: the accuracy cost of BF16 rounding across many mid-depth layers exceeds the cost of extending INT8 into those same layers when properly calibrated. This behaviour is specific to the ResNet-50 feature hierarchy; other architectures may show the expected monotone relationship.

This is real quantization — INT8 per-channel weight quantization using PyTorch's torch.quantization API with FBGEMM (x86) or QNNPACK (ARM/mobile) backends, plus BF16 casting via .to(torch.bfloat16) — not fake quantization or simulated round-trips.

Runtime note: Executing INT8 layers requires FBGEMM on x86-64 or QNNPACK on ARM platforms. Both are bundled with standard PyTorch wheel distributions (≥1.8). On Raspberry Pi 4 (ARM Cortex-A72), QNNPACK is available in the official PyTorch ARM builds. The compressed .pt file is a standard PyTorch checkpoint; no third-party runtime is required beyond PyTorch itself.

03 — Certify (EcoInfer Certificate)

After compression, QuantZero generates an EcoInfer Certificate — a ReportLab PDF containing:

Before/after carbon figures (g CO₂ per 1,000 inferences) with 95% confidence intervals
Prediction Consistency Score (PCS) — the fraction of inputs where the compressed model's top-1 prediction exactly matches the uncompressed FP32 baseline. PCS = 1.00 means the compressed model is behaviourally indistinguishable from FP32 on every tested input.
Stored model size before and after
Real-world equivalents: km driven by a petrol car, tree-years to offset, household electricity days
Full methodology disclaimer (see Challenges for the accuracy bounds of the FLOPs proxy)

Library models supported: ShuffleNetV2-x0.5, MobileNetV3-Small/Large, EfficientNet-B0, ResNet-18/34/50. Custom model upload is available via the /api/scan endpoint for any nn.Module serialisable with torch.save.

How We Built It

QuantZero is a full-stack system with a FastAPI backend, a PyTorch compression engine, and a static HTML/JS frontend.

Backend Architecture

QuantZero/
├── profiler.py      # FLOPs profiler via nn.Module forward hooks
├── carbon.py        # FLOPs → energy → gCO₂, EMBER grid intensity table
├── compression.py   # NeuroQuant cortical policy + real INT8/BF16 quantization
├── certificate.py   # ReportLab EcoInfer PDF generation
└── main.py          # FastAPI app, all endpoints + fleet batch API

Core API endpoints:

Endpoint	Method	Purpose
`/api/library`	GET	Model catalogue, country list, hardware devices, modes
`/api/scan`	POST	Profile model + compute baseline carbon
`/api/compress/{run_id}`	POST	Apply NeuroQuant, generate EcoInfer Certificate
`/api/fleet`	POST	Batch compress multiple models, fleet-level certificate
`/api/download/{run_id}/model`	GET	Download compressed `.pt`
`/api/download/{run_id}/certificate`	GET	Download EcoInfer PDF
`/api/cache/warm`	GET	Pre-compute FLOPs for all library models

The Carbon Calculation

All carbon estimates — at scan time, post-compression, and on the certificate — use the single unified formula described in "What It Does":

$$ C_{1000} = \frac{F}{R_{\text{hw}}} \cdot P_{\text{hw}} \cdot \frac{1000}{3{,}600{,}000} \cdot \gamma_{\text{grid}} $$

Hardware coefficients ($ R_{\text{hw}}, P_{\text{hw}} $) are defined for six device classes: Raspberry Pi 4 (4 GFLOP/s, 6.8W measured under CNN inference load), Jetson Nano (236 GFLOP/s, 10W), laptop CPU (140 GFLOP/s, 28W), desktop CPU (847 GFLOP/s, 65W), cloud vCPU (1200 GFLOP/s, 60W), and desktop GPU (8900 GFLOP/s, 220W). For edge devices (RPi4, Jetson Nano), $ R_{\text{hw}} $ represents effective single-inference throughput derived from published benchmarks; for server-class hardware (cloud vCPU, desktop CPU, desktop GPU), $ R_{\text{hw}} $ represents peak aggregate throughput, so carbon estimates for those device classes are lower bounds — real inference energy is higher at low batch sizes where hardware utilisation is partial. Grid intensity values $ \gamma_{\text{grid}} $ are loaded from a baked-in EMBER 2023 annual dataset at startup (offline default). An optional ?refresh=true query parameter triggers a live EMBER API call to update the table; if the call fails, the baked-in values are used as fallback. This design ensures the tool works fully offline — important for edge deployments without reliable internet.

The NeuroQuant Compression Engine

The cortical policy maps each layer to a precision based on its cumulative FLOP share and the selected mode:

THRESHOLDS = {
    "accuracy_preserving": (0.15, 0.60),
    "max_savings":         (0.30, 0.70),
}

def cortical_policy(flop_share: float, mode: str) -> str:
    int8_limit, bf16_limit = THRESHOLDS[mode]
    if flop_share < int8_limit:
        return "int8"          # V1/V2 — low-level, high redundancy
    elif flop_share < bf16_limit:
        return "bf16"          # V4   — mid-level representation
    else:
        return "fp32"          # IT   — semantic, precision-critical

INT8 layers undergo per-channel weight quantization using PyTorch's static quantization pipeline (prepare → calibrate → convert). BF16 layers are cast with .to(torch.bfloat16). FP32 layers are untouched. The recursive walker identifies quantizable submodules by introspection — checking for weight and bias attributes — rather than explicit type-matching. The walker additionally skips quantization-incompatible module types — specifically nn.BatchNorm2d and other normalization layers, whose scale-sensitive running statistics cannot be correctly represented in per-channel INT8 without a separate batch-norm folding pass; these are left in their native precision.

Calibration data: Static INT8 requires representative inputs to determine activation scale factors and zero points. For library models, a 1,000-image subset of the ImageNet validation set is used automatically. For user-uploaded models with unknown input distributions, users supply a calibration batch (minimum 100 samples) alongside the model file in the /api/scan request; if none is provided, the system falls back to dynamic quantization (weights-only INT8) for that model and notes this on the certificate.

This handles all standard architectures in our library plus any standard torchvision model without model-specific configuration. Architectures using custom fused operators or non-standard module types (e.g., some ViT implementations) may require manual layer annotation.

Frontend

A single-page static HTML/JS console walks the user through Scan → Compress → Download in three panels, with live carbon savings updating as compression completes. No npm, no build step — deployable on any web server, including the same machine running the FastAPI backend.

Challenges We Ran Into

1. Making quantization architecture-agnostic

PyTorch's quantization API is designed around specific module types. Arbitrary user-uploaded models may use custom layers, fused blocks, or non-standard activations. We built a recursive introspection walker that identifies quantizable submodules without type-matching. This covers all tested architectures in our library and all standard torchvision models. Architectures with custom fused operators fall back gracefully to BF16-only compression with a warning in the console.

2. Bounding the FLOPs-to-carbon proxy error

We don't have access to hardware power meters at inference time. Our carbon estimate is a FLOPs-based proxy. Cross-referencing FLOPs-normalized energy predictions against published inference energy benchmarks — specifically MLPerf Inference v3.0 results for ResNet-50 and published Raspberry Pi 4 inference power measurements — for comparable CNN workloads on our characterised hardware classes gives an estimated proxy error of 20–30% for standard CNN architectures. For architectures with large embedding lookups or attention layers, where memory bandwidth (not FLOPs) is the binding constraint, the proxy may underestimate energy by a larger margin; the certificate flags this condition explicitly. We frame all certificate outputs as before/after relative savings rather than absolute ground-truth figures, which substantially reduces the impact of proxy error: a systematic 25% offset in both before and after measurements cancels in the ratio.

3. Calibrating the FLOP-share thresholds

The 0.15/0.60 and 0.30/0.70 thresholds for the two modes were determined by ablation across our full model library. Extending the INT8 zone from 0.15 to 0.25 FLOP share caused a 0.8 pp accuracy drop on MobileNetV3-Small — larger than the drop at 0.30 — because MobileNetV3-Small has less FLOP redundancy in its early depthwise separable layers than ResNet-50. The final thresholds represent the values that preserved accuracy across all tested architectures simultaneously. Architectures using depthwise separable convolutions (MobileNetV3-Small, ShuffleNetV2) show greater accuracy sensitivity to extended INT8 zones than ResNet-type networks, so the same thresholds produce wider accuracy/energy variation across the library than on ResNet-50 alone — a limitation the certificate methodology section notes explicitly.

4. Grid intensity data freshness and offline robustness

Live EMBER API calls added latency and a network dependency. We ship a baked-in EMBER 2023 annual dataset as the default, with live refresh as an explicit opt-in. This means QuantZero runs fully offline — critical for the schools and clinics we are targeting, many of which have unreliable internet.

5. Certificate credibility

An EcoInfer Certificate is only useful if recipients can understand and trust it. We went through three iterations of the PDF layout before landing on one that: (a) clearly separates model-based proxy estimates from direct hardware measurements, (b) displays 95% confidence intervals on all carbon figures, (c) defines the Prediction Consistency Score explicitly on the certificate face, and (d) includes a plain-language methodology disclaimer at the bottom of every page. Trust requires transparency about limitations.

Accomplishments That We're Proud Of

Strong accuracy/energy trade-off on ResNet-50: accuracy_preserving mode achieves 21.2% carbon reduction with only −0.14 pp top-1 accuracy change versus uncompressed FP32. max_savings achieves 48.4% carbon reduction with −0.20 pp accuracy change. Both results require no manual layer-by-layer tuning and outperform naïve per-tensor static uniform INT8 quantization — which degrades ResNet-50 top-1 accuracy by approximately 1.2 pp when applied without per-channel calibration — while delivering automatic architecture-agnostic compression with a signed carbon certificate.

$$ C_{1000}:\; 0.48464\,\text{g} \;\xrightarrow{\;\text{NeuroQuant}\;}\; 0.38167\,\text{g CO}_2 \quad (21.2\%\ \text{reduction}) $$

High Prediction Consistency Score: On the full 50,000-image ImageNet-1K validation set, the accuracy_preserving compressed model achieves a Prediction Consistency Score of approximately 0.999 — fewer than 100 of 50,000 inputs produce a different top-1 prediction from the FP32 baseline. This is consistent with the −0.14 pp accuracy delta: approximately 70 net disagreements at the ground-truth level, meaning the compression introduces no new failure modes beyond what the accuracy delta already implies.
Grid-aware carbon, not a global average: Most carbon tools use EMBER's published 2023 global-average grid intensity of 491 g CO₂/kWh. We use country-specific figures. A model deployed in Poland ($ \gamma = 773\,\text{g/kWh} $) versus France ($ \gamma = 85\,\text{g/kWh} $) has a 9.1× difference in carbon cost per identical inference. Our certificates make this visible and actionable.
Fleet API for multi-model deployments: The /api/fleet endpoint lets an organisation submit a list of models and receive a batch compression report and aggregate fleet-level EcoInfer Certificate — a feature designed for IT teams managing multi-model deployments.
Zero-install frontend: The entire management console is static HTML. Any device with a browser and network access to the FastAPI server can use QuantZero.

What We Learned

Inference, not training, is where AI's carbon accumulates at scale. The AI community focuses on training compute. But once a model is trained, it may run inference millions of times. For ResNet-50 — whose single training run costs approximately 11 kg CO₂ (derived from ~90 GPU-hours across 4×V100-class GPUs at ~22 hours wall-clock time, 300 W per GPU, US cloud grid at ~400 g CO₂/kWh) — the cumulative inference carbon footprint exceeds the training footprint after approximately 23 million inferences on a Raspberry Pi 4 at EU average grid intensity (0.48 g per 1,000 inferences). A school of 500 students running 300 AI-assisted tasks each per day — 150,000 daily inferences — crosses that threshold in under six months. At that point, every percentage point of inference carbon reduction is more impactful than any further optimisation of the training run.

Mixed precision is not one-size-fits-all — and the accuracy/energy relationship is non-monotone. Naïve per-tensor static INT8 across all layers degrades accuracy sharply (approximately 1.2 pp on ResNet-50 without per-channel calibration). BF16 casting reduces memory bandwidth but provides negligible energy savings on hardware without native BF16 execution units — including the ARM Cortex-A72 on Raspberry Pi 4, where BF16 runs in FP32 emulation. The energy gains in QuantZero come from INT8 in the early layers, where per-channel calibration keeps accuracy loss minimal. But the relationship between compression depth and accuracy loss is not simply monotone: the max_savings mode (deeper INT8 zone) loses slightly less accuracy than accuracy_preserving on ResNet-50, because BF16 rounding errors accumulate differently across different depth ranges. Calibration matters as much as the policy itself.

Carbon proxy design is as important as accuracy. A certificate reporting carbon savings with false precision is worse than no certificate. We invested significant time calibrating the FLOPs-to-energy coefficients against measured hardware power, validating the 20–30% error bound on standard CNNs, and writing plain-language methodology disclosures. The before/after ratio formulation specifically reduces the impact of systematic proxy error. Trust requires honesty about uncertainty.

Grid intensity is the hidden multiplier no one talks about. Two organisations running the same model — one in France, one in India — face a 8.4× difference in per-inference carbon cost. Compressing a model to save 20% energy saves 20% carbon everywhere, but the absolute saving in India is 8.4× larger per kWh saved. Deployment geography should be a first-class input to any AI sustainability decision. QuantZero makes it one.

Scope discipline prevents overclaims. Early drafts of QuantZero included claims about transformer support, LLM compression, and Pareto optimality. We removed or qualified all of them: the cortical-order policy derives from convolutional depth hierarchies that do not map cleanly onto attention mechanisms; claiming Pareto optimality from three tested configurations is unsupportable; LLM inference carbon has a fundamentally different energy profile (memory-bandwidth-bound, not compute-bound). What remains is a set of claims we can defend precisely.

What's Next for QuantZero

Near-term (next 3 months):

Transformer profiling extension. The current FLOPs profiler undercounts energy for attention layers, which are memory-bandwidth-bound rather than compute-bound. We plan to add a separate attention energy model using the roofline model:

$$ E_{\text{attn}} = \frac{4 \cdot N \cdot d \cdot \text{bytes/element}}{B_{\text{mem}}} \cdot P_{\text{hw}} $$

where $N$ is sequence length, $d$ is embedding dimension, and $B_{\text{mem}}$ is memory bandwidth. This does not extend the cortical quantization policy to transformers — that requires new theoretical grounding we do not yet have — but it will make carbon estimates accurate for transformer-based models even if the compression pipeline does not yet apply to them.

ONNX export. A post-compression ONNX export path would open the compressed model to TensorFlow Lite, CoreML, and OpenVINO runtimes without any changes to the compression pipeline.
Concrete EcoInfer standardisation step. We plan to submit the EcoInfer Certificate specification as a proposed field addition to the Hugging Face Model Card metadata schema (via a pull request to huggingface/huggingface_hub), which accepts community proposals through its standard RFC process. This is a concrete, actionable step toward interoperability — not an aspiration.

Medium-term (6–12 months):

Continuous carbon monitoring. A lightweight inference wrapper that logs per-request carbon in production and updates the certificate over the model's actual deployment lifetime.
CI/CD integration. A GitHub Action that runs QuantZero on every model push and blocks deployment if the carbon cost delta exceeds a configurable threshold — treating carbon budget as a first-class deployment gate alongside accuracy and latency.
Hardware-in-the-loop calibration. Replace the FLOPs-based proxy with NVML-measured watt readings on supported devices (Jetson Nano, desktop GPU) during the scan phase, tightening the carbon estimate error bound from ~25% to under 5% on characterised hardware.

Long-term vision:

The core thesis does not change: every inference has a carbon cost, and that cost should be measured, compressed, and certified. QuantZero is the tool that makes that possible for PyTorch CNNs today. The path forward — transformer profiling, continuous monitoring, CI/CD gates, standard certificate metadata — extends that same principle to every model, every framework, and every deployment environment.

Built With

css3
ember.js
fastapi
fbgemm
html5
httpx
imagenet-1k
javascript
jetsonnano
mlperf
python
pytorch
qnnpack
reportlab
rpi4
torchvision