Chom, Nom!

Inspiration

Scaling laws have reshaped AI. Models grow deeper, wider, and more expressive. But they're also heavier, slower, and increasingly incompatible with embedded hardware. A Raspberry Pi cannot casually host frontier-scale intelligence.

Edge deployment demands a different philosophy: task precision over generality, efficiency over excess, compression without compromise.

Chom, Nom! exists to close that gap, transforming oversized PyTorch models into lean, deployable systems without rewriting architectures or sacrificing performance.

What it does

Compress any machine learning model to 1/4 the size with a single click! Chom, Nom! gets any 50M parameter ready to deploy on a Raspberry Pi in <20s.

Under the hood, Chom, Nom! is like a little AI Researcher. Chom, Nom! orchestrates a multi-agent pipeline to perform per-layer quantization ablations, leaving the choice of compression intensity as one that an intelligence can decide to balance footprint reduction and performance. The automated pipeline looks like this:

Per-layer sensitivity analysis A Scanner agent inspects every layer's weight distribution, fitting a Beta distribution to estimate quantization sensitivity. Layers with heavy-tailed or high-kurtosis distributions are flagged as fragile; sparse, well-behaved layers are marked as robust.
LLM-guided mixed-precision strategy A Strategist agent (backed by an LLM) reads the sensitivity table and proposes 3-5 quantization configurations -- including mixed-precision plans that assign FP32, FP16, or INT8 per layer according to each layer's sensitivity score. A Critic agent then reviews and refines these proposals before any quantization is applied.
Automated quantization and evaluation An Executor agent applies each approved configuration via post-training quantization (dynamic INT8, static INT8 with calibration data, FP16, or mixed precision), then measures accuracy, model size, and inference latency. Static INT8 uses representative calibration batches to estimate per-layer activation ranges and zero-points.
Pareto-optimal selection An Analyst agent computes the Pareto frontier across all experiment results -- balancing accuracy, size, and speed -- and recommends the best configuration. If coverage gaps exist, additional experiments are proposed iteratively.
Guaranteed footprint reduction The final model can be up to 4x smaller (INT8) or even 8x smaller (INT4), enabling cheaper and more private inference while maintaining stable accuracy through sensitivity-aware layer protection.

We put a lot of energy into making the user experience is ruthlessly minimal: Upload, Click to Compress, and Deploy.

How we built it

We built the system on top of PyTorch's native quantization APIs and a pluggable quantizer registry to unify:

Per-layer weight distribution analysis (Beta distribution fitting)
Sensitivity-aware precision assignment (FP32 / FP16 / INT8 / INT4 per layer)
Post-training quantization with optional calibration
Multi-agent strategy, critique, and Pareto-optimal selection

Architecture Agnostic Design

The compression pipeline was designed to remain architecture-agnostic, meaning it does not assume convolutional, transformer, or custom module structure. Instead, the Scanner iterates over all named modules and parameter tensors, fitting statistical distributions to each layer's weights to assess quantization sensitivity. This allows it to generalize across bespoke research models, or your very own custom creations.

Challenges we ran into

We tried a number of supplementary approaches to model compression including:

Agentic self-distillation
Structural pruning
Low-rank factorization

All required modifying the model’s architecture or training loop — pruning channels, changing ranks, or rewriting optimization logic.

Automating these decisions meant letting agents redesign core structural components. Current AI systems can apply local edits, but they are unreliable at global architectural reasoning. The result was instability, silent accuracy degradation, and excessive debugging overhead.

We ultimately prioritized graph-preserving methods like quantization, which compress models without requiring architectural redesign.

In addition, we built custom Web GPU Kernels for on-device quantization through a web interface. However, we found this platform to be too limiting in the end.

Accomplishments that we're proud of

Accuracy preservation at 1/4 the precision
Balance of accuracy and footprint reduction using agentic loops, evaluating per layer structure
A stable, agentic loop to act as a mini AI-researcher
The cutie patootie on the GUI

What we learned

There's still a gap in agentic ability for sequential ML research (e.g. reacting to distillation attempts and refining experiments). However, this is rapidly progressing!