Unspoiled

Landing Page

Inspiration

Every year, the U.S. throws out about $380 billion of food, and five states — Massachusetts, Vermont, California, Connecticut, and Rhode Island — have already made it illegal to send commercial food waste to the landfill. But when we started digging, the data behind those bans was a mess: each state publishes its generator list in its own format, thresholds change almost every year, and no one (not haulers, not regulators, not composters) has a single shared view of who is actually covered. Unspoiled started from a simple observation: the regulatory infrastructure is already in place — the operational infrastructure isn't.

What it does

Unspoiled turns public regulator data into a live operating view of the regulated food-waste market. For every commercial generator in Massachusetts and Vermont, it:

predicts annual tonnage with a calibrated PyTorch model,
scores each generator as Above, Near, or Below its state's current ban threshold,
routes it to its nearest permitted composter, anaerobic digester, animal-feed operation, or transfer station, and
for Massachusetts, overlays MassDEP enforcement actions town-by-town so you can see where the regulator is already knocking.

The ban-status rule is simple: a generator is Above when its predicted tonnage meets or exceeds its state's current threshold, Near when it sits between half of that threshold and the threshold itself, and Below otherwise. No invented heuristics — just the state's own number applied to every row.

All of that is exposed through a dashboard (generators, bans, map, enforcement, evidence), a JSON API, and a marketing site where every number is cited back to the CSV column it came from.

How we built it

Data layer: 16 public CSVs from the Dryad organic-waste-ban dataset + MassDEP enforcement log — food generator rosters for MA and VT, a permitted-processor list, bans_thresholds.csv, disposal/composting effect sizes, city-history benchmarks (Boulder, Seattle), and U.S. Census population.
Harmonization: Python ingestion scripts unify MA's annual-tons schema with VT's weekly-tons schema, gazetteer-match enforcement text to towns, and join every generator to Census population.
Model: A PyTorch MLP with category and state embeddings plus four numeric features — population, latitude, longitude, and distance in kilometers to the nearest permitted processor. Because self-reported tonnage spans several orders of magnitude, we train against the log of one-plus-tonnage rather than raw tons, then exponentiate the prediction back at inference time. We optimize with Smooth-L1 (Huber) loss, which behaves like squared error for small residuals and like absolute error for large ones, so a handful of extreme outliers don't dominate the gradient. Training uses Adam with a cosine learning-rate schedule that warms high and smoothly decays to a small floor over 300 epochs.
Ban engine: Rule-based join of every generator to its state's current row in bans_thresholds.csv — no invented heuristics.
Routing graph: For every generator we compute the great-circle distance to every permitted processor using the haversine formula on Earth's radius (~6,371 km), then store the closest processor's ID and the miles to it directly on the generator record. That makes "who should pick this up" a constant-time lookup at request time.
App: Next.js with an in-memory cache that boots from JSON snapshots, a Tailwind-styled marketing site, and a dashboard with generators, bans, map, enforcement, and evidence views.

Challenges we ran into

Schema drift between states. MA reports annual tons and VT reports weekly tons, and you can't just multiply VT's weekly figure by 52 because seasonal closures (schools, ski-town restaurants, summer-only ice-cream stands) make the real annual total lower. Category taxonomies disagree, and town names don't always match Census gazetteer spellings. Harmonizing them took more engineering than the model did.
Sparse and skewed tonnage labels. Self-reported generator tonnage spans two to three orders of magnitude between a café and a distribution center, with heavy missingness. Training on raw tons collapsed to the mean; moving to log-space targets with Huber loss stabilized everything.
Enforcement is text, not data. MassDEP's enforcement log is a free-text document. We had to parse, gazetteer-match towns, and deduplicate actions before it was joinable to generators.
Provenance without clutter. Every number on the site needed to be traceable to a CSV column without turning the UI into footnotes. The Provenance + Cite components were a design problem as much as a data one.
Honest "unknowns." Some source CSVs (e.g., Dryad's composting_effect for MA) have no value — we chose to render a dash rather than fabricate one, and that discipline had to propagate through every component.

Accomplishments that we're proud of

~13,000 generators modeled end-to-end across MA and VT with per-row predicted tonnage, ban status, and nearest-processor routing.
A calibrated model, not a demo. We report validation error honestly — mean absolute error in tons on the hold-out set, plus an R² computed in log-space so it reflects how well the model ranks generators across the full size range rather than just fitting the few large outliers. Both numbers are surfaced on the landing page straight from model_metrics.json.
Every number is cited. The dashboard and marketing site both bind values to their CSV source and field — a regulator or ESG team can defend any stat shown.
Concrete case study. Peabody, MA (the #1 town in MassDEP's enforcement log) becomes a three-click routable workflow: filter → map → route.
One in-memory cache, three surfaces. Dashboard, JSON API, and state briefings all serve from the same source of truth.

What we learned

Compliance-critical markets look boring from the outside, but the real product work is in the joins — unifying schemas across states is where the defensibility lives.
Regulatory tailwinds compound: every state has cut its threshold at least once, and each cut drags thousands of new generators into coverage.
Modeling tonnage in log space and embedding category + state is much more stable than trying to regress raw tons.
Provenance is a feature, not a chore. Showing the CSV column next to a number is what makes regulators and ESG teams actually trust the dashboard.
State-additive architecture pays off — decoupling ingestion from UI means "add CA" is a data job, not a redesign.

What's next for Unspoiled

More states. CA, CT, and RI ban thresholds are already indexed; the next milestone is ingesting their generator rosters and processor lists into the same schema so a New England–wide view becomes a coast-to-coast view with no UI rewrite.
Signed-contract feedback loop. Close the success-fee channel end-to-end: when an Unspoiled-sourced lead converts into a signed hauler or composter contract, the outcome flows back into the model as a training signal, so predicted tonnage and ban status sharpen every quarter against real-world placements instead of only self-reported CSVs.
Regulator workspace. A MassDEP-style view for state agencies — capacity-versus-demand heatmaps, clusters of likely non-compliant generators, and exportable audit packets — so regulators can target enforcement the same way haulers target sales.
Richer routing. Replace straight-line distance between a generator and a processor with true road-network driving distance, and add a capacity constraint on each processor so we never route more tonnage to a facility than it can physically accept in a week. The recommendation becomes "nearest processor that actually has room this week" rather than "nearest dot on a map" — the difference between a demo and a dispatch tool.
Benchmarks as a product. Boulder and Seattle city-history data already live on disk — package them as a public "what tightening does" benchmark, showing how diversion tonnage, participation rates, and enforcement activity moved in cities that lowered their threshold, so states considering their next cut have a reference curve instead of a guess.
Carbon and cost accounting. Pair every routed ton with its avoided-landfill emissions and avoided tipping fee, so each generator row carries not just a ban status but a dollar and a CO₂e figure — turning Unspoiled from a compliance map into an ESG-grade ledger.