Canary

Inspiration

H5N1 spread to dairy cattle herds across more than 30 US states over the past two years. It showed up in raw milk, in farm workers, and eventually in poultry flocks nationwide. The data tracking all of this was public the whole time. USDA APHIS publishes confirmed detections as a CSV. WAHIS logs outbreak reports globally. But there was no tool that put it together, scored the risk by state, and showed you where the disease was likely to move next.

I kept thinking about the phrase "canary in a coal mine." Miners used canaries because the bird reacted to danger before any human could detect it. That is the exact gap here. By the time an outbreak makes the news, it has already spread. The question is whether you can build something that catches the signal earlier.

What it does

Canary is a veterinary outbreak risk intelligence platform built on live public data from USDA APHIS and WAHIS/OIE.

It scores every US state on a 0 to 100 risk scale using five factors:

  • How recently outbreaks have occurred in the state
  • How severe the diseases are on a 1 to 5 scale
  • How frequently outbreaks happen in the state
  • How dense the livestock population is, using USDA NASS 2022 Census data
  • How much pressure is coming from neighboring states

That last factor, neighbor pressure, is what separates this from a simple case count. A state with no active outbreaks but surrounded by high-risk neighbors is still a threat. The model catches that.

The Spread Model tab is the flagship feature. Select any disease and any state, and the tool estimates the probability the disease jumps to each bordering state. It accounts for geographic proximity, livestock density, and current case load in each neighbor.

The tool ships as both an interactive Streamlit dashboard and a FastAPI REST service with endpoints for state rankings, disease threats, outbreak history, and spread probabilities.

How we built it

The pipeline runs in two blocks.

Block 1 fetches live data from two sources. The USDA APHIS H5N1 confirmation CSV is pulled directly from the APHIS website. WAHIS outbreak reports are pulled from the WAHIS REST API. Both are cleaned, species are classified into seven categories using keyword matching, and disease severity is assigned on a 1 to 5 scale. If either source is unavailable, the tool falls back to synthetic seed data so the dashboard stays functional.

Block 2 computes the risk scores. The state-level risk formula is:

$$\text{state_risk} = 0.30 \times \text{recency} + 0.25 \times \text{severity} + 0.20 \times \text{frequency} + 0.15 \times \text{livestock_density} + 0.10 \times \text{neighbor_pressure}$$

Recency uses a 2-year rolling window and decays with distance from today. Neighbor pressure is calculated from a hand-built US state adjacency graph covering all contiguous states.

The disease-level threat score uses a separate formula:

$$\text{disease_risk} = 0.35 \times \text{severity} + 0.30 \times \text{recency} + 0.20 \times \text{reach} + 0.15 \times \text{frequency}$$

The Spread Model estimates jump probability between neighboring states as:

$$\text{spread_prob} = 0.50 \times \text{proximity} + 0.30 \times \text{density} + 0.20 \times \text{case_load}$$

The stack is Python, Pandas, Streamlit, Plotly for the choropleth maps and charts, and FastAPI for the REST layer.

Challenges we ran into

The biggest challenge was data reliability. The WAHIS API has inconsistent response formats and goes down regularly. The USDA APHIS CSV changes column names without notice. I built a fallback system using synthetic seed data that mirrors the real data distribution, so the dashboards and API stay usable even when both upstream sources fail. Getting the fallback to feel realistic rather than clearly fake took several rounds of testing.

Building the state adjacency graph by hand was tedious but necessary. No clean dataset existed for this, so I mapped out the neighbors for all contiguous states manually and validated them against known outbreak corridors like the Mississippi Flyway for avian influenza.

Calibrating the neighbor pressure weight was the hardest modeling decision. Too high and states with zero local outbreaks start appearing Critical just because they border Iowa or Minnesota. Too low and the spread model loses its purpose. I tested different weights against the 2022 to 2024 H5N1 spread pattern to find a balance that felt honest.

Accomplishments that we're proud of

The Spread Model is the piece I am most proud of. The idea of modeling disease jump probability between states using only public data, livestock census figures, and a hand-built adjacency graph, and having it produce results that match the actual spread patterns of H5N1, was a real proof of concept.

The fallback data layer is something I did not expect to care about but ended up being important. Government APIs are unreliable. Building a system that degrades gracefully rather than crashing made the whole project feel production-ready rather than demo-only.

What we learned

The most important thing I learned is that the hardest part of public health data work is not modeling, it is data plumbing. Getting clean, consistent data out of two different government sources with different formats, different update cadences, and no reliability guarantees took most of the build time.

I also learned that geographic modeling is harder than it looks. Neighbor pressure sounds simple until you realize that a disease spreading through the Mississippi Flyway does not care about state borders the way a static adjacency graph does. The model is a useful approximation, but building a more accurate version would need flyway data, livestock movement records, and wind pattern data. That is the next version.

What's next for Canary

The most immediate next step is a push alert system. Right now you have to visit the dashboard to see what changed. The next version should send a notification when a state crosses into a new risk tier or when a new disease shows up in a state that had been clean.

On the data side, I want to bring in livestock movement data from USDA NASS, NAHMS herd health surveys, and CDC one health reports that connect animal outbreaks to human cases. The H5N1 connection to dairy and to human farm worker infections is the clearest example of why that link matters.

Longer term, Canary could serve as the data layer for biosecurity planning tools used by state veterinarians, poultry and cattle producers, and insurance underwriters who need forward-looking risk signals rather than last week's case counts.

Built With

Share this project:

Updates