Inspiration
The California coast has been quietly changing for decades. Scientists know it — but there's no accessible way for anyone to actually see it.
The Scripps Institution of Oceanography / CalCOFI program has been sending ships along the California Current since 1949, dropping sample bottles into the ocean at fixed stations to measure temperature, oxygen, nutrients, and more. That's over 70 years of ocean chemistry — one of the longest continuous marine records in the world. Yet it lives largely as raw CSVs, inaccessible to non-specialists.
We wanted to turn that into something anyone could look at and immediately understand.
What it does
ChumBucket is an interactive ocean health dashboard for the California coast. For each CalCOFI sampling station, we compute a composite Ocean Health Score (0–100) by comparing measurements from 1991–2021 against a pre-disturbance historical baseline (pre-1990).
Scoring uses a z-score anomaly model across four variables — dissolved oxygen, temperature, chlorophyll-a, and the Redfield N:P nutrient ratio:
$$ z_i = \frac{x_i - \mu_{\text{baseline}}}{\sigma_{\text{baseline}}} $$
$$ \text{HealthScore} = 100 - \text{clip}!\left(10 \cdot \left(0.35\,z_{O_2}^{-} + 0.20\,z_{T}^{+} + 0.25\,z_{\text{ChlorA}}^{\pm} + 0.20\,z_{N:P}^{\pm}\right),\ 0,\ 100\right) $$
The result is an interactive map where each CalCOFI station is a colored dot — 🟢 green for healthy, 🔴 red for degraded — with a time slider covering 1991 to 2021 that lets you watch the coast evolve year by year. Click any station to see the full time-series history of each component variable.
How we built it
Data pipeline: CalCOFI CSVs (the bottle/ and cast/ prefixes) live in an AWS S3
data lake. AWS Glue external tables point at those prefixes, making the raw data directly
queryable as SQL tables (bottle_table).
Query layer: An AWS Lambda function receives POST /query requests with structured
JSON — column selection, depth filters, date ranges — and builds safe parameterized SQL
server-side against the Glue/Athena tables. No raw SQL ever comes from the browser. AWS
Athena executes the SQL over data in S3 and returns JSON rows. API Gateway exposes
the Lambda as a public HTTPS endpoint.
Scoring: The Lambda computes per-station z-score anomalies against the pre-1990 baseline
on the fly, across four variables: dissolved oxygen (o2ml_l), temperature (t_degc),
chlorophyll-a (chlqua), and the Redfield N:P ratio (no3um/po4um).
Frontend: Built with Streamlit and Folium — a color-coded interactive map, a year
slider (1991–2021), a station detail panel with per-variable time-series charts, and a
"Most Degraded Stations" leaderboard. The frontend calls POST /query via API Gateway;
a Streamlit or BI dashboard sits cleanly on top of the same API. Deployed to Streamlit
Community Cloud connected directly to our GitHub repo.
Challenges we ran into
pH data was essentially unusable. The entire dataset had only 84 non-null pH rows, mostly from 2014–2015. We had to drop it entirely and redesign our variable set mid-build — a crash course in the messiness of real-world scientific data.
Depth filter vs. station coverage tradeoff. Strict surface-only filtering left only ~60
usable stations — too sparse for a meaningful map. We settled on a mixed-layer average using
the depthm filter in our query layer, which is oceanographically valid and roughly doubled
our station count.
Nutrient baseline sparsity. NO₃ and PO₄ weren't measured consistently until the 1990s, so we computed z-scores using only the years where data existed per variable rather than requiring a full baseline across all variables simultaneously.
Safe query construction. We needed the frontend to be flexible — selectable columns,
date ranges, depth filters — without ever exposing raw SQL to the browser. Building the
server-side SQL allowlist in Lambda (matching athenaAllowlist.ts) took careful design to
be both secure and expressive enough for our use case.
Accomplishments that we're proud of
We turned decades of raw ocean chemistry into something a non-scientist can read in under 10 seconds. Drag the time slider from 1991 to 2021 and watch the California coast slowly change — that visual tells the entire climate story without a word of explanation. We're proud that the science is rigorous (z-score anomalies against a pre-disturbance baseline, Redfield ratio as an ecosystem stress signal) while the output is completely accessible.
We're also proud of the cloud architecture: a fully serverless pipeline where Athena queries
S3 directly via Glue external tables, Lambda handles all SQL construction securely, and the
same POST /query API could power any frontend — Streamlit, a BI dashboard, or a mobile app.
What we learned
- Real scientific datasets are intentionally hard. The CalCOFI schema required careful documentation reading before we could trust a single number.
- AWS Athena + Glue is a powerful pattern for querying large CSVs without ever loading them into a database — the data stays in S3 and SQL comes to it.
- Server-side SQL construction with an allowlist is the right security boundary: flexible for the client, safe from injection, and auditable.
- The Redfield ratio (\(N:P \approx 16:1\)) is a surprisingly clean ecosystem stress signal — deviations flag nutrient imbalance in ways individual nutrient levels miss.
- Data visualization is the difference between a result and a story.
What's next for ChumBucket
- Extend the time range as newer CalCOFI cruises are published beyond 2021
- Add predictive modeling — train a time-series model on the historical scores to forecast when stations will cross critical health thresholds
- Species correlation layer — overlay commercial fishery data to show how health score decline correlates with fish stock changes over the same period
- Expand the API — the
POST /queryendpoint is generic enough to power dashboards beyond ocean health; any CalCOFI variable is already queryable
Built With
- amazon-web-services
- athena
- json
- lambda
- python
- s3
- sql
- streamlit
Log in or sign up for Devpost to join the conversation.