About LakeScope

đź’ˇ Inspiration

The inspiration for LakeScope came from a stark realization: we are trying to save 21st-century ecosystems with 20th-century tools.

Dissolved oxygen (DO) is the heartbeat of a lake. When it drops below $4 \text{mg/L}$, fish begin to die. Below $2 \text{mg/L}$, the water becomes a "dead zone." Currently, environmental agencies rely on manual sampling—sending a person out on a boat to dip a probe into the water. This process is:

  1. Expensive: costing ~ÂŁ50-100 per sample.
  2. Slow: taking 3-5 days for lab validation.
  3. Reactive: telling us what happened last week, not what will happen next week.

I wanted to change this paradigm. I asked: Can we use the laws of physics and biology—thermal inertia and photosynthesis—to predict these dead zones before they happen?

🏗️ How I Built It

LakeScope is a full-stack Bayesian forecasting system designed for reliability.

1. The Data Engine

We integrated over 10 years of historical data (2014-2023) from three disparate sources:

  • UK Environment Agency: Biweekly water chemistry (Oxygen, pH, Temperature).
  • Met Office MIDAS: Hourly weather data (Air Temp, Wind Speed, Pressure).
  • NRFA: Daily Hydrological data (River Flow, Rainfall).

2. Leakage-Proof Feature Engineering

The biggest trap in time-series ML is data leakage. It's easy to get a high $R^2$ by using today's temperature to predict today's oxygen... but you don't have today's temperature 14 days in advance.

We built a strict temporal feature pipeline that explicitly removed 148+ features prone to leakage. We restricted the model to three "safe" tiers of information:

  1. Pure Temporal: Deterministic astronomical cycles like photoperiod.
  2. Autoregressive Lags: $y_{t-14}$ (the measurement from the last boat trip).
  3. Historical Weather: Weather aggregates from the $t-28$ to $t-14$ day window.

3. Bayesian Modeling with PyMC

We chose a Bayesian approach over traditional "black box" ML because environmental decisions require risk management. We implemented a Hierarchical Seasonal Model:

$$ \begin{aligned} Oxygen_t &\sim \mathcal{N}(\mu_t, \sigma) \ \mu_t &= \alpha_{season} + \beta X_t \end{aligned} $$

This gives us a full posterior distribution, not just a single guess.

đź§  What I Learned

1. Physics > Algorithms

The most powerful feature wasn't complex neural architecture; it was Photoperiod (day length). Simple physics-> longer days = more sunlight = more photosynthesis = more oxygen outperformed almost every other indicator.

2. The "Leakage" Mirage

I initially had a model with $R^2 = 0.98$. It felt amazing, but it was fake. I had accidentally included a rolling mean of air temperature that overlapped with the prediction window. Hunting down that leak taught me that if a model looks too good to be true, it probably is.

3. Bayesian Thinking

Moving from "Point Predictions" (The oxygen will be 95%) to "Probabilistic Thinking" (There is a 95% chance oxygen is between 90-100%) is a paradigm shift. It makes the tool useful for policymakers who manage risk, not just data.

đźš§ Challenges I Faced

  • Sparse Data: Real-world environmental data is messy. Sampling happens every 14 days, effectively missing 13 out of every 14 days of "ground truth." Interpolation would introduce leakage, so we had to design a model that could learn robustly from sparse, irregular time steps. It took me ages to find the data.
  • The "Cold Start": The UK Lake District is huge. Generalizing a model trained on Windermere to other lakes without local training data remains an open challenge (the "Transfer Learning" problem).

Built With

Share this project:

Updates