Inspiration

After the theme and mandatory datasets were shown, I wasn't too sure what I wanted to do for Datahacks at first. I decided to do something with the CalCOFI Data Portal datasets, since I had always been fairly interested in marine biology even since I was a kid (I remember drawing sharks and other sea animals daily in elementary school, which annoyed my family and teachers at times). While doing EDA on the larvae count dataset, I noticed that Northern anchovies were recorded a staggering amount of times. After doing some research and learning how important they were to the local ecosystem, along with how surprisingly little was known about the effects of the climate and environment on their spawning and larvae habits, I decided to make my project all about them.

Project Deliverables

My project has multiple deliverables. I have multiple visualizations and tests (ordinary least-squares regression, random forest regressor) showing the relationships between Northern anchovy larvae density and variables like oxygen level, sea surface temperature, and zooplankton abundance. These include geospatial animations. I also made a random forest regressor that can predict a given area's anchovy larvae density with an R-squared score of 0.236, which is very impressive given the relative lack of features used (9) and the fact that ecological data is very uncontrolled and chaotic.

How I built it

📊 The Data: SIO, NOAA, & ERDDAP

This project heavily relies on gold-standard time-series datasets maintained by the Scripps Institution of Oceanography (SIO) and NOAA, accessed via the ERDDAP data server. The combination of localized bottle sampling and high-resolution satellite imagery provides a uniquely rich, multi-decade feature space for ML applications.

Datasets Used:

  1. CalCOFI NOAA Fish Larvae Counts: Target variable (Anchovy density per 10m²).
  2. CalCOFI NOAA Zooplankton Volume: Biomass data used to map the base of the predator food web.
  3. CalCOFI SIO Hydrographic Bottle Data: High-resolution physical and chemical profiles (Salinity, Dissolved $O_2$, Chlorophyll-a, Silicate, Phosphate, Nitrite).
  4. NOAA Optimum Interpolation (OI) SST V2 High Resolution Dataset: Provided the critical Sea Surface Temperature (SST) feature, leveraging advanced spatial interpolation to map historical climate regimes and currents.

🛠️ Data Preprocessing & EDA

Ecological data is notoriously noisy, heavily skewed, and spatio-temporally misaligned.

  • Spatio-Temporal Grid Mapping: Continuous latitude/longitude coordinates were quantized into localized 0.25-degree geographic bins. Time series were discretized into 4 distinct meteorological seasons to match CalCOFI's quarterly cruise schedules.
  • Depth Filtering: Because anchovy larvae are epipelagic, hydrographic bottle data (which spans thousands of meters) was strictly filtered to the top 10 meters to represent accurate sea-surface conditions.
  • Target Normalization: Larval catch counts follow a highly exponential, zero-inflated distribution. A $log(x)$ transformation was applied to the target variable (log_larvae) and extreme right-skewed nutrient metrics to stabilize model variance.
  • Geospatial Animation: Conducted frame-by-frame rendering of historical SST mapping overlaid with log-scaled catch densities to visually validate historical regime shifts.

🧠 Feature Engineering

  • Handling Multicollinearity: Ocean physics dictate that cold, deep upwelled water is inherently nutrient-rich, high in $O_2$, and highly saline. This creates massive multicollinearity traps for linear models.
  • Temporal Lags (The "Pantry" Metric): Cross-sectional modeling showed Zooplankton to be statistically insignificant when matched with current-day Anchovy populations (the "they already ate the food" dilemma). We engineered a T-1 Season (3-month) time lag on zooplankton volume. By transforming it into a historical leading indicator, its predictive importance skyrocketed.

🤖 Modeling Strategy

  1. Ordinary Least Squares (OLS) Regression Used primarily as a diagnostic baseline.
  2. Findings: Demonstrated the mathematical limitations of linear models in ecology ($R^2 pprox 0.107$). OLS correctly identified 'Season' and 'Salinity' as statistically significant ($p < 0.05$), but collapsed under the weight of multicollinearity when interacting biological variables (Chlorophyll, Phosphates, Silicates) were introduced, failing to detect threshold-based biological carrying capacities.

  3. Random Forest Regressor Selected to map the complex, non-linear interactions and threshold dynamics ("Goldilocks zones") without suffering from multicollinearity penalty.

  4. Performance: Achieved an $R^2 pprox 0.236$ (a ~400% improvement over initial linear baselines), a highly respectable variance capture for uncontained, wild ecological systems.

  5. Feature Importance Hierarchy: 1. Sea Surface Temperature (SST): Emerged as the undisputed macro-proxy for ocean currents and climate regimes once underlying chemical data was controlled for.

    1. Dissolved $O_2$ & Salinity: Confirmed as the primary localized physical drivers for larval survival.
    2. Lagged Zooplankton: Proved the importance of historical cause-and-effect in ecological time series.
    3. The "Upwelling Cocktail": Chlorophyll-a, Silicate, and Phosphate grouped together to successfully identify localized nutrient upwelling events.

Challenges I ran into

The datasets that I used all had missing/corrupted data, useless columns, and involved features and information I was not familiar with. I thus had to spend a lot of time transforming and cleaning the data, along with doing 2-3 hours of research on anchovy spawning behavior, CalCOFI funding droughts, previous studies on how the 1982-83 El Nino affected spawning, etc. I also used multiple unfamiliar imports such as xarray and contextily, which took me a bit to understand.

Accomplishments that I'm proud of

I'm proud of creating a regressor model that, despite not using an insane number of features and having to use chaotic data, was still able to achieve an R-squared score of 0.236. Also, while I created several detailed visualizations that demonstrate the relationship between anchovy larvae density and various features, arguably the best one has to be the time lapse showing both anchovy larvae density and sea surface temperature changing over the years. It was my first time creating a time lapse and it does an incredible job showing the relationship between seasons, SST, location, and anchovy larvae populations.

What I learned

Perseverance, a lot of new imports, a lot on Northern anchovies, and a lot on weather patterns off the California coast.

What's next for Climate Changes and Predicting Northern Anchovy Spawning

There were multiple other features that I wanted to add but didn't have time to properly implement. To push the variance capture even higher, future iterations of this model will integrate macro-climate indices (Pacific Decadal Oscillation, El Niño Southern Oscillation) to account for decadal regime shifts, alongside static geological features like coastal bathymetry to delineate open-ocean vs. shelf spawning habitats.

Built With

Share this project:

Updates