Inspiration

San Diego is one of the most biodiverse regions in North America, but climate change is quietly reshuffling which species can survive here. As temperatures rise, species are pushed out of their thermal habitats while others expand into new ones. We wanted to make that invisible process visible. UCSD's campus sits at the intersection of urban heat, coastal influence, and native habitat, and it has real sensor infrastructure we could actually use. That combination made it the perfect place to ask: what does warming look like on the ground, for real wildlife, right now?

What it does

EcoShift is a climate-biodiversity interaction model for the UCSD campus area. It takes real microclimate conditions (temperature and humidity from UCSD's campus sensor network) and predicts which species are likely present at any location, at any time of year. The predictions are displayed as an interactive heatmap that updates in real time.

The key feature is the warming simulator: a slider that lets you shift temperatures by +1, +2, or +3°C and instantly see how the species community responds. You can see which species expand their habitat and which ones lose it. No retraining required. The model has learned the underlying relationship between microclimate and biodiversity, so the counterfactual runs in under a second.

How we built it

Data pipeline: We joined two Scripps-provided datasets (UCSD campus temperature/humidity sensor readings and research-grade iNaturalist species observations) using a KD-tree spatial index to match each observation to its nearest sensor reading within 500m and 2 hours.

Bias correction: iNaturalist over-represents popular trails. We applied per-species spatial thinning and target-group background sampling (the standard MaxEnt approach) to correct for observer effort rather than sampling absences at random.

Model: A PyTorch multi-label MLP (6 inputs -> 128 -> 64 -> 32 -> n_species) trained with BCEWithLogitsLoss and per-species pos_weight to handle severe class imbalance (typical presence/absence ratios of 50:1 to 500:1). Training used early stopping on a held-out validation set.

Evaluation: We used blocked spatial cross-validation, dividing the study area into a 5x5 geographic grid and using leave-one-block-out CV, to get an honest estimate of generalisation to new locations. Species with mean spatial CV AUC below 0.65 were excluded from the final output as unreliable.

Serving: A FastAPI backend wraps a Predictor singleton that loads all model artefacts at startup. The predict_grid() endpoint covers the map with a spatial grid, runs inference in one batched forward pass, and returns per-species probability surfaces to the Leaflet.js frontend.

Challenges we ran into

The biggest challenge was the sensor coverage gap. The UCSD campus sensors provide dense, accurate readings in a relatively small area. Extending predictions beyond that footprint required using the sensor climatology mean as a fallback for grid points far from a sensor, a real limitation we were transparent about in the model's confidence flagging.

Correcting for iNaturalist observer bias was also harder than expected. Standard random negative sampling produced a model that largely learned human trail patterns rather than species ecology. Switching to target-group background sampling (using the full iNaturalist observation pool as backgrounds) meaningfully changed which species the model could learn.

Finally, implementing proper spatial cross-validation rather than random CV revealed that naive AUC estimates were significantly inflated due to spatial autocorrelation. Several species that looked well-modelled under random CV dropped below our threshold under spatial CV and were cut from the output.

Accomplishments that we're proud of

Getting the methodological details right. Most species distribution models presented at hackathons use random train/test splits on spatially correlated data, which leaks information and inflates performance estimates. We implemented blocked spatial CV, per-species class imbalance correction, and target-group background sampling: the same standards used in published conservation biology research. The model's reported AUC reflects genuine generalisation, not spatial leakage.

We are also proud of how the counterfactual inference works. The warming simulation is not a separate model or a lookup table. It is a live inference pass with a shifted temperature input, grounded in what the model actually learned about the climate-biodiversity relationship.

What we learned

Spatial data breaks standard ML assumptions in ways that are not obvious until you look carefully. Random cross-validation, random negative sampling, and ignoring observer effort all seem reasonable until you understand the structure of the data, and all three produce models that look good but generalise poorly. The field of species distribution modelling has developed rigorous solutions to each of these problems, and applying them properly was the most valuable technical lesson from this project.

What's next for EcoShift

The most natural next step is expanding sensor coverage, either through additional campus sensors or by calibrating satellite-derived land surface temperature (Landsat LST) against the existing ground truth readings. That would allow the microclimate surface to extend across a wider area and make the species predictions spatially richer beyond the campus footprint.

On the modelling side, adding temporal dynamics (diurnal patterns, seasonal transitions, year-over-year drift) would make the predictions more realistic. The current model treats each observation independently. A sequential model could capture how species communities shift through the day and across seasons.

Built With

Share this project:

Updates