Inspiration
After hearing about the theme regarding climate, environment and energy, our group was excited to see how we could exercise our skills in developing something that could aid our planet. Upon viewing the datasets we had access to, our group quickly gravitated towards marine life and how the ocean’s shifting conditions have impacted them. Phytoplankton, the base of the vast amount of the ocean’s ecosystems, became the clear choice to learn about, analyze and innovate towards. Changes in phytoplankton concentrations and populations drastically impact all of ocean life. Knowing this, we made a tool that would analyze data regarding phytoplankton and make predictions on their population shifts.
What it does
We took multiple approaches to predict phytoplankton populations off the California coast. One used data from California Cooperative Oceanic Fisheries Investigations that contained multiple different ocean health signifiers such as temperature, salinity, and O2 in order to predict the amount of ChlorA (Chlorophyll mg per L) which is used to determine the concentration of phytoplankton. We did this using multiple methods of machine learning such as neural nets and random forest. We then created a heat map with scalers to show how changes in water temperature, Salinity, O2, and more affect phytoplankton populations.
We extended the analysis to each pier along the California coast, the chance that tomorrow's seawater will contain a toxic algae bloom, specifically Pseudo-nitzschia and the poison it produces, domoic acid. Domoic acid is the stuff that closes shellfish fisheries and poisons sea lions (and occasionally people). There's already an official government forecast for this called C-HARM v3.1, made by NOAA. It's the system that lifeguards, fishery managers, and public-health officials currently rely on, so how does our system compare?
How we built it - Data Analysis
After performing exploratory data analysis on the CalCOFI dataset, we discovered that most instances of algal blooms occurred along the coast and were driven by a variety of external factors, including temperature, salinity, dissolved oxygen concentrations, and more. With that in mind, we explored these analyses further, quantifying the chlorophyll concentrations, the primary factor behind algal blooms, with respect to the distance from the coastline, to prove our hypothesis that algal blooms tend to linger alongside coastlines. In addition, to isolate the effects of mineral and oxygen presence in the water, we created interactive average plots of chlorophyll concentrations with respect to these variables and water depth to identify the approximate concentrations that produce the highest average chlorophyll concentrations.
We took our model and the official C-HARM forecast, lined them up on the exact same days at the exact same piers, and asked one simple question: when each model says "bloom likely," how often is it actually right? The PR curve answers that question visually. A line that bows up toward the top-right corner is a better forecaster. On the curves, our model's line sits clearly above C-HARM's for both the algae itself and the toxin. For the toxin (the rare, dangerous one), our model is roughly 4× better at correctly flagging real events. The gap is big enough that statistics confirm it's a real improvement, not luck.
How we built it - ML/AI
Challenges we ran into
A major challenge that we faced initially would be extracting enough data from the datasets and understanding what parameters held value. For one, it was extremely difficult to find enough data that covered all the parameters that were meaningful to chlorophyll concentration, which correlates to phytoplankton populations. Things such as water conditions, nutrient availability and weather conditions had to be factored in. Merging datasets failed as we had very minimal overlap in locations with our provided datasets. We eventually settled on the CalCOFI dataset, which struck a good balance of quantity and diversity of important parameters.
Accomplishments that we're proud of
Our primary accomplishments throughout this hackathon were the fact that we were able to successfully train multiple different types of machine learning models and create interesting methods to display our understanding of these models and the underlying data.
What we learned
We learned a lot about what different types of ocean conditions affect phytoplankton populations as well as gaining a lot of experience cleaning data and creating and training models off of that data. We also learned how to work together as a team with a group of people who met for the first time today.
What's next for Hal's Homies
Our next steps with this project, had we had more time would be to start integrating the ocean health data and satellite data in order to create even stronger models and see new and exciting insights into phytoplankton populations.
Built With
- jupyter-notebooks
- python
- scikit-learn
- scripps-databases
- tensorflow
Log in or sign up for Devpost to join the conversation.