It was an exciting opportunity to join this hackathon because we are both interested in geography and data science but had yet to take part in a geography-related data science challenge. We were inspired by the experts who introduced us to spatial visualization tools through the introduction videos.
What we did
Visualizing the data
The spatio-temporal data given to us was very complex and difficult to perceive, so we created an interactive visualization using Python Dash that allows us to see spatial relationships between pollution sources and sensor stations, as well as the time-series for each sensor station to see how associations change over time. This allowed us to discover some preliminary hypotheses such as the seasonality of nitrogen and pollution levels, as well as nitrogen pollution being higher near urban areas such as Baltimore.
Transforming the data
Then, we transformed the given data into relevant input features for machine learning modelling. We found it important to capture the spatial relationships (i.e. upstream and downstream) between HUCs, and hence used innovative data representations like directed acyclic graphs to represent HUC dependencies. We also tried to use the land use data by counting pixels of certain colors using QGIS, but as the files were huge we could only process land use data for 8 HUCs.
Modelling the data
We then used linear regression to investigate the statistical significance of the different predictors, then used XGBoost, a machine learning algorithm, to uncover non-linear relationships as well as variable importance. This settles our first aim of uncovering the underlying factors affecting nitrogen and phosphorous pollution levels. Then, for each point, we developed a time-series model (SARIMAX) with the goal of predicting future pollution using past data, splitting out data into train and test samples to ensure the generalizability of the fitted models. We were able to fit a model that makes suitably good predictions over the test data.
What we learned
We're happy to have learnt so many new skills and enhance our data science arsenal, and also to have applied it to enhance the understanding of the environment and ecosystems. Hopefully this hackathon is just the start of promising data science careers for us !
(Note: Code and presentation PDF are in the public github repository https://github.com/thamsuppp/hackthebay)