Inspiration

The Chesapeake Bay watershed is home to over 18 million people, spanning 6 states and Washington DC. The Bay is the largest estuary in the United States, providing 500 million pounds of seafood harvest each year. Declining water quality due to pollution and overharvesting in past decades led to poor ecosystem health and the near extinction of economically and culturally important aquatic species. In response to this decline, state and federal partners have invested heavily in restoration efforts, amounting to nearly $1.5 billion of state and federal funding in 2019 alone. Despite many years of costly efforts, warming temperatures, sea level rise, and increasing precipitation due to climate change have hindered the attainment of restoration goals.

Working with our community partner, the Chesapeake Monitoring Cooperative (CMC), the hackathon team at Booz Allen Hamilton performed an analysis of water quality measurements resulting from citizen science efforts. CMC cultivates collaboration among previously siloed citizen science water monitoring groups through their pipeline and platform for collecting, vetting, sharing and scaling data on the Chesapeake Bay’ and engaged residents in the conservation process. This hackathon submission seeks to leverage these data to provide actionable insights for advancing climate resiliency for Chesapeake Bay water quality.

What it does

We built a model to predict total nitrogen concentrations at water quality sampling locations. This is important for identifying sources of nutrient pollution to the Bay downstream and understanding the environmental factors that influence its presence in our vast watershed. In this analysis we incorporate historical climate, land use and other water quality indicators to predict nutrient pollution throughout the Chesapeake Bay watershed using a suite of leading regression models. By performing this historical analysis, we can better infer how future changes in climate, such as warming temperatures and increasing precipitation, might thwart restoration efforts.

How we built it

1. Data load

  • Chesapeake Monitoring Cooperative: download water quality and station metadata from CMC Data Explorer and upload to Databricks
  • Chesapeake Bay Program: download nontidal water quality monitoring data and metadata from Chesapeake Bay Program water quality database and upload to Databricks
  • NARR: direct download from a National Oceanic and Atmospheric Administration ftp site, stage on tmp/ directory within databricks to subset and save out data.
  • Numeric watershed codes associated with CMC and CBP stations were included in our model. These hydrologic unit codes (HUCs) can be mapped to land use characterizations available here.

2. Exploratory data analysis

Investigated spatial and temporal availability of data points and looked for correlations between the variables. There was no obvious single predictor of nitrogen concentrations. To gain insight into which variables are important for predicting nitrogen concentrations, we chose to use a suite of regression models.

3. Model build & evaluation

  • Build the model schema and apply StringIndexing, Imputation, and OneHotEncoding
  • Specify an 80/20 train/test split
  • Using the power of AutoML to build high performing ensemble models, evaluate performance of a baseline linear regression model with RMSE, R2, MSE, MAE as our metrics (60+ models i.e GLM, XRF, DNN, DRF, XGBoost, GBM, Linear Regression)
  • Compare results with a Random Forest regressor, GBT regressor, XGBoost and H2O AutoML

Challenges we ran into

  • Data sparseness: Many sampling efforts do not measure numerous environmental parameters at the same time resulting in sparse data. We decided to remove predictor variables with many null values, and reduced our dataset to have no null values. Simple imputation schemes to address these would not adequately represent the missing data in these environmental systems.

  • Modeling the interactive effects of complex processes: There are many complex processes that contribute to the flow of nitrogen pollution throughout the watershed. Water flow over different land use types affects the amount of nitrogen that is released into the waterways. Chemicals are transformed through microbial processes in the soil and water. Given the complexity of the system, we were not surprised when our baseline model performed poorly, and we looked to a number of more sophisticated models for better predictive power.

Accomplishments that we're proud of

  • Conducting a novel analysis. The strength of our model performance and the tools used in our analysis are a novel contribution to academic research in this field. It was difficult to find an appropriate study to compare our evaluation metrics against.
  • Our data-driven insight, that geography is the strongest predictor of nutrient pollution, is consistent with the general understanding of nutrient sources to the Bay: Land Use and Pollution Across the Bay, Nutrients in the Chesapeake Bay.
  • Collaborating with a knowledgeable community partner that's advancing high quality data collection by citizen scientists in the Chesapeake Bay region. We hope this analysis empowers further volunteer sampling efforts and can provide insight into future aims.
  • Using the power of Databrick's Unified Analytics platform to enable true joined Data Science and Engineering efforts. The tools and analysis we built are a demonstration of how combining subject matter expertise, geospatial data, and advanced analytics leads to actionable insights to enable climate resiliency for our community.
  • Working on this submission while also balancing the demands of our “day jobs”.

What we learned

The stacked ensemble model (H2O Auto ML Regression GLM) performed best among all our models, with an R2 of 0.91 and RMSE of 0.48 on the test data. Deep Learning models (DNN) did not make it metric-wise under the better performing models. Since we couldn’t extract feature importance from the ensemble, we looked to our best performing individual model, H2O AutoML XGBoost, for this insight.

The most important predictors for our nitrogen model are geographic indicators, i.e. longitude, latitude and watershed code. Despite the variation in climate over the sampling record, location was a much stronger predictor of nitrogen than climate variables. This supports location and community-based solutions for pollution reduction. By reducing nutrient pollution, watershed communities can offset the negative water quality impacts of climate change for the Chesapeake Bay.

What's next for Climate Resiliency for the Chesapeake Bay

  • Social Good: We are eager to share this work with our community partner, the CMC, to inform their future sampling strategy and encourage future citizen science efforts. Booz Allen Hamilton’s Women in Data Science and 901 Green Office team are supporting CMC in a broader data science project which includes a virtual hackathon in August of 2020. The work here will serve as a demonstration of the Databricks Community Cloud platform.
  • Analytics: Given the challenge of missing data, we recommend future work to devise a scientifically supported, process-based imputation scheme for null values. An example approach would be to interpolate between time points for stations that are frequently sampled.
  • Data Visualization: We look forward to building a dashboard where watershed citizens can look up land use and nitrogen concentrations for their subwatersheds. With further evaluation and support, this tool could be upgraded to predict nitrogen levels under future climate and land use scenarios.

Acknowledgments

Our team would like to thank the Booz Allen Women in Data Science, the Booz Allen 901 Green Office team, the Chesapeake Bay Monitoring Cooperative, and all citizen scientists who collected this data. We appreciate your missions, time and support!

Built With

  • automl
  • databricks
  • datetime
  • geopandas
  • h2o
  • io
  • math
  • matplotlib
  • netcdf4
  • numpy
  • open-source-libraries
  • pandas
  • pyspark
  • pysparkling
  • python
  • scipy
  • seaborn
  • sklearn
  • sql
  • urllib
  • xgboost
Share this project:

Updates