With our whole team being born into 3rd world countries, we were always taught to actively think about water. Whether it was turning on your water heater 30 minutes before your shower, or saving water uses for your backyard as we didn't have natural rainfall. When we came to the United States, we received a culture shock--24 hour water heaters, and the impact that it may have had on the environment. When we saw the CDP's climate challenges, we knew immediately that water risk was a challenge that we were all passionate about. We were interested in the current state of water security in each city, and we wanted to see if we could provide and or gain any insight on whether they were making strides in the right direction.
What it does:
Overall, the idea was to assess water security. However, with such a complex qualitative dataset, it was difficult to extract such features with ease. Firstly, we implented a keyword analyzer using 2018 CDP data to understand the context of the questionnaires. Then, using those keywords, we were able to assess water security based on a description that the city themselves gave. This was a step in the right direction--understanding self-assessment is a key factor in understanding whether water security is truly at risk.
To better understand qualitative data, we then extracted from Aqueduct's water stress dataset to get a quantitative measure on how the states were doing so we could get a good comparison for our test data.
Additionally, we analyzed the corporate side by looking at the sectors, and their respective grades. To understand the problem with the corporations, however, we looked at the grades they received and then organized them into sector average grades to understand which corporations are causing the problem, and to see in the future, what route the CDP should take to help fix those issues.
How we built it:
The first challenge was organizing and cleaning the data. To do this, we used pandas and several other data science libraries to first clean the data.and prepare it for the keyword analyzer. Next, we pulled the text from the 2018 CDP report and used the nlk package to make a TF-IDF to be able to find the terms and their frequencies in each response respective to the context to which they're important.
We then used this information in accordance with the sklearn library to make three classification models: SVM, Stochastic Gradient Descent, and Naive Bayes. The classes we used were found in the questionnaire by each city's self-assessment for whether or not a risk was extremely serious, serious, or less serious. To test the success of each model, we used the metrics: accuracy, recall, f-1 score. We used k-fold cross-validation with random shuffle to split our data into a training and test set.
Using this data, we used plotly, matplotlib, and seaborn to convert our results into geospatial visual data. We used a variety of different representations, such as city points(on a US map), state lines (on a US map), and bar graphs.
Challenges we ran into:
As with most qualitative datasets, our primary challenge is always cleaning and extracting useful features. It took us a long time to find and organize the features we wanted for our classifier, and which ones truly worked well. Additionally, another challenge was having an imbalanced dataset with the risks, and also a small number of datapoints. We managed to use certain machine learning and data science techniques to overcome these challenges, such as using k-fold cross-validation to overcome the problem of having imbalanced data for training our model.
Accomplishments that we're proud of:
The key word analyzer worked surprisingly well. It is able to find almost all the keywords that we would think it should find. For example, when looking at water security questions, a cool example we found was that the word "rainfall" was seen with a much higher weight than the word "snowfall" in one of the responses. Additionally, we were proud of our model's accuracy with the amount of data we had. Also, the visual appeal of our analysis allowed the project to cater towards more business-facing and professional industries.
What we learned:
We learned many new natural language processing(NLP) techniques, such as how to create a custom inverse document frequency file. This knowledge can definitely be extended to other applications, as NLP is a prominent field in today's market. While we have worked with machine learning before, in classes and on Kaggle, we tend to get datasets that are very balanced, and designed to work. Working with a real dataset extracted from questionnaires was a new challenge for us because we had to overcome more hurdles in order to get our models to work.
This was particularly useful, as it can definitely be applied to other projects. As developers, we often get stuck in the back-end, and don't think about how our data is presented. Getting experience with geospatial data and actively thinking about how we want our data to be seen by businesses and consumers helped us develop our front end and professional skills.
What's next for Challenge 2 - Team Jackbox
By focusing a lot of our time on fundamentals like a keyword analyzer, this data can be extended to many more NLP tasks. While you can continue to improve the classifier while the CDP grows, you can also use the keywords for things such as sentiment analysis, consumer reports, and question answering(QA). We can expand our analysis to not only focus on cities, but to also target corporations and assess their risk on an individual level on matters other than water security.