We have not had much experience with analyzing datasets, so we thought that the challenge of analyzing data in regards to a humanitarian cause of disaster relief would be a great opportunity.
What it does
.... Our project consists of two main parts:
First, we pulled historic tweets posted during these catastrophes and rated how negative or positive they were. Using this sentience level, we were able to make analytical claims on the relationships between general public opinion and the amount of funding per damage done to these tweeters' areas.
We then gathered datasets for disaster relief funding and disaster statistics (about 9 total datasets spanning about 600,000 entries). We used the DataScience module in Python Jupyter Notebook along with NumPy, sklearn, and matplotlib to conduct the data analysis. We looked for relationships between many factors such as type of disaster related to how much funding it receives, and the overall destruction level of the disaster related to how much funding it receives, etc. We made multiple plots of these relationships to determine which relationships show a correlation. We also used the Twitter sentiment analysis dataset to look for pre and post-disaster sentiment to see how a large disaster may impact general discourse in Twitter.
How we built it
We used python2 and python3 with getOldTweets to pull the Twitter post information, Google Sentiment Analysis API to generate sentiment information.
For the data analysis, Jupyter Notebook was used in conjunction with Java Swing to help visualize these plot in an clean GUI fashion.
Challenges we ran into
Twitter Analysis: Initially, we wanted to use Tweepy to pull relevant tweets of specific locations and date ranges because of previous experience in using it. Assigning locations was a bit odd: rather than taking in a location by keyword as a string (ex: "Melbourne"), Tweepy actually requires the input of the top left and bottom coordinates of corners of a geographical box selection of that location. Pulling from the Google Maps Places API, we were able to pull this information. We then realized that Tweepy strictly pulls from the most recent tweets... there was no way to access past posts. We then decided to pivot and try to use Twitter's official API, until we learned Twitter's API has specific regulations on accessing tweets only within the past two weeks.
With some in-depth Google Searches, we found a script on GitHub called getOldTweets that pulled historical Tweets using a unique method. Unfortunately, the update of the code from python2 to python3 was incomplete, thus while location searching was available, date range queries weren't. However, both features still existed in the python2 version of the code. We decided to just write and run this scraping of the Twitter posts in python2, and write everything else in python3.
Another challenge was that we are both computer science majors who did not have much experience in data analysis. Dealing with huge datasets in the hundreds-of-thousands was new territory to us, and it truly taught us a lot about how to deal with data. For example, we found that many entries of the data had to be discarded due to being unreasonable outliers or sparse entries. We also had to implement an efficient algorithm that could process all of this data in a reasonable time, which was done with a bit of dynamic programming in addition to concurrent processing to speed up our dataset retrieval.
In addition, we had to figure out how to implement a Java GUI to show all of our data plots, so getting Java and Python to interface together was an interesting challenge.
Accomplishments that we're proud of
Using a myriad of APIs to achieve the results we wanted Merging and filtering 1,000 element tables to create relevant data sets Applying data analytical skills we learned in class here at Penn State
What we learned
How to pivot off of unsuccessful trails, building off the failures to slowly create something that works Twitter and Google Maps API's in general How to efficiently clean datasets to gather significant insights. How to present out findings in a GUI.
What's next for Insights into General Public Sentiment and Disaster Relief
A potential next step could be strengthening our understanding of this relationship by way of a much larger sample size of data. Due to time limitations, we were only able to scrap ~600 tweets per event, not nearly the amount we think is necessary to judge public opinion.
We could also use these insights to potentially learn how to create a better structure for funding these disasters.