Our approach
Our approach to the problem was multi-faceted. When we first started our problem we tried to approach things in a top-down fashion, by having some idea of what the data would look like and creating our models based on that. We realized as we kept going that a lot of what we had in mind wasn't working out. We expected to find certain things in the data but it was hard for us to find those correlations in the data. That's why we turned to a more bottom-up approach where collecting data became our top priority.
The first data we scraped was the depression dataset which came from the CDC (https://wwwn.cdc.gov/nchs/nhanes/continuousnhanes/default.aspx?BeginYear=2007). We used Python's BeautifulSoup and Requests libraries. Data collected was the proportion of all respondents that chose the option of "Several days", "More than half of the days", and Nearly every day".
The same process was used to collect data on drug use in the United States.
Data about songs were collected in multiple different places but eventually cleaned and aggregated together. The top songs of the year were found from Billboard's top 100 list. Taking the artist and title of the song, the genius API was employed to return the lyrics for each song. This data was manipulated through NLTK's sentiment analysis to determine the positive, negative, neutral, and compound values. Finally, the Spotify Web API was used to determine more attributes of the top songs of each year.
Analysis
The main idea of the project was to see if popular songs and their possible message would affect the general population's mental health. The average depression rate was compared to the average positive and negative sentiment rating on top songs. You can see our actual analysis here on our Github repo.
Reflection
There were a lot of challenges that we had to face, but we think one of the biggest ones was definitely analyzing the data that we scraped and trying to find the relationships or lack thereof between them. One of the things that made it hard to find relationships between the data was the fact that a lot of census data on depression and drug usage was collected every two years. That’s a lot of time between each of the points and so analyzing any relationships was not as accurate as we would like. There was not much we could do about this as we wanted to work with the CDC’s data and so had to make do with what we had. If we had more time we would have liked to look for more data and try to look through different relationships to see if there were any more important findings in the various datasets we created.
Bonus star

Log in or sign up for Devpost to join the conversation.