Inspiration

Seniors in high school might not have the opportunity to visit college campuses, so college subreddits, which contain a wealth of information about their respective colleges, are a good second option.

What it does

Extracts common topics within each college subreddit (clustering), analyzes the general sentiment around each cluster (positive or negative), and displays it by college.

How we built it

We used the Reddit API to extract Reddit comments to analyze, extracted noun phrases from comments using spaCy, performed K-means clustering to cluster comments into topic groups, and performed sentiment analysis on each noun phrase and averaged it to get the overall sentiment of each topic group. Then, we took a random sample of each cluster to display to the website, censored the sample using a Python package, and colored each cluster by sentiment.

Challenges we ran into

The biggest challenge we faced was our plan to fine-tune a BART summarization model to recognize a topic from our list of noun-phrases. Unfortunately, we realized only in the last 12 hours that this was not feasible due to the disconnected nature of the noun-phrases and that summarization would not return a single topic word, forcing us to change directions and focus on clustering and sentiment analysis.

Accomplishments that we're proud of

We’re proud of extracting large amounts of data from an API, cluster analysis, sentiment analysis, considering that we haven’t done this before.

What we learned

Through this project we gained a much deeper understanding of natural language processing and how it is used and implemented in applications. We also learned how to implement clustering to group data, andbasic sentiment analysis.

What's next for College Reddit Sentiment Analysis

Future improvements involve pulling data from more sources and including some data on how much data was censored before displaying to the user, for a more comprehensive understanding. Furthermore, we would look into creating an API to allow for users to generate the analysis based on a user-input.

Built With

Share this project:

Updates