Inspiration
Inspiration came from Charlotte Gils' CodeJam Presentation, consultation with data scientists at Intact insurance, and the team's personal experiences with Reddit. Sometimes you may be working on a really great post on your favourite subreddit, but feel you may be missing something to tie the whole comment together. Or, maybe you're on a subreddit for the first time and want to sound like you know what you're talking about. A quick glance at our cheat can offer the necessary inspiration to turn your good post into a great one!
What it does
The Reddit Karma Cheat Sheet is an aggregation of data from 50 popular subreddits, offering information on word uses, their frequencies, and the average rating of all comments that use the word. Further, the website gives graphical visualization of the ratings associated with the top 10 most common words. Linguists, Redditers, NLP-enthusiasts and pop-culturalists alike will affirm gut feelings and may even find a few surprises in the information we're offering... Did you know that in r/himym (television series How I Met Your Mother) the 5 top-scoring words are the main character's names yet in r/starwars, a character name doesn't appear in the top 10? ("Darth," at rank 13)
Scour the listing and see for yourself!
How we built it
This product is produced through filtering and analysis of a publicly-available repository of Reddit comments. This repo is available at github.com/linanqiu/reddit-dataset.
Knowing the dataset
The first step is exploring the dataset's structure, from the format of individual records to the organizational compression and project structure. Although these topics are ascertained from initial inspections of the tables, some more complicated topics, such as distributions, sentiments, and relative dataset size, are observed after extraction and analysis.
The dataset is organized into 50 .csv files, 1 per subreddit. These .csv files have a variable number of columns, but always contain columns for author, post id, subreddit name, comment text, and comment rating.
Extraction and Filtering, and Aggregation
By opening each .csv file in python, we're able to read in posts line by line, extracting the text from each one. Much of the data is inadmissible as it is either unrepresentative, uninformative, malformed, or incomplete. During this step, we remove stop words (less objective words such as pronouns and determiners) and words that are too long (above 15 characters they're usually just web links) and skip duplicate comments or comments without a number in the score column. Extraction is done on a per-subreddit per-word basis.
for each subreddit for each comment c for each word w word_frequency[w]++ if first time w encountered in c comment_frequency[w]++ comment_score[w] += c.score enter results in subreddit.csv(word, frequency, comment frequency, comment score)
Presenting our findings
We are deploying our results in a visually-friendly manner to a website hosted by herokuapp. We are using HTML5, JavaScript, CSS, and Node.js to fetch our on-server CSV files and present them in table and histogram formats.
Challenges we ran into
Web technologies can be tricky! While locally-hosted, our web app works great, but more research and expertise is necessary before attempting a full-featured web deployment.
Accomplishments that we're proud of
The data is interesting!
What we learned
Python, the complexity of NLP concepts.
What's next for Reddit Karma Cheat Sheet
After our cheat sheet is web deployed with proper tables and charts, we can make some interactive user activities:
Higher NLP: In its current state, the Reddit Karma Cheat Sheet simply presents aggregate scores. With Logistical or Linear Regressions, word meaning webs, and sentiment analysis APIs, we can infer much more information on each comment and offer a more nuanced approach to writing your Reddit comments!
The Perfect Comment Generator: Ever wonder what the perfect comment for r/drunk looks like? Using statistical NLP techniques, we have the data to generate the "perfect comment" which should - statistically speaking - achieve the highest possible rating! This concept can be extended with the help of the user: we give the key words, and they drag and drop them in order, adding stop words or other words as they like!
Rate My Comment: The user is given a (mock or real past) post in a given subreddit. Their task is to write a comment that will score highest. With higher-level analysis, we can determine the correlation of subject matter, whether the response makes sense, is positive, personal, and much more!
Log in or sign up for Devpost to join the conversation.