Reddit Recommendation System

Title:

People:

Shuwen Wu(swu124), Yuntian Yang (yyang324), Yilin Miao(ymiao20), Yu Cao (ycao98)

Links

Our Check-in #2 reflection: https://docs.google.com/document/d/1A5lsZ7dkSVs4PGaF4srQ-JBlSKotOCinmVXC909EQmI/edit?usp=sharing

Our github repo: https://github.com/yl-miao/Reddit-Recommendation-System

Our final writeup: https://docs.google.com/document/d/1LDZriMmVjG_eL9ynUkcj063jQQp5AhIkARpuQrwXxq8/edit#

Introduction:

As avid Internet users, we have often found that the content being recommended to us is not always particularly engaging. In addition to simply recommending contents based on our past browsing history, there are other factors that can influence our decision to click on a link. With this in mind, we decided to build a deep learning-based recommendation system for Reddit communities. The system is based on a graph neural network (GNN) architecture called HinSAGE, which uses StellarGraph to enable it to handle heterogeneous graphs containing multiple node and edge types. The final recommendations are generated by solving a "link prediction" problem in the graph neural network. For a given user, we predict the possibilities of whether it is linked to any subreddit nodes in our data. If our GNN model’s prediction indicates that there is a linkage between a user and a subreddit, we recommend the subreddit to the user. To further improve the performance of the recommendation system, we incorporated additional knowledge into the feature embeddings. For example, we fine-tuned a MBTI personality classifier using DistilBERT as an additional feature extractor, as we believe that people's MBTI categories are closely related to the things they might be interested in. Additionally, we used KeyBERT to extract genre keywords for better describing each subreddit, which helped to provide more accurate recommendations. Overall, our approach allowed us to generate more engaging and relevant recommendations for reddit users.

Related Work

Deep Learning based Recommender System: A Survey and New Perspectives: https://arxiv.org/pdf/1707.07435.pdf A Survey on deep learning based Point-of-Interest (POI) recommendations: https://www.sciencedirect.com/science/article/pii/S0925231221016106?casa_token=ex3BUWmVY5UAAAAA:SaJRFGQwsgUCKrKtoJ4cj3EPGo_NbaNoc2zcsrzrH_CI-9myIjayz--zvDhnRhazNLNWzAVm5w Self-supervised Learning for Recommender Systems: A Survey: https://arxiv.org/pdf/2203.15876.pdf A Personalized Subreddit Recommendation Engine: https://arxiv.org/pdf/1905.01263.pdf Reinforcement Learning with External Knowledge and Two-Stage Q-functions for Predicting Popular Reddit Threads: https://arxiv.org/abs/1704.06217 Reddit Recommendation System: http://cs229.stanford.edu/proj2011/PoonWuZhang-RedditRecommendationSystem.pdf

Data

We use PRAW: The Python Reddit API Wrapper for crawling information such as user activities, subreddits, posts, comments, etc. from Reddit websites. Here is the link that provides more information about the PRAW crawler: https://praw.readthedocs.io/en/stable/index.html. We also used a Kaggle MBTI Dataset for training our MBTI classifier. This dataset consists of 430,000 posts authored by 8600 people as well as the associated MBTI labels for each author. Here is the link that provides more information about the MBTI dataset: https://www.kaggle.com/datasets/datasnaek/mbti-type. We used our own crawlers to enhance the data from public datasets (Subreddit Interactions for 25,000 Users), since the current public datasets don’t have some features we need such as the descriptions for subreddits, and past posts, link karma, comment karma, is_mod, is_employee, and is_friend for Redditors. We used a Reddit crawler package called PRAW. We used a public Reddit comment interaction dataset, and we used PRAW to take the subreddit names and Redditor names and enhance their features through crawling. Then, we use this data to generate the nodes and edges for our GNN recommendation system. For the MBTI classifier part, we use two ready-made MBTI posts-type datasets from Kaggle, with a total of 100k users’ data. Dataset links: https://www.kaggle.com/datasets/datasnaek/mbti-type/code https://www.kaggle.com/datasets/zeyadkhalid/mbti-personality-types-500-dataset/code

Methodology

We use the Redditor and Subreddit nodes and edges given by the crawlers to put into the HinSAGE algorithm (it uses StellarGraph to expand GraphSAGE to heterogeneous graphs i.e. graphs containing many different node and edge types. (HinSAGE)). For example, the subreddit nodes contain features such as name, words, MBTI classified using words and MBTI classifier, link and comment karma. The edge nodes contain the subreddit names and redditor names and the number of comments that redditor left in that subreddit. It took about 48 hours on CCV for our crawler to gather around 50K data. During inference, top X subreddits are recommended to a given Redditor.

Metrics

Base goal: Build a deep learning based recommendation system for Reddit posts. Extra goal: Build a deep learning based MBTI classifier to predict users’ MBTI based on their previous posts/comments, and apply it as a metric to our recommendation system. After training, our error and loss both went down. We successfully augmented data features by incorporating knowledge embeddings.

Ethics:

Deep Learning is a good approach to recommender systems, as user interactions generate large amounts of data, and we can assume users’ certain characteristics would cause them to click on certain things. Thus we could build a deep learning model and learn from the big data. In specific, we also consider using attention-based models, as there exist patterns or contexts when an user interacts. The dataset is collected through crawler. This could potentially violate user privacy as an user might not be willing to become a training sample in a deep learning model without acknowledgment. Besides, the dataset could be statistically biased. This requires careful examination when collecting data subjects.

Also, the reddit dataset could be inherently biased, as people who tend to post on reddit could be more likely to belong to certain MBTI categories.

Also, The pre-trained word embeddings could have bias like mentioned in the homeworks.

Division of Labor

Together: the preliminary research & discussion Shuwen Wu: data processing, model setup. Yuntian Yang: data processing, model setup. Yilin Miao: data processing, model setup. Yu Cao: data processing, model setup.