Reddit Comment Year Prediction

What is the problem you are solving?

We built a machine learning model that estimates when a Reddit comment was written based solely on its text, recognizing that Reddit language evolves over time through changing slang, memes, tone, and cultural references.

Why did you choose this problem?

Reddit's linguistic evolution over nearly two decades creates a fascinating classification challenge, the platform's culture shifts dramatically, and we wanted to see if these temporal patterns could be detected from text alone.

Briefly explain the models you experimented with to solve the problem.

We made a simple classification model as a proof of concept. And then moved to linear regression treating year as a continuous variable, then pivoted back to multi-class classification with RoBERTa-base, experimenting with 2, 3, and 4 time-period bin.

What challenges did you face?

Linear regression required enormous datasets and training time while performance plateaued. Increasing the number of classification bins degraded accuracy significantly, requiring careful balance between model performance and time periods included in training.

What is the result you are most satisfied with?

Our 2-bin RoBERTa classifier achieved 85.3% accuracy, 87.3% precision, and 85.5% F1 score distinguishing 2008-2010 comments from 2020-2022 comments, demonstrating that linguistic temporal patterns are learnable.

If you were to continue this project, what would you explore/improve?

We would train on more data and epochs to improve the 3-bin and 4-bin classifiers, experiment with finer time granularity, and potentially explore what specific linguistic features the model learns to distinguish eras.

How would you recommend approaching similar projects?

Start with a simple formulation to validate the approach works, then gradually increase complexity.