Inspiration

As an avid Reddit user, I've been seeing many subreddits getting swamped with samey content. Subreddits on universal hobbies or or trends like r/breadit, r/sewing, r/learnmachinelearning have it worse.

This is a tricky matter. On the one hand, having a wiki doesn't really help since fewer people read, and everyone thinks their question is unique. On the other hand, repetitive posts dilute a subreddit and turn off other readers. No one wants to see the same questions being asked every few days.

Enter Thread Tamer to solve the dilemma. Every time someone submits a new post, it will search for similar posts in the past 90 days and surface them in a stickied comment. Everyone gets directed to an archive of helpful posts, and no one gets hurt!

What it does

On app installation

  • Index all posts from the subreddit submitted within the last 90 days, I think this is a Goldilock windowframe, not too new to miss out on helpful posts in the past, but not too old for the past posts to become relevant.
  • Sends a mod mail notification when indexing is complete

On post submit

  • Tokenize the new post's title (lowercase, remove punctuation, filter stop words, stem tokens)
  • Computes TF-IDF-style similarity scores against existing indexed posts
  • If similar posts are found (similarity ≥ 30%), posts a stickied, distinguished comment listing up to 5 similar past posts with links
  • Index the new post after the similarity search

Daily cleanup (scheduled task)

  • Remove posts and token entries older than 90 days from the Redis index

How we built it

I started with the Comment Mop tool and changed it incrementally.

The core of the app is a lightweight TF-IDF (Term Frequency–Inverse Document Frequency) similarity engine built from scratch. When a new post is submitted, its title is:

  1. Tokenized — lowercased, stripped of punctuation, and filtered against a list of common stop words
  2. Stemmed — tokens are reduced to their root form so that "running" and "runs" are treated the same
  3. Scored — each token is weighted by how rare it is across all indexed posts (IDF), then compared against existing posts to produce a similarity score

Posts scoring 30% or above are considered a match.

All post data and token mappings are stored in Redis via Devvit's key-value store:

  • Each post is stored as a hash under post:<id> containing its title, URL, and creation time
  • Each token maps to a sorted set under token:<token>, with posts scored by their creation timestamp
  • A token:registry sorted set tracks all known tokens for efficient cleanup
  • A posts:count key tracks the total number of indexed posts

And what happens if a post expires but their tokens are still stored in Redis database? There's a daily cron job to keep the index lean by removing tokens older than 90 days. Tokens with no remaining associated posts are also pruned from the registry.

Challenges we ran into

You may notice that I used 4 data structures! That's because the Redis clients used in Devvit is a lot more limited and only contains a few core operations. However, it's an acceptable tradeoff for having Reddit host the app on my behalf and not having to worry about hosting platforms.

Also, it took a lot of trials and errors for coming up with a practical but uncomplicated algorithm for detecting similar posts.

Accomplishments that we're proud of

I'm proud that I'm able to build a workable search algorithm within the limits of Reddit API and Redis clients. Somewhere, some subreddit users will have a more enjoyable time browsing Reddit with Thread Tamer.

What we learned

Constraint is mother of creativity. Start small and build incrementally. Not everything needs to be complicated. In this case, I only check similarity on post title, so there is no need to implement a full-blown term frequency-inverse document frequency engine.

What's next for Thread Tamer

No subreddit is the same! There will be more user configurations like how far back to index past posts, what types of post that will trigger a similarity search, and the maximum number of similar posts to display in the stickied comment.

Built With

Share this project:

Updates