🚀 What Inspired Us

Every founder has spent time manually digging through Reddit trying to figure out if anyone actually wants what they're building. It works — but it's slow, incomplete, and impossible to quantify. You leave with a feeling, not a signal.

The spark came when we discovered the Academic Torrents Reddit dataset — a complete 15-year archive of hundreds of millions of posts and comments, freely available to download. That flipped the question from:

“How do I find relevant threads?”
to
“What if I could search all of them at once?”

The second insight was about thread longevity.
A 2019 post asking “why is time tracking so painful?” might still be collecting comments today because it ranks on Google. These aren’t just threads — they’re living documents of sustained demand. No one had surfaced them systematically.


🛠 How We Built It

We downloaded a targeted slice of Reddit data covering 28 high-signal subreddits, parsed the raw JSONL files, and reconstructed individual comment chains — each post plus one top-level comment thread — instead of full threads.

This mattered.

A 500-comment thread isn’t one conversation — it’s 20 parallel conversations.
Embedding the whole thing creates noise.
Embedding each chain creates precision.

Our stack:

  • Postgres + pgvector for vector storage
  • OpenAI embeddings for semantic indexing
  • FastAPI backend with a semantic search endpoint
  • An AI reranking layer that scores results by likelihood of genuine customer signal

The result: instant search across years of real demand.


🧱 The Biggest Challenges

1️⃣ Embedding Granularity

Our first approach embedded full threads.
The results felt random.

Switching to chain-level embeddings required a mid-build schema redesign — but the quality improvement was dramatic. Precision went up immediately.

2️⃣ Scale

Even just 28 subreddits produced tens of gigabytes of data.

To ingest, reconstruct, and embed everything overnight, we:

  • Parallelised the pipeline across all four CPU cores
  • Optimised reconstruction logic
  • Monitored and babysat the process through the night

This wasn’t a toy dataset. It was infrastructure.


🧠 What We Learned

The dataset is the product.

The interface, the activity ratio, the agent — they all sit on top of a deeper insight:

Reddit’s historical archive is the largest free alternative dataset in existence —
and nobody has built the right interface to query it.

That’s what we built.

Built With

Share this project:

Updates