🚀 What Inspired Us

Every founder has spent time manually digging through Reddit trying to figure out if anyone actually wants what they're building. It works — but it's slow, incomplete, and impossible to quantify. You leave with a feeling, not a signal.

The spark came when we discovered the Academic Torrents Reddit dataset — a complete 15-year archive of hundreds of millions of posts and comments, freely available to download. That flipped the question from:

“How do I find relevant threads?”
to
“What if I could search all of them at once?”

The second insight was about thread longevity.
A 2019 post asking “why is time tracking so painful?” might still be collecting comments today because it ranks on Google. These aren’t just threads — they’re living documents of sustained demand. No one had surfaced them systematically.

🛠 How We Built It

We downloaded a targeted slice of Reddit data covering 28 high-signal subreddits, parsed the raw JSONL files, and reconstructed individual comment chains — each post plus one top-level comment thread — instead of full threads.

This mattered.

A 500-comment thread isn’t one conversation — it’s 20 parallel conversations.
Embedding the whole thing creates noise.
Embedding each chain creates precision.

Our stack:

Postgres + pgvector for vector storage
OpenAI embeddings for semantic indexing
FastAPI backend with a semantic search endpoint
An AI reranking layer that scores results by likelihood of genuine customer signal

The result: instant search across years of real demand.