🚀 What Inspired Us
Every founder has spent time manually digging through Reddit trying to figure out if anyone actually wants what they're building. It works — but it's slow, incomplete, and impossible to quantify. You leave with a feeling, not a signal.
The spark came when we discovered the Academic Torrents Reddit dataset — a complete 15-year archive of hundreds of millions of posts and comments, freely available to download. That flipped the question from:
“How do I find relevant threads?”
to
“What if I could search all of them at once?”
The second insight was about thread longevity.
A 2019 post asking “why is time tracking so painful?” might still be collecting comments today because it ranks on Google. These aren’t just threads — they’re living documents of sustained demand. No one had surfaced them systematically.
🛠 How We Built It
We downloaded a targeted slice of Reddit data covering 28 high-signal subreddits, parsed the raw JSONL files, and reconstructed individual comment chains — each post plus one top-level comment thread — instead of full threads.
This mattered.
A 500-comment thread isn’t one conversation — it’s 20 parallel conversations.
Embedding the whole thing creates noise.
Embedding each chain creates precision.
Our stack:
- Postgres + pgvector for vector storage
- OpenAI embeddings for semantic indexing
- FastAPI backend with a semantic search endpoint
- An AI reranking layer that scores results by likelihood of genuine customer signal
The result: instant search across years of real demand.
🧱 The Biggest Challenges
1️⃣ Embedding Granularity
Our first approach embedded full threads.
The results felt random.
Switching to chain-level embeddings required a mid-build schema redesign — but the quality improvement was dramatic. Precision went up immediately.
2️⃣ Scale
Even just 28 subreddits produced tens of gigabytes of data.
To ingest, reconstruct, and embed everything overnight, we:
- Parallelised the pipeline across all four CPU cores
- Optimised reconstruction logic
- Monitored and babysat the process through the night
This wasn’t a toy dataset. It was infrastructure.
🧠 What We Learned
The dataset is the product.
The interface, the activity ratio, the agent — they all sit on top of a deeper insight:
Reddit’s historical archive is the largest free alternative dataset in existence —
and nobody has built the right interface to query it.
That’s what we built.
Built With
- embedding
- next.js
- python
Log in or sign up for Devpost to join the conversation.