RabbitHole — Project Description

Inspiration

Every researcher knows the feeling: you have a rough idea, maybe a few papers that seem relevant, and a gnawing sense that someone might have already solved your problem — or that a specific corner of the field is completely untouched. Finding out which one is true takes weeks.

You trace citations by hand, open 40 tabs, build a mental model of how subfields relate, and eventually arrive at a hunch about where the gaps are. Then you repeat it when the hunch turns out to be wrong.

Tools like Connected Papers and ResearchRabbit made citation graphs beautiful. But they stopped there — you still have to do all the reasoning yourself. Elicit and Consensus let you ask questions over papers, but they answer what's known, not what's missing.

We wanted to close that loop. Not just "here is the graph of what exists" — but "here is what is missing, here is why it matters, and here is the most viable path forward." That reasoning layer does not exist as a product today. RabbitHole is our attempt to build it.

What It Does

RabbitHole takes a research topic and a small set of seed papers as input, and delivers a prioritized map of research gaps with suggested next steps.

The flow:

You drop in exact name of the papers or generate relevant papers from the global pool — the app resolves the title live so you know you've got the right paper. Before the pipeline even starts, Gemini reads your seed abstracts and asks 3–4 clarifying questions tailored specifically to your seeds and topic: "Are you focused on theoretical foundations or practical applications?" "Do you want recent work only, or historical evolution?" Your answers become intent constraints that shape every downstream decision.

Then the pipeline runs. Starting from your seeds, it crawls the citation graph outward via Semantic Scholar — both forward (papers that cite your seeds) and backward (papers your seeds cite) — up to 200 papers across 2 hops. That raw candidate pool goes through two pruning stages: a cosine similarity filter that scores every abstract against your seed centroid and drops the bottom 40%, then a Maximal Marginal Relevance re-ranking that balances relevance against diversity, ensuring a single popular sub-topic can't flood the analysis with 30 near-identical papers.

The ~90 survivors go to Gemini. It clusters them into research directions, names each cluster, identifies its strengths and weaknesses, evaluates whether the cluster is actively growing or stagnant, and then reasons across clusters to surface gaps — specific things the field has not addressed — ranked by research viability. Each gap maps to a concrete suggested research path grounded in specific papers.

The result is a force-directed citation graph (Cytoscape.js) with node size encoding citation count, color encoding cluster, and seed papers pinned and highlighted. A tabbed right panel shows cluster summaries, ranked gaps, and an open chat interface where you can drill down on anything. A transparency log at the bottom shows exactly what was kept, what was dropped, and why — every cutoff score, every λ value.

After the analysis you can annotate any paper with notes and tags, tweak the MMR λ parameter to shift the diversity/relevance balance, export the full report as PDF or Markdown, or trigger a partial re-run with new seeds or BFS parameters.

How We Built It

Backend — FastAPI + Python pipeline

The backend is a FastAPI application with a five-stage async pipeline running as a BackgroundTask. The stages are seed_fetch → bfs_crawl → cosine_filter → mmr_rerank → gemini_analysis. Stages communicate through MongoDB — each stage writes a checkpoint document on completion, so a failed pipeline can resume from the last successful stage without re-crawling. No data passes between stages as return values; the orchestrator passes only the session_id.

Paper data comes from Semantic Scholar's API first, with OpenAlex as a per-paper fallback. All external calls use tenacity for exponential backoff retry. BFS traversal uses NetworkX, stored as serialized graph JSON in the checkpoint. Cosine similarity runs via scikit-learn with a percentile-based cutoff (top 60%) rather than a fixed threshold, because fixed thresholds break across domains with different embedding distributions. Embeddings are Google's text-embedding-004 (768-dim), stored globally per paper so they're never recomputed across sessions.

MMR re-ranking anchors its first selection to the seed centroid deterministically — eliminating the primary source of order-dependence. The λ parameter (relevance vs. diversity weight) is auto-set from the pre-flight intent answers and exposed as a user-adjustable slider.

Gemini receives the full set of retained abstracts, cluster assignments, and intent constraints in one prompt call. Structured JSON output is enforced so we can parse clusters, gaps, and research paths reliably.

Progress streams to the frontend via Server-Sent Events. One asyncio.Queue per session; a None sentinel closes the connection when the pipeline finishes.

Database — MongoDB Atlas + Atlas Vector Search

Five collections: papers (global cache with embeddings), citation_edges (global citation graph, upserted as world facts), sessions (retained paper scores, clusters, gaps, annotations, pruning report), session_chats (extracted to prevent unbounded session document growth), and pipeline_checkpoints (per-stage resilience). Every MongoDB read uses an explicit projection — full document reads are treated as bugs.

Frontend — Next.js + Cytoscape.js

The frontend is a Next.js App Router application in TypeScript. Cytoscape.js handles the citation graph with CoSE layout. The dashboard renders progressively — the graph becomes available after mmr_rerank completes, so users see nodes and edges while Gemini is still running. The right panel unlocks on pipeline_complete. Chat streaming uses a per-message EventSource. All API calls are typed and centralized in services/api.ts.

Deployment — Docker Compose

docker-compose up runs the full stack locally. API keys never leave the user's machine.

Challenges We Ran Into

Pruning quality was harder than expected. A fixed cosine similarity threshold that works for NLP papers is too aggressive for materials science and too loose for highly specific clinical topics. The switch to percentile-based cutoff (top 60% of the score distribution) solved this — it adapts to whatever the actual distribution looks like for a given paper set.

MMR order-dependence. Standard MMR implementation selects papers greedily, and the first pick determines every subsequent pick's diversity penalty. If that first pick is random, two identical runs can produce meaningfully different outputs. We fixed this by anchoring the first selection to the seed centroid — it's always deterministic, and every downstream pick is reproducible.

SSE lifecycle management. One queue per session, one sentinel to close it, and it has to close on both success and failure paths through the pipeline. Getting the cleanup right across reconnects, failed stages, and the orchestrator error boundary took several iterations.

Gemini output reliability. Asking Gemini to produce clusters, gaps, and research paths in a single structured JSON response occasionally produced malformed output on edge-case paper sets (single-topic seeds, very sparse citation graphs). We added structured output mode enforcement and a validation pass that falls back to a reduced prompt if the full response fails to parse.

MongoDB write patterns. Early implementations wrote retained papers incrementally — one document update per paper as MMR produced results. At 90 papers this caused write amplification and race conditions in the checkpoint logic. The fix was batch writes: the entire retained_papers array and retained_paper_ids list are written in one atomic $set after MMR completes.

Accomplishments That We're Proud Of

The intent capture system genuinely improves output quality. Gemini generating questions from the actual seed abstracts — not generic research questions — means the constraints are topically precise. "Your seeds span both sparse attention and linear attention variants — which direction is your primary focus?" is a meaningfully different question from "What is your research focus?" and it produces meaningfully different pruning weights.

The pruning transparency report exists, and it's detailed. Researchers are skeptical — rightfully — when an algorithm trims their candidate pool from 200 to 90 papers. Showing the full score distribution, the cutoff score, exactly which papers were dropped and why, makes the system auditable. That matters for research tooling.

Progressive graph rendering feels right in practice. Having the citation graph appear and be explorable while Gemini is still thinking fills a dead spot in the UX that would otherwise just be a loading spinner for 30+ seconds.

The checkpoint-based pipeline resilience works cleanly. A session that fails at gemini_analysis after a 3-minute BFS crawl can be retried from the Gemini stage in seconds. This was essential for building confidence during development.

What We Learned

Diversity in the pruned paper set matters more than relevance alone. Without MMR, Gemini's gap analysis consistently over-indexed on the most-cited cluster because it received 25 near-identical papers about one sub-topic and 3 papers about everything else. The diversity constraint produces a more balanced input and a more interesting analysis.

λ is not a tuning knob — it's a research intent signal. Setting it automatically from pre-flight answers ("exploratory" → 0.5, "focused" → 0.75) and exposing it as a user-adjustable slider rather than a hidden parameter turned it from an arbitrary hyperparameter into a meaningful tool.

Separating world facts from session facts at the data model level pays off immediately. Papers and citation edges are global — they are facts about the world, not facts about one user's session. Storing them globally means the second session that touches a paper gets it from cache with its embedding already computed. By the time you have 1,000 sessions, the cache hit rate on BFS traversal becomes significant.

The pre-flight step isn't overhead — it's signal extraction. The time spent asking clarifying questions before the pipeline runs saves time downstream by narrowing the BFS traversal and sharpening the cosine scoring centroid.

What's Next for RabbitHole

PathRAG (v2 pruning stage). PathRAG (Chen et al., 2025) models the citation graph as a resource-flow network and extracts the highest-signal relational paths between papers. As Stage 3 after MMR, it further reduces the paper set before the Gemini prompt and — more importantly — preserves the logical research lineage rather than treating papers as independent documents. The data model already anticipates this: citation_edges is a global collection designed for PathRAG traversal, and the checkpoint system slots a new pathrag_pruning stage in without touching anything upstream.

Session persistence (v1). Right now sessions live only as long as the user has the tab open. Adding name and user_id fields to the session document — the only schema change required — enables save and resume. Session history means a researcher can return to an analysis from last week, annotate further, and re-run with updated seeds as a new paper drops.

Research momentum scoring. The Gemini analysis already produces a momentum label (active vs. stagnant) with reasoning. The next step is computing this quantitatively from citation velocity data — papers per year, citations per year, recency of last publication — rather than relying on Gemini's inference from abstracts alone.

Timeline view. A second visualization mode that plots the citation graph as a temporal arc — research directions by publication year — to show how the field evolved and which directions accelerated or stalled.

BibTeX / reference list as seed input. Right now seeds are individual DOIs or ArXiv IDs. A researcher's existing bibliography or the references section of a paper they are reading is a natural and richer seed set. Parsing a .bib file or a copy-pasted reference list as the input method removes the friction of manual ID entry.