CiteSite

Inspiration

In 2025, NeurIPS, one of the premier machine learning conferences, was flooded with AI-generated slop submissions. Reviewers were overwhelmed: fabricated citations, exaggerated claims, and statistically impossible results slipped past overloaded peer review committees. The scale of the problem made it clear that manual review alone can no longer safeguard scientific integrity. We wanted to build a tool that could act as an automated first line of defence. It checks every citation against real databases and every claim against its cited source, so human reviewers can focus on substance instead of detective work.

What it does

CiteSite is an automated peer review verification system. You upload an academic PDF, and it:

Extracts every citation from the bibliography, every empirical claim tied to a citation, and every statistical assertion (p-values, effect sizes, confidence intervals). Resolves each citation against Semantic Scholar, CrossRef, and OpenAlex to confirm the paper actually exists and retrieve its abstract. Reasons over each claim using chain-of-thought verification, and checks whether the cited source actually supports what the paper says it does, flagging overstated, contradicted, or out-of-scope attributions. Reports an integrity score and renders the original PDF with colored inline annotations (like Grammarly, but for research papers). Click any highlight to see the verdict, confidence level, and explanation.

How we built it

The backend is a four-stage LangGraph pipeline with stateful checkpointing. Hermes 4 70B (Nous Research) powers all extraction tasks. It parses unstructured bibliography text into structured JSON, scanning the full paper for cited claims, and pulling out statistical assertions. Citations are resolved concurrently through Lava Gateway, which proxies requests to Semantic Scholar, CrossRef, and OpenAlex with usage metering. Claim verification uses Hermes with deep chain-of-thought reasoning to generate verdicts. The web interface is FastAPI on the backend with PDF.js on the frontend, rendering the actual PDF with precise coordinate-mapped highlights computed via PyMuPDF. Everything runs asynchronously with live progress tracking streamed to the UI.

Challenges we ran into

The hardest part was the idea itself, which was figuring out what form this tool should take. We knew the problem was real, but there were a hundred ways to approach it: a CLI tool, a dashboard, a browser extension. Eventually, we landed on the Grammarly model: to render the actual PDF and overlay colored annotations inline. This is because it matches how reviewers already read papers. Click a highlighted claim, read the explanation, all without leaving the document. Getting PyMuPDF to map extracted claim text back to precise PDF coordinates was also a significant engineering challenge, since academic PDFs have wildly inconsistent formatting, hyphenation, and column layouts.

Accomplishments that we're proud of

We're proud that CiteSite correctly identified 8 out-of-scope citations and 4 overstated claims in a known and real fraudulent submission that passed initial review at a journal. The system gave it a 57/100 integrity score, flagging exactly the kinds of problems that human reviewers missed. We're also proud of the bibliography chunking system that handles reference lists of any size without output truncation, and the truncated JSON salvage logic that recovers partial model output instead of failing. Both of which came from hitting real edge cases during testing. Honestly, the fact that the annotated PDF viewer actually feels good to use: highlights, sidebar, click-to-explain..., this is something we didn't expect to nail in a hackathon.

What we learned

We learned that LLMs are surprisingly good at structured extraction from messy academic text when you give them tight JSON schemas and good examples in the system prompt. We also learned the hard way that academic citation formats are a nightmare, and that chunking long bibliographies is essential to avoid truncation. On the product side, we learned that the interface matters as much as the model: a verification result is only useful if it's presented in context, right on the PDF where the reviewer is already looking.

What's next for CiteSite

We want to add full-text source retrieval (not just abstracts) for higher-confidence verdicts, support for batch processing so journals can screen entire submission queues, and a reviewer collaboration mode where multiple people can annotate and discuss flagged claims. On the longer term, we see CiteSite integrating directly into journal submission portals like OpenReview. It will be running automatically on every new submission as a pre-screening step before human review begins.