Inspiration
AI projects are being built and launched faster than ever. Every day, people share new demos, hackathon submissions, open-source tools, startup prototypes, and AI agent projects that claim to use advanced systems like RAG, multi-agent workflows, MCP, computer vision, voice agents, or custom model routing.
But the people evaluating these projects — judges, sponsors, investors, recruiters, potential users, and even teammates — usually do not have time to manually inspect every repository and verify every technical claim.
A demo can look impressive. A README can sound convincing. A pitch can say the project uses multiple advanced technologies. But the real question is:
Is the claim actually backed by implementation evidence in the code?
We started BuildProof in the hackathon setting because this problem is especially visible there, but the broader problem is technical trust. As AI-generated projects and rapid prototypes become more common, we need faster ways to verify whether a project’s story matches its implementation.
What it does
BuildProof is an AI authenticity auditor that checks whether a project’s technical claims are supported by its GitHub code.
A user can provide a project description, Devpost-style writeup, README text, or GitHub repository. BuildProof extracts the project’s technical claims, scans the GitHub repository, and gathers evidence from source files, dependencies, package files, file structure, README content, and missing implementation signals.
It then produces an authenticity report showing:
What the project claims to do What evidence exists in the repository Whether the evidence comes from real source code, dependencies, file structure, README text, or absence of implementation A weighted authenticity score LLM judge reasoning TokenRouter / MiniMax-M3 powered claim extraction and judging Optional Anthropic comparison to show where judges agree or disagree
BuildProof is designed for hackathon judges, sponsors, investors, recruiters, and anyone who needs to quickly evaluate whether a technical project is real without spending 30 minutes manually reading the repo.
How we built it
BuildProof is built as a Next.js app with a TypeScript audit pipeline.
The pipeline has several stages:
First, BuildProof extracts technical claims from project descriptions, Devpost-style writeups, or README-style input. When available, it uses TokenRouter with MiniMax-M3 for LLM-based claim extraction, with deterministic fallback behavior when keys are missing.
Second, BuildProof scans the GitHub repository. It collects repository metadata, README content, package files, dependency information, file-tree signals, and source-file snippets.
Third, BuildProof runs implementation-signal detectors. We upgraded the detector layer from simple keyword matching to reusable implementation-signal analysis across MCP, RAG, and multi-agent claims. The detectors distinguish between weak README-only mentions and stronger implementation evidence such as dependencies, imports, tool registration patterns, retrieval code, vector database usage, agent orchestration patterns, and source-file usage.
Fourth, BuildProof scores evidence with a weighted model. Source code evidence counts more than package.json evidence, package evidence counts more than file-tree evidence, and README-only evidence is treated as weaker. Missing evidence is surfaced as absence evidence instead of being ignored.
Finally, BuildProof uses an LLM judge to produce a verdict and rationale. TokenRouter / MiniMax-M3 is the primary model integration. We also added an optional judge comparison mode that runs Anthropic and TokenRouter on the same compressed evidence context and displays agreement or disagreement per claim.
Challenges we ran into
One major challenge was avoiding a simple “LLM reads README and guesses” product. That would be easy to build, but not very trustworthy. We wanted BuildProof to ground its judgments in concrete repository evidence, so we had to build a real scanning and evidence pipeline.
Another challenge was evidence quality. A README mention is not the same as source code implementation. A dependency is not the same as actual usage. A file name is not the same as a working feature. We had to design a weighted scoring system and detector layer that treats different evidence sources differently.
We also had to make the app work even when LLM keys are missing. BuildProof includes deterministic fallback paths so the core audit still works without relying entirely on external model calls.
Another challenge was sponsor integration depth. We wanted TokenRouter to be more than a single API call, so we added provider abstraction, smoke tests, safe logging, fallback behavior, model configuration, and a judge comparison panel that lets users compare TokenRouter / MiniMax-M3 against Anthropic on the same evidence.
Accomplishments that we're proud of
We are proud that BuildProof became a real evidence-based audit pipeline instead of just a wrapper around an LLM.
Some highlights:
End-to-end GitHub repo scanning TokenRouter / MiniMax-M3 integration for claim extraction and judging Optional Anthropic vs TokenRouter judge comparison Weighted authenticity scoring based on evidence strength Implementation-signal detectors for MCP, RAG, and multi-agent claims Safe fallback behavior when LLM providers are unavailable Compression-aware evidence context before judging A clear report UI showing claims, evidence, verdicts, scores, traces, and integration status A growing test suite covering provider selection, scoring, judge comparison, compression behavior, and implementation-signal detection
We are especially proud of the judge comparison feature because it lets users “audit the auditor” by seeing where two model providers agree or disagree.
What we learned
We learned that verifying AI project claims is harder than simply asking an LLM if something sounds real.
The most important part is evidence quality. A project can mention “RAG,” “multi-agent,” or “MCP” in a README, but the real question is whether the repository contains dependencies, imports, source usage, configuration files, and implementation patterns that support those claims.
We also learned that this problem is bigger than hackathons. As more AI prototypes, agent demos, and open-source projects appear, the cost of technical due diligence increases. People need faster ways to separate real implementation from polished descriptions.
Finally, we learned that transparency matters. Users should not just see a score; they should see why the score exists, what evidence was used, which provider judged it, and where the system was uncertain.
What's next for BuildProof
Next, we want to make BuildProof more useful for anyone who needs fast technical due diligence.
Planned improvements include:
Deeper static analysis for more claim categories Better Devpost and project-page ingestion Stronger repository analysis beyond file snippets, including import graphs and AST-level signals More model comparison options through TokenRouter Better report sharing for judges, investors, recruiters, and teams A public benchmark of real, exaggerated, and unsupported project claims Optional CI integration so teams can run BuildProof before publishing or submitting a project Organization-level dashboards for reviewing many projects at once
Long term, we imagine BuildProof as a trust layer for technical demos: a fast way to check whether a project’s story is supported by the code behind it.
Built With
- anthropic
- api
- browserbase
- css
- github
- minimax-m3
- next.js
- node.js
- react
- tailwind
- tokenrouter
- typescript
Log in or sign up for Devpost to join the conversation.