Inspiration
ProofForge started from a mistake we recognized in our own building process: we were moving fast, but not always proving the right thing.
It is easy for a student founder to feel productive. You can open a code editor, generate a roadmap, design a landing page, write a pitch, and call it progress. But none of that answers the question that matters most at the beginning:
Which belief could kill this idea if it turns out to be false?
That question became the foundation of ProofForge.
We imagined a student founder named Priya. She is 21, technical, excited about a startup idea from a class project, and working with almost no validation budget. She does not need another business-plan template. She does not need a chatbot that confidently says her idea is good. She needs to know what assumption is most dangerous, what evidence is missing, and what small test she can run before spending three weeks building the wrong product.
ProofForge was built for that moment.
It is a responsible AI workspace that helps student founders move from vague ideas to structured proof. Instead of helping users generate more tasks, it helps them make one better decision: what to test next.
ProofForge helps student founders think clearly before they build: AI structures the uncertainty, humans confirm what is true, and one honest test comes next.
Our belief is simple:
The best founders are not the ones who build the fastest. They are the ones who learn the truth fastest.
What It Does
ProofForge helps student founders and aspiring entrepreneurs prove one risky assumption before they build.
The product turns an uncertain startup idea into a guided proof workflow:
Plant Idea The founder enters a messy idea, audience, constraints, goals, and available time.
Proof Map AI converts the idea into a visual map of users, buyers, problems, assumptions, evidence gaps, and next actions.
Reality Scan The system scans external signals, alternatives, competitors, and market context.
Assumption Arena The riskiest assumptions are ranked using impact, uncertainty, cost-if-wrong, and evidence strength.
Commitment Lab ProofForge proposes a small pilot, usage test, pricing test, or commitment test.
Build Slice The founder defines the smallest artifact needed to test the assumption.
Pilot Evidence The founder records real behavioral evidence from users.
Decision Gate The founder chooses to Continue, Pivot, Run Another Test, or Stop.
The AI structures the thinking, but the human owns the decision.
ProofForge is not a roadmap generator. It is a proof discipline system. It helps founders stop asking, “What can I build?” and start asking, “What do I need to prove?”
Quick Summary
| Area | ProofForge Answer |
|---|---|
| User | Student founders and aspiring entrepreneurs |
| Core problem | Founders build before proving the riskiest assumption |
| Product | Responsible AI workspace for startup validation |
| AI role | Extracts assumptions, maps uncertainty, explains risk, suggests tests |
| Human role | Reviews, approves evidence, and owns the final decision |
| Key guardrail | AI output starts as approved: false |
| Final output | Proof Map, ranked assumption, pilot test, evidence log, Decision Gate, Founder Pack |
One-line pitch: ProofForge helps student founders think clearly before they build by turning messy startup ideas into structured proof, ranked assumptions, small tests, and human-owned decisions.
The Problem
Most early-stage founders do not fail because they cannot build.
They fail because they build before proving the right thing.
Early founders often operate with incomplete information. They are trying to answer questions like:
- Who exactly has this problem?
- Is the pain strong enough?
- Will anyone commit time, money, or attention?
- What existing behavior proves this matters?
- Which assumption should be tested first?
- What should be built now, and what should wait?
The existing workflow is usually scattered across chatbots, notes, spreadsheets, landing pages, and mentor feedback. That creates a new problem: the founder has information, but not a clear reasoning system.
A generic AI chatbot can produce confident advice. A spreadsheet can store assumptions. A landing page can measure interest. A mentor can give feedback. But none of these tools creates a complete proof loop that connects idea, assumption, evidence, test, and decision.
ProofForge solves that by turning startup uncertainty into a structured proof loop.
$$ \text{Idea} \rightarrow \text{Assumptions} \rightarrow \text{Evidence} \rightarrow \text{Pilot} \rightarrow \text{Decision} $$
The goal is not to help founders build more. The goal is to help them learn the truth sooner.
Why This Needs AI
ProofForge uses AI because early ideas are messy.
A founder rarely starts with clean data. They start with fragments: a user they think exists, a problem they think matters, a solution they want to build, and a fear they have not fully named.
A rules-only tool can ask form questions, but it cannot reason well over messy natural language. A generic chatbot can give advice, but it usually does not enforce evidence, approval, or decision discipline.
ProofForge combines both approaches.
The AI handles ambiguity:
- Extracting users, buyers, problems, and assumptions from raw text
- Suggesting proof-map cards
- Summarizing reality signals
- Explaining if-wrong impact
- Drafting pilot ideas
- Coaching the founder through uncertainty
The deterministic system handles control:
- Schema validation
- Risk scoring
- Evidence approval
- Phase gates
- Security checks
- Decision ownership
This split is the core technical and responsible AI design.
$$ \text{AI} = \text{reasoning support} $$
$$ \text{Rules} = \text{system control} $$
$$ \text{Founder} = \text{final judgment} $$
ProofForge uses AI where AI is strong and rules where trust matters most. AI can interpret, summarize, and suggest. The system validates, scores, gates, and protects. The founder decides.
What Makes ProofForge Different
ProofForge is different because it does not treat AI output as truth.
Many AI startup tools can generate ideas, business plans, landing pages, pitch decks, roadmaps, or growth strategies. But early founders do not mainly need more output. They need help deciding which belief is dangerous, what evidence is missing, and what test should happen next.
ProofForge is built around one responsible AI rule:
AI can propose, but humans must approve.
Every AI-generated card enters the system as a suggestion, not a fact. It cannot reduce risk, influence the evidence state, or become part of the founder’s proof system until the founder reviews and approves it.
That design makes ProofForge safer and more useful than a generic chatbot. It does not simply tell founders what to do. It creates a structured reasoning system where uncertainty is visible, assumptions are testable, and decisions stay human-owned.
| Generic AI Tool | ProofForge |
|---|---|
| Generates advice | Structures proof |
| Sounds confident | Shows uncertainty |
| Produces more tasks | Finds the riskiest assumption |
| Treats output like an answer | Treats output like a proposal |
| Pushes momentum | Protects founder judgment |
| Helps users build faster | Helps users learn faster |
ProofForge helps founders build less of the wrong thing and prove more of what matters.
How It Works
ProofForge follows a strict operating model:
Messy founder idea
↓
AI extraction
↓
Proof Map generation
↓
Human approval
↓
Reality Scan
↓
Risk scoring
↓
Commitment test
↓
Pilot evidence
↓
Human Decision Gate
↓
Founder Pack export
The most important product rule is this:
AI can propose evidence, but it cannot approve evidence.
Every AI-generated card enters the system as unapproved:
{
"aiProposed": true,
"approved": false
}
That small flag changed the whole product. It means AI suggestions do not silently become trusted facts. The founder must review and approve them before they influence the proof workflow.
This is the core safety layer of ProofForge. The AI is useful, but it is not the authority. The human review step prevents startup guesses from being treated as validated evidence just because they were written clearly by an AI system.
Demo Walkthrough
In the demo, a founder starts with a messy startup idea and moves through the ProofForge loop:
Enter a raw idea The founder describes the idea, audience, constraints, and goals in natural language.
Generate a Proof Map ProofForge extracts the user, buyer, problem, assumptions, evidence gaps, and next actions.
Review AI proposals AI-generated cards appear as unapproved suggestions. The founder must approve them before they become trusted evidence.
Run Reality Scan ProofForge summarizes external signals, alternatives, competitors, and weak evidence.
Rank assumptions The Assumption Arena identifies which belief is most dangerous if wrong.
Choose a commitment test ProofForge suggests a small pilot, usage test, pricing test, or commitment test.
Record evidence The founder logs real behavioral evidence from users.
Make a decision The Decision Gate summarizes what is known, what is still uncertain, and lets the founder choose Continue, Pivot, Run Another Test, or Stop.
The demo shows the full product loop:
Idea → Proof Map → Approval → Reality Scan → Risk Ranking → Pilot → Evidence → Decision
Risk Scoring Model
We wanted the risk system to be transparent enough for a founder to understand, but structured enough to match how ProofForge actually works.
ProofForge does not let the AI directly decide whether an idea is good. Instead, it integrates AI reasoning, external signals, project memory, human approval, and deterministic scoring.
$$\mathcal{P}=\int_{\mathcal{C}}\left[\alpha A(c)+\beta S(c)+\gamma M(c)+\delta H(c)\right],dc$$
Where:
- 𝓟 = ProofForge proof state
- 𝓒 = founder context space
- A(c) = AI-extracted reasoning signal
- S(c) = external signal from Reality Scan
- M(c) = stored project memory
- H(c) = human approval signal
- α, β, γ, δ = weights for each signal
Each startup idea is converted into testable assumptions:
$$A_i=\Phi\left(\int_{\mathcal{C}}\left[\operatorname{User}_i(c)+\operatorname{Problem}_i(c)+\operatorname{Belief}_i(c)+\operatorname{Evidence}_i(c)\right],dc\right)$$
For each assumption, ProofForge calculates raw risk from impact, uncertainty, and cost-if-wrong:
$$\mathcal{R}(A_i)=\int_{0}^{1}I_i(t)U_i(t)C_i(t),dt$$
Where:
- Iᵢ(t) = impact if the assumption is wrong
- Uᵢ(t) = uncertainty
- Cᵢ(t) = cost-if-wrong
Evidence only reduces risk after founder approval:
$$\mathcal{E}(A_i)=\int_{0}^{1}W_i(t)H_i(t),dt$$
Here, (H_i(t)=1) if the founder approves the evidence, and (H_i(t)=0) if the evidence is not approved.
So unapproved AI output has no scoring power.
The final risk score is:
$$R_i=\mathcal{R}(A_i)\left(1-0.045\int_{0}^{1}W_i(t)H_i(t),dt\right)$$
Expanded:
$$R_i=\left(\int_{0}^{1}I_i(t)U_i(t)C_i(t),dt\right)\left(1-0.045\int_{0}^{1}W_i(t)H_i(t),dt\right)$$
The riskiest assumption is selected as:
$$A^{*}=\arg\max_{A_i\in\mathcal{A}}R_i$$
Then ProofForge chooses the cheapest useful test by minimizing test cost and maximizing expected evidence gain:
$$T^{*}=\arg\min_{T_i\in\mathcal{T}}\int_{0}^{1}\left[\operatorname{Cost}(T_i,t)-\operatorname{EvidenceGain}(T_i,t)\right],dt$$
The final decision is derived from the riskiest assumption, the risk score, the selected test, and approved evidence:
$$D=\Psi(A^{\star},R_i,T^{\star},\int_0^1 E_i(t)H_i(t),dt)$$
$$D\in{\text{Continue},\text{Pivot},\text{Run Another Test},\text{Stop}}$$
Complete model:
$$\boxed{D=\Psi\left(\arg\max_{A_i\in\mathcal{A}}\left[\left(\int_{0}^{1}I_i(t)U_i(t)C_i(t),dt\right)\left(1-0.045\int_{0}^{1}W_i(t)H_i(t),dt\right)\right],T^{*},H\right)}$$
In simpler implementation terms, the same scoring logic becomes:
$$\text{rawScore}=I\times U\times C$$
$$R=\text{rawScore}\times(1-W\times0.045)$$
Example:
$$I=9,\quad U=5,\quad C=5,\quad W=2$$
$$\text{rawScore}=9\times5\times5=225$$
$$R=225\times(1-2\times0.045)$$
$$R=225\times0.91=204.75$$
This tells the founder that the assumption is still high-risk because the evidence is weak.
The important part is not just the formula. The important part is the product behavior: AI-generated evidence cannot lower risk until a human approves it. That makes the scoring model transparent, useful, and safer for early-stage decision-making.
Evidence Model
ProofForge organizes founder thinking into four evidence zones.
| Zone | Meaning | Example |
|---|---|---|
| Know | Confirmed facts or approved evidence | “Five student founders reported delaying validation because they did not know what to test.” |
| Believe | Assumptions that may be true | “Student founders will use this before writing code.” |
| Learn | Open questions or evidence gaps | “Will founders complete a pilot without mentor pressure?” |
| Next | Actions that create evidence | “Run a 3-day pilot with 10 student founders.” |
This structure helps founders avoid one of the most common early-stage mistakes: treating a belief as if it were already proven.
It also makes the product easier to use. Instead of staring at a long AI-generated plan, the founder can see what is known, what is believed, what must be learned, and what action comes next.
Technical Architecture
ProofForge is built as a layered Next.js monolith.
We chose a layered monolith because the project needed one production deploy, fast iteration, and clean engineering boundaries. A microservice architecture would have added complexity without improving the proof workflow. A single unstructured app would have been faster at first, but it would have mixed AI calls, database logic, UI state, and domain rules too tightly.
The layered monolith gave us both speed and discipline.
| Layer | Responsibility | Main Tools |
|---|---|---|
| Presentation | User interface, Proof Map canvas, AI coach | Next.js, React, Tailwind CSS, React Flow |
| API | Request validation, routing, authentication checks | Next.js API routes, Zod |
| Application | Orchestrates product workflows | Extraction, Reality Scan, Map, Decision services |
| Domain | Pure business rules | Risk engine, phase gates, evidence states |
| Infrastructure | External systems and adapters | Supabase, Gemini, pgvector, Tavily, SSRF guard |
The most important architecture rule was:
Domain rules should not depend on Next.js, Supabase, or Gemini.
That made the risk engine, evidence rules, and phase gates easier to test and harder to accidentally bypass.
This architecture also helped us separate AI behavior from product truth. The AI can generate structured suggestions, but the domain layer decides what is valid, what is approved, what can influence scoring, and what phase the founder is allowed to enter.
AI Pipeline
ProofForge separates probabilistic AI reasoning from deterministic software control.
| Stage | Input | AI Role | System Control | Output |
|---|---|---|---|---|
| Ingestion | Raw idea text | Extract user, buyer, problem, and assumptions | Zod validates JSON | Structured idea brief |
| Proof Map | Extracted entities | Suggest cards and relationships | Cards start unapproved | Visual proof graph |
| Reality Scan | Idea and assumptions | Summarize signals and alternatives | Weak-signal labels | Evidence ledger |
| Assumption Arena | Assumption list | Explain if-wrong impact | TypeScript risk scoring | Ranked assumptions |
| Commitment Lab | Riskiest assumption | Suggest small tests | Founder selects test | Pilot protocol |
| Decision Gate | Evidence snapshot | Summarize tradeoffs | No AI default choice | Human decision |
This architecture keeps the AI useful without letting it become the authority.
The AI pipeline is designed to support the founder’s reasoning, not replace it. At every important point, ProofForge either validates the AI output, labels it as weak, requires human approval, or forces the final decision to stay with the founder.
Responsible AI
Responsible AI is not a separate feature in ProofForge. It is built into the workflow.
The main risk is over-reliance. A student founder may trust AI-generated advice because it sounds structured and confident. That is dangerous in startup validation because the data is incomplete, personal context matters, and the wrong decision can waste weeks of work.
ProofForge reduces this risk through concrete controls:
| Risk | Mitigation |
|---|---|
| AI sounds too confident | Outputs are framed as suggestions, not answers |
| Founder treats guesses as facts | AI cards start with approved: false |
| Weak web signals look stronger than they are | Reality Scan labels them as weak until confirmed |
| AI pushes one decision | Decision Gate shows equal-weight options |
| Invalid AI output enters the database | Zod schema validation blocks persistence |
| User skips proof steps | Phase gates enforce workflow order |
| Unsafe URL fetching | SSRF guard blocks localhost, private IPs, and metadata endpoints |
The final decision is always human-owned.
ProofForge does not choose Continue, Pivot, Run Another Test, or Stop. The founder must decide because that choice depends on constraints AI cannot fully own: time, motivation, ethics, team capacity, budget, personal risk, and long-term goals.
The responsible AI idea is simple:
AI should help the founder think better.
It should not become the founder’s source of truth.
Human-in-the-Loop Design
The most important human-in-the-loop moment is the Decision Gate.
At this point, the AI summarizes:
- What was assumed
- What was tested
- What evidence was collected
- What remains uncertain
- What tradeoffs exist
Then the founder chooses one of four options:
| Decision | When It Makes Sense | Next Action |
|---|---|---|
| Continue | Key assumptions have enough evidence | Build or scale the next version |
| Pivot | Evidence contradicts a core belief | Change the user, problem, offer, or channel |
| Run Another Test | Signals are still inconclusive | Design a sharper or cheaper test |
| Stop | A fatal flaw is confirmed | Archive the learning and avoid wasted effort |
The AI never preselects the option.
That design choice matters because ProofForge is not trying to replace founder judgment. It is trying to improve the evidence behind that judgment.
Human-in-the-loop design is not only a safety feature here. It is also a better product experience. Founders do not want to be told what to do by a black-box system. They want clearer reasoning, better evidence, and a decision they can trust.
What We Built
During the hackathon, we built a working ProofForge MVP with:
- Idea intake
- AI extraction
- Proof Map generation
- Card approval states
- Reality Scan workflow
- Assumption ranking
- Commitment test generation
- Pilot evidence tracking
- Decision Gate
- Founder Pack export
- Authentication
- Private project storage
- Knowledge memory
- Production deployment
The product is designed around one repeated loop:
Propose → Review → Test → Record → Decide
That loop keeps the founder moving, but not blindly.
ProofForge is not just a prompt wrapper. It has product state, workflow gates, a scoring model, evidence approval, private project storage, and a complete decision cycle from idea to Founder Pack.
Accomplishments We Are Proud Of
We are proud that ProofForge became more than a polished interface around an AI prompt.
We built a full proof workflow with real product logic:
- A visual Proof Map for startup reasoning
- AI-generated cards that start unapproved
- Human approval before evidence affects the system
- Risk scoring based on impact, uncertainty, cost-if-wrong, and evidence strength
- A Reality Scan for external signals and alternatives
- A Commitment Lab for small validation tests
- A Pilot Evidence workflow for behavioral proof
- A Decision Gate where the founder, not the AI, chooses what happens next
- A Founder Pack export that preserves the reasoning process
The biggest accomplishment is that responsible AI is not just described in the submission. It is built into the product data model, workflow, and decision system.
Challenges We Faced
OAuth Redirect Loops
Authentication became more complex than expected. Google OAuth, Supabase Auth, local development, and the Vercel deployment all needed matching redirect behavior.
The issue was that Google redirects to Supabase first, then Supabase redirects back to the app. If those callback URLs are not separated correctly, users can get stuck in redirect loops or fail to receive session cookies.
We fixed this by separating the provider callback from the app callback and making the server exchange the auth code for a session.
This taught us that authentication is not just a login button. It is a chain of redirects, cookies, domains, and environment variables.
AI Output Sounded Too Final
Our first AI responses were helpful, but they sounded too authoritative.
That was a serious product problem. If ProofForge tells a founder what to do with too much confidence, it becomes the exact kind of AI system we were trying to avoid.
We solved this by changing the data model and workflow. AI-generated cards became proposals, not evidence. Every card had to be approved by the founder before it could influence the project.
This changed the product from “AI gives advice” to “AI prepares reasoning for human review.”
Zod Caught Bad AI Structure
Structured AI output is powerful, but it can fail in subtle ways. Sometimes the model returned valid-looking content that drifted away from the founder’s actual idea or included template contamination.
We added strict Zod validation and extraction sanitization. If the AI output did not match the schema, the system rejected it instead of saving broken reasoning into the project.
This made the product more reliable and made the AI safer to use inside a real workflow.
SSRF Protection Was Required
Reality Scan needed to work with external sources, but fetching URLs can create security risk. A user or model-generated URL could point to localhost, private networks, or cloud metadata endpoints.
We added an SSRF guard that blocks unsafe destinations before any scrape happens.
That was an important lesson: once AI can suggest or process URLs, security boundaries must be explicit.
The Hardest UX Problem Was Slowing Users Down
Most tools reward speed. ProofForge sometimes needs to tell a founder, “Not yet. You do not have enough evidence.”
That can feel frustrating unless the product gives a useful next step.
We solved this by making every gate action-oriented. Instead of blocking the user with a dead end, ProofForge shows what evidence is missing and what small test can create it.
What We Learned
The biggest lesson was that responsible AI is an engineering decision, not a disclaimer.
It is easy to add a warning that says AI can be wrong. It is much harder, and much more valuable, to design the system so AI cannot silently become the source of truth.
The approved: false flag became the clearest example of that lesson. It looks small, but it changes the power dynamic of the product. AI can help prepare the founder’s thinking, but the founder must decide what becomes accepted evidence.
We also learned that founders do not need more output. They need better focus.
A generic AI tool can generate endless tasks, strategies, and ideas. But early-stage founders are usually not suffering from a lack of possible actions. They are suffering from a lack of decision clarity.
ProofForge taught us this:
$$ \text{More Output} \neq \text{Better Thinking} $$
A useful AI system should sometimes reduce options, expose uncertainty, and force the user to test the belief that matters most.
We also learned that deterministic rules and AI reasoning work best together. The AI is strong at reading messy human language and explaining tradeoffs. The rule system is strong at enforcing gates, validating data, and keeping the workflow honest.
Most importantly, we learned that the best AI products do not always make users move faster. Sometimes they help users slow down at the right moment so they can make a better decision.
Impact
ProofForge helps student founders move from confusion to clarity to action.
$$ \text{Confusion} \rightarrow \text{Clarity} \rightarrow \text{Action} $$
The impact is not just faster planning. The impact is better decision-making.
For a student founder, time and confidence are limited resources. Spending three weeks building the wrong thing does not only waste time; it can make a founder lose momentum. ProofForge protects that momentum by helping users find the riskiest assumption, run a small pilot, collect evidence, and make a grounded decision.
That makes ProofForge useful beyond one project. It teaches a repeatable way to think:
Do not ask, “What can I build?”
Ask, “What do I need to prove?”
ProofForge can also support startup classes, incubators, student clubs, university entrepreneurship programs, and mentor-led founder teams. It gives everyone a shared language for assumptions, evidence, tests, and decisions.
The long-term impact is a healthier founder culture: less blind building, less AI overconfidence, and more evidence-driven progress.
Future Roadmap
Next, we want to make ProofForge stronger as a real founder workspace.
Team Collaboration
Student founders often work in teams. We want to add shared projects, roles, comments, and decision history so teams can reason from the same evidence instead of scattered notes.
Mentor Review
We want mentors, professors, and incubator leads to review Founder Packs and leave structured feedback tied to assumptions, pilots, and decisions.
Stronger Pilot Analytics
We want to improve how pilots track behavioral signals, commitment strength, user friction, and evidence quality.
Multi-Provider AI
We want to reduce dependency on a single AI provider by supporting multiple model backends while keeping the same deterministic control layer.
Proof History
We want to add Git-style history for assumptions, tests, pivots, and decision changes so founders can see how their thinking evolved over time.
Classroom and Incubator Mode
We want to support entrepreneurship classes, startup clubs, and incubator programs where mentors can review Founder Packs, compare assumptions across teams, and help students learn a repeatable validation process.
ProofForge is built around a simple rule:
Do not build until you know what you are trying to prove.
That rule matters for student founders. Budgets are small. Time is limited. Motivation is fragile. A fast build can feel like progress, but if the core assumption is wrong, speed only helps the founder reach the wrong answer faster.
ProofForge gives founders a structured way to slow down before they speed up. It uses AI to organize uncertainty, deterministic rules to enforce proof discipline, and human decision gates to keep ownership where it belongs.
The result is not just a startup tool.
It is a system for clearer thinking under uncertainty.
Build less. Prove more.
Built With
- ai
- google-gemini
- nextjs
- pgvector
- playwright
- postgresql
- react
- supabase
- tailwindcss
- typescript
- vercel
- vitest
- zod

Log in or sign up for Devpost to join the conversation.