ProofForge

Inspiration

ProofForge started from a mistake we recognized in our own building process: we were moving fast, but not always proving the right thing.

It is easy for a student founder to feel productive. You can open a code editor, generate a roadmap, design a landing page, write a pitch, and call it progress. But none of that answers the question that matters most at the beginning:

Which belief could kill this idea if it turns out to be false?

That question became the foundation of ProofForge.

We imagined a student founder named Priya. She is 21, technical, excited about a startup idea from a class project, and working with almost no validation budget. She does not need another business-plan template. She does not need a chatbot that confidently says her idea is good. She needs to know what assumption is most dangerous, what evidence is missing, and what small test she can run before spending three weeks building the wrong product.

ProofForge was built for that moment.

It is a responsible AI workspace that helps student founders move from vague ideas to structured proof. Instead of helping users generate more tasks, it helps them make one better decision: what to test next.

ProofForge helps student founders think clearly before they build: AI structures the uncertainty, humans confirm what is true, and one honest test comes next.

Our belief is simple:

The best founders are not the ones who build the fastest. They are the ones who learn the truth fastest.

What It Does

ProofForge helps student founders and aspiring entrepreneurs prove one risky assumption before they build.

The product turns an uncertain startup idea into a guided proof workflow:

Plant Idea The founder enters a messy idea, audience, constraints, goals, and available time.
Proof Map AI converts the idea into a visual map of users, buyers, problems, assumptions, evidence gaps, and next actions.
Reality Scan The system scans external signals, alternatives, competitors, and market context.
Assumption Arena The riskiest assumptions are ranked using impact, uncertainty, cost-if-wrong, and evidence strength.
Commitment Lab ProofForge proposes a small pilot, usage test, pricing test, or commitment test.
Build Slice The founder defines the smallest artifact needed to test the assumption.
Pilot Evidence The founder records real behavioral evidence from users.
Decision Gate The founder chooses to Continue, Pivot, Run Another Test, or Stop.

The AI structures the thinking, but the human owns the decision.

ProofForge is not a roadmap generator. It is a proof discipline system. It helps founders stop asking, “What can I build?” and start asking, “What do I need to prove?”

Quick Summary

Area	ProofForge Answer
User	Student founders and aspiring entrepreneurs
Core problem	Founders build before proving the riskiest assumption
Product	Responsible AI workspace for startup validation
AI role	Extracts assumptions, maps uncertainty, explains risk, suggests tests
Human role	Reviews, approves evidence, and owns the final decision
Key guardrail	AI output starts as `approved: false`
Final output	Proof Map, ranked assumption, pilot test, evidence log, Decision Gate, Founder Pack

One-line pitch: ProofForge helps student founders think clearly before they build by turning messy startup ideas into structured proof, ranked assumptions, small tests, and human-owned decisions.

The Problem

Most early-stage founders do not fail because they cannot build.

They fail because they build before proving the right thing.

Early founders often operate with incomplete information. They are trying to answer questions like:

Who exactly has this problem?
Is the pain strong enough?
Will anyone commit time, money, or attention?
What existing behavior proves this matters?
Which assumption should be tested first?
What should be built now, and what should wait?

The existing workflow is usually scattered across chatbots, notes, spreadsheets, landing pages, and mentor feedback. That creates a new problem: the founder has information, but not a clear reasoning system.

A generic AI chatbot can produce confident advice. A spreadsheet can store assumptions. A landing page can measure interest. A mentor can give feedback. But none of these tools creates a complete proof loop that connects idea, assumption, evidence, test, and decision.

ProofForge solves that by turning startup uncertainty into a structured proof loop.

$$ \text{Idea} \rightarrow \text{Assumptions} \rightarrow \text{Evidence} \rightarrow \text{Pilot} \rightarrow \text{Decision} $$

The goal is not to help founders build more. The goal is to help them learn the truth sooner.

Why This Needs AI

ProofForge uses AI because early ideas are messy.

A founder rarely starts with clean data. They start with fragments: a user they think exists, a problem they think matters, a solution they want to build, and a fear they have not fully named.

A rules-only tool can ask form questions, but it cannot reason well over messy natural language. A generic chatbot can give advice, but it usually does not enforce evidence, approval, or decision discipline.

ProofForge combines both approaches.

The AI handles ambiguity:

Extracting users, buyers, problems, and assumptions from raw text
Suggesting proof-map cards
Summarizing reality signals
Explaining if-wrong impact
Drafting pilot ideas
Coaching the founder through uncertainty

The deterministic system handles control:

Schema validation
Risk scoring
Evidence approval
Phase gates
Security checks
Decision ownership

This split is the core technical and responsible AI design.

$$ \text{AI} = \text{reasoning support} $$

$$ \text{Rules} = \text{system control} $$

$$ \text{Founder} = \text{final judgment} $$

ProofForge uses AI where AI is strong and rules where trust matters most. AI can interpret, summarize, and suggest. The system validates, scores, gates, and protects. The founder decides.

What Makes ProofForge Different

ProofForge is different because it does not treat AI output as truth.

Many AI startup tools can generate ideas, business plans, landing pages, pitch decks, roadmaps, or growth strategies. But early founders do not mainly need more output. They need help deciding which belief is dangerous, what evidence is missing, and what test should happen next.

ProofForge is built around one responsible AI rule:

AI can propose, but humans must approve.

Every AI-generated card enters the system as a suggestion, not a fact. It cannot reduce risk, influence the evidence state, or become part of the founder’s proof system until the founder reviews and approves it.

That design makes ProofForge safer and more useful than a generic chatbot. It does not simply tell founders what to do. It creates a structured reasoning system where uncertainty is visible, assumptions are testable, and decisions stay human-owned.

Generic AI Tool	ProofForge
Generates advice	Structures proof
Sounds confident	Shows uncertainty
Produces more tasks	Finds the riskiest assumption
Treats output like an answer	Treats output like a proposal
Pushes momentum	Protects founder judgment
Helps users build faster	Helps users learn faster

ProofForge helps founders build less of the wrong thing and prove more of what matters.

How It Works

ProofForge follows a strict operating model:

Messy founder idea
        ↓
AI extraction
        ↓
Proof Map generation
        ↓
Human approval
        ↓
Reality Scan
        ↓
Risk scoring
        ↓
Commitment test
        ↓
Pilot evidence
        ↓
Human Decision Gate
        ↓
Founder Pack export

The most important product rule is this:

AI can propose evidence, but it cannot approve evidence.

Every AI-generated card enters the system as unapproved:

{
  "aiProposed": true,
  "approved": false
}

That small flag changed the whole product. It means AI suggestions do not silently become trusted facts. The founder must review and approve them before they influence the proof workflow.

This is the core safety layer of ProofForge. The AI is useful, but it is not the authority. The human review step prevents startup guesses from being treated as validated evidence just because they were written clearly by an AI system.

Demo Walkthrough

In the demo, a founder starts with a messy startup idea and moves through the ProofForge loop:

Enter a raw idea The founder describes the idea, audience, constraints, and goals in natural language.
Generate a Proof Map ProofForge extracts the user, buyer, problem, assumptions, evidence gaps, and next actions.
Review AI proposals AI-generated cards appear as unapproved suggestions. The founder must approve them before they become trusted evidence.
Run Reality Scan ProofForge summarizes external signals, alternatives, competitors, and weak evidence.
Rank assumptions The Assumption Arena identifies which belief is most dangerous if wrong.
Choose a commitment test ProofForge suggests a small pilot, usage test, pricing test, or commitment test.
Record evidence The founder logs real behavioral evidence from users.
Make a decision The Decision Gate summarizes what is known, what is still uncertain, and lets the founder choose Continue, Pivot, Run Another Test, or Stop.

The demo shows the full product loop:

Idea → Proof Map → Approval → Reality Scan → Risk Ranking → Pilot → Evidence → Decision

Risk Scoring Model

We wanted the risk system to be transparent enough for a founder to understand, but structured enough to match how ProofForge actually works.

ProofForge does not let the AI directly decide whether an idea is good. Instead, it integrates AI reasoning, external signals, project memory, human approval, and deterministic scoring.

$$\mathcal{P}=\int_{\mathcal{C}}\left[\alpha A(c)+\beta S(c)+\gamma M(c)+\delta H(c)\right],dc$$

Where:

𝓟 = ProofForge proof state
𝓒 = founder context space
A(c) = AI-extracted reasoning signal
S(c) = external signal from Reality Scan
M(c) = stored project memory
H(c) = human approval signal
α, β, γ, δ = weights for each signal

Each startup idea is converted into testable assumptions:

$$A_i=\Phi\left(\int_{\mathcal{C}}\left[\operatorname{User}_i(c)+\operatorname{Problem}_i(c)+\operatorname{Belief}_i(c)+\operatorname{Evidence}_i(c)\right],dc\right)$$

For each assumption, ProofForge calculates raw risk from impact, uncertainty, and cost-if-wrong:

$$\mathcal{R}(A_i)=\int_{0}^{1}I_i(t)U_i(t)C_i(t),dt$$

Where:

Iᵢ(t) = impact if the assumption is wrong
Uᵢ(t) = uncertainty
Cᵢ(t) = cost-if-wrong

Evidence only reduces risk after founder approval:

$$\mathcal{E}(A_i)=\int_{0}^{1}W_i(t)H_i(t),dt$$

Here, (H_i(t)=1) if the founder approves the evidence, and (H_i(t)=0) if the evidence is not approved.

So unapproved AI output has no scoring power.

The final risk score is:

$$R_i=\mathcal{R}(A_i)\left(1-0.045\int_{0}^{1}W_i(t)H_i(t),dt\right)$$

Expanded:

$$R_i=\left(\int_{0}^{1}I_i(t)U_i(t)C_i(t),dt\right)\left(1-0.045\int_{0}^{1}W_i(t)H_i(t),dt\right)$$

The riskiest assumption is selected as:

$$A^{*}=\arg\max_{A_i\in\mathcal{A}}R_i$$

Then ProofForge chooses the cheapest useful test by minimizing test cost and maximizing expected evidence gain:

$$T^{*}=\arg\min_{T_i\in\mathcal{T}}\int_{0}^{1}\left[\operatorname{Cost}(T_i,t)-\operatorname{EvidenceGain}(T_i,t)\right],dt$$

The final decision is derived from the riskiest assumption, the risk score, the selected test, and approved evidence:

$$D=\Psi(A^{\star},R_i,T^{\star},\int_0^1 E_i(t)H_i(t),dt)$$

$$D\in{\text{Continue},\text{Pivot},\text{Run Another Test},\text{Stop}}$$

Complete model:

$$\boxed{D=\Psi\left(\arg\max_{A_i\in\mathcal{A}}\left[\left(\int_{0}^{1}I_i(t)U_i(t)C_i(t),dt\right)\left(1-0.045\int_{0}^{1}W_i(t)H_i(t),dt\right)\right],T^{*},H\right)}$$

In simpler implementation terms, the same scoring logic becomes:

$$\text{rawScore}=I\times U\times C$$

$$R=\text{rawScore}\times(1-W\times0.045)$$

Example:

$$I=9,\quad U=5,\quad C=5,\quad W=2$$

$$\text{rawScore}=9\times5\times5=225$$

$$R=225\times(1-2\times0.045)$$

$$R=225\times0.91=204.75$$

This tells the founder that the assumption is still high-risk because the evidence is weak.

The important part is not just the formula. The important part is the product behavior: AI-generated evidence cannot lower risk until a human approves it. That makes the scoring model transparent, useful, and safer for early-stage decision-making.

Evidence Model

ProofForge organizes founder thinking into four evidence zones.

Zone	Meaning	Example
Know	Confirmed facts or approved evidence	“Five student founders reported delaying validation because they did not know what to test.”
Believe	Assumptions that may be true	“Student founders will use this before writing code.”
Learn	Open questions or evidence gaps	“Will founders complete a pilot without mentor pressure?”
Next	Actions that create evidence	“Run a 3-day pilot with 10 student founders.”

This structure helps founders avoid one of the most common early-stage mistakes: treating a belief as if it were already proven.

It also makes the product easier to use. Instead of staring at a long AI-generated plan, the founder can see what is known, what is believed, what must be learned, and what action comes next.

Technical Architecture

ProofForge is built as a layered Next.js monolith.

We chose a layered monolith because the project needed one production deploy, fast iteration, and clean engineering boundaries. A microservice architecture would have added complexity without improving the proof workflow. A single unstructured app would have been faster at first, but it would have mixed AI calls, database logic, UI state, and domain rules too tightly.

The layered monolith gave us both speed and discipline.

Layer	Responsibility	Main Tools
Presentation	User interface, Proof Map canvas, AI coach	Next.js, React, Tailwind CSS, React Flow
API	Request validation, routing, authentication checks	Next.js API routes, Zod
Application	Orchestrates product workflows	Extraction, Reality Scan, Map, Decision services
Domain	Pure business rules	Risk engine, phase gates, evidence states
Infrastructure	External systems and adapters	Supabase, Gemini, pgvector, Tavily, SSRF guard

The most important architecture rule was:

Domain rules should not depend on Next.js, Supabase, or Gemini.

That made the risk engine, evidence rules, and phase gates easier to test and harder to accidentally bypass.

This architecture also helped us separate AI behavior from product truth. The AI can generate structured suggestions, but the domain layer decides what is valid, what is approved, what can influence scoring, and what phase the founder is allowed to enter.

AI Pipeline

ProofForge separates probabilistic AI reasoning from deterministic software control.

Stage	Input	AI Role	System Control	Output
Ingestion	Raw idea text	Extract user, buyer, problem, and assumptions	Zod validates JSON	Structured idea brief
Proof Map	Extracted entities	Suggest cards and relationships	Cards start unapproved	Visual proof graph
Reality Scan	Idea and assumptions	Summarize signals and alternatives	Weak-signal labels	Evidence ledger
Assumption Arena	Assumption list	Explain if-wrong impact	TypeScript risk scoring	Ranked assumptions
Commitment Lab	Riskiest assumption	Suggest small tests	Founder selects test	Pilot protocol
Decision Gate	Evidence snapshot	Summarize tradeoffs	No AI default choice	Human decision

This architecture keeps the AI useful without letting it become the authority.

The AI pipeline is designed to support the founder’s reasoning, not replace it. At every important point, ProofForge either validates the AI output, labels it as weak, requires human approval, or forces the final decision to stay with the founder.

Responsible AI

Responsible AI is not a separate feature in ProofForge. It is built into the workflow.

The main risk is over-reliance. A student founder may trust AI-generated advice because it sounds structured and confident. That is dangerous in startup validation because the data is incomplete, personal context matters, and the wrong decision can waste weeks of work.

ProofForge reduces this risk through concrete controls:

Risk	Mitigation
AI sounds too confident	Outputs are framed as suggestions, not answers
Founder treats guesses as facts	AI cards start with `approved: false`
Weak web signals look stronger than they are	Reality Scan labels them as weak until confirmed
AI pushes one decision	Decision Gate shows equal-weight options
Invalid AI output enters the database	Zod schema validation blocks persistence
User skips proof steps	Phase gates enforce workflow order
Unsafe URL fetching	SSRF guard blocks localhost, private IPs, and metadata endpoints

The final decision is always human-owned.

ProofForge does not choose Continue, Pivot, Run Another Test, or Stop. The founder must decide because that choice depends on constraints AI cannot fully own: time, motivation, ethics, team capacity, budget, personal risk, and long-term goals.

The responsible AI idea is simple:

AI should help the founder think better.
It should not become the founder’s source of truth.

Human-in-the-Loop Design

The most important human-in-the-loop moment is the Decision Gate.

At this point, the AI summarizes:

What was assumed
What was tested
What evidence was collected
What remains uncertain
What tradeoffs exist

Then the founder chooses one of four options:

Decision	When It Makes Sense	Next Action
Continue	Key assumptions have enough evidence	Build or scale the next version
Pivot	Evidence contradicts a core belief	Change the user, problem, offer, or channel
Run Another Test	Signals are still inconclusive	Design a sharper or cheaper test
Stop	A fatal flaw is confirmed	Archive the learning and avoid wasted effort

The AI never preselects the option.

That design choice matters because ProofForge is not trying to replace founder judgment. It is trying to improve the evidence behind that judgment.

Human-in-the-loop design is not only a safety feature here. It is also a better product experience. Founders do not want to be told what to do by a black-box system. They want clearer reasoning, better evidence, and a decision they can trust.

What We Built

During the hackathon, we built a working ProofForge MVP with:

Idea intake
AI extraction
Proof Map generation
Card approval states
Reality Scan workflow
Assumption ranking
Commitment test generation
Pilot evidence tracking
Decision Gate
Founder Pack export
Authentication
Private project storage
Knowledge memory
Production deployment

The product is designed around one repeated loop:

Propose → Review → Test → Record → Decide

That loop keeps the founder moving, but not blindly.

ProofForge is not just a prompt wrapper. It has product state, workflow gates, a scoring model, evidence approval, private project storage, and a complete decision cycle from idea to Founder Pack.

Accomplishments We Are Proud Of

We are proud that ProofForge became more than a polished interface around an AI prompt.

We built a full proof workflow with real product logic:

A visual Proof Map for startup reasoning
AI-generated cards that start unapproved
Human approval before evidence affects the system
Risk scoring based on impact, uncertainty, cost-if-wrong, and evidence strength
A Reality Scan for external signals and alternatives
A Commitment Lab for small validation tests
A Pilot Evidence workflow for behavioral proof
A Decision Gate where the founder, not the AI, chooses what happens next
A Founder Pack export that preserves the reasoning process

The biggest accomplishment is that responsible AI is not just described in the submission. It is built into the product data model, workflow, and decision system.

Challenges We Faced

OAuth Redirect Loops

Authentication became more complex than expected. Google OAuth, Supabase Auth, local development, and the Vercel deployment all needed matching redirect behavior.

The issue was that Google redirects to Supabase first, then Supabase redirects back to the app. If those callback URLs are not separated correctly, users can get stuck in redirect loops or fail to receive session cookies.

We fixed this by separating the provider callback from the app callback and making the server exchange the auth code for a session.

This taught us that authentication is not just a login button. It is a chain of redirects, cookies, domains, and environment variables.

AI Output Sounded Too Final

Our first AI responses were helpful, but they sounded too authoritative.

That was a serious product problem. If ProofForge tells a founder what to do with too much confidence, it becomes the exact kind of AI system we were trying to avoid.

We solved this by changing the data model and workflow. AI-generated cards became proposals, not evidence. Every card had to be approved by the founder before it could influence the project.

This changed the product from “AI gives advice” to “AI prepares reasoning for human review.”

Zod Caught Bad AI Structure

Structured AI output is powerful, but it can fail in subtle ways. Sometimes the model returned valid-looking content that drifted away from the founder’s actual idea or included template contamination.

We added strict Zod validation and extraction sanitization. If the AI output did not match the schema, the system rejected it instead of saving broken reasoning into the project.

This made the product more reliable and made the AI safer to use inside a real workflow.

SSRF Protection Was Required

Reality Scan needed to work with external sources, but fetching URLs can create security risk. A user or model-generated URL could point to localhost, private networks, or cloud metadata endpoints.

We added an SSRF guard that blocks unsafe destinations before any scrape happens.

That was an important lesson: once AI can suggest or process URLs, security boundaries must be explicit.

The Hardest UX Problem Was Slowing Users Down

Most tools reward speed. ProofForge sometimes needs to tell a founder, “Not yet. You do not have enough evidence.”

That can feel frustrating unless the product gives a useful next step.

We solved this by making every gate action-oriented. Instead of blocking the user with a dead end, ProofForge shows what evidence is missing and what small test can create it.

What We Learned

The biggest lesson was that responsible AI is an engineering decision, not a disclaimer.

It is easy to add a warning that says AI can be wrong. It is much harder, and much more valuable, to design the system so AI cannot silently become the source of truth.

The approved: false flag became the clearest example of that lesson. It looks small, but it changes the power dynamic of the product. AI can help prepare the founder’s thinking, but the founder must decide what becomes accepted evidence.

We also learned that founders do not need more output. They need better focus.

A generic AI tool can generate endless tasks, strategies, and ideas. But early-stage founders are usually not suffering from a lack of possible actions. They are suffering from a lack of decision clarity.

ProofForge taught us this:

$$ \text{More Output} \neq \text{Better Thinking} $$

A useful AI system should sometimes reduce options, expose uncertainty, and force the user to test the belief that matters most.

We also learned that deterministic rules and AI reasoning work best together. The AI is strong at reading messy human language and explaining tradeoffs. The rule system is strong at enforcing gates, validating data, and keeping the workflow honest.

Most importantly, we learned that the best AI products do not always make users move faster. Sometimes they help users slow down at the right moment so they can make a better decision.

Impact

ProofForge helps student founders move from confusion to clarity to action.

$$ \text{Confusion} \rightarrow \text{Clarity} \rightarrow \text{Action} $$

The impact is not just faster planning. The impact is better decision-making.

For a student founder, time and confidence are limited resources. Spending three weeks building the wrong thing does not only waste time; it can make a founder lose momentum. ProofForge protects that momentum by helping users find the riskiest assumption, run a small pilot, collect evidence, and make a grounded decision.

That makes ProofForge useful beyond one project. It teaches a repeatable way to think:

Do not ask, “What can I build?”
Ask, “What do I need to prove?”

ProofForge can also support startup classes, incubators, student clubs, university entrepreneurship programs, and mentor-led founder teams. It gives everyone a shared language for assumptions, evidence, tests, and decisions.

The long-term impact is a healthier founder culture: less blind building, less AI overconfidence, and more evidence-driven progress.

Future Roadmap

Next, we want to make ProofForge stronger as a real founder workspace.

Team Collaboration

Student founders often work in teams. We want to add shared projects, roles, comments, and decision history so teams can reason from the same evidence instead of scattered notes.

Mentor Review

We want mentors, professors, and incubator leads to review Founder Packs and leave structured feedback tied to assumptions, pilots, and decisions.

Stronger Pilot Analytics

We want to improve how pilots track behavioral signals, commitment strength, user friction, and evidence quality.

Multi-Provider AI

We want to reduce dependency on a single AI provider by supporting multiple model backends while keeping the same deterministic control layer.

Proof History

We want to add Git-style history for assumptions, tests, pivots, and decision changes so founders can see how their thinking evolved over time.

Classroom and Incubator Mode

We want to support entrepreneurship classes, startup clubs, and incubator programs where mentors can review Founder Packs, compare assumptions across teams, and help students learn a repeatable validation process.

ProofForge is built around a simple rule:

Do not build until you know what you are trying to prove.

That rule matters for student founders. Budgets are small. Time is limited. Motivation is fragile. A fast build can feel like progress, but if the core assumption is wrong, speed only helps the founder reach the wrong answer faster.

ProofForge gives founders a structured way to slow down before they speed up. It uses AI to organize uncertainty, deterministic rules to enforce proof discipline, and human decision gates to keep ownership where it belongs.

The result is not just a startup tool.

It is a system for clearer thinking under uncertainty.

Build less. Prove more.

Built With

ai
google-gemini
nextjs
pgvector
playwright
postgresql
react
supabase
tailwindcss
typescript
vercel
vitest
zod