FORGE-The AI-powered gauntlet for early-stage founders

About Forge

What Inspired Us

Forge was born from a simple observation: most startups fail not because they can't build, but because they build the wrong thing.

The traditional startup planning process is broken. Founders spend months building products based on assumptions that should have been validated in weeks. They dive into execution without confronting fundamental questions: What if this fails? What are we assuming? Are we aligned as cofounders?

The AI tools available to founders — chatbots like ChatGPT, Claude, or Gemini — are great for brainstorming but terrible for structured thinking. They give you 500 words of advice when you need a 5-field decision matrix. They validate your ideas when they should challenge them. They're too conversational, not strategic enough.

We wanted something different: an AI-powered strategist that asks the hard questions before you commit to building. A tool that combines the analytical rigor of a pre-mortem, the strategic clarity of a 90-day roadmap, and the validation mindset of lean startup methodology — all in one fluid experience.

What We Learned

1. Human-in-the-Loop is Non-Negotiable

Early prototypes had the AI "recommend" the best strategic posture. Users followed it blindly. We realized this was dangerous — the AI can't know your risk tolerance, your financial runway, or your gut instincts about the market.

The fix: The AI now generates 3 equal options with clear trade-offs. The founder chooses. We also added hard mitigation gates — if the AI identifies a HIGH risk, the workflow locks until the founder writes their own mitigation strategy. This intentional friction prevents blind reliance on AI.

2. Cofounders Are Often Misaligned (and Don't Know It)

We initially built Forge for solo founders. Then we added multi-founder support and were shocked by how often cofounders had conflicting expectations — one wanted "ship fast," the other wanted "build moat"; one could give 20 hours/week, the other only 5. These conversations weren't happening until it was too late.

The insight: The most valuable part of Forge for many teams is the alignment report — it surfaces difficult conversations (equity splits, exit timelines, time commitment) before founders start building.

3. Ideas Are Iterative, Not Static

Founders don't have one idea — they have half-ideas, variations, and pivots. We learned that the tool needed to support:

Multi-idea synthesis (analyze 2-3 ideas together to find the core bet)
Idea versioning (branch at the strategy stage, compare roadmaps, merge insights)
Live experiment tracking (log real-world results, update risk levels, pivot when evidence says to)

This turned Forge from a one-shot planning tool into a living system that evolves with the founder's journey.

4. Privacy Matters More Than We Thought

Founders are sharing their raw, unfiltered ideas. Some are in stealth mode. Others are exploring sensitive pivots while employed elsewhere. We learned that client-side processing with localStorage wasn't just a technical choice — it was a privacy requirement. No server logs, no training on user data, no database.

How We Built It

Architecture: Pipeline-First Design

Forge is built around a pipeline architecture (P0-P10) rather than a traditional CRUD app. Each stage does one thing well:

$$P0: \text{Cofounder Alignment} \rightarrow P1: \text{Idea Intake} \rightarrow P2: \text{Risk Analysis} \rightarrow P3: \text{Strategic Posturing} \rightarrow P4: \text{Roadmap Synthesis}$$

$$\rightarrow P5: \text{Experiment Design} \rightarrow P6: \text{Resource Mapping} \rightarrow P7: \text{Resource Retrieval} \rightarrow P8: \text{Execution Drafts}$$

$$\rightarrow P9: \text{Re-Risk Analysis} \rightarrow P10: \text{Pivot Detection}$$

This pipeline approach means:

Clear separation of concerns: Each stage is a standalone module.
Easy testing: Can mock any stage.
Parallel processing: Multi-idea synthesis runs P1 in parallel across all ideas.
State machine: User can't skip to experiments without risks.

Tech Stack Choices

Why Genkit? We needed structured JSON output from AI, not conversational text. Genkit's Zod schema integration guarantees type safety — if the AI hallucinates a field, the entire call fails rather than producing malformed data.
Why GLM-4.5-Air? We tested multiple models and found the 12B-parameter model hit the sweet spot: fast enough for real-time UX (<2s per stage) but smart enough for strategic reasoning. Larger models added latency without better outputs.

The confidence score calculation for risk updates follows:

$$ C_{new} = C_{base} + \sum_{i=1}^{n} w_i \cdot \Delta_i $$

Where $C_{base}$ is baseline confidence, $w_i$ is experiment weight, and $\Delta_i$ is the confidence shift from experiment $i$.

Why localStorage? Privacy-first, no backend required, works offline after initial load. We implemented a v2 → v3 migration system so existing users wouldn't lose progress when we added multi-founder support.
Why shadcn/ui? We needed accessible components out of the box without building a design system from scratch. The components are customizable, well-documented, and include proper ARIA labels.

The Four Innovative Features

We implemented four features that make Forge unique:

Cofounder Alignment (P0): Capture 2-4 founder profiles, detect conflicts in goals/time/expectations, generate conversation starters.
Multi-Idea Synthesis (P1): Input 2-3 ideas, AI finds core bet/conflicts/complementarity, recommends merging or pursuing separately.
Idea Versioning (P3): Branch into 3 strategic postures, compare roadmaps side-by-side, merge best phases.
Live Experiment Loop (P8-P10): Log real-world metrics, risk levels animate from HIGH → MEDIUM → LOW, AI suggests pivots after 3+ experiments.

Each feature required new pipeline stages, UI components, and state management — but the pipeline architecture made integration straightforward.

Challenges We Faced

Challenge 1: AI Output Consistency

Problem: Early versions of the risk analysis would sometimes generate "Market risk" as HIGH in one run and MEDIUM in another, even with the same input. Temperature 0.7 caused too much variance.
Solution: We dropped temperature to 0.2 and added deterministic field mappings. For risk levels, we now use a strict enum (HIGH/MEDIUM/LOW) rather than free text. We also added few-shot examples in prompts to show the AI exactly what "HIGH" vs "MEDIUM" looks like.

Challenge 2: Type Safety Across 10 Pipeline Stages

Problem: With 10+ pipeline stages passing JSON between them, a single schema change could break the entire chain. Catching type errors at runtime was painful.
Solution: We went all-in on TypeScript + Zod. Every pipeline stage has a Zod schema that validates AI output. If runP2Risks returns malformed data, it fails fast with a clear error message rather than propagating corruption downstream.

Challenge 3: UX for "Pursue Separately" Recommendation

Problem: When multi-idea synthesis recommended pursuing ideas separately, users were confused about which idea to advance with. The UI just showed a generic "choose one" message.
Solution: We redesigned the synthesis stage to show side-by-side idea cards with clear "Pursue Idea 1" / "Pursue Idea 2" buttons. We also added a "merge as one" option for when the AI recommends unification but the user disagrees.

Challenge 4: Experiment Loop Friction

Problem: Users completed the full pipeline but never logged experiments because it required leaving the app and coming back. The "live feedback loop" we envisioned wasn't happening.
Solution: We added persistent state so users can close the browser, return weeks later, and their progress is preserved. We also redesigned the experiment tracker to show a confidence progress bar — users see their confidence % increase with each logged experiment, which creates a gamified feedback loop.

Challenge 5: API Rate Limits & Timeout Handling

Problem: Z.ai's coding endpoint has rate limits. When 3 parallel branches are generated in idea versioning mode, we hit timeouts.
Solution: We implemented progressive loading with optimistic UI updates. Each branch shows a loading spinner independently; if one fails, the others complete. We also added retry logic with exponential backoff for transient failures.

The Future

Forge is far from complete. We're exploring:

Collaborative mode for remote cofounder alignment (currently single-device only)
Resource marketplace for curated tools, templates, and startup perks
Mobile app for on-the-go experiment logging
Integration with productivity tools (Notion, Linear, Slack) for roadmap sync

But the core philosophy won't change: challenge assumptions, validate ideas, execute with clarity. The AI is the strategist, not the oracle — the founder remains firmly in the loop.