ForgeFlow: Multi-Agent Idea-to-Plan AI

Inspiration

Most people with a startup, side hustle, or class project idea don’t fail because they lack ambition. They fail because they never turn a vague idea into a concrete plan. Generic AI chatbots make this worse: one prompt, one wall of text, no structure, no risks, and no clear first step.

We built ForgeFlow for founders and builders who need decision support, not another chatbot. The inspiration was simple: if AI can help you think, it should show its work, clarify what you mean, stress-test the plan, surface risks, and let you choose the strategy. The human stays in control at every gate.

What it does

ForgeFlow turns a one-line idea into a structured execution plan in three steps. First, you pick a type (startup, class project, or side hustle) and submit free text. Second, on the Clarify step, the Assessor reviews feasibility and the Clarifier asks up to five choice-based questions about gaps that would materially change the plan — things like budget, timeline, market, and MVP scope. The UI won’t let you generate a plan until every question is answered; ForgeFlow does not guess on your behalf. If the Assessor flags concerns, you can refine your idea or choose to continue anyway, the gate informs you, it doesn’t hard-block you. Third, after agents build a baseline plan, the Synthesizer surfaces two or three strategic paths (for example, Lean MVP vs Feature-Rich Launch). You pick one on a dedicated Step 3 screen, or skip to view the baseline plan; the Path Adapter reshapes your timeline, tasks, and first action when you commit to a path.

On the result dashboard, you get a phased timeline with milestones and key activities, risks, weak assumptions, and failure modes from an adversarial Stress Tester, a visible pipeline trace showing what each agent produced, plan chat to refine the roadmap conversationally via the Plan Refiner agent, and Markdown export of your plan. Every screen reinforces the same principle: decision support, not advice.

How we built it

We used Next.js for the frontend, Express for the API, OpenAI GPT-4o as the default model (with optional Claude and Gemini via a multi-provider LLM layer), LangGraph for orchestration, and structured JSON outputs at every agent step. The system has seven specialized agents, each with a narrow job: the Assessor runs an early feasibility check before clarification; the Clarifier structures the idea and generates choice-based questions; the Planner builds a phased execution plan from confirmed answers; the Stress Tester adversarially challenges assumptions and flags risks; the Synthesizer merges everything into a final roadmap plus path options; the Path Adapter reshapes the plan when the user picks a strategic path; and the Plan Refiner applies conversational edits through the chat panel.

The core pipeline: Planner, then Stress Tester, then Synthesizer runs inside a LangGraph StateGraph, with each node appending to a shared pipelineTrace. Post-plan agents (Path Adapter and Plan Refiner) extend that trace when the user interacts, so judges can see the system evolve rather than just a one-shot response. The Assessor runs in parallel with the Clarifier before planning but is shown as a separate review panel, not as a pipeline-trace stage.

Human-in-the-loop design shows up throughout the product. Clarification uses choice questions (with an optional “type your own answer” per question), not open-ended free text. Path selection is Step 3 of 3 before the full result, with a skip option for the baseline plan. Chat refinement shows visible “Updated: timeline, first action…” badges when the Plan Refiner changes the plan. The Stress Tester is explicitly instructed never to treat user-confirmed answers as weak assumptions.

On the engineering side, all agents use response_format: json_object with schema-guided prompts. A shared normalization layer (planUtils, reasoningUtils) keeps LLM output consistent for the frontend. validateFinalPlan runs with one retry before delivery. The LLM layer includes retries, timeouts, and multi-model fallback. We have 20 unit tests covering validation, normalization, and merge logic.

Challenges we ran into

Single-prompt plans feel smart but aren’t trustworthy Early prototypes returned long paragraphs that looked impressive but weren’t actionable. We split the work across agents and enforced structured JSON at every step.
LLMs return inconsistent shapes Assumptions and dependencies sometimes came back as objects instead of strings, breaking React rendering. We built a normalization layer and applied it in both backend agents and frontend components.
Making AI reasoning visible without overwhelming the user We iterated on the Pipeline tab, first a duplicate “reasoning” section, then a merged single timeline where each agent card shows its actual output inline.
Path selection felt cosmetic Radio buttons on a buried tab didn’t feel like real control. We moved path choice to Step 3 of 3 in the user flow and wired the Path Adapter to actually reshape the roadmap via a dedicated API call.
Latency during plan generation Running 3–4 LLM calls sequentially takes 30–60 seconds. We added a generation overlay with stage progress and pre-tested a golden demo path so live demos stay predictable.

Accomplishments that we're proud of

Real multi-agent architecture: not a wrapper around one ChatGPT call; LangGraph orchestration with traceable intermediate outputs Mandatory human gates: clarify questions and path selection are required steps, not optional UI Adversarial Stress Tester: explicitly challenges plan assumptions while respecting user-confirmed answers Visible pipeline: judges and users can see Clarifier > Planner > Stress Tester > Synthesizer > Path Adapter > Plan Refiner in the UI Production-minded LLM layer — retries, timeouts, multi-model fallback, schema validation before delivery Polished 3-step UX: idea > clarify > choose path > explore result, with export and chat refinement

What we learned

Specialized agents beat one big prompt when each agent has a narrow job and a strict JSON schema Human-in-the-loop only works if the UI enforces it — optional text fields get skipped; choice questions and mandatory gates get answered Responsible AI is a design choice, not a disclaimer, e.g., the Stress Tester’s rule to never flag user answers as weak assumptions is encoded in the agent prompt and visible in the Risks tab Transparency builds trust — showing pipeline trace and agent reasoning matters as much as the final plan Normalize early; LLM output variability is guaranteed; normalization layers save hours of frontend bugs

What's next for ForgeFlow: Multi-Agent Idea-to-Plan AI

Collaborative plans, share a plan link, comment on phases, and track assumptions over time. Server-sent events during plan generation so progress reflects real agent completion instead of a timer. Persistent chat memory across page refreshes within a session. These are planned improvements, not shipped features yet.