Inspiration

In 2026, AI can build your product in a weekend, and that's the trap. The cheapest mistake is now to build the wrong thing fast. About 42% of startups fail because nobody actually needed what they built, and every new tool makes it easier to skip straight to building without ever checking whether the idea holds up. We wanted the opposite of a build-faster tool: something that helps you find out what's worth building at all, which turns out to be a thinking problem, not a typing one.

What it does

Project Pulse is a reasoning partner for early-stage builders. You describe your idea in plain language, and instead of helping you build it, Pulse surfaces the hidden assumptions buried in how you framed it, the beliefs you never stopped to question, and works out which one, if it's wrong, would kill the whole thing. Then it designs the cheapest test to check that belief before you write a line of code. As you bring back evidence, Pulse weighs how much it actually means: three lukewarm interviews, two excited friends, and a landing page nobody signed up for don't add up to demand, even though on paper they look like progress. It flags when "evidence" is really just people being nice, updates where your idea stands, and at key moments lays out an honest persevere, pivot, or stop picture, with the case for each, then hands that decision to you. A memory layer keeps the whole journey, so every session builds on the last.

How we built it

We built Pulse as a structured reasoning loop in Python with a Streamlit chat interface, using a GPT model through Microsoft Foundry. The reasoning runs on four decision-science methods: the Riskiest Assumption Test, pre-mortems, kill criteria, and a decision journal. A local memory store holds the validation ledger (the idea, its assumptions, the evidence, and each session's snapshot), and a plain change-detector compares sessions to surface what's new. We were deliberate about dividing the labor: the parts that are really judgment, like pulling unstated assumptions out of a sentence, deciding which belief is most fatal, and reading whether messy qualitative evidence is real or wishful, run on the model, while the mechanical parts like storing state and diffing snapshots run on ordinary code. Every action the agent takes is gated behind explicit user approval.

Challenges we ran into

The first real challenge was philosophical, not technical: deciding where an LLM actually earns its place. It would have been easy to wrap every step in a model call and call it AI, but that produces a tool a rules engine could fake. We forced ourselves to draw a hard line. Extracting unstated assumptions, judging how fatal a belief is if it is wrong, reading whether messy evidence is a false positive, and synthesizing a persevere or pivot call all genuinely need a reasoner. Storing the ledger, diffing sessions, and tracking evidence tiers do not. Splitting the system this way took real debate, but it is what kept the LLM as the reasoner instead of expensive glue.

Representing confidence honestly was harder than expected. Our first instinct was a percentage, but we could never answer "why 25 percent and not 35," and a fake-precise number quietly violates the no-false-certainty principle we were building the whole product around. We replaced percentages entirely with evidence-tiered bands (Untested, Weak signal, Contested, Supported), where each label is defined by what evidence actually exists. That turned an honesty problem into a feature, because every tier can now justify itself out loud, for example "Weak signal because the only two positives came from coursemates."

Getting the prompts to reason instead of list was a recurring fight. Early versions of the assumption-extraction prompt just restated the founder's idea back as if it were a hidden belief. We had to push it to dig beneath the framing and surface the buried, often fatal assumption that the problem the founder sees is not the problem the market has. The evidence-interpretation step had the same trap: it is not enough to notice two people liked the idea, the prompt has to explain why those two yeses do not count because they came from friends and a biased sample. Catching the why, not just the what, was the whole point.

The most instructive challenge came from the risk-ranking step, and it taught us something about testing AI. The prompt was supposed to pick the single riskiest assumption to test first, and on our canonical scenario it would sometimes land perfectly and sometimes phrase the same underlying belief differently, so our test flagged a failure. We almost "fixed" it by hardcoding the prompt to prefer the answer we wanted, then realized that would overfit to our demo and make the tool look rigged the moment anyone tried a different idea. Instead we added a general reasoning principle (for discovery, matching, or search tools, the spine is usually whether the user's hard part is really the finding step itself), which mirrors logic the extraction prompt already used and generalizes to any idea. The deeper lesson was about verification: one passing run does not prove a probabilistic prompt is fixed. We ran the full pipeline twelve times and confirmed it landed on the core-pain spine every time before we trusted it. We also had to loosen our own test, which was matching on exact keywords and falsely failing perfectly good paraphrases, while keeping it strict enough to still reject a genuinely wrong answer.

Finally, two smaller but important calls. We had originally planned to auto-detect whether the user was starting a session or wrapping one up from how they phrased their message, and we cut it: it earned no points and could misfire live in front of judges, so we made the mode an explicit choice and removed a failure point for free. And coordinating a two-person repo meant a few divergent-history and rejected-push moments, which we handled by rebasing cleanly so the shared branch stayed linear and neither person's work was lost.

Accomplishments that we're proud of

We built something that does the genuinely hard part of validation, the judgment, rather than handing back a checklist. It catches the difference between real signal and a supportive friend, which is the exact thing founders fool themselves on. It remembers a multi-week arc and can tell you how your riskiest assumption shifted over time. And it refuses to make the one call that should always be the founder's: whether to keep going, change course, or walk away. We're proud that representing uncertainty honestly became a feature you can point at, not a disclaimer we tacked on.

What we learned

Use AI only where judgment can't be faked: reasoning on the model, data-shuffling on plain code. Honest beats precise: we ditched fake-looking confidence percentages for simple evidence tiers. One passing test proves nothing: we trusted our key prompt only after twelve right answers in a row. Fix the reasoning, not the demo: a general rule, never a hardcoded answer for our own example.

What's next for Project Pulse

Richer evidence handling (importing real interview notes or survey results), more test types beyond interviews and fake-door pages, and longer-horizon memory so Pulse can coach a builder across months instead of a few sessions. Longer term, a way to bring a mentor or teammate into the persevere/pivot moment, keeping the human decision central while widening whose judgment informs it.

Built With

  • azure-openai
  • claude
  • github
  • gpt
  • json
  • microsoft-foundry
  • openai
  • python
  • python-dotenv
  • streamlit
Share this project:

Updates