KNE-Guards — Stress-Testing Student Products Before You Build Them
Inspiration It started with a question: why do some student products stick, and most don't?
We kept coming back to the same observation — the student graveyard is full of products that seemed reasonable on paper. Another flashcard app. Another focus timer. Another study planner. They all had users at launch and almost none of them had users six months later.
So we started mapping what the survivors had in common. Duolingo pulls you back every day through streaks. WhatsApp is useless without your friends, so you recruit them. Grammarly sits inside the essay you're already writing. Notion becomes the place where your entire academic life lives.
The pattern kept resolving into the same structural dimensions: does the product replace something students already do? Does it pull them back habitually? Is it embedded in an existing workflow? Does it spread through word of mouth? Is it discoverable organically?
We called these the R-U-W-F-M mechanism scores. And once we had the framework, we asked: what if we could evaluate any student product idea against these dimensions — automatically, adversarially, and before a single line of code was written?
That became KNE-Guards.
What We Built KNE-Guards is a two-stage evaluation engine for student product ideas:
Stage 1 — Adversarial Critique A pitch goes in. An AI critic trained to be a skeptical seed-stage investor tears it apart — surfacing kill shots, challenging pricing assumptions, stress-testing each feature, and scoring the product across five structural dimensions:
$$S = \sum_{d \in {R,U,W,F,M}} w_d \cdot \alpha_d \cdot x_d$$
where \w_d is the strategy weight for dimension \d, \alpha_d is the archetype-specific sensitivity multiplier, and \x_d is the raw mechanism score assigned by the AI.
Stage 2 — Behavioral Simulation 100 synthetic student personas — grinders, explorers, burnouts, budgeters, and social followers — each with distinct behavioral profiles, are run through 30 days of adoption. Each day, every persona decides: keep using the product, abandon it, or switch to a substitute. The simulation tracks retention curves, churn spikes, and archetype-level drop-off.
The two stages combine into a single survivability score and a final verdict: Build, Iterate, or Drop.
How We Built It The backend is a pure Python HTTP server with no framework dependencies — just the standard library, SQLite for persistence, and the OpenAI API for the critic and expression helper.
The simulation engine models each persona as a stateful agent. On each day, satisfaction updates based on the product's mechanism scores, distraction costs vary by phase (onboarding → retention → habit), and decisions are drawn stochastically against per-archetype thresholds.
The survivability model computes a weighted score per archetype and aggregates them:
$$S_{aggregate} = \frac{1}{|\mathcal{A}|} \sum_{a \in \mathcal{A}} S_a$$
where each \$S_a$\ is penalised multiplicatively for active killers — dimensions that are load-bearing for the product's strategy but fall below a critical threshold:
$$S_a = S_{a,\text{base}} \times (1 - \kappa)^{k}$$
with \kappa = 0.35$ per active killer and $k$ the number of killers triggered.
The frontend is vanilla JavaScript — no framework — with a custom SPA router, Supabase auth, and a results column that surfaces the summary verdict, untapped strengths, kill shots, and steelman side by side.
Challenges Calibrating the AI scorer was the hardest problem. The AI critic is also responsible for assigning the R-U-W-F-M scores, but its default behaviour was to conflate skepticism with low scores — compressing everything toward 0.3–0.5 regardless of the product's actual structural position. We went through several iterations of anchor-based calibration, adding explicit penalty rules for dominant incumbents, UX complexity, forced adoption, and novelty-driven engagement, before scores became meaningfully differentiated. Coordinating the two-stage pipeline required careful state management. The mechanism scores produced by the critic need to flow into the simulation so the behavioral model reflects the AI's structural assessment — not just the raw spec. Getting this handoff right across the frontend, API, and simulation engine took more wiring than expected.
Managing the AI agents in the persona simulation required balancing stochasticity with reproducibility. Too much randomness and results were noisy; too little and the model stopped surfacing meaningful variance across archetypes. We settled on seeded random number generators with per-persona behavioral profiles to get consistent but differentiated outputs.
Auth and infrastructure — wiring Supabase auth to a custom Python HTTP server with no framework meant implementing the JWT validation layer by hand, managing session cookies carefully, and handling the mismatch between what the frontend expected and what the backend provided.
What We Learned The framework itself turned out to be the insight. Once you have the R-U-W-F-M lens, you start seeing every product through it — and the failures become obvious in retrospect. Google+ had the wrong strategy for its mechanism strengths. Quibi had no structural return trigger. BeReal had viral spread but no depth beneath the novelty.
The hardest part of building an evaluation tool is that you need ground truth to calibrate against. We ended up building a 34-product benchmark — spanning WhatsApp to Google Wave, Duolingo to Chegg — and using it to tune both the AI scoring prompts and the survivability thresholds until decisions matched real-world outcomes at a level we trusted.
Built With
- css
- html
- javascript
- python
- supabase
Log in or sign up for Devpost to join the conversation.