๐ฅ Inspiration I was building InnerVoice โ a full mental health platform with AI chat, crisis support, and therapy booking โ when a thought stopped me cold: what if this AI gives someone the worst possible advice at the worst possible moment? So I tested it. In under 10 minutes, I found four ways to break it using nothing but normal typed messages. No hacking tools. No special access. Just words. That terrified me. Because 1 in 5 people use some form of mental health AI today โ and none of them know if it's ever been safety tested. I decided to be the first person to actually do it, document it, and build a real fix.
๐ก What It Does InnerVoice has two superpowers no other mental health AI has:
- It knows its own weaknesses. A red-teaming framework systematically attacks the AI with the same tricks a real user might try โ roleplay manipulation, fake developer overrides, pressure to skip crisis hotlines, and eval awareness tests. An automated safety probe then checks every single AI response against six clinical safety rules before it reaches the user. If it's dangerous, it gets blocked. In milliseconds.
- It sees what users don't say. Through Knot's TransactionLink API, InnerVoice connects to a user's real pharmacy and supplement purchase data โ CVS, Walgreens, Amazon, Walmart. When someone stops refilling their medication for three weeks, InnerVoice notices and reaches out โ before the user has to say a word. That's the difference between an app that waits for you to ask for help, and one that notices before you do.
๐ ๏ธ How We Built It Red-Teaming Layer Designed 6 adversarial attack prompts targeting the most dangerous failure modes in mental health AI โ roleplay jailbreaks, developer mode overrides, crisis resource bypasses, substance use normalization, and eval awareness. Each attack fires at the Claude-powered InnerVoice chatbot and captures the raw response for analysis. Safety Probe Built a lightweight rule-based classifier that sits between the AI and the user. It checks every response against 6 clinical rules: no harmful suggestions, crisis responses must include the 988 lifeline, must never discourage professional help, must reject jailbreak framing, and must behave consistently whether or not it's being observed. Runs significantly faster than the LLM itself โ zero perceptible latency for the user. Steering Vector Correction When the probe flags a violation, a safety-constrained system prompt is injected and the response is regenerated. Same message. Safe answer. Demonstrated live in a before/after format on the dashboard. Knot TransactionLink Integration Connected InnerVoice to Knot's merchant data API to pull SKU-level pharmacy and supplement purchases. This feeds behavioral context into the AI โ turning it from reactive to proactive. Full Stack Claude API ยท Knot TransactionLink ยท React 18 + TypeScript ยท ASP.NET Core 8 ยท Flutter (iOS/Android) ยท MongoDB Atlas ยท SignalR
๐งฑ Challenges We Ran Into The "I'm fine" problem was harder than the jailbreaks. Finding the four attack vectors was straightforward. The harder challenge was designing a system that catches the crisis no one announces โ the user who sounds okay but isn't. That required thinking beyond text analysis into behavioral signals, which is what pushed me toward the Knot integration. Making the probe fast enough to be real. A safety check that adds noticeable delay is a safety check that gets turned off in production. Getting the classifier to run in milliseconds โ genuinely faster than the LLM โ required careful design so it would hold up under real usage conditions. Working solo across the full stack. Every layer โ mobile, backend, web panel, AI safety, Knot integration โ had to be built and connected by one person in 36 hours. Prioritizing ruthlessly and knowing when "good enough to demo" was the right call was a skill in itself.
๐ Accomplishments That We're Proud Of
Found 4 distinct, reproducible jailbreak failure modes in a real mental health AI โ documented, categorized, and fixed Built a safety probe with 80%+ catch rate running in real time with no latency impact on the user experience Integrated Knot's TransactionLink API to create a passive health OS that requires zero manual logging from the user Demonstrated a complete attack โ detection โ correction pipeline live on a working dashboard Built this across a full production-grade stack โ Flutter mobile, ASP.NET backend, React web panel โ as a solo developer in 36 hours
๐ What We Learned AI safety in mental health is not a solved problem โ it's barely started. Standard safety training doesn't catch the specific failure modes that appear in mental health contexts. Roleplay framing, authority override, and eval awareness are exploitable in ways that generic guardrails don't address. Behavioral data is more honest than self-reporting. Users underreport how they're doing โ especially in mental health contexts. Purchase history doesn't lie. The Knot integration showed that passive signals from the real world can tell you things the user never will. Cheap probes can replace expensive human oversight at scale. A lightweight classifier running locally catches the majority of failures that would otherwise require a clinician to review every AI conversation manually. That's the insight that makes this scalable.
๐ What's Next for InnerVoice: Breaking AI to Make It Safer The safety probe built this weekend becomes permanent middleware in the production InnerVoice backend โ a SafetyAuditService that intercepts every AI response before it reaches any real user, forever. The rule-based classifier gets replaced with a fine-tuned ML model trained on real flagged conversations, pushing accuracy above 95% and making it adaptive to new attack patterns. The Knot integration expands beyond pharmacy data to include grocery nutrition patterns and fitness spending โ building a full passive health OS for every InnerVoice user without requiring a wearable or manual input. Longer term, the red-teaming findings and probe methodology get published as an open standard for mental health AI safety testing โ because every app that talks to someone in crisis deserves to be held to this bar. InnerVoice is the first mental health AI that asked what if it gets it wrong? It won't be the last.
Log in or sign up for Devpost to join the conversation.