🧠 MoodLens — Catching the Storm Before It Hits

Passive Depression Screening & Tiered Mental Health Intervention | Hacklytics 2026

💡 Inspiration

It started with a question none of us could shake:

Why does mental healthcare only show up after someone is already in crisis?

We'd all seen it - a friend who seemed fine until suddenly they weren't. A statistic that 40% of teens with a major depressive episode receive zero care. A $16.3 trillion projected cost to the global economy by 2030 [1]. The gap isn't just a healthcare problem. It's a timing problem.

Depression doesn't appear overnight. It leaves a trail of digital fingerprints for weeks before anyone notices disrupted sleep, fewer steps, and increasingly withdrawn messages. But no one is watching that trail. Clinicians see patients at scheduled appointments. Crisis hotlines field calls mid-breakdown. There is almost no infrastructure for the critical middle ground in the weeks before someone reaches a breaking point.

We thought: the smartwatch is already on your wrist. What if this device could quietly notice what you can't yet see in yourself?

That question became MoodLens.

🔍 What We Built

MoodLens is a passive, always-on depression screening system that fuses two independent data streams, biometric signals from a wearable and linguistic signals from messages into a single PHQ-9-aligned risk score, then automatically routes users to the right tier of support.

No surveys. No manual check-ins. Zero active effort from the user.

Biometric PHQ Score	Message Sentiment	Severity	Response
0 – 9	Neutral	Low	On-demand AI Audio & Text Companion
10 – 19	Anxiety, Depression, PTSD, Stress	Moderate	AI Voice Coach (ElevenLabs + LLM)
20 – 27	Any Sentiment	High	🚨 Human-in-the-Loop Responder Protocol
Any score	Suicide	High	🚨 Human-in-the-Loop Responder Protocol

Three tiers. Four AI agents. Two data streams. One goal: reach people in the weeks before a breaking point — not the day after.

🛠️ How We Built It

How We Build It

Model 1: Biometrics (XGBoost)

We integrated with Google Health Connect to pull passive signals continuously from compatible wearables:

Sleep duration & efficiency
Heart Rate Variability (HRV / RMSSD)
Resting heart rate
Daily step count & active minutes
Blood oxygen saturation (SpO2)

The challenge here was training data scarcity. Ethically labeled datasets pairing wearables with clinical PHQ-9 scores essentially don't exist. So we went back to the literature. A 2021 JMIR study by Rykov et al. published Spearman correlation coefficients between biometric features and depression severity. We used those coefficients to generate a statistically consistent synthetic training corpus of 10,000 samples via Cholesky decomposition of the covariance matrix, then trained an XGBoost classifier on top of it.

XGBoost earned its place here for three reasons: robust performance on tabular data, graceful handling of missing wearable sync gaps, and SHAP-based interpretability.

Model 2: Linguistics (multiMentalRoBERTa)

We fine-tuned a mental-roberta-base model, a domain-adapted checkpoint pre-trained on mental health corpora on the ** Reddit Depression dataset, Dreaddit Dataset, The Stress Annotated Dataset, Reddit Suicide and Depression Dataset, and Neutral Data collection*: a corpus spanning Depression, Anxiety, PTSD, Suicidal Ideation, and a Neutral control. The 6-class setup hit an **F1 of 0.835*, outperforming MentalBERT (0.826) and few-shot GPT-5o (0.561).

🔀 Score Fusion & Ensemble Decision Logic

Combining a biometric sub-score and a linguistic sub-score into one coherent PHQ-9 number required careful calibration. We built a weighted fusion layer and, critically, a two-function ensemble decision system with hard safety overrides that operates across three distinct tiers.

Tier 0: Mild (PHQ 0–9): The clinical anchor holds. The biometric score is the primary driver, and no active intervention is pushed. The user has on-demand access to supportive agents but is not flagged for escalation.

Tier 1: Moderate (PHQ 10–19): The ensemble watches for masked depression, cases where a user's self-reported or biometric score sits at Tier 0, but the NLP stream is detecting elevated linguistic markers of depression. In these cases, the system escalates to Tier 1 rather than deferring to the lower score. This is the catch for users who are struggling but not yet showing it in every signal.

Tier 2: Severe (PHQ 20–27 or Safety Override): Any detection of suicidal ideation by the NLP model triggers an immediate Tier 2 escalation, completely independent of the aggregate score. PHQ Item 9 "thoughts of being better off dead" functions as an identical hard override. There is no averaging, no weighting, and no threshold to clear. A safety signal at this level bypasses the entire scoring pipeline.

This three-layer logic ensures that no safety signal is ever silently absorbed by an aggregate score, which is exactly the failure mode that matters most in a mental health context.

🤖 Meet the Agents (Eleven Labs)

Once the ensemble locks in a tier, it hands off to one of four purpose-built AI agents, each calibrated for its specific clinical context and risk level.

🟢 Audio Companion & Text Companion — Tier 0 At low severity, no intervention is pushed. Instead, users have on-demand access to two supportive agents: a voice-based Audio Companion and a chat-based Text Companion. These agents deliver personalized affirmations, psychoeducation snippets, and gentle mood check-ins. Users can also tap into a curated library of audiobooks and calming background music, content chosen specifically to support emotional regulation and quiet mental noise without requiring any active effort. The goal is proactive resilience-building, meeting users in a good moment, giving them tools to reach for, so the foundation is already there if things get harder.

🟡 AI Voice Coach — Tier 1 The user receives a notification and is connected to an AI Voice Coach powered by a large language model and ElevenLabs voice synthesis. The Coach uses RAG over a curated knowledge base of proven clinical strategies: 5-4-3-2-1 grounding, cognitive reframing, guided breathing, and evidence-based approaches for stress, depression, anxiety, and PTSD, pulling the most relevant technique for the moment in real time. Empathetic, stigma-free, and always available, it is explicitly designed as a bridge to professional care. The Coach is not therapy. It is the thing that gets someone to therapy.

🔴 Responder Agent — Tier 2 The most critical agent in the system. When Tier 2 is triggered, whether by score or by NLP suicide detection, the Responder Agent activates a full human-in-the-loop protocol. Emergency contacts are alerted via call or email using the message tool. Volunteer mental health professionals or responders gain access to a report containing PHQ trends, sentiment analysis, and the current state. The Responder Agent's job is not to have a conversation. Its job is to get the right humans involved as fast as possible.

🤝 Human-in-the-Loop: The Heart of Tier 2

Most AI mental health tools stop at a score. We didn't think that was good enough.

When MoodLens detects a severe PHQ score (20–27) — or when suicidal ideation is flagged at any score — it doesn't just send a push notification and hope for the best. It activates a full human-in-the-loop Responder Protocol:

Emergency contacts are notified — pre-designated people the user has chosen and consented to in advance
Mental health volunteers or responders gain access to a real-time PHQ trend dashboard showing the user's trajectory over preceding weeks
The user is simultaneously informed that their contacts have been alerted and is given crisis resources
Responders see context, not content — the dashboard surfaces PHQ trends and flagged language themes, never raw messages

The philosophy here was deliberate: speed without surveillance. A responder who can see that someone's PHQ score has climbed steadily for three weeks is far better equipped to help than one walking in cold. But that same responder has no business reading someone's private messages word-for-word.

The AI Voice Coach in Tier 1 is also designed as a bridge, not a replacement. It provides empathetic, structured conversation — behavioral activation prompts, cognitive reframing, guided breathing — while actively guiding users toward professional care when the conversation warrants it. We were careful that no agent ever positions itself as therapy.

🔒 Data Privacy: Built In, Not Bolted On

Analyzing someone's health metrics and messages is one of the most sensitive things software can do. We treated privacy as a first-principles design constraint, not a compliance checkbox.

What we collect and how:

Biometric data is pulled via Google Health Connect with explicit user permission
Message content is analyzed locally or in fully anonymized form — raw text never leaves the device in identifiable form
PHQ sub-scores and trend data are stored on Databricks; raw message content is never stored at all

What responders see vs. what they don't:

Responders CAN see	Responders CANNOT see
PHQ score trend over time	Raw message content
Severity tier history	Individual biometric readings
Flagged language themes	Specific conversations
Escalation timestamps	Any personally identifiable text

User control is non-negotiable:

Users can view, edit, or remove their emergency contact list at any time
Consent is granular — biometric access and message analysis are separate opt-ins
The system is designed so that turning off either data stream degrades gracefully rather than failing

We believe that a mental health tool which requires users to surrender their privacy in exchange for help isn't actually helping them.

⚡ Challenges

Getting training data without compromising ethics. Real PHQ-labeled wearable datasets are rare and sensitive. Synthesizing data from published correlation structures felt like a necessary and defensible tradeoff — but validating that the synthetic distribution matched clinical reality took serious iteration.

Calibrating score fusion across two totally different modalities. A biometric sub-score and a linguistic sub-score live in different statistical worlds. Getting the weighted fusion to produce a PHQ distribution matching clinical norms required going back to reference prevalence data and tuning weights against labeled anchor cases.

The Stress class problem. We built a 6-class NLP classifier before realizing one class was actively hurting us. Running cosine similarity analysis to diagnose why performance plateaued — and having the courage to drop a class rather than paper over it — was one of the more satisfying debugging moments of the weekend.

Designing Tier 2 without causing harm. The moment you alert someone's emergency contacts, you're making a high-stakes intervention in their life. A false positive isn't a minor inconvenience — it can damage trust, relationships, and the user's willingness to keep using the system. We spent significant time stress-testing the ensemble thresholds, the responder dashboard's information architecture, and the exact language used in every alert.

📚 What We Learned

Read the papers before you touch the keyboard. The Rykov et al. correlation structure handed us our training data. The PHQ-9 framework handed us our scoring logic. Standing on published clinical research let us skip months of data collection and focus on the system design that actually matters.

The hardest part of mental health tech isn't the model — it's the intervention. Knowing someone is struggling is only useful if you can reach them in a way that doesn't make things worse. Every design decision downstream of the score matters just as much as the score itself.

Privacy and utility are not opposites. Every privacy constraint we added — local analysis, anonymized storage, theme-not-content dashboards — made us think harder about what data we actually needed. The result was a cleaner, more defensible architecture.

Human oversight isn't a fallback. It's a feature. We could have made MoodLens fully automated end-to-end. We chose not to. Real people in Tier 2 situations deserve real human contact. The AI's job is to get the right information to the right human as fast as possible — not to replace that human.

🔭 What's Next

Real-world validation with actual PHQ-labeled wearable data to move beyond synthetic training
Longitudinal trend modeling — shifting from point-in-time scoring to trajectory-based risk detection
Clinician-in-the-loop integration for Tier 1/2 boundary cases, connecting users to licensed professionals
Full privacy audit with security researchers before any production deployment
User-controlled sensitivity tuning — letting users adjust how aggressively the system escalates based on their own preferences and history

🧰 Tech Stack

Layer	Technology
Frontend	React (Vite) + Google Health Connect API
Data Analysis	Sphinx for all data analysis
ML Infrastructure	Databricks (model serving + PHQ trend DB)
Biometric Model	XGBoost on synthetic data (Rykov et al. 2021)
Linguistic Model	Fine-tuned RoBERTa (mental-roberta-base)
Scoring Framework	PHQ-9 clinical alignment
Conversational AI	ElevenLabs (4 agents)
Score Fusion	Weighted ensemble with safety override logic
RAG	VectorDB - Documents with professional help techniques
LLM Layer	Gemini backing the conversational agents conversation engine
Privacy	Local or anonymized message analysis; responders receive PHQ trends, not raw content

📖 References

[1] Trautmann, S., Rehm, J. & Wittchen, H. The economic costs of mental disorders. EMBO Rep 17, 1245–1249 (2016). https://doi.org/10.15252/embr.201642951
[2] Liu Y, et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv:1907.11692. 2019.
[3] Kroenke K, Spitzer RL, Williams JB. The PHQ-9: Validity of a Brief Depression Severity Measure. J Gen Intern Med. 2001.
[4] Annie E. Casey Foundation. 2024 Kids Count Data Book: Child Mental Health.
[5] CDC. Mental Health Conditions & Care Data, 2024.
[6] Rykov Y, et al. Digital Biomarkers for Depression Screening With Wearable Devices. JMIR Mhealth Uhealth. 2021.