Voice Pulse

Inspiration

Mental health doesn't break overnight — it drifts. Slowly, over days and weeks, in ways that are hard to notice until they're already a problem. Most tools catch this too late: therapy is weekly, questionnaires are monthly, journaling gets abandoned after three days. We wanted to build something that required almost no effort but generated a real signal. Thirty seconds a day. Just talk.

What it does

Voice Pulse takes a short daily voice note and runs three independent signals on it:

Vocal biomarkers — speaking pace, energy variance, and silence ratio, extracted directly from the audio in the browser using the Web Audio API. No server needed for this part.
Transcript sentiment — the recording is transcribed and analysed for emotional state, concern level, and sentiment score.
ML emotion detection — a HuggingFace distilroberta model classifies the transcript across seven emotions (anger, sadness, joy, fear, disgust, surprise, neutral).

All three signals combine into a daily emotional state flag — stable, elevated, or high — with personalised feedback and a 5-activity wellness programme that adapts based on whether the activities are actually improving your trendline. There's also a Talk mode: a conversational AI companion that conducts a structured voice check-in, streams responses sentence by sentence with text-to-speech, and feeds the full conversation into the same analysis pipeline.

How we built it

Frontend: Vanilla HTML, CSS, and JavaScript. MediaRecorder API for capture, Web Audio API for real-time vocal metrics, browser SpeechRecognition for live transcript drafts.
Backend: Node.js and Express. Multer for audio uploads.
AI pipeline: Whisper for transcription, a language model for emotional analysis, HuggingFace Inference API for emotion classification, TTS for Talk mode audio responses.
Persistence: Browser localStorage for 7-day history, streak tracking, and daily programme completions — no backend database needed for the MVP.

Challenges we ran into

Getting the Talk mode to feel like a real conversation was the hardest part. The naive approach — wait for the full AI response, then play it — introduced 3–4 seconds of silence that broke the illusion completely. We solved it by flushing the response sentence by sentence over SSE and queuing each sentence into an audio pipeline immediately, so playback starts on the first sentence while the rest is still generating. The latency now feels close to a real call.

The other challenge was the silence detection in Talk mode. Browser SpeechRecognition fires onend aggressively on short pauses, which meant the system kept cutting the user off mid-thought. We ended up with a two-tier timer: a longer initial wait before speech begins, then a shorter post-speech pause threshold that scales with how much the user has already said.

Accomplishments that we're proud of

Three genuinely independent signals from a single 30-second input — and the whole vocal analysis runs client-side with no round-trip. The Talk mode streaming pipeline working in real time was the moment the project went from a prototype to something that actually felt alive. And the adaptive activity programme — the feedback loop where the system notices if activities aren't working and replaces them — is the piece we think has the most legs as a real product.

What we learned

Streaming matters more than raw quality for perceived responsiveness. A slightly less polished response that starts playing in one second feels dramatically better than a perfect one that takes four. We also learned that the habit loop — the streak, the daily programme, the trendline — is where the actual product value lives. The AI signals are the engine, but the reason someone comes back tomorrow is the 30-second habit.