Clarity

Speak with intention. Land with impact

Inspiration

Most people who want to communicate better, whether they're shy in casual conversation, preparing for a job interview, or about to pitch to investors, practice alone in their head with no real feedback.

The tools that do exist are either corporate HR platforms (cold, institutional, expensive) or simple transcript analyzers that miss everything non-verbal. We wanted to build something different: a calm, evidence-based rehearsal loop that tells you specifically what to work on before it counts.

What It Does

Clarity is a real-time communication coach. You record yourself practicing, and it analyzes your delivery across three simultaneous signals — audio, video, and language — then talks back to you through a simulated conversation partner, spoken critiques, and plain-English next steps you can act on before your next session.

Three practice modes:

Emotion Sprint -- A short scenario-based drill targeting a specific emotional register out of the main 8 emotions. Your delivery is scored against the target affect using real-time facial emotion, tone, and sentiment analysis.

Conversation -- A multi-turn session simulating an interview, negotiation, or anything else you can think of. The AI plays the other side and keeps the exchange grounded in the topic at hand, pushing back, redirecting, and responding the way a real conversation partner would. This is not just fluency practice. It is social literacy training. Users learn to read the room, stay on topic under pressure, and develop the conversational instincts that make communication land in real situations.

Free Speaking -- Uninterrupted recording that scores your pacing, filler words, repetition, and emotional arc — then breaks everything down with graphs, stats, and a word-level transcript you can replay and audit. Detected speech patterns get flagged automatically, and plain-English next steps tell you exactly what to fix before your next session.

How We Built It

Backend: Python/FastAPI with WebSocket support for live session coordination. We structured a shared AI integration layer (backend/shared/ai) so all three practice modes call the same service abstractions regardless of which model or API handles a given task. Session recordings are saved server-side and uploaded to Imentiv for video emotion analysis. Results are normalized and streamed back through WebSocket events.

Frontend: Next.js (TypeScript). The UI is designed around two distinct contexts: an immersive, chrome-free recording flow, and a clear, scannable scorecard/replay view. These are handled through separate route structures rather than toggled state in a single shell.

AI stack:

Gemma via Google AI Studio -- Drives the conversational AI in Conversation mode; generates feedback summaries and next-step recommendations across all modes.
ElevenLabs -- STT (Scribe v1) for high-accuracy transcription; TTS (Multilingual v2) for the AI voice in Conversation mode.
Imentiv -- Facial emotion and expression analysis from recorded video, aligned to audio windows for per-segment scoring.

Persistence: MongoDB Atlas for durable session history powering the progress dashboard. Live WebSocket state stays in memory. For offline or no-network demos, both Imentiv and MongoDB can be mocked out with environment flags.

Challenges We Ran Into

Multimodal alignment. Getting three async data streams -- transcript, audio features, and video emotion frames -- to align meaningfully in time was the core technical challenge. Audio and video clock drift across a session meant naive timestamp alignment produced nonsensical per-segment scores. We had to implement windowed alignment at the backend layer before normalization.

CV model latency. ML computer vision models on video are slow by nature. Imentiv's analysis pipeline introduced significant end-to-end latency that would have made the post-session scorecard feel broken. We addressed this by decoupling the CV processing from the user-facing session flow entirely. Rather than blocking on Imentiv's result, the session closes optimistically, the upload and analysis run asynchronously in the background, and the scorecard hydrates progressively via WebSocket events as results come in. Users see transcript and audio scores immediately, with video emotion scores filling in as they land.

Imentiv API integration. The API required recordings to be saved server-side, uploaded with proper consent headers, and polled for results, all while keeping the frontend WebSocket connection feeling live and responsive. Building the mock infrastructure early (Imentiv mock mode, MongoDB-disabled mode) saved significant demo-day stress and let us develop the full scoring UI before the integration was stable.

Accomplishments We're Proud Of

Building a coherent, three-signal analysis pipeline in a hackathon window. The system detects mismatches between what you are saying, how you sound, and what your face is doing. That is exactly the kind of feedback a human coach gives that no simple transcript tool can produce.

The Conversation mode is something we are especially proud of. Most speaking tools treat communication as a solo performance. Conversation mode treats it as what it actually is: a two-way exchange that rewards social awareness. The AI holds a real position, responds to what you actually said rather than what you meant to say, and surfaces where your communication broke down in context. Building something that genuinely improves social literacy, not just fluency, felt meaningful.

We are also proud of the design system. Clarity explicitly rejects the aesthetic shortcuts most AI hackathon projects reach for: no purple gradients, no glassmorphism, no gamified streaks. It is calm, warm, and built for a person who is already focused and a little anxious. That was a deliberate product decision and we held it throughout the full build.

What We Learned

Decoupling slow processes is a product decision, not just an engineering one. The choice to hydrate the scorecard progressively -- showing transcript and audio scores immediately while CV results load in -- changed what the product felt like. A technically correct but blocking implementation would have felt broken. Framing async processing as a feature rather than a limitation required rethinking the session state machine mid-build, and it taught us to design for the slow path from the start.

Social literacy is harder to measure than fluency. Building the Conversation mode forced us to think carefully about what "doing well" actually means in a real exchange. Word count, pacing, and filler rates are easy to quantify. Staying on topic under pressure, recovering from a redirect, and knowing when to stop talking are not. We ended up using the LLM not just to play the other side but to evaluate the quality of engagement, which required much more careful prompting than we anticipated.

Mock infrastructure is not a shortcut, it is a foundation. Building Imentiv mock mode and the MongoDB-disabled path early meant we could develop and test the full product loop before any external integration was stable. At a hackathon, external APIs go down, rate limits get hit, and latency spikes at the worst moments. The teams that build mock layers early ship; the ones that do not spend demo day debugging network errors.

The hardest design constraint is restraint. Every instinct at a hackathon is to add more: more metrics, more feedback panels, more features. Holding to the question "what does this user actually need to know before their next practice session" was genuinely difficult and genuinely worth it. A focused scorecard people can act on beats a comprehensive dashboard nobody reads.

What's Next

Deploy to production: containerize the backend, set up CI/CD, and get Clarity running on a stable public URL
Production hardening: rate limiting, auth, session security, and proper error handling throughout
Real-time mid-session feedback rather than only end-of-session scoring
Longitudinal progress tracking across sessions with trend lines on pacing, filler rate, and emotional range
Shareable session recaps for coaches or accountability partners
Mobile support (the recording flow is already designed for it)