Inspiration
Approximately 50% to 70% of neurodivergent people experience difficulty interpreting emotional tone in everyday communication, especially when emotion is subtle, mixed, or implied. When speech is converted into text, even more emotional context is lost, making it harder to understand intent or nuance.
Sono is inspired by the idea that text does not have to erase emotion. Sono explores how giving emotional identity to text can support understanding, self expression, and accessibility without labeling, miscommunication or judging emotion. It brings life back into plain words, turning subtitles into a mirror of the speaker's true intent.
What it does
Sono is a live transcription experience designed to help users, particularly neurodivergent individuals, better interpret the emotional context of speech in real-time.
Unlike traditional speech-to-text tools that only process what is said, Sono analyzes how it is said. It utilizes a multimodal approach to decoding communication:
Audio Analysis (Pitch & Tone): As the user speaks, Sono’s engine listens to the raw audio waveform, detecting nuances in pitch, volume, and cadence. This allows the system to catch non-verbal cues like the drop in pitch associated with sarcasm or the rapid tempo of excitement that text alone would miss.
Semantic Interpretation (Text): Simultaneously, the system processes the transcribed text to understand the sentiment and meaning of the words themselves.
Visual Synthesis: These two data streams merge to drive a dynamic visual interface. As words appear on the screen, Sono applies subtle, fluid changes to typography and colour turning "angry" words red and sharp, or "calm" words blue and soft.
The result is a set of "living subtitles" that retain the emotional identity of the voice, helping users process intent instantly and intuitively.
How we built it
We built Sono as a responsive web-based prototype using React, Python (Flask), and Tailwind CSS, with a strong focus on accessibility, clarity, and reducing cognitive overload. The interface is intentionally calm and minimal, ensuring that emotional cues enhance the text without overwhelming the user.
The core of Sono's intelligence is powered by Google's Gemini API. We integrated Gemini not just as a text generator, but as a reasoning engine to perform real-time sentiment analysis.
Frontend: We used React to capture live audio from the browser and manage the real-time state of the subtitles. Tailwind CSS allowed us to create a fluid, high-contrast design system where fonts and colours could transition dynamically based on emotional data.
Backend: A lightweight Python Flask server acts as the bridge between the client and the AI. It processes incoming data streams and manages the API requests.
The AI Engine: We leveraged the Gemini 2.5 Flash model for its speed and multimodal capabilities. By feeding Gemini live conversation segments, we utilized its advanced reasoning to classify speech into specific emotional categories (e.g., Sarcastic, Happy, Anxious) with high accuracy. This allowed us to go beyond simple keyword matching and understand the true context of the conversation.
Challenges we ran into
One of the biggest challenges was representing emotion in a way that feels supportive rather than prescriptive. Emotional interpretation is highly personal, and overly explicit cues can feel stressful or judgmental, especially for neurodivergent users.
From a technical perspective, we also considered how real-time speech processing and emotion inference could remain accurate, low-latency, and ethically responsible.
Accomplishments that we are proud of
- Designing emotional cues that support interpretation without diagnosis
- Aligning neurodivergent needs through optional and adjustable UI signals
- Creating a calm interface that reduces cognitive load
- Reframing transcription as an expression of emotional identity rather than accuracy alone
What we learned
Accessibility is about curation, not volume: We realized that "less is often more." Overloading the user with data increased anxiety, while presenting abstract, ambient signals (like color shifts) reduced cognitive load and felt more supportive.
Agency builds trust: We discovered that giving users control over how emotion is visualized transforms the tool from a judgmental "monitor" into an empowering extension of their own expression.
What is next for Sono
Speech-to-Speech Translation: We want to build a translation pipeline that translates feelings, not just words. By using expressive AI synthesis, Sono will read out translations with clearer, slightly exaggerated emotional inflections—helping users hear the "vibe" of the conversation as clearly as they hear the translation.
Universal Overlay Extension: We plan to develop a browser extension that can float our emotive subtitles over any active tab. This would bring Sono's accessibility features to Zoom calls, movies, and YouTube videos seamlessly.
Multi-Voice Tracking: We are working on distinguishing between multiple speakers in a single audio stream, allowing Sono to visually separate and identify different voices in group conversations.
Participatory Design: We aim to partner directly with neurodivergent communities for beta testing, ensuring that our design choices align with lived experiences and truly support the people we are building for.
Smarter, Context-Aware Models: We aim to refine our AI to understand context better, reducing latency and improving accuracy in distinguishing between nuanced emotions (like excitement vs. anxiety).
Built With
- gemini
- javascript
- python
- react
- tailwind


Log in or sign up for Devpost to join the conversation.