Inspiration

Born and raised in India, I have always loved connecting with people. But moving to the US changed everything. Because English isn't my first language, I suddenly struggled to hold engaging conversations, catch rapid-fire accents, or process cultural slang and references mid-sentence. These barriers bred a sudden, isolating social nervousness.

I built WingIt to solve this. It acts as an invisible audio wingman that listens to live conversations and whispers contextual hints, slang breakdowns, or full responses right into your earbud. Beyond helping immigrants bridge language, cultural, and confidence gaps, WingIt holds massive accessibility potential—from empowering neurodivergent individuals and introverts to leveling the playing field for sales teams and networking event participants.

What it does

WingIt listens to your in-person conversations through your phone mic, understands the context in real time, and whispers smart suggestions into your earbud — helping you be more engaging, witty, and confident.

You wear a single earbud, keep your phone nearby, and just talk. WingIt captures both speakers, transcribes the conversation with speaker diarization, detects when the other person finishes speaking, and generates a contextual suggestion using an LLM — personalized to your style via a "Second Brain" profile. The suggestion is converted to speech and played back through your earbud, all within ~2 seconds. It also handles noisy environments gracefully: if auto-detection fails, you can always tap "Suggest Now" for an instant suggestion.

How we built it

(Fully built using Devin by Cognition with Opus 4.8 as the LLM under the hood) The core principle was streaming everywhere — no step waits for the previous step to fully complete.

Pipeline:

\text{Mic} \xrightarrow{\text{WebSocket}} \text{STT} \xrightarrow{} \text{Turn Detection} \xrightarrow{} \text{LLM} \xrightarrow{} \text{TTS} \xrightarrow{\text{WebSocket}} \text{Earbud}

Backend: Python + FastAPI with async WebSockets. Each WebSocket connection is an independent conversation session. Speech-to-Text: Deepgram Nova 3 with live streaming and speaker diarization (diarize=true, utterance_end_ms=1500, endpointing=300). Turn Detection: Three strategies running in parallel — silence-based (via Deepgram's utterance_end event), semantic (Claude Haiku asks "did this person finish a thought expecting a reply?"), and manual button press. First trigger wins, then a 5-second cooldown prevents duplicates. LLM: Claude Sonnet for generating suggestions (streaming), with the user's "Second Brain" profile and last 20 conversation turns injected into the system prompt. Text-to-Speech: Deepgram Aura, streamed at sentence boundaries from the LLM output — we don't wait for the full LLM response. Mobile App: React Native (Expo) with expo-av for audio capture (16kHz, 16-bit PCM mono) and playback routed to earpiece/Bluetooth.

Challenges we ran into

The Latency Budget: Achieving sub-2-second speeds forced us to use async generators throughout. TTS begins synthesizing sentence fragments before the LLM even finishes its response paragraph.

Environment Noise: Silence detection breaks in loud spaces. We resolved this by running an LLM semantic analyzer in parallel to verify if a thought is complete, backed by a manual UI trigger.

State Cancellation: If the user speaks while a suggestion is playing, the backend must instantly terminate the active LLM stream, empty the audio buffers, and clear cross-network tasks cleanly.

Accomplishments & Lessons

True Zero-Blocking Pipeline: Maintained total end-to-end latency below the 2-second natural conversation threshold.

Modular Engineering: Built a swappable provider layout; switching LLMs or voice models requires only a single environment variable change.

Mobile Audio Realities: Heterogeneous mobile hardware (especially Android Bluetooth routing) presents severe fragmentation that requires rigorous real-device testing.

What's next

Hardware Triggers: Map suggestion prompts directly to earbud tap events.

Vector RAG: Upgrade the static profile to a Vector Database that syncs live with personal notes and documents.

On-Device STT: Port speech recognition to local Whisper models to eliminate network overhead and maximize privacy.

Built With

Share this project:

Updates