Sonic Shadow

Inspiration

Every presenter faces the same challenge: nervousness changes how I sound.

When you're nervous, your voice becomes hesitant ("uh, so basically..."), lacking confidence ("I think maybe we could try..."), filled with verbal tics and filler words ("um, like, you know..."), and less impactful ("here's a feature" vs. "here's an EXCEPTIONAL feature").

The irony? Your core message is solid, but how you deliver it makes all the difference.

I wanted to solve this: What if your audience heard an enhanced, confident version of your words while you spoke naturally? No acting required. No extra effort. Just speak and let AI amplify your impact.

AWS Bedrock Nova Pro became the perfect tool for this. Unlike traditional audio processing, Nova Pro understands meaning, emotional intent, and context. It could transform plain text into powerful statements while preserving the authentic message.

Sonic Shadow was born: Your authentic voice, amplified with confidence.

What It Does

Sonic Shadow is a real-time speech enhancement platform that transforms how presentations sound to the audience.

The speaker delivers naturally while talking into their microphone. The system transcribes the speech, enhances the text using AI (Nova Pro), synthesizes it to speech, and broadcasts the enhanced version to listeners. Each listener hears a more confident, energetic version of what the speaker said—with better word choice, stronger phrasing, and professional polish.

Key features:

One speaker, unlimited listeners
Real-time transcription and enhancement
Synchronized audio delivery across all listeners
Side-by-side comparison (original vs. enhanced text)
Works on phones, tablets, and computers
Local network or public deployment options

The enhancement preserves the speaker's authentic message while boosting confidence and impact. A casual "here's something cool" becomes "here's an exceptional feature." A hesitant explanation becomes clear and persuasive.

How I Built It

The architecture consists of two parallel threads to avoid latency bottlenecks:

Capture Thread (runs continuously):

Reads microphone audio using sounddevice
Detects speech using RMS (Root Mean Square) loudness calculation
Queues detected utterances when 0.3 seconds of silence is detected
Speaker experiences <100ms microphone latency

Processing Thread (runs sequentially):

Gets audio from queue (non-blocking)
Transcribes using Google Speech Recognition API
Enhances text using AWS Bedrock Nova Pro with prompt: "ENERGETIC, CONFIDENT, PROFESSIONALLY ENGAGING"
Synthesizes enhanced text to MP3 using gTTS
Broadcasts base64-encoded MP3 to all connected listeners via WebSocket

This parallel architecture means the speaker can deliver rapid utterances without waiting for processing. Listeners hear continuous, enhanced speech.

Frontend:

HTML5 with Socket.IO JavaScript client
Speaker interface shows original and enhanced text side-by-side, with listener count
Listener interface includes audio unlock overlay (required by browser autoplay policy)
One-click listener link sharing

Backend:

Flask 3.1.2 with Flask-SocketIO for WebSocket communication
SpeakerSession class manages state: audio buffer, queue, threads, listeners
Room-based broadcasting (all listeners in a room receive synchronized events)
Base64-encoded MP3 transmission for cross-browser compatibility

Tech Stack:

Python 3.13 with sounddevice (for microphone capture)
Google Speech Recognition API (transcription)
AWS Bedrock Nova Pro v1:0 (text enhancement)
gTTS 2.3.0 (text-to-speech synthesis)
Flask + Socket.IO (WebSocket server)
HTML5 + JavaScript (frontend)
ngrok (public deployment tunneling)

Speech detection uses RMS calculation: if loudness exceeds threshold (100), speech is detected. Processing triggers after 0.3 seconds of silence.

Architecture Diagram

┌─────────────────────────────────────────────────────────────┐
│                    SPEAKER EXPERIENCE                                                                                                             │
├─────────────────────────────────────────────────────────────┤
│                                                                                                                                                                           │
│  Microphone (sounddevice)                                                                                                                          │
│        ↓                                                                                                                                                                │
│  Capture Thread (continuous, non-blocking)                                                                                              │
│  • RMS-based speech detection                                                                                                                  │
│  • Silence threshold = 0.3 seconds                                                                                                              │
│  • Queue detected utterances                                                                                                                     │
│        ↓                                                                                                                                                                │
│  [Queue] ← Audio waiting for processing                                                                                                  │
│        ↓                                                                                                                                                                │
│  Processing Thread (sequential)                                                                                                                   │
│  • consume from queue                                                                                                                                │
│  • Transcribe (Google Speech Recognition)                                                                                               │          
│  • Enhance (AWS Bedrock Nova Pro)                                                                                                          │
│  • Synthesize (gTTS → MP3)                                                                                                                        │ 
│  • Broadcast via WebSocket                                                                                                                         │
│                                                                                                                                                                           │
│  Display: [Original Text] [Enhanced Text]                                                                                                      │
│           Connected Listeners: X                                                                                                                       │
│                                                                                                                                                                          │
└─────────────────────────────────────────────────────────────┘
                          ↑  ↓          
                      WebSocket
                      (Socket.IO)
                          ↑  ↓
┌─────────────────────────────────────────────────────────────┐
│                   LISTENER EXPERIENCE                                                                                                             │
├─────────────────────────────────────────────────────────────┤
│                                                                                                                                                                           │
│  [Audio Unlock Overlay] ← Tap to enable                                                                                                    │
│        ↓ (after click)                                                                                                                                             │
│  Receive enhanced_audio events                                                                                                                 │
│  • Decode Base64 MP3                                                                                                                                 │
│  • Play audio automatically                                                                                                                            │
│  • Display enhanced text                                                                                                                                │
│  • Show history of utterances                                                                                                                        │
│                                                                                                                                                                           │
└─────────────────────────────────────────────────────────────┘

Challenges I Ran Into

Challenge 1: Package Compatibility with Python 3.13

PyAudio does not support Python 3.13. Initial setup failed. Switched to sounddevice, which is better maintained and cross-platform compatible. Also discovered SpeechRecognition package naming inconsistency (speech_recognition vs SpeechRecognition).

Challenge 2: Latency in Sequential Processing

Initial approach: Speaker speaks → Full pipeline runs (2-3 seconds) → Next utterance delayed

This created terrible UX. The speaker would start speaking, nothing happens for 2 seconds, then audio finally plays, but the speaker is already ahead of themselves.

Solution was separating capture (fast, continuous) from processing (slower, parallelized). Audio gets queued immediately when speech ends, not when processing finishes. Speaker experiences immediate microphone feedback while processing happens asynchronously in the background.

Challenge 3: Browser Autoplay Policy Blocking Audio

Listeners connected, audio data received, but nothing played. No console errors—it just failed silently. Modern browsers require user interaction before playing audio (security feature).

Implemented audio unlock overlay: Black screen with "TAP TO ENABLE AUDIO" button. One click initializes the Web Audio API (satisfies the user interaction requirement) and unmutes the audio element. All subsequent audio plays automatically.

Challenge 4: ngrok Free Tier Endpoint Locking

ngrok free tier reserves endpoints on startup. Restarting ngrok claims the endpoint as "already online" and prevents new connections. Workaround was using local WiFi IP (192.168.0.101:5000) for testing and demos instead. Actually superior for hackathon demos—faster, more reliable, no bandwidth limits.

Challenge 5: Synchronization Between Multiple Listeners

How to ensure all listeners hear audio at the same time with varying network latency?

Solution: WebSocket room-based broadcasting. All listeners in the same room receive the event simultaneously. Base64-encoded MP3 ensures identical payload. No client-side buffering—play immediately upon receive. Result: All listeners hear enhanced audio within ~20ms of each other (imperceptible).

Accomplishments That I'm Proud Of

End-to-End Pipeline: Built a complete system from microphone capture through transcription, enhancement, synthesis, and broadcast—all working in real-time.
Queue-Based Architecture: Eliminated latency bottlenecks by parallelizing capture and processing. Speaker can deliver naturally without technical delays.
Multi-Listener Synchronization: Multiple listeners receive perfectly synchronized audio despite varying network conditions.
User Experience: Solved the browser autoplay problem with an elegant one-tap unlock overlay. Invisible complexity turned into simple UX.
Nova Pro Integration: Successfully leveraged AWS Bedrock Nova Pro for meaningful text enhancement. A single emotional prompt transforms mundane statements into confident, professional language.
Cross-Device Compatibility: Works on phones, tablets, laptops. Listeners can join with a simple link share. No installation or login required.
Rapid Iteration: Built from concept to working deployment in the hackathon timeframe.

What I Learned

Text Enhancement Beats Audio Processing: Nova Pro's strength is understanding what you meant to say and expressing it with impact. AI that understands meaning beats audio filters every time.
Architecture Matters More Than Optimization: Queue-based parallel processing solved latency better than any code optimization. Separating concerns equals better performance.
WebSocket Broadcasting is Essential: HTTP polling was too slow. WebSocket with room-based broadcasting provides the real-time sync required for multi-listener scenarios.
User Experience is Non-Obvious: The audio unlock overlay was a friction point I discovered and solved. Many technical challenges have elegant UX solutions.
Test Early with Real Constraints: Python 3.13 compatibility, ngrok free tier limitations, browser autoplay policies—each constraint forced better design decisions.
Local is Fast: Do not overlook local WiFi for demos. It beats cloud tunneling every time (fewer hops, full bandwidth).

What's Next for Sonic Shadow

Potential enhancements:

Multiple Enhancement Styles: Not just "energetic," but "professional," "conversational," "authoritative" prompts for different contexts.
Real-Time Waveform Display: Visual feedback showing speech detection and processing status.
Session Recording: Save speaker and enhanced audio versions for review and improvement.
Speech Analytics: Metrics on speech patterns—filler word frequency, pace, confidence levels before and after enhancement.
Mobile Native Apps: iOS and Android apps for speakers and listeners with better audio handling.
Production Deployment: AWS AppRunner or EC2 with persistent storage, custom domains, and scalability for large presentations.
Emotional Range Control: Let speakers dial in the "level" of confidence—more natural for casual presentations, more amplified for high-stakes pitches.
Speaker Queue: Multiple speakers in sequence, with listeners staying connected across sessions.