StreamBuddy

Architecture

Inspiration

As someone who spends a lot of time watching streams, I kept noticing the same problem, streamers missing questions, chat flying by unanswered, and the audience slowly disengaging. I wondered: what if AI could act as a co-host that actually sees what's being streamed, reads the chat, and responds naturally in real time? When I came across Gemini Live API's multimodal capabilities, I realized I could build something beyond a basic chatbot, an AI companion that genuinely understands the context of a stream as it happens.

What it does

StreamBuddy is an AI co-host for live streamers that:

Reads YouTube Live chat in real-time and understands viewer questions, comments, and sentiment
Listens to the streamer you can talk to it directly and it responds naturally, just like a real co-host
Responds via voice using Gemini Live API's natural audio output
Engages with your audience by answering questions, reacting to gameplay, and keeping the conversation flowing
Adapts its personality - adjust humor, supportiveness, and verbosity to match your streaming style

How we built it

Architecture:

Frontend: React + Vite
Backend: Python FastAPI on Google Cloud Run with WebSocket support for real-time bidirectional streaming
AI Engine: Google GenAI SDK with Gemini 2.5 Flash Live API for multimodal understanding
Integrations: YouTube Data API v3 for live chat capture, OAuth 2.0 for secure authentication

Key Technical Implementations:

Real-time Audio Streaming: Implemented bidirectional WebSocket audio streaming with browser-based microphone capture and AI response playback using Web Audio API
Personality System: Created a configurable personality engine that adjusts AI responses based on humor level, supportiveness, playfulness, and verbosity settings
Chat Analysis: Implemented background chat analysis and topic extraction to help the AI understand audience mood and interests

Deployment:

Containerized with Docker and deployed to Cloud Run

Challenges we ran into

1. Audio Latency & Synchronization Initially, I experienced 30+ second delays in AI responses. I discovered the proactive audio mode was adding unnecessary latency. By switching to responsive mode and optimizing audio chunk sizes (20-40ms as recommended by Google), I reduced latency to 2-4 seconds.

2. WebSocket Reconnection Logic Users experienced unwanted microphone prompts after stopping sessions. I implemented a shouldReconnectRef flag to control reconnection behavior, preventing auto-reconnect when users explicitly stop streaming.

Accomplishments that we're proud of

Real-time multimodal AI: Successfully integrated audio streaming, and chat monitoring into a single coherent AI experience
Production-ready deployment: Built a scalable cloud-native architecture that handles multiple concurrent streaming sessions
Natural voice interaction: Achieved conversational AI responses with minimal latency using Gemini Live API
Seamless OAuth flow: Implemented secure YouTube authentication with proper token management per user

What we learned

Gemini Live API's proactive vs responsive modes have significant latency trade-offs
WebSocket audio streaming requires careful buffer management and format conversion

What's next for StreamBuddy

Webcam and screen video capture - Add streamer's face/reactions to the AI's context for more natural interactions. For PC streamers like gamers, artists etc I plan to capture their screen video and pass to the agent.
Support for other platforms to expand beyond YouTube Live

Built With

cloudrun
fastapi
gcp
python
react

Updates

Mohammed Yaseen started this project — Mar 16, 2026 05:30 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.