Inspiration

As someone who spends a lot of time watching streams, I kept noticing the same problem, streamers missing questions, chat flying by unanswered, and the audience slowly disengaging. I wondered: what if AI could act as a co-host that actually sees what's being streamed, reads the chat, and responds naturally in real time? When I came across Gemini Live API's multimodal capabilities, I realized I could build something beyond a basic chatbot, an AI companion that genuinely understands the context of a stream as it happens.

What it does

StreamBuddy is an AI co-host for live streamers that:

  • Reads YouTube Live chat in real-time and understands viewer questions, comments, and sentiment
  • Listens to the streamer you can talk to it directly and it responds naturally, just like a real co-host
  • Responds via voice using Gemini Live API's natural audio output
  • Engages with your audience by answering questions, reacting to gameplay, and keeping the conversation flowing
  • Adapts its personality - adjust humor, supportiveness, and verbosity to match your streaming style

How we built it

Architecture:

  • Frontend: React + Vite
  • Backend: Python FastAPI on Google Cloud Run with WebSocket support for real-time bidirectional streaming
  • AI Engine: Google GenAI SDK with Gemini 2.5 Flash Live API for multimodal understanding
  • Integrations: YouTube Data API v3 for live chat capture, OAuth 2.0 for secure authentication

Key Technical Implementations:

  1. Real-time Audio Streaming: Implemented bidirectional WebSocket audio streaming with browser-based microphone capture and AI response playback using Web Audio API

  2. Personality System: Created a configurable personality engine that adjusts AI responses based on humor level, supportiveness, playfulness, and verbosity settings

  3. Chat Analysis: Implemented background chat analysis and topic extraction to help the AI understand audience mood and interests

Deployment:

  • Containerized with Docker and deployed to Cloud Run

Challenges we ran into

1. Audio Latency & Synchronization Initially, I experienced 30+ second delays in AI responses. I discovered the proactive audio mode was adding unnecessary latency. By switching to responsive mode and optimizing audio chunk sizes (20-40ms as recommended by Google), I reduced latency to 2-4 seconds.

2. WebSocket Reconnection Logic Users experienced unwanted microphone prompts after stopping sessions. I implemented a shouldReconnectRef flag to control reconnection behavior, preventing auto-reconnect when users explicitly stop streaming.

Accomplishments that we're proud of

  • Real-time multimodal AI: Successfully integrated audio streaming, and chat monitoring into a single coherent AI experience
  • Production-ready deployment: Built a scalable cloud-native architecture that handles multiple concurrent streaming sessions
  • Natural voice interaction: Achieved conversational AI responses with minimal latency using Gemini Live API
  • Seamless OAuth flow: Implemented secure YouTube authentication with proper token management per user

What we learned

  • Gemini Live API's proactive vs responsive modes have significant latency trade-offs
  • WebSocket audio streaming requires careful buffer management and format conversion

What's next for StreamBuddy

  • Webcam and screen video capture - Add streamer's face/reactions to the AI's context for more natural interactions. For PC streamers like gamers, artists etc I plan to capture their screen video and pass to the agent.
  • Support for other platforms to expand beyond YouTube Live

Built With

Share this project:

Updates