Inspiration
Working in the jewelry manufacturing industry, I witnessed firsthand how live streaming became the lifeline of modern sales. However, I noticed a recurring problem: many streamers—whether selling gold rings, sneakers, or fashion items—struggle to retain viewers but have no idea why. They are experts in their products but amateurs in broadcasting.
Hiring a professional director to whisper cues like "fix your lighting" or "speak with more energy" is too expensive for most. I realized that with the speed and multimodal capabilities of the new Gemini 3 Flash Preview, we could democratize this. I wanted to build an AI "Stream Coach" that acts as a personal director, giving objective, data-driven feedback to help anyone become a better streamer.
What it does
StreamCoach AI is a context-aware performance auditor for live streamers. It allows users to upload a recording of their session and receive a comprehensive "Stream Health Score" (0-100) within seconds.
Key capabilities include:
- Multimodal Analysis: It doesn't just read transcripts. It "watches" video frames to judge lighting, product focus, and gestures, and "listens" to the full audio track to analyze vocal energy and sales persuasion.
- Context-Adaptive: Users can select their category (e.g., Jewelry, Fashion, Gaming). The AI dynamically adjusts its critique criteria—focusing on "sparkle and detail" for jewelry, or "outfit coordination" for fashion.
- Timeline Flagging: It pinpoints exact timestamps where engagement dropped, such as moments of "dead air," blurry product shots, or low energy.
- Actionable Coaching: Instead of generic advice, it provides specific fixes, e.g., "At 02:15, your hand covered the ring details. Consider using a velvet stand."
How we built it
We prioritized speed and privacy ("Bring Your Own Key" architecture).
- AI Engine: We utilized Gemini 3 Flash Preview via the Google GenAI SDK. Its superior multimodal reasoning and expanded context window allowed us to feed disjointed image frames alongside continuous audio tracks without losing context.
- Backend: Built with Golang for its raw performance and concurrency. We used FFmpeg to process video uploads—intelligently sampling frames every few seconds while extracting high-fidelity audio for tonal analysis.
- Frontend: Developed using Vue.js and Tailwind CSS for a responsive, clean interface that visualizes the AI's complex JSON output into easy-to-read charts and timelines.
- Integration: We devised a custom prompt engineering strategy that injects specific "personas" (e.g., Jewelry Expert vs. Fashion Stylist) into Gemini based on user selection.
Challenges we ran into
- Synchronizing Modalities: Ensuring Gemini understood that the audio at 00:15 corresponds to the visual frame at 00:15 was tricky. We solved this by strictly timestamping our image samples and instructing the model to correlate specific visual cues with the audio sentiment at that exact second.
- Prompt Nuance: Initially, the AI was too nice. We had to iterate on the system instructions to make Gemini 3 Flash more critical and "strict," like a real sales manager, so the feedback would be genuinely useful rather than just complimentary.
- Handling Context: Creating a truly "General Purpose" tool meant the prompt structure had to be modular. Hardcoding criteria for jewelry broke the logic for sneakers. We had to build a dynamic prompt generator in Go.
Accomplishments that we're proud of
- Speed: Leveraging Gemini 3 Flash Preview, we achieved a near-instant analysis time. A 5-minute video is analyzed in under 30 seconds.
- Holistic Auditing: We are proud that StreamCoach doesn't just look at text. It successfully flags "visual silence" (bad framing) and "audio boredom" (monotone voice), which are usually invisible to standard NLP tools.
- Agnostic Architecture: Successfully moving beyond just a niche jewelry tool to a platform that can audit any type of visual sales stream.
What we learned
- Audio is Underrated: We learned that vocal intonation often carries more weight in sales conversion than the visual quality itself. Gemini 3 Flash's ability to pick up on "hesitation" or "confidence" in the voice was a revelation.
- The Power of Preview Models: Using the bleeding-edge Gemini 3 Flash Preview gave us reasoning capabilities that felt significantly more human-like compared to previous generations, especially in understanding the intent behind a gesture.
What's next for StreamCoach AI: Multimodal Live Stream Auditor
- Real-Time RTMP Analysis: Moving from "Post-stream Audit" to "Live Assistance," where the AI intercepts the RTMP stream and gives feedback during the broadcast (e.g., "Smile more!" popping up on the host's screen).
- Competitor Analysis: allowing users to upload a competitor's video to see how their performance score compares.
- Mobile App: A dedicated mobile version for on-the-go streamers using on-device processing where possible.
🌍 Global Impact
1. Democratizing Professional Sales Coaching Historically, only large enterprises could afford professional directors to critique their live streams. StreamCoach AI levels the playing field, giving solo entrepreneurs and MSMEs (UMKM) access to high-end, data-driven coaching at near-zero cost. This empowers small businesses to compete with big brands.
2. Accelerating the Creator Economy With the global live commerce market projected to reach trillions of dollars, many new streamers fail due to a lack of guidance. By providing instant feedback on lighting, energy, and product presentation, we help creators increase their conversion rates and build sustainable incomes.
3. Bridging the Soft-Skill Gap Public speaking and digital presence are critical skills in the modern workforce. StreamCoach AI doesn't just improve sales; it acts as a personal tutor that helps users master confidence, eye contact, and articulation—skills that are valuable far beyond just live streaming.
Architecture Overview
The system follows a modern, decoupled architecture designed for speed and privacy. It leverages Golang for high-performance backend processing and Google Gemini 3 Flash Preview for multimodal AI analysis.
1. Frontend Layer (User Interface)
- Tech Stack: Vue.js (Framework) & Tailwind CSS (Styling).
- Function:
- Provides a responsive dashboard for users to upload live stream recordings.
- Privacy-First Security: The user's Gemini API Key is stored securely in the browser's Local Storage. It is never saved to our database, ensuring a "Bring Your Own Key" (BYOK) architecture.
- Visualizes the JSON analysis data into interactive charts, timelines, and scorecards.
2. Backend Orchestration
- Tech Stack: Golang (Go).
- Function:
- Acts as the central orchestrator, handling API requests from the frontend.
- Manages Temporary Storage to briefly hold uploaded video files during the processing stage.
- Utilizes Go’s concurrency model (Goroutines) to handle multiple analysis requests efficiently without blocking the server.
3. Media Processing Engine
- Tool: FFmpeg.
- Workflow:
- Once the video reaches the backend, Golang executes FFmpeg commands to split the media into two modalities:
- Visual Sampling: Extracts image frames at specific intervals (e.g., every 5-10 seconds) to reduce payload size while retaining visual context (lighting, gestures, product focus).
- Audio Extraction: Separates the full audio track to ensure the AI can analyze vocal intonation, pitch, and energy continuity without interruption.
- Once the video reaches the backend, Golang executes FFmpeg commands to split the media into two modalities:
4. AI Analysis Layer (The Brain)
- Model: Google Gemini 3 Flash Preview.
- Workflow:
- The Backend constructs a Multimodal Prompt containing the sampled image frames, the full audio file, and the user-selected context (e.g., "Jewelry Sales" or "Fashion").
- This payload is sent to the Gemini API.
- Gemini processes the inputs simultaneously ("watching" the frames and "listening" to the audio) to generate a holistic audit.
5. Data Output & Visualization
- Result: Gemini returns a structured JSON response containing:
- Overall Performance Score (0-100).
- Timestamped flags for issues (e.g., "Blurry product at 02:15").
- Actionable coaching tips.
- The Golang backend forwards this JSON to the Frontend, which renders it into the user-friendly "Stream Health Report."
Log in or sign up for Devpost to join the conversation.