Echuu-AI Virtual Streamer

Inspiration

We wanted to lower the bar for anyone to run a live stream with an AI-driven virtual host—no green screen, no mocap suit, just a webcam and a browser. The idea was to combine real-time 3D avatar motion capture with an AI that can script and speak in multiple languages, so one product could power both “VTuber-style” streaming and AI-hosted shows.

What it does

Echuu is an AI virtual streamer that runs in the browser. You pick a topic and persona, hit “Go Live,” and the app generates a script, drives a 3D VRM avatar with your webcam (face, pose, hands via MediaPipe), and speaks the lines with multilingual TTS (Qwen/CosyVoice). Viewers see a single-page live experience: avatar + captions + audio, with optional English/Chinese/Japanese. The frontend is a Next.js app (echuu.xyz); the live engine and TTS run on a FastAPI backend (e.g. on Railway).

How we built it

Frontend: Next.js 14 (App Router), React Three Fiber + Three.js for the 3D avatar, MediaPipe Holistic for webcam pose/face/hands, Kalidokit to turn landmarks into VRM bone rotations. State is managed with Zustand; we avoid putting high-frequency mocap data in React state to keep 60fps.
Backend: FastAPI service that runs the “Echuu” live engine: script generation (LLM), TTS (DashScope Qwen/CosyVoice), and WebSocket streaming of script steps and base64 audio. We support multiple LLM backends (e.g. Gemini, Claude) and configurable voice/language.
Deploy: Next.js on Vercel; backend Dockerized and deployed on Railway, with env-based API keys and CORS/COOP/COEP for MediaPipe.

Challenges we ran into

VRM orientation and axis mapping: VRM 0.x vs 1.0 face different directions; we had to normalize with VRMUtils.rotateVRM0 and tune axis settings so mocap rotations map correctly to the model.
TTS in production: The backend originally looked for the TTS module under workflow/backend; in Docker the app runs from /app/backend, so we added a fallback to load tts_client.py from the current working directory so the same code works locally and on Railway.
Language and audio sync: Getting “English in → English out” and reliable audio delivery to the client required passing a language hint from the frontend and hardening the WebSocket/audio enqueue logic so the player doesn’t skip or duplicate chunks.

Accomplishments that we're proud of

One stack for both real-time VTuber-style mocap and AI-hosted live streams.
Multilingual pipeline (topic/persona + optional language) so the stream can stay in English, Chinese, or Japanese end-to-end.
Shipping a working live experience (echuu.xyz) with 3D avatar, TTS, and script generation, deployable via GitHub → Railway with minimal config.

What we learned

How to keep React and Three.js in sync without re-renders on every frame (refs, object pooling, memo).
Integrating DashScope’s realtime TTS and designing a simple protocol for script + audio over WebSockets.
The importance of matching deploy layout (e.g. Root Directory and Dockerfile path) to how the code resolves modules so TTS and engines load correctly in production.

What's next for Echuu-AI Virtual Streamer

Richer persona and memory so the AI can refer to past streams and viewer context.
Optional recording/playback and clipping of live segments.
More avatar and scene options, and tuning finger/hand mapping for VRM so gestures look more natural.