Inspiration

We wanted to lower the bar for anyone to run a live stream with an AI-driven virtual host—no green screen, no mocap suit, just a webcam and a browser. The idea was to combine real-time 3D avatar motion capture with an AI that can script and speak in multiple languages, so one product could power both “VTuber-style” streaming and AI-hosted shows.

What it does

Echuu is an AI virtual streamer that runs in the browser. You pick a topic and persona, hit “Go Live,” and the app generates a script, drives a 3D VRM avatar with your webcam (face, pose, hands via MediaPipe), and speaks the lines with multilingual TTS (Qwen/CosyVoice). Viewers see a single-page live experience: avatar + captions + audio, with optional English/Chinese/Japanese. The frontend is a Next.js app (echuu.xyz); the live engine and TTS run on a FastAPI backend (e.g. on Railway).

How we built it

  • Frontend: Next.js 14 (App Router), React Three Fiber + Three.js for the 3D avatar, MediaPipe Holistic for webcam pose/face/hands, Kalidokit to turn landmarks into VRM bone rotations. State is managed with Zustand; we avoid putting high-frequency mocap data in React state to keep 60fps.
  • Backend: FastAPI service that runs the “Echuu” live engine: script generation (LLM), TTS (DashScope Qwen/CosyVoice), and WebSocket streaming of script steps and base64 audio. We support multiple LLM backends (e.g. Gemini, Claude) and configurable voice/language.
  • Deploy: Next.js on Vercel; backend Dockerized and deployed on Railway, with env-based API keys and CORS/COOP/COEP for MediaPipe.

Challenges we ran into

  • VRM orientation and axis mapping: VRM 0.x vs 1.0 face different directions; we had to normalize with VRMUtils.rotateVRM0 and tune axis settings so mocap rotations map correctly to the model.
  • TTS in production: The backend originally looked for the TTS module under workflow/backend; in Docker the app runs from /app/backend, so we added a fallback to load tts_client.py from the current working directory so the same code works locally and on Railway.
  • Language and audio sync: Getting “English in → English out” and reliable audio delivery to the client required passing a language hint from the frontend and hardening the WebSocket/audio enqueue logic so the player doesn’t skip or duplicate chunks.

Accomplishments that we're proud of

  • One stack for both real-time VTuber-style mocap and AI-hosted live streams.
  • Multilingual pipeline (topic/persona + optional language) so the stream can stay in English, Chinese, or Japanese end-to-end.
  • Shipping a working live experience (echuu.xyz) with 3D avatar, TTS, and script generation, deployable via GitHub → Railway with minimal config.

What we learned

  • How to keep React and Three.js in sync without re-renders on every frame (refs, object pooling, memo).
  • Integrating DashScope’s realtime TTS and designing a simple protocol for script + audio over WebSockets.
  • The importance of matching deploy layout (e.g. Root Directory and Dockerfile path) to how the code resolves modules so TTS and engines load correctly in production.

What's next for Echuu-AI Virtual Streamer

  • Richer persona and memory so the AI can refer to past streams and viewer context.
  • Optional recording/playback and clipping of live segments.
  • More avatar and scene options, and tuning finger/hand mapping for VRM so gestures look more natural.

Built With

Share this project:

Updates