Inspiration

We were inspired by using Character AI and wondering how those interactions could feel more immersive. Text chat was engaging, but it still felt one step removed from a real presence. We wanted to build something where the character could not only respond intelligently, but also speak, emote, and exist inside a scene so the interaction felt more alive.

What We Built

We built a real-time anime avatar platform where users can talk naturally with AI characters through voice. The system combines speech-to-text, language generation, text-to-speech, facial expression control, animation playback, and 3D scene rendering so the character can respond in a more embodied way.

How We Built It

We built the project as a full pipeline connecting frontend and backend systems.

On the frontend, we used a 3D web scene with VRM characters, animation control, live transcription UI, and scene-specific interaction logic.

On the backend, we connected STT, LLM, and TTS services into a single conversation flow. We also added session handling and interaction-state logic so speech, playback, and animation stay synchronized.

Challenges

The hardest part was making the pipeline feel reliable in real time. Voice interaction is sensitive to timing, browser audio policies, scene transitions, and state resets. We had to debug issues where STT worked but TTS did not, where live transcription failed after scene changes, and where different interaction modes needed separate behavior without breaking the shared pipeline.

What We Learned

We learned that immersive AI is not just about model output. It depends just as much on systems design, latency control, browser behavior, animation state management, and careful debugging across the entire stack. We also learned how much small interaction details matter when trying to make an AI character feel present rather than just functional.

Built With

  • 11lab
  • codex
  • vscode
Share this project:

Updates