Interactive Real Time 3d Avatar for Web Visitor Assistance

Main view
Rotate 3d products on demand
Screen viewing

Inspiration

Every website has the same engagement problem: visitors land, scan, and leave. Chatbots tried to fix this but they're text boxes — invisible, impersonal, and easy to ignore. We asked: what if your website had a person on it? Not a cartoon mascot, but a photorealistic AI that can talk to visitors in real time, see what they're looking at, navigate the page for them, and push interactive content into the conversation. We wanted to obliterate the boundary between "using a website" and "talking to someone who knows the website." The Gemini Live API made this possible — real-time, interruptible voice conversation with multimodal understanding — and we realized we could give that intelligence a body.

What it does

Nyx is an embeddable chat widget that adds a photorealistic 3D avatar to any website with a single <script> tag. The avatar holds natural, real-time voice conversations powered by the Gemini Live API — users talk, interrupt, and ask questions just like speaking to a human.

What makes Nyx different from a voice chatbot:

She has a face. A 3D Gaussian Splatting avatar with 52 ARKit blendshapes synchronized to speech at 30 FPS — real lip-sync, real eye blinks, real expressions.
She can see the page. When a user says "what am I looking at?", Nyx captures the viewport and sends it to Gemini as visual context — no DOM access or API integration required.
She can navigate the site. She scrolls to sections, highlights content, and guides the user through the page by voice.
She can act. Nyx pushes interactive rich content (link cards, data tables, product carousels) directly into the chat, triggers visual effects, and executes any custom client action the host site registers.
She knows your domain. A pluggable knowledge base (local file or URL) grounds the avatar in your business context — products, FAQs, policies — so she speaks with authority, not hallucinations.

The result: visitors don't read your website. They talk to it.

How we built it

The system is two open-source components:

Backend — Avatar Chat Server (Python / FastAPI)

Receives user audio over WebSocket and forwards it to the Gemini Live API via the Google GenAI SDK for real-time, full-duplex voice conversation.
Runs a custom Wav2Arkit ONNX model on CPU that converts the AI's audio response into 52 ARKit-compatible facial blendshapes in real time.
Pairs each audio chunk with its corresponding blendshape frame and streams them as sync_frame packets at 30 FPS — guaranteeing perfect lip-sync regardless of network jitter.
Handles tool calls from Gemini (request_screen_context, send_rich_content, navigate_to_section) by forwarding them as trigger_action events to the client.
Supports pluggable agents via an abstract BaseAgent interface — swap between Gemini, OpenAI, or a custom remote agent with one env var.

Frontend — Avatar Chat Widget (TypeScript)

A fully encapsulated web component using Shadow DOM — zero CSS conflicts with the host page.
Renders the avatar using 3D Gaussian Splatting (@myned-ai/gsplat-flame-avatar-renderer) for photorealistic quality directly in the browser.
A SyncPlayback engine uses audio as the master clock — audio time drives blendshape application, ensuring frame-perfect lip-sync.
Implements the screen capture flow: intercepts request_screen_context tool calls, captures the viewport with html2canvas, and sends the image back to the server as an attachment for Gemini's vision input.
Renders rich content pushed by the AI (link cards, tables, forms) via a registry of pluggable renderers.
Lazy-loads the entire 3D engine in the background — initial page load impact is near zero.

Infrastructure

Dockerized with multi-stage builds, deployed to Google Cloud Run. IaC templates included for automated deployment.

Challenges we ran into

Lip-sync timing across network jitter. Streaming audio and blendshapes separately caused desynchronization. We solved this by pairing them server-side into atomic sync_frame packets and using audio playback time as the single source of truth on the client — the SyncPlayback engine schedules blendshape application relative to AudioContext.currentTime, not wall-clock time.
Gemini Live API interruption handling. When the user interrupts mid-response, Gemini cancels its output, but audio already buffered on the client keeps playing. We implemented a two-phase interruption: the server sends an interrupt event with a cutoff timestamp, and the client immediately stops audio playback, truncates the transcript to match, and resets the avatar to idle — all within a single frame.
Deferred tool execution for screen context. When Gemini calls request_screen_context, it expects an image back before continuing. But the roundtrip (server → client capture → server → Gemini) takes time. We solved this by having the Gemini agent cancel its current response, store the pending tool call ID, and resume only after the screenshot arrives as an attachment — ensuring the AI stays silent and doesn't hallucinate while waiting.
3D rendering performance in a widget. Gaussian Splatting is GPU-intensive. We added lazy loading (the 3D engine only initializes when the widget opens), visibility-based throttling, and pre-allocated object pools to keep the frame budget under control without impacting the host page.
VAD sensitivity tuning. Gemini's voice activity detection sometimes triggered on avatar audio playback (echo). We exposed configurable VAD start/end sensitivity parameters and implemented per-session dynamic adjustment.

Accomplishments that we're proud of

One script tag, full embodied AI. Any website — WordPress, Wix, a static HTML page — gets a photorealistic avatar concierge with zero framework dependencies.
Sub-100ms perceived lip-sync latency. The server-paired sync_frame protocol with client-side audio-clock-driven blendshape scheduling delivers lip-sync that feels instant.
True multimodal loop. The agent sees (screen capture → Gemini vision), speaks (Gemini Live → audio), animates (Wav2Arkit → blendshapes), and acts (tool calls → client actions) — all in a single, fluid interaction. This isn't bolted-on multimodality; it's a closed loop.
The screen context flow. Asking "what's on my screen?" and having a 3D avatar look at your page, understand it visually, and talk you through it — without any DOM integration from the host site — still feels like magic.
Fully open-source and production-grade. JWT auth, rate limiting, CORS, health checks, structured logging, Docker, IaC — this isn't a demo. It's deployable.

What we learned

Audio is the master clock. Every attempt to synchronize lip-sync using server timestamps or frame counters failed under real network conditions. The only reliable approach is making the client's AudioContext the single source of truth and deriving all visual timing from it.
Gemini Live API is remarkably capable but requires careful orchestration. Interruption handling, tool call deferral, and VAD configuration aren't documented use cases — we had to reverse-engineer the behavior through extensive testing.
Gaussian Splatting is ready for production. With proper lazy loading and visibility management, photorealistic neural rendering works in an embeddable widget without destroying page performance. The visual leap over mesh-based avatars is worth the engineering effort.
The "website as agent body" framing changes everything. Once the agent can see the page and trigger client actions, it stops being a chatbot and starts being a concierge. This shifts the product conversation from "add chat to your site" to "give your site a presence."

What's next for Interactive Real Time 3D Avatar for Web Visitor Assistance

Agentic browsing. Extend beyond single-page context — let the avatar navigate across pages, fill forms, and complete multi-step workflows on behalf of the user.
Personalized avatars. Allow businesses to create custom avatar identities from a single photo using Gaussian Splatting reconstruction — your brand, your face.
Emotion-aware responses. Use Gemini's audio understanding to detect user sentiment (frustration, confusion, excitement) and adjust the avatar's expression and tone dynamically.
Analytics dashboard. Track what visitors ask, where they get stuck, and which actions the avatar triggers — turning conversation data into product insights.
Multi-language support. Leverage Gemini's multilingual capabilities to serve a single avatar that speaks to visitors in their native language.
ADK migration. Transition from the raw GenAI SDK to Google's Agent Development Kit for more sophisticated multi-agent orchestration and built-in grounding.