Relio

Links

Presentation

Video

Repositories

Architecture Diagram

Live Demo URL

Inspiration

We kept coming back to the same frustration: the physical world is 3D, but selling it online is still flat.

E-commerce return rates hover around 20–30%, and research consistently shows that the number one reason people return products is that the item "looked different than expected." Meanwhile, companies like IKEA and Shopify have proven that 3D product views and AR previews reduce returns by up to 40% and increase conversions by 94%. The technology works — but creating 3D assets is still painfully expensive and slow. A single product scan through a professional service costs $200–500 and takes days.

We asked ourselves: what if anyone with a phone could create a production-ready 3D digital twin just by talking to their camera?

That question led us to Gemini's Live API. The moment we saw real-time voice + vision working together — an AI that can simultaneously see what you're showing it and talk you through a process — we realized this wasn't just a chatbot problem. This was an entirely new interaction paradigm. You don't type commands. You don't click buttons. You hold up an object, have a conversation, and a digital twin materializes.

We were also inspired by the broader potential beyond e-commerce. Industrial facilities spend six figures on digital twin surveys. Real estate agents need virtual tours. Urban planners need 3D maps. The same core technology — voice-guided spatial capture — applies to all of them. Relio started as an e-commerce tool but was designed with a platform mindset from day one.


What it does

Relio is a voice-first AI agent that turns physical objects and spaces into 3D digital twins through natural conversation.

The core experience works like this:

  1. You speak. Open Relio, point your camera at an object, and say "I want to scan this." No menus, no tutorials, no setup.
  2. Relio sees and guides. The agent analyzes your camera feed in real-time and gives spoken instructions: "I can see the front and left side — slowly rotate to show me the back." It tracks coverage, evaluates lighting, detects blur, and tells you exactly what it still needs.
  3. You have a conversation. Interrupt anytime: "How much more do I need?" or "The lighting is bad here, should I move?" Relio responds naturally and picks up where it left off. It handles barge-in gracefully — no awkward pauses or lost context.
  4. A 3D model builds in real-time. As you scan, a point cloud assembles on screen. When coverage is sufficient, Relio says "Looking great — I have enough to build your model. Want me to start processing?"
  5. One voice command exports everywhere. Say "Create a Shopify listing" and Relio uploads the GLB model with AR preview. Say "Make a product video" and Veo 3.1 generates a cinematic showcase. Say "Generate marketing images" and Nano Banana creates lifestyle photography in multiple styles.

The entire flow — from physical object to live e-commerce listing with 3D viewer, product video, and marketing images — happens in a single conversation.

Relio supports three scanning modes:

  • Product Scan — Optimized for e-commerce objects. Outputs Shopify-ready GLB/USDZ with AR Quick Look support.
  • Space Scan — For rooms, properties, and interiors. Generates navigable 3D environments.
  • Industrial Scan — For facilities and equipment. Produces annotated digital twins with measurement data.

How we built it

Relio is a four-service architecture connected by three real-time protocols, all orchestrated to deliver zero-perceptible-latency interaction.

Frontend — relio-web (Next.js 15 + Bun + TypeScript + Three.js)

The browser captures camera and microphone via WebRTC and streams media to the backend with minimal latency. Three.js renders the 3D model progressively as point cloud data arrives over WebSockets. We used CSS Modules for styling with a custom dark theme built around the Geist variable font. Every icon is a hand-crafted SVG component — no icon libraries. The UI is designed to disappear: during active scanning, the interface is just the camera feed, a progress ring, and the agent's spoken guidance. No buttons needed.

Backend — relio-back (Rust + Axum + Tokio)

The Rust backend serves three roles: WebSocket server for real-time client events, WebRTC signaling server for media negotiation, and gRPC client for communicating with the AI service. We chose Rust for its async performance — it handles concurrent WebSocket connections and media frame routing with microsecond overhead. The backend maintains session state, coordinates frame delivery to the AI service at 1–3 FPS (matching Gemini's processing rate), and relays scanning guidance back to the client in real-time.

Database — relio-back-db (Rust + PostgreSQL)

A dedicated data service manages users, scanning sessions, captured frames, generated assets, and export jobs. SQLx provides compile-time checked queries. The schema tracks the full lifecycle: from initial scan through 3D reconstruction to final Shopify product listing.

AI Service — relio-ai (Python + Google ADK + GenAI SDK)

This is where the magic happens. The AI service runs a multi-agent ADK architecture:

  • Coordinator Agent — The main personality. Uses the Gemini Live API (gemini-2.5-flash-native-audio-preview-12-2025) for bidirectional voice + vision streaming. It maintains the conversational flow, interprets camera frames, provides scanning guidance, and dispatches to sub-agents for specialized tasks.
  • Scan Analyst — A tool-calling sub-agent that evaluates frame coverage, lighting quality, and object dimensions. It determines when enough data has been captured and identifies which angles are still missing.
  • Export Agent — Handles post-processing exports: Shopify upload via GraphQL Admin API, Veo 3.1 video generation for product showcases, and Nano Banana (Gemini 3 Pro Image) for marketing asset creation.

The 3D reconstruction pipeline uses Open3D for point cloud processing (voxel downsampling, statistical outlier removal, normal estimation) and Poisson surface reconstruction for mesh generation. Trimesh handles GLB export. The pipeline runs server-side on Cloud Run with 4 vCPUs and 4GB RAM.

Communication protocols — why three?

  • WebRTC (client ↔ backend): Camera and microphone streams need the lowest possible latency. WebRTC gives us sub-100ms media delivery with SRTP encryption and automatic quality adaptation.
  • WebSockets (client ↔ backend): Reliable, ordered delivery of control events — session state, progress updates, scanning guidance text, UI commands.
  • gRPC (backend ↔ AI): Strongly typed, streaming-native, efficient binary serialization. Perfect for service-to-service communication where we're shuttling video frames and audio chunks at high throughput.

Gemini integration details:

We use the Live API's native audio model with the "Sage" voice for Relio's persona — warm and encouraging, like a patient photography instructor. Google Search grounding is enabled so the agent can provide accurate product category information and material descriptions. Asynchronous function calling (behavior: NON_BLOCKING) ensures that tool calls (like checking coverage or triggering exports) don't interrupt the voice stream. The Live API's built-in voice activity detection and barge-in handling let users interrupt mid-sentence without any custom logic.

Infrastructure:

Everything deploys to Google Cloud Run via Terraform (infrastructure-as-code) and a single deploy.sh script. Cloud Storage hosts generated 3D models, images, and videos. Each service has its own Dockerfile and docker-compose for local development, plus a root docker-compose that orchestrates everything.


Challenges we ran into

LiDAR is locked behind native iOS — no browser access.

Our original vision included iPhone LiDAR for depth-enhanced scanning. We quickly discovered that Safari provides zero JavaScript API access to LiDAR data — Apple restricts it to ARKit in native Swift apps. We pivoted to a camera-only photogrammetry approach for the web demo, which works surprisingly well with Gemini's vision guiding users to capture optimal angles. The LiDAR integration remains in the architecture for a future native app.

Gemini Live API processes video at 1 frame per second.

We initially tried streaming at 15+ FPS and immediately hit rate limits. The solution was client-side intelligence: we capture at 2 FPS, run local blur detection and motion analysis, and only send the sharpest, most informative frames. This actually improved quality — fewer frames, but each one is useful.

Coordinating three real-time protocols without perceptible lag.

Having WebRTC, WebSocket, and gRPC all running simultaneously created synchronization challenges. Audio from the agent (arriving via gRPC → WebSocket) needed to align with visual guidance overlays. We solved this with a unified session clock and event sequencing in the Rust backend — every message carries a timestamp, and the frontend reconciles based on the session timeline.

3D reconstruction quality from phone cameras.

Professional photogrammetry uses calibrated camera rigs. We have a handheld phone with an unknown lens. The key insight was using Gemini itself as a quality gate: the agent evaluates each frame for blur, occlusion, and overlap before adding it to the reconstruction queue. Bad frames get rejected with spoken feedback: "That was a bit blurry — hold steady and I'll try again."

Maintaining conversational flow during computationally expensive operations.

When the user says "process my model," the 3D reconstruction can take 30–60 seconds. Dead silence would break the experience. We solved this by having the Coordinator Agent chat naturally during processing — asking about export preferences, explaining what's happening, or even commenting on the object: "That's a really nice design on the handle, by the way. While I'm building the model, do you want me to set up a Shopify listing?"

Veo 3.1 video generation is asynchronous and slow.

Video generation takes several minutes. We couldn't block the user experience, so we implemented a job queue: the user requests a video, gets an immediate acknowledgment, and receives a WebSocket notification when it's ready. The agent says "I've started generating your product video — I'll let you know when it's done. Usually takes about 2 minutes."


Accomplishments that we're proud of

Zero-text interaction. From launch to exported Shopify listing, the user never types a single character. Everything is voice and camera. This isn't a chatbot with a camera bolted on — it's a fundamentally new interaction paradigm where the AI sees, speaks, and acts in real-time.

The "scanning guidance" experience. Watching Relio guide someone through a scan feels like having an expert photographer directing you. It notices things: "You're getting some glare from that window — shift about a foot to your left." When it works, people forget they're talking to an AI.

End-to-end pipeline completeness. Most hackathon projects demo one capability. Relio delivers a complete workflow: scan → 3D model → Shopify listing → product video → marketing images. Each step flows naturally from the conversation.

Sub-200ms voice response latency. The combination of WebRTC for media transport and Gemini's native audio model means Relio responds almost instantly. Interruptions feel natural — you can talk over it mid-sentence and it adapts, just like a human conversation partner.

Production-grade infrastructure. Terraform IaC, Docker containers for every service, automated Cloud Run deployment, proper environment variable management, and a clean separation of concerns across four repositories. This isn't demo-ware — it's architected to scale.


What we learned

Voice-first design requires rethinking everything. UI patterns we take for granted — progress bars, error messages, confirmation dialogs — all need voice equivalents. "Your scan is 73% complete" becomes "We're about three-quarters done — just need the top and bottom." Every state transition needs a spoken narrative.

The Gemini Live API is more capable than we expected. Affective dialog, barge-in handling, voice activity detection, and real-time vision analysis all work together remarkably well. The native audio model produces voice that sounds genuinely warm and natural — not robotic. The hardest part was writing system prompts that kept responses short enough for real-time interaction.

ADK's multi-agent architecture is the right abstraction. Having specialized sub-agents for scanning analysis, 3D processing, and export handling keeps the code clean and the conversation natural. The Coordinator Agent doesn't need to know how Shopify's GraphQL API works — it just calls the export tool and narrates the result.

Rust + Python is a powerful hackathon combo. Rust handles the high-throughput, low-latency networking layer (WebSockets, WebRTC signaling, gRPC routing) where performance matters. Python handles the AI orchestration, 3D processing, and API integrations where ecosystem richness matters. gRPC bridges them cleanly.

Grounding matters more than we thought. When Relio describes a scanned object — "this looks like a ceramic mug, probably handmade" — grounding with Google Search ensures the description is accurate rather than hallucinated. This becomes critical when auto-generating Shopify product descriptions.

The demo video is not an afterthought. We spent nearly 30% of our time on the demo, and it was worth it. Judges may never run the code, but they will watch the video. Every second needs to earn its place.


What's next for Relio

Native iOS app with LiDAR integration. The web demo proves the concept; a native app with ARKit LiDAR will dramatically improve scan quality with true depth data, producing sub-millimeter accurate digital twins.

Real-time collaborative scanning. Multiple people with phones scanning the same large space simultaneously, with Relio coordinating coverage: "Alex, you take the north wall. Sam, I need you on the ceiling."

Industrial digital twin platform. Expand from product scanning to full facility documentation — factories, warehouses, construction sites. Integration with BIM (Building Information Modeling) systems and safety compliance databases.

Shopify app marketplace. Package the e-commerce pipeline as a Shopify app that merchants install directly. One-tap 3D listing creation from their phone.

AR commerce layer. Use the generated 3D models to power AR try-before-you-buy experiences. Scan a product in-store → instantly see it in your home via AR → purchase with one voice command.

Multi-language scanning guidance. Gemini's Live API supports 70+ languages. Relio should guide scanning in any language with automatic detection — a French merchant says "numérise ce produit" and the entire experience switches to French seamlessly.

The vision is simple: every physical object should have a digital twin, and creating one should be as easy as having a conversation. Relio is the first step.

Built With

Share this project:

Updates