IRIS - Interactive Realtime Intelligence System
Inspiration
IRIS was inspired by a simple but powerful vision: creating an assistant that doesn’t just respond to commands, but truly understands the user’s environment in real time.
With the rise of multimodal AI, we wanted to break the barrier between voice interaction and visual context. The goal was to build an assistant capable of "seeing" what the user sees, processing that information instantly, and engaging in a natural, flowing conversation. IRIS is built at the intersection of real-time communication, multimodal reasoning, and scalable cloud infrastructure—the blueprint for the next generation of human-AI interaction.
What it does
IRIS is a high-performance multimodal voice assistant that bridges the gap between sight and sound. By combining live screen sharing with real-time audio streaming, IRIS can:
- Analyze context: Understand code, UI designs, or documents shared on-screen.
- Engage naturally: Communicate via warm, low-latency voice responses.
- Reason step-by-step: Use Gemini’s multimodal power to solve complex problems with the user.
Whether it’s assisting a developer through a debugging session or providing real-time guidance on a presentation, IRIS acts as an intelligent partner that witnesses and understands your work.
How we built it
IRIS is built using a modern, 3-tier modular architecture designed for high scalability:
🧠 AI Agent (The Brain)
- Developed with LiveKit Agents SDK.
- Gemini 3 Flash: Orchestrates multimodal reasoning and vision.
- Deepgram & Cartesia: Ensures ultra-fast speech-to-text and high-quality voice synthesis.
- Containerized using Docker for consistent environment parity.
⚙️ Backend (The Orchestrator)
- FastAPI server deployed on Render.
- Managed session handling, secure token generation, and room configuration.
💻 Frontend (The Interface)
- React + Vite for a premium, high-speed UX.
- Custom-built audio visualizers and seamless screen-sharing integration.
- Deployed on Vercel.
🌐 Infrastructure
- LiveKit Cloud for global real-time communication and managed agent hosting.
Challenges we ran into
The primary challenge was orchestrating the multimodal pipeline. Synchronizing a live video stream (screen share) with a voice conversation—while maintaining low latency—required deep experimentation with frame buffering and STT/TTS synchronization.
Additionally, navigating the cutting-edge (and sometimes non-linear) documentation of real-time AI frameworks required building parts of the implementation from scratch through iterative testing and architecture redesigns.
Accomplishments that we're proud of
- Seamless Multimodality: Achieving an assistant that reacts to visual cues during a conversation.
- Scalable Architecture: Successfully separating the frontend, backend, and agent into a robust 3-tier system.
- Zero-Latency Feel: Optimizing the pipeline for near-instant responses.
- Production-Ready Deployment: Orchestrating a fully working cloud-based agent on LiveKit Cloud.
What we learned
This project significantly deepened our expertise in Real-time AI Engineering. We learned how to manage high-bandwidth multimodal data, the intricacies of Agentic system design, and the critical importance of a decoupled architecture for cloud-based AI. We realized that in the world of real-time AI, stability and latency management are just as important as the LLM's intelligence.
What's next for IRIS
IRIS is just the beginning. Our roadmap includes:
- Dynamic Context Injection: Allowing users to customize system prompts through the UI.
- Persistent Memory: Session-to-session memory for long-term project support.
- Multimodal RAG: Connecting IRIS to private knowledge bases for enterprise-level visual support.
- Interactive Action Layer: Enabling IRIS to take actions within the user's environment.
Built With
- cartesia
- deepgram
- docker
- fastapi
- google-gemini-3
- livekit
- python
- react
- render
- typescript
- vercel
- vite
Log in or sign up for Devpost to join the conversation.