About the Project
This project was inspired by the recent wave of real-time interactive AI systems, especially demos like Thinking Machines Lab’s live agents. I wanted to explore what it would feel like if an AI assistant were not just a chatbot waiting for turns, but a responsive multimodal partner that can listen, see, reason, and react in a low-latency loop.
The core idea is to build a real-time interactive agent that combines audio, vision, and language reasoning. Instead of treating voice input, visual context, and agent planning as separate steps, the system maintains a shared context where different streams can update the agent continuously. This makes the interaction feel more natural: the user can interrupt, ask follow-up questions, point to visual information, or change goals while the agent is still working.
What I Learned
One important lesson I learned is that real-time AI is not only a model problem. Even with strong models, the experience depends heavily on system design: streaming, latency control, memory management, event scheduling, and interruption handling.
I also learned that multimodal interaction needs careful coordination. Audio and vision can often be processed in parallel, but the agent still needs a unified state representation. A useful abstraction is to think of the interaction state as:
$$ S_t = f(S_{t-1}, A_t, V_t, U_t) $$
where:
S_trepresents the shared context at timetA_trepresents the audio input streamV_trepresents the visual input streamU_trepresents the user’s explicit instruction or interaction event
This helped me think about the system as an event-driven architecture rather than a simple request-response chatbot.
How I Built It
I built the project around a streaming architecture with three main components:
Streaming input layer
This layer receives user audio, visual frames, and interaction events. Audio and vision are processed independently so that the system can react quickly without waiting for every modality to finish.Shared context and event queue
All intermediate results are written into a shared context. An event queue manages updates such as user speech, visual changes, interruptions, tool calls, and agent responses.Agent reasoning loop
The agent reads from the shared context, decides what action to take next, and streams responses back to the user. The goal is to keep the loop fast enough for natural interaction while still allowing deeper reasoning when needed.
The simplified architecture looks like this:
flowchart TD
A[Audio Stream] --> C[Shared Context]
B[Vision Stream] --> C
D[User Events] --> E[Event Queue]
E --> C
C --> F[Agent Reasoning Loop]
F --> G[Streaming Response]

Log in or sign up for Devpost to join the conversation.