Inspiration
Despite how powerful computers have become, interacting with them still feels surprisingly inefficient. We constantly jump between tabs, apps, keyboards, and mice just to complete simple tasks. At the same time, most AI assistants can answer questions but rarely help you actually do anything on your computer. We were inspired by the idea of making interaction with computers feel more natural , more like collaborating with a helpful assistant than operating a machine. AIRA was created to explore what happens when an AI can listen, see what’s on your screen, understand your intent, and take action instantly. Our goal was to build an assistant that feels less like a chatbot and more like a real partner for getting things done.
What it does
AIRA is a real-time multimodal desktop assistant that turns natural conversation into real actions. Users can interact with AIRA through voice or chat, and the system responds instantly while understanding both spoken commands and on-screen context. With AIRA, users can: • Control their computer using natural voice commands • Interrupt conversations naturally without breaking the interaction flow • Ask AIRA to search the web and filter results intelligently • Share their screen so AIRA can read and analyze documents • Receive explanations for assignments or questions directly in chat • Navigate their screen using simple hand gestures • Communicate in multiple languages • Automatically open apps, play media, and execute tasks across different services AIRA acts as a real AI agent operating directly on the user’s desktop, transforming everyday workflows into simple conversations.
How we built it
AIRA is built as a real-time multimodal agent architecture. At the core is the Gemini Live API, which processes continuous audio streams and enables low-latency conversational interaction. The system works in several layers: • Voice Interaction: User speech is captured and streamed through WebSockets to a FastAPI backend. The audio stream is sent to Gemini Live, which processes the input and returns real-time responses with both text and audio. • Agent Orchestration: An internal AIRA Agent module interprets Gemini’s responses and decides what actions should be executed. A goal-planning component enables multi-step tasks like searching, filtering, and launching applications. • Screen Understanding: When users share their screen, AIRA analyzes the visible content to understand documents, questions, or interfaces and respond with contextual assistance. • Task Execution: Using Playwright and system automation tools, AIRA can perform real actions such as opening applications, navigating browsers, and interacting with web services. • Frontend Interface: The React frontend provides a live interface for voice interaction, chat responses, and visual feedback such as gesture detection and task progress. • Deployment: The system is containerized using Docker and deployed on Google Cloud infrastructure for scalability and reliability.
Challenges we ran into
• Reducing latency for real-time voice interaction • Handling interruptions smoothly during conversations • Maintaining context during multi-step tasks
Accomplishments that we're proud of
• Building a real-time voice agent powered by Gemini Live • Creating a seamless experience where users can interrupt and redirect conversations naturally • Integrating screen awareness, allowing AIRA to read and analyze documents • Implementing gesture-based navigation for hands-free interaction
What we learned
Building AIRA showed us that creating a useful AI assistant is not just about having a powerful model, it’s about designing interactions that feel natural for humans. Real-time responsiveness, handling interruptions, and understanding context turned out to be just as important as the AI itself. We also learned a lot about building systems that combine AI with real-world actions. Connecting voice interaction, screen understanding, and automation into one seamless experience pushed us to think carefully about architecture, latency, and user experience. Most importantly, this project showed us how exciting the future of real-time AI agents can be.
What's next for AIRA
Next, we plan to expand AIRA’s ability to interact with more applications and services, enabling users to automate complex multi-step workflows across their desktop and the web. We also aim to improve AIRA’s visual understanding so it can interpret interfaces, diagrams, and on-screen content more intelligently, as well as introduce personalization so the assistant can adapt to individual users’ habits and preferences. Our long-term vision is for AIRA to become a true AI desktop companion , one that doesn’t just respond to commands, but actively helps users get things done.
Built With
- alembic
- asyncpg
- docker
- fastapi
- gemini2.0flash
- geminiliveapi
- googlecloudrun
- playwright
- postgresql
- python
- react
- typescript
- vite
- websockets
- xdotool
Log in or sign up for Devpost to join the conversation.