Nova-CUA: The Beginning of a JARVIS-like System
Inspiration
The inspiration comes from the dream of creating a real-life JARVIS — an AI that doesn’t just chat, but actively helps by planning, coding, and directly interacting with a computer. Instead of a simple assistant, the goal is to build a true autonomous colleague, capable of working alongside humans seamlessly.
What it does
Nova-CUA is a multi-agent system called CoAct-1:
- Orchestrator: The brain, powered by Gemini 2.5-Flash, decomposes tasks into steps and assigns them.
- Programmer: The engineer, also using Gemini 2.5-Flash, writes and executes code to complete steps.
- GUI Operator: The hands and eyes, using a composed_grounded model (Gemini 2.5-Flash + InterVL-4B grounding model), controls the screen by clicking, typing, and dragging precisely where needed.
For example, if you say “Open Firefox and search the weather in New York”:
- The Orchestrator plans the steps.
- The Programmer launches Firefox.
- The GUI Operator grounds the coordinates of the search bar (via InterVL-4B) and types in the query.
- Within seconds, the result is shown.
This is more than a chatbot — it’s the first step toward JARVIS.
How we built it
- Designed a multi-agent architecture where roles are clearly defined (planner, coder, GUI operator).
- Integrated Gemini 2.5-Flash for reasoning, planning, and code execution.
- Integrated InterVL-4B, a 4B parameter grounding model, to translate descriptions into precise pixel coordinates.
- Built a composed_grounded pipeline: the reasoning model describes what to do, the grounding model gives exact coordinates, and the system performs the action (e.g.,
click(100, 200)). - Connected the agents through a control layer so they can act cooperatively on real applications.
Challenges we ran into
- Getting multiple agents to communicate and coordinate smoothly.
- Bridging the gap between high-level reasoning and low-level GUI actions.
- Ensuring the GUI operator could handle diverse application layouts without brittle rules.
- Debugging multi-step processes where one agent’s small mistake cascaded into the others.
Accomplishments that we're proud of
- Successfully created a working demo where the system plans, codes, and interacts with real applications.
- Implemented the composed_grounded pipeline that combines abstract reasoning with precise grounding.
- Built something that feels like the first glimpse of a real JARVIS system.
What we learned
- The power of multi-agent design: breaking down tasks makes AI systems more reliable and flexible.
- Grounding models are essential — reasoning alone isn’t enough when you need to act on a real screen.
- Designing for robust coordination between agents is just as important as making each agent smart individually.
What's next for Nova-CUA
- Expand support for more complex workflows (e.g., file editing, app automation).
- Improve robustness so it can handle unexpected screen states gracefully.
- Explore voice integration for more natural interaction.
- Scale the grounding model for even more accurate GUI control.
- Long-term: push Nova-CUA closer to the vision of a true JARVIS — a fully capable AI partner across planning, reasoning, coding, and real-world computer use.
Log in or sign up for Devpost to join the conversation.