AI computer-use

Nova-CUA: The Beginning of a JARVIS-like System

Inspiration

The inspiration comes from the dream of creating a real-life JARVIS — an AI that doesn’t just chat, but actively helps by planning, coding, and directly interacting with a computer. Instead of a simple assistant, the goal is to build a true autonomous colleague, capable of working alongside humans seamlessly.

What it does

Nova-CUA is a multi-agent system called CoAct-1:

Orchestrator: The brain, powered by Gemini 2.5-Flash, decomposes tasks into steps and assigns them.
Programmer: The engineer, also using Gemini 2.5-Flash, writes and executes code to complete steps.
GUI Operator: The hands and eyes, using a composed_grounded model (Gemini 2.5-Flash + InterVL-4B grounding model), controls the screen by clicking, typing, and dragging precisely where needed.

For example, if you say “Open Firefox and search the weather in New York”:

The Orchestrator plans the steps.
The Programmer launches Firefox.
The GUI Operator grounds the coordinates of the search bar (via InterVL-4B) and types in the query.
Within seconds, the result is shown.

This is more than a chatbot — it’s the first step toward JARVIS.

How we built it

Designed a multi-agent architecture where roles are clearly defined (planner, coder, GUI operator).
Integrated Gemini 2.5-Flash for reasoning, planning, and code execution.
Integrated InterVL-4B, a 4B parameter grounding model, to translate descriptions into precise pixel coordinates.
Built a composed_grounded pipeline: the reasoning model describes what to do, the grounding model gives exact coordinates, and the system performs the action (e.g., click(100, 200)).
Connected the agents through a control layer so they can act cooperatively on real applications.

Challenges we ran into

Getting multiple agents to communicate and coordinate smoothly.
Bridging the gap between high-level reasoning and low-level GUI actions.
Ensuring the GUI operator could handle diverse application layouts without brittle rules.
Debugging multi-step processes where one agent’s small mistake cascaded into the others.

Accomplishments that we're proud of

Successfully created a working demo where the system plans, codes, and interacts with real applications.
Implemented the composed_grounded pipeline that combines abstract reasoning with precise grounding.
Built something that feels like the first glimpse of a real JARVIS system.

What we learned

The power of multi-agent design: breaking down tasks makes AI systems more reliable and flexible.
Grounding models are essential — reasoning alone isn’t enough when you need to act on a real screen.
Designing for robust coordination between agents is just as important as making each agent smart individually.

What's next for Nova-CUA

Expand support for more complex workflows (e.g., file editing, app automation).
Improve robustness so it can handle unexpected screen states gracefully.
Explore voice integration for more natural interaction.
Scale the grounding model for even more accurate GUI control.
Long-term: push Nova-CUA closer to the vision of a true JARVIS — a fully capable AI partner across planning, reasoning, coding, and real-world computer use.