Nova-CUA: The Beginning of a JARVIS-like System

Inspiration

The inspiration comes from the dream of creating a real-life JARVIS — an AI that doesn’t just chat, but actively helps by planning, coding, and directly interacting with a computer. Instead of a simple assistant, the goal is to build a true autonomous colleague, capable of working alongside humans seamlessly.

What it does

Nova-CUA is a multi-agent system called CoAct-1:

  • Orchestrator: The brain, powered by Gemini 2.5-Flash, decomposes tasks into steps and assigns them.
  • Programmer: The engineer, also using Gemini 2.5-Flash, writes and executes code to complete steps.
  • GUI Operator: The hands and eyes, using a composed_grounded model (Gemini 2.5-Flash + InterVL-4B grounding model), controls the screen by clicking, typing, and dragging precisely where needed.

For example, if you say “Open Firefox and search the weather in New York”:

  • The Orchestrator plans the steps.
  • The Programmer launches Firefox.
  • The GUI Operator grounds the coordinates of the search bar (via InterVL-4B) and types in the query.
  • Within seconds, the result is shown.

This is more than a chatbot — it’s the first step toward JARVIS.

How we built it

  • Designed a multi-agent architecture where roles are clearly defined (planner, coder, GUI operator).
  • Integrated Gemini 2.5-Flash for reasoning, planning, and code execution.
  • Integrated InterVL-4B, a 4B parameter grounding model, to translate descriptions into precise pixel coordinates.
  • Built a composed_grounded pipeline: the reasoning model describes what to do, the grounding model gives exact coordinates, and the system performs the action (e.g., click(100, 200)).
  • Connected the agents through a control layer so they can act cooperatively on real applications.

Challenges we ran into

  • Getting multiple agents to communicate and coordinate smoothly.
  • Bridging the gap between high-level reasoning and low-level GUI actions.
  • Ensuring the GUI operator could handle diverse application layouts without brittle rules.
  • Debugging multi-step processes where one agent’s small mistake cascaded into the others.

Accomplishments that we're proud of

  • Successfully created a working demo where the system plans, codes, and interacts with real applications.
  • Implemented the composed_grounded pipeline that combines abstract reasoning with precise grounding.
  • Built something that feels like the first glimpse of a real JARVIS system.

What we learned

  • The power of multi-agent design: breaking down tasks makes AI systems more reliable and flexible.
  • Grounding models are essential — reasoning alone isn’t enough when you need to act on a real screen.
  • Designing for robust coordination between agents is just as important as making each agent smart individually.

What's next for Nova-CUA

  • Expand support for more complex workflows (e.g., file editing, app automation).
  • Improve robustness so it can handle unexpected screen states gracefully.
  • Explore voice integration for more natural interaction.
  • Scale the grounding model for even more accurate GUI control.
  • Long-term: push Nova-CUA closer to the vision of a true JARVIS — a fully capable AI partner across planning, reasoning, coding, and real-world computer use.

Built With

Share this project:

Updates