BrushOS Gemini 3 in the Physical World

Inspiration

A lot of “multimodal” demos stop at describing images. We wanted to push one step further: can a model see a real workspace and take reliable actions in it? So we gave Gemini 3 eyes and hands, a camera and a robot arm, and built BrushOS as a closed loop system that draws with real ink on real paper, not pixels.

What it does

BrushOS is a text to physical drawing agent. Gemini 3 observes the workspace through a camera, generates stroke geometry using big continuous brush strokes, executes those strokes with a real robot arm, captures new images, and iteratively improves the drawing using visual feedback through Google’s ADK tool orchestration.

We included evidence as both images and run transcripts showing structured tool calls for stroke construction, concatenation, arm execution, and camera captures.

Why Gemini 3 was essential

This project only works if the model can reliably do four things at once.

First, ground decisions in live vision. Gemini 3 looks at the camera frame and reasons about the physical layout of the paper area, the brush position, and the paint bowl, so its next tool call is based on what is actually there, not what it assumes.

Second, plan across multiple tool calls without losing the thread. Drawing is a multi step process: build strokes, transform and concatenate, execute, capture, evaluate, revise. Gemini 3 stays consistent across these steps, keeping the goal and constraints intact.

Third, choose actions, not just describe. Instead of outputting instructions, Gemini 3 produces structured tool calls with concrete parameters for stroke geometry and robot execution. That turns multimodal understanding into physical action.

Fourth, refine through a feedback loop. After each pass, Gemini 3 compares the new camera capture with the intended composition and decides what to add or correct next. This is where the system becomes autonomous: observe, act, re observe, improve.

In short, Gemini 3’s multimodal grounding plus long horizon tool use is what makes BrushOS stable in the messy physical world.

Why it’s different

Most creative AI is text to image. BrushOS is text to drawing in the physical world. The model does not output pixels. It outputs actions, then checks its work and refines it.

Painting is the fun surface area. The deeper point is a reusable pattern for embodied autonomy: perceive, plan, act, verify, correct.

How we built it

We built BrushOS in four steps.

1 Pose teaching. A learning mode tool captures paper corners and paint bowl poses into a file, so the agent can draw in a consistent workspace frame.

2 Primitives and tools. We implemented normalized stroke execution and refill logic designed around a real thick brush, plus camera tools for perception and arm tools for movement.

3 Autonomous agent. Gemini 3 acts as the decision engine, producing structured tool calls instead of free form instructions.

4 Closed loop refinement. The agent captures images after major strokes and Gemini 3 decides what to add or correct next.

High level loop Camera to Gemini 3 reasoning to tool calls to robot execution to camera feedback to refinement.

Challenges we hit and how we fixed them

Thick brush kills fine detail, so we constrained the system to big continuous strokes and frequent refills. Pose accuracy dominates quality, so we standardized a four corner paper frame so drawing is stable run to run. Feedback is mandatory, so we capture and evaluate after major actions so the agent can correct drift and composition. Robot software mismatch, so we pinned the robot client version and documented it.

What we learned

Multimodal becomes real when the model is forced to output tools, geometry, and actions, not just text. In the physical world, constraints beat cleverness. Limiting stroke primitives and enforcing calibration made the system reliable. The biggest unlock was not better planning. It was a tight observe, act, re observe loop.

What’s next

  • Multicolor palettes, different brushes, more physical tools.
  • Better composition objectives like spacing, balance, and coverage, plus faster correction passes.
  • Generalize from painting to other camera guided manipulation tasks that need precision and verification.

Built With

  • adk
  • gemini
  • gemini3
  • pyniryo
  • robot
  • uv
Share this project:

Updates