Inspiration

Work is split across apps and windows, but user goals are continuous workflows. Prism was inspired by the need for an assistant that can understand what is on screen, take safe actions across applications, and reduce context switching without becoming a black box.

What it does

Prism is a Gemini 3 powered desktop agent that can read your screen, plan steps, and execute verified actions like click, type, scroll, and navigation across apps. It streams progress as task steps, supports local visual session memory, and adds safety controls that keep the user in charge.

Core features

  • Cross app automation with verification and recovery
  • Visual DOM mapping for stable targeting
  • Local visual session memory with retrieval
  • Privacy first redaction before any frame is sent to the Gemini API
  • Safe Mode confirmations and dry run target highlighting
  • Real time streaming of plans, actions, and outcomes

How we built it

System layers

  • Desktop overlay and UI using Electron plus Next.js
  • Orchestration backend using FastAPI and SSE streaming
  • Automation tools for desktop input and browser control
  • Gemini 3 for multimodal perception, planning, grounding, and structured outputs

Control loop

  1. Capture current screen and context
  2. Ask Gemini 3 for intent, plan, and grounded targets
  3. Preview targets in dry run mode if enabled
  4. Request Safe Mode confirmation for high risk actions
  5. Execute actions locally
  6. Verify outcome and recover if needed

Visual DOM math and targeting equations

Coordinate normalization

We convert pixel coordinates into normalized screen coordinates so targeting survives resolution changes. Let screen width be \(W\) and height be \(H\). A pixel point \(p = (x, y)\) becomes \(n(p) = (x/W,\; y/H)\).

$$ n(p) = \left(\frac{x}{W}, \frac{y}{H}\right) $$

A bounding box \(b = [y_{min}, x_{min}, y_{max}, x_{max}]\) is normalized as

$$ n(b) = \left[\frac{y_{min}}{H}, \frac{x_{min}}{W}, \frac{y_{max}}{H}, \frac{x_{max}}{W}\right] $$

Visual DOM element model

Each detected element is represented as a structured node \(e_i\):

$$ e_i = \langle t_i,\; r_i,\; b_i,\; s_i \rangle $$

Where \(t_i\) is the type such as button or input, \(r_i\) is the role or label text, \(b_i\) is the bounding box, and \(s_i\) is a confidence score.

The Visual DOM for a frame is the set

$$ D = { e_1, e_2, \dots, e_n } $$

Target selection score

We select the best element for an action using a weighted score:

$$ Score(e_i) = \alpha \cdot sim(q, r_i) + \beta \cdot s_i + \gamma \cdot prior(e_i) $$

Where \(q\) is the user intent or action description, \(sim\) is text similarity between the intent and the element label, and \(prior\) encodes context like active app and recent actions. \(\alpha, \beta, \gamma\) are tuned weights.

Verification using intersection over union

After an action, we verify that the intended UI state appears by comparing expected and observed boxes using intersection over union.

$$ IoU(A,B) = \frac{|A \cap B|}{|A \cup B|} $$

We accept a match if

$$ IoU(A,B) \ge \tau $$

Where \(\tau\) is a threshold chosen for UI density.

Self healing retry budget

We cap retries to avoid infinite loops. Let \(k\) be the current retry count and \(K\) be the maximum budget. We enforce \(k \le K\).

$$ k \le K $$

Recovery triggers when verification fails:

$$ fail \Rightarrow recapture \Rightarrow relocalize \Rightarrow retry $$

Challenges we ran into

  • Desktop variability such as scaling, multiple monitors, focus changes, popups
  • Latency perception when model time dominates the pipeline
  • Building safety controls that are visible, simple, and enforceable
  • Privacy handling for on screen sensitive content

Latency model we used for profiling is \(T_{total}\), decomposed into capture, model, execution, and verification:

$$ T_{total} = T_{capture} + T_{model} + T_{execute} + T_{verify} $$

We optimized by minimizing \(T_{model}\) context, streaming partial updates, and using lower thinking settings for fast steps.

Accomplishments that we are proud of

  • End to end computer use loop with planning, execution, verification, and recovery
  • Safe Mode confirmations and dry run highlighting that improve trust
  • Visual DOM mapping that enables more stable targeting than raw coordinates
  • Local visual memory timeline with retrieval based recall
  • Real time streaming that makes the agent behavior transparent

What we learned

  • Reliability beats feature count in judging and in user trust
  • Structured control like browser automation should be used before vision fallback
  • Verification must be explicit, measurable, and bounded
  • Minimal context with retrieval improves accuracy and speed

What’s next for Prism

  • Move from vision first to accessibility first targeting where available
  • Improve Visual DOM with richer roles, better priors, and stronger verification
  • Add policy rules for allowed actions and required confirmations
  • Expand workflow presets for repeatable real world tasks
  • Tighten memory retrieval to return fewer, higher signal frames with better summaries
  • Eventually make a Desktop OS powered by Gemini

Built With

  • antigravity
  • chromium
  • electron
  • fastapi
  • google-gemini-3-api
  • playwright
  • python
  • server-sent-events
  • uvicorn
Share this project:

Updates