Inspiration
Work is split across apps and windows, but user goals are continuous workflows. Prism was inspired by the need for an assistant that can understand what is on screen, take safe actions across applications, and reduce context switching without becoming a black box.
What it does
Prism is a Gemini 3 powered desktop agent that can read your screen, plan steps, and execute verified actions like click, type, scroll, and navigation across apps. It streams progress as task steps, supports local visual session memory, and adds safety controls that keep the user in charge.
Core features
- Cross app automation with verification and recovery
- Visual DOM mapping for stable targeting
- Local visual session memory with retrieval
- Privacy first redaction before any frame is sent to the Gemini API
- Safe Mode confirmations and dry run target highlighting
- Real time streaming of plans, actions, and outcomes
How we built it
System layers
- Desktop overlay and UI using Electron plus Next.js
- Orchestration backend using FastAPI and SSE streaming
- Automation tools for desktop input and browser control
- Gemini 3 for multimodal perception, planning, grounding, and structured outputs
Control loop
- Capture current screen and context
- Ask Gemini 3 for intent, plan, and grounded targets
- Preview targets in dry run mode if enabled
- Request Safe Mode confirmation for high risk actions
- Execute actions locally
- Verify outcome and recover if needed
Visual DOM math and targeting equations
Coordinate normalization
We convert pixel coordinates into normalized screen coordinates so targeting survives resolution changes. Let screen width be \(W\) and height be \(H\). A pixel point \(p = (x, y)\) becomes \(n(p) = (x/W,\; y/H)\).
$$ n(p) = \left(\frac{x}{W}, \frac{y}{H}\right) $$
A bounding box \(b = [y_{min}, x_{min}, y_{max}, x_{max}]\) is normalized as
$$ n(b) = \left[\frac{y_{min}}{H}, \frac{x_{min}}{W}, \frac{y_{max}}{H}, \frac{x_{max}}{W}\right] $$
Visual DOM element model
Each detected element is represented as a structured node \(e_i\):
$$ e_i = \langle t_i,\; r_i,\; b_i,\; s_i \rangle $$
Where \(t_i\) is the type such as button or input, \(r_i\) is the role or label text, \(b_i\) is the bounding box, and \(s_i\) is a confidence score.
The Visual DOM for a frame is the set
$$ D = { e_1, e_2, \dots, e_n } $$
Target selection score
We select the best element for an action using a weighted score:
$$ Score(e_i) = \alpha \cdot sim(q, r_i) + \beta \cdot s_i + \gamma \cdot prior(e_i) $$
Where \(q\) is the user intent or action description, \(sim\) is text similarity between the intent and the element label, and \(prior\) encodes context like active app and recent actions. \(\alpha, \beta, \gamma\) are tuned weights.
Verification using intersection over union
After an action, we verify that the intended UI state appears by comparing expected and observed boxes using intersection over union.
$$ IoU(A,B) = \frac{|A \cap B|}{|A \cup B|} $$
We accept a match if
$$ IoU(A,B) \ge \tau $$
Where \(\tau\) is a threshold chosen for UI density.
Self healing retry budget
We cap retries to avoid infinite loops. Let \(k\) be the current retry count and \(K\) be the maximum budget. We enforce \(k \le K\).
$$ k \le K $$
Recovery triggers when verification fails:
$$ fail \Rightarrow recapture \Rightarrow relocalize \Rightarrow retry $$
Challenges we ran into
- Desktop variability such as scaling, multiple monitors, focus changes, popups
- Latency perception when model time dominates the pipeline
- Building safety controls that are visible, simple, and enforceable
- Privacy handling for on screen sensitive content
Latency model we used for profiling is \(T_{total}\), decomposed into capture, model, execution, and verification:
$$ T_{total} = T_{capture} + T_{model} + T_{execute} + T_{verify} $$
We optimized by minimizing \(T_{model}\) context, streaming partial updates, and using lower thinking settings for fast steps.
Accomplishments that we are proud of
- End to end computer use loop with planning, execution, verification, and recovery
- Safe Mode confirmations and dry run highlighting that improve trust
- Visual DOM mapping that enables more stable targeting than raw coordinates
- Local visual memory timeline with retrieval based recall
- Real time streaming that makes the agent behavior transparent
What we learned
- Reliability beats feature count in judging and in user trust
- Structured control like browser automation should be used before vision fallback
- Verification must be explicit, measurable, and bounded
- Minimal context with retrieval improves accuracy and speed
What’s next for Prism
- Move from vision first to accessibility first targeting where available
- Improve Visual DOM with richer roles, better priors, and stronger verification
- Add policy rules for allowed actions and required confirmations
- Expand workflow presets for repeatable real world tasks
- Tighten memory retrieval to return fewer, higher signal frames with better summaries
- Eventually make a Desktop OS powered by Gemini




Log in or sign up for Devpost to join the conversation.