Inspiration
Every time I wanted to automate something on my computer filling a form, searching a website, adding something to cart. I had two options: do it manually, or write a Selenium script. I wanted to build something where you just say what you want and the computer does it like having an assistant who can see your screen.
What it does
ScreenPilot lets you control any application or website on your computer using plain English commands no scripting, no setup, no technical knowledge required.
You type a command like "Open Nike and add a shoe to cart" or *"Open YouTube" and the agent takes over. It captures your screen, analyzes what's visible using Gemini 2.5 Flash's vision capabilities, decides what to click or type, executes the action, and then looks at the screen again to verify before moving to the next step.
How we built it
ScreenPilot is built around a see → decide → act → verify loop:
- The user types a command in the browser dashboard
- FastAPI receives it and triggers the ADK agent
- The agent calls
capture_screen()this captures a screenshot and injects it as raw image bytes directly into the next Gemini message. - Gemini 2.5 Flash analyzes the screenshot and decides which tool to call (click, type, scroll, etc.)
- pyautogui executes the action on the real screen
- The agent captures the screen again to verify, then repeats
Every tool call is streamed live to the dashboard via Server-Sent Events so you can watch the agent work in real time.
Challenges we ran into
- Coordinate precision screenshots are resized to 1024px for token efficiency but pyautogui clicks at native resolution. Keeping these consistent required careful handling of the capture pipeline
- Repeat loop detection the agent would get stuck clicking the same element. Detection had to persist across session recycles, otherwise the counter reset and the loop continued
- One tool per turn enforcing this purely through prompting was unreliable. Had to enforce it at the infrastructure level by breaking event collection after the first tool call
- Streaming across threads the ADK agent runs in a thread pool while FastAPI is async. Bridging SSE events between the two required thread-safe queue management.
## Accomplishments that we're proud of
the core breakthrough was figuring out that ADK serializes tool results as plain text, meaning screenshot data never reached the model as real pixels. Bypassing the tool system entirely and injecting raw image bytes directly into the message loop was the fix that made the whole project possible.
the agent successfully navigates complex, dynamic sites like Nike and YouTube end-to-end, handling popups, size selectors, search bars, and cart flows without any site-specific code.
## What we learned
The biggest insight was that vision-based agents are only as good as their screenshot pipeline. Getting images into the model correctly was the hardest part not the AI reasoning.
Returning an image dict from a tool never reached the model as actual pixels. The fix was intercepting the
capture_screencall in the agent loop and injecting raw bytes directly. Gemini would sometimes call multiple tools in one response despite prompt instructions. Required both prompt-level rules and early-breaking the event loop in code. Each screenshot is 2–4k tokens. After 10 turns that's 40k+ tokens just from images. Solved by pruning old screenshots from session history and recycling sessions every 8 turns. ## What's next for ScreenPilot Voice input- speak commands hands-free instead of typing, making it truly zero-touch automation Task scheduling- run commands on a schedule, e.g. "every morning open my email and summarize unread messages" Remote machine control- control a machine over the network so one person can pilot another machine entirely Failure recovery-smarter handling when a website layout changes or an unexpected popup appears mid-task Multi-step task chaining- queue multiple commands that execute sequentially, e.g. "open Gmail, read the latest email, then summarize it in a Google Doc"
Built With
- fastapi
- gemini-2.5-flash
- google-adk
- pillow
- pyautogui
- python
- vertex-ai
Log in or sign up for Devpost to join the conversation.