ScreenGenie — AI Screen Guide for Any App

Home screen with glassmorphism UI. Tap "Start Genie" to launch the floating AI overlay that works on top of any Android app.
Genie is active and ready. The overlay persists across apps — switch to any app and the genie stays on top.
Genie bubble floating over Chrome. Tap the genie to expand full-screen input mode and ask your question.
Guidance mode: Gemini 2.5 Flash analyzes the screenshot, highlights the target with a spotlight, and genie speaks the instruction.
Architecture

What inspired you

Every week, my mom calls me asking the same question: "Where do I tap to log in?" She's not alone — millions of older adults struggle with unfamiliar app interfaces, repeatedly asking family members for help with tasks that seem simple to digital natives. Existing solutions mean switching apps to watch tutorials or scrolling through help docs — but what if an AI agent could see your screen and point to the exact button for you?

We built ScreenGenie — a visual UI agent that observes any Android screen, understands the interface purely from pixels (no APIs, no DOM access), and guides users to the right action in real time.

How you built your project

What it does

ScreenGenie is an AI agent that floats on top of any Android app as an overlay. It observes the screen visually and acts as the user's guide:

Tap the floating genie — it expands to a full-screen input mode with suggestion chips
Describe your intent — "How do I sign in?" or "Search by image"
The agent sees your screen — ScreenGenie captures a screenshot and sends it to Gemini 2.5 Flash multimodal, which visually interprets the UI layout, identifies the exact target element, and returns precise coordinates — all without any API or DOM access to the underlying app
Follow the spotlight — A dark overlay highlights the exact button or menu with the genie agent floating beside it, delivering step-by-step action instructions

A built-in Safety Gate classifies each action's risk level — warning users before irreversible actions like deleting accounts or making payments.

Architecture & Implementation

Architecture: Flutter overlay → MediaProjection screenshot → Gemini 2.5 Flash multimodal analysis → Action instruction + Spotlight overlay render

Gemini 2.5 Flash (multimodal) via the official google_generative_ai Dart SDK — visually interprets screenshots, identifies UI elements, and outputs actionable instructions with precise coordinates in ~5 seconds. Zero DOM/API dependency — works purely from visual understanding
Flutter + flutter_overlay_window — system-wide floating overlay agent that works on top of any app
Custom coordinate pipeline — Gemini returns 0-999 normalized coordinates → denormalized to physical pixels → converted to logical dp for precise overlay positioning
PathFillType.evenOdd spotlight rendering — the standard BlendMode.clear approach doesn't work in Android overlay TextureView, requiring a novel rendering approach
MediaProjection API (native Kotlin via MethodChannel) for real-time screen capture beneath the overlay
Google Cloud Run — FastAPI backend containerized and deployed on Google Cloud for scalable agent hosting
Safety Gate — AI-powered risk classification (low/medium/high) prevents the agent from guiding users toward destructive actions without warning

Accomplishments that we're proud of

A visual UI agent that works on top of any Android app — no app-specific APIs, no DOM parsing, purely visual understanding
~5 second response time from screenshot to actionable guidance overlay
Zero app-specific training — fully generalizes to unseen UIs through Gemini's multimodal vision capabilities
Genie agent character floating near the spotlight with contextual speech bubble delivering action instructions
Adaptive layout: speech bubble positions horizontally next to genie with 3-step fallback (right → left → below)
Robust fallback chain: MediaProjection → demo screenshot → mock response
Tested across Chrome, Settings, Gmail, and third-party apps — works universally

What's next for ScreenGenie

Voice interaction — Ask questions by speaking using Gemini Live API
Multi-step guidance — The agent guides users through complex multi-screen flows with sequential steps
Auto-execution — Use AccessibilityService to tap the highlighted element on user confirmation, completing the agent loop from observation to action
Cross-app workflows — Automate tasks that span multiple apps (e.g., "Copy this address and navigate there")

The challenges you faced

flutter_overlay_window v0.4.5 height bug: has a height bug where matchParent calculates incorrectly — we traced it to an inverted condition in OverlayService.java:244 and worked around it with explicit dp values
Rendering issue: saveLayer + BlendMode.clear doesn't render on overlay's transparent FlutterTextureView — solved with PathFillType.evenOdd
Coordinate conversion math: Coordinate conversion between Gemini's normalized 0-999 space, physical pixels, and logical dp required careful math with the device pixel ratio. (Mathematical representation of the coordinate conversion logic): $$Logical\ DP = \left( \frac{C_{norm}}{999} \times S_{physical} \right) \times \frac{1}{Device\ Pixel\ Ratio}$$
Initialization timing: Overlay engine starts before the Android OverlayService registers the MethodChannel — solved with retry logic and error handling
Prompt engineering for spatial accuracy: Getting Gemini to consistently return accurate coordinates required iterating on the system prompt, including explicit 0-999 coordinate format instructions and structured JSON output schema. Adding screen context significantly improved targeting precision

What you learned

Android overlay rendering has fundamentally different constraints than regular app rendering — many standard Flutter painting techniques simply don't work
Gemini 2.5 Flash is remarkably capable at spatial reasoning — it accurately identifies UI elements, understands hierarchical layouts, and returns precise coordinates even on complex, cluttered screens, all from a single screenshot
The google_generative_ai Dart SDK enabled direct on-device integration without a backend roundtrip, reducing latency and simplifying architecture
Building a visual UI agent that overlays other apps requires careful UX thinking — the agent must be helpful without being obstructive
Pure visual understanding is powerful — by not depending on DOM or accessibility APIs, the agent works universally across any app without per-app integration

Built With

android-mediaprojection-api
cloud-run
dart
fastapi
flutter
gemini-2.5-flash
google-genai-sdk
kotlin
python

Updates

Minji Kim started this project — Mar 15, 2026 10:03 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.