Inspiration

I noticed how Google is integrating Gemini across platforms, Gmail, YouTube, Colab, Docs, and it raised a question: Why isn't there a single AI that works across ALL screens? Not just inside Google apps, but everywhere on your computer. That's what this project aims to solve.

What It Does

Overlay AI Assistant is an always-on AI companion that stays with you across all apps, screens, and windows. It offers:

  • Instant chat with screenshot attachment for context-aware answers
  • Live screen awareness that understands what you're looking at
  • Guided instructions that visually highlight exactly where to click, not just telling you, but showing you

How We Built It

  • PyQt5 for the glassmorphism UI with transparent overlays
  • Google Gemini API for multimodal AI (understands both text and images)
  • Tesseract OCR to read text from screenshots and locate UI elements
  • 6-step pipeline that separates logical reasoning from screen analysis for efficiency
  • PIL/Pillow for fast screenshot capture

Challenges We Ran Into

The biggest challenge was overlay alignment. The AI could identify what to click, but the highlight rectangle kept appearing in the wrong place. We solved this by:

  1. Separating OCR coordinate systems from screen coordinates
  2. Implementing DPI-aware scaling
  3. Using a hybrid approach: AI identifies the target word, OCR finds its exact position

Accomplishments We're Proud Of

The guided learning system is our biggest achievement. It can help both non-technical users (who struggle to find settings) and technical users (who want quick navigation in unfamiliar software). It's like having a patient teacher who never gets tired of showing you where to click.

What We Learned

  • Multimodal AI is powerful but needs structured pipelines to be efficient
  • OCR is surprisingly accurate for UI element detection
  • Sometimes the best UX is the simplest: just draw a rectangle around what matters

What's Next for Overlay AI Assistant

Three major upgrades planned:

  1. Voice interaction – Ask questions and receive guidance via audio
  2. Cursor control – Optional auto-click
  3. MCP server integration – Connect the AI assistant to external tools and APIs for expanded capabilities

Built With

Share this project:

Updates