Inspiration

I was learning Blender a few weeks ago. I had a 3D model open, I wanted to export it, and I had absolutely no idea where to start. I Googled it. Got a YouTube video from 2019. Opened it. Skipped through 8 minutes of intro. Found the right section. Went back to Blender. Forgot what they said. Went back to YouTube.

That loop — software open, tutorial open, switching back and forth, pausing, rewinding — felt completely broken. The answer was right there on my screen. Why couldn't something just look at what I was doing and tell me?

That frustration became Reality Copilot.


What it does

Reality Copilot is an AI assistant that watches your screen and guides you in real time.

You share your screen. You ask a question — by voice or by typing. The AI sees your UI, understands what you're looking at, and responds with clear step-by-step instructions spoken aloud.

It works with anything: Blender, Excel, VS Code, Photoshop, government forms, unfamiliar websites. No tutorials. No Googling. You just ask.

Core features:

  • 🖥 Screen capture via WebRTC getDisplayMedia — the AI sees exactly what you see
  • 🎤 Voice input via Web Speech API — ask naturally, hands-free
  • 👁 Multimodal vision — Gemini 2.0 Flash reads your UI and identifies buttons, menus, and errors
  • 📋 Step-by-step guidance — structured, numbered instructions every time
  • 🔊 Voice output — responses are spoken aloud so you never have to look away from your work
  • Real-time — screenshot to spoken response in ~2 seconds

How we built it

The stack came together fast because every piece had one job:

Frontend — React + Vite. The browser captures a video frame from the shared screen using canvas.toDataURL(), records voice input using the Web Speech API, and speaks responses back using SpeechSynthesisUtterance. Everything visual lives here.

Backend — Python + FastAPI, deployed on Google Cloud Run. It receives the question, the base64-encoded screenshot, and conversation history. It builds a multimodal prompt and sends it to Gemini.

AI — Gemini 2.0 Flash via the Google GenAI Python SDK. The model receives the screenshot as an inline image alongside the user's question and is prompted to return structured JSON: a plain-language explanation, numbered steps, and a UI hint pointing to the exact element to interact with next.

Deployment — single Docker container on Cloud Run. The frontend proxies API calls to the same origin in production. Zero infrastructure to manage.

User asks question
       ↓
Screenshot captured (canvas → base64 JPEG)
       ↓
POST /api/ask → Cloud Run (FastAPI)
       ↓
Gemini 2.0 Flash (image + question + history)
       ↓
JSON { text, steps, ui_note }
       ↓
Render in chat + spoken aloud via TTS

Challenges I ran into

Getting Gemini to always return clean JSON was harder than expected. Early on it would add markdown fences, preamble text, or just ignore the format instructions entirely. The fix was combining response_mime_type: "application/json" in the generation config with a very explicit system prompt — and stripping any accidental fences as a fallback.

Screenshot timing was tricky. Capturing too early meant the frame wasn't ready. Capturing too late felt sluggish. Landing on "capture at the moment the user sends their message" gave the best balance of accuracy and speed.

Windows deployment — the deploy scripts used bash \ line continuation which CMD doesn't support. Had to manually run each gcloud command as a single line. Small thing, cost real time.

Model name changesgemini-2.0-flash-exp was deprecated mid-build. Had to run ListModels to find the current stable name. Worth checking this early in any Gemini project.


Accomplishments that I am proud of

  • The full loop — share screen, ask by voice, hear the answer — works in under 3 seconds end to end
  • Gemini accurately identifies UI elements from screenshots, including specific button names, menu items, and error messages it has never been trained on specifically
  • Successfully deployed on Google Cloud Run with zero DevOps experience going in
  • The experience genuinely feels like having a patient expert sitting next to you — not like using a chatbot

What I learned

  • Multimodal prompting is an art. Telling Gemini how to look at an image matters as much as giving it the image. Specifying "identify UI element names, menu labels, and button text" dramatically improved response quality.
  • Structured output is non-negotiable for apps. Free-form AI responses are great for chat. For an app that needs to render steps and speak them aloud, JSON output with a strict schema is the only reliable approach.
  • Cloud Run is genuinely fast to ship on. From zero to a public HTTPS URL was under 10 minutes once the Dockerfile was clean.
  • Voice makes AI feel alive. Adding SpeechSynthesisUtterance took 4 lines of code and completely changed how the product feels to use.

What's next for Reality Copilot

  • Bounding box overlay — draw a highlight directly on the screen around the exact button or menu item the AI is pointing to
  • Gemini Live API — move from screenshot-per-question to a continuous live video stream for truly real-time guidance
  • App detection — automatically detect which application is open and pre-load context about its UI patterns
  • Multi-step task tracking — let the AI guide users through long workflows (e.g. "set up a React project from scratch") with progress tracking across multiple questions
  • Mobile screen sharing — extend to Android/iOS for on-device app guidance

Built With

Share this project:

Updates