Inspiration
I got tired of the copy-paste loop — grabbing text from my IDE, uploading screenshots to ChatGPT, switching tabs to explain what I was looking at. I was the bottleneck stitching context together across every app I used.
What if AI could just see what I see? No copying, no uploading. An always-on copilot that watches my screen, remembers everything across tabs, and answers instantly.
What it does
ScreenMind is a desktop overlay that silently captures your screen, understands it with vision AI, and stores that understanding in a knowledge graph. Ask a
question by typing or voice — it retrieves the most relevant screen context and streams an answer in under a second. Discord to VS Code to Google Docs — it
follows along and is ready when you need it.
How I built it
- Reka Vision analyzes screenshots in the background — identifying apps, visible content, and user activity
- Fastino GLiNER2 extracts entities (apps, technologies, projects) locally on-device
- Neo4j as a RAG system — every screen description stored as a node, linked to entities via graph relationships. Queries use hybrid retrieval (graph traversal
- full-text search)
- Groq (LLaMA 4 Scout) generates streamed responses from retrieved context via SSE
- Tavily powers web search when the user needs real-time information
- Desktop client built with Python + Tkinter with push-to-talk voice input
Challenges
The first version took 8-9 seconds per answer — every query triggered a fresh vision API call. The breakthrough: background captures were already analyzing
the screen but throwing away the descriptions. I rebuilt it as a RAG pipeline — store every description in Neo4j, skip the vision API on queries, retrieve
cached context instead. Response time dropped from ~8s to under 1s for the first token.
I also fought LLM hallucination — early versions called Discord "Slack" and fabricated details. Fixed through model switching, structured vision prompts, and strict grounding rules.
What I learned
- RAG isn't just for documents — a knowledge graph of screen states gives AI persistent memory of your workflow
- The biggest wins came from eliminating unnecessary API calls, not optimizing them
- Streaming transforms perceived latency — first word in under a second makes it feel instant
Log in or sign up for Devpost to join the conversation.