ScreenMind

Inspiration

I got tired of the copy-paste loop — grabbing text from my IDE, uploading screenshots to ChatGPT, switching tabs to explain what I was looking at. I was the bottleneck stitching context together across every app I used.

What if AI could just see what I see? No copying, no uploading. An always-on copilot that watches my screen, remembers everything across tabs, and answers instantly.

What it does

ScreenMind is a desktop overlay that silently captures your screen, understands it with vision AI, and stores that understanding in a knowledge graph. Ask a
question by typing or voice — it retrieves the most relevant screen context and streams an answer in under a second. Discord to VS Code to Google Docs — it
follows along and is ready when you need it.

How I built it

Reka Vision analyzes screenshots in the background — identifying apps, visible content, and user activity
Fastino GLiNER2 extracts entities (apps, technologies, projects) locally on-device
Neo4j as a RAG system — every screen description stored as a node, linked to entities via graph relationships. Queries use hybrid retrieval (graph traversal
- full-text search)
Groq (LLaMA 4 Scout) generates streamed responses from retrieved context via SSE
Tavily powers web search when the user needs real-time information
Desktop client built with Python + Tkinter with push-to-talk voice input

Challenges

The first version took 8-9 seconds per answer — every query triggered a fresh vision API call. The breakthrough: background captures were already analyzing
the screen but throwing away the descriptions. I rebuilt it as a RAG pipeline — store every description in Neo4j, skip the vision API on queries, retrieve
cached context instead. Response time dropped from ~8s to under 1s for the first token.

I also fought LLM hallucination — early versions called Discord "Slack" and fabricated details. Fixed through model switching, structured vision prompts, and strict grounding rules.

What I learned

RAG isn't just for documents — a knowledge graph of screen states gives AI persistent memory of your workflow
The biggest wins came from eliminating unnecessary API calls, not optimizing them
Streaming transforms perceived latency — first word in under a second makes it feel instant

Built With

fastapi
fastino-gliner2
groq-(llama-4-scout)
neo4j
openai-sdk
python
reka-vision-api
server-sent-events
tavily-search-api
tkinter
uvicorn

Updates

StrangeStorm243-bit Shankar started this project — Feb 27, 2026 07:41 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.