Inspiration

I got tired of the copy-paste loop — grabbing text from my IDE, uploading screenshots to ChatGPT, switching tabs to explain what I was looking at. I was the bottleneck stitching context together across every app I used.

What if AI could just see what I see? No copying, no uploading. An always-on copilot that watches my screen, remembers everything across tabs, and answers instantly.

What it does

ScreenMind is a desktop overlay that silently captures your screen, understands it with vision AI, and stores that understanding in a knowledge graph. Ask a
question by typing or voice — it retrieves the most relevant screen context and streams an answer in under a second. Discord to VS Code to Google Docs — it
follows along and is ready when you need it.

How I built it

  • Reka Vision analyzes screenshots in the background — identifying apps, visible content, and user activity
  • Fastino GLiNER2 extracts entities (apps, technologies, projects) locally on-device
  • Neo4j as a RAG system — every screen description stored as a node, linked to entities via graph relationships. Queries use hybrid retrieval (graph traversal
    • full-text search)
  • Groq (LLaMA 4 Scout) generates streamed responses from retrieved context via SSE
  • Tavily powers web search when the user needs real-time information
  • Desktop client built with Python + Tkinter with push-to-talk voice input

Challenges

The first version took 8-9 seconds per answer — every query triggered a fresh vision API call. The breakthrough: background captures were already analyzing
the screen but throwing away the descriptions. I rebuilt it as a RAG pipeline — store every description in Neo4j, skip the vision API on queries, retrieve
cached context instead. Response time dropped from ~8s to under 1s for the first token.

I also fought LLM hallucination — early versions called Discord "Slack" and fabricated details. Fixed through model switching, structured vision prompts, and strict grounding rules.

What I learned

  • RAG isn't just for documents — a knowledge graph of screen states gives AI persistent memory of your workflow
  • The biggest wins came from eliminating unnecessary API calls, not optimizing them
  • Streaming transforms perceived latency — first word in under a second makes it feel instant

Built With

  • fastapi
  • fastino-gliner2
  • groq-(llama-4-scout)
  • neo4j
  • openai-sdk
  • python
  • reka-vision-api
  • server-sent-events
  • tavily-search-api
  • tkinter
  • uvicorn
Share this project:

Updates