The Problem
A couple of years ago, I got interested in cosmetic chemistry and enrolled in an online skincare formulation course. I quickly hit a wall. Dense PDFs filled with scientific terminology - ceramide ratios, skin barrier pH, ingredient interactions - and English isn't my first language. Every unfamiliar term meant switching to a browser, searching, losing my place, then trying to reconnect with where I left off.
This constant context switching breaks focus. And it's not unique to cosmetic chemistry. Anyone studying research papers, technical documentation, or unfamiliar software faces the same friction: the AI assistant lives in a browser tab, but your work doesn't.
The Solution
I wanted an assistant that could see what I was reading and answer questions in context - without leaving the material. So I built one.
Screenwise is a Windows desktop companion powered by Gemini. It sees what you see, hears what you ask, and responds with answers grounded in your actual screen content.
Architecture

Features & Gemini Integration
| Feature | Description | Gemini Model |
|---|---|---|
| Screen Capture + Chat | Capture region/window/full screen, attach to chat, ask "What is this?" or "Where do I click?" | Gemini 3 Flash (Vision) |
| Interactive Study | Load PDFs, YouTube transcripts, or screenshots; view materials on left, chat with AI on right | Gemini 3 Flash (Vision + Chat) |
| Voice Input | Record a spoken question, get a text reply with screen context | Gemini 3 Flash (Audio) |
| Text AI | Floating assistant: fix grammar, rewrite, summarize, translate, expand/shorten | Gemini 3 Flash (Text) |
| Image Generation | Create flashcards, diagrams, visual study aids | Gemini 3 Pro Image |
| Auto-Read (TTS) | AI reads responses aloud for hands-free learning | Gemini 2.5 Flash TTS |
| Live Voice Sessions | Real-time voice call with AI in Ask, Quiz, Debate, or Teach mode | Gemini 2.5 Flash Live |
| Real-time Screen Sharing | Share your screen during live calls; AI sees what you see as you navigate | Gemini 2.5 Flash Live |
| Live Translation | Real-time transcription + translation of system audio | Deepgram STT + Gemini 3 Flash |
Gemini powers the core of Screenwise. The primary workflow: capture screen content → Gemini 3 Flash analyzes with vision → ask questions via text or voice → get context-aware answers. For text editing, the floating Text AI handles grammar, rewrites, and translations. For hands-free use, Gemini 2.5 Flash TTS reads responses aloud. For deeper learning, Gemini 2.5 Flash Live enables real-time voice sessions.
Tech Stack
| Component | Technology |
|---|---|
| Framework | Tauri 2 (Rust backend) |
| Frontend | Vanilla JavaScript |
| AI Engine | Gemini API |
| Real-time Voice | Gemini 2.5 Flash Live (native audio) |
| Live Transcription | Deepgram STT |
| PDF Rendering | PDF.js + Rust pdf-extract |
Challenges
Getting live translation working reliably was harder than I expected. I tested multiple speech-to-text options and kept running into stability issues, so I switched system-audio transcription to Deepgram while keeping Gemini 3 Flash as the translation engine.
Multi-monitor DPI scaling was another pain point, because region capture can drift when displays use different resolutions and scaling settings. I had to iterate on coordinate mapping and capture logic to make selection feel consistent across monitors.
For screen sharing, the tricky part was tuning image size and streaming rate. Too large or too frequent and performance drops, too small or too slow and the assistant loses useful context.
What's Next
- Saved study sessions: Export materials and conversations as reusable packages
- Spaced repetition: Track weak areas and resurface them over time
Built With
- antigravity
- bolt
- claude
- gemini
- tauri

Log in or sign up for Devpost to join the conversation.