Screenwise - Learn from anything on-screen, by conversation

The Problem

A couple of years ago, I got interested in cosmetic chemistry and enrolled in an online skincare formulation course. I quickly hit a wall. Dense PDFs filled with scientific terminology - ceramide ratios, skin barrier pH, ingredient interactions - and English isn't my first language. Every unfamiliar term meant switching to a browser, searching, losing my place, then trying to reconnect with where I left off.

This constant context switching breaks focus. And it's not unique to cosmetic chemistry. Anyone studying research papers, technical documentation, or unfamiliar software faces the same friction: the AI assistant lives in a browser tab, but your work doesn't.

The Solution

I wanted an assistant that could see what I was reading and answer questions in context - without leaving the material. So I built one.

Screenwise is a Windows desktop companion powered by Gemini. It sees what you see, hears what you ask, and responds with answers grounded in your actual screen content.

Architecture

Gemini diagram

Features & Gemini Integration

Feature	Description	Gemini Model
Screen Capture + Chat	Capture region/window/full screen, attach to chat, ask "What is this?" or "Where do I click?"	Gemini 3 Flash (Vision)
Interactive Study	Load PDFs, YouTube transcripts, or screenshots; view materials on left, chat with AI on right	Gemini 3 Flash (Vision + Chat)
Voice Input	Record a spoken question, get a text reply with screen context	Gemini 3 Flash (Audio)
Text AI	Floating assistant: fix grammar, rewrite, summarize, translate, expand/shorten	Gemini 3 Flash (Text)
Image Generation	Create flashcards, diagrams, visual study aids	Gemini 3 Pro Image
Auto-Read (TTS)	AI reads responses aloud for hands-free learning	Gemini 2.5 Flash TTS
Live Voice Sessions	Real-time voice call with AI in Ask, Quiz, Debate, or Teach mode	Gemini 2.5 Flash Live
Real-time Screen Sharing	Share your screen during live calls; AI sees what you see as you navigate	Gemini 2.5 Flash Live
Live Translation	Real-time transcription + translation of system audio	Deepgram STT + Gemini 3 Flash

Gemini powers the core of Screenwise. The primary workflow: capture screen content → Gemini 3 Flash analyzes with vision → ask questions via text or voice → get context-aware answers. For text editing, the floating Text AI handles grammar, rewrites, and translations. For hands-free use, Gemini 2.5 Flash TTS reads responses aloud. For deeper learning, Gemini 2.5 Flash Live enables real-time voice sessions.

Tech Stack

Component	Technology
Framework	Tauri 2 (Rust backend)
Frontend	Vanilla JavaScript
AI Engine	Gemini API
Real-time Voice	Gemini 2.5 Flash Live (native audio)
Live Transcription	Deepgram STT
PDF Rendering	PDF.js + Rust pdf-extract

Challenges

Getting live translation working reliably was harder than I expected. I tested multiple speech-to-text options and kept running into stability issues, so I switched system-audio transcription to Deepgram while keeping Gemini 3 Flash as the translation engine.

Multi-monitor DPI scaling was another pain point, because region capture can drift when displays use different resolutions and scaling settings. I had to iterate on coordinate mapping and capture logic to make selection feel consistent across monitors.

For screen sharing, the tricky part was tuning image size and streaming rate. Too large or too frequent and performance drops, too small or too slow and the assistant loses useful context.

What's Next

Saved study sessions: Export materials and conversations as reusable packages
Spaced repetition: Track weak areas and resurface them over time

Built With

antigravity
bolt
claude
gemini
tauri

Updates

Wen Z started this project — Feb 04, 2026 07:34 AM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.