The Problem

A couple of years ago, I got interested in cosmetic chemistry and enrolled in an online skincare formulation course. I quickly hit a wall. Dense PDFs filled with scientific terminology - ceramide ratios, skin barrier pH, ingredient interactions - and English isn't my first language. Every unfamiliar term meant switching to a browser, searching, losing my place, then trying to reconnect with where I left off.

This constant context switching breaks focus. And it's not unique to cosmetic chemistry. Anyone studying research papers, technical documentation, or unfamiliar software faces the same friction: the AI assistant lives in a browser tab, but your work doesn't.

The Solution

I wanted an assistant that could see what I was reading and answer questions in context - without leaving the material. So I built one.

Screenwise is a Windows desktop companion powered by Gemini. It sees what you see, hears what you ask, and responds with answers grounded in your actual screen content.


Architecture

Gemini diagram

Features & Gemini Integration

Feature Description Gemini Model
Screen Capture + Chat Capture region/window/full screen, attach to chat, ask "What is this?" or "Where do I click?" Gemini 3 Flash (Vision)
Interactive Study Load PDFs, YouTube transcripts, or screenshots; view materials on left, chat with AI on right Gemini 3 Flash (Vision + Chat)
Voice Input Record a spoken question, get a text reply with screen context Gemini 3 Flash (Audio)
Text AI Floating assistant: fix grammar, rewrite, summarize, translate, expand/shorten Gemini 3 Flash (Text)
Image Generation Create flashcards, diagrams, visual study aids Gemini 3 Pro Image
Auto-Read (TTS) AI reads responses aloud for hands-free learning Gemini 2.5 Flash TTS
Live Voice Sessions Real-time voice call with AI in Ask, Quiz, Debate, or Teach mode Gemini 2.5 Flash Live
Real-time Screen Sharing Share your screen during live calls; AI sees what you see as you navigate Gemini 2.5 Flash Live
Live Translation Real-time transcription + translation of system audio Deepgram STT + Gemini 3 Flash

Gemini powers the core of Screenwise. The primary workflow: capture screen content → Gemini 3 Flash analyzes with vision → ask questions via text or voice → get context-aware answers. For text editing, the floating Text AI handles grammar, rewrites, and translations. For hands-free use, Gemini 2.5 Flash TTS reads responses aloud. For deeper learning, Gemini 2.5 Flash Live enables real-time voice sessions.


Tech Stack

Component Technology
Framework Tauri 2 (Rust backend)
Frontend Vanilla JavaScript
AI Engine Gemini API
Real-time Voice Gemini 2.5 Flash Live (native audio)
Live Transcription Deepgram STT
PDF Rendering PDF.js + Rust pdf-extract

Challenges

Getting live translation working reliably was harder than I expected. I tested multiple speech-to-text options and kept running into stability issues, so I switched system-audio transcription to Deepgram while keeping Gemini 3 Flash as the translation engine.

Multi-monitor DPI scaling was another pain point, because region capture can drift when displays use different resolutions and scaling settings. I had to iterate on coordinate mapping and capture logic to make selection feel consistent across monitors.

For screen sharing, the tricky part was tuning image size and streaming rate. Too large or too frequent and performance drops, too small or too slow and the assistant loses useful context.


What's Next

  • Saved study sessions: Export materials and conversations as reusable packages
  • Spaced repetition: Track weak areas and resurface them over time

Built With

  • antigravity
  • bolt
  • claude
  • gemini
  • tauri
Share this project:

Updates