Inspiration
I've always been frustrated by how siloed AI tools are — you open a chat window, describe your problem from scratch, and get a generic answer that has no idea what's actually on your screen. I wanted to build something that felt like a genuinely smart friend sitting next to you at your computer: one that could see what you're looking at, remember what you'd been working on all day, and respond in a way that was visible and physical — not just text in a box.
What It Does
Nova is a floating macOS AI assistant that activates with a global hotkey (Ctrl+Option) from any app. It captures your screen, passes it to Gemini 2.5 Flash along with a rolling memory of your session, and responds by:
- Speaking an answer out loud
- Animating a cursor to the most relevant element on screen
- Sliding in a glassmorphic response panel for anything worth reading
- Updating a live Context Web — a graph that maps your session and draws edges between queries that Gemini determines are genuinely semantically related
How I Built It
The app is written in Swift using SwiftUI and native macOS frameworks — ScreenCaptureKit for screen capture, CGEvent for cursor animation, AVSpeechSynthesizer for text-to-speech, and NSVisualEffectView for the glassmorphic UI. The AI backbone is Gemini 2.5 Flash, which handles vision, conversation, and session graph reasoning all in a single call.
Every response from Gemini returns a structured action payload: spoken text, display text, cursor coordinates, and metadata that powers the Context Web — including which prior session entries are semantically connected to the current query and why.
Challenges
The biggest challenge was making the Context Web actually meaningful. An early version connected session entries based on time and keyword overlap, which looked active but was basically noise. The breakthrough was delegating the connection logic entirely to Gemini — on each query, it reviews the indexed session history and decides which prior entries are genuinely related, returning specific reasons for each link. That turned the graph from decorative to genuinely intelligent.
What I Learned
Gemini 2.5 Flash's 1M token context window is underutilized in most apps. You can pass a full screen capture, rich session history, and a detailed system prompt in a single call — and the quality of reasoning scales with how much context you give it. The real design challenge isn't the model's capacity, it's deciding what to ask the model to reason about versus what to handle in code.
Log in or sign up for Devpost to join the conversation.