Project Name: Lumino

Model: Gemini 3 Flash Preview

Overview

This project introduces the Universal Visual Interface (UVI). Powered by the Gemini 3 Flash Preview, Lumino gives your computer the ability to "see" and understand your screen. Instead of waiting for developers to build custom integrations (APIs) to connect disconnected apps, Lumino bridges them visually. It acts as a set of eyes for your operating system, breaking down the barriers between software and making any application instantly automatable through natural language and simulated keystrokes. Not only does it see your screen, but it also watches how you work just show Lumino how to perform a repetitive task once, and it will automatically figure out the logic and complete the rest for you.

The Problem: Disconnected Apps and Complex Automation

Apps Are Isolated: Most software programs are closed off from one another. Moving information from a PDF invoice into an accounting system usually requires typing it by hand. Connecting them automatically requires special programming shortcuts called APIs, which often do not exist.
Automation is for Programmers: Creating shortcuts for repetitive computer tasks usually requires coding knowledge. Regular users do not have an easy way to simply show their computer a task and ask it to repeat the process.
The Digital Divide: Many older or specialized programs are incredibly difficult for people with visual or motor impairments to use. Because these programs lack modern accessibility features, disabled users are often locked out of using them entirely.

The Solution: Multimodal Understanding (Vision + Input)

Lumino solves these bottlenecks by utilizing Multimodal AI. By feeding the screen's visual state along with the user's captured keystrokes directly into Gemini 3 Flash Preview, Lumino sees the interface and understands the user's exact goal simultaneously.

Visual App Bridging: Lumino connects disconnected apps simply by looking at them. It can visually analyze a source document and transfer that data to a destination form using simulated keystrokes, completely bypassing the need for custom API integrations or backend scripts.
Learn by Watching: Instead of writing code to automate a task, you simply show Lumino how to do it once. It watches your screen, captures your keystrokes to understand the pattern, and then repeats the task for you automatically.
Universal Accessibility: Lumino instantly makes any app easier to use. Instead of forcing users to navigate confusing menus just to export, reformat, or copy data, Lumino allows them to draw custom floating buttons on their screen (Phantom UI) to instantly extract and type out the exact information they need. It acts as an intelligent overlay, making data entry and extraction accessible to everyone.

How to Try It (Live Demo)

Note on Testing: The current working live link utilizes a Google AI Studio preview environment. Because Lumino operates as an OS-level visual layer, a web preview can only showcase partial capabilities. To fully experience and test Lumino's capabilities locally, please clone the GitHub repository.

The Innovation (Differentiation)

Unlike traditional automation tools or screen recorders that rely on brittle code selectors, hidden accessibility tags, or fragile coordinate maps, our engine uses a combination of Visual Understanding and Input Tracking. It sees the UI exactly as a human does, while capturing the exact intent via keyboard input. This makes the system immune to underlying code changes and fully compatible with any application, from modern web apps to legacy virtual desktop infrastructure and mainframe terminals.

Core Gemini 3 Features

Mimic Mode (One-Shot Imitation Learning): We leverage Gemini’s multimodal capabilities to watch a user perform a complex task once. It instantly synthesizes the visual logic with the captured keystrokes and generates a robust automation strategy for batch execution, democratizing automation for non-technical users.
Phantom Bridge (Semantic Interoperability): Gemini acts as a live translator between disconnected windows. It visually analyzes a "Source" (like a PDF invoice) and a "Destination" (like a database form), semantically mapping data fields and executing the transfer without a single line of integration code.
Reality Hook (Cognitive Monitoring): Instead of simple pixel-matching, this mode uses Gemini to monitor live screen states for abstract concepts (such as "Alert me when the market sentiment turns bearish"), giving the OS cognitive awareness.
Ghost Mode (Direct Neural Input): Gemini operates as a "Ghost Typist" directly within the keyboard interface. By focusing on any text field and using a hotkey, users can input natural language instructions for Gemini to generate and inject context-aware text via OS-level simulation.
Rewind Mode (Visual Memory): This mode utilizes a secure, local circular buffer of screen history to serve as a visual memory. Leveraging Gemini's long-context window, users can query past screen states to retrieve specific details like flight numbers or previous correspondence from up to 10 minutes prior.
Visual Mode (Instant Analysis): Functioning as a "Lens" for the desktop, this mode allows for the instant capture of any screen region for immediate Q&A. Gemini provides high-speed vision analysis to translate text, extract code, or rephrase content from arbitrary UI elements in real-time.
Neural Link (The Knowledge Drawer): This persistent "Brain" serves as the agent's local database, ingesting raw data files (.txt, .md, .rs, .ts). Gemini employs Priority Reasoning to cross-reference this documentation, ensuring every action or text generation remains grounded in your unique project data.

Hackathon Requirements & Alignment

1. The Value Guarantee

Functionality is King: Lumino is not a UI mockup; it is a fully functional operating system layer. The visual reasoning, input tracking, and automated keystroke injections have been engineered to run locally and reliably.
Original Intent: The concept of a Universal Visual Interface and the orchestration of Gemini's multimodal capabilities to bypass traditional APIs is entirely original. The core architectural logic and semantic mapping engine are custom-built for this submission.

2. Social Good Alignment

Purpose-Driven Mission: Lumino was built to tackle Digital Accessibility and the Digital Divide.
Real-World Logic: Millions of users with motor or cognitive impairments are locked out of legacy software that lacks modern accessibility compliance. Because Lumino sees the screen as a human does, it retrofits any application with natural language control and automated workflows. A user who struggles to use a mouse to navigate a cluttered interface can use Lumino to completely bypass those menus. Instead of hunting for specific buttons, they can use Phantom Bridge or Mimic Mode to instantly extract information and execute data entry purely through simulated keystrokes, actively removing barriers to employment and digital independence.

Future Scaling & Roadmap

Autonomous OS Agents (Proactive AI): Evolving Lumino from user-triggered commands to a background cognitive agent. By analyzing continuous desktop activity securely, Lumino could proactively identify repetitive tasks and suggest creating an automated script to save the user time.
Enterprise RPA Disruption: Scaling from a personal productivity tool to a decentralized enterprise RPA platform. Organizations could deploy Lumino to automate complex, multi-software legacy workflows without hiring expensive software integration engineers.
Cross-Device & Mobile Ecosystems: Expanding the UVI to mobile operating systems (Android/iOS). Mobile apps are heavily restricted by "sandboxing," making traditional automation almost impossible. By using purely visual context and simulated touches, Lumino could eventually bypass these restrictions to bring true, cross-app automation to smartphones.
Self-Healing Automation Scripts: Implementing a visual feedback loop where if an application UI undergoes a massive redesign, Lumino detects the task failure, visually scans the new layout, and auto-updates its own automation logic without human intervention.
Local Edge Deployment (Privacy-First): Optimizing the multimodal engine to run entirely on-device via local NPU (Neural Processing Unit) hardware. This ensures 100 percent data privacy and zero latency, making Lumino viable for highly secure sectors like healthcare and finance.