Neural OS Actuator

Inspiration

For a non-disabled person, the computer screen is a window to the world. For the 300 million visually impaired people globally, it is often a wall. Current accessibility tools are fragmented: screen readers robotically read code (DOM) but fail on complex visuals, while voice control tools require memorizing exhausting grid commands.

We asked ourselves: What if the computer could actually see what you see?

We wanted to build a Neural Nervous System for the PC—an agent that connects Ears (Voice Input), Eyes (Vision AI), and a Voice (Human-like Audio) into one seamless experience. Our goal was to create a digital companion that doesn't just "read text," but understands the desktop environment and acts on it, bridging the gap between human intent and digital execution.

What it does

Neural OS Actuator is a multimodal AI agent designed to give full operating system control to users who cannot see or use their hands. Currently prototyped as a logic core, it simulates a complete accessibility loop:

It Listens: It accepts natural language instructions
It Sees: It ingests raw screenshots of the user's desktop state.
It Thinks (The Brain): Using Google Gemini 1.5 Flash, it visually analyzes the UI to identify elements that traditional screen readers miss—like icons, graphs, or unlabeled buttons.
It Plans Action: It calculates the precise X,Y coordinates needed to click or the text needed to type.
It Speaks : Using ElevenLabsit responds with a hyper-realistic, empathetic voice, confirming actions to reassure the user.

How we built it

We focused our hackathon time on the hardest part of the stack: the Intelligence Layer. Instead of getting bogged down in OS driver compatibility, we built the Neural Core entirely within Google AI Studio.

The Brain (Gemini 1.5 Flash): We engineered complex System Instructions to force Gemini to act as an OS operator. We taught it to ignore decorative elements and output strictly structured JSON commands for mouse control. The Voice (ElevenLabs): We integrated the concept of ElevenLabs to give the agent a specific "Companion Persona." We focused on latency and tone, ensuring the AI sounds helpful rather than robotic. The Architecture: We simulated a "Hybrid Edge" model. In our design, the heavy visual reasoning happens in the cloud (via Gemini), while the future local client would handle the privacy-sensitive execution.

Challenges we ran into

The "Local vs. Cloud" Dilemma: We originally attempted to run open-source vision models locally on our laptops to ensure 100% privacy. However, we found that consumer hardware was too slow for a real-time assistive tool (waiting 15 seconds for a click is not accessible). We pivoted to Gemini 1.5 Flash in the cloud to achieve the necessary speed. Spatial Hallucinations: Large Language Models are great at text but often bad at pixels. We had to iterate extensively on our prompts to stop the model from "guessing" coordinates and force it to accurately map the screen grid. Defining "Success": Since we couldn't build the full OS integration in time, we had to carefully scope our project to prove the logic works in AI Studio, rather than shipping a buggy desktop app.

Accomplishments that we're proud of

Proving the Concept: We demonstrated that Gemini 1.5 Flash is capable of navigating a messy, unstructured desktop environment purely through screenshots. Human-Centric Design: By prioritizing ElevenLabs for audio, we showed that accessibility tools don't have to sound robotic. The emotional connection significantly improves the user experience. Privacy-First Thinking: Even though we used the cloud, we designed the architecture to support "Edge Computing" principles—minimizing data transfer to protect user privacy.

What we learned

Vision > Code: Computer Vision is a more robust solution for accessibility than parsing HTML code, because it works on any application (Zoom, Photoshop, Games), not just web browsers. Prompt Engineering is Engineering: Getting a probabilistic AI to output deterministic, safe JSON commands for mouse control requires rigorous testing and precise language.

What's next for Neural OS Actuator

From Cloud to Edge: Our immediate next step is to take the logic proven in AI Studio and wrap it in a local Python client (PyAutoGUI) to actually drive the mouse and keyboard. Biometric Security: We plan to implement facial and voice recognition so the "Actuator" only responds to the authorized user, preventing accidental commands from others in the room. Tactile Feedback: Integrating with refreshable Braille displays to offer a dual-sensory experience for deaf-blind users.