V.O.I.C.E. (Visual Operative Intelligent Cybernetic Entity)

Inspiration

For most of us, using a computer is just muscle memory. We click and type without even thinking about it. But I realized that for people with paralysis, ALS, or limited motor skills, a mouse and a keyboard aren't helpful tools; they are massive walls. I wanted to build something that would tear those walls down. I was inspired to create an assistant that doesn't just talk to you, but actually works for you, giving anyone with a voice the power to control their digital world.

What it does

V.O.I.C.E. is a hands-free bridge between a person and their computer. It allows a user to control their entire operating system using only their voice. It can open applications, type out long emails, search the web, and even "see" the screen. By using Gemini 3's vision, the assistant can find specific buttons or icons on a website and move the mouse to click them, allowing someone who cannot hold a mouse to browse the internet as effectively as anyone else.

How we built it

I used Python as the core language and connected it to the Google Gemini 3 Flash Preview API for the "brain" and "eyes."

I used speech recognition to capture vocal commands.
I built a vision pipeline that takes screenshots and uses a grid system to help the AI map out the screen.
I used PyAutoGUI to handle the physical movements of the mouse and the typing of the keys.
I integrated Edge-TTS to give the agent a high-quality, professional voice that explains its actions in real-time.

Challenges we ran into

The biggest technical hurdle was accuracy. Standard mouse scripts often miss small buttons because of Windows "DPI Scaling." I had to write a specific calibration fix to make sure a "click" landed exactly on the pixel the AI was looking at. I also struggled with the speed of the API; sending large images to the cloud takes time. I solved this by optimizing the image format to WebP and resizing files so the response felt fast and snappy. Finally, I had to build a "failover" system with multiple keys to handle the heavy rate-limiting during the hackathon.

Accomplishments that we're proud of

Most importantly, I’m proud to have built a tool that actually has the potential to change lives. Knowing that this could help someone with ALS or paralysis navigate a computer as effectively as anyone else is the most rewarding part of this entire journey. It’s not just code; it’s a way to give people their digital independence back.

What we learned

I learned that the next big step for AI isn't just "chatting"; it's "agency." There is a big difference between an AI that tells you how to do something and an AI that actually does it for you. I learned how to combine vision, speech, and OS-level automation into one single, cohesive unit. Most importantly, I learned that when you design technology for the most limited users, you end up making a better product for everyone.

What's next for V.O.I.C.E. (Visual Operative Intelligent Cybernetic Entity)

My vision is to expand V.O.I.C.E. beyond just the PC. I want this logic to live in every single piece of tech, phones, tablets, and smart devices, so that no one is limited by their physical condition. I want to ensure that every human has an equal opportunity to use technology, regardless of their physical abilities. At the same time, I plan to make the system even safer and more secure, adding "Confirmation Vows" so that the AI only performs sensitive tasks when it is 100% sure of the user's intent.

Built With

computer-vision
edge-tts
gemini-3-flash-preview
google-gemini-3-api
os-automation
pyautogui
pygame
python
speech-recognition
tkinter

Updates

Ahmed Rehman started this project — Feb 08, 2026 04:44 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.