Harvey the Helper

Harvey Ball - The user clicks on it to start

Inspiration

The inspiration for Harvey the Helper came from our teammate's grandma. She constantly needs help to perform tasks like checking email, opening a website, or adjusting a setting. Witnessing her frustration and the difficulty she faced navigating her computer using the mouse really made us want to help her. We wanted to build a tool that removed those barriers, allowing her and anyone else with similar challenges, including the elderly and disabled, to control their computer with just their voice.

Harvey Tries to make computers more inclusive and accessible for everyone by making computer use as simple as talking to a friend!

What it does

Harvey the Helper is an AI agent designed to take control of a user's computer based on their verbal requests. He translates the user's speech into direct actions on the computer. A user can simply talk to Harvey and ask him to:

Open an application (e.g., "Open my email.")

Play a video (e.g., "Find some cat videos on YouTube")

Change system settings (e.g., "Turn down the screen brightness.")

After processing the user's request, Harvey will execute the task that the user wanted and exactly show them how it is done, so that they learn too!

How we built it

The project's architecture is centered around three main stages:

Conversational Interface: We used Gemini API to run a seamless, low-latency conversation between the user and Harvey. We used Google 2.5 Flash Native TTS model to give Harvey his voice.

Action Item Extraction: After the conversation concluded, we took the resulting JSON transcript and fed it back into Gemini 2.5 Flash. This step analyzed the dialogue to identify and extract clear action items (e.g., "open safari and go to youtube.com"). these clear Action items help guide Harvey take actions seamlessly.

Execution and Feedback Loop: The action items, along with real-time screenshots of the user's screen, will be sent to Gemini 2.5 Flash( flash was used to protect from the rate limits) as a prompt along with asking for instructions on how the task should be executed. Next, Gemini then instructs a Python library (Quartz) to execute the necessary keyboard presses and mouse movements to perform the requested actions on the operating system.

Challenges we ran into

Accuracy of Mouse Clicks: A major challenge was getting Gemini to accurately click on screen elements. Since the action is based on analyzing a screenshot, a slight misinterpretation of coordinates or a visual change on the desktop could lead to incorrect actions. We solved this by making an overlay grid system on the screenshot.

Seamless Voice Interaction: Achieving a voice interaction that felt natural and conversational was difficult. We initially tried to use Gemini Live, which would allow the conversation to flow even more smoothly, but we quickly used up all of the requests and had trouble with API key limits.

Accomplishments that we're proud of

We are most proud of successfully creating a proof of concept that works surprisingly well giving the tight deadline and is genuinely a helpful tool. We are also proud of using google gemini in an innovative way to make computers more accessible and inclusive.

What we learned

We learnt how to use the Google Gemini Live API and how to do complex Prompt Engineering.

What's next for Harvey the Helper

The future of Harvey the Helper includes several key improvements:

Improved Accuracy: We plan to implement sophisticated Computer Vision (CV) models to better identify and locate UI elements on the screen, which will dramatically improve the accuracy of mouse clicks.

Broader System Integration: We want to expand Harvey's capabilities beyond simple actions to include complex, multi-step tasks (e.g., "Find a recipe for chicken tikka and email it to my son"). We also want Harvey to narrate his actions as he takes them, in order for the user to follow along and learn with confidence.

How Was Gemini Used

We used Google Gemini 2.5 flash native to generate speech to talk to the user, which sounds natural. Additionally we use Gemini 2.5 flash to understand what Harvey is seeing and predict what Harvey should do to achieve the goal. To our surprise google gemini was very well equipped to understand interfaces.