Inspiration

We were inspired by in-class discussions about accessible tech and by recent advancements in AI. We felt that AI could transform the way that visually-impaired and mobility-impaired people interacted with their computers, and we wanted to build a proof-of-concept for that.

What it does

We allow LLMs to autonomously control computers through function calling. This is a space that’s gotten a bit more popular in recent months, but the applications in accessibility tech are still almost completely unexplored.

Specifically, we allow the user to talk to their computer and give it a browser-based or application-based task. The LLM will then complete it.

How we built it

In terms of machine learning, we rely pretty heavily on OpenAI’s models. We use the OpenAI assistants API to run the task and have agentic behavior. Specifically, we use GPT-3.5 for the function calling, and we occasionally call GPT-Vision to parse screenshots.

The non-ML components used industry-standard techniques. Web browsing was automated through Playwright, a popular automation library for browser testing. We also injected vanilla JS for some DOM manipulation. And the automation of other apps was done through a Python wrapper for Microsoft’s accessibility API (UIA). Because of this, we generalize to any app written in Qt5 and many apps written in WinForms.

Challenges we ran into

One challenge we ran into was the integration into existing screenreaders. We really wanted to build this as an add-on to existing screenreaders to minimize friction to join, but JAWS (the most popular paid screen reader) is closed-source and NVDA (the most popular free one) is still on an older version of Python that doesn’t support many libraries we need.

Another challenge we had was with design. For this project, we had to intentionally design our UI to be as accessible as possible. What that meant, for us, was using voice control.

Accomplishments that we're proud of

We’re very proud that we ended with a working demo! The project is still a bit low-fidelity, but it’s a fantastic proof of concept and really highlights how we envision the future of accessible tech.

We think that a future, high-fidelity version of Gru could improve a lot of people’s lives.

What we learned

We learned a lot about accessible design, LLMs, and automation. It was a blast!

What's next for Gru

We plan to continue working on Gru. We’d like to build this out more formally as an add-on, and the upcoming update for NVDA to Python 3.11 will probably help with that.

We’d also like to improve the fidelity of our pipeline. Longer term, we’d like to build our own multimodal model that understands UI hierarchy and use that. In the shorter term, we need to improve error handling, create more alternate flows, increase few-shot and chain-of-thought prompting, and refactor our data pipeline.

We’d also like to begin caching solutions to tasks as possible “skills” and use RAG to use skills to solve future problems.

Built With

Share this project:

Updates