Baymax AI

Inspiration

We noticed a lack of resources for people with physical abilities and were inspired to help solve this pressing issue.

What it does

It helps people surf the web who can’t through normal means.

How we built it

Consisted of three parts:

Voice: Used OpenAI’s Realtime API and websocket to allow for real-time conversations between users and LLMs based on audio. For audio transcription, we used Whisper API.
Vision: Using Python, I was able to download different libraries like dlib, and extensions such as mediapipe to be able to add an eye/head tracking algorithm. Later, I imported a speech recognition extension, make sure that the user say words like “click” and “Tap” to be able to left click, “right” for right click, “Double” for a double click, and “enter” to enter.
Computer Use: Uses Anthropic’s Computer Use API. The API returns a tool and an action, and using macOS shell commands, it simulates mouse movement, mouse clicks, and typing. It’s able to understand the screen through the use of a vision LLM.

Challenges we ran into

Integrating all our different parts together. Understanding how to configure the servers for real time voice conversation. How to utilize Anthropic’s Computer Use API for your own browser and device.

Accomplishments that we're proud of

Innovation using new tools such as Computer Use API and Realtime API. Development of a innovative concept.

What we learned

Open AI Realtime API integration, eye tracking using Python, and use of computer through Anthropic's Computer Use API and Claude Sonnet model.

What's next for Baymax AI

Full integration between the three tools: voice, vision, and computer use. Better development of user experience.

Built With

anthropic
computer-use-api
openai
python
realtime-api

Updates

Rodin S started this project — Nov 10, 2024 04:59 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.