Inspiration
We noticed a lack of resources for people with physical abilities and were inspired to help solve this pressing issue.
What it does
It helps people surf the web who can’t through normal means.
How we built it
Consisted of three parts:
- Voice: Used OpenAI’s Realtime API and websocket to allow for real-time conversations between users and LLMs based on audio. For audio transcription, we used Whisper API.
- Vision: Using Python, I was able to download different libraries like dlib, and extensions such as mediapipe to be able to add an eye/head tracking algorithm. Later, I imported a speech recognition extension, make sure that the user say words like “click” and “Tap” to be able to left click, “right” for right click, “Double” for a double click, and “enter” to enter.
- Computer Use: Uses Anthropic’s Computer Use API. The API returns a tool and an action, and using macOS shell commands, it simulates mouse movement, mouse clicks, and typing. It’s able to understand the screen through the use of a vision LLM.
Challenges we ran into
Integrating all our different parts together. Understanding how to configure the servers for real time voice conversation. How to utilize Anthropic’s Computer Use API for your own browser and device.
Accomplishments that we're proud of
Innovation using new tools such as Computer Use API and Realtime API. Development of a innovative concept.
What we learned
Open AI Realtime API integration, eye tracking using Python, and use of computer through Anthropic's Computer Use API and Claude Sonnet model.
What's next for Baymax AI
Full integration between the three tools: voice, vision, and computer use. Better development of user experience.
Built With
- anthropic
- computer-use-api
- openai
- python
- realtime-api
Log in or sign up for Devpost to join the conversation.