Inspiration
The inspiration comes from enabling advanced AI models to interact with computers using visual and voice inputs, mimicking human-like computer operation across different multimodal models like Gemini Pro Vision, GPT-4o, Claude 3, and LLaVa.
What it does
A framework that allows AI models to: View computer screens Interpret visual information Execute mouse and keyboard actions Operate computers through voice commands Support multiple AI models with different capabilities Provide Optical Character Recognition (OCR) and Set-of-Mark (SoM) prompting modes
How we built it
Developed a flexible framework compatible with multiple multimodal AI models Integrated screen capture and accessibility technologies Implemented voice input capabilities Created model-specific integrations (Gemini, Claude 3, GPT-4, LLaVa) Added advanced visual interpretation techniques like OCR and SoM prompting Ensured cross-platform compatibility (Mac, Windows, Linux)
Challenges we ran into
Achieving consistent accuracy across different AI models Managing complex visual interpretation tasks Handling diverse computer interfaces Implementing reliable voice command processing Integrating multiple AI model APIs Managing high error rates with local models like LLaVa
Accomplishments that we're proud of
Successfully created a universal framework for AI computer operation Supported multiple cutting-edge multimodal models Implemented voice and visual input modes Developed advanced visual interpretation techniques Created an easily installable and configurable system
What we learned
Complexity of translating visual information into computer actions Importance of flexible model integration Challenges in developing cross-platform AI interaction tools Nuances of different multimodal AI model capabilities
What's next for ORCA
Expand model compatibility Improve accuracy and reliability Develop Agent-1-Vision for more precise interactions Release API access for advanced users Enhance voice and visual interpretation capabilities Reduce error rates in local model operations
Log in or sign up for Devpost to join the conversation.