Inspiration
The vision behind the Self-Operating Computer Framework was to revolutionize how users interact with their computers, making hands-free operation a reality for everyone. The idea stemmed from the need to increase accessibility and efficiency, using multimodal AI models to interpret screen content and control computer functions autonomously. By integrating AI models that combine vision, language, and interaction, we aimed to create a system that could perform tasks just like a human operator—navigating and interacting with software purely through the model’s interpretation of the screen.
What it does
The Self-Operating Computer uses multimodal models to control a computer by simulating human interaction. It interprets the visual content on the screen, processes language inputs, and autonomously performs actions like moving the mouse, clicking buttons, and typing, all while achieving specific goals set by the user. This allows for hands-free computing, making tasks such as browsing, document editing, and other digital activities possible without the need for direct physical input.
How we built it
We built this framework by integrating cutting-edge multimodal models like Gemini Pro Vision, GPT-4, Claude 3, and LLaVa, each bringing powerful capabilities in vision and language understanding. The project uses Python for core development, with YOLOv8 for button detection, and OCR (Optical Character Recognition) to identify clickable elements on the screen. The backend supports various operating systems (Mac OS, Windows, Linux), and APIs like OpenAI and Google AI Studio are used for model access and interaction. The framework allows users to switch between models via simple terminal commands.
Challenges we ran into
Model Integration: Combining multiple models (vision, language, interaction) and making them work together seamlessly was a significant challenge. API Permissions: Configuring Google AI Studio and ensuring the correct API key setup took time. OCR Performance: Improving the accuracy of the OCR system to detect clickable elements on diverse screen layouts required extensive testing and fine-tuning. Cross-Platform Compatibility: Ensuring that the framework worked smoothly across Mac, Windows, and Linux with different screen recording and accessibility settings was complex.
Accomplishments that we're proud of
Successfully integrating Gemini Pro Vision and GPT-4 with OCR to allow the system to click and interact with elements based on screen text. Implementing Set-of-Mark (SoM) Prompting for enhanced visual grounding, giving the AI better context to make decisions about on-screen actions. Building a robust, extensible framework that supports future AI models, with clear instructions for user contributions.
What we learned
Multimodal AI Synergy: We gained deep insights into how vision and language models can complement each other in practical applications like computer control. User-Centric Design: Ensuring that the framework is user-friendly and accessible for those unfamiliar with the underlying technology was essential to its success. Performance Optimization: We learned that OCR combined with model-specific optimizations significantly improves performance over traditional vision models alone.
What's next for Self Operating Computer
Model Expansion: We plan to support more AI models, improving both the range of tasks the system can handle and its overall efficiency. Advanced Interaction Capabilities: Adding voice input to complement the multimodal capabilities, providing a more natural and seamless hands-free experience. Community Contributions: Encouraging users to contribute new detection models and enhance the existing framework. Mobile Support: Extending compatibility to mobile operating systems for fully autonomous mobile device operation. AI Training Enhancements: Implementing more sophisticated AI training models to improve accuracy and decision-making speed.

Log in or sign up for Devpost to join the conversation.