architecture

Horizon - AI Computer Control Assistant

An intelligent AI assistant that can control your computer directly using advanced vision, natural language processing, and speech capabilities. Horizon combines GPT-OSS orchestration with Gemini vision, Whisper Turbo speech-to-text, and PlayAI-TTS text-to-speech to understand your screen and voice and execute complex tasks autonomously.

Inspiration

The inspiration for Horizon came from the need to bridge the gap between natural language instructions and precise computer control. Traditional automation tools require specific programming or scripting knowledge, but Horizon allows users to describe what they want to accomplish in plain English or by speaking. By combining cutting-edge vision and speech models with intelligent orchestration, we created an assistant that can "see" your screen, "hear" your commands, and take appropriate actions just like a human would.

What it does

Horizon is a comprehensive AI assistant that can:

Execute Complex Tasks: Break down high-level requests into actionable steps
Visual Screen Analysis: Understand what's currently displayed on your screen
Precise UI Interaction: Click buttons, fill forms, navigate applications with pixel-perfect (by leveraging omniparser and a vision model) accuracy
Multi-Modal Understanding: Process both text and voice commands, as well as visual information
Speech-to-Text: Use Whisper Turbo for accurate and fast voice command recognition
Text-to-Speech: Respond to users with a natural-sounding voice using PlayAI-TTS
Adaptive Learning: Learn from action history about the user to improve future performance

Key Capabilities:

Screen Understanding: "What's on my screen?" - Get a detailed description of the current display
Application Control: "Reply Rohit in Discord " - Launch and control applications
Form Automation: "Fill out this form with my information (if available in history)" - Automate data entry
Web Navigation: "Find the login button and click it" - Navigate websites intelligently
Multi-Step Workflows: "Send an email to John about the meeting" - Execute complex multi-step tasks
Voice Interaction: Speak commands and receive spoken responses for hands-free operation

How we built it

Architecture Overview

Horizon uses a sophisticated three-tier architecture:

GPT-OSS Orchestrator (orchestrator.py)
- High-level task planning and decision making
- Determines whether to execute tasks or answer questions
- Coordinates between vision, speech, and action modules
- GPT-OSS is perfect for this because of its strong reasoning and its safety features
Gemini Vision (vision/vision.py)
- Advanced screen analysis using Gemini API
- OmniParser integration for precise UI element detection
- YOLO-based object detection for accurate coordinates
Action Execution (action/action.py)
- PyAutoGUI-based computer control
- Mouse, keyboard, and system interactions
Speech Module
- Whisper Turbo for real-time speech-to-text
- PlayAI-TTS for natural text-to-speech responses

Why gpt-oss?

We chose GPT-OSS because of its outstanding safety, reasoning, and privacy capabilities. GPT-OSS is specifically designed to refuse dangerous or harmful commands, making it a perfect fit for an AI assistant that controls your computer. Its advanced reasoning allows it to break down complex instructions into safe, actionable steps, while its strong context understanding ensures reliable orchestration of Horizon’s components. Additionally, GPT-OSS is built with privacy in mind, ensuring that user data is handled securely and never exposed unnecessarily. The open-source nature of GPT-OSS also enables deep customisation and seamless integration with our vision, speech, and action modules.

Technical Stack:

Models: GPT-OSS-20B, Gemini 2.0 flash, Whisper Turbo, PlayAI-TTS
Vision: YOLO + OmniParser for UI element detection
APIs: Groq and Gemini API for model access
Automation: PyAutoGUI, platform-specific system calls
Speech: Whisper Turbo (STT), PlayAI-TTS (TTS)

Accomplishments that we're proud of

1. Advanced Vision Integration

Successfully integrated OmniParser with YOLO for precise UI element detection
Achieved accurate coordinate mapping from visual elements to actionable commands

2. Intelligent Orchestration

Created a sophisticated decision-making system using GPT-OSS
Implemented proper tool calling and context management

3. Robust Action System

Built comprehensive automation capabilities covering mouse, keyboard, and system control
Added intelligent retry mechanisms and error handling

4. Seamless Speech Integration

Enabled hands-free operation with Whisper Turbo speech-to-text
Provided natural spoken feedback using PlayAI-TTS

What we learned

Technical Insights:

Vision Model Capabilities: Gemini excels at understanding UI contexts and spatial relationships
Orchestration Patterns: GPT-OSS provides excellent high-level reasoning for task decomposition, and its built-in safety features work seamlessly to ensure secure and reliable automation.
Action Reliability: Combining visual verification with action execution significantly improves success rates
Speech Integration: Whisper Turbo and PlayAI-TTS enable intuitive, accessible, and hands-free user experiences

Development Lessons:

Modular Architecture: Separating vision, orchestration, action, and speech modules enables better testing and maintenance
Error Handling: Computer control requires robust error handling and recovery mechanisms
User Experience: Clear feedback and status updates are crucial for user trust in AI automation

What's next for Horizon

Performance Optimisation: Reduce latency through model optimisation and caching
Application-Specific Plugins: Devspecialisedlized modules for popular applications (browsers, office suites, etc.)
Cross-Platform Support: Expand compatibility to multiple operating systems (Windows, macOS, Linux)
Workflow Automation: Create and save complex multi-step workflows for repeated tasks
Parallel Agent Execution: Enable running multiple agents in parallel to handle concurrent tasks efficiently
Local Model Execution: Support running models locally for enhanced privacy and data security
Multi-Screen Support: Handle multiple monitors and complex display configurations

Built With

c#
computer-vision
groq
porcupine
python
unity

Submitted to

OpenAI Open Model Hackathon

Created by

I led the project, implementing OmniParses, a Microsoft-developed tool critical for enabling the vision model to logically determine the next actions to perfrom while also improving the system’s structure, ensuring thorough testing, defining the feature set, and taking responsibility for the overall architecture.

Salo Soja Edwin
Tony Stark was able to build this in a cave with a box of scraps!
I worked on making Action Module I also worked on adding short Term Memory and Prompt updating for the Vision Module and then replacing Vision Module with better gemini 2.0 Flash.

Rohit Francis
rm -rf .git
I worked on a desktop overlay front-end of the project with Unity. i also worked on integrating the system with the front-end.

Govind S Sarath
I worked in implementing the main loop where the basic model works well with speech to text and text to speech. I also worked on making the tool calling feature in the orchestrator.

Arjun G Ravi

Updates

Salo Soja Edwin started this project — Sep 11, 2025 04:42 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.