Inspiration
The vision for this project was to create a solution that simulates real users interacting with Windows applications. The goal was to build an agent capable of exploring software autonomously, analyzing screen states, and making decisions based on specific user personas. By adopting predefined personas like the "Power User" or "Senior" user, the agent can test advanced features or act as a novice to identify usability issues.
This aims to promote the rise of AI usage in testing applications and remove the excessive bugs encountered during alpha/beta testing early on. Combining both AI and human testing allows for the development of more robust software.
What it does
One-Sigma is an AI-powered QA agent that operates within sandboxed virtual machines. It captures the application's interface and uses a perception-action loop to decide the next logical step. The agent can perform actions such as:
- Clicking: Identifying and interacting with elements on the screen.
- Typing: Entering text into input fields.
- Wait/Done: Deciding when to pause or when a test run is successfully finished.
Every run produces a structured report containing action logs and the reasoning traces behind the AI's decisions.
How we built it
The project is architected with a React and Electron frontend for an intuitive dark-themed UI and a FastAPI backend to manage agent processes. The agent itself is written in Python and uses a "Brain" module powered by the GPT-5.2 model.
To bridge the gap between AI visual analysis and precise cursor control, we implemented a coordinate translation system. We capture the screen and overlay a red grid, creating distinct cells. The conversion from a Grid ID (where ) to screen coordinates is calculated as follows:
This mathematical approach allows the model to specify a cell ID, which the "Hands" module then translates into an exact click position using PyAutoGUI.
Challenges we ran into
During development, we faced several technical hurdles:
- Rate Limiting: The Gemini API frequently hit rate limits, requiring us to optimize our calls, and eventually pivot to OpenRouter to expand our capabilities.
- Accuracy: We initially struggled with the model accurately locating buttons, which led to the development of our grid overlay technique.
- Environment Shifts: We pivoted the project between Linux and Windows environments twice before settling on the current Windows-focused implementation.
- Frontend Integration: We had to significantly adjust the frontend to support the execution of local application paths.
Accomplishments that we're proud of
We successfully developed a working perception-action loop that allows the agent to "think" before it acts. We are particularly proud of the grid perception system, which turned a high-level visual model into a precise automation tool. Additionally, the structured report generation provides clear visibility into the agent's logic, making it a viable tool for actual QA workflows.
What we learned
This project taught us the importance of grounding AI decisions in a structured physical space. We learned how to handle asynchronous processes between a FastAPI backend and a Python-based agent loop while broadcasting logs in real-time via WebSockets. We also gained experience in designing AI personas that effectively modify the agent's behavior and decision-making speed.
What's next for One-Sigma
The next steps for the project involve expanding its utility and safety:
- Cross-platform Compatibility: Moving beyond Windows to support Linux and macOS applications.
- Default VM Sandboxing: Integrating VirtualBox by default to ensure all tests run in a completely isolated environment.
- Advanced Personas: Allowing users to define even more granular custom behaviors for the agent.
Log in or sign up for Devpost to join the conversation.