Aura - OS Automation Agent

Initial Look

Inspiration

This project was born from two very different, yet equally powerful, inspirations. On a personal level, I watched my parents struggle with everyday digital tasks. Seeing their difficulty in handling modern devices, I wanted to create something that could bridge that gap—an assistant that would allow them to accomplish what they want with a simple voice or text command, removing the technical barriers and anxieties. Aura is the first step toward that goal: a tool designed to make technology accessible to everyone, regardless of their technical skill.

The second inspiration came from the world of science fiction: Jarvis from Iron Man. The idea of a seamless, intelligent AI that could understand intent, interact with multiple systems, and proactively assist with complex tasks was a powerful motivator. I wanted to capture a piece of that magic and build an agent that feels less like a tool and more like a capable partner.

How It Was Built

Aura is built on a client-server architecture, separating the lightweight local agent from the resource-intensive language model. This design allows the agent to be responsive and efficient on a local machine while leveraging the power of a dedicated server for AI inference.

The Local Agent: The "hands" of the agent, running on the user's desktop.

Core Logic: Written in Python, it orchestrates all the actions.
GUI: The holographic interface is built with PyQt6, providing a visually engaging way to interact with the agent through voice and text.
Tools: A suite of powerful libraries is used to give Aura its capabilities:
- Web Automation: Playwright is used for robust browser control.
- GUI Automation: PyAutoGUI and PyGetWindow provide low-level control of the mouse, keyboard, and application windows.
- Screen Understanding: Tesseract, EasyOCR, and OpenCV allow Aura to read text and identify elements on the screen.
- Document Creation: python-docx and python-pptx are used to generate Word documents and PowerPoint presentations.
- Data Operations: Pandas is used for data analysis and manipulation.

The Cloud Server: The "brain" of the agent.

API: A FastAPI server exposes the language model to the local agent.
Language Model: The server hosts a large language model from the Hugging Face Hub using the Transformers library.
Infrastructure: To handle the computational demands of the model, I had to outsource the hosting to a cloud service, which was a critical decision to make the project feasible.

Challenges Faced

Building Aura was a journey filled with complex challenges, from making the AI reliable to integrating all the moving parts.

Prompt Engineering and JSON Structuring: This was one of the most significant hurdles. The heart of the project lies in the model's ability to understand a user's request and respond with a precise JSON object that the local agent can execute. I went through countless variations of the system prompt, refining the rules and examples to force the model to adhere to this strict structure. It was a constant battle against the model's tendency to generate conversational text instead of the required JSON.

System Integration: Linking all the components together was a major challenge. The GUI, the various operational modules, and the AI's decision-making process all had to communicate flawlessly. Handling background tasks, such as generating content for a Word document while the user can still interact with the GUI, required careful management of threading and signals.

Robust GUI Automation: GUI automation is inherently brittle. I had to build a system that was not only capable of clicking and typing but also resilient to variations in application layouts. This led to the implementation of a "self-correction" mechanism, where Aura can fall back on visual AI (using OCR and the language model) to find on-screen elements when standard methods fail.

What I Learned

This project was an incredible learning experience. Here are some of the key takeaways:

The Power of a Well-Defined Architecture: Separating the local agent from the cloud server was crucial. It allowed me to focus on building a rich set of tools for the agent without being constrained by the hardware limitations of a local machine.
Prompt Engineering is an Art: I learned that prompt engineering is not just about asking the right questions; it's about creating a "constitution" for the AI to follow, complete with rules, examples, and a clear structure to ensure reliable, machine-readable output.
The Future is Agents: This project solidified my belief that the future of AI lies in agents that can take action in the digital world, not just provide information. The ability to combine a powerful language model with a rich set of tools opens up a world of possibilities for automation and making technology more accessible for everyone, just as I had hoped for my parents.

Built With

accelerate
bitsandbytes
easyocr
fastapi
huggingface
lightiningai
numpy
opencv
openpyxl
pandas
pillow
playwright
psutil
pyaudio
pyautogui
pygetwindow
pyqt6
python
python-docx
python-ppt
pytorch
pyttsx3
pywin32
speech-recognition
tesseract
transformers
uvicorn
winreg

Updates

Cherin Yacoob Wattacheril started this project — Sep 11, 2025 01:37 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.