AURA | Devpost

Inspiration

Our inspiration for AURA came from a simple, universal frustration: the digital workflow is broken. We have incredibly powerful AI models like gpt-oss, yet we are still forced into a tedious loop of typing into one window, copying the result, and pasting it into another. Our computers have voice assistants, but they are siloed and lack true awareness of our tasks. We wanted to build something better.

We were inspired by the vision of a true AI partner—an agent that doesn't just respond, but assists. An agent that can see what we see, understand our context, and act on our behalf directly within our environment. The release of OpenAI's gpt-oss models provided the final, critical piece: the reasoning "brain" powerful enough to drive such an agent locally and privately.

Furthermore, we were deeply moved by the potential for this technology to serve as a next-generation accessibility tool. We envisioned an agent that could empower users with motor or visual impairments to navigate the digital world as fluidly as anyone else, transforming a convenience into a necessity.

What it does

AURA (Autonomous User-side Robotic Assistant) is a context-aware local agent that transforms how you interact with your computer. Powered by the gpt-oss model, it gives you a voice-driven, conversational partner that can see your screen, understand your workflow, and execute complex tasks across any application on your desktop.

At its core, AURA is built on four layers of context:

Stateful Task Context: AURA can generate complex content like code or an email and hold it in a "deferred action" state, waiting for you to simply click where you want it placed. This separates content creation from placement, creating a seamless workflow.
Conversational Context: AURA remembers the last few turns of your conversation, allowing for natural follow-up commands. You can ask it to write a function in PyTorch, and then simply say, "Now write that in TensorFlow," and it will understand.
Visual & Accessibility Context: AURA perceives what you're focused on. You can highlight any text in any application—a browser, a PDF, a code editor—and ask, "Explain the selected text," turning AURA into a powerful, on-the-fly research and learning assistant.
Application Context: AURA is aware of the application you are currently using and can proactively pre-load UI information in the background, making its responses faster and more efficient.

Ultimately, AURA stops you from having to bring your work to the AI; it brings the AI to your work.

How we built it

AURA is a Python-based application built on a modular, orchestrated architecture designed for extensibility and reliability.

The Orchestrator (orchestrator.py): This is the central nervous system of the application. It manages the entire workflow, from receiving a transcribed voice command to routing it through the appropriate modules and handlers.
The Reasoning Core (gpt-oss-120b): We use the gpt-oss-120b model as our primary reasoning engine. It powers our intent recognition, action plan generation, and conversational responses. We interface with it using the Ollama Python client, allowing for both local and cloud-based execution.
Hybrid Perception System: AURA "sees" the screen using a robust two-tiered system:
1. Fast Path (Accessibility API): Its primary method is to use macOS's native accessibility APIs to directly inspect an application's UI tree. This is incredibly fast and accurate.
2. Vision Fallback (Local Vision Model): If the accessibility path fails, AURA seamlessly falls back to a vision-based workflow. It takes a screenshot and uses a local vision model (like LLaVA or Phi-3-Vision running in LM Studio) to identify and locate UI elements.
Modular Design (modules/ & handlers/): All core capabilities are encapsulated in specialized modules (e.g., AudioModule, AutomationModule). The logic for handling different user intents is further separated into handlers (GUIHandler, ConversationHandler, DeferredActionHandler), making the system clean and easy to extend.
Platform & Tools: The application is currently built for macOS, using the cliclick command-line tool for reliable automation and the pynput library for global mouse event listening to enable our deferred action workflow.

Challenges we ran into

Brittleness of Pure Vision Automation: Our initial prototypes relied solely on vision. We quickly discovered this was unreliable. UI elements that look identical to a human can be ambiguous to a model, and minor changes in resolution or theme could break the system. This led us to develop the much more robust hybrid perception system with the accessibility "fast path."
Complex State Management: The "deferred action" workflow was a significant architectural challenge. We had to implement a thread-safe state machine within the Orchestrator that could enter a "waiting" state, start a global mouse listener without interfering with other operations, and gracefully handle timeouts or cancellation by a new command. This required careful use of threading locks and event handlers.
Reliable Intent Recognition: Distinguishing between a command to do something ("write a function") versus a command to talk about something ("tell me about functions") is a nuanced problem. We iterated extensively on the INTENT_RECOGNITION_PROMPT, adding specific examples and clear instructions to train the gpt-oss model to accurately classify user intent and provide structured, predictable responses.

Accomplishments that we're proud of

The Context-Aware Engine: We are incredibly proud of the multi-layered contextual understanding we've built. The seamless interplay between conversational, stateful, and visual context is what elevates AURA from a simple script to a truly intelligent agent. The ability to follow up on commands naturally is a game-changer.
The Deferred Action Workflow: This is a novel human-computer interaction pattern that we believe is a glimpse into the future of AI collaboration. It perfectly blends the generative power of the AI with the user's precise, contextual control, creating a workflow that is both powerful and intuitive.
Building a Resilient Hybrid System: Creating the dual-path perception system (Accessibility + Vision) was a major accomplishment. It allows AURA to be incredibly fast and efficient when possible, but also robust and universally functional when faced with non-standard applications. This makes the agent far more reliable in real-world use.

What we learned

Open Models are Ready for the Desktop: The gpt-oss models are not just powerful, but also efficient enough to be the core reasoning engine for a complex, real-time local agent. This project proved to us that the future of AI assistance can be private, local, and open-source.
Context is Everything: We learned that the true potential of LLMs is unlocked not just by better prompts, but by providing them with rich, real-time context. An agent's usefulness is directly proportional to its awareness of the user's current environment and conversational state.
The "Last Mile" is Crucial: The most powerful AI in the world is useless if it can't bridge the "last mile" to the user's application. We learned that the most critical part of an agent is its ability to seamlessly integrate into a user's existing workflow, which is why we focused so heavily on the automation and accessibility modules.

What's next for AURA

We are incredibly excited about the future of AURA and see this hackathon as just the beginning. Our roadmap is focused on making AURA an even more indispensable partner:

Cross-Platform Support: Our immediate next step is to abstract the AutomationModule and AccessibilityModule to support Windows and Linux, making AURA a universally available local agent.
Deeper Application Integration: We plan to move beyond surface-level GUI automation and integrate with application-specific APIs (e.g., APIs for calendars, email clients, and IDEs) to perform more complex, high-level tasks.
Self-Correction and Learning: We want to empower AURA to learn from its mistakes. If an action plan fails, we plan to implement a feedback loop where AURA analyzes the error, re-evaluates the screen, and generates a new, corrected plan.
Proactive Assistance: The ultimate goal is for AURA to become a proactive assistant. By understanding the user's context and habits over time, AURA could anticipate needs and offer suggestions, such as automating repetitive tasks or summarizing a newly opened document before being asked.

Built With

cliclick
json
lm-studio
numpy
ollama
openai
openai-whisper
pdftotext
picovoice-porcupine
pyobjc
python
sounddevice
thefuzz

Updates

prats 2311 started this project — Sep 11, 2025 06:59 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.