Inspiration For millions of users in the Global South, the smartphone is a lifeline, but the apps inside it are a maze. Digital literacy is a massive barrier. Apps are heavily cluttered with ads, nested menus, and complex English-heavy UIs.

Voice assistants today are fundamentally broken for these users: they can answer trivia, but they can't do the work. If a gig-worker is driving or an elderly user is trying to navigate a complex train booking app, they don't need instructions read aloud to them—they need an agent with agency.

We built Aura because we believe the operating system of the future should adapt to the user, not the other way around. Aura translates the complex, high-friction internet into a frictionless, voice-native experience.

What it does Aura is a fully autonomous, voice-native Android OS co-pilot. Instead of waiting for questions, Aura lives inside the operating system and executes multi-step workflows autonomously based on visual UI feedback.

Autonomous App Orchestration: You give Aura a compound command (e.g., "Open Swiggy and order a Masala Dosa"), and she takes over the device. She opens the app, waits for it to load, finds the search bar, types the query, handles out-of-stock items, scrolls through lists, and navigates all the way to the checkout screen.

Clutter-to-Calm Abstraction: Aura acts as an invisible bridge. Users don't need to learn how to use confusing apps; they just speak naturally, and Aura's "invisible hands" do the rapid scrolling, searching, and navigating behind the scenes.

How we built it To win the Orion Hackathon, we knew a standard Python API wrapper wouldn't cut it. We built a hyper-optimized, "Zero-Hop" Native Edge Architecture:

The Nervous System (Raw WebSockets): We bypassed heavy middleware SDKs and built a raw OkHttp WebSocket client natively in Kotlin. This streams 16kHz bidi-audio directly from the Android device to the live AI model, achieving near-zero latency conversational turns.

The Eyes (Semantic Flattening): We utilized Android's AccessibilityService to pull the active UI tree. To conserve LLM context windows, we built a SemanticFlattener that strips out redundant XML and compresses the screen state into lightweight, tokenized JSON strings while preserving exact physical bounds.

The Brain (ReAct Loop): We engineered a strict "Reason + Act" System Prompt. Aura doesn't guess; she evaluates the UI state, calls a specific OS tool, waits for the screen to update, and loops until the user's ultimate goal is achieved.

The Hands (Fuzzy Actuation): To translate LLM intent into physical taps, we wrote a custom TargetResolver using pure Kotlin Jaro-Winkler fuzzy matching. This allows Aura to successfully tap elements even if the AI slightly hallucinates the button text, and dynamically calculates the exact X/Y center coordinates for the dispatchGesture API.

Challenges we ran into Building an autonomous agent that navigates third-party apps is an engineering minefield:

The Asynchronous Desync: In a bidi-streaming environment, the AI would sometimes cancel tool calls while the Android app was halfway through calculating coordinates. We had to implement thread-safe ConcurrentHashMap state management to silently drop stale payloads and prevent fatal state errors.

The "Blind Spot" Loading Times: When Aura opens an app, Android takes hundreds of milliseconds to draw the UI. Aura would initially "look" too fast, see a blank screen, and crash. We implemented a deterministic 3x 800ms retry loop that actively waits for the rootInActiveWindow to populate before feeding context back to the model.

Custom View Nightmares: We discovered that apps like Instagram Reels ignore standard Accessibility ACTION_SCROLL commands. We had to build a fallback physical gesture dispatcher that dynamically scales swipe coordinates to 65% of the user's specific screen height (metrics.heightPixels) to guarantee scrolling works across all device sizes.

The Keyboard Trap: When an AI types into a text field, it doesn't automatically press "Search" on the software keyboard. We had to inject a deterministic "submitted": "true/false" flag directly into the tool response so Aura actively knows she has to hunt down the physical search icon to proceed.

Accomplishments that we're proud of True Autonomy: Breaking past the "single-shot" clicker paradigm to build a persistent cognitive loop where the AI actually remembers the user's final goal while navigating intermediate steps.

Edge-Native Integration: Successfully handling raw PCM audio byte manipulation and concurrent coroutines purely in Kotlin, proving that powerful AI agents can run entirely on the edge without PC-based Python relays.

The "Breadcrumb" UX: Implementing strategic audio feedback (e.g., "Finding the search bar now...") to mask UI parsing latency, making the agent feel incredibly alive and responsive.

What we learned We learned that building AI for the real world is about handling the "Unhappy Path." An AI's logic is only as good as its contextual grounding. We learned how to meticulously manage state, handle unpredictable UI popups, and force the agent into a "Trust, but Verify" loop before executing physical actions on a user's device.

What's next for Aura This is just the foundation. Our immediate roadmap includes:

The Optical Nerve (Spatial Mode): Integrating CameraX to stream 1 FPS video to the Live API, allowing users to point their phone at a physical poster or object and have Aura autonomously execute digital actions based on it.

The Digital Immune System: Upgrading the Accessibility Service to proactively detect and physically block (via SYSTEM_ALERT_WINDOW overlays) malicious phishing links and UPI scams before the user can tap them.

Zero-Touch SOS Protocol: A dedicated emergency trigger that locks the OS and silently broadcasts live location and audio data.

Built With

  • accessibilityservice
  • android
  • api
  • apis:
  • backend
  • cloud
  • coroutines
  • databases:
  • execution
  • fastapi
  • firebase
  • flash
  • for
  • frameworks
  • gemini
  • gesturedescription
  • google
  • jaro-winkler
  • key
  • kotlin
  • languages:
  • libraries
  • lifecycle
  • live
  • manager
  • matching
  • multimodal
  • node.js
  • okhttp
  • parsing
  • platform
  • platforms:
  • python
  • retrofit
  • routing)
  • run
  • sdk
  • secret
  • services:
  • similarity
  • stream
  • swarm
  • tools:
  • tree
  • ui
  • websocket
  • x
Share this project:

Updates