Inspiration

My inspiration was 3 things:

  1. Operator/Computer-Use by OpenAi/Anthropic
  2. Guilt when i use phone too much
  3. My mom inability to use mobile phone effectively.

So I wanted to make something that will use phone on your behalf

What it does

This agent uses your phone instead of you to,

  1. Perform boring tasks like apply on Jobs on linkedin, reply on Whatsapp, scroll reels,
  2. Do Search Operation for you and give summary (using Tavily),
  3. Remember you, memories basically ( using Mem0),
  4. Do content moderation for the content you want to avoid putting in your brain's RAM

How we built it

Panda is a Kotlin-based, multi-agent system for Android:

  1. Eyes & Hands: The Android Accessibility Service provides the ability to read the screen and perform gestures.
  2. Brain: Google's Gemini models power all planning, reasoning, and analysis. Knowledge: Tavily Search provides real-time web access.
  3. Memory: Mem0 provides a persistent, long-term memory layer.

This is orchestrated by a team of specialised agents (Manager, Operator, Reflector, etc.) that handle everything from high-level planning to action execution and reflection.

My agents were: Manager: Make a high level plan. Operator: Do the task Reflector: Reflect on the operator actions NoteTaker: Take important notes Judge: For the content moderation part DeepSearch: To decide when to deep search and what to search, and form a reply

Challenges we ran into

API Rate Limiting: Solved by implementing an intelligent key management system that rotates through a pool of 11 Gemini API keys.

LLM Hallucination: Reduced visual misinterpretations (e.g., calling the Amazon icon a "cat") by updating prompts to require cross-referencing visual data with the text-based accessibility hierarchy.

Accomplishments that we're proud of

We've built one of the only open-source, on-device phone automation agents available.

While all LLM operators are still evolving, Panda has a robust foundation and can reliably perform a wide range of tasks, proving the multi-agent architecture is effective.

What we learned

This project was a deep dive into building a complete agentic system.

  1. I gained practical expertise in multi-agent design,
  2. Low-level Android UI automation via the Accessibility Service,
  3. Strategies for grounding LLMs to improve reliability in real-world scenarios.

What's next for Panda

  1. Ability for Agent to ask question (currently the LLM just assume that my brother name is Jeff, he cannot ask question) This is the biggest problem with Automation Agents (for eg Browser-Use)

  2. Implement voice Input and good output with help of Elevenlabs, instead of using shitty robotic sounds (No money)

  3. Non-vision mode (without Screenshot) , where we just use element description in text form, using accessiblity api. Its like rendering DOM but for Mobile Phone

  4. I want to give Agents access to android APIs that will automate process faster. for example(instead of find the amazon app icon, and clicking it, just use Android api to open app. )

  5. Better Vision Labelling, current method is expensive and slow.

  6. Understand Video: Explore JEPA model by Yann LeCun to understand not just screenshots, but also screen recordings and animations.

Built With

Share this project:

Updates