Inspiration
My inspiration was 3 things:
- Operator/Computer-Use by OpenAi/Anthropic
- Guilt when i use phone too much
- My mom inability to use mobile phone effectively.
So I wanted to make something that will use phone on your behalf
What it does
This agent uses your phone instead of you to,
- Perform boring tasks like apply on Jobs on linkedin, reply on Whatsapp, scroll reels,
- Do Search Operation for you and give summary (using Tavily),
- Remember you, memories basically ( using Mem0),
- Do content moderation for the content you want to avoid putting in your brain's RAM
How we built it
Panda is a Kotlin-based, multi-agent system for Android:
- Eyes & Hands: The Android Accessibility Service provides the ability to read the screen and perform gestures.
- Brain: Google's Gemini models power all planning, reasoning, and analysis. Knowledge: Tavily Search provides real-time web access.
- Memory: Mem0 provides a persistent, long-term memory layer.
This is orchestrated by a team of specialised agents (Manager, Operator, Reflector, etc.) that handle everything from high-level planning to action execution and reflection.
My agents were: Manager: Make a high level plan. Operator: Do the task Reflector: Reflect on the operator actions NoteTaker: Take important notes Judge: For the content moderation part DeepSearch: To decide when to deep search and what to search, and form a reply
Challenges we ran into
API Rate Limiting: Solved by implementing an intelligent key management system that rotates through a pool of 11 Gemini API keys.
LLM Hallucination: Reduced visual misinterpretations (e.g., calling the Amazon icon a "cat") by updating prompts to require cross-referencing visual data with the text-based accessibility hierarchy.
Accomplishments that we're proud of
We've built one of the only open-source, on-device phone automation agents available.
While all LLM operators are still evolving, Panda has a robust foundation and can reliably perform a wide range of tasks, proving the multi-agent architecture is effective.
What we learned
This project was a deep dive into building a complete agentic system.
- I gained practical expertise in multi-agent design,
- Low-level Android UI automation via the Accessibility Service,
- Strategies for grounding LLMs to improve reliability in real-world scenarios.
What's next for Panda
Ability for Agent to ask question (currently the LLM just assume that my brother name is Jeff, he cannot ask question) This is the biggest problem with Automation Agents (for eg Browser-Use)
Implement voice Input and good output with help of Elevenlabs, instead of using shitty robotic sounds (No money)
Non-vision mode (without Screenshot) , where we just use element description in text form, using accessiblity api. Its like rendering DOM but for Mobile Phone
I want to give Agents access to android APIs that will automate process faster. for example(instead of find the amazon app icon, and clicking it, just use Android api to open app. )
Better Vision Labelling, current method is expensive and slow.
Understand Video: Explore JEPA model by Yann LeCun to understand not just screenshots, but also screen recordings and animations.
Built With
- android
- android-studio
- figma
- gemini
- kotlin
- mem0
- native
- tavily
Log in or sign up for Devpost to join the conversation.