My uncle uses WhatsApp every day, but spent two hours on a government form with my help last year. One confusing field, one wrong answer, and he had to start over. Millions of people hit the same wall, not because they lack intelligence, but because forms are designed for people who already know the system. I wanted to build something that could sit with anyone the way I sat with him. Agentigator is a mobile voice agent that guides users through forms field by field. The user speaks, the agent captures the screen, reads the visible fields visually without DOM access, explains each one in plain language, confirms the answer, and moves on. If a reCAPTCHA appears it pauses, explains what the user needs to do, and resumes when they are ready. Nothing is submitted without the user hearing every answer read back first. The biggest challenge was reliable visual field detection since form layouts are inconsistent across sites. Barge-in handling was the second hurdle, as keeping the form state intact when a user interrupts mid-instruction required careful session management. reCAPTCHA was a wall we did not expect until testing on real government sites. Getting a full form completed end to end, voice only, no typing, on a real mobile device was the moment the project felt real. The latency was also something we worked hard on because the conversation needed to feel live, not like a request and response cycle. We learned that the harder problem is not the technology but the language. Explaining a government form field to someone unfamiliar with the terminology, in plain words and without being condescending, took more iteration than any part of the infrastructure. Next, we want to improve the agentic behaviour through fine-tuning, as some edge cases go beyond what prompt engineering alone can resolve. We also plan to test with real users across different form types and iterate based on real scenarios. One feature already in scope is document capture: when a form requires identity details such as a CNIC or passport number, the agent will surface a card prompting the user to either upload an image or capture one with the camera, removing the need to read out and dictate sensitive details field by field.
Built With
- fastapi
- gemini-live-api
- google-cloud-run
- google-genai-sdk
- python
- react-native
- websockets
Log in or sign up for Devpost to join the conversation.