HomePage
Voice Based Form filling

FormWhisper

Inspiration

Filling out government forms is hard enough when you're comfortable with technology and fluent in written English. For elderly individuals, people with low literacy, or anyone who finds bureaucratic paperwork overwhelming, a single government application or benefits form can feel impossible. We wanted to remove every barrier between a person and the help they deserve. No typing, no reading dense legalese, no confusion about which box to fill in.

What it does

FormWhisper turns any government PDF form into a friendly voice conversation. You upload a PDF, and FormWhisper automatically reads every fillable field, generates plain-language spoken questions, and walks you through the form one question at a time. You answer out loud. FormWhisper transcribes your voice, verifies your answer makes sense for the field, and fills the correct box in the PDF. When you're done, you download a completed, ready-to-submit document. No typing, no reading, no confusion.

How we built it

Frontend: React + Vite with a clean, accessible UI. The browser captures microphone audio using the Web Audio API with silence detection to auto-stop recording after the user finishes speaking.
Backend: Python FastAPI serving a REST API for PDF upload, form analysis, transcription, answer verification, and PDF filling.
Vision LLM: Qwen2.5-VL-32B-Instruct (hosted on AMD hardware) visually reads each page of the uploaded PDF and extracts every fillable field, generating a conversational question for each one.
Speech-to-Text: OpenAI Whisper for accurate voice transcription.
Text-to-Speech: ElevenLabs for natural, warm audio question delivery.
PDF Filling: PyMuPDF (fitz) maps each answer back to the correct AcroForm widget using positional matching, handling both text fields and checkboxes.

Challenges we ran into

PDF field mapping: AcroForm fields use internal names like TextField1[6] with no relation to their visual label. Matching a user's answer to the right box required positional sorting of both VLM-extracted questions and fitz widgets by their on-page coordinates.
Checkbox handling: Checkboxes in XFA-based PDFs store full internal paths. fitz resolves these differently than pypdf, requiring short-name fallback matching.
Silence detection: Keeping the microphone open for the right amount of time without cutting the user off mid-sentence or waiting forever required tuning a Web Audio RMS-based VAD with per-field thresholds.
VLM prompt engineering: Getting the vision model to reliably distinguish fillable fields from authorization statements, legal disclaimers, and instructional text took many iterations.

Accomplishments that we're proud of

A fully working end-to-end pipeline: upload a PDF → hear spoken questions → answer by voice → download a filled PDF, all in under 5 minutes.
Generic design: FormWhisper works on any AcroForm PDF, not just one specific form type.
Clean, accessible UI that a non-technical user could navigate without any instructions.
Robust positional field assignment that requires zero hardcoded field names.

What we learned

Vision-language models are surprisingly capable at reading and interpreting scanned government forms, but need careful prompting to avoid hallucinating non-existent fields.
PDF internals (AcroForm vs XFA, field name paths, widget types) are far more complex than they appear from the outside.
Accessibility-first design forced us to make better engineering decisions if it's too complicated to explain to a first-time user in one sentence, it's too complicated.

What's next for FormWhisper

Multi-language support: serve users who speak Spanish, Mandarin, Vietnamese, and other languages common in underserved communities.
Mobile app: a phone-native version so users can fill forms anywhere without needing a laptop.
Form library: pre-analyzed versions of the most common government forms so analysis is instant.
Assisted mode: a caregiver or caseworker can sit with a user and guide them through the form together in the same session.
Offline mode: local Whisper + smaller LLM for use in areas with limited connectivity after a disaster.

Built With

amd
elevenlabs
fastapi
huggingface
javascript
python
react.js

Submitted to

Hack For Humanity 2026
- Winner Future Unicorn Award: Most Likely to Become a Startup

Created by

I set up and hosted all the required large language models on AMD instances, exposing them via public endpoints and integrating them with the backend. I engineered the prompts that power the intelligent question reasoning pipeline and worked on improving the accuracy of PDF form field detection and filling to ensure reliable data extraction. Additionally, I integrated ElevenLabs' text-to-speech API to generate audio responses and extended the question retrieval API to include audio URLs, enabling seamless voice playback on the frontend.

Anmol Sharma
I contributed across the full stack by laying the initial project foundation , then driving core functionality by integrating the LLM pipeline with PDF conversion and connecting Whisper-based speech-to-text, while continuously refining behavior through prompt/result filtering and mic timeout tuning; on the frontend, I improved the end-user form flow with navigation updates, dynamic PDF viewing, confirmation-page text entry, and targeted UX cleanups.

Siddharth Bhat
Dhruv Girish Savla
Ankush Rai

Updates

Anmol Sharma started this project — Mar 01, 2026 11:22 AM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.