Inspiration

Many people struggle to understand essential documents (GOs, medical bills, legal notices) due to literacy challenges or language barriers. DocuVoice grants users independence by translating documents, explaining them simply, extracting key actions, and reading them aloud in their native tongue.

What it does

  • Multimodal Upload - Upload images or PDFs of any document
  • Smart Translation - Translate to 10+ Indian languages with simplified summaries
  • Voice Reader - High-quality TTS with custom speed controls
  • Action Items - Auto-extract tasks and deadlines as checklists
  • Official Document Detection - Extract GO Numbers, Department Names, and Dates
  • Location Grounding - Display addresses on interactive Google Maps

How we built it

  • Frontend - React 19, Tailwind CSS, Lucide React
  • AI Models - gemini-3-flash-preview for text extraction, gemini-2.5-flash-preview-tts for audio
  • Audio Engine - Custom Web Audio API decoder for raw PCM data
  • Maps - gemini-2.5-flash with Google Maps grounding
  • Deep Reasoning - Dynamic thinkingConfig with 2048 token budget for complex forms

Challenges we ran into

  • Raw Audio Decoding - Built custom TypeScript decoder for raw PCM data from Gemini TTS
  • JSON Reliability - Extensive prompt engineering for consistent structured outputs across RTL languages
  • Official Data Extraction - Used Gemini 3's reasoning to distinguish GO Numbers from generic text

Accomplishments that we're proud of

  • Native Indian Language Support - RTL layouts for Urdu/Arabic with wide Indic script support
  • Intent Extraction - Action items feature that goes beyond translation
  • Custom Audio Player - Variable playback speeds (0.5x-2.0x) with raw buffer handling

What we learned

  • Leveraging thinkingConfig in Gemini 3 for complex document analysis
  • Chaining models (Gemini 3 + Gemini 2.5) for multimodal grounding

What's next for docuvoice

  • PWA - Offline support for rural users with spotty internet
  • Chat with Document - Voice Q&A feature for spoken queries
  • Form Filling - AI-assisted form completion through dictation

Built With

Share this project:

Updates