Inspiration

Indians Citizens interact with government documents every day, Aadhaar cards, PAN cards, land records, court notices, ration cards. But almost all of these exist as blurry scans, in regional languages, with inconsistent formatting that no existing OCR system can reliably process. A village CSC operator manually types data from hundreds of documents daily because no tool understands Hindi, Tamil, or Telugu documents properly. A lawyer cannot search across scanned court orders. A citizen cannot verify what their land record actually says. We built DocuRec AI to change that.

What it does

DocuRec AI is a full-stack agentic platform that converts any Indian government document into structured, searchable, and actionable intelligence. Core capabilities:

  • Upload any blurry scan or JPEG, PNG, TIFF, PDF supported
  • Tesseract OCR extracts raw text locally, Google Gemini AI cleans, structures, and understands it
  • Document type is automatically identified Aadhaar, PAN, land record, court notice, ration card, and 10 more types
  • 12+ Indian languages detected Hindi, Tamil, Telugu, Bengali, Marathi, Gujarati, and more
  • Every extracted field gets a confidence score
  • PII is automatically detected and masked Aadhaar number, phone number, address, date of birth
  • Chat with your documents in natural language ask "What is the DOB in this Aadhaar?" and get a direct answer
  • Compare two versions of a document and see exactly what changed at the field level
  • Translate extracted text to any Indian language in one click
  • Human-in-the-loop review system for low confidence extractions

Upload a blurry Aadhaar scan. In 15 to 20 seconds you get structured data, masked PII, and a document you can actually talk to.

How we built it

Frontend — React 18, Vite, TailwindCSS, Framer Motion, shadcn/ui, Redux Toolkit for global state, React Query for server state management. Backend — Node.js and Express.js with a full production middleware stack — Helmet for HTTP security headers, express-rate-limit with per-route limits, JWT authentication, bcrypt password hashing, CORS whitelist, and express-validator for all inputs. AI Pipeline — A deliberate two-step design:

  1. Tesseract.js runs locally on the server for fast raw text extraction
  2. Google Gemini 1.5 Flash post-processes the output — correcting OCR errors, detecting languages, classifying document type, and extracting structured fields as validated JSON Database and Caching — MongoDB with Mongoose for document storage with full text indexing, Redis for response caching and session management, Winston with daily log rotation for structured logging. Storage — ImageKit for CDN-based image storage, automatic compression, and fast delivery. Architecture — MVP production folder structure with separation of controllers, services, models, middleware, and utilities. Versioned API routes at /api/v1/, typed error classes, global error middleware, and graceful shutdown handling.

Challenges we ran into

Blurry image accuracy — Low resolution Indian document scans gave Tesseract very poor raw output, and Gemini would return low confidence scores causing documents to be stuck in needs_review status. We rewrote the Gemini prompt to be aggressive about partial extraction and lowered the needs_review threshold from 50 to 35. Silent save failures — Documents were uploading successfully but OCR results were not being saved to the database. The error was a Mongoose enum mismatch — the ocrEngine field only accepted tesseract, paddleocr, and hybrid but the service was saving tesseract+gemini. One line fix, hours to find. Chat returning no documents — The chat service queried only status: completed documents, but blurry scans were being marked needs_review. All three document queries were updated to include both statuses. Chat sessions disappearing on navigation — React Query was garbage collecting chat session data when the user navigated away. Fixed by setting gcTime to 10 minutes and chat session staleTime to Infinity. No real-time processing feedback — Document status was not updating without a manual page refresh. Added refetchInterval polling every 3 seconds that automatically stops when processing completes or fails, plus toast notifications on status change.

The most dangerous bug was completely silent. The app said upload succeeded, the user saw a document card, but nothing was actually extracted and saved. One enum value caused it all.

Accomplishments that we are proud of

  • A fully production-grade full-stack application built end to end during a hackathon
  • Real multilingual Indian document processing that works on genuinely blurry, real-world scans
  • A two-layer AI pipeline combining local Tesseract OCR and cloud Gemini AI for the right balance of speed and accuracy
  • Automatic PII detection and masking with no manual field configuration required
  • Polished responsive UI with purposeful animations that feel like a real product
  • Complete mid-project migration from Anthropic to Google Gemini without any feature regression

What we learned

  • Prompt engineering matters more than model selection. A precisely written prompt on Gemini Flash consistently outperformed a vague prompt on Gemini Pro.
  • Redis caching is not optional at scale. It reduced repeated analytics queries from 276ms to under 10ms.
  • Indian OCR is a genuinely hard problem. Blurry scans, mixed scripts, and code-mixed text where Hindi and English appear in the same sentence require AI post-processing — raw OCR alone is not enough.
  • Production security must be set up from day one. Retrofitting Helmet, rate limiting, and input validation into an existing codebase is significantly harder than building it in from the start.
  • Real-time UI feedback is not a nice-to-have. Processing banners, auto-polling, and toast notifications are what make an app feel trustworthy rather than broken.

What is next for DocuRec AI

  • PaddleOCR integration for higher accuracy on complex regional scripts like Tamil and Telugu
  • Auto form-fill — upload a blank government form and have it automatically filled from your document library
  • Offline processing mode — run the entire pipeline locally without sending data to any cloud service
  • Government API verification — verify Aadhaar and PAN numbers directly through official UIDAI and income tax APIs
  • Multi-user organizations — allow CSC operators to manage documents across multiple citizens under one account

Built With

Share this project:

Updates