DocuRec AI — Intelligent Indian Document Processing

Inspiration

Indians Citizens interact with government documents every day, Aadhaar cards, PAN cards, land records, court notices, ration cards. But almost all of these exist as blurry scans, in regional languages, with inconsistent formatting that no existing OCR system can reliably process. A village CSC operator manually types data from hundreds of documents daily because no tool understands Hindi, Tamil, or Telugu documents properly. A lawyer cannot search across scanned court orders. A citizen cannot verify what their land record actually says. We built DocuRec AI to change that.

What it does

DocuRec AI is a full-stack agentic platform that converts any Indian government document into structured, searchable, and actionable intelligence. Core capabilities:

Upload any blurry scan or JPEG, PNG, TIFF, PDF supported
Tesseract OCR extracts raw text locally, Google Gemini AI cleans, structures, and understands it
Document type is automatically identified Aadhaar, PAN, land record, court notice, ration card, and 10 more types
12+ Indian languages detected Hindi, Tamil, Telugu, Bengali, Marathi, Gujarati, and more
Every extracted field gets a confidence score
PII is automatically detected and masked Aadhaar number, phone number, address, date of birth
Chat with your documents in natural language ask "What is the DOB in this Aadhaar?" and get a direct answer
Compare two versions of a document and see exactly what changed at the field level
Translate extracted text to any Indian language in one click
Human-in-the-loop review system for low confidence extractions

Upload a blurry Aadhaar scan. In 15 to 20 seconds you get structured data, masked PII, and a document you can actually talk to.

How we built it

Frontend — React 18, Vite, TailwindCSS, Framer Motion, shadcn/ui, Redux Toolkit for global state, React Query for server state management. Backend — Node.js and Express.js with a full production middleware stack — Helmet for HTTP security headers, express-rate-limit with per-route limits, JWT authentication, bcrypt password hashing, CORS whitelist, and express-validator for all inputs. AI Pipeline — A deliberate two-step design:

Tesseract.js runs locally on the server for fast raw text extraction
Google Gemini 1.5 Flash post-processes the output — correcting OCR errors, detecting languages, classifying document type, and extracting structured fields as validated JSON Database and Caching — MongoDB with Mongoose for document storage with full text indexing, Redis for response caching and session management, Winston with daily log rotation for structured logging. Storage — ImageKit for CDN-based image storage, automatic compression, and fast delivery. Architecture — MVP production folder structure with separation of controllers, services, models, middleware, and utilities. Versioned API routes at /api/v1/, typed error classes, global error middleware, and graceful shutdown handling.

Challenges we ran into

Blurry image accuracy — Low resolution Indian document scans gave Tesseract very poor raw output, and Gemini would return low confidence scores causing documents to be stuck in needs_review status. We rewrote the Gemini prompt to be aggressive about partial extraction and lowered the needs_review threshold from 50 to 35. Silent save failures — Documents were uploading successfully but OCR results were not being saved to the database. The error was a Mongoose enum mismatch — the ocrEngine field only accepted tesseract, paddleocr, and hybrid but the service was saving tesseract+gemini. One line fix, hours to find. Chat returning no documents — The chat service queried only status: completed documents, but blurry scans were being marked needs_review. All three document queries were updated to include both statuses. Chat sessions disappearing on navigation — React Query was garbage collecting chat session data when the user navigated away. Fixed by setting gcTime to 10 minutes and chat session staleTime to Infinity. No real-time processing feedback — Document status was not updating without a manual page refresh. Added refetchInterval polling every 3 seconds that automatically stops when processing completes or fails, plus toast notifications on status change.

The most dangerous bug was completely silent. The app said upload succeeded, the user saw a document card, but nothing was actually extracted and saved. One enum value caused it all.

Accomplishments that we are proud of

A fully production-grade full-stack application built end to end during a hackathon
Real multilingual Indian document processing that works on genuinely blurry, real-world scans
A two-layer AI pipeline combining local Tesseract OCR and cloud Gemini AI for the right balance of speed and accuracy
Automatic PII detection and masking with no manual field configuration required
Polished responsive UI with purposeful animations that feel like a real product
Complete mid-project migration from Anthropic to Google Gemini without any feature regression

What we learned

Prompt engineering matters more than model selection. A precisely written prompt on Gemini Flash consistently outperformed a vague prompt on Gemini Pro.
Redis caching is not optional at scale. It reduced repeated analytics queries from 276ms to under 10ms.
Indian OCR is a genuinely hard problem. Blurry scans, mixed scripts, and code-mixed text where Hindi and English appear in the same sentence require AI post-processing — raw OCR alone is not enough.
Production security must be set up from day one. Retrofitting Helmet, rate limiting, and input validation into an existing codebase is significantly harder than building it in from the start.
Real-time UI feedback is not a nice-to-have. Processing banners, auto-polling, and toast notifications are what make an app feel trustworthy rather than broken.

What is next for DocuRec AI

PaddleOCR integration for higher accuracy on complex regional scripts like Tamil and Telugu
Auto form-fill — upload a blank government form and have it automatically filled from your document library
Offline processing mode — run the entire pipeline locally without sending data to any cloud service
Government API verification — verify Aadhaar and PAN numbers directly through official UIDAI and income tax APIs
Multi-user organizations — allow CSC operators to manage documents across multiple citizens under one account

Built With

axio
bullmq
express.js
framer-motion
google-gemini-ai
helmet
imagekit
jwt
mongodb
mongoose
node.js
react
react-query
redis
redux-toolkit
shadcn/ui
tailwindcss
tesseract.js
vite
winston

Updates

Priyanshu Pandey started this project — Jun 03, 2026 10:14 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.