AI Document Analyzer

Inspiration

As a second-year IT student building projects entirely from my phone, I constantly struggled with long PDFs, research papers, and dense documents. There was no quick way to extract what actually mattered.

I built this for every student, researcher, and professional who doesn't have hours to read — but still needs to understand.

What It Does

An AI-powered document analysis API that supports:

📄 PDF & DOCX parsing with full text extraction
🖼️ OCR for scanned/image-based documents
📝 Summarization — key points in seconds
🏷️ Named Entity Recognition (NER) — people, places, orgs
💬 Sentiment Analysis — tone detection across content
🗂️ Topic Classification — auto-categorize documents
🔍 Document Comparison — diff two docs intelligently
❓ Q&A — ask any question, get answers from your document

How I Built It

Built entirely on Google Colab from my phone — no laptop, no desktop.

Tech Stack:

Python — core logic
Groq API (llama-3.3-70b-versatile) — LLM backbone
PyMuPDF / pdfplumber — PDF extraction
python-docx — DOCX parsing
pytesseract — OCR for scanned files
FastAPI — REST API layer
HuggingFace Transformers — initial NER pipeline

The Q&A feature works by chunking the document into segments, passing relevant chunks + the user's question to the LLM, and returning a grounded answer — no hallucination from outside context.

Mathematically, for a document $D$ split into chunks ${c_1, c_2, \ldots, c_n}$, the model retrieves the most relevant chunk:

$$c^* = \arg\max_{c_i} \text{sim}(q, c_i)$$

where $q$ is the user query and $\text{sim}$ is cosine similarity over embeddings. The answer is then generated conditioned on $c^*$:

$$A = \text{LLM}(q \mid c^*)$$

Challenges

🔴 Model Deprecations Started with local HuggingFace models — ran into memory crashes on Colab's free tier. Switched to Groq API for reliability and speed.

🔴 API Credit Exhaustion Hit rate limits mid-build during the hackathon. Had to restructure calls to batch efficiently and reduce redundant requests.

🔴 OCR Accuracy Scanned documents with low DPI returned garbled text. Added preprocessing (grayscale + threshold) before passing to tesseract.

🔴 Endpoint Format Mismatches FastAPI schema mismatches caused silent failures. Debugged purely from Colab output logs — no browser dev tools on mobile.

🔴 Dual Deadline Pressure This was submitted for both a GUVI hackathon and an HCL hackathon on the same day — built, tested, and submitted solo under pressure.