Inspiration

As a second-year IT student building projects entirely from my phone, I constantly struggled with long PDFs, research papers, and dense documents. There was no quick way to extract what actually mattered.

I built this for every student, researcher, and professional who doesn't have hours to read — but still needs to understand.


What It Does

An AI-powered document analysis API that supports:

  • 📄 PDF & DOCX parsing with full text extraction
  • 🖼️ OCR for scanned/image-based documents
  • 📝 Summarization — key points in seconds
  • 🏷️ Named Entity Recognition (NER) — people, places, orgs
  • 💬 Sentiment Analysis — tone detection across content
  • 🗂️ Topic Classification — auto-categorize documents
  • 🔍 Document Comparison — diff two docs intelligently
  • Q&A — ask any question, get answers from your document

How I Built It

Built entirely on Google Colab from my phone — no laptop, no desktop.

Tech Stack:

  • Python — core logic
  • Groq API (llama-3.3-70b-versatile) — LLM backbone
  • PyMuPDF / pdfplumber — PDF extraction
  • python-docx — DOCX parsing
  • pytesseract — OCR for scanned files
  • FastAPI — REST API layer
  • HuggingFace Transformers — initial NER pipeline

The Q&A feature works by chunking the document into segments, passing relevant chunks + the user's question to the LLM, and returning a grounded answer — no hallucination from outside context.

Mathematically, for a document $D$ split into chunks ${c_1, c_2, \ldots, c_n}$, the model retrieves the most relevant chunk:

$$c^* = \arg\max_{c_i} \text{sim}(q, c_i)$$

where $q$ is the user query and $\text{sim}$ is cosine similarity over embeddings. The answer is then generated conditioned on $c^*$:

$$A = \text{LLM}(q \mid c^*)$$


Challenges

🔴 Model Deprecations Started with local HuggingFace models — ran into memory crashes on Colab's free tier. Switched to Groq API for reliability and speed.

🔴 API Credit Exhaustion Hit rate limits mid-build during the hackathon. Had to restructure calls to batch efficiently and reduce redundant requests.

🔴 OCR Accuracy Scanned documents with low DPI returned garbled text. Added preprocessing (grayscale + threshold) before passing to tesseract.

🔴 Endpoint Format Mismatches FastAPI schema mismatches caused silent failures. Debugged purely from Colab output logs — no browser dev tools on mobile.

🔴 Dual Deadline Pressure This was submitted for both a GUVI hackathon and an HCL hackathon on the same day — built, tested, and submitted solo under pressure.


What I Learned

  • Groq's free tier is a lifesaver for hackathon-scale LLM apps
  • OCR pipelines need image preprocessing — raw scans don't work
  • Chunking strategy directly impacts Q&A accuracy
  • Building mobile-only teaches you to be ruthlessly efficient
  • Dual deadlines are brutal — but doable with clean architecture

What's Next

  • [ ] Vector DB integration (FAISS) for smarter chunk retrieval
  • [ ] Multi-document Q&A across a folder
  • [ ] Streamlit frontend for non-technical users
  • [ ] Hindi + Tamil language document support

Built With

  • fastapi
  • github
  • google-colab
  • groqapi
  • pillow
  • pymupdf
  • pytesseract
  • python
  • python-docx
  • swagger
Share this project:

Updates