Inspiration

I wanted to build something that could help people understand documents faster. Reading through long PDFs is tedious, so I thought - why not let AI do the heavy lifting?

What it does

DocuMind is a multi-agent document analysis system. You upload a PDF, and it:

  • Extracts text using PaddleOCR
  • Analyzes the content with ERNIE
  • Generates summaries
  • Lets you ask questions about the document

How we built it

Started with the warm-up task (PDF to web converter) to get familiar with the APIs. Then built out the multi-agent system using CAMEL-AI patterns. Each agent has a specific job - OCR, analysis, summary, QA - and the coordinator manages them all.

The backend is FastAPI, frontend is plain HTML/CSS/JS. Nothing too fancy.

Challenges we ran into

  • ERNIE API auth format changed, took a while to figure out the new Bearer token setup
  • PaddleOCR needs Poppler for PDF conversion, ended up using PyPDF2 as fallback
  • Getting the agents to work together smoothly required some trial and error

Accomplishments that we're proud of

  • Got the full pipeline working end-to-end
  • The QA feature actually gives useful answers
  • Warm-up task works reliably

What we learned

  • How to use ERNIE API properly
  • Multi-agent system design patterns
  • PaddleOCR is pretty powerful for document extraction

What's next for DocuMind

  • Add support for more document types (Word, images)
  • Improve the analysis accuracy
  • Maybe add a Chrome extension

Built With

Share this project:

Updates