Inspiration

We wanted a tool that could translate Mandarin to English PDFs while preserving layout. Existing solutions break tables, distort formatting, or fail on scanned documents, inspiring us to build a smarter dual-pipeline system.

What it does

LingoPDF automatically detects whether a PDF is digital or scanned, extracts text using OCR or direct parsing, translates it, and reconstructs a new PDF that preserves fonts, tables, spacing, and visual structure.

How we built it

We used PyMuPDF for digital text extraction, PPStructureV3 + PaddleOCR for layout and OCR, Marian MT (Helsinki-NLP) for translation, and ReportLab/PyMuPDF for reconstruction. A custom layout engine adjusts font size, wrapping, and alignment to match the original formatting.

Challenges we ran into

  • Preserving table structure during translation
  • Handling text expansion/shrinkage after translation
  • Accurately detecting layout regions in noisy scans
  • Rebuilding PDFs without breaking alignment or spacing
  • Inpainting text inside figure regions cleanly

Accomplishments that we're proud of

  • Full bidirectional Mandarin to English translation
  • Seamless handling of both digital and scanned PDFs
  • True layout preservation including tables, images, and fonts
  • Automated table reconstruction using HTML (ReportLab)
  • Consistent, clean inpainting and text overlay on scanned pages

What we learned

We learned the complexity of document layout analysis, challenges of OCR on mixed-quality scans, and how translation affects spacing and typography. We also deepened our understanding of PDF internals and generative inpainting.

What's next for LingoPDF

  • Add support for more languages
  • Build a GUI and API service
  • Enable batch processing and cloud deployment
  • Add handwriting OCR support
  • Improve speed using model quantization and GPU acceleration

Built With

Share this project:

Updates