LingoPDF

Inspiration

We wanted a tool that could translate Mandarin to English PDFs while preserving layout. Existing solutions break tables, distort formatting, or fail on scanned documents, inspiring us to build a smarter dual-pipeline system.

What it does

LingoPDF automatically detects whether a PDF is digital or scanned, extracts text using OCR or direct parsing, translates it, and reconstructs a new PDF that preserves fonts, tables, spacing, and visual structure.

How we built it

We used PyMuPDF for digital text extraction, PPStructureV3 + PaddleOCR for layout and OCR, Marian MT (Helsinki-NLP) for translation, and ReportLab/PyMuPDF for reconstruction. A custom layout engine adjusts font size, wrapping, and alignment to match the original formatting.

Challenges we ran into

Preserving table structure during translation
Handling text expansion/shrinkage after translation
Accurately detecting layout regions in noisy scans
Rebuilding PDFs without breaking alignment or spacing
Inpainting text inside figure regions cleanly

Accomplishments that we're proud of

Full bidirectional Mandarin to English translation
Seamless handling of both digital and scanned PDFs
True layout preservation including tables, images, and fonts
Automated table reconstruction using HTML (ReportLab)
Consistent, clean inpainting and text overlay on scanned pages

What we learned

We learned the complexity of document layout analysis, challenges of OCR on mixed-quality scans, and how translation affects spacing and typography. We also deepened our understanding of PDF internals and generative inpainting.

What's next for LingoPDF

Add support for more languages
Build a GUI and API service
Enable batch processing and cloud deployment
Add handwriting OCR support
Improve speed using model quantization and GPU acceleration

Built With

paddleocr
paddlepaddle
pymupdf
python
pytorch

Updates

Aaryaman Bisht started this project — Nov 16, 2025 06:59 AM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.