Inspiration
We wanted a tool that could translate Mandarin to English PDFs while preserving layout. Existing solutions break tables, distort formatting, or fail on scanned documents, inspiring us to build a smarter dual-pipeline system.
What it does
LingoPDF automatically detects whether a PDF is digital or scanned, extracts text using OCR or direct parsing, translates it, and reconstructs a new PDF that preserves fonts, tables, spacing, and visual structure.
How we built it
We used PyMuPDF for digital text extraction, PPStructureV3 + PaddleOCR for layout and OCR, Marian MT (Helsinki-NLP) for translation, and ReportLab/PyMuPDF for reconstruction. A custom layout engine adjusts font size, wrapping, and alignment to match the original formatting.
Challenges we ran into
- Preserving table structure during translation
- Handling text expansion/shrinkage after translation
- Accurately detecting layout regions in noisy scans
- Rebuilding PDFs without breaking alignment or spacing
- Inpainting text inside figure regions cleanly
Accomplishments that we're proud of
- Full bidirectional Mandarin to English translation
- Seamless handling of both digital and scanned PDFs
- True layout preservation including tables, images, and fonts
- Automated table reconstruction using HTML (ReportLab)
- Consistent, clean inpainting and text overlay on scanned pages
What we learned
We learned the complexity of document layout analysis, challenges of OCR on mixed-quality scans, and how translation affects spacing and typography. We also deepened our understanding of PDF internals and generative inpainting.
What's next for LingoPDF
- Add support for more languages
- Build a GUI and API service
- Enable batch processing and cloud deployment
- Add handwriting OCR support
- Improve speed using model quantization and GPU acceleration
Log in or sign up for Devpost to join the conversation.