Bidirectional OCR Translation on PDFs: Mandarin ↔ English

Project: Layout-Preserving Polyglot (LPP)1. The Inspiration & Core Idea

The project was inspired by the critical, frustrating problem of translating complex, visually-rich documents (like financial reports) without destroying their layout. Existing tools failed, either losing formatting or attempting flawed document rebuilds.

The Core Idea: Achieve "perfect" layout preservation by treating the document structure as untouchable. We used an analogy: instead of rebuilding a house (the document), we simply "repaint the walls" (the text) while keeping the structure (the layout) intact. This led to the image-as-background architecture.2. How We Built It: The "Paint-Over" Architecture

The LPP system is a structurally simple, non-traditional process: Rendering as Foundation (PyMuPDF): The PDF page is first rendered to a 300 DPI PNG image, which becomes the immovable, layout-preserving background. Extraction (PaddleOCR): State-of-the-art OCR extracts text and, crucially, its precise bounding box coordinates: $\text{Text}i, [x{i0}, y_{i0}, x_{i1}, y_{i1}]$. Local Translation: We used secure, local Hugging Face models ($\text{Helsinki-NLP/opus-mt-zh-en}$) for fast, batch translation. The Layout Loop ("Paint-Over"): The original background image is inserted. A white-filled rectangle is drawn over the original text's bounding box to mask it. The translated text is placed into the exact same bounding box using page.insert_textbox(), which automatically handles font size and wrapping to preserve the visual flow.

Challenges & Lessons Challenge Solution & Implementation Lesson Learned OCR Quality on PDFs Forcing a 300 DPI image render before OCR. Standardization is the best preprocessing step. Character Confusion Implemented a user toggle for lang='japan' vs. lang='chinese_cht'. A "one size fits all" model is insufficient for East Asian languages. The "White-Out" Flaw Slightly expanding the white rectangle by a small margin $\epsilon$ ($\text{e.g.}, \pm 1 \text{ pixel}$) around the bounding box. Geometrical operations require a small, empirical buffer ($\epsilon$) for reliable coverage.
Key Takeaways

The project demonstrated that an architectural "hack" can outperform brute-force deep learning for visual tasks. Simplicity Wins: Treating the layout as an image avoided the error-prone task of document structure recreation. Bounding Boxes are Gold: The precise coordinates ($\text{bbox}$) are the foundation that grounds the translated text in the original layout. The Power of Local Models: Using local transformer models ensured security, speed, and scalability for sensitive documents.

Built With

python

Updates

Kartik Sharma started this project — Nov 16, 2025 08:28 AM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.