Inspiration

This project began with a simple observation: English and Mandarin behave very differently on the page. Mandarin is dense and compact, while English tends to expand, often causing text overflow, broken layouts, and mismatched page lengths when translating PDFs. After repeatedly seeing official documents, academic papers, and certificates lose their structure during translation, I wanted to build a system that could preserve not only meaning, but also design. That idea “your PDF, just in another language” became the core inspiration behind ManEn.

What it does

ManEn is a bidirectional OCR-based PDF translation system that translates English ↔ Mandarin while preserving the original layout, spacing, and formatting. It detects whether a PDF is scanned or text-based, extracts text and layout metadata, identifies language, performs translation with intelligent overflow/contraction strategies, and reconstructs a new PDF using the original geometry. The output is a clean, faithful version of the document in the target language, with tables, spacing, alignment, and fonts preserved.

How I built it

I built ManEn as a multi-stage pipeline. First, the system checks if the PDF is text-based (via PyMuPDF) or scanned (requiring OCR using PaddleOCR). Next, layout detection tools capture bounding boxes, reading order, fonts, and table structures. Language detection via langdetect tags multilingual segments, and translation is handled using opus-mt models and Google’s NMT. I then apply layout-sensitive logic like font scaling and spacing adjustments to address text expansion or contraction depending on the language direction. Finally, the PDF is reconstructed using ReportLab with embedded CJK fonts, redrawn tables, and precisely positioned text blocks that mirror the original design.

Challenges I ran into

One of the biggest challenges was managing text length differences: English translations often overflow their boxes, while Mandarin translations sometimes leave large gaps. Handling mixed-language documents also complicated language detection. Reconstructing tables cleanly was another major hurdle, as PDFs rarely store table structure explicitly. We also faced issues with font compatibility many fonts don’t support Chinese characters requiring embedded fallback fonts. Finally, OCR inaccuracies sometimes propagated into translation, making evaluation and correction essential

Accomplishments that I'm proud of

I'm proud that ManEn doesn’t just translate PDFs it preserves their form. The system successfully maintains layout fidelity across languages, handles both scanned and digital PDFs, and intelligently manages text overflow through dynamic scaling and spacing. Rebuilding complex documents like tables, multi-column layouts, and mixed-language pages was particularly rewarding. Bringing OCR, translation, and PDF engineering together into a single consistent workflow feels like a significant achievement.

What we learned

I discovered how nuanced cross-language typography can be especially when working with Chinese fonts, punctuation, and spacing. Ultimately, we learned how to combine OCR, NLP, and document reconstruction into a cohesive system that respects both meaning and design.

What's next for ManEn

Next, I plan to expand ManEn into a more robust, user-friendly service. This includes adding support for more languages, integrating a web interface for drag-and-drop PDF translation, improving table reconstruction with ML-based structure detection, and refining overflow/contraction strategies using a layout prediction model. I also aim to incorporate confidence scoring for OCR and translation, allowing users to review and correct low-confidence segments. Eventually, ManEn could evolve into a full document-localization platform accurate, layout-faithful, and accessible to everyone.

Built With

  • deep-translator
  • langdetect
  • paddleocr
  • pymupdf
  • python
  • transformers
Share this project:

Updates