Inspiration

The idea came from noticing how poorly existing tools translate technical PDFs. Equations break, tables lose alignment, and scanned documents become unusable. We wanted a system that preserves structure while delivering accurate multilingual translation.

What it does

DocTransFlow detects layout elements using DocLayout-YOLO, preprocesses text blocks, applies OCR for scanned PDFs, translates the content with LLMs or HF models, and reconstructs the document while keeping tables, paragraphs, and equations in their original positions. It also handles text expansion using bounding-box resizing and adaptive font scaling.

How we built it

We combined three components:

  1. Layout Detection: DocLayout-YOLO trained on DocSynth300K identifies tables, paragraphs, and equations.
  2. Text and OCR Processing: Cleaned text, fixed hyphens, normalized encoding, and applied PaddleOCR or trOCR for scanned PDFs, followed by tokenization and segmentation.
  3. Translation and Reconstruction: LLM-based and model-based translation with layout-aware resizing and LaTeX preserving extraction to rebuild the final structured PDF.

Challenges we ran into

Managing text expansion after translation, ensuring OCR accuracy on low-quality scans, maintaining layout integrity across complex documents, and preserving equations during both OCR and translation were the toughest problems. Coordinating bounding boxes while reconstructing PDF pages also required careful handling of coordinates and styles.

Accomplishments that we're proud of

We built a unified pipeline that works on both digital and scanned PDFs, preserves complex layouts, and maintains mathematical expressions without distortion. The system reliably outputs translation-ready pages with structural fidelity.

What we learned

We gained a deeper understanding of layout detection models, OCR systems, multilingual translation behavior, PDF coordinate systems, and the complexities of bounding-box aware rendering. We also learned how important preprocessing is for both OCR quality and translation accuracy.

What's next for DocTransFlow

We plan to add automated table structure translation, improved math OCR using hybrid vision- LaTeX models, support for right-to-left languages, and an interactive editor for manual correction. A cloud-based API for batch document translation is also on the roadmap.

Built With

  • deepseek
  • engines
  • face
  • hugging
  • llm-based
  • ocr
  • pdf/a
  • pp-structurev3)
  • preprocessing
  • tesseract
  • translation
  • trocr
  • yolov10-doclayout-docsynth300k-paddleocr-(pp-ocrv5
Share this project:

Updates