About the Project Inspiration

Dimsum AI was inspired by a recurring frustration while working with multilingual documents. Most translation tools break a document into OCR text, translate it linearly, and then struggle to rebuild the original layout. Visual meaning gets lost: headings drift, tables collapse, font hierarchies vanish, and diagrams become detached from their context.

Alongside this, augmented reality showed how digital layers can blend with the physical world. The idea emerged naturally: What if translation could respect spatial structure and what if the translated text could appear directly on the physical paper?

This curiosity to preserve both meaning and form became the foundation of Dimsum AI.

What it does

Dimsum AI is a multimodal, offline document translation system that:

Understands the 2D visual structure of a document using a Layout Graph Neural Network (LGNN).

Translates the document without using OCR, through a layout-aware diffusion model.

Regenerates the translated version with fonts, spacing, tables, and visuals preserved.

Anchors the translated content back onto the real document using AR plane tracking.

Works offline, ensuring privacy, speed, and reliability in secure environments.

The result is a translation experience where the document looks as if it were originally written in the target language.

How we built it

  1. Synthetic Bilingual Dataset (Syntho-Gen)

Real layout-aligned Mandarin–English datasets are scarce, so we built our own. A generator creates parallel text pairs, a layout engine produces structured templates, and the system renders them into two images:

Image_CN (Mandarin)

Image_EN (English)

This becomes training data for layout-aware translation.

  1. Layout Graph Neural Network (LGNN)

Each text block becomes a node in a graph, connected by spatial relationships like proximity, indentation, alignment, and font size. The GNN produces a layout embedding that encodes the entire structure of the document.

  1. Layout-Aware Diffusion Translator (LADT)

Instead of:

OCR → Translate → Reassemble

we perform direct pixel translation:

Image_CN → Image_EN

The diffusion model generates the translated document using LGNN embeddings as structure guidance, preserving typography and layout.

  1. AR Anchoring

Using SLAM and homography tracking, the translated layout is projected onto the physical page. The overlay remains stable even if the camera moves away, thanks to persistent spatial anchors.

  1. Offline Optimization

Through pruning, quantization, and LoRA fine-tuning, all components run locally without cloud services.

Challenges we ran into

Generating high-quality bilingual dataset pairs with perfectly aligned layouts.

Designing a GNN that could reliably understand spatial logic instead of treating documents as linear text.

Training a diffusion model to preserve typography while translating language.

Achieving stable AR anchoring in variable lighting conditions and camera angles.

Keeping the full pipeline efficient enough for offline deployment on mobile hardware.

Accomplishments that we're proud of

Built a fully OCR-free translation pipeline — rare in document AI.

Created a synthetic bilingual dataset that mirrors real-world document structures.

Achieved layout-preserving generation, maintaining tables, fonts, and spacing accurately.

Developed an AR overlay system that aligns translations with real paper persistently.

Ensured complete offline capability, making the system usable in sensitive or remote environments.

What we learned

Layout must be modeled spatially; documents are maps, not paragraphs.

Synthetic data, when carefully designed, can be more effective than real datasets for structured tasks.

Diffusion models can handle translation when guided with structural embeddings.

AR stability relies heavily on consistent geometric features and plane detection.

Offline deployment forces thoughtful design choices that improve the system overall.

What's next for Dimsum AI

Expanding to additional languages like Japanese, Korean, and Hindi.

Adding speech-based AR translation for real-time read-aloud.

Integrating the system with smart glasses for hands-free use.

Introducing collaborative AR editing features for teams.

Optimizing the model further for low-power edge devices like Raspberry Pi.

Built With

  • 8-bit-quantization
  • craft
  • diffusers-cv-tools:-opencv
  • east-ar-technologies:-arcore-/-arkit-deployment:-onnx-runtime
  • hugging-face-transformers
  • javascript-ml-frameworks:-pytorch
  • json-tools:-github
  • languages:-python
  • lora-utilities:-pil
  • numpy
Share this project:

Updates