Inspiration

I wanted to understand how modern OCR systems go beyond simple text extraction and move toward document understanding. PaddleOCR-VL caught my attention because it combines OCR with vision-language models to extract structured content from complex documents like newspapers and reports. This project started as a warm-up to explore that capability in a hands-on way.

What it does

This project takes a document image as input and uses PaddleOCR-VL to: Detect text and layout elements Understand document structure Convert the content into structured, readable Markdown The result is a clean representation of the document that can be reused for analysis, search, or publishing.

How we built it

Used PaddleOCR-VL for document parsing Ran inference in Baidu AI Studio (PaddlePaddle environment) Processed a real multilingual document image Exported the output into Markdown Published the results using GitHub Pages Challenges we ran into Environment setup and dependency compatibility Understanding where outputs were generated Differentiating between “code running” and “content being published” Learning how GitHub Pages works for the first time Each issue helped clarify how ML pipelines and deployment actually work. Accomplishments that we're proud of Successfully ran PaddleOCR-VL end-to-end Extracted structured content from a real document Published a live project page using GitHub Pages Completed a full workflow: model → output → public demo

What we learned

OCR today is not just text recognition but document understanding Vision-Language models can extract structure, not just words Deployment and presentation are as important as the model itself Debugging confusion is part of real learning, not a failure

What's next for PaddleOCR-VL Warmup

Add more document samples Compare OCR output formats (Markdown vs JSON) Explore layout-aware post-processing Extend this into a small document analysis pipeline

Built With

Share this project:

Updates