About the Project
Motivation
This project was inspired by a practical question:
how can dense, research-level machine learning papers be transformed into something more accessible, readable, and reusable on the web?
While studying modern language models, I repeatedly encountered the paper Attention Is All You Need. Although it is foundational, reading it in PDF form makes it difficult to extract structure, reuse content, or integrate it into interactive learning materials. I wanted to explore whether modern OCR, vision–language models, and large language models could be combined into a pipeline that converts a research paper into a clean, web-ready document.
What I Learned
Through this project, I learned both conceptual and engineering-level lessons:
- How Vision–Language OCR models (PaddleOCR-VL) differ from traditional OCR by jointly extracting text and layout structure.
- That real-world ML pipelines rarely return “clean” data structures—understanding returned formats (e.g.
listvsdict) is critical. - How to safely integrate LLMs as post-processors, using them as converters rather than content generators.
- Why strong prompt constraints are necessary to prevent LLMs from hallucinating or replacing input content.
- How to manage long-running notebook workflows without restarting kernels or recomputing expensive steps.
On the modeling side, revisiting the Transformer architecture also reinforced key ideas such as self-attention, multi-head attention, and positional encoding:
[ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V ]
Understanding these concepts at the same time as building a practical system made the theory feel much more concrete.
How I Built the Project
The project was built as a multi-stage pipeline:
Document Processing
A PDF document is processed using PaddleOCR-VL, which extracts text blocks along with layout information such as titles and paragraphs.Markdown Generation
The OCR output is converted into structured Markdown, preserving headings and readable flow.LLM-Based Conversion
The Markdown is passed to an ERNIE large language model, constrained strictly to convert content into HTML without adding or rewriting information.Web Output
The resulting HTML is saved locally and can be previewed or downloaded as a standalone web page.
Each stage was designed to be modular, so expensive steps (like OCR inference) do not need to be re-run when adjusting later formatting or presentation.
Challenges Faced
Several challenges emerged during development:
Library and API mismatches
PaddleOCR 3.x introduced breaking changes compared to earlier versions, requiring careful inspection of supported parameters and return types.Long-running inference management
OCR-VL inference is computationally expensive. Restarting a notebook kernel would mean losing results and recomputing everything, so I had to adapt workflows to continue safely without restarts.Unexpected LLM behavior
When prompts were under-specified or inputs were incorrect, the LLM returned generic tutorial content instead of transforming the provided text. This highlighted the importance of strict prompt design and input validation.File handling pitfalls
Writing intermediate outputs to incorrect paths (e.g., overwriting a PDF) can silently corrupt data, reinforcing the need for careful file management.
Outcome
The final result is a reproducible workflow that transforms a research paper into a clean, readable HTML document suitable for learning, sharing, or further analysis. More importantly, the project provided hands-on experience in combining OCR, layout understanding, and large language models into a single, coherent system.
This project represents both a technical exploration and a learning journey into how modern AI tools can be composed thoughtfully to solve real-world problems.
Built With
- erine
- openai
- python
Log in or sign up for Devpost to join the conversation.