Inspiration

PDFs are still one of the most common ways information is shared, but they’re often difficult to read, navigate, and publish online—especially on mobile devices. We wanted to explore how AI could bridge the gap between static documents and modern web content by automatically transforming PDFs into clean, accessible websites.

What it does

Doc2Web automatically converts a PDF into a fully deployable static website. It extracts text and layout information from the PDF, restructures the content, and generates a clean, responsive web page that can be hosted on GitHub Pages with no manual editing required.

How we built it

PaddleOCR-VL is used to extract text and layout information from the input PDF.

The extracted content is converted into structured Markdown while preserving headings and sections.

The Markdown is sent to ERNIE via API to improve readability and generate semantically structured HTML.

The final output is assembled into a static website using HTML and CSS, ready for deployment on GitHub Pages.

The entire pipeline is automated and beginner-friendly.

Challenges we ran into

Preserving document structure from PDFs with inconsistent layouts

Ensuring the generated HTML followed a logical heading hierarchy

Balancing automation with readability so the output didn’t feel “AI-generated”

Keeping the pipeline simple while meeting all warm-up task requirements

Accomplishments that we're proud of

Successfully built a full PDF → Website pipeline using PaddleOCR-VL and ERNIE

Generated a deployable website with no manual content cleanup

Created a clear, reproducible workflow suitable for beginners

Completed the official Warm-Up Task requirements end-to-end

What we learned

OCR quality strongly affects downstream AI generation

Prompt design plays a major role in turning raw text into clean HTML

ERNIE is effective at restructuring and improving extracted document content

Clear documentation and simple architecture matter as much as technical depth

What's next for Doc2Web: PDF to Website Generator

Support for multi-page navigation instead of a single page

Better table and image handling

Theme customization for generated websites

Optional multilingual output using ERNIE

Built With

Share this project:

Updates