Inspiration
This project was inspired by the need to convert complex PDFs into clean, visually appealing web content using AI. Traditional OCR tools often produce messy or inconsistent output, making it difficult to transform documents into usable digital experiences. With Baidu’s powerful OCR and ERNIE’s generative capabilities, I wanted to build a fully automated pipeline that converts any PDF into a modern, styled website — with zero manual editing.
What it does
The project takes a PDF document, extracts all its text using Baidu’s OCR API, processes and cleans the content, and then feeds it to ERNIE 3.5, which generates a complete website consisting of an index.html and styles.css.
In short:
PDF → OCR → Clean Text → ERNIE → Fully Generated Website
It outputs:
- Clean, structured text extracted from every page
- Page-by-page OCR preview images
- A complete, styled, modern HTML website created automatically
How we built it
The pipeline is built using Python and Baidu APIs:
OCR Extraction (
ocr_api.py)- Sends the PDF (base64-encoded) to Baidu’s OCR endpoint
- Extracts
rec_textsfrom every page - Saves:
output/output.txt(clean combined text)output/*.jpg(OCR preview images)output_resultados_detallados.json(full OCR data)
HTML Generation (
ernie_api.py)- Reads the clean text
- Sends a carefully crafted prompt to ERNIE 3.5
- ERNIE returns a JSON containing:
htmlcss
- These files are saved into the
site/directory
Deployment
- The
site/folder is deployed to GitHub Pages to automatically host the generated website.
- The
Challenges we ran into
OCR JSON complexity:
Baidu’s OCR response includes many nested structures (dt_polys,rec_boxes, scores, metadata), so correctly extracting clean text required filtering and multiple iterations.Handling base64 PDF encoding:
Large PDFs required careful encoding and error handling.ERNIE formatting variations:
Sometimes ERNIE returned HTML wrapped in Markdown or broken JSON. I had to sanitize the output to extract valid JSON consistently.Ensuring the HTML and CSS were always functional:
The LLM sometimes produced styles or paths that needed correction, so I refined the prompt until results were reliable.
Accomplishments that we're proud of
- Successfully transformed a full PDF into a modern, aesthetic website purely through OCR + LLM processing.
- Created code that is modular, easy to reuse, and ready for future expansion.
- Overcame parsing, encoding, and JSON validation challenges.
- Achieved a clean result compatible with GitHub Pages deployment.
- Made a system that could be applied to millions of PDFs in real-world use cases.
What we learned
- How to work deeply with Baidu’s OCR and ERNIE 3.5 APIs.
- How to handle complex JSON responses from OCR systems.
- Best practices for cleaning noisy OCR data.
- How to prompt ERNIE to produce valid JSON reliably.
- How to automate multi-step AI workflows in Python.
- Deployment techniques using GitHub Pages for auto-hosting generated content.
Built With
- ernie
- ocr
Log in or sign up for Devpost to join the conversation.