Warm-up Task

Inspiration

This project was inspired by the need to convert complex PDFs into clean, visually appealing web content using AI. Traditional OCR tools often produce messy or inconsistent output, making it difficult to transform documents into usable digital experiences. With Baidu’s powerful OCR and ERNIE’s generative capabilities, I wanted to build a fully automated pipeline that converts any PDF into a modern, styled website — with zero manual editing.

What it does

The project takes a PDF document, extracts all its text using Baidu’s OCR API, processes and cleans the content, and then feeds it to ERNIE 3.5, which generates a complete website consisting of an index.html and styles.css.

In short:
PDF → OCR → Clean Text → ERNIE → Fully Generated Website

It outputs:

Clean, structured text extracted from every page
Page-by-page OCR preview images
A complete, styled, modern HTML website created automatically

How we built it

The pipeline is built using Python and Baidu APIs:

OCR Extraction (ocr_api.py)
- Sends the PDF (base64-encoded) to Baidu’s OCR endpoint
- Extracts rec_texts from every page
- Saves:
  - output/output.txt (clean combined text)
  - output/*.jpg (OCR preview images)
  - output_resultados_detallados.json (full OCR data)
HTML Generation (ernie_api.py)
- Reads the clean text
- Sends a carefully crafted prompt to ERNIE 3.5
- ERNIE returns a JSON containing:
  - html
  - css
- These files are saved into the site/ directory
Deployment
- The site/ folder is deployed to GitHub Pages to automatically host the generated website.

Challenges we ran into

OCR JSON complexity:
Baidu’s OCR response includes many nested structures (dt_polys, rec_boxes, scores, metadata), so correctly extracting clean text required filtering and multiple iterations.
Handling base64 PDF encoding:
Large PDFs required careful encoding and error handling.
ERNIE formatting variations:
Sometimes ERNIE returned HTML wrapped in Markdown or broken JSON. I had to sanitize the output to extract valid JSON consistently.
Ensuring the HTML and CSS were always functional:
The LLM sometimes produced styles or paths that needed correction, so I refined the prompt until results were reliable.

Accomplishments that we're proud of

Successfully transformed a full PDF into a modern, aesthetic website purely through OCR + LLM processing.
Created code that is modular, easy to reuse, and ready for future expansion.
Overcame parsing, encoding, and JSON validation challenges.
Achieved a clean result compatible with GitHub Pages deployment.
Made a system that could be applied to millions of PDFs in real-world use cases.

What we learned

How to work deeply with Baidu’s OCR and ERNIE 3.5 APIs.
How to handle complex JSON responses from OCR systems.
Best practices for cleaning noisy OCR data.
How to prompt ERNIE to produce valid JSON reliably.
How to automate multi-step AI workflows in Python.
Deployment techniques using GitHub Pages for auto-hosting generated content.

Built With

ernie
ocr

Updates

Luis Eduardo De León Barrientos started this project — Dec 05, 2025 08:20 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.