OCR-Based PDF Web Publisher

pdf to image converted using paddelOCR

Inspiration

About the Project

Inspiration

The inspiration for this project came from the need to make information stored in PDF documents more accessible and shareable on the web. PDFs are commonly used for documentation, but their content is not always easy to view, search, or integrate into web platforms. This motivated me to explore how PDF content could be converted into a simple web-based format.

What I Learned

Through this project, I learned:

How Optical Character Recognition (OCR) can be used to extract text from PDF documents
The basics of using PaddleOCR for text extraction
How to structure extracted content into clean and readable HTML
How to deploy a static website using GitHub Pages

This project helped me understand the end-to-end flow from raw document input to a deployed web output.

How I Built the Project

The project was built in the following steps:

A PDF document was used as the input source.
Text was extracted from the PDF using PaddleOCR.
The extracted text was manually structured into an HTML file for clarity and readability.
The final HTML page was deployed as a static website using GitHub Pages.

The focus of the project was on demonstrating the transformation of PDF content into a live web page rather than building a complex processing pipeline.

Challenges Faced

One of the main challenges was understanding how to properly format OCR-extracted text so that it looks clean and readable on a web page. Another challenge was learning how GitHub Pages works for hosting static websites. Resolving these challenges helped strengthen my understanding of both document processing and web deployment.

Conclusion

This warm-up project demonstrates a simple but practical workflow for converting PDF content into a deployable web page. It highlights how OCR and basic web technologies can be combined to improve content accessibility.

Built With

css
github
githubpages
google-colab
html
paddelocr

Updates

ishita Makdiya started this project — Dec 20, 2025 02:46 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.