Inspiration

Many people receive PDF files for school assignments, reports, or official purposes without knowing what is inside them. Most everyday documents do not need interactive features or embedded scripts. However, users cannot easily verify whether a file contains hidden elements.

This inspired me to build a simple tool that prioritizes safety over complexity.

What It Does

CleanFile rebuilds PDF documents into a clean, script-free version.

Instead of trying to detect malware, it removes all interactive and executable elements by extracting only readable text and reconstructing a brand-new static document.

The result is a lightweight, safe version suitable for everyday use.

How I Built It

The prototype was built using:

  • Python
  • Streamlit (for the demo interface)
  • PyPDF (for text extraction)
  • ReportLab (for rebuilding a new PDF)

The system extracts readable text from the original file and generates a completely new PDF using safe default fonts. The original internal structure is not reused.

Challenges

One major challenge was handling layout and formatting. PDF files are not structured like regular documents they store content as positioned elements.

Rebuilding the document while keeping it readable, but without inheriting any potentially unsafe structures, required balancing safety and visual clarity.

What I Learned

I learned how PDF documents store content internally and how removing structural complexity can improve safety.

This project showed that sometimes the simplest security approach rebuilding from scratch can be more practical than trying to detect every possible threat.

Future Improvements

In the future, CleanFile could be developed into a standalone desktop application or integrated into document management systems for safer file sharing.

Built With

Share this project:

Updates