Inspiration

With sensitive personal data increasingly stored in local files like scanned IDs, insurance documents, tax forms, I wanted a tool that works silently in the background and catches exposure risks before they become a problem. Most privacy tools are cloud-based or reactive; PrivacyGuard is local-first and proactive. Beyond accidental exposure, unprotected sensitive files sitting in easily accessible folders are a prime target for attackers like malware, ransomware, and unauthorized access can silently exfiltrate documents containing credit card numbers, government IDs, or credentials without the user ever knowing. PrivacyGuard acts as a last line of defense: detecting those files the moment they land and locking them away in AES-256 encrypted archives before they can be read or stolen.

What it does

PrivacyGuard runs in the Windows system tray and watches one or more folders for new files. When a file arrives, it extracts text (via OCR for images, PDF parsing, DOCX reading, or plain text) and scans it against regex patterns for sensitive data types: credit card numbers, SSNs, passport numbers, API keys, passwords, health information, and more. It also runs a keyword-cluster check specifically for government-issued ID documents. If anything is found, a popup alerts the user and offers to encrypt the file into a password-protected AES-256 ZIP archive automatically.

How I built it

  • Python with tkinter + ttk for the settings UI and alert popups (light-mode, clam theme)
  • pystray for the system tray icon
  • watchdog for real-time folder monitoring
  • pytesseract + Pillow for OCR on images
  • PyPDF2 and python-docx for document text extraction
  • pyzipper for AES-256 encrypted ZIP creation
  • config.json for persistent settings, with the archive password intentionally kept in memory only and never written to disk

Challenges I ran into

  • Keeping the archive password secure ensuring it is never serialized to disk while still being accessible across modules at runtime
  • Building reliable sensitive data detection using only regex was harder than expected. Data appears in inconsistent formats with varying separators and surrounding context, requiring significant iteration to avoid both false positives and false negatives
  • A scanned ID card or image often contains no structured data patterns at all, so I had to build a keyword-cluster fallback where co-occurring terms like "DOB", "exp", "height", and "issued" together signal a government ID even when no regex fires
  • Windows path separator inconsistencies silently breaking file operations across multiple libraries requiring defensive os.path.normpath() normalization everywhere

Accomplishments that I'm proud of

  • A fully local, zero-cloud privacy scanner that works silently without interrupting the user's workflow
  • AES-256 encryption integrated directly into the alert flow- one click from detection to protected
  • A clean, modern light-mode UI built entirely with tkinter/ttk, which is notoriously difficult to make look polished
  • The "Guarded Folder" concept: a persistent encrypted archive that files can be routed to automatically without any user interaction beyond initial setup

What I learned

  • Tesseract OCR can read text from images entirely offline- I didn't know this was possible without an LLM or cloud API. This was a key discovery: sending images to a service like Google Vision to detect sensitive content would completely defeat the purpose of a privacy tool. Tesseract gives us the same capability locally, for free
  • Regex alone cannot reliably classify sensitive documents. Real-world sensitive data is messy, and layered detection- combining pattern matching with keyword clustering- is necessary to catch what structured patterns miss
  • How to structure a multi-module desktop app cleanly, separating scanning logic, UI, encryption, and config into distinct responsibilities
  • That a genuinely useful security tool doesn't require a cloud backend- everything here runs locally, processes nothing externally, and still covers a broad range of real threat scenarios

What's next for PrivacyGuard

More scalability, and wider net for catching sensitive files, this may include better trained model, or even a small localized offline LLM for better detection.

Built With

Share this project:

Updates