website
example of redaction

PrivacyShield — Local-First PDF Redaction with Reversible Encryption

Inspiration

These days, we’re constantly sharing documents — over email, through cloud platforms, or with third-party tools. Many of these documents contain sensitive information like names, social security numbers, medical details, or financial records.

The problem is, most redaction tools either:

require uploading your files somewhere (which raises privacy concerns), or
permanently destroy the original information once redacted

We wanted to build something that felt safer and more practical — a tool that:

works entirely on your device
is easy enough for non-technical users
and doesn’t force you to lose your data forever

That’s how PrivacyShield came about.

What We Built

PrivacyShield is a local-first PDF redaction tool that automatically finds and hides sensitive information — while still letting you recover it later if needed.

Here’s what it does:

Automatically detects PII (Personally Identifiable Information) using a multilingual NLP pipeline (English, German, French, Italian, Spanish)
Redacts sensitive content by placing black boxes directly over it — without breaking the document layout
Encrypts the original data into a secure .privacyshield file
Supports reversible redaction — only the document owner can restore the original information
Works with text PDFs, scanned PDFs, and mixed documents
Runs completely locally — nothing leaves your machine
Provides a simple REST API (/redact and /unredact) for integration

The system follows a clean pipeline architecture:

PDF Input
    ↓
Analyzer  →  classify each page (text / scanned / mixed)
    ↓
Text pages    → Extractor → NER Engine → Redactor → PDF Rebuilder
Scanned pages → pypdfium2 → PaddleOCR  → Image Redactor
Mixed pages   → both pipelines run and results are merged
    ↓
Key Manager → encrypt token map → .privacyshield file
    ↓
Redacted PDF + Encryption Key (shown once to user)

How We Built It

NLP Layer — Presidio + spaCy

We used Microsoft Presidio as the NER backbone, extended with custom pattern recognizers for:

Swiss AHV/AVS numbers (756.XXXX.XXXX.XX)
IBANs with mod-97 checksum validation
UUIDs, TAX IDs, RF creditor references
Policy numbers, invoice numbers, and any value following a label word like "number", "no.", "#"
Multilingual support via spaCy models for DE, FR, IT, ES, EN

We run NER line by line to prevent entities from spanning across newlines, and apply a post-processing pipeline to remove false positives — filtering duration expressions like "30 days", company names, and label words like "Email" or "Phone" being detected as person names.

PDF Layer — pdfplumber + PyMuPDF

pdfplumber extracts text with character-level coordinates. PyMuPDF then searches for each detected PII string and draws a permanent black redaction box at the exact pixel location, with a white [TOKEN_ID] label on top so reviewers know what was redacted without exposing the original value.

Image Layer — pypdfium2 + PaddleOCR + PIL

For scanned pages, pypdfium2 converts each page to a high-resolution PIL image. PaddleOCR then extracts text with pixel-level bounding boxes. PIL draws black boxes directly on the image layer over detected PII regions, with token labels drawn in white text on top of each box.

Encryption Layer — Fernet

The token map ({"NAME_1": "John Smith", "SSN_1": "123-45-6789"}) is serialized to JSON, encrypted using Fernet symmetric encryption, and saved as a .privacyshield file. The encryption key is shown to the user exactly once and never stored — following the principle that only the document owner can decrypt their own data.

The math behind Fernet encryption uses AES-128-CBC with HMAC-SHA256 for authentication:

$$C = \text{AES}_{128\text{-CBC}}(K, M) \quad \text{with} \quad \text{HMAC-SHA256}(K, C)$$

API Layer — FastAPI

A FastAPI backend exposes two core endpoints:

POST /redact — accepts a PDF, runs the full pipeline, returns redacted PDF + encryption key
POST /unredact — accepts redacted PDF + key, restores the original document

The restore flow embeds the encrypted original PDF bytes directly into the redacted file using a payload marker, so the user only needs to keep two things: the redacted PDF and their key.

Challenges We Faced

1. Coordinate System Mismatch

pdfplumber and PyMuPDF both claim to use PDF coordinates but differ subtly in how they report bounding boxes. We spent significant time debugging why black boxes appeared in the wrong position before discovering both use top-origin coordinates — eliminating the need for coordinate flipping.

2. NER Spanning Across Newlines

Presidio's analyzer treated multi-line text as a single string, causing entities like "John Smith\nSSN" to be detected as a single PERSON. We solved this by running NER line-by-line and tracking character offsets to map results back to the original document.

3. False Positives at Scale

Testing on 100 synthetic documents revealed numerous false positives — "Email" detected as a person name, "30 days" as a date, company names as organizations to redact, and SWIFT/BIC codes matching common words like "Paziente". We built a multi-layer false positive filter and a context-based number detector that catches any value following label words like "policy number" or "invoice #".

4. Multilingual PII Formats

Each European country has different ID formats — Swiss AHV numbers, RF creditor references, IBANs in 30+ country formats. We built custom recognizers for each, with language-specific supported_language parameters to prevent cross-language false positives.

5. Windows Compatibility for OCR

pypdfium2 was used instead of pdf2image because poppler — a dependency of pdf2image — is not natively available on Windows. pypdfium2 bundles its own PDF rendering engine and works cross-platform without any system-level dependencies.

6. Package Size and Deployment

PaddleOCR adds ~2.5GB to the deployment footprint, making cloud deployment on free tiers impossible. We solved this by running the application locally with an ngrok tunnel for demonstration, and structuring the code so PaddleOCR is only initialized lazily when a scanned page is actually encountered.

What We Learned

Building privacy-preserving tools requires thinking adversarially — what can an attacker infer from a redacted document alone?
NER models trained on news corpora behave very differently on structured documents like tax forms and insurance policies
Coordinate systems in PDF processing are surprisingly inconsistent across libraries
Context is everything — the same string ("Miller") may or may not be PII depending on what surrounds it
Mixed PDFs (containing both text layers and embedded scanned images) require two separate redaction pipelines running in parallel

What's Next

Surname-only redaction — "James Miller" → "James [NAME_1]" per GDPR minimization principles
Face detection — blur profile photos and ID card photos using OpenCV
Browser extension — redact PDFs directly in the browser before download
Audit trail — cryptographically signed redaction log showing what was redacted, when, and by whom
Lightweight deployment — replace PaddleOCR with a lighter OCR engine to enable cloud hosting on free tiers