Inspiration

The starting point was a simple frustration: using Claude to help with tax statement preparation. But every time you share a file with an AI model — through a chat interface or a coding assistant like Claude Code — the raw content, including all personal data, is transmitted to the model provider's servers.

Existing tools to strip sensitive data before sharing turned out to be either complex enterprise software or cloud-based services that missed the point entirely. Nothing was simple, local, and reversible. So PII Bye Bye was built.

What it does

PII Bye Bye redacts personal information from PDFs entirely on your device. No data leaves your machine.

Upload a PDF, review what the tool detected, download a clean redacted copy safe to share with any AI service. Every field is replaced by a typed token like [NAME_1] or [IBAN_1]. An AES-256-GCM encrypted key file lets you restore the original values anytime with your password — only you can unlock it.

How we built it

Built bottom-up, one focused component at a time.

The extractor pulls character-level bounding boxes from PDFs using PyMuPDF, with an OCR fallback for image-based pages. The detector runs Presidio with spaCy NER, extended with custom recognisers for Swiss-specific identifiers: AHV/AVS numbers, insurance numbers, patient IDs, and ICD-10 diagnosis codes. The tokeniser maps each detected entity to a typed token ([NAME_1], [IBAN_1]) with deduplication so the same value always gets the same token. The reviewer presents a detection summary table and asks for confirmation before anything is changed. The redactor applies black-box overlays at precise coordinates and overlays the token text. The keystore encrypts the token-to-value mapping with AES-256-GCM and PBKDF2 key derivation.

A significant challenge was unredaction fidelity. The first version reconstructed pages by re-rendering token text — but font metrics rarely match perfectly. This was reworked to capture the original PDF content streams before redaction and store them base64-encoded in the key file. Unredaction now restores pages byte-for-byte.

Date-of-birth detection required its own dedicated pass. Presidio's generic DATE_TIME entity catches too much — payment dates, document dates, everything. A hybrid approach replaced it: regex patterns for known DOB label formats (born, né/née, Geburtsdatum, etc.), with a spaCy embedding similarity fallback for unlabelled dates. This reuses the model already loaded by Presidio to avoid a second heavyweight load.

The web UI was added with Gradio — a few lines of code for a UI non-technical users can navigate. The CLI uses Click. Language support for German, French, Italian, and Spanish was added last, packaged as optional install extras so users only download the spaCy models they need.

Challenges we ran into

Coordinate precision. When Presidio detects a multi-word entity spanning multiple PDF text spans, stitching those spans back to pixel-accurate rectangles without over- or under-redacting required careful work in the extractor and redactor.

Date-of-birth vs. other dates. Generic date detection produces too many false positives. The hybrid regex + embedding approach brought precision up significantly, but edge cases — especially unlabelled dates in tables — remain the hardest category.

Byte-identical unredaction. The naive approach of re-rendering text from bounding boxes failed because font metrics differ. Storing and restoring original content streams solved this but required a v2 key file format with backwards-compatible fallback for v1 files.

Packaging multilingual models. Each spaCy language model is 50–700 MB. Making these optional install extras (pip install "piibyebye[de,fr]") without complicating the default install path took iteration on pyproject.toml and the first-run auto-download hook.

Accomplishments that we're proud of

  • Zero data transmission — the entire pipeline runs locally with no outbound calls
  • Byte-identical unredaction — original documents are restored exactly, not approximated
  • Swiss-specific PII coverage out of the box — AHV/AVS, Swiss IBANs, insurance numbers
  • Claude Code hook integration — PDFs are automatically redacted before the model ever reads them
  • Installable in one command: pip install piibyebye && pii web

What we learned

Detection is a precision/recall trade-off with no perfect answer. Every threshold decision is a judgement call about which failure mode is worse: leaking a field or annoying the user with a false positive.

The hardest ongoing tension is convenience vs. capability. spaCy language models are 50–700 MB each — essential for good detection quality, but a significant download for a user who just wants to try the tool. Optional install extras addressed this, but the same tension will return with OCR support: Tesseract models and image processing libraries add hundreds of megabytes more. Getting that trade-off right — keeping the tool approachable without sacrificing coverage — is a design constraint that will keep coming back.

What's next for PII Bye Bye

  • Scanned document support via OCR (the extractor already has a fallback stub)
  • AI tool support beyond Claude Code (Gemini Cli, Codex etc.)

Built With

  • aes-256-gcm
  • click
  • gradio
  • presidio
  • pymupdf
  • python
  • spacy
Share this project:

Updates