Inspiration
We kept seeing the same headline: company leaks sensitive data from a "redacted" PDF because someone drew a black box over the text in Acrobat and called it a day. The text was still there. Select-all, copy, paste — boom, full names, social security numbers, medical records, all of it. It's 2026 and we're still doing this.
We're based in Switzerland, where privacy isn't just a nice-to-have — it's the law. GDPR, FADP, sector-specific regulations in healthcare and insurance. People handle sensitive documents every day — HR departments, law firms, clinics — and the tools available are either expensive enterprise software, cloud services that upload your documents to who-knows-where, or the Acrobat black-box trick that doesn't actually work.
We wanted something that runs on your machine, actually removes the data, and doesn't require a procurement process to use.
What it does
Redactly detects and removes personal information from PDFs — for real. Not "draw a rectangle over it" real. Content-stream-level removal real.
Drop in a PDF. Redactly extracts the text with per-word bounding boxes, OCRs any embedded images, then runs two detection passes: regex patterns for structured data (IBANs, Swiss AHV/AVS numbers, emails, phones, credit cards, SSNs) and a multilingual BERT NER model for names, locations, and organizations across five languages. A priority merge system resolves overlapping detections — so a Swiss social security number never gets misclassified as a phone number.
You review everything in the UI. Toggle entities on or off, see confidence scores, flip through page previews. When you're satisfied, hit redact. You get back a clean PDF and an AES-256-GCM encrypted .gocalma key file. The key file lets you — and only you — recover the original values later if you need them.
Everything stays on localhost. No cloud. No telemetry. No trust required.
How we built it
Backend: FastAPI serving the API and static files. PyMuPDF does the heavy lifting — text extraction with word-level bounding boxes, true content-stream redaction via add_redact_annot + apply_redactions, and page rendering for previews.
Detection: Two-pass pipeline. Pass one is regex — instant, deterministic pattern matching with validation (Luhn checksums for credit cards, check digits for IBANs, format validation for AHV numbers). Pass two is Davlan/bert-base-multilingual-cased-ner-hrl from HuggingFace — a multilingual BERT model fine-tuned for named entity recognition across 10+ languages. Text gets chunked at 512 tokens with proper word-boundary splits so nothing falls between the cracks.
The tricky part was coordinate mapping. NER gives you character offsets in plain text. PDF redaction needs bounding boxes. During extraction we build a char_offset → word_index map, then union the bounding boxes of matched words to get the precise rectangle to redact.
Image redaction: Embedded images get extracted by xref, OCR'd with Tesseract, and PII regions get painted over with Pillow at the pixel level before the image is replaced back into the PDF.
Encryption: The key file is 16 bytes salt + 12 bytes nonce + ciphertext. Key derivation is PBKDF2-HMAC-SHA256 with 100k iterations. The plaintext contains a document hash, version tag, and the full redaction map.
Frontend: Single-page HTML with Tailwind CSS and vanilla JS. No framework, no build step. Dark-themed UI with real-time page previews and entity badges.
Optional LLM: You can swap the BERT model for any OpenAI-compatible endpoint — Ollama, LM Studio, OpenAI — via the settings gear. The UI shows a privacy warning if the URL points somewhere non-local. Regex always runs regardless.
Challenges we ran into
Coordinate mapping was a nightmare. NLP models think in characters and tokens. PDFs think in page coordinates and content streams. Bridging those two worlds — building the char-offset-to-bounding-box pipeline and getting it to work reliably across different PDF layouts, fonts, and encodings — took the most iteration by far.
Scanned PDFs are a different beast. A "normal" PDF has a text layer you can extract. A scanned PDF is just a pile of images. We had to build a separate pipeline: extract embedded images, OCR them with Tesseract, get per-word bounding boxes in pixel space, then transform those coordinates back to PDF page points using the image's placement matrix. Two completely different redaction paths (PyMuPDF content-stream removal vs. Pillow pixel painting) that need to produce consistent results.
Overlapping detections. Regex and NER don't know about each other. An AHV number like 756.1234.5678.90 might get flagged as a phone number by one pass and correctly identified by the other. We built a priority system (specific identifiers rank higher than generic entity types) with confidence-based tiebreaking to merge results cleanly.
Model size vs. startup time. The multilingual BERT model is ~680 MB. First launch downloads it from HuggingFace, which is a rough first impression. We cache it locally so subsequent starts are fast, but that initial wait is something we want to improve.
Accomplishments that we're proud of
It actually redacts. Not cosmetically — at the content stream level. page.apply_redactions() removes the underlying text data from the PDF. You can't copy-paste it back. You can't extract it with a script. It's gone.
Reversible by design. Most redaction tools are one-way. We encrypt the original values into a key file that only the document owner can decrypt. You get the safety of redaction with the flexibility of recovery — without trusting a third party to hold your secrets.
Zero network dependencies for core functionality. No API keys required. No cloud account. No data leaves your machine. Install it, run it, done.
Multilingual out of the box. Swiss documents mix German, French, Italian, and English — sometimes in the same paragraph. The BERT model handles all of them without configuration.
The priority merge system. It sounds small, but getting overlapping detections to resolve correctly — so that a Swiss social security number is always labeled as an AHV number and never eaten by a phone number pattern — makes the difference between a tool people trust and one they don't.
What we learned
Building a "simple" redaction tool is deceptively complex. PDFs are not text files — they're page description programs with content streams, embedded fonts, image xrefs, and coordinate systems that vary per page. Getting NLP output to map cleanly onto that world required understanding both sides deeply.
We also learned that privacy tools have a higher trust bar. If your redaction tool sends data to a server, or if the redacted text is still recoverable, or if the encryption is weak — you've made the problem worse, not better. Every design decision had to be evaluated through that lens.
On the ML side: a fine-tuned multilingual BERT model is surprisingly capable for NER, but regex still catches things ML misses (and vice versa). The hybrid approach — deterministic patterns plus learned models, merged with explicit priority rules — turned out to be significantly more reliable than either approach alone.
What's next for Redactly
- Batch processing — drop a folder of PDFs and redact them all with consistent settings
- Custom PII patterns — let users define their own regex rules for domain-specific identifiers (patient IDs, case numbers, internal codes)
- Full document restoration — right now the key file lets you view original values; we want to support reconstructing the original PDF entirely
- Smaller/faster models — explore distilled NER models to cut the 680 MB download and speed up inference, especially on machines without a GPU
- Browser extension — redact PDFs directly in the browser before downloading, for people who don't want to install anything
- Audit trail — generate a compliance-ready log of what was redacted, when, and by whom, for regulated industries that need documentation
Built With
- aes-256-gcm
- fastapi
- huggingface-transformers
- javascript
- pillow
- pymupdf
- python
- tailwind-css
- tesseract-ocr
Log in or sign up for Devpost to join the conversation.