Inspiration

Do you happily share sensitive documents - contracts, tax forms, insurance papers - with OpenAi and Anthropic and Google?

Reclaim control over your data with GoCalma Redact: easily hide sensitive data before your documents are sent to AI servers.

What it does

GoCalma Redact allows you to:

  • Upload PDFs, scans, and photos.
  • Automatically detect personal data (PII) using GenAI.
  • Review and adjust redactions interactively.
  • Generate a fully redacted version.

All processing happens locally, no data ever leaves your machine.

Deployment Options

Key Features:

  • 100% local processing (zero external APIs)
  • Interactive PDF editor (click-to-fix redactions)
  • Dual detection pipeline (NER + LLM verification)
  • Swiss & EU-specific PII detection (AHV, IBAN, CH IDs, etc.)
  • Multilingual OCR (90+ languages)
  • 7 redaction methods (mask, hash, encrypt, etc.)

Bonus Features:

  • GoCalma Redact handles scanned documents reliably
  • POC for iOS app: the approach works with Apple Intelligence shipped on latest iPhones.
  • Reversible redactions with an encrypted key file

Technical Details

GitHub Repos:

Main product: https://github.com/alallaqi/go-calma-redact
Desktop App: https://github.com/alallaqi/gocalma-redact-desktop-app
POC for iOS App: https://github.com/Ben-Zahler/go-calma-redact-ios

How we built it

  • Modular pipeline: OCR → NER → optional LLM → review → redaction
  • Realistic dataset: Swiss documents (real, synthetic, dummy)
  • Coverage: tax, insurance, contracts, invoices, letters
  • Formats: scans, photos, PDFs
  • Languages: EN, DE, FR, IT

Security:

  • PBKDF2-HMAC-SHA256 encrypted key files (480k iterations)
  • Salted HMAC hashing (immune to rainbow table attacks)
  • Scanned PDF pixel flattening (forensic recovery prevention)
  • LLM prompt injection guards
  • 130 automated tests, including security regression tests
  • Zero external network calls — verified, no CDN/fonts/analytics

Challenges we ran into

  • Reliable detection on low-quality scans
  • Balancing recall vs false positives
  • Supporting multilingual + Swiss-specific formats
  • Performance of local models

Accomplishments that we're proud of

  • Strong performance on real Swiss documents for 11 Swiss-specific entity types
  • Seamless integration of OCR, LLMs, and 9 NER backends
  • Interactive redaction that works even on scanned PDFs, with 7 redaction approaches
  • Two redaction modes: flattened PDFs (permanent) and reversible redactions
  • Reliability demonstrated through 130 tests

What we learned

  • Real-world documents are messy — synthetic data is not enough
  • Combining models (NER + LLM) significantly improves recall
  • UX matters: users must trust and verify the system
  • Performance constraints shape design decisions

What's next for GoCalma Redact

  • Faster, lighter LLM verification (<3s per page)
  • Batch processing for multiple documents
  • Smarter filtering to reduce false positives
  • Broader European document support

Built With

  • bert-base
  • huggingface
  • mistral
  • ocr
  • ollama
  • phi
  • presidio
  • python
  • qwen
  • spacy
  • streamlit
  • surya
  • swissbert
Share this project:

Updates

posted an update

For this project, we used Surya instead of Tesseract (while keeping Tesseract as a fallback if Surya OCR is unavailable). We observed several improvements:

  • Uses a transformer-based (deep learning) approach rather than a traditional computer vision pipeline
  • Supports 90+ languages with automatic detection
  • Performs significantly better on mixed-language where traditional methods often struggle with complex layouts (rotated, or noisy scans)
  • Natively returns word-level bounding boxes

Log in or sign up for Devpost to join the conversation.