GoCalma

Local-First PDF PII Redaction For Real Documents

GoCalma is an open-source browser application that detects and redacts sensitive information in PDFs without sending document contents to any server. Everything runs locally: PDF parsing, text extraction, PII detection, OCR, redaction, and restoration. The document never leaves the user's device.

Live demo: https://gocalme.ketchalegend.me/

Why This Matters

Sensitive PDFs containing names, government IDs, medical data, and financial details are routinely uploaded to cloud redaction tools. Existing workflows either send documents to servers (privacy risk), require manual effort (error-prone), or have poor recall on real documents (unreliable).

GoCalma provides a safer path: upload a PDF locally, detect PII, review the findings, export a redacted copy, and keep an encrypted key for reversible restoration.

Core Results

  • Core text-PDF macro recall: 100.00% (above the 90% challenge threshold)
  • 301 automated tests (299 unit + 2 Playwright end-to-end)
  • 0 production CVEs across 441 transitive dependencies
  • 409ms page load, 602KB total transfer
  • All 5 development phases complete and independently shippable

What GoCalma Detects

30+ PII types across six categories, all detected locally:

Category Types Validation
Personal Names, addresses, dates of birth NER + regex + context rules
Government IDs AHV/AVS, French NIR, Italian CF, Spanish DNI, Dutch BSN, UK NINO, UK NHS, German Steuer-ID, Belgian NN, Portuguese NIF, Austrian SVNr, US SSN, US/UK passports Checksum validation per type
Financial IBAN, credit cards, account numbers MOD-97, Luhn
Contact Email, phone (7 country-specific formats + generic) Format validation
Technology AWS access keys, API keys (OpenAI, Stripe, GitHub, GitLab, Slack), PEM private keys Pattern + prefix matching
Visual Faces/photos, handwritten signatures Canvas-based heuristic detection

Multilingual support: EN, DE, FR, IT, ES documents.

How It Works

GoCalma combines six layered detection strategies:

  1. Contextual regex with per-type confidence scores and mathematical checksum validators (IBAN MOD-97, Luhn, 10 European ID checksums)
  2. Local NER via Transformers.js with WebGPU acceleration and automatic WASM fallback
  3. Local OCR via tesseract.js with per-page routing (auto-classifies each page as text vs image)
  4. Face and signature detection via canvas-based skin-tone density and ink density heuristics
  5. Context scoring with proximity-based confidence adjustment and keyword-gating on broad patterns
  6. Allow/deny lists with fuzzy matching for OCR error tolerance (Levenshtein distance)

Multi-pass detection runs strategies in priority order with range locking to prevent duplicate detections.

User Controls

  • Allow list: Terms that should never be flagged (e.g., "Acme Corporation", "Public Info")
  • Deny list: Terms that should always be flagged, even if auto-detection misses them (e.g., "Martin Muller", "ABC-12345")
  • Compliance profiles: Swiss DPA, EU GDPR, Healthcare, Financial, Full Scan — each enables the PII types relevant to that framework
  • Per-type redaction: Choose a different mode for each PII type (e.g., redact names, mask IBANs, synthesize addresses)
  • Five redaction modes: Redact (black box), mask (random characters), hash (#### replacement), highlight (visual marker), synthesize (consistent fake data)
  • Explain API: Every detection shows why it was flagged — which strategy found it, the pattern that matched, and the confidence score
  • Privacy risk score: 0-100 score with severity levels and per-type breakdown, shown before export
  • Face and signature toggle: Enable or disable visual PII detection independently

Secure Redaction

Simply drawing black rectangles over text in a PDF does not remove the underlying content. The text layer remains selectable, searchable, and detectable by OS features.

GoCalma's approach: Each affected page is rendered to an image, redaction regions are painted over at the exact coordinates, and the page is replaced. The original text layer is fully removed. This is pixel-level redaction — not cosmetic.

Additional security measures:

  • AES-GCM encryption for all .gocalma key file payloads with optional password protection
  • PDF metadata scrubbing strips author, title, and timestamps during redaction
  • Consistent anonymization in synthesize mode — the same original value always maps to the same fake replacement
  • Audit trail generation with entity type counts, severity classification, and redaction mode per document

Reversible Restoration

Alongside the redacted PDF, GoCalma exports an encrypted .gocalma key file containing the token mappings and optionally the original PDF. Users can restore the original document later by uploading both files. Key file versions (1.0.0 through 1.2.0) track format evolution and support backward compatibility.

Security Audit

GoCalma was independently security-audited (CSO mode):

  • 0 critical CVEs across 441 transitive dependencies
  • 0 production vulnerabilities in the shipping bundle
  • AES-GCM encryption for all key file payloads
  • Pixel-level redaction fully removes text layer
  • PDF metadata scrubbing strips author, title, timestamps
  • CSO audit: 15 candidates scanned, 1 design-level finding (raw key storage when no password set), 0 exploitable vulnerabilities

Development Phases

All five phases are complete and independently shippable:

Phase Focus Key Deliverables
1: Foundation Safety and quick wins Safe regex execution (match limits, ReDoS prevention), IBAN MOD-97 validation, allow/deny lists
2: Detection Engine Pattern expansion 10 European ID validators with checksums, US SSN/passports, tech secret detection, 7 country-specific phone formats, context scoring
3: Pipeline Intelligence OCR and matching Per-page OCR routing, 4-tier coordinate matching, fuzzy matching for OCR errors, NER on OCR text
4: User Features Controls and transparency Compliance profiles, per-type redaction strategies, consistent anonymization, explain API
5: Next-Gen Advanced detection WebGPU-accelerated NER, face and signature detection, UI redesign

Challenge Fit

GoCalma directly addresses the core challenge requirements:

  • Open source — full prototype repository on GitHub
  • Working redaction flow — upload, detect, review, export
  • Local-first privacy guarantee — zero data transmission, all processing in-browser
  • User review — every detection confirmed before export
  • Reversible — encrypted key files for un-redaction
  • Security-audited — 0 production CVEs, AES-GCM crypto, pixel-level redaction
  • Extended capability — scanned PDFs, phone captures, face/signature detection, multilingual support

Summary

GoCalma is a practical, browser-based privacy tool with 100% core recall, 30+ PII types with checksum validation, five redaction modes with per-type configuration, user-configurable allow/deny lists, face and signature detection, a 301-test automated suite, and a security-audited zero-transmission architecture.

The document stays on the user's machine from upload to export. That is the product guarantee.

Built With

Share this project:

Updates

posted an update

GoCalma

Local-First PDF PII Redaction For Real Documents

GoCalma is an open-source browser application that detects and redacts sensitive information in PDFs without sending document contents to any server. It is built for the GoCalma challenge and focuses on a practical privacy guarantee: the document stays on the user’s machine from upload to export.

Why This Matters

Users regularly paste or upload highly sensitive PDFs into cloud tools before realizing those files contain names, addresses, IDs, account numbers, medical references, and other personal data. Existing workflows are either too risky, too fragmented, or too hard to use.

GoCalma provides a safer path:

  • upload a PDF locally,
  • detect likely PII,
  • review the findings,
  • export a redacted PDF,
  • keep an encrypted key for reversible restoration.

What Makes GoCalma Submission-Worthy

1. Fully local execution

The core workflow runs entirely in the browser:

  • PDF parsing
  • text extraction
  • PII detection
  • OCR for scanned/image-heavy files
  • redaction
  • restoration

No plain document payload is sent to third parties.

2. Strong benchmark performance

The repository’s current evaluator clears the challenge gate on the core benchmark:

  • Core text-PDF macro recall: 100.00%

This is above the stated challenge threshold of 90% recall for the core text-PDF path. The codebase includes 299 unit tests and Playwright end-to-end browser tests covering the full upload-detect-review-export flow.

3. Handles more than clean digital PDFs

The project was improved using a broader set of realistic sample documents, including:

  • clean digital PDFs,
  • scanned PDFs,
  • phone-captured document images converted into PDF,
  • noisier OCR-heavy forms.

That broader sample set materially improved robustness and helped push practical accuracy higher on messy real-world inputs, not just ideal text-layer PDFs.

4. Human review before final redaction

GoCalma is optimized for high recall and transparent review. Instead of silently missing risky fields, it presents detections to the user for confirmation before export.

That is the right product choice for privacy-sensitive redaction.

5. Reversible redaction

Alongside the redacted PDF, GoCalma exports an encrypted .gocalma key file. This preserves the mapping needed to restore the original values later without weakening the local-first privacy model.

How It Works

GoCalma combines several local detection methods:

  • contextual regex and rule-based matching,
  • layout-aware heuristics,
  • local NER enrichment with Transformers.js,
  • local OCR with tesseract.js,
  • post-processing and deduplication.

This layered approach is why the system performs well across structured forms, letters, notices, invoices, and lower-quality scanned inputs.

Security

GoCalma was independently security-audited:

  • 0 critical CVEs across 441 transitive dependencies,
  • 0 production vulnerabilities in the shipping bundle,
  • AES-GCM encryption for all key file payloads,
  • pixel-level redaction fully removes the text layer (no hidden selectable data),
  • PDF metadata scrubbing strips author, title, and timestamps,
  • CSO audit: 15 candidates scanned, 1 design-level finding, 0 exploits.

Challenge Fit

GoCalma directly addresses the core challenge requirements:

  • open-source prototype repository,
  • working redaction flow,
  • local-first privacy guarantee,
  • user review before export,
  • encrypted reversible un-redaction,
  • security-audited with 0 production CVEs,
  • support for scanned and image-heavy PDFs as an extended capability.

Summary

GoCalma is not just a demo. It is a practical, browser-based privacy tool with strong core recall, a reviewable workflow, reversible redaction, and support for both standard PDFs and tougher scanned or phone-captured documents.

The result is a credible local-first redaction product that is aligned with the challenge and ready for submission.

Log in or sign up for Devpost to join the conversation.