GoCalma
Local-First PDF PII Redaction For Real Documents
GoCalma is an open-source browser application that detects and redacts sensitive information in PDFs without sending document contents to any server. Everything runs locally: PDF parsing, text extraction, PII detection, OCR, redaction, and restoration. The document never leaves the user's device.
Live demo: https://gocalme.ketchalegend.me/
Why This Matters
Sensitive PDFs containing names, government IDs, medical data, and financial details are routinely uploaded to cloud redaction tools. Existing workflows either send documents to servers (privacy risk), require manual effort (error-prone), or have poor recall on real documents (unreliable).
GoCalma provides a safer path: upload a PDF locally, detect PII, review the findings, export a redacted copy, and keep an encrypted key for reversible restoration.
Core Results
- Core text-PDF macro recall: 100.00% (above the 90% challenge threshold)
- 301 automated tests (299 unit + 2 Playwright end-to-end)
- 0 production CVEs across 441 transitive dependencies
- 409ms page load, 602KB total transfer
- All 5 development phases complete and independently shippable
What GoCalma Detects
30+ PII types across six categories, all detected locally:
| Category | Types | Validation |
|---|---|---|
| Personal | Names, addresses, dates of birth | NER + regex + context rules |
| Government IDs | AHV/AVS, French NIR, Italian CF, Spanish DNI, Dutch BSN, UK NINO, UK NHS, German Steuer-ID, Belgian NN, Portuguese NIF, Austrian SVNr, US SSN, US/UK passports | Checksum validation per type |
| Financial | IBAN, credit cards, account numbers | MOD-97, Luhn |
| Contact | Email, phone (7 country-specific formats + generic) | Format validation |
| Technology | AWS access keys, API keys (OpenAI, Stripe, GitHub, GitLab, Slack), PEM private keys | Pattern + prefix matching |
| Visual | Faces/photos, handwritten signatures | Canvas-based heuristic detection |
Multilingual support: EN, DE, FR, IT, ES documents.
How It Works
GoCalma combines six layered detection strategies:
- Contextual regex with per-type confidence scores and mathematical checksum validators (IBAN MOD-97, Luhn, 10 European ID checksums)
- Local NER via Transformers.js with WebGPU acceleration and automatic WASM fallback
- Local OCR via tesseract.js with per-page routing (auto-classifies each page as text vs image)
- Face and signature detection via canvas-based skin-tone density and ink density heuristics
- Context scoring with proximity-based confidence adjustment and keyword-gating on broad patterns
- Allow/deny lists with fuzzy matching for OCR error tolerance (Levenshtein distance)
Multi-pass detection runs strategies in priority order with range locking to prevent duplicate detections.
User Controls
- Allow list: Terms that should never be flagged (e.g., "Acme Corporation", "Public Info")
- Deny list: Terms that should always be flagged, even if auto-detection misses them (e.g., "Martin Muller", "ABC-12345")
- Compliance profiles: Swiss DPA, EU GDPR, Healthcare, Financial, Full Scan — each enables the PII types relevant to that framework
- Per-type redaction: Choose a different mode for each PII type (e.g., redact names, mask IBANs, synthesize addresses)
- Five redaction modes: Redact (black box), mask (random characters), hash (#### replacement), highlight (visual marker), synthesize (consistent fake data)
- Explain API: Every detection shows why it was flagged — which strategy found it, the pattern that matched, and the confidence score
- Privacy risk score: 0-100 score with severity levels and per-type breakdown, shown before export
- Face and signature toggle: Enable or disable visual PII detection independently
Secure Redaction
Simply drawing black rectangles over text in a PDF does not remove the underlying content. The text layer remains selectable, searchable, and detectable by OS features.
GoCalma's approach: Each affected page is rendered to an image, redaction regions are painted over at the exact coordinates, and the page is replaced. The original text layer is fully removed. This is pixel-level redaction — not cosmetic.
Additional security measures:
- AES-GCM encryption for all .gocalma key file payloads with optional password protection
- PDF metadata scrubbing strips author, title, and timestamps during redaction
- Consistent anonymization in synthesize mode — the same original value always maps to the same fake replacement
- Audit trail generation with entity type counts, severity classification, and redaction mode per document
Reversible Restoration
Alongside the redacted PDF, GoCalma exports an encrypted .gocalma key file containing the token mappings and optionally the original PDF. Users can restore the original document later by uploading both files. Key file versions (1.0.0 through 1.2.0) track format evolution and support backward compatibility.
Security Audit
GoCalma was independently security-audited (CSO mode):
- 0 critical CVEs across 441 transitive dependencies
- 0 production vulnerabilities in the shipping bundle
- AES-GCM encryption for all key file payloads
- Pixel-level redaction fully removes text layer
- PDF metadata scrubbing strips author, title, timestamps
- CSO audit: 15 candidates scanned, 1 design-level finding (raw key storage when no password set), 0 exploitable vulnerabilities
Development Phases
All five phases are complete and independently shippable:
| Phase | Focus | Key Deliverables |
|---|---|---|
| 1: Foundation | Safety and quick wins | Safe regex execution (match limits, ReDoS prevention), IBAN MOD-97 validation, allow/deny lists |
| 2: Detection Engine | Pattern expansion | 10 European ID validators with checksums, US SSN/passports, tech secret detection, 7 country-specific phone formats, context scoring |
| 3: Pipeline Intelligence | OCR and matching | Per-page OCR routing, 4-tier coordinate matching, fuzzy matching for OCR errors, NER on OCR text |
| 4: User Features | Controls and transparency | Compliance profiles, per-type redaction strategies, consistent anonymization, explain API |
| 5: Next-Gen | Advanced detection | WebGPU-accelerated NER, face and signature detection, UI redesign |
Challenge Fit
GoCalma directly addresses the core challenge requirements:
- Open source — full prototype repository on GitHub
- Working redaction flow — upload, detect, review, export
- Local-first privacy guarantee — zero data transmission, all processing in-browser
- User review — every detection confirmed before export
- Reversible — encrypted key files for un-redaction
- Security-audited — 0 production CVEs, AES-GCM crypto, pixel-level redaction
- Extended capability — scanned PDFs, phone captures, face/signature detection, multilingual support
Summary
GoCalma is a practical, browser-based privacy tool with 100% core recall, 30+ PII types with checksum validation, five redaction modes with per-type configuration, user-configurable allow/deny lists, face and signature detection, a 301-test automated suite, and a security-audited zero-transmission architecture.
The document stays on the user's machine from upload to export. That is the product guarantee.
Built With
- pdf-lib
- pdfjs-dist
- react
- tesseract.js
- transformerjs
- typescript
- vite
- xenova
Log in or sign up for Devpost to join the conversation.