Inspiration

Data privacy is the biggest bottleneck in enterprise AI adoption. Every day, professionals upload sensitive documents containing PII (Personally Identifiable Information) to AI tools like ChatGPT or Claude, losing control over their data. The GoCalma challenge inspired us to ask: How can we enable safe, powerful AI usage without compromising data ownership? The answer is local, proactive redaction before the data ever leaves the device.

What it does

GoCalma Shield is a 100% local, lightweight middleware designed to neutralize privacy risks.

  • Dual Input Modes: Users can batch-process files (PDF/TXT) or use the text area for quick copy-pasting.
  • Offline Anonymization: It instantly identifies and masks sensitive information (Names, Phones, Emails, IDs, Bank Cards, etc.) entirely offline.
  • Format-Preserving: The output remains highly readable, ensuring LLMs can still understand the business context.
  • Dual-Track AI Interaction: Beginners can use quick links to jump to native ChatGPT/Claude/Kimi web interfaces, while developers can use the "API Geek Mode" to stream LLM responses directly within the app using the sanitized data.

How we built it

We built the application using Python and Streamlit for a rapid, interactive UI. The core NLP engine is powered by Microsoft Presidio and offline spaCy large models (en_core_web_lg and zh_core_web_lg). To ensure maximum accuracy and prevent NLP hallucinations, we implemented a two-pass engine: a high-priority Regex pre-pass for rigid formats (like SSN, IDs, and Credit Cards), followed by line-by-line NLP analysis for contextual entities.

Challenges we ran into

  1. Cross-Language Support: Handling mixed Chinese and English text was tricky. We implemented an auto-detection mechanism to route text to the correct underlying spaCy model.
  2. Environment & Dependency Hell: While building the offline NLP engine, we encountered severe C++ compilation errors (blis wheel build failures) because we were initially using Python 3.13. We quickly realized the AI ecosystem's compatibility limits, downgraded to the stable Python 3.11, and successfully deployed the offline models.
  3. Preventing Double-Redaction: Sometimes the NLP model would try to re-redact data already masked by our Regex. We engineered a "Fragment Protection Mechanism" to freeze already-masked entities.

Accomplishments that we're proud of

We successfully built a Zero-Cloud-API anonymization pipeline. Proving that robust privacy protection doesn't require sending data to another third-party server first. We also love our seamless UI that caters to both non-technical users and developers.

What we learned

We learned that relying solely on Regex or solely on NLP is flawed. Regex misses context, and NLP hallucinates on rigid formats. The hybrid approach (Whitelist Regex + NLP) is the true enterprise standard for data anonymization.

What's next for GoCalma Shield

  • Integration with local offline LLMs (like Ollama / Llama 3) to create a pipeline that is 100% disconnected from the internet.
  • Supporting more complex document layouts (Word, Excel) while maintaining formatting.

Built With

Share this project:

Updates