PII-Vault

Inspiration

We see friends, family, and colleagues upload private information into ChatGPT to "get help," not realizing their sensitive data was now sitting in someone else's database. We wondered if we could build something that easily lets people use AI on private documents without giving away their data? We haven't seen an open-source tool that works well — so we decided to try building one ourselves.

What it does

PII-Vault is our experiment in privacy-preserving document redaction. It detects personal information and replaces it with placeholders like <PERSON_1>.

We encrypt the mapping between placeholders and real values. Keep the key, restore the original later. Lose the key, and the redaction becomes permanent.

Dedicated UI programs allow users to manage their redacted encrypted data and the keys they were encrypted with, including exporting them. Once exported and sent through any secure channel, a different UI allows third parties to unredact only the documents in their possession that match the export they were given.

We also built a Chrome extension that allows redacting file uploads, including to AI providers like ChatGPT, Claude, and Gemini, on the fly directly in the browser through the use of a connector to the locally installed program; meaning vaults and keys are stored together no matter where the redaction happens.

In addition to integrating state-of-the-art models, we developed our own model for PII recognition. Since the requirement was to run on mobile devices such as laptop and potentially phones, we found small LLMs are still too unreliable; we instead fine-tuned a ModernBERT model to perform the more focused task of classifying token sequences instead of generating them, while still making use of the whole-text understanding enabled by the attention mechanism. The model is codenamed "multihead", as the breakthrough to an interesting level of performance came by training it to recognized independently whether a span might be interesting, what type it is, and whether the information is actually private (for example "Alexa play Despacito" contains two names, but no private information).

The Presidio open base allows easy extension. In addition to training on more and more diverse data, we plan to make it easy to generate configuration files for each use-case and locale, each potentially codifying a complex strategy, so that users will only have to choose one -or even have a model recognize the best strategy.

How we built it

We started with Microsoft's Presidio and added a reversible encryption layer with AES-256.

English own model results

docs: 100 True Positives: 145 merged TP (predicted span is the union of more than one golden span): 2 False Positives: 8 False Negatives: 2 Predicted is a subset of Golden: 16 Predicted is a superset of Golden: 4

Observations

The quality is high, though not perfect. Most of the errors and partial matches happen on addresses, which are long, complex, and highly variable strings. If partial redaction is acceptable, error rate drops significantly.

International all-models benchmarking

Synthetic data generation for benchmarking

Licensing and privacy makes it challenging to collect sufficient data, especially in non-english locales, so we built a synthetic data generator using Faker. It creates realistic Swiss documents with known PII labels — names, addresses, IBANs, phone numbers in local formats — with exact character positions for each entity.

International Benchmarking results

We tested 4 models across German, French, and Italian:

Model	German F1	French F1	Italian F1	Avg Time/doc
GLiNER	45.8%	34.2%	36.4%	434-1290ms
Stanza	44.5%	33.3%	38.9%	818-1577ms
XLM-R	32.2%	35.7%	31.7%	549-844ms
Local Multihead (ours)	1.6%	1.3%	1.1%	381-1155ms

Challenges we ran into

Local model training. Our ModernBERT model's 1.6% F1 is understandable as we only trained it on English data, and here we require exact matches.
Language variance. German works ok. Italian and French need more processing time for worse results.
Speed. This solution needs to work on edge devices, so only small models are feasible.
Entity consistency. Getting the same entity to have the same placeholder throughout a document required building a consistent hashing scheme. Early versions would call "John Smith" <PERSON_1> on page 1 and <PERSON_2> on page 3.
Engine abstraction. Each NLP engine (spaCy, Stanza, GLiNER, our local model) has different output formats, confidence scores, and entity type names. Building a unified interface that normalizes all of these into consistent placeholders while preserving each engine's strengths is non-trivial.

Accomplishments that we're proud of

Reversible redaction works end-to-end.
Chrome extension with native messaging bridge from browser to local Python.
We built a benchmark framework with synthetic data generation across 3 languages.
We learned more about NER pipelines and multilingual models than we expected.

What we learned

Benchmarking is key. Without precision/recall metrics, we wouldn't know which models work best both in precision/recall but also in speed.
Training your own model is hard. Our low F1 taught us how hard it is to train small models.
Multilingual NLP is uneven. The training data imbalance between German and Italian/French is real.
This is a prototype. Benchmarking shows clear gaps — we need more tests and improvements.