Inspiration
Every time you copy a credit card number, Social Security number, or password to paste it somewhere, that data silently sits in your clipboard — readable by every single app on your device that holds clipboard permissions. There is no warning, no audit trail, and no way to know which app just silently harvested it.
Cloud-based scanners exist, but they defeat their own purpose: to check if your sensitive data is sensitive, you have to send it to a server. We wanted a solution with zero data-off-device — ever.
That's what inspired PrivacyGuard: on-device PII detection that runs entirely in hardware, with the internet permission explicitly stripped from the app manifest.
What We Learned
Building PrivacyGuard forced a deep dive into several areas we hadn't previously combined:
- WordPiece tokenization — matching BERT's 30,522-token vocabulary on-device without any network access, ensuring the tokenizer output exactly mirrors what the NER model was trained on
- BIO tag span reconstruction — the model emits per-subword labels
(
B-CREDIT_CARD,I-CREDIT_CARD,O, etc.); mapping these back to character-level spans in the original string requires careful offset tracking when words are split across multiple subword tokens - Melange SDK internals — how
ZeticMLangeModelmanages the NPU session, how to feedByteBufferinputs for token IDs and attention masks, and how to read logit output buffers back into label distributions - Android security architecture — using
tools:node="remove"to strip the INTERNET permission,EncryptedSharedPreferenceswith AES-256-GCM via the Android Keystore, and foreground service lifecycle management
How We Built It
The pipeline has four stages:
Stage 1 — Pre-screen (< 1 ms)
A RegexScreener applies lightweight pattern checks (Luhn algorithm for card
numbers, SSA area/group/serial rules for SSNs, E.164 format for phones) before
invoking the ML model. This filters ~80% of clean clipboard content immediately.
Stage 2 — Tokenization (< 5 ms)
The PIITokenizer runs a pure-Kotlin WordPiece implementation against the
bundled vocab.txt (identical to bert-base-uncased), producing token ID and
attention mask ByteBuffers.
Stage 3 — Melange Inference (< 50 ms on NPU)
PrivacyModel wraps ZeticMLangeModel from Melange SDK 1.2.2, loading the
Team_ZETIC/TextAnonymizer NER model. The model performs Named Entity
Recognition with 21 BIO labels:
$$\hat{y}i = \arg\max{k} \text{softmax}(\mathbf{W} \cdot \mathbf{h}_i)_k$$
where \( \mathbf{h}_i \) is the hidden state for token \( i \) and \( k \in {B\text{-}PER, I\text{-}PER, B\text{-}CREDIT_CARD, \ldots, O} \).
Stage 4 — Span Decoding & Alert (< 5 ms)
OutputDecoder reconstructs entity spans from the per-token BIO predictions,
merges subword tokens back into surface forms, applies confidence thresholds, and
fires an overlay alert via AlertOverlayService if any PII is detected.
The entire pipeline runs in a 300 ms debounced coroutine on Dispatchers.Default.
Challenges We Faced
1. WordPiece tokenizer parity
The NER model was trained on BERT-style tokenization. Any mismatch in
tokenization produces garbage output. Getting the pure-Kotlin tokenizer to
exactly reproduce bert-base-uncased behavior — including unknown-token
handling, ## continuation prefixes, and punctuation splitting — required
extensive validation against reference outputs.
2. BIO tag → character offset mapping
A single word like "Visa" tokenizes to ["visa"] (one token), but
"MasterCard" tokenizes to ["master", "##card"] (two tokens). The BIO label
is on the first subword token. Reconstructing the character-level span requires
tracking byte offsets through the tokenization step and back.
3. Sub-120 ms end-to-end on a mid-range device
Achieving interactive latency required: (a) routing inference through the NPU
via Melange's hardware backend, (b) keeping the model warm in memory via a
ModelLifecycleManager foreground service, and (c) pre-allocating ByteBuffer
instances to avoid GC pressure on the hot path.
4. Zero network constraint
Stripping INTERNET with tools:node="remove" is a one-line manifest change,
but it exposed that several transitive dependencies assumed network access for
license checks, telemetry, or update polling. Auditing and suppressing these was
non-trivial.
Log in or sign up for Devpost to join the conversation.