Inspiration

Every time you copy a credit card number, Social Security number, or password to paste it somewhere, that data silently sits in your clipboard — readable by every single app on your device that holds clipboard permissions. There is no warning, no audit trail, and no way to know which app just silently harvested it.

Cloud-based scanners exist, but they defeat their own purpose: to check if your sensitive data is sensitive, you have to send it to a server. We wanted a solution with zero data-off-device — ever.

That's what inspired PrivacyGuard: on-device PII detection that runs entirely in hardware, with the internet permission explicitly stripped from the app manifest.


What We Learned

Building PrivacyGuard forced a deep dive into several areas we hadn't previously combined:

  • WordPiece tokenization — matching BERT's 30,522-token vocabulary on-device without any network access, ensuring the tokenizer output exactly mirrors what the NER model was trained on
  • BIO tag span reconstruction — the model emits per-subword labels (B-CREDIT_CARD, I-CREDIT_CARD, O, etc.); mapping these back to character-level spans in the original string requires careful offset tracking when words are split across multiple subword tokens
  • Melange SDK internals — how ZeticMLangeModel manages the NPU session, how to feed ByteBuffer inputs for token IDs and attention masks, and how to read logit output buffers back into label distributions
  • Android security architecture — using tools:node="remove" to strip the INTERNET permission, EncryptedSharedPreferences with AES-256-GCM via the Android Keystore, and foreground service lifecycle management

How We Built It

The pipeline has four stages:

Stage 1 — Pre-screen (< 1 ms)

A RegexScreener applies lightweight pattern checks (Luhn algorithm for card numbers, SSA area/group/serial rules for SSNs, E.164 format for phones) before invoking the ML model. This filters ~80% of clean clipboard content immediately.

Stage 2 — Tokenization (< 5 ms)

The PIITokenizer runs a pure-Kotlin WordPiece implementation against the bundled vocab.txt (identical to bert-base-uncased), producing token ID and attention mask ByteBuffers.

Stage 3 — Melange Inference (< 50 ms on NPU)

PrivacyModel wraps ZeticMLangeModel from Melange SDK 1.2.2, loading the Team_ZETIC/TextAnonymizer NER model. The model performs Named Entity Recognition with 21 BIO labels:

$$\hat{y}i = \arg\max{k} \text{softmax}(\mathbf{W} \cdot \mathbf{h}_i)_k$$

where \( \mathbf{h}_i \) is the hidden state for token \( i \) and \( k \in {B\text{-}PER, I\text{-}PER, B\text{-}CREDIT_CARD, \ldots, O} \).

Stage 4 — Span Decoding & Alert (< 5 ms)

OutputDecoder reconstructs entity spans from the per-token BIO predictions, merges subword tokens back into surface forms, applies confidence thresholds, and fires an overlay alert via AlertOverlayService if any PII is detected.

The entire pipeline runs in a 300 ms debounced coroutine on Dispatchers.Default.


Challenges We Faced

1. WordPiece tokenizer parity

The NER model was trained on BERT-style tokenization. Any mismatch in tokenization produces garbage output. Getting the pure-Kotlin tokenizer to exactly reproduce bert-base-uncased behavior — including unknown-token handling, ## continuation prefixes, and punctuation splitting — required extensive validation against reference outputs.

2. BIO tag → character offset mapping

A single word like "Visa" tokenizes to ["visa"] (one token), but "MasterCard" tokenizes to ["master", "##card"] (two tokens). The BIO label is on the first subword token. Reconstructing the character-level span requires tracking byte offsets through the tokenization step and back.

3. Sub-120 ms end-to-end on a mid-range device

Achieving interactive latency required: (a) routing inference through the NPU via Melange's hardware backend, (b) keeping the model warm in memory via a ModelLifecycleManager foreground service, and (c) pre-allocating ByteBuffer instances to avoid GC pressure on the hot path.

4. Zero network constraint

Stripping INTERNET with tools:node="remove" is a one-line manifest change, but it exposed that several transitive dependencies assumed network access for license checks, telemetry, or update polling. Auditing and suppressing these was non-trivial.

Built With

  • actions
  • aes-256-gcm
  • android
  • compose
  • coroutines
  • github
  • jetpack
  • keystore
  • kotlin
  • material3
  • melange
  • mockk
  • robolectric
  • sdk
  • stateflow
  • team-zetic/textanonymizer
Share this project:

Updates