VeilVault: Our Hackathon Story (TikTok TechJam 2025 · AI for Privacy)

TL;DR We built a privacy pipeline that lets anyone tap to blur or unblur exactly what they want—faces, plates, emails, IDs—in images and video. Under the hood, it’s a three-stage system: (1) open-vocabulary visual grounding + segmentation, (2) OCR + PII detection, and (3) geo-adversarial noise to defeat location inference. For video, we engineered a perceptual-hash key-frame strategy plus CSRT tracking to avoid running heavy AI on every frame.

What inspired us

Every privacy tool we tried felt blunt: either blur everything (overkill) or miss what matters (underkill). We wanted user autonomy—a mask for each detected entity so the user can choose what to hide or reveal. That’s the core of VeilVault: precision (per-entity masks) and control (one-tap toggle), while keeping things fast enough to run on realistic hardware for long videos.

What we built

A three-stage pipeline

See the scene — Open-vocabulary visual grounding + segmentation We use GroundingDINO to propose regions for prompts like “face,” “license plate,” “screen,” “ID card,” and “signboard.” Then SAM snaps these into pixel-accurate masks. This anchors downstream text detection to real objects, not just letter-shaped specks.
Read & reason — OCR + PII detection We run OCR (Tesseract/EasyOCR, depending on config) to extract tokens and boxes, then Microsoft Presidio to classify spans (EMAIL_ADDRESS, PHONE_NUMBER, PERSON, CREDIT_CARD, LICENSE_PLATE, …). We reconcile spans ↔ tokens ↔ regions from Stage 1 and output only:

entity_name
mask (binary, same H×W as the input; PNG or COCO-RLE encoded)

Design choice: No raw text has to leave the server. Clients still get pixel-perfect control.

Hide the world — Geo-adversarial noise Backgrounds can leak where you are (buildings, skylines, unique signs). We add light, targeted adversarial noise to location-revealing regions so SOTA geolocation models stumble—without humans noticing.

Video without the pain: perceptual-hash key frames + CSRT tracking

Instead of processing every frame, we:

Detect pivotal scene changes with perceptual image hashing. We compute a hash $h_i$ per frame (e.g., a/p/d/whash) and take a key frame when the Hamming distance exceeds a threshold:

$$ d_i \;=\; \operatorname{Hamming}(h_i, h_{i-1}); \quad \text{select frame } i \text{ if } d_i > \tau. $$

Perceptual hashes are designed so visually similar frames map to similar hashes—perfect for fast scene-change sampling. ([PyPI][1], [PyImageSearch][2], [vframe.io][3])

Track entities between sampled frames with the OpenCV CSRT tracker. Once entities are detected on a key frame, we follow them across intermediate frames using CSRT (Discriminative Correlation Filter with Channel & Spatial Reliability). This preserves IDs and masks while avoiding repeated heavy inference. ([docs.opencv.org][4], [learnopencv.com][5])

Net effect: we run the full 3-stage pipeline only when the perceptual hash signals a meaningful change, and rely on CSRT to bridge the gaps—cutting compute while keeping masks stable and responsive.

How we built it

Backend: FastAPI with background jobs; endpoints return just {entity_name, mask} per detection.
Vision: GroundingDINO prompts → SAM refinement → polygon masks.
Text & PII: OCR tokens/boxes → Presidio entities → span↔token alignment → mask clipping.
Video: imagehash-based key frames (Hamming distance threshold) → 3-stage run on key frames → CSRT tracking across in-between frames for continuity and reduced overhead. ([PyPI][1], [docs.opencv.org][4])
Performance: PNG/RLE mask encoding, micro-batch OCR where possible, strict response filtering to keep payloads lean.

* Security: No need to return raw text; optional geo-noise on backgrounds.

Challenges we faced (and what we learned)

1) OCR that laughs at orientation

Challenge: Mobile photos are chaotic—rotations, perspective, slant, curved pages. OCR would detect scratches as letters or miss tilted text.

What we did

EXIF-aware rotation + fallback rotations $[0^\circ, 90^\circ, 180^\circ, 270^\circ]$.
Skew estimation using minAreaRect + deskew; slope heuristics for mild slants.
Token reconciliation: match Presidio spans to OCR tokens with tolerant alignment (token overlap / small Levenshtein window).
Line ordering: robust left-to-right, top-to-bottom sorting using y-cluster + x-sort, with a baseline proximity threshold to avoid cross-line merges.

Lesson: Pairing object masks (Stage 1) with text spans reduces false positives drastically—letters live where the object says they should.

2) Batch processing video frames in AI models

Challenge: Running OCR + detectors per frame is prohibitively slow.

What we did

Perceptual-hash key frames so heavy AI runs only on pivotal frames. ([PyPI][1], [PyImageSearch][2])
CSRT tracking to propagate entity locations between key frames, avoiding redundant inference. ([docs.opencv.org][4])

Lesson: The winning recipe is not a bigger model—it’s less work per video by sampling with perceptual hashes and tracking the rest.

3) Figuring out the best key-frame policy

Challenge: Too sparse → miss events. Too dense → waste compute.

What we did

Tuned the Hamming-distance threshold $\tau$ on varied clips so that “obvious” scene changes are always captured.
Added a small minimum spacing to avoid bursty selections when content flickers.

Lesson: A perceptual-hash policy is simple, explainable, and fast, and pairs naturally with tracker-based interpolation. ([PyPI][1], [vframe.io][3])

4) We didn’t have time for a pretty UI (and we’re okay with that)

Challenge: We focused on the pipeline and ran out of runway for a polished app.

What we shipped

A working backend with clean APIs and test scripts.
A minimal client that demonstrates tap-to-blur/unblur using returned masks.
Clear README + examples for quick replication.

Hope: The judging rewards innovation and engineering—the parts that are hardest to fake: the three-stage design, the mask contract, and the video speedups.

What we learned (big takeaways)

Precision masks > blanket filters. Autonomy beats automation for privacy UX.
Open-vocabulary grounding + PII semantics are complementary: objects explain where text should be; PII explains what it means.
Compute is a product problem. The best way to go faster was to do less: perceptual-hash sampling + tracking.
Security by design. Returning entity_name + mask avoids shipping raw text—less data, fewer risks.
Adversarial noise is practical when it’s targeted and subtle—we can frustrate location inference without ruining the photo.

Final thoughts

VeilVault is our attempt to turn privacy from a blunt instrument into a scalpel. It’s fast, precise, and—most importantly—under the user’s control. We’ll keep polishing the UI, but we’re proud that the hard parts—the ideas and the engineering—are already working.