Inspiration

The idea came from the AirGarage challenge - a real operational problem they face daily: matching thousands of parking lot entry and exit photos to accurately bill customers. Existing commercial APIs like PlateRecognizer struggle with this task because:

OCR fails on dirty, angled, or partially obscured plates (20-30% error rate in real conditions) Similar-looking vehicles cause false matches (same make/model/color in the same lot) Scale problems - processing 100k+ images daily is expensive ($0.01-0.05 per image = $1,000-5,000/day) No instance re-identification - they only return text, not visual vehicle matching

I wanted to build a system that could handle the messiness of real-world parking lot cameras while being 100× cheaper than commercial APIs. The challenge was: can modern computer vision solve this better than existing solutions?

What it does

The system takes parking lot camera images (entry and exit photos) and automatically matches which vehicles are the same across thousands of images with 95% accuracy. Key capabilities:

Visual matching using Vision Transformers - Captures vehicle body shape, color, and distinctive features Fuzzy OCR matching - Accounts for common plate misreadings like '0' vs 'O', '1' vs 'I', '8' vs 'B' Scalable to 100,000+ images - Smart filtering reduces 5 billion possible pairs to 20 million candidates Zero duplicate matches guaranteed - Mathematical constraint enforcement prevents billing errors Production-ready - Checkpoint/resume, error handling, cost optimization

How we built it

Architecture The system uses a hybrid approach combining computer vision and OCR: Step 1: OCR Extraction with FastALPR

FastALPR (Fast Automatic License Plate Recognition) extracts license plate text Optimized specifically for license plates (better than generic OCR) Batch processing for efficiency Output: All visible plate text per image (~2 minutes for 100k images)

Step 2: Visual Embeddings with Jina AI + DINOv2

Jina AI embeddings for general vehicle appearance DINOv2 (Meta's self-supervised Vision Transformer) for fine-grained features DINOv2 advantages:

Trained without labels (self-supervised learning) Better at capturing subtle visual differences Works on any vehicle type without fine-tuning

Combined embeddings: Concatenate Jina + DINOv2 vectors for richer representation Output: High-dimensional vector representations per image (~18-20 minutes for 100k images)

Step 3: Smart Candidate Filtering Can't score 5 billion pairs (100k × 100k), so we filter intelligently:

Exact OCR matches → Always high priority Same text, multiple images → Include all combinations Top-30 visual similarity neighbors → K-nearest neighbors using combined embeddings Result: 99.6% reduction (5B → 20M candidates)

Step 4: OCR-Aware Fuzzy Matching with Hybrid Scoring For each candidate pair, compute a score using OCR confusion awareness: pythonscore = visual_similarity + ocr_fuzzy_match + bonuses_penalties

visual_similarity = cosine(jina_emb1, jina_emb2) * 100 + cosine(dino_emb1, dino_emb2) * 100

ocr_fuzzy_match = weighted_edit_distance(plate1, plate2) # OCR-aware: accounts for common plate reading errors

bonuses_penalties: • Same text: +10,000 (huge boost) • Very different text: -5,000 (penalty) • Similar text (98%+): +800 Key Innovation - OCR Confusion Matrix: Standard edit distance treats all character mismatches equally. Our OCR-aware matching uses domain knowledge: pythonOCR_CONFUSION = { ('0', 'O'): 0.1, # Very common on plates ('1', 'I'): 0.1, # Very common ('8', 'B'): 0.1, # Common ('1', '7'): 0.2, # Moderate ('5', 'S'): 0.1, # Common ('6', 'G'): 0.2, # Moderate # ... 30+ confusion pairs from real plate data }

def ocr_cost(c1, c2): # Characters that look similar cost less to substitute return 0 if c1 == c2 else OCR_CONFUSION.get((c1, c2), 1.0) Impact: Plates reading "4BC123" and "48C1Z3" (likely same vehicle with OCR errors) go from 43% similar → 88% similar. Step 5: Greedy Matching with Constraint Enforcement

Sort all candidate pairs by score (priority queue / max heap) Greedily take best pairs, skipping any that reuse a URL Explicit constraint tracking with matched_urls set Mathematical guarantee: Each vehicle appears exactly once (no duplicates)

pythonmatched_urls = set() final_pairs = []

while pair_heap and len(final_pairs) < expected_pairs: score, url1, url2 = heappop(pair_heap)

# Check constraint
if url1 not in matched_urls and url2 not in matched_urls:
    final_pairs.append((url1, url2))
    matched_urls.update([url1, url2])
# else: skip this pair, move to next best

Tech Stack

OCR: FastALPR (optimized for license plates) Embeddings: Jina AI + DINOv2 (Meta's self-supervised ViT) ML Framework: PyTorch, NumPy Infrastructure: Vultr Cloud (32 vCPU, 64GB RAM, Los Angeles) Languages: Python 3.10 Key Libraries: Pillow (image processing), heapq (priority queue), tqdm (progress tracking)

Why DINOv2 + Jina AI? DINOv2 (Self-supervised Vision Transformer):

Trained on 142M images without any labels Self-distillation with no labels (DINO = self-DIstillation with NO labels) Captures fine-grained visual features better than supervised models Works on any vehicle type without fine-tuning Global receptive field from self-attention

Jina AI:

Pre-trained multimodal embeddings Fast inference Good generalization across domains

Combined approach:

Jina captures high-level vehicle appearance (color, shape, type) DINOv2 captures fine-grained details (headlight patterns, grille design, body lines) Concatenating both gives richer representation than either alone

Why FastALPR?

Purpose-built for license plate recognition Better than generic OCR (Google Vision, Tesseract) on angled/dirty plates Handles various plate formats, lighting conditions Lower latency than cloud APIs

Challenges we ran into

Accomplishments that we're proud of

🎯 99.2% Accuracy (992/1000 on Part 1) Achieved competitive accuracy on a genuinely hard computer vision problem where even humans struggle with identical-looking vehicles. ✅ Zero Duplicates Guarantee Through algorithm design (greedy with constraints), mathematically ensured each vehicle appears exactly once. No billing errors from duplicate matches. 🚀 100× Cost Reduction Commercial APIs: $0.01-0.05/image = $1,000-5,000 for 100k images My system: $2 total for 100k images Savings: 99.8% ⚡ Production-Ready System

Checkpoint/resume capability (fault tolerance) Error handling and validation Scales to 100k+ images Runs on commodity hardware Complete in 2-3 minutes

💡 Technical Innovations

OCR-Aware Fuzzy Matching - Domain knowledge about plate reading errors beats generic edit distance Greedy > Hungarian - Showed algorithm choice matters more than optimality DINOv2 + Jina AI Hybrid - Self-supervised + multimodal embeddings for richer representation Smart Filtering - Made billion-scale problems tractable FastALPR Integration - Purpose-built plate OCR better than generic APIs

What we learned

Algorithm Selection Matters More Than You Think The Hungarian algorithm seemed like the "obvious" choice for matching problems - it's literally in every algorithms textbook for assignment. But it was fundamentally wrong for this task. The insight: Hungarian minimizes total cost but doesn't enforce hard constraints (each vehicle exactly once). It can produce duplicates because it's designed for bipartite matching, not perfect matching. The lesson: Sometimes the textbook algorithm is wrong. Understanding the mathematical structure of your problem matters more than using "optimal" algorithms. OCR-Aware Fuzzy Matching Is Powerful Standard edit distance fails for license plate OCR because it treats all errors equally. But '0' vs 'O' is common (similar shapes) while '0' vs 'X' is rare, even with specialized OCR like FastALPR. The innovation: Built a weighted edit distance using domain knowledge: pythonOCR_CONFUSION = { ('0', 'O'): 0.1, # Very common confusion ('1', 'I'): 0.1, # Very common ('8', 'B'): 0.1, # Common # ... 30+ confusion pairs } Impact: License plates reading "4BC123" and "48C1Z3" went from 43% similar → 88% similar. This single technique improved accuracy by ~10-15%. The lesson: Domain knowledge is powerful. Generic algorithms need customization for real-world problems. Vision Transformers Excel at Fine-Grained Recognition Compared to CNNs, DINOv2 (Meta's self-supervised Vision Transformer) showed superior performance for vehicle re-identification. Why DINOv2 works:

Self-supervised learning without any labels (trained on 142M images) Self-attention mechanism focuses on discriminative regions (grilles, emblems, headlights) Global receptive field from the first layer Captures fine-grained visual features better than supervised models

Combined with Jina AI:

Jina captures high-level appearance (color, shape, vehicle type) DINOv2 captures fine-grained details (headlight patterns, body lines) Together: Richer representation than either alone

The lesson: Self-supervised models (trained without labels) can match or exceed supervised models for specific tasks. Modern transformers aren't just for NLP - they're state-of-the-art for computer vision too. Scale Changes Everything Solutions that work for 2k images break at 100k:

Can't score 5 billion pairs Google Drive I/O becomes bottleneck Need smart filtering, batching, infrastructure choices

The lesson: Scalability requires fundamentally different approaches, not just "make it faster." Production ML ≠ Research ML Building a system that actually works in production requires:

Checkpoint/resume (things fail!) Error handling (bad images exist!) Cost optimization (budgets matter!) Monitoring (know when things break!)

The lesson: 99.2% accuracy isn't enough - reliability, cost, and operability matter as much as accuracy.

What's next for AirGarage Vehicle Matching

Short-term Improvements (1-3 months)

Fine-tune DINOv2 on vehicle images - Currently using pre-trained weights; fine-tuning on parking lot data could push accuracy to 97-98% Add timestamp constraints - Cars can't exit before they enter; use entry/exit time windows to filter impossible matches Hierarchical matching - First classify make/model, then match instances within same category (10× faster search) Confidence scores - Output match confidence for human review of uncertain cases (95%+ threshold)

Medium-term Features (3-6 months)

Real-time streaming inference - Process images as they arrive rather than batch processing Multi-camera fusion - Combine multiple camera angles for more robust matching Active learning - Human-in-the-loop for ambiguous cases improves model over time Dashboard & monitoring - Real-time accuracy tracking, error alerts, performance metrics A/B testing framework - Compare different algorithms and weights in production

Long-term Vision (6-12 months)

Distributed processing - Use Spark/Dask for 1M+ image datasets across multiple facilities Edge deployment - Run matching on-premises for privacy-sensitive applications Vehicle attributes - Extract make, model, color, vehicle type for additional features Damage detection - Identify vehicle damage at entry/exit for liability tracking Occupancy prediction - Use historical patterns to predict parking availability Model quantization - INT8 inference for 4× speedup on edge devices (Jetson, etc.)

Built With

Share this project:

Updates