Inspiration

What it does

How we built it

Challenges we ran into

Accomplishments that we're proud of

  1. Executive Overview

We’re building a real-time intelligent visual tracker that can detect, identify, and remember a target object (like “red chair”) in a live video.

Unlike normal trackers that only follow what they see now, this one learns and refines what the object looks like over time.

Think of it as a human-like visual memory system — it keeps updating its understanding as new angles and lighting appear.

  1. Initialization and Seed Embedding 2.1 User Input & Initialization The user defines a target e.g. “red chair”, “black backpack”, or selects an example object in the first frame. We extract two kinds of embeddings: Text embedding: using a multimodal model like CLIP or LLaVA. Visual embedding: using the visual backbone of Grounded-DINO 2 or SAM 2 from the selected region. These two are aligned in the same latent space, so both represent “what we’re looking for”.2.2 The Seed Embedding

The seed embedding is the first unified representation of the target object. It’s created by combining: The textual description (semantic meaning like “chair”, “red”), The visual features (color, texture, edges, spatial layout). Mathematically, you can imagine:

$$E_{\text{seed}} = \alpha E_{\text{text}} + (1-\alpha) E_{\text{visual}}$$

where $(\alpha)$ balances language and vision confidence.

This seed embedding acts like a “prototype” of the object — the first fingerprint of what to search for. It’s stored in memory and used to guide: The initial detection (Grounded-DINO uses it to localize the object), The future tracking steps (as a reference template). 2.3 Initial Detection (Grounding Phase)

Once we have $(E_{\text{seed}})$, the system runs Grounded-DINO 2 (or Grounding-SAM 2) on the first frame: The model receives: The frame image. The text query or embedding $(E_{\text{seed}})$. It outputs: Bounding boxes + confidence scores for the target. Visual embeddings for each detected region. The most confident detection becomes our anchor object — the initial position and visual reference for tracking.3. Compositional Object Memory (COM)3.1 Memory Structure A memory bank is created to store embeddings of the object as it changes. Each entry stores: Visual embedding (from the region in the new frame) Frame timestamp Confidence Viewpoint or metadata (angle, size, etc.) So memory = { (embedding₁, confidence₁), (embedding₂, confidence₂), ... }3.2 Real-Time Tracking Loop (Working Logic)

Every frame that comes in after initialization follows this loop:Step A: Tracker Prediction A lightweight tracker (e.g. CTTrack) predicts where the object will be based on past motion. It gives a predicted box and a confidence score. Step B: Saliency Verification A saliency map (from CLIP-Saliency or SAM2 attention) checks which regions attract high attention given the text query. If the predicted box overlaps with a salient region → tracking confidence increases. If not → trigger re-detection via Grounded-DINO. Step C: Embedding Update & Memory Write Extract the visual embedding $(E_{\text{frame}})$ from the detected box (using the same vision backbone). Compare it with stored embeddings using cosine similarity. If $(E_{\text{frame}})$ is too different, it means a new view of the object → add to memory. If it’s similar, update the older entry (temporal averaging). This keeps memory compact yet adaptive.Step D: Memory-Based Refinement If tracking fails (e.g., occlusion), the model scans the frame for regions similar to any stored embedding. This makes it resilient — it can rediscover the target after disappearance or rotation. Step E: Confidence Fusion Final object confidence for the frame is computed as a weighted fusion: $$C_{\text{final}} = w_1 C_{\text{tracker}} + w_2 C_{\text{saliency}} + w_3 C_{\text{memory}}$$ If this exceeds a threshold, the detection is accepted and displayed.

  1. Real-Time Backend Flow (Inference Time)

Let’s break down the live loop (running at 30+ FPS): Frame Capture: video feed → preprocessed (resize, normalize, maybe convert to tensor). Tracker Prediction: last known state → predict new location (fast step, ~2 ms). Saliency Calculation: quick attention map (~10 ms on GPU). Memory Check: Compare predicted region embedding with memory. Retrieve nearest match using FAISS vector index (fast cosine similarity search). Re-detection Trigger: if confidence low → run Grounded-DINO 2 (~50 ms), else skip. Memory Update: if new pose/view → add embedding; else reinforce old one. Visualization: draw bounding box + confidence + optional saliency overlay. All runs asynchronously — tracker runs continuously, detector runs conditionally, and memory retrieval runs in parallel threads using FAISS for sub-millisecond embedding lookups.5. System Summary5.1 Advantages of COM Integration Feature Normal Tracker COM-Based Tracker Adaptation Static template Learns new poses & lighting Occlusion recovery Often fails Recovered from memory bank Long-term tracking Weak Strong (continual update) Open vocabulary Limited Strong via Grounded-DINO Efficiency Fixed loop Adaptive saliency schedule Real-time 20–30 FPS 25–40 FPS (optimized)

5.2 Backend Architecture Summary Module Role Model Output Input Preprocessor Converts frame to tensor OpenCV / CUDA normalized frame Target Encoder Creates text & visual embeddings CLIP / LLaVA $E_{\text{seed}}$ Detector Initial grounding Grounded-DINO 2 / SAM 2 bbox + visual feature Tracker Predicts object motion CTTrack / SMILETrack next bbox Saliency Attention guidance CLIP-Saliency / AIM saliency mask Memory Bank Stores embeddings, retrieves nearest matches FAISS embedding set Fusion Module Combines confidences Weighted formula final bbox + score Renderer Visual overlay OpenCV real-time feed

5.3 Intuitive Picture You show the system a red chair. It builds an initial mental image (seed embedding). It spots the chair (Grounded-DINO). It starts following it (CTTrack). Each time lighting or angle changes, it remembers the new version in memory. If the chair disappears behind someone, the system uses memory recall to find it again. Saliency makes sure it only works hard when necessary. All this happens live — GPU optimized and running asynchronously.

Built With

Share this project:

Updates