Now You See Me – Real-Time Vision–Language Object Tracking
Overview
Now You See Me is a real-time, intelligent, vision-based object tracking system that not only detects and tracks objects in a live video feed but also understands natural language instructions. The system combines Memory-Augmented Transformers, Vision–Language Models, and agent-based reasoning to allow users to interact with video streams using human-like commands.
Example: "Track this object for 3 minutes and alert me if it leaves the frame."
About the Project
Inspiration
The project was inspired by the idea of building a system that can see and understand simultaneously. Traditional object trackers work well but are passive—they cannot interpret human instructions.
The goal was to build a system where users could:
- Talk to the tracker
- Ask questions
- Give tasks
- Receive meaningful responses
This vision of a more human-like visual AI inspired Now You See Me.
What I Learned
Multi-Object Tracking (MOT)
- Understanding transformer-based tracking systems
- Working with object queries
- Handling challenges like ID switches, occlusion, and re-identification
Memory-Augmented Transformers (MeMOTR)
- Learning how long-term memory banks store historical embeddings
- Preserving identity across frames
- Using memory attention mechanisms
This includes the attention formula:
$$ \text{Attention}(Q,K,V)=\text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V $$
Vision–Language Models (VLMs)
Learning from BLIP-2, Grounded-SAM, and LLaVA-Next taught me:
- Cross-modal alignment
- Instruction-based video interaction
- Visual question answering
Natural Language Understanding (NLU)
Using models like T5 or DistilGPT showed me how to:
- Extract intent
- Convert instructions into structured commands
- Enable task-level reasoning
How I Built the Project
Core Tracker – MeMOTR Architecture
- Used a Swin Transformer or MobileViT backbone for feature extraction
- Implemented object queries in the transformer decoder
- Added a long-term memory module
- Used memory cross-attention for stable identity tracking
Language-Driven Agent Layer
User commands get parsed into structured tasks, for example:
{
"action": "track",
"target": "selected_object",
"duration": 180,
"condition": "out_of_frame",
"response": "alert"
}
Vision–Language Integration
- Visual encoding done with ViT/Swin
- Text instructions fused with visual context
- Used for both task execution and visual questions
Real-Time Pipeline
- Capture video frame
- Encode it using the vision backbone
- Transformer decoder predicts bounding boxes and identities
- Memory module updates past embeddings
- NLU agent parses instructions
- Decision agent executes tasks in real time
Challenges Faced
Occlusion & Re-Identification
Objects getting blocked or leaving the frame caused identity mismatches. Building a stable memory retrieval system was difficult.
Real-Time Performance
Transformers are computationally heavy, so achieving real-time speed required optimization.
Aligning Language With Vision
Mapping human instructions to specific objects required careful integration of visual and textual models.
Dataset & Annotation Issues
Preparing sequence-level tracking data was challenging and time-consuming.
Features
- Real-time multi-object tracking
- Memory-augmented identity preservation
- Natural-language commands
- Vision–language understanding
- Visual question answering
- Robust tracking under occlusion
Applications
- Smart surveillance
- Industrial safety
- Museum artifact monitoring
- Traffic and vehicle analysis
- Human–robot interaction
Tech Stack
- Transformers: MeMOTR
- Backbones: Swin Transformer, MobileViT
- VLM: BLIP-2, LLaVA-Next, Grounded-SAM
- NLU: T5, DistilGPT, GPT-4o-mini
- Languages: Python, PyTorch
- Tools: OpenCV, NumPy
Mathematical Foundations
Memory cross-attention:
$$ \text{Attention}(Q,K,V)=\text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V $$
Transformer object-query update:
$$ x_{t+1} = \text{Decoder}(x_t, E_t, M) $$
Identity matching:
$$ \text{ID} = \arg\max_i ; \text{cosine_similarity}(x_t, M_i) $$
Conclusion
Now You See Me is a multimodal AI system that:
- Sees
- Remembers
- Understands
- Responds
It represents a meaningful step toward AI systems that perceive and reason like humans.
Built With
- distilgpt
- gpt-4o-mini-languages:-python
- grounded-sam-nlu:-t5
- llava-next
- mobilevit-vlm:-blip-2
- pytorch-tools:-opencv
- tech-stack-transformers:-memotr-backbones:-swin-transformer
Log in or sign up for Devpost to join the conversation.