Now You See Me – Real-Time Vision–Language Object Tracking

Overview

Now You See Me is a real-time, intelligent, vision-based object tracking system that not only detects and tracks objects in a live video feed but also understands natural language instructions. The system combines Memory-Augmented Transformers, Vision–Language Models, and agent-based reasoning to allow users to interact with video streams using human-like commands.

Example: "Track this object for 3 minutes and alert me if it leaves the frame."

About the Project

Inspiration

The project was inspired by the idea of building a system that can see and understand simultaneously. Traditional object trackers work well but are passive—they cannot interpret human instructions.

The goal was to build a system where users could:

Talk to the tracker
Ask questions
Give tasks
Receive meaningful responses

This vision of a more human-like visual AI inspired Now You See Me.

What I Learned

Multi-Object Tracking (MOT)

Understanding transformer-based tracking systems
Working with object queries
Handling challenges like ID switches, occlusion, and re-identification

Memory-Augmented Transformers (MeMOTR)

Learning how long-term memory banks store historical embeddings
Preserving identity across frames
Using memory attention mechanisms

This includes the attention formula:

$$ \text{Attention}(Q,K,V)=\text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V $$

Vision–Language Models (VLMs)

Learning from BLIP-2, Grounded-SAM, and LLaVA-Next taught me:

Cross-modal alignment
Instruction-based video interaction
Visual question answering

Natural Language Understanding (NLU)

Using models like T5 or DistilGPT showed me how to:

Extract intent
Convert instructions into structured commands
Enable task-level reasoning

How I Built the Project

Core Tracker – MeMOTR Architecture

Used a Swin Transformer or MobileViT backbone for feature extraction
Implemented object queries in the transformer decoder
Added a long-term memory module
Used memory cross-attention for stable identity tracking

Language-Driven Agent Layer

User commands get parsed into structured tasks, for example:

{
  "action": "track",
  "target": "selected_object",
  "duration": 180,
  "condition": "out_of_frame",
  "response": "alert"
}

Vision–Language Integration

Visual encoding done with ViT/Swin
Text instructions fused with visual context
Used for both task execution and visual questions

Real-Time Pipeline

Capture video frame
Encode it using the vision backbone
Transformer decoder predicts bounding boxes and identities
Memory module updates past embeddings
NLU agent parses instructions
Decision agent executes tasks in real time

Challenges Faced

Occlusion & Re-Identification

Objects getting blocked or leaving the frame caused identity mismatches. Building a stable memory retrieval system was difficult.

Real-Time Performance

Transformers are computationally heavy, so achieving real-time speed required optimization.

Aligning Language With Vision

Mapping human instructions to specific objects required careful integration of visual and textual models.

Dataset & Annotation Issues

Preparing sequence-level tracking data was challenging and time-consuming.

Features

Real-time multi-object tracking
Memory-augmented identity preservation
Natural-language commands
Vision–language understanding
Visual question answering
Robust tracking under occlusion

Applications

Smart surveillance
Industrial safety
Museum artifact monitoring
Traffic and vehicle analysis
Human–robot interaction

Tech Stack

Transformers: MeMOTR
Backbones: Swin Transformer, MobileViT
VLM: BLIP-2, LLaVA-Next, Grounded-SAM
NLU: T5, DistilGPT, GPT-4o-mini
Languages: Python, PyTorch
Tools: OpenCV, NumPy

Mathematical Foundations

Memory cross-attention:

$$ \text{Attention}(Q,K,V)=\text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V $$

Transformer object-query update:

$$ x_{t+1} = \text{Decoder}(x_t, E_t, M) $$

Identity matching:

$$ \text{ID} = \arg\max_i ; \text{cosine_similarity}(x_t, M_i) $$

Conclusion

Now You See Me is a multimodal AI system that:

Sees
Remembers
Understands
Responds

It represents a meaningful step toward AI systems that perceive and reason like humans.

Built With

distilgpt
gpt-4o-mini-languages:-python
grounded-sam-nlu:-t5
llava-next
mobilevit-vlm:-blip-2
pytorch-tools:-opencv
tech-stack-transformers:-memotr-backbones:-swin-transformer

Updates

Rudransh Singh Rathore started this project — Nov 16, 2025 07:36 AM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.