Probabilistic Vehicle Tracking

Map image
Select different camera footages
Car tracking with stable IDs

Inspiration

Modern cities have dense networks of traffic cameras, but most systems treat each feed independently. I wanted to explore whether it is possible to infer vehicle movement across disconnected camera views without relying on explicit identifiers such as license plates. The core motivation was to model identity as something uncertain and probabilistic rather than fixed.

I was also interested in whether computer vision embeddings could be used not just for detection, but for reasoning about continuity across space and time in visual data.

Although such a tracking application would compromise privacy of drivers, this feature is immensely helpful for urban safety and efficiency. For instance, in "hit and run" scenarios where a license plate was never captured, the system can reconstruct the most likely flight path of a vehicle based on visual profiles, aiding emergency responses without requiring a city-wide database of drivers.

Furthermore, it allows urban planners to conduct deep "Origin-Destination" studies to reduce traffic congestion and carbon emissions. Because the system relies on abstract mathematical vectors rather than high-resolution imagery of individuals, it provides the insights needed for a "smart city" while maintaining a layer of digital anonymity that traditional tracking systems cannot offer.

What it does

Probabilistic Vehicle Tracking detects vehicles in traffic videos, tracks them locally within each scene using temporary IDs, and then estimates how likely it is that vehicles observed in different scenes correspond to the same underlying entity.

Instead of assigning absolute identities, the system produces probability distributions over possible matches between vehicle tracklets across different camera views. These probabilities are then visualized as connections on a map based interface.

How we built it

The system is built in Python and leverages AI models for key tasks: a YOLO-based model for vehicle detection and CLIP embeddings from Hugging Face Transformers for visual representation. Local tracking is handled with OpenCV and NumPy, while Ursina is used for map-based visualization.

The pipeline consists of:

Vehicle Detection
Vehicles are detected in each frame using a YOLO based model through OpenCV’s DNN module.
Local Tracking
A centroid based tracking method assigns temporary IDs to vehicles within each video stream, forming short term trajectories (tracklets).
Tracklet Construction
Each tracklet stores bounding box history and cropped vehicle images over time.
Visual Embeddings
CLIP is used to convert vehicle crops into embedding vectors representing appearance.
Cross Scene Matching
Tracklets from different scenes are compared using cosine similarity over embeddings, combined with simple visual heuristics to produce match scores.
Probabilistic Modeling
Match scores are normalized into probabilities using a softmax style transformation, producing a ranked set of likely correspondences.
Map Based Representation
An Ursina interface renders intersections as nodes, with probabilistic links representing likely vehicle transitions between locations.

Challenges we ran into

A key challenge was the lack of true ground truth identity across different camera views. Without license plates or synchronized camera networks, identity must be inferred indirectly, which introduces ambiguity. Another difficulty was designing a system that balances tracking stability within a single video and uncertainty across multiple videos. Local tracking is relatively reliable, but cross scene matching is inherently noisy. Computational efficiency was also a constraint due to running detection, embedding extraction, and matching in the same pipeline.

Another practical challenge we ran into was the user interface. Ursina, while powerful for 3D visualization, proved tricky to work with for more complex interactions. Its event handling and UI layout system sometimes felt unintuitive, making even straightforward interface elements harder to implement.

Integrating Ursina with OpenCV added another layer of complexity. OpenCV handles real-time video frames efficiently, but combining its 2D frame display with Ursina’s 3D scene required careful synchronization. We had to manage differences in rendering pipelines, frame rates, and input handling between the two libraries which significantly increased development overhead. Balancing real-time performance with a usable interface became a major pain point.

Accomplishments that we're proud of

A major accomplishment was building a full pipeline that links detection, tracking, similarity-based matching, and probabilistic reasoning into a single working system.

The system does not just assign a fixed identity to each vehicle but instead maintains a distribution of likely identities, capturing uncertainty naturally.

Another achievement was adding a visual layer that shows these probability estimates on a map or scene, making the abstract data easier to interpret.

What we learned

I learned that vehicle identity in visual systems is often uncertain and that assuming a single fixed identity can lead to errors in real-world settings.

I also learned how to combine different computer vision components such as detection, tracking, and embedding models to support reasoning about movement and matching across cameras.

Finally, I gained experience building multi-stage pipelines where each component generates intermediate outputs that feed into probabilistic decision-making.