YOLOcate

What it does

YOLOcate lets a user type any natural language query, and the system:

Understands the query using CLIP text embeddings
Scans the video frame using YOLO to generate candidate bounding boxes
Matches each YOLO crop with the query using CLIP image--text similarity
Selects the most semantically correct box using score fusion
Switches to tracking mode (ByteTrack/any lightweight tracker) when confident
Re-localises periodically to correct drift or recover lost objects

The system can both locate and track almost any described object without training a custom model.

For each YOLO box:

The similarity score measures how well a YOLO detection matches the user's query.

Each box receives a combined score:

The box with the highest combined score is the localisation result.

To maintain realtime performance:

Once confident:

This creates a Search → Track → Correct cycle.

Ambiguity in user queries\ Natural language often doesn't map cleanly to YOLO classes (e.g., "bag" could be backpack/handbag/luggage).
Many irrelevant YOLO detections\ Without LLM-based filtering, CLIP must compare every crop → can be slower on crowded scenes.
Maintaining real-time performance\ CLIP is expensive; we had to design frame-skipping and adaptive localisation logic.
Drift in long-term tracking\ ByteTrack can drift when the object is occluded; periodic re-localisation was essential.
Ensuring stable confidence thresholds\ Needed experimentation to find reliable confidence switch-points between Search and Track modes.

Created a fully general natural language → object localisation system without training any new models.
Combined YOLO + CLIP + lightweight tracking into a smooth, working pipeline.
Achieved realtime performance by using adaptive localisation intervals.
Built a system flexible enough to handle colors, attributes, and textual descriptions ("blue bottle", "man in red shirt", etc.).
Solved object re-identification without explicit re-ID models, using CLIP similarity alone.

CLIP is incredibly strong at semantic matching, even when YOLO classes are vague.
Tracking is not enough --- systems need periodic re-localisation to stay accurate in long videos.
Real-time vision systems require careful optimisation, not just accurate models.
Natural language search in video becomes feasible once you combine:
- a fast detector (YOLO)
- a semantic encoder (CLIP)
- a lightweight tracker (ByteTrack)
LLMs are optional --- the system still works great without Gemini by relying purely on CLIP similarity.

Add attribute-level segmentation (color masks) when color queries are very specific.
Introduce spatial reasoning ("near the table", "left of the screen") using lightweight heuristics.
Add multi-object search ("find all people wearing red").
Support long-term re-identification across shots, not just continuous video.
Deploy as a web demo with a simple textbox + video input pipeline.
Optional: bring back LLMs later for advanced reasoning when time and resources permit.

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.