What it does
YOLOcate lets a user type any natural language query, and the system:
Understands the query using CLIP text embeddings
Scans the video frame using YOLO to generate candidate bounding boxes
Matches each YOLO crop with the query using CLIP image--text similarity
Selects the most semantically correct box using score fusion
Switches to tracking mode (ByteTrack/any lightweight tracker) when confident
Re-localises periodically to correct drift or recover lost objects
The system can both locate and track almost any described object without training a custom model.
How we built it
1. Natural Language Encoding (CLIP)
We convert the user's text query into a CLIP text embedding.
This embedding represents the semantic meaning of the query.
2. YOLO for Candidate Proposals
YOLO runs on the frame and outputs all detections.
Instead of filtering by LLM, we simply take all YOLO boxes as candidates.
3. CLIP for Semantic Matching
For each YOLO box:
Crop the region
Compute CLIP image embedding
Compute similarity with the text embedding
The similarity score measures how well a YOLO detection matches the user's query.
4. Score Fusion
Each box receives a combined score:
YOLO detection confidence
CLIP similarity
The box with the highest combined score is the localisation result.
5. Control Logic
To maintain realtime performance:
Localise every K frames instead of every frame
Track in between using ByteTrack
If tracking fails, return to localisation mode
If localisation confidence is low → skip ahead & retry
6. Tracking Stage
Once confident:
Initialise ByteTrack
Perform fast, per-frame tracking
Every R frames, run localisation in a smaller search region to fix drift
This creates a Search → Track → Correct cycle.
Challenges we ran into
Ambiguity in user queries\ Natural language often doesn't map cleanly to YOLO classes (e.g., "bag" could be backpack/handbag/luggage).
Many irrelevant YOLO detections\ Without LLM-based filtering, CLIP must compare every crop → can be slower on crowded scenes.
Maintaining real-time performance\ CLIP is expensive; we had to design frame-skipping and adaptive localisation logic.
Drift in long-term tracking\ ByteTrack can drift when the object is occluded; periodic re-localisation was essential.
Ensuring stable confidence thresholds\ Needed experimentation to find reliable confidence switch-points between Search and Track modes.
Accomplishments that we're proud of
Created a fully general natural language → object localisation system without training any new models.
Combined YOLO + CLIP + lightweight tracking into a smooth, working pipeline.
Achieved realtime performance by using adaptive localisation intervals.
Built a system flexible enough to handle colors, attributes, and textual descriptions ("blue bottle", "man in red shirt", etc.).
Solved object re-identification without explicit re-ID models, using CLIP similarity alone.
What we learned
CLIP is incredibly strong at semantic matching, even when YOLO classes are vague.
Tracking is not enough --- systems need periodic re-localisation to stay accurate in long videos.
Real-time vision systems require careful optimisation, not just accurate models.
Natural language search in video becomes feasible once you combine:
- a fast detector (YOLO)
- a semantic encoder (CLIP)
- a lightweight tracker (ByteTrack)
LLMs are optional --- the system still works great without Gemini by relying purely on CLIP similarity.
What's next for YOLOcate
Add attribute-level segmentation (color masks) when color queries are very specific.
Introduce spatial reasoning ("near the table", "left of the screen") using lightweight heuristics.
Add multi-object search ("find all people wearing red").
Support long-term re-identification across shots, not just continuous video.
Deploy as a web demo with a simple textbox + video input pipeline.
Optional: bring back LLMs later for advanced reasoning when time and resources permit.
Built With
- python
- pytoch
- tensorflow
- ultralytics
Log in or sign up for Devpost to join the conversation.