What it does

YOLOcate lets a user type any natural language query, and the system:

  1. Understands the query using CLIP text embeddings

  2. Scans the video frame using YOLO to generate candidate bounding boxes

  3. Matches each YOLO crop with the query using CLIP image--text similarity

  4. Selects the most semantically correct box using score fusion

  5. Switches to tracking mode (ByteTrack/any lightweight tracker) when confident

  6. Re-localises periodically to correct drift or recover lost objects

The system can both locate and track almost any described object without training a custom model.


How we built it

1. Natural Language Encoding (CLIP)

  • We convert the user's text query into a CLIP text embedding.

  • This embedding represents the semantic meaning of the query.

2. YOLO for Candidate Proposals

  • YOLO runs on the frame and outputs all detections.

  • Instead of filtering by LLM, we simply take all YOLO boxes as candidates.

3. CLIP for Semantic Matching

For each YOLO box:

  • Crop the region

  • Compute CLIP image embedding

  • Compute similarity with the text embedding

The similarity score measures how well a YOLO detection matches the user's query.

4. Score Fusion

Each box receives a combined score:

  • YOLO detection confidence

  • CLIP similarity

The box with the highest combined score is the localisation result.

5. Control Logic

To maintain realtime performance:

  • Localise every K frames instead of every frame

  • Track in between using ByteTrack

  • If tracking fails, return to localisation mode

  • If localisation confidence is low → skip ahead & retry

6. Tracking Stage

Once confident:

  • Initialise ByteTrack

  • Perform fast, per-frame tracking

  • Every R frames, run localisation in a smaller search region to fix drift

This creates a Search → Track → Correct cycle.


Challenges we ran into

  • Ambiguity in user queries\ Natural language often doesn't map cleanly to YOLO classes (e.g., "bag" could be backpack/handbag/luggage).

  • Many irrelevant YOLO detections\ Without LLM-based filtering, CLIP must compare every crop → can be slower on crowded scenes.

  • Maintaining real-time performance\ CLIP is expensive; we had to design frame-skipping and adaptive localisation logic.

  • Drift in long-term tracking\ ByteTrack can drift when the object is occluded; periodic re-localisation was essential.

  • Ensuring stable confidence thresholds\ Needed experimentation to find reliable confidence switch-points between Search and Track modes.


Accomplishments that we're proud of

  • Created a fully general natural language → object localisation system without training any new models.

  • Combined YOLO + CLIP + lightweight tracking into a smooth, working pipeline.

  • Achieved realtime performance by using adaptive localisation intervals.

  • Built a system flexible enough to handle colors, attributes, and textual descriptions ("blue bottle", "man in red shirt", etc.).

  • Solved object re-identification without explicit re-ID models, using CLIP similarity alone.


What we learned

  • CLIP is incredibly strong at semantic matching, even when YOLO classes are vague.

  • Tracking is not enough --- systems need periodic re-localisation to stay accurate in long videos.

  • Real-time vision systems require careful optimisation, not just accurate models.

  • Natural language search in video becomes feasible once you combine:

    • a fast detector (YOLO)
    • a semantic encoder (CLIP)
    • a lightweight tracker (ByteTrack)
  • LLMs are optional --- the system still works great without Gemini by relying purely on CLIP similarity.


What's next for YOLOcate

  • Add attribute-level segmentation (color masks) when color queries are very specific.

  • Introduce spatial reasoning ("near the table", "left of the screen") using lightweight heuristics.

  • Add multi-object search ("find all people wearing red").

  • Support long-term re-identification across shots, not just continuous video.

  • Deploy as a web demo with a simple textbox + video input pipeline.

  • Optional: bring back LLMs later for advanced reasoning when time and resources permit.

Built With

  • python
  • pytoch
  • tensorflow
  • ultralytics
Share this project:

Updates