Inspiration

To solve the data problem for training humanoids i.e. train humanoids easily

How we built it

The app is a real-time scene-understanding tool for robotics: it turns live video into actionable world state. In real time it shows: detected objects (and optional tracking IDs) segmentation masks hand-object interactions (when available) VLM summary of object state and possible robot actions live timing/performance signals

Challenges we ran into

Latency at every layer -> segementation, object detection, VLM . YOLO - was good for object detection, this was feed into SAM3fast or rf-detr-seg Segmentation with SAM3 fast and rf-detr-seg (realtime) made segmentation layer fast and gave accurate borders Siglip - help in detecting actions quickly VLM - using gemini3 .. gemini3 gives better accuracy but would wasnt able to reduce the latency much here. Http calls were slow but it had much better accuracy.

Accomplishments that we're proud of

Was able to extract a lot of data needed for humainoids

Real-time Multi-Model Perception Runs live object detection, segmentation, tracking, and hand-object interaction analysis in one stream. Actionable World-State Reasoning Uses VLM to infer object state (for example open/closed/unknown) and suggest robot-feasible actions. Live + Post-Analysis Workflow Records live sessions, then auto-processes them into visual overlays, temporal changes, events, and structured JSON outputs for evaluation and downstream robotics use

What we learned

latency is the number one problem and it was fun optimizing it for robots

What's next for HUMOS - Humanoid Unified Model Orchestration System

Collect other data necessary and maybe actually use the data to train a Unitree or optimus robots, by mapping motions to sensors

Built With

  • gemini3
  • python
  • sam3
  • siglip
  • streamlit
  • yolov
Share this project:

Updates