Inspiration
To solve the data problem for training humanoids i.e. train humanoids easily
How we built it
The app is a real-time scene-understanding tool for robotics: it turns live video into actionable world state. In real time it shows: detected objects (and optional tracking IDs) segmentation masks hand-object interactions (when available) VLM summary of object state and possible robot actions live timing/performance signals
Challenges we ran into
Latency at every layer -> segementation, object detection, VLM . YOLO - was good for object detection, this was feed into SAM3fast or rf-detr-seg Segmentation with SAM3 fast and rf-detr-seg (realtime) made segmentation layer fast and gave accurate borders Siglip - help in detecting actions quickly VLM - using gemini3 .. gemini3 gives better accuracy but would wasnt able to reduce the latency much here. Http calls were slow but it had much better accuracy.
Accomplishments that we're proud of
Was able to extract a lot of data needed for humainoids
Real-time Multi-Model Perception Runs live object detection, segmentation, tracking, and hand-object interaction analysis in one stream. Actionable World-State Reasoning Uses VLM to infer object state (for example open/closed/unknown) and suggest robot-feasible actions. Live + Post-Analysis Workflow Records live sessions, then auto-processes them into visual overlays, temporal changes, events, and structured JSON outputs for evaluation and downstream robotics use
What we learned
latency is the number one problem and it was fun optimizing it for robots
What's next for HUMOS - Humanoid Unified Model Orchestration System
Collect other data necessary and maybe actually use the data to train a Unitree or optimus robots, by mapping motions to sensors
Built With
- gemini3
- python
- sam3
- siglip
- streamlit
- yolov
Log in or sign up for Devpost to join the conversation.