Inspiration

We built OneShot around a simple idea: teaching a robot should feel more like showing than programming. Instead of writing task-specific control logic or collecting lots of demonstrations, we wanted to see how far we could get with just a single human demo video. Our goal was to turn that video into something a real robot could understand and execute.

How we built it

OneShot is a two-part system.

The first part is a video-to-motion pipeline. We used GEM-X to extract 3D human motion from video, and RTMW3D / MMPose to get precise hand tracking. We then refined and smoothed the motion and packaged it into a reusable bundle. This was hosted on a VM since we needed a pretty beefy A100 GPU in order to run our joint detection in a reasonable amount of time.

The second part is the edge robot runtime. We used AMD's miniPC in order to have fast, real world edge execution, importing the motion bundle from the VM, using OpenCV ArUco markers to calibrate the table and robot setup, and then retargeting the motion into the SO-101 robot frame. From there, we used Pinocchio, Pink, and inverse kinematics based solvers to generate feasible joint trajectories, and execute them directly on the robot.

Challenges we faced

The hardest part was bridging the gap between perception and robotics. A human pose estimate is not directly usable by a robot, so we had to build a clean path from noisy video outputs to stable robot motion.

We also ran into issues with jitter, calibration, and reachability. Small pose errors become much more noticeable on a real robot, and human arm motion does not always map cleanly to the SO-101’s geometry. That meant we needed smoothing, workspace constraints, scene grounding, and reachability-aware IK to make playback reliable.

What we learned

We learned that building an embodied AI system is much more than running a vision model. The most important pieces were the ones between stages: using a good intermediate representation, calibrating the physical scene, and enforcing robot constraints during execution.

Most of all, we learned how to turn separate components like motion extraction, retargeting, calibration, and playback into one end-to-end system.

Why we are proud of it

What makes OneShot exciting to us is that it connects the full pipeline: single-video demonstration, 3D motion extraction, scene grounding, robot retargeting, and real hardware playback. Instead of stopping at simulation or pose estimation, we built a system that carries a human demo all the way to robot execution.

Built With

Share this project:

Updates