Inspiration
Standard robotic teleoperation is clunky. Relying on joysticks, keyboards, or heavy physical controllers creates a disconnect between the operator and the machine. We wanted to build a system that makes human-machine interaction feel entirely natural, where the robot acts as a direct, fluid extension of the human body without requiring any wearables or specialized depth-sensing hardware.
What it does
Gesture Updated Robotic Telemetry (GURT) is a heterogeneous control system for remote robotic manipulation. It uses a standard 2D webcam to track a user's hand movements in 3D space. It translates those spatial coordinates and distinct hand gestures (like opening the hand or making a fist) into smooth, deterministic movements on a physical robotic arm in real-time.
How we built it
We built the vision pipeline using Python, OpenCV, and MediaPipe to extract 21-point hand landmarks. Because a standard webcam only provides a 2D matrix of pixels, we engineered a custom Dimensionality Extraction Layer. The system calculates a dynamic Z-axis (depth) by measuring the relative Euclidean distance between static palm joints (the wrist and knuckles), which remains perfectly stable regardless of finger articulation.
To control the physical hardware, we utilized the LeRobot API to abstract the serial bus communication, translating our Python coordinate dictionaries into the specific hexadecimal data packets required by the Feetech serial bus servos.
Challenges we ran into
Our two biggest hurdles were software environment management and physical control theory.
First, we ran into severe dependency collisions when bridging modern machine learning libraries with legacy C++ bindings, requiring careful downgrading and environment isolation to get OpenCV, MediaPipe, and NumPy to communicate smoothly.
Second, raw optical data is highly volatile. Passing raw frame-by-frame coordinates directly to the servos caused the robotic arm to shake violently. We had to implement a Digital Signal Processing (DSP) layer using an Exponential Moving Average filter to act as a mathematical shock absorber. We also built a temporal state machine with hysteresis to "debounce" the gesture commands, ensuring the robotic claw only actuates when a human gesture is fully intentional.
Accomplishments that we're proud of
We are incredibly proud of extracting highly stable 3D depth telemetry from a purely 2D optical sensor. By combining our DSP noise-filtering with a custom kinematic blending engine—which dynamically adjusts the shoulder and elbow ratios to maximize the physical reach envelope rather than using a rigid 1:1 joint mapping—we achieved a level of fluid, deterministic control that usually requires much more expensive hardware.
What we learned
We learned exactly what it takes to bridge high-level computer vision arrays with physical hardware. We gained deep, hands-on experience with critical control theory (filtering, smoothing, and temporal debouncing) and its application to volatile real-world sensor data. We also learned how difficult the ideation stage is and the amount of experimenting that is required before settling on a project.
What's next for Gesture-Updated Robotic Telemetry(GURT)
We want to push this from a tethered software integration project into a standalone, industrial-grade edge device. Our roadmap includes hardware acceleration by mitigating the vision pipeline off a standard CPU and onto a Zynq-7000 SoC (ZedBoard). An AXI VDMA can be used to stream video frames directly into the FPGA fabric, utilizing custom Verilog for true hardware/software co-design to achieve zero-latency tracking. We would also like to incude multiple cameras to get a completely accurate 3d image of gestures.
Log in or sign up for Devpost to join the conversation.