Inspiration

Physical AI companies raise hundreds of millions to solve robotic manipulation through imitation learning: collect demonstrations, train a VLA, deploy a policy. This works... sometimes (at the cost of it being slow, brittle, and requires retraining for every new object and environment.)

We walked in with a 3D-printed leader-follower teleoperation rig, 12 bus servos, and dual cameras, and assumed we'd do the same. Then we asked a different question.

Why teach a robot to fold through imitation when tasks done by virtually everybody, such as folding clothes, are all fundamentally geometric?

A fold is a shape transformation. Pick two corners, drag them to target, verify alignment. The geometry is explicit, computable, deterministic. If that holds for folding, it holds for grasping, stacking, assembly, and deformation. Physical tasks have intrinsic geometric structure. You don't need memorized trajectories. You need to reason about structure.

That insight is Infer.


What It Does

Infer is a closed-loop robotic manipulation system that reasons about physical tasks geometrically in real-time. Zero training data. No pre-trained models. No demonstrations.

Folding clothes is the proof of concept. The real discovery: physical manipulation tasks have intrinsic geometric structure, and reasoning about that structure directly generalizes in ways data-driven approaches cannot.

Perception: Every frame, the system automatically identifies the cloth against the background, finds its edges, and scores every point on the boundary for how "graspable" it is. Sharp corners and extremities score highest. The result: a precise map of manipulation-relevant points on any object, updated in real-time, in under 10ms.

Fold planning: Rather than following a memorized sequence of moves, Infer reasons about the fold geometrically. It identifies where the cloth is, where it needs to go, and computes the motion required. Every attempt re-evaluates the cloth's current state from scratch. When the fold is within tolerance, it is declared mathematically complete.

Control loop: Detect the cloth. Plan the motion. Execute with both arms simultaneously. Wait for the cloth to settle. Verify the fold worked. If it did not, find the nearest corner to where the gripper slipped and try again. The loop runs until the task is geometrically verified done.

Remote override: When the algorithm encounters a configuration it cannot resolve, a remote operator picks up the leader arms and physically corrects the cloth through teleoperation in real time. They put the leaders down. Infer resumes autonomously. One operator can supervise dozens of deployed arms simultaneously, stepping in only on exception — the same way a senior surgeon guides a remote procedure.

Generalization: Tested on towels, shirts, and pillowcases with identical settings and zero retraining.


How We Built It

Hardware: Four physical robotic arms — two that a human operates by hand (leaders), and two that mirror those movements precisely in real time (followers). When the human steps back, the algorithm takes over the follower arms directly. 1080p overhead camera for global cloth state. 480p wrist-mounted camera for close-range precision.

Arm control: Arms are driven through HuggingFace's open-source LeRobot framework. Calibration is saved to disk and survives machine changes, location changes, and software updates. No recalibration needed between runs.

Perception: Built entirely on classical computer vision — no neural networks on the critical path. The system is fast enough to update corner positions 30 times per second on a standard laptop CPU.

Fold geometry: A set of geometric primitives that any part of the system can call: find the rectangle, plan the fold, measure how complete the fold is, find the nearest corner after a slip. Each primitive is a standalone, testable unit. Swapping in a new task means writing a new geometric constraint, not collecting new training data.

Calibration: A one-time setup that maps camera pixels to physical table coordinates. Saved as a file. The full system runs in simulation mode without it, so every layer was developed and tested before the hardware was touched.


Challenges We Ran Into

Corner identity across folds. After the first fold the cloth has moved, and the system's corner labels have reshuffled. A naive system tries to grab the wrong corner on the second fold. Fix: re-derive every grasp target from the cloth's current geometry on every attempt, never from a remembered label.

Cloth slip. The gripper picks a corner, the cloth slips partway through the drag, the original target is now wrong. Fix: after every failed attempt, find the corner closest to where the gripper intended to go and re-grasp from there. Converges in 2-3 tries with no force sensors.

Developer environment instability. The robotics software stack lived on a cloud-synced drive that silently corrupted Python's module loader, breaking all imports with no error message. Diagnosed and routed around entirely in code.

Serial port conflicts. The arm configuration software and the control scripts compete for the same hardware connections. Documented and scripted away in the setup runbook.

Dual-arm coordination. Two arms moving independently pulled the cloth in opposing directions mid-fold. Fix: both arms must reach the lift apex before either descends, enforced by a synchronization barrier.

Cable management. USB cables routed without slack caused mid-motion disconnects. Fixed with careful routing and cable management on the rig frame.


Accomplishments That We're Proud Of

The fold converges by geometry, not by luck. The distance between the cloth's moving edge and the fold target drops from roughly 200 pixels to zero across 2-3 attempts, and when it hits zero the fold is visually correct every time.

A fully working bimanual leader-follower rig with calibration that survives hardware changes, room changes, and software updates. Plug in anywhere and run.

Real-time perception on a laptop CPU with no neural network and no GPU. Adapts its corner count automatically to whatever object is in front of it.

Zero training data. Zero demonstrations. Zero retraining. The same algorithm folded every cloth type we tested. Generalization came for free because the system reasons about geometry, not examples.

The teleop rig is not a demo artifact. It is the human fallback layer of a production-grade manipulation system — the answer to what happens when autonomy fails.


What We Learned

Geometric reasoning generalizes where learned policies do not. If you can express a task as a geometric constraint, you never need training data for it.

Closed-loop iteration is the right primitive for deformable objects. Cloth moves unpredictably. Re-evaluating the world state after every attempt and correcting from there handles variability that no training set can fully anticipate.

Hardware bring-up takes longer than the algorithm. The core reasoning system took hours to write. Getting four physical arms, two cameras, and a robotics software stack running reliably on a hackathon timeline took longer. A clear setup runbook is as important as the code.

The teleop rig changes the deployment story entirely. Every autonomous system fails sometimes. The question is what happens next. Human override via teleoperation, followed by seamless handoff back to autonomy, is the answer that makes real-world deployment viable.


What's Next for Infer

Visual grasp confirmation. After every grasp, re-check that the target corner actually moved with the arm before committing to the drag. Catches failures earlier without any additional sensors.

Semantic grasp point detection. The current system finds geometrically extremal points. A small trained model would find semantically meaningful ones — sleeve tips, collar points, hem corners — unlocking full generalization to arbitrary garments.

Scaled remote supervision. One operator, many deployed arms across many locations. Step in via teleop on exception, hand back to autonomy. The architecture already supports it.

Generalization beyond folding. Stacking, assembly, sorting, surface deformation. Any physical task that can be expressed as a geometric constraint runs on the same planner without retraining.

Built With

Share this project:

Updates