GRAB - Gripper with Rapid Adaptive Behavior

Inspiration

You pick up a wine glass differently than you pick up a brick. You don't think about it - your brain sees the object, your hand adjusts before you even touch it.

Robots can't do this. Every robot arm ships with one gripper. One grip for a peach and a power drill. One grip for a glass vial and a cardboard box. It's the reason warehouse robots still crush fragile inventory. It's the reason assistive robots can't be trusted to hand someone their medication. The arm is fine. The hand is the problem.

We built GRAB to give robots what humans take for granted - the ability to look at something, understand how it should be held, and adapt before making contact. One arm, swappable grippers, zero hesitation.

What it does

GRAB is a robot arm built on HuggingFace's LeRobot framework that sees an object, figures out what kind of gripper it needs, swaps its own end-effector from a tool rack, picks the object up, and places it where it needs to go. No human in the loop. No manual tool changes. No compromise grippers.

Place a sponge ball on the table. The camera sees it. CLIP - a vision-language model that has never been trained on your specific objects - classifies it as soft and deformable. The system selects the petal gripper: flexible orange fingers that wrap around irregular shapes without crushing them. The arm docks onto the petal claw from the rack, picks up the ball, drops it in the box.

Now place a rigid box on the table. Same camera, same system, different conclusion. CLIP says rigid. The arm returns the petal gripper to the rack, picks up the stock claw with rigid parallel fingers, and grabs the box cleanly.

One arm. Two claws. The right one, every time, chosen by AI that has never seen your objects before.

How we built it

Perception pipeline: An overhead camera feeds frames into YOLOv8n (3.2M params, 30+ FPS inference) for object detection. Detected regions are cropped and passed to CLIP (ViT-B/32) which performs zero-shot material classification against text prompts like "a soft deformable object" and "a rigid solid object." No fine-tuning, no custom dataset. CLIP handles novel objects out of the box.

Manipulation policy: Using LeRobot's teleoperation framework, we collected 25 demonstrations with a leader-follower arm setup. Two cameras (overhead + wrist-mounted) and six joint encoders recorded synchronized observations at 30Hz. This data trained an ACT (Action Chunking with Transformers) policy through LeRobot's training pipeline: 52M parameters, ResNet18 vision backbone, 100-step action chunking. The policy takes raw camera images and joint states as input, outputs 6-DOF joint position targets.

Wireless gripper control: Each claw has its own ESP32-S3 and servo motor (MG90S for the soft claw, STS3215 for stock). A master ESP32 on the table connects to the PC via USB serial and relays open/close commands to the active claw over ESP-NOW (peer-to-peer WiFi, 1-5ms latency, no router required). Claw selection is MAC-address based.

Mechanical swap system: 3D-printed twist-lock mount on the arm wrist. The arm approaches a tool rack, pushes into the claw connector, rotates motor 5 (wrist roll) to lock, and lifts. Undocking is the reverse rotation. No fasteners, no magnets, no manual intervention.

Challenges we ran into

The biggest headache was dynamic motor configuration. The stock claw uses a STS3215 servo on the arm's Feetech bus (6 motors total), but the soft claw uses an MG90S controlled wirelessly via ESP-NOW (5 motors on the bus). LeRobot expects a fixed motor count at startup. We had to build a swap system that dynamically edits the LeRobot source code and swaps calibration files on the fly, maintaining separate 5-motor and 6-motor calibration JSONs that get hot-swapped during dock and undock sequences.

Our first training run failed completely. We recorded 40 episodes with the arm picking objects in both directions, table-to-box and box-to-table, under one task label. The ACT policy learned two conflicting motions and froze mid-air, unable to decide which way to go. We scrapped the data, re-recorded 25 clean unidirectional episodes, and the new policy worked on the first test.

Camera bandwidth was a constant battle. Running two USB cameras simultaneously through a hub caused timeout errors that crashed the recording pipeline. We had to manage camera lifecycle carefully, killing stale processes between runs and tuning frame capture timing.

The MG90S servo in the soft claw had different PWM characteristics than documented. We manually swept through pulse widths in 50-microsecond increments to find the exact open/close range for our specific claw geometry.

Accomplishments that we're proud of

We got teleoperation working within hours of unboxing the arm. Leader-follower mirroring across all joints, both cameras streaming, data recording at 30Hz. That foundation made everything else possible.

The docking and undocking system works mechanically and in software. The 3D-printed twist-lock mount engages cleanly with a wrist rotation. The swap_mode script toggles between 5-motor and 6-motor configurations, swapping calibration files and editing LeRobot's source config automatically. Dock, grab, place, undock, all sequenced.

Autonomous pick and place runs reliably on trained objects. The ACT policy picks up the object from the table and places it in the box without human input. 25 clean demonstrations were enough to train a working policy in under 3 hours on AMD Radeon hardware.

Zero-shot material classification works without any custom training. CLIP classifies objects it has never seen by matching camera images against text descriptions of material properties. Soft objects get the petal gripper. Rigid objects get the stock gripper. No labeled dataset, no fine-tuning, just language.

The entire swappable gripper system was built with off-the-shelf components and 3D-printed parts, keeping the hardware accessible and reproducible.

What we learned

Data quality is everything. Not data quantity - data quality. 25 slow, careful, identical demonstrations outperformed 40 messy ones. In imitation learning, your robot is only as good as the human teacher, and the human teacher needs to be boringly consistent.

Modularity saved us. Because YOLO/CLIP, ACT, and ESP-NOW were completely independent systems, we could develop them in parallel, debug them separately, and connect them at the end. When the camera system broke, the gripper kept working. When the policy was training, we were flashing ESP32s. Three people, three systems, one integration step.

Foundation models are ready for robotics. Not for end-to-end control - not yet - but for perception and classification, they're shockingly good. CLIP's zero-shot material classification worked on objects we'd never tested with, in lighting conditions we'd never trained for. The gap between research papers and hackathon implementations is closing fast.

What's next for GRAB

More end-effectors: a hook for handles, a suction cup for flat surfaces, tweezers for electronics. Each snaps into the same twist-lock mount.

The real next step is making the swap smarter. Right now the docking positions are calibrated once at setup. We want to train a policy that can find and dock onto any claw in any rack configuration - the way a human can grab any tool from a messy workbench.

Beyond that, multi-object sorting: dump a bin of mixed items on the table, and GRAB sorts them - selecting the right gripper for each, placing fragile items carefully and tossing rigid ones quickly. That's the warehouse use case. That's the eldercare use case. That's the disaster-response use case.

One arm. The right hand for the job. Every time.