-
-
The robot grabbing the specified object, being controlled by the LLM. Capable of accurately positioning and moving to accomplish the task.
-
An overall picture of the hardware stack with the LeRobot arm, two overhead cameras, AprilTag setup, and mat representing the workspace.
-
Vision model creating bounding boxes and coordinates in the robot space, facilitated by April Tags+Solve PNP algorithm for the translation.
The main bottleneck in general‑purpose robotics isn’t hardware or sensing—it’s data. Current approaches like imitation learning and reinforcement learning demand hundreds of training episodes just to master a single task. In a household environment with countless possible tasks, gathering enough teleoperation or simulation data becomes practically impossible.
UPIC is an embodied AI that skips needing training to accomplish tasks entirely. We established tools like a world vision model to detect our environment, Inverse Kinematics to interact with our environment, and Claude API calls to use these tools to understand how to move the arm to accomplish these tasks. You give it a goal in plain English, and our system interprets it, breaks it up, and uses its tools to achieve it zero-shot in a way that is transparent and easy to see how it is thinking/reasoning/moving.
We first created a 3D-printed jig that contains attachments for a LeRobotSO101 robotic arm, 2 standard webcams, and 2 AprilTags. This was then clamped to a table, and various different household items with varying rarity were brought for testing. Once our setup was complete, we began by using Ground DINO's open-vocabulary image detector library to accurately ascribe labels to objects in the workspace. Next, we calibrated the cameras via the AprilTags to detect the cameras' exact height and angle, then solved a PNP algorithm to detect near exact xy positions, and later size and orientation. Finally,
While this project seems like 3 simple steps, lots of work and code went into each of them, which naturally came with their respective problems. First of all, Ground Dino was not at all our first choice for an image-detector. We tried YOLO World and Meta AI’s SAM library at first, but realized that YOLO was not precise and confident enough with its predictions, and SAM did not align with our project pipeline very well. Additionally, Ground DINO’s image processing came with astronomically low frame rates, which required GPU acceleration to resolve, but it ended up working in the end. Another challenge was troubleshooting the PNP algorithm. We first tried implementing a stereo setup with the cameras, but since the robot arm would often obstruct one of the cameras’ views, we couldn’t consistently identify all objects in both lenses. This led to us using AprilTags and PNP; objects in the air were not able to be located properly, and the sizing and height values were not as great as we would have liked, but they were sufficient enough for the LLM to process requests...(add IK + LLM struggles).
Overall, we were very impressed with UPIC’s methods of reasoning for new, unique tasks. Its solutions were oftentimes what we had in mind during execution, and it would occasionally think of solutions that we initially oversaw, some of which ended up working better. Additionally, we were impressed with the accuracy of the AprilTag/PNP coordinate readings. We were quite disappointed that we weren’t able to implement stereo camera depth perception, but this alternative method served as a great makeshift. Finally, we were amazed at Claude API’s approach to problem solving and solution creativity. We anticipated that this would be the hardest part of our project, but Claude’s ability to reason well and realistically did most of the LLM’s heavy lifting for determining arm action and action sequence.
We ultimately realized how powerful modern code-writing models are in expediting the software side of ambitious projects. Contrary to our initial belief, around 75% of our hacking time was spent getting the arm and cameras in touch the the LLM framework instead of tuning the LLM’s decision-making due to its powerful generalization capabilities and Claude code’s incredibly efficient debugging features significantly helped with cleaning up areas of our scripts that would have otherwise required hours of dirty-work. Additionally, we learned the importance of comparing different problem-solving methods and libraries, as our decisions to switch to Ground DINO and import a custom IK controller paid off significantly in terms of time-saving and program efficiency. Furthermore, we learned the importance of communication and coordination; we oftentimes found ourselves pushing and pulling files from our GitHub repository, so properly merging actions, avoiding overwriting, and ensuring we were working with up-to-date files were not only important for project safety, but are important skills that industry especially requires.
While we are very proud of UPIC’s capabilities, we recognize there are multiple areas for significant improvement in performance and efficiency. First of all, as discussed before, we would like to implement standard depth perception with either stereo cameras or possibly LiDAR for better readings for object location, especially above the ground. If we go with a stereo camera setup, we think it would be better to attach it to the arm itself for better workspace vision and to prevent arm occlusion. Additionally, while Ground DINO’s object detection worked well, it sometimes failed to identify more niche items and worked at relatively low frame rates despite implementing GPU acceleration. We solved this by feeding niche object classification directly into our LLM, but finding a more optimized method for classification can significantly reduce runtime and enable smoother LLM generalization. However, most importantly, the training process for UPIC does not stop after this Hackathon. We will continually train it on various tasks with more niche items, and we are excited about the possibility of applying UPIC’s methodology to more advanced robotic systems like humanoids or robotics animals, as well as encouraging robotics companies to advance general robot technology.
Log in or sign up for Devpost to join the conversation.