Inspiration
Our team brings together a group of ambitious CS and engineering students with a shared passion for high school robotics. We are driven by a singular observation: tech is not just advancing; it is exploding. Breakthroughs in foundation models and computer vision are happening weekly. We saw a unique opportunity to stop building simple, hard-coded demos and instead take a real stab at solving one of the most critical, unsolved challenges in generalized robotics: how to build a seamless bridge between human language, computer vision, and physical execution. BRIDGE is our ambitious answer.
What it does
BRIDGE is a language-to-action pipeline that allows any user to control complex robots using intuitive, natural language commands. Rather than telling a robotic arm, "Move end effector to [0.5, 0.2, 0.1]", our system allows you to say, "Grab the coffee mug by the handle."
The magic is in how our system translates that abstract intent. It decomposes a 3D environment into a structured map of geometric primitives and "affordances." By using real-time AI reasoning, BRIDGE can perceive unseen objects, determine how they can be physically manipulated, and generate the exact code to execute the task, fundamentally bridging the gap between human intent and robot action.
How we built it
We structured our pipeline into four core phases to solve the problem chronologically, from the spoken command to the final motor movement.
Phase 1: Semantic 3D Modeling (The "What is this?") When a user tells the robot to "Grab the coffee mug by the handle," the Large Language Model (LLM) understands the request, but the robot arm is blind. To build its vision, we use a spinning Kinect sensor that captures a complete, chaotic, 3D point cloud of the environment.
We then perform Semantic Injection. We use a vision-language model (like CLIP and Segment Anything 3D) to "paint" meaning onto this raw data, highlighting a cluster of points and identifying it as "the Mug." Finally, our system performs Geometric Decomposition, mathematically breaking down that cluster into simple shapes: realizing the mug is actually a hollow cylinder attached to a half-torus (the handle).
Phase 2: Affordance Mapping (The "How do I interact with this?") The robot now knows a cylinder and torus are on the table, but how does it know what to do with them? This is where Affordance Mapping is key. We define an affordance as a physical property of an object that dictates how it can be used.
Our system maps these rules to the geometric primitives. It knows that "torus shapes" afford hooking or pinching, while "cylinders with flat tops" afford pushing or wrapping. By mapping these to the mug's decomposed model, the robot now understands that the torus geometry (the handle) is the mathematically optimal place to initiate a grasp.
Phase 3: The LLM Orchestrator ("Code as Policies") The LLM does not perform complex math or calculate motor angles. Instead, it acts as the system’s high-level logic engine. It takes the user's prompt, analyzes the generated affordance map, and writes a real-time Python script by pulling from our "Bank of Primitives."
For the mug, the LLM’s logic determines that the best grasp point is the 'Torus' geometry. It then writes a script calling the pinch_grasp() primitive from our skill bank and populates it with the precise 3D coordinates of the target geometry.
Phase 4: Execution (IK + RL in Isaac Sim) This is where the software brain translates into physical motor movement. We use a modular approach trained within NVIDIA Isaac Sim.
Inverse Kinematics (IK) handles the Macro-Movement: IK uses math to calculate the collision-free path for the elbow and shoulder to move the gripper through empty space, stopping it exactly one inch from the mug handle.
Reinforcement Learning (RL) handles the Micro-Movement: Once the gripper is positioned, we use RL. Math cannot predict unpredictable friction or slipping, but our RL primitives were trained on millions of virtual iterations in sim. The RL brain dynamically adjusts the motors to achieve a perfect, stable squeeze on the handle, even reacting to live camera feedback if the mug shifts.
Challenges we ran into
The path to generalized robotics handling is not straightforward. We immediately faced massive compute needs, requiring high-performance processing to manage the simultaneous simulation, computer vision, and LLM reasoning. Mathematically, navigating an arm through empty space from any given starting position means managing infinite approach angles, creating a massive pathfinding puzzle.
Simulating precise physical interaction is inherently difficult, as even small, unpredictable variables in friction, collision physics, and sensor noise play a big role in a system's success or failure. We quickly realized why this remains an unsolved problem in the field—it is a chaotic problem to solve.
Accomplishments that we're proud of
Our most significant pride is in the fact that we were able to take a highly credible and functional stab at an unsolved problem. We didn’t just build a specific-object-picking demo. We designed and implemented a generalizable pipeline—from 3D perception to physical squeeze—for manipulating unseen objects based purely on natural language, all within the intense constraints of a short hackathon. This is an incredible step toward the future of human-robot interaction.
What we learned
Generalized object manipulation is a humbling engineering challenge. Handling even simple, organic shapes requires a level of adaptive reasoning that we often take for granted as humans. We learned that chaos exists in all complex systems, and tiny variables—from virtual sensor calibration errors to physics simulation flaws—can play a disproportionate role in the outcome. A critical learning was how to successfully integrate and utilize the powerful K2 reasoning model to drive our core LLM orchestrator.
What's next for BRIDGE
Our road map for BRIDGE is ambitious. We are looking at continuous improvement of our affordance and segmentation models. Spending more time in simulation will yield more robust RL policies for increasingly accurate actions. We are focused on implementing Action Chunking and diffusion policies to replace modular IK/RL with a single, massive neural network capable of predicting continuous, human-like movements. Finally, our ultimate goal is deployment on physical, real-world robotic systems, moving from virtual Isaac Sim residuals to managing the physics of a physical robot.
Log in or sign up for Devpost to join the conversation.