Inspiration

The inspiration for this project came from the idea of a beach-cleaning robot.
The robot’s main task is to move across the beach and collect litter.
Since it can also communicate, it doubles as a helpful advisor, providing answers to common questions such as the weather.

What it does

Face Detection and Interaction

  • The robot remains in a waiting position until it recognizes a human face.
  • Once detected, it greets the person by saying:
    “Hello, my human friend.”
  • You can then speak to the robot and ask questions.
  • Questions are answered by the GPT-OSS 20B model, which is:
    • Configured with agentic capabilities
    • Equipped with tools such as Wikipedia and DuckDuckGo search
  • The response is delivered through the robot’s text-to-speech system.

Scene Understanding and Cleanup

  • The GPT-OSS model includes a specialized tool to detect if the surroundings are messy.
  • When called:
    1. A large vision model (VLM) interprets the scene.
    2. It returns an assessment of cleanliness.
    3. GPT-OSS evaluates the result.
    4. If the environment is messy, the robot initiates cleanup.

Object Detection and Manipulation

  • YOLOv5 is used to localize trash items (e.g., a toothbrush).
  • The robot calculates their relative positions.
  • These coordinates are sent via TCP socket to the myCobot280 6-DOF robotic arm.
  • Using:
    • NVIDIA Jetson platform
    • ROS2 MoveIt inverse kinematics solver
  • The robot computes a valid trajectory to guide the gripper toward the target.
  • Once the target is reached, the end effector grasps the object and removes it, effectively cleaning the environment.

How I built it

I started building the robot’s TCP/IP interface and connected it to the inverse kinematics solver of ROS2. Then I connected Mediapipe to the robot to enable detection of human faces nearby and added the speech recognition and text-to-speech output capabilities. Then the robot learned to detect objects using YOLOv5 and interpret scenes with the multimodal VLM. In the end, I added the powerful agentic capabilities of gpt-oss and the tools the model has access to.

Challenges I ran into

One of the biggest challenges I faced was fine-tuning the inverse kinematics solver.
The difficulty lies in ensuring that the grabber can move correctly in every possible pose within 3D space.
Accurate camera-to-end-effector calibration is essential here.
It has been both challenging and fun to experiment with different solutions for the wide variety of scenarios.

Accomplishments that I am proud of

I am proud of what I’ve learned about inverse kinematics and how to control a robotic arm in three-dimensional space.

What I learned

Through this project, I learned that gpt-oss offers strong agentic tool-calling capabilities while still being able to run on relatively old hardware, like an RTX 3090, alongside other models required for robotic movement.

What's next for gpt-oss 20b powered agentic 6-DoF robot

If I win the robotics challenge, I would dedicate my time to implementing a larger Behavior Model on a more powerful GPU, enabling more advanced and complex robotic movements.

Built With

Share this project:

Updates