Real-Time Robotic Perception with ROS2 and Gemini

Inspiration

Robot Operating System 2 (ROS2) is one of, if not the, most popular software development kits used in robotics. As the software subteam from VT CRO's SoutheastCon design team, we need to use ROS2 for nearly all our tasks, which usually consist of simulation, navigation, localization, hardware communications, etc. It is therefore important that all subteam members are familiar with this SDK. Unfortunately, one of the things we've typically struggled with is getting new members quickly acquainted, as this usually requires many days of tediously going through documentation. This could lead to a drop in interest and less motivation to get work done. As such, we wanted to create a project that demonstrated the power and usefulness of ROS2 over multiple domains, while also showing that learning how to use ROS2 can be fun.

What it does

Our mini robot leverages the use of Google Gemini API to detect objects using an Intel Realsense depth camera and output what is being seen via speech. The core of this project is ROS2, which we integrated with the Gemini API to create a robot that spins on a gimbal and essentially tells you what it sees. This goes beyond the scope of any baseline model, as our project combines the robotics model with speech generation model and is capable of analyzing near real-time images to communicate a verbal understanding. Additionally, there is a partner website to our hardware, called ROS Garden (a play on "rose garden"), where anyone interested can go to learn more about the Robot Operating System 2. This website also leveraged the use of Gemini API in assistance with full stack development. The learning style of ROS Garden is very hands-on, as exercises with specific tasks are assigned after an intuitive concept overview. We believe that this is a much more engaging way to learn about the most important SDK of our design team.

Challenges we ran into

The number one problem we had was with text to speech from our images. The challenge was figuring out how to play .wav files in Docker containers, as our entire project is dockerized, and we kept getting permission errors and interrupts that prevented audio from playing. Another challenge we ran into was dealing with hardware, since our entire group comes from a software background. We had to refamiliarize ourselves with CAD to design a gimbal and servo holder, which took much longer than necessary due to lack of experience. We also ran into typical software bugs and spent another large portion of time figuring out how to use artificial intelligence and integrate all of our project components together.

Accomplishments that we're proud of

Our biggest accomplishment was successfully integrating the Google Gemini API into a working robot with a camera and servo. While it may not function perfectly as intended, we made really good progress over 36 hours and are satisfied with the fruit of our labor. We learned a lot from this experience, which is described in more detail in the next section, and it was fun being able to do a project that wasn't purely just software.

What we learned

We gained a lot more experience with using open source APIs and learned just how much more advanced our systems could become from leveraging artificial intelligence. We also gained experience in hardware design, CAD, and embedded systems.

What's next

After this hackathon, we plan to improve our ROS2 learning system even more, by adding interactive activities that provide users with feedback, creating an almost game-like experience for better immersion. We also plan to look into using more open source libraries for future robotics competitions.