Inspiration
As a passionate roboticist, I've witnessed firsthand the transformative revolution unfolding in our field. The industry is increasingly generating rich ecocentric datasets for Physical AI development, which traditionally feed into computationally intensive reinforcement learning policies for task execution. However, this conventional approach remains prohibitively expensive and resource-demanding. This challenge inspired us to explore more efficient alternatives that could democratize advanced robotics capabilities while maintaining the sophisticated functionality needed for real-world applications.
What it does
Our humanoid robot executes complex, multi-step everyday tasks through advanced spatial understanding and reasoning capabilities. When instructed to "clean up the table" or "sort kitchen items," it intelligently distinguishes between trash and valuable objects, properly disposing of waste while organizing items that should be preserved. The robot demonstrates contextual intelligence by selecting appropriate cleaning tools—knowing when a situation calls for a paper towel versus a brush. This practical reasoning enables it to handle varied household scenarios without specific programming for each task variation, making it truly useful for daily living assistance.
How we built it
We integrated a Nao Humanoid Robot with a custom controller system to enable walking and arm manipulation capabilities essential for embodied AI tasks. The robot processes multimodal inputs through Gemini's API, accepting both voice commands and visual information from its built-in RGB camera and microphone. When users interact with the robot using natural language, our system breaks down complex commands into manageable action sequences. Each subtask is evaluated using Gemini's spatial understanding capabilities to determine the optimal approach. For example, when asked to clean a coffee spill, the robot:
Scans the environment to identify suitable cleaning items Ranks potential tools (prioritizing paper towels and cloths over brushes or scrubbers) Decomposes the task into precise motion sequences (walking to the item, reaching, grasping) Executes actions through Gemini's function calling feature, which triggers the appropriate low-level robot movements
This architecture creates a seamless bridge between high-level understanding and physical execution without requiring task-specific training.
Challenges we ran into
Working with the Nao Robot presented significant technical obstacles. The platform's age and outdated infrastructure—requiring Python 2.7 and suffering from minimal documentation—created fundamental difficulties throughout development. Our challenges intensified when the robot's firmware became corrupted during implementation, effectively bricking the device and forcing us to develop custom firmware flashing solutions. This unexpected setback required extensive troubleshooting under tight time constraints. The mathematical complexity of translating visual perception into physical action proved particularly demanding. We needed to implement inverse kinematics algorithms to convert pixel coordinates from visual input into precise joint angles for movement. The coordinate transformation process between the camera frame and robot frame introduced another layer of complexity that required careful mathematical modeling to ensure accurate physical responses to visual stimuli.
Accomplishments that we're proud of
In a remarkably short timeframe, we successfully created a complete robot foundation pipeline for embodied AI that bypasses computationally intensive approaches like reinforcement learning or imitation learning. Despite confronting numerous technical obstacles—including outdated Python dependencies and sparse documentation—we remained persistent in developing alternative solutions without compromising our vision. We're particularly proud of our innovative implementation of Gemini for physical embodiment, positioning ourselves at the forefront of the emerging Physical AI and Embodied AI revolution. By harnessing Gemini's capabilities in this cutting-edge application, we've demonstrated how large foundation models can bridge the gap between perception and physical action without traditional computational burdens. This creative approach to robotics represents exactly where the field is heading—toward more flexible, adaptive systems that can understand and interact with the physical world through advanced AI reasoning rather than rigid programming.
What we learned
Our project provided invaluable hands-on experience with the mathematical foundations of robotics. We gained practical understanding of coordinate transformation systems and their critical importance in translating between different reference frames. The process of camera calibration became second nature to us, along with appreciating how focal length parameters significantly impact spatial transformations. On the practical side, we learned some hard lessons about hardware management—specifically that interrupting firmware updates can have catastrophic consequences. This experience taught us to always have recovery tools prepared, particularly keeping a bootable flash drive on hand for emergency firmware restoration. These technical insights and practical workflows will prove invaluable for future robotics development work.
What's next for Cena Can See
We plan to expand our foundation model pipeline by testing it across different embodied robot platforms to evaluate its generalization capabilities. This cross-platform validation will help us refine our approach and confirm its flexibility beyond our current implementation. Additionally, we aim to broaden the range of everyday tasks our system can handle, systematically testing and improving its performance across various household scenarios. These practical tests will help us identify edge cases and opportunities for enhancement. A key focus of our future work will be improving the robot's control dynamics and motion quality. By refining these fundamental movement capabilities, we'll create smoother, more natural interactions that better approximate human-like task execution while maintaining our computationally efficient approach.
Log in or sign up for Devpost to join the conversation.