Project: Automated Office Assistant (AOA)

Inspiration The transition from "Self-Driving Cars" to "Self-Driving Everything." While openpilot has mastered the highway, the comma body v2 represents the frontier of indoor robotics. We wanted to see if the same "End-to-End" philosophy that keeps a car in its lane could allow a robot to navigate a chaotic office environment—without expensive LIDAR or pre-baked maps—just like a human does: by looking for landmarks.

What it does AOA is a zero-map, semantic navigation suite for the comma body. Instead of giving it coordinates, you give it a mission: "Find the coffee machine" or "Deliver this to the lead engineer." * It uses the comma four’s triple-camera stack to "see" the office.

It leverages an eGPU-hosted Vision-Language Model (VLM) to identify objects and intent.

It autonomously balances and drives the body v2 to the target while avoiding "static obstacles" (chairs) and "dynamic obstacles" (busy hackers).

How we built it The Brain: A comma four running AGNOS. We tapped into the cameraState and modelV cereal messages to get raw vision data.

The Muscle: The comma body v2 platform. We used the bodyjim Python API to send velocity and heading commands via the internal CAN bus.

The Logic: We offloaded frames to an eGPU running a quantized Moondream2 VLM. The VLM acts as the "High-Level Planner," outputting directional vectors (e.g., "The fridge is 30 degrees to the right").

The Controller: A custom PID loop written in Python that translates the VLM’s "semantic desires" into torque commands for the body’s motors, ensuring smooth movement without tipping the robot.

Challenges we ran into The "Indoor Sun" Problem: Office lighting and reflections on polished floors confused the standard openpilot vision model, which expects asphalt. We had to implement a custom "floor-plane" filter to distinguish between a flat walkway and a glass wall.

Latency: Sending high-res frames to the eGPU and waiting for an LLM response created a 500ms lag. We solved this by using the VLM for "Global Planning" every 2 seconds, while a faster, local OpenCV-based "Local Planner" handled immediate obstacle avoidance.

Balance vs. Precision: The body v2 is a balancing robot. Stopping too quickly to look at a landmark caused the "pitch" to oscillate, blurring the camera feed. We had to tune the acceleration curves to keep the "eyes" stable.

Accomplishments that we're proud of Mapless Navigation: We successfully navigated from the entrance to the pizza table without a single line of SLAM (Simultaneous Localization and Mapping) code.

VLM Integration: Getting a local vision model to talk to the cereal messaging system in real-time on comma hardware.

Zero-Intervention Demo: Watching the body v2 "decide" to turn around when it hit a dead-end and successfully re-locate its target.

What we learned Robotics is harder than cars: In a car, you stay in a lane. In an office, there are no lanes—the "state space" is infinite.

Hardware Constraints are Teachers: Working within the power and thermal limits of the comma four taught us to be ruthless with code efficiency.

The Power of the Stack: The comma ecosystem (panda, cereal, AGNOS) is incredibly robust for rapid prototyping if you know how to talk to the CAN bus correctly.

What's next for Automated Office Assistant Multi-Agent Coordination: Using two bodies to "relay" items across the office.

Voice Integration: Adding a microphone to the comma four so you can literally tell the robot where to go.

Open-Source Port: Cleaning up our bodyjim wrappers to contribute back to the comma/openpilot community so others can use semantic navigation on their own bodies.

Built With

Share this project:

Updates