Inspiration

Guide dogs are incredible, but access to them is limited. They take years to train, cost a lot to support, and only reach a small fraction of the people who could benefit from that kind of assistance.

We started with a simple question: could a robot dog provide some of that spatial guidance indoors?

Not as a replacement for guide dogs, but as another kind of assistive tool: something that can understand speech, recognize objects, remember places, and help someone move through an unfamiliar space.

We had a Jueying Lite3 in the lab, an RGB-D camera, and access to Bedrock. That was enough to try.

What it does

Lite3 is a voice-controlled guide assistant running on a quadruped robot. You talk to it through a push-to-talk phone app. There are no buttons to learn and no robot commands to memorize — just conversation.

You can ask things like:

“What’s in front of me?” It uses its RGB-D camera and Claude’s vision capabilities to describe the scene. “Follow me.” It uses YOLO to track a person and follow them through the room. “Find the bottle and bring me to it.” It rotates in place, watches its live object-detection feed, stops when it sees the bottle, walks toward it, and stops at a safe distance. “Go back to the desk.” It uses a persistent semantic map to navigate back to something it saw earlier, even after walking away.

The robot also speaks back naturally. Instead of saying something like “goal status reached” or “object ID detected,” it says things like:

“Found the door, about two meters ahead — walking over now.”

That mattered to us because the user may never see a screen. The voice is the interface.

How we built it

Hardware

We used a Deep Robotics Jueying Lite3 quadruped, an Intel RealSense D435i RGB-D camera, and a laptop as the main compute host.

AI stack

AWS Bedrock with Claude Sonnet is the reasoning and vision engine. The planner uses the Bedrock Converse streaming API to decide what to do, call tools, and respond to the user. YOLO runs on the robot for real-time object and person detection. Whisper handles speech transcription on the laptop, and macOS say handles text-to-speech back to the phone.

ROS setup

The robot uses two ROS stacks: ROS Noetic for the camera and ROS 2 Foxy for motion and Nav2. Instead of trying to merge everything, we exposed both stacks through separate rosbridge WebSocket connections — one for ROS 1 and one for ROS 2 — so the laptop agent could subscribe and publish without needing a local ROS install.

Agent architecture

An Orchestrator sends memory and robot state to a PlannerAgent. The planner runs a streaming tool-use loop through Bedrock Converse and has access to more than 30 tools across perception, movement, following, map navigation, and memory. Read-only tools can run in parallel, which helps keep the interaction responsive.

Semantic world map

A background daemon runs once per second. It takes every YOLO detection, uses the RealSense depth frame to back-project it into 3D, and transforms that point into the robot’s odometry frame using the current pose. The result is a persistent object map. When the user says “go to the bottle,” the agent can check whether it already knows where the bottle is before searching again.

Voice UI

A Flask server on the laptop hosts the push-to-talk phone interface. The phone records audio, Whisper transcribes it, the agent runs, and the spoken response streams back to the phone. End-to-end latency is usually around 3–6 seconds.

Challenges we ran into

Running ROS 1 and ROS 2 together

The camera stack runs on ROS Noetic, while motion and navigation run on ROS 2 Foxy. Sourcing both environments in the same shell caused subtle failures, so we kept the two systems strictly separated and bridged them through independent rosbridge sockets.

RealSense reliability

Under sustained load, the D435i color stream would sometimes silently drop to 0 Hz while depth kept working. Restarting the ROS node was not enough because the issue was in the kernel UVC driver. The fix was to reload uvcvideo and restart the service.

WiFi bandwidth

Streaming raw RGB at 30 Hz was too heavy for the network. We built a fetch pattern where tools subscribe, grab what they need, then unsubscribe, with client-side rate limiting so multiple tools do not overload rosbridge at the same time.

Accomplishments that we’re proud of

The find_and_go_to behavior felt surprisingly fluid. The robot rotates continuously while a background thread polls YOLO at around 6 Hz. As soon as the target appears, the robot stops rotating and starts walking toward it. It feels much more natural than a stop-check-step loop.

The voice also made a huge difference. We told the model that the user is visually impaired and only hears the robot through speak_to_user. That single constraint changed the output from sensor readouts into useful guidance.

The robot did not integrate navigation and slam all together. We had to integrate ourselves the mapping and odometry in order to be able to perform both tasks simultaneously and therefore be able toe xplroe new spaces autonomously.

What we learned

We learned how to bridge ROS 1 and ROS 2 without doing a full migration, and how far you can get with RealSense depth, YOLO, and odometry before needing full SLAM.

We also learned that tool design matters as much as model choice. The LLM is only useful if it has the right actions available: look, remember, plan, follow, stop, speak.

And for assistive robotics, voice output is not an afterthought. The way the robot explains what it is doing changes whether the interaction feels like a machine or like a guide.

What’s next for Lite3

Persist semantic memory and waypoints across sessions. Run the YOLO tracker continuously in the background for faster map updates. Replace the current depth-based walking advisory with full Nav2 costmap-based collision avoidance. Test the interaction design with blind and low-vision users to understand what guidance is actually useful.

Share this project:

Updates