Fig 3
Assembly for Casing
Model of arm supposed to added
Fig 2
Fig 1
Wiring
Power Source

ORBI

Not another voice assistant trapped in a speaker. Orbi is what happens when an LLM gets a body - and nothing it does is scripted.

Inspiration

Every AI companion we've used lives behind a screen or inside a speaker. You speak, it answers, it goes back to sleep. But humans don't connect with software that way; we connect with things that notice us, remember us, and move through the world alongside us.

We kept asking: what would it feel like if an AI had a body? Not just a moving object running a scripted routine, but something with genuine agency, deciding when to look, when to speak, when to come closer, and when to just stay quiet. A robot that doesn't wait to be told.

The inspiration wasn't a better Alexa. It was Rocky from Project Hail Mary, a mind that happens to have a form. A friend, not a tool.

And we wanted to build it with the messy, unreliable, off-the-shelf hardware anyone can buy, because the future of embodied AI shouldn't require a research lab.

What it does

Orbi is an agentic AI companion with a body. You talk to it. It decides what to do.

It listens, constantly, through a mic- no wake word, no button. Voice activity detection handles the silence.
It sees through a camera it can look through whenever it decides a visual would help answer you.
It moves on four wheels around its environment, going where it judges it needs to go.
It remembers- facts about you, things you've said, events worth keeping, across conversations, indefinitely.
It speaks with a voice shaped for character, not utility. Every turn, an LLM gets the user's speech, a tool menu (see, move, remember, recall), and a personality. The LLM decides whether to reply, move, look, save a memory, or combine these, with no magic words and no hand-coded rules. When you say "Is my notebook on the couch?" Orbi chooses to look. When you say "I've had a rough day," Orbi chooses to remember how you sounded when you said it.

That decision layer is the project. Everything else is plumbing.

How we built it

The software stack

Orbi runs on a single Python process with one idea at its center: every turn is a decision made by an LLM with full access to its body.

The loop:

VAD-gated mic- always open, captures audio only when speech is detected, stops on silence
Whisper STT (local, faster-whisper base model)- transcribes on CPU, fast enough for conversation
Gemini 2.0 Flash- primary reasoning layer with native function-calling across four tools: tool_see, tool_move, tool_remember, tool_recall
Gemma 4 via Ollama- local fallback for offline operation, same tool contract, same personality
ElevenLabs TTS- Orbi's voice, pitch-shifted and layered to give it an otherworldly, Rocky-inspired character
JSON-lines over USB serial- ESP32-S3 receives movement commands, acks back
JSONL memory log- append-only long-term memory, searchable by the agent itself via a tool The personality is encoded in a single system prompt- not hand-coded behaviors. Change the prompt, change the friend.

The mic mutes automatically while Orbi speaks, a shared threading event preventing the loop from hearing itself.

The hardware stack

We built Orbi around two power domains and two compute domains that cooperate through a thin serial bridge:

Jetson Orin Nano 8GB (JetPack 6.x)- the brain. Runs the agent loop, Gemma 4 local inference, Whisper, camera capture, and TTS playback. Powered from a wall adapter via USB-C PD, which in turn powers the ESP32-S3 over USB.
ESP32-S3 Dev Module- the body controller. Receives JSON commands over USB serial, drives servos and DC motors through an L298N motor driver. The L298N runs on its own battery, sharing only ground with the ESP32- logic and motor power isolated, one common ground.
Four DC motors for the wheels, cross-wired diagonally so each pair of diagonal motors acts as one virtual wheel. Dramatically simplifies drive firmware (two channels instead of four) and produces smoother motion than naive four-wheel control.
Camera- USB webcam with built-in mic, plugged directly into the Jetson. Vision inference runs on Gemini or Gemma, depending on mode.
Robotic arm (v1: camera gimbal)- articulated servo-driven mount that reframes the camera. Hardware stub for a future two-hand manipulator.
Double-decker chassis- separates the high-voltage motor level from the low-voltage compute level, keeping motor noise away from the logic and giving us physical room for a 3-cell battery, the motor driver, and the ESP32. Everything routes on purpose. Motor current never touches the logic rail. The Jetson only talks to the motor domain through a serial line; it can serve in one stop command. Failure in one domain doesn't brick the other.

Challenges we ran into

Honestly, the hardest part wasn't the AI. It was everything touching voltage.

Dual-powering the ESP32. Our first wiring attempt had both the Jetson (via USB) and the L298N (via +5V output) feeding the ESP32 simultaneously. Two 5V rails tied together, fighting each other. We caught it before damage, disconnected the L298N's 5V wire to the ESP32, and kept grounds commoned. Lesson: one power source per logic chip, always.

Getting the Jetson to even boot. NVIDIA ships Orin Nanos with factory firmware incompatible with JetPack 6.x. Requires a two-step flash: 5.1.3 first for the firmware update, then 6.x. We learned this the hard way after a blank-screen boot cycle.

Making Gemma 4 fast enough on an 8GB Orin. E4B (~10GB on disk, ~6GB VRAM) was too heavy — system freezes under concurrent load with the desktop environment. We downgraded to E2B, killed the graphical desktop entirely (systemctl set-default multi-user.target), added swap, and forced MAXN Super power mode. Responses dropped from ~15 seconds to ~3-5 seconds.

Gemini API free-tier quota weirdness. A new project on Google Cloud gave us limit: 0 errors — a whole evening of "why is my key rate-limited the moment I try it?" The fix: create the key through AI Studio directly, not Cloud Console. Two different free tiers; only one works.

Firmware/software command mismatch. Our Python side was sending {"cmd": "forward", "distance_cm": 30}. Our ESP32 firmware expected {"cmd": "forward", "duration_ms": 600}. Manual serial tests worked, Orbi's tool calls didn't, and it took us too long to notice the difference. We converted centimeters to milliseconds on the Python side rather than reflashing the ESP32 during the crunch.

Hardware that kept acting differently across networks. Hotel wifi blocking device-to-device traffic, hackathon wifi flaking under load, Jetson getting different IPs every time we moved- we cycled through SSH setups more times than we'd like to admit. Phone hotspot became our reliable fallback.

The speaker problem. Our passive 2W speaker needed an amplifier, which we didn't have time to wire. We routed audio through the portable monitor's built-in speakers over HDMI- a workaround that ended up being cleaner than the original plan.

Each of these was a half-hour or more of debugging on top of the actual building. That's the honest shape of building an embodied AI system in 12 hours: 30% building, 70% integration.

Accomplishments that we're proud of

Tool-calling that actually feels intelligent. Orbi never waits for magic words. The same "can you check the kitchen?" triggers the agent to call move, then see, then respond- all decided by the LLM from tool descriptions alone. No keyword matching, no if-else ladder. When it works, it feels alive.

Genuine offline fallback. Gemma 4 E2B runs locally on the Jetson and handles the full agent contract- same tools, same personality, same memory. Unplug the Wi-Fi, and Orbi keeps going. That's the pitch for a companion robot: one that doesn't stop being your friend when the network drops.

A memory system that persists forever. Append-only JSONL at ~/.orbi/memory.jsonl. Every remembered fact, every conversation. Restart Orbi tomorrow, and it still knows you love lemon cake. This is what distinguishes a companion from a session.

A personality that's more than a prompt. By shaping the system prompt around character, not instructions- brief, warm, curious, slightly playful, quiet when uncertain- Orbi stops feeling like a chatbot. Add an alien, chord-like voice inspired by Rocky from Project Hail Mary, and suddenly it's a creature, not a tool.

Two compute domains, one robot. The Jetson and ESP32 cooperate through a serial protocol simple enough to debug with echo, strong enough to drive real hardware. That isolation is what let us keep shipping even when half the system was on fire.

What we learned

The model chooses when to act- don't try to choose for it. Our early instinct was to add rules: "when the user says X, call tool Y." Every rule we added made Orbi feel less alive. Once we deleted all the rules and let Gemini decide from tool descriptions and personality alone, everything got better. Agency is a design choice, not an emergent property.

Small models are often the right call. E4B was more "capable" than E2B on paper. But a capable model that freezes the Jetson is worth less than a slightly dumber model that responds in 3 seconds. On resource-constrained hardware, latency is part of intelligence.

Hardware bugs hide inside software. The ESP32 command mismatch looked like a software bug for an hour. Shared-ground mistakes look like random crashes. Power supply weakness looks like "the OS is buggy." When you're touching voltage, always suspect the physical layer first.

Voice is the whole experience. Judges don't read a pitch slowly- they hear Orbi speak. The first 3 seconds of TTS output matter more than the entire system prompt behind it. We spent extra time on voice quality, and it disproportionately shaped how demos felt.

The LLM-as-brain pattern scales to bodies. This was the biggest conceptual takeaway. We didn't write a state machine, a behavior tree, or a control loop. We wrote a handful of tools and a personality. The rest is the LLM choosing, turn after turn. That architecture generalizes- to new hardware, new tools, new capabilities- without rewriting the core.

What's next for ORBI

We're building Orbi OS - the next generation.

Where everyone else in robotics is chasing computer vision and autonomous navigation algorithms, we're betting on a different primitive: LLM agents that can discover, reason about, and use new hardware on their own.

Orbi OS is an operating system for embodied AI that:

Adapts to any hardware the user provides. Plug in a new motor, a new sensor, a new arm, a new camera. Orbi OS introspects the device, learns its capabilities through a reflection protocol, and exposes it as a new tool the agent can call. No firmware rewrites. No retraining. The robot grows its body at runtime.
Learns new tools on its own. When Orbi encounters a task it can't solve with existing tools, it writes a new one- Python code that extends its capabilities, tested in a sandbox, committed to its toolset if useful. Self-modifying in a bounded, auditable way.
Scales memory from session to lifetime. The JSONL memory in v1 is a proof of concept. Orbi OS will have a multi-vault memory architecture- episodic, semantic, procedural, self- searchable semantically, with automatic summarization and recall prioritization.
Runs on anything. Jetson today, Raspberry Pi tomorrow, an ARM laptop the day after. As long as the brain can run a local LLM and talk to a body over some bus, Orbi OS makes it a companion. The goal: a friend that grows with you, on hardware that's truly yours, private and local, forever.

Tune in for Orbi OS.

Orbi v1 was built in under 36 hours.