Inspiration

Today's robots are brilliant at perceiving a scene and amnesiac about it a second later. A vision model can tell you "that's a mug," but it can't be taught "this is my mug" without retraining, and it has no memory of where it saw things. We wanted to build the missing piece for physical AI: a memory layer, instant, teachable, and spatial, that sits between a robot's eyes and its actions. The north-star question we kept asking: "Would a real robotics team actually use this?"

What it does

Engram gives a Booster K1 humanoid a cloud brain and instant memory:

  • Perceives with a 72-billion-parameter vision model (Qwen2.5-VL) running on Nebius GPUs. It captions whatever the robot's camera sees.
  • Remembers every object as a vector in Redis, with sub-2 ms nearest-neighbor recall over its whole memory bank.
  • Learns in one shot. Show it a brand-new object it's never seen, teach it once (one vector written to Redis), and it recognizes it instantly. No retraining.
  • Recognizes and attends. It turns its head to look at the object it recalls, tracking the object's position continuously.
  • Has spatial memory. Walk it through a scene and it captures every object and the heading where it saw it.
  • Fetches on command. Say "go to the red soda can" and it first checks whether it even remembers one, recalls where it was, walks toward it (closed-loop on odometry), stops at a safe distance using the robot's depth camera so it doesn't crash into the table, faces it, and waves. You can watch the robot's-eye view and its live decision (recalled object, confidence, Redis search latency, the Nebius model) on a real-time dashboard.

How we built it

Architecture, a thin robot plus a cloud brain:

K1 camera ──JPEG──▶ laptop "brain" ──▶ Nebius Qwen2.5-VL-72B (caption)
                                  ├──▶ Nebius Qwen3-Embedding-8B (vector)
                                  ├──▶ Redis vector KNN (recall + spatial memory)
                                  └──▶ {label, center_x, confidence, heading} ──▶ K1 acts
  • Perception (Nebius Token Factory): OpenAI-compatible API. Qwen/Qwen2.5-VL-72B-Instruct turns a camera frame into a short caption; Qwen/Qwen3-Embedding-8B turns the caption into a 4096-d vector. One VLM call also returns the object's horizontal center for tracking.
  • Memory (Redis Stack): a FLAT cosine vector index over the embeddings, plus numeric fields storing the robot's pose (x, y, θ) where each object was seen. KNN recall runs in ~2 ms over a 124-object bank. A text query ("is the green bottle here?") embeds and searches the same index.
  • Robot (Booster K1): ROS2 Humble + booster_robotics_sdk_python. RGB from /booster_video_stream, metric depth from /StereoNetNode/stereonet_depth (mono16, millimeters), odometry, and high-level motion (RotateHead, Move, WaveHand).
  • Closed-loop locomotion: turns and walks are odometry-feedback controlled (we calibrated a slip factor because the K1's odometry over-reports rotation ~1.4× from foot slip), so "turn 180°" actually turns 180°.
  • Depth-gated safe approach: during the final walk, we sample the nearest obstacle in the depth image and stop the robot a safe distance from the object before it can hit the table.
  • Robust recognition: an object captured from several angles gets several captions, so we match the live view to the target by semantic similarity, not exact strings.
  • Live demo dashboard: the brain serves an HTML/JS page showing the robot's view, the recalled label + confidence, a gaze marker, the Redis search latency, and the Nebius model, the "decision on screen."
  • Laptop ↔ robot connected over Tailscale so the robot can be untethered.

    Challenges we ran into

  • Firmware reality: the robot's SDK build didn't implement the Cartesian arm-control API (returned 501 Not Implemented), so our planned "point at it" pivoted to a head-look + walk-up + wave that the firmware does support.

  • Odometry slip: open-loop turns were wildly inaccurate; we switched to closed-loop odometry feedback and calibrated the rotation scale so turns land on target.

  • Depth that reads the wrong thing: a naive center-of-frame depth sample looked over the table at the far wall, so the robot didn't stop. We moved to a nearest-obstacle sample over a wider, lower band.

  • One object, many captions: the same object captioned differently from different angles broke exact-label matching during approach, fixed with semantic matching.

  • Networking gauntlet: captive portals, a flaky Ethernet link, and hotspot/Tailscale relays. We ended up on Tailscale so the robot could roam.

  • Staging perception: the VLM captions the dominant object, so objects have to fill enough of the frame, a real lesson in demo design.

    Accomplishments that we're proud of

  • One-shot live learning that actually works. Teach a brand-new object once and it's known instantly. A vision model alone can't do this.

  • A genuinely large model, 72B parameters, running in the live robot loop on Nebius, not a toy API call.

  • Sub-2 ms memory recall in Redis over the whole object bank.

  • A full spatial fetch: check memory, recall heading, walk, depth-safe stop, wave, built and tuned on real hardware in a single hack.

  • Closed-loop humanoid navigation calibrated from scratch in hours.

    What we learned

  • Caption → embed → vector store is a remarkably robust, nearly-free perception path; it matches on caption consistency more than accuracy.

  • Closed-loop beats open-loop for everything on a real robot. Trust the sensor, not the command.

  • Depth needs careful region sampling. "What's directly ahead of me" is not "the median of the center of the image."

  • The memory layer is the differentiator: perception is a commodity; a teachable, instant, spatial memory is what makes a robot feel like it understands its space.

    What's next for Engram

  • Image embeddings (Path B): CLIP-style image vectors on a Nebius GPU to tell lookalikes apart (two similar cans).

  • Brain on the robot: run the whole pipeline on the K1's onboard Jetson for full autonomy, untethered.

  • True spatial navigation / SLAM: persistent maps so it can be walked through a building and sent anywhere.

  • Depth for 3D grasping: use the RGBD stream to reach and pick, not just point.

  • Scale the memory: thousands of objects, still sub-ms.

Built With

Share this project:

Updates