Inspiration

Fire-fighting is exactly the kind of dangerous, time-critical job we'd want a robot to take on — but a humanoid that can walk into a hazard is only useful if it can perceive that hazard cheaply and continuously. We kept hitting the same tension: every always-on sensor and computation on a robot competes for power, money, and cooling. Running a full neural network around the clock just to watch for fire is expensive overkill. So Ember became two halves solving one problem: a fire-fighting humanoid trained in simulation, and a custom FPGA perception accelerator that does the constant, low-power watching — a cheap always-on "first line of detection" that only wakes heavier compute when there's a real reason to.

What it does

Ember is a fire-fighting humanoid with a hardware-accelerated fire-detection front end.

The perception layer runs on an FPGA. As camera pixels stream in, it flags fire-colored pixels, filters out false positives by checking whether each flagged pixel is surrounded by other fire pixels (a real fire is a solid region, not scattered specks), locates the fire, and targets the base of the flame — where it would actually be fought. It outputs just a coordinate and a fire/no-fire flag per frame, not a whole image, over UART.

The humanoid layer is a Unitree humanoid trained in MuJoCo. Locomotion and approach behaviors are driven by reinforcement-learned policies (PPO), with A* for higher-level path planning toward the detected fire. The detected fire location from the FPGA is the target the humanoid stack acts on.

A home-base server ties it together, reading the FPGA's output over WiFi and visualizing detections live — and keeping the architecture open for multi-robot coordination and on-demand heavier AI.

How we built it

Perception (FPGA, Verilog): a streaming pipeline on a Xilinx Zynq — raster scanner → YCbCr color threshold → morphological erosion using two line buffers and a 3×3 sliding window (so all nine neighbors are available in a single clock cycle) → an accumulator that computes the centroid and flame-base aim point → a UART transmitter sending a compact binary packet per frame. We kept division off the fabric by deferring it downstream, and verified the hardware against a pixel-for-pixel Python golden model before programming the board.

Humanoid (MuJoCo + PPO + A*): we set up the Unitree humanoid in MuJoCo and trained locomotion/approach policies with PPO, using A* for planning toward a target. Getting stable, useful behavior took real fine-tuning of the training policies — reward shaping, tuning to keep the humanoid balanced while moving toward a goal rather than collapsing or learning degenerate gaits, and adjusting so the learned policy responded sensibly to an externally supplied target coordinate.

Software glue (Python, Flask): an image-to-memory converter, a golden-reference verifier, a live serial reader, a Flask web dashboard (phone-viewable over WiFi), and a video annotator that runs the exact FPGA algorithm frame-by-frame on real footage. We used Cursor and Claude through the build.

Challenges we ran into

The hardest problems were at the integration seams between two very different systems:

  • Bridging perception to the policy. The FPGA emits a raw coordinate over a serial link; the humanoid policy expects a target in its own frame. Getting that hand-off — serial packet → server → a target the trained policy could actually act on — was a real integration effort, and we ran out of time to fully close the loop into the live sim, so we route through the home-base server as the connecting layer.
  • Tuning the humanoid policies. PPO didn't just work out of the box — balancing locomotion stability against goal-seeking took repeated reward and hyperparameter tuning, and the policy had to stay robust when handed a target it hadn't seen during training.
  • FPGA pipeline alignment. BRAM read latency and the line-buffer window each add delay, so coordinates and frame-boundary signals needed careful re-alignment — we chased a one-pixel coordinate bias down to the exact register stage.
  • Morphology tradeoffs. Erosion removed noise but over-shrank thin fires; we attempted morphological opening, hit a dilation bug, and made the call to ship reliable erosion-only rather than risk a working demo.
  • Bring-up gremlins. A camera/ESP8266 path we ultimately scoped out, stale bitstreams, sim runtimes too short for UART, cached dashboards — the classic "sim is right but the board isn't" debugging across both hardware and the sim toolchain.

Accomplishments that we're proud of

  • A complete fire-detection pipeline running on real silicon, verified end-to-end: image → color detection → real-time neighborhood filtering → localization → UART → live dashboard.
  • Real-time morphological filtering with line buffers — the piece that genuinely justifies "why an FPGA," doing a neighborhood operation at one result per clock that a CPU can't sustain at frame rate.
  • A trained humanoid that learned to locomote and move toward a goal in MuJoCo, with policies tuned to stay stable.
  • Bringing two hard, separate systems — custom hardware perception and a learned humanoid controller — into one coherent fire-fighting story.
  • Knowing when to scope down to protect a working demo.

What we learned

  • Why FPGAs win for streaming, per-pixel work: dedicated hardware per stage, neighbors on wires instead of fetched from memory, deterministic latency with no cache jitter — and that much of FPGA design is timing alignment, not the logic itself.
  • How brittle RL policies can be, and how much reward shaping and tuning it takes to get stable, goal-directed humanoid behavior in MuJoCo.
  • That the real work in a multi-part robot is the integration between subsystems, not just each subsystem alone.
  • The value of a software golden model for trusting hardware output.
  • The strongest framing isn't "FPGA beats GPU" — it's a tiered system where a cheap, always-on FPGA gate guards expensive compute that runs only on demand.

What's next for Ember

  • Close the loop fully: FPGA detection → policy target → humanoid response, live and end-to-end.
  • Live camera input via a parallel camera module straight into the FPGA fabric.
  • Richer perception on the same pipeline: morphological opening to preserve thin fires, Sobel texture analysis to reject flat orange surfaces like sunsets, and temporal flicker detection (fire pulses at a few hertz; steady light doesn't) — each drops into the existing line-buffer foundation, as does thermal imaging.
  • More robust policies: further PPO tuning, domain randomization for sim-to-real, and training the humanoid on actual suppression behaviors rather than just approach.
  • The home-base server as an opening for multi-robot coordination and an on-demand heavier confirmer (e.g. a YOLO-style model) that the FPGA gate wakes only when it flags something.

Built With

Share this project:

Updates