Inspiration

One of us has a friend who is constantly using LLMs and Generative AI tools for everything, from deciding what to wear on dates to building his meals and shopping lists. We wanted to lean into the irony a bit, and build a system that is meant to gesture with the primary appeal of reinforcing preference for a user.

What it does

M8 is a robotic system that takes user input and responds, similar to the magic 8 ball it was inspired by. These robotic arms are equipped with idle stretching and minor passive movements. Just like humans, they can't stay still. M8's main functionality is an api-less chatbot, with emphasis on the bot. A user asks M8 a question, and it can respond positively, negatively, or (ideally in the case that it is classified as such), uncertain.

How we built it

We took facebook's wave2vec-base (~94.4M params) for an audio encoder, and fine-tuned it on a bunch of diffusion-generated samples. For the actual decoder and classification head, to meet the requirements of our device, we used a lightweight mamba architecture for temporal features and capturing long-range dependencies (with the added benefit of being quadratic in time complexity). It has a basic classification head on it that will output logits for likelihood of a query belonging to each type of response we have in M8.

Challenges we ran into

  • Audio drivers suck. Many hours were spent on jack/pulseaudio nonsense.
  • Most open source models cannot run on edge
  • lerobot is too messy for what we wanted to do, so we wrote literally all of the functionality ourselves (IK, Servos, FK, etc...). Special shoutout to lerobot's feetech for causing a lot of headaches.

Accomplishments that we're proud of

  • implemented diffusion for on the fly sample generation
  • used an SSM for temporal context and efficiency
  • arms can dance, nod, and shake their "heads"!

What we learned

We learned a lot about on-device deployment, interfacing with quantization libraries, and we gained more experience working with 5DoF arms (in one of our case, it was actually the first time!). We also got very familiar with the Claude API, and spent a long time with the fish-audio API attempting to use it for ASR.

What's next for M8

Our immediate goal is to resolve the issue with audio drivers, and support native lagless audio input for more streamlined interaction. In the further future, we intend to implement a passive wakeup call, so that the entire process is completely plug and play and M8 can fit into any ecosystem it needs to be. Finally, more powerful architectures and longer training will likely result in a more aligned model, so we aim to upscale the backbone for M8 with better compute.

Built With

Share this project:

Updates