Milo — The Story

Inspiration

It started with a conversation none of us expected.

During a hospital visit, we met a patient — we'll call him Karim — a retired engineer, sharp as ever, who had been living with Parkinson's for six years. He wasn't angry. He wasn't asking for pity. He just said, very quietly:

"The hardest part isn't the shaking. The hardest part is needing someone to hand me my pill every morning. Like I'm a child."

That sentence didn't leave us.

Parkinson's disease affects over 10 million people worldwide. It is a progressive neurological disorder that attacks the dopaminergic pathways of the brain, causing tremors, rigidity, and loss of fine motor control. The medication — primarily Levodopa/Carbidopa — works. Decades of neuroscience back it up. But the medication only works if it is taken on time, every time.

The cruel irony: the very disease that requires precise, timed medication is the same disease that makes self-administration nearly impossible.

We did not set out to build a robot. We set out to give Karim his mornings back.


What We Learned

The neuroscience of timing

Parkinson's medication management is not forgiving. The therapeutic window for Levodopa is narrow. When plasma concentration drops below the effective threshold, patients enter what clinicians call an "off" episode — a state of severe motor freezing that can last minutes to hours.

The pharmacokinetic model that governs this is approximately:

$$C(t) = \frac{F \cdot D \cdot k_a}{V_d(k_a - k_e)} \left(e^{-k_e t} - e^{-k_a t}\right)$$

Where:

  • $C(t)$ is plasma drug concentration at time $t$
  • $D$ is the dose
  • $k_a$ is the absorption rate constant
  • $k_e$ is the elimination rate constant
  • $V_d$ is the volume of distribution
  • $F$ is bioavailability

The implication: a missed or delayed dose does not just mean discomfort. It means the concentration curve never reaches therapeutic levels, and the patient suffers a predictable, preventable neurological crisis.

This is not a comfort problem. It is a safety-critical systems problem.

Tremor as a signal, not just a symptom

Parkinsonian resting tremor oscillates at a characteristic frequency of $4$–$6\ \text{Hz}$. This is distinct from essential tremor ($8$–$12\ \text{Hz}$) and cerebellar tremor ($< 4\ \text{Hz}$).

$$f_{\text{Parkinson}} \in [4, 6]\ \text{Hz}$$

This frequency signature is measurable, predictable, and — crucially — something a robot can be designed around. The arm does not fight the tremor. It operates independently of it.

Imitation learning is the right paradigm for assistive robotics

We learned that the most important design decision we made was choosing imitation learning over hard-coded trajectories.

A scripted arm follows a fixed path. The real world is not fixed — cups shift, tables are different heights, patients move. Imitation learning, specifically the Action Chunking with Transformers (ACT) policy, learns a distribution over actions from human demonstrations:

$$\pi_\theta(a_t | o_t) = \arg\max_\theta \sum_{i=1}^{N} \log P_\theta(a^{(i)} | o^{(i)})$$

Where $o_t$ is the observation (camera + joint states) and $a_t$ is the action (joint targets). The model learns to generalize across the variance in the demonstrations — exactly what you need when every patient's environment is slightly different.

AI voice is not a feature — it is the product

We initially thought the voice layer was secondary. We were wrong.

When we tested a version of Milo that simply moved the arm without speaking, patients reported feeling watched, not helped. The machine felt alien. Threatening.

The moment Milo said "Good morning — ready for your pill? I'll bring it to you slowly" — the entire dynamic shifted. Patients relaxed. They engaged. One tester said: "It feels like someone is there."

We learned that for vulnerable populations, trust is the interface. Everything else is implementation detail.


How We Built It

The hardware — ElRobot

We built a custom 7+1 DOF robotic arm called ElRobot, designed specifically for close-proximity assistive tasks. The arm is low-cost, open-source, and built on Feetech servo motors. It connects via serial to a host computer running the full Milo stack.

The arm has:

  • 6 rotational joints for reach and orientation
  • 1 wrist rotation joint
  • 1 gripper (the "+1")

The AI stack

Microphone
    │
    ▼
WebRTC VAD          ← detects speech onset/offset in real time
    │
    ▼
Whisper (faster-whisper, tiny, CPU)   ← on-device STT, ~1s latency
    │
    ▼
Mistral AI (mistral-small-latest)     ← intent parsing + response generation
    │               │
    │               ▼
    │         ElevenLabs TTS (Rachel voice)  ← speaks back to patient
    │
    ▼ (if trigger detected)
ACT Policy (PyTorch, trained on 50 demos)
    │
    ▼
ElRobot Arm  ←  pick up pill → deliver to patient

The voice personality — Milo's system prompt

Mistral is given a carefully engineered system prompt that defines Milo's character: warm, brief, never clinical, always encouraging. Every response is generated fresh — not scripted — which means Milo adapts to what the patient actually says, not what we predicted they would say.

The training pipeline

We recorded 50 human demonstrations of the pick-and-deliver task using a leader-follower teleoperation setup. The leader arm (operated by a human) records joint positions and camera frames. The follower arm mirrors the motion.

These demonstrations are stored as a LeRobotDataset and used to train the ACT policy:

$$\mathcal{L}(\theta) = -\frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T} \log \pi_\theta(a_t^{(i)} | o_t^{(i)})$$

After training, the policy runs autonomously — no human in the loop.


Challenges We Faced

1. Latency in the voice pipeline

The patient should never feel like they are waiting for a computer. We targeted end-to-end latency (speech → response → arm motion start) under 2 seconds.

The biggest bottleneck was TTS synthesis. We solved it by:

  • Switching to eleven_turbo_v2_5 (ElevenLabs' lowest-latency model)
  • Streaming audio chunks as they arrive rather than waiting for the full clip
  • Running Whisper in int8 quantized mode on CPU

2. Robustness of the arm in real environments

The arm was trained in our lab. Patient environments are messier — different lighting, different table heights, cups that are not perfectly centered.

We addressed this through data augmentation during training (random crops, brightness shifts) and by adding a wrist camera that gives the policy a close-up view of the target object, reducing sensitivity to global scene variation.

3. Designing for cognitive and emotional vulnerability

This was the hardest challenge and the one no engineering textbook covers.

Parkinson's patients often have anxiety, cognitive fatigue, and variable attention. A voice interface designed for a healthy adult — fast, information-dense, efficient — fails completely with this population.

We ran five user testing sessions and rewrote Milo's personality three times. Key lessons:

  • Sentences must be under 15 words
  • Always announce motion before it happens
  • Never use medical jargon
  • Celebrate small wins explicitly ("Perfect. That's your morning dose done.")
  • Give the patient a way to stop at any time

4. The trust problem

Patients were initially afraid of the arm. This was not a technical problem. It was a psychological one.

We solved it not by making the arm look less robotic, but by making Milo's voice warm enough that the arm felt supervised. The AI companion acts as a social proxy — it signals that someone (even an AI) is paying attention and cares about what happens.

This insight reshaped our entire product philosophy: the voice is not a wrapper around the robot. The voice is the reason the robot is trusted.


What's Next

  • Clinical pilot with a partner neurology clinic
  • Medication schedule integration (automatic reminders at prescribed times)
  • Caregiver dashboard — remote monitoring of adherence
  • Multi-medication support (pill organizer + vision model for pill identification)
  • On-device Mistral inference for full offline operation in low-connectivity homes

Built at the intersection of robotics, AI, and human dignity. For Karim, and the 10 million people like him.

Built With

  • elevenlab
  • mistral
Share this project:

Updates