Autonomous Triage Kiosk

Inspiration

Emergency rooms and urgent care clinics across the world share one painful truth: triage is a bottleneck. The average U.S. ER wait is over 2.5 hours, and during that wait, a nurse is the only person who can decide who gets seen first. Critical patients sit next to common colds, often for hours, and lives are lost in those lobbies every day.

We asked a simple question: what if the very first point of contact in a clinic wasn't a human at all, but an autonomous robot that could greet, listen, measure, and prioritize every patient the second they walked through the door? One robot arm, one sensor, and a smart intake loop — available 24/7, consistent, and tireless.

What it does

The Autonomous Triage Kiosk is a robotic nurse's assistant that performs end-to-end intake before a human provider is ever involved.

Presence detection: An Arduino-connected ultrasonic sensor detects when a patient approaches the kiosk.
Greeting: The SO-101 robot arm waves hello while the voice assistant simultaneously says "Hello, I am your intake assistant."
Voice intake: Using Google Speech Recognition, the kiosk asks three questions — reason for visit, symptoms, and duration — and listens to the patient's spoken answers.
Triage classification: A rule-based engine categorizes the case as CRITICAL, URGENT, or NON-URGENT based on keyword patterns (e.g. "chest pain" → critical, "fever" → urgent, "mild headache" → non-urgent).
Vitals presentation: Based on reported symptoms, the arm picks up and presents the relevant instrument to the patient:
- Pulse oximeter (black box) for cardiac/respiratory symptoms
- Thermometer (white box) for fever
- Blood pressure cuff for headaches / hypertension
Token handoff: The arm picks up a colored priority token — red (critical), yellow (urgent), or green (non-urgent) — and hands it to the patient, who proceeds accordingly.
Graceful handoff: The kiosk thanks the patient and resets, ready for the next arrival.

How we built it

Hardware: SO-101 follower + leader arm pair, two USB webcams (top + side view), Arduino Uno with HC-SR04 ultrasonic distance sensor.
Robotics framework: LeRobot for recording and replaying teleoperated motions. We recorded ~14 episodes across two Hugging Face datasets (kiosk_motions and pick_drop_right) — one episode per macro motion (greeting, each vitals instrument, each colored token).
Two-agent architecture: Google ADK + Gemini Live + Gemini API
- intake_agent.py — reads the Arduino over serial, runs the voice intake loop, classifies triage, and orchestrates actions.
- arm_agent.py — a TCP server on port 8765 that receives action names (e.g. PRESENT_PULSE_OX, HAND_RED_TOKEN) and invokes lerobot-replay as a subprocess to play the corresponding demonstration episode on the follower arm.
Voice: macOS say for text-to-speech, SpeechRecognition + pyaudio for speech-to-text.
Concurrency: The greeting wave fires in a background thread so the arm moves in parallel with the welcome speech, while vitals and token handoffs run sequentially for timing clarity.

Challenges we ran into

Training on GPU: Training 50 episodes perfectly with the recording was very hard, the camera angles were not capturing everything, putting the right positions to the two cameras was challenging to collect the data set. Many times the recording was unsuccessful, had to record many times.
Episode index management: Every time we re-recorded a motion, the dataset appended a new episode rather than replacing the old one. We had to carefully track which index corresponded to which motion, and update a central ACTION_CATALOG in arm_agent.py on every iteration.
Consistent start poses: Replayed trajectories only look right if the arm starts from the same home pose as when recorded. A few early attempts drifted because the home pose wasn't consistent across recordings.
Parallelism vs. determinism: Deciding which actions should run concurrently (greeting + speech) versus sequentially (vitals + narration) was a UX problem more than a technical one.
*Calibration of the follower *: It was hard to properly calibrate the follower, had to try it out many times to get the movements of all the joints correctly

Accomplishments that we're proud of

A fully autonomous, end-to-end demo in a single hackathon: presence → greeting → voice intake → triage → vitals → token handoff, with no human in the loop.
Seven distinct demonstrated motions trained and replayed cleanly on a single physical arm.
Parallelized greeting — the arm waves while the assistant speaks, which is a small detail that makes the interaction feel genuinely alive.
Clean two-agent separation: the intake brain and the motion brain talk over a simple JSON socket protocol, so each can be swapped or upgraded independently.
Voice-driven conversation that works hands-free, making the demo feel real rather than scripted.
A working, live system that survives being demo'd.

What we learned

Teleop demonstrations scale further than you'd think. With just one recorded example per macro, replay is reliable enough for a real demo — no policy training needed for a proof of concept.
Cold infrastructure matters: keeping the arm at a consistent home pose, the tokens in consistent positions, and the cameras bolted down is the difference between a demo that works and one that flails.
Timing sells the illusion. Making the arm greet while the voice speaks is the single change that made users react to the system as a living assistant.
Agent architectures are cheap and powerful. Splitting perception/conversation from motion into two processes kept each file short, readable, and debuggable in isolation.
Graceful fallbacks matter. Voice input degrades to keyboard, unknown actions return errors cleanly, and the arm agent is restart-safe.

What's next for Autonomous Triage Kiosk

LLM-driven triage — replace the keyword classifier with GPT-4 or a fine-tuned medical LLM that can reason about symptom combinations, follow up with clarifying questions, and generate a structured EHR note.
Computer vision — use the cameras to recognize patient demographics, visible distress signals, skin color changes, and injury locations to augment the triage signal.
Actual vitals capture — integrate real Bluetooth-enabled pulse oximeters, thermometers, and BP cuffs so the arm doesn't just present the instrument, it reads the measurements and feeds them into triage.
Multi-language support — Spanish, Mandarin, Hindi, Arabic. Voice intake should work for every patient regardless of language.
Electronic handoff to staff — push the patient's intake record and triage score directly to the nursing station's dashboard, so the right nurse knows about the right patient the instant they walk in.
Learned policies — replace macro replay with a learned visuomotor policy (ACT, Diffusion Policy) so the arm can adapt to varying token positions, lighting, and patient placement.
Deployment: pilot at a student health center or rural clinic where wait times are longest and staffing is thinnest.

Built With

amd
gemini
lerobot
lilypad
pytorch
ros
viam
vla

Updates

Rajashekar Vennavelli started this project — Apr 19, 2026 08:05 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.