Inspiration
What if AI could reach beyond the screen and touch the physical world? I wanted to explore embodied AI through the metaphor of a possessed robotic arm, a spectral intelligence trapped in silicon, manifesting through mechanical limbs. The Halloween theme wasn’t just aesthetic; ghosts are disembodied entities seeking physical form, making it the perfect metaphor for an LLM controlling a robot. Mortis became my exploration of how abstract thoughts become concrete actions, blurring the line between predictable robotics and the unpredictable nature of a spectral intelligence.
What it does
Mortis is an interactive AI experience where users chat with a mischievous Halloween spirit that responds through both text and physical gestures. Powered by Google's Gemini API, Mortis analyzes user messages and generates in-character responses paired with emotional moods and physical actions.
A SeeedStudio SO101 robotic arm—controlled via Hugging Face's LeRobot framework—executes synchronized gestures such as waving, pointing, grabbing, and dropping, all reflecting Mortis’s personality. The system includes a Gradio web interface with Halloween theming, multi-modal capabilities (text, voice, vision), and advanced manipulation through SmolVLA vision-language-action models for complex real-world tasks.
How I built it
Core Architecture (3-Layer Design)
The system is built on three interconnected layers:
1. The Brain (LLM Intelligence)
I used Google Gemini with structured function calling. Instead of parsing free-form text, I constrained the model to return JSON with three fields:
message: in-character dialoguemood: one of 8 predefined emotional statesgesture: one of 6 defined physical actions
This eliminated parsing errors and ensured reliable robot control.
2. The Soul (Intent Routing)
I developed a custom IntentRouter class that validates and interprets Gemini outputs. It separates conversational intents (simple gestures) from manipulation intents (complex VLA tasks). Manipulation requests are validated against 6 trained SmolVLA tasks. Any invalid or ambiguous command safely degrades into a fallback gesture.
3. The Body (Robot Control)
I created a MortisArm class to control a SeeedStudio SO101 6-DOF arm through LeRobot. Gestures are implemented via a GESTURES dictionary mapping names to precise servo sequences with timing logic.
SmolVLA manipulation tasks use trained encoders and action policies for object interaction.
Technical Stack
- Language: Python 3.12+ with type hints
- Package Manager:
uvfor fast dependency resolution - Web Framework: Gradio 5.49+ with custom CSS + base64 Halloween background
- Robotics: LeRobot 0.4.0+
- AI Models: Gemini 2.5 Flash (LLM), SmolVLA for VLA manipulation
- Build System: Makefile with 15+ commands (install, run, calibrate, train…)
- Environment: python-dotenv for configuration
- Architecture: Hybrid async execution (threads + LeRobot async API)
Challenges I ran into
- Robot calibration was extremely sensitive and required iterating physical offsets.
- Synchronization across UI updates, LLM responses, gestures, and VLA tasks created race conditions that forced me to design a hybrid async executor.
- SmolVLA integration was difficult: mismatched observation formats, discretization issues, 100–200ms inference latency, and camera alignment problems.
- Intent routing needed strict validation to prevent the LLM from generating unsafe or undefined gestures.
- Voice integration uncovered unexpected complexity in audio formats, latency, and UX flow.
- Documentation debt grew until I disciplined myself to track progress with structured notes, specs, and user guides.
Accomplishments I’m proud of
- I built a fully working embodied-AI character with synchronized dialogue, emotion, gestures, and manipulation.
- I designed a clean 3-layer architecture (Brain–Soul–Body) that separates reasoning, routing, and actuation.
- I implemented a spec-driven development workflow using Kiro to automate boilerplate code and enforce consistency.
- I created a safe fallback system preventing invalid robot actions.
- I integrated SmolVLA end-to-end with real hardware.
- I wrapped everything in a polished Halloween-themed Gradio UI.
What I learned
- Embodied AI breaks assumptions: language, vision, and control need synchronized timing.
- Structured outputs (JSON) dramatically improve LLM reliability.
- VLA models require precise observation pipelines, not just good weights.
- Async robotics is harder than async web code—timing matters.
- Specs + Kiro accelerate development and reduce design drift.
- Real hardware forces constraints that don’t exist in simulation.
- UX matters even in robotics—humans want personality, not raw motors.
What’s next for Mortis
Near-Term Improvements (VLA & Manipulation)
- Retraining SmolVLA with domain-specific data
- Expanding task vocabulary & multi-object reasoning
- Improving observation pipelines (multi-camera, depth, tactile feedback)
- More refined continuous-control action spaces
- Tool-use and compliant manipulation
Mid-Term Vision (Embodied AI)
- Autonomous planning & proactive behavior
- Emotional state machines and personality consistency
- Gesture recognition, gaze following, and richer multi-modal interaction
- Social robotics behaviors like joint attention, haptics, and shared goals
Long-Term Research Directions
- World models, sim-to-real transfer, cross-embodiment generalization
- Hierarchical reasoning, chain-of-thought for robotics, causal inference
- Robust safety systems: constrained policies, fallback verification, human-in-the-loop
- Multi-agent robots, teamwork, emergent behaviors, and swarms
Log in or sign up for Devpost to join the conversation.