Mortis

Inspiration

What if AI could reach beyond the screen and touch the physical world? I wanted to explore embodied AI through the metaphor of a possessed robotic arm, a spectral intelligence trapped in silicon, manifesting through mechanical limbs. The Halloween theme wasn’t just aesthetic; ghosts are disembodied entities seeking physical form, making it the perfect metaphor for an LLM controlling a robot. Mortis became my exploration of how abstract thoughts become concrete actions, blurring the line between predictable robotics and the unpredictable nature of a spectral intelligence.

What it does

Mortis is an interactive AI experience where users chat with a mischievous Halloween spirit that responds through both text and physical gestures. Powered by Google's Gemini API, Mortis analyzes user messages and generates in-character responses paired with emotional moods and physical actions.

A SeeedStudio SO101 robotic arm—controlled via Hugging Face's LeRobot framework—executes synchronized gestures such as waving, pointing, grabbing, and dropping, all reflecting Mortis’s personality. The system includes a Gradio web interface with Halloween theming, multi-modal capabilities (text, voice, vision), and advanced manipulation through SmolVLA vision-language-action models for complex real-world tasks.

How I built it

Core Architecture (3-Layer Design)

The system is built on three interconnected layers:

1. The Brain (LLM Intelligence)

I used Google Gemini with structured function calling. Instead of parsing free-form text, I constrained the model to return JSON with three fields:

message: in-character dialogue
mood: one of 8 predefined emotional states
gesture: one of 6 defined physical actions

This eliminated parsing errors and ensured reliable robot control.

2. The Soul (Intent Routing)

I developed a custom IntentRouter class that validates and interprets Gemini outputs. It separates conversational intents (simple gestures) from manipulation intents (complex VLA tasks). Manipulation requests are validated against 6 trained SmolVLA tasks. Any invalid or ambiguous command safely degrades into a fallback gesture.

3. The Body (Robot Control)

I created a MortisArm class to control a SeeedStudio SO101 6-DOF arm through LeRobot. Gestures are implemented via a GESTURES dictionary mapping names to precise servo sequences with timing logic. SmolVLA manipulation tasks use trained encoders and action policies for object interaction.

Technical Stack

Language: Python 3.12+ with type hints
Package Manager: uv for fast dependency resolution
Web Framework: Gradio 5.49+ with custom CSS + base64 Halloween background
Robotics: LeRobot 0.4.0+
AI Models: Gemini 2.5 Flash (LLM), SmolVLA for VLA manipulation
Build System: Makefile with 15+ commands (install, run, calibrate, train…)
Environment: python-dotenv for configuration
Architecture: Hybrid async execution (threads + LeRobot async API)

Challenges I ran into

Robot calibration was extremely sensitive and required iterating physical offsets.
Synchronization across UI updates, LLM responses, gestures, and VLA tasks created race conditions that forced me to design a hybrid async executor.
SmolVLA integration was difficult: mismatched observation formats, discretization issues, 100–200ms inference latency, and camera alignment problems.
Intent routing needed strict validation to prevent the LLM from generating unsafe or undefined gestures.
Voice integration uncovered unexpected complexity in audio formats, latency, and UX flow.
Documentation debt grew until I disciplined myself to track progress with structured notes, specs, and user guides.

Accomplishments I’m proud of

I built a fully working embodied-AI character with synchronized dialogue, emotion, gestures, and manipulation.
I designed a clean 3-layer architecture (Brain–Soul–Body) that separates reasoning, routing, and actuation.
I implemented a spec-driven development workflow using Kiro to automate boilerplate code and enforce consistency.
I created a safe fallback system preventing invalid robot actions.
I integrated SmolVLA end-to-end with real hardware.
I wrapped everything in a polished Halloween-themed Gradio UI.

What I learned

Embodied AI breaks assumptions: language, vision, and control need synchronized timing.
Structured outputs (JSON) dramatically improve LLM reliability.
VLA models require precise observation pipelines, not just good weights.
Async robotics is harder than async web code—timing matters.
Specs + Kiro accelerate development and reduce design drift.
Real hardware forces constraints that don’t exist in simulation.
UX matters even in robotics—humans want personality, not raw motors.

What’s next for Mortis

Near-Term Improvements (VLA & Manipulation)

Retraining SmolVLA with domain-specific data
Expanding task vocabulary & multi-object reasoning
Improving observation pipelines (multi-camera, depth, tactile feedback)
More refined continuous-control action spaces
Tool-use and compliant manipulation

Mid-Term Vision (Embodied AI)

Autonomous planning & proactive behavior
Emotional state machines and personality consistency
Gesture recognition, gaze following, and richer multi-modal interaction
Social robotics behaviors like joint attention, haptics, and shared goals

Long-Term Research Directions

World models, sim-to-real transfer, cross-embodiment generalization
Hierarchical reasoning, chain-of-thought for robotics, causal inference
Robust safety systems: constrained policies, fallback verification, human-in-the-loop
Multi-agent robots, teamwork, emergent behaviors, and swarms

Built With

gemini
git
huggingface
json
kiro
lerobot
markdown
mcp
python
pytorch
smolvla
whisper

Updates

jorge lamperez started this project — Dec 05, 2025 03:24 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.