Text -> Robot Co-Design with Gemini + autoResearch
Open-source robotics workspace that turns a task prompt into grounded robot designs, simulated artifacts, and an autoresearch-style morphology/controller optimization loop.
What it does
We built a local-first robotics design system that starts from a natural-language task, grounds that task in motion references, generates multiple robot embodiments, compiles engineering-facing artifacts, and then launches an autonomous co-design loop to improve both the robot body and controller over time. Instead of stopping at concept generation, the system carries the task all the way through typed task interpretation, robot schema generation, render/MuJoCo artifacts, human approval checkpoints, and iterative training/evaluation.
In the live flow, a user describes a task such as carrying a box upstairs or climbing a wall. The backend first converts that prompt into a structured task representation, retrieves or falls back to motion references, proposes multiple robot designs, scores and reranks them against task-specific hardrails, then starts an autoresearch-style loop that edits the morphology and controller code, runs simulation trials, and keeps the best-performing iteration.
Why this is novel
The novelty is not just using generative AI to describe robots. We use the Gemini model family as part of a robotics compiler-and-search stack:
- Gemini 2.5 family models are used for low-latency structured reasoning: task parsing, motion-reference planning, and compact robot schema generation.
- Gemini Robotics-ER 1.6 is used as an embodied evaluator over rollout video, turning task completion into a machine-usable fitness signal instead of relying only on kinematic error.
- An
autoresearch-style agent loop then treats robot design as an iterative optimization problem over both embodiment and control, rather than a one-shot generation problem.
The result is a system that combines structured generation, motion grounding, learned control, and agentic optimization in one loop.
Creative use of the Gemini model family
We deliberately split responsibilities across Gemini models instead of using a single general-purpose model for everything.
1. Gemini 2.5 family for structured robotics reasoning
We use Gemini 2.5 for the high-entropy reasoning steps where the output still needs to be machine-actionable:
- task ingestion from free-form user prompts
- generation of search queries for human motion references
- structured robot candidate generation
- drafting
program.md, the research agenda that conditions the autonomous evolution loop
A key implementation detail is that robot generation does not ask the model to emit the full internal schema directly. The provider-facing output is intentionally compact, then expanded deterministically into the internal robot representation. That let us keep the generative step creative while keeping the downstream system typed and verifiable.
2. Gemini Robotics-ER 1.6 for embodied evaluation
We use Gemini Robotics-ER 1.6 as a task-success oracle over simulation video. After each trial, the system renders a rollout video and asks ER 1.6 to judge whether the robot actually completed the task. This gives us a semantic success signal that complements raw tracking error.
That is important because in robotics, “looks like the reference motion” and “actually completes the task” are not the same objective. ER 1.6 gives us a way to score task completion directly from visual rollout evidence.
Model training: autoResearch + MorphologyAgnosticGNN
The core optimization loop is inspired by Karpathy’s autoresearch, but adapted from language-model training to robot co-design.
Instead of editing only a training script, our agent edits two files:
train.py: controller architecture and training hyperparametersmorphology_factory.py: morphology sampling and body-generation logic
This means the loop can change both how the robot is controlled and what robot gets built.
Controller
The trainable policy is a MorphologyAgnosticGNN, a graph neural network controller whose weights are shared across robot morphologies. For each generated body, we build a graph from the URDF/MJCF structure, encode node and edge features, and predict torque outputs in a morphology-conditioned way.
Supervision objective
Inside each trial, the controller is trained by imitation learning. We retarget the reference motion to the current morphology, generate target joint trajectories, and then compute a PD-derived supervisory torque target. The GNN is optimized with MSE loss between predicted torques and that target torque sequence.
Outer-loop objective
At the trial level, the agent is optimizing a composite robotics objective:
fitness = 0.6 * (1 - tracking_error) + 0.4 * er16_success_probability
This balances:
- kinematic fidelity: does the robot reproduce the retargeted motion?
- task completion: does the rollout video appear to succeed on the task?
The agent then keeps the best iteration, records diffs, and uses the accumulated history to decide what to try next.
Video ingestion and parsing
The system is grounded in video, not just text.
Ingest pipeline
Given a task prompt, the backend:
- uses Gemini 2.5 to convert the prompt into a typed task plan
- generates targeted YouTube search queries for real human analogs of the task
- filters low-quality or irrelevant candidate videos
- reviews candidate videos and selects the best reference
- dispatches GVHMR-based motion extraction when available
- falls back to DROID-style structured trajectory retrieval if YouTube/GVHMR is unavailable or weak
Why this matters
This gives the system a concrete motion prior before robot generation even begins. Rather than generating morphology from language alone, it first asks: what does successful task execution look like in motion space?
That motion reference then conditions both robot design and controller training.
Agent orchestration
The project also includes an explicit orchestration layer rather than a single monolithic generation step.
program.md as an approval boundary
Before the evolution loop starts, the system drafts a program.md research agenda. This document captures:
- what morphology directions to explore
- what controller changes to try
- what failure modes to avoid
- how progress should be measured
A human can approve or edit this once, and then the loop runs autonomously.
Evolution loop
After approval, the orchestrator:
- reads
program.md - edits
train.pyand/ormorphology_factory.py - dispatches a trial to Modal
- trains the controller for the current sampled morphology
- generates replay video and artifacts
- scores the result with tracking error + ER 1.6 success probability
- logs the iteration and updates the best result
This gives us a real agentic research loop over robot design, not just a static generator.
System architecture
High-level flow:
Prompt -> structured task -> motion reference -> robot schemas -> hardrails/reranking -> render + telemetry + validation -> autoresearch loop -> best iteration
The important detail is that every stage has typed interfaces and deterministic post-processing around the generative steps. Generation is used where ambiguity is useful; deterministic logic is used where correctness and reproducibility matter.
Technical stack
- FastAPI for backend orchestration and APIs
- Next.js for the interactive workspace UI
- Gemini 2.5 family for structured task and design reasoning
- Gemini Robotics-ER 1.6 for rollout-video success evaluation
- GVHMR for human motion extraction
- DROID fallback retrieval for structured motion references
- MuJoCo for simulation and replay generation
- PyTorch + PyTorch Geometric for
MorphologyAgnosticGNN - Modal for per-trial GPU execution
- Supabase for artifacts and iteration state
Current limitations
This version is intentionally ambitious but still incomplete in a few places:
- the morphology space is still parametric rather than fully free-form
- the canonical IR/export path is still less mature than the proposal/runtime path
- the evolution loop depends on external infrastructure such as Modal and artifact storage
- the current rollout/evaluation path is stronger as a research loop than as a production robotics training stack
But even in this form, the system demonstrates a compelling pattern: use multimodal Gemini models for structured embodied reasoning, then wrap them in deterministic compilers, simulation, and an autoresearch-style optimization loop to search over robot body + controller jointly.
Bottom line
This project is a text-to-robot-co-design system. Gemini helps interpret tasks, ground them in video, generate structured embodiments, and evaluate whether rollouts actually succeed. autoResearch turns that into an iterative search process over robot morphology and controller code. The result is a more technical and more robotics-native use of generative AI: not just generating descriptions, but driving a closed loop that proposes, trains, simulates, scores, and improves robot designs.
Built With
- fastapi
- gemini
- mujoco
- nextjs
- pydantic
- python
- react
- tailwindcss
- typescript
Log in or sign up for Devpost to join the conversation.