Fescribe a humanoid robot
robot generated from text description

Text -> Robot Co-Design with Gemini + autoResearch

Open-source robotics workspace that turns a task prompt into grounded robot designs, simulated artifacts, and an autoresearch-style morphology/controller optimization loop.

What it does

We built a local-first robotics design system that starts from a natural-language task, grounds that task in motion references, generates multiple robot embodiments, compiles engineering-facing artifacts, and then launches an autonomous co-design loop to improve both the robot body and controller over time. Instead of stopping at concept generation, the system carries the task all the way through typed task interpretation, robot schema generation, render/MuJoCo artifacts, human approval checkpoints, and iterative training/evaluation.

In the live flow, a user describes a task such as carrying a box upstairs or climbing a wall. The backend first converts that prompt into a structured task representation, retrieves or falls back to motion references, proposes multiple robot designs, scores and reranks them against task-specific hardrails, then starts an autoresearch-style loop that edits the morphology and controller code, runs simulation trials, and keeps the best-performing iteration.

Why this is novel

The novelty is not just using generative AI to describe robots. We use the Gemini model family as part of a robotics compiler-and-search stack:

Gemini 2.5 family models are used for low-latency structured reasoning: task parsing, motion-reference planning, and compact robot schema generation.
Gemini Robotics-ER 1.6 is used as an embodied evaluator over rollout video, turning task completion into a machine-usable fitness signal instead of relying only on kinematic error.
An autoresearch-style agent loop then treats robot design as an iterative optimization problem over both embodiment and control, rather than a one-shot generation problem.

The result is a system that combines structured generation, motion grounding, learned control, and agentic optimization in one loop.

Creative use of the Gemini model family

We deliberately split responsibilities across Gemini models instead of using a single general-purpose model for everything.

1. Gemini 2.5 family for structured robotics reasoning

We use Gemini 2.5 for the high-entropy reasoning steps where the output still needs to be machine-actionable:

task ingestion from free-form user prompts
generation of search queries for human motion references
structured robot candidate generation
drafting program.md, the research agenda that conditions the autonomous evolution loop

A key implementation detail is that robot generation does not ask the model to emit the full internal schema directly. The provider-facing output is intentionally compact, then expanded deterministically into the internal robot representation. That let us keep the generative step creative while keeping the downstream system typed and verifiable.

2. Gemini Robotics-ER 1.6 for embodied evaluation

We use Gemini Robotics-ER 1.6 as a task-success oracle over simulation video. After each trial, the system renders a rollout video and asks ER 1.6 to judge whether the robot actually completed the task. This gives us a semantic success signal that complements raw tracking error.

That is important because in robotics, ‚Äúlooks like the reference motion‚Äù and ‚Äúactually completes the task‚Äù are not the same objective. ER 1.6 gives us a way to score task completion directly from visual rollout evidence.

Model training: autoResearch + MorphologyAgnosticGNN

The core optimization loop is inspired by Karpathy‚Äôs autoresearch, but adapted from language-model training to robot co-design.

Instead of editing only a training script, our agent edits two files:

train.py: controller architecture and training hyperparameters
morphology_factory.py: morphology sampling and body-generation logic

This means the loop can change both how the robot is controlled and what robot gets built.

Controller

The trainable policy is a MorphologyAgnosticGNN, a graph neural network controller whose weights are shared across robot morphologies. For each generated body, we build a graph from the URDF/MJCF structure, encode node and edge features, and predict torque outputs in a morphology-conditioned way.

Supervision objective

Inside each trial, the controller is trained by imitation learning. We retarget the reference motion to the current morphology, generate target joint trajectories, and then compute a PD-derived supervisory torque target. The GNN is optimized with MSE loss between predicted torques and that target torque sequence.

Outer-loop objective

At the trial level, the agent is optimizing a composite robotics objective:

fitness = 0.6 * (1 - tracking_error) + 0.4 * er16_success_probability

This balances:

kinematic fidelity: does the robot reproduce the retargeted motion?
task completion: does the rollout video appear to succeed on the task?

The agent then keeps the best iteration, records diffs, and uses the accumulated history to decide what to try next.

Video ingestion and parsing

The system is grounded in video, not just text.

Ingest pipeline

Given a task prompt, the backend:

uses Gemini 2.5 to convert the prompt into a typed task plan
generates targeted YouTube search queries for real human analogs of the task
filters low-quality or irrelevant candidate videos
reviews candidate videos and selects the best reference
dispatches GVHMR-based motion extraction when available
falls back to DROID-style structured trajectory retrieval if YouTube/GVHMR is unavailable or weak

Why this matters

This gives the system a concrete motion prior before robot generation even begins. Rather than generating morphology from language alone, it first asks: what does successful task execution look like in motion space?

That motion reference then conditions both robot design and controller training.

Agent orchestration

The project also includes an explicit orchestration layer rather than a single monolithic generation step.

`program.md` as an approval boundary

Before the evolution loop starts, the system drafts a program.md research agenda. This document captures:

what morphology directions to explore
what controller changes to try
what failure modes to avoid
how progress should be measured

A human can approve or edit this once, and then the loop runs autonomously.

Evolution loop

After approval, the orchestrator:

reads program.md
edits train.py and/or morphology_factory.py
dispatches a trial to Modal
trains the controller for the current sampled morphology
generates replay video and artifacts
scores the result with tracking error + ER 1.6 success probability
logs the iteration and updates the best result

This gives us a real agentic research loop over robot design, not just a static generator.

System architecture

High-level flow:

Prompt -> structured task -> motion reference -> robot schemas -> hardrails/reranking -> render + telemetry + validation -> autoresearch loop -> best iteration

The important detail is that every stage has typed interfaces and deterministic post-processing around the generative steps. Generation is used where ambiguity is useful; deterministic logic is used where correctness and reproducibility matter.

Technical stack

FastAPI for backend orchestration and APIs
Next.js for the interactive workspace UI
Gemini 2.5 family for structured task and design reasoning
Gemini Robotics-ER 1.6 for rollout-video success evaluation
GVHMR for human motion extraction
DROID fallback retrieval for structured motion references
MuJoCo for simulation and replay generation
PyTorch + PyTorch Geometric for MorphologyAgnosticGNN
Modal for per-trial GPU execution
Supabase for artifacts and iteration state

Current limitations

This version is intentionally ambitious but still incomplete in a few places:

the morphology space is still parametric rather than fully free-form
the canonical IR/export path is still less mature than the proposal/runtime path
the evolution loop depends on external infrastructure such as Modal and artifact storage
the current rollout/evaluation path is stronger as a research loop than as a production robotics training stack

But even in this form, the system demonstrates a compelling pattern: use multimodal Gemini models for structured embodied reasoning, then wrap them in deterministic compilers, simulation, and an autoresearch-style optimization loop to search over robot body + controller jointly.

Bottom line

This project is a text-to-robot-co-design system. Gemini helps interpret tasks, ground them in video, generate structured embodiments, and evaluate whether rollouts actually succeed. autoResearch turns that into an iterative search process over robot morphology and controller code. The result is a more technical and more robotics-native use of generative AI: not just generating descriptions, but driving a closed loop that proposes, trains, simulates, scores, and improves robot designs.

Built With

fastapi
gemini
mujoco
nextjs
pydantic
python
react
tailwindcss
typescript

Updates

Atharva Gupta started this project — Apr 19, 2026 08:16 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.