Daydreamer: The GPT Moment for Robotics

Inspiration

https://www.canva.com/design/DAG1mhtdFUQ/f1j9CMFXsBgxfzCX5YV8Vg/edit?utm_content=DAG1mhtdFUQ&utm_campaign=designshare&utm_medium=link2&utm_source=sharebutton

Robots are still in the pre-GPT era. They're trained task-by-task, don't generalize, and are expensive to teach. Language showed us that large, world-scale pretraining is the key to generalization. We borrowed that approach, using the emergent world model of an internet-scale pretrained video diffusion model to control robots in a general and self-improving way.

What it does

Daydreamer imagines succeeding at a task in video, executes in the real world, and self improves from vlm feedback.

Imagines a short video of a successful outcome (video diffusion world model).
Translates the imagined frames into robot joint poses (we trained a video-to-pose model).
Executes via inverse kinematics and low-level control.
Evaluates the outcome and keeps only successful rollouts as new training data for the world model

How we built it

world model: Open source video diffusion model conditioned on the current frame + text instruction to generate short "success" clips. RL trained to imagine actions that work in the real world.

video to pose model: trained a CNN from scratch on synthetic trajectories of robot to map each imagined frame end-effector/arm joint poses.

control: IK controller to reproduce poses from world model -> pose model pipeline

evaluator: gemini 2.5 flash to verify task in real-world rollout data is correct.

Self-training loop: thousands of rollouts to an outcome-filtered dataset of video (by evaluator), used to finetune the video diffusion model. Repeat and improve!

Benchmark task: stack blocks on a table

Challenges we ran into

It took a LOT of compute to do training runs + rollouts. When things didn't work, we kept on doubling compute + data, in the end, we spent more than $2000 on GPUs 🤣.

Accomplishments that we're proud of

https://www.canva.com/design/DAG1mhtdFUQ/f1j9CMFXsBgxfzCX5YV8Vg/edit?utm_content=DAG1mhtdFUQ&utm_campaign=designshare&utm_medium=link2&utm_source=sharebutton

Designed a novel way to train robots that we believe may solve robotics if scaled.
RL-tuned a video diffusion model to “dream” better control sequences for a robotic arm.
Built an end-to-end loop that leads to self improvement
Robot can do commands in env from prompt!

What we learned

Planning directly in video leverages broad, pretrained world knowledge, and connecting to a video-to-pose inverse map embodies that world model.

Outcome filtering with a VLM allows for self-improvement (produce video, try to execute in real world, train on data where it works)

What's next for Daydreamer: The GPT Moment for Robotics

larger diffusion model: more emergent properties. Sora-scale models know how to dream almost anything.

Joint diffusion pose architecture: join the video-to-pose and diffusion model into 1 model which outputs video + pose data.

Built With

pytorch
robosuite
transformers

Updates

diego caples started this project — Oct 12, 2025 03:08 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.