Inspiration
Robots are still in the pre-GPT era. They're trained task-by-task, don't generalize, and are expensive to teach. Language showed us that large, world-scale pretraining is the key to generalization. We borrowed that approach, using the emergent world model of an internet-scale pretrained video diffusion model to control robots in a general and self-improving way.
What it does
Daydreamer imagines succeeding at a task in video, executes in the real world, and self improves from vlm feedback.
- Imagines a short video of a successful outcome (video diffusion world model).
- Translates the imagined frames into robot joint poses (we trained a video-to-pose model).
- Executes via inverse kinematics and low-level control.
- Evaluates the outcome and keeps only successful rollouts as new training data for the world model
How we built it
world model: Open source video diffusion model conditioned on the current frame + text instruction to generate short "success" clips. RL trained to imagine actions that work in the real world.
video to pose model: trained a CNN from scratch on synthetic trajectories of robot to map each imagined frame end-effector/arm joint poses.
control: IK controller to reproduce poses from world model -> pose model pipeline
evaluator: gemini 2.5 flash to verify task in real-world rollout data is correct.
Self-training loop: thousands of rollouts to an outcome-filtered dataset of video (by evaluator), used to finetune the video diffusion model. Repeat and improve!
Benchmark task: stack blocks on a table
Challenges we ran into
It took a LOT of compute to do training runs + rollouts. When things didn't work, we kept on doubling compute + data, in the end, we spent more than $2000 on GPUs 🤣.
Accomplishments that we're proud of
- Designed a novel way to train robots that we believe may solve robotics if scaled.
- RL-tuned a video diffusion model to “dream” better control sequences for a robotic arm.
- Built an end-to-end loop that leads to self improvement
- Robot can do commands in env from prompt!
What we learned
Planning directly in video leverages broad, pretrained world knowledge, and connecting to a video-to-pose inverse map embodies that world model.
Outcome filtering with a VLM allows for self-improvement (produce video, try to execute in real world, train on data where it works)
What's next for Daydreamer: The GPT Moment for Robotics
larger diffusion model: more emergent properties. Sora-scale models know how to dream almost anything.
Joint diffusion pose architecture: join the video-to-pose and diffusion model into 1 model which outputs video + pose data.
Built With
- pytorch
- robosuite
- transformers
Log in or sign up for Devpost to join the conversation.