Text-Aware Diffusion Policies

Who: Yuki Zang, Sebastián Varón, Calvin Luo

Introduction:

In this work, we study the question of whether such large-scale text-to-image diffusion models also naturally encode and understand the time dimension, potentially demonstrated through its encoding of verbs. Despite only being trained on static images, the diffusion model seems to understand "actions" - when asked to generate an image of someone running, the resulting image may have a person with their lays spread in a running position, and the image might be slightly blurry to represent some sort of passing scene. When asked to generate an image of someone jumping, the resulting image may feature someone in the air. Furthermore, the textual data that large-scaled language models are pretrained on may include descriptions of such verbs and their effects; a natural question to investigate is when such textual understanding from LLMs are unified with static images, can they also learn the essence of videos?

Related Work:

Prior work on video diffusion models already demonstrate that text-to-image models can generate still images that represent verb terms, and that with some simple modifications, they are able to generate multiple images concurrently that exhibit content consistency (https://arxiv.org/abs/2212.11565). Other motivating works include DreamFusion, a model that learns 3D volumes from text input leveraging frozen pretrained diffusion models (https://arxiv.org/abs/2209.14988).

In the following, we briefly summarize the paper mentioned above: In the paper "Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation" (https://arxiv.org/abs/2212.11565), the authors introduce a novel approach to Text-to-Video (T2V) generation, dubbed "One-Shot Video Tuning". They extend the cutting-edge Text-to-Image (T2I) diffusion model to create images that maintain temporal consistency using their "Tune-A-Video" method. Recognizing that traditional fine-tuning strategies can compromise the performance of the T2I model, the authors innovatively integrate a spatio-temporal attention mechanism, which solely focuses on the first and preceding video frames. Furthermore, they adopt a tuning strategy that exclusively updates the projection matrices within attention blocks.

The paper "DREAMFUSION: TEXT-TO-3D USING 2D DIFFUSION" (https://arxiv.org/abs/2209.14988) presents a novel approach to text-to-3D synthesis by leveraging a pretrained 2D text-to-image diffusion model, to address the need for large-scale 3D labeled datasets and efficient 3D denoising architectures. The researchers introduce a loss function based on probability density distillation, so 2D diffusion model serves as a prior for a parametric image generator's optimization. They optimize a randomly-initialized 3D model (a Neural Radiance Field, or NeRF) so that its 2D renderings from random angles achieve a low loss. This method requires no 3D training data and no modifications to the image diffusion model.

Data:

We plan to evaluate the model on the OpenAI Gym suite.

Methodology:

In this work we propose Text-Aware Diffusion Policies (TADPols), where a policy is trained to align with human-text captions using diffusion model score distillation. We plan to use a Soft Actor-Critic (SAC) as our agent, applied to OpenAI Gym tasks. For the score distillation signal, we plan to use an established pretrained text-to-image diffusion model such as StableDiffusion. At each timestep, we replace the reward signal with a score distillation signal queried from the diffusion model. Separately, we ablate the choice of the text-image foundation model and also utilize CLIP.

Metrics:

Now that the text prompt is the reward signal, we perform qualitative metrics to analyze the videos produced by the policy, since it can learn flexible behavior beyond the previously rigid environment-specified reward functions. We also evaluate our approach quantitatively by utilizing prompts that seek to produce the original behavior intended by the environment.

Base goal: Let the diffusion model generate the motion depicted in the text prompt.

Target goal: Complicate the text prompts to see if the model can capture multiple motions.

Stretch goal: Compare and interpret the performances of models with CLIP/Diffusion in place of the traditional reward function.

Ethics:

We do not foresee our work alone having any negative ethical implications. Instead, our work attempts to shed some potentially interesting insights into how diffusion models potentially encode actions and verbs despite only being trained on static images and plaintext. However, considering the video generations tasks in general, we respond to following two questions:

Why is Deep Learning a good approach to this problem? Deep learning enables the learning of complex policies, that can visualize the kind of action information stored in large pretrained image-text models such as CLIP and diffusion models.

What broader societal issues are relevant to your chosen problem space? As these image2video models improve, the potential for creating realistic "deepfake" videos grows. With that being said, one image and some texts are sufficient to fabricate a reasonable video, which can be used to spread misinformation, defame individuals, or even pose national security threats. Hence, it is crucial to develop ethical guidelines and detection tools alongside.