Multi-Caption Diffusion

Introduction: What problem are you trying to solve and why?

In this report, we explore new techniques to train diffusion models by researching how different combinations of multiple captions affect the overall detail and quality of generated images. The inspiration stems from “Towards Language-Free Training for Text-to-Image Generation”, which describes using CLIP models to train GANs on captionless data for text-to-image generation. Each person has a unique perspective on how they see the world. Our individual perspectives, educations, and life paths all have an impact on this perspective, which in turn affects the descriptions we may provide and the focus we have. A single description will always miss details that could have added more depth to an image. Deep learning models express their unique perspective through the embedding they generate for a given input. Modern caption-based diffusion models consider a single embedded caption for image generation. When people collaborate, they combine everyone's unique perspectives to find a better solution. We believe that by adding multiple model's embeddings, we can combine multiple models perspectives. We believe this will allow the model to create a more detailed and accurate representation than if a single perspective were used.
In order to properly test this hypothesis, we developed 3 models training 2 on Ciphar and 1 on COCO. These models incorporate class embeddings, CLIP image embeddings and CLIP text embeddings. This allows us to properly assess whether multiple model embeddings improve the quality of the final image generated by a diffusion model.

Related Work:

In the CVPR “Shifted Diffusion for Text-to-image Generation” paper, the main objective is to introduce a novel diffusion model called Corgi which is a flexible text-to-image generation model. The paper aims to train a better generative model by leveraging prior knowledge from pre-trained CLIP models and enabling efficient and effective generation using self-supervised learning. This new model can be utilized in situations where there are captionless images. More related work lists are as follows.

Related Work List:

The Data We Are Using:

COCO: Common Objects in Context
CIFAR-10: Labeled subsets of the 80 million tiny images dataset collected by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton

~~3. MS-COCO: Microsoft Common Objects in Context~~

~~4. CC3M: Google Conceptual Captions 3M dataset (dismissed due to resources)~~

How big is it? Will you need to do significant preprocessing?
- These are very big datasets! COCO dataset 300k images with 1.5 million objects detected ~~and CC3M 3 million image-to-text pairs~~.
- After trying to download the CC3M, for every entry, it would give a [Erro 48] which was due to the server taking too long to respond and therefore not being able to get the image. Further searching, we found that a certain method to make it work would require the download to take at least 2 days, making it difficult to use. Therefore, we decided to dismiss using CC3M and decided to use the COCO dataset.
- Additionally, we also used the CIFAR-10 classification dataset for faster training and do class-based generation.

Methodology: Our models Architecture?

We are using the Unet model architecture originally proposed in “Denoising Diffusion Probabilistic Models”[3], however, when we pass in the time embedding at time point t, we additionally pass in the caption data y. We encode captions by passing the string through CLIP, taking the outputs, and adding them to the time embedding T. This is done by passing the encoded data through a fully connected layer to match the dimensions of the time embedding t, or by changing our time embedding to match those of the text encoder. When using multiple captions, concatenate captions together and pass them through a larger, fully connected layer to make the embedding dimensions match. In sampling, we simultaneously generate a conditioned and unconditioned image using linear interpolation between the two images to combine them. We choose a weight of 3, or a guidance score of 3, with which to generate the final image.
We also implemented a class-based generation model in addition to caption-based generation. While the class-based generation model is able to reduce training time because it uses the CIFAR-10 model, the model would also be able to get more specific examples. In other words, by still keeping the class, we are able to keep the classes or objects separate. And by getting the image embedding from CLIP, we would be able to get the text embedding, therefore being able to get a more specific classification. Accordingly, we have three models in total. (1) A class-based generation V1 model that uses a dense layer to combine the image embedding and class embedding; (2) A class-based generation V2 model that merely adds the image embedding and class embedding to prevent any loss of information; (3) Caption-based generation that is the same model as the class-based generation V2 model but uses the COCO dataset, which is a larger dataset and also contains captions.

Metrics: What constitutes “success?”

What experiments do you plan to run?

Firstly, we want to train a “baseline” model that only uses one text-encoded caption, this will be our reference.
Secondly, we will train a model with two text-encoded captions from our CIFAR-10 and COCO dataset. CIFAR-10 would be a class-based generation model and the COCO would be used on the caption-based generation model.
Third, we will train the model with one text-encoded caption and one CLIP image-encoding.
Finally, we will experiment with different text-encoded captions and encoded image (i.e. image-encoding caption) combinations.

For most of our assignments, we have looked at the accuracy of the model. Does the notion of “accuracy” apply for your project, or is some other metric more appropriate?

FID (Fréchet inception distance) Scores: a metric used to assess the quality of images created by a generative model

If you are doing something new, explain how you will assess your model’s performance.

We plan to compare our FID (Fréchet inception distance) scores on CIFAR-10 and COCO. We can also compare our performance to state-of-the-art models.

What are your base, target, and stretch goals?

Our base goal is to build our first version of our CLIP and diffusion architecture that takes in multiple captions (true and generated).
Our target goal is to build two or more versions of our CLIP and diffusion architecture that take in multiple captions (true and generated) with different implementations and compare their results.
Our stretch goal is to incorporate multiple approaches together to generate the optimal CLIP and diffusion architecture with the best performance.

Ethical Questions:

What broader societal issues are relevant to your chosen problem space?

Generating fake photos (Deep Fakes)
Stealing artists + copyright work from datasets for training

What is your dataset? Are there any concerns about how it was collected, or labeled? Is it representative? What kind of underlying historical or societal biases might it contain?

Our dataset is publicly licensed and therefore not stealing anyone's work. Individuals can scrape the internet for training data; however, we are using properly licensed materials/data. Our data can include people, and therefore we will have to be aware of any social or historic biases. Additionally, we are using OpenAI's CLIP model, which was trained on scraped data, which can contain bias.

Division of labor:

Julien: Embed input data using pre-trained CLIP and also endeavor to implement a T5 model, preprocess with dataloader, and perform/improve diffusion model training
Roger: Adapt existing diffusion architecture, data prep, and perform diffusion model training
Seik: Embed input data using pre-trained CLIP and also endeavor to implement a T5 model, preprocess with dataloader, and perform/improve diffusion model training
Ziyan: Adapt existing diffusion architecture, data prep, and perform diffusion model training