Across the Spider-Verse Style Transfer

Title: Across the Spider-Verse Style Transfer

Arjan Chakravarthy - achakr33 Abhijith Chandran - achand40 George Chemmala - gchemmal Siddu Sitaraman - sksitara

Final Report:https://www.overleaf.com/read/jbdnxtnmvmtd#a7a5f7

Github Repo: https://github.com/AzureCoral/across-the-spiderverse-style-transfer

Final Poster: https://github.com/AzureCoral/across-the-spiderverse-style-transfer/blob/main/report%26poster/poster.png

Introduction: What problem are you trying to solve and why? The primary objective of our project is to explore and enhance the capabilities of image style transfer, a powerful and creative application of convolutional neural networks (CNNs) . Style transfer involves applying the artistic style of one image to the content of another, allowing for the creation of unique, stylized images. In our case we chose to apply this to the movie Across the Spider-Verse. Our interest in this area was sparked by seminal works such as those by Gatys et al., which introduced the neural algorithm of artistic style transfer, and subsequent improvements by Dumoulin et al. that proposed methods for more versatile style applications.

Our project seeks to re-implement and extend these foundational approaches. The reason for choosing this particular area of research stems from its blend of art and technology, presenting a fascinating challenge in computer vision and machine learning. Specifically, our focus is on optimizing the style transfer process for better speed and accuracy, potentially expanding its application to video content. This project will primarily involve elements of structured prediction and unsupervised learning, as the task does not require labeled data in the traditional sense but rather leverages the intrinsic properties of the images themselves.

Related Work: Are you aware of any, or is there any prior work that you drew on to do your project? Please read and briefly summarize (no more than one paragraph) at least one paper/article/blog relevant to your topic beyond the paper you are re-implementing/novel idea you are researching. In this section, also include URLs to any public implementations you find of the paper you’re trying to implement. Please keep this as a “living list”–if you stumble across a new implementation later down the line, add it to this list.

We’ve drawn inspiration for our project from various sources (listed below). We are interested in utilizing style transfer which is quite well documented. The first two sources (Gatsy et al. and Dumoulin et al.) broadly introduce the concept of image style transfers using CNNs. Beyond these original reference papers, we have found other sources that show novel applications and extensions of style-transfer that may be helpful. For example, we have looked into many cutting-edge improvements to model speed and accuracy in papers such as Arbitrary Style Transfer in Real-time with Adaptive Instance Normalization (link) which uses adaptive instance normalization to quickly change styles based on learned feature patterns. We have also been looking at new approaches to quantifying stylized image loss as described in A Style-Aware Content Loss for Real-time HD Style Transfer (link). Finally, we have the goal of potentially extending our work to videos and even real-time transfer, as detailed in the prior and many other papers.

OG Paper - https://arxiv.org/abs/1508.06576 (better sequel: https://arxiv.org/pdf/1610.07629.pdf) Reference in Textbook - https://d2l.ai/chapter_computer-vision/neural-style.html Other papers in the field - https://github.com/neuralchen/awesome_style_transfer

Data: What data are you using (if any)? If you’re using a standard dataset (e.g. MNIST), you can just mention that briefly. Otherwise, say something more about where your data come from (especially if there’s anything interesting about how you will gather it). How big is it? Will you need to do significant preprocessing?

Since we are planning on using style transfer with images from the movie Across the Spider-Verse, we plan on collecting different clips from the movie and sampling different images from each shorter video clip. There are certain scenes and parts of the movie where the style in the scene is extremely rich, and we plan on writing a Python script to extract frames from these scenes to form our training data.

Methodology: What is the architecture of your model? How are you training the model? If you are implementing an existing paper, detail what you think will be the hardest part about implementing the model here. If you are doing something new, justify your design. Also note some backup ideas you may have to experiment with if you run into issues.

Style transfer relies on using a CNN based model to learn how to identify features in an image in order to identify style, an extension of multiple features. Therefore, we are planning on the conventional method of using certain layers from the VGG-19 model to identify features as part of the architecture for our model.

We are training the model on each of the styles and content. This is done by creating a loss function for the style and a loss function for the content which we use to adjust our image through gradient descent till we get an image that contains both the content and the style.

Metrics: What constitutes “success?” What experiments do you plan to run? For most of our assignments, we have looked at the accuracy of the model. Does the notion of “accuracy” apply for your project, or is some other metric more appropriate? If you are implementing an existing project, detail what the authors of that paper were hoping to find and how they quantified the results of their model. If you are doing something new, explain how you will assess your model’s performance. What are your base, target, and stretch goals?

For style transfer models, success is based on how well the model is able to transfer the style while preserving content of the original image. The main difficulty is that style/content is not completely independent, so we need a model to determine how to differentiate style from content.

For the experiments we plan to run, it might be helpful to create some sample images and then run some qualitative tests. For example, we can have human judges evaluate the quality of transferring and also the creativity of the images. On a more quantitative note, it will probably be helpful to use MSE to check if the content of the images is similar. A loss function would need to consider both the content of the image and the style of the image and scale them appropriately to determine our loss.

We also plan to use the Structural Similarity Index which finds the structural similarity between two images. A lower structural similarity index between the input and output images would generally signify that our model produces a significantly different image from the input image. On the other hand, a higher structural similarity index would signify that our model produces a similar image to the original image.

Along the way, we will adjust the hyperparameters of the model in order to see if we can develop better results based on the aforementioned qualitative tests.

The original paper was hoping to produce images based on the aforementioned qualitative metrics.

Base - create model that does style transfer Target - create model that does multi-style transfer Reach - create model that does style transfer with video

Ethics: Choose 2 of the following bullet points to discuss; not all questions will be relevant to all projects so try to pick questions where there’s interesting engagement with your project. (Remember that there’s not necessarily an ethical/unethical binary; rather, we want to encourage you to think critically about your problem setup.)

Why is Deep Learning a good approach to this problem? Deep learning is a good approach to this problem since we are doing a creative task that is not suited to having a strict series of steps. With style transfer, in general, it is difficult to algorithmically change an input image into the style of another image. Based on the features of the content image, the colors and style that gets applied to the content image may vary. Therefore, we can use deep learning to learn the necessary features and style from a set of training data and apply this style to new images.

What is your dataset? Are there any concerns about how it was collected, or labeled? Is it representative? What kind of underlying historical or societal biases might it contain? Our dataset is stills from Across the Spider-Verse that give us a good grasp on what the styles we want to replicate. One issue is that we are “copying” styles that the animators created without necessarily attributing any credit to them. However, it can be argued that what we are doing is of no harm to the creators of the movie and is not much more than fan-art since we are not commercializing it. We would still want to take into consideration how our model is used and how the output images of our model are used and distributed.

Division of labor: We will all work together on the data collection part of the project as capturing the right stills and parts of the movie are important for our model. With the architecture of our model, we will all work on finding the important layers of the VGG-19 model and will then build these layers sequentially to form our model. We will also all work together in defining our loss functions and the other metrics that will be used to evaluate our model.

Project 3 Check-in Document: https://docs.google.com/document/d/1S3DYCIfw-RpS2An7FeWPwPyZzrc0tkXrNlIQmXu333E/edit?usp=sharing