Compositional Sketch Generator

Title

Authors

Jiahao Liu (jliu265)
Zheyuan Zhou (zzhou118)
Yun Li (yli482)

Final deliverables

Check-ins

Check-in 3

Introduction

We study the sketch generating problem. Sketch-RNN proposed to a recurrent neural network (RNN) to construct sketches of common objects. The model is trained on a dataset of human-drawn sketches representing many different classes. In their paper, a sketch is represented as a sequence of moving points, and they applied a seq2seq VAE architecture to train the model from end to end. However, sketches, as a special kind of image data, contain rich visual and spatial information. To capture such visual feature, we propose to apply convolutional neural network to provide auxiliary information for the system. Moreover, empirically RNNs even LSTM can’t perfectly handle extremely long sequences. While human-drawn sketches often contains more than 400 points. This poses difficulty on the training process. Therefore we propose decompose the sketch into strokes (usually no more then 20), and then generate the sketch by composing these strokes.

Related Work

Explorations have been conducted in the sketch generation field. One work - DoodlerGAN - is a part-based Generative Adversarial Network (GAN) to generate unseen compositions of novel sketch. The model can be divided into two parts: part generator and part selector. Part generator generates a complete image with given partial image; Part selector aims to predict the next part to be generated. Therefore, during inference, starting from a random initial stroke, the part selector and part generators work iteratively to complete the sketch.

Data

If you’re using a standard dataset (e.g. MNIST), you can just mention that briefly. Otherwise, say something more about where your data come from (especially if there’s anything interesting about how you will gather it).

The dataset we will use is Quick Draw dataset, which is a dataset of vector drawings obtained from Quick Draw, an online game where the players are asked to draw objects belonging to a particular object class in less than 20 seconds.

How big is it? Will you need to do significant preprocessing?

The total dataset contains 50 million drawings across 345 classes. Each class of QuickDraw is a dataset of 70K training samples, in addition to 2.5K validation and 2.5K test samples.

The data preprocessing is required for our project, and it mainly contains four parts:

1. Drop Poor Quality Data: There exists data that is poorly drawn, and they are hard to detect the corresponding category feature even by human eyes. We believe this kind of training data will hinder the model performance, therefore, we will manually extract the data with good quality from the abundant dataset.

2. Control Points Length: One stroke is constructed by multiple points, and the total points amount for the stroke in our dataset varies. Therefore, we will control the points length by eliminating the extremely long and short strokes.

3. Format Conversion: The original dataset is in ndjson format where for each sketch, which contains a list of strokes in a 3-dim array format: the first row is point x position, the second row is point y position, the third row is time stamp. We need to process the format into a sequence of points with x direction offset, y direction offset and pen state.

4. Control Stroke Length: In our method, one critical point is we use stroke as unit to eliminate the long-distance vanishing gradient problem for LSTM and RNN. Therefore, we need to decide when to stop stroke generation during training and testing. In order to do this, we need to select the sketches with certain stroke number range as dataset.

Methodology

The original design of Sketch-RNN takes a sketch as a sequence of points and use a Seq2Seq VAE to learn it, which mainly has two weakness: (1) sketches contain rich visual information which can’t be fully captured by RNNs. (2) human-drawn sketches often contains extremely long sequence, which pose difficulty for training a Seq2Seq model. To solve the first problem, we will apply convolutional neural networks as an auxiliary encoder to capture the spatial information. For the second one, we make use of the compositionality of sketches. We first decompose each sketch into strokes. A variational auto encoder is trained to generate plausible strokes, and then another variational auto encoder will learn the relative relationship between strokes and thereby compose the strokes into a complete sketch.

Metrics

What experiments do you plan to run?

We will run sketch generation experiments to check if the model can generate plausible sketches. We will also try to do interpolation in the latent space to see how the the generated sketches change according to different input latent vectors.

For most of our assignments, we have looked at the accuracy of the model. Does the notion of “accuracy” apply for your project, or is some other metric more appropriate?

The metric of “accuracy” does not apply to our model. We are training a generative model, therefore likelihood of samples in the dataset would be a better metric to evaluate the model.

If you are doing something new, explain how you will assess your model’s performance.

The log likelihood of data samples and human evaluation.

What are your base, target, and stretch goals?

Base: model generate sketches contains the features of the given data and the corresponding category can be identified by human being.
Target: the loss of model obtains the similar value as the original paper Sketch-RNN on the same dataset.
Stretch goals: (1) the loss outperforms the Sketch RNN paper. (2) More reasonable sketches can be manually observed.

Ethics

What broader societal issues are relevant to your chosen problem space?

There are quite a number of ethical issues related to the field of our study. Here we list and briefly discuss two of them.

The copyright issue. The datasets contain drawings / sketches / writings by different individuals. If these data are in the hands of the wrong people, the graphs could be passed on as these people’s own work and sold for money without the original creators’ permission. In this situation, the copyrights of the original creators’ may be infringed.

The imposture issue. Quite a lot of work in the related field aims at mimicking / reproducing graphs in the dataset. These generated graphs, if in the wrong hands, could be impostures of the original works and be used for profit gaining. For example, Paul the Robot is able to mimic the artistic styles of artists when generating graphs, so Paul’s drawings may be passed on as famous artists’ drawings for sale illegally. As another example, an ML model which could mimic someone’s writings may also be used to produce fake signatures.

Why is Deep Learning a good approach to this problem?

The objective of our project is to build a model which can recognize objects from human sketches. Deep learning suits this purpose for several reasons.

This problem can be solved by an algorithm. To study this problem, we can easily encode all relevant features - the number of strokes, the paths of strokes, and the placements of strokes for each graph can all be encoded numerically and used directly as training data. We can also define the metric of success clearly as log likelihood.

This problem is suitable to be solved by an algorithm. Unlike topics involving medical diagnostics or autonomous driving, nothing is seriously at stake if our model gives slightly worse results than expectation. Although we should still pay close attention to the broader societal issues discussed earlier, our project does not have strong ethical implications comparatively, and we should feel comfortable using an algorithm to help us implement it.

Deep learning is a suitable machine learning model for solving this problem. Since we are trying to understand how humans abstractly visualize and sketch objects, using deep learning models which are structured like the human brains would be perfectly fitting. We also have abundant data for training and testing (over 50 million drawings), which is crucial for building deep learning models.