Title: "It's a little sketchy, but..."
Final Submission:
Paper: https://docs.google.com/document/d/143HLm9OzQfoo-RmeU_wPeJ7VsydzRyxJ2Y4piczJDDE/edit?usp=sharing Video: https://drive.google.com/file/d/1GHuuUCiVrFuaGiIACn8tVEOE7BAQxxEa/view?usp=sharing
Who: Jackie Luke (jluke) and Rachel Fuller (rfuller1)
Introduction:
The goal of our model is to identify hand-drawn sketches made by non-artists. We can imagine this kind of identification being useful, for example, for identifying sketches people draw as they take notes for class or in the field. This task is challenging because the data, hand-drawn sketches by non-artists, is by nature imprecise and varies based on the skill of the person drawing it. Sketches of this type offer less information than a photo (they are in black and white, they only contain a few lines, etc.) so they can be difficult to identify.
We created an implementation of a sketch image classifier similar to the model featured in “Free-hand Sketch Recognition Classification'' (Lu and Tran, 2016). In this paper, the authors created an image classifier using Deep Convolutional Neural Networks (DCNN) to show that DCNN architecture, primarily used for photos, can also be used for sketches. They used a sketch database used by Eintz et. al. generated by TU-Berlin, featuring 20,000 hand drawn images of 250 different classes. Lu and Tran utilized residual neural networks, which address problems of the vanishing gradient and overfitting observed in typical CNN architecture. In our project, we attempt to emulate their architecture using residual neural networks.
Data:
For our project, we used the database created in the paper “The Sketchy Database: Learning to Retrieve Badly Drawn Bunnies'' (Sangkloy et al.). This dataset features human drawn sketches of 125 categories, featuring animals and inanimate objects. The drawings were drawn by non-artists based on provided photographs to be used as references. Therefore, the Sketchy Database provides a slightly different kind of image than the database from Eitz et al. used by Lu and Tran. The Eitz et al. database featured pictures generated solely by the participants, stemming from their memory and mind’s eye alone, which mean that the pictures are not necessarily representative of the object itself, more of the way that people view the image in their mind. When collecting data for the Sketchy Database, reference images were provided to the participants, who then were directed to draw the object from memory based on the reference image. This produces sketches of the various objects in various positions and angles and not just the viewpoint most easily accessible and drawn. For example, Sangkloy et al. observes the shortcomings of Eitz and all, pointing out that not a single duck sketch was drawn of a duck in flight. We chose this database in order to include these variances in poses, positions, and angles in our classifications of the sketches.
Methodology:
The project features two models. The first model is a basic model that features traditional CNN architecture, with 3 CNNs separated by batch normalization and dropout layers, and 2 dense layers with relu activation. The second model is an implementation of the resnet model used by Lu and Tran. The resnet model attempts to reduce the problem of the vanishing gradient by adding the initial input to the output of two convolutional layers in a single resnet block.
Results:
The basic model performed with high accuracy, averaging around 73% for 10 classes. When you take into account the top 3 predictions, the accuracy averages to around 93%. However, the resnet layer unfortunately only has an accuracy of about 10% for 10 classes, which makes it have the same accuracy as random guessing. This may be due to the structure of the resnet architecture modelled from Lu and Tran, which at over 20 convolution layers may be overkill for the amount of data we are processing (around 500 128x128 images per class). This overanalysis results in logits that converge onto the same set of values for each image, which is causing the model to assign the same labels to each sketch. Other challenges in labelling each image may come from the dataset itself, which contains many different poses and positions of each object, as well as many confusing drawings.
Challenges:
As is evident from the accuracy data above, the resnet model that we implemented that was based on the paper we selected did very poorly. At first we suspected that our loss or accuracy function or our model itself had an error. We double checked our selected loss function, categorical cross-entropy loss, and it was the same as what the paper describes and what we had done ourselves in previous projects for this class. From there, we wondered about our implementation of the model but found no bugs that we could identify, and we did not receive any obvious signs of an issue (for example, earlier in our implementation we got a “gradients do not exist” warning from tensorflow because of an error in our implementation that we corrected). We also don’t experience negative learning, where the model actually gets worse at identifying sketches, which would also be a sign of a problem. Instead, we see variation between batches but overall, we see accuracy that’s approximately the same as a random guess. For example, with 10 categories, we consistently saw accuracy of around 0.1, even though loss initially decreased, and then plateaued.
Therefore, we think the problem is a mismatch between the model and our data. The model as described in the paper has 25 convolutional layers, not including additional convolutions used to project previous outputs to the correct size when we change the size of convolution layers. This is far more convolutional layers than we have used for any project in this class. For our more successful basic model, we only used three convolutional layers. The architecture in this model was designed to classify 250 different classes. For our project, we only tried to identify 10 different classes because we thought that this would make the project more manageable in terms of runtime and we thought it would also increase the accuracy of the model if it only needed to learn 10 categories. But, the architecture is designed to start from a 128x128x1 image and then progress to 250 categories. Along the way, the model reduces the feature map by half and doubles the number of filters every three residual layers. So, we have (128x128x1) → (64x64x64) → (32x32x128) → (16x16x256) → (8x8x512) → 250 in the original model. For our model, we changed the last step to go to our 10 categories instead of 250 categories. This architecture means that we have a very large number of convolutional filters, so we are trying to learn to detect many, many features. In the original with 250 categories, this was likely more effective because there were more features of the different examples to detect. For us, having only 10 or 25 classes may not have provided enough information to effectively learn so many features. Having 512 filters in our last convolutional layer was likely “overkill” for our use case.
We also wonder if part of the issue is having too few examples, regardless of the number of categories. We only have a few hundred examples of each category, as opposed to the thousands of examples per category that we’ve had for other projects. So few examples may not have been enough to learn weights for the more than 20 convolutional layers we used in the resnet model. We discuss in more detail other issues with the dataset in the following section.
Overall, we think that a Resnet is clearly a useful approach to this problem in general, but our particular case was not well suited to using one.
We also potentially had issues with the dataset. By the nature of the dataset, we were using sketches made by non-artists. Drawing, for example, a recognizable hermit crab is a difficult task. Some of the images were difficult for us to identify as humans. Another challenge is that within the animal classes the animals are drawn from different perspectives. The sketches within a category are not all drawn from the reference image or from people’s idea of the category. Instead, a variety of different reference images were given with the animal in different orientations. So a bear could have a bear from either side, from the front, standing up, laying down, the whole body, just the head, etc.
Submission for Check in 1
Title: "It's a little sketchy, but..."
Who:
Jackie Luke (jluke) and Rachel Fuller (rfuller1)
Introduction:
Our project aims to identify hand drawn images. We plan to accomplish this by following the techniques used in the paper "Free-hand Sketch Recognition Classification" by Lu and Tran of Stanford University. We also plan to explore some of the potential techniques and applications addressed and utilize the dataset from "The Sketchy Database: Learning to Retrieve Badly Drawn Bunnies" by Sangkloy et al.
We decided to explore the processing of 2D hand drawn images because we are interested in exploring the relationship between human generated content and computer generated content. Being able to process and identify human generated content may provide insight and techniques that can be used to implement computer generated content that can mimic human generated content, which may be something we look to explore.
Introduction:
Our project aims to identify hand drawn images. We plan to accomplish this by following the techniques used in the paper "Free-hand Sketch Recognition Classification" by Lu and Tran of Stanford University. We also plan to explore some of the potential techniques and applications addressed and utilize the dataset from "The Sketchy Database: Learning to Retrieve Badly Drawn Bunnies" by Sangkloy et al.
We decided to explore the processing of 2D hand drawn images because we are interested in exploring the relationship between human generated content and computer generated content. Being able to process and identify human generated content may provide insight and techniques that can be used to implement computer generated content that can mimic human generated content, which may be something we look to explore.
Related Work:
http://cs231n.stanford.edu/reports/2017/pdfs/420.pdf https://www.cc.gatech.edu/~hays/tmp/sketchy-database.pdf http://cybertron.cg.tu-berlin.de/eitz/pdf/2012_siggraph_classifysketch.pdf
Data:
We will be using the "Sketchy Database" used and created in the paper Sangkloy et al. which includes 12500 human generated drawings of 125 types of objects.
Methodology:
We will use convolutional neural networks, experimenting with the "wide" and "normal" resnet architectures discussed in the Stanford paper by Lu and Tran. We may have to take special care to account for translation and variance in the drawings, potentially rescaling and reshaping them. The first section of the project will perform the graphical processing and the second section will address the classification.
Metrics:
Base goal = .56 Accuracy (lowest from stanford paper) Target goal = 0.65 (best from stanford paper) Stretch goal = 0.73 (human accuracy)
Ethics:
Why is Deep Learning a good approach to this problem? Deep learning is a good approach to this problem because identifying the content of hand drawn images requires pattern recognition that can both identify the key features of an object and also discriminate between the human quirks of each drawing. These different patterns may be difficult to simplify and requires image processing over many data samples.
How are you planning to quantify or measure error or success? What implications does your quantification have?
We plan to quantify success by the accuracy by which the model is able to predict the images. The metrics chosen
We may also prioritize speed, and I think it would be interesting to try to have the model predict the images in real time, even before the drawing is completed like Google's Quick draw project.
Division of Labor:
The project is split into a few main sections that we will split between the two of us: 1) Preprocessing 2) Graphical Processing -CNN -feature identification? -embedding? 3) Identification -model architecture -"wide" and "normal" models 3) Train and Testing -optimization and tuning 4) Output -visualizer -possible live demo where people can draw on a tablet and submit their input to the model
Submission for Check-In 2
Challenges: The hardest part of this project has been processing the images because of the large size of the files and the fact that there are many files and classes of images that must be individually uploaded and processed. Another challenge has been implementing the ResNet because there are differences between ResNets as typically implemented and the ResNet architecture described in the paper we were inspired by.
Insights: So far we do not have concrete results because we have not yet finished the preprocessing portion so we can't run our model.
Plan: We need to finish the preprocessing and begin running our model because we don't know yet whether it trains at all or if the architecture is correct. We are slightly behind schedule.
Built With
- python
- tensorflow
Log in or sign up for Devpost to join the conversation.