Poster

Free-Hand Sketch Recognition and Image Retrieval

Final Writeup

Outline

Summary: Implementation of a novel deep learning classification model using CNN and Transformer for the recognition and image retrieval task of free-hand sketches

Who: Yiqing Liang (yliang51) / Shijie Mao (smao5) / Ziyin Li (zli255)

Introduction:

The free-hand sketch is widely used in daily life as a simple yet powerful illustrative tool for various scenarios such as communication, recording, and design. It has attracted more and more attention to recognize sketches due to the prevalence of touchscreen devices that make acquiring sketch data much easier than ever in recent years. It has also been studied in computer vision and pattern recognition fields due to the rapid development of deep learning techniques that are achieving state-of-the-art performance in diverse artificial intelligence tasks.

Free-hand sketches are intrinsically different from natural photos which are pixel-perfect copies of the real world. Though sharing certain similarities with hand-written characters, they are fundamentally different given that free-hand sketch has a highly abstract and free-style nature while alphabetical hand-writing is subject to specific rules and a teaching process. In fact, free-hand sketch provides a special data modality/domain which has both domain-unique challenges and advantages, making learning meaningful representations of it remains a challenging task given its signal sparsity and the high-level abstraction despite the advantages of its lack of background and use of iconic representations.

For this project, we will focus on free-hand sketch recognition which is a classification problem aiming to predict the class label of a given sketch. It is a fundamental task in computer vision and has a variety of practical applications including interactive drawing systems, sketch-based science education, games, etc. Specifically, we will focus on exploiting the static nature of free-hand sketches by building an offline draw-guess sketch recognition system. It takes the whole sketch as an input image processed as static pixel space and predicts a class label based on the complete sketch. We will use CNNs to extract abstract visual concepts from the sketches with Transformers applied afterward for spatial long-term memory, aiming to achieve a better accuracy compared to previous work with mere CNNs. We will also try to retrieve images corresponding to the class label of each sketch from our database.

Related Work:

Recently, some researchers have conducted a comprehensive survey of the deep learning techniques oriented at free-hand sketch data and the applications that they enable (https://arxiv.org/pdf/2001.02600.pdf). They have reviewed the developments of free-hand sketch research in the deep learning era by surveying existing datasets, research topics, and state-of-the-art methods through a detailed taxonomy and experimental evaluation. Some other researchers have constructed a hybrid CNN composed of A-Net and S-Net which describe appearance information and shape information, respectively for sketch recognition as well as image retrieval (https://cse.sc.edu/~songwang/document/prl20.pdf). While the previous study only uses CNNs, Transformers have also been implemented in a Multi-Graph Transformer (MGT) which is a novel GNN for learning representations of sketches from multiple graphs (https://arxiv.org/pdf/1912.11258.pdf). In this study, researchers have tried to move from Euclidean to a topological analysis by representing the free-hand sketches as graphs and applying GNNs for sketch recognition.

Data:

We will be using Sketchy, a fine-grained sketch-based image retrieval (SBIR) dataset as our database for this project (https://dl.acm.org/doi/pdf/10.1145/2897824.2925954). It is the first large-scale collection of sketch-photo pairs with 75,471 sketches of 12,500 objects from 125 categories which can give us fine-grained associations between particular photos and sketches, enabling training cross-domain CNNs which embed sketches and photographs in a common feature space. This database can be used as a benchmark for fine-grained sketch-based image retrieval.

Methodology:

The general structure of our model will be a combination of a two-branch CNN and a Transformer.

Since sketches can naturally be described by both appearance and shape information, we will first implement a two-branch CNN where one branch named appearance CNN (A-Net) is to extract the appearance features, and the other branch (S-Net) is to extract the shape features. After combining the two types of extracted features, we will feed them into a Transformer for sketch recognition tasks, including sketch classification and sketch-based image retrieval.

More specifically, in the first place, the original sketch will be fed into the A-Net as input. Then we will extract feature vectors as appearance representation using convolutional layers. For the S-Net, in order to obtain distinctive shape features, we will first sample a number of points from a sketch and then use stacked layers to extract feature vectors as shape representation. These two networks will be trained separately and their outputs will be concatenated as the hybrid features that will be later fed into the Transformer.

Metrics:

The Sketchy dataset will be split into training, validation, and testing sets. We will train the model on the training set, pick the best combination of parameters for the model using the validation set, and then test the selected model on the testing set.

For model performance evaluation, we will choose the top-5 predictions with the highest probabilities to calculate corresponding rank accuracy as the metric.

We will also compare with the baseline method ResNet-18 and perform an ablation study to show the effectiveness of our own method. The baseline method has been implemented and its 5 rank accuracies are shown as below:

Rank Accuracy (%)	Rank-1	Rank-2	Rank-3	Rank-4	Rank-5
ResNet-18	78.4	87.3	90.9	92.8	94.0

Our base goal is to achieve a greater rank-1 accuracy compared to that of the baseline method as well as the Hybrid CNN model from a previous study. For the target goal, we are hoping that we can achieve better results for all of the 5 rank accuracies. Furthermore, for the stretch goal, we may try to tackle image retrieval.

Ethics:

Why is Deep Learning a good approach to this problem?

Deep learning allows machines to identify and extract features from images automatically. Deep neural networks can learn the features to look for in images by analyzing lots of pictures out of which our large dataset Sketchy is suitable. Also, deeper neural networks tend to achieve better accuracy in the image recognition area where our free-hand sketch recognition task falls in.

Free-hand sketches are often highly abstract and can vary based on the artistic style and drawing ability of different people. Hence, it is necessary to apply deep learning methods to automatically recognize the intent of a drawer with high accuracy while allowing him or her to draw in an unconstrained manner at the same time.

What is your dataset? Are there any concerns about how it was collected, or labeled? Is it representative? What kind of underlying historical or societal biases might it contain?

Our dataset for the project is called Sketchy which has been created in 2016 by some researchers aiming to learn a cross-domain representation for sketches and photos which is reliable not just at retrieving objects of the correct category but also objects with fine-grained similarity to the sketch (https://dl.acm.org/doi/pdf/10.1145/2897824.2925954).

Sketches in the dataset are thousands of photographic objects spanning 125 categories drawn by crowd workers. 644 qualified individuals have been chosen to be participants in this study and have collectively spent 3,921 hours sketching for the dataset over 6 months. For sketch collection, the participants were not allowed to trace photos; instead, the researchers revealed and then hid photos from them so that they must sketch from memory similar to the way in which a user of a sketch-based image retrieval system would be drawing based on some mental image of the desired object. In this way, they would produce more diverse and realistic sketches, making the dataset more representative.

However, we believe that certain biases could be inevitably introduced during both the participant qualification test and the sketch collection process. Since the researchers manually reviewed the qualification results to ensure each potential participant had understood the sketching process, their judgment could be biased due to the variation of understanding of sketches for different people. Also, this group of 644 participants might have a biased distribution of sex, age, race, religion, etc., which could also influence the sketches they have drawn and introduce biases.

Besides, the researchers have acknowledged a limitation of their dataset. Their data collection method assumed that there were only three discrete levels of similarity between sketches and photos – the same instance, the same category, or completely dissimilar – while sketch-photo similarity as perceived by a human would be more continuous. Collecting more pairs of perceptual distances between the sketches and photos would probably help to have more annotations, especially within categories where there could be several photos matching one sketch reasonably well. However, out of the considerations for sustainable deep learning, we think it is not economical and necessary to try to increase the sketch-photo perceptual distance pairs for the dataset since the number of images in it is quite large and it is already representative enough for our work.