Shallow Residual Network For Image Classification

Poster

Introduction

In our project, we are trying to solve the problem of image classification using the Residual Network (ResNet), which was first defined in a paper written by the Microsoft Research team in 2015. The objective of the paper is to solve the problem of degradation in Deep convolutional networks that describes the phenomenon when, with increasing depth, accuracy gets saturated and then degrades rapidly. The researchers found the solution in Residual Learning that involves "Skip Connection" or identity mapping that adds an output from a previous layer to a layer ahead. We will implement this Residual Learning on our self-designed convolutional networks that will follow the principles of existing well-performing networks. The reason why we chose this paper is that we were very fascinated by the CNN assignment on image classification with two classes, so we wanted to design a model that recognizes a broader pool of objects and performs better. In order to accomplish these two goals, we decided to implement the deep CNN with ResNet. Original paper: He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. https://arxiv.org/pdf/1512.03385.pdf

Related Works

Introduction to ResNet: https://towardsdatascience.com/introduction-to-resnets-c0a830a288a4 This article starts with talking about the motivation behind ResNet mentioning the problems of deep networks. Then it goes into detail about the Residual Block and its mathematical meaning. Finally, the author gives an example of the VGG network with and without Residual Learning and implementation of the model in Keras. VGG16: https://neurohive.io/en/popular-networks/vgg16/ This article gives an overview of the VGG-16 network that we use in our implementation of the Plain16 network.

Data

We are planning to gradually increase the complexity of the dataset. We will start with CIFAR2, then use CIFAR10, and finally if possible, we will try ImageNet dataset. All of them are available in TensorFlow Datasets, so no extensive preprocessing is needed.

Methodology

Our goal is to design a ResNet034. As shown in the Figure 1(in the original paper). Considering the computation power that it may consume, we will start from Resnet-22 to see how the model performs and then increase the number of layers. We are going to train our model using CIFAR-10 dataset. At first, we will train locally and try to classify cat/dog only to test if our model is running relatively well. After that, we will move our model to Google Cloud platform or use a department machine with GPUs to train CIFAR-10. The hardest part about implementing the model is finding the hyperparameters to train the model, e.g. the batch_size, learning rate to achieve a satisfying accuracy. The residual block uses shortcut to add the output of the previous block into the output of the current block. One of the challenges would be finding appropriate parameters for convolutions to match the dimensionalities of connected layers to enable the addition operation in the residual block. Another possible problem that we may encounter is the limitation of computation powers. Since there are many parameters to learn, the model may run slowly. We need to come up with a solution to speed up the training process with GPUs. We are planning to train CIFAR-10 and test the accuracy of classification. The authors of that paper get an error of 8.75% for CIFAR-10 with ResNet-22(20 residual layers) and an error of 7.51% with ResNet-34(32 residual layers). The authors compared the ResNet-22 and ResNet-34 with the plain net of 19 convolution layers and of 32 convolution layers respectively. The results in the paper show that the deep plain nets suffer from increased depth, and exhibit higher training error when going deep, while the Resnet overcome the optimization difficulty and demonstrate accuracy gains when the depth increases. We also want to see that accuracy increase as the Resnet goes deeper. If possible, we want to increase the layers to 50 to see more accuracy gains. The base goal is to implement ResNet-22 with CIFAR-10 where we only classify 2 classes. The target goal is to implement ResNet-22 with CIFAR-10 where we classify 10 classes. The stretch goal is to implement ResNet-34 with CIFAR-10 where we classify 10 classes.

Ethics: What broader societal issues are relevant to your chosen problem space?

The most important societal issue raised against our chosen problem space, image classification, is the problem of biases. It is an undeniable truth that biases are innate to Deep Learning models due to the nature of algorithms used for training models. In the reality, where more and more DL image classification programs are introduced in the market, it becomes evident that these programs can actually hurt some groups of people, especially minorities, with unfair predictions. In fact, the models designed by Microsoft, IBM, and FACE showed that they perform much worse on darker female images than lighter male images. So if these image classification models will be used in police or work market, they will only exacerbate discrimination. Fortunately, the companies modified the model to flatten performance metrics across imagesets of different social groups, but the incident manifested shortcomings of DL image classification. Another societal issue caused by automated image classification would be over-reliance on technology. For example, the car industry is actively incorporating DL models to create better self-driving cars. While the power of DL opens new opportunities, DL is far from perfect, so relying on them can backfire us in a very serious manner. This raises several question that, in my opinion, we need to answer before actually using DL image classification in practice: “What if the model fails?”, “What if bypassers manipulate the model to hurt other?”, “Who is responsible for the model’s failures?”.

Ethics: Why is Deep Learning a good approach to this problem?

Alternative way to do the image classification would be to use an hard-coded approach that recognizes patterns in image using if-else statements. While this approach could work in theory, the program that classifies an input image would have an enormous amount of if-statements to account for any possible inputs, which make it impossible to code. On top of that, even if we create a non-learning algorithm for image classification, we won’t be able to reuse and extend that code for different tasks. Contrary to the hard-coded approach, Deep Learning programs are very simple and intuitive to write. The biggest advantage of DL for image classification is that we, developers, only have to specify the rules for the training process, and the program finds the desired parameters (weights), which define the function for classifying images, on its own. Furthermore, the concepts and models in Deep Learning are reusable and extendable for other tasks or improvements. Those are the reasons why Deep Learning is a good approach to this problem.

Division of labor

We will both read the paper and come up with the model structure that we want to implement. Then, Min Seong is in charge of data preprocessing and writing the block of the residual layer. Yingjie will build the model, i.e. fill in the call function, train function, and test function. Then we will both train our model on CIFAR-10 using ResNet-22 and Resnet 34 and tune our hyperparameters to get better accuracy.

Final Writeup

https://docs.google.com/document/d/1URF8K-AhSye58Da_lR8cy9qCGrWzIexFtZx6zmW_C1U/edit?usp=sharing

Built With

python
tensorflow

Submitted to

Brown Deep Learning Day 2020

Created by

I worked on building all components (except preprocess) of reset20, plain20, resnet32 and plain32. I trained them and because of the hyperparameters that I chose, the accuracies I achieve is at most 82.6%. Thus, we gave up using those networks in our report. We use another implementation by Min, i.e resnet16, plain16, resnet34 and plain34 which achieved higher accuracy of around 87%. I also implemented the visualization parts, the training and testing functions of our project (test.py). All reports, oral presentation and posters are done by both of us.

Yingjie Xue
Min Seong Kang

Updates

Min Seong Kang posted an update — Nov 23, 2020 06:14 PM EST

Checkin 2

Introduction

In our project, we are trying to solve the problem of image classification using the Residual Network (ResNet), which was first defined in a paper written by the Microsoft Research team in 2015. The objective of the paper is to solve the problem of degradation in Deep convolutional networks that describes the phenomenon when, with increasing depth, accuracy gets saturated and then degrades rapidly. The researchers found the solution in Residual Learning that involves "Skip Connection" or identity mapping that adds an output from a previous layer to a layer ahead. We will implement this Residual Learning on our self-designed convolutional network. The reason why we chose this paper is that we were very fascinated by the CNN assignment on image classification with two classes, so we want to design a model that recognizes a broader pool of objects and performs better. In order to accomplish these two goals, we need a deeper network with Residual Learning.

Challenges

We have implemented a plain neural network with 18 convolution layers. One challenge that we meet is how to train a large dataset with a network with such deep layers efficiently utilizing GPU. In the lecture, the instructor roughly talked about using tensorflow.data.Dataset() to fetch images, divide them and train them in batches. However, in the particular setting of CIFAR dataset, we also need to fetch the corresponding labels. tensorflow.data.Dataset() has non deterministic behavior such that we do not know how to guarantee the one-to-one correspondence of images and their labels. If we let tensorflow.data.Dataset() be deterministic, for every epoch we train, we cannot shuffle the dataset thus we cannot benefit from the randomness of shuffle. To sum up, although we can fetch the all data and load them into the memory as we do in our assignment, we are now working on how to fetch one batch of data every time and train on them.

Insights

We implemented plain 16 layers CNN that is also called VGG-16. The basic architecture of this model involves convolutions with small filters of 3x3 followed by max pooling that down samples filter maps by the factor of 2. Plain-16 will serve as the baseline for our residual network: we will add residual learning on top of this network and explore supposed boost in performance. Although currently our code does not output any errors or dimensionality issues, it doesn’t work properly as the final accuracy is only 10% which tells us that there are bugs to be fixed.

Plan

Although we are experiencing some difficulties with implementing Deep CNNs, we think we are on track with our project. As of now, we should dedicate more time on our Plain-16 network to make it work and result in a reasonable accuracy. The next step would be to add residual learning to Plain-16 and think about the ways to boost the performance of the model. We are thinking about minor changes in the number of layers of our network.

Log in or sign up for Devpost to join the conversation.

Min Seong Kang started this project — Nov 13, 2020 04:57 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.