Introduction
In our project, we are trying to solve the problem of image classification using the Residual Network (ResNet), which was first defined in a paper written by the Microsoft Research team in 2015. The objective of the paper is to solve the problem of degradation in Deep convolutional networks that describes the phenomenon when, with increasing depth, accuracy gets saturated and then degrades rapidly. The researchers found the solution in Residual Learning that involves "Skip Connection" or identity mapping that adds an output from a previous layer to a layer ahead. We will implement this Residual Learning on our self-designed convolutional networks that will follow the principles of existing well-performing networks. The reason why we chose this paper is that we were very fascinated by the CNN assignment on image classification with two classes, so we wanted to design a model that recognizes a broader pool of objects and performs better. In order to accomplish these two goals, we decided to implement the deep CNN with ResNet. Original paper: He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. https://arxiv.org/pdf/1512.03385.pdf
Related Works
Introduction to ResNet: https://towardsdatascience.com/introduction-to-resnets-c0a830a288a4 This article starts with talking about the motivation behind ResNet mentioning the problems of deep networks. Then it goes into detail about the Residual Block and its mathematical meaning. Finally, the author gives an example of the VGG network with and without Residual Learning and implementation of the model in Keras. VGG16: https://neurohive.io/en/popular-networks/vgg16/ This article gives an overview of the VGG-16 network that we use in our implementation of the Plain16 network.
Data
We are planning to gradually increase the complexity of the dataset. We will start with CIFAR2, then use CIFAR10, and finally if possible, we will try ImageNet dataset. All of them are available in TensorFlow Datasets, so no extensive preprocessing is needed.
Methodology
Our goal is to design a ResNet034. As shown in the Figure 1(in the original paper). Considering the computation power that it may consume, we will start from Resnet-22 to see how the model performs and then increase the number of layers. We are going to train our model using CIFAR-10 dataset. At first, we will train locally and try to classify cat/dog only to test if our model is running relatively well. After that, we will move our model to Google Cloud platform or use a department machine with GPUs to train CIFAR-10. The hardest part about implementing the model is finding the hyperparameters to train the model, e.g. the batch_size, learning rate to achieve a satisfying accuracy. The residual block uses shortcut to add the output of the previous block into the output of the current block. One of the challenges would be finding appropriate parameters for convolutions to match the dimensionalities of connected layers to enable the addition operation in the residual block. Another possible problem that we may encounter is the limitation of computation powers. Since there are many parameters to learn, the model may run slowly. We need to come up with a solution to speed up the training process with GPUs. We are planning to train CIFAR-10 and test the accuracy of classification. The authors of that paper get an error of 8.75% for CIFAR-10 with ResNet-22(20 residual layers) and an error of 7.51% with ResNet-34(32 residual layers). The authors compared the ResNet-22 and ResNet-34 with the plain net of 19 convolution layers and of 32 convolution layers respectively. The results in the paper show that the deep plain nets suffer from increased depth, and exhibit higher training error when going deep, while the Resnet overcome the optimization difficulty and demonstrate accuracy gains when the depth increases. We also want to see that accuracy increase as the Resnet goes deeper. If possible, we want to increase the layers to 50 to see more accuracy gains. The base goal is to implement ResNet-22 with CIFAR-10 where we only classify 2 classes. The target goal is to implement ResNet-22 with CIFAR-10 where we classify 10 classes. The stretch goal is to implement ResNet-34 with CIFAR-10 where we classify 10 classes.
Ethics: What broader societal issues are relevant to your chosen problem space?
The most important societal issue raised against our chosen problem space, image classification, is the problem of biases. It is an undeniable truth that biases are innate to Deep Learning models due to the nature of algorithms used for training models. In the reality, where more and more DL image classification programs are introduced in the market, it becomes evident that these programs can actually hurt some groups of people, especially minorities, with unfair predictions. In fact, the models designed by Microsoft, IBM, and FACE showed that they perform much worse on darker female images than lighter male images. So if these image classification models will be used in police or work market, they will only exacerbate discrimination. Fortunately, the companies modified the model to flatten performance metrics across imagesets of different social groups, but the incident manifested shortcomings of DL image classification. Another societal issue caused by automated image classification would be over-reliance on technology. For example, the car industry is actively incorporating DL models to create better self-driving cars. While the power of DL opens new opportunities, DL is far from perfect, so relying on them can backfire us in a very serious manner. This raises several question that, in my opinion, we need to answer before actually using DL image classification in practice: “What if the model fails?”, “What if bypassers manipulate the model to hurt other?”, “Who is responsible for the model’s failures?”.
Ethics: Why is Deep Learning a good approach to this problem?
Alternative way to do the image classification would be to use an hard-coded approach that recognizes patterns in image using if-else statements. While this approach could work in theory, the program that classifies an input image would have an enormous amount of if-statements to account for any possible inputs, which make it impossible to code. On top of that, even if we create a non-learning algorithm for image classification, we won’t be able to reuse and extend that code for different tasks. Contrary to the hard-coded approach, Deep Learning programs are very simple and intuitive to write. The biggest advantage of DL for image classification is that we, developers, only have to specify the rules for the training process, and the program finds the desired parameters (weights), which define the function for classifying images, on its own. Furthermore, the concepts and models in Deep Learning are reusable and extendable for other tasks or improvements. Those are the reasons why Deep Learning is a good approach to this problem.
Division of labor
We will both read the paper and come up with the model structure that we want to implement. Then, Min Seong is in charge of data preprocessing and writing the block of the residual layer. Yingjie will build the model, i.e. fill in the call function, train function, and test function. Then we will both train our model on CIFAR-10 using ResNet-22 and Resnet 34 and tune our hyperparameters to get better accuracy.
Final Writeup
https://docs.google.com/document/d/1URF8K-AhSye58Da_lR8cy9qCGrWzIexFtZx6zmW_C1U/edit?usp=sharing
Built With
- python
- tensorflow
Log in or sign up for Devpost to join the conversation.