Title: The Detection & Classification of Traffic Signs using Convolutional Neural Networks

Who: Sahil Bansal (sbansa12), Shreyas Raman (ssunda11), Emily Zhang (ezhang52)

Paper: https://www.researchgate.net/publication/311610541_Traffic-Sign_Detection_and_Classification_in_the_Wild

Final Writeup: https://docs.google.com/document/d/14BhSDrJvWzFaDc4MqUGe8yBz8A7wVeeDw5J0qN1dND0/edit?usp=sharing

Introduction:

The paper we are trying to implement detects and classifies traffic signs, using multi-layered CNNs, particularly for noisy images (e.g. different lighting or weather conditions). The paper extends traditional application of CNNs in image processing by jointly detecting and further classifying several traffic signs (based on shape, color and structure). The model stands out by aiming to take input images where the target object occupies a small (approx. 80x80 pixel) regions of a large (2000x2000 pixel) image. Whilst previous object-detection models (e.g. using Support Vector Machines) have been effective for detecting target objects occupying a large portion of the pixel space, the application of multi-layer CNNs in this paper far outperforms them.

We selected this paper because it offers us a chance to expand our understanding of image classification with CNNs by further introducing a branched 3-stream architecture for pixel classification, bounding boxes (detection) and labels display (classification) i.e. the potential for detection and classification described by the model on the paper seems to be an interesting avenue for exploration. The potential exploration into object ‘detection’ (beyond classification) would be a novel exploration to the content covered in the course; thus we were interested to explore how this could be achieved through a CNN architecture and what additional impacts it would have on model performance - beyond the standard multi-layer CNNs covered in lecture. Image detection and recognition functionalities can be used in several real-life scenarios e.g. autonomous vehicles recognizing traffic signs while driving, smart glasses or cameras requiring auto-focus with object detection, etc.

Related Work:

We were not aware of any specific examples of traffic sign detection and classification algorithms prior to starting this project. Traffic sign detection is a widely covered subject in the topic of autonomous vehicles. This article (https://phys.org/news/2019-05-traffic-recognition-influential-decade.html) touches upon the subject to traffic sign recognition in this field.

The article mentions that traffic signs are much simpler to categorize in comparison to other more complex objects, as traffic signs all have a simple, relatively standard set of colors, shapes, and symbols. Another interesting point made is that an autonomous vehicle must oftentimes rely on “real-time feeds” of what the camera can see. The article essentially describes how traffic sign detection is becoming less of a commodity and more of a need, as autonomous vehicles are becoming more popular.

Public Implementations:

https://cg.cs.tsinghua.edu.cn/traffic-sign/ (Source Code of Paper, Written in Caffe) https://lijiancheng0614.github.io/2019/04/16/2019_04_16_TT100K/#architecture (Code of Model in Paper) https://github.com/asyncbridge/tsinghua-tencent-100k (Code of model, Uses Caffe) https://github.com/JunshengFu/traffic-sign-recognition (Traffic Sign Recognition Model, Not the Paper’s Model) https://github.com/jacobssy/Traffic_Sign_detection (Traffic Sign Detection Model, Not the Paper’s Model) https://github.com/vamsiramakrishnan/TrafficSignRecognition (Traffic Sign Recognition Model, Not the Paper’s Model)

Data: We have found the following datasets that can be used:

https://www.kaggle.com/valentynsichkar/traffic-signs-preprocessed A little less than 87000 examples in training dataset, 43 classes. Has 9 pickle files for train, validation, and testing dataset. Files 0-3 have RGB images, files 4-9 are greyscale. All preprocessing has been done. https://www.kaggle.com/meowmeowmeowmeowmeow/gtsrb-german-traffic-sign The German Traffic Recognition Benchmark is a standardised dataset that has over 50000 images with 42 classes. It has been preprocessed to include classID, shapeID, colorID, sign ID. https://sid.erda.dk/public/archives/ff17dc924eba88d5d01a807357d6614c/published-archive.html German Traffic Sign Detection Benchmark with 600 training images, 300 testing images, and ground truth for both train and test sets (ground truth for test sets are placed in a separate file). https://cg.cs.tsinghua.edu.cn/traffic-sign/ Original Dataset for Paper: Researchers annotated the images by hand, recording the voiding box, boundary vertices, and class label for the sign. Has 100000 images. We may need to weed out classes with too few instances. For example, the paper states that classes with less than 100 instances were weeded out while classes with less than 1000 instances were augmented (by randomly rotating the “standard template” for a class) to give them more than 1000. We may also need to add random images without street signs for additional noise. These datasets have had preprocessing done to them, we would just need to read in that data and perhaps augment the data through rotations, size changes, removing some classes, or similar.

Methodology:

The backbone architecture of our model is a multi-layered Convolutional Neural Network (CNN). The network for object detection (i.e. bounding box generation and object localization - for multiple signs) seems as if it would be hardest to implement. We are thinking of following the model’s implementation of Selective Search, Edge Boxes and BING; however the paper does not provide extensive exploration of their implementation of these features, thus we are also looking into adapting the paper’s model by implementing a Mask-R CNN or Fast-RCNN network as an initial working segment for traffic sign detection.

Within the overarching CNN pipeline, the paper recommends splitting into three branches after the 6th layer. The paper recommends branching into a pixel layer ( the probability of a 4x4 pixel region having the target object), a bounding-box layer (the distance between a 4x4 pixel region and the four sides of the target object’s predicted bounding box), and a label layer (outputs a classification vector, similar to a logit vector, with the probability of belonging to a specific subclass of traffic signs). The paper’s description of the label layer implementation is also slightly unclear, as the layer seems like it outputs a single probability vector when an image can have more than one target (traffic sign) present within it. How the model works with images of multiple signs is yet to be determined.

The general layout of the model will be: 8 convolution layers (layers 1, 2, 5 will have pooling and stride; layer 1, 2 will have an additional lrn layer; layers 6, 7 will have additional dropout). We will fork after the 6th layer, so that layer 7 will be a forked layer with 3 separate convolution layers with dropout, and layer 8 will consist of 3 parallel branches connecting to those in layer 7: bounding box, pixel, and label layers.

We will split training, testing samples in the ratio 2:1. A Hinge Loss Stochastic Gradient Descent (HLSGD) is suggested to train the CNNs.

Metrics:

The paper’s model has an accuracy of 84% and a 94% recall at Jaccard similarity coefficient of 0.5. The paper also uses MicrosoftCOCO benchmark, dividing the traffic-sign images according to the pixel size of the target on the image: small objects (area <322pixels), medium objects(322962) - thereby differentiating the model’s performance on multiple sized images/targets.

Our base goal and metric will encompass the general accuracy of the model. We want to be able to detect and classify street signs with an accuracy of 65-75% (with perhaps an 80-85% on the German Dataset).

If we run into problems with training on such a large dataset, we may have to adapt the goal to the classification of street signs and work with a simpler dataset that just includes traffic signs with labels. This would be the case if we can’t seem to train or get detection working with the original dataset, we may try to incorporate the German Detection Dataset if possible.

Our target goal is to be able to detect and classify street signs. Time permitting, we aim to be able to detect and classify medium to large objects (as defined by Microsoft COCO) and segment our accuracies between these differently sized images.

If we are able to achieve our target goal, we will improve accuracy on smaller sized objects. Another stretch goal is to use other metrics of measuring our model like a Jaccard similarity coefficient if we are able to achieve our desired accuracy.

Ethics:

What broader societal issues are relevant to your chosen problem space? Traffic sign detection can be useful for people who struggle with driving ( perhaps from vision problems or another disability). On a whole, autonomous vehicles can be used by people who are unable to drive (or have a disability where they need assistance when driving). Traffic sign detection is an integral part of autonomous vehicles, which is useful for people who struggle to drive. Traffic sign detection may also be used in other applications. For example, if a driver often misses traffic signs (i.e. if the driver is elderly) then there could be an application that announces the presence of a traffic sign ahead with an audio cue. The problem of traffic sign detection is very useful to address issues of people who cannot drive or have trouble driving. This contributes to overall road safety, as a traffic sign detection algorithm would reduce human error (like missing a sign or driving past a stop sign). An example of this would be if a semi-autonomous car stops the car in front of a traffic sign that a human driver may drive past. Who are the major “stakeholders” in this problem, and what are the consequences of mistakes made by your algorithm? The algorithm would mainly be used by automobile companies; when looking up traffic sign detection or classification, several results that came up were from car companies like Nissan. These companies use such algorithms for autonomous and semi-autonomous vehicles, which means that the algorithm plays an important role in the decision making of autonomous vehicles. Any mistakes in the algorithm would lead to terrible consequences, as they could result in injury or death if the autonomous vehicle performs the wrong action in response to a wrongly classified street sign. It is important not to assume that the algorithm is always accurate, because understanding the flaws in our algorithm will allow us to know the limitations that an application of the model has.

Division of labor:

-Emily: Writing/Poster Creation and Helping With Coding

-Shreyas: Dealing with Bounding Boxes and Visualizer

-Sahil: Coding Model

Built With

Share this project:

Updates

posted an update

Introduction: The paper we are trying to implement detects and classifies traffic signs, using multi-layered CNNs, particularly for noisy images (e.g. different lighting or weather conditions). The paper extends the traditional application of CNNs in image processing by jointly detecting and further classifying several traffic signs (based on shape, color, and structure). The model stands out by aiming to take input images where the target object occupies a small (approx. 80x80 pixel) region of a large (2000x2000 pixel) image. Whilst previous object-detection models (e.g. using Support Vector Machines) have been effective for detecting target objects occupying a large portion of the pixel space, the application of multi-layer CNNs in this paper far outperforms them. We selected this paper because it offers us a chance to expand our understanding of image classification with CNNs by further introducing a branched 3-stream architecture for pixel classification, bounding boxes (detection), and labels display (classification) i.e. the potential for detection and classification described by the model on the paper seems to be an interesting avenue for exploration. The potential exploration into object ‘detection’ (beyond classification) would be a novel exploration of the content covered in the course; thus we were interested to explore how this could be achieved through a CNN architecture and what additional impacts it would have on model performance - beyond the standard multi-layer CNNs covered in lecture. Image detection and recognition functionalities can be used in several real-life scenarios e.g. autonomous vehicles recognizing traffic signs while driving, smart glasses or cameras requiring auto-focus with object detection, etc.

Challenges: We have had challenges converting the model in the paper into code. The paper is very vague on its architecture (especially the architecture of the three sets of layers that branch after the sixth convolution layer and the OverFeat architecture). The paper barely mentions the distinction between the bounding box, pixel, and label layers. Therefore, it has been difficult for us to create the parts of the model involving these layers. We also still do not understand how the model in the paper implemented multi-object detection and label-handling. Additionally, because the model is written in Caffe, there are some functionalities in Caffe that cannot be directly translated into TensorFlow. For example, Caffe allows for separate layers to have different learning rates, but this is not a Keras functionality. Therefore, we needed to find information on how to incorporate the learning rate into each layer. The solution we decided on was to use multiple optimizers and to split the variables from different layers into different lists that the optimizers would run on. Other potential solutions we found were: writing our own SDG optimizer that allows us to determine multipliers on the learning rate for each layer (and also allows us to change the learning rate decay, but we cannot control alter how many steps the learning rate decays), using @tf.custom_gradient to define a function with custom gradient, collecting learning rate multipliers for each variable and applying them before applying gradients, or writing code to set learning rate specifically for one layer. If the current solution we try does not work, we will turn to one of the other solutions listed.

Insights: We have coded what has been covered in the paper, but have realized that the information in the paper does not fit the information that we have researched on OverFeat. We currently have all of the model’s layers completed for the Muli-Layer CNN, (i.e. everything in the diagram in the paper). We have also completed the majority of our loss and accuracy functions (excluding code relating to the bounding box layer) and have also coded most of our train and test loop. However, we do not yet have any concrete results with our model yet because we are adding OverFeat code to our model after researching it (because it was not covered in-depth in the paper). We also have been unable to run our model because we have problems with the dataset. Our issues with the dataset and potential solutions are discussed in the following section (Plan). Some further tweaks will be made when the dataset is chosen.

Plan: Of the multiple approaches to “detection” presented in the paper (and the background) we are attempting to implement object detection with either: OverFeat or Fast-RCNN. Whilst the former is implemented in the paper, the latter has a more well-defined scope with literature to rely upon. Additionally, the paper itself was not able to offer comparisons to Fast-RCNN approaches, due to lack of source code. Due to the advantages of OverFeat (with EdgeBoxes and BING) being non-supervised mechanisms, we are attempting to implement this and will move to implement Fast-RCNN if OverFeat does not work.

Further research needs to be done on the losses and mechanisms governing EdgeBox and BING methods, as well as how they integrate into an OverFeat architecture so that we can integrate this into our model. One concern we have is the size of the dataset we are planning to use; the dataset is extremely large (it is 100GB). We need to consider whether we can use part of the current dataset or if we need to use a different dataset. If we use a different dataset, we may need to do preprocessing (i.e. superimpose images of traffic signs onto different backgrounds to mimic the dataset intended to be used in the paper). This would require additional preprocessing, which we currently have not implemented because the large dataset used in the paper has already been preprocessed and we are undecided on which dataset we plan to use. We may need to change the size of the output and the sizes of the first and last layers if the dataset changes.

Log in or sign up for Devpost to join the conversation.