posted an update

Introduction: The paper we are trying to implement detects and classifies traffic signs, using multi-layered CNNs, particularly for noisy images (e.g. different lighting or weather conditions). The paper extends the traditional application of CNNs in image processing by jointly detecting and further classifying several traffic signs (based on shape, color, and structure). The model stands out by aiming to take input images where the target object occupies a small (approx. 80x80 pixel) region of a large (2000x2000 pixel) image. Whilst previous object-detection models (e.g. using Support Vector Machines) have been effective for detecting target objects occupying a large portion of the pixel space, the application of multi-layer CNNs in this paper far outperforms them. We selected this paper because it offers us a chance to expand our understanding of image classification with CNNs by further introducing a branched 3-stream architecture for pixel classification, bounding boxes (detection), and labels display (classification) i.e. the potential for detection and classification described by the model on the paper seems to be an interesting avenue for exploration. The potential exploration into object ‘detection’ (beyond classification) would be a novel exploration of the content covered in the course; thus we were interested to explore how this could be achieved through a CNN architecture and what additional impacts it would have on model performance - beyond the standard multi-layer CNNs covered in lecture. Image detection and recognition functionalities can be used in several real-life scenarios e.g. autonomous vehicles recognizing traffic signs while driving, smart glasses or cameras requiring auto-focus with object detection, etc.

Challenges: We have had challenges converting the model in the paper into code. The paper is very vague on its architecture (especially the architecture of the three sets of layers that branch after the sixth convolution layer and the OverFeat architecture). The paper barely mentions the distinction between the bounding box, pixel, and label layers. Therefore, it has been difficult for us to create the parts of the model involving these layers. We also still do not understand how the model in the paper implemented multi-object detection and label-handling. Additionally, because the model is written in Caffe, there are some functionalities in Caffe that cannot be directly translated into TensorFlow. For example, Caffe allows for separate layers to have different learning rates, but this is not a Keras functionality. Therefore, we needed to find information on how to incorporate the learning rate into each layer. The solution we decided on was to use multiple optimizers and to split the variables from different layers into different lists that the optimizers would run on. Other potential solutions we found were: writing our own SDG optimizer that allows us to determine multipliers on the learning rate for each layer (and also allows us to change the learning rate decay, but we cannot control alter how many steps the learning rate decays), using @tf.custom_gradient to define a function with custom gradient, collecting learning rate multipliers for each variable and applying them before applying gradients, or writing code to set learning rate specifically for one layer. If the current solution we try does not work, we will turn to one of the other solutions listed.

Insights: We have coded what has been covered in the paper, but have realized that the information in the paper does not fit the information that we have researched on OverFeat. We currently have all of the model’s layers completed for the Muli-Layer CNN, (i.e. everything in the diagram in the paper). We have also completed the majority of our loss and accuracy functions (excluding code relating to the bounding box layer) and have also coded most of our train and test loop. However, we do not yet have any concrete results with our model yet because we are adding OverFeat code to our model after researching it (because it was not covered in-depth in the paper). We also have been unable to run our model because we have problems with the dataset. Our issues with the dataset and potential solutions are discussed in the following section (Plan). Some further tweaks will be made when the dataset is chosen.

Plan: Of the multiple approaches to “detection” presented in the paper (and the background) we are attempting to implement object detection with either: OverFeat or Fast-RCNN. Whilst the former is implemented in the paper, the latter has a more well-defined scope with literature to rely upon. Additionally, the paper itself was not able to offer comparisons to Fast-RCNN approaches, due to lack of source code. Due to the advantages of OverFeat (with EdgeBoxes and BING) being non-supervised mechanisms, we are attempting to implement this and will move to implement Fast-RCNN if OverFeat does not work.

Further research needs to be done on the losses and mechanisms governing EdgeBox and BING methods, as well as how they integrate into an OverFeat architecture so that we can integrate this into our model. One concern we have is the size of the dataset we are planning to use; the dataset is extremely large (it is 100GB). We need to consider whether we can use part of the current dataset or if we need to use a different dataset. If we use a different dataset, we may need to do preprocessing (i.e. superimpose images of traffic signs onto different backgrounds to mimic the dataset intended to be used in the paper). This would require additional preprocessing, which we currently have not implemented because the large dataset used in the paper has already been preprocessed and we are undecided on which dataset we plan to use. We may need to change the size of the output and the sizes of the first and last layers if the dataset changes.

Log in or sign up for Devpost to join the conversation.