Traffic Sign Detection

Introduction: Traffic signs play a crucial role in maintaining order and safety on roads, serving as essential visual communication tools for pedestrians and drivers. Specifically, traffic signs can help prevent accidents, regulate transportation flows, and optimize transportation systems. From the drivers' perspective, it would be helpful to know what traffic signs are ahead of them to make more informed decisions beforehand. Hence, deep learning techniques may come to help. In our project, we utilized the CLIP model from an OpenAI paper to detect traffic signs on roads and highways. Our goal is to not only detect the location of traffic signs on images but also identify their meanings. After training the model, it will output images that best match the input caption describing a particular traffic sign. Unlike the original implementation in PyTorch, we re-implemented the paper using the TensorFlow framework.

Related Work: Are you aware of any, or is there any prior work that you drew on to do your project?

Relevant paper: The paper “Understanding Cities with machine eyes: a review of deep computer vision in urban analytics” covers various aspects where computer vision techniques can be applied, including object detection and classification. The paper highlights the benefits of using deep learning techniques such as CNN, RNN, and GAN to improve urban planning and design, optimize transportation systems, and support smart city initiatives. It is an invaluable resource for researchers and policy-makers in the intersection of urban studies and deep learning.
The paper we want to implement: Paper: https://towardsdatascience.com/simple-implementation-of-openai-clip-model-a-tutorial-ace6ff01d9f2
Potential Dataset: “Chinese Traffic Sign Dataset”: https://www.kaggle.com/datasets/dmitryyemelyanov/chinese-traffic-signs?select=annotations.csv

Data: What data are you using (if any)? Our dataset is imported from the Chinese Traffic Sign Detection Database. It consists of 5998 traffic sign images of 58 categories. Images in the dataset represent zoomed-in views of individual traffic signs. Along with the images, the annotations include properties such as the file name, width, and height. Additionally, the dataset indicates the coordinates of the traffic sign within the image and the corresponding category. The dataset illustrates the size of images and specific locations of the dominant features with corresponding categories. Since the original dataset does not contain captions, we manually create a text file that briefly describes each image, such as “red and white circle no car sign.”

Methodology: What is the architecture of your model?

The model paper introduced the CLIP (Contrastive Language-Image Pre-training) model, which retrieves the most relevant images based on the input sentences. This model is powerful because instead of having fixed categories for classification, it leverages natural language processing to possess flexibility in its prediction. The paper we re-implemented uses the Flickr 8k dataset to train the model, which is a standard dataset for image-captioning tasks. The paper consists of seven classes: CFG, AvgMeter, CLIPDataset, ImageEncoder, TextEncoder, ProjectionHead, and CLIPModel. For ImageEncoder, we utilized Resnet-50, a convolutional neural network architecture widely used in computer vision, such as image classification and object detection. For TextEncoder, we applied DistilBERT, a state-of-the-art model in Natural Language Processing Tasks that can help us match the captions with corresponding sections of images. The ProjectionHead class remains the same, which combines the encoded outcomes for images and texts and projects them to the same dimension.

Metrics: What constitutes “success?”

What experiments do you plan to run?

We plan to experiment with different hyperparameters, especially those related to image classification, such as convolution kernel size and convolution filter size. We also plan to use data augmentation and other techniques to create variations in terms of lighting and weather conditions.

For most of our assignments, we have looked at the accuracy of the model. Does the notion of “accuracy” apply to your project, or is some other metric more appropriate?

We think accuracy is an appropriate metric. Other metrics worth considering are recall, precision, and F score, which we haven’t learned in class but we can try to learn and implement.

If you are implementing an existing project, could you explain what the authors of that paper were hoping to find and how they quantified their model results?

The authors of that paper used accuracy score and loss to quantify the results. In other words, the authors tried to examine how well the model could select the most relevant features of the images based on the users' instructions. For example, if the input sentence were "red and white circle no car sign", then we expect it to output the image with those features.

What are your base, target, and stretch goals?

Our base goal would be to re-implement their model using a different framework and dataset to reproduce their result. Our target goal would be to improve their result by adding new features and/or tuning hyperparameters and testing the model on more complicated data (videos etc.). Our stretch goal would be to modify the model architecture or explore new approaches that would increase accuracy further.

Ethics: Choose 2 of the following bullet points to discuss

What broader societal issues are relevant to your chosen problem space?

This problem concerns important social issues such as transportation safety and efficiency. By improving the accuracy of travel sign detection, we can organize cities, and drivers and pedestrians can follow clear rules structured by the traffic signs.

Why is Deep Learning a good approach to this problem?

Deep learning is a good approach for this problem because the image recognition model can learn to recognize patterns and features within complex data in an accurate way, if there is enough and well-annotated training data provided. It is also a good approach because we can make use of newly published datasets every year from different regions to train and improve the model.

Challenges: The main challenge lies in finding appropriate datasets. Initially, we planned to implement the paper — “CueCAn: Cue-driven Contextual Attention for Identifying Missing Traffic Signs on Unconstrained Roads.” This paper required a dataset containing images and videos with missing traffic signs and appropriate captions. We didn’t realize this issue until we started implementing the code. It is easy to find images and videos with missing traffic signs, but it is challenging to find images and videos containing captions about which traffic sign is missing. Thus, we decided to change our topic to “traffic sign detection” instead of “detecting missing traffic signs.”

Division of labor: Briefly outline who will be responsible for which part(s) of the project Data processing: Yuechuan Yang, Yifan Zhang Model training: Xilin Wang, Yuechuan Yang, Yifan Zhang Evaluation: Yifan Zhang, Xilin Wang

DL Day Slides https://docs.google.com/presentation/d/1WIwhQo3EDpG-Irp1ZXGH6CI921Tt0wvWbR3HfuAG5HE/edit?usp=sharing

Final Writeup(code link included in the document) https://docs.google.com/document/d/1G4o9BUdHY3I_Ksu04KOa_O197wMlvtTkO_xlL6JRbgY/edit