Re-Implementation of “End-to-End Object Detection of Transformers”
github link: https://github.com/chelsea97/CS1470-project
public github link: https://github.com/Visual-Behavior/detr-tensorflow
google drive link: https://drive.google.com/drive/folders/19qTBVLCtruv0grLENKR91IFvZFfS_eJu?usp=sharing
Who
Yuxuan Zhao (yzhao153) Zhuo Wang (zwang302) Xinhui Zhan (xzhan4)
Introduction
The goal of this project is to put the paper "End-to-End Object Detection of Transformers" into practice. For each object of interest, the purpose of object detection is to anticipate a collection of bounding boxes and category labels. This work also introduces a new perspective on object detection as a direct set prediction issue. This is a more effective method based on a new framework known as DEtection TRansformer(DETR), which is a set-based global loss that drives unique predictions using bipartite matching and a transformer encoder-decoder architecture. Unlike many other modern detectors, this model is much simpler to implement conceptually and there is no requirement for a specialized library. What’s more, on the COCO object detection dataset, DETR achieves relatively great accuracy and run time performance compared with the Faster R-CNN baseline. Besides, this DETR can be further generalized to produce panoptic segmentation which is widely used nowadays. These are the main reasons why our group wants to have a deeper understanding of this paper and reimplement this model with a different frame and dataset.
Related Work
Non-local Neural Network (Wang, X., Girshick, R.B., Gupta, A., He, K.) https://arxiv.org/pdf/1711.07971.pdf
Convolutional and recurrent operations are both building blocks for processing a single local neighborhood at a time. This related paper presents a non-local operation as a generic family of building pieces for capturing long-range dependencies. Inspired by the conventional non-local means technique in computer vision, this non-local operation computes the response at a position as a weighted sum of the features at all positions. Moreover, this building element can be applied to many different computer vision architectures. As for the performance, non-local models can compete with existing competition winners on both the Kinetics and Charades datasets. Also, on the COCO dataset, it can improve object detection/segmentation and posture estimation in static picture identification. This non-local neural network can be applied to build a self-attention layer in the transformers in DETR.
Data
In many ways, Open Images is the largest annotated image dataset for training deep convolutional neural networks for computer vision tasks. It's a 9-million-image dataset with image-level labels, object bounding boxes, object segmentation masks, and visual relationships annotated. It also has a total of 16 million bounding boxes and 600 object classes in 1.9 million photos, making it the world's largest collection containing object position annotations. To ensure accuracy and consistency, the boxes were mostly drawn by hand by skilled annotators.
Visual relationships annotations are also available in Open Images, indicating object pairs in certain relationships, object features, and human behaviors. In total, it contains 3.3 million annotations from 1466 different relationship triplets. With 9 million photos, 36 million image-level labels, 15.8 million bounding boxes, 2.8 million instance segmentations, and 391 thousand visual associations, the Open Images dataset currently has 9 million images. Recent discoveries in object detection, instance segmentation, and visual relationship detection have been motivated by the challenges of Open Images, as well as the dataset itself.
The Open images introduces 675k localized narratives in Version 6, which are multimodal image descriptions that include synchronized audio, text, and mouse trails over the items being narrated. Additionally, apart from localized narratives on trains in V6, it covers validation and testing as well.
Methodology
Tensorflow will be our primary tool for developing and training the Detection TRansformer (DETR). To train DETR, we use a convolutional neural network to extract essential features from the original image, then send those data to an encoder-decoder architecture based on transformers, a popular architecture for sequence prediction. Transformers' self-attention mechanisms, which explicitly describe all pairwise interactions between components in a sequence, making them particularly well suited to impose prediction limitations like deleting duplicate predictions.
DETR predicts all items at the same time and is trained with a set loss function that performs bipartite matching between predicted and ground-truth objects from start to finish. DETR streamlines detection pipelines by eliminating certain hand-crafted components that encode past information, such as spatial anchoring and non-maximal suppression. DETR, unlike most other detection methods, does not require any customized layers and can thus be simply replicated in any framework that includes standard CNN and transformer classes. The combination of bipartite matching loss and transformers with parallel encoding is one of DETR's primary features. Previous research, on the other hand, concentrated on RNN-based autoregressive decoding. The DETR matching loss function is invariant to a permutation of predicted objects and uniquely assigns a prediction to a ground truth item.
DETR's training settings differ from those of ordinary object detectors in a number of ways. The new model necessitates an extra-long training regimen and takes advantage of the transformer's auxiliary decoding losses. We go over all of the components that are necessary for the exhibited performance in detail. The most difficult component of developing a model, in our opinion, is changing model parameters and determining an acceptable prediction number N in order to get the best fit between expected and real classes, as well as a grounding box. Furthermore, how to define and discover the proper grounding box size is critical since DETR not only predicts the correct class but also draws the proper grounding box size to fit the shape of the objects when compared to test figures that are artificially labeled.
Metrics
What experiments do you plan to run? We are planning to compare our model with Faster R-CNN in a quantitative approach.
For most of our assignments, we have looked at the accuracy of the model. Does the notion of “accuracy” apply to your project, or is some other metric more appropriate? Yes, we are trying to do a similar evaluation of AP(Average Precision) in this paper, and compare different performances in various settings of decoders quantities.
If you are implementing an existing project, detail what the authors of that paper were hoping to find and how they quantified the results of their model. The authors of this paper propose DETR to detect objects in pictures utilizing transformers and bipartite and justify this model is comparable to optimized Faster R-CNN on the COCO dataset in AP. However, DETR is easy to implement and extensible for other tasks such as panoptic segmentation. Also, it works better with large objects than Faster R-CNN.
What are your base, target, and stretch goals? Our base goal is to implement a workable DETR model in different platforms, namely, tensor flow. Our target and stretch goals will be to achieve AP around 40 under 12 layers.
Ethics
Our aim is to build a model that can detect objects’ size and position on a picture. Here are some possible ethical issues related to this topic.
- The dataset used is probably only has restricted objects’ categories, which means that there is a great potential to guess some new objects in existing wrong categories, and also marked with inappropriate bounding. So the domain of the dataset will affect the precision and performance of actual use.
- We should be careful with the bias of our training set, which means the classes and marks are labeled by people and inevitably there will be different descriptions of the same things, and even worse biased and offensive labels will be used in this dataset. If we use this model to detect new pictures, these kinds of stereotypes and biases will be carried on.
Division of labor
Plan to cooperate evenly: final project coding, and its writeup/ reflection Zhuo Wang: a poster of the project Yuxuan Zhao: check-in #2 reflection Xinhui Zhan: record an oral presentation
Built With
- tensorflow
Log in or sign up for Devpost to join the conversation.