Title:
Enhancing Out-of-Distribution Object Detection with CLIP: A Vision-Language Approach
Team Members:
- Yuxiang Wang, ywan1084
- Xueru Ma, xma75
- Kangyu Zhu, kzhu37
- Yingwei Song, ysong137
Introduction :
In the field of artificial intelligence, one of the primary challenges is ensuring that models perform robustly under unexpected conditions. While existing traditional object detection models such as Faster R-CNN, have shown promising performance in recognizing and localizing objects in images. However, these detectors lack the capability to detect Out-Of-Distribution(OOD) data. Therefore, when encountering data that deviate from their training sets, they would lead to incorrect detections or misclassifications. To tackle this issue, our group has developed a Contrastive Learning Based Out-of-Distribution Unified Detector(CLOUD). This innovative approach is designed to improve the resilience of models under conditions with unexpected data scenarios, specifically targeting the challenges associated with OOD data in object detection. Our project is inspired by key insights from this study, 'CLIPN for Zero-Shot Detection: Teaching CLIP to Say No', which introduces a decision-making process enabling models to distinguish between OOD and in-distribution samples. By combining detection capabilities with selective decision-making, our model aims to recognize OOD items and accurately detect objects using bounding boxes, while enhancing their efficacy and reliability.
Our contributions include:
A novel method for dataset partitioning specifically designed for the Out-of-Distribution (OOD) object detection task.
Region-text matching facilitates matching at the region level with textual descriptions.
A joint training pipeline integrating Region Proposal Networks (RPN) and an OOD detection network, tailored for the object OOD detection task.
Related Work:
What is the OOD
All well-known that most successful deep learning models are trained under the closed condition - that is the given datasets; however, when these models are deployed in real-world applications, they sometimes suffer from poor generalization and suboptimal performance. This is partially due to the presence of a large number of the “unseen categories” in the real world and makes it difficult for the model to detect and recognize since they are not explicitly seen during the training steps. Thus the problem OOD - Out-of-Distribution detection tasks are defined to solve this problem. Out-of-distribution (OOD) detection refers to training the model on an in-distribution (ID) dataset to classify whether the input images come from unknown classes.
In the paper “CLIPN for Zero-Shot OOD Detection: Teaching CLIP to Say No”, the author found that the CLIP model failed to train the ability to form a “negative logic”, that is the original CLIP lacks the “no” logic. The paper author tried to solve this problem by introducing a “negative” prompt and a “negative” text encoder, which author's learnable "negation" prompt incorporates negation semantics in the prompt, and the "negation" text encoder captures the corresponding negation semantics of the image, enabling CLIP to express the meaning of "negation".
Data :
We use the COCO dataset, which is a large-scale image dataset that provides 80 categories, with more than 330,000 images, of which 200,000 have annotations for masks or bounding boxes. The dataset encompasses over 1.5 million individual instances. COCO supports various visual tasks, including object detection and segmentation. The specific dataset and annotations can be downloaded from COCO - Common Objects in Context (cocodataset.org).
We will use this dataset for our region-text match training. Additionally, we will generate OOD image-text pairs based on this dataset for subsequent OOD detection. Specifically, COCO already provides substantial image-text pairs in which the texts are standard positive prompts. As suggested by CLIPN work, we would define a series of “no” prompts to equip the original texts. For each image Xi and positive text prompt Ti, we handcraft and generate a negative text prompt Ti' by using negative words e.g. from 'a photo with dog' to 'a photo without dog'. In this way, a modified dataset built upon COCO is made for our OOD network training. Data pairs can be described as {(image, [standard annotations], ['no' annotations])}.
Methodology :
Our model consists of multiple modules: an image encoder, a text encoder, a region proposal generating network, and an OOD detection network.
The basic architecture of the OOD detection network follows 3 parts - a “no” text encoder; an image encoder; and a text encoder. The image encoder and the text encoder have the same structure and parameters as the pre-trained CLIP encoder, while the “no” text encoder’s input is negative text describing an image with opposite semantics.
Inference Process: The model inputs are of two modalities: one is text containing object nouns, and the other is images. The image first passes through a region proposal network to generate multiple candidate bounding boxes. These candidate boxes each go into the image encoder to produce multiple image embeddings, while different nouns are processed through the text encoder and the OOD decision network to generate corresponding text embeddings. The image embeddings are matched and interacted with text embeddings from two different networks. Based on the logits generated, a comprehensive judgment is made whether a noun is out-of-distribution (OOD) relative to an image; if it is OOD, the bounding box will not be drawn, otherwise, the highest-scoring candidate box for that category is outputted.
Training Process: Our training is multi-phased. To enhance sensitivity to image regions, we train the image encoder, which may use a ResNet architecture, with contrastive loss. The text encoder uses a CLIP pre-trained encoder, and its weights are frozen. This training aims to produce a lightweight image encoder for our project and enhance matching between text and image regions, preparing for subsequent detection tasks.
Training of the OOD Detection Network:
1.Input: each input pair, as explained in Data part above, consists of an image, a standard text, and a “no” text in COCO. The input matrices pair can be written as (X,T,T') in which X is the matrix for images, T and T' are the matrices for corresponding standard text and 'no' text respectively.
2.During the forward pass:They are then passed into different components in the OOD network: image features F = Image encoder (X); text features G = Text encoder (T); 'no' text features G' = 'no' text encoder (T'); X and T, X and T' are then multiplied into inner product to measure the match-ness between image and text, image and 'no' text respectively. The matched probability between the ith image and j-th “no” text is produced by a further softmax layer.
3.Loss function: (1) Image-Text Binary-Opposite Loss: Binary-Opposite Value m(i,j) is defined as the 0-1 value that measures the relations between image Xi and “no” text Tj. We use 0 for 'no' text irrevelant to the image and 1 for the “no” text with opposite semantic to the image. The total loss is defined as negative average of m(i,j) or 1-m(i,i) weighted probabilities. (2) Text Semantic-Opposite Loss: G and G' should be far from each other since these two features space are produced from opposite prompts. This can be roughly described as a problem of argmax(L2 distance between G and G') ITBO Loss aims to help the model learn where it should respond with 'no' and TSO loss is designed to let the network understand the meaning of 'no'. Therefore, the total loss is calculated as the addition of two types of loss above to benefit the optimization of our OOD detection network.
4.Output: We would finally get a vision-language model equipped with the ability of OOD detection. The output match-ness probabilities can be further used to measure the OOD detection performance on benchmarks mentioned in Metrics section.
Backup Ideas: If issues arise in some of the training and inference steps mentioned above, we might first use some existing pretrained model checkpoints to simplify the training process. In addition, we could reconfigure our OOD decision network to make its interaction with image embeddings simpler and see how this affects the OOD detection results.
Metrics :
We provide the model with texts and images. The model can mark the bounding boxes of nouns mentioned in the text on the image. We evaluate the model's performance based on its ability to detect OOD and the precision of object detection. We expect it can accurately detect in-distribution (ID) object nouns as well as correctly identify OOD object nouns. Therefore, we use metrics related to object detection and OOD detection.
Object Detection Metric:
For the object detection task, we use Intersection over Union (IoU) as a metric to indicate the degree of overlap between the predicted bounding boxes and the true bounding boxes. A commonly used threshold is 0.5 (i.e., a detection is considered correct if IoU > 0.5). We can also use the overall accuracy metric, which represents the proportion of correctly detected objects out of the total number of detected objects. Additionally, we can utilize precision, which refers to the ratio of the number of correctly detected positive instances (such as correctly detected objects) to the total number of instances detected as positive. It measures the model's ability to not produce false positives.
Out of Distribution Detection Metric:
For the large-scale datasets, the CLIPN-A, which uses special 'no' texts that it can learn, performs better than the previous best method, MCM. It scores higher in two key measures, AUROC and FPR95, across all OOD datasets. Specifically, using the ViT-B-16 model, the method in the paper shows improvements of at least 2.34% in AUROC and 11.64% in FPR95(ViT-B-16 avg AUROC: 93.10 (+2.34), and FPR95 31.10 (-11.64)). Thus we are highly possible to follow the metric used in this paper to measure the effectiveness of our OOD network performance. And compare the results with those previously mentioned related work and other classic OOD models.
Goal to Achieve
Base goals: We want our model to have the ability to distinguish OOD cases as well as ID cases.
Target goals: The model could draw the bounding boxes for the correct text descriptions and ignore the box for the OOD case accurately.
Stretch goals: The model could solve the zero-shot problem and achieve good performance on some other dataset that has many different labels.
Ethics :
What broader societal issues are relevant to your chosen problem space?
The broader societal issues relevant to our chosen problem space could be the risk of infringing on individual privacy. In the application of image recognition technologies, privacy and consent are significant ethical issues. Cameras in public and private places may capture images without the explicit consent of the individuals involved. This data can be analyzed and stored, sometimes without the knowledge of those captured, which infringes on the privacy of the individuals. Moreover, the potential risks are heightened if this information is used through improper channels, such as when these technologies are used to track individual activities or analyze personal characteristics.
Who are the major “stakeholders” in this problem, and what are the consequences of mistakes made by your algorithm?
The major "stakeholders" in this problem involve developers and researchers who create and refine the technology, and end-users including companies that use this system for identifying and categorizing digital images. Additionally, it includes the general public, as some societal applications incorporating this technology may change people's ways of living, thereby indirectly impacting their lives. The capacity of our model to detect and segment objects may perpetuate or even exacerbate existing biases in the training data. A significant issue relates to how the model represents gender. Currently, the system only recognizes binary gender categories, which could result in the misrepresentation or failure to recognize non-binary or transgender individuals in images. Such limitations can lead to continued exclusion and discrimination, reinforcing existing societal biases.
Division of labor:
- Yuxiang Wang: Preprocessing the Coco dataset; constructing the OOD (Out-of-Distribution) dataset.
- Xueru Ma: Organizing, collecting, and analyzing experimental data.
- Kangyu Zhu: Setting up the training pipeline, carrying out training, and fine-tuning.
- Yingwei Song: Building the OOD decision network.
Deliverables
Final Written Report : link
Submitted Code on Github: link
Presentation Slides: link
Previous Reflections
Project Check-in #2: link
Project Check-in #3: link

Log in or sign up for Devpost to join the conversation.