Captioning Images Through Identifiers

Introduction

We are implementing a paper which uses a transformer to identify items in images then caption the image based on the items that are present in the image. The task is a combination of a classification and structured prediction, this sort of combination peaked our interest and it part of the reason we chose to implement this paper. The goal of the paper is to identify up to 80 objects in an image then using the relevance of each object, craft a caption that appropriately describes the image.

Related Work

Before the transformer revolution, most of these kinds of tasks were completed by feeding a CNN into an RNN to then create the sentence. This architecture was made out of date with the innovation of the transformer, which takes the two models and condenses them into one unit. We find the a lot of the related work using this two piece model follows intuition on how this task may be taken on.

Data

The data set is the Microsoft Common Objects in Context dataset, COCO, for short. The dataset contains 328 thousand images, these images range in content across a variety of topics. We anticipate that both the with nature of the model and the regularization of the dataset that we won't have much preprocessing to take care of.

Methodology

The training of the model comes in two parts, first we must train a model to complete sentence generation then applying this model to the decoder of a transformer. This transformer takes as input images, then classifies the images then takes the classified images as the input for the decoder, which then uses the sentence generation model and the features to create a caption. Our base goals include have a model that runs and can at least accurately classify the features of the image. We are hopeful to be able to get a model that can caption the image based on at least some combination of features in the image. Our stretch goals will be to increase the accuracy of the captioning, including properly captioning images based on the primary features of an image consistently. We anticipate the most difficult part of the model will be the decoder and basing the captions off of what is in the image in a cohesive context.

Metrics

Success in this instance is a little difficult to quantify but generally we are looking for the model to produce sensible captions. These captions should be make sense from a grammatical sense and also correctly identify what is in the image and what is the main component of the image. As for more specific numeric metrics, we will include perplexity and accuracy. Our base goal will be a perplexity of 200 to establish basic understanding of the image and some sort of learning. Our target would 60 perplexity, and our stretch goal would be a much lower 30 perplexity.

Ethics

One relevant social issue would be making sure that no biased information enters the models images of certain people. For example, we don't want captions that unjustly label people as "criminals" or "losers" based on their skin color. Naturally, this could also be a problem for human-labelled photos, however discrepancies could be significantly worse when a computer is not being actively monitored. Unlike many datasets, the COCO dataset we will be working with was not sourced directly from the internet. Instead, those captioning images were paid for the work through Amazon's Mechanical Turk. While this doesn't absolve the labels of bias and unethical language, it does imply some greater degree of confidence than the historical biases that plague internet sourced datasets. On top of these workers, the dataset also included some "expert" labelings, coming from those with a greater proficiency and weighted more than the Mechanical Turk responses. We will be quantifying success based on Perplexity, along with subjective coherency of the sentences produced; we don't believe that this will create any type of bias in our measurement.