Generating Image Descriptions
Check Point 2
We are implementing an existing paper, and the objective of this paper is to generate text descriptions of images. Contrary to previous papers that have focused on retrieving image descriptions from GPS metadata to access relevant text documents, the method used in this paper first detects objects, modifiers (adjectives), and spatial relationships (prepositions) in the image and then uses either an n-gram language model or a simple template based approach to generate the sentences/captions. This method puts more emphasis on the objects and their positions/spatial orientation and attributes. This problem is both a classification problem and a natural language processing problem.
We chose this paper because it combined many of our interests: computer vision, classification, and natural language processing. Generating image descriptions that closely resemble human speech while also providing positional orientation and descriptors for the objects found in the picture effectively mimic what it’s like to see the image. For people who are visually impaired, having effective text descriptions of images will allow them to better understand what is going on in images.
One challenge was storing and moving large files, specifically, features.pkl, which was produced in preprocessing. Since features.pkl was over the size limit for pushing to git, we learned how to use git-lfs (large file storage) to move the file into the remote repository.
Another challenge we have encountered is dealing with differences in the original paper and our project. We chose to make our base and target goals more simple than the paper, because the paper deals with certain architectures that we have not yet learned. We made our target goal to have our model recognize primary objects in an image with modifiers. Our caption would then just consist of the objects and their modifiers. Because of this simplification, we have been drawing on other related works as our primary sources for the architecture of this project. This is due to the fact that parsing/dividing the original research paper into implementation-specific details relevant to our current goals was nearly impossible. The integration of different sources makes designing our model more difficult, because we need to continuously analyze the differences in their implementations and purposes.
According to the paper “Where to put the Image in an Image Caption Generator”, these are the ranges of “good” BLEU scores at each stage:
- BLEU-1: 0.401 to 0.578
- BLEU-2: 0.176 to 0.390
- BLEU-3: 0.099 to 0.260
- BLEU-4: 0.059 to 0.170
Although we are aiming for these scores, at the moment, we do not have BLEU scores since we are debugging our testing code. However, we have a concrete plan as for how to debug and should be able to see the scores soon. As for other concrete results that we can show, we have produced 1) descriptions.txt, a file storing the dictionary of image identifiers mapped to descriptions and 2) features.pkl, a file storing the dictionary of the extracted features for each image.
We are on track so far. We need to dedicate more time to researching other similar research papers and projects that will help us bridge the difference between our current implementation and the one used in the original research paper we referenced. We also need to dedicate more time to trying potential ways to bridge this gap.
We are thinking of changing the way our captions are generated. Our current implementation uses a dataset of pregenerated descriptions, and based on the image the model pulls from this dataset the model matches it to one of the pre generated descriptions. However, we want our model to be able to generate sentences using an N-gram, drawing from a dictionary of individual words rather than pre-generated sentences.
Check Point 1
Our project will generate image descriptions by analyzing the objects in each input image and recording its attributes and orientation. It will then generate a description of the image scene based on the objects and attributes recorded.
tldr; take in images as input and generate sentences.
Ria Rajesh [rrajesh], Mandy He [mhe26], and Sophia Liu [sliu176]
We are implementing an existing paper, and the objective of this paper is to generate text descriptions of images. Contrary to previous papers that have focused on retrieving image descriptions from GPS metadata to access relevant text documents, the method used in this paper first detects objects, modifiers (adjectives), and spatial relationships (prepositions) in the image and then uses either an n-gram language model or a simple template based approach to generate the sentences/captions. This method puts more emphasis on the objects and their positions/spatial orientation and attributes. This problem is both a classification problem and a natural language processing problem. We chose this paper because it combined many of our interests: computer vision, classification, and natural language processing. Generating image descriptions that closely resemble human speech while also providing positional orientation and descriptors for the objects found in the picture effectively mimic what it’s like to see the image. For people who are visually impaired, having effective text descriptions of images will allow them to better understand what is going on in images.
We found no public implementations of the paper. However, in a web article called “How to Develop a Deep Learning Photo Caption Generator from Scratch”, we found code related to our topic written in Tensorflow. This article goes through how to obtain the photo and caption dataset, how to prepare the photo data with the Oxford Visual Geometry Group and pre-compute photo features, how to prepare the text data through data cleaning, and how to develop the deep learning model. The paper also explains how to evaluate the model and how to generate new captions.
We plan to use training images from Flickr, Google, the attribute dataset provided by Farhadi et al, and ImageNet, as used in the paper. As for image descriptions, we will collect them by querying the Flickr API with the objects’ categories. We aim to gather upwards of 50,000 descriptions and parse them with the Stanford dependency parser.
Primary objects (person, bus, etc.) and surrounding objects (grass, sky, etc.) detectors find candidate objects. Each candidate region is processed by a set of attribute classifiers. Each pair of candidate regions is processed by prepositional relationship functions. A CRF (conditional random field) is constructed that incorporates the unary image potentials computed in the steps above and higher order text based potentials computed from large document corpora. A labeling of the graph is predicted. Sentences are generated using an N-gram model based on the labeling.
How are we training the model (CRF) -> with 100 hand-labeled images. (section 4.2 of paper)
4) CRF is used to predict best labeling for an object: Nodes correspond to object things or stuff, attributes of appearance of object, prepositions relating to spatial relationship between objects.
Since preposition nodes describe the relationship between a preposition label and two object labels (obj-prep-obj), they are most often modeled through trinary potential functions. However we must convert this to a set of unary and pairwise potentials through a z-node.
Backup Ideas If we run into issues, we plan to simplify the project and remove the spatial relationship of objects and just generate captions describing objects and their attributes. We can also use the methods described in the papers of related projects (see related work section).
We plan to perform experiments where we input an image into the model and manually examine the outputted caption. We then determine whether the caption was generated correctly given the image contents. The notion of accuracy does apply to our project, but perhaps not as strictly as accuracy may apply to a project that just classifies objects in images. Although classifying images is a part of our project, our project’s goal is to generate captions in which there are many potentially accurate captions possible for a given image. In the existing paper, the authors manually scored the generated image captions in three categories (quality of image parsing, language model-based, and template-based generation), hoping to find high average scores. The scores were on a scale of 1 to 4, where 4 was perfect without error, 3 was good with some errors, 2 was many errors, and 1 was failure. The results were an average score of 3.49 for the quality of generated sentences (very high score!), and a relatively high score for predicting image content: 2.85.
Goal of the authors of this paper
The authors of this paper sought to generate captions rather than just retrieve and summarize captions from textual documents related to the images like previous similar projects/papers have done. Many previous projects leading up to this paper put little emphasis on the spatial relationships between objects. The authors of this paper also wanted spatial relationships between objects to be an important part of their project.
Base, target, and stretch goals
Base goal = what you think you definitely can achieve by the final due date.
Recognize primary objects in an image. Our caption would then just consist of the objects.
Target goal = what you think you should be able to achieve by the due date.
Recognize primary objects in an image with modifiers.
Our caption would then just consist of the objects and their modifiers
Stretch goal = what you want to do if you exceed your target goal.
Include spatial relationships in captions. This would mean our results fully match those described in the paper.
Why is Deep Learning a good approach to this problem?
Deep Learning is a good approach to this problem because it is unfeasible for text descriptions to be generated by a human for every image. Developing a successful technique for developing these descriptions using deep learning will be much easier and more practical in the long run to be used by a screen reader for people who are visually impaired, for example.
How are you planning to quantify or measure error or success? What implications does your quantification have?
We are planning to measure error and success based on if the sentences outputted are an accurate representation of the input image. For example, if we input a picture of a black dog we would expect the output sentence to describe the image as including a black dog. Because our algorithm has to categorize objects in an image, it could make certain assumptions about these objects in the image that may increase specificity. We would want to limit assumptions like this as much as possible. For example, given an image of a person, we would not want the algorithm to describe them as a “man” or a “woman” to avoid potentially harmful stereotypes. We would choose rather to describe the image by saying a “person”. We also noticed that in the paper one of the images described included the description of a “black person”, which was a person wearing black clothing. We would want to avoid categorizing this as a “black person”.
Division of labor
This project is quite the challenge for each of us individually. Moreover, this project divides naturally into two heavy parts (classification and caption generation). Our group agreed to meet and tackle each part together during group meetings to bounce ideas off each other and work as a group through discussion and pair-programming. We plan to swap who is the driver each meeting.