Find a specific photo in the album

Find a specific photo in the album

Title: Find a specific photo in the album

Who: Han Cai, Ce Zhang

Inspiration (Introduction):

In our daily life, we must have encountered this situation that sometimes it is hard to find a specific photo in our album. We might forget the concrete location or accurate time of shooting while only remember some key points. We have to look through the whole album to find the photo we want, which is very time consuming. So, we want to develop a model which can help us find the specific photo we want with only a few words or several short description!

Related Work:

https://arxiv.org/pdf/1803.08024.pdf "Stacked Cross Attention for Image-Text Matching"

In this paper, the researchers study the problem of image-text matching. They introduce a method to infer the latent semantic alignment between objects or other salient stuff (e.g. snow, sky, lawn) and the corresponding words in sentences allows to capture fine-grained interplay between vision and language, and makes image-text matching more interpretable.

https://arxiv.org/pdf/2103.00020.pdf "Learning Transferable Visual Models From Natural Language Supervision"

In this paper, the researchers demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks.

Data:

We used Flickr30k Dataset in this project, which is a large corpus of 30K images and 150K descriptive captions. We split the whole dataset into a train set, a validation set and a test set. There are 1,000 images in both validation set and test set and 29,783 images in the train set. We stored the information of all images in three different .json files, where the image name is set to be the dictionary key and the dictionary value is a list which consists of 5 corresponding captions and the location of each image.

Methodology:

Architecture: We firstly used a pre-trained Faster R-CNN object detection model to find out objects in each image. Then we used a pre-trained ResNet-34 model to generate object embedding for each detected object. Next, for each image, we passed object embeddings to an image encoder and passed each word in a text encoder. After that, we calculated the similarity score for each image-caption pair. Our entire model architecture is shown in the Final Writeup.
How are you training the model: After we obtained the similarity of each image-caption pair, our model can learn a multi-modal embedding space by jointly training an image encoder and text encoder to maximize the similarity of the image and text embeddings of the N real pairs while minimizing the similarity of the embeddings of the N^2 -N incorrect pairings. Here, we optimized our model by minimizing symmetric cross entropy loss.

Metrics:

We will evaluate the performance of the model on Flickr30k test dataset.
Base goal: R@1, R@5, R@10 = 24%, 54%, 65% on Flickr30k
Target goal: R@1, R@5, R@10 = 33%, 62%, 71% on Flickr30k
Stretch goal: R@1, R@5, R@10 = 43%, 72%, 80% on Flickr30k

Ethics:

Why is Deep Learning a good approach to this problem?
Deep learning can learn complex and high-level feature from both image and text and find their relationship. It's an end-to-end method which trains a possibly complex learning system represented by a single model, bypassing the intermediate process.
What is your dataset? Are there any concerns about how it was collected, or labeled? Is it representative?
We will use Flickr30k to train our model. The resource of the album is from the Internet, so it may include some personal privacy. These datasets are actually not representative enough, it only includes 44,518 object categories.