Watch our video presentation HERE

we apologize for the poor audio quality of our video, wearing masks created unintended white noise in our recording

Take a look at our final writeup HERE

Take a look at our final poster HERE

Take a look at our code HERE

Title: Neural Image Caption Generator

Who: Lila Zimbalist (lzimbali), Udayveer Sodhi (usodhi)

Introduction: We are planning to implement an existing paper (https://arxiv.org/pdf/1411.4555.pdf). The paper solves the problem of structured prediction and classification. It ’s objective is to use a neural network which processes images to generate a caption for the image. We chose this paper because this seemed like an interesting and practical way to use the skills we’ve learnt in this class. Who hasn’t taken a picture and wished that it was automatically given a really objective caption!

Related Work: Microsoft’s Rich Image Captioning in the Wild also uses neural networks to generate captions for images, with a machine translation-inspired encoder-decoder framework. It consists of four main components: a model for image feature extraction, a language model for generation of potential captions and rankings, an entity recognition unit for landmarks and celebrities, and a classifier to estimate the confidence score (which estimates how much the generated caption could be trusted.

There are many existing implementations of this professor that we were able to find https://github.com/nikhilmaram/Show_and_Tell https://github.com/muggin/show-and-tell https://github.com/jazzsaxmafia/show_and_tell.tensorflow https://github.com/vedal/show-and-tell https://github.com/serhii-havrylov/ShowAndTell

Data: We will be using the flickr8k dataset. It has 8,000 images total, and we will be training on 6,000 of those images. This will require pretty comparable levels of preprocessing to the rest of the projects we’ve done so far in this class.

Methodology: What is the architecture of your model? We will train the model by using a combination of CNNs to represent the images and embeddings to represent the words. We think the hardest part of implementing our model is going to be figuring out the exact right model architecture. We’re going to have to figure out how all of the pieces of this model fit together to be as accurate an efficient as possible.

Github: https://github.com/usodhi/bottlecommared

Built With

Share this project:

Updates

posted an update

Check in #2 Introduction We are planning to implement an existing paper (https://arxiv.org/pdf/1411.4555.pdf). The paper solves the problem of structured prediction and classification. Its objective is to use a neural network which processes images to generate a caption for the image. We chose this paper because this seemed like an interesting and practical way to use the skills we’ve learnt in this class. Who hasn’t taken a picture and wished that it was automatically given a really objective caption!

Challenges So far our work has been focused on preprocessing our data into a usable format. The Flickr8k dataset we’re using contains the data in a somewhat disorganised manner, so the work we’ve done so far has focused on making sure the data (images, captions) are organised in a logical manner that will make training and testing easier for us. We’ve also begun the process of collating the data for usage in Python, which has proven slightly more difficult we expected as the images in this dataset are in a different format than what we’ve seen in class so far. We may yet have to make changes to how this is organised, but we’re satisfied with the way it’s been set up so far.

Insights At the moment, we don’t have anything concrete to display in terms of a model. We do, however, have some of our preprocessing completed to set up the data in Python in a similar manner as that recommended in the paper.

Plan We think that we’re probably slightly behind on implementing the paper, but both our schedules are more free in this coming week than they have been in the past, so we’re sure that we have the time to devote to it in the near future. We need to finish our preprocessing, then begin work on the components of our model (the CNNS, embeddings) and the architecture of the model. We aren’t thinking of changing anything in particular at the moment. In the next few days, we plan on finishing preprocessing, designing, implementing and training our model, then testing and making changes as necessary. Post that, we will work on our ablation studies and qualitative evaluation.

Log in or sign up for Devpost to join the conversation.