Watch our video presentation HERE
we apologize for the poor audio quality of our video, wearing masks created unintended white noise in our recording
Take a look at our final writeup HERE
Take a look at our final poster HERE
Take a look at our code HERE
Title: Neural Image Caption Generator
Who: Lila Zimbalist (lzimbali), Udayveer Sodhi (usodhi)
Introduction: We are planning to implement an existing paper (https://arxiv.org/pdf/1411.4555.pdf). The paper solves the problem of structured prediction and classification. It ’s objective is to use a neural network which processes images to generate a caption for the image. We chose this paper because this seemed like an interesting and practical way to use the skills we’ve learnt in this class. Who hasn’t taken a picture and wished that it was automatically given a really objective caption!
Related Work: Microsoft’s Rich Image Captioning in the Wild also uses neural networks to generate captions for images, with a machine translation-inspired encoder-decoder framework. It consists of four main components: a model for image feature extraction, a language model for generation of potential captions and rankings, an entity recognition unit for landmarks and celebrities, and a classifier to estimate the confidence score (which estimates how much the generated caption could be trusted.
There are many existing implementations of this professor that we were able to find https://github.com/nikhilmaram/Show_and_Tell https://github.com/muggin/show-and-tell https://github.com/jazzsaxmafia/show_and_tell.tensorflow https://github.com/vedal/show-and-tell https://github.com/serhii-havrylov/ShowAndTell
Data: We will be using the flickr8k dataset. It has 8,000 images total, and we will be training on 6,000 of those images. This will require pretty comparable levels of preprocessing to the rest of the projects we’ve done so far in this class.
Methodology: What is the architecture of your model? We will train the model by using a combination of CNNs to represent the images and embeddings to represent the words. We think the hardest part of implementing our model is going to be figuring out the exact right model architecture. We’re going to have to figure out how all of the pieces of this model fit together to be as accurate an efficient as possible.
Built With
- python
- tensorflow
Log in or sign up for Devpost to join the conversation.