Introduction:
Paper Name: The Role of Attention Mechanism and Multi-Feature in Image Captioning link
Paper objectives:
- Develop a novel approach for multi-feature image captioning.
- Compares two different approaches for image captioning using BLEU score.
Report Link:
https://docs.google.com/document/d/1niSxw-DUPxB5PyPb_M6Ods9S6VBbAGcjtjwb8zVwUb8/edit?usp=sharing
Why we chose this paper?
Everyday, thousands of images are uploaded on the internet. They may be uploaded via social media sites like Facebook, Instagram etc. This results in a large unorganized set of image data. This makes it difficult for the user or a researcher to retrieve a specific kind of image. This brings the need for a system that can provide a description or caption for the image. This description can be used to index the images and make image retrieval more efficient. Secondly, an image captioning system could be integrated into an application to provide real-time captions. This can assist the visually impaired individuals to understand their surroundings better because of the detailed captions.
Type of Problem:
Sequence Modelling (Classification)
Related Work:
No prior work drawn onto the current project.
Dataset:
The Flicker8K dataset consists of approximately 8000 images. Each image has on average 5 captions associated with it. The initial Flicker8K dataset did not include a train/test split for the images and captions. Hence, we first split both into training and testing images and captions. The train:test ratio is 80:20.
Data pre-processing:
We used Keras preprocess_input to preprocess the images. Then, we used VGG16 and InceptionV3 to extract the image features. The preprocessing steps in preprocess_input for VGG16 are almost similar to that of InceptionV3. In Inception V3, the numpy array returned has pixel values scaled between -1 and 1. The input shape to VGG16 is 224x224x3 while the input shape to InceptionV3 is 299x299x3. The features are stored in a dictionary where the key is image name and value is features of each image. The two dictionaries obtained from two pretrained models are stored separately in a pickle file.
The captions were preprocessed using nltk library. The preprocessing steps were Tokenize -> Stop words removal ->Adding START and STOP Tokens -> Padding
Methodology:
The proposed model consists of using two approaches for predicting the captions for the images. The two approaches are single feature approach and multi-feature approach.
Single Feature Approach:
We are using two pretrained models- VGG16 and InceptionV3 to extract features and predict captions using these features. We use two different pretrained models to evaluate their performance in image captioning. VGG16 usually represents the object of the images and InceptionV3 usually provides middle-level features of images.
While processing the images using VGG16, we are removing the last layer to receive high level features. The extracted image features are further passed to two fully connected dense layers. The captions are passed into an embedding layer and trained using LSTM. The image and caption features are concatenated and passed further into a Dense layer. The output of the Dense layer is used to predict the captions.
Similar approach is used with Inception V3.
Multiple Feature Approach:
In this approach, there are two sub-approaches: Fusion approach and Attention Model
Fusion Approach (InceptionV3 + VGG16 + LSTM)
For this approach, the images features are extracted using both the pre-trained models. The last layer of the VGG16 and InceptionV3 was removed to receive more parameters. VGG16 features and InceptionV3 features are passed into separate dense layers. The output features from the two dense layers are concatenated. These concatenated features are further concatenated with caption features from LSTM. These features are then passed through a Dense layer (with activation function ‘relu’) to predict the caption for images.
Attention Approach (InceptionV3 + VGG16 + ATTENTION)
According to the research paper, the architecture of the attention model is as follows: the InceptionV3, VGG16 and Embedding layer features are concatenated (in a similar way as done in fusion approach). These concatenated features are passed through a LSTM layer. The output features are concatenated with InceptionV3 features. This is done so that middle-level features are again added to the model. The final concatenated features are passed through another LSTM layer and a softmax layer. The final output is used to predict the captions.
Metrics:
We used the BLEU score to measure the accuracy of the model. The BLEU score compares the n-grams of the actual caption to that of the predicted caption. The value of BLEU score ranges between 0 and 1. A BLEU score closer to 1 indicates that the model is able to generate captions that are very similar to the ones in the dataset.
Base Goal:
Implementing the two approaches for Flicker 8K dataset.
Target Goal:
Implementing the two approaches for Flick-30K dataset.
Stretch Goal:
Modifying approach-1: Instead of using LSTM, we use attention model.
Ethics:
Who are the major stakeholders in this problem? What are the consequences of mistakes made by the algorithm? The major stakeholders for our project are visually impaired people. The model could be integrated in to an application which can be used by the stakeholder. It could potentially cause physical harm if the image caption fails and the user misunderstands the surrounding/situation.
Why is deep learning a good approach for solving this problem? Deep learning approach works for image captioning as the model has flexibility to extract and to learn features that will provide the best prediction of images.
Challenges we ran into:
Multiple Captions:
In the Flicker8K dataset, each image has 5 captions associated with it. It was challenging to embed all the 5 captions for each image. Our approach was to generate a loop through each caption and generate input sequence and output sequences for each caption. Then, the sequences and image features were passed into the approaches models.
Run Time:
Since we are using VGG16 and InceptionV3 convolution models, the number of parameters of our model is very high (approximately >4000000). As a result the model takes around an hour to train and another hour to test for only one epoch. It was a little challenging since we couldn’t work and get output for a high range of epochs.
What we learned:
LSTM mechanism, Attention mechanism, vgg16 architecture, inceptionv3 architecture and BLEU score.
Division of labor:
Aakansha: preprocessing of images and captions, Single Feature Approach Hiloni: Fusion Approach and Attention Approach.


Log in or sign up for Devpost to join the conversation.