Sha'ireen

Kanza Azhar posted an update — Nov 30, 2021 11:11 PM EST

Title:

A natural language processing model that generates intelligible captions for images in the Arabic language.

Introduction:

While there has been a large amount of work done to generate English captions for images, work using deep learning to generate Arabic captions for images has been limited. As a result, we decided to pursue a project in this underexplored area. The task is a supervised learning problem. We will be re-implementing the 2020 paper entitled “Resources and End-to-End Neural Network Models for Arabic Image Captioning.” It is also worth noting that this paper is based upon the 2018 paper entitled “Automatic Arabic Image Captioning using RNN-LSTM-Based Language Model and CNN,” which has a slightly different model architecture for the same task. As a result, we could also be drawing upon this paper’s approach. The motivations for both of these papers align with our personal desire to more deeply explore the area of Arabic image captioning, bringing this beneficial service to a wider audience.

Challenges:

There have been 2 main challenges in implementing our project so far: Our CNN outputs a 256 feature vector, which is then passed into our caption-generating LSTM model along with the previous words generated towards the Arabic caption. However, the challenge we ran into was in figuring out the initial word produced by the LSTM and how to obtain this. We decided to use Teacher Forcing to generate this first word for our training data, but were unsure what to do for our testing data, as this is reflective of new images the model is seeing. Based on an EdStem post, we eventually decided to look into one of several options: beam search, greedy decoding, or random decoding. The second main challenge has been in regards to figuring out what window_size means in our LSTM model, and how to set the length of the captions generated. Ideally, we would like to stay away from defining a set length for generated caption as this length might vary widely by image in order to produce a semantically sound caption. Insights: Are there any concrete results you can show at this point? Most of the concrete results we can show at this point are in terms of building our training data vocabulary, tokenizing this vocabulary and creating word embeddings out of it to pass into our LSTM. Most of the implementation for the CNN and image feature extraction using the VGG16 model subpackage, is also done. Since we have not completely figured out Teacher Forcing for our LSTM, we currently have not been able to run the complete model and compare its BLEU score to an expected score.

Plan:

Are you on track with your project? Yes! We are on track to completing an initial model based on the approach we defined earlier, i.e., generating an Arabic caption directly instead of first generating an English caption and then translating it into Arabic. Once we have the initial implementation down, we might look into implementing the second approach and also implementing a PyTorch version of our current approach. What do you need to dedicate more time to? We need to dedicate more time to Teacher Forcing and figuring out how to produce the first word in our captions for the testing data. We also need to put more time towards debugging and generating a BLEU-1 metric to measure the accuracy of our model. What are you thinking of changing, if anything? We are currently not planning on changing anything!

Log in or sign up for Devpost to join the conversation.