posted an update

This is our initial project pitch for reference. Instagram Caption Generator

By: Eyal Levin and Tali Bers

Introduction:

Instagram might be the most popular social media application in our generation. A common question asked before posting a picture is what the caption should be. There are a lot of different types of captions including descriptive ones, inside jokes, puns, quotes, and short phrases. Our goal is to create a model that given a picture can output a suitable instagram caption. In order to do this we will create a classification that is trained on real instagram pictures and corresponding captions to learn what sort of captions are most appropriate for which pictures.

Related Work:

The two main papers we will be drawing on are: https://machinelearningmastery.com/develop-a-deep-learning-caption-generation-model-in-python/, by Jason Brownlee https://towardsdatascience.com/do-it-for-the-gram-instagram-style-caption-generator-4e7044766e34, by Bowman et al.

While the first paper more effectively describes architectures, methods, and other features we would draw upon, the second paper focuses specifically on instagram captions, which is more relevant to our work. Therefore, we will try to follow the second paper with guidance from the first.

Data:

We will get our data directly from instagram, it will be composed of single pictures with corresponding captions. We will scrape the instagram accounts of 100 verified users in the United States (to make sure captions are in English) to get 100 of their most recent posts. We will use verified users because they are writing captions to the widest audience so the captions are less likely to be inside jokes and they are the most well known instagram users so people tend to copy their posting styles. We will use one of the following scrapers: Instagram scraper opensource python https://github.com/realsirjoe/instagram-scraper Instagram scraping APIs https://stevesie.com/apps/instagram-api https://apify.com/jaroslavhejlek/instagram-scraper https://github.com/timgrossmann/instagram-profilecrawl Once we have the data we will clean and preprocess it by first taking out all the posts that have more than one image, images that don’t have captions or if the captions are over 60 characters to make training faster. We may resize the images so that they are smaller and easier to process. Then we have to preprocess the captionsby tokenizing by letter including punctuation and hashtags since this will enable us to generate more realistic instagram captions that include abbreviations, hashtags.

## Methodology:

We plan to follow an architecture based on Marc Tanti’s ‘merge model’:

More info on the model here: https://machinelearningmastery.com/caption-generation-inject-merge-architectures-encoder-decoder-model/

The model can be split into 3 parts: Captions: Character embeddings are found for the characters in the caption (including hashtags, emojis, etc…) captions are passed into an RNN (with LSTM layer of output size 256) Images: Are passed into VGG-16 (minus the last layer) The result of that is passed as input to a resizing dense layer (256). Merge: Resized outputs of images and captions are added. Passed through a dense layer with softmax activation.

We will run our data through the model, and use perplexity to calculate the loss. The hardest part will be figuring out a way for the perplexity to be informative enough, given that instagram captions are likely really hard to predict.

Metrics:

Success for our project is determined by how much the generated caption consistently resembles a possible human caption. We will measure the perplexity of our model to gauge how well it generated a caption. However, instagram captions themselves can be quite perplexing so the perplexity we calculate may not be that informative. So the best way to measure success is through human verification. Depending on how much time we have left at the end we can manually go through the output of our model on a certain number of instagram pictures and determine how realistic the generated captions are and rate them on a scale of 1-10. Base goals: Measure perplexity Have a caption generated which seems to have some relation to the image (words, themes, etc…) for most images Target goals: Have a coherent caption generated Generate a caption that clearly seems as if it makes sense for the image. Whether this is an inside joke, description of the image, or inspirational quote does not matter, as long as it looks like it could maybe match a human caption to some degree. This need not be true for all images, but definitely for most. Stretch goals: Generate DL instagram captions that are virtually indistinguishable from human instagram captions, for various kinds of images. Experiment by converting the model to use transformers, which could significantly boost performance

Ethics:

Why is Deep Learning a good approach to this problem? There is as much data as we could possibly need, and it is really easy to parse and filter through. Lots of work has been done with text generation and image classification, and this problem combines the two approaches in a very convenient way. There are too many variables and too much noise for this to be done manually. A big and good enough network would hopefully be able to handle this problem for that reason.

What is your dataset? Are there any concerns about how it was collected, or labeled? Is it representative? What kind of underlying historical or societal biases might it contain? Our dataset is made up of around 10k image caption pairs from the most influential instagram users. Since all of the users have public accounts there is not a privacy concern. The biggest concern our dataset raises is about representation. The most popular instagram users are wealthy white people so our generated captions will not be representative of all instagram users who have diverse socio-economic and racial backgrounds. Many of the influencers post advertisements and expensive vacations which is not what the average user can afford. Additionally our model may generate better captions for images containing people that look like the users we trained on since their posts are mostly individual or group pictures of other white and rich individuals.

Division of Labor:

Tali Preprocessing/tokenizing Caption side of the model

Eyal Data scraping Image + Merge side of the model

Both: Analysis of results Hyperparameter tweaking Other parts of the model (loss, accuracy, etc…) Presentation

Log in or sign up for Devpost to join the conversation.