By: Eyal Levin and Tali Bers

Introduction:

Despite the popularity of Instagram, instagram captions are confusing and often have nothing to do with the photo. Some captions “make sense” for the photo, and others don’t, and for humans this distinction can be pretty obvious in many cases. so we set out on a journey to decode instagram captions. In this project, we attempted to train a caption generator on image/caption pairs, and generate captions for our favorite photos. The goal is not to output the “correct” caption, but one that could work for the photo.

Data:

We got our data directly from instagram using instalooter, a program that downloads instagram pictures without any API access. We manually ran instalooter on 58 accounts to get around 100 of their most recent posts where the first picture wasn’t a video. Posts can include up to ten pictures/videos and a caption. The 58 accounts are all verified accounts with lots of posts and followers there is a balance between males and females and although most are personal accounts there are a few that are of sports teams or other brands. These are the accounts we scraped from.

["arianagrande", "therock", "kyliejenner", "selenagomez", "kimkardashian", "beyonce", "justinbieber", "natgeo", "kendalljenner", "taylorswift", "jlo", "nickiminaj", "khloekardashian", "nike", "mileycyrus", "katyperry", "kourtneykardash", "kevinhart4real", "ddlovato", "theellenshow", "badgalriri", "virat.kohli", "zendaya", "iamcardib", "kingjames", "chrisbrownofficial", "champagnepapi", "billieeilish", "shakira", "victoriassecret", "vindiesel", "championsleague", "davidbeckham", "nasa", "gigihadid", "justintimberlake", "emmawatson", "priyankachopra", "shawnmendes", "shraddhakapoor", "snoopdogg", "dualipa", "9gag", "nba", "camila_cabello", "willsmith", "aliaabhatt", "marvel", "nehakakkar", "hudabeauty", "robertdowneyjr", "leonardodicaprio", "gal_gadot", "katrinakaif", "chrishemsworth", "ladygaga", "zacefron", "michelleobama"]

We then preprocess the data by taking the first picture from every post that has a caption and running the pictures through a VGG to extract the features. We tokenized all the captions by character to ensure that the emojis, hashtags, and tags could all be reproduced. We unked characters that appear less than 50 times to make sure that we were only learning relevant ones. Finally, we made sure each caption was 150 characters long by either cutting it off or adding padding, each caption had a start and stop character as well.

Methodology:

We followed an architecture based on Marc Tanti’s ‘merge model’:

More info on the model here: https://machinelearningmastery.com/caption-generation-inject-merge-architectures-encoder-decoder-model/

The model can be split into 3 parts:

Captions:

Character embeddings are found for the characters in the caption (including hashtags, emojis, etc…) captions are passed into an RNN (with LSTM layer of output size 128)

Images:

Are passed into VGG-16 (minus the last layer) The result of that is passed as input to a resizing dense layer (128).

Merge:

Resized outputs of images and captions are added together. Passed through two dense layers, first with ReLu, second with softmax activation.

We train our data in batch sizes of 10, passing in an image with its corresponding caption with all of the characters except the last. We train for 30 epochs. Our loss function uses the whole caption except the first character as the label and the predictions from our model to compute the loss using sparse categorical cross entropy. We use a mask to make sure the padding isn’t used when calculating loss so that we don’t encourage padding in captions. We have a generate caption function that uses our model and a given image to generate a caption character by character.

Results:

Success for our project is determined by how much the generated caption consistently resembles a possible human caption. We originally thought about calculating the accuracy through perplexity. However, instagram captions themselves can be quite perplexing so the perplexity we calculate may not be that informative. So the best way to measure success is through human verification. For that reason we did not include an accuracy or testing function. We manually tested by passing in particular images and generating captions and then comparing them to the original captions and just generally to what we think a caption should look like from our experience with instagram. We got our model to generate captions that mostly contain real English words and the occasional hashtag and tags. Our caption never matched the original caption but most of the instagram captions can seem pretty random and not related to the image anyways so it would be impossible to expect the model to learn patterns that just aren’t in the data. Oftentimes there were few random characters in the captions that don’t create full words despite this the captions are legible.

Challenges:

Our first challenge was getting our data off instagram, we tried a few different scrapers and APIs, we had to ensure that they were able to get both pictures and captions. A problem we ran into was that instagram shuts you out of their API if you have too many requests in a short amount of time. This meant that we could only run the scraper for around 5 accounts at a time before getting an error which made scraping instagram very time consuming since we had to start and stop the scraper very often. We ended up scraping 100 posts from 58 accounts given out time constraints and the desire to start on preprocessing and the model. Another challenge along the way was that when we first ran our model, it kept picking one character seemingly randomly and outputting that character over and over again as our caption. Basically the model wasn’t learning and our gradients were zero. We checked over the model architecture and all the preprocessing which seemed fine. The error turned out to be that when we added the image weights they were overpowering the characters so the model wasn’t able to learn anything, it was getting too much information from the image. To fix the issue we played around with the weight given to the image when added to the captions until we found a good balance that produced reasonable gradients. This change along with other adjustments improved the model to be able to output real words stringed together. Although the grammar of the sentence structure isn’t perfect and some of the words are just random characters at least the captions were readable. This is a challenge resulting from data that doesn’t always have perfect grammar and doesn’t always make sense as a caption for a given image. Although our loss graph does decrease it stops decreasing after around 30 epochs and also reaches a minima that it just goes up and down between.

Reflection:

Ultimately we are proud of our model for producing actual words and stringing them together even though they don’t necessarily correlate to the image. We ended up not calculating the accuracy as we thought we would when writing our goals since it just didn’t make sense as a metric when we saw our results. We thought we would compare the predicted captions to the actual captions but since they were nowhere close the accuracy would have been useless. Additionally the fact that they weren’t similar to the original caption isn’t that bad as long as they look like any type of caption. As we worked on the project we realized that instagram captions don’t have a formula to them so most caption/image pairs don’t have a reproducible pattern. The most important thing is that the captions are written in an instagram tone meaning users might believe they are captions produced by people which we believe our model is close to achieving.

If we had to do the project over again, we would pick training data more carefully or take from only certain types of accounts that have captions more directly related to what is in their pictures that way there is more of a correlation between images and captions for the model to pick up patterns from. Also just having more data could drastically improve the model so spending more time on finding a way to scrape from instagram successfully or just taking from a pre made dataset. If we had more time we could play around with the hyperparameters and weights of the images to try and produce better captions. We could also just scrape and train more data.

Throughout this project we learned more about how different model architectures work and the fact that there isn’t just one way to go about solving a problem. We used an RNN but maybe a transformer model would have been better. We also added the image and caption weights together but maybe we could have passed the images in as a previous state to the LSTM. There are many ways to approach the problem. Additionally, we learned how challenging gathering data can be even from one of the most popular social media sites. We also realized that just getting data isn’t enough, you need to be very specific about the type of data gathered and think about its implications for the model. For example, if we had only gathered data from sport accounts our model might produce very good captions for images relating to different sports but random results for other images. There are trade offs between this and our current model that has information from so many different types of accounts that it didn’t even recognize patterns between image caption pairs. Despite the challenge of actually getting data it is interesting to think about how much data there is in the world around us and all the things (good and bad) it can be used for, this is something we thought about while creating our model. What are other people using these scrapers for?

Built With

Share this project:

Updates

posted an update

This is our initial project pitch for reference. Instagram Caption Generator

By: Eyal Levin and Tali Bers

Introduction:

Instagram might be the most popular social media application in our generation. A common question asked before posting a picture is what the caption should be. There are a lot of different types of captions including descriptive ones, inside jokes, puns, quotes, and short phrases. Our goal is to create a model that given a picture can output a suitable instagram caption. In order to do this we will create a classification that is trained on real instagram pictures and corresponding captions to learn what sort of captions are most appropriate for which pictures.

Related Work:

The two main papers we will be drawing on are: https://machinelearningmastery.com/develop-a-deep-learning-caption-generation-model-in-python/, by Jason Brownlee https://towardsdatascience.com/do-it-for-the-gram-instagram-style-caption-generator-4e7044766e34, by Bowman et al.

While the first paper more effectively describes architectures, methods, and other features we would draw upon, the second paper focuses specifically on instagram captions, which is more relevant to our work. Therefore, we will try to follow the second paper with guidance from the first.

Data:

We will get our data directly from instagram, it will be composed of single pictures with corresponding captions. We will scrape the instagram accounts of 100 verified users in the United States (to make sure captions are in English) to get 100 of their most recent posts. We will use verified users because they are writing captions to the widest audience so the captions are less likely to be inside jokes and they are the most well known instagram users so people tend to copy their posting styles. We will use one of the following scrapers: Instagram scraper opensource python https://github.com/realsirjoe/instagram-scraper Instagram scraping APIs https://stevesie.com/apps/instagram-api https://apify.com/jaroslavhejlek/instagram-scraper https://github.com/timgrossmann/instagram-profilecrawl Once we have the data we will clean and preprocess it by first taking out all the posts that have more than one image, images that don’t have captions or if the captions are over 60 characters to make training faster. We may resize the images so that they are smaller and easier to process. Then we have to preprocess the captionsby tokenizing by letter including punctuation and hashtags since this will enable us to generate more realistic instagram captions that include abbreviations, hashtags.

## Methodology:

We plan to follow an architecture based on Marc Tanti’s ‘merge model’:

More info on the model here: https://machinelearningmastery.com/caption-generation-inject-merge-architectures-encoder-decoder-model/

The model can be split into 3 parts: Captions: Character embeddings are found for the characters in the caption (including hashtags, emojis, etc…) captions are passed into an RNN (with LSTM layer of output size 256) Images: Are passed into VGG-16 (minus the last layer) The result of that is passed as input to a resizing dense layer (256). Merge: Resized outputs of images and captions are added. Passed through a dense layer with softmax activation.

We will run our data through the model, and use perplexity to calculate the loss. The hardest part will be figuring out a way for the perplexity to be informative enough, given that instagram captions are likely really hard to predict.

Metrics:

Success for our project is determined by how much the generated caption consistently resembles a possible human caption. We will measure the perplexity of our model to gauge how well it generated a caption. However, instagram captions themselves can be quite perplexing so the perplexity we calculate may not be that informative. So the best way to measure success is through human verification. Depending on how much time we have left at the end we can manually go through the output of our model on a certain number of instagram pictures and determine how realistic the generated captions are and rate them on a scale of 1-10. Base goals: Measure perplexity Have a caption generated which seems to have some relation to the image (words, themes, etc…) for most images Target goals: Have a coherent caption generated Generate a caption that clearly seems as if it makes sense for the image. Whether this is an inside joke, description of the image, or inspirational quote does not matter, as long as it looks like it could maybe match a human caption to some degree. This need not be true for all images, but definitely for most. Stretch goals: Generate DL instagram captions that are virtually indistinguishable from human instagram captions, for various kinds of images. Experiment by converting the model to use transformers, which could significantly boost performance

Ethics:

Why is Deep Learning a good approach to this problem? There is as much data as we could possibly need, and it is really easy to parse and filter through. Lots of work has been done with text generation and image classification, and this problem combines the two approaches in a very convenient way. There are too many variables and too much noise for this to be done manually. A big and good enough network would hopefully be able to handle this problem for that reason.

What is your dataset? Are there any concerns about how it was collected, or labeled? Is it representative? What kind of underlying historical or societal biases might it contain? Our dataset is made up of around 10k image caption pairs from the most influential instagram users. Since all of the users have public accounts there is not a privacy concern. The biggest concern our dataset raises is about representation. The most popular instagram users are wealthy white people so our generated captions will not be representative of all instagram users who have diverse socio-economic and racial backgrounds. Many of the influencers post advertisements and expensive vacations which is not what the average user can afford. Additionally our model may generate better captions for images containing people that look like the users we trained on since their posts are mostly individual or group pictures of other white and rich individuals.

Division of Labor:

Tali Preprocessing/tokenizing Caption side of the model

Eyal Data scraping Image + Merge side of the model

Both: Analysis of results Hyperparameter tweaking Other parts of the model (loss, accuracy, etc…) Presentation

Log in or sign up for Devpost to join the conversation.

posted an update

Checkpoint 1

Introduction: Instagram might be the most popular social media application in our generation. A common question asked before posting a picture is what the caption should be. There are a lot of different types of captions including descriptive ones, inside jokes, puns, quotes, and short phrases. Our goal is to create a model that given a picture can output a suitable instagram caption. In order to do this we will create a classification that is trained on real instagram pictures and corresponding captions to learn what sort of captions are most appropriate for which pictures.

Challenges: What has been the hardest part of the project you’ve encountered so far?

So far we have worked on just getting our data off instagram and are beginning to parse it. We looked into a few APIs to scrap images and captions from instagram. We want to get data from the top 100 English language account users that are verified but it has been challenging to ensure we are scraping from the specific accounts we want and getting a sufficient amount of data from each account so that after cleaning the data we still have enough information. We believe that parsing the captions will be the most difficult part as we have to tokenize on letters. Also cleaning the data is challenging as it requires removing pictures without captions and posts that have more than one picture.

Insights: Are there any concrete results you can show at this point? How is your model performing compared with expectations?

We don’t have any concrete results of running the model and getting some output. Since all we have really done is gathered data and tried to figure out the best way to parse it, the only results we have are just getting instagram data and seeing that we can access it correctly.

Plan: Are you on track with your project? What do you need to dedicate more time to? What are you thinking of changing, if anything?

Yes, we believe we are on track with the project. By far the most difficult part of the project should be the gathering and parsing of the data, so once we are sure that is done, we think we will be in very good shape to finish. We have planned out the model architecture which combines our work from past assignments. After completing that, all we would then have to do to reach our base goal is likely print out some results. We then plan to play with the architecture a little and tweak hyperparameters to get to the target goal and then maybe the stretch goal, if we think we can.

Log in or sign up for Devpost to join the conversation.