A.I.D.A. (AI-generated Image Description Assistant)

Inspiration

Image descriptions are a crucial tool for blind and low vision social media users. Deep neural networks can be used to generate these descriptions from an image. Existing tools for creating AI-generated image descriptions usually intervene at the screen reader level, generating descriptions for images that have already been posted by other users. However, some disability advocates have noted that AI-generated image descriptions can be far less accurate than image descriptions written by humans. This doesn’t mean that AI-generated image descriptions can’t be helpful – we believe that they are simply being incorporated at the wrong step in the process. An AI-generated image description should not be used as a final product, but instead as a tool that will prompt more users to include image descriptions in their posts by making the description-writing process as convenient as possible. This is where A.I.D.A. comes in.

What it does

A.I.D.A (short for AI-generated Image Description Assistant) suggests image descriptions for images uploaded during the tweet creation process. In our current implementation, the user can allow the program to automatically suggest AI-generated image descriptions for any and all photos in a tweet. At this point, the user may make any desired alterations to the description before posting. This effectively and accurately streamlines a process that accommodates users with blindness or low vision, promoting inclusivity and enhancing the experience of the social media user base.

How we built it

To generate the image descriptions, we implemented a hybrid CNN/LSTM model using Python, NumPy, and Keras backed with Tensorflow in Google Colaboratory. We chose a VGG16 model architecture for the convolutional neural network (CNN), which was created by K. Simonyan and A. Zisserman (University of Oxford) in 2014 and notably achieved 92.7% accuracy when completing a top-5 test in ImageNet. Specifically, we used a VGG16 model that had been pre-trained on the ImageNet dataset, which contains 14 million images that can be sorted into 1000 classes. VGG16 is typically used for image classification tasks, so we modified it for our image description generation task by removing the final softmax layer so it could be used for feature extraction. We then used this model on the Flickr 8K dataset via transfer learning. We chose the Flickr 8K dataset, which contains 8000 images (including a training set of 6000, validation set of 1000, and test set of 1000), as it is a smaller and more manageable version of the industry standard dataset for image description, Flickr30K. Each image in the dataset is accompanied by 5 descriptions for training purposes. Once we had trained our VGG16 CNN, we were able to use it to extract features from each image in the dataset.

The other major component of our description-generating model is the LSTM layer. Each image description is generated word by word. The LSTM layer takes the previous n-1 words that have already been generated in a given description as an input. It then uses this input to predict the nth word in the sequence. Together, this LSTM layer and the image feature vector are used to generate a complete image description.

Because we could not access the (abstracted) Twitter source code, we had to carry out automation on the user side, and decided to automate the actual description-adding process on the Twitter webpage using scripted mouse movements and keystrokes. In order to dynamically respond to the web page, we used OpenCV, NumPy, and Matplotlib to take screenshots and analyze the resulting images using template matching. This enabled the program to wait for the next step then determine where it would need to click, as well as analyze how many photos the user uploaded. The images were processed through the AI, and the resulting generated descriptions were injected into the necessary description text box. These processes were carried out using PyAutoGUI and Pynput. Once every photo had been processed, the program would save the changes to the tweet draft so the user would have the ability to look them over and tweet when ready.

Though the reference images used for template matching are specific to twitter, the code itself can be easily repurposed for other social media sites.

Challenges we ran into

The main obstacle we ran into was the abstracted Twitter source code. Our original intention was to use web scraping rather than image recognition to automate the captioning process, however, this required access to the dynamically-changing source code, which Twitter (likewise to Instagram and Facebook) does not allow access to. Though this method may have been slightly quicker, the functionality would ultimately have been very similar to our actual implementation.

Accomplishments that we're proud of

We were able to maintain a quick but careful workflow that allowed us to finish in a timely manner without running into too many bugs. We were able to combine separate parts (the automation and AI) into a seamless, functional flow.

What we learned

Rae implemented process automation in Python for the first time and learned to use multiple template matching, as well as how to take screenshots and manipulate the properties of images. Erin, who did not have prior experience with natural language processing, learned how to modify a CNN intended for image classification and use it for feature extraction, as well as how to implement an LSTM.

What's next for A.I.D.A. (AI Image Description Assistant)

Given more time, we believe we could improve the accuracy of the image description-generating model via hyperparameter tuning, modifying the batch size and learning rate, and allowing the model to train for more epochs. We would also be interested in trying different convolutional neural network model architectures to see if any of them would yield better results than the VGG16 architecture we used. Furthermore, we would like to expand the functionality of our application by adding a feature that generates GIF descriptions, creating a version for the Twitter mobile app, and creating versions for other social media platforms that allow users to post images such as Instagram or Facebook.