Link to Final Writeup

This is the link to the final write up. The content below comes from our first deliverable. Please refer to the final write up to get most updated details about this project.


We are trying to create a model which can provide an accurate natural language answer given an image and a natural language question about the image. The reason we want to do this is that such a model can help people (e.g. the visually impaired) acquire visual information without relying on others.

Related Work

Paper implemented

Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C. Lawrence Zitnick, Dhruv Batra, Devi Parikh. “VQA: Visual Question Answering”. arXiv:1505.00468.

The paper tries to implement a free-form and open-ended Visual Question Answering (VQA) system, which takes as input an image and a free-form, open-ended, natural language question about the image and produces a natural language answer as the output. We chose this paper because it is composed by the pioneers in VQA and explains thoroughly the basic structure of the system. It lays a good foundation for us to build on.


Aishwarya Agrawal (2019). Visual question answering and beyond. Georgia Institute of Technology.



Providing aid to visually-impaired people, teaching children through interactive demos, making visual social media content more accessible, etc.


The main dataset we used is MSCOCO. Images are collected from Flickr, annotations are collected from crowd workers with funding provided by Microsoft. The objective is to collect images of various scene types, and MS made sure that each category has a significant number of instances. However, there are categories with significantly more images (windows, doors) than the other, which might lead to potential bias towards the minority.


VQA models are driven by language priors in the training data and therefore inherit biases from the questions and answers in the training set.

Division of Labor

The entire project consists of 4 parts: data preprocessing, image feature extraction (CNN), language feature encoding (LSTM), and merging (feed-forward network). While data preprocessing and merging are related to the model as a whole, CNN and LSTM are two separate blocks and can assign two group members to work on each: Yue Yang and Jayden Yi work on the CNN block, and Linghai Liu and Jacob Yu work on the LSTM block. On the other hand, we expect all members to collaborate on data preprocessing and the last merge process. Later labor adjustment and slight re-assignment are possible and will be disclosed here.

Built With

Share this project: