Being a normal human being it is easy for us to see an image and answer any question about it using our commonsense knowledge. However, there are also scenarios, for instance, a visually-impaired user or an intelligence analyst, where they want to actively elicit visual information given an image. Hence, we build an AI system, which takes as input an image and a free-form, open-ended, or natural language question about the image and produces a natural language answer as the output. If properly implemented the system can be used to help automate many image description processes. But if asked for one particular reason the system is mainly built to extend as an application to blind people.
What it does
Given any Image and a question about the image our system will be able to localize the part of the image pertaining to the question and can answer the question in natural language. For example, you upload a picture of a cow eating grass, and let's say you ask a question like what animal is in the picture or what is the animal eating, the system will be able to localize to the part pertaining to that particular question and can give you an appropriate answer.
How we built it
The dataset for the entire model can be found here https://visualqa.org/download.html There were a total of 82,783 Training images,40,504 Validation images, and about 81,434 testing images
Challenges we ran into
1.The model took about 8 hours to train and we had to do it about 3 to 4 times for decent accuracy.
2.The first challenge was Google colab expires every 90 mins if you don't perform anything on it.This was a big problem while training as training took a long time and we forgot to make sure the runtime was active. So someone had to stay awake overnight to not let the runtime expire. So basically we didn't sleep properly xD.
3.For the natural language processing part we used n-gram smoothing technique which for implementation was difficult to figure out.
Accomplishments that we're proud of
When we first thought of this project we were skeptical about whether we would be able to train this big a model in three days, but when we were able to successfully train the model, it felt like a big achievement. This Visual Question Answering is actually a research topic in AI and there are not many models that work properly.,So When we saw the results we were happy about it.
What we learned
We learned a lot about deep learning and model training. Also, we also learned how to deploy a deep learning model in a flask Application which we would not have if not for this hackathon. We learned how actually natural language processing works and how it impacts an AI application.
What's next for Visual Question Answering
Now we've built the model as a flask application, the next plan is to do this as a mobile application with more training..If successful we will try to extend this as Visual Support for the Blind People where a blind person can take a photo and can ask what is the picture about and we would use some Voice to text API and get an answer for his question.