It took a while before we had the final idea formulated. We decided to go ahead with using Image Captioning as the base-model since it is relatively mature now. When we finally came up with using Algolia for indexing the captions for our images, it was like Lego pieces fitting into each other :dancer:
What it does
- The android application lets the user click a picture and gets a description for what the image is from our model.
- The Android application also lets you search for previous clicks based on the description that was generated
- The web UI also lets you search through the images clicked during the hackathon based on the descriptions generated by our model.
How we built it
- We found pre-trained weights for the COCO dataset (Since training it from scratch in a day wasn't feasible)
- We found pytorch models that implement the CNN-RNN architecture. (Links to everything in our repo)
- We got the model up and running and wrote a wrapper around it to have a function call return a caption to us
- We built an API endpoint using python flask
- We uploaded all the images to google cloud storage using their pip package (from within the server)
- We also added Algolia indices in the python server
- We built an Android Application that sends the image to the server as a base64 string
- We integrated Algolia's instant search SDK in our android app
- We deployed our server on Google Cloud.
- It is live! :bomb:
Challenges we ran into
- While deploying on Google cloud, I kept getting a
couldn't connect to instance 255after some time.
- The solution was to use a machine without GPU's (Weird right ?)
Accomplishments that we're proud of
- We came up with an ambitious pipeline that we completed. We have a pipeline that actually works now! Click an image on the android app, you get a caption and it is immediately available on the search UI for searching
- We even added the functionality of reading out the caption out loud for people with visual disabilities (They could just ask what am I looking at to know what's in front of them)
What we learned
- Caffeine does wonders
What's next for WhoDat
Expand the model
- The current model successfully recognizes different objects present in the images but there have been few cases where it fails to differentiate between objects like a carton box and a luggage bag. We plan to build a pipeline for the model where the user can correct the caption produced for the Image and model in turn learns and optimize its weights using the corrected caption.
Implement the streaming service for Visually impaired people
- We plan to build another application, possibly using Google Glass, based on a similar model and idea which could be used to assist Visually impaired people. The idea is to implement a streaming service that constantly monitors the external environment and consistently update the user about its surroundings.