Inspiration
I had the inspiration for this app 20 days ago: I was looking at the window of a clothing store and a blind woman came to me, asking what that store was selling. In that moment I thought that an app that could generally describe what it is seeing could help thousand of people all around the world.
What it does
Pic2Speech is a mobile application, currently available on Google Play, that describes the content of pictures taken from smartphone's camera vocally. The core of this project is a Deep Neural Network, trained on nearly 90.000 images and 400.000 captions, that automatically generates a description for a given image.
How I built it
The model I built is mostly inspired on the paper "Show and Tell: A Neural Image Caption Generator" by Vinyals et al. I built the neural network using Keras on an Azure Data Science VM with GPU support. After training I deployed the model on an Azure Webservice, without leaving my python notebook, using the Azure ML API for Python. To develop the mobile app I used Flutter. The entire source code for this project is available on my Github, take a look if you like
Challenges I ran into
To train the neural network I used the MS COCO 2019 and the biggest challenge was to manipulate such big dataset, which weight around 15 GB before processing and something like 150 GB after processing (damn one hot encoding !!! ). I have been able to create a generator to process data in batch at training time, finding a good balance between memory usage and training time.
Accomplishments that I'm proud of
The model I built is not at the state-of-the-art, but I'm really proud of it because I did it in just 2 weeks, without ever having used Azure ML Services before. I think that with time I could improve it, hoping to make Pic2Speech really helpful for visually impaired.
What I learned
I learned a lot in these two weeks, first of all I learned to use Azure ML Services to train and deploy a machine learning model, everything inside a Jupyter Notebook. I also strengthened my skills with Keras and I learned how to use its Functionals API.
What's next for Pic2Speech
The neural network that power Pic2Speech could be widely improved, but to do that I need more computing resources (so Microsoft please give me some more azure's credits :D ). I'd like to implement an attention mechanism, as described in the paper "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention" it would dramatically improves the accuracy of the model. On the app side, I'd like to give users the ability to manually caption their pictures or pictures taken by other users, with the aim of use these informations to constantly improve the model.


Log in or sign up for Devpost to join the conversation.