This project is inspired by the technology driving speech translation systems and an urge to design for accessibility. The question we asked before starting the project was that if one form of speech can be translated into another, why can't we do the same for images? In fact, if we were able to translate images into speech, we might be able to positively impact the lives of the visually impaired. So we decided to combine pre-existing technology with the advancements in AI to build LiveCap, a mobile app that produces text or voice captions from live images or video.
What it does
The app has two settings. It can either produce captions based on live footage it receives from a mobile camera, or for images stored in the device's gallery. The former setting, coupled with the text to speech functionality is was designed keeping people with visual impairments in mind. The other setting has just been included in this prototype app to display the accuracy and power of our prediction system.
How we built it
We created an attention based sequence to sequence translator in Pytorch that consisted of an encoder (Resnet) and a decoder (RNN). The trained weights of this model were exported to a mobile version of Pytorch, so that they could be used in an Android app. The backend for the app was created with Java that interfaces the trained model parameters to images or video from either the camera or storage. The model then generates a caption that is displayed on the screen and sent to the Google Text to Speech API to provide users with voice captions.
Challenges we ran into
Exporting the trained model to Pytorch Mobile was quite difficult due to the lack of documentation and rapid development of this framework. This was a major bottleneck and quite unexpected. Once the model was exported, we faced several challenges interfacing it with our application, due to our lack of experience, a time crunch and confusing documentation.
Accomplishments that we're proud of
We are proud of the fact that we were able to bring our ideas to fruition during this hackathon, even though the final prototype is rather simple with it's user interface. We started off as students with limited practical ML and app-development experience and have learnt a skill-set that we didn't even know existed. Most importantly, we are proud of the fact that this idea represents our core design-values (usability and accessibility) and has the potential to change lives.
What we learned
Apart from technical skills, we learnt about the importance of teamwork, communication and planning ahead (especially given the fact that communication is currently tougher than ever before). We also understood that steps that may seem easy during the planning phase (such as exporting model weights) can often prove to be sever bottlenecks. Thus we should always plan for the worst case and have a buffer time in between tasks.
What's next for LiveCap
We would love to deploy this app on several other edge device, such as Google Glass, so that it is significantly easier to use for our target population.