We thought about this when we wanted to match the audio with video footage, which has a wide variety of applications that includes aiding the deaf.
What it does
It uses CNN-LSTM neural networks (or deep learning) in order to be able to identify what someone is saying just by reading their lips. It achieves reasonable results and is able to understand someone with just a computer-vision solution, with no audio present.
How we built it
We used CNN-LSTM neural networks that is trained on the Google Cloud Platform in order to train on our data and form our classifier. We are basing our model on a research paper where our CNN-LSTM classifier is predicting words. We also spent time extracting data from offical news sources from images of lips and words.
Challenges we ran into
Development of the CNN-LSTM neural network as well as the dataset extraction. Coming up with a proper way to approach the problem of lip reading as well as trying to form an authentic dataset by extracting lip movement images and subtitles from news sources.
What we learned
Good data takes time