Inspiration
Understanding what the person in front of you is speaking is a privilege. I understood this when I became friends with a girl who faced difficulty due to a severe hearing impairment. The different hearing aids and devices worked, but failed in case of noise.
What it does
Visual Speech to Text Conversion: Lipreading is very useful in noisy environments where the audio signal is corrupted; Automatic lip reading aims to see the content of speech by watching videos. It has many applications that can be present in both noisy and quiet environments. However, several factors, including the lighting conditions, speaker’s age, make-up, view-points, and set of words to predict so on, make lip reading a challenging task. Fortunately, the recent progress in the following two points makes automatic lip-reading possible. Firstly, the rapidly developed deep learning techniques have been proved to tackle several problems closely related to lip reading, including action recognition, sequential modelling, and so on. Among these works employed a fully connected approach, used convolutional layers followed by recurrent layers, and attention and self-attention architectures was explored. Secondly, several large-scale lip-reading datasets have been released in recent years, such as Reading in the Wild, which provide a massive amount of data with significant variation and contribute much to the progress of lip reading. By taking full advantage of these two aspects, recently, lip-reading has presented several appealing results.
How we built it
After preprocessing video, we are passed the preprocessed videos to the Training Architecture which later converted the neural network into a Vector and when passed to Soft-max gave the desired output.
Challenges we ran into
The biggest challenge faced was dealing with homophones: different words having same pronunciation or sound. Also the computational power and cost were factors that significantly impacted the project. We are also using some code from MS-TCN (https://github.com/yabufarha/ms-tcn) research paper because it’s the latest technology which has not been published in any library so we have to take raw code for it, otherwise we can also use LS-TCN which would be extremely slow.
Accomplishments that we're proud of
We increased the accuracy using Attention Models and dealt with homophones, which is a serious issue in this domain.
What we learned
This being the first hackathon for most of us, we learnt more than we could have learnt from tutorials in a semester. We learnt how to build what's on our my mind by having just sufficient knowledge,without having prior expertise in any particular tech stack.
What's next for Visual Speech to Text
In future we can also use Attention Models, which can increase accuracy and also predict sentences with grammatical correction.
Log in or sign up for Devpost to join the conversation.