SpeakVision

What it does

Our project proposes a deep learning model for generating audio from lip movement. The model is implemented using a convolutional neural network (CNN) architecture. It can be used in reconstructing audio from CCTV footages in a crime scene. It can also aid law enforcement in deciphering previously inaudible or unclear dialogue.

Challenges we ran into

Availability of relevant datasets was the major problem we have faced. We also had some problems with optimizing the dataset thereby reducing the accuracy.

What we learned

We have learnt how to use CNN for image recognition as in our case for lip movement detection. We also got a better understanding of the need for pre-trained models as we had to face many problems in finding some.

What's next for Speak Vision

Lip reading is still a relatively new field, and there is still much research to be done. We would like to find more real life applications for our project as well as optimizing the data producing more accurate results.