Inspiration
With the mission of using technology to build better and more inclusive communities, I took up the challenge of addressing the UN Sustainable Development Goal 10, reducing inequalities with the target goal 10.2 which aims to empower and promote the social, economic and political inclusion of all, irrespective of age, sex, disability, race, ethnicity, origin, religion or economic or other status. My solution is specifically catered to the hearing and speech-disabled population and helps in reducing the communication barrier between them and able-bodied individuals.
What it does
The solution empowers the hearing and speech-disabled communities to carry out their day-to-day activities independently and ensures their inclusion in our communities. It does this by taking a live video feed as input and plots face landmarks, hand landmarks, and pose estimation contours on the video stream. It then further uses the trained LSTM machine learning model to analyze the sign sequence in the video and gives the output of an accurate interpretation in text which can then be used to convert into speech with a python package.
How we built it
The solution is majorly based on Computer Vision, Image Processing, and Machine Learning. It was built using computer vision to capture accurate sign sequences and the mediapipe package was incorporated to draw the contours. Current models only consider hands for sign language prediction which is an inaccurate method as sign language takes into consideration more parameters that are body orientation, shoulder positioning as well as facial expressions. We have built a robust solution by considering these parameters and included the mediapipe holistic package to plot the complete sign sequence. These sequences were used to train an LSTM time series-based machine learning model that could interpret the relationship between sign sequences and predict an accurate interpretation.
Challenges we ran into
The major challenge included working with dynamic sequences which are dependent on time. The initial stage included working with static images like alphabets, numbers, and a few words for which I used a convolutional neural network to train the model. The model gave accurate results but it failed functionally as it didn't work well with dynamic sequences. I concluded working with time-dependent models and had to learn about Recurrent neural networks in a very short duration. The model was then trained on an LSTM-based RNN network that would consider dynamic sequences and be much more functionally active.
What we learned
Technology used in the right manner can empower people and communities and this project was a clear indication of the same. Undertaking very difficult tasks to build solutions that could make a difference is inspiring and would be my biggest takeaway from this activity.
What's next for Deep Motion
I envision Deep Motion to be situated in every institution like banks, marketplaces, and educational centers which would help eliminate the current problems of communication and inclusion among us. I will strive to bring this project as a complete product event after the Hackathon and I would start with collecting and training more datasets for the current model and improving on the RNN model. The future scope also includes incorporating Natural Language Processing (NLP) models to give seamless grammatically versed outputs.
Log in or sign up for Devpost to join the conversation.