Inspiration
Today the experience of blind people inside Youtube Video is that they are able to comprehend the content from the voice, music, song effect and so on... but they are unable to understand important information from the video when the audio description is unavailable, and this audio description is manually made by people so it's costly and it takes time. Therefore, Voice-Over vision came as a solution to these problem. Recently, Google Pixel release a update on their phone where the phone describe the image taken for the user, and this was our main source of inspiration (https://www.youtube.com/watch?v=wYPTZIFQoDQ).
What it does
Like a friend sitting next to you, this Chrome Extension narrates the unseen parts of a video, filling in the blanks where audio alone falls short. It smartly sifts through videos, picking out details that you might miss otherwise, and uses text-to-speech technology to bring those visuals to life through vivid descriptions
How we built it
Using multi-modal embedding, advanced video processing, LLM's, django, RAG, silence detection, chroma extension, Text to Speech, Speech to Text, Web Socket to make it real time.
Accomplishments that we're proud of
We also were able to create a new feature called "Ask-The-Video" which help visually impaired users inquire about parts of the video they find unclear.
What we learned
Most of what we learned was through the creation of the chrome extension that we weren't aware of, but also the use of Chat GPT (prompt engineering, how does the api works and so on)
What's next for Voice-Over Vision
There are some small fixes that need to be fixed and we are also would like to make experiments with blind people to see how our solution performs in a real scenario.
Log in or sign up for Devpost to join the conversation.