Inspiration

Today the experience of blind people inside Youtube Video is that they are able to comprehend the content from the voice, music, song effect and so on... but they are unable to understand important information from the video when the audio description is unavailable, and this audio description is manually made by people so it's costly and it takes time. Therefore, Voice-Over vision came as a solution to these problem. Recently, Google Pixel release a update on their phone where the phone describe the image taken for the user, and this was our main source of inspiration (https://www.youtube.com/watch?v=wYPTZIFQoDQ).

What it does

Like a friend sitting next to you, this Chrome Extension narrates the unseen parts of a video, filling in the blanks where audio alone falls short. It smartly sifts through videos, picking out details that you might miss otherwise, and uses text-to-speech technology to bring those visuals to life through vivid descriptions

How we built it

Using multi-modal embedding, advanced video processing, LLM's, django, RAG, silence detection, chroma extension, Text to Speech, Speech to Text, Web Socket to make it real time.

Accomplishments that we're proud of

We also were able to create a new feature called "Ask-The-Video" which help visually impaired users inquire about parts of the video they find unclear.

What we learned

Most of what we learned was through the creation of the chrome extension that we weren't aware of, but also the use of Chat GPT (prompt engineering, how does the api works and so on)

What's next for Voice-Over Vision

There are some small fixes that need to be fixed and we are also would like to make experiments with blind people to see how our solution performs in a real scenario.

Built With

Share this project:

Updates