ListenIn

Inspiration

In the realm of communication, the lack of visual cues can greatly impede the ability of visually impaired individuals to engage with video content, making it difficult for them to enjoy and comprehend movies and other forms of entertainment. According to the American Foundation for the Blind, findings from the 2022 National Health Interview Survey (NHIS) data release indicate that an estimated 50.18 million adult Americans reported either having trouble seeing, even when wearing glasses or contact lenses, or being blind or unable to see at all. Our project enhances accessibility for visually impaired individuals by providing frame-by-frame captions and audio descriptions, enabling a comprehensive understanding of video content.

What it does

Our project aims to support visually impaired individuals by delivering detailed video captioning, precisely synchronized with each frame and enhanced with audio information. This approach enables them to grasp the visual elements and narrative flow of movies and other entertainment content. By integrating advanced AI models for image captioning, text-to-speech conversion, and speech-to-text translation of the original audio track, we provide a rich and inclusive viewing experience for those with visual impairments.

How we built it

Development Tools Used to Build the Project The primary programming language utilized for developing the application is Python, chosen for its versatility and extensive library support. We employed Jupyter Notebook for the development and testing of individual components, allowing for an interactive and iterative development process. VSCode (Visual Studio Code) was used for code editing, project management, and version control, providing an integrated environment for managing our codebase. Additionally, Google Colab was leveraged for testing machine learning models and scripts in an accessible cloud environment, enabling collaborative development and easy resource sharing.

APIs Used in the Project To achieve the project's goals, we utilized several APIs. OpenAI GPT-4o was employed to refine image descriptions generated from video frames, ensuring accurate and coherent captions. For generating audio narrations from text descriptions, we used Google Text-to-Speech (gTTS), which offers natural-sounding speech synthesis. Whisper by OpenAI was utilized for high-accuracy speech recognition, enhancing the quality of transcriptions. Additionally, Deepgram served as an alternative API for speech-to-text conversion, ensuring robustness in our audio processing pipeline.

Assets and Datasets Used in the Project The project involved processing various video files as source material. These videos were analyzed and processed to extract meaningful content. Pre-trained AI models were also essential, particularly for image captioning and speech recognition tasks. These models, trained on vast datasets, provided the foundation for accurate and reliable content generation.

Libraries Used in the Project Several libraries were crucial to the project's development. OpenCV (Open Source Computer Vision Library) was utilized for video processing, including frame extraction, which allowed us to analyze video content frame by frame. The Transformers library from Hugging Face was used for leveraging the pre-trained BLIP model for image captioning, providing detailed and contextually accurate descriptions of video frames. PIL (Python Imaging Library, also known as Pillow) was employed for handling and manipulating image files, facilitating the integration of visual content. gTTS (Google Text-to-Speech) was used to convert text descriptions into audio files, adding an auditory dimension to the captions. MoviePy was utilized for merging audio tracks with video files, ensuring synchronized playback of audio and video content. FFmpeg was used for merging subtitle files with the video to display captions, providing a seamless viewing experience.

Challenges we ran into

One of our key challenges is ensuring the generation of highly accurate captions that can seamlessly adapt to diverse video content, rather than being limited to specific categories. For instance, we implement robust error-catching mechanisms to prevent issues when videos lack original soundtracks or include background music tracks, which could otherwise lead to inaccuracies in captioning. Looking ahead, we plan to enhance our monitoring and testing processes to further refine our system's capability to handle a wide range of video types with precision and reliability.

Accomplishments that we're proud of

We generate coherent and continuous captions for our videos by leveraging advanced AI models and meticulous synchronization techniques.

Image Captioning: Utilizing the BLIP model, our system processes video files to extract frames and generate detailed captions for each, ensuring that important visual information is conveyed accurately.

Text-to-Speech and Speech-to-Text Translations: The original audio track of the video is transcribed using Whisper by OpenAI, and these transcriptions are refined to match the video's duration perfectly. This text is then transformed into audio narrations via gTTS, allowing for a seamless auditory representation of the video.

Synchronization and Subtitles: The entire process is designed to maintain synchronization between the audio and video elements. The final output merges these components using MoviePy, and subtitles are incorporated using FFmpeg to enhance clarity and comprehension.

What we learned

Through this project, we gained crucial insights into the integration of complex AI technologies and the importance of user-centric design in accessibility tools. We learned to fine-tune AI models like BLIP for image captioning and Whisper for speech recognition to meet specific user needs effectively. This project reinforced our commitment to using technology to enhance accessibility and opened new avenues for innovation in support of the visually impaired community.

What's next for ListenIn

Currently, our project is dedicated to translating and enhancing video content into audio formats for the visually impaired. Looking ahead, we aim to expand our focus to real-time sign language translation, enabling hearing-impaired individuals to create content in sign language that remains accessible to the general public. To achieve this, we plan to utilize machine learning techniques for precise classification and segmentation, coupled with advanced large language models to significantly improve the translation accuracy.