Track 2: Inspiring Creative with Generative AI
In the scenarios of creating and consuming streaming media content, generative AI technologies can be utilised for content optimisation, information extraction and style transformation, to refine content across various media platforms. The transformative power of Generative AI can unite communities by inclusively addressing diverse consumption needs through various communication mediums and beyond. With these technologies, we can cater to the preferences of diverse audiences, as well as facilitate creators in producing higher quality content more efficiently.
Team Problem Statement
How can TikTok content be accessible and inclusive for people with visual impairments, while ensuring creative and freedom of expressions by creators?
Background
Many TikTok videos rely on viral audio that does not provide sufficient context, creating barriers for visually impaired viewers and hindering their ability to fully enjoy TikTok's immersive and engaging nature.
Introducing AudioSight, a revolutionary project designed to make TikTok more inclusive for the visually impaired.
AudioSight automatically generates audio narration for videos where visual context is essential (e.g. This Viral TikTok Video). Our solution harnesses advanced generative AI concepts to ensure accuracy and accessibility:
Accessibility Features
- Keyframe Recognition: By utilising structural similarity comparisons, we optimise and extract keyframes to identify crucial moments in the video.
- Scene Detection and Explanation: Leveraging large language models (LLMs), we explain scenes through detailed descriptions, providing accurate and comprehensive audio transcripts for enhanced accessibility.
Please refer to images in the image carousel above if the images are not loading.


Accessible Content : Visually impaired audiences can seamlessly toggle to the accessibility feature through distinctive gestures, in addition to existing TikTok gesture interactions. This non-intrusive method enables them to initially enjoy the content as intended by the creator. Subsequently, they can utilize the accessibility tool for enhanced comprehension and clarity. This dual approach ensures an authentic viewing experience while providing a means of contextual explanation to better understand the visual context of the content.
Inclusive Content Creation: Content creators can leverage the AI-enabled scene detection feature in their content creation process. This feature samples unique frames from the content and provides LLM-generated contextual explanations. These explanations save creators time by offering a solid starting point for contextual descriptions, requiring only minor adjustments. This Human-in-the-Loop approach ensures creators can make final checks to confirm that the explanations accurately reflect the intended portrayal of the content. This ensures that the accessibility toggle feature can be fully enjoyed by visually impaired audiences, enhancing their experience like never before.
Through these innovative techniques, AudioSight transforms video content, making it accessible and enjoyable for everyone.


Technical Implementation

- Data Collection: We scraped a few videos that suited our content style (mainly travel videos and low caption to no caption videos).
- Frontend Development and FastAPI: We used JavaScript frameworks ReactJS and NextJS to build our client-facing application and used FastAPI to connect our Frontend with our middleware and backend. We chose FastAPI as it is a Web Framework for easy building of APIs.
- Finetuning: Using Prompt Engineering and sequential prompting techniques, we ensured that our video to speech methodology gave accurate yet succinct responses. This requires the use of LLM APIs such as Open AI's TTS and GPT 4o.
- Evaluation: We used a second LLM API (Gemini 1.5 Pro and GPT 3.5 Turbo) to evaluate the accuracy of what was produced.
- Backend Support: We used MongoDB which is a NoSQL document-based base that allows vector embedding of information and Amazon S3 as our storage server.
Development Tools
- Frontend: ReactJS, NextJS
- Middleware: OpenAI API, FastAPI, Gemini API
- Backend: Python, MongoDB, Amazon S3
Please view this Google Drive link for higher quality images: Click HERE

Log in or sign up for Devpost to join the conversation.