Media-to-Text Q&A System
Inspiration
We were inspired by the growing need to quickly access information from audio and video content.
With so much content out there whether it’s educational videos, podcasts, or corporate training
materials we realized how helpful it would be to extract the information as text and make it
searchable. Our goal was to build a system that could take any video or audio file and turn it into
something users could easily query for answers.
What it does
The system takes an uploaded file and determines whether it is a video or audio.
- For Videos: It extracts the audio and converts it into text.
- For Audio: It directly transcribes the content to text.
Once the text is extracted, it is processed using Retrieval-Augmented Generation (RAG), a
technique that retrieves relevant information and generates context-aware responses. The text is
stored in ChromaDB, a vector database, which allows for efficient search and retrieval of
information. Users can continue asking unlimited questions about the uploaded document until
they choose to end the session.
How we built it
1. File Type Identification: The system first determines whether the uploaded file is audio or
video.
2. Conversion:
a. Video Files: The system extracts the audio from video files and converts it to text.
b. Audio Files: These are directly transcribed to text.
3. Text Storage: The transcribed text is stored in ChromaDB, a vector database that organizes the
text for fast and efficient retrieval.
4. Q&A Processing: When a user asks a question, RAG (Retrieval-Augmented Generation) is used
to retrieve the most relevant information from the stored text. The system then utilizes gpt-4o
to generate context-aware and accurate responses based on the retrieved content.
5. User Interaction: Users can ask an unlimited number of questions about the uploaded content.
The system continues to provide answers until the user chooses to end the session, ensuring
an interactive and dynamic experience.
Challenges we ran into
1. Accurate Transcription: Achieving high transcription accuracy wasn’t easy because of varying
speech patterns, accents, and background noise. We had to experiment with different speech-
to-text models to ensure better results.
2. Efficient Data Retrieval: Fine-tuning the retrieval mechanism in ChromaDB was an iterative
process. We wanted to make sure the system pulled out the most relevant information quickly
for every user query, which took a lot of adjustments and testing.
3. Scalability: Managing large multimedia files and multiple user sessions without slowing down
the system was a significant hurdle. We had to optimize our processes to handle larger loads
efficiently.
4. Session Management: Allowing users to ask unlimited questions in a session while keeping
responses accurate and context-aware was challenging. We carefully designed the session
handling to maintain continuity.
5. Using gpt-4o in RAG: Integrating gpt-4o into our RAG process was exciting but not without its
challenges. We needed to balance retrieval and generation so that gpt-4o could provide well-
informed, contextually accurate answers based on the document text. Fine-tuning this
process to ensure smooth, relevant responses took a lot of trial and error.
Accomplishments that we're proud of
1. Successfully created a fully functional system for converting video and audio files to text and
providing context-aware answers.
2. Integrated ChromaDB and RAG to deliver quick and accurate responses to user questions.
3. Enabled users to ask unlimited questions until they decide to end the session, ensuring a
smooth and interactive experience.
4. Optimized the system to handle larger files and faster retrieval with minimal latency.
What we learned
1. Speech Recognition: We gained experience in handling speech-to-text conversions and
ensuring transcription accuracy across different conditions.
2. RAG Implementation: We learned how to combine information retrieval and AI generation for
contextually accurate responses.
3. Vector Databases: Using ChromaDB taught us how to store and retrieve data efficiently for
question-answering systems.
4. User Experience Design: Managing unlimited user questions required improving session
handling and interaction flow.
What's next for Interactive Audio or Video-Based Q&A System
1. Improve Accuracy: Further enhance transcription accuracy and retrieval to provide better
responses.
2. Handle Complex Queries: Allow the system to process more complex questions that span
multiple sections of the document.
3. Add Multi-Language Support: Support multiple languages for transcription and user queries.
4. Enhanced Scalability: Ensure the system can handle larger files, more users, and more
simultaneous sessions with ease.
5. Summarization: Add a feature to generate summaries of the uploaded content to give users a
quick overview before diving into questions.
Log in or sign up for Devpost to join the conversation.