Inspiration
Honestly, lectures are boring to listen to. I realized I spend so much time staring at the clock instead of actually retaining anything. I wanted a tool that could listen to audio for me, transcribe it, and give me a quick summary so I could get the key points without sitting through the whole thing.
What I Learned
While building this, I picked up a lot:
Using Whisper for transcribing audio into text.
How to work with Gemini AI to summarize long texts clearly.
Setting up a Flask backend to handle uploads and processing.
Making a Streamlit frontend that talks to the backend.
Some practical lessons on prompt design, debugging, and handling edge cases.
How I Built It
Backend
Flask app with an /api/upload route.
Saved uploaded MP3s locally.
Used Whisper to split audio into text segments.
Sent those segments to Gemini AI for summarization.
Returned JSON containing both segment summaries and an overall summary.
Frontend
Streamlit interface lets users upload MP3s.
Shows file info, plays audio, and sends it to the backend.
Displays segment summaries with timestamps and the overall summary.
AI Summarization
Used prompts that push Gemini AI to pull out the main ideas, not just reword the audio.
Tried different prompt styles until summaries actually captured the gist.
Challenges
Transcription issues: Whisper sometimes got words wrong or messed up timestamps.
Summaries felt repetitive: Initially, AI just rephrased sentences. I had to tweak prompts to get actual concise insights.
Setup headaches: Installing Whisper, PyTorch, and Gemini AI SDKs was messy, fixed with virtual environments.
File handling: Ensuring MP3 uploads worked correctly in Streamlit needed attention.
Future Improvements
Add speaker recognition.
Include multilingual summarization.
Let users download summaries as PDF or Markdown.
Make bullet-point or structured summaries.
Log in or sign up for Devpost to join the conversation.