Inspiration
We're drowning in a sea of dry, boring text—from dense articles to tedious reports. Standard text-to-speech is robotic and unengaging. We were inspired to answer a simple question:
"Why can't essential information also be fun?"
We wanted to build a tool that doesn't just process information, but gives it a personality, making learning and consumption memorable and enjoyable.
What it does
AI Talks Back is an interactive web application that transforms any content into a concise, narrated summary. Users can paste text, enter a URL, or upload a file (PDF, DOCX, TXT). The application intelligently summarizes the information into key bullet points and then, using high-quality text-to-speech from ElevenLabs, narrates the summary in one of several distinct AI personas—from a 'grandma GG' to a 'sassy teen'. It's a tool for turning any document into an engaging, listenable experience.
How we built it
The project is built on a modern Python stack, architected for scalability and a clean user experience.
- Frontend: The user interface is a fully interactive web application built with Streamlit.
- Backend: A robust backend server powered by FastAPI handles the core processing, including API calls and audio management.
- AI Core: We integrated two powerful AI services:
- OpenAI's GPT-4o for intelligent text summarization and content generation.
- ElevenLabs API for generating expressive, human-like speech with distinct personalities.
- Data Handling: We used PyMuPDF and python-docx to parse uploaded files and the requests library for fetching web content.
Challenges we ran into
Integrating multiple, distinct AI APIs seamlessly was a primary challenge. We had to manage the flow of data from user input -> summarization by OpenAI -> narration by ElevenLabs. Another hurdle was creating a responsive user experience in Streamlit; we implemented live streaming of the text response from GPT-4o to give users immediate feedback. Finally, engineering the prompts to consistently coax the right personality and tone out of the models required significant iteration.
Accomplishments that we're proud of
We are proud of creating a polished, end-to-end application that is both highly functional and genuinely fun to use. Our biggest accomplishment is the core concept itself: transforming text into personality-driven audio. We didn't just build another summarizer; we created a new way to engage with information. We're also proud of the versatile multi-modal input system that allows users to process content from virtually anywhere.
What we learned
This project highlighted the power of combining specialized AI models. While GPT-4o is great at reasoning, combining it with a dedicated voice synthesis model like ElevenLabs creates an experience that neither could achieve alone. We learned the critical importance of prompt engineering in controlling the AI's tone and style. Architecturally, we validated the strength of separating the UI (Streamlit) from the backend processing (FastAPI), which makes the system more robust and maintainable.
What's next for AI talks back
Our requirements.txt file hints at our next big step: implementing a full Retrieval-Augmented Generation (RAG) system.
- Deeper Context: By integrating vector databases and embeddings (using the currently unused
qdrant-clientandfastembedlibraries), we'll allow the AI to answer deep, specific questions about very large documents or entire websites. - Expanded Persona Library: We plan to add more AI voices and characters, potentially allowing users to create their own.
- Conversation Mode: We want to enable users to have a follow-up conversation with the AI persona about the content it just summarized, making the experience even more interactive.
Built With
- elevenlabs
- fastapi
- githubpages
- gpt4o
- huggingface
- python
Log in or sign up for Devpost to join the conversation.