Inspiration

We're drowning in a sea of dry, boring text—from dense articles to tedious reports. Standard text-to-speech is robotic and unengaging. We were inspired to answer a simple question:

"Why can't essential information also be fun?"

We wanted to build a tool that doesn't just process information, but gives it a personality, making learning and consumption memorable and enjoyable.

What it does

AI Talks Back is an interactive web application that transforms any content into a concise, narrated summary. Users can paste text, enter a URL, or upload a file (PDF, DOCX, TXT). The application intelligently summarizes the information into key bullet points and then, using high-quality text-to-speech from ElevenLabs, narrates the summary in one of several distinct AI personas—from a 'grandma GG' to a 'sassy teen'. It's a tool for turning any document into an engaging, listenable experience.

How we built it

The project is built on a modern Python stack, architected for scalability and a clean user experience.

  • Frontend: The user interface is a fully interactive web application built with Streamlit.
  • Backend: A robust backend server powered by FastAPI handles the core processing, including API calls and audio management.
  • AI Core: We integrated two powerful AI services:
    • OpenAI's GPT-4o for intelligent text summarization and content generation.
    • ElevenLabs API for generating expressive, human-like speech with distinct personalities.
  • Data Handling: We used PyMuPDF and python-docx to parse uploaded files and the requests library for fetching web content.

Challenges we ran into

Integrating multiple, distinct AI APIs seamlessly was a primary challenge. We had to manage the flow of data from user input -> summarization by OpenAI -> narration by ElevenLabs. Another hurdle was creating a responsive user experience in Streamlit; we implemented live streaming of the text response from GPT-4o to give users immediate feedback. Finally, engineering the prompts to consistently coax the right personality and tone out of the models required significant iteration.

Accomplishments that we're proud of

We are proud of creating a polished, end-to-end application that is both highly functional and genuinely fun to use. Our biggest accomplishment is the core concept itself: transforming text into personality-driven audio. We didn't just build another summarizer; we created a new way to engage with information. We're also proud of the versatile multi-modal input system that allows users to process content from virtually anywhere.

What we learned

This project highlighted the power of combining specialized AI models. While GPT-4o is great at reasoning, combining it with a dedicated voice synthesis model like ElevenLabs creates an experience that neither could achieve alone. We learned the critical importance of prompt engineering in controlling the AI's tone and style. Architecturally, we validated the strength of separating the UI (Streamlit) from the backend processing (FastAPI), which makes the system more robust and maintainable.

What's next for AI talks back

Our requirements.txt file hints at our next big step: implementing a full Retrieval-Augmented Generation (RAG) system.

  • Deeper Context: By integrating vector databases and embeddings (using the currently unused qdrant-client and fastembed libraries), we'll allow the AI to answer deep, specific questions about very large documents or entire websites.
  • Expanded Persona Library: We plan to add more AI voices and characters, potentially allowing users to create their own.
  • Conversation Mode: We want to enable users to have a follow-up conversation with the AI persona about the content it just summarized, making the experience even more interactive.

Built With

  • elevenlabs
  • fastapi
  • githubpages
  • gpt4o
  • huggingface
  • python
Share this project:

Updates