Inspiration

Have you ever wished you could quickly recall specific thoughts from your extensive audio journals? Finding precise feelings or events in your recorded entries can be like searching for a needle in a haystack. How can we access these important memories effortlessly?

What it does

We’ve developed a cutting-edge platform that not only converts audio and video into text using Deepgram's technology but goes a step further by transforming this text into searchable, interactive data through vector embeddings. But what truly sets our platform apart is how we deliver these answers. Using an open-source text to audio cloning model VALL-E, we create a personalized voice model for each user. This means responses aren’t just text; they're spoken back in the user’s own voice, enhancing the learning and interaction experience, making it deeply personal and much more engaging. Imagine the possibilities—students learning in their instructor’s voice long after the lecture ends, or journalists retrieving statements from interviews without tedious searching. We’re not just simplifying information retrieval; we’re personalizing it. We invite you to join us in revolutionizing how we interact with recorded knowledge.

How we built it

We built Twin AI using a robust stack of advanced technologies. At the core, Deepgram's speech-to-text API translates audio inputs into accurate text transcriptions. We then use natural language processing techniques to convert these transcriptions into vector embeddings, allowing for nuanced data retrieval. The personalized voice output is powered by VALL-E, which synthesizes speech that mimics the user’s own voice based on initial voice samples and ongoing inputs, creating a seamless and personalized user experience.

Challenges we ran into

One of our biggest challenges was integrating VALL-E's text-to-speech technology, particularly due to the limitations of creating accurate custom voices from minimal input. This process significantly increased latency, as generating spoken responses takes longer than delivering text. Balancing voice personalization with acceptable response times was both complex and critical for user experience.

Accomplishments that we're proud of

We are particularly proud of developing a seamless integration that not only transcribes but also interprets and responds in the user's own voice. Our platform’s ability to deliver personalized audio responses in real-time marks a significant step forward in how AI can enhance personal learning and data interaction. Seeing it help users retrieve information effortlessly and interact with their past experiences in their own voice has been incredibly rewarding.

What we learned

Throughout this project, we've gained a deeper understanding of advanced AI technologies including NLP for text analysis, vector embeddings for data retrieval, and neural networks for voice synthesis. We also learned about the complexities of maintaining user privacy and data security when handling sensitive audio data, pushing us to implement robust security measures.

What's next for Twin AI

Looking ahead for Twin AI, we're focusing on deepening personalization, transforming the bot into a digital avatar that mirrors you. Our goal is to enhance response times to match the speed of human thought, offering instant answers.

Built With

  • deepgram
  • faiss
  • groq
  • huggingface
  • langchain
  • llama3
  • llamaindex
  • valle-x
Share this project:

Updates