Inspiration
Our team has been journaling for over 10 years. Over that time, journaling has helped us pause and reflect, and it has also served as a record of how we thought, felt, and made sense of difficult moments in the past. At the same time, we’ve run into its limitations. Journaling often requires the right time and environment, can feel hard to return to when you’re overwhelmed, and makes it difficult to surface the specific memory you need in the moment.
We wanted to make reflection easier and more accessible by using voice, which feels more natural and immediate than writing for many people. We also wanted to explore how AI could help bridge the present moment with your own past experiences, rather than replacing them. That idea eventually became Echoes.
What it does
Echoes is a voice-based reflection system grounded in a user’s own memories. Users record short voice reflections as moments happen in their lives, similar to talking things through with a trusted friend. These recordings are transcribed and saved to a personal timeline.
Later, when a user returns feeling uncertain or stuck, they can speak again or ask a question. Echoes responds by surfacing relevant past recordings of moments where the user experienced similar emotions or situations and plays them back as audio.
Echoes does not generate advice or tell users what to do. Instead, it helps users hear something they already said, but from a different point in time.
How we built it
We started with a Vercel-provided template that combined Next.js with Supabase for data storage and authentication, which made setting up the web app much faster. For voice capabilities, we relied on ElevenLabs for high-quality, low-latency speech-to-text transcription and text-to-speech synthesis, which helped create a smooth and responsive experience.
For language understanding and reasoning, we used Gemini. Its strong reasoning abilities and large context windows made it straightforward to define tools and guide how the model should interact with our grounding data, which included both audio and text. To retrieve relevant moments from past recordings, we used Gemini’s embedding models to generate vector representations of each clip. These embeddings were stored in Supabase and queried efficiently to support low-latency semantic search.
Challenges we ran into
One of the main challenges was working within the usage limits of free-tier products, especially since machine learning models can be expensive to run! For example, we would have liked to clone a user’s voice for responses, but that capability isn’t available on the free tier of ElevenLabs.
We also found that semantic search became less reliable for longer audio clips. When a single recording covered many topics, it was harder to consistently evaluate its relevance to a specific query. Additionally, this was the first time either of us had used Next.js, which came with a learning curve. Despite that, its abstractions ultimately helped us iterate quickly.
Finally, feature creep was a real challenge. As development progressed, we came up with many new ideas, and it was difficult to decide which ones to prioritize given the limited time available.
Accomplishments that we’re proud of
We’re proud that we implemented and deployed a fully working web application with authentication and basic security considerations in place. We also successfully learned and integrated new tools, including Next.js and the Gemini and ElevenLabs APIs, within a short time frame. Perhaps most encouraging was building a smooth user experience on limited resources—transcription and text processing ended up being much faster than we initially expected.
What we learned
Working with voice inputs highlighted how much latency and transcription quality affect the overall experience. Even small delays or inaccuracies in speech-to-text noticeably degraded usability, which made low-latency transcription and fast post-processing essential rather than optional.
We learned that grounding AI responses in user-owned data requires careful constraints on model behavior. Clear tool boundaries and explicit instructions were necessary to ensure the model only retrieved and replayed past recordings instead of generating new content or advice. This made prompt design and tool orchestration a central part of the system, not just an implementation detail.
Semantic retrieval over long, unstructured voice transcripts turned out to be more challenging than expected. Embedding quality varied depending on transcript length and topic density, which pushed us to think about chunking strategies, metadata enrichment, and potential hybrid retrieval approaches to improve relevance.
What’s next for Echoes
Looking ahead, we want to explore detecting emotional themes and patterns across voice reflections, while also giving users more control over their timelines through features like deleting, re-recording, and reorganizing entries. We’re interested in turning Echoes into a mobile app to support true on-the-go reflection and refining the UI so it feels less like a traditional app and more like a calm, intentional moment. Throughout all of this, we plan to maintain a strong focus on privacy, consent, and user-controlled memory.
Built With
- elevenlabs
- gemini
- next.js
- supabase
- typescript
- vercel

Log in or sign up for Devpost to join the conversation.