Echo | Devpost

Recording Screen
In Action

Inspiration

Communication generates vast amounts of information, yet much of it is lost or difficult to visualize in real-time. Inspired by the need for better ways to organize and interpret spoken interactions, Echo was born. We wanted to create a tool that not only captures conversations but also enhances the way we interact with them, using AR to display real-time data. By doing this, we aim to bridge the gap between spoken communication and actionable insights, ensuring that no key information is missed.

What it does

Echo captures live conversations using Snap Spectacles and turns them into real-time transcription in augmented reality. As the conversation progresses, users can click on words/phrases to define them or bookmark important segments. At the end of the session, Echo generates a detailed report supercharged by Gemini, including a transcription, a summary, and action items, which can be accessed through our dashboard. This system is perfect for meetings, lectures, or just conversations you'd want to remember, offering a new way to interact with and organize spoken data.

How we built it

We built Echo using Snap Spectacles for live video capture, integrated with Deepgram's speech recognition API to convert spoken words into transcriptions. The transcriptions are rendered in a 3D space using Lens Studio, allowing users to interact with the words in real-time. The backend is powered by Python and JavaScript, ensuring smooth communication between the AR environment, the dashboard, and Gemini. Reflex handles user interactions and backend connection for our dashboard, which summarizes the session and organizes the transcription, action steps, and definitions for future use.

Challenges we ran into

We faced several challenges throughout the development process. Initially, we wanted to use Lens Studio’s built-in voice ML for speech-to-text, but quickly realized that it didn’t allow us to interact with third-party APIs. This limitation led us to utilize Deepgram’s API, which offered more flexibility for integrating real-time transcriptions and easier external data storage/comprehensive analysis.

Additionally, we ran into difficulties with data retrieval. Since we couldn’t make direct calls to our MongoDB database from the front end, we had to build and deploy a Flask server that could fetch data in real-time. This server allowed us to store and retrieve live transcription data from MongoDB, which Deepgram alone couldn’t handle directly. These challenges pushed us to rethink our system architecture, but ultimately led to a more robust solution.

Accomplishments that we're proud of

We’re proud to have successfully integrated Snap Spectacles with 3D AR transcription, a feat that involved overcoming multiple technical hurdles. While sending and retreiving speech data and analysis took time and was tough to implement, the reward of bringing real-time definitions/analysis to the Snap Spectacles was well worth the effort. Another accomplishment is creating an intuitive dashboard that complements the AR experience, providing users with actionable insights like summaries and action steps, and allowing their actions in AR to carry impact even after they take off the Spectacles.

What we learned

This project taught us the importance of creative problem solving when dealing with newer software. While our vision was straighforward, we dealt with numerous hurdles that we had to overcome by using our knowledge of backend development and API usage. We also learned how to use Lens Studio and create augmented reality environements that change in real time. We learned how to create smooth interactions and develop user friendly interfaces that can impact students, professionals, and anyone interested in organizing spoken information.

What's next for Echo

We plan to introduce contextual summarization, where real-time bullet points and conversation prompts will appear to help guide discussions and highlight key topics as they emerge. We are also exploring the idea of categorizing conversations into specific topics to emphasize important points automatically. Real-time data visualization is another exciting prospect, making relevant graphs/data available on-demand as they come up in conversation. Collaborative AR environments allow multiple users to view and interact with the same data, create a real-time collaboration experience. Finally, we aim to implement multi-lingual support, enabling real-time translation for all types of conversations.