Conversa

Landing page
Picture of our prototype

Inspiration

We’ve all experienced moments where we meet someone new, have an interesting conversation, and then forget the details moments later. Networking events, conferences, or even casual meetups can be overwhelming when trying to remember people and the context of your interactions.

Conversa was inspired by the desire to augment human memory without replacing presence — to help people remember meaningful details from conversations so they can build authentic connections, not just take notes or record meetings. We wanted a tool that’s calm, trustworthy, and human-centered, unlike typical “always-on” AI tools. Also, we wanted to prove this idea can work on ultra low-cost hardware (under $15), not just expensive smart glasses.

What it does

Conversa is a smart wearable + mobile app system that helps you remember and recall conversations: Smart Glasses: Detect faces and trigger recording when the user speaks a natural phrase like “Hi, my name is Kevin” or a custom wake word. AI Summaries: After the conversation, the companion app generates concise summaries, key points, and action items — so you can review interactions quickly. Person-Centric Memory: Each contact has a timeline of past conversations, linking interactions across events. Search & Filter: Easily search by person, date, or topic to recall past interactions. Optional Transcript: Users can expand a conversation to view the full transcript for reference. All of this is designed to help you focus on the moment, rather than getting distracted by note-taking or trying to remember everything manually.

How we built it

Frontend: React 18 + TypeScript with Vite and Tailwind CSS. We designed a calming UI with soft colors (lavender, mint, sage) to feel trustworthy and human-centered.

Backend API: FastAPI with PostgreSQL and SQLAlchemy. Our schema links People, Conversations, and ActionItems to build person-centric timelines. Face embeddings are stored as 128-dimensional vectors for biometric matching.

Recording Pipeline: Custom Python service with real-time Voice Activity Detection using PyAudio and NumPy. When speech is detected, we capture synchronized video (OpenCV) and audio with a 15-second rolling buffer so we never miss conversation starts.

AI Pipeline: We combine multiple services for multimodal understanding:

Snowflake AI_TRANSCRIBE for speech-to-text with speaker diarization Google Gemini 2.5 Pro for video analysis — extracting key frames to identify who said what based on visual cues face_recognition (dlib) for recognizing returning contacts across conversations Smart Glasses Integration: WebRTC streaming via aiortc captures video at 30fps and audio at 48kHz from the glasses, with all AI processing happening on our backend.

The full flow: Glasses → VAD → Recording → Transcription → Video Analysis → Face Matching → Database → Mobile App

Challenges we ran into

Network connectivity with the camera module: Our biggest hurdle was getting the real-time camera to connect reliably. UBC's enterprise network blocked our WebRTC streams due to firewall restrictions and network congestion from thousands of connected devices. We spent hours debugging before realizing we needed to fall back to a personal hotspot for stable streaming.

Another big hurdle was our hardware limitations since we used a cheap microphone attachment to listen to the conversation around the user. This setup works fine in quiet environments but not as well in loud/busy areas. Moreover, there are some issues with the audio interpreting things because the input is too low and our AI unable to detect it as a result.

Voice Activity Detection tuning: Getting the VAD to feel natural was tricky. Too sensitive, and it triggered on background noise. Too conservative, and it missed the start of conversations. We had to carefully tune the energy thresholds, zero-crossing rates, and timing parameters (300ms to confirm speech, 1.5s of silence to stop) and then re-tune everything for WebRTC's 48kHz audio vs. local mic at 16kHz.

Speaker diarization accuracy: Matching audio segments to the right person in frame required combining multiple signals. Snowflake's speaker labels, Gemini's visual analysis of lip movement, and face recognition. Getting these to agree consistently took significant iteration.

Accomplishments that we're proud of

We’re proud that we built a complete end-to-end system that actually works in real time and not just a concept demo. We took a hacked Temu security camera and turned it into a smart-glasses prototype that can stream live audio and video into our backend, detect when a real conversation is happening, and automatically capture it without missing the start using a rolling buffer + tuned VAD. From there, we integrated multiple AI services (Snowflake transcription + speaker diarization, Gemini video understanding, and face recognition) to connect who was speaking with who was in frame, and store it as a person-centric memory timeline with summaries and action items. Most importantly, we brought hardware, networking, backend systems, and a calm mobile UI together into one cohesive experience even after fighting real-world issues like enterprise (UBC) Wi-Fi blocking WebRTC streams.

What we learned

Even simple ideas around memory and recall involve complex design trade-offs around privacy, user trust, and clarity. Humans care about meaning and context, not just raw data — summaries are far more valuable than full transcripts for most users. Visual design and emotional tone are critical for wearables; users must trust the system before adopting it. Iterating rapidly across hardware, software, and UI requires tight coordination and a strong design system.

What's next for Conversa

Multi-person conversation tracking: Detect and summarize group discussions. Pre-conversation prompts: Notify users of past interactions before they meet someone again. On-device AI summarization: Reduce dependency on cloud processing and improve privacy. Relationship memory intelligence: Surface insights about recurring topics, action items, or trends in conversations. Expanded personalization: Users can customize triggers, visual style, and how memories are surfaced.