Inspiration

Modern digital life is fragmented. Our memory is spread across Gmail threads, random Chrome searches, screenshots, and voice notes that are rarely revisited. Existing AI assistants are great at general knowledge but struggle with the specifics of your life. Vocalis was built on a simple premise: your digital footprint already contains your memory; it just needs a more natural, voice-first interface to unlock it.

What it does

Vocalis is a multimodal, voice-first personal memory agent that allows you to query your life in plain English.

  • "Did I get an email from Google Ventures this morning?"
  • "What was that idea I screenshotted during the design meeting?"
  • "Find the key takeaway from that audio memo I recorded yesterday."

By indexing Gmail, Chrome history, photos, and audio, Vocalis creates a unified memory layer. It doesn't just search; it reasons across your past to give you the exact answer you need through a high-fidelity voice interface.

How we built it

We focused on a "Multimodal-to-Voice" pipeline designed for speed and precision:

  • Voice-Native Interaction (ElevenLabs): The project is built on the ElevenLabs AI Voice Agents platform. This provides the primary conversational layer, handling the ultra-low-latency voice synthesis and natural language interaction.
  • Multimodal Interpretation (Google Cloud Vertex AI): We used the latest Gemini 3 Flash model to bridge the gap between visual/auditory data and a searchable format. Gemini 3 Flash processes raw screenshots and audio recordings, converting them into concise, descriptive textual representations that capture the core intent and context of each "memory."
  • Intelligent Retrieval (RAG): The system leverages the native text-based RAG and vector storage capabilities of the ElevenLabs platform. This allows the agent to instantly retrieve the most relevant "memories" from the data processed by Gemini.
  • Refinement & Personality: Once context is retrieved, we use Gemini 3 Flash to hypertune the response—extracting only the most relevant facts and ensuring the agent's personality is classy, concise, and professional.

Challenges we ran into

  • Modality Mapping: Translating visual screenshots into text that is optimized for a RAG system required significant prompt engineering with Gemini.
  • Real-Time Latency: Syncing multimodal processing with a live voice agent is a race against the clock. Using Gemini 3 Flash on Vertex AI was critical for its speed-to-reasoning ratio.
  • Signal vs. Noise: Personal history is cluttered. We used Gemini to filter out the "digital exhaust" and keep only the meaningful signal for the agent's memory.

Accomplishments that we're proud of

  • Unified Memory: Successfully unified text, images, and audio into a single conversational interface.
  • High-Fidelity Voice: Leveraged ElevenLabs to create a "chief of staff" experience that sounds human and intuitive.
  • Efficient Extraction: Using Gemini 3 Flash to distill large amounts of personal data into a single, to-the-point spoken sentence.

What we learned

  • Preprocessing is the Bridge: Multimodal AI is only as good as the translation between modalities. Gemini 3 Flash is a powerful "translator" for sensory data.
  • Voice is Contextual: A voice agent shouldn't just read data; it needs to understand the user’s intent, which is why the refinement layer is so essential.
  • The Power of the Stack: Combining Gemini's reasoning with ElevenLabs' voice creates a "best-of-both-worlds" AI experience.

What's next for Vocalis

  • Proactive Recall: Having the agent surface relevant past memories before the user even asks.
  • Expanded Integrations: Adding Slack, Notion, and local file system support to the multimodal pipeline.
  • Privacy-First Encryption: Implementing end-to-end encryption for the textual representation of the digital footprint.

Demo credentials

  • Username: elevencloud
  • Password: elevencloud#1

Built With

Share this project:

Updates