MANAS is a voice-first personal AI assistant where users can interact entirely through speech: talk naturally, get an intelligent response, and hear it back in a high-quality human voice.

This project was built specifically to combine:

  • Google Cloud AI for real-time speech understanding and intelligence
  • ElevenLabs for natural, human-like voice output and personality

In practice, MANAS uses a low-latency pipeline:

  1. User speaks in the web app
  2. Google Cloud Speech-to-Text transcribes the audio (WEBM_OPUS) with streaming recognition
  3. Gemini handles intent classification + reasoning and generates the response
  4. ElevenLabs Text-to-Speech speaks the response back in a selectable voice (MP3), so MANAS feels conversational and human

What inspired us

Many “assistants” are still text-first: they require typing, clicking, and context switching. We wanted to prove that a voice-native experience can do real work—fast—by making the whole loop (listen → understand → act → speak) feel like a natural conversation.

What we built (voice-native features)

  • Voice-driven UX: microphone-first interaction; the UI is built around speaking and listening.
  • Intent-based intelligence: Gemini classifies what the user wants (tasks, calendar, email, news, learn, etc.) and routes to the right capability.
  • Human voice output: ElevenLabs TTS produces natural speech with configurable voices.
  • Action + memory (beyond chatting):
    • task management (Firestore)
    • calendar operations (Google Calendar via OAuth)
    • email workflows (Gmail via OAuth)
    • news briefings (News API + optional Gemini summarization)
    • learning mode with citations (You.com search + Gemini)
    • long-term memory (Mem0 + Qdrant-backed vector search)
    • document/image analysis via file upload (Gemini grounded on provided files)
    • Fitbit health data integration

How we built it (Google Cloud + ElevenLabs integration)

Google Cloud: speech understanding + trusted infrastructure

  • Google Cloud Speech-to-Text is the entry point for all voice interactions.

    • We optimized for web audio with WEBM_OPUS encoding, automatic punctuation, and streaming recognition.
    • The goal is fast, accurate transcripts that can power reliable intent detection.
  • Gemini provides the “brain”:

    • Intent classification: quick, structured JSON output so routing is predictable.
    • Conversational responses: short, spoken-friendly replies by default.
    • Document grounding: when a user uploads files, the system forces a focused analysis mode and answers based on the file content.
  • Firestore / Firebase provides secure persistence:

    • user-scoped task storage
    • user profile preferences (like preferred voice)
    • OAuth credential storage for Google Calendar + Gmail

ElevenLabs: voice + personality

  • ElevenLabs Text-to-Speech turns every assistant response into natural audio output (MP3).
  • Users can choose from multiple voice IDs, enabling different personalities and tones.
  • We also include a streaming-ready pathway to reduce perceived latency for longer responses.

Design (UX)

MANAS is designed as a voice-first interface, not a chatbot with a microphone icon.

  • Immediate feedback loop: the UI shows listening/thinking/speaking states so users always know what’s happening.
  • Spoken-first responses: responses are kept short for clarity when played aloud.
  • Structured cards: when the assistant performs an action or returns structured data (news, citations, tasks), the UI presents it cleanly while the voice summarizes.
  • Analysis mode: users can drop a document/image; the UI confirms “analysis mode active” and the assistant focuses on that content.

Challenges we faced

  • End-to-end latency across STT → Gemini → tools → ElevenLabs TTS:
    • Voice apps feel “broken” if they pause too long, so we used fast classification, short spoken responses, and streaming-friendly paths.
  • Reliability of intent routing:
    • We needed deterministic routing (JSON intent) while still sounding natural in conversation.
  • Voice UX edge cases:
    • Handling short/empty audio, microphone permission issues, and graceful fallbacks without breaking flow.

What we learned

  • Voice is the product: quality isn’t only model accuracy—latency, feedback states, and the “feel” of the voice matter just as much.
  • Structured orchestration is key: pairing Gemini reasoning with explicit tool routing makes the system useful beyond Q&A.
  • Personality must be intentional: ElevenLabs provides the human voice, but the system must constrain responses to be spoken-friendly.

Potential impact

MANAS can meaningfully help communities where typing and screens are a barrier:

  • Accessibility: voice-first workflows for users with visual or motor impairments.
  • Field work + mobility: hands-free task management, scheduling, and information retrieval.
  • Productivity: reduce friction for managing daily tasks, email triage, and quick planning.

Why the idea is unique

MANAS isn’t “STT + chatbot + TTS.” It’s a voice-native system built around:

  • Google Cloud Speech-to-Text for real-time understanding
  • Gemini for intent routing + reasoning
  • ElevenLabs for natural, human-like voice output
  • Tools + memory so the assistant can act, not just talk

Built With

Share this project:

Updates