EchoMind: Voice‑Driven AI Assistant
Inspiration
The inspiration for EchoMind came from the desire to make AI conversations feel more natural and human‑like. While text chatbots are powerful, they often lack the warmth of spoken interaction. By combining Gemini’s reasoning capabilities with ElevenLabs’ lifelike voice synthesis, I wanted to create an assistant that not only responds intelligently but also speaks back, bridging the gap between human conversation and AI.
What it does
EchoMind is a pastel‑themed, voice‑enabled AI assistant that:
- Accepts user input via text or speech recognition 🎤
- Generates intelligent replies using Google’s Gemini model
- Converts those replies into natural speech using ElevenLabs TTS
- Displays messages in a clean, scroll‑free React UI with gradient backgrounds
- Provides a friendly, conversational experience that feels both professional and approachable
How we built it
- Frontend: Built with React + Tailwind CSS for responsive design and pastel gradients. State management ensures smooth transitions (e.g., input bar moving after the first message).
- Backend: Node.js + Express server that integrates:
- Google Cloud Vertex AI (Gemini model) for text generation
- ElevenLabs API for text‑to‑speech conversion
- Integration: Axios handles communication between frontend and backend. Audio blobs are streamed back and played in the browser using the
AudioAPI.
Challenges we ran into
- Audio playback issues: Browsers block autoplay unless triggered by user gestures. Fixing this required awaiting
audio.play()and ensuring proper MIME types (audio/mpeg). - Credential management: Handling Google Cloud service account JSON securely and ensuring environment variables were correctly set.
- UI polish: Designing a responsive, scroll‑free layout with pastel gradients while keeping it professional.
- Debugging async flows: Ensuring that text replies and audio playback were synchronized without race conditions.
Accomplishments that we're proud of
- Successfully integrated two advanced APIs (Gemini + ElevenLabs) into a seamless workflow.
- Built a polished, pastel‑themed UI that feels welcoming and professional.
- Overcame tricky browser audio restrictions to deliver consistent voice playback.
- Learned to debug and refactor backend audio responses for cross‑browser compatibility.
What we learned
- The importance of MIME types and headers in audio streaming.
- How to manage environment variables securely in full‑stack projects.
- Practical experience with speech recognition and text‑to‑speech APIs.
- Reinforced knowledge of async programming in JavaScript and React state management.
- That even small UI details (like input bar transitions) greatly affect user experience.
What's next for EchoMind
- Queued audio playback: Ensuring multiple replies play sequentially without overlap.
- Multilingual support: Expanding beyond English to support Telugu and other languages.
- Personalization: Allowing users to choose different voices, tones, or speaking speeds.
- Deployment: Hosting EchoMind on Vercel/Render with secure backend integration for wider accessibility.
Built With
- axios
- elevenlabs-api
- express.js
- google-auth-library
- google-cloud-vertex-ai
- javascript
- jsx
- node.js
- react
- tailwindcss
- web-speech-api
Log in or sign up for Devpost to join the conversation.