Inspiration
Digital interfaces often create barriers for the elderly or people who are not tech-savvy. Our vision was to build a "Human-First" system. We realized that by combining the powerful reasoning of Google Cloud with the emotional depth of ElevenLabs voices, we could create a Zero-UI experience. This project is an effort to democratize technology by making interaction as simple as having a conversation.
What it does
VocalCompute AI is a state-of-the-art conversational interface that eliminates the limitations of traditional text-based chatbots.
It processes user speech in real-time with high accuracy.
It leverages Google Gemini 2.5 Flash to generate instant, logical, and context-aware responses.
Using ElevenLabs Agents, it converts those responses into a natural human tone, including realistic breath pauses and inflections.
How we built it
We built this project on a modular architecture to ensure scalability:
LLM Orchestration: We selected Google Gemini 2.5 Flash as the core "brain" because of its massive context window and lightning-fast inference speed, which is perfect for low-latency requirements.
Speech Synthesis Layer: We integrated the ElevenLabs Conversational AI agent. We tuned the stability and similarity settings to ensure the voice sounds empathetic rather than robotic.
Frontend Integration: The frontend is a lightweight implementation using HTML5 and JavaScript. By embedding the ElevenLabs ConvAI widget, we ensured socket-based streaming to keep buffering to a minimum.
System Prompting: We engineered a custom instruction set that gives the agent a "Professional Personal Assistant" persona—efficient yet respectful.
Challenges we ran into
Latency Synchronization: The biggest technical hurdle was syncing Gemini's "thinking time" with ElevenLabs' "voice generation." We solved this by optimizing the Gemini 2.5 Flash model, which provides near-instantaneous responses.
Context Preservation: Ensuring the AI remains on track during long conversations required rigorous testing of the System Prompt to maintain a professional tone at all times.
Accomplishments that we're proud of
We successfully built a 100% cloud-powered agent that runs in any standard browser without heavy local setup.
Achieving real-time interaction speeds using Gemini 2.5 Flash, closing the gap often found in AI voice applications.
What we learned
We gained deep insights into the scalability of the Google Cloud AI ecosystem.
We mastered the principles of Conversational Design using ElevenLabs' synthesis APIs.
We learned that even with minimal code (HTML/JS), one can build enterprise-level tools if the underlying APIs are powerful.
What's next for VocalCompute AI
Multilingual Expansion: Expanding support to over 29+ international languages.
Knowledge Base Integration: Allowing the agent to answer questions based on specific enterprise databases or uploaded PDFs.
Sentiment Detection: Enabling the agent to detect user frustration or happiness through voice pitch and adjusting its response tone accordingly.
Built With
- 2.5
- ai
- conversational
- css3
- elevenlabs
- gemini
- html5
- javascript
- websocket
Log in or sign up for Devpost to join the conversation.