The Gemini Voice Agent operates as a Voice-to-Voice (V2V) assistant that allows you to speak commands and receive synthesized audio replies, even executing system actions like launching applications. The process begins in the browser's voice_agent.html frontend, which captures your speech using the microphone and sends the transcribed text to the cli_voice_agent.py Python backend. The backend uses the Gemini API to determine the appropriate response, notably utilizing Function Calling to execute tools, such as the open_system_app function, if you request an action like "Open Calculator." After the action is executed or a standard response is formulated, the final text is converted into audio using the Gemini Text-to-Speech (TTS) model, and this audio data is streamed back to the browser for immediate playback.

Built With

Share this project:

Updates