Jarvis Voice agent

Slide 6
Slide 5
Slide 4
Slide 2
Slide 1
Slide 3

The Gemini Voice Agent operates as a Voice-to-Voice (V2V) assistant that allows you to speak commands and receive synthesized audio replies, even executing system actions like launching applications. The process begins in the browser's voice_agent.html frontend, which captures your speech using the microphone and sends the transcribed text to the cli_voice_agent.py Python backend. The backend uses the Gemini API to determine the appropriate response, notably utilizing Function Calling to execute tools, such as the open_system_app function, if you request an action like "Open Calculator." After the action is executed or a standard response is formulated, the final text is converted into audio using the Gemini Text-to-Speech (TTS) model, and this audio data is streamed back to the browser for immediate playback.

Built With

Updates

aMMAAR Shareef started this project — Nov 06, 2025 09:16 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.