Xtension

Xtension – Hands-Free Browser Extension with Cloned Voice !

Inspiration

I was inspired by the idea of making the web completely hands-free. I wanted to create a browser extension that not only automates websites with voice commands but also responds intelligently in a cloned voice that feels personal. Most assistants sound generic, but by combining Gemini with ElevenLabs, I could build one that’s both smart and uniquely mine.

What it does

Xtension is a Chrome extension that:

Automates browsing with voice commands (scroll, click, navigate).
Listens for the keyword “Gemini” to activate AI responses.
Clones and uses a personal voice for replies via ElevenLabs TTS.
Lets me clone a new voice or paste an existing voice_id directly from a simple webpage.

How I built it

Extension (HTML/JS): Captures voice, handles browser automation, and manages the cloned voice_id.
Flask server (Python): Provides backend support for uploading and cloning voices.
Assistant script (Python):
- Records microphone input.
- Transcribes with ElevenLabs STT (scribe_v1).
- Responds through Gemini when triggered.
- Speaks back using the active cloned voice.

Challenges I ran into

Linking the Chrome extension with the Python assistant to share the voice_id.
Converting browser audio recordings (webm) into formats usable by ElevenLabs APIs.
Debugging API rate limits and errors in real-time during voice cloning.
Making the assistant respond only when “Gemini” is invoked while still supporting web automation commands.

Accomplishments that I’m proud of

Built a working hands-free extension that combines voice automation, AI answers, and cloned speech.
Successfully integrated browser-level JavaScript with backend Python services.
Made the assistant feel more human by giving it a customizable cloned voice.

What I learned

I learned how to:

Build and debug a full Chrome extension with persistent settings.
Work with real-time audio capture and APIs like ElevenLabs and Gemini.
Bridge between a web interface and a Python process smoothly.
Think about accessibility — hands-free browsing and personalized voices make the web more usable.

What’s next for Xtension

Add support for continuous streaming transcription instead of fixed chunks.
Allow multiple voice profiles for different moods or contexts.
Expand automation features for more complex multi-step browsing tasks.
Package the project for release on the Chrome Web Store so anyone can use it.