AI Muse Creator: Voice-Powered AI Art with Gemini Nano

Inspiration

The spark for AI Muse Creator came from a late-night brainstorming session where I sketched ideas verbally but struggled to visualize them without clunky tools. As a developer fascinated by AI's creative potential, I wondered: What if I could speak a concept—a mystical forest or cyberpunk skyline—and have AI refine it into text, craft a vivid prompt, generate art, and even edit it on the fly, all within Chrome? Inspired by the Google Chrome Built-in AI Challenge 2025's call to reimagine the web with on-device Gemini Nano, I aimed to make multimodal creation accessible and private, eliminating typing barriers for artists, writers, and non-native English speakers. This project embodies the challenge's ethos: leveraging Chrome's AI APIs for seamless, offline innovation.

What it does

AI Muse Creator is a Chrome extension that transforms voice inputs into AI-generated art through an intuitive popup interface. Speak your idea (e.g., "a serene mountain lake at dawn"), and it auto-transcribes, detects language and mood, refines it into engaging text, crafts an advanced image prompt, and generates a high-quality image using Stability AI. Users can voice-edit the text (e.g., "make it brighter") via Canvas manipulation, download the result, or share via QR code/URL stored in MongoDB. Built for on-device privacy, it runs core logic with Gemini Nano, making creativity instant and secure—no cloud for mood/prompt gen.

How we built it

I built AI Muse Creator as a Manifest V3 Chrome extension with a Node.js backend, iterating over two weeks to integrate Chrome's AI APIs seamlessly.

Frontend (Extension): Started with manifest.json for V3 structure, permissions (storage, activeTab), and strict CSP. The popup (popup.html + popup.js) uses Web Speech API for voice recognition and TTS. Transcript auto-fills a textbox, then chains Chrome APIs: Translator for lang detection/translation, Prompt API (Gemini Nano) for mood analysis and advanced prompt generation (structured SDXL templates with quality boosters like "8k UHD, cinematic lighting"), and Writer API for text refinement.
Backend Proxy: Express.js server (server.js) handles cloud calls: Stability AI for image gen (with negative prompts for reliability) and MongoDB for share persistence. Axios fetches APIs, dotenv loads secrets from .env. Routes like /generate-image and /analyze-mood (rule-based fallback) ensure offline resilience.
Integration: Extension fetches backend via BACKEND_URL. Voice edits use Canvas API for pixel adjustments. QR sharing with qrcode.js. Deployed backend to Vercel for live demos.

Tools: VS Code for editing, GitHub for repo, Vercel for hosting. Total ~500 LOC, focused on on-device efficiency.

Challenges we ran into

Several hurdles tested my resolve:

API Compatibility: Gemini Nano required enabling experimental flags (chrome://flags/#prompt-api-for-gemini-nano), and initial ONNX lang detection in backend failed (model 404 errors)—switched to Chrome's Translator API for reliability.
Async Flows in Extensions: Await in event handlers caused "SyntaxError" loops; resolved with .then chaining and ES modules in backend.
Prompt & Image Quality: Basic transcripts produced blurry outputs; iterated SDXL templates (adding "masterpiece, sharp focus") and negative prompts, boosting consistency to 90%.
Secrets & GitHub Protection: Hardcoded API keys triggered push blocks—learned to use .env + BFG Repo-Cleaner to scrub history.

These pushed me to prioritize fallbacks and security, strengthening the MVP.

Accomplishments that we're proud of

On-Device Multimodal Pipeline: Integrated 3+ Chrome AI APIs (Prompt, Translator, Writer) for a fully offline voice-to-art flow—Gemini Nano handles 80% of logic, cutting latency to <2s.
Voice-Driven Edits: Canvas-based voice edits (e.g., "brighter" → pixel math) add interactivity, making it feel like a "living canvas."
Global Accessibility: Supports 100+ languages with auto-translation, ensuring non-English users (e.g., Spanish voice → English prompt gen → native TTS).
Reliable Sharing: QR/URL with Mongo persistence—tested end-to-end, including Vercel deployment for live demos.
Challenge Alignment: Built during the contest, showcasing hybrid AI (on-device + cloud proxy) for privacy-focused creativity.

Proudest: From sketch to submission in weeks—it's functional, fun, and forward-thinking.

What we learned

This project was a masterclass in Chrome's AI ecosystem:

API Chaining: Gemini Nano excels for lightweight tasks (mood/prompts) but needs structured outputs (JSON) for parsing—fallbacks like rule-based mood ensure robustness.
Extension Constraints: V3's CSP and async limits demand creative workarounds (.then for handlers, hashes for libs).
Prompt Engineering: SDXL thrives on detailed, structured prompts (subject/setting/lighting/quality)—simple templates fail; boosters like "8k UHD" transform results.
Security Best Practices: GitHub's Push Protection taught me to never hardcode keys—.env + history scrubbing (BFG) is essential.
User-Centric Design: Voice UX reveals nuances (permission prompts, accent detection)—iterating with tests improved inclusivity.

Overall, I learned Chrome APIs enable "web-native" AI, but hybrid (on-device + cloud) balances speed and power.

What's next for AI Muse Creator: Voice-Powered AI Art with Gemini Nano

Next, evolve to full storytelling: Voice-to-video (integrate RunwayML for 5-10s clips from scene breakdowns via Prompt API) and collaborative edits (WebSockets for shared canvases). Add AR previews (WebXR for 3D image overlays) and more languages with fine-tuned Nano models. Open-source expansions: Community plugins for custom prompts or voice styles. Ultimately, publish to Chrome Web Store for global reach—turning everyday browsers into creative studios. Fork on GitHub and contribute!

Built With

chrome
express.js
github
javascript
manifest
mongodb
node.js
openai
qrcode.js
stability

Updates

ELAVARASI SIVASAMY started this project — Oct 31, 2025 11:04 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.