UXIGuide - The Multimodal UI Co-pilot

Logo
Projects List
Project Details
Agent Persona
Script UI
Just Add Script to index.html

Inspiration

Think of UXIGuide as the "GPS for Software." Just as a GPS doesn't just show you a map but watches your real-time position to give turn-by-turn voice instructions, UXIGuide transforms static web interfaces into interactive, guided experiences. By combining Gemini’s vision understanding to "see" the UI and Live audio to "hear" user frustration, it moves beyond rigid, "click-next" tutorials. It becomes a proactive digital mentor that understands intent, respects privacy through real-time redaction, and navigates complex workflows alongside the user making the most sophisticated SaaS feel as simple as a conversation.

What it does

UXIGuide is an AI-powered UI Navigator that transforms static web onboarding into a live, interactive conversation. Instead of forcing users through rigid "click-next" tutorials, UXIGuide acts as a real-time co-pilot that sees the screen and hears the user’s intent to provide personalized, step-by-step guidance.

Multimodal Live Guidance: Users speak naturally to the agent to ask for help. The client-side script captures the current UI state and sends it to Gemini, which interprets the visual layout to provide verbal instructions and visual highlights on exactly where the user should interact.
Privacy-First Vision: To ensure total security, the script performs local redaction before any data leaves the browser. Sensitive information like passwords or PII are masked, ensuring the model only receives the context it needs to be helpful, never the data it shouldn't see.
Dynamic Action Mapping: Instead of following a hard-coded path, the agent uses DOM Mapping and visual reasoning to create "action plans" on the fly. If a user gets lost or a UI element moves, the agent re-evaluates the screen and adapts its instructions instantly.
Plug-and-Play Integration: Developers simply add a lightweight script tag to their index.html. This script handles the complex work of capturing screenshots, managing the Gemini Live API connection, and rendering the interactive FAB (Floating Action Button) and overlays.
Fluid Interaction & Interruption: Built for real-time engagement, the agent handles interruptions gracefully. If a user stops the agent mid-walkthrough to ask a follow-up, the agent pauses, re-scans the screen, and shifts its strategy to match the new context.

How we built it

The Brain: Python & Google ADK The core agent logic is built in Python using the Google Agent Development Kit (ADK).

Stateful Reasoning: The ADK manages the conversation flow, allowing the agent to remember the user’s ultimate goal (e.g., "Complete the checkout process") even if they ask side questions.
Multimodal Integration: The ADK orchestrates the inputs from the Gemini Live API, synthesizing the visual context from screenshots with the audio stream to generate an actionable "Next Step."

The Engine: FastAPI & WebSockets We chose FastAPI to handle the high-concurrency requirements of real-time audio and image streaming.

Persistent Connection: The frontend script maintains a WebSocket connection to our Python backend. This allows for sub-second latency when the agent "sees" a UI change and needs to immediately move the visual highlight or provide a voice correction.
Asynchronous Flow: Using Python’s asyncio, the backend concurrently manages the upstream connection to Gemini while processing downstream UI updates to the user.

Data & Configuration: Firestore (Firebase) Project settings, theme configurations, and persona definitions are stored in Firestore.

Dynamic Loading: When a connection is initiated, the backend pulls the specific project metadata from Firestore to "prime" the agent’s personality and visual theme, ensuring the experience feels native to the host website.

Security & Fraud Prevention To ensure the service is only used by authorized developers, we implemented a multi-layered security approach:

Origin Whitelisting: The WebSocket server performs a strict handshake check. If the connection request origin does not match the whitelisted domain stored in our database, the connection is immediately terminated.
Scalable Protection: This serves as our primary layer of defense against unauthorized script embedding, with architecture in place to support private API key headers for future enterprise-grade authentication.

What's next for UXIGuide

Our roadmap focuses on moving from reactive navigation to proactive, grounded intelligence, ensuring UXIGuide becomes the definitive "Knowledge Layer" for any web application.

RAG-Powered Grounding (Knowledge Base): We are implementing a Knowledge Base where developers can upload documentation, FAQs, and complex process definitions. By using Retrieval-Augmented Generation (RAG), the agent will move beyond visual reasoning to provide answers and instructions that are strictly grounded in the software's official documentation, eliminating hallucinations.
Learning Mode: We are developing a Learning Mode for developers. Instead of writing manual guides, a developer can simply "perform" a complex workflow while narrating. UXIGuide will capture the visual steps and audio context to automatically generate structured knowledge base articles and navigation paths.
Enhanced UI Tooling & Accessibility: To support all user environments, we are expanding our visual toolkit. This includes Smart Tooltips for high-visibility guidance and a Multimodal Command Center which is a hybrid modal where users can toggle between voice, typing, and "Quick-Action" suggestions pre-defined by the developer for common hurdles.
Teacher Mode & Proactive Pedagogy: We aim to shift from a "helper" to a "tutor." In Teacher Mode, the agent won't just do the work; it will challenge the user to find the next step, providing proactive hints only when it detects hesitation. This utilizes Affective Dialog to sense user frustration and adjust the teaching pace accordingly.
Enterprise-Grade Security: We are committed to evolving our security layer from Domain Whitelisting to a full OAuth 2.0 and API Key infrastructure, alongside deeper automated redaction models to ensure UXIGuide meets the strictest global privacy standards (GDPR/SOC2).

Built With

adk
angular.js
asyncio
audio
canvas
fastapi
firebase
firestore
gemini
google-cloud
jenkins
microphone
python
typescript
uvicorn
vite
websockets

Updates

Belfodil Abdessamad started this project — Mar 16, 2026 07:44 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.