Devpost Submission — Point & Say


Inspiration

Every frontend developer knows the pain: you see a button that needs to be yellow instead of green, and it turns into a 5-minute task — find the file, locate the component, edit the JSX, save, verify. For a 2-second visual tweak, that workflow feels absurdly slow.

I wanted to build something that felt like magic — point at an element on a live web app, speak what you want changed, and watch it happen instantly. No code editor. No terminal. Just your voice and a pointer.

When I saw the Amazon Nova AI Hackathon, I knew Nova's multimodal capabilities — vision, language, and speech — could make this real. The idea of combining Nova Premier for code reasoning, Nova 2 Lite for visual understanding, and Nova Sonic for voice interaction into a single seamless pipeline was too compelling to pass up.


What it does

Point & Say is an AI-powered UI automation tool that lets you modify live web interfaces using voice commands.

The workflow is dead simple:

  1. Point at any UI element in a live React app
  2. Say what you want changed — "make this button yellow", "change the title to Welcome", "remove the second arrow"
  3. Watch the AI identify the component, generate the code change, apply it via hot reload, verify it visually, and confirm with a spoken response

The entire pipeline — from voice command to live change — takes about 12–15 seconds. Every step is transparent: the AI reasoning panel shows exactly what's happening at each stage. And every change is reversible with a single undo.


How I built it

The system is built as a 3-layer architecture:

Frontend (React + Vite + TypeScript)

  • A playground UI with file explorer, live preview, and AI reasoning panel
  • Web Speech API for voice capture
  • Component picker overlay for DOM element selection
  • Diff modal for viewing code changes
  • Real-time status updates via the pipeline status bar

Backend (Python + FastAPI)

  • Grounding Service — Uses Nova 2 Lite to analyze screenshots and identify which React component the user is pointing at
  • Code Generation Service — Sends the full source file + user command to Nova Premier, which generates a modified version of the code with a JSON response containing the explanation and modified code
  • Verification Service — After HMR applies the change, captures a new screenshot and uses Nova 2 Lite Vision to verify the change was applied correctly
  • Undo System — Tracks a history stack of up to 20 changes, allowing instant rollback

Voice Layer (Node.js + Nova 2 Sonic)

  • A dedicated microservice handles bidirectional streaming with Nova 2 Sonic for natural-sounding TTS confirmations
  • Amazon Polly serves as a reliable fallback when Sonic is unavailable

The Pipeline Flow

Point → Screenshot → Nova 2 Lite (ground) → Nova Premier (codegen) → File Write → Vite HMR → Nova 2 Lite (verify) → Nova Sonic (confirm)

Challenges I ran into

1. Nova Premier's JSON responses aren't always clean The model sometimes appends explanations after the JSON closing brace, or includes escape sequences that break json.loads(). I built a 3-stage parser: direct parse → escape-fix → brace-matching extraction. This eliminated 100% of parsing failures.

2. Bidirectional streaming with Nova Sonic Sonic requires a continuous silent audio stream to keep the connection alive, even for text-only TTS. Getting the event sequence right — session start, system prompt, audio stream, user text with interactive: true, cleanup — took significant debugging. The AWS Python SDK doesn't support bidirectional streaming for Sonic, so I built the TTS layer in Node.js using the AWS JS SDK.

3. HMR timing for verification After writing modified code to disk, Vite's HMR needs ~500ms to apply the change. If the verification screenshot is captured too early, it shows the old UI and fails. I added a configurable delay (default 2s) between applying code and capturing the verification screenshot.

4. Component grounding accuracy Getting the AI to correctly identify which React component corresponds to a clicked pixel position was tricky. I combined DOM analysis with visual grounding — the screenshot includes a pointer indicator, and the backend cross-references this with the project's component tree.


Accomplishments that I'm proud of

  • End-to-end pipeline in ~12 seconds — from voice command to verified, live UI change
  • Zero parsing failures after implementing the robust JSON parser
  • Real code changes — not mocks. The AI modifies actual .tsx files on disk, and Vite hot-reloads them
  • Full transparency — every AI decision is visible in the reasoning panel
  • Natural voice confirmation — Nova Sonic speaks back naturally, confirming what changed
  • Undo support — every change is reversible, up to 20 steps back

What I learned

  • Nova Premier excels at code generation — when given the full source file as context, it preserves imports, respects component boundaries, and generates drop-in replacements
  • Bidirectional streaming is powerful but complex — Nova Sonic's event-based protocol requires careful orchestration of concurrent audio and text streams
  • LLMs need guardrails for structured output — never trust raw JSON from any model. Always build fallback parsers
  • Multimodal AI pipelines are the future — combining vision (grounding + verification), language (code generation), and speech (voice I/O) into a single workflow creates experiences that feel genuinely magical

What's next for Point & Say

  • MediaPipe Hands integration — actual finger tracking via webcam, replacing mouse clicks
  • Multi-file edits — generate changes across related components in a single command
  • External project bridging — inject a bridge script into any Vite/React project to make it Point & Say compatible (already scaffolded)
  • Conversation memory — let the AI remember previous changes for contextual follow-up commands ("now make it bigger")

Built With

Share this project:

Updates