Nova Voice Commander

Hands-free business operations. Speak it and Nova does it.

A voice-first agentic system that lets you speak complex multi-step business tasks and have them executed across real websites.


The Problem

Small business owners spend hours daily on repetitive web tasks, checking analytics dashboards, managing promotions on delivery platforms, booking meetings. Current voice assistants can answer questions but can't do things across arbitrary web applications. Current browser automation tools require scripting knowledge.

Nova Voice Commander bridges the gap: speak a task in natural language, and an AI pipeline reasons about the steps, automates the browser, and speaks back the results, all in a single voice conversation.

Demo

"How are my analytics looking?" → Nova opens Google Analytics, reads key metrics, and reports back: "You've got 22,000 active users this week, up 8.4%. Sessions are at 26k..."

"Put some deals on Just Eat." → Nova opens the supplier portal, finds inactive promotions, toggles them on: "I've activated the Lunch Special, the Burger BOGOF, and the Family Meal Deal."

"Schedule a meeting with my head of marketing to discuss this." → Nova opens Cal.com, creates a booking: "Done! I've booked a 30-minute meeting for tomorrow afternoon."

Three voice commands. Three different platforms. One continuous conversation. No typing, no clicking, no tab-switching.

Architecture

┌─────────────────────────────────────────────────────────┐
│                      FRONTEND                            │
│           React App (Push-to-Talk Interface)              │
│     ┌──────────┐   ┌──────────┐   ┌───────────────┐    │
│     │ Voice UI │   │ Task Log │   │  Status Bar   │    │
│     │  Button  │   │ Display  │   │               │    │
│     └──────────┘   └──────────┘   └───────────────┘    │
└────────────────────────┬─────────────────────────────────┘
                         │ WebSocket (audio + task events)
                         ▼
┌─────────────────────────────────────────────────────────┐
│                  BACKEND (Python/FastAPI)                 │
│                                                          │
│  ┌────────────────────────────────────────────────┐     │
│  │          1. NOVA 2 SONIC (Speech)               │     │
│  │   • Bidirectional audio stream via Bedrock      │     │
│  │   • Understands speech → extracts intent        │     │
│  │   • Speaks results back to the user             │     │
│  │   • Triggers tools mid-conversation             │     │
│  └─────────────────────┬──────────────────────────┘     │
│                        │ Tool call: run_task              │
│                        ▼                                  │
│  ┌────────────────────────────────────────────────┐     │
│  │       2. NOVA 2 LITE (Reasoning/Planning)       │     │
│  │   • Decomposes intent into atomic steps         │     │
│  │   • Routes: browser automation vs direct tools  │     │
│  │   • Summarises results for voice output         │     │
│  └─────────────────────┬──────────────────────────┘     │
│                        │ Structured task plan (JSON)      │
│                        ▼                                  │
│  ┌────────────────────────────────────────────────┐     │
│  │           3. AGENT ORCHESTRATOR                  │     │
│  │   • Executes steps sequentially                 │     │
│  │   • Persistent browser session (5-min idle)     │     │
│  │   • Real-time status updates to frontend        │     │
│  └──────┬──────────────────────┬─────────────────┘     │
│         │                      │                         │
│         ▼                      ▼                         │
│  ┌──────────────┐  ┌──────────────────────────┐         │
│  │  NOVA ACT    │  │     DIRECT TOOLS         │         │
│  │  (Browser    │  │  • Date/time lookups     │         │
│  │  Automation) │  │  • Calculations          │         │
│  └──────────────┘  └──────────────────────────┘         │
│                                                          │
│  ┌────────────────────────────────────────────────┐     │
│  │     4. NOVA MULTIMODAL EMBEDDINGS               │     │
│  │   • Embeds query + extracted data               │     │
│  │   • Cosine similarity → relevance score         │     │
│  │   • Verifies automation got the right data      │     │
│  └────────────────────────────────────────────────┘     │
└─────────────────────────────────────────────────────────┘

Nova Services Used

# Service Model ID Role
1 Nova 2 Sonic amazon.nova-2-sonic-v1:0 Speech-to-speech - bidirectional audio streaming with barge-in and tool use
2 Nova 2 Lite us.amazon.nova-2-lite-v1:0 Reasoning - task decomposition into structured plans and result summarization
3 Nova Act nova-act SDK Browser automation - navigates and interacts with real websites
4 Nova Multimodal Embeddings amazon.nova-2-multimodal-embeddings-v1:0 Semantic verification - scores relevance between intent and extracted data

Key Technical Decisions

  • Persistent browser sessions - Nova Act keeps the browser open for 5 minutes between tasks. Follow-up commands reuse the same window, avoiding the 15-second cold-start per task. This makes multi-step conversations feel natural.
  • Bidirectional streaming - Audio flows in both directions simultaneously via Bedrock's InvokeModelWithBidirectionalStream. The user can interrupt (barge-in) while Nova is speaking.
  • Tool use within speech - Nova 2 Sonic's native tool-use capability triggers the task pipeline mid-conversation. No speech-to-text → text-to-speech workaround needed.
  • Silent keepalive - During browser automation (which can take 30-60 seconds), the backend feeds silent audio to Sonic to prevent its 55-second timeout, while muting real mic input to avoid accidental triggers.
  • Pre-defined workflow templates - The three demo flows use reliable step templates rather than LLM-generated plans, ensuring consistent demo performance. Unrecognized intents fall back to Lite-generated plans.

Demo Scenarios

Scenario Voice Command What Happens
Analytics "How are my analytics looking?" Opens Google Analytics → reads active users, sessions, trends, top countries → speaks summary
Supplier Deals "Put some deals on Just Eat" Opens Just Eat partner portal → finds inactive promotions → toggles them on → confirms
Calendar "Schedule a meeting with marketing" Opens Cal.com → creates a new booking → confirms date and time

Prerequisites

  • Python 3.11+
  • Node.js 18+
  • AWS account with Bedrock access (Nova 2 Sonic, Nova 2 Lite, Nova Multimodal Embeddings enabled in us-east-1)
  • Nova Act API key (from nova.amazon.com)
  • Google Chrome installed

Quick Start

1. Clone and install

git clone https://github.com/devtoship/NovaCommand.git
cd nova-voice-commander

# Backend
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

# Frontend
cd frontend && npm install && cd ..

2. Configure environment

cp .env.example .env
# Edit .env with your AWS credentials and Nova Act API key

3. Set up browser profile (one-time)

python scripts/setup_browser_profile.py
# Log into your target websites (Cal.com, Google Analytics, etc.)
# in the browser that opens, then close it

4. Run

# Terminal 1: Backend
source .venv/bin/activate
uvicorn backend.main:app --host 0.0.0.0 --port 8001

# Terminal 2: Frontend
cd frontend && npm run dev

Open http://localhost:3000, click Connect, hold the microphone button, and speak your command.

Authentication & Security

Nova Act uses a pre-authenticated Chromium user profile stored locally on the backend server. During setup, users log into target platforms in a Chrome window managed by setup_browser_profile.py. Nova Act reuses these saved sessions, no credentials are stored in application code.

Security boundaries:

  • The React frontend handles audio capture and playback only, it never touches browser state, cookies, or credentials
  • All browser automation runs server-side, the Nova Act Chromium instance is not exposed to the frontend
  • All sensitive configuration (AWS credentials, API keys) is managed through environment variables

Production considerations: OAuth flows for account connections, AWS Secrets Manager for credential storage, isolated browser contexts per user, role-based access controls, and encrypted session storage.

Project Structure

├── backend/
│   ├── main.py                 # FastAPI + WebSocket server
│   ├── config.py               # Environment configuration
│   ├── voice/
│   │   ├── sonic_handler.py    # Nova 2 Sonic bidirectional streaming
│   │   ├── sonic_events.py     # Sonic protocol event builders
│   │   └── audio_utils.py      # PCM audio helpers
│   ├── reasoning/
│   │   ├── planner.py          # Nova 2 Lite task decomposition
│   │   └── prompts.py          # Centralized system prompts
│   ├── agents/
│   │   ├── orchestrator.py     # Multi-step task coordination
│   │   └── act_executor.py     # Nova Act persistent browser session
│   ├── embeddings/
│   │   └── embedder.py         # Nova Multimodal Embeddings
│   └── tools/
│       └── direct_tools.py     # Non-browser utilities
├── frontend/
│   ├── src/
│   │   ├── App.jsx             # Main layout + WebSocket URL
│   │   ├── components/
│   │   │   ├── VoiceButton.jsx # Push-to-talk microphone
│   │   │   ├── TaskLog.jsx     # Real-time step progress
│   │   │   └── StatusBar.jsx   # Connection status
│   │   └── hooks/
│   │       ├── useWebSocket.js # Sonic protocol bridge
│   │       └── useAudioCapture.js # Mic capture + 16kHz resampling
│   └── package.json
├── workflows/                   # Demo flow step templates
│   ├── supplier_check.py       # Just Eat partner portal
│   ├── calendar_booking.py     # Cal.com booking management
│   └── analytics_summary.py   # Google Analytics dashboard
├── demo/
│   └── supplier_portal.html    # Local Just Eat demo page
├── scripts/
│   └── setup_browser_profile.py
├── requirements.txt
└── .env.example

Tech Stack

Layer Technology
Frontend React 18 + Tailwind CSS + Vite
Communication WebSocket (bidirectional audio + events)
Backend Python 3.11 + FastAPI
Speech Nova 2 Sonic via Bedrock bidirectional stream
Reasoning Nova 2 Lite via Bedrock Converse API
Browser Automation Nova Act SDK
Semantic Verification Nova Multimodal Embeddings via Bedrock

Built With

Share this project:

Updates

posted an update

I looked through some of the other projects that have been submitted. Competition looks tough but I'm hoping the use case will let this project stand out :) Good luck everyone!

Log in or sign up for Devpost to join the conversation.