Nova Voice Commander

Hands-free business operations. Speak it and Nova does it.

A voice-first agentic system that lets you speak complex multi-step business tasks and have them executed across real websites.

The Problem

Small business owners spend hours daily on repetitive web tasks, checking analytics dashboards, managing promotions on delivery platforms, booking meetings. Current voice assistants can answer questions but can't do things across arbitrary web applications. Current browser automation tools require scripting knowledge.

Nova Voice Commander bridges the gap: speak a task in natural language, and an AI pipeline reasons about the steps, automates the browser, and speaks back the results, all in a single voice conversation.

Demo

"How are my analytics looking?" → Nova opens Google Analytics, reads key metrics, and reports back: "You've got 22,000 active users this week, up 8.4%. Sessions are at 26k..."

"Put some deals on Just Eat." → Nova opens the supplier portal, finds inactive promotions, toggles them on: "I've activated the Lunch Special, the Burger BOGOF, and the Family Meal Deal."

"Schedule a meeting with my head of marketing to discuss this." → Nova opens Cal.com, creates a booking: "Done! I've booked a 30-minute meeting for tomorrow afternoon."

Three voice commands. Three different platforms. One continuous conversation. No typing, no clicking, no tab-switching.

Architecture

┌─────────────────────────────────────────────────────────┐
│                      FRONTEND                            │
│           React App (Push-to-Talk Interface)              │
│     ┌──────────┐   ┌──────────┐   ┌───────────────┐    │
│     │ Voice UI │   │ Task Log │   │  Status Bar   │    │
│     │  Button  │   │ Display  │   │               │    │
│     └──────────┘   └──────────┘   └───────────────┘    │
└────────────────────────┬─────────────────────────────────┘
                         │ WebSocket (audio + task events)
                         ▼
┌─────────────────────────────────────────────────────────┐
│                  BACKEND (Python/FastAPI)                 │
│                                                          │
│  ┌────────────────────────────────────────────────┐     │
│  │          1. NOVA 2 SONIC (Speech)               │     │
│  │   • Bidirectional audio stream via Bedrock      │     │
│  │   • Understands speech → extracts intent        │     │
│  │   • Speaks results back to the user             │     │
│  │   • Triggers tools mid-conversation             │     │
│  └─────────────────────┬──────────────────────────┘     │
│                        │ Tool call: run_task              │
│                        ▼                                  │
│  ┌────────────────────────────────────────────────┐     │
│  │       2. NOVA 2 LITE (Reasoning/Planning)       │     │
│  │   • Decomposes intent into atomic steps         │     │
│  │   • Routes: browser automation vs direct tools  │     │
│  │   • Summarises results for voice output         │     │
│  └─────────────────────┬──────────────────────────┘     │
│                        │ Structured task plan (JSON)      │
│                        ▼                                  │
│  ┌────────────────────────────────────────────────┐     │
│  │           3. AGENT ORCHESTRATOR                  │     │
│  │   • Executes steps sequentially                 │     │
│  │   • Persistent browser session (5-min idle)     │     │
│  │   • Real-time status updates to frontend        │     │
│  └──────┬──────────────────────┬─────────────────┘     │
│         │                      │                         │
│         ▼                      ▼                         │
│  ┌──────────────┐  ┌──────────────────────────┐         │
│  │  NOVA ACT    │  │     DIRECT TOOLS         │         │
│  │  (Browser    │  │  • Date/time lookups     │         │
│  │  Automation) │  │  • Calculations          │         │
│  └──────────────┘  └──────────────────────────┘         │
│                                                          │
│  ┌────────────────────────────────────────────────┐     │
│  │     4. NOVA MULTIMODAL EMBEDDINGS               │     │
│  │   • Embeds query + extracted data               │     │
│  │   • Cosine similarity → relevance score         │     │
│  │   • Verifies automation got the right data      │     │
│  └────────────────────────────────────────────────┘     │
└─────────────────────────────────────────────────────────┘

Nova Services Used

#	Service	Model ID	Role
1	Nova 2 Sonic	`amazon.nova-2-sonic-v1:0`	Speech-to-speech - bidirectional audio streaming with barge-in and tool use
2	Nova 2 Lite	`us.amazon.nova-2-lite-v1:0`	Reasoning - task decomposition into structured plans and result summarization
3	Nova Act	`nova-act` SDK	Browser automation - navigates and interacts with real websites
4	Nova Multimodal Embeddings	`amazon.nova-2-multimodal-embeddings-v1:0`	Semantic verification - scores relevance between intent and extracted data

Key Technical Decisions

Persistent browser sessions - Nova Act keeps the browser open for 5 minutes between tasks. Follow-up commands reuse the same window, avoiding the 15-second cold-start per task. This makes multi-step conversations feel natural.
Bidirectional streaming - Audio flows in both directions simultaneously via Bedrock's InvokeModelWithBidirectionalStream. The user can interrupt (barge-in) while Nova is speaking.
Tool use within speech - Nova 2 Sonic's native tool-use capability triggers the task pipeline mid-conversation. No speech-to-text → text-to-speech workaround needed.
Silent keepalive - During browser automation (which can take 30-60 seconds), the backend feeds silent audio to Sonic to prevent its 55-second timeout, while muting real mic input to avoid accidental triggers.
Pre-defined workflow templates - The three demo flows use reliable step templates rather than LLM-generated plans, ensuring consistent demo performance. Unrecognized intents fall back to Lite-generated plans.

Demo Scenarios

Scenario	Voice Command	What Happens
Analytics	"How are my analytics looking?"	Opens Google Analytics → reads active users, sessions, trends, top countries → speaks summary
Supplier Deals	"Put some deals on Just Eat"	Opens Just Eat partner portal → finds inactive promotions → toggles them on → confirms
Calendar	"Schedule a meeting with marketing"	Opens Cal.com → creates a new booking → confirms date and time

Prerequisites

Python 3.11+
Node.js 18+
AWS account with Bedrock access (Nova 2 Sonic, Nova 2 Lite, Nova Multimodal Embeddings enabled in us-east-1)
Nova Act API key (from nova.amazon.com)
Google Chrome installed

Quick Start

1. Clone and install

git clone https://github.com/devtoship/NovaCommand.git
cd nova-voice-commander

# Backend
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

# Frontend
cd frontend && npm install && cd ..

2. Configure environment

cp .env.example .env
# Edit .env with your AWS credentials and Nova Act API key

3. Set up browser profile (one-time)

python scripts/setup_browser_profile.py
# Log into your target websites (Cal.com, Google Analytics, etc.)
# in the browser that opens, then close it

4. Run

# Terminal 1: Backend
source .venv/bin/activate
uvicorn backend.main:app --host 0.0.0.0 --port 8001

# Terminal 2: Frontend
cd frontend && npm run dev

Open http://localhost:3000, click Connect, hold the microphone button, and speak your command.

Authentication & Security

Nova Act uses a pre-authenticated Chromium user profile stored locally on the backend server. During setup, users log into target platforms in a Chrome window managed by setup_browser_profile.py. Nova Act reuses these saved sessions, no credentials are stored in application code.

Security boundaries:

The React frontend handles audio capture and playback only, it never touches browser state, cookies, or credentials
All browser automation runs server-side, the Nova Act Chromium instance is not exposed to the frontend
All sensitive configuration (AWS credentials, API keys) is managed through environment variables

Production considerations: OAuth flows for account connections, AWS Secrets Manager for credential storage, isolated browser contexts per user, role-based access controls, and encrypted session storage.

Project Structure

├── backend/
│   ├── main.py                 # FastAPI + WebSocket server
│   ├── config.py               # Environment configuration
│   ├── voice/
│   │   ├── sonic_handler.py    # Nova 2 Sonic bidirectional streaming
│   │   ├── sonic_events.py     # Sonic protocol event builders
│   │   └── audio_utils.py      # PCM audio helpers
│   ├── reasoning/
│   │   ├── planner.py          # Nova 2 Lite task decomposition
│   │   └── prompts.py          # Centralized system prompts
│   ├── agents/
│   │   ├── orchestrator.py     # Multi-step task coordination
│   │   └── act_executor.py     # Nova Act persistent browser session
│   ├── embeddings/
│   │   └── embedder.py         # Nova Multimodal Embeddings
│   └── tools/
│       └── direct_tools.py     # Non-browser utilities
├── frontend/
│   ├── src/
│   │   ├── App.jsx             # Main layout + WebSocket URL
│   │   ├── components/
│   │   │   ├── VoiceButton.jsx # Push-to-talk microphone
│   │   │   ├── TaskLog.jsx     # Real-time step progress
│   │   │   └── StatusBar.jsx   # Connection status
│   │   └── hooks/
│   │       ├── useWebSocket.js # Sonic protocol bridge
│   │       └── useAudioCapture.js # Mic capture + 16kHz resampling
│   └── package.json
├── workflows/                   # Demo flow step templates
│   ├── supplier_check.py       # Just Eat partner portal
│   ├── calendar_booking.py     # Cal.com booking management
│   └── analytics_summary.py   # Google Analytics dashboard
├── demo/
│   └── supplier_portal.html    # Local Just Eat demo page
├── scripts/
│   └── setup_browser_profile.py
├── requirements.txt
└── .env.example

Tech Stack

Layer	Technology
Frontend	React 18 + Tailwind CSS + Vite
Communication	WebSocket (bidirectional audio + events)
Backend	Python 3.11 + FastAPI
Speech	Nova 2 Sonic via Bedrock bidirectional stream
Reasoning	Nova 2 Lite via Bedrock Converse API
Browser Automation	Nova Act SDK
Semantic Verification	Nova Multimodal Embeddings via Bedrock

Built With

amazon
amazon-nova-act
amazon-nova-lite
amazon-nova-sonic
aws-bedrock
embeddings
fastapi
multimodal
nova
python
react
tailwind-css
websockets

Updates

Max M posted an update — Mar 23, 2026 01:03 PM EDT

I looked through some of the other projects that have been submitted. Competition looks tough but I'm hoping the use case will let this project stand out :) Good luck everyone!

Log in or sign up for Devpost to join the conversation.

Max M started this project — Mar 15, 2026 09:25 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.