Web accessibility tools often require screen-reader expertise or browser extensions that users can’t install on shared machines. We wanted a friction-free, speech-only layer that works on any single-page app, costs (almost) nothing to run, and shows how far AWS’s serverless stack—and Bedrock—can go in democratizing access.

What it does

  1. Record voice in the browser.
  2. Upload clip to S3.
  3. Lambda #1 – StoreConn-Voice
    ($connect / $disconnect routes) saves & prunes WebSocket connectionIds in DynamoDB.
  4. Lambda #2 – Transcribe Trigger
    S3 event starts an Amazon Transcribe async job.
  5. Transcribe drops JSON in transcribe-output/ → Lambda #3 – Bedrock Intent runs.
  6. Bedrock (Claude-3 Sonnet) returns an intent such as
    {"action":"click","selector":"#nav-book"}.
  7. Lambda #3 fan-outs the intent via API Gateway WebSocket to open tab.
  8. Amplify hosting forFront-end JavaScript clicks/navigates/inputs text—hands-free control.

How we built it
Three Python 3.12 Lambdas:

  1. StoreConn-Voice (WS connect/disconnect)
  2. Transcribe Trigger (start job)
  3. Bedrock Intent (parse transcript → Bedrock → broadcast)
    Minimal SPA: plain HTML/JS; no front-end framework.
    Prompt-engineered Bedrock to output strict JSON and whitelisted selectors.
    Live-reloading Amplify hosting for rapid UX tweaking.

Challenges we ran into
Escaping {} in Python .format killed Bedrock calls (KeyError "action").
API Gateway WS management URL vs. client WSS URL—double “@connections” 404s.
Bedrock sometimes hallucinated selectors; solved with an alias map and stricter prompt.
Transcribe async adds ~45 s latency; streaming wasn’t available in free tier.

Accomplishments we’re proud of

  • Zero servers—full pipeline idles at $0.18/mo (S3 + Dynamo storage).
  • Works on any SPA without code changes—just drop in app.js.
  • Live demo navigates between tabs and fills forms with voice only.
  • Added TTL pruning → no zombie WebSocket IDs, no manual clean-up.

What we learned

  • Bedrock’s Claude-3 is shockingly good at structured JSON if you remind it every prompt.
  • API Gateway WebSockets + Dynamo TTL = nearly effortless real-time fan-out.
  • Small UX touches (on-screen log overlay, auto-WS reconnect) make demos rock-solid.

What’s next for VoiceNavAI
Transcribe Streaming to cut first-response time to <3 s.
Multilingual commands (update prompt + language-auto-detect).
ARIA role introspection: auto-generate selector whitelist per page.
Chrome extension wrapper to inject app.js on any site, no dev changes required.

Built With

Share this project:

Updates