Abyss: A Voice-First AI Assistant Built for the Way You Actually Live

Inspiration

The world is quietly crossing a threshold. AI coding tools are making software creation accessible to nearly anyone — and as that barrier falls, the new bottleneck becomes something more fundamental: the ability to think clearly, move fast, and act decisively, even when you're away from your desk.

We kept coming back to a simple observation: the most capable computer most people own is the one in their pocket. And yet almost every serious AI assistant is still desktop-first, chat-first, and session-first. It forgets you between conversations. It can't act locally on your behalf. It waits for you to come to it.

We're living in the early days of what Jarvis actually looks like — Open Interpreter, OpenClaw, real-time voice models, autonomous agents. The pieces are converging. We wanted to build something that took that vision seriously: a voice-first assistant that travels with you, learns you over time, and earns the access it asks for.

What It Does

Abyss is an iPhone-native AI assistant backed by a Node.js/TypeScript WebSocket conductor and a paired macOS bridge for privileged local actions.

From your phone, by voice, you can:

Triage Gmail and draft replies that surface for your approval before sending
Create and manage Google Calendar events
Check Canvas LMS assignments and deadlines
Spawn Cursor Cloud Agents to work on your codebase while you're in transit
Search the web, execute shell commands, read and write files on your Mac

The system routes requests through Amazon Bedrock, dynamically selecting between Nova Lite and Nova Pro based on task complexity. Real-time voice is powered by Nova Sonic. A context graph built on Amazon Neptune Analytics and Amazon Titan Embeddings persists knowledge across sessions — your preferences, past decisions, open threads — so Abyss genuinely learns you over time.

How We Built It

The architecture is three layers:

iPhone client — SwiftUI, AVFoundation audio pipeline, WhisperKit on-device transcription, ElevenLabs TTS, URLSession WebSockets
Conductor server — Node.js + TypeScript WebSocket server on AWS ECS Fargate, orchestrating tool calls, streaming responses, and session state via Amazon Bedrock
macOS bridge — Swift app with individually toggled permissions; isolated from the phone client and server, constrained to user-selected workspace roots

All inter-component communication uses a strict EventEnvelope schema with deterministic SHA-256 IDs. The context graph runs hybrid vector + graph search on Neptune Analytics, with embeddings generated by Titan Text Embeddings V2 ($d = 256$ dimensions). Dynamic model routing is deterministic: if any "heavy" tool (bridge execution, Cursor agents, browser automation) is in scope, the request escalates to Nova Pro.

Context summarization kicks in after 30 conversation turns, compressing older history into a 3–6 sentence summary that gets prepended to every subsequent request — keeping the effective context window meaningful without inflating token cost.

Challenges

Security without friction. Giving an AI assistant real local access is genuinely dangerous if done naively. We spent significant time designing the permission model — individual toggles per capability, workspace root constraints, confirmation cards before any mutation is finalized. Getting that to feel natural rather than bureaucratic was harder than it sounds.

Voice latency on mobile. Chaining on-device Whisper transcription → WebSocket → Bedrock inference → ElevenLabs TTS introduced compounding latency at every hop. We optimized aggressively: streaming responses back as they generate, push-to-talk as a low-latency alternative to VAD, and Nova Sonic for end-to-end voice when latency matters most.

Context that actually transfers. Building a memory system that feels like genuine understanding rather than a key-value store required the full Neptune + Titan stack. Getting hybrid vector + graph retrieval to surface the right context — not just recent context — took careful tuning of the neighborhood traversal depth and vector similarity threshold.

iOS/server protocol consistency. Keeping the Swift and TypeScript protocol libraries in sync across a fast-moving codebase required strict shared JSON schemas and discipline about versioning. A single mismatched field in EventEnvelope would silently break entire tool flows.

What We Learned

Building on Amazon Bedrock gave us something we didn't fully anticipate: the ability to treat model selection as a runtime decision rather than a design-time commitment. Dynamic routing between Nova Lite and Pro — purely based on which tools are in scope — let us keep the system fast and cheap for everyday tasks while still having real horsepower available when the work demands it.

Nova Sonic changed how we think about voice interfaces. Bidirectional streaming audio with tool-calling capability means voice doesn't have to be a thin wrapper over a text model — it can be a first-class execution surface.

And perhaps most importantly: security-first design is not the same as security-as-afterthought. The permission model we built — isolated bridge, per-capability toggles, confirmation before mutation — made the system feel more useful, not less, because users trusted it with more.