Vox — Private Voice-First Visual Intelligence Agent

VOX

Inspiration

Voice data is biometric. Images contain locations, faces, documents. Creative prompts reveal intent and ideation. With centralized AI providers, this data gets stored, analyzed, and potentially leaked. We wanted an AI agent that could see, listen, speak, and create — without ever knowing who you are. That's only possible when inference itself is private, which led us to Venice AI's zero-data-retention APIs running on Akash decentralized compute.

What it does

Vox is a privacy-first autonomous multimodal AI agent:

Voice Interaction — Speak to it, it speaks back (Venice Whisper ASR + Kokoro TTS, with 3-layer hallucination defense)
Visual Perception — Upload images for intelligent analysis via Venice Vision
Creative Generation — Generate images from natural language descriptions
Web Search — Real-time information retrieval with citations
Semantic Memory — Persistent vector memory with embedding-based recall across conversations
Autonomous Monitoring — Scheduled background tasks (URL monitoring, periodic reports) that run even when you're offline
Complete Privacy — All inference via Venice AI with zero data retention; deployed on Akash decentralized compute

How we built it

Dual AI provider architecture:

AkashML (Qwen/Qwen3-30B-A3B) handles core agent reasoning and tool calling
Venice AI provides all 6 multimodal capabilities: chat/vision, ASR, TTS, image generation, embeddings, and web search — all with zero data retention

Agent core: LangGraph StateGraph with a 4-node reflection loop (agent → tools → reflect → respond). An actor-critic pattern where the agent generates a response, a reflector critiques it, and the agent revises — producing higher-quality outputs.

8 built-in tools: web search, image generation, image analysis, monitoring task creation, activity logging, and semantic remember/recall (Venice embeddings + numpy cosine similarity).

Stack: Python 3.12, FastAPI + Uvicorn (REST + WebSocket), LangGraph, AsyncSqliteSaver checkpointing, SQLite WAL mode, APScheduler for autonomous tasks, vanilla HTML/CSS/JS frontend. Dockerized and deployed on Akash Network with persistent beta3 storage.

Research foundation

Every core design decision is grounded in published research. We conducted deep research (489 learnings from 423 sources) and studied 4 key papers:

Paper	Key Insight Applied
AIOpsLab (2501.06706)	Autonomous incident lifecycle → persistent task queue, URL monitoring, activity logging
ReVision (2502.14780)	Privacy-preserving visual instruction rewriting → Venice zero-retention for all vision/voice/image calls
ReliabilityBench (2601.06112)	ReAct outperforms Reflexion under stress; backoff + jitter critical → retry with exponential backoff on all API calls, iteration safety limits
FAME (2601.14735)	Planner/Actor/Evaluator decomposition → LangGraph 4-node agent graph, AsyncSqliteSaver checkpointing, semantic memory persistence

Challenges we faced

Whisper hallucination on silence — Whisper generates phantom transcriptions ("Thank you", "Bye") on silent audio. We built a 3-layer defense: WAV RMS energy detection, prompt anchoring, and post-transcription filtering.
Dual-provider coordination — Routing reasoning to AkashML and multimodal calls to Venice while keeping a unified tool-calling interface required careful abstraction in the client layer.
Fault tolerance on decentralized infra — SQLite WAL mode, asyncio.shield for critical writes, retry with jitter, and Docker health checks ensure the agent survives Akash container restarts without data loss.

What we learned

Privacy and capability aren't trade-offs — Venice's OpenAI-compatible API means you get full multimodal AI without sacrificing a single byte of user data. Decentralized deployment on Akash removes the last single point of failure.

What's next

Expanding Vox's autonomous capabilities: multi-agent collaboration, richer scheduling primitives, and streaming voice responses for real-time conversation.