Inspiration
Voice data is biometric. Images contain locations, faces, documents. Creative prompts reveal intent and ideation. With centralized AI providers, this data gets stored, analyzed, and potentially leaked. We wanted an AI agent that could see, listen, speak, and create — without ever knowing who you are. That's only possible when inference itself is private, which led us to Venice AI's zero-data-retention APIs running on Akash decentralized compute.
What it does
Vox is a privacy-first autonomous multimodal AI agent:
- Voice Interaction — Speak to it, it speaks back (Venice Whisper ASR + Kokoro TTS, with 3-layer hallucination defense)
- Visual Perception — Upload images for intelligent analysis via Venice Vision
- Creative Generation — Generate images from natural language descriptions
- Web Search — Real-time information retrieval with citations
- Semantic Memory — Persistent vector memory with embedding-based recall across conversations
- Autonomous Monitoring — Scheduled background tasks (URL monitoring, periodic reports) that run even when you're offline
- Complete Privacy — All inference via Venice AI with zero data retention; deployed on Akash decentralized compute
How we built it
Dual AI provider architecture:
- AkashML (Qwen/Qwen3-30B-A3B) handles core agent reasoning and tool calling
- Venice AI provides all 6 multimodal capabilities: chat/vision, ASR, TTS, image generation, embeddings, and web search — all with zero data retention
Agent core: LangGraph StateGraph with a 4-node reflection loop (agent → tools → reflect → respond). An actor-critic pattern where the agent generates a response, a reflector critiques it, and the agent revises — producing higher-quality outputs.
8 built-in tools: web search, image generation, image analysis, monitoring task creation, activity logging, and semantic remember/recall (Venice embeddings + numpy cosine similarity).
Stack: Python 3.12, FastAPI + Uvicorn (REST + WebSocket), LangGraph, AsyncSqliteSaver checkpointing, SQLite WAL mode, APScheduler for autonomous tasks, vanilla HTML/CSS/JS frontend. Dockerized and deployed on Akash Network with persistent beta3 storage.
Research foundation
Every core design decision is grounded in published research. We conducted deep research (489 learnings from 423 sources) and studied 4 key papers:
| Paper | Key Insight Applied |
|---|---|
| AIOpsLab (2501.06706) | Autonomous incident lifecycle → persistent task queue, URL monitoring, activity logging |
| ReVision (2502.14780) | Privacy-preserving visual instruction rewriting → Venice zero-retention for all vision/voice/image calls |
| ReliabilityBench (2601.06112) | ReAct outperforms Reflexion under stress; backoff + jitter critical → retry with exponential backoff on all API calls, iteration safety limits |
| FAME (2601.14735) | Planner/Actor/Evaluator decomposition → LangGraph 4-node agent graph, AsyncSqliteSaver checkpointing, semantic memory persistence |
Challenges we faced
- Whisper hallucination on silence — Whisper generates phantom transcriptions ("Thank you", "Bye") on silent audio. We built a 3-layer defense: WAV RMS energy detection, prompt anchoring, and post-transcription filtering.
- Dual-provider coordination — Routing reasoning to AkashML and multimodal calls to Venice while keeping a unified tool-calling interface required careful abstraction in the client layer.
- Fault tolerance on decentralized infra — SQLite WAL mode,
asyncio.shieldfor critical writes, retry with jitter, and Docker health checks ensure the agent survives Akash container restarts without data loss.
What we learned
Privacy and capability aren't trade-offs — Venice's OpenAI-compatible API means you get full multimodal AI without sacrificing a single byte of user data. Decentralized deployment on Akash removes the last single point of failure.
What's next
Expanding Vox's autonomous capabilities: multi-agent collaboration, richer scheduling primitives, and streaming voice responses for real-time conversation.


Log in or sign up for Devpost to join the conversation.