Inspiration
AR has enormous untapped potential — but it's stayed niche because building useful AR is genuinely hard. You have to fuse spatial tracking, plane detection, and 3D rendering, and even then you're left with experiences that mostly show things rather than understand you.
We had a simple thesis: AI is the missing interface layer for AR. Instead of forcing people to learn AR controls — or worse, learn to "prompt" — what if you could just talk to your reality? Say "Hey jARvis, what am I looking at?", "show me the latest news," or "put a book on that table," and an agent figures out the intent, reaches for the right tool, and augments the world around you.
So we built jARvis: a thin, conversational AI layer on top of AR that lets anyone pull the full power of multi-agent AI into their physical space. The agent treats spatial data as just another tool it can reach for on demand — turning AR from a hard engineering problem into a natural conversation.
What it does
- Just talk — no prompting. Wake-word voice ("Hey jARvis") → speech-to-text → agent → spoken reply, fully hands-free.
- Scene understanding. Ask "what do you see?" or "where am I?" and a vision-language model reads the live camera frame and answers.
- Spatial placement. "Add a book on the table" → the agent picks the object + surface type, and you tap a detected plane to anchor it in AR.
- Live caption translation. Streams audio segments to ElevenLabs Scribe v2 Realtime and pushes translated English subtitles back in near real time.
- AR information cards. "Latest news on X" or "show me images of Y" renders live cards floating in your space.
How we built it
Frontend — Expo SDK 55 / React Native 0.83
@reactvision/react-viroon ARKit for the AR scene:ViroARPlaneSelectorfor tap-to-place,ViroARCamera, image/text nodes for cards.expo-audiofor mic capture + playback;expo-glass-effectfor a Liquid-Glass HUD;reanimatedfor the assistant's animated states.
Backend — FastAPI + LangGraph multi-agent orchestrator
The core is a LangGraph state machine that classifies one utterance and routes it to a specialized agent:
START → classify ─┬─ news → news_agent ┐
├─ image → image_agent ├→ synthesize → END
├─ chat → chat_agent ┘
├─ vision → vision_node (asks client for a frame)
├─ place → place_node (emits placement command)
└─ clear → clear_node
- Routing uses a deterministic keyword pass first, falling back to an LLM with structured output (
langchain-openai) only for ambiguous input. - Vision uses a separate vision-language model over the base64 camera frame.
- Translation relays WAV→PCM segments over a WebSocket to ElevenLabs realtime STT, then LLM-translates committed transcripts.
- Voice I/O proxies ElevenLabs STT and TTS server-side so the API key never touches the device.
- Tools (news/image search via
ddgs) are cached with fresh/stale TTLs so DuckDuckGo throttling never leaves the view blank.
What we learned
- AI makes AR approachable. Letting the agent treat spatial actions (place, look) as tools collapsed a huge amount of AR UX complexity into plain language.
- Agents need an eye and a hand. The cleanest pattern was a tool-call back to the device: the agent can't grab a camera frame or pick a plane itself, so it emits a
lookorplaceaction and the client completes it. The model decides what; the device owns where. - Determinism beats a clever router. A small LLM router mis-filed obvious intents ("latest news," "add a cup on the laptop") as chat — so trusting unambiguous keywords first was both more reliable and cheaper.
Challenges we faced
- Viro + New Architecture gates. The AR native module silently fails to build unless
newArchEnabledandRCT_NEW_ARCH_ENABLED=1are both set — a quiet, hard-to-diagnose footgun. - Real-time translation latency. True ~150 ms streaming needs continuous PCM from a native audio module. We stayed on
expo-audioby streaming short WAV segments to the realtime API instead — trading ~1–2 s of subtitle lag for a far simpler, cross-platform pipeline. - Search reliability. DuckDuckGo throttles rapid calls and intermittently returns nothing, so we added fresh/stale caching and multi-phrasing fallbacks so the AR view never goes blank.
- Graceful degradation. Every LLM-dependent node has a deterministic fallback, so the whole graph still runs with no API key configured.
- On-device iOS builds. We had to patch
@expo/clito getrun:ios --deviceworking againstlockdownd.
Built with
Expo SDK 55 · React Native · @reactvision/react-viro (ARKit) · expo-audio · expo-glass-effect · FastAPI · LangGraph · LangChain · OpenAI-compatible LLM + vision model · ElevenLabs (STT / TTS / Scribe v2 Realtime) · DuckDuckGo Search
Log in or sign up for Devpost to join the conversation.