Inspiration

AR has enormous untapped potential — but it's stayed niche because building useful AR is genuinely hard. You have to fuse spatial tracking, plane detection, and 3D rendering, and even then you're left with experiences that mostly show things rather than understand you.

We had a simple thesis: AI is the missing interface layer for AR. Instead of forcing people to learn AR controls — or worse, learn to "prompt" — what if you could just talk to your reality? Say "Hey jARvis, what am I looking at?", "show me the latest news," or "put a book on that table," and an agent figures out the intent, reaches for the right tool, and augments the world around you.

So we built jARvis: a thin, conversational AI layer on top of AR that lets anyone pull the full power of multi-agent AI into their physical space. The agent treats spatial data as just another tool it can reach for on demand — turning AR from a hard engineering problem into a natural conversation.

What it does

  • Just talk — no prompting. Wake-word voice ("Hey jARvis") → speech-to-text → agent → spoken reply, fully hands-free.
  • Scene understanding. Ask "what do you see?" or "where am I?" and a vision-language model reads the live camera frame and answers.
  • Spatial placement. "Add a book on the table" → the agent picks the object + surface type, and you tap a detected plane to anchor it in AR.
  • Live caption translation. Streams audio segments to ElevenLabs Scribe v2 Realtime and pushes translated English subtitles back in near real time.
  • AR information cards. "Latest news on X" or "show me images of Y" renders live cards floating in your space.

How we built it

Frontend — Expo SDK 55 / React Native 0.83

  • @reactvision/react-viro on ARKit for the AR scene: ViroARPlaneSelector for tap-to-place, ViroARCamera, image/text nodes for cards.
  • expo-audio for mic capture + playback; expo-glass-effect for a Liquid-Glass HUD; reanimated for the assistant's animated states.

Backend — FastAPI + LangGraph multi-agent orchestrator

The core is a LangGraph state machine that classifies one utterance and routes it to a specialized agent:

        START → classify ─┬─ news   → news_agent  ┐
                          ├─ image  → image_agent ├→ synthesize → END
                          ├─ chat   → chat_agent  ┘
                          ├─ vision → vision_node  (asks client for a frame)
                          ├─ place  → place_node   (emits placement command)
                          └─ clear  → clear_node
  • Routing uses a deterministic keyword pass first, falling back to an LLM with structured output (langchain-openai) only for ambiguous input.
  • Vision uses a separate vision-language model over the base64 camera frame.
  • Translation relays WAV→PCM segments over a WebSocket to ElevenLabs realtime STT, then LLM-translates committed transcripts.
  • Voice I/O proxies ElevenLabs STT and TTS server-side so the API key never touches the device.
  • Tools (news/image search via ddgs) are cached with fresh/stale TTLs so DuckDuckGo throttling never leaves the view blank.

What we learned

  • AI makes AR approachable. Letting the agent treat spatial actions (place, look) as tools collapsed a huge amount of AR UX complexity into plain language.
  • Agents need an eye and a hand. The cleanest pattern was a tool-call back to the device: the agent can't grab a camera frame or pick a plane itself, so it emits a look or place action and the client completes it. The model decides what; the device owns where.
  • Determinism beats a clever router. A small LLM router mis-filed obvious intents ("latest news," "add a cup on the laptop") as chat — so trusting unambiguous keywords first was both more reliable and cheaper.

Challenges we faced

  • Viro + New Architecture gates. The AR native module silently fails to build unless newArchEnabled and RCT_NEW_ARCH_ENABLED=1 are both set — a quiet, hard-to-diagnose footgun.
  • Real-time translation latency. True ~150 ms streaming needs continuous PCM from a native audio module. We stayed on expo-audio by streaming short WAV segments to the realtime API instead — trading ~1–2 s of subtitle lag for a far simpler, cross-platform pipeline.
  • Search reliability. DuckDuckGo throttles rapid calls and intermittently returns nothing, so we added fresh/stale caching and multi-phrasing fallbacks so the AR view never goes blank.
  • Graceful degradation. Every LLM-dependent node has a deterministic fallback, so the whole graph still runs with no API key configured.
  • On-device iOS builds. We had to patch @expo/cli to get run:ios --device working against lockdownd.

Built with

Expo SDK 55 · React Native · @reactvision/react-viro (ARKit) · expo-audio · expo-glass-effect · FastAPI · LangGraph · LangChain · OpenAI-compatible LLM + vision model · ElevenLabs (STT / TTS / Scribe v2 Realtime) · DuckDuckGo Search

Built With

Share this project:

Updates