Inspiration

bunq's December 2025 update shipped Finn, a multimodal AI agent that already handles ~97% of customer support across 35 languages, with image recognition and speech-to-speech translation. But Finn lives inside a phone — and bunq's actual user, the location-independent professional (digital nomads, expats, cross-border SMEs), almost never has their hands free. They're carrying luggage, a passport, sometimes a child, and crossing a currency border every week.

Our pitch line:

bunq already built the AI agent for 20M borderless users. It's trapped in a phone. We put it where your eyes are.

Every concept we built is grounded in bunq's real 2026 direction (Updates 30 and 31) — no sci-fi, just existing Finn capabilities given a body through Snap Spectacles.

What it does

A heads-up "welcome to bunq" hub, world-locked 1.2 m in front of you, with four feature surfaces accessed via pinch gestures:

We were not able to implement all features for this round but focused on implementing Finn.

  • menu · fx — Look at a foreign menu, prices overlay in your home currency. (Pending)
  • bunq.me — Glance at a friend's QR, voice-pay in any currency. (Pending)
  • baskets — Investment portfolios as floating spatial cards. (Pending)
  • finn — A breathing orb that listens, sees, and moves money on confirmation.

The Finn screen is the headline: voice → GPT-4o → typed intent → confirm-gated transfer → mock ledger update → Sasha speaks back. Plus camera scan: point at a receipt, GPT-4o vision categorizes it into the right basket.

How we built it

Stack. Lens Studio 5.15.3, Snap Spectacles (2024), TypeScript. Voice via VoiceMLModule, TTS via TextToSpeechModule (Sasha), camera via CameraModule, GPT-4o (text + vision) via Snap's Remote Service Gateway.

Architecture decisions worth naming:

  • Finn proposes, user confirms. Single safety rule: we never go from utterance straight to a ledger mutation. Every action is two human moments — Finn proposes, user says "confirm."
  • Typed FinnIntent envelope between the LLM and the state machine: interpret(text) → { action, source, dest, amount, speak, confidence }. The controller stays dumb; the LLM owns parsing.
  • Buffer-on-settle for fragmented voice finals. The VAD chops conversational speech every ~700 ms of silence. We accumulate fragments for \( \Delta t = 1.2\,\text{s} \) then dispatch one full utterance. Control verbs ("confirm", "cancel") fast-path past the buffer for instant response.
  • Keyword pre-filter before any LLM call: transcripts that don't contain move / transfer / basket / euros / finn / … get dropped silently. No paid LLM call, no random Sasha reply on ambient chatter.

Visual. A procedurally-generated bunq rainbow backdrop (sampled stripe colors from the actual logo, regenerated as a clean stripe-only PNG), a SUIKit Frame-wrapped movable window, a breathing orb that lerps color on state transitions

$$ s(t) = s_0 \cdot \left(1 + 0.05\sin!\left(\tfrac{2\pi t}{2\,\text{s}}\right)\right) $$

magenta → cyan → green, plus color-coded cards (Main green, Travel cyan, Tax red, Tech orange, Emergency yellow).

Unusual workflow note: the entire scene scaffold — hub, four screens, navigation routing, ~30 SceneObjects, ~15 materials, all the wiring — was built via Lens Studio's MCP server driving the editor programmatically from Claude Code. We wrote zero click-through guides; we wrote prompts.

Challenges we ran into

  • VAD fragmentation. VoiceMLModule.ListeningOptions exposes no endpoint-silence-threshold knob. Discovered mid-build that the module is deprecatedASRModule exposes silenceUntilTerminationMs directly — but migration didn't fit the timebox. Buffer-on-settle was the workaround.
  • TTS silently fails on non-ASCII. Em-dash, ellipsis, all fire err 13 with no audible cue — perfect way for a demo to degrade on stage without anyone realizing. Defense in depth: sanitize at the speak boundary AND tell the LLM in its system prompt to emit plain ASCII.
  • Spectacles back-face culling. Rotated planes and Text3Ds vanish at certain head angles unless twoSided=true on every material — easy to miss in editor, brutal on-device.
  • OnEnableEvent doesn't propagate. A SceneObject's enable event only fires when its own enabled flips, not its ancestor's. So we couldn't start the mic on screen-enter; voice has to start in onAwake and stay always-on.
  • Frame component vs content positioning. SUIKit's Frame defaults autoScaleContent: true and relativeZ: true, both of which silently rescale child Z values and re-introduce z-fighting. Both must be disabled.

What we learned

  • Spectacles AR isn't a phone with extra steps — it's a different surface where glanceability beats density and voice + confirm beats touch.
  • The hard part of an LLM agent isn't the LLM call; it's the boundary code — the buffer, the gate, the typed envelope, the safety confirm. The LLM is one line.
  • Lens Studio's MCP server is genuinely transformative for AR development — letting an agent drive scene composition lets you iterate on layout at the speed of language instead of the speed of clicks.

Built with

Tag list (paste one-per-line into Devpost's "Built with" field):

typescript
lens-studio
snap-spectacles
spectacles-interaction-kit
spectacles-ui-kit
voiceml-module
text-to-speech-module
camera-module
remote-service-gateway
openai
gpt-4o
gpt-4o-vision
mcp
claude-code
powershell
bunq-api

Built With

  • lensstudio
  • snap
  • typescript.
Share this project:

Updates