Trip Agent for bunq

Inspiration

Banking apps are quiet. You make a decision out loud — "I want to plan a weekend for Sara", "lock in this month's rent and gym", "I'm flying to Tokyo Friday" — and then you spend the next twenty minutes context-switching between seventeen tabs and the bunq app to actually do it. The intent is one sentence. The execution is a workflow.

We wanted to close that gap. The hackathon brief asked for an AI that solves a real banking problem with a non-text modality, so we built a voice-first agent that doesn't just answer questions — it does the work. You speak the mission like you'd say it to a friend, and the agent plans, executes, narrates, and bills it all on real bunq sandbox APIs while you watch.

What it does

Trip Agent for bunq is a voice-driven, multi-modal financial concierge. Tap the mic, speak a one-sentence mission, and watch the dashboard render every action the agent takes — in real time, with real bunq calls.

Sustainability round-up — every mission ends with the agent asking out loud whether to round the day's spend up with a 3–5 % donation to Trees for All (travel/weekend) or Just Diggit (payday). The line is dynamic — pulled from a curated 47-line bunq-tone pool, never the same twice. You say "yeah" or "skip", and a real bunq payment to the cause fires on yes.

The dashboard is a single non-scrolling viewport: lowercase bunq wordmark, balance pill, live SSE indicator, action cascade on the left, browser-agent screenshots in the centre, mission-aware Sources sidebar on the right (real, clickable links to booking.com, hotels.com, TheFork, Michelin Guide, DUWO portal, Spotify Premium etc.), and the bunq rainbow strip pinned to the bottom of the viewport. All animated with framer-motion, themed with the Press Kit palette (deep black, bunq orange, Montserrat).

How we built it

Backend — FastAPI + threading bridge

A FastAPI server with an SSE event bus. Every step the agent takes — step_started, step_finished, narrate_audio, awaiting_donation, tax_extracted, awaiting_tax_confirm, mission_complete — fans out to the dashboard live.

Mission cascades run in a daemon worker thread; voice-confirmation steps block on a threading.Event while the dashboard captures the user's audio and fires /missions/donate/confirm (or /tax/confirm). bunq client is a hand-rolled RSA-signed Python wrapper covering payment, draft-payment, schedule-payment, request-inquiry, bunq.me-tab, card status, monetary-account-bank, IBAN payments, and the notification-filter-url webhook flow. Brain — Anthropic Claude

Claude Haiku 4.5 drives the tool-use loop on every mission. The system prompt + the 16-tool catalog are marked cache_control: ephemeral so prompt caching slashes per-iteration latency dramatically. Claude Sonnet 4.6 handles the receipt scanner — handwriting OCR is markedly more reliable on Sonnet than Haiku. A separate cheap Haiku call routes free-text voice commands to one of the four missions when keywords don't disambiguate. Voice — ElevenLabs

Scribe (scribe_v1) for STT — both for mission commands and yes/no confirmations. ElevenLabs TTS (Rachel voice, eleven_multilingual_v2) with per-call jittered prosody so identical phrases don't sound karaoke-identical.

Voice activity detection runs on the frontend, reusing the same AnalyserNode that powers the meter. RMS over a smoothed window decides "speech" vs "silence" — 3 seconds below threshold after the user has begun speaking auto-stops the recorder. Browser agent — Playwright + Claude Vision

A generic _drive_booking_flow powers all three browser missions. Claude Vision sees a screenshot, picks a tool (click_text, wait, complete), the agent acts. We inject an animated red cursor into the page and stream multi-frame screenshots back to the dashboard at ~14 fps. Mock sites are real HTML — restaurant data comes from Google Places when keys are present (with a fixture fallback), hotel and subscription pages use branded inline HTML. Frontend — React 18, Vite 6, Tailwind v4, ShadCN, framer-motion

Tailwind v4 @theme block holds the bunq palette as native colour tokens: bunq-orange, bunq-purple, bunq-green, etc. ShadCN semantic tokens (primary, card, border, ring) are re-mapped through bunq. Three independent mic recorders (mission, donation, tax) — each with its own upload function, all sharing the silence-detection hook.

Double-buffered screenshot rendering in the BrowserPanel — no flicker between frames. A voice-reactive MicDialog that scales the mic icon with RMS, cycles through the bunq palette while listening, and shows a dedicated browser-specific recovery view if getUserMedia is denied.

Challenges we ran into

Audio echo into the user's mic. The agent would speak a question via TTS, the dashboard would auto-open the user's mic, and the mic would record the agent's own voice as input. The transcript came back garbled, the donation classifier returned unsure, and missions silently ended in "skipped today, no worries". Fix: an audioQueue.waitForDrain() Promise that resolves only when the queue is fully empty and quiet — confirmation mics gate on that drain plus a 300 ms reverb buffer.

Claude looping on earlier steps. The donation prompt said "Call that total_spent" — we meant "label the variable", Claude read "make a tool call". On Travel runs the agent would re-invoke book_hotel after the donation step. Fix: rewrote the language as "Compute total_spent as a NUMBER you derive yourself — do NOT make a tool call", plus a hard-rules block that explicitly forbids re-calling browser-driven tools after confirm_donation.

Donation phrasing converging. Even with explicit "vary every run, do NOT use the template 'Spent X on Y'" instructions and four worked-out example styles, Claude landed on the same template repeatedly. Fix: moved generation out of the model entirely. The agent passes a placeholder prompt_line; the server overrides it with a random pick from a 47-line curated pool tagged by cause/mission flavour.

Safari user-gesture context. NotAllowedError: The request is not allowed by the user agent. Not a permission denial — Safari's gesture window had closed because we awaited two fetch calls before reaching mic.start(). Fix: call mic.start() synchronously inside the click handler; fire the TTS-opening fetch fire-and-forget after.

bunq sandbox rate limits. 5 POSTs / 3 s. Mission cascades with 6+ bunq calls back-to-back hit 429 immediately. Fix: paced sleeps between successive bunq calls and snapshot-balance batching.

Browser-panel screenshot flicker. re-mounted on every frame, briefly showing nothing. Fix: double-buffered elements with CSS opacity crossfade — the active image always stays mounted, the new one decodes in the hidden buffer, then we flip on requestAnimationFrame.

Accomplishments that we're proud of

A voice command really does run real bunq calls. No mocked-out side effects, no recorded video. The user speaks, the agent calls monetary-account-bank/.../payment for the actual amount, and the dashboard shows the bunq response. The bunq sandbox shows the same payments in its console. Receipt scanning that handles handwriting. Claude Sonnet Vision reads handwritten IBANs and recipient names off a phone-camera frame, the agent speaks the question, the user says yes, real bunq pays. Reliably enough that we'd demo it cold.

Voice that feels like a friend, not a bank. No "executing transaction", no "kindly approve". The narration style guide forbids twenty corporate phrases and pushes Claude toward varied openers, contractions, fragments. The donation pool extends that — every pitch sounds different, none sound like a charity ad. Audio choreography that doesn't echo. TTS-first, drain-the-queue, then open the mic — the user never hears their own input come back jumbled. It's a small thing that makes the whole loop feel real. Brand-faithful UI. The bunq Press Kit palette, the lowercase wordmark, the Montserrat extrabold, the 6 px rainbow strip pinned to the bottom of the viewport, the boxy 3 px / 6 px / 8 px radii. Every primitive (Button, Card, MicDialog) is themed through the same tokens — no off-brand hex codes anywhere.

What we learned

Voice UX is system design, not a sticker. Echo prevention, gesture preservation, silence detection, audio-queue gating, recovery views for permission errors — every one of those was load-bearing. Skip any and the demo dies.

LLMs converge. When the requirement is variety, the right move is often to take the writing out of the model — a curated random pool beats a creative-prompt every single run.

Prompt clarity > prompt length. A single ambiguous phrase ("Call that total_spent") caused real production loop bugs across two missions. Re-writing it with a no-tool-call rule fixed both.

Prompt caching is the difference between feeling slow and feeling alive. Marking the system prompt and tool catalog as ephemeral cache turned a 12-iteration mission from sluggish to brisk.

Safari has opinions. getUserMedia after an await is a dead permission grant — the user-gesture window has expired. Plan synchronous-up-to-the-API-call.

The bunq sandbox is friendly but pace yourself. Rate limits are real, but predictable; designing for them is part of the job.

What's next for Trip Agent for bunq

Persistent memory across sessions. The agent should remember Sara is a partner, Tokyo is the dream trip, DUWO is your landlord — without you saying it twice. The Council (Money has feelings). A prototype we explored: each bunq sub-account becomes a voiced personality with goals, a deadline, and an opinion about your purchase. The math-graded sub-accounts argue out loud, the user picks a side by speaking. We have the architecture mapped — sub-account auto-genesis, voice-per-archetype mapping, persona pick override on the user's vote — and want to ship it as a polish layer. Mission packs. Groceries (auto-rotating supplier picks), kids' activity bookings, gifting cycles, holiday savings. Multi-language voice. Dutch, German, French — bunq's actual region. ElevenLabs supports it; it's a TTS voice + Scribe language switch.