About Finn 2.0

Inspiration

We started with a simple question: what is the most dangerous moment in digital banking?

Not a hacked password. Not a stolen card. The most dangerous moment is when a real person, fully authenticated, is being manipulated into sending money they don't want to send. Elderly users coerced by phone scammers. Authorised push payment (APP) fraud — where the victim willingly initiates the transfer — is the fastest-growing fraud category in Europe, with billions lost annually.

Standard authentication solves the wrong problem. A PIN confirms who you are. It cannot confirm whether you are safe. We wanted to build something that sits between the user's intent and the transaction execution — an AI layer that verifies identity in real time before any money moves.

That became Finn 2.0.

How We Built It

The system is split into two processes that talk to each other:

Streamlit frontend (app.py) — a custom HTML/CSS/JS phone UI embedded in Streamlit, running on port 8501
FastAPI backend (finn/backend.py) — the execution engine, running on port 8000

Voice Pipeline

The user speaks a request. The browser captures audio and transcribes it using the Web Speech API. The transcript is posted to /query on the backend.

For richer audio analysis, we also built an Amazon Nova Sonic STT path: raw audio is converted to 16 kHz mono PCM via ffmpeg, then streamed to Nova Sonic over a bidirectional WebSocket, reading back only the USER-role transcript events.

Intent Recognition

The transcript is sent to Amazon Bedrock (Nova Pro) via the Converse API with native tool use — five tools defined:

$$\mathcal{T} = {\texttt{make_payment},\ \texttt{request_money},\ \texttt{create_payment_link},\ \texttt{list_accounts},\ \texttt{list_transactions}}$$

The LLM selects $t \in \mathcal{T}$ and returns a fully populated parameter object. If the model returns nothing actionable, a regex fallback extracts amount, recipient, and intent from the raw text.

Face Verification

Before any transaction executes, the user must pass a live face check. The frontend captures a webcam photo and POSTs it to /face/verify. The backend sends both images — the stored reference and the live capture — to Amazon Bedrock Nova Lite (vision model) via the Converse API.

The verification prompt enforces a strict confidence threshold:

$$P(\text{match}) > 0.95 \implies \texttt{YES},\quad \text{otherwise} \implies \texttt{NO}$$

The model compares face shape, eye spacing, nose, mouth, jawline, and distinctive features. It replies with a single word — YES or NO. Anything other than YES is treated as a rejection.

Banking Execution

Once face-verified, the matched function calls the real bunq REST API:

Intent	Endpoint
Send money	`POST /user/{id}/monetary-account/{id}/payment`
Request money	`POST /user/{id}/monetary-account/{id}/request-inquiry`
Payment link	`POST /user/{id}/monetary-account/{id}/bunqme-tab`
Balances	`GET /user/{id}/monetary-account`
History	`GET /user/{id}/monetary-account/{id}/payment`

Every bunq request is RSA-signed with a dynamically generated key pair registered at session start. In sandbox mode, recipients are spun up as fresh sandbox users and transfers use IBAN-based routing.

The result is formatted into a short spoken sentence and returned to the frontend for TTS playback.

What We Learned

Multimodal AI is genuinely useful at boundaries. Using a vision model as a binary face gate — rather than a traditional ML pipeline — let us ship real biometric verification in under 100 lines of code. The prompt engineering matters more than the model size: forcing a single-word YES/NO response with an explicit confidence threshold made the output deterministic.

Tool use changes how you design LLM integrations. Instead of parsing free-text model output, we defined a strict schema for each banking action. The model's job is to select a tool and fill its parameters — nothing more. This made the pipeline dramatically more reliable and removed an entire class of output-parsing bugs.

Bidirectional streaming is non-trivial. The Nova Sonic STT path requires managing a full async send/receive loop with named content blocks, session lifecycle events, and chunked audio. Getting this right across Python's event loop constraints (Streamlit runs in a thread, FastAPI in another) required isolating each STT call in its own thread with a fresh event loop.

Sandbox banking has sharp edges. The bunq sandbox does not support email-based counterparties — only IBAN. We had to dynamically create sandbox recipient users, fetch their IBANs, and cache them. This added complexity that production mode doesn't need.

Challenges

1. Event loop isolation for async STT

Streamlit and FastAPI both manage their own event loops. Running asyncio coroutines for Nova Sonic from inside a FastAPI endpoint required wrapping each call in a ThreadPoolExecutor with a brand-new event loop:

with concurrent.futures.ThreadPoolExecutor(max_workers=1) as pool:
    return pool.submit(_run_in_new_loop, _nova_sonic_stt(...)).result()

Without this, the bidirectional stream would deadlock or throw RuntimeError: no running event loop.

2. LLM reliability on ambiguous input

Natural speech is messy. "Send fifty to Sriram" might produce a toolUse block, or it might produce a text reply. We layered three fallback strategies:

$$\text{result} = \begin{cases} \text{LLM tool use} & \text{if Converse returns } \texttt{toolUse} \ \text{regex extraction} & \text{if LLM returns text with payment intent} \ \text{session memory replay} & \text{if a prior turn had a pending tool} \end{cases}$$

3. Face verification latency

Sending two full-resolution JPEG images to a vision model on every transaction adds ~2–3 seconds of latency. We mitigated this by keeping the face check as a separate frontend step that runs once per session, passing the face_verified flag in subsequent /query calls rather than re-verifying on every message.

4. Keeping it honest

The temptation at a hackathon is to describe the product you want to build, not the one you built. We deliberately scoped down the README and presentation to match exactly what is in the code — a voice + face verified banking assistant with real bunq API integration — rather than overclaiming a risk engine that doesn't exist yet.