BunqShield

Inspiration

bunq already does an incredible job protecting users from fraud — transaction monitoring with NVIDIA RAPIDS, deepfake detection during KYC via DuckDuckGoose, and Sardine for rule-based fraud prevention. But while researching bunq's stack, we noticed something missing: nobody actually looks at the invoice image itself.

Fraudsters today routinely Photoshop amounts, splice fake totals onto real receipts, or generate entirely synthetic invoices using AI image generators. Once a tampered invoice gets uploaded, the existing rule engines have no visual evidence to flag it the fraud sneaks through.

That's the moment BunqShield was born. With Hackathon 7.0 explicitly asking for a multi-modal AI that doesn't just respond, but acts, the alignment was perfect: an autonomous agent that sees the document, hears the user, decides, and intervenes before money leaves the account.

What it does

BunqShield is an autonomous multi-modal AI agent that protects bunq users from invoice fraud in real time:

Sees the invoice — runs six forensic computer vision methods (Error Level Analysis, Copy-Move Detection, Noise Inconsistency, Font Consistency, Metadata Forensics, Edge Coherence) plus a Dual-Stream Vision Transformer to score the document on a 0–100 scale
Reasons about findings — Claude LLM interprets the forensic evidence in a ReAct loop, deciding the appropriate action
Speaks the verdict — gTTS auto-explains the decision out loud, so the user understands why without reading anything
Listens to follow-ups — Whisper transcribes microphone input for natural conversation
Acts on the bank — autonomously converts suspicious payments to draft-payments via bunq's API, requiring manual approval before the money moves

The user experience: drop an invoice → hear the AI explain it → see the verdict → trust the decision. No manual review, no rule-writing, no waiting.

How we built it

Backend (Python):

FastAPI server exposing 11+ endpoints for analysis, voice, and bunq integration
OpenCV + scikit-image + NumPy for the 6 classical CV detection methods
PyTorch + timm for the Dual-Stream Vision Transformer architecture (ViT-Tiny/16 backbone, RGB + ELA streams, cross-attention fusion, patch-level attention heatmaps)
OpenAI Whisper for speech-to-text transcription
Google Text-to-Speech (gTTS) for natural speech synthesis
Anthropic Claude SDK powering the ReAct agent's reasoning loop
Official bunq Python SDK for live sandbox connection (RSA signing, session management, draft-payment creation)

Frontend (React + Vite):

Drag-and-drop invoice upload zone
Inline microphone capture via Web Audio API
Auto-playing voice responses with replay button
Animated score ring + per-method breakdown
Dark emerald theme inspired by bunq's brand

ML Pipeline:

PDF → PNG conversion script using pdf2image and Poppler
Synthetic forgery generator with 4 attack types (copy-move, splice, amount overwrite, recompression)
Authentic image augmentation (rotation, blur, color jitter) for class balancing
Balanced training script with class weights, weighted random sampling, label smoothing, cosine annealing, and early stopping

Architecture pattern: The agent uses the ReAct (Reasoning + Acting) loop: it thinks about what to do, calls a tool, observes the result, and decides the next action all autonomously, without a human in the critical path.

Challenges we ran into

ViT training on tiny data: We trained a Dual-Stream ViT-Tiny on ~150 synthetic images. The model achieved 89% balanced validation accuracy but didn't generalize well to held-out real invoices a known limitation of Vision Transformers on small datasets. We pivoted: kept the architecture and training pipeline production-ready in the repo, and focused our live demo on the classical CV pipeline that produces meaningful results without training data.
bunq API authentication: bunq's API requires RSA key pair signing, installation registration, device-server creation, and session tokens a non-trivial multi-step handshake. Our first attempt with a custom HTTP client returned connected status but with null user/account IDs because the RSA signing wasn't implemented correctly. We switched to the official bunq Python SDK, which solved it cleanly and gave us a real live connection (verified user_id=3629568, account_id=3621708).
Class imbalance after augmentation: Our first balanced run still had imbalance issues the model learned to always predict "forged" because that was the majority class. We added weighted sampling, class-weighted loss, and balanced accuracy as the early-stopping criterion to fix this.
Browser autoplay restrictions: Modern browsers block auto-playing audio without user interaction. We solved it by triggering the audio playback inside the user's drag-and-drop event handler, which counts as a user gesture.
Time pressure: Building a full multi-modal pipeline (vision + audio + agent + bunq integration + frontend) within hackathon hours required ruthless prioritization. We focused on what could be demonstrated end-to-end and what would survive a live demo without breaking.

Accomplishments that we're proud of

Live bunq sandbox integration that actually works — not mocked, not simulated. The agent really connects, really fetches payments, really creates draft-payments.
A genuinely multi-modal pipeline — vision (CV + ViT), text (Claude reasoning), and audio (Whisper + gTTS) all working together in a single autonomous loop. This isn't three separate features bolted on; it's one integrated system.
Six classical CV methods that produce real, explainable results — every score is traceable back to a specific forensic technique, with confidence levels and supporting evidence.
Production-grade ViT architecture — Dual-Stream design with cross-attention fusion is a research-paper-level architecture, not just a textbook model. Training pipeline includes class weights, weighted sampling, early stopping, and label smoothing.
An autonomous agent with real authority — not a chatbot. The ReAct agent calls 6 tools and takes actions on the user's bank account.
A frontend that feels finished — animated score rings, per-method bars with risk-color coding, microphone integration, auto-play voice responses, and a dark-emerald aesthetic that fits bunq's brand.

What we learned

Architectural complexity can outpace data. Our Dual-Stream ViT design is sophisticated, but small datasets win architectures don't. We learned firsthand that ViTs need tens of thousands of images to generalize, and there's no shortcut. Knowing when to fall back to classical methods is a real engineering skill.
Official SDKs save hours. Trying to implement bunq's RSA-signed API by hand was a rabbit hole. The moment we switched to the official Python SDK, everything worked. Lesson: trust the maintained tooling.
Multi-modal isn't multi-feature. Anyone can add a microphone button to an app. True multi-modal AI is when the modalities reinforce each other in a single decision loop that's what makes BunqShield different from a chatbot with a voice gimmick.
Honest limitations beat exaggerated claims. Documenting the ViT's training limitations upfront in the README is more credible than hiding them. Judges can verify code; they can't verify hype.
The ReAct agent pattern is incredibly powerful. Giving an LLM a small set of well-defined tools and letting it orchestrate them produces emergent, useful behavior with surprisingly little code.

What's next for BunqShield

Train the ViT on DocTamper — the 170k-image document tampering dataset. With proper data and GPU training, we expect 90%+ accuracy and meaningful patch-level heatmaps.
Fine-tune on bunq-specific invoice patterns — partner with bunq to access anonymized real invoice/receipt data for domain-specific tampering detection.
Production webhook deployment — register the notification-filter-url callback in production so every new invoice upload triggers BunqShield automatically without user action.
Multi-language voice support — extend Whisper and gTTS to Dutch, German, French, and Spanish to match bunq's European user base.
Active learning loop — when the agent flags an invoice and a human reviewer confirms or rejects it, that signal feeds back into model retraining.
Native mobile integration — embed BunqShield directly in the bunq mobile app, so the security check happens on-device before the invoice is even uploaded to the cloud.
AI-generated invoice detection — extend the pipeline specifically to detect invoices created by Stable Diffusion, DALL-E, and ChatGPT, the next frontier of fraud.

Built With

amazon-web-services
fastapi
pycharm
python
react
vision-transform

Updates

nirvana fl started this project — Apr 25, 2026 07:35 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.