FTx Hunter

Inspiration

Fraud review at most payments companies is still a manual, unstructured process. An analyst opens a flat spreadsheet of thousands of transactions, has no idea where to start, and has no context for what "normal" looks like for any given card. A $340 Best Buy charge is unremarkable in isolation — it only looks suspicious once you know that card has never spent more than $22 in its entire history.

We wanted to build the tool that analyst actually needs: one that surfaces the highest-confidence fraud first, tells the reviewer why each transaction was flagged in plain English, and gets out of the way so they can move fast. The challenge brief gave us 1,000 transactions and 24 hours.

What it does

FTx Hunter ingests a CSV of credit card transactions, detects fraud using a combination of rule-based pattern matching and a trained XGBoost classifier, and serves a keyboard-driven review queue where a human analyst can triage every flagged transaction.

Detection — Four fraud patterns are implemented: card-testing bursts (rapid small charges on a new device), merchant cashout waves (multiple stolen cards hitting the same merchant simultaneously), high-value account takeovers (a single charge wildly above a card's normal spend), and gift-card/electronics liquidation (resellable assets purchased at multiples of the card's median). Every flagged transaction gets a plain-English reason string alongside its fraud score.

Review queue — Flagged transactions are sorted by fraud score, highest first. The analyst sees one transaction at a time with full context: the score, the pattern that fired, the reason string, and the card's baseline figures. They hit Enter to approve, Backspace to dismiss and Space to escalate. No mouse required.

Audit log — Every decision is timestamped and stored. The /reviewed page shows the full history sorted by review time, giving compliance teams a verifiable record of every call made.

Re-scoring — Uploading a new CSV from the start page re-runs the full detection pipeline and repopulates the queue without any manual database work.

How we built it

We had no ground-truth labels for the live dataset, so we bootstrapped supervised learning with a weak-labelling approach:

Write the rules first. Four hand-crafted detectors encode exactly what each fraud pattern looks like. Each fires a boolean flag and a human-readable reason string.
Use rule outputs as training labels. Those flags become the target variable for XGBoost. No labelled dataset required.
Engineer per-card features. All anomaly signals are relative to each card's own baseline — median amount, MAD-based standard deviation, and historical device set. A $300 charge means nothing without knowing whether that card normally spends $15 or $280.
Train and score. XGBoost trains on 18 features with 5-fold stratified cross-validation, then scores all 1,000 transactions. Combined fraud scores use an independence assumption across patterns so a transaction matching two patterns scores higher than one matching either alone.
Serve via Flask. A lightweight Flask backend reads from MongoDB and serves Jinja2 templates. app.py never calls the model directly — it only reads scored results. The detection pipeline and the review UI are fully decoupled.

Stack: Python, Flask, XGBoost, MongoDB, scikit-learn, pandas, pytest.

Challenges we ran into

No ground truth. The live dataset has no is_fraud column. Weak labelling worked, but it means our XGBoost model is only as good as our rules. Any fraud that doesn't match one of the four defined patterns would be missed entirely — a known blind spot.

Pattern 2 requires cross-card thinking. The merchant cashout burst is completely invisible when you look at transactions card by card. Every individual charge looks normal for that card. Catching it required joining across all cards grouped by merchant and time window — a non-trivial query structure that didn't fit the standard per-row feature engineering loop.

Class imbalance. With ~7% fraud, a naive classifier just predicts "not fraud" and scores 93% accuracy. We tuned scale_pos_weight=16 and switched the eval metric to AUCPR (area under the precision-recall curve) rather than AUC-ROC to force the model to actually learn the minority class.

Per-card baselines on short histories. Some cards have very few transactions, making their median and standard deviation unstable. A single $200 charge on a card with two prior transactions of $5 each produces a massive z-score. We added an absolute floor ($300) to the ATO detector and used MAD-based std to reduce sensitivity to outliers inflating the spread.

Accomplishments that we're proud of

CV F1 ≈ 0.91 on the labelled training set, with precision ≈ 0.89 and recall ≈ 0.93 — without a single manually labelled example from the live dataset.
Pattern 2 detection. Cross-card merchant burst detection is the hardest pattern to catch and the one most teams miss. Getting it right required rethinking the entire feature engineering loop.
Explainability on every flag. Every flagged transaction has a plain-English reason string generated by the rule that fired. A reviewer never has to wonder why something was flagged.
Full audit trail. Approve, dismiss, escalate, and undo — all timestamped and persisted. Most prototype fraud tools skip this entirely.
End-to-end in 24 hours. One command seeds the database, trains the model, scores all 1,000 transactions, and leaves the Flask server ready to serve the review queue.

What we learned

Per-card baselines change everything. Using per-card medians instead of global thresholds eliminated almost all false positives on high-spend cards and almost all false negatives on low-spend cards. A global "$300 = suspicious" threshold is useless; "14× this card's median = suspicious" is not.

Cross-card signals are invisible until you build for them. The entire mental model of fraud detection shifts when you stop thinking per-transaction and start thinking about patterns across cards. Pattern 2 would have been a zero-recall miss if we had treated each card in isolation.

Reviewer experience is a first-class product concern. A fraud score without a reason string is close to useless for a human reviewer. The reason string is what makes the difference between a 10-second decision and a 3-minute investigation.

What's next for FTx Hunter

Feedback retraining. Reviewer decisions (especially dismissals) are already logged to MongoDB. The missing piece is wiring those decisions back into the training loop — turning the analyst's judgement into labelled data that improves the model over time.

Streaming pipeline. Replace batch scoring with a real-time consumer that scores each transaction as it arrives and pushes it into the review queue immediately. The review UI stays the same; only the ingestion layer changes.

Concept drift monitoring. Retrain nightly, log F1 over time, and alert when it drops more than five points. Fraud patterns evolve — a static model degrades silently without this.

Built With

python

Updates

Magalie Nicolas started this project — May 30, 2026 07:42 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.