Inspiration

During a guest lecture at VNIT Nagpur, a banking professional said something that stopped us cold:

"The average time to detect insider fraud in Indian banks is 197 days. By the time we catch it, the money is already gone."

That number became our north star.

The deeper problem isn't missing data — every login, every transaction, every file download is already logged. The problem is that existing systems use static global rules. Flag anyone who accesses more than 50 accounts. Flag anyone who logs in after 8 PM. Fraudsters know these rules. They access 49 accounts. They log in at 7:59 PM. Every single day. For months.

We asked one question: what if the system stopped comparing employees to each other — and instead compared each employee to their own history?

That question became SentinelAI.


What It Does

SentinelAI is a real-time AI fraud detection system for bank employees — specifically targeting internal and privileged user fraud that static rules cannot catch.

Every login session is scored in real time using a three-layer AI ensemble:

Layer 1 — Autoencoder (Structural Anomaly) Trained only on normal behaviour. When it sees a fraud session, it cannot reconstruct it accurately. That reconstruction error is our first signal.

Layer 2 — LSTM (Temporal Anomaly) Looks at 7 consecutive days of behaviour. Catches the "slow escalator" fraudster who increases transaction amounts by just 10% per day — completely invisible in a single session, unmistakable across a week.

Layer 3 — CatBoost (Final Risk Score) Fuses both model outputs with 20 engineered features into a final risk score from 0 to 100.

Every alert includes SHAP-explained reasons — the top 5 factors that triggered the flag, in plain English. No black box. Fully auditable for RBI compliance.

The moment a session is flagged, SentinelAI calls the Claude API and auto-generates a court-ready investigation report in under 2 seconds — Alert Summary, Suspicious Behaviour, Risk Assessment, and Recommended Actions. What used to take an investigator 2–4 hours now happens before they open their laptop.

We detect 5 fraud archetypes that static rules miss entirely:

Archetype Pattern
Off-Hours Access Login at 2–4 AM when no supervisor is present
Bulk Download 60–150 accounts, 500–2,000 MB in one session
Privilege Escalation Accessing systems outside own department
Slow Escalator Transaction amounts growing 10% per day over 90 days
Dormant Abuse Targeting accounts with no recent activity

The full system includes a FastAPI backend with 6 REST endpoints, a React dashboard with live risk heatmap and alert queue, and Docker deployment — one command brings the entire stack live.


How We Built It

Data Pipeline We generated 5,099 synthetic bank activity records across 100 employees over 90 days, with 5 fraud archetypes injected for 6 flagged users. The key design decision was computing per-user rolling z-scores using a 30-day window with .shift(1) to guarantee zero data leakage — today's value is never in the baseline used to score today.

Feature Engineering 20 features per session: 6 rolling z-scores, 4 time-based flags (off-hours, late-night, cyclic hour encoding), 5 access pattern flags, 2 categorical encodings, and 3 raw counts. All features are scaled with StandardScaler. The encoding maps are fixed and sorted in both training and inference code to guarantee they never diverge.

Model Training

  • Autoencoder: trained on normal records only, threshold at 95th percentile of normal reconstruction errors
  • LSTM: 7-day sliding window sequences, one per user, trained on normal users only
  • CatBoost: class weights {0:1, 1:10} to handle the 6% fraud ratio without oversampling

Backend FastAPI with a lifespan context manager for startup, request timing middleware that logs every call with duration and request ID, and a 5-minute scored-DataFrame cache so /alerts and /dashboard-stats return in under 10ms instead of re-scoring 5,099 records on every call.

Testing 222 automated pytest tests across 7 files. We tested config, data validation, model architecture, all 6 API endpoints, encoding consistency, cache behaviour, and feature vector construction.

Stack: Python 3.11 · TensorFlow 2.20 · CatBoost · SHAP · FastAPI · React · Chart.js · Docker · Anthropic Claude API


Challenges We Ran Into

Data leakage in rolling z-scores Our first implementation included today's value in its own baseline. A clerk suddenly downloading 2 GB would compute a z-score near zero because the spike contaminated the mean used to score it. We fixed this with .shift(1) and wrote a dedicated test to prove the fix works — the test injects a spike on day 31 and asserts the z-score is above 5.

Training and inference encoding mismatch scikit-learn's LabelEncoder assigns codes in the order it first sees each category during training. If inference runs a fresh encoder, the codes silently differ and every fraud score is wrong. We fixed this by replacing LabelEncoder with a fixed sorted dictionary that lives identically in both feature_engineering.py and api/routers/score.py. A test explicitly asserts the two maps are equal — if anyone changes one without the other, the CI fails immediately.

The slow escalator is invisible to single-session models A 10% daily increase looks almost normal on any given day. The autoencoder missed it entirely. The LSTM caught it because it sees the upward trend across 7 consecutive days — that sequence breaks the pattern the model learned for normal users.

Alert fatigue from re-scoring on every request The /alerts and /dashboard-stats endpoints were each calling the full ML pipeline on 5,099 records on every request — roughly 1 second per call. We fixed this with a 5-minute in-memory cache on the scored DataFrame, dropping response time to under 10ms. The test caught this by asserting that two consecutive calls to get_scored_df() return the exact same Python object.

Class imbalance Real insider fraud is roughly 5–6% of sessions. Naive training gives 94% accuracy by predicting "normal" for everything. We used CatBoost's class_weights = {0:1, 1:10} — penalising a missed fraud 10× more than a false positive — rather than oversampling, which preserves the original data distribution.


Accomplishments That We're Proud Of

The LSTM catches what the Autoencoder misses. Two models catching different fraud types in a genuine ensemble — not just stacking for stacking's sake.

222 tests, 86% coverage, one real bug caught. The merge collision bug (name_x/name_y columns in the alert response) was caught by the test suite before it ever reached a judge's browser. That's what a test suite is for.

The encoding consistency test. A single test that would have prevented a class of silent production bugs that are notoriously hard to debug. We're proud of that one.

A full investigation report in under 2 seconds. Taking what costs an investigator 2–4 hours and delivering it automatically — that's the feature that makes this genuinely useful to a real fraud team, not just technically impressive.

Production-grade from day one. Lifespan middleware, request IDs, structured logging, Docker multi-stage build, non-root container user, health checks, and a one-command deployment. Not a prototype.


What We Learned

Per-user baselines always beat global rules. Every fraud detection system should ask "normal for whom?" — not "normal globally." This insight drove every design decision.

Explainability is not optional in regulated industries. SHAP was designed into the system from the start, not added at the end. Without it, no compliance team would trust the output.

Tests find real bugs — not just verify working code. We caught the data leakage bug, the encoding mismatch risk, the merge collision, and the performance regression through automated tests. Not one of these was found by manually testing the UI.

LLMs as a communication layer, not a decision layer. Using Claude to write the investigation report — not to make the fraud decision — is the correct architecture. ML makes the call. The LLM explains it in language investigators can act on.

Temporal models and structural models are complementary. The Autoencoder catches sessions that look wrong in isolation. The LSTM catches sessions that look fine in isolation but wrong as part of a sequence. You need both.


What's Next for SentinelAI

Real CBS Integration Connect directly to Union Bank's Core Banking System log export format instead of synthetic data. The feature engineering pipeline is already designed to handle real log schemas with minimal changes.

Shadow Mode Deployment Run SentinelAI silently alongside the existing system for 30 days before going live — generating alerts but not acting on them. This validates the false-positive rate on real employee behaviour before the fraud team sees a single alert.

Continuous Retraining Add a scheduled retraining pipeline that rebuilds the rolling baselines every 30 days as employees' normal behaviour evolves. A manager promoted to a new role should not be flagged for doing their new job.

Network Analysis Layer Add a fourth model: a Graph Neural Network that looks at which employees access the same accounts, systems, and files. Collusion between two employees leaves a network signature that no per-user baseline can detect.

Expansion to All 12 PSBs File for DFS certification and deploy across all 12 public sector banks under the Department of Financial Services — with zero retraining required since the model learns each bank's patterns from its own employee data.

Mobile Alert Dashboard A mobile app for fraud investigators so they receive push notifications for HIGH risk alerts and can approve or dismiss them from their phone with a full report attached.

Team SentinelAI · VNIT Nagpur · iDEA 2.0 · Union Bank of India

Built With

Share this project:

Updates