AI-Powered Insider Threat Detection System

Inspiration

Modern organizations face a growing class of cybersecurity risks that traditional perimeter-based security cannot detect — insider threats.

These threats arise from:

  • Compromised credentials
  • Malicious employees
  • Privilege misuse
  • Abnormal data exfiltration

Unlike external attacks, insider threats operate within trusted boundaries. They are statistically subtle rather than signature-based.

This project builds a behavior-driven anomaly detection system that models user activity patterns and flags deviations in real time using machine learning.

Instead of rule-based detection, we use:

Behavioral Baselining + Statistical Anomaly Detection


What It Does

The system:

  • Generates realistic user activity logs
  • Builds per-user behavioral baselines
  • Detects anomalies using Isolation Forest
  • Assigns dynamic risk levels (Low / Moderate / High)
  • Stores alerts with feature-level explainability
  • Provides real-time dashboard visualization
  • Displays last 30-day user activity trends
  • Simulates both normal and attack behaviors
  • Uses a separate ML microservice for inference

Every login event passes through ML inference before being stored.


System Architecture

Technology Stack:

Frontend: React + Vite
Backend: Node.js + Express
ML Service: FastAPI + scikit-learn
Database: MongoDB Atlas

System Flow:

User Activity Event
→ Frontend (React)
→ Backend (Node.js/Express)
→ ML Microservice (FastAPI)
→ Backend (Risk Calibration + Storage)
→ MongoDB Atlas
→ Frontend Dashboard

Separation of concerns:

  • Backend handles orchestration and persistence
  • ML service handles statistical inference
  • Frontend handles visualization and UX

Feature Engineering

Each login event is transformed into a statistical feature vector.

Baseline Mean

$$ \mu_x = \frac{1}{n} \sum_{i=1}^{n} x_i $$

Baseline Standard Deviation

$$ \sigma_x = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (x_i - \mu_x)^2} $$

Standardized Z-Score

$$ z = \frac{x - \mu}{\sigma} $$

Where:

  • $x$ = observed value
  • $\mu$ = baseline mean
  • $\sigma$ = baseline standard deviation

Binary Flags

New IP detection:

$$ \text{new_ip_flag} = 1 \quad \text{if IP not in trusted set} $$

$$ \text{new_ip_flag} = 0 \quad \text{otherwise} $$

New device detection:

$$ \text{new_device_flag} = 1 \quad \text{if device not recognized} $$

$$ \text{new_device_flag} = 0 \quad \text{otherwise} $$


Final Feature Vector

$$ X = \left[ z_{\text{login}}, z_{\text{files}}, z_{\text{download}}, \text{new_ip}, \text{new_device}, \text{sensitive_flag} \right] $$


Model

We use Isolation Forest for unsupervised anomaly detection.

Configuration

$$ \text{IsolationForest}(n_estimators = 100,\ contamination = 0.05) $$

Anomaly Score

$$ \text{score}(x) = \text{decision_function}(x) $$

Prediction Rule

$$ \text{prediction} = \begin{cases} -1 & \text{anomaly} \ 1 & \text{normal} \end{cases} $$

Risk levels are calibrated from anomaly scores into:

  • Low
  • Moderate
  • High

Normal vs Attack Simulation

Normal logs are sampled from the training distribution to maintain statistical alignment.

Attack simulations introduce:

  • Extreme z-score deviations
  • External IP addresses
  • Unknown devices
  • Elevated file and download activity
  • Sensitive access enabled

Each event is processed in real time by ML inference.


Challenges

1. Baseline Drift

Normal simulations initially generated anomalies due to distribution mismatch.

Solution:
Aligned runtime simulation with training dataset.

2. Score Calibration

Isolation Forest anomaly scores are relative and tightly clustered.

Solution:
Recalibrated thresholds to produce meaningful Low / Moderate / High segmentation.

3. Microservice Communication

Handled:

  • CORS issues
  • Environment variables
  • Production URLs
  • Cold start behavior

4. Statistical Sensitivity

Small standard deviations caused inflated z-scores.

Solution:
Refined variance scaling.


What’s Next

Real-Time Behavioral Drift

$$ \Delta_{\text{behavior}} = |\mu_{\text{current}} - \mu_{\text{baseline}}| $$

Planned enhancements:

  • Email notification system
  • Role-based risk modeling (IT / Finance / HR)
  • Rolling 30-day auto-retraining
  • Geo-location anomaly detection
  • Sequence modeling (LSTM)
  • Per-user risk trend graphs

Conclusion

This project demonstrates a statistically principled insider threat detection system built using:

  • Behavioral modeling
  • Unsupervised anomaly detection
  • Microservice ML inference
  • Real-time risk calibration

A scalable foundation for enterprise-grade behavioral security systems.

Share this project:

Updates