AI-Powered Insider Threat Detection System
Inspiration
Modern organizations face a growing class of cybersecurity risks that traditional perimeter-based security cannot detect — insider threats.
These threats arise from:
- Compromised credentials
- Malicious employees
- Privilege misuse
- Abnormal data exfiltration
Unlike external attacks, insider threats operate within trusted boundaries. They are statistically subtle rather than signature-based.
This project builds a behavior-driven anomaly detection system that models user activity patterns and flags deviations in real time using machine learning.
Instead of rule-based detection, we use:
Behavioral Baselining + Statistical Anomaly Detection
What It Does
The system:
- Generates realistic user activity logs
- Builds per-user behavioral baselines
- Detects anomalies using Isolation Forest
- Assigns dynamic risk levels (Low / Moderate / High)
- Stores alerts with feature-level explainability
- Provides real-time dashboard visualization
- Displays last 30-day user activity trends
- Simulates both normal and attack behaviors
- Uses a separate ML microservice for inference
Every login event passes through ML inference before being stored.
System Architecture
Technology Stack:
Frontend: React + Vite
Backend: Node.js + Express
ML Service: FastAPI + scikit-learn
Database: MongoDB Atlas
System Flow:
User Activity Event
→ Frontend (React)
→ Backend (Node.js/Express)
→ ML Microservice (FastAPI)
→ Backend (Risk Calibration + Storage)
→ MongoDB Atlas
→ Frontend Dashboard
Separation of concerns:
- Backend handles orchestration and persistence
- ML service handles statistical inference
- Frontend handles visualization and UX
Feature Engineering
Each login event is transformed into a statistical feature vector.
Baseline Mean
$$ \mu_x = \frac{1}{n} \sum_{i=1}^{n} x_i $$
Baseline Standard Deviation
$$ \sigma_x = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (x_i - \mu_x)^2} $$
Standardized Z-Score
$$ z = \frac{x - \mu}{\sigma} $$
Where:
- $x$ = observed value
- $\mu$ = baseline mean
- $\sigma$ = baseline standard deviation
Binary Flags
New IP detection:
$$ \text{new_ip_flag} = 1 \quad \text{if IP not in trusted set} $$
$$ \text{new_ip_flag} = 0 \quad \text{otherwise} $$
New device detection:
$$ \text{new_device_flag} = 1 \quad \text{if device not recognized} $$
$$ \text{new_device_flag} = 0 \quad \text{otherwise} $$
Final Feature Vector
$$ X = \left[ z_{\text{login}}, z_{\text{files}}, z_{\text{download}}, \text{new_ip}, \text{new_device}, \text{sensitive_flag} \right] $$
Model
We use Isolation Forest for unsupervised anomaly detection.
Configuration
$$ \text{IsolationForest}(n_estimators = 100,\ contamination = 0.05) $$
Anomaly Score
$$ \text{score}(x) = \text{decision_function}(x) $$
Prediction Rule
$$ \text{prediction} = \begin{cases} -1 & \text{anomaly} \ 1 & \text{normal} \end{cases} $$
Risk levels are calibrated from anomaly scores into:
- Low
- Moderate
- High
Normal vs Attack Simulation
Normal logs are sampled from the training distribution to maintain statistical alignment.
Attack simulations introduce:
- Extreme z-score deviations
- External IP addresses
- Unknown devices
- Elevated file and download activity
- Sensitive access enabled
Each event is processed in real time by ML inference.
Challenges
1. Baseline Drift
Normal simulations initially generated anomalies due to distribution mismatch.
Solution:
Aligned runtime simulation with training dataset.
2. Score Calibration
Isolation Forest anomaly scores are relative and tightly clustered.
Solution:
Recalibrated thresholds to produce meaningful Low / Moderate / High segmentation.
3. Microservice Communication
Handled:
- CORS issues
- Environment variables
- Production URLs
- Cold start behavior
4. Statistical Sensitivity
Small standard deviations caused inflated z-scores.
Solution:
Refined variance scaling.
What’s Next
Real-Time Behavioral Drift
$$ \Delta_{\text{behavior}} = |\mu_{\text{current}} - \mu_{\text{baseline}}| $$
Planned enhancements:
- Email notification system
- Role-based risk modeling (IT / Finance / HR)
- Rolling 30-day auto-retraining
- Geo-location anomaly detection
- Sequence modeling (LSTM)
- Per-user risk trend graphs
Conclusion
This project demonstrates a statistically principled insider threat detection system built using:
- Behavioral modeling
- Unsupervised anomaly detection
- Microservice ML inference
- Real-time risk calibration
A scalable foundation for enterprise-grade behavioral security systems.
Log in or sign up for Devpost to join the conversation.