Data-Duo: Fraud Detection for Financial Security

🧠 Inspiration

Credit card fraud is a growing concern in digital finance, with losses amounting to billions of dollars annually. We were inspired by the idea of using machine learning to help financial institutions detect fraudulent transactions early, reduce losses, and protect customers. The extreme class imbalance and the need for high recall models made this a compelling and challenging real-world problem to tackle.

💡 What it does

This project uses machine learning algorithms to classify credit card transactions as either legitimate (0) or fraudulent (1). It leverages preprocessing, resampling techniques, and multiple models to identify fraudulent patterns within anonymized transaction data.

🛠️ How we built it

Dataset: We used the popular Kaggle Credit Card Fraud Detection Dataset, which contains 284,807 transactions with only 492 labeled as fraud.
Preprocessing:
- Checked for missing values and infinities
- Scaled Time and Amount using StandardScaler
Handling Imbalanced Data:
- Used SMOTE (Synthetic Minority Over-sampling Technique) to balance classes in the training set
Modeling:
- Trained and evaluated models including Logistic Regression, Random Forest, and XGBoost
- Used GridSearchCV to tune hyperparameters for optimal performance
Evaluation:
- Measured metrics: Precision, Recall, F1-score, ROC-AUC, and PR-AUC
- Emphasized high recall to minimize false negatives (missed fraud)

🚧 Challenges we ran into

Extreme class imbalance: With fraud representing less than 0.2% of all transactions, traditional accuracy metrics were misleading.
Slow model tuning: GridSearchCV with cross-validation on large datasets took significant time and resources.
Overfitting risk: Balancing the trade-off between high recall and generalizability was tricky, especially with synthetic data from SMOTE.

🎉 Accomplishments that we're proud of

Achieved high Recall (~0.88) on the fraud class while keeping Precision reasonable
Successfully implemented multiple models and compared them using meaningful metrics
Created strong visualizations to understand data distributions and class separation

📚 What we learned

The importance of choosing the right evaluation metrics when dealing with imbalanced data
How SMOTE works and when to use it effectively
Grid search vs randomized search and the practical trade-offs in computational cost
Why Precision-Recall AUC is often more useful than ROC-AUC in fraud detection tasks

🔮 What's next for Data-Duo – Fraud Detection for Financial Security

Deploy the model as an API that can scan real-time transactions
Add explainability using SHAP values or feature importance plots to understand what triggers a fraud prediction
Experiment with deep learning models like autoencoders or LSTMs for anomaly detection
Monitor model drift and create alerts if the distribution of transactions changes significantly over time

Built With

Updates

Elysee IRADUKUNDA started this project — Jul 25, 2025 09:03 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.