Inspiration
Over this past semester, Alp and I were in the same data science class together, and we were really interested in how data can be applied through various statistical methods. Wanting to utilize this knowledge in a real-world application, we decided to create a prediction model using machine learning. This would allow us to apply the concepts that we learned in class, as well as to learn more about various algorithms and methods that are used to create better and more accurate predictions.
What it does
This project consists of taking a dataset containing over 280,000 real-life credit card transactions made by European cardholders over a two-day period in September 2013, with a variable determining whether the transaction was fraudulent, also known as the ground truth. After conducting exploratory data analysis, we separated the dataset into training and testing data, before training the classification algorithms on the training data. After that, we observed how accurately each algorithm performed on the testing data to determine the best-performing algorithm.
How we built it
We built it in Python using Jupyter notebooks, where we imported all our necessary libraries for plotting, visualizing and modeling the dataset. From there, we began to do some explanatory data analysis to figure out the imbalances of the data and the different variables. However, we discovered that there were several variables that were unknown to us due to customer confidentiality. From there, we first applied principal component analysis (PCA) to reduce the dimensionality of the dataset by removing the unknown variables and analyzing the data using the only two variables that were known to us, the amount and time of each transaction. Thereafter, we had to balance the dataset using the SMOTE technique in order to balance the dataset outcomes, as the majority of the data was determined to be not fraudulent. However, in order to detect fraud, we had to ensure that the training had an equal proportion of data values that were both fraudulent and not fraudulent in order to return accurate predictions. After that, we applied 6 different classification algorithms to the training data to train it to predict the respective outcomes, such as Naive Bayes, Decision Tree, Random Forest, K-Nearest Neighbor, Logistic Regression and XGBoost. After training the data, we then applied these algorithms to the testing data and observed how accurately does each algorithm predict fraudulent transactions. We then cross-validated each algorithm by applying it to every subset of the dataset in order to reduce overfitting. Finally, we used various evaluation metrics such as accuracy, precision, recall and F-1 scores to compare which algorithm performed the best in accurately predicting fraudulent transactions.
Challenges we ran into
The biggest challenge was the sheer amount of research and trial and error required to build this model. As this was our first time building a prediction model, we had to do a lot of reading to understand the various steps and concepts needed to clean and explore the dataset, as well as the theory and mathematical concepts behind the classification algorithms in order to model the data and check for accuracy.
Accomplishments that we're proud of
We are very proud that we are able to create a working model that is able to predict fraudulent transactions with very high accuracy, especially since this was our first major ML model that we have made.
What we learned
We learned a lot about the processing of building a machine learning application, such as cleaning data, conducting explanatory data analysis, creating a balanced sample, and modeling the dataset using various classification strategies to find the model with the highest accuracy.
What's next for Credit Card Fraud Detection
We want to do more research into the theory and concepts behind the modeling process, especially the classification strategies, as we work towards fine-tuning this model and building more machine learning projects in the future.
Built With
- kaggle
- machine-learning
- numpy
- pandas
- python
- scikit-learn
- seaborn
Log in or sign up for Devpost to join the conversation.