We chose to follow the project proposed by Credit Suisse about fraud detection in credit card transactions.
What it does
It is a machine learning model that analyzes a large number of transactions (a million) and tries to predict which of these are suspicious and could be frauds.
How we built it
We wanted to use machine learning to solve this problem. We first started by loading and discovering the data, doing a bit of pre-processing and preparing the data to be able to use it with our models. We created dummy variables for categorical features, normalized some features and changed the dataset format. We did this with the help of the Pandas library. The first model we tried was a Logistic Regression model. We mainly used the scikit python library. We tried different parameters to try to find the best ones and optimize the model. We then also considered to use a RandomForest Classifier instead. Again tuning the parameters was the main part of the work. As a fraud happening was a rare events we had to use oversampling algorithm to give more importance to the fraud cases. We finally tried to get a better accuracy by analyzing the distributions of the different transactions features and identifying the most important features using correlation between them and the target predictions. We also tried using neural networks but we didn't get sufficiently satisfying results with.
Challenges we ran into
It was not a classical classification machine learning problem. The probability of a fraud was very low, it was less than 1 in a thousand transactions. The models had trouble learning and would first predict that there was no fraud, at it would give an accuracy score of 99%, the fraud happening in the last percent.