Group 129 NUS Datathon

Inspiration

Upon initial analysis of the data, we have found that most relevant columns are binary in nature. Thus, initially, we opted for a decision tree classifier for its interpretability. However, we later found the dataset to be rather imbalanced with only 700 positive cases out of 18,992 rows, leading us to switch to a Random Forest model. We were inspired to use this model as it's ability to build multiple decision trees and aggregate predictions not only reduces the risk of overfitting but also effectively handles imbalanced classes. In the end, this model has helped to produce a more accurate solution to predict customer satisfaction in the insurance acquisition process.

How we built it

The product was created under the notion of trying out different types of classification algorithms. After testing out several, we found that Decision Trees was quite suitable for our choice of using TRUE FALSE columns. As we tested, we realised the accuracy was not up to our standard and we started to look for variants of this algorithm. It was here that we found the Random Forest Classification Method.

As such, after filtering out the columns that we wanted and changed 'stat_flag' into three separate columns for TRUE FALSE use, we can dropped all of the NAs and proceeded to train the model with class weights that was suitable for the problem. With that, we tested it out using a randomly selected test dataframe used a confusion matrix to check our accuracy.

In this state, the results was True Negative (TN): 3018 False Positive (FP): 226 False Negative (FN): 117 True Positive (TP): 35 To calculate accuracy using the formula Accuracy = True Positives / (False Positives + True Positives)

Therefore, the Accuracy was 35/261 = 0.134 (rounded).

Challenges we ran into

Choosing the suitable machine learning model was one of the main challenges we faced as we realised some models we considered at the beginning do not work well with the dataset. Trial and error to find the best model took quite a significant amount of time which was not very desirable especially with the short duration given. There were very few true positive cases in the dataset which made it more challenging to split training data and test data to train our model.

What we learned

Throughout the past 3 days, my team and I was able to apply what we have learnt in the classroom, experimenting with different classifiers and figuring the strengths and limitations of each classifier. We have tried many classifiers, such as KNN, SVM, Decision Tree and Random Forest, and found out through trial and error that, for example, Decision Tree in this context was not the most suitable due to the skewed nature of the dataset. We eventually settled on Random Forest, as we learned through comparing the accuracy that it was the most suitable model for this dataset.

We also learned how the different factors, such as status and purchase history were able to drastically affect the customer's propensity, which was learned through using different variables and seeing how the training data compared with the test data. This allowed us to understand how each factor in the Insurance Industry can impact the consumer's decision, and taught us the importance of data analytics and machine learning in this industry.