Prudential Risk Evaluation: Modeling & Visualizations

pirateplot(BMI ~ LowestRisk)

Inspiration

Machine predictions challenges have been some of my favorite projects since I was introduced to the language of R, and it's intricacies. So when Prudential announced their Machine Learning challenge, I leapted at the opportunity.

What it does

Beyond simply plugging the data into a formula, I've attempted to further hone the process through the inclusion of visualizations to further test exactly how to improve the model. I've also developed a quick command line python app that returns risk value based on a given ID.

How I built it

I started by determining key variables through a rpart model, which led to key visualizations. These visualizations led to the discovery of patterns that shaped how I handled the data prior to the final model. I then used the RandomForest library to finalize the model.

Challenges I ran into

One issue I had was handling the dataset's large size. Due to the sheer mass of provided information, simply running test models proved time inefficient and taxing on my poor laptop. Therefore, it became vital to boil down the data set to key variables, ignoring the variables that have minimal to no effect on the risk factor.

Accomplishments that I'm proud of

Ultimately, I'm pretty pleased with how the visualizations turned out. Pirateplot was an enormous help in being able to see the minute detail that barcharts of averages work. These shaped my final model, as key variables could be identified easily. Additionally, this was my first exposure to Python, which proved to be a fruitful experience.

What I learned

While looking for more effective ways of displaying and visualizing data, I came across pirateplot, a library that uses eschews the overused and underwhelming barplot for RDI plots, which are able to better reflect exactly how the data is spread. Using this tool, I was able to better identify areas where

What's next for Prudential Risk Evaluation: Modeling & Visualizations

Ultimately, I was only able to achieve roughly 81% correctness on the training dataset. Given more time, this mark could certainly be improved with further testing and fine tuning. Perhaps other models and solutions could be attempted with processing power that I unfortunately lacked. Needless to say though, the challenge was fun, engaging and a great mental exercise.

Built With

r
rstudio
the-blood-of-tsm

Updates

Andrew Li started this project — Oct 15, 2017 05:45 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.