As a part of UTD HACKATHON 2019 FANNIE MAE CHALLENGE, we built a risk analytics engine using machine learning and data science. The system can accurately predict mortgages that will be foreclosed on or paid off early.
- The end product was an app in which a user can enter values related to an acquisiton, example its loan to value ratio, property state, crime statistics etc. These values are passed into the trained machine learning model to predict the viability of that property acquisition.
- Risk analytics engine driven by Machine Learning to predict mortgages gave accuracies upto 98% for foreclosure decisions [random forest], 97% for zero balance [naive bayes] and 87% for delinquency states [support vector machines]
How we built it
The dataset provided by Fannie Mae, contained two tables: acquisition and performance. Fannie Mae informed us that they were interested learning about possible patterns in the acquisition data that would lead to the prediction of the following: 1) Foreclosure [Whether a particular acquisition will lead to a foreclosure] 2) Delinquency [Prediction of levels of delinquency] 3) ZeroBalance value
As a result, those three performance features were selected as output, and we prepared the data in the following way:
- Randomly sampled 180k lines of data from the original dataset provided for sake of time
- Selected the unique values of LoanID in sample of original performance to filter the acquisition table and get corresponding data
- Created three csv files by appending the value of the corresponding column in the performance table to the acquisition table: foreclosure_data.csv, delinquency_data.csv, zerobal.csv
We started with the analysis of the 25 features using WEKA and plotly to visualize individual patterns as shown in the pictures. We also generated scatter graphs using matplotlib and plotly. With those tools, we could narrow the quantity of features from 25 to 12, then to 8. In addition, we researched and experimented with different models and parameters in WEKA. Results for foreclosure data can be seen as below:
Model (Using cross validation 10 fold) : Accuracy (%) Naive Bayes : 97.0564 Decision Tree : 98.7735 Random Forest : 98.7735 SVM (non linear kernel) : 98.7735
We repeated the same analysis for zero balance and delinquency data.
The next task was to integrate it to an app. This was accomplished using bokeh inside of the jupyter notebook. Once the model is trained, the newly entered values from the app act as test data and are able to give live predictions on the viability of that particular acquisition!
Challenges we ran into
- Risk of imbalanced data owing to random sampling
- Tradeoff between building a full fledged separate app vs building an in browser application
- Tradeoff between choosing a simplistic model which is explainable vs a complicated model with more train time
Accomplishments that we're proud of
- We were able to connect with the customer (Fannie Mae) and ask them important questions about what they wanted to learn from the data, what would be a good user interface for them and how they wanted to use it
- We understood the business side of the solution even without any prior knowledge of the house-financing field
- Well researched and complete data science solution to mortgage risk analysis
- Decision making at various points was well documented and explained to judges using visualizations
- I learnt Bokeh and was able to implement the app in Bokeh in browser
- My team mate learnt visualization tools like plotly and matplotlib
What's next for Risk Analytics Fannie Mae
The app can be extended to answer similar questions. It is ready to be deployed and data can tested against an existing trained model in browser. Next step would be to allow more compute time in order to train on the complete data and not just a random sample from it.