We were interested in learning about how females partake in formal financial systems in India, how their roles differ from their male counterparts, and how this information can be used to further empower females to hold more equal roles in society. We used data that is part of the Women in Data Science Datathon 2018; the data comes from the Bill and Melinda Gates Foundation.
What it does
Our model predicts with over 90% accuracy whether or not a survey respondent is female or not based on their answers to many survey questions.
Example of a survey question: Who is the main income earner in your household?
How we built it
We built our model using Microsoft Azure Machine Learning Studio. We ran a Random Forest model to get an importance ranking of over 1,000 features; we chose the highest 20 to include in our model. We tried several different models, but chose to use a two-class boosted decision tree over models such as two-class SVM, two-class averaged perceptron, and two-class logistic regression. We chose to use a boosted decision tree because it yielded the highest accuracy rate.
Before using Azure, we cleaned our data by removing variables with over 20% NAs (which took our number of features down to 262 from over 1,000) using R.
Challenges we ran into
The dataset had over 1,000 features to begin with, which was intimidating coming into the project. It posed the question of how we were to go through all of them to find the most relevant features with the most predictive power. This was, as it turned out, a problem for a machine. Random Forest's importance ranking helped us overcome this challenge quickly.
Additionally, one other challenge was getting accustomed to Azure's ML Studio--it was a new system for us both, and slightly uncomfortable at first. HOWEVER, we are now huge believers! We loved how we could visualize the process in building models (import data, clean data, split data, train data, predict, etc.) and how we could compare multiple models side-by-side.
Accomplishments that we're proud of
We are so happy to be at AthenaHacks this year, and work on data that tells womens' stories. By examining the dataset, we found some interesting patterns:
One of the biggest predictors of female or not female was whether a person said they were the head of their household or classified themselves as "Spouse." As you might guess, females were MUCH more likely to say "Spouse."
Another big predictor was the following question: Who decides on who should have a phone in your household? Females were much more likely to say their spouses decided on who should have a phone as opposed to saying they were the ones who decided. This raises the issue of power dynamics within the household, and how perhaps women with mobile phones should be further studied to see if it makes a difference in their financial autonomy.
What's next for Female Finances
More feature engineering, more Kaggle submissions, more exploring of this very cool dataset!