It’s no secret, a high school diploma matters -- to individuals, communities, and society. United States high school graduates are more likely to be employed and less likely to engage in criminal behavior. They also enjoy better health and longer life expectancy, and are more likely to be engaged in their communities.

To meet the GradNation Campaign goal of 90% graduation rate by 2020, we need to identify every opportunity that can create success for our nation’s youth. Income, demographics, family dynamics and zip code are all known factors that can play a significant part in a student’s ability to graduate. But what are the unknown factors? Does bullying play a part? Crime? What about local gas prices, weather or transportation factors? Help us figure it out.

This challenge provides the opportunity to study Census and demographic data sets, which can be used to identify the key drivers of low graduation rates. These variables can include socio-economic status, language and culture of the home, school size and size of family.

What it does

It simple terms, the model predicts what school districts would have an above 90% Graduation Rate based on certain variables passed in.

How I built it

My approach to analyzing the data was a three step process:

Selecting The Variables

Splitting the Data

Creating the Model

When selecting the variables, the Local Education Agency ID (LEAID) was used as the unique variable and chose only variables that were 70% or more complete thereby reducing the number of NA's. Specific variables were chosen using the Boruta machine learning algorithm. A new variable based on a 90% or greater threshold for the graduation rate was created and became the independent variable all the models were built on.

Moving onto the splitting the data, the data was split into a training and testing set while maintaining equal proportions of graduation rates. The purpose of the training set is to ensure the model is predicting correctly based on the requirements we have passed in.

2 different models were created: The Logistic Regression Model - Prediction Accuracy of 70.4%

CART Model - Prediction Accuracy of 74.4%

Challenges I ran into

Taking care of redundancies in the census data as well as cleaning up the data and choosing pertinent variables. It was a hassle to look though all 550 available variables to recognize the variables that could be considered useful.

Accomplishments that I'm proud of

The High Accuracy rate of the model predictions show just how sensitive the graduation rates are to certain variables. Implementing new Machine Learning Algorithms that I recently learnt how to use in class.

What I learned

Insights for the Logistic Regression Model are grouped into 3 categories based on the people who had the ability to actually create a change. They were namely:

Household: parents and family members can facilitate the change in student’s lives right from the home

School Administration: teachers, principals and guidance counsellors would be the chief proponents of these new initiatives.

Socio economic: State government and local school district educational boards would be responsible to lead these initiatives. Please see the slide deck for the key actionable insights we recommended.


Statistical modeling was used to predict graduation rates with roughly 70% accuracy. The key insights found from the data were divided into school, household and socio-economic initiatives. These recommendations include diversifying Cohort because diversity brought about better results as well as avoiding smaller sized schools.

What's next for Finding Drivers for Higher Graduation Rates

Perhaps including more models for prediction such as the random forest model and implementing the AdaBoost Algorithm to combine their predictive Capability.

Share this project: