Dropping out of high school is a critical issue in today's society that contributes to many of the long term issues in the nation. To tackle this issue, we need to understand the precise cause and effect relationships.

Our goal is to use data to derive actionable recommendations to help increase the national graduation rate to 90% by 2020.

We mapped environmental theory to derive various hypotheses and evaluated each hypothesis. Using a sophisticated technique of extracting non-linear directional impact of the various features towards grad rate at a nationwide level from noisy data at the district level, we found that the most important features for improving nationwide graduation rates are household stability and economic security. We also found that weather plays an important role. On the other hand, we found that school spending and low food access may not be very important for improving graduation rates and have very diminishing returns relative to investments to improve them.

** Our best model had 0.68 R^2 but with 150 features. Our final selected model was with 0.6 R^2 with 7 features. Those are, in order of most important to least:

  1. Percentage of Economically Disadvantaged students in the cohort
  2. Percentage of unmarried households in the cohort
  3. Percentage of people below poverty in the cohort
  4. Average of the daily minimum temperature
  5. Instructional salaries and wages per pupil
  6. Average of the daily maximum temperature **

Environmental Theory

Interesting Insights!!!:

  1. Coming from educated households were NOT strong predictors of whether you would graduate high school.
  2. We found that the percent of people living under the poverty line was a much stronger predictor than median household income. This implies that it's not necessary to raise incomes for everyone, but just for the neediest families. This helps to target policy in the right direction.
  3. Salaries and Wages were impactful investments to grad rates but had very diminishing returns but NOTHING else. Investments need to be in the right place based on analysis such as ours !
  4. Food Access is NOT a big impact to grad rates.
  5. Weather in your area IS fairly impactful to grad rates !

What We Did

  • Data Clean up. The dataset provided, although detailed had a major flaw in it. Schools with cohort sizes under 60 students, had graduation rates with a very large margin of error and broad bucket.

  • Integration of many third party datasets at the tract or school district level. E.g. weather, food availability, school financials.

  • Deep analysis of the various hypotheses based off the environmental theory.

  • Prediction of what specific investments to impact the key predictive features on the graduation rate in order to understand the best area to invest in with the maximum impact. This was done at the national level in order to help facilitate specific policy decisions using quantitative what-if analysis. This approach could also be used to set KPI's for governmental agencies to ensure that the specific policies do indeed meet their intent.

  • Interactive visualization tools to explore the final merged dataset, understand the key predictive features and deep dive into the analysis we performed.

How we built it

Most of the work was in the analysis and identifying actionable recommendations at the national level. In addition, we built a set of interactive visualization tools on a web platform in order to help the user understand what we accomplished and for policy makers to identify policies that have the biggest impact to nationwide graduation rate.

In terms of the actionable recommendations, we wanted to build an analytical framework that would enable policy makers to identify :

  1. what metrics to focus on when developing specific policies (i.e. biggest bang for the buck)
  2. directional relationship on how much to change the specific metric by in order to optimize for national graduation rate

We initially did this at a district level but found the results to have unclear actionable content especially since policy decisions tend to be at the state or national level. Instead we used an interesting technique of predicting the impact of % change of a specific feature for all districts and then aggregating the results at the national level. By using this technique, we can visually see on a chart what is the predicted impact on graduation rate at the national level relative to changes to specific features. Below is an example of this technique being applied to instructional spending on salary. The graph shows that incremental spending from current levels has diminishing returns although it is very important if spending is significantly below the nominal.

Instruction Spending on Salary vs. National Grad Rate


The data required significant cleaning and deep understanding. Firstly, the school districts for cohort sizes under 60 students had graduation rates that were in large buckets. In fact for 6-15 student cohorts, the rate was either <50% or ≥50%. This could cause significant inaccuracies in our model, hence we decided to leave these cohorts outside of our modeling. Nevertheless, we tested our model with the overall data to determine robustness and we confirmed that the conclusions were the same.

Next, we used the unmerged dataset and we needed to merge the specific tracts per school into district level data. The technique mentioned in the documentation assumed that all metrics are treated the same way, which is incorrect. We only applied the weighted average on absolute value metrics and recalculated the % value metrics.


This project entailed an end to end data science methodology and implementation. Building a comprehensive data story, actionable recommendations using predictive analytics and finally a interactive visualization web platform to communicate the story.

Recommendations (More in the notebook on the site)

Given that Household Stability and Economic Security are the biggest factors, here are a few actions to take:

1) Continued investment in community development corporations (CDCs) to strengthen communities from a financial and social perspective.

2) Initiatives like the Low Income Housing Tax and the New Markets Tax Credit have brought vital services to low-income communities. Check out this article for more information.

School spending

1) We found that more instructional spending tended to increase graduation rates, while other types tended to decrease it. Thus, all else being equal, it seems that schools should allocate more money towards instructional spending rather than support spending or administrative spending.

Inclement weather

1) Colder areas tend to have lower graduation rates than warmer areas. One way to design around this problem is to have alternate school schedules for colder areas. It might be better to have a longer break during winter months and a shorter break during summer months for these schools

Future Work

Look into a model for graduation rates per race to understand what features makes each race's graduation rate to be different.

Share this project: