We wanted to take a deep dive into a dataset on public school test performance to help parents and educators to better understand geographic patterns in how today's youth are performing across the country. We also wanted to see how geospatial APIs could offer some insight about where people hoping to improve the country's education levels could most efficiently.
What it does
We have an interactive map that lets users browse public schools in the United States, and view associated metadata.
We also have compiled a local database of the 88,300+ schools that participated in standardized testing in the United States from 2012-2015, which can assist in educational policy analysis for future interested data enthusiasts.
How we built it
We made a Flask app that powers an interactive D3 map based interface for viewing data about schools and their surrounding regions. The app leverages the Pitney-Bowes Geolife API to provide contextual demographic and economic data for each school.
In addition to using the test scores dataset, we scraped the National Education Statistics website for metadata about the schools, used the
geocoder python library to call the Esri/Arcgis geocoding service to enrich the text addresses with lat/long coordinates, and use pandas to munge and clean the data so that it could be used by the application. Data exploration and quality control was conducted using jupyter and beaker notebooks.
Challenges we ran into
- Beaker notebook cloud was good for data preparation, but we were unable to export the data and switched back to jupyter
- Huge range of data quality issues to address in main dataset, including the placement of strings and range based variables into otherwise numerical columns. For example, the math results dataset has 231 columns, 228 of which are supposed to be numerical, but only 3 of which were numerical without cleaning measures being taken. In other cases, states were completely omitted (Kansas in 2013 and Nevada in 2014), and datasets that were listed on public pages turning out to be nonexistent (test scores from before 2012).
- There were 89 pages of documentation just to understand one of the datasets for 1 year. It is difficult to do fair state-to-state comparisons due to all the different data quality issues particular states had in individual years.
- There are some data quality issues that weren't reported in the docs, such as the fact that 1 school could have 9 unique identifier codes by listing their service to different (sometimes overlapping) grade ranges.
- python lxml and html.parser libraries use different methods to walk the DOM tree, learned to use each for different sorts of pages to deal with invalid HTML.
- State FIPS codes do not count up from 1-50... the numbers that are skipped (3,7, etc...) are quite haphazard and we wonder if there's a story behind it!
- Finicky table layouts without any classes or IDs to help walk the dom tree
Accomplishments that I'm proud of
- Learned to use object models to make scrapers more organized and maintainable
- Learned to use GNU parallel to run scrapers efficiently
- Gained experienced working with d3.js and working with mapping data, learning to pass output from Python scripts to JS for visuals
- Largest scrape I've ever run in 12 hours (Almost 190,000 pages crawled and parsed)
What I learned
- Preparing the data takes much longer than you think
- Keeping geocode mappings up-to-date is difficult and important! However, accuracy is not perfect (one Californian address got mapped to Canada)
- Putting additional data layers on top of map datasets is very predicated on
- Working with spatial datasets is challenging and fun!
What's next for MapPub: Interactive Map of American Public School Data
- Sharing the tool with teachers and parents to learn what additional fields they'd look to explore
- Adding tooltips, sparklines, and histograms to the map to provide additional context when using the map
- Moving API calls to the backend for better performance
- Add more datasets, such as the Common Core of Data