Deployed at http://frozen-taiga-1666.herokuapp.com.
We focused on the Wage and Hour Compliance Action Data from the U.S. Department of Labor. This data set is a list of descriptions of investigations into alleged wage and benefits violations (employers paying late, underpaying, etc.). The prompt originally called for something "predictive", regarding which employers will violate next. However, it became clear quickly that predicting which employer will violate next wasn't in this data - the features just weren't there. But there might be an opportunity to say something about which employers will violate again.
We thought the Wage and Benefits data set would be interesting and productive for a few reasons.
First, with wage growth stagnant in the US, ensuring workers receive the pay they are entitled to is hugely important.
Second, we assumed government data would be cleaner (addresses normalized, for example). We were wrong. The Department of Labor is not the NSA.
With sufficiently developed analysis, the Department of Labor or potential employees or any business based in the U.S. could benefit. The DoL can alter inspection patterns, and job seekers can avoid toxic employers if they have the luxury to do so.
- The recidivism rate drops significantly (current data state suggests around 50%) for employers whose Civil Monetary Penalty was in the top quartile of all initial penalties vs the bottom quartile. This suggests, at first glance, that stiffer penalties may help prevent repeat offenses (at least for smaller businesses).
- Recidivism is lower than expected. The current state of the data puts the recidivism rate above 6%, but this number is likely to rise as more correct groupings are identified.
- A few offenders, especially the United States Postal Service, Walmart, USProtect Corporation, X-Press Sweeping, AT&T, and Comcast (to name a few from a long list) are frequent offenders accounting for a significant fraction of repeat violations.
Process and Issues
Frankly, it's surprising how messy the data is. Although the DoL does offer a functioning search page to find wage violations by business name (awesome), the messiness of the data and lack of more detailed information makes it difficult for even a motivated searcher to remain informed.
Addresses are not normalized, typos exist in all fields (in even dates), and legal names are inconsistent. We very aggressively normalized and stripped the
legal name fields. We created an
address field assembling and parsing the included geographic information info a standard form. We developed an process to identify rows where 2 of the 3 "essential" fields were shared (
address) and join them as one
We tried to normalize
legal_name using Levenshtein distance, and then Jaro–Winkler distance as similarity measures, but they didn't work as well as simply removing punctuation and spaces. In retrospect, a slightly modified Jaro-Winkler (weighing both prefix and suffix) on top of the aggressive normalization would have probably worked well.
A few thousand rows were also cleaned by hand while scripts were running.
About 70% of our total group time was spent cleaning the data to a point where we could group entities with a fair degree of confidence, and with some more time (and sleep) we're confident we could have produced a few more interesting graphs.
If the data set was cleaned to the state it ought to be, there's quite a lot of interesting exploratory research that can be performed. It would also be much easier to join this data with another interesting set to try and get at the more "predictive" angle of interest to the DoL. Here are some questions that came to mind despite 20 hours of high color-temperature light cleaning this data:
- How does the violation amount (total backwages) change for repeat violations over time?
- Do repeat offenders offend more often or for more money the next time?
- Do penalties have a greater effect on larger businesses or smaller businesses? Does the size of the penalty matter?
- Do repeat offenders continually violate the same laws (FLSA, MSPA, etc.)?
- Are certain kinds of violations (FLSA, MSPA, etc.) more likely to co-occur with each other?
- How do changes in a state or county's economic climate effect the likelihood of a wage violation?
- Do the inspectors of the DoL have preferences or biases with regard to assessing different industries or businesses with penalties?
... and many more.