Debiasing Dataset Features for Crime Prediction

NYC map data by Zipcode
Word cloud

Inspiration

Recent events in the US have only brought to the forefront, a long-standing racial bias pervading law enforcement. We sought to analyze the historical data of arrests made in 2 major cities, NYC and Chicago, over the past 3 years that might be used by AI developers for potential crime prediction. This data has been provided by government websites and hence is more likely to be widely used without concern for veracity. Furthermore, since most crimes are committed due to financial distress, we try to find a correlation between the arrest statistics and the median household income in the neighborhoods they occur in.

What it does

Our analysis reveals a clear skew in the arrest data towards people of African American and Hispanic origins, despite the ratio of their population being much lesser. Were there to be a predictive model based on this data to predict the likelihood of crime using perpetrator race information as a feature, it would be highly disadvantageous to these communities. While comparing income, it was seen that on average, areas with a lower median income had higher arrest counts. We also observed a trend of lower median income with an increasing African-American population. This could have stemmed from the systemic racial bias that hindered their employment opportunities and gatekept them from living in more affluent neighborhoods. We conclude that predictive modeling for crime prevention cannot be unbiased in nature if solely based on historical data due to the deep-rooted biases that are reflected in any information collected from the past.

How we built it

We found the data for crime stats in NYC from here and for Chicago from here. We retrieved demographic and income information for both cities on the US Census website. Pandas, Tableau, and OpenRefine were used for data gathering, cleaning, and analysis.

Challenges we ran into

Finding the datasets and performing data cleaning
Enormous dataset size (3M+ arrests in NYC the past 3 years alone) and lack of computing resources to process it
Framing the problem statement
Time constraints prevented us from being able to create our own unbiased predictive models.
The datasets we had lacked the zip codes required to link the crime dataset with the demographic dataset. Therefore, we had to use latitude and longitude to hit the OpenStreetMaps API and retrieve the location's zip code.

Accomplishments that we're proud of

This is the first hackathon for the 3 of us and completing the submission before the deadline was definitely an accomplishment.
Exploring the intersection of AI and justice has been a novel and motivating experience for all of us.
We were able to successfully find correlations between the median income of an area and its incarceration rate.
In addition, we also concluded that certain races are more prone to incarceration due to the historical disadvantages they have faced.