EDA Patrol – Crime in India (2001–2014)

What Inspired Us

The rising concern over public safety, coupled with limited granular insights on regional crime in India, inspired us to explore district-level crime data from 2001 to 2014. We wanted to investigate patterns, identify high-risk zones, and support strategic policymaking using data as a tool for social impact. By visualizing crime trends, we aimed to make this data accessible, informative, and actionable for both authorities and the general public.

What We Learned

Throughout the course of this project, we deepened our understanding of:

Geospatial data handling using GeoJSON and fuzzy matching.
Data wrangling techniques to clean, normalize, and merge large datasets.
Exploratory Data Analysis (EDA) to derive actionable insights.
Visualization tools like Streamlit for creating interactive dashboards.
Machine learning models such as Random Forests and Linear Regression to classify and forecast crime trends effectively.

How We Built It

We started by importing the district-wise IPC crime dataset and a GeoJSON file containing district boundaries of India.

Cleaned and standardized the data

Removed aggregated rows like “Total”.
Normalized district/state names to match across datasets using fuzzy matching.

Created new Indicators

crime_risk_index using a weighted formula of serious crimes.
Binary label for high vs. low crime zones.
Most common crime per district.

Performed in-depth EDA

Identified top crime-contributing states and districts.
Analyzed urban vs. non-urban patterns.
Built heatmaps, choropleths, and time-series plots.

Developed a Random Forest classifier to predict high-crime districts with 95% accuracy. Also built a linear regression model to forecast total crimes in the future and deployed an interactive dashboard using Streamlit to present insights dynamically.**

Challenges We Faced

Data Inconsistency

Mismatched or misspelled district names across files required fuzzy matching and manual verification.

Lack of urban/rural Indicators

We had to infer urbanization using keywords like “commr” or “city”.

Data Volume

Processing multiple years of crime data at the district level needed optimized data handling.

Model Tuning

Balancing model complexity and interpretability while avoiding overfitting was a learning curve.

Geo-Visualization

Integrating statistical output with shapefiles for clean choropleth rendering took iterative refinement.

Final Output

Interactive Dashboard - Streamlit App

The dashboard allows users to:

Explore crime intensity across states and districts.
View geospatial heatmaps for different crime types.
Understand most common crimes in each district.
Analyze trends over time and forecast future crimes.

We also successfully trained:

Random Forest Classifier with 95% accuracy to classify districts into high/low crime zones.
Linear Regression model that projects ~3 million IPC crimes by 2019.

Conclusion

This project provided a comprehensive look into the crime landscape of India at the district level from 2001 to 2014. Through data cleaning, exploratory analysis, geospatial visualization, and machine learning, we uncovered critical patterns—highlighting urban-rural disparities, identifying high-risk zones, and emphasizing the prevalence of specific crimes like theft and grievous hurt. The development of a crime risk index and predictive models offers a data-driven foundation for targeted interventions and informed policy decisions. Our findings reinforce the importance of granular, localized analysis in understanding crime trends and demonstrate how technology can empower public safety initiatives.