Urban vehicle break-ins are a persistent and growing problem in dense cities, especially in high-traffic and tourist-heavy areas like San Francisco. Many drivers are forced to rely on intuition or anecdotal advice when choosing where and when to park, often leading to avoidable losses and anxiety. We were motivated by a simple question: can historical crime data be transformed into clear, actionable guidance for everyday parking decisions? ParkWise was created to bridge this gap. Our goal was to build a data-driven decision-support system that estimates the risk of vehicle break-ins at specific locations and times, empowering drivers to make safer, more informed choices.

ParkWise is built on publicly available historical crime data from the San Francisco Police Department, accessed through the Socrata REST API. The dataset spans multiple years (2018–2024) and contains detailed records of reported vehicle burglaries, including timestamps and geographic coordinates. We began by cleaning and preprocessing the raw data. Duplicate and incomplete entries were removed, and all spatial and temporal fields were standardized for consistency. To prevent data leakage and ensure realistic evaluation, we used a strict 2024 hold-out test set, training only on data from 2018–2023.

Feature engineering was central to model performance. Temporal features included cyclical encodings for hour of day and day of week, weekend indicators, and seasonal patterns. Spatial features included latitude, longitude, proximity to major tourist landmarks, and localized crime density measures. To better capture geographic structure, incidents were aggregated using the H3 hexagonal spatial indexing system at a 175-meter resolution, enabling the model to learn neighborhood-level hotspot behavior.

Because vehicle break-ins are relatively rare compared to safe parking events, we applied negative sampling. This involved generating spatially and temporally shifted non-incident examples to balance the dataset and improve model robustness.

For prediction, we trained an XGBoost gradient-boosted decision tree model with 200 trees and a maximum depth of six. Early stopping was used during validation to reduce overfitting and improve generalization.

One major challenge was dealing with class imbalance, since most locations and times do not result in break-ins. Without correction, the model would be biased toward predicting low risk everywhere. Negative sampling and careful evaluation using precision-recall metrics helped address this issue. Another challenge was ensuring geospatial consistency. Raw latitude and longitude alone failed to capture neighborhood-level patterns, which led us to adopt hexagonal spatial indexing. Selecting the right spatial resolution required experimentation to balance locality with statistical reliability.

Finally, we had to account for real-world data limitations. The model relies on reported crime data, which may underrepresent true incident rates and reflect reporting biases. We addressed this by focusing on relative risk rather than absolute predictions and validating calibration carefully.

Results and Evaluation: The final model demonstrated strong predictive performance: AUC-ROC: 83.2% Precision-Recall AUC: 91.7% Precision: 95.1% High precision was especially important, as it ensured that high-risk predictions were rarely false positives. Calibration analysis showed that predicted risk scores closely matched observed outcomes. Feature importance analysis revealed that recent incident trends and time-of-day variables were the strongest predictors, reinforcing the idea that vehicle break-ins follow predictable patterns rather than random behavior.

This project highlighted how machine learning can turn historical public safety data into actionable insight when paired with thoughtful feature engineering and evaluation. We learned that spatial structure and temporal context are just as important as model choice, and that precision-focused metrics are critical for real-world decision support. We also gained experience balancing technical rigor with practical usability. ParkWise is deployed through a simple web-based interface that allows users to drop a pin on a map, select a time, and instantly receive a risk score—demonstrating how complex models can be translated into intuitive tools.

Beyond individual drivers, ParkWise has broader implications for urban safety, including informing proactive resource allocation and hotspot awareness. While limited by reliance on reported crime data, the project demonstrates how data science can meaningfully improve everyday decision-making and contribute to safer cities.

Share this project:

Updates