There's so much data in the world - we want to make some sense of it.
I want to start participating in Kaggle competitions - this was a perfect learning experience :D
What it does
- NYC Hourly Traffic data from https://catalog.data.gov/dataset/motor-vehicle-crashes-vehicle-information-beginning-2009 and
- NYC Hourly Vehicle Accidents data from https://catalog.data.gov/dataset/hourly-traffic-on-metropolitan-transportation-authority-mta-bridges-and-tunnels-beginning-
to make classification models that predict the range of number of accidents at a given hour, month and amount of traffic.
See images of input data and scores of prediction attached above.
We are predicting AccidentCountRangeOf5. If AccidentCountRangeOf5 is an integer x that means there'll be x, x+1, x+2, x+3 or x+4 accidents for the given hour, month and number of vehicles on road.
Classification models: this is a classification problem because we aren't predicting the number of accidents but rather the range-of-5 that the number of accidents will belong to. For example, if our model predicts 10 that implies that there'll be between 10*5 and (10*5) + 4 accidents ---> #accidents for that given set of inputs will be 50, 51, 52, 53 or 54.
Note for future hackers: We have thoroughly documented our entire data processing process and code in this Jupyter Notebook so that it becomes a bit easier for you to process government data https://github.com/DeeptanshuM/HackHarvard2018/blob/master/DataCleaningandProcessing.ipynb
How we built it
We used Pandas to clean and process data. We used Microsoft Azure Machine Learning Studio to build ML classification models and determine their accuracy.
What we learned
- Discovered and became comfortable using Microsoft Azure Machine Learning Studio
- Gained experience cleaning, processing and preparing real-world raw government data for ML
- Fix the NaN value for the macro-averaged precision metric :D
- Compare different types of ML models