Imagine you are a taxi driver in a metropolitan area looking for your next passengers. Wouldn't it be nice if you could predict a pickup location yielding the highest fare amount for your given location and point in time? As a passenger, wouldn't it be nice if you could predict when to start your trip minimizing your trip time? Using the increasing amount of publicly available data sets as well as cluster computing and analytics tools this should be possible today. And this is exactly what we did using Apache Spark.
What it does
We are using parts of the publicly available data sets by the New York City Taxi & Limousine Commission containing over 1.1 billion individual taxi trips within New York City from 2009 through 2015. Each trip contains pickup and drop off locations as well as detailed fare amounts. We cleaned and analyzed the data with Apache Spark on an AWS Cluster and used the data to train a predictive model using Spark's Machine Learning libraries. The model can predict a fare amount for a given location and point in time with some degree of certainty. We developed a small application that can be used to query the model and that will continuously update the prediction values for each of the 195 "Neighborhood Tabulation Areas" (NTAs) of New York City for the current time. The final map is generated using CartoDB backed by a dynamic prediction table.
How we built it
For the initial data acquisition and cleaning we used AWS S3 to store the raw input data (csv format) as well as the output data (parquet format). For the current model we used the data from 2014 as we were unable to find a complete source for the 2015 data. The parquet file format allowed for smaller storage size as well as faster queries and analysis. For the initial analysis and ad-hoc queries we used a Databricks Spark AWS Cluster. The machine learning model itself is based on a random forest algorithm and uses a subset of the input data as features. The optimal model was selected using cross-validation and persisted to S3. The web service was implemented using Spring Boot and deployed to an EC2 instance. The frontend is implemented as a CartoDB map.
What's next for taxi-assistant
Obviously there is work to be done: build a native app, allow for more fine-grained predictions, use dynamic data such as weather and traffic and use more data from a longer time-periods.