On Broad St, a major artery of Richmond, the GRTC Pulse system is being built. "GRTC Pulse is a modern, high quality, high capacity rapid transit system that will serve a 7.6 mile route along Broad Street and Main Street, from Rocketts landing in the City of Richmond to Willow Lawn in Henrico County." (Source: http://ridegrtc.com/brt/)
While this will connect the east and west ends of Richmond to city center, does this system serve those areas that need public transport the most? By identifying these areas, the system can fill a need of the city while also being a more attractive option for people in the city who would not have used it before, thus increasing the number of people using public transport and lowering the emissions from people driving their own car.
This work is inspired by the Leaders of a New South. They came to speak at the VCU Presidential Listening Forum on October 13, 2016.
"Leaders Of the New South began as a movement to confront southern racism by turning a rallying symbol of REBELlion and oppression into a rallying symbol of REBELlion and liberation and soon evolved into a movement to enable US to be Leaders in our family, Leaders in our community, and Leaders in our nation.
By developing and utilizing a network of LEADERS who take ACTION, we drive community empowerment and policy engagement." (Source: http://www.leadersofthenewsouth.org/home.html)
What it does
By using public data sets and open-source software, machine learning models were made to target areas of low-income housing that are more likely to use public transport. This project also made use of Flickr data in the city of Richmond to identify neighborhoods and attractions in the city of Richmond.
How I built it
This project used the Konstanz Information Miner (KNIME). "KNIME is an free and open-source data analytics, reporting and integration platform. KNIME integrates various components for machine learning and data mining through its modular data pipelining concept." (Source: https://en.wikipedia.org/wiki/KNIME). Using this GUI, a pipeline was made to extract data from public data sets and cluster them according to different machine learning algorithms.
Dataset were collected from Flickr, U.S. Department of Housing and Urban Development, arcgis.com, census.gov, and data.gov. Datasets from the U.S. Department of Housing and Urban Development included Public Housing Buildings, Public Housing Development, and HOME Program Activity.
DBSCAN is a machine learning algorithm that creates clusters from a central point _ p _. If another point _ q _ is within a reachable distance (determined by epsilon), then the points are clustered together. From point _ q _, if another point _ z _ is within reach, then that point is added to the cluster of points _ p _ and _ q _. A cluster must have a number of minimum points else it is considered noise. This allows for clustering of amorphous, nonspherical shapes. This allows for neighborhoods to be identified that might go straight down a block or are in a culdesac.
Datasets from the US Department of Housing and Urban Development were grouped into a single dataset to draw clusters. Using the DBSCAN algorithm seeded with values for epsilon of 0.01 and minimum points per cluster of 5.
For the Flickr datasets, the area was also confined to the Richmond area and the data taken was confined to the last 10 years.
Challenges I ran into
Finding publicly available data is an issue every data scientist runs into. I would have liked to have included locations of Grocery stores and supermarkets to help combat the Food Desert Effect in Richmond.
Accomplishments that I'm proud of
I'm proud that I have been able to adapt a pipeline that was provided for me in a class to a project here that can potentially help thousands of people in my community.
What I learned
I learned how to effectively scour websites for data that answers the questions I am asking while making sure it is clear, clean, and concise.
What's next for Improving GRTC with Machine Learning
I would love to see this pipeline used by GRTC as they have access to consumer data such as addresses of those frequent users and others in enrollment plans.
This project is an adaptation from a the Data Mining course project at the Institut national des sciences appliquées de Lyon as part of their Information Science and Technology Semester. Code was initially supplied by Mehdi Kaytoue (@MehdiKaytoue).