We are a group of students from the Cooper Union interested in the environment!
Current debates on global warming and the environment make us aware of the pollution around us. And we are wondering: what are we breathing everyday? Where do all these pollutants come from.
What it does
base on pollution data in past 15 years, we modelled the flow of pollutant with Gaussian dispersion model, assuming that all sample locations is an affected area as well as a source.
we used methods such as interpolation and randomized sampling to deduce pollutants that are not recorded in certain areas
To examine up-to-date pollution, we implemented a machine learning algorithm to generate more datapoints, and would also run it through the algorithm to examine its sources.
How we built it
First we implemented multiple web-crawling algorithms to take data from GoogleAPI elevation and geocoding, and then we also created programs to extract useful information from the EPA and NOAA, such as the pollutant concentration level each month and wind direction & magnitude respectively.
Afterwards, we developed an algorithm base on the Gaussian Dispersion model in python and made a dataframe to run through the whole algorithm to generate a final csv file with the analyzed information in the csv
We exported everything to json files for data representation.
For current data, we implemented a backend script in python where as an HTTP request is sent, backend.py would run through the same algorithm as before, but using the concentrations generated by a machine learning code, to give real time pollution and pollution source data.
Challenges we ran into
Getting spark to work locally
Uploading codes onto bluemix and integrating backend and front end
Accomplishments that we're proud of
We successfully ran the algorithm to retrace the pollutants -- where they came from and their respective percentages. And also the machine learning code worked successfully, giving a prediction on pollution base on wind direction and geographical locations.
What we learned
Apache Spark, big data processing
Machine learning algorithms, in this case specifically we used linear regression. We evaluated whether to use stochastic gradient depression in the regression, but we decided not to because it would take a longer time to process when the user is using the app.
What's next for Polldentify
Polldentify is simply a proof of concept on what could happen on a global scale. Tracing pollutants and sourcing its origin, it allows environmentalists, academia and the government to make better public decisions and policies, and citizens could increase awareness to pollution around them.