90% of people living in urban areas are exposed to air quality levels that exceed WHO limits. It has been linked to chronic lung disease, heart disease, cancer and diabetes, the major killers in most countries. Air pollution is very variable across time and space but data on local pollution sources is still lacking. During my PhD, I used commercial sensor to develop a portable sensor array device that can accurately measure air pollutants (NOx, CO, CO2, PM2.5, O3) at a fraction of the cost of existing devices. While using the device to study air pollution in Toronto neighborhoods, many people approached me for advice on how to use the sensor data and ways they can use to protect the health of their kids. While having the devices was useful to people, interpreting the data is difficult for low-cost sensors, because they are non-specific (detect the combined response form several different pollutants at once). These problems have been highlighted by other people using the same sensors for citizen science (https://www.nature.com/news/validate-personal-air-pollution-sensors-1.20195). While some efforts have been made to make the air quality datasets publicly available (http://aqicn.org/here/), only outdoor air pollution data is available and the data sets are not annotated and are too coarse to be useful for training models.
What it does
I realized that there is a need for better models and analysis pipelines that would provide people with not just the raw sensor data, but also actionable insights (i.e. what source is causing the spike in sensor reading? Is the gas stove leaking natural gas? When is a good time to open the window to let in fresh air but avoid exposure to air pollution?). In order to build such classification models, more open source training data is needed with sensors tested under a grater range of conditions. This is even more important because sensors are impacted by interferences (temperature, humidity, other contaminants) and time (drift). However, one major challenge is the lack of well-documented and annotated open-source datasets that contain not just sensor reading but also sensor model number, age, location, major local sources, time variability of the local sources etc.).
How I built it
My goal is to develop curated and well-documented open database of sensor response under a range of environmental conditions and source profiles. Other researchers and citizen scientists would be able to upload their own documented datasets as well, which would be automatically checked for quality through a set of criteria I developed when working with the sensors for the last 8 years. https://www.dropbox.com/s/7lgjwfs17rvphzs/LedgAir.pdf?dl=0 https://www.dropbox.com/s/xgkl5o1bpaerhzc/LeadgAir.mp4?dl=0
What's next for LedgAir
The datasets would then be used to develop classification model libraries (i.e. fully reproducible model specifications) that would identify pollution sources and predict pollution variation in the future. These models would also be openly available, curated and well documented to enable people across the world to deploy them on their datasets and develop tools and APIs using these data sources. Having reliable open datasets and models would be invaluable to help develop better policies on urban development, source control and climate change mediation. We would also be able to use them to test the outcomes of small-scale interventions (i.e. no-idling laws, electric and biofuel vehicles, walkable city designs), which is currently not easy to do.