We chose the topic "Reduce impact on climate change" because we believe that raising awareness around CO2 impact on climate will help corporations and consumers to change their production and consumption model.
And because it seems difficult to understand it from Greta Thunberg, we hope satellites data from Copernicus (the European Union's Earth Observation Program) will bring the missing credit to the cause.
What it does ?
The data is provided through two applications :
- a prediction model in order to simulate the environmental changes regarding the CO2 projection
- and a self-service model through Power BI to empower users and give transparency on the data used
How we built it ?
Please find the all the details in the GitHub readme file.
Challenges we ran into
We faced a lot of challenges, from dataset comprehension to technical ones :
The data format from the CDS (Climate Data Store) isn't ready to use with Apache Spark. We needed to convert the datasets to parquet and it wasn't simple regarding the data volume and the lack of parallelism using existing libraries.
Data Quality issues
Our wish to make available the row data to end users, with satisfactory performance and full analytics capabilities
The ability to find relevant prediction variables
Accomplishments that we are proud of
The self service analytical application could be really useful to integrate citizens in the data crunching process and doing that, avoid climatosceptic tendency spread by some politics and lobbyists. Even without any transformation on the data, we are able to see the changes on climate measurements from satellites.
Our ability to work as a team, we succeeded to use each member strength to industrialize, make the data accessible, configure the clusters and implement ML algorithm.
What we learned ?
If an instinctive approach was doable at our level, it certainly takes a great amount of knowledge on the datasets underlying characteristics to avoid mistake. Sadly, we lacked a lot of this needed knowledge. Databricks allowed us to simply implement complex logic capitalizing on Python full support on one side and enjoying the full managed Spark cluster for computation over tens of billions of rows.
What's next for When Greta meets Copernicus !
We hope to continue working on democratizing the data from Copernicus, a reliable non financial source of data with a lot more to say than our first basic analysis.