Inspiration

We initially wanted to do something cool with elevation data (to predict Strava running times from elevation data!), but we ran straight into a problem. Elevation data is too large! The standard ASTER GDEM data (https://asterweb.jpl.nasa.gov/gdem.asp) for the whole world was looking at around ~400GB of data already compressed by NASA in the TIFF format. This quickly led us to our project idea - what if elevation data could be compressed using geographic insight of the dataset so that services and developers (like us!) could more efficiently use this data?

What it does

We train a convolution neural network pipeline in order to fully compress each image file. Given fourier inputs based on where you are on the globe, the neural network outputs the 3600x3600 height imaging of that tile, covering a 1 degree by 1 degree (in latitude/longitude) portion of the globe. Then, residuals are captured by a custom compression algorithm to make it fully lossless.

How we built it

The convolutional neural network was trained on a mere laptop CPU (we only had time to train one image) to minimise the entropy of the output in relation to the original image. The input consists of fourier features of the tile on the globe as well as the pixel position itself, similar to how positional encoding for transformers is performed. The resulting weights of the network is only 1.3MB - accurately predicting elevation up to an average error of ~20m - the remaining residuals are then passed to a custom PNG based compression algorithm. This differs from the standard PNG compression algorithm in a few ways, to be more suited to elevation terrain data. Firstly, our custom algorithm uses zig-zag encoding of the prediction residuals, to make small negative deltas positive. This has the effect of improving entropy encoding. This is not done in PNG, as PNG performs predictions on bytes, which would split up our 16 bit elevation data such that polarity is meaningless. Secondly, our custom algorithm uses a true 2D predictor, rather than Paeth predictor (which is a heuristic to select the most similar neighbour as the predictor). PNG uses Paeth as images often have sharp edges, but a true 2D predictor is better suited to terrain data, as that is locally smooth in all directions.

Impact

This could have a huge impact in the field data modelling - such as lava flow modelling (which GDEM data is already used for!) on the field where internet connectivity is limited/expensive and speed is critical. It could also have impacts in commercial uses such as an offline route finder with limited service/storage.

While we did not manage to fully integrate the CNN with the custom compression algorithm in the time given, both show already effective compression ratios.

The Content Mixing CNN achieves roughly a 50% entropy reduction, with a 1.3mb model.

The custom compression algorithm achieves a compression ratio of 0.286267, so a saving of 71.37% from the raw data, which beats competing state of the art compression algorithms as below (all tested on a random sample of 60 granules): LZW ratio=0.3092 savings=69.08% ZSTD ratio=0.2964 savings=70.36% DEFLATE ratio=0.3007 savings=69.93% LERC ratio=0.4317 savings=56.83% It also strongly beats the original .tif compression used by NASA, which had a 0.5499 compression ratio, meaning the 400gb of data could be almost halved with our compression algorithm.

Challenges we ran into

Downloading the data - to download data we had to modify a NASA provided bash script to filter our data to a smaller dataset for testing - our laptops could not initially handle the full dataset!

Accomplishments that we're proud of

Our algorithms beat the NASA .tif compression and other state of the art compression algorithms using an extremely small CNN model size.

What's next

Currently our model only trains on one of the ~22000 elevation data files - despite this we still see a significant filesize reduction with our small model. A future project would ideally train on all the data files - this would take a substantially CPU time investment but would only need to be done once by one computer and would bring even more filesize reductions. We could also consider testing for the optimal size of geographical area considered - the larger the area the less storage we need to store the networks but the less specialised our network becomes.

Built With

Share this project:

Updates

posted an update

Just wanted to update that while writing the devpost, we were testing the final improvements to the custom compression algorithm (switching to a true 2D prediction over Paeth), and achieved a 0.245080 compression ratio, beating our previous best with a saving of 75.39%.

Log in or sign up for Devpost to join the conversation.