Our data platform is set up as a hybrid. We primarily have our data storage and processing on-premise while we do secondary/additional work on the cloud. Naturally, this requires lots of data movement across environments.

While Delta Lake features (structured streaming source/sink) has made this data flow much easier and reliable, there are still certain use cases where we require hdfs distcp-like functionality (usually when needing duplicate copies on-prem and cloud). Distcp that's specific for Delta datasets would be nice here.

What it does

How we built it

Challenges we ran into

Accomplishments that we're proud of

What we learned

What's next for Delta Distcp

Built With

Share this project: