Inspiration

As somewhat experienced in Machine Learning, one of the challenges I frequently ran into was availability of data. More specifically, access to data to train my models. At a time where data doubles every 2 years, access to training ML models with reliable datasets should only increase, and not stay constant!

However, there are real reasons as to why a lot of data isn't being open-sourced to the public. For one, a lot of individual/consumer data is restricted by data preserving rules such as GDPR, California Privacy Law, COPA, and also various forms of health industry privacy laws such as HIPPA. This makes access to these datasets very difficult and often infeasible for individuals without licenses (e.g. students). This is where our approach to help fix this HUGE problem comes into play.

What it does

Assume an organization has their data stored in one of two major cloud providers (GCP or Microsoft Azure). Based on pre-agreed conditions by these organizations, the companies share the trained models of their data (based on another agreed upon Machine Learning model) to a parent node. This parent node has access to encrypted versions of individually trained weights of each file (using differential privacy via Ornstein-Uhlenbeck noise process) and runs an algorithm known as federated averaging (which we built) to convert individual weight files into one meaningful weight file and send this back to all of its children/client nodes. This process is repeated several time (10-100) until the model starts to converge.

Using this process, we can TRAIN any Machine Learning/Deep Learning model without having all the data in one concentric place. We've proven (check our video) that using this technique, we outperform traditional techniques by nearly 2x!

How we built it

PyTorch, GCP, Microsoft Azure, Cloud-Storage, Blob-Storage, a lot of cloud stuff!!!

Challenges we ran into

a lot... running out of time, so can't explain rn!

Accomplishments that we're proud of

it worked!!! We really focused on interoperability as a key feature for this type of system. We envision our platform being used by a range of organizations, which each have their own technology stack. Whether it's Tensorflow or PyTorch, GCP or Azure, local vs cloud, we knew that this solution needed to be flexible to eliminate any friction from the collaboration.

What we learned

that this proof of concept works!

What's next for Federated Learning as a service

everything :D We wanted to apply the techniques to the COVID-CT dataset to demonstrate the value federated learning can provide in the current circumstances. We also wanted to implement differential privacy to add another layer of security to the model parameters. We also wanted to make it a full-fledged enterprise implementation by using private networks to secure the sensitive data.

Proof our technique is better than traditional ML:

Accuracy

Built With

Share this project:

Updates