Inspiration
Machine learning has a steep learning curve, and even those with experience spend a lot of time prototyping and executing their ideas. Being machine learning engineers, we realized building a custom model pipeline is a gruesome, redundant, and time taking task.
After talking to some senior data scientists, we realized how much of a blocker it is to deploy and test every single model every time they experiment with A/B testing, ensembling, or chaining of machine learning models. There is a necessity to eliminate these redundant tasks for the people working in the industry, so they can focus on what's more important. Even for experienced engineers/data scientists, it's not an easy task to build model pipelines and a near-impossible task for people without any experience in computer science. This leaves a huge void in the industry and academia where educators, students, researchers, entrepreneurs, and enterprises can come in with great ideas but can not, right now, because of the high barrier of entry. So we came up with the idea of building a platform that lets you build your machine learning models without writing a single line of code.
What it does
SimpliSmart is a universal platform that allows anyone to create their deep learning models. It uses AutoML and transfer learning to fetch the best configuration according to your data. With Habana DL1 instances we have reduced the 'best configuration' search by order of magnitudes. It also lets the users give their custom configuration and train a model in a single click.
How we built it
We used Angular to build the UI and Python/Django for the backend server. We leveraged the power of Gaudi accelerators to accelerate AutoML as well as train large image, text, and multi-modal models.
We wrote adapters to expose a universal interface for all machine learning models in our platform. These models can then further be used for dynamic model pipeline generation. We built an easy-to-use UI that makes it intuitive for the end-users to build and visualize complex pipelines irrespective of their level of expertise in machine learning.
Challenges we ran into
- Many of the TfOps were not supported in order to make a declarative deep learning framework. We had to go around it and implement our custom operations to support some of the crucial parts of the platform. For instance, we migrated masking operations such that the internal operations creating the masking operation can run on Habana DL1.
- We also struggled with Horovod since 'clip gradient' operations weren't supported on Habana DL1 instances. This posed difficulty for horovod when slicing and broadcasting the gradients in our custom loss function.
- Since Gradient Tape wasn't supported which was a part of our existing distributed optimizer, we had to discard it and figure out optimization of loss using operations supported in Habana.
- We had to shift from dedicated instances to on-demand Habana DL1 since running them 24x7 would have cost a lot. We had to use boto3 to communicate between our server instance and Habana DL1 for datasets, configurations, and metrics.
Accomplishments that we're proud of
We are proud to have built:
- A general-purpose platform that lets users easily create model pipelines within a few minutes to best suit their machine learning needs, that would take a good machine learning engineer a few days to build.
- An easy-to-use platform that significantly reduces the barrier of entry for creating and using machine learning pipelines, especially for the users not having a strong background in machine learning, thus democratizing the power of the Gaudi accelerator.
- An adaptive and extensible platform that helps users distributedly train any model on Habana DL1 instance without having to worry about cloud setup or migration of their models to support specific OPs.
- A robust and resilient system that can consume any valid user specification, and generate a stable and optimized model.
- A highly scalable containerized system that can be easily deployed and horizontally/vertically scaled.
What we learned
- How to migrate a traditional model to train on Habana DL1 instance.
- How to leverage horovod to train a model on multiple Gaudi accelerators and multiple Habana DL1 machines
- How to make the most out of a minimal computation unit by optimizing OS and web server configurations.
- The intricacies of the python interpreter, local context management, variable scoping to dynamically generate resilient code blocks in real-time.
What's next for SimpliSmart
- Extend support of AutoML and transfer learning for image data.
- Use Elastic Horovod to build a more robust Habana DL1 cluster to reduce cost and increase resilience.
- Integrate MLFlow with Habana to conveniently monitor the training of models.
- Extend support for Pytorch on our platform.
- Making the user experience intuitive enough irrespective of their expertise in machine learning so they are able to configure the model according to their requirements while keeping the platform extensible enough for an experienced user.
- Single click auto scale deployment of models.
Built With
- amazon-web-services
- angular.js
- boto
- django
- habana
- horovod
- javascript
- python

Log in or sign up for Devpost to join the conversation.