When I was working at the ABS, we wanted to deploy a ML model, however we were not allowed to use the public cloud (AWS, GCP, etc) due to privacy concerns, additionally investing into infrastructure for a single use case, simply made zero business sense.
What it does
Our platform solves the above mentioned issues along with one other important issue, decreasing the barrier of entry to ML. We are capable of accepting custom machine learning models, however we are also capable of training a data from a single data upload.
This means ANYONE is capable of using and deploying ML models.
We have implemented EXCEL extensions so that the barrier to entry is even lower. Simple click and use.
We are able to automatically infer the best model to use along with the preprocessing needed to be done.
How we built it
We use existing solutions for automatic training, largely because this is quite a theoretically and engineering heavy. We used ludwig for inference of the preprocessing that needs to be done
Scalr provides a working, easy and quickly deployable PaaS for on premises or cloud use.
Our architecture can be split into three logical components:
- User Interface, providing an easy way to monitor and deploy machine learning workloads
- Control Layer, enabling dynamic routing, load balancing and metrics
- Execution Layer, processing incoming requests, training and applying machine learning models
It was important that Scalr kept developer complexity to a minimum, principle to this idea is reliability. In order to ensure that the machine learning workloads could be effectively accessed in spite of an arbitrary failure, we implemented a peer-to-peer gossip model for replicating machine learning models across nodes. This greatly increases uptime and increases throughput capacity.
- User interface: React, Bootstrap
- Controller API: Golang, Docker
- Execution: Python, Sanic, Ludwig, Docker
Challenges we ran into
Creating reliable software is hard. Creating reliable distributed software that runs on many machines is very hard. From scaling our network of nodes from a single server to nearly a dozen for testing - we encountered plenty of headaches. Fortunately, through this trial of fire, we had a chance to smooth out the edges.
Accomplishments that we're proud of
Building something start to finish,
What we learned
A whole lot more about CORS than we wanted to know. How to create a bespoke distributed machine learning system. The value of shopping your idea to as many mentors as you can, and the non-technical considerations that should be front of mind.
What's next for Scalr
Scale! Build out monitoring, reporting and metrics so that Scalr becomes the solution that ticks all the boxes. Expand on our UI for non-technical users and double down on our focus on user experience, with established SLAs for response and deployed uptime.