The critical event any IT service provider cares is a service outage(e.g. HTTP 5xx response). But there are just too many variables. uptime, Java gc counts/time, CPU, I/O util, swap memory usage, etc.
Whenever you try to detect a service outage before it happens, you get screwed up by those countless variables.
One day, I noticed Apache Cassandra uses a very simple metrics to detect a node failure, which is based on the following theory.
This inspired me to try HTM against simple PING response times to predict possible high load that could generate an error.
What it does
It polls a given set of nodes by PING periodically. Then, record its response times, and which are fed to PhiFailureDetector and HTMAnomalyDetector to detect an important event to watch out.
Please refer to Concept on my project for details.
The case study I have investigated was done on a 6 nodes Cloudian HyperStore(S3 object storage) cluster. You can simply think of it as a distributed file system with a web interface. All the HTTP requests were routed to cloudian-node1, then distributed among 6 storage services.
The traffic pattern I used was a constantly high traffic for this cluster, which would produce an error if some more requests had been added.
To make it overloaded, a temporal random traffic surge was added every hour. Then, as expected, 6 errors were observed as follows.
Now, the first question is if this tool could have provided a useful information based on anomaly score by HTM before they happened.
The followings are examples against node1 taken on node3.
HTM inputs, log10 of response time in micro seconds, looked like this.
This looks there is no direct indication of such an error. For example, around 6:50, there were 4 503 errors. But no particular spike there. There could be a hidden pattern, but may be not. Let's see.
And this is the distribution. 2.5 is around 300 micro seconds.
PhiFailureDetector shows a server health. It is below 1.0, if it is fine. The lower, the better. You can set a horizontal line(threshold) to identify a bad node.
Here, as you can see, there are so many high values over 1.0. Some spikes are even beyond 10(0.0000001%). This indicates that the server was under high load.
A PHI value is a good indication of server health, but simply lowering a threshold doesn't really help. For example, if you would set 2.0, the first alarm could have been given around 22:10, followed by many more alerts. You may try 5.0 to avoid too many alarms. But you still have no idea if this is OK or not.
HTMAnomalyDetector allows you to see a vertical line instead of a horizontal line(threshold), which indicates an anomaly, the time when unexpected pattern was observed against its prediction(max in the next 30 seconds).
As you can see, HTM detected whenever a pattern has changed.
Cloud Sonar is not an oracle, so can not tell when an error could happen. But Cloud Sonar is confirmed that it can give an anomaly whenever a pattern changes.
If you look back a pattern when your server got overloaded, then you'll see the first point that became higher than its usual pattern. And it must be the first possible point you could be notified.
So, the best way to deal with an unexpected event is to know as soon as possible when a change is observed. And it is what Cloud Sonar can provide.
How I built it
Using HTM.java, and Swarming to find optimal configurations.
Challenges I ran into
HTM parameter tunings.
=> Let Swarming find
==> Takes long...
HTM input scale.
Accomplishments that I'm proud of
A very simple method to monitor a server health, and detect an anomaly.
What I learned
I have tried to collect more and more features. But an important simple value contains so many information. The hardest part is how to find it, and extract valuable information out of it.
What's next for CloudSonar
- more patterns like daily/weekly/monthly/holiday/seasonal
==> the result will be presented at Data Tech conference in Dec. at Tokyo
- serialization (contribution to htm.java)
- flume/fluentd appender to upload csv files to S3 storage for analytics