Introduction
Our proposed project will apply various deep learning methods (CNNs, RNNs, Translation Blocks) to analyze system logs (human readable format print statement) to predict failure in critical systems before they happen. We plan on using an open-source library that contains logs which record important runtime information for system troubleshooting and behavior understanding; the extensivity/volume of these logs suggest that deep learning-based solutions may be beneficial in detecting these anomalies. As a part of our initial research, we have discovered the following paper (Deep Learning-based System Log Analysis for Anomaly Detection https://arxiv.org/abs/2107.05908?context=cs.LG) and the following implementation (https://github.com/logpai/deep-loglizer) to explore further as we commence our project. Ultimately, we plan to make a general purpose framework that is capable of handling different logs (i.e. is not domain specific) in order to predict production issues/anomalies and to convert binary classification (employed in the paper) to multi-class classification. We plan for our model to output a dashboard that employs graphics (maps, graphs, matrices, etc.) that reveals information that has been discovered about the logs in a user-friendly manner. As a stretch goal, we also aspire to explore the possibility of integrating NLP to this model.
Challenges
The hardest challenge in the project so far has been data-acquisition and preprocessing. While our initial plans consisted of potentially acquiring a dataset from Intuitive Surgical Robots, it does not appear that the timeline will work according to our schedule, given the amount of legal work needed to be done to give us the rights to use such data. Although log datasets exist and are available for use, they do not contain significant amounts of data, especially of varying data, which is a challenge because one of our goals was to make a general purpose framework for anomaly detection in many different log types. Furthermore, the lack of large and varying datasets mean that existing models converge and overfit in under 10 epochs. Deep learning is advantageous specifically when large quantities of data exist for processing. Another challenge is for NLP to be used, language must be vectorized through embedding layers based on an existing vocabulary, and log data contains a lot of unexpected tokens like memory addresses and code snippings. It is almost impossible to find a vocabulary that is entirely encompassing of all the tokens we can encounter in a log trace for any system.
Insights
Since the implementation of the Deep-Loglizer DataLoader is written in PyTorch, the team worked on implementing data preprocessing, and creating data pipelines in TensorFlow. The team implemented a Generative Adversarial Network with LSTM for log analysis and currently running on Google Colab using HDFS and BGL dataset provided by LogHub. HDFS and BGL dataset contains only binary labels (Normal/Abnormal) and since the goal of this project is to generalize the framework and be able to process mutli-class labels, the team decided to use RAS (Reliability,Availability, and Serviceability) logs from a high end computing system, Intrepid Blue Gene, at Argonne NationalLaboratory, provided by Zheng Et al[1]. The dataset contains 15 unique log entries such as Processor Information, Node Information, Block Number, Physical Location, Error Code, Flags, Component, Message, with a time span of six months. The decompressed RAS log has several encoding issues and the team has to write specific parser scripts to manually convert the log into a processable log.
Plan
The next step in our plan of action is to further explore for datasets and finalize on one dataset. We also plan to implement a GAN to generate more structured data based on the existing one and try to train our model using the generated data. Once the data is finalized, we plan on devoting more time to perfecting our model so that it can out-perform the existing model. We also plan to experiment with different types of NLP models such as RNN, auto-encoders and Language CNN to find the best fit for our data. Another key plan of action is to ensure that our model performs well in multi-class classification of the logs so that we can be more precise in predicting the cause of the anomalies in our log data. Finally, our plan is to integrate all of this to produce a general-purpose framework and, if time permits, we plan to integrate our model to a graphical dashboard to display various metrics that the model discovers from the log data.
References
Z. Zheng, L. Yu, W. Tang, Z. Lan, R. Gupta, N. Desai, S. Coghlan, and D. Buettner, ''Co-Analysis of RAS Log and Job Log on Blue Gene/P,'' in Proc. of IEEE International Parallel & Distributed Processing Symposium (IPDPS'11), Anchorage, AK, USA, 2011.
Log in or sign up for Devpost to join the conversation.