Skip to content

Supplementary materials for paper "On the Effectiveness of Log Representation for Log-based Anomaly Detection"

License

Notifications You must be signed in to change notification settings

mooselab/suppmaterial-LogRepForAnomalyDetection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Log Representation - Supplimental Materials

The repository contains the detailed results and replication package for the paper "On the Effectiveness of Log Representation for Log-based Anomaly Detection".

Introduction

The overall framework of our experiments and our research questions:

Framework

We organize the this repository into following folders:

  1. 'models' contains our studied anomaly detection models, both traditional and deep-learning models.
    • Traditional models (i.e., SVM, decision tree, logistic regression, random forest)
    • Deep-learning models (i.e., MLP, CNN, LSTM)
  2. 'logrep' contains the codes we used to generated all the studied log representations.
    • Feature generation
    • Feature aggregation (from event-level to sequence-level)
  3. 'results' contains the experimental results which are not listed on the paper as the space limit.

Dependencies

We recommend using an Anaconda environment with Python version 3.9, and following Python requirement should be met.

  • Numpy 1.20.3
  • PyTorch 1.10.1
  • Sklearn 0.24.2

Dataset

Source

We use HDFS, BGL, Spirit and Thunderbird datasets. Original datasets are accessed from LogHub project. (We do not provide generated log representations as they are in huge size. Please generate them with our codes provided.)

Due to computational limitations, we utilized subsets of the Spirit and Thunderbird datasets in our experiments. These subsets are available for access at Zenodo.

Extra regular expression parsed to the Drain parser

We used Drain to parse the studied datasets. We adopted the default parameters from the following paper for parsing.

Pinjia He, Jieming Zhu, Zibin Zheng, and Michael R. Lyu. Drain: An Online Log Parsing Approach with Fixed Depth Tree, Proceedings of the 24th International Conference on Web Services (ICWS), 2017.

However, Drain parser generated too much templates with the default setting due to the failure of spotting some dynamic fields. We passed the following regular expression to reduce the amount.

For BGL dataset:

For configuration used in our experiment:

regex      = [r'core\.\d+',
              r'(?<=r)\d{1,2}',
              r'(?<=fpr)\d{1,2}',
              r'(0x)?[0-9a-fA-F]{8}',
              r'(?<=\.\.)0[xX][0-9a-fA-F]+',
              r'(?<=\.\.)\d+(?!x)',
              r'\d+(?=:)',
              r'^\d+$',  #only numbers
              r'(?<=\=)\d+(?!x)',
              r'(?<=\=)0[xX][0-9a-fA-F]+',
              r'(?<=\ )[A-Z][\+|\-](?= |$)',
              r'(?<=:\ )[A-Z](?= |$)',
              r'(?<=\ [A-Z]\ )[A-Z](?= |$)'
              ]

We refined the RegExps for more accurate parsing as follows:

              r'core\.\d+',
              r'(?<=:)(\ [A-Z][+-]?)+(?![a-z])', # match X+ A C Y+......
              r'(?<=r)\d{1,2}',
              r'(?<=fpr)\d{1,2}',
              r'(0x)?[0-9a-fA-F]{8}',
              r'(?<=\.\.)0[xX][0-9a-fA-F]+',
              r'(?<=\.\.)\d+(?!x)',
              r'\d+(?=:)',
              r'^\d+$',  #only numbers
              r'(?<=\=)\d+(?!x)',
              r'(?<=\=)0[xX][0-9a-fA-F]+'  # for hexadecimal

For Spirit dataset:

regex      = [r'^\d+$',  #only numbers
              r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}[^0-9]',   # IP address
              r'^([0-9A-Fa-f]{2}[:-]){5}([0-9A-Fa-f]{2})$',   # MAC address
              r'\d{14}(.)[0-9A-Z]{10,}',   # message id
              r'(?<=@#)(?<=#)\d+',   #  message id special format
              r'[0-9A-Z]{10,}', # id
              r'(?<=:|=)(\d|\w+)(?=>|,| |$|\\)'   # parameter after:|=
             ]

For Thunderbird dataset:

regex      = [
             r'(\d+\.){3}\d+',
             r'((a|b|c|d)n(\d){2,}\ ?)+', # a|b|c|dn+number
             r'\d{14}(.)[0-9A-Z]{10,}@tbird-#\d+#', # message id
             r'(?![0-9]+\W)(?![a-zA-Z]+\W)(?<!_|\w)[0-9A-Za-z]{8,}(?!_)',      # char+numbers,
             r'(/|)([0-9]+\.){3}[0-9]+(:[0-9]+|)(:|)', # ip address
             r'\d{8,}',   # numbers + 8
             r'(?<=:)(\d+)(?= )',    # parameter after :
             r'(?<=pid=)(\d+)(?= )',   # pid=XXXXX
             r'(?<=Lustre: )(\d+)(?=:)', # Lustre:
             r'(?<=,)(\d+)(?=\))'
             ]

Experiments

The general process to replicate our results is:

  1. Generate structured parsed dataset using loglizer with Drain parser into JSON format.
  2. Split the dataset into training and testing set and save as NPZ format, with x_train, y_train, x_test, y_test.
  3. Generate selected log representations with corresponding codes within the logrep folder, and generates representations and save as NPY or NPZ format.
  4. If the studied technique generates event-level representations, use the aggregation.py in the logrep folder to merge them into sequence-level for the models that demand sequence-level input.
  5. Load generated representations and corresponding labels, and run the models within the models folder to get the results.
  • Sample parsed data and splitted data are provided in samples folder.

Network details for CNN and LSTM

CNN

Layer Parameters Output
Input win_size * Embeddin_size N/A
FC Embedding_size * 50 Win_size * 50
Conv 1 kernel_size=[3, 50], stride=[1, 1], padding=valid, MaxPool2D:[𝑤𝑖𝑛_𝑠𝑖𝑧𝑒 − 3, 1], LeakyReLU 50 * 1 * 1
Conv 2 kernel_size=[4, 50], stride=[1, 1], padding=valid, MaxPool2D: [𝑤𝑖𝑛_𝑠𝑖𝑧𝑒 − 3, 1], LeakyReLU 50 * 1 * 1
Conv 3 kernel_size=[5, 50], stride=[1, 1], padding=valid, MaxPool2D:[𝑤𝑖𝑛_𝑠𝑖𝑧𝑒 − 4, 1], LeakyReLU 50 * 1 * 1
Concat Concatenate feature maps of Conv1, Conv2, Conv3, Dropout(0.5) 150 * 1 * 1
FC [150 * 2] $2$
Output Softmax

LSTM

Layer Parameters Output
Input [win_size * Embedding_size] N/A
LSTM Hidden_dim = 8 Embedding_size * 8
FC [8 * 2] 2
Output Softmax

Acknowledgements

Our implimentation bases on or contains many references to following repositories:

Citing & Contacts

Please cite our work if you find it helpful to your research.

Wu, X., Li, H. & Khomh, F. On the effectiveness of log representation for log-based anomaly detection. Empir Software Eng 28, 137 (2023). https://doi.org/10.1007/s10664-023-10364-1

@article{article,
year = {2023},
month = {10},
pages = {},
title = {On the effectiveness of log representation for log-based anomaly detection},
volume = {28},
journal = {Empirical Software Engineering},
doi = {10.1007/s10664-023-10364-1}
}

About

Supplementary materials for paper "On the Effectiveness of Log Representation for Log-based Anomaly Detection"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages