The repository contains the detailed results and replication package for the paper "On the Effectiveness of Log Representation for Log-based Anomaly Detection".
The overall framework of our experiments and our research questions:
We organize the this repository into following folders:
- 'models' contains our studied anomaly detection models, both traditional and deep-learning models.
- Traditional models (i.e., SVM, decision tree, logistic regression, random forest)
- Deep-learning models (i.e., MLP, CNN, LSTM)
- 'logrep' contains the codes we used to generated all the studied log representations.
- Feature generation
- Feature aggregation (from event-level to sequence-level)
- 'results' contains the experimental results which are not listed on the paper as the space limit.
We recommend using an Anaconda environment with Python version 3.9, and following Python requirement should be met.
- Numpy 1.20.3
- PyTorch 1.10.1
- Sklearn 0.24.2
We use HDFS, BGL, Spirit and Thunderbird datasets. Original datasets are accessed from LogHub project. (We do not provide generated log representations as they are in huge size. Please generate them with our codes provided.)
Due to computational limitations, we utilized subsets of the Spirit and Thunderbird datasets in our experiments. These subsets are available for access at Zenodo.
We used Drain to parse the studied datasets. We adopted the default parameters from the following paper for parsing.
Pinjia He, Jieming Zhu, Zibin Zheng, and Michael R. Lyu. Drain: An Online Log Parsing Approach with Fixed Depth Tree, Proceedings of the 24th International Conference on Web Services (ICWS), 2017.
However, Drain parser generated too much templates with the default setting due to the failure of spotting some dynamic fields. We passed the following regular expression to reduce the amount.
For BGL dataset:
For configuration used in our experiment:
regex = [r'core\.\d+',
r'(?<=r)\d{1,2}',
r'(?<=fpr)\d{1,2}',
r'(0x)?[0-9a-fA-F]{8}',
r'(?<=\.\.)0[xX][0-9a-fA-F]+',
r'(?<=\.\.)\d+(?!x)',
r'\d+(?=:)',
r'^\d+$', #only numbers
r'(?<=\=)\d+(?!x)',
r'(?<=\=)0[xX][0-9a-fA-F]+',
r'(?<=\ )[A-Z][\+|\-](?= |$)',
r'(?<=:\ )[A-Z](?= |$)',
r'(?<=\ [A-Z]\ )[A-Z](?= |$)'
]
We refined the RegExps for more accurate parsing as follows:
r'core\.\d+',
r'(?<=:)(\ [A-Z][+-]?)+(?![a-z])', # match X+ A C Y+......
r'(?<=r)\d{1,2}',
r'(?<=fpr)\d{1,2}',
r'(0x)?[0-9a-fA-F]{8}',
r'(?<=\.\.)0[xX][0-9a-fA-F]+',
r'(?<=\.\.)\d+(?!x)',
r'\d+(?=:)',
r'^\d+$', #only numbers
r'(?<=\=)\d+(?!x)',
r'(?<=\=)0[xX][0-9a-fA-F]+' # for hexadecimal
For Spirit dataset:
regex = [r'^\d+$', #only numbers
r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}[^0-9]', # IP address
r'^([0-9A-Fa-f]{2}[:-]){5}([0-9A-Fa-f]{2})$', # MAC address
r'\d{14}(.)[0-9A-Z]{10,}', # message id
r'(?<=@#)(?<=#)\d+', # message id special format
r'[0-9A-Z]{10,}', # id
r'(?<=:|=)(\d|\w+)(?=>|,| |$|\\)' # parameter after:|=
]
For Thunderbird dataset:
regex = [
r'(\d+\.){3}\d+',
r'((a|b|c|d)n(\d){2,}\ ?)+', # a|b|c|dn+number
r'\d{14}(.)[0-9A-Z]{10,}@tbird-#\d+#', # message id
r'(?![0-9]+\W)(?![a-zA-Z]+\W)(?<!_|\w)[0-9A-Za-z]{8,}(?!_)', # char+numbers,
r'(/|)([0-9]+\.){3}[0-9]+(:[0-9]+|)(:|)', # ip address
r'\d{8,}', # numbers + 8
r'(?<=:)(\d+)(?= )', # parameter after :
r'(?<=pid=)(\d+)(?= )', # pid=XXXXX
r'(?<=Lustre: )(\d+)(?=:)', # Lustre:
r'(?<=,)(\d+)(?=\))'
]
The general process to replicate our results is:
- Generate structured parsed dataset using loglizer with Drain parser into JSON format.
- Split the dataset into training and testing set and save as NPZ format, with
x_train
,y_train
,x_test
,y_test
. - Generate selected log representations with corresponding codes within the
logrep
folder, and generates representations and save as NPY or NPZ format. - If the studied technique generates event-level representations, use the
aggregation.py
in thelogrep
folder to merge them into sequence-level for the models that demand sequence-level input. - Load generated representations and corresponding labels, and run the models within the
models
folder to get the results.
- Sample parsed data and splitted data are provided in
samples
folder.
Layer | Parameters | Output |
---|---|---|
Input | win_size * Embeddin_size | N/A |
FC | Embedding_size * 50 | Win_size * 50 |
Conv 1 | kernel_size=[3, 50], stride=[1, 1], padding=valid, MaxPool2D:[𝑤𝑖𝑛_𝑠𝑖𝑧𝑒 − 3, 1], LeakyReLU | 50 * 1 * 1 |
Conv 2 | kernel_size=[4, 50], stride=[1, 1], padding=valid, MaxPool2D: [𝑤𝑖𝑛_𝑠𝑖𝑧𝑒 − 3, 1], LeakyReLU | 50 * 1 * 1 |
Conv 3 | kernel_size=[5, 50], stride=[1, 1], padding=valid, MaxPool2D:[𝑤𝑖𝑛_𝑠𝑖𝑧𝑒 − 4, 1], LeakyReLU | 50 * 1 * 1 |
Concat | Concatenate feature maps of Conv1, Conv2, Conv3, Dropout(0.5) | 150 * 1 * 1 |
FC | [150 * 2] | |
Output | Softmax |
Layer | Parameters | Output |
---|---|---|
Input | [win_size * Embedding_size] | N/A |
LSTM | Hidden_dim = 8 | Embedding_size * 8 |
FC | [8 * 2] | 2 |
Output | Softmax |
Our implimentation bases on or contains many references to following repositories:
Please cite our work if you find it helpful to your research.
Wu, X., Li, H. & Khomh, F. On the effectiveness of log representation for log-based anomaly detection. Empir Software Eng 28, 137 (2023). https://doi.org/10.1007/s10664-023-10364-1
@article{article,
year = {2023},
month = {10},
pages = {},
title = {On the effectiveness of log representation for log-based anomaly detection},
volume = {28},
journal = {Empirical Software Engineering},
doi = {10.1007/s10664-023-10364-1}
}