The primary goal of the Anomaly Detection project is to implement Parzen window estimation and Gaussian mixture model for anomaly detection. The secondary goal is to train these models on train dataset (only normal samples), set hyper-parametres on validation dataset (only normal samples) by maximizing likelihood of the model (proxy-parametr) and calculate roc-AUC on test data. The tertiary goal is to compare models with robust statistical test and decide which model is better. The project was implemented in Julia language and its key points are as follows:
- Parzen window estimation.
- Gaussian mixture model.
- Expectation–maximization algorithm
- Wilcoxon signed-rank test.
The repository is organised as follow In the folder src are implemented Gaussian mixture model, Parzen window estimation, EM algorithm, Wilcoxon signed-rank test and evaluation report (including roc-AUC). In the folder data_anomalyproject are stored datasets, in the folder auc_statistics are stored CSV files with roc-AUC for models (each CSV for one dataset, each line of CSV corresponds to one evaluation of learning and testing the model and each column of CSV corresponds to one model). In the file main.jl are functions to learn and compare models. In the folder examples are codes to evaluate models on random generated dataset with visualisation (human eye-check such as charts bellow).
The Parzen window estimation is defined as
where h is a window size hyper-parametr, N is count of train samples and k is a kernel function defined as follow
The Gaussian mixture model is defined as
where K is a number of components hyper-parametr, are component weights, are component means and are component covariance matrices. Gaussian component is defined as follow
And the sum of component weights is equal one.
Expectation maximization (EM) algorithm is a numerical technique for maximum likelihood estimation by iterative updating the model parameters. The first step, known as the expectation step or E step, consists of calculating the expectation of the component assignments for each data point given the model parameters. The second step the maximization (M) step, consists of maximizing the expectations calculated in the E step with respect to the model parameters. M step consists of updating the parametres.
Initialization Step:
- Randomly assign samples without replacement from the dataset to the component mean estimates.
- Set all component covariance estimates to the sample covariance.
- Set all component distribution prior estimates to the uniform distribution 1/K
Expectation (E) Step:
Maximization (M) Step:
If the determinant of the covariance matrix is less than 10e-10, then an Identity matrix multiplied by 10e-5 is added to the covariance matrix. This prevents the zero determinant of the covariance matrix.
The Wilcoxon signed-rank test is used to decide if difference between pair follows a symmetric distribution around zero. It is used as a robust statistical test, because unlike Student's t-test, the Wilcoxon signed-rank test does not assume that the differences between paired samples are normally distributed. On a large dataset it has greater statistical power than Student's t-test and is more likely to produce a statistically significant result. Steps are follow:
- For each pair of values compute its difference and discard zero differences.
- Sort the differences in ascending order and assign them a rank.
- Compute test statistic as
- where are ranks.
- The z-score is defined as
where
Hypothesis are:
- Null hypothesis: the differences of observations are symmetric about zero mean.
- Alternative hypothesis (two-sided): the differences of observations are not symmetric about zero mean.
The metric to compare models by Wilcoxon test was roc-AUC.
The evaluation of the model took place on each dataset as follows. Normal samples were randomly shuffled. Half of the normal samples were used as a training set, 25% of normal samples were used as a validation set, and 25% of normal samples and all anomalous samples were used as a test set. Both models (GMM and PWE) were trained on training set. The hyperparameters (number of components for GMM and window-size for PWE) were set by maximizing likelihood of the model on validation dataset. GMM maximized likelihood for the number of components 2 to 10, with the EM algorithm always was performed with 60 iterations. PWE maximized the likelihood for window sizes from 0.01 to 10 with 0.01 increments. The roc-AUC was calculated on the test data set for the best model (best GMM and best PWE) based on likelihood. The procedure described above was repeated ten times on each dataset.
The dataframe (two columns for models, 90 rows for models roc-AUC on 9 datasets) was the input observation for Wilcoxon test. The goal of the test was to decide if there is a statistical difference between means of roc-AUC results of the models. The test was performed with the type I error rate (significance level) 0.05. Z-score was -0.92746, which is less than 1.96 and greater than -1.96. The null hypothesis was accepted.
There is no statistical significance of the difference between means of roc-AUC observations of the models. According to the obtained results on the given datasets with roc-AUC metrics, the models give the same results. This may be due to the fact that, due to the high time required, the EM algorithm was limited to 60 iterations, which was not enough for some datasets.
- https://cw.fel.cvut.cz/wiki/_media/courses/b4m36san/robust.pdf
- https://cw.fel.cvut.cz/wiki/_media/courses/b4m36san/san_anomaly.pdf
- https://brilliant.org/wiki/gaussian-mixture-model/
- https://www.ics.uci.edu/~smyth/courses/cs274/notes/EMnotes.pdf
- http://users.sussex.ac.uk/~grahamh/RM1web/WilcoxonHandoout2011.pdf
- https://pro.arcgis.com/en/pro-app/2.8/tool-reference/spatial-statistics/what-is-a-z-score-what-is-a-p-value.htm
- https://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test
MIT