On the interplay between performance, efficiency and interpretability of Machine Learning for Malware Detection
This project is part of my Bachelor's Thesis.
In the Dataset folder, there are the exact numpy arrays used for the Nested Cross-Validation.
The selected dataset for the problem is the Ember malware dataset, particularly the 2018 version. The training and test samples have been merged (Excluding unlabeled data), shuffled and a portion of 20K samples has been taken.
To install the Conda environment used for this project:
- amd64:
conda env create -f amd64env.yml
- arm64:
conda env create -f arm64env.yml
The K-Fold Nested Cross Validation with Grid Search is done to select the best performing model for the Ember-2018 dataset.
This directory contains:
- Nested_CV.py: Nested Cross-Validation script.
- CustomModels folder: Contains custom models modified to make them compatible with Sklearn.
Nested Cross Validation results are saved in the Results directory in two ways:
- Results.txt: Readable log.
- Results.pkl: Python pickle containing a dictionary with all the binary results.
- Gaussian Naive Bayes
- K-Nearest Neighbours
- Support Vector Machine
- Decision Tree
- LightGBM
- XGBoost
- Random Forest
- Extreme Random Forest
- Gradient Boosted Machine
- AdaBoost
- MultiLayer Perceptron
- Ensemble Deep RVFL
- Broad Learning System
- Stochastic Configuration Network
- Extreme Learning Machine
- Skope Rules
- GPLearn
- Balanced Accuracy
- F1-Score
- Recall
- Precision
- Matthews Correlation Coefficient
- Cohen Kappa
- Fit time (s)
- Predict time (s)
- Memory size (MB)
Algorithm | Hyperparameter | Range of hyperparameter values for grid search |
---|---|---|
K-Nearest Neighbors | n_neighbors | 1, 3, 5, 7, 9 |
weights | uniform, distance | |
p | 1, 2, 3 | |
Support Vector Machine | C | 0.1, 1, 10, 100, 1000 |
gamma | 1e-04, 1e-03, 1e-02, 1e-01, 1 | |
Decision Tree | criterion | gini, entropy |
splitter | best, random | |
max_depth | None, 10, 20, 50 | |
Random Forest | n_estimators | 50, 100, 200 |
criterion | gini, entropy | |
max_depth | None, 10, 20, 50 | |
Extreme Random Forest | n_estimators | 50, 100, 200 |
criterion | gini, entropy | |
max_depth | None, 10, 20, 50 | |
Gradient Boosted Machine | loss | deviance, exponential |
learning_rate | 0.01, 0.1, 1 | |
n_estimators | 50, 100, 200 | |
MultiLayer Perceptron | hidden_layer_size | (1000, 500, 200), (500, 200, 100), (200, 100, 50) |
activation | identity, logistic, relu | |
alpha | 1e-05, 1e-04, 1e-03 | |
learning_rate_init | 1e-03, 1e-02, 1e-01 | |
AdaBoost | n_estimators | 200, 500, 1000 |
learning_rate | 1e-02, 1e-01, 1 | |
Gaussian Naive Bayes | var_smoothing | from 1e-9 to 1, 50 evenly spaced numbers |
Extreme Learning Machine | n_neurons | 200, 500, 1000 |
alpha | 1e-04, 1e-03, 1e-02, 1e-01 | |
Ensemble Deep RVFL | n_nodes | 40, 80, 100, 200, 300 |
n_layer | 1, 2, 3, 4, 5 | |
activation | relu, sigmoid | |
LightGBM | num_leaves | 50, 500, 1000 |
max_depth | 3, 6, 12 | |
Skope Rules | max_depth_duplication | None, 2, 3 |
n_estimators | 200, 500, 1000 | |
GPLearn | population_size | 100, 1000, 10000 |
generations | 20, 50, 200 | |
tournament_size | 20, 50, 200 | |
XGBoost | n_estimators | 50, 100, 200 |
max_depth | 2, 4, 6 | |
eval_metric | error | |
Broad Learning System | C | 1e-04, 1e-03, 1e-02, 1e-01 |
N1 | 50, 100, 200, 500 | |
Stochastic Configuration Network | L_max | 10, 100, 1000 |
T_max | 10, 100, 1000 |
Plotting directory contains two scripts:
- Acc-Time_Plot.py: Accuracy vs time plot with interactive labels.
- DoubleBox_Plot.py: DoubleBox type plot. Accuracy, time and interpretability are plotted.