On the interplay between performance, efficiency and interpretability of Machine Learning for Malware Detection

Context

This project is part of my Bachelor's Thesis.

Dataset

In the Dataset folder, there are the exact numpy arrays used for the Nested Cross-Validation.

The selected dataset for the problem is the Ember malware dataset, particularly the 2018 version. The training and test samples have been merged (Excluding unlabeled data), shuffled and a portion of 20K samples has been taken.

Conda environment installation

To install the Conda environment used for this project:

amd64:

conda env create -f amd64env.yml

arm64:

conda env create -f arm64env.yml

NestedCV

The K-Fold Nested Cross Validation with Grid Search is done to select the best performing model for the Ember-2018 dataset.

This directory contains:

Nested_CV.py: Nested Cross-Validation script.
CustomModels folder: Contains custom models modified to make them compatible with Sklearn.

Nested Cross Validation results are saved in the Results directory in two ways:

Results.txt: Readable log.
Results.pkl: Python pickle containing a dictionary with all the binary results.

Tested models

Gaussian Naive Bayes
K-Nearest Neighbours
Support Vector Machine
Decision Tree
LightGBM
XGBoost
Random Forest
Extreme Random Forest
Gradient Boosted Machine
AdaBoost
MultiLayer Perceptron
Ensemble Deep RVFL
Broad Learning System
Stochastic Configuration Network
Extreme Learning Machine
Skope Rules
GPLearn

Scoring

Balanced Accuracy
F1-Score
Recall
Precision
Matthews Correlation Coefficient
Cohen Kappa
Fit time (s)
Predict time (s)
Memory size (MB)

List of Hyperparameters

Algorithm	Hyperparameter	Range of hyperparameter values for grid search
K-Nearest Neighbors	n_neighbors	1, 3, 5, 7, 9
	weights	uniform, distance
	p	1, 2, 3
Support Vector Machine	C	0.1, 1, 10, 100, 1000
	gamma	1e-04, 1e-03, 1e-02, 1e-01, 1
Decision Tree	criterion	gini, entropy
	splitter	best, random
	max_depth	None, 10, 20, 50
Random Forest	n_estimators	50, 100, 200
	criterion	gini, entropy
	max_depth	None, 10, 20, 50
Extreme Random Forest	n_estimators	50, 100, 200
	criterion	gini, entropy
	max_depth	None, 10, 20, 50
Gradient Boosted Machine	loss	deviance, exponential
	learning_rate	0.01, 0.1, 1
	n_estimators	50, 100, 200
MultiLayer Perceptron	hidden_layer_size	(1000, 500, 200), (500, 200, 100), (200, 100, 50)
	activation	identity, logistic, relu
	alpha	1e-05, 1e-04, 1e-03
	learning_rate_init	1e-03, 1e-02, 1e-01
AdaBoost	n_estimators	200, 500, 1000
	learning_rate	1e-02, 1e-01, 1
Gaussian Naive Bayes	var_smoothing	from 1e-9 to 1, 50 evenly spaced numbers
Extreme Learning Machine	n_neurons	200, 500, 1000
	alpha	1e-04, 1e-03, 1e-02, 1e-01
Ensemble Deep RVFL	n_nodes	40, 80, 100, 200, 300
	n_layer	1, 2, 3, 4, 5
	activation	relu, sigmoid
LightGBM	num_leaves	50, 500, 1000
	max_depth	3, 6, 12
Skope Rules	max_depth_duplication	None, 2, 3
	n_estimators	200, 500, 1000
GPLearn	population_size	100, 1000, 10000
	generations	20, 50, 200
	tournament_size	20, 50, 200
XGBoost	n_estimators	50, 100, 200
	max_depth	2, 4, 6
	eval_metric	error
Broad Learning System	C	1e-04, 1e-03, 1e-02, 1e-01
	N1	50, 100, 200, 500
Stochastic Configuration Network	L_max	10, 100, 1000
	T_max	10, 100, 1000

Plotting

Plotting directory contains two scripts:

Acc-Time_Plot.py: Accuracy vs time plot with interactive labels.
DoubleBox_Plot.py: DoubleBox type plot. Accuracy, time and interpretability are plotted.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
Dataset/ember2018		Dataset/ember2018
NestedCV		NestedCV
Plotting		Plotting
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
amd64env.yml		amd64env.yml
arm64env.yml		arm64env.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

On the interplay between performance, efficiency and interpretability of Machine Learning for Malware Detection

Context

Dataset

Conda environment installation

NestedCV

Tested models

Scoring

List of Hyperparameters

Plotting

About

Contributors 2

Languages

g2jz/MalwareDetectionBenchmark

Folders and files

Latest commit

History

Repository files navigation

On the interplay between performance, efficiency and interpretability of Machine Learning for Malware Detection

Context

Dataset

Conda environment installation

NestedCV

Tested models

Scoring

List of Hyperparameters

Plotting

About

Topics

Resources

Stars

Watchers

Forks

Contributors 2

Languages