Skip to content

On the interplay between performance, efficiency and interpretability of Machine Learning for Malware Detection. This project is part of my Bachelor's Thesis.

Notifications You must be signed in to change notification settings

g2jz/MalwareDetectionBenchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

On the interplay between performance, efficiency and interpretability of Machine Learning for Malware Detection

Context

This project is part of my Bachelor's Thesis.

Dataset

In the Dataset folder, there are the exact numpy arrays used for the Nested Cross-Validation.

The selected dataset for the problem is the Ember malware dataset, particularly the 2018 version. The training and test samples have been merged (Excluding unlabeled data), shuffled and a portion of 20K samples has been taken.

Conda environment installation

To install the Conda environment used for this project:

  • amd64:
conda env create -f amd64env.yml
  • arm64:
conda env create -f arm64env.yml

NestedCV

The K-Fold Nested Cross Validation with Grid Search is done to select the best performing model for the Ember-2018 dataset.

This directory contains:

  • Nested_CV.py: Nested Cross-Validation script.
  • CustomModels folder: Contains custom models modified to make them compatible with Sklearn.

Nested Cross Validation results are saved in the Results directory in two ways:

  • Results.txt: Readable log.
  • Results.pkl: Python pickle containing a dictionary with all the binary results.

Tested models

  • Gaussian Naive Bayes
  • K-Nearest Neighbours
  • Support Vector Machine
  • Decision Tree
  • LightGBM
  • XGBoost
  • Random Forest
  • Extreme Random Forest
  • Gradient Boosted Machine
  • AdaBoost
  • MultiLayer Perceptron
  • Ensemble Deep RVFL
  • Broad Learning System
  • Stochastic Configuration Network
  • Extreme Learning Machine
  • Skope Rules
  • GPLearn

Scoring

  • Balanced Accuracy
  • F1-Score
  • Recall
  • Precision
  • Matthews Correlation Coefficient
  • Cohen Kappa
  • Fit time (s)
  • Predict time (s)
  • Memory size (MB)

List of Hyperparameters

Algorithm Hyperparameter Range of hyperparameter values for grid search
K-Nearest Neighbors n_neighbors 1, 3, 5, 7, 9
weights uniform, distance
p 1, 2, 3
Support Vector Machine C 0.1, 1, 10, 100, 1000
gamma 1e-04, 1e-03, 1e-02, 1e-01, 1
Decision Tree criterion gini, entropy
splitter best, random
max_depth None, 10, 20, 50
Random Forest n_estimators 50, 100, 200
criterion gini, entropy
max_depth None, 10, 20, 50
Extreme Random Forest n_estimators 50, 100, 200
criterion gini, entropy
max_depth None, 10, 20, 50
Gradient Boosted Machine loss deviance, exponential
learning_rate 0.01, 0.1, 1
n_estimators 50, 100, 200
MultiLayer Perceptron hidden_layer_size (1000, 500, 200), (500, 200, 100), (200, 100, 50)
activation identity, logistic, relu
alpha 1e-05, 1e-04, 1e-03
learning_rate_init 1e-03, 1e-02, 1e-01
AdaBoost n_estimators 200, 500, 1000
learning_rate 1e-02, 1e-01, 1
Gaussian Naive Bayes var_smoothing from 1e-9 to 1, 50 evenly spaced numbers
Extreme Learning Machine n_neurons 200, 500, 1000
alpha 1e-04, 1e-03, 1e-02, 1e-01
Ensemble Deep RVFL n_nodes 40, 80, 100, 200, 300
n_layer 1, 2, 3, 4, 5
activation relu, sigmoid
LightGBM num_leaves 50, 500, 1000
max_depth 3, 6, 12
Skope Rules max_depth_duplication None, 2, 3
n_estimators 200, 500, 1000
GPLearn population_size 100, 1000, 10000
generations 20, 50, 200
tournament_size 20, 50, 200
XGBoost n_estimators 50, 100, 200
max_depth 2, 4, 6
eval_metric error
Broad Learning System C 1e-04, 1e-03, 1e-02, 1e-01
N1 50, 100, 200, 500
Stochastic Configuration Network L_max 10, 100, 1000
T_max 10, 100, 1000

Plotting

Plotting directory contains two scripts:

  • Acc-Time_Plot.py: Accuracy vs time plot with interactive labels.
  • DoubleBox_Plot.py: DoubleBox type plot. Accuracy, time and interpretability are plotted.

About

On the interplay between performance, efficiency and interpretability of Machine Learning for Malware Detection. This project is part of my Bachelor's Thesis.

Topics

Resources

Stars

Watchers

Forks

Languages