Ensemble neural network for static malware classification using multiple representations

MSc Dissertation: Artificial Intelligence
School of Electronic Engineering and Computer Science - Department of Computer Science
Queen Mary University London, United Kingdom
Supervisor: Gianni Antichi

Abstract

This dissertation proposes a new ensemble neural network architecture for malware classification. The architecture uses three different methods of malware representation as inputs for the three models in the ensemble. By using three malware representations as inputs, the models can complement each other, resulting in better accuracy of the ensemble. The models used within the ensemble model are based on previously proposed models, which have been replicated. The ensemble achieves a test accuracy of 98.7% on the Microsoft BIG 2015 dataset after combining three models with accuracy as low as 87%.

Structure

The Repository consists of the following 6 folders.

data This folder will contain the dataset, the splits used for training, testing and validation.
data_processing Within this folder, all scripts required to download, preprocess and prepare the data for training are located.
mlflow server The folder contains the docker-compose for the MLflow server used for tracking, including all supporting files. models Within this folder, the scripts to train the individual and ensemble models can be found. It further includes the trained weights for all four models.
notebooks This folder contains the notebooks that have been used to train the models using Google Colab. They are logically identical to the scripts found in the models folder but present a different structure due to the notebook format.
report This folder includes the Latex files used to create the report and the pdf of the report.

Instructions

To train the models, the following steps for preparation are required:

Throughout this process, you will be prompted twice to install multiple python dependencies. Although not required, it is recommended to create a python environment in which this project will be executed. It is recommended to use miniconda for this, due to its simplicity. The installation guide can be found here. Instruction on how to create a conda environment can be found here.
Download the dataset from Kaggle and preprocess it. The instructions for this process can be found in the data_processing folder.
The MLflow server needs to be started to track the training of the models. (Both notebooks and python scripts are designed to not execute without having access to a MLflow server). The instructions can be found in the mlflow server folder.
(Google Colab only) setting up the connection to the MLflow server. The instructions for this process can be found in the notebooks folder.
Train the models using the local environment or Colab, or use the trained model from the folder models/.weights/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ensemble neural network for static malware classification using multiple representations

Abstract

Structure

Instructions

About

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
data		data
data_processing		data_processing
mlflow server		mlflow server
models		models
notebooks		notebooks
report		report
.gitignore		.gitignore
readme.md		readme.md

devasworski/Malware_Classification_Ensemble

Folders and files

Latest commit

History

Repository files navigation

Ensemble neural network for static malware classification using multiple representations

Abstract

Structure

Instructions

About

Topics

Resources

Stars

Watchers

Forks

Languages