kipple: Towards accessible, robust malware classification

Welcome to kipple! kipple is a set of resources that accompany my entry in the 2021 ML Security Evasion Competition. While kipple only scored third place in the defensive track, through publishing the materials behind it, I hope to help inspire other researchers in the space as well as make the topic more accessible for newcomers.

kipple materials are divided into four components:

The data that kipple was built from, hosted in the kipple-data submodule;
The models built during the construction of the 2021 MLSEC kipple entry, hosted in the kipple-models submodule;
Scripts used to build and evaluate kipple, hosted in this repository; and
Resources -- i.e., papers and presentations -- for understanding kipple, hosted in this repository.

This project is a work in progress! While my hope is to update it occassionally (see below), it is also a personal project, and so updates will likely be sporadic.

Downloading Kipple

kipple's components are presently stored in separate GitHub repositories; because the data and models are each quite large (~300MB and ~500MB respectively), I want to ensure users can select the pieces they want to use. To download everything you can use the following command:

git clone https://github.com/aapplebaum/kipple.git --recursive

Requirements

Almost all of the scripts and code associated with kipple reference the EMBER project -- you can access the data and install it here: https://github.com/elastic/ember.

Scripts

This repository is home to three scripts to help make training a robust model easier for users. Each script is heavily commented and, hopefully, written in a way to make it clear what the intention of each is; the hope being that others can modify them as they see fit. Within the kipple-models submodule there are two files that show how to use the models as well as the data.

train.py

train.py shows an example of how to build a GBDT model using the EMBER data alongside the data within kipple-data. Some of the commented out code shows how to run different configurations.

get_individual_thresholds.py

get_individual_thresholds.py iterates through each model within kipple-models and computes the numeric threshold for a set of false positive values, and then computes the accuracy of each model at each threshold against the EMBER malware test data as well as a folder of malware of your choosing.

size_three_portfolio.py

size_three_portfolio.py runs through a set of model combinations to identify thresholds that yield 1% false positive rate.

MLSEC 2021 Entry

The kipple entry into MLSEC 2021 used a portfolio approach of three models:

initial with a threshold of 0.898
variants-all with a threshold of 0.028
undetect-benign with a threshold of 0.85

In addition to the static detection with the files above, it also leveraged the default stateful implementation from the sample defender provided as part of the competition. The only tweak was to add in prediction that used all three models, and then to store malware if and only if it violated variants_all, modifying this line.

False Positive Performance

The initial kipple entry had a high false positive rate on the local benign corpus I was using -- this turned out to be because the msfvenom detector (undetect-benign) was flagging all of the benign binaries. Digging in deeper, this was because the msfvenom script was using these binaries for templates, and so the classifier had been trained on things that looked very much like those specific binaries.

To fix this, the final submission ultimately used an unnecessarily large 0.85 threshold for the undetect-benign classifier, and had a hardcoded set of MD5s of known-benign files.

To-dos

Extend the train.py script to show how to train over a local set of binaries.
Add an example script showing how to save a memmap'd array for quicker analysis.
Upload an alternative representation of the adversarial samples not hardcoded to the memmap'd array.
Upload scripts used to generate adversarial variants (maybe).
Upload data and models based on other obfuscation techniques (e.g., crypters, packers).
Add more information on retraining on evasive adversarial samples (not just all the samples).

Citing

If you want to cite kipple in your work, the following citation (or a variant of it) should work:

A. Applebaum, "kipple: Towards robust, accessible malware classification", CAMLIS, 2021.

And if you do use kipple -- please feel free to let me know!

References

There are many good and helpful references in this space! The following tools in particular were used to help construct the data behind kipple:

EMBER
Malware RL
SecML Malware
VirusShare
SoReL 20M
msfvenom
The 2021 MLSEC default model implementation

Some other cool resources I haven't finished tinkering with include:

toucanstrike
MAB-malware

Lastly, check out the blog posts of some of the other competitors in the MLSEC 2021 competition:

The CUJO AI announcement
Fabrício Ceschin and Marcus Botacin's first place (attacker + defender) entry
Alejandro Mosquera's second place (defender + attacker) entry
Alexey Antonov, Alexey Kogtenkov, and Maxim Golovkin's attacker track writeup

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

kipple: Towards accessible, robust malware classification

Downloading Kipple

Requirements

Scripts

train.py

get_individual_thresholds.py

size_three_portfolio.py

MLSEC 2021 Entry

False Positive Performance

To-dos

Citing

References

Files

README.md

Latest commit

History

README.md

File metadata and controls

kipple: Towards accessible, robust malware classification

Downloading Kipple

Requirements

Scripts

train.py

get_individual_thresholds.py

size_three_portfolio.py

MLSEC 2021 Entry

False Positive Performance

To-dos

Citing

References