This repository houses the data associated with the kipple project. It has two primary folders:
- data, which contains files of zipped memmap'd feature arrays for adversarial malware, and
- records, which contains a list of the associated md5/sha256 value (more below) for each dat file.
Note that for each dat file in data there is an associated txt file in records with the latter listing the md5/sha256 values encoded in the array.
In total, there are 13 data stores, matching the following table:
Name | Description | Count |
---|---|---|
msf_normal | Randomly generated implants from msfvenom, no added-code parameter | 5884 |
msf_sorel | Randomly generated implants from msfvenom, added-code from the SoReL dataset | 33633 |
msf_vs | Randomly generated implants from msfvenom, added-code from VirusShare | 7614 |
sorel_malware_rl | Adversarial malware generated using Malware RL over the SoReL dataset | 37553 |
sorel_sml_gamma | Adversarial malware generated using the GAMMA attack from SecML Malware on the SoReL dataset | 5167 |
sorel_small_pad | Adversarial malware generated using the padding attack with a small pad from SecML Malware on the SoReL dataset | 225 |
sorel_large_pad | Adversarial malware generated using the padding attack with a large pad from SecML Malware on the SoReL dataset | 277 |
sorel_header_ev | Adversarial malware generated using the DOS Header attack from SecML Malware on the SoReL dataset | 2590 |
vs_malware_rl | Adversarial malware generated using Malware RL over malware from VirusShare | 24581 |
vs_sml_gamma | Adversarial malware generated using the GAMMA attack from SecML Malware on malware from VirusShare | 5629 |
vs_small_pad | Adversarial malware generated using the padding attack with a small pad from SecML Malware on malware from VirusShare | 2347 |
vs_large_pad | Adversarial malware generated using the padding attack with a large pad from SecML Malware on malware from VirusShare | 2815 |
vs_header_ev | Adversarial malware generated using the DOS Header attack from SecML Malware on malware from VirusShare | 2814 |
This data is zipped. The main kipple repo assumes you will unzip it -- we strongly recommend unzipping once you download the repo. The zip is only to make sure we're in line with file size requirements.
The records directory contains files listing the file hashes associated with each data array. Due to the different data sources, and some small code hiccups, there are some nuances in the naming convention:
- All hashes under the "msf" category are the MD5 file hashes of the implant generated by msfvenom.
- All hashes under the "vs" category are the MD5 file hashes of the original malware downloaded from VirusShare.
- In some cases, multiple variants of the same original sample were created; in these cases, after the original sample is created, the subsequent ones have a "-ABC-.exe" after them, where is the variant number.
- In some cases, a sha256 value may be used in place of an MD5.
- All hashes under the "sorel" category of file hashes are the hashes of the original malware.
- SoReL modifies the malware binaries to be non-executable, giving them a different hash than the "active"/original malware.
- The sha256 values correspond to the original version.
- There may be some names solely consisting of "-".
The memmap'd format for storage probably isn't ideal -- it would be better to have stored + shared the malware as feature sets similar to how EMBER stores the data. However, to save time during testing we would effectively add all newly generated malware samples to the existing memmap'd set, letting us run quicker tests. Hopefully at some point in the future I'll go through and revise the format storage.
Assuming you've already unzipped, the following code would be an example of running a classifier over the kipple data:
import ember
import os
from ember.features import PEFeatureExtractor
import lightgbm as lgb
import gzip
import numpy as np
# Load EMBER feature extractor + number of dimensions
extractor=PEFeatureExtractor(feature_version=2, print_feature_warning=False)
ndim = extractor.dim
# Load the data in the array we want to use
target_data="msf_normal"
num_entries=sum(1 for line in open("records/" + target_data + ".txt"))
malware_data = np.memmap("data/" + target_data + ".dat", dtype=np.float32, mode="r", shape=(num_entries, ndim))
# Load a local model
model_location="/exes/kipple_repo/kipple/models/initial.txt.gz"
with gzip.open(model_location,"rb") as f:
md=f.read().decode('ascii')
mdl=lgb.Booster(model_str=md)
num_correct=0
for i in range (0, num_entries):
if mdl.predict([malware_data[i]])[0] > .85:
num_correct=num_correct+1
print(num_correct/num_entries)
There are more examples in the primary kipple directory.
- Malware RL: https://github.com/bfilar/malware_rl
- SoReL 20M: https://github.com/sophos-ai/SOREL-20M
- SecML Malware: https://github.com/pralab/secml_malware/
- msfvenom: https://www.offensive-security.com/metasploit-unleashed/msfvenom/
- EMBER: https://github.com/elastic/ember
- VirusShare: https://virusshare.com/