ML-OT is a toolbox to apply different algorithms and analyses on network data.
- Different options for experiments can be set with the commandline and easily reproduced. The Open source project Hydra from facebook research provides this functionality.
- Datasets made by different parsers for network data are supported f.e. tshark, nProbe.
- Automatic evaluation and analyses o.a. statistics, explainable AI, ..
Install pytorch with instructions from official website https://pytorch.org/get-started/locally/ This depends on specific operating system, hardware and software.
pip install torch --extra-index-url https://download.pytorch.org/whl/cu116
Install remaining libraries of the requirements file.
pip install -r requirements.txt
This command turns on tab completion for hydra.
eval "$(python main.py -sc install=bash)"
When you make use of an virtual environment you can add this command to the activation script.
This will automatically activate tab completion when entering your virtual environment.
Use the absolute pad to the main.py
script.
echo 'eval "$(python <path_main_script> -sc install=bash)"' | tee -a env/bin/activate
You can smoothly select different options for experiments and tab completion is enabled. This recording is a short showcase of some simple experiments.
Using this procedure you can explore the different configurations.
To apply one or more algorithms for one or more datasets add the --multirun
arg to the command:
python main.py --multirun dataset=tshark,nprobe model=randomforest,XGB
You can also do a grid search or sweep on a single algorithm:
python main.py --multirun dataset=tshark model=randomforest model.n_estimators=2,5,10 model.max_depth=5,10,20
To perform a smarter hyperparameter search make use of hp_search.py
:
python hp_search.py --multirun dataset=tshark model=randomforest 'model.n_estimators=range(10,200)' 'model.max_depth=range(5,20)'
This will make use of the nevergrad plugin for hydra to search the parameter space and returns the parameter with the best average f1-score. You can find more information on how to use nevergrad here
Hydra uses a collection of config files in the background for the tab completion in terminal. The config files can be adapted directly to have more extended control over the experiment.
The configs
directory contains three important subdirectories:
dataset
, contains the different options for datasets such as test size, data directory and whether labels should be binary or multi-class.model
, contains the different hyperparameters.analysis
, contains the different functions to apply after training for each model.
Note that there are two types of models: classification and anomaly. Anomaly can only handle binary labels and classification can handle both binary and multi-class labels.
To adapt the parameter search space used by the hp_search.py script adapt the config file configs/hydra/sweeper/hyperparam_search.yaml
.
The default options here are for a randomforest model with the number of trees between 10 and 150.
parametrization:
model : randomforest
model.n_estimators:
lower: 10
upper: 150
integer: True
You can also create a new config file, f.e. my_search_space
, under the same directory and adapt the top level config file configs/config_hpsearch.yaml
to:
defaults:
- _self_
- dataset: tshark
- model: randomforest
- override hydra/sweeper: my_search_space
This way you can maintain different experiments easily.
In the outputs
directory you can find results of experiments with a single option of configs.
- main.log file containing all the executed steps.
- An evaluation for the model, report on the scores for the different classes.
- Analysis folder with different produced graphs.
These graphs are produced by different functions and they can be changed from the config files in directory configs/analysis
each of the models contain different config files.
In the multirun
directory you can find results of experiments with multiple options for config or sweeps. For grid search sweeps it will contain the same output as a normal experiment but in different subdirectories.For sweeps with nevergrad it will contain the different scores for each hyperparameter setting and the best found hyperparameters.
Datasets, models and analysis functions can be added to this project.
Dataset classes extend an interface.
class InterfaceData(Dataset):
def __init__(self, data_path: str, test_size: float, label_type: str, shuffle: bool):
"""Initialize dataset object with properties"""
self.data_path = data_path
self.test_size = test_size
self.label_type = label_type
self.shuffle = shuffle
self.data=None
self.x_train, self.x_test = None, None
self.y_train, self.y_test = None, None
def __len__(self):
"""method to retrieve length of data contained"""
return len(self.data)
def __getitem__(self, idx):
"""method to retrieve row of data"""
return self.data.iloc[idx,:]
def load(self) -> None:
"""Load in dataset and store in object"""
pass
def preprocess(self) -> None:
"""Apply a specified preprocessing function and store in object"""
pass
def postprocess(self) -> None:
"""Any processing applied after splitting data, such as rescaling store in object"""
pass
To add a new dataset, extend the InterfaceData class in a python file under src/datasets/
Write seperate functions for loading, preprocessing and postprocessing. You could also adapt the constructor, most of the time this will not be necessary.
Add a new yaml file in the directory configs/dataset
.
It should target the new dataset class you created, a data path, test size, label_type and shuffle argument.
The tshark config looks like this:
_target_: src.datasets.tshark.TsharkData
data_path: 'Data/tshark/scenario1'
label_type: 'binary'
test_size: 0.3
shuffle: False
Model classes extend an interface.
class InterfaceModel():
def __init__(self, **kwargs):
""" Initialize model_type as 'supervised','semi-supervised' or 'unsupervised'
and pass kwargs as hyperparameters model"""
self.model_type=None
pass
def fit(self, dataset: InterfaceData) -> None:
"""Apply model specific fit function"""
pass
def predict(self, dataset: InterfaceData) -> Union[tuple[DataFrame, DataFrame], DataFrame]:
"""Return one or two pandas dataframes with predictions"""
pass
def evaluate(self, dataset: InterfaceData, preds) -> None:
"""Calculate standard evaluation scores for model and save in outputs"""
pass
def score(self, dataset: InterfaceData) -> float:
"""Get singular score on relevant piece of data, depends of model type"""
pass
def get_class_names(self, dataset: InterfaceData) -> list:
"""Return original class names in list or [0,1] for binary datasets"""
pass
To add a new model
Extend the InterfaceModel class in a python file under src/models/
.
Implement the different functions and the constructor of your subclass.
Add a new yaml file in the directory configs/model
.
It should target the new model class you created and add the hyperparameters of the model.
The randomforest config looks like this:
_target_: src.models.classification.RandomForest #name of class
verbose: 0
n_estimators: 10
criterion: "gini"
max_depth: null #null converted to None by Hydra
max_features: "sqrt"
max_leaf_nodes: null
n_jobs: null
class_weight: null
Analysis functions follow the same interface
def new_analysis_func(dataset: InterfaceData, model: InterfaceModel, **kwargs):
...
Each analysis function should accept a dataset object, model object and any additional arguments that are needed. To add a new function
Define a new function using the interface described above. src.explain
contains a number of analysis functions.
Add the analysis function to the config of a model under configs/analysis/<model_name>.yaml
.
It should target your newly made analysis function with _target_.
It should also contain a data and model argument with ???. This means it will get filled in by your chosen dataset and model objects in the experiment. If there are any additional arguments expected by the function you can add them under model.
For example randomforest contains a function to explain feature importance with shap library.
shaptotal can be any chosen name and will be displayed in the logs.
shaptotal:
_target_: src.explain.shap_importance
dataset: ???
model: ???
Datasets
- nprobe UNSW'15 https://staff.itee.uq.edu.au/marius/NIDS_datasets/#RA4
- tshark lemay https://github.com/antoine-lemay/Modbus_dataset
- Proprietary Datasets generated for the GAICIA project, more information link
Papers
- link Providing SCADA Network Data Sets for Intrusion Detection Research. (A. Lemay & J. M. Fernandez)
- link Evaluation of machine learning based anomaly detection algorithms on modbus/TCP data set. (SD. Anton & S. Kanoor)
- link NetFlow Datasets for Machine Learning-based Network Intrusion Detection Systems. (M. Sarhan & S. Layeghy)