Skip to content

Getting Started

Gabriel Iuhasz edited this page Mar 22, 2018 · 18 revisions

In order to run ADP we must execute the following command:

python dmonadp.py <args>

There are currently two ways of configuring ADT. First we have the command line arguments and second we have the configuration file.

Command Line arguments

$ python dmonadp.py -h
  • h -> This argument will list a short help message detailing some basic usage of for ADT
$ python dmonadp.py -f <file_location>
  • f -> This argument will ensure that the selected configuration file is loaded.
$ python dmonadp.py -e <es_endpoint>
  • e -> This argument allows the setting for the elasticsearch endpoint.

NOTE: It is important to note that in future versions ADT will be integrated with DMon and will be able to query the DMon query endpoint not just the elasticsearch one.

$ python dmonadp.py -a <es_query> -t -m <method_name> -v <folds> -x <model_name>
  • a -> This represents the query that is to be issued to elasticsearch. The resulting data will be used for training. The query is a standard elasticsearch query containing also the desired timeframe for the data.

  • t -> This instructs ADT to initiate the training of the predictive model.

  • m -> This represents the name of the method used to create the predictive model.

  • v -> This instructs ADT to run cross validation on the selected model for a set of defined folds.

  • x -> This allows the exporting of the predictive model in PMML format.

NOTE: The last two arguments, v and x, are optional.

$ python dmonadp.py -a <query> -d <model_name>
  • d -> This enables the detection of anomalies using a specified pre-trained predictive model (identified by its name).

Configuration File

The configuration file allows the definition of all of the arguments already listed. Here is an example:

[Connector]
ESEndpoint:85.120.206.27
ESPort:9200
DMonPort:5001
From:1479105362284
To:1479119769978
Query:yarn:cluster, nn, nm, dfs, dn, mr;system
Nodes:
QSize:0
QInterval:10s

[Mode]
Training:true
Validate:False
Detect:false

[Filter]
Columns:colname;colname2;colname3
Rows:ld:145607979;gd:145607979
DColumns:colname;colname2;colname3


[Detect]
Method:skm
Type:clustering
Export:test1
Load:test1

[MethodSettings]
n:10
s:10


[Point]
Memory: cached:gd:231313;buffered:ld:312123;used:ld:12313;free:gd:23123
Load: shortterm:gd:2.0;midterm:ld:0.1;longterm:gd:1.0
Network: tx:gd:34344;rx:ld:323434

[Misc]
heap:512m
checkpoint:false
delay:2m
interval:15m
resetindex:false

Connector

The Connector section sets the parameters for use in connecting and querying DMON:

  • ESEndpoint -> sets the current endpoint for DMON, it can be also in the form of a list if more than one elasticsearch instance is used by DMON
  • ESPort -> sets the port for the elasticsearch instances (NOTE: Only used for development and testing)
  • DMonPort -> sets the port for DMON
  • From -> sets the first timestamp for query (NOTE: Can use time arithmetic of the form "now-2h")
  • To -> sets the second timestamp for query
  • Query -> defines what metrics context to query from DMON
  • each metric context is divided into subfields as follows:
    • yarn-> cluster, nn, nm, dfs, dn, mr
    • system -> memory, load, network
    • spark -> spark metrics
    • storm -> storm metrics
  • Nodes -> list of desired nodes, if nothing specified than uses all available nodes
  • QSize -> sets the query size (number of instances), if set to 0 then no limit is set
  • QInterval -> sets aggregation interval

NOTE: Each large context is delimited by ";" while each subfield is divided by ",". Also QInterval must be set to the largest value if query contains both system and platform specific metrics. If the values are two far apart it may cause issues while merging the metrics!

Mode

The Mode section defines the mode in which ADP will run. The options are as follows:

  • Training -> If set to True the selected method will be trained using metrics collected from DMON
  • Validate -> If set to True the trained methods are compared and validated
  • Detect -> If set to True the trained model is used to decide if a given data instance is an anomaly or not.

The Filter section is used to filter the collected data. The options are as follows:

  • Columns -> Defines the columns to be used during training and/or detecting. Columns are delimited by ";"
  • Rows -> Defines the minimum (using ld) and maximum (using gd) of the metrics. The timeformat used is utc.
  • DColumns -> Defines the columns to be removed before from the dataset.

Detect

The Detect section is used for selecting the anomaly detection methods for both training and detecting as follows:

  • Method -> sets the desired anomaly detection method to be used (available 'skm', 'em', 'dbscan')
  • Type -> type of anomaly detection
  • Export -> name of the exported predictive/clustering model
  • Load -> name of the predictive/clustering model to be loaded

NOTE: Currently we support only clustering methods: Simple KMeans, Expectation Maximization and DBScan. If Export and Load is set to the same value then one a model is trained it will be automatically loaded and used to detect anomalies.

Point

The Point section is used to set threshold values for memory, load and network metrics to be used during point anomaly detection. This type of anomaly detection is run even if Train and Detect is set to False.

##Misc The Misc section is used to set miscellaneous settings which are as follows:

  • head -> sets heap space max value
  • checkpoint -> If set to false, all metrics will be saved as csv files into the data directory otherwise all data will be kept in memory. It is important to note the if the data is kept in memory processing will be much faster.
  • delay -> sets the query delay for point anomalies
  • interval -> sets the query interval for complex anomalies
  • interval-> if set to True the anomalies index will be reset and all previously detected anomalies will be deleted. resetindex:false

Method settings

The MethodSettings section of the configuration files allows the setting of different parameters of the chosen training method. These parameters can't be set using the command line arguments.

##DBSCAN DBSCAN is a density based data clustering algorithm that marks outliers based on the density of the region they are located in. For this algorithm we support two versions with the following method settings:

DBSCAN SCILEANT DBSCAN WEKA

DBSCAN Weka

  • E -> epsilon which denotes the maximum distance between two samples for them to be considered as in the same neighborhood (default 0.9)
  • M -> the number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself (default 6)
  • D -> distance measure (default weka.clusterers.forOPTICSAndDBScan.DataObjects.EuclideanDataObject)
  • I -> index (default weka.clusterers.forOPTICSAndDBScan.Databases.SequentialDatabase)

DBSCAN scikit-learn

  • eps -> epsilon which denotes the maximum distance between two samples for them to be considered as in the same neighborhood (default 0.5)
  • min_samples -> The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself (default 5)
  • metric -> metric used when calculating distance between instances in a feature array (default euclidean)
  • algorithm -> the algorithm used by the nearest-neighbour module to compute pointwise distance and find nearest neighbour (default auto)
  • leaf_size -> leaf size passed to BallTree or cKDTree, this can affect the speed of the construction and query, as well as the memory required to store the tree (default 30)
  • p -> the power of the Minkowski metric used to calculate distance between points (default None)
  • n_jobs -> the number of parallel jobs to run (default 1, if -1 all cores are used)

IsolationForest

The IsolationForest 'isolates' observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature. Since recursive partitioning can be represented by a tree structure, the number of splittings required to isolate a sample is equivalent to the path length from the root node to the terminating node. This path length, averaged over a forest of such random trees, is a measure of normality and our decision function. Random partitioning produces noticeably shorter paths for anomalies. Hence, when a forest of random trees collectively produce shorter path lengths for particular samples, they are highly likely to be anomalies.

Official Documentation

  • n_estimators -> number of base estimators in the ensemble (default 100)
  • max_samples -> number of samples to draw to train each base estimator (default auto)
  • contamination -> the amount of contamination of the dataset, used when fitting to defined threshold on the decision function (default 0.1)
  • max_features -> the number of features to draw to train each base estimator (default 1.0)
  • boostrap -> if true individual trees are fit on random subsets of the training data sample with replacements, if false sampling without replacement is performed (default false)
  • n_jobs -> the number of jobs to run in parallel for both fit and predict, (default 1, if -1 all cores are used)

NOTE

This tool is still a work in progress. All commands and their behaviours are subject to changes. Please consult the repository changelog to see any significant changes.