Skip to content

Getting Started

Gabriel Iuhasz edited this page Nov 21, 2016 · 18 revisions

In order to run ADP we must execute the following command:

python dmonadp.py <args>

There are currently two ways of configuring ADT. First we have the command line arguments and second we have the configuration file.

#Command Line arguments

$ python dmonadp.py -h
  • h -> This argument will list a short help message detailing some basic usage of for ADT
$ python dmonadp.py -f <file_location>
  • f -> This argument will ensure that the selected configuration file is loaded.
$ python dmonadp.py -e <es_endpoint>
  • e -> This argument allows the setting for the elasticsearch endpoint.

NOTE: It is important to note that in future versions ADT will be integrated with DMon and will be able to query the DMon query endpoint not just the elasticsearch one.

$ python dmonadp.py -a <es_query> -t -m <method_name> -v <folds> -x <model_name>
  • a -> This represents the query that is to be issued to elasticsearch. The resulting data will be used for training. The query is a standard elasticsearch query containing also the desired timeframe for the data.

  • t -> This instructs ADT to initiate the training of the predictive model.

  • m -> This represents the name of the method used to create the predictive model.

  • v -> This instructs ADT to run cross validation on the selected model for a set of defined folds.

  • x -> This allows the exporting of the predictive model in PMML format.

NOTE: The last two arguments, v and x, are optional.

$ python dmonadp.py -a <query> -d <model_name>
  • d -> This enables the detection of anomalies using a specified pre-trained predictive model (identified by its name).

#Configuration File

The configuration file allows the definition of all of the arguments already listed. Here is an example:

[Connector]
ESEndpoint:85.120.206.27
ESPort:9200
DMonPort:5001
From:1479105362284
To:1479119769978
Query:yarn:cluster, nn, nm, dfs, dn, mr;system
Nodes:
QSize:0
QInterval:10s

[Mode]
Training:true
Validate:False
Detect:false

[Filter]
Columns:colname;colname2;colname3
Rows:ld:145607979;gd:145607979
DColumns:colname;colname2;colname3


[Detect]
Method:skm
Type:clustering
Export:test1
Load:test1

[MethodSettings]
n:10
s:10


[Point]
Memory: cached:gd:231313;buffered:ld:312123;used:ld:12313;free:gd:23123
Load: shortterm:gd:2.0;midterm:ld:0.1;longterm:gd:1.0
Network: tx:gd:34344;rx:ld:323434

[Misc]
heap:512m
checkpoint:false
delay:2m
interval:15m
resetindex:false

Connector

The Connector section sets the parameters for use in connecting and querying DMON:

  • ESEndpoint -> sets the current endpoint for DMON, it can be also in the form of a list if more than one elasticsearch instance is used by DMON

  • ESPort -> sets the port for the elasticsearch instances (NOTE: Only used for development and testing)

  • DMonPort -> sets the port for DMON

  • From -> sets the first timestamp for query (NOTE: Can use time arithmetic of the form "now-2h")

  • To -> sets the second timestamp for query

  • Query -> defines what metrics context to query from DMON

  • each metric context is divided into subfields as follows:

    • yarn-> cluster, nn, nm, dfs, dn, mr
    • system -> memory, load, network
    • spark -> not for this version (v0.1.0)
    • storm -> not for this version (v0.1.0)
  • Nodes -> list of desired nodes, if nothing specified than uses all available nodes

  • QSize -> sets the query size (number of instances), if set to 0 then no limit is set

  • QInterval -> sets aggregation interval

NOTE: Each large context is delimited by ";" while each subfield is divided by ",". Also QInterval must be set to the largest value if query contains both system and platform specific metrics. If the values are two far apart it may cause issues while merging the metrics!

##Mode

The Mode section defines the mode in which ADP will run. The options are as follows:

  • Training -> If set to True the selected method will be trained using metrics collected from DMON
  • Validate -> If set to True the trained methods are compared and validated
  • Detect -> If set to True the trained model is used to decide if a given data instance is an anomaly or not.

The Filter section is used to filter the collected data. The options are as follows:

  • Columns -> Defines the columns to be used during training and/or detecting. Columns are delimited by ";"
  • Rows -> Defines the minimum (using ld) and maximum (using gd) of the metrics. The timeformat used is utc.
  • DColumns -> Defines the columns to be removed before from the dataset.

Detect

The Detect section is used for selecting the anomaly detection methods for both training and detecting as follows:

  • Method -> sets the desired anomaly detection method to be used (available 'skm', 'em', 'dbscan')
  • Type -> type of anomaly detection
  • Export -> name of the exported predictive/clustering model
  • Load -> name of the predictive/clustering model to be loaded

NOTE: Currently we support only clustering methods: Simple KMeans, Expectation Maximization and DBScan. If Export and Load is set to the same value then one a model is trained it will be automatically loaded and used to detect anomalies.

Point

The Point section is used to set threshold values for memory, load and network metrics to be used during point anomaly detection. This type of anomaly detection is run even if Train and Detect is set to False.

##Misc The Misc section is used to set miscellaneous settings which are as follows:

  • head -> sets heap space max value
  • checkpoint -> If set to false, all metrics will be saved as csv files into the data directory otherwise all data will be kept in memory. It is important to note the if the data is kept in memory processing will be much faster.
  • delay -> sets the query delay for point anomalies
  • interval -> sets the query interval for complex anomalies
  • interval-> if set to True the anomalies index will be reset and all previously detected anomalies will be deleted. resetindex:false

Method settings

The MethodSettings section of the configuration files allows the setting of different parameters of the chosen training method. These parameters can't be set using the command line arguments.

DBSCAN is a density based data clustering algorithm that marks outliers based on the density of the region they are located in. For this algorithm we support two versions with the following method settings:

##DBSCAN Weka

  • E -> epsilon which denotes the maximum distance between two samples for them to be considered as in the same neighborhood (default 0.9)
  • M -> the number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself (default 6)
  • D -> distance measure (default weka.clusterers.forOPTICSAndDBScan.DataObjects.EuclideanDataObject)
  • I -> index (default weka.clusterers.forOPTICSAndDBScan.Databases.SequentialDatabase)

##DBSCAN scikit-learn

  • eps -> epsilon which denotes the maximum distance between two samples for them to be considered as in the same neighborhood (default 0.5)
  • min_samples -> The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself (default 5)

NOTE

This tool is still a work in progress. All commands and their behaviours are subject to changes. Please consult the repository changelog to see any significant changes.

Clone this wiki locally