-
Notifications
You must be signed in to change notification settings - Fork 21
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Update with riskloc and more baselines
- Loading branch information
Showing
30 changed files
with
2,541 additions
and
2 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,2 +1,112 @@ | ||
# multi-dim-baselines | ||
Baselines for multi-dimensional RCA | ||
# RiskLoc | ||
Code for the paper RiskLoc: Localization of Multi-dimensional Root Causes by Weighted Risk ([link](https://arxiv.org/abs/2205.10004)). | ||
Contains the implementation of RiskLoc and all baseline multi-dimensional root cause localization methods. | ||
|
||
## Requirements | ||
- pandas | ||
- numpy | ||
- scipy | ||
- kneed (for squeeze) | ||
- loguru (for squeeze) | ||
|
||
## How to run | ||
|
||
To run, use the `run.py` file. There are a couple of options, either to use a single file or to run all files in a directory (including all subdirectories). | ||
|
||
Example of running a single file using riskloc in debug mode: | ||
``` | ||
python run.py riskloc --run-path /data/B0/B_cuboid_layer_1_n_ele_1/1450653900.csv --debug | ||
``` | ||
|
||
Example of running all files in a particular setting for a dataset (setting derived to True): | ||
``` | ||
python run.py riskloc --run-path /data/D/B_cuboid_layer_3_n_ele_3 --derived | ||
``` | ||
|
||
Example of running all files in a dataset: | ||
``` | ||
python run.py riskloc --run-path /data/B0 | ||
``` | ||
|
||
Example of running all datasets with 20 threads: | ||
``` | ||
python run.py riskloc --n-threads 20 | ||
``` | ||
|
||
Changing `riskloc` to any of the supported algorithms will run those instead, see below. | ||
|
||
## Algorithms | ||
The supported algorithms are: | ||
``` | ||
$ python run.py --help | ||
usage: run.py [-h] {riskloc,autoroot,squeeze,old squeeze,hotspot,r_adtributor,adtributor} ... | ||
RiskLoc | ||
positional arguments: {riskloc,autoroot,squeeze,old squeeze,hotspot,r_adtributor,adtributor} | ||
algorithm specific help | ||
riskloc riskloc help | ||
autoroot autoroot help | ||
squeeze squeeze help | ||
hotspot autoroot help | ||
r_adtributor r_adtributor help | ||
adtributor adtributor help | ||
optional arguments: | ||
-h, --help show this help message and exit | ||
``` | ||
The code for Squeeze is adapted from the recently released code from the original publication: https://github.com/NetManAIOps/Squeeze. | ||
|
||
To see the algorithm specific arguments run: `python run.py 'algorithm' --help`. For example, for RiskLoc: | ||
``` | ||
$ python run.py riskloc --help | ||
usage: run.py riskloc [-h] [--data-root DATA_ROOT] [--run-path RUN_PATH] [--derived [DERIVED]] [--n-threads N_THREADS] [--output-suffix OUTPUT_SUFFIX] [--debug [DEBUG]] [--risk-threshold RISK_THRESHOLD] [--ep-prop-threshold EP_PROP_THRESHOLD] | ||
optional arguments: | ||
-h, --help show this help message and exit | ||
--data-root DATA_ROOT root directory for all datasets (default ./data/) | ||
--run-path RUN_PATH directory or file to be run; | ||
if a directory, any subdirectories will be considered as well; | ||
must contain data-path as a prefix | ||
--derived [DERIVED] derived dataset (defaults to True for the D dataset and False for others) | ||
--n-threads N_THREADS number of threads to run | ||
--output-suffix OUTPUT_SUFFIX suffix for output file | ||
--debug [DEBUG] debug mode | ||
--risk-threshold RISK_THRESHOLD risk threshold | ||
--pep-threshold PEP_THRESHOLD proportional explanatory power threshold | ||
--prune-elements [PRUNE_ELEMENTS] use element pruning (True/False) | ||
``` | ||
|
||
The `risk-threshold` and `pep-threshold` arguments are specific for the RiskLoc while the rest are shared by all algorithms. To see the algorithm specific arguments for other algorithms simply run them with the `--help` flag or check the code in `run.py`. | ||
|
||
## Datasets | ||
The semi-synthetic datasets can be downloaded from: https://github.com/NetManAIOps/Squeeze. | ||
To run these, place them within the data/ directory and name them: A, B0, B1, B2, B3, B4, and D, respectively. | ||
|
||
The three synthetic datasets used in the paper can be generated using `generate_dataset.py` as follows. | ||
|
||
S dataset: | ||
``` | ||
python generate_dataset.py --num 1000 --dataset-name S --seed 121 | ||
``` | ||
L dataset: | ||
``` | ||
python generate_dataset.py --num 1000 --dataset-name L --seed 122 --dims 10 24 10 15 --noise-level 0.0 0.1 --anomaly-severity 0.5 1.0 --anomaly-deviation 0.0 0.0 --num-anomaly 1 5 --num-anomaly-elements 1 1 --only-last-layer | ||
``` | ||
H dataset: | ||
``` | ||
python generate_dataset.py --num 100 --dataset-name H --seed 123 --dims 10 5 250 20 8 12 | ||
``` | ||
|
||
In addition, new, intersting datasets can be created for using `generate_dataset.py` for extended emperical verification and research purposes. Supported input argments can be found at the beginning of the `generate_dataset.py` file or using the `--help` flag. | ||
|
||
## Citation | ||
``` | ||
@article{riskloc, | ||
title={RiskLoc: Localization of Multi-dimensional Root Causes by Weighted Risk}, | ||
author={Kalander, Marcus}, | ||
journal={arXiv preprint arXiv:2205.10004}, | ||
year={2022} | ||
} | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,43 @@ | ||
import numpy as np | ||
import pandas as pd | ||
from utils.element_scores import add_explanatory_power, add_surpise | ||
|
||
|
||
def merge_dimensions(df, dimensions, derived): | ||
elements = pd.DataFrame(columns=list(set(df.columns) - set(dimensions)), dtype=float) | ||
for d in dimensions: | ||
dim = df.groupby(d).sum().reset_index() | ||
dim['element'] = dim[d] | ||
dim['dimension'] = d | ||
dim = dim.drop(columns=d) | ||
elements = pd.concat([elements, dim], axis=0, sort=False) | ||
|
||
if derived: | ||
elements['predict'] = elements['predict_a'] / elements['predict_b'] | ||
elements['real'] = elements['real_a'] / elements['real_b'] | ||
|
||
elements = elements.reset_index(drop=True) | ||
return elements | ||
|
||
|
||
def adtributor(df, dimensions, teep=0.1, tep=0.1, k=3, derived=False): | ||
elements = merge_dimensions(df, dimensions, derived) | ||
elements = add_explanatory_power(elements, derived) | ||
elements = add_surpise(elements, derived, merged_divide=len(dimensions)) | ||
|
||
candidate_set = [] | ||
for d in dimensions: | ||
dim_elems = elements.loc[elements['dimension'] == d].set_index('element') | ||
dim_elems = dim_elems.sort_values('surprise', ascending=False) | ||
cumulative_ep = dim_elems.loc[dim_elems['ep'] > teep, 'ep'].cumsum() | ||
if np.any(cumulative_ep > tep): | ||
idx = (cumulative_ep > tep).idxmax() | ||
candidate = {'elements': cumulative_ep[:idx].index.values.tolist(), | ||
'explanatory_power': cumulative_ep[idx], | ||
'surprise': dim_elems.loc[:idx, 'surprise'].sum(), | ||
'dimension': d} | ||
candidate_set.append(candidate) | ||
|
||
# Sort by surprise and return the top k | ||
candidate_set = sorted(candidate_set, key=lambda t: t['surprise'], reverse=True)[:k] | ||
return candidate_set |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,170 @@ | ||
import numpy as np | ||
import pandas as pd | ||
from itertools import combinations | ||
from utils.element_scores import add_deviation_score | ||
from scipy.stats import gaussian_kde | ||
from scipy.signal import argrelextrema | ||
|
||
|
||
def get_unique_elements(df, cuboid): | ||
return np.vstack(list({tuple(row) for row in df[cuboid].values})) | ||
|
||
|
||
def get_elements_mask(df, cuboid, elements): | ||
return np.logical_and.reduce(np.logical_or.reduce([(df[cuboid] == e).values for e in elements], axis=0), axis=1) | ||
|
||
|
||
def nps(selection, non_selection): | ||
sel_real, sel_pred = selection['real'], selection['predict'] | ||
non_sel_real, non_sel_pred = non_selection['real'], non_selection['predict'] | ||
|
||
with np.errstate(divide='ignore', invalid='ignore'): | ||
selection_a = np.nan_to_num(sel_pred * (sel_real.sum() / sel_pred.sum())) | ||
|
||
a = np.mean(np.nan_to_num(np.abs(sel_real - selection_a) / sel_real, posinf=0, neginf=0, nan=0)) | ||
b = np.mean(np.nan_to_num(np.abs(sel_real - sel_pred) / sel_real, posinf=0, neginf=0, nan=0)) | ||
c = np.mean(np.nan_to_num(np.abs(non_sel_real - non_sel_pred) / non_sel_real, posinf=0, neginf=0, nan=0)) | ||
return 1 - ((a + c) / (b + c)) | ||
|
||
|
||
def kde_clustering(df): | ||
values = df['deviation'].values | ||
|
||
if len(np.unique(values)) == 1: | ||
df['cluster'] = 1 | ||
return df | ||
|
||
kernel = gaussian_kde(values, bw_method='silverman') | ||
|
||
s = np.linspace(-2, 2, 400) | ||
e = kernel.evaluate(s) | ||
mi = argrelextrema(e, np.less)[0] | ||
|
||
# All ends in reverse order | ||
ends = sorted(np.concatenate((s[mi], [np.inf])), reverse=True) | ||
for i, end in enumerate(ends): | ||
df.loc[df['deviation'] <= end, 'cluster'] = i | ||
return df | ||
|
||
|
||
def is_subset(parent, child): | ||
return all([any([p.issubset(c) for p in parent]) for c in child]) | ||
|
||
|
||
def remove_crc(cluster_root_causes, elem_to_remove): | ||
def filter_crc(crc): | ||
root_cause_set = set([frozenset(elem) for elem in crc['elements']]) | ||
return root_cause_set == elem_to_remove | ||
|
||
return [crc for crc in cluster_root_causes if not filter_crc(crc)] | ||
|
||
|
||
def remove_same_layer(cluster_root_causes): | ||
# Merge if exactly the same root cause | ||
duplicates = [] | ||
for p, c in combinations(enumerate(cluster_root_causes), 2): | ||
if p[1]['layer'] == c[1]['layer']: | ||
parent_set = set([frozenset(elems) for elems in p[1]['elements']]) | ||
child_set = set([frozenset(elems) for elems in c[1]['elements']]) | ||
if is_subset(parent_set, child_set): | ||
duplicates.append(p[0]) | ||
mask = np.full(len(cluster_root_causes), True, dtype=bool) | ||
mask[duplicates] = False | ||
cluster_root_causes = np.array(cluster_root_causes)[mask].tolist() | ||
return cluster_root_causes | ||
|
||
|
||
def merge_root_causes(cluster_root_causes, max_layer=4): | ||
cluster_root_causes = remove_same_layer(cluster_root_causes) | ||
|
||
for layer in range(max_layer - 1, 0, -1): | ||
layer_root_causes = [set([frozenset(elems) for elems in crc['elements']]) for crc in cluster_root_causes if | ||
crc['layer'] == layer] | ||
higher_layer_root_causes = [set([frozenset(elems) for elems in crc['elements']]) for crc in cluster_root_causes | ||
if crc['layer'] > layer] | ||
|
||
for child in higher_layer_root_causes: | ||
for parent in layer_root_causes: | ||
if is_subset(parent, child): | ||
print('parent', parent, 'child', child) | ||
cluster_root_causes = remove_crc(cluster_root_causes, child) | ||
return cluster_root_causes | ||
|
||
|
||
def search_cluster(df, df_cluster, attributes, delta_threshold, debug=False): | ||
z = len(df_cluster) | ||
|
||
best_root_cause = {'avg': -1.0} | ||
for layer in range(1, len(attributes) + 1): | ||
if debug: print('Layer:', layer) | ||
cuboids = [list(c) for c in combinations(attributes, layer)] | ||
for cuboid in cuboids: | ||
if debug: print('Cuboid:', cuboid) | ||
|
||
# Way too many to go through. This is probably not what is done. | ||
# elements = get_unique_elements(df_cluster, cuboid) | ||
# splits = [t for r in range(1, len(elements) + 1) for t in list(combinations(elements, r))] | ||
|
||
# if last layer, we only run if CF can be above the threshold | ||
best_candidate = {'NPS': -1.0} | ||
if layer == len(attributes): | ||
CF = 1 / len(df_cluster) | ||
if CF <= delta_threshold: | ||
continue | ||
|
||
xs = df_cluster.groupby(cuboid)['real'].count() | ||
xs = xs.loc[(xs / z) > delta_threshold] | ||
xs.name = 'x' | ||
|
||
ys = df.groupby(cuboid)['real'].count() | ||
ys.name = 'y' | ||
splits = pd.concat([xs, ys], axis=1, join='inner') | ||
splits['LF'] = splits['x'] / splits['y'] | ||
splits = splits.loc[splits['LF'] > delta_threshold] | ||
|
||
for s, row in splits.iterrows(): | ||
split = [s] if layer == 1 else s | ||
mask = get_elements_mask(df, cuboid, split) | ||
|
||
selection = df.loc[mask] | ||
non_selection = df.loc[~mask] | ||
nps_score = nps(selection, non_selection) | ||
if nps_score > best_candidate['NPS']: | ||
CF = row['x'] / z | ||
avg_score = (nps_score + row['LF'] + CF) / 3 | ||
candidate = {'elements': [split], 'layer': layer, 'cuboid': cuboid, | ||
'LF': row['LF'], 'CF': CF, 'NPS': nps_score, 'avg': avg_score} | ||
best_candidate = candidate.copy() | ||
|
||
if 'elements' in best_candidate and best_candidate['avg'] > best_root_cause['avg']: | ||
best_root_cause = best_candidate.copy() | ||
|
||
if 'elements' not in best_root_cause: | ||
return None | ||
return best_root_cause | ||
|
||
|
||
def autoroot(df, attributes, delta_threshold=0.1, debug=False): | ||
df = add_deviation_score(df) | ||
|
||
# Filter away the uninteresting elements with a score [-0.2,0.2]. | ||
# (The deviation score here uses a multiple 2.) | ||
df_relevant = df.loc[df['deviation'].abs() > 0.2].copy() | ||
|
||
df_relevant = kde_clustering(df_relevant) | ||
clusters = df_relevant['cluster'].unique() | ||
if debug: print('clusters:', clusters) | ||
|
||
cluster_root_causes = [] | ||
for cluster in clusters: | ||
if debug: print("Cluster:", cluster) | ||
df_cluster = df_relevant.loc[df_relevant['cluster'] == cluster].copy() | ||
|
||
root_cause = search_cluster(df, df_cluster, attributes, delta_threshold, debug) | ||
if root_cause is not None: | ||
root_cause['cluster'] = cluster | ||
cluster_root_causes.append(root_cause) | ||
|
||
if debug: print('root causes before merge:', cluster_root_causes) | ||
cluster_root_causes = merge_root_causes(cluster_root_causes, max_layer=len(attributes)) | ||
return cluster_root_causes |
Oops, something went wrong.