Skip to content

Commit

Permalink
Update with riskloc and more baselines
Browse files Browse the repository at this point in the history
  • Loading branch information
shaido987 authored Jun 1, 2022
1 parent b2a35ba commit d000e7c
Show file tree
Hide file tree
Showing 30 changed files with 2,541 additions and 2 deletions.
114 changes: 112 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,112 @@
# multi-dim-baselines
Baselines for multi-dimensional RCA
# RiskLoc
Code for the paper RiskLoc: Localization of Multi-dimensional Root Causes by Weighted Risk ([link](https://arxiv.org/abs/2205.10004)).
Contains the implementation of RiskLoc and all baseline multi-dimensional root cause localization methods.

## Requirements
- pandas
- numpy
- scipy
- kneed (for squeeze)
- loguru (for squeeze)

## How to run

To run, use the `run.py` file. There are a couple of options, either to use a single file or to run all files in a directory (including all subdirectories).

Example of running a single file using riskloc in debug mode:
```
python run.py riskloc --run-path /data/B0/B_cuboid_layer_1_n_ele_1/1450653900.csv --debug
```

Example of running all files in a particular setting for a dataset (setting derived to True):
```
python run.py riskloc --run-path /data/D/B_cuboid_layer_3_n_ele_3 --derived
```

Example of running all files in a dataset:
```
python run.py riskloc --run-path /data/B0
```

Example of running all datasets with 20 threads:
```
python run.py riskloc --n-threads 20
```

Changing `riskloc` to any of the supported algorithms will run those instead, see below.

## Algorithms
The supported algorithms are:
```
$ python run.py --help
usage: run.py [-h] {riskloc,autoroot,squeeze,old squeeze,hotspot,r_adtributor,adtributor} ...
RiskLoc
positional arguments: {riskloc,autoroot,squeeze,old squeeze,hotspot,r_adtributor,adtributor}
algorithm specific help
riskloc riskloc help
autoroot autoroot help
squeeze squeeze help
hotspot autoroot help
r_adtributor r_adtributor help
adtributor adtributor help
optional arguments:
-h, --help show this help message and exit
```
The code for Squeeze is adapted from the recently released code from the original publication: https://github.com/NetManAIOps/Squeeze.

To see the algorithm specific arguments run: `python run.py 'algorithm' --help`. For example, for RiskLoc:
```
$ python run.py riskloc --help
usage: run.py riskloc [-h] [--data-root DATA_ROOT] [--run-path RUN_PATH] [--derived [DERIVED]] [--n-threads N_THREADS] [--output-suffix OUTPUT_SUFFIX] [--debug [DEBUG]] [--risk-threshold RISK_THRESHOLD] [--ep-prop-threshold EP_PROP_THRESHOLD]
optional arguments:
-h, --help show this help message and exit
--data-root DATA_ROOT root directory for all datasets (default ./data/)
--run-path RUN_PATH directory or file to be run;
if a directory, any subdirectories will be considered as well;
must contain data-path as a prefix
--derived [DERIVED] derived dataset (defaults to True for the D dataset and False for others)
--n-threads N_THREADS number of threads to run
--output-suffix OUTPUT_SUFFIX suffix for output file
--debug [DEBUG] debug mode
--risk-threshold RISK_THRESHOLD risk threshold
--pep-threshold PEP_THRESHOLD proportional explanatory power threshold
--prune-elements [PRUNE_ELEMENTS] use element pruning (True/False)
```

The `risk-threshold` and `pep-threshold` arguments are specific for the RiskLoc while the rest are shared by all algorithms. To see the algorithm specific arguments for other algorithms simply run them with the `--help` flag or check the code in `run.py`.

## Datasets
The semi-synthetic datasets can be downloaded from: https://github.com/NetManAIOps/Squeeze.
To run these, place them within the data/ directory and name them: A, B0, B1, B2, B3, B4, and D, respectively.

The three synthetic datasets used in the paper can be generated using `generate_dataset.py` as follows.

S dataset:
```
python generate_dataset.py --num 1000 --dataset-name S --seed 121
```
L dataset:
```
python generate_dataset.py --num 1000 --dataset-name L --seed 122 --dims 10 24 10 15 --noise-level 0.0 0.1 --anomaly-severity 0.5 1.0 --anomaly-deviation 0.0 0.0 --num-anomaly 1 5 --num-anomaly-elements 1 1 --only-last-layer
```
H dataset:
```
python generate_dataset.py --num 100 --dataset-name H --seed 123 --dims 10 5 250 20 8 12
```

In addition, new, intersting datasets can be created for using `generate_dataset.py` for extended emperical verification and research purposes. Supported input argments can be found at the beginning of the `generate_dataset.py` file or using the `--help` flag.

## Citation
```
@article{riskloc,
title={RiskLoc: Localization of Multi-dimensional Root Causes by Weighted Risk},
author={Kalander, Marcus},
journal={arXiv preprint arXiv:2205.10004},
year={2022}
}
```
43 changes: 43 additions & 0 deletions algorithms/adtributor.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
import numpy as np
import pandas as pd
from utils.element_scores import add_explanatory_power, add_surpise


def merge_dimensions(df, dimensions, derived):
elements = pd.DataFrame(columns=list(set(df.columns) - set(dimensions)), dtype=float)
for d in dimensions:
dim = df.groupby(d).sum().reset_index()
dim['element'] = dim[d]
dim['dimension'] = d
dim = dim.drop(columns=d)
elements = pd.concat([elements, dim], axis=0, sort=False)

if derived:
elements['predict'] = elements['predict_a'] / elements['predict_b']
elements['real'] = elements['real_a'] / elements['real_b']

elements = elements.reset_index(drop=True)
return elements


def adtributor(df, dimensions, teep=0.1, tep=0.1, k=3, derived=False):
elements = merge_dimensions(df, dimensions, derived)
elements = add_explanatory_power(elements, derived)
elements = add_surpise(elements, derived, merged_divide=len(dimensions))

candidate_set = []
for d in dimensions:
dim_elems = elements.loc[elements['dimension'] == d].set_index('element')
dim_elems = dim_elems.sort_values('surprise', ascending=False)
cumulative_ep = dim_elems.loc[dim_elems['ep'] > teep, 'ep'].cumsum()
if np.any(cumulative_ep > tep):
idx = (cumulative_ep > tep).idxmax()
candidate = {'elements': cumulative_ep[:idx].index.values.tolist(),
'explanatory_power': cumulative_ep[idx],
'surprise': dim_elems.loc[:idx, 'surprise'].sum(),
'dimension': d}
candidate_set.append(candidate)

# Sort by surprise and return the top k
candidate_set = sorted(candidate_set, key=lambda t: t['surprise'], reverse=True)[:k]
return candidate_set
170 changes: 170 additions & 0 deletions algorithms/autoroot.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,170 @@
import numpy as np
import pandas as pd
from itertools import combinations
from utils.element_scores import add_deviation_score
from scipy.stats import gaussian_kde
from scipy.signal import argrelextrema


def get_unique_elements(df, cuboid):
return np.vstack(list({tuple(row) for row in df[cuboid].values}))


def get_elements_mask(df, cuboid, elements):
return np.logical_and.reduce(np.logical_or.reduce([(df[cuboid] == e).values for e in elements], axis=0), axis=1)


def nps(selection, non_selection):
sel_real, sel_pred = selection['real'], selection['predict']
non_sel_real, non_sel_pred = non_selection['real'], non_selection['predict']

with np.errstate(divide='ignore', invalid='ignore'):
selection_a = np.nan_to_num(sel_pred * (sel_real.sum() / sel_pred.sum()))

a = np.mean(np.nan_to_num(np.abs(sel_real - selection_a) / sel_real, posinf=0, neginf=0, nan=0))
b = np.mean(np.nan_to_num(np.abs(sel_real - sel_pred) / sel_real, posinf=0, neginf=0, nan=0))
c = np.mean(np.nan_to_num(np.abs(non_sel_real - non_sel_pred) / non_sel_real, posinf=0, neginf=0, nan=0))
return 1 - ((a + c) / (b + c))


def kde_clustering(df):
values = df['deviation'].values

if len(np.unique(values)) == 1:
df['cluster'] = 1
return df

kernel = gaussian_kde(values, bw_method='silverman')

s = np.linspace(-2, 2, 400)
e = kernel.evaluate(s)
mi = argrelextrema(e, np.less)[0]

# All ends in reverse order
ends = sorted(np.concatenate((s[mi], [np.inf])), reverse=True)
for i, end in enumerate(ends):
df.loc[df['deviation'] <= end, 'cluster'] = i
return df


def is_subset(parent, child):
return all([any([p.issubset(c) for p in parent]) for c in child])


def remove_crc(cluster_root_causes, elem_to_remove):
def filter_crc(crc):
root_cause_set = set([frozenset(elem) for elem in crc['elements']])
return root_cause_set == elem_to_remove

return [crc for crc in cluster_root_causes if not filter_crc(crc)]


def remove_same_layer(cluster_root_causes):
# Merge if exactly the same root cause
duplicates = []
for p, c in combinations(enumerate(cluster_root_causes), 2):
if p[1]['layer'] == c[1]['layer']:
parent_set = set([frozenset(elems) for elems in p[1]['elements']])
child_set = set([frozenset(elems) for elems in c[1]['elements']])
if is_subset(parent_set, child_set):
duplicates.append(p[0])
mask = np.full(len(cluster_root_causes), True, dtype=bool)
mask[duplicates] = False
cluster_root_causes = np.array(cluster_root_causes)[mask].tolist()
return cluster_root_causes


def merge_root_causes(cluster_root_causes, max_layer=4):
cluster_root_causes = remove_same_layer(cluster_root_causes)

for layer in range(max_layer - 1, 0, -1):
layer_root_causes = [set([frozenset(elems) for elems in crc['elements']]) for crc in cluster_root_causes if
crc['layer'] == layer]
higher_layer_root_causes = [set([frozenset(elems) for elems in crc['elements']]) for crc in cluster_root_causes
if crc['layer'] > layer]

for child in higher_layer_root_causes:
for parent in layer_root_causes:
if is_subset(parent, child):
print('parent', parent, 'child', child)
cluster_root_causes = remove_crc(cluster_root_causes, child)
return cluster_root_causes


def search_cluster(df, df_cluster, attributes, delta_threshold, debug=False):
z = len(df_cluster)

best_root_cause = {'avg': -1.0}
for layer in range(1, len(attributes) + 1):
if debug: print('Layer:', layer)
cuboids = [list(c) for c in combinations(attributes, layer)]
for cuboid in cuboids:
if debug: print('Cuboid:', cuboid)

# Way too many to go through. This is probably not what is done.
# elements = get_unique_elements(df_cluster, cuboid)
# splits = [t for r in range(1, len(elements) + 1) for t in list(combinations(elements, r))]

# if last layer, we only run if CF can be above the threshold
best_candidate = {'NPS': -1.0}
if layer == len(attributes):
CF = 1 / len(df_cluster)
if CF <= delta_threshold:
continue

xs = df_cluster.groupby(cuboid)['real'].count()
xs = xs.loc[(xs / z) > delta_threshold]
xs.name = 'x'

ys = df.groupby(cuboid)['real'].count()
ys.name = 'y'
splits = pd.concat([xs, ys], axis=1, join='inner')
splits['LF'] = splits['x'] / splits['y']
splits = splits.loc[splits['LF'] > delta_threshold]

for s, row in splits.iterrows():
split = [s] if layer == 1 else s
mask = get_elements_mask(df, cuboid, split)

selection = df.loc[mask]
non_selection = df.loc[~mask]
nps_score = nps(selection, non_selection)
if nps_score > best_candidate['NPS']:
CF = row['x'] / z
avg_score = (nps_score + row['LF'] + CF) / 3
candidate = {'elements': [split], 'layer': layer, 'cuboid': cuboid,
'LF': row['LF'], 'CF': CF, 'NPS': nps_score, 'avg': avg_score}
best_candidate = candidate.copy()

if 'elements' in best_candidate and best_candidate['avg'] > best_root_cause['avg']:
best_root_cause = best_candidate.copy()

if 'elements' not in best_root_cause:
return None
return best_root_cause


def autoroot(df, attributes, delta_threshold=0.1, debug=False):
df = add_deviation_score(df)

# Filter away the uninteresting elements with a score [-0.2,0.2].
# (The deviation score here uses a multiple 2.)
df_relevant = df.loc[df['deviation'].abs() > 0.2].copy()

df_relevant = kde_clustering(df_relevant)
clusters = df_relevant['cluster'].unique()
if debug: print('clusters:', clusters)

cluster_root_causes = []
for cluster in clusters:
if debug: print("Cluster:", cluster)
df_cluster = df_relevant.loc[df_relevant['cluster'] == cluster].copy()

root_cause = search_cluster(df, df_cluster, attributes, delta_threshold, debug)
if root_cause is not None:
root_cause['cluster'] = cluster
cluster_root_causes.append(root_cause)

if debug: print('root causes before merge:', cluster_root_causes)
cluster_root_causes = merge_root_causes(cluster_root_causes, max_layer=len(attributes))
return cluster_root_causes
Loading

0 comments on commit d000e7c

Please sign in to comment.