Update with riskloc and more baselines

shaido987 · Jun 1, 2022 · d000e7c · d000e7c
1 parent b2a35ba
commit d000e7c
Show file tree

Hide file tree

Showing 30 changed files with 2,541 additions and 2 deletions.
diff --git a/README.md b/README.md
@@ -1,2 +1,112 @@
-# multi-dim-baselines
-Baselines for multi-dimensional RCA
+# RiskLoc
+Code for the paper RiskLoc: Localization of Multi-dimensional Root Causes by Weighted Risk ([link](https://arxiv.org/abs/2205.10004)).  
+Contains the implementation of RiskLoc and all baseline multi-dimensional root cause localization methods.
+
+## Requirements
+- pandas
+- numpy
+- scipy
+- kneed (for squeeze)
+- loguru (for squeeze)
+
+## How to run
+
+To run, use the `run.py` file. There are a couple of options, either to use a single file or to run all files in a directory (including all subdirectories).
+
+Example of running a single file using riskloc in debug mode:
+```
+python run.py riskloc --run-path /data/B0/B_cuboid_layer_1_n_ele_1/1450653900.csv --debug
+```
+
+Example of running all files in a particular setting for a dataset (setting derived to True):
+```
+python run.py riskloc --run-path /data/D/B_cuboid_layer_3_n_ele_3 --derived
+```
+
+Example of running all files in a dataset:
+```
+python run.py riskloc --run-path /data/B0
+```
+
+Example of running all datasets with 20 threads:
+```
+python run.py riskloc --n-threads 20
+```
+
+Changing `riskloc` to any of the supported algorithms will run those instead, see below.
+
+## Algorithms 
+The supported algorithms are:
+```
+$ python run.py --help
+usage: run.py [-h] {riskloc,autoroot,squeeze,old squeeze,hotspot,r_adtributor,adtributor} ...
+
+RiskLoc
+
+positional arguments: {riskloc,autoroot,squeeze,old squeeze,hotspot,r_adtributor,adtributor}
+
+                        algorithm specific help
+    riskloc             riskloc help
+    autoroot            autoroot help
+    squeeze             squeeze help
+    hotspot             autoroot help
+    r_adtributor        r_adtributor help
+    adtributor          adtributor help
+
+optional arguments:
+  -h, --help            show this help message and exit
+```
+The code for Squeeze is adapted from the recently released code from the original publication: https://github.com/NetManAIOps/Squeeze.
+
+To see the algorithm specific arguments run: `python run.py 'algorithm' --help`. For example, for RiskLoc: 
+```
+$ python run.py riskloc --help
+usage: run.py riskloc [-h] [--data-root DATA_ROOT] [--run-path RUN_PATH] [--derived [DERIVED]] [--n-threads N_THREADS] [--output-suffix OUTPUT_SUFFIX] [--debug [DEBUG]] [--risk-threshold RISK_THRESHOLD] [--ep-prop-threshold EP_PROP_THRESHOLD]
+
+optional arguments:
+  -h, --help                                  show this help message and exit
+  --data-root DATA_ROOT                       root directory for all datasets (default ./data/)
+  --run-path RUN_PATH                         directory or file to be run; 
+                                              if a directory, any subdirectories will be considered as well;
+                                	      must contain data-path as a prefix
+  --derived [DERIVED]                         derived dataset (defaults to True for the D dataset and False for others)
+  --n-threads N_THREADS                       number of threads to run
+  --output-suffix OUTPUT_SUFFIX               suffix for output file
+  --debug [DEBUG]                             debug mode
+  --risk-threshold RISK_THRESHOLD             risk threshold
+  --pep-threshold PEP_THRESHOLD               proportional explanatory power threshold
+  --prune-elements [PRUNE_ELEMENTS]           use element pruning (True/False)
+```
+
+The `risk-threshold` and `pep-threshold` arguments are specific for the RiskLoc while the rest are shared by all algorithms. To see the algorithm specific arguments for other algorithms simply run them with the `--help` flag or check the code in `run.py`.
+
+## Datasets
+The semi-synthetic datasets can be downloaded from: https://github.com/NetManAIOps/Squeeze.
+To run these, place them within the data/ directory and name them: A, B0, B1, B2, B3, B4, and D, respectively.
+
+The three synthetic datasets used in the paper can be generated using `generate_dataset.py` as follows.
+
+S dataset:
+```
+python generate_dataset.py --num 1000 --dataset-name S --seed 121
+```
+L dataset:
+```
+python generate_dataset.py --num 1000 --dataset-name L --seed 122 --dims 10 24 10 15 --noise-level 0.0 0.1 --anomaly-severity 0.5 1.0 --anomaly-deviation 0.0 0.0 --num-anomaly 1 5 --num-anomaly-elements 1 1 --only-last-layer
+```
+H dataset:
+```
+python generate_dataset.py --num 100 --dataset-name H --seed 123 --dims 10 5 250 20 8 12
+```
+
+In addition, new, intersting datasets can be created for using `generate_dataset.py` for extended emperical verification and research purposes. Supported input argments can be found at the beginning of the `generate_dataset.py` file or using the `--help` flag. 
+
+## Citation
+```
+@article{riskloc,
+  title={RiskLoc: Localization of Multi-dimensional Root Causes by Weighted Risk},
+  author={Kalander, Marcus},
+  journal={arXiv preprint arXiv:2205.10004},
+  year={2022}
+}
+```
diff --git a/algorithms/adtributor.py b/algorithms/adtributor.py
@@ -0,0 +1,43 @@
+import numpy as np
+import pandas as pd
+from utils.element_scores import add_explanatory_power, add_surpise
+
+
+def merge_dimensions(df, dimensions, derived):
+    elements = pd.DataFrame(columns=list(set(df.columns) - set(dimensions)), dtype=float)
+    for d in dimensions:
+        dim = df.groupby(d).sum().reset_index()
+        dim['element'] = dim[d]
+        dim['dimension'] = d
+        dim = dim.drop(columns=d)
+        elements = pd.concat([elements, dim], axis=0, sort=False)
+
+    if derived:
+        elements['predict'] = elements['predict_a'] / elements['predict_b']
+        elements['real'] = elements['real_a'] / elements['real_b']
+
+    elements = elements.reset_index(drop=True)
+    return elements
+
+
+def adtributor(df, dimensions, teep=0.1, tep=0.1, k=3, derived=False):
+    elements = merge_dimensions(df, dimensions, derived)
+    elements = add_explanatory_power(elements, derived)
+    elements = add_surpise(elements, derived, merged_divide=len(dimensions))
+
+    candidate_set = []
+    for d in dimensions:
+        dim_elems = elements.loc[elements['dimension'] == d].set_index('element')
+        dim_elems = dim_elems.sort_values('surprise', ascending=False)
+        cumulative_ep = dim_elems.loc[dim_elems['ep'] > teep, 'ep'].cumsum()
+        if np.any(cumulative_ep > tep):
+            idx = (cumulative_ep > tep).idxmax()
+            candidate = {'elements': cumulative_ep[:idx].index.values.tolist(),
+                         'explanatory_power': cumulative_ep[idx],
+                         'surprise': dim_elems.loc[:idx, 'surprise'].sum(),
+                         'dimension': d}
+            candidate_set.append(candidate)
+
+    # Sort by surprise and return the top k
+    candidate_set = sorted(candidate_set, key=lambda t: t['surprise'], reverse=True)[:k]
+    return candidate_set
diff --git a/algorithms/autoroot.py b/algorithms/autoroot.py
@@ -0,0 +1,170 @@
+import numpy as np
+import pandas as pd
+from itertools import combinations
+from utils.element_scores import add_deviation_score
+from scipy.stats import gaussian_kde
+from scipy.signal import argrelextrema
+
+
+def get_unique_elements(df, cuboid):
+    return np.vstack(list({tuple(row) for row in df[cuboid].values}))
+
+
+def get_elements_mask(df, cuboid, elements):
+    return np.logical_and.reduce(np.logical_or.reduce([(df[cuboid] == e).values for e in elements], axis=0), axis=1)
+
+
+def nps(selection, non_selection):
+    sel_real, sel_pred = selection['real'], selection['predict']
+    non_sel_real, non_sel_pred = non_selection['real'], non_selection['predict']
+
+    with np.errstate(divide='ignore', invalid='ignore'):
+        selection_a = np.nan_to_num(sel_pred * (sel_real.sum() / sel_pred.sum()))
+
+    a = np.mean(np.nan_to_num(np.abs(sel_real - selection_a) / sel_real, posinf=0, neginf=0, nan=0))
+    b = np.mean(np.nan_to_num(np.abs(sel_real - sel_pred) / sel_real, posinf=0, neginf=0, nan=0))
+    c = np.mean(np.nan_to_num(np.abs(non_sel_real - non_sel_pred) / non_sel_real, posinf=0, neginf=0, nan=0))
+    return 1 - ((a + c) / (b + c))
+
+
+def kde_clustering(df):
+    values = df['deviation'].values
+
+    if len(np.unique(values)) == 1:
+        df['cluster'] = 1
+        return df
+
+    kernel = gaussian_kde(values, bw_method='silverman')
+
+    s = np.linspace(-2, 2, 400)
+    e = kernel.evaluate(s)
+    mi = argrelextrema(e, np.less)[0]
+
+    # All ends in reverse order
+    ends = sorted(np.concatenate((s[mi], [np.inf])), reverse=True)
+    for i, end in enumerate(ends):
+        df.loc[df['deviation'] <= end, 'cluster'] = i
+    return df
+
+
+def is_subset(parent, child):
+    return all([any([p.issubset(c) for p in parent]) for c in child])
+
+
+def remove_crc(cluster_root_causes, elem_to_remove):
+    def filter_crc(crc):
+        root_cause_set = set([frozenset(elem) for elem in crc['elements']])
+        return root_cause_set == elem_to_remove
+
+    return [crc for crc in cluster_root_causes if not filter_crc(crc)]
+
+
+def remove_same_layer(cluster_root_causes):
+    # Merge if exactly the same root cause
+    duplicates = []
+    for p, c in combinations(enumerate(cluster_root_causes), 2):
+        if p[1]['layer'] == c[1]['layer']:
+            parent_set = set([frozenset(elems) for elems in p[1]['elements']])
+            child_set = set([frozenset(elems) for elems in c[1]['elements']])
+            if is_subset(parent_set, child_set):
+                duplicates.append(p[0])
+    mask = np.full(len(cluster_root_causes), True, dtype=bool)
+    mask[duplicates] = False
+    cluster_root_causes = np.array(cluster_root_causes)[mask].tolist()
+    return cluster_root_causes
+
+
+def merge_root_causes(cluster_root_causes, max_layer=4):
+    cluster_root_causes = remove_same_layer(cluster_root_causes)
+
+    for layer in range(max_layer - 1, 0, -1):
+        layer_root_causes = [set([frozenset(elems) for elems in crc['elements']]) for crc in cluster_root_causes if
+                             crc['layer'] == layer]
+        higher_layer_root_causes = [set([frozenset(elems) for elems in crc['elements']]) for crc in cluster_root_causes
+                                    if crc['layer'] > layer]
+
+        for child in higher_layer_root_causes:
+            for parent in layer_root_causes:
+                if is_subset(parent, child):
+                    print('parent', parent, 'child', child)
+                    cluster_root_causes = remove_crc(cluster_root_causes, child)
+    return cluster_root_causes
+
+
+def search_cluster(df, df_cluster, attributes, delta_threshold, debug=False):
+    z = len(df_cluster)
+
+    best_root_cause = {'avg': -1.0}
+    for layer in range(1, len(attributes) + 1):
+        if debug: print('Layer:', layer)
+        cuboids = [list(c) for c in combinations(attributes, layer)]
+        for cuboid in cuboids:
+            if debug: print('Cuboid:', cuboid)
+
+            # Way too many to go through. This is probably not what is done.
+            # elements = get_unique_elements(df_cluster, cuboid)
+            # splits = [t for r in range(1, len(elements) + 1) for t in list(combinations(elements, r))]
+
+            # if last layer, we only run if CF can be above the threshold
+            best_candidate = {'NPS': -1.0}
+            if layer == len(attributes):
+                CF = 1 / len(df_cluster)
+                if CF <= delta_threshold:
+                    continue
+
+            xs = df_cluster.groupby(cuboid)['real'].count()
+            xs = xs.loc[(xs / z) > delta_threshold]
+            xs.name = 'x'
+
+            ys = df.groupby(cuboid)['real'].count()
+            ys.name = 'y'
+            splits = pd.concat([xs, ys], axis=1, join='inner')
+            splits['LF'] = splits['x'] / splits['y']
+            splits = splits.loc[splits['LF'] > delta_threshold]
+
+            for s, row in splits.iterrows():
+                split = [s] if layer == 1 else s
+                mask = get_elements_mask(df, cuboid, split)
+
+                selection = df.loc[mask]
+                non_selection = df.loc[~mask]
+                nps_score = nps(selection, non_selection)
+                if nps_score > best_candidate['NPS']:
+                    CF = row['x'] / z
+                    avg_score = (nps_score + row['LF'] + CF) / 3
+                    candidate = {'elements': [split], 'layer': layer, 'cuboid': cuboid,
+                                 'LF': row['LF'], 'CF': CF, 'NPS': nps_score, 'avg': avg_score}
+                    best_candidate = candidate.copy()
+
+            if 'elements' in best_candidate and best_candidate['avg'] > best_root_cause['avg']:
+                best_root_cause = best_candidate.copy()
+
+    if 'elements' not in best_root_cause:
+        return None
+    return best_root_cause
+
+
+def autoroot(df, attributes, delta_threshold=0.1, debug=False):
+    df = add_deviation_score(df)
+
+    # Filter away the uninteresting elements with a score [-0.2,0.2].
+    # (The deviation score here uses a multiple 2.)
+    df_relevant = df.loc[df['deviation'].abs() > 0.2].copy()
+
+    df_relevant = kde_clustering(df_relevant)
+    clusters = df_relevant['cluster'].unique()
+    if debug: print('clusters:', clusters)
+
+    cluster_root_causes = []
+    for cluster in clusters:
+        if debug: print("Cluster:", cluster)
+        df_cluster = df_relevant.loc[df_relevant['cluster'] == cluster].copy()
+
+        root_cause = search_cluster(df, df_cluster, attributes, delta_threshold, debug)
+        if root_cause is not None:
+            root_cause['cluster'] = cluster
+            cluster_root_causes.append(root_cause)
+
+    if debug: print('root causes before merge:', cluster_root_causes)
+    cluster_root_causes = merge_root_causes(cluster_root_causes, max_layer=len(attributes))
+    return cluster_root_causes