Giskard Scan crush when tested for a large number of features #1974

dzaridis · 2024-07-02T10:23:18Z

Issue Type

Bug

Source

source

Giskard Library Version

2.14.0

Giskard Hub Version

2.14.0

OS Platform and Distribution

Linux Ubuntu 20.04

Python version

3.9

Installed python packages

numpy==1.23.5
pandas==2.0.3
pyarrow==16.0.0
openpyxl==3.1.2
scikit-learn==1.3.1
xgboost==1.7.6
featurewiz==0.3.2

Current Behaviour?

I run the scan with a testing dataset of 100 samples and ~3700 features and an OOM error occured. 
I have utilized a pipeline with Data transformers, featurewiz feature selection and XGBoost model. 
I have run the library in 2 other use cases with <100 number of features and it runs smoothly without any issue therefore the issue i am suspecting is related with the vast amount of features

Running on 48GB Ram.

Standalone code OR list down the steps to reproduce the issue

import pandas as pd
import numpy as np
from giskard import Dataset, Model, scan

# Class to create the model and Dataset
class VulnerabilityDetection:
    def __init__(self, df: pd.DataFrame, model_instance):
        self.model_instance = model_instance
        self.df = df
    
    def gisk_dataset(self):
        CATEGORICAL_COLUMNS = list(self.df[self.df.columns[self.df.dtypes == 'object']].columns)
        giskard_dataset = Dataset(
                df=self.df,
                target="Target",
                name="",
                cat_columns=CATEGORICAL_COLUMNS,
                )
        return giskard_dataset
    
    def gisk_model(self):
        model_inst = self.model_instance

        def prediction_function(df: pd.DataFrame) -> np.ndarray:
            return model_inst.predict_proba(df)
        
        giskard_model = Model(
            model=prediction_function,
            model_type="classification",
            name="Vulnerability Detection Model",
            classification_labels=model_inst.classes_,
            feature_names=self.df.columns
        )
        return giskard_model

# Execution
import pickle
df = pd.read_csv("MyData")
with open("XGBoost_pipeline.pkl", 'rb') as file:
    xg_pipeline = pickle.load(file)
vd = VulnerabilityDetection(df, xg_pipeline)
gisk_dataset = vd.gisk_dataset()
gisk_model = vd.gisk_model()

Relevant log output

Actually the Notebook from VSCode crushed with OOM error

dzaridis · 2024-07-02T11:26:09Z

The dataset i am using is related to radiomics (Medical Imaging) where all the features are contributing at model's decision and therefore i cannot isolate specific features. Maybe updating the logic behind the scan would be beneficiary.
For instance for a large number of features procced with batch processing and at the end merge the scan results into the total

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Giskard Scan crush when tested for a large number of features #1974

Giskard Scan crush when tested for a large number of features #1974

dzaridis commented Jul 2, 2024 •

edited

Loading

dzaridis commented Jul 2, 2024

Giskard Scan crush when tested for a large number of features #1974

Giskard Scan crush when tested for a large number of features #1974

Comments

dzaridis commented Jul 2, 2024 • edited Loading

Issue Type

Source

Giskard Library Version

Giskard Hub Version

OS Platform and Distribution

Python version

Installed python packages

Current Behaviour?

Standalone code OR list down the steps to reproduce the issue

Relevant log output

dzaridis commented Jul 2, 2024

dzaridis commented Jul 2, 2024 •

edited

Loading