Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Giskard Scan crush when tested for a large number of features #1974

Open
dzaridis opened this issue Jul 2, 2024 · 1 comment
Open

Giskard Scan crush when tested for a large number of features #1974

dzaridis opened this issue Jul 2, 2024 · 1 comment

Comments

@dzaridis
Copy link

dzaridis commented Jul 2, 2024

Issue Type

Bug

Source

source

Giskard Library Version

2.14.0

Giskard Hub Version

2.14.0

OS Platform and Distribution

Linux Ubuntu 20.04

Python version

3.9

Installed python packages

numpy==1.23.5
pandas==2.0.3
pyarrow==16.0.0
openpyxl==3.1.2
scikit-learn==1.3.1
xgboost==1.7.6
featurewiz==0.3.2

Current Behaviour?

I run the scan with a testing dataset of 100 samples and ~3700 features and an OOM error occured. 
I have utilized a pipeline with Data transformers, featurewiz feature selection and XGBoost model. 
I have run the library in 2 other use cases with <100 number of features and it runs smoothly without any issue therefore the issue i am suspecting is related with the vast amount of features

Running on 48GB Ram.

Standalone code OR list down the steps to reproduce the issue

import pandas as pd
import numpy as np
from giskard import Dataset, Model, scan

# Class to create the model and Dataset
class VulnerabilityDetection:
    def __init__(self, df: pd.DataFrame, model_instance):
        self.model_instance = model_instance
        self.df = df
    
    def gisk_dataset(self):
        CATEGORICAL_COLUMNS = list(self.df[self.df.columns[self.df.dtypes == 'object']].columns)
        giskard_dataset = Dataset(
                df=self.df,
                target="Target",
                name="",
                cat_columns=CATEGORICAL_COLUMNS,
                )
        return giskard_dataset
    
    def gisk_model(self):
        model_inst = self.model_instance

        def prediction_function(df: pd.DataFrame) -> np.ndarray:
            return model_inst.predict_proba(df)
        
        giskard_model = Model(
            model=prediction_function,
            model_type="classification",
            name="Vulnerability Detection Model",
            classification_labels=model_inst.classes_,
            feature_names=self.df.columns
        )
        return giskard_model

# Execution
import pickle
df = pd.read_csv("MyData")
with open("XGBoost_pipeline.pkl", 'rb') as file:
    xg_pipeline = pickle.load(file)
vd = VulnerabilityDetection(df, xg_pipeline)
gisk_dataset = vd.gisk_dataset()
gisk_model = vd.gisk_model()

Relevant log output

Actually the Notebook from VSCode crushed with OOM error
@dzaridis
Copy link
Author

dzaridis commented Jul 2, 2024

The dataset i am using is related to radiomics (Medical Imaging) where all the features are contributing at model's decision and therefore i cannot isolate specific features. Maybe updating the logic behind the scan would be beneficiary.
For instance for a large number of features procced with batch processing and at the end merge the scan results into the total

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

1 participant