Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sometimes Counterfactuals generated with random method have wrong class #417

Open
benediktsatalia opened this issue Nov 3, 2023 · 1 comment

Comments

@benediktsatalia
Copy link

benediktsatalia commented Nov 3, 2023

I encountered situations where the returned counterfactuals have not the desired class. It happens only sometimes so I needed to play with seeds to get a reproducible example. I boiled it down to a simple example based on the getting started notebook.

This is the output the code produces:

Query instance (original outcome : 0)
   age workclass education marital_status    occupation   race gender  hours_per_week  income
0   32   Private   HS-grad        Married  White-Collar  White   Male              60       0

Diverse Counterfactual set (new outcome: 1)
   age workclass  education marital_status    occupation   race gender  hours_per_week  income
0   61   Private    HS-grad        Married  Professional  White   Male              60       0
1   32   Private  Bachelors        Married  White-Collar  White   Male              60       1

The code to reprdocue:

# Sklearn imports
from sklearn.compose import ColumnTransformer
from sklearn.discriminant_analysis import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestClassifier

# DiCE imports
import dice_ml
from dice_ml.utils import helpers  # helper functions

dataset = helpers.load_adult_income_dataset()
dataset = dataset.sample(1000, random_state=1)

y_train = dataset["income"]
x_train = dataset.drop('income', axis=1)

# Step 1: dice_ml.Data
d = dice_ml.Data(dataframe=dataset, continuous_features=['age', 'hours_per_week'], outcome_name='income')


numerical = ["age", "hours_per_week"]
categorical = x_train.columns.difference(numerical)

# We create the preprocessing pipelines for both numeric and categorical data.
numeric_transformer = Pipeline(steps=[("scaler", StandardScaler())])

categorical_transformer = Pipeline(steps=[("onehot", OneHotEncoder(handle_unknown="ignore"))])

transformations = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numerical),
        ("cat", categorical_transformer, categorical),
    ]
)

# Append classifier to preprocessing pipeline.
# Now we have a full prediction pipeline.
clf = Pipeline(
    steps=[("preprocessor", transformations), ("classifier", RandomForestClassifier(random_state=1))]
)
model = clf.fit(x_train, y_train)

# Using sklearn backend
m = dice_ml.Model(model=model, backend="sklearn")
# Using method=random for generating CFs
exp = dice_ml.Dice(d, m, method="random")

e1 = exp.generate_counterfactuals(x_train[4:5], total_CFs=2, desired_class="opposite", random_seed = 6)
e1.visualize_as_dataframe()
@benediktsatalia
Copy link
Author

I further tested it and it also happens for method="genetic". It is a bit harder to catch since random_seed = ... doesn't work for other methods than random (which is by the way also not documented, so I consider this a bug too). But the method has still some randomness so to find occurrences of this bug I run generate_counterfactuals multiple times until the bug occurs once:

# Sklearn imports
from sklearn.compose import ColumnTransformer
from sklearn.discriminant_analysis import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestClassifier

# DiCE imports
import dice_ml
from dice_ml.utils import helpers  # helper functions

dataset = helpers.load_adult_income_dataset()
dataset = dataset.sample(1000, random_state=1)

y_train = dataset["income"]
x_train = dataset.drop('income', axis=1)

# Step 1: dice_ml.Data
d = dice_ml.Data(dataframe=dataset, continuous_features=['age', 'hours_per_week'], outcome_name='income')


numerical = ["age", "hours_per_week"]
categorical = x_train.columns.difference(numerical)

# We create the preprocessing pipelines for both numeric and categorical data.
numeric_transformer = Pipeline(steps=[("scaler", StandardScaler())])

categorical_transformer = Pipeline(steps=[("onehot", OneHotEncoder(handle_unknown="ignore"))])

transformations = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numerical),
        ("cat", categorical_transformer, categorical),
    ]
)

# Append classifier to preprocessing pipeline.
# Now we have a full prediction pipeline.
clf = Pipeline(
    steps=[("preprocessor", transformations), ("classifier", RandomForestClassifier(random_state=1))]
)
model = clf.fit(x_train, y_train)

# Using sklearn backend
m = dice_ml.Model(model=model, backend="sklearn")
# Using method=random for generating CFs
exp = dice_ml.Dice(d, m, method="genetic")

for i in range(1000):
    e1 = exp.generate_counterfactuals(x_train[4:5], total_CFs=10, desired_class="opposite")
    print(i)
    if (e1.cf_examples_list[0].final_cfs_df["income"].nunique() > 1):
        e1.visualize_as_dataframe()
        break

If you run this script it will eventually give you some counterfactuals where the class of at least one counterfactual is wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

1 participant