Adding SKLearn OrdinalEncoder as a Transformer #157

gketronDS · 2024-10-04T23:26:12Z

[please review the Contribution Guidelines prior to submitting your pull request. go ahead and delete this line if you've already reviewed said guidelines.]

What does this PR do?

Adds Scikit-Learn's Ordinal Encoder as a transformer search space component.

Where should the reviewer start?

Based on the One Hot Encoder but with the encoder switched out.

How should this PR be tested?

I've used it on my experiments so far, and it works as intended. There may be bugs I haven't encountered from swapping onehot encoder for the Ordinal Encoder, so a quick double check couldn't hurt but probably isnt necessary since I runs fine on my copy and the HPC.

Any background context you want to provide?

My experiments where running a lot longer from the extra features added by the OneHotEncoder, so I added this to constrain the amount of features I imputed. Useful to prevent possible future neural network components (like a potential GAN or VAE-based imputer) from exploding the runtime.

What are the relevant issues?

N/A

Screenshots (if appropriate)

Questions:

Do the docs need to be updated?
Does this PR add new (Python) dependencies?

…hitectures.)

perib · 2024-10-08T17:32:14Z

I don't think we should add an ordinal encoder to the default search space, or include it in the list of "transformers". It seems unlikely to contribute meaningfully to the pipeline when randomly inserted. After the first layer, all tpot operators essentially return floats (maybe sometimes ints). By default it does only apply to features with less than 10 variables. But in the case of floats, it would just remove the useful scale information by encoding them in lists of ints from 0 to 9 (e.g. the list [1, 100, 1000] would get encoded as [0,1,2]. In the case of ints, they should already be encoded anyway (for example, predict should already return a list of ints per category). Strings should only be present in the initial input, in which case it should be either handled by the user before tpot, or in the fixed, initial preprocessing step. Though TPOT would have to randomly order the categories, which may not be desirable.

To be fair, for similar reasons I'm not fully convinced on the inclusion of the ColumnOneHotEncoder in the "transformers" list either. I mainly just included it because it was already included in the original TPOT1 search space. Both classes are odd in that "auto" defaults to an arbitrary hard-coded 10 unique values per column, which may not be desirable (maybe they were already ordinally encoded!). Chains of one hot encoders can potentially blow up the data size too.. However, in the case of few unique variables, I could see the potential for one hot encoding to benefit the pipeline in some cases. Mainly if it were only used in the first layer though. Should the minimum number of unique values be a learned parameter? Slightly off topic, but I initially planed to concatenate the a "learned" preprocessing step with the user provided search space. so that we could learn the onehotencoder params before passing the desired search space (i.e : SequentialPipeline([ OneHotEncodersearch_space, user_search_space]) ).

There may be a potential use for the OrdinalEncoder/OneHotEncoder in TPOT, but they may primarily be useful as a fixed preprocessing step before learning. Currently TPOT has a preprocessing pipeline (via the preprocessing parameter) that will one hot encode categorical columns automatically. This should be sufficient and the one hot encoder could be removed from the default search space and the "transformers" list. We could add ordinal encoding as an option here as well, however, it is not possible to know which order to list the categorical features in. In my opinion, it is simpler and easier for users to do ordinal encoding on their own if they need to.

side note: these methods, along with the "ColumnSimpleImputer" and the fixed preprocessing pipelines, could probably be more simply coded with the ColumnTransformer .

but maybe I missed something, curious to hear other thoughts.

perib · 2024-10-08T17:58:01Z

in the learned preprocessing example, TPOT could also learn to either select a single ordinal encoding OR one hot encoding (ChoicePipeline) before following with the user/default search space

gketronDS · 2024-10-08T18:00:48Z

I think it would be best to take both encoders out and put them into a new preprocessing option similiar to how we handle “transformers” or “selectors” so we can use them in sequential pipelines in the future and iterate. Calling one hot imputation in the default preprocessing flag was hard coded and we don’t have a non-hard coded way to do categorical encoding right now other than the onehot transformer object. Idk how much support for learning preprocessing pipelines we want to give but I feel like having it as an option is in line with the overall goal of automation, plus a non-hardcoded option give more leway to take advantage of the new pipeline architectures like choice or sequential.

gketronDS · 2024-10-08T22:31:04Z

From Pedro on Slack: yeah im not sure whether to have a "builtin" learned preprocessing thing, or have it be something the users decide to do. for TPOTClassifier and TPOTRegressor it could be useful and automate an extra few steps. It would be helpful for things like imputation
I was thinking something like
SequentialPipeline(
[
#TPOT automatically adds these when needed
,
,
#Default always included

]
)
we can already access individual methods with get_search_space. For example tpot2.config.get_search_space("ColumnOneHotEncoder")

From Me: Since we can just grab from the config anyway. maybe its worth pulling out OneHot from transformers, adding the OrdinalEncoder in for get search space and we can decide if we want make a new preprocessor section for automated preprocessing later? I needed it for my project, so someone else may need that functionality in the future too.

From Pedro: I think both ideas make sense.

…group along with the ordinal encoder.

gketronDS · 2024-10-10T22:23:47Z

Pedro, are you able to combine these since you made changes to the config recently? Not sure why this is causing a merge error or how to save changes I make.

nickotto · 2024-11-08T19:44:44Z

This code would take away one hot encoding from the transformers and the user would have to manually add column_encoders in their search space. Is this something you agree with?

perib · 2024-11-08T23:23:52Z

perhaps we could benchmark both options to see if it makes a large difference in time/performance/memory?

gketronDS and others added 4 commits August 6, 2024 10:15

test

5f64bae

Merge branch 'EpistasisLab:main' into main

2a33503

Added Ordinal Encoder Support (Useful for some Deep Neural Network ac…

cc26623

…hitectures.)

Remove Test

7bc5d21

removed OneHotEncoder from transformers, moved to new column_encoder …

dab722c

…group along with the ordinal encoder.

Merge branch 'main' into main

7e24f28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding SKLearn OrdinalEncoder as a Transformer #157

Adding SKLearn OrdinalEncoder as a Transformer #157

gketronDS commented Oct 4, 2024

perib commented Oct 8, 2024

perib commented Oct 8, 2024

gketronDS commented Oct 8, 2024 •

edited

Loading

gketronDS commented Oct 8, 2024

gketronDS commented Oct 10, 2024

nickotto commented Nov 8, 2024

perib commented Nov 8, 2024

Adding SKLearn OrdinalEncoder as a Transformer #157

Are you sure you want to change the base?

Adding SKLearn OrdinalEncoder as a Transformer #157

Conversation

gketronDS commented Oct 4, 2024

What does this PR do?

Where should the reviewer start?

How should this PR be tested?

Any background context you want to provide?

What are the relevant issues?

Screenshots (if appropriate)

Questions:

perib commented Oct 8, 2024

perib commented Oct 8, 2024

gketronDS commented Oct 8, 2024 • edited Loading

gketronDS commented Oct 8, 2024

gketronDS commented Oct 10, 2024

nickotto commented Nov 8, 2024

perib commented Nov 8, 2024

gketronDS commented Oct 8, 2024 •

edited

Loading