Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding SKLearn OrdinalEncoder as a Transformer #157

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

gketronDS
Copy link
Member

[please review the Contribution Guidelines prior to submitting your pull request. go ahead and delete this line if you've already reviewed said guidelines.]

What does this PR do?

Adds Scikit-Learn's Ordinal Encoder as a transformer search space component.

Where should the reviewer start?

Based on the One Hot Encoder but with the encoder switched out.

How should this PR be tested?

I've used it on my experiments so far, and it works as intended. There may be bugs I haven't encountered from swapping onehot encoder for the Ordinal Encoder, so a quick double check couldn't hurt but probably isnt necessary since I runs fine on my copy and the HPC.

Any background context you want to provide?

My experiments where running a lot longer from the extra features added by the OneHotEncoder, so I added this to constrain the amount of features I imputed. Useful to prevent possible future neural network components (like a potential GAN or VAE-based imputer) from exploding the runtime.

What are the relevant issues?

N/A

Screenshots (if appropriate)

Questions:

  • Do the docs need to be updated?
  • Does this PR add new (Python) dependencies?

@perib
Copy link
Collaborator

perib commented Oct 8, 2024

I don't think we should add an ordinal encoder to the default search space, or include it in the list of "transformers". It seems unlikely to contribute meaningfully to the pipeline when randomly inserted. After the first layer, all tpot operators essentially return floats (maybe sometimes ints). By default it does only apply to features with less than 10 variables. But in the case of floats, it would just remove the useful scale information by encoding them in lists of ints from 0 to 9 (e.g. the list [1, 100, 1000] would get encoded as [0,1,2]. In the case of ints, they should already be encoded anyway (for example, predict should already return a list of ints per category). Strings should only be present in the initial input, in which case it should be either handled by the user before tpot, or in the fixed, initial preprocessing step. Though TPOT would have to randomly order the categories, which may not be desirable.

To be fair, for similar reasons I'm not fully convinced on the inclusion of the ColumnOneHotEncoder in the "transformers" list either. I mainly just included it because it was already included in the original TPOT1 search space. Both classes are odd in that "auto" defaults to an arbitrary hard-coded 10 unique values per column, which may not be desirable (maybe they were already ordinally encoded!). Chains of one hot encoders can potentially blow up the data size too.. However, in the case of few unique variables, I could see the potential for one hot encoding to benefit the pipeline in some cases. Mainly if it were only used in the first layer though. Should the minimum number of unique values be a learned parameter? Slightly off topic, but I initially planed to concatenate the a "learned" preprocessing step with the user provided search space. so that we could learn the onehotencoder params before passing the desired search space (i.e : SequentialPipeline([ OneHotEncodersearch_space, user_search_space]) ).

There may be a potential use for the OrdinalEncoder/OneHotEncoder in TPOT, but they may primarily be useful as a fixed preprocessing step before learning. Currently TPOT has a preprocessing pipeline (via the preprocessing parameter) that will one hot encode categorical columns automatically. This should be sufficient and the one hot encoder could be removed from the default search space and the "transformers" list. We could add ordinal encoding as an option here as well, however, it is not possible to know which order to list the categorical features in. In my opinion, it is simpler and easier for users to do ordinal encoding on their own if they need to.

side note: these methods, along with the "ColumnSimpleImputer" and the fixed preprocessing pipelines, could probably be more simply coded with the ColumnTransformer .

but maybe I missed something, curious to hear other thoughts.

@perib
Copy link
Collaborator

perib commented Oct 8, 2024

in the learned preprocessing example, TPOT could also learn to either select a single ordinal encoding OR one hot encoding (ChoicePipeline) before following with the user/default search space

@gketronDS
Copy link
Member Author

gketronDS commented Oct 8, 2024

I think it would be best to take both encoders out and put them into a new preprocessing option similiar to how we handle “transformers” or “selectors” so we can use them in sequential pipelines in the future and iterate. Calling one hot imputation in the default preprocessing flag was hard coded and we don’t have a non-hard coded way to do categorical encoding right now other than the onehot transformer object. Idk how much support for learning preprocessing pipelines we want to give but I feel like having it as an option is in line with the overall goal of automation, plus a non-hardcoded option give more leway to take advantage of the new pipeline architectures like choice or sequential.

@gketronDS
Copy link
Member Author

From Pedro on Slack: yeah im not sure whether to have a "builtin" learned preprocessing thing, or have it be something the users decide to do. for TPOTClassifier and TPOTRegressor it could be useful and automate an extra few steps. It would be helpful for things like imputation
I was thinking something like
SequentialPipeline(
[
#TPOT automatically adds these when needed
,
,
#Default always included

]
)
we can already access individual methods with get_search_space. For example tpot2.config.get_search_space("ColumnOneHotEncoder")

From Me: Since we can just grab from the config anyway. maybe its worth pulling out OneHot from transformers, adding the OrdinalEncoder in for get search space and we can decide if we want make a new preprocessor section for automated preprocessing later? I needed it for my project, so someone else may need that functionality in the future too.

From Pedro: I think both ideas make sense.

@gketronDS
Copy link
Member Author

Pedro, are you able to combine these since you made changes to the config recently? Not sure why this is causing a merge error or how to save changes I make.

@nickotto
Copy link
Contributor

nickotto commented Nov 8, 2024

This code would take away one hot encoding from the transformers and the user would have to manually add column_encoders in their search space. Is this something you agree with?

@perib
Copy link
Collaborator

perib commented Nov 8, 2024

perhaps we could benchmark both options to see if it makes a large difference in time/performance/memory?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants