-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding SKLearn OrdinalEncoder as a Transformer #157
base: main
Are you sure you want to change the base?
Conversation
I don't think we should add an ordinal encoder to the default search space, or include it in the list of "transformers". It seems unlikely to contribute meaningfully to the pipeline when randomly inserted. After the first layer, all tpot operators essentially return floats (maybe sometimes ints). By default it does only apply to features with less than 10 variables. But in the case of floats, it would just remove the useful scale information by encoding them in lists of ints from 0 to 9 (e.g. the list [1, 100, 1000] would get encoded as [0,1,2]. In the case of ints, they should already be encoded anyway (for example, predict should already return a list of ints per category). Strings should only be present in the initial input, in which case it should be either handled by the user before tpot, or in the fixed, initial preprocessing step. Though TPOT would have to randomly order the categories, which may not be desirable. To be fair, for similar reasons I'm not fully convinced on the inclusion of the ColumnOneHotEncoder in the "transformers" list either. I mainly just included it because it was already included in the original TPOT1 search space. Both classes are odd in that "auto" defaults to an arbitrary hard-coded 10 unique values per column, which may not be desirable (maybe they were already ordinally encoded!). Chains of one hot encoders can potentially blow up the data size too.. However, in the case of few unique variables, I could see the potential for one hot encoding to benefit the pipeline in some cases. Mainly if it were only used in the first layer though. Should the minimum number of unique values be a learned parameter? Slightly off topic, but I initially planed to concatenate the a "learned" preprocessing step with the user provided search space. so that we could learn the onehotencoder params before passing the desired search space (i.e : SequentialPipeline([ OneHotEncodersearch_space, user_search_space]) ). There may be a potential use for the OrdinalEncoder/OneHotEncoder in TPOT, but they may primarily be useful as a fixed preprocessing step before learning. Currently TPOT has a preprocessing pipeline (via the side note: these methods, along with the "ColumnSimpleImputer" and the fixed preprocessing pipelines, could probably be more simply coded with the ColumnTransformer . but maybe I missed something, curious to hear other thoughts. |
in the learned preprocessing example, TPOT could also learn to either select a single ordinal encoding OR one hot encoding (ChoicePipeline) before following with the user/default search space |
I think it would be best to take both encoders out and put them into a new preprocessing option similiar to how we handle “transformers” or “selectors” so we can use them in sequential pipelines in the future and iterate. Calling one hot imputation in the default preprocessing flag was hard coded and we don’t have a non-hard coded way to do categorical encoding right now other than the onehot transformer object. Idk how much support for learning preprocessing pipelines we want to give but I feel like having it as an option is in line with the overall goal of automation, plus a non-hardcoded option give more leway to take advantage of the new pipeline architectures like choice or sequential. |
From Pedro on Slack: yeah im not sure whether to have a "builtin" learned preprocessing thing, or have it be something the users decide to do. for TPOTClassifier and TPOTRegressor it could be useful and automate an extra few steps. It would be helpful for things like imputation From Me: Since we can just grab from the config anyway. maybe its worth pulling out OneHot from transformers, adding the OrdinalEncoder in for get search space and we can decide if we want make a new preprocessor section for automated preprocessing later? I needed it for my project, so someone else may need that functionality in the future too. From Pedro: I think both ideas make sense. |
…group along with the ordinal encoder.
Pedro, are you able to combine these since you made changes to the config recently? Not sure why this is causing a merge error or how to save changes I make. |
This code would take away one hot encoding from the transformers and the user would have to manually add column_encoders in their search space. Is this something you agree with? |
perhaps we could benchmark both options to see if it makes a large difference in time/performance/memory? |
[please review the Contribution Guidelines prior to submitting your pull request. go ahead and delete this line if you've already reviewed said guidelines.]
What does this PR do?
Adds Scikit-Learn's Ordinal Encoder as a transformer search space component.
Where should the reviewer start?
Based on the One Hot Encoder but with the encoder switched out.
How should this PR be tested?
I've used it on my experiments so far, and it works as intended. There may be bugs I haven't encountered from swapping onehot encoder for the Ordinal Encoder, so a quick double check couldn't hurt but probably isnt necessary since I runs fine on my copy and the HPC.
Any background context you want to provide?
My experiments where running a lot longer from the extra features added by the OneHotEncoder, so I added this to constrain the amount of features I imputed. Useful to prevent possible future neural network components (like a potential GAN or VAE-based imputer) from exploding the runtime.
What are the relevant issues?
N/A
Screenshots (if appropriate)
Questions: