Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fit preprocessor just once with tune_bayes #955

Open
asb2111 opened this issue Oct 31, 2024 · 5 comments · May be fixed by #958
Open

Fit preprocessor just once with tune_bayes #955

asb2111 opened this issue Oct 31, 2024 · 5 comments · May be fixed by #958

Comments

@asb2111
Copy link

asb2111 commented Oct 31, 2024

Feature

Currently, it appears that tune_bayes recomputes the entire preprocessor during every iteration, even if the preprocessor has nothing to tune. This can lead to a substantial amount of unnecessary computation as the preprocessor should only need to be executed once and could be reused for all iterations.

@topepo
Copy link
Member

topepo commented Oct 31, 2024

This is an excellent point.

Once a candidate is created by the Gaussian process model, we pass that to tune_grid(). It fits the new candidate for each resample, makes appropriate predictions, and gets metrics.

We would have to make substantial changes to tune_grid() to skip the preprocessor. Also, we’d also have to have it save each of the fitted models from the previous fit (assuming that the preprocessor is the same), take that as input, and then start the process when the supervised model is trained (for each resample).

We’ll have to consider this to see if there is a less invasive approach than the one described above (I don't think we can do that).

@asb2111
Copy link
Author

asb2111 commented Oct 31, 2024

If there is nothing to tune in the preprocessor, could it be 'baked' prior to starting the tuning process altogether? Then the workflow gets modified to use the baked data and no preprocessor, the tuning is conducted, and then everything gets repackaged at the end? Maybe this is too much work for too little gain in a special case.

@topepo
Copy link
Member

topepo commented Oct 31, 2024

Unless you are using a validation set, we would not want to fit the preprocessor on the entire training set then fit the model on a potentially different data set (i.e., one that was a resample)

@asb2111
Copy link
Author

asb2111 commented Oct 31, 2024

I'm thinking if we are passing in resamples we could bake the preprocessor on each resample in advance, or something like that.

@asb2111
Copy link
Author

asb2111 commented Oct 31, 2024

Ok, I've tried hacking this together and I see why it won't work. The workflow expects the data in each resample to look similar (same columns, etc), but if we preprocess and glue the resamples back together, each resample could have different columns, and that breaks things down the line.

FWIW, I brought all this up because I have a workflow that involves step_lincomb which takes a very long time to run.

@simonpcouch simonpcouch linked a pull request Nov 1, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants