Fit preprocessor just once with `tune_bayes` #955

asb2111 · 2024-10-31T11:41:34Z

Feature

Currently, it appears that tune_bayes recomputes the entire preprocessor during every iteration, even if the preprocessor has nothing to tune. This can lead to a substantial amount of unnecessary computation as the preprocessor should only need to be executed once and could be reused for all iterations.

The text was updated successfully, but these errors were encountered:

topepo · 2024-10-31T15:26:11Z

This is an excellent point.

Once a candidate is created by the Gaussian process model, we pass that to tune_grid(). It fits the new candidate for each resample, makes appropriate predictions, and gets metrics.

We would have to make substantial changes to tune_grid() to skip the preprocessor. Also, we’d also have to have it save each of the fitted models from the previous fit (assuming that the preprocessor is the same), take that as input, and then start the process when the supervised model is trained (for each resample).

We’ll have to consider this to see if there is a less invasive approach than the one described above (I don't think we can do that).

asb2111 · 2024-10-31T16:03:23Z

If there is nothing to tune in the preprocessor, could it be 'baked' prior to starting the tuning process altogether? Then the workflow gets modified to use the baked data and no preprocessor, the tuning is conducted, and then everything gets repackaged at the end? Maybe this is too much work for too little gain in a special case.

topepo · 2024-10-31T17:39:09Z

Unless you are using a validation set, we would not want to fit the preprocessor on the entire training set then fit the model on a potentially different data set (i.e., one that was a resample)

asb2111 · 2024-10-31T18:27:11Z

I'm thinking if we are passing in resamples we could bake the preprocessor on each resample in advance, or something like that.

asb2111 · 2024-10-31T20:19:39Z

Ok, I've tried hacking this together and I see why it won't work. The workflow expects the data in each resample to look similar (same columns, etc), but if we preprocess and glue the resamples back together, each resample could have different columns, and that breaks things down the line.

FWIW, I brought all this up because I have a workflow that involves step_lincomb which takes a very long time to run.

simonpcouch linked a pull request Nov 1, 2024 that will close this issue

cache duplicated preprocessor fits #958

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fit preprocessor just once with `tune_bayes` #955

Fit preprocessor just once with `tune_bayes` #955

asb2111 commented Oct 31, 2024

topepo commented Oct 31, 2024

asb2111 commented Oct 31, 2024 •

edited

Loading

topepo commented Oct 31, 2024

asb2111 commented Oct 31, 2024

asb2111 commented Oct 31, 2024 •

edited

Loading

Fit preprocessor just once with tune_bayes #955

Fit preprocessor just once with tune_bayes #955

Comments

asb2111 commented Oct 31, 2024

Feature

topepo commented Oct 31, 2024

asb2111 commented Oct 31, 2024 • edited Loading

topepo commented Oct 31, 2024

asb2111 commented Oct 31, 2024

asb2111 commented Oct 31, 2024 • edited Loading

Fit preprocessor just once with `tune_bayes` #955

Fit preprocessor just once with `tune_bayes` #955

asb2111 commented Oct 31, 2024 •

edited

Loading

asb2111 commented Oct 31, 2024 •

edited

Loading