edit sort and loop for fitted pipeline #159
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
[please review the Contribution Guidelines prior to submitting your pull request. go ahead and delete this line if you've already reviewed said guidelines.]
What does this PR do?
Previously, the best fitted pipeline was selected by identifying a single pipeline with the max of the first objective function. If that pipeline failed, TPOT crashes without fitted_estimator_
Two changes
a) Pipelines are now sorted with all objective functions, in order. Now when multiple pipeline have the same score, they are also sorted by the second score, and so on. Previously, a random pipeline with the best score was selected, which may not have been the optimal pipeline given the other scores.
b) There is a very rare but not impossible chance that a pipeline will work correctly on in the objective functions, but fail on the full dataset. For example, a selector function that happens to select only positive values during when evaluated on the cv folds, might then select a different column that does include negative values when trained on the full dataset. If the final estimator is MultinomialNB, this will execute correctly on the objective function, but throw an error on the full dataset as it cannot accept negative values. This could cause TPOT to crash
To resolve this, TPOT will now loop through the best pipelines. If a pipeline fails, it will catch the error and try the next best pipeline. This prevents a terminal error from occurring.
Where should the reviewer start?
How should this PR be tested?
Double check that the sort order is correct and it runs without issue on test data.