Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Accept pre-binned data in fit #74

Merged
merged 10 commits into from
Dec 18, 2018
Merged

Conversation

NicolasHug
Copy link
Collaborator

Closes #68

This PR makes it possible to pass pre-binned data in fit() based on the dtype of X, in which case the BinMapper is bypassed.

If the estimator was fitted with binned data, it raises an error when predict or predict_proba is used with real-valued data.

I renamed bin_thresholds into numerical_treshold to avoid confusion and to be consistent with the TreePredictor fields.

@codecov
Copy link

codecov bot commented Dec 15, 2018

Codecov Report

Merging #74 into master will decrease coverage by 0.03%.
The diff coverage is 97.82%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #74      +/-   ##
==========================================
- Coverage   97.17%   97.14%   -0.04%     
==========================================
  Files          10       10              
  Lines        1028     1051      +23     
==========================================
+ Hits          999     1021      +22     
- Misses         29       30       +1
Impacted Files Coverage Δ
pygbm/predictor.py 100% <100%> (ø) ⬆️
pygbm/grower.py 92.81% <100%> (ø) ⬆️
pygbm/binning.py 92.98% <100%> (ø) ⬆️
pygbm/gradient_boosting.py 97.61% <96.66%> (-0.28%) ⬇️
pygbm/utils.py 93.54% <0%> (+0.21%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 53c73f3...056512d. Read the comment docs.

Copy link
Owner

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments:

tests/test_gradient_boosting.py Outdated Show resolved Hide resolved
pygbm/gradient_boosting.py Outdated Show resolved Hide resolved
predictor.predict, X_binned
)

predictor.predict_binned(X) # No error
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't that be:

predictor.predict_binned(X_binned)  # No error

instead?

Furthermore I would have expected that predictor.predict_binned would raise a ValueError when called with np.float32 or np.float64 input data. But maybe such a check in the nested private API is redundant and would introduce unnecessary overhead.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you're right, I've added the check.

And yes all the checks in PredictorNode are redundant since GradientBoosting.predict() would only call predict_binned() on uint8 data and predict() on non-uint8 data.

My guess is that the overhead is negligible compared to the prediction time, but no strong opinion, we can remove them and just add a comment.

tests/test_predictor.py Outdated Show resolved Hide resolved
Copy link
Owner

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please run a quick benchmark to predict with a GB estimator with both pre-binned and numerical data with a very small batch (e.g. 1 samples with 100 features) but a large number of trees (e.g. 300 trees with max_leaf_nodes=31)?

Then enable and disable the input checks in the predictor.predict_binned / predictor.predict methods to check whether or not those input checks are actually negligible.

if self.max_bins < max_bin_index + 1:
raise ValueError(
f'Data is pre-binned and max_bins={self.max_bins}, '
f'but data has {max_bin_index + 1} bins.'
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In retrospect I find the phrasing confusing. May I suggest the following:

                 raise ValueError(
                     f'max_bins is set to {self.max_bins} but the data is pre-binned with '
                     f' {max_bin_index + 1} bins.'

pygbm/grower.py Outdated
The actual thresholds values of each bin.
numerical_thresholds : array-like of floats, optional (default=None)
The actual thresholds values of each bin. None if the training data
was pre-binned.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add that the values of the array are expected to be sorted in increasing order.

)

if X.dtype == np.uint8:
raise ValueError('X dtype should be float32 or float64')
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe be more explicit:

         if X.dtype == np.uint8:
             raise ValueError('X has uint8 dtype: use grower.predict_binned(X) if X is pre-binned, or'
                              ' convert X to a float32 dtype to be treated as numeral data')

@NicolasHug
Copy link
Collaborator Author

without checks:

predict took 4.687ms on avg
predict_binned took 4.692ms on avg

with checks

predict took 4.934ms on avg
predict_binned took 4.860ms on avg
from time import time

import numpy as np
from sklearn.datasets import make_regression
from pygbm import GradientBoostingRegressor
from pygbm.binning import BinMapper


max_iter = 500

X, y = make_regression(n_samples=5, n_features=100)
X_binned = BinMapper().fit_transform(X)
assert X_binned.dtype == np.uint8

gb = GradientBoostingRegressor(max_iter=max_iter, scoring=None,
                               n_iter_no_change=None)
# Compiling
gb.fit(X, y)
assert len(gb.predictors_) == max_iter
gb.predict(X)
gb.predict(X_binned)

n_exp = 100
predict_time = 0
predict_binned_time = 0
for _ in range(n_exp):

    tic = time()
    gb.predict(X)
    toc = time()
    predict_time += toc - tic
    print(f'predict took {toc - tic:.3f}s')

    tic = time()
    gb.predict(X_binned)
    toc = time()
    predict_binned_time += toc - tic
    print(f'predict_binned took {toc - tic:.3f}s')

print('-' * 10)
print(f'predict took {predict_time / n_exp * 1000:.3f}ms on avg')
print(f'predict_binned took {predict_binned_time / n_exp * 1000:.3f}ms on avg')

This is a very minimal overhead, right?

Copy link
Owner

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the benchmark. +1 for merge once the previous comments are addressed.

@@ -6,7 +6,7 @@ skip_missing_interpreters=True
[testenv]
deps =
numpy
scipy
scipy == 1.1.0
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ogrisel I set scipy version here for travis as a temporary fix for #82

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment with the url of the issue #82.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I needed to remove it, that was causing travis to fail :/

tox.ini Outdated
@@ -6,7 +6,7 @@ skip_missing_interpreters=True
[testenv]
deps =
numpy
scipy == 1.1.0
scipy == 1.1.0 # temporary fix for issue #82
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should remove the whitespaces around "==".

@NicolasHug NicolasHug merged commit 0d1e68f into ogrisel:master Dec 18, 2018
@NicolasHug
Copy link
Collaborator Author

Thanks for the reviews!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants