-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Accept pre-binned data in fit #74
Conversation
Codecov Report
@@ Coverage Diff @@
## master #74 +/- ##
==========================================
- Coverage 97.17% 97.14% -0.04%
==========================================
Files 10 10
Lines 1028 1051 +23
==========================================
+ Hits 999 1021 +22
- Misses 29 30 +1
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments:
tests/test_predictor.py
Outdated
predictor.predict, X_binned | ||
) | ||
|
||
predictor.predict_binned(X) # No error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't that be:
predictor.predict_binned(X_binned) # No error
instead?
Furthermore I would have expected that predictor.predict_binned
would raise a ValueError
when called with np.float32
or np.float64
input data. But maybe such a check in the nested private API is redundant and would introduce unnecessary overhead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes you're right, I've added the check.
And yes all the checks in PredictorNode
are redundant since GradientBoosting.predict()
would only call predict_binned()
on uint8
data and predict()
on non-uint8
data.
My guess is that the overhead is negligible compared to the prediction time, but no strong opinion, we can remove them and just add a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please run a quick benchmark to predict with a GB estimator with both pre-binned and numerical data with a very small batch (e.g. 1 samples with 100 features) but a large number of trees (e.g. 300 trees with max_leaf_nodes=31
)?
Then enable and disable the input checks in the predictor.predict_binned
/ predictor.predict
methods to check whether or not those input checks are actually negligible.
pygbm/gradient_boosting.py
Outdated
if self.max_bins < max_bin_index + 1: | ||
raise ValueError( | ||
f'Data is pre-binned and max_bins={self.max_bins}, ' | ||
f'but data has {max_bin_index + 1} bins.' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In retrospect I find the phrasing confusing. May I suggest the following:
raise ValueError(
f'max_bins is set to {self.max_bins} but the data is pre-binned with '
f' {max_bin_index + 1} bins.'
pygbm/grower.py
Outdated
The actual thresholds values of each bin. | ||
numerical_thresholds : array-like of floats, optional (default=None) | ||
The actual thresholds values of each bin. None if the training data | ||
was pre-binned. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add that the values of the array are expected to be sorted in increasing order.
pygbm/predictor.py
Outdated
) | ||
|
||
if X.dtype == np.uint8: | ||
raise ValueError('X dtype should be float32 or float64') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe be more explicit:
if X.dtype == np.uint8:
raise ValueError('X has uint8 dtype: use grower.predict_binned(X) if X is pre-binned, or'
' convert X to a float32 dtype to be treated as numeral data')
without checks:
with checks
from time import time
import numpy as np
from sklearn.datasets import make_regression
from pygbm import GradientBoostingRegressor
from pygbm.binning import BinMapper
max_iter = 500
X, y = make_regression(n_samples=5, n_features=100)
X_binned = BinMapper().fit_transform(X)
assert X_binned.dtype == np.uint8
gb = GradientBoostingRegressor(max_iter=max_iter, scoring=None,
n_iter_no_change=None)
# Compiling
gb.fit(X, y)
assert len(gb.predictors_) == max_iter
gb.predict(X)
gb.predict(X_binned)
n_exp = 100
predict_time = 0
predict_binned_time = 0
for _ in range(n_exp):
tic = time()
gb.predict(X)
toc = time()
predict_time += toc - tic
print(f'predict took {toc - tic:.3f}s')
tic = time()
gb.predict(X_binned)
toc = time()
predict_binned_time += toc - tic
print(f'predict_binned took {toc - tic:.3f}s')
print('-' * 10)
print(f'predict took {predict_time / n_exp * 1000:.3f}ms on avg')
print(f'predict_binned took {predict_binned_time / n_exp * 1000:.3f}ms on avg') This is a very minimal overhead, right? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the benchmark. +1 for merge once the previous comments are addressed.
@@ -6,7 +6,7 @@ skip_missing_interpreters=True | |||
[testenv] | |||
deps = | |||
numpy | |||
scipy | |||
scipy == 1.1.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a comment with the url of the issue #82.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I needed to remove it, that was causing travis to fail :/
tox.ini
Outdated
@@ -6,7 +6,7 @@ skip_missing_interpreters=True | |||
[testenv] | |||
deps = | |||
numpy | |||
scipy == 1.1.0 | |||
scipy == 1.1.0 # temporary fix for issue #82 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should remove the whitespaces around "==".
Thanks for the reviews! |
Closes #68
This PR makes it possible to pass pre-binned data in
fit()
based on the dtype ofX
, in which case theBinMapper
is bypassed.If the estimator was fitted with binned data, it raises an error when
predict
orpredict_proba
is used with real-valued data.I renamed
bin_thresholds
intonumerical_treshold
to avoid confusion and to be consistent with theTreePredictor
fields.