[MRG] Accept pre-binned data in fit #74

NicolasHug · 2018-12-15T22:32:46Z

Closes #68

This PR makes it possible to pass pre-binned data in fit() based on the dtype of X, in which case the BinMapper is bypassed.

If the estimator was fitted with binned data, it raises an error when predict or predict_proba is used with real-valued data.

I renamed bin_thresholds into numerical_treshold to avoid confusion and to be consistent with the TreePredictor fields.

codecov · 2018-12-15T22:49:15Z

Codecov Report

Merging #74 into master will decrease coverage by 0.03%.
The diff coverage is 97.82%.

@@            Coverage Diff             @@
##           master      #74      +/-   ##
==========================================
- Coverage   97.17%   97.14%   -0.04%     
==========================================
  Files          10       10              
  Lines        1028     1051      +23     
==========================================
+ Hits          999     1021      +22     
- Misses         29       30       +1

Impacted Files	Coverage Δ
pygbm/predictor.py	`100% <100%> (ø)`	⬆️
pygbm/grower.py	`92.81% <100%> (ø)`	⬆️
pygbm/binning.py	`92.98% <100%> (ø)`	⬆️
pygbm/gradient_boosting.py	`97.61% <96.66%> (-0.28%)`	⬇️
pygbm/utils.py	`93.54% <0%> (+0.21%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 53c73f3...056512d. Read the comment docs.

ogrisel

Some comments:

tests/test_gradient_boosting.py

pygbm/gradient_boosting.py

ogrisel · 2018-12-17T10:21:07Z

tests/test_predictor.py

+        predictor.predict, X_binned
+    )
+
+    predictor.predict_binned(X)  # No error


Shouldn't that be:

predictor.predict_binned(X_binned) # No error

instead?

Furthermore I would have expected that predictor.predict_binned would raise a ValueError when called with np.float32 or np.float64 input data. But maybe such a check in the nested private API is redundant and would introduce unnecessary overhead.

Yes you're right, I've added the check.

And yes all the checks in PredictorNode are redundant since GradientBoosting.predict() would only call predict_binned() on uint8 data and predict() on non-uint8 data.

My guess is that the overhead is negligible compared to the prediction time, but no strong opinion, we can remove them and just add a comment.

tests/test_predictor.py

ogrisel

Could you please run a quick benchmark to predict with a GB estimator with both pre-binned and numerical data with a very small batch (e.g. 1 samples with 100 features) but a large number of trees (e.g. 300 trees with max_leaf_nodes=31)?

Then enable and disable the input checks in the predictor.predict_binned / predictor.predict methods to check whether or not those input checks are actually negligible.

ogrisel · 2018-12-18T13:34:12Z

pygbm/gradient_boosting.py

+            if self.max_bins < max_bin_index + 1:
+                raise ValueError(
+                    f'Data is pre-binned and max_bins={self.max_bins}, '
+                    f'but data has {max_bin_index + 1} bins.'


In retrospect I find the phrasing confusing. May I suggest the following:

raise ValueError( f'max_bins is set to {self.max_bins} but the data is pre-binned with ' f' {max_bin_index + 1} bins.'

ogrisel · 2018-12-18T13:37:47Z

pygbm/grower.py

-            The actual thresholds values of each bin.
+        numerical_thresholds : array-like of floats, optional (default=None)
+            The actual thresholds values of each bin. None if the training data
+            was pre-binned.


Maybe add that the values of the array are expected to be sorted in increasing order.

ogrisel · 2018-12-18T13:48:04Z

pygbm/predictor.py

+            )
+
+        if X.dtype == np.uint8:
+            raise ValueError('X dtype should be float32 or float64')


Maybe be more explicit:

if X.dtype == np.uint8: raise ValueError('X has uint8 dtype: use grower.predict_binned(X) if X is pre-binned, or' ' convert X to a float32 dtype to be treated as numeral data')

NicolasHug · 2018-12-18T15:54:16Z

without checks:

predict took 4.687ms on avg
predict_binned took 4.692ms on avg

with checks

predict took 4.934ms on avg
predict_binned took 4.860ms on avg

from time import time

import numpy as np
from sklearn.datasets import make_regression
from pygbm import GradientBoostingRegressor
from pygbm.binning import BinMapper


max_iter = 500

X, y = make_regression(n_samples=5, n_features=100)
X_binned = BinMapper().fit_transform(X)
assert X_binned.dtype == np.uint8

gb = GradientBoostingRegressor(max_iter=max_iter, scoring=None,
                               n_iter_no_change=None)
# Compiling
gb.fit(X, y)
assert len(gb.predictors_) == max_iter
gb.predict(X)
gb.predict(X_binned)

n_exp = 100
predict_time = 0
predict_binned_time = 0
for _ in range(n_exp):

    tic = time()
    gb.predict(X)
    toc = time()
    predict_time += toc - tic
    print(f'predict took {toc - tic:.3f}s')

    tic = time()
    gb.predict(X_binned)
    toc = time()
    predict_binned_time += toc - tic
    print(f'predict_binned took {toc - tic:.3f}s')

print('-' * 10)
print(f'predict took {predict_time / n_exp * 1000:.3f}ms on avg')
print(f'predict_binned took {predict_binned_time / n_exp * 1000:.3f}ms on avg')

This is a very minimal overhead, right?

ogrisel

Thanks for the benchmark. +1 for merge once the previous comments are addressed.

NicolasHug · 2018-12-18T16:28:33Z

tox.ini

@@ -6,7 +6,7 @@ skip_missing_interpreters=True
 [testenv]
 deps =
     numpy
-     scipy
+     scipy == 1.1.0


@ogrisel I set scipy version here for travis as a temporary fix for #82

Please add a comment with the url of the issue #82.

I needed to remove it, that was causing travis to fail :/

ogrisel · 2018-12-18T17:11:50Z

tox.ini

@@ -6,7 +6,7 @@ skip_missing_interpreters=True
 [testenv]
 deps =
     numpy
-     scipy == 1.1.0
+     scipy == 1.1.0  # temporary fix for issue #82


I think you should remove the whitespaces around "==".

NicolasHug · 2018-12-18T19:11:37Z

Thanks for the reviews!

Accept pre-binned data in fit

f18519a

ogrisel reviewed Dec 17, 2018

View reviewed changes

NicolasHug added 2 commits December 17, 2018 09:46

Addressed comments

1fcf558

Merge branch 'master' into binned_X

a5689ad

ogrisel reviewed Dec 18, 2018

View reviewed changes

ogrisel approved these changes Dec 18, 2018

View reviewed changes

Addressed comments

648ad17

NicolasHug commented Dec 18, 2018

View reviewed changes

Added reference to scipy version issue in tox.ini

3826804

ogrisel reviewed Dec 18, 2018

View reviewed changes

NicolasHug added 5 commits December 18, 2018 12:29

Trying tox with no whitespace, single = sign

1e7075e

double equal sign... ?

1f06008

sigh...

688419d

removing comment ???????

32eb845

trying == again. Comment was causing error

056512d

NicolasHug merged commit 0d1e68f into ogrisel:master Dec 18, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG] Accept pre-binned data in fit #74

[MRG] Accept pre-binned data in fit #74

NicolasHug commented Dec 15, 2018

codecov bot commented Dec 15, 2018 •

edited

Loading

ogrisel left a comment

ogrisel Dec 17, 2018

NicolasHug Dec 17, 2018

ogrisel left a comment •

edited

Loading

ogrisel Dec 18, 2018

ogrisel Dec 18, 2018

ogrisel Dec 18, 2018

NicolasHug commented Dec 18, 2018

ogrisel left a comment

NicolasHug Dec 18, 2018

ogrisel Dec 18, 2018

NicolasHug Dec 18, 2018

ogrisel Dec 18, 2018

NicolasHug commented Dec 18, 2018

[MRG] Accept pre-binned data in fit #74

[MRG] Accept pre-binned data in fit #74

Conversation

NicolasHug commented Dec 15, 2018

codecov bot commented Dec 15, 2018 • edited Loading

Codecov Report

ogrisel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ogrisel left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NicolasHug commented Dec 18, 2018

ogrisel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NicolasHug commented Dec 18, 2018

codecov bot commented Dec 15, 2018 •

edited

Loading

ogrisel left a comment •

edited

Loading