Add SHAP calculation to GBT regression (#1399)

* rename gbt_convertors.pyx -> *.py * use dataclasses for Node * isort/black * refactor get_gbt_model_from_xgboost() with improved Node classes * refactor: put new NodeList and related classes in module namespace * add cover to gbt regression nodes * simplify xgboost tree parser * Refactor gbt model parser for speed and add tests * feat: provide pred_contribs/pred_interactions kwargs in GBT _predict_regression * re-enable mb tests * Return pred_interactions in correct shape * clean up inference APIs and versioning * Fix SHAP interaction output shape * align tree clf/reg APIs * update copyright * fix: remove loading xgb only for a type hint * Update LightGBM Model Builder for TreeView * chore: rename model builders test file and remove ancient version check * Start cleaning up model builder tests, fix some failing tests * Add exhaustive model builder testing * chore: merge test_xgboost_mb.py and test_model_builders.py * fix: support XGBoost models trained with early stopping * refactor: simplify early stopping test case * fix: add SHAP to requirements-test * chore: update oneDAL version for _gbt_inference_api_versision 2 * Add GBT model builder API version descriptions * Fix typo in pred_interactions test * fix: remove local backup file * fix: remove local backup file * Start work on fixing LightGBM model builder test cases * Properly use XGBoost's base_score parameter * fix: parse enums declared with bit shifting * refactor: SHAP prediction replace boolean parameters with DAAL_UINT64 flag * chore: fix typos and add another classification test * feat: add more tests for LightGBM models * fix LightGBM model conversion * feat: provide XGBoost SHAP example * clean imports * Include SHAP description * typos * chore: move model builder examples to dedicated directory * rename model_builders -> mb * Apply suggestions from code review Co-authored-by: Alexandra <alexandra.epanchinzeva@intel.com> * add reg/clf leaf node wrappers for backwards compatibility * fix: model retrieve API * chore: remove requirements-test-optional.txt * Update CODEOWNERS after removing requirements-test-optional.txt * fix: add new mb path to test_examples sys.path * feat: add xgboost_shap example to testing for 2024.0.1 * fix: add shap to test requirements * Skip SHAP checks for older versions * fixup: skip shap tests if *not* daal_check_version(...) * Let main() accept args and kwargs * fix: only request resultsToCompute with compatible versions * fixup: better error reporting * use pytest for main() * fix: use unittest.skipIf * fix: typo 2023 -> 2024 * Drop 3.12 requirement Co-authored-by: Nikolay Petrov <nikolay.a.petrov@intel.com> * cleanup after rebase * Skip SHAP install & tests on 3.12 * Install catboost on all python versions * Skip catboost install & tests on 3.12 * chore: add fixmes for catboost and shap support on 3.12 --------- Co-authored-by: Alexandra <alexandra.epanchinzeva@intel.com> Co-authored-by: Nikolay Petrov <nikolay.a.petrov@intel.com>
intel · Oct 27, 2023 · 6d95372 · 6d95372
1 parent 1fe0df1
commit 6d95372
Show file tree

Hide file tree

Showing 22 changed files with 1,776 additions and 815 deletions.
diff --git a/.ci/pipeline/build-and-test-lnx.yml b/.ci/pipeline/build-and-test-lnx.yml
@@ -45,7 +45,7 @@ steps:
  . /usr/share/miniconda/etc/profile.d/conda.sh
  conda activate CB
  bash .ci/scripts/setup_sklearn.sh $(SKLEARN_VERSION)
- pip install --upgrade -r requirements-test.txt -r requirements-test-optional.txt
+ pip install --upgrade -r requirements-test.txt
  pip install $(python .ci/scripts/get_compatible_scipy_version.py)
  if [ $(echo $(PYTHON_VERSION) | grep '3.8\|3.9\|3.10') ]; then conda install -q -y -c intel dpnp; fi
  pip list

diff --git a/.ci/pipeline/build-and-test-mac.yml b/.ci/pipeline/build-and-test-mac.yml
@@ -40,7 +40,7 @@ steps:
  - script: |
  source activate CB
  bash .ci/scripts/setup_sklearn.sh $(SKLEARN_VERSION)
- pip install --upgrade -r requirements-test.txt -r requirements-test-optional.txt
+ pip install --upgrade -r requirements-test.txt
  pip install $(python .ci/scripts/get_compatible_scipy_version.py)
  pip list
  displayName: 'Install testing requirements'

diff --git a/.ci/pipeline/build-and-test-win.yml b/.ci/pipeline/build-and-test-win.yml
@@ -43,7 +43,7 @@ steps:
  set PATH=C:\msys64\usr\bin;%PATH%
  call activate CB
  bash .ci/scripts/setup_sklearn.sh $(SKLEARN_VERSION)
- pip install --upgrade -r requirements-test.txt -r requirements-test-optional.txt
+ pip install --upgrade -r requirements-test.txt
  cd ..
  for /f "delims=" %%c in ('python s\.ci\scripts\get_compatible_scipy_version.py') do set SCIPY_VERSION=%%c
  pip install %SCIPY_VERSION%

diff --git a/.ci/pipeline/nightly.yml b/.ci/pipeline/nightly.yml
@@ -64,7 +64,7 @@ jobs:
  conda activate CB
  pip install -r dependencies-dev
  pip install -r requirements-doc.txt
- pip install -r requirements-test.txt -r requirements-test-optional.txt
+ pip install -r requirements-test.txt
  pip install jupyter matplotlib requests
  displayName: 'Install requirements'
  - script: |

diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS
@@ -13,17 +13,16 @@ requirements-doc.txt @maria-Petrova @napetrov @aepanchi @Alexsandruss
 onedal/ @Alexsandruss @samir-nasibli @KulikovNikita
 sklearnex/ @Alexsandruss @samir-nasibli @KulikovNikita
 
-# Examples 
+# Examples
 examples/ @maria-Petrova @Alexsandruss @samir-nasibli @napetrov
 
 # Dependencies
 setup.py @napetrov @Alexsandruss @samir-nasibli
 requirements* @napetrov @Alexsandruss @samir-nasibli @homksei @ahuber21 @ethanglaser
-conda-recipe/ @napetrov @Alexsandruss 
+conda-recipe/ @napetrov @Alexsandruss
 
 # Model builders
 *model_builders* @razdoburdin @ahuber21 @avolkov-intel
-requirements-test-optional.txt @razdoburdin @ahuber21 @avolkov-intel
 
 # Forests
 *ensemble* @ahuber21 @icfaust

diff --git a/daal4py/mb/model_builders.py b/daal4py/mb/model_builders.py
@@ -200,7 +200,9 @@ def _predict_classification(self, X, fptype, resultsToEvaluate):
  else:
  return predict_result.probabilities
 
- def _predict_regression(self, X, fptype):
+ def _predict_regression(
+ self, X, fptype, pred_contribs=False, pred_interactions=False
+ ):
  if X.shape[1] != self.n_features_in_:
  raise ValueError("Shape of input is different from what was seen in `fit`")
 
@@ -212,22 +214,64 @@ def _predict_regression(self, X, fptype):
  ).format(type(self).__name__)
  )
 
- # Prediction
+ try:
+ return self._predict_regression_with_results_to_compute(
+ X, fptype, pred_contribs, pred_interactions
+ )
+ except TypeError as e:
+ if "unexpected keyword argument 'resultsToCompute'" in str(e):
+ if pred_contribs or pred_interactions:
+ # SHAP values requested, but not supported by this version
+ raise TypeError(
+ f"{'pred_contribs' if pred_contribs else 'pred_interactions'} not supported by this version of daalp4y"
+ ) from e
+ else:
+ # unknown type error
+ raise
+
+ # fallback to calculation without `resultsToCompute`
  predict_algo = d4p.gbt_regression_prediction(fptype=fptype)
  predict_result = predict_algo.compute(X, self.daal_model_)
-
  return predict_result.prediction.ravel()
 
+ def _predict_regression_with_results_to_compute(
+ self, X, fptype, pred_contribs=False, pred_interactions=False
+ ):
+ """Assume daal4py supports the resultsToCompute kwarg"""
+ resultsToCompute = ""
+ if pred_contribs:
+ resultsToCompute = "shapContributions"
+ elif pred_interactions:
+ resultsToCompute = "shapInteractions"
+
+ predict_algo = d4p.gbt_regression_prediction(
+ fptype=fptype, resultsToCompute=resultsToCompute
+ )
+ predict_result = predict_algo.compute(X, self.daal_model_)
+
+ if pred_contribs:
+ return predict_result.prediction.ravel().reshape((-1, X.shape[1] + 1))
+ elif pred_interactions:
+ return predict_result.prediction.ravel().reshape(
+ (-1, X.shape[1] + 1, X.shape[1] + 1)
+ )
+ else:
+ return predict_result.prediction.ravel()
+
 
 class GBTDAALModel(GBTDAALBaseModel):
  def __init__(self):
  pass
 
- def predict(self, X):
+ def predict(self, X, pred_contribs=False, pred_interactions=False):
  fptype = getFPType(X)
  if self._is_regression:
- return self._predict_regression(X, fptype)
+ return self._predict_regression(X, fptype, pred_contribs, pred_interactions)
  else:
+ if pred_contribs or pred_interactions:
+ raise NotImplementedError(
+ f"{'pred_contribs' if pred_contribs else 'pred_interactions'} is not implemented for classification models"
+ )
  return self._predict_classification(X, fptype, "computeClassLabels")
 
  def predict_proba(self, X):

diff --git a/doc/daal4py/model-builders.rst b/doc/daal4py/model-builders.rst
@@ -24,17 +24,17 @@ Model Builders for the Gradient Boosting Frameworks
 
 Introduction
 ------------------
-Gradient boosting on decision trees is one of the most accurate and efficient 
-machine learning algorithms for classification and regression. 
-The most popular implementations of it are: 
+Gradient boosting on decision trees is one of the most accurate and efficient
+machine learning algorithms for classification and regression.
+The most popular implementations of it are:
 
 * XGBoost*
 * LightGBM*
 * CatBoost*
 
 daal4py Model Builders deliver the accelerated
-models inference of those frameworks. The inference is performed by the oneDAL GBT implementation tuned 
-for the best performance on the Intel(R) Architecture. 
+models inference of those frameworks. The inference is performed by the oneDAL GBT implementation tuned
+for the best performance on the Intel(R) Architecture.
 
 Conversion
 ---------
@@ -61,22 +61,49 @@ CatBoost::
 Classification and Regression Inference
 ----------------------------------------
 
-The API is the same for classification and regression inference. 
-Based on the original model passed to the ``convert_model``, ``d4p_prediction`` is either the classification or regression output. 
- 
+The API is the same for classification and regression inference.
+Based on the original model passed to the ``convert_model()``, ``d4p_prediction`` is either the classification or regression output.
+
  ::
- 
+
  d4p_prediction = d4p_model.predict(test_data)
 
 Here, the ``predict()`` method of ``d4p_model`` is being used to make predictions on the ``test_data`` dataset.
-The ``d4p_prediction`` variable stores the predictions made by the ``predict()`` method. 
+The ``d4p_prediction`` variable stores the predictions made by the ``predict()`` method.
+
+SHAP Value Calculation for Regression Models
+------------------------------------------------------------
+
+SHAP contribution and interaction value calculation are natively supported by models created with daal4py Model Builders.
+For these models, the ``predict()`` method takes additional keyword arguments:
+
+ ::
+
+ d4p_model.predict(test_data, pred_contribs=True) # for SHAP contributions
+ d4p_model.predict(test_data, pred_interactions=True) # for SHAP interactions
+
+The returned prediction has the shape:
+
+ * ``(n_rows, n_features + 1)`` for SHAP contributions 
+ * ``(n_rows, n_features + 1, n_features + 1)`` for SHAP interactions
+Here, ``n_rows`` is the number of rows (i.e., observations) in
+``test_data``, and ``n_features`` is the number of features in the dataset.
+
+The prediction result for SHAP contributions includes a feature attribution value for each feature and a bias term for each observation.
+
+The prediction result for SHAP interactions comprises ``(n_features + 1) x (n_features + 1)`` values for all possible
+feature combinations, along with their corresponding bias terms.
+
+.. note:: The shapes of SHAP contributions and interactions are consistent with the XGBoost results.
+ In contrast, the `SHAP Python package <https://shap.readthedocs.io/en/latest/>`_ drops bias terms, resulting
+ in SHAP contributions (SHAP interactions) with one fewer column (one fewer column and row) per observation.
 
 Scikit-learn-style Estimators
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 You can also use the scikit-learn-style classes ``GBTDAALClassifier`` and ``GBTDAALRegressor`` to convert and infer your models. For example:
 
-:: 
+::
 
  from daal4py.sklearn.ensemble import GBTDAALRegressor
  reg = xgb.XGBRegressor()
@@ -88,16 +115,17 @@ Limitations
 ------------------
 Model Builders support only base inference with prediction and probabilities prediction. The functionality is to be extended.
 Therefore, there are the following limitations:
-- The categorical features are not supported for conversion and prediction. 
+- The categorical features are not supported for conversion and prediction.
 - The multioutput models are not supported for conversion and prediction.
-- The tree SHAP calculations are not supported.
+- SHAP values can be calculated for regression models only.
 
 
 Examples
 ---------------------------------
 Model Builders models conversion
 
 - `XGBoost model conversion <https://github.com/intel/scikit-learn-intelex/blob/master/examples/daal4py/model_builders_xgboost.py>`_
+- `SHAP value prediction from an XGBoost model <https://github.com/intel/scikit-learn-intelex/blob/master/examples/daal4py/model_builders_xgboost_shap.py>`_
 - `LightGBM model conversion <https://github.com/intel/scikit-learn-intelex/blob/master/examples/daal4py/model_builders_lightgbm.py>`_
 - `CatBoost model conversion <https://github.com/intel/scikit-learn-intelex/blob/master/examples/daal4py/model_builders_catboost.py>`_
 

diff --git a/examples/daal4py/log_reg_model_builder.py → examples/mb/log_reg_model_builder.py b/examples/daal4py/log_reg_model_builder.py → examples/mb/log_reg_model_builder.py
diff --git a/examples/daal4py/model_builders_catboost.py → examples/mb/model_builders_catboost.py b/examples/daal4py/model_builders_catboost.py → examples/mb/model_builders_catboost.py
diff --git a/examples/daal4py/model_builders_lightgbm.py → examples/mb/model_builders_lightgbm.py b/examples/daal4py/model_builders_lightgbm.py → examples/mb/model_builders_lightgbm.py
diff --git a/examples/daal4py/model_builders_xgboost.py → examples/mb/model_builders_xgboost.py b/examples/daal4py/model_builders_xgboost.py → examples/mb/model_builders_xgboost.py
diff --git a/examples/mb/model_builders_xgboost_shap.py b/examples/mb/model_builders_xgboost_shap.py
@@ -0,0 +1,80 @@
+# ==============================================================================
+# Copyright 2023 Intel Corporation
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+# daal4py Gradient Boosting Classification model creation and SHAP value
+# prediction example
+
+import numpy as np
+import xgboost as xgb
+from sklearn.datasets import make_regression
+from sklearn.model_selection import train_test_split
+
+import daal4py as d4p
+
+
+def main(*ars, **kwargs):
+ # create data
+ X, y = make_regression(n_samples=10000, n_features=10, random_state=42)
+ X_train, X_test, y_train, _ = train_test_split(X, y, random_state=42)
+
+ # train the model
+ xgb_model = xgb.XGBRegressor(
+ max_depth=6, n_estimators=100, random_state=42, base_score=0.5
+ )
+ xgb_model.fit(X_train, y_train)
+
+ # Conversion to daal4py
+ daal_model = d4p.mb.convert_model(xgb_model.get_booster())
+
+ # SHAP contributions
+ daal_contribs = daal_model.predict(X_test, pred_contribs=True)
+
+ # SHAP interactions
+ daal_interactions = daal_model.predict(X_test, pred_interactions=True)
+
+ # XGBoost reference values
+ xgb_contribs = xgb_model.get_booster().predict(
+ xgb.DMatrix(X_test), pred_contribs=True, validate_features=False
+ )
+ xgb_interactions = xgb_model.get_booster().predict(
+ xgb.DMatrix(X_test), pred_interactions=True, validate_features=False
+ )
+
+ return (
+ daal_contribs,
+ daal_interactions,
+ xgb_contribs,
+ xgb_interactions,
+ )
+
+
+if __name__ == "__main__":
+ daal_contribs, daal_interactions, xgb_contribs, xgb_interactions = main()
+ print(f"XGBoost SHAP contributions shape: {xgb_contribs.shape}")
+ print(f"daal4py SHAP contributions shape: {daal_contribs.shape}")
+
+ print(f"XGBoost SHAP interactions shape: {xgb_interactions.shape}")
+ print(f"daal4py SHAP interactions shape: {daal_interactions.shape}")
+
+ contribution_rmse = np.sqrt(
+ np.mean((daal_contribs.reshape(-1, 1) - xgb_contribs.reshape(-1, 1)) ** 2)
+ )
+ print(f"SHAP contributions RMSE: {contribution_rmse:.2e}")
+
+ interaction_rmse = np.sqrt(
+ np.mean((daal_interactions.reshape(-1, 1) - xgb_interactions.reshape(-1, 1)) ** 2)
+ )
+ print(f"SHAP interactions RMSE: {interaction_rmse:.2e}")
diff --git a/generator/parse.py b/generator/parse.py
@@ -283,8 +283,14 @@ def parse(self, elem, ctxt):
  ctxt.enum = False
  return True
  regex = (
- r"^\s*(\w+)(?:\s*=\s*((\(int\))?\w(\w|:|\s|\+)*))?"
- + r"(\s*,)?\s*((/\*|//).*)?$"
+ # capture group for value name
+ r"^\s*(\w+)"
+ # capture group for value (different possible formats, 123, 0x1, (1 << 5), etc.)
+ + r"(?:\s*=\s*((\(int\))?(\w|:|\s|\+|\(?\d+\s*<<\s*\d+\)?)*))?"
+ # comma after the value, plus possible comments
+ + r"(\s*,)?\s*((/\*|//).*)?"
+ # EOL
+ + r"$"
  )
  me = re.match(regex, elem)
  if me and not me.group(1).startswith("last"):

diff --git a/requirements-test-optional.txt b/requirements-test-optional.txt
diff --git a/requirements-test.txt b/requirements-test.txt
@@ -6,3 +6,8 @@ scikit-learn==1.2.2 ; python_version == '3.8'
 scikit-learn==1.3.1 ; python_version >= '3.9'
 pandas==2.0.1 ; python_version == '3.8'
 pandas==2.1.1 ; python_version >= '3.9'
+xgboost==1.7.6; python_version <= '3.9'
+xgboost==2.0.0; python_version >= '3.10'
+lightgbm==4.1.0
+catboost==1.2.2; python_version <= '3.11' # FIXME: Add as soon as 3.12 is supported
+shap==0.42.1; python_version <= '3.11' # FIXME: Add as soon as 3.12 is supported