fix: SGDClassifier post-processing with multi-class and improve linear models' predict method #585

RomanBredehoft · 2024-04-03T16:31:06Z

first step towards a green weekly CI

So there are a few things to note in this PR :

the weekly CI was failing because SGD classifier's post_processing method was not properly integrated : instead, everything was done in predict_proba, which interferes with the way we build and test our API (notably the test_separated_inference)
I found a more general "issue" related to linear models' predict method: while trees / qnns usually do predict_proba + argmax, for linear models sklearn does decision_function + argmax. In theory this should not change anything as the predict_proba basically does a sigmoid/softmax or normalization, but in practice it made the argmax behave differently because of slight floating points errors (basically the same as in https://github.com/zama-ai/concrete-ml-internal/issues/3369)
while debugging this, I also encountered https://github.com/zama-ai/concrete-ml-internal/issues/4029 and decided to fix it once and for all, else I more or less had to make linear classifiers ouputs' shape not coherent with sklearn which seemed going backward

Why are we discovering this only recently ? A few reasons :

for linear classifiers: it's been quite a long time now that we almost never test our classifier's predict method because of the float issues mentioned above : we consider that the most important fact is making sure that quantized == quantized. The only paces where this seem to happen is the hyper parameter test, which is where @jfrery detected the issue
for sgd's post processing: the way we were doing post-processing was good, the issue was only when running the inference in separated ways like in test_separated_inference. Still, our custom post_processing function was good for 1D array (tested in regular CIs) with "log_loss" loss (default). Only the weekly tests 2D arrays, which is where we found the error

I also took the liberty to clean a bit some parts/tests (no breaking changes)

refs https://github.com/zama-ai/concrete-ml-internal/issues/4030
closes https://github.com/zama-ai/concrete-ml-internal/issues/4344
closes https://github.com/zama-ai/concrete-ml-internal/issues/4252
closes https://github.com/zama-ai/concrete-ml-internal/issues/4029

src/concrete/ml/sklearn/linear_model.py

RomanBredehoft · 2024-04-03T16:33:58Z

tests/sklearn/test_sklearn_models.py

@@ -1605,31 +1605,11 @@ def test_predict_correctness(


 @pytest.mark.parametrize("model_class, parameters", MODELS_AND_DATASETS)
-@pytest.mark.parametrize(


there's no point of simulating here (we are testing encrypt + run + decrypt)

RomanBredehoft · 2024-04-03T16:34:28Z

tests/sklearn/test_sklearn_models.py

-# make sure we only compile below that bit-width.
-# Additionally, prevent computations in FHE with too many bits
-@pytest.mark.parametrize(
-    "n_bits",


there no real point of testing multiple n_bits values, as we are only testing the api + comparing fhe vs simulation

RomanBredehoft · 2024-04-04T15:16:20Z

src/concrete/ml/sklearn/base.py

@@ -1735,38 +1735,6 @@ class SklearnLinearRegressorMixin(SklearnLinearModelMixin, sklearn.base.Regresso
    """


-class SklearnSGDRegressorMixin(SklearnLinearRegressorMixin):


just moved it for better readability

RomanBredehoft · 2024-04-04T15:17:28Z

src/concrete/ml/sklearn/base.py

@@ -1815,6 +1783,48 @@ def predict_proba(self, X: Data, fhe: Union[FheMode, str] = FheMode.DISABLE) ->
        y_proba = self.post_processing(y_logits)
        return y_proba

+    # In scikit-learn, the argmax is done on the scores directly, not the probabilities
+    def predict(self, X: Data, fhe: Union[FheMode, str] = FheMode.DISABLE) -> numpy.ndarray:


this fixes https://github.com/zama-ai/concrete-ml-internal/issues/4344 (reason is explained in main comment, but basically this is how sklearn does)

RomanBredehoft · 2024-04-04T15:18:35Z

src/concrete/ml/sklearn/linear_model.py

@@ -253,27 +252,12 @@ def __init__(
                    "Setting 'parameter_range' is mandatory if FHE training is enabled "
                    f"({fit_encrypted=}). Got {parameters_range=}"
                )
-
-    def post_processing(self, y_preds: numpy.ndarray) -> numpy.ndarray:


this post_processing was wrong + I moved it below

RomanBredehoft · 2024-04-04T15:18:50Z

src/concrete/ml/sklearn/linear_model.py

@@ -835,61 +819,24 @@ def partial_fit(
            # FIXME: https://github.com/zama-ai/concrete-ml-internal/issues/4184
            raise NotImplementedError("Partial fit is not currently supported for clear training.")

-    # This method is taken directly from scikit-learn
-    def _predict_proba_lr(self, X: Data, fhe: Union[FheMode, str]) -> numpy.ndarray:


this part should be included in our post_processing method

RomanBredehoft · 2024-04-04T15:20:01Z

src/concrete/ml/sklearn/linear_model.py

-
-    def predict_proba(self, X: Data, fhe: Union[FheMode, str] = FheMode.DISABLE) -> numpy.ndarray:
-        """Probability estimates.
+    def post_processing(self, y_preds: numpy.ndarray) -> numpy.ndarray:


basically we don't need to define the predict_proba method as it is in sklearn, we only need to re-define post_processing with sklearn's implem

RomanBredehoft · 2024-04-04T15:20:39Z

tests/sklearn/test_fhe_training.py

@@ -133,27 +133,23 @@ def test_fit_error_if_non_binary_targets(n_bits, max_iter, parameter_min_max):
        model.partial_fit(x, y, fhe="disable")


-@pytest.mark.parametrize("loss", ["log_loss", "modified_huber"])


here we are testing error raises for specific arguments, we don't need all these inputs

RomanBredehoft · 2024-04-04T15:21:06Z

tests/sklearn/test_sklearn_models.py

@@ -651,7 +648,11 @@ def check_separated_inference(model, fhe_circuit, x, check_float_array_equal):
        is_classifier_or_partial_classifier(model)
        and get_model_name(model) != "KNeighborsClassifier"
    ):
-        y_pred = numpy.argmax(y_pred, axis=-1)
+        # For linear classifiers, the argmax is done on the scores directly, not the probabilities
+        if is_model_class_in_a_list(model, _get_sklearn_linear_models()):


as mentioned in the main comment, let's do it like sklearn

jfrery

Awesome! Thanks a lot for the fixes + cleaning. Great explanations as well.

RomanBredehoft · 2024-04-09T12:19:29Z

conftest.py

@@ -420,7 +421,7 @@ def check_accuracy():
    """Fixture to check the accuracy."""

    def check_accuracy_impl(expected, actual, threshold=0.9):
-        accuracy = numpy.mean(expected == actual)
+        accuracy = accuracy_score(expected, actual)


src/concrete/ml/sklearn/base.py

RomanBredehoft · 2024-04-09T12:20:05Z

src/concrete/ml/sklearn/base.py

+        y_preds = self.output_quantizers[0].dequant(q_y_preds)
+
+        # If the preds have shape (n, 1), squeeze it to shape (n,) like in scikit-learn
+        if y_preds.ndim == 2 and y_preds.shape[1] == 1:


like in sklearn (same for the followings one)

RomanBredehoft · 2024-04-09T12:20:59Z

tests/sklearn/test_sklearn_models.py

-            #     "Method 'decision_function' outputs different shapes between scikit-learn and "
-            #     f"Concrete ML in FHE (fhe={fhe})"
-            # )
+            assert y_scores_sklearn.shape == y_scores_fhe.shape, (


put back the assert here and below

RomanBredehoft · 2024-04-09T12:21:24Z

tests/sklearn/test_sklearn_models.py

@@ -1912,7 +1892,7 @@ def test_rounding_consistency_for_regular_models(
    else:
        # Check `predict` for regressors
        predict_method = model.predict
-        metric = check_accuracy
+        metric = check_r2_score


regressors should be tested with r2 score

github-actions · 2024-04-09T14:48:02Z

Coverage passed ✅

Coverage details

---------- coverage: platform linux, python 3.8.18-final-0 -----------
Name    Stmts   Miss  Cover   Missing
-------------------------------------
TOTAL    7548      0   100%

59 files skipped due to complete coverage.

RomanBredehoft · 2024-04-09T14:49:38Z

(for info, the last commits are about fixing https://github.com/zama-ai/concrete-ml-internal/issues/4029)

jfrery

Thanks!

RomanBredehoft requested a review from a team as a code owner April 3, 2024 16:31

cla-bot bot added the cla-signed label Apr 3, 2024

RomanBredehoft commented Apr 3, 2024

View reviewed changes

src/concrete/ml/sklearn/linear_model.py Show resolved Hide resolved

RomanBredehoft commented Apr 3, 2024

View reviewed changes

RomanBredehoft force-pushed the fix/sgd_classifier_post_processing_4252 branch from 5ffaf9a to 8baa5e9 Compare April 4, 2024 08:39

RomanBredehoft changed the title ~~fix: SGDClassifier post-processing with multi-class~~ fix: SGDClassifier post-processing with multi-class and improve linear models' predict method Apr 4, 2024

RomanBredehoft commented Apr 4, 2024

View reviewed changes

jfrery previously approved these changes Apr 4, 2024

View reviewed changes

RomanBredehoft dismissed jfrery’s stale review via 9e8b111 April 9, 2024 09:56

RomanBredehoft added 3 commits April 9, 2024 11:57

fix: SGDClassifier post-processing with multi-class

3b2cfe8

chore: improve linear predict method

833eb09

chore: fix pcc

a7b606c

RomanBredehoft force-pushed the fix/sgd_classifier_post_processing_4252 branch from 9e8b111 to a7b606c Compare April 9, 2024 09:57

RomanBredehoft commented Apr 9, 2024

View reviewed changes

src/concrete/ml/sklearn/base.py Show resolved Hide resolved

RomanBredehoft commented Apr 9, 2024

View reviewed changes

chore: also fix shape coherence with sklearn

a7dc084

RomanBredehoft force-pushed the fix/sgd_classifier_post_processing_4252 branch from 93aaec4 to a7dc084 Compare April 9, 2024 12:23

chore: fix make pcc

c16ca0a

RomanBredehoft requested a review from jfrery April 9, 2024 14:53

jfrery approved these changes Apr 10, 2024

View reviewed changes

RomanBredehoft merged commit b097022 into main Apr 10, 2024
11 checks passed

RomanBredehoft deleted the fix/sgd_classifier_post_processing_4252 branch April 10, 2024 12:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: SGDClassifier post-processing with multi-class and improve linear models' predict method #585

fix: SGDClassifier post-processing with multi-class and improve linear models' predict method #585

RomanBredehoft commented Apr 3, 2024 •

edited

Loading

RomanBredehoft Apr 3, 2024

RomanBredehoft Apr 3, 2024

RomanBredehoft Apr 4, 2024

RomanBredehoft Apr 4, 2024

RomanBredehoft Apr 4, 2024

RomanBredehoft Apr 4, 2024

RomanBredehoft Apr 4, 2024

RomanBredehoft Apr 4, 2024

RomanBredehoft Apr 4, 2024

jfrery left a comment •

edited

Loading

RomanBredehoft Apr 9, 2024

RomanBredehoft Apr 9, 2024

RomanBredehoft Apr 9, 2024

RomanBredehoft Apr 9, 2024

github-actions bot commented Apr 9, 2024

RomanBredehoft commented Apr 9, 2024

jfrery left a comment

		@@ -1605,31 +1605,11 @@ def test_predict_correctness(


		@pytest.mark.parametrize("model_class, parameters", MODELS_AND_DATASETS)
		@pytest.mark.parametrize(

		@@ -1735,38 +1735,6 @@ class SklearnLinearRegressorMixin(SklearnLinearModelMixin, sklearn.base.Regresso
		"""


		class SklearnSGDRegressorMixin(SklearnLinearRegressorMixin):

		@@ -133,27 +133,23 @@ def test_fit_error_if_non_binary_targets(n_bits, max_iter, parameter_min_max):
		model.partial_fit(x, y, fhe="disable")


		@pytest.mark.parametrize("loss", ["log_loss", "modified_huber"])

fix: SGDClassifier post-processing with multi-class and improve linear models' predict method #585

fix: SGDClassifier post-processing with multi-class and improve linear models' predict method #585

Conversation

RomanBredehoft commented Apr 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jfrery left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Apr 9, 2024

Coverage passed ✅

RomanBredehoft commented Apr 9, 2024

jfrery left a comment

Choose a reason for hiding this comment

RomanBredehoft commented Apr 3, 2024 •

edited

Loading

jfrery left a comment •

edited

Loading