Skip to content

Commit

Permalink
documentation update
Browse files Browse the repository at this point in the history
  • Loading branch information
gykovacs committed Nov 6, 2023
1 parent 0b5bc38 commit a441130
Show file tree
Hide file tree
Showing 5 changed files with 110 additions and 89 deletions.
18 changes: 16 additions & 2 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -125,8 +125,11 @@ For further information, see
* ReadTheDocs full documentation: https://mlscorecheck.readthedocs.io/en/latest/
* The preprint: https://arxiv.org/abs/2310.12527

The requirements of the consistency tests
=========================================
Preliminaries
=============

Requirements
------------

In general, there are three inputs to the consistency testing functions:

Expand Down Expand Up @@ -210,6 +213,17 @@ A note on the *Score of Means* and *Mean of Scores* aggregations

When it comes to the aggregation of scores (either over multiple folds, multiple datasets or both), there are two approaches in the literature. In the *Mean of Scores* (MoS) scenario, the scores are calculated for each fold/dataset, and the mean of the scores is determined as the score characterizing the entire experiment. In the *Score of Means* (SoM) approach, first the overall confusion matrix is determined, and then the scores are calculated based on these total figures. The advantage of the MoS approach over SoM is that it is possible to estimate the standard deviation of the scores, however, its disadvantage is that the average of non-linear scores might be distorted and some score might become undefined on when the folds are extremely small (typically in the case of small and imbalanced data).

The ``mlscorecheck`` package supports both approaches, however, by design, to increase awareness, different functions are provided for the different approaches, usually indicated by the '_mos' or '_som' suffixes in the function names.

The types of tests
------------------

The consistency tests can be grouped to three classes, and it is the problem and the experimental setup determining which internal implementation is applicable:

- Exhaustive enumeration: primarily applied for binary and multiclass classification, when the scores are calculated from one single confusion matrix. The calculations are speeded up by interval computing techniques. These tests support all 20 performance scores of binary classification.
- Linear programming: when averaging is involved in the calculation of performance scores, due to the non-linearity of most scores, the operation cannot be simplified and the extremely large parameter space prevents exhaustive enumeration. In these scenarios, linear integer programming is exploited. These tests usually support only the accuracy, sensitivity, specificity and balanced accuracy scores.
- Checking the relation of scores: mainly used for regression, when the domain of the performance scores is continuous, preventing inference from the discrete values.

Binary classification
=====================

Expand Down
92 changes: 91 additions & 1 deletion docs/01a_requirements.rst
Original file line number Diff line number Diff line change
@@ -1,8 +1,98 @@
Preliminaries
=============

Requirements
************
------------

In general, there are three inputs to the consistency testing functions:

* **the specification of the experiment**;
* **the collection of available (reported) performance scores**;
* **the estimated numerical uncertainty**: the performance scores are usually shared with some finite precision, being rounded/ceiled/floored to ``k`` decimal places. The numerical uncertainty estimates the maximum difference of the reported score and its true value. For example, having the accuracy score 0.9489 published (4 decimal places), one can suppose that it is rounded, therefore, the numerical uncertainty is 0.00005 (10^(-4)/2). To be more conservative, one can assume that the score was ceiled or floored. In this case, the numerical uncertainty becomes 0.0001 (10^(-4)).

Specification of the experimental setup
---------------------------------------

In this subsection, we illustrate the various ways the experimental setup can be specified.

Specification of one testset or dataset
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

There are multiple ways to specify datasets and entire experiments consisting of multiple datasets evaluated in differing ways of cross-validations.

A simple binary classification testset consisting of ``p`` positive samples (usually labelled 1) and ``n`` negative samples (usually labelled 0) can be specified as

.. code-block:: Python
testset = {"p": 10, "n": 20}
One can also specify a commonly used dataset by its name and the package will look up the ``p`` and ``n`` counts of the datasets from its internal registry (based on the representations in the ``common-datasets`` package):

.. code-block:: Python
dataset = {"dataset_name": "common_datasets.ADA"}
To see the list of supported datasets and corresponding counts, issue

.. code-block:: Python
from mlscorecheck.experiments import dataset_statistics
print(dataset_statistics)
Specification of a folding
^^^^^^^^^^^^^^^^^^^^^^^^^^

The specification of foldings is needed when the scores are computed in cross-validation scenarios. We distinguish two main cases: in the first case, the number of positive and negative samples in the folds are known, or can be derived from the attributes of the dataset (for example, by stratification); in the second case, the statistics of the folds are not known, but the number of folds and potential repetitions are known.

In the first case, when the folds are known, one can specify them by listing them:

.. code-block:: Python
folding = {"folds": [{"p": 5, "n": 10},
{"p": 4, "n": 10},
{"p": 5, "n": 10}]}
This folding can represent the evaluation of a dataset with 14 positive and 30 negative samples in a 3-fold stratified cross-validation scenario.

Knowing that the folding is derived by some standard stratification techniques, one can just specify the parameters of the folding:

.. code-block:: Python
folding = {"n_folds": 3, "n_repeats": 1, "strategy": "stratified_sklearn"}
In this specification, it is assumed that the samples are distributed into the folds according to the ``sklearn`` stratification implementation.

Finally, if neither the folds nor the folding strategy is known, one can simply specify the folding with its parameters (assuming a repeated k-fold scheme):

.. code-block:: Python
folding = {"n_folds": 3, "n_repeats": 2}
Note that not all consistency testing functions support the latter case (not knowing the exact structure of the folds).

Specification of an evaluation
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

A dataset and a folding constitute an *evaluation*, and many of the test functions take evaluations as parameters describing the scenario:

.. code-block:: Python
evaluation = {"dataset": {"p": 10, "n": 50},
"folding": {"n_folds": 5, "n_repeats": 1,
"strategy": "stratified_sklearn"}}
A note on the *Score of Means* and *Mean of Scores* aggregations
----------------------------------------------------------------

When it comes to the aggregation of scores (either over multiple folds, multiple datasets or both), there are two approaches in the literature. In the *Mean of Scores* (MoS) scenario, the scores are calculated for each fold/dataset, and the mean of the scores is determined as the score characterizing the entire experiment. In the *Score of Means* (SoM) approach, first the overall micro-figures (e.g. the overall confusion matrix in classification, the overall squared error in regression) are determined, and then the scores are calculated based on these total figures. The advantage of the MoS approach over SoM is that it is possible to estimate the standard deviation of the scores, however, its disadvantage is that the average of non-linear scores might be distorted and some score might become undefined on when the folds are extremely small (typically in the case of small and imbalanced data).

The ``mlscorecheck`` package supports both approaches, however, by design, to increase awareness, different functions are provided for the different approaches, usually indicated by the '_mos' or '_som' suffixes in the function names.

The types of tests
------------------

The consistency tests can be grouped to three classes, and it is the problem and the experimental setup determining which internal implementation is applicable:

- Exhaustive enumeration: primarily applied for binary and multiclass classification, when the scores are calculated from one single confusion matrix. The calculations are speeded up by interval computing techniques. These tests support all 20 performance scores of binary classification.
- Linear programming: when averaging is involved in the calculation of performance scores, due to the non-linearity of most scores, the operation cannot be simplified and the extremely large parameter space prevents exhaustive enumeration. In these scenarios, linear integer programming is exploited. These tests usually support only the accuracy, sensitivity, specificity and balanced accuracy scores.
- Checking the relation of scores: mainly used for regression, when the domain of the performance scores is continuous, preventing inference from the discrete values.
70 changes: 0 additions & 70 deletions docs/01b_specifying_setup.rst

This file was deleted.

18 changes: 3 additions & 15 deletions docs/01c_consistency_checking.rst
Original file line number Diff line number Diff line change
@@ -1,17 +1,5 @@
Testing the consistency of performance scores
---------------------------------------------

Numerous experimental setups are supported by the package. In this section we go through them one by one giving some examples of possible use cases.

We emphasize again, that the tests are designed to detect inconsistencies. If the resulting ``inconsistency`` flag is ``False``, the scores can still be calculated in non-standard ways. However, **if the resulting ``inconsistency`` flag is ``True``, it conclusively indicates that inconsistencies are detected, and the reported scores could not be the outcome of the presumed experiment**.

A note on the *Score of Means* and *Mean of Scores* aggregations
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

When it comes to the aggregation of scores (either over multiple folds, multiple datasets or both), there are two approaches in the literature. In the *Mean of Scores* (MoS) scenario, the scores are calculated for each fold/dataset, and the mean of the scores is determined as the score characterizing the entire experiment. In the *Score of Means* (SoM) approach, first the overall confusion matrix is determined, and then the scores are calculated based on these total figures. The advantage of the MoS approach over SoM is that it is possible to estimate the standard deviation of the scores, however, its disadvantage is that the average of non-linear scores might be distorted and some score might become undefined on when the folds are extremely small (typically in the case of small and imbalanced data).

Binary classification
~~~~~~~~~~~~~~~~~~~~~
=====================

Depending on the experimental setup, the consistency tests developed for binary classification problems support a variety of performance scores: when aggregated performance scores (averages on folds or datasets) are reported, only accuracy (``acc``), sensitivity (``sens``), specificity (``spec``) and balanced accuracy (``bacc``) are supported; when cross-validation is not involved in the experimental setup, the list of supported scores reads as follows (with abbreviations in parentheses):

Expand Down Expand Up @@ -426,7 +414,7 @@ The setup is consistent. However, if the balanced accuracy is changed to 0.9, th
# True
Multiclass classification
~~~~~~~~~~~~~~~~~~~~~~~~~
=========================

In multiclass classification scenarios single testsets and k-fold cross-validation on a single dataset are supported with both the micro-averaging and macro-averaging aggregation strategies. The list of supported scores depends on the experimental setup, when applicable, all 20 scores listed for binary classification are supported, when the test operates in terms of linear programming, only accuracy, sensitivity, specificity and balanced accuracy are supported.

Expand Down Expand Up @@ -620,7 +608,7 @@ As the results show, there are no inconsistencies in the configuration. However,
# True
Regression
~~~~~~~~~~
==========

From the point of view of consistency testing, regression is the hardest problem as the predictions can produce any performance scores. The tests implemented in the package allow testing the relation of the *mean squared error* (``mse``), *root mean squared error* (``rmse``), *mean average error* (``mae``) and *r^2 scores* (``r2``).

Expand Down
1 change: 0 additions & 1 deletion docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,6 @@ mlscorecheck: testing the consistency of machine learning performance scores
:caption: Consistency testing

01a_requirements
01b_specifying_setup
01c_consistency_checking

.. toctree::
Expand Down

0 comments on commit a441130

Please sign in to comment.