Skip to content

Commit

Permalink
Merge pull request #2 from FalseNegativeLab/feature/multiclass
Browse files Browse the repository at this point in the history
Feature/multiclass
  • Loading branch information
gykovacs authored Nov 6, 2023
2 parents e287552 + a441130 commit 0cbb9e1
Show file tree
Hide file tree
Showing 215 changed files with 17,779 additions and 10,762 deletions.
2 changes: 2 additions & 0 deletions .flake8
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
[flake8]
max-line-length = 100
463 changes: 223 additions & 240 deletions README.rst

Large diffs are not rendered by default.

41 changes: 28 additions & 13 deletions docs/00_introduction.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,23 +2,35 @@
:file: ga4.html

Introduction
=============
============

The purpose
-----------

Performance scores for binary classification are reported on a dataset and look suspicious (exceptionally high scores possibly due to typo, uncommon evaluation methodology, data leakage in preparation, incorrect use of statistics, etc.). With the tools implemented in the package ``mlscorecheck``, one can test if the reported performance scores are consistent with each other and the assumptions on the experimental setup up to the numerical uncertainty due to rounding/truncation/ceiling.
Performance scores of a machine learning technique (binary/multiclass classification, regression) are reported on a dataset and look suspicious (exceptionally high scores possibly due to a typo, uncommon evaluation methodology, data leakage in preparation, incorrect use of statistics, etc.). With the tools implemented in the package ``mlscorecheck``, one can test if the reported performance scores are consistent with each other and the assumptions on the experimental setup up.

Testing is as simple as the following example illustrated. Suppose the accuracy, sensitivity and specificity scores are reported for a binary classification testset consisting of p=100 and n=200 samples. All this information is supplied to the suitable test function and the result shows that that inconsistencies were identified: the scores could not be calculated from the confusion matrix of the testset:

.. code-block:: Python
from mlscorecheck.check.binary import check_1_testset_no_kfold
result = check_1_testset_no_kfold(testset={'p': 100, 'n': 200},
scores={'acc': 0.9567, 'sens': 0.8545, 'spec': 0.9734},
eps=1e-4)
result['inconsistency']
# True
The consistency tests are numerical and **not** statistical: if inconsistencies are identified, it means that either the assumptions on the evaluation protocol or the reported scores are incorrect.
The consistency tests are numerical and **not** statistical: if inconsistencies are identified, it means that either the assumptions on the experimental setup or the reported scores are incorrect.

In more detail
--------------

Binary classification is one of the most fundamental tasks in machine learning. The evaluation of the performance of binary classification techniques, whether for original theoretical advancements or applications in specific fields, relies heavily on performance scores (https://en.wikipedia.org/wiki/Evaluation_of_binary_classifiers). Although reported performance scores are employed as primary indicators of research value, they often suffer from methodological problems, typos, and insufficient descriptions of experimental settings. These issues contribute to the replication crisis (https://en.wikipedia.org/wiki/Replication_crisis) and ultimately entire fields of research ([RV]_, [EHG]_). Even systematic reviews can suffer from using incomparable performance scores for ranking research papers [RV]_.
The evaluation of the performance of machine learning techniques, whether for original theoretical advancements or applications in specific fields, relies heavily on performance scores (https://en.wikipedia.org/wiki/Evaluation_of_binary_classifiers). Although reported performance scores are employed as primary indicators of research value, they often suffer from methodological problems, typos, and insufficient descriptions of experimental settings. These issues contribute to the replication crisis (https://en.wikipedia.org/wiki/Replication_crisis) and ultimately entire fields of research ([RV]_, [EHG]_). Even systematic reviews can suffer from using incomparable performance scores for ranking research papers [RV]_.

The majority of performance scores are calculated from the binary confusion matrix, or multiple confusion matrices aggregated across folds and/or datasets. For many commonly used experimental setups one can develop numerical techniques to test if there exists any confusion matrix (or matrices), compatible with the experiment and leading to the reported performance scores. This package implements such consistency tests for some common scenarios. We highlight that the developed tests cannot guarantee that the scores are surely calculated by some standards or a presumed evaluation protocol. However, *if the tests fail and inconsistencies are detected, it means that the scores are not calculated by the presumed protocols with certainty*. In this sense, the specificity of the test is 1.0, the inconsistencies being detected are inevitable.
In practice, the performance scores cannot take any values independently, the scores reported for the same experiment are constrained by the experimental setup and need to express some internal consistency. For many commonly used experimental setups it is possible to develop numerical techniques to test if the scores could be the outcome of the presumed experiment on the presumed dataset. This package implements such consistency tests for some common experimental setups. We highlight that the developed tests cannot guarantee that the scores are surely calculated by some standards or a presumed evaluation protocol. However, *if the tests fail and inconsistencies are detected, it means that the scores are not calculated by the presumed protocols with certainty*. In this sense, the specificity of the test is 1.0, the inconsistencies being detected are inevitable.

For further information, see the preprint:
For further information, see the preprint: https://arxiv.org/abs/2310.12527

Citation
========
Expand All @@ -28,33 +40,36 @@ If you use the package, please consider citing the following paper:
.. code-block:: BibTex
@article{mlscorecheck,
author={Gy\"orgy Kov\'acs and Attila Fazekas},
title={Checking the internal consistency of reported performance scores in binary classification},
year={2023}
author={Attila Fazekas and Gy\"orgy Kov\'acs},
title={Testing the Consistency of Performance Scores Reported for Binary Classification Problems},
year={2023}
}
Latest news
===========

* the 0.1.0 version of the package is released
* the paper describing the implemented techniques is available as a preprint at: https://arxiv.org/abs/2310.12527
* the 1.0.1 version of the package is released;
* the paper describing the numerical techniques is available as a preprint at: https://arxiv.org/abs/2310.12527
* 10 test bundles including retina image processing datasets, preterm delivery prediction from electrohysterograms and skin lesion classification has been added;
* multiclass and regression tests added.

Installation
============

Requirements
------------

The package has only basic requirements when used for consistency testing.
The package has only basic requirements when used for consistency testing:

* ``numpy``
* ``pulp``
* ``scikit-learn``

.. code-block:: bash
> pip install numpy pulp
In order to execute the tests, one also needs ``scikit-learn``, in order to test the computer algebra components or reproduce the algebraic solutions, either ``sympy`` or ``sage`` needs to be installed. The installation of ``sympy`` can be done in the usual way. To install ``sage`` in a ``conda`` environment, one needs to add the ``conda-forge`` channel first:
In order to execute the unit tests for the computer algebra components or reproduce the algebraic solutions, either ``sympy`` or ``sage`` needs to be installed. The installation of ``sympy`` can be done in the usual way. To install ``sage`` in a ``conda`` environment, one needs to add the ``conda-forge`` channel first:

.. code-block:: bash
Expand Down
118 changes: 92 additions & 26 deletions docs/01a_requirements.rst
Original file line number Diff line number Diff line change
@@ -1,32 +1,98 @@
Preliminaries
=============

Requirements
************
------------

In general, there are three inputs to the consistency testing functions:

* **the specification of the experiment**;
* **the collection of available (reported) performance scores**: when aggregated performance scores (averages on folds or datasets) are reported, only accuracy (``acc``), sensitivity (``sens``), specificity (``spec``) and balanced accuracy (``bacc``) are supported; when cross-validation is not involved in the experimental setup, the list of supported scores reads as follows (with abbreviations in parentheses):

* accuracy (``acc``),
* sensitivity (``sens``),
* specificity (``spec``),
* positive predictive value (``ppv``),
* negative predictive value (``npv``),
* balanced accuracy (``bacc``),
* f1(-positive) score (``f1``),
* f1-negative score (``f1n``),
* f-beta positive (``fbp``),
* f-beta negative (``fbn``),
* Fowlkes-Mallows index (``fm``),
* unified performance measure (``upm``),
* geometric mean (``gm``),
* markedness (``mk``),
* positive likelihood ratio (``lrp``),
* negative likelihood ratio (``lrn``),
* Matthews correlation coefficient (``mcc``),
* bookmaker informedness (``bm``),
* prevalence threshold (``pt``),
* diagnostic odds ratio (``dor``),
* Jaccard index (``ji``),
* Cohen's kappa (``kappa``);

* **the collection of available (reported) performance scores**;
* **the estimated numerical uncertainty**: the performance scores are usually shared with some finite precision, being rounded/ceiled/floored to ``k`` decimal places. The numerical uncertainty estimates the maximum difference of the reported score and its true value. For example, having the accuracy score 0.9489 published (4 decimal places), one can suppose that it is rounded, therefore, the numerical uncertainty is 0.00005 (10^(-4)/2). To be more conservative, one can assume that the score was ceiled or floored. In this case, the numerical uncertainty becomes 0.0001 (10^(-4)).

Specification of the experimental setup
---------------------------------------

In this subsection, we illustrate the various ways the experimental setup can be specified.

Specification of one testset or dataset
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

There are multiple ways to specify datasets and entire experiments consisting of multiple datasets evaluated in differing ways of cross-validations.

A simple binary classification testset consisting of ``p`` positive samples (usually labelled 1) and ``n`` negative samples (usually labelled 0) can be specified as

.. code-block:: Python
testset = {"p": 10, "n": 20}
One can also specify a commonly used dataset by its name and the package will look up the ``p`` and ``n`` counts of the datasets from its internal registry (based on the representations in the ``common-datasets`` package):

.. code-block:: Python
dataset = {"dataset_name": "common_datasets.ADA"}
To see the list of supported datasets and corresponding counts, issue

.. code-block:: Python
from mlscorecheck.experiments import dataset_statistics
print(dataset_statistics)
Specification of a folding
^^^^^^^^^^^^^^^^^^^^^^^^^^

The specification of foldings is needed when the scores are computed in cross-validation scenarios. We distinguish two main cases: in the first case, the number of positive and negative samples in the folds are known, or can be derived from the attributes of the dataset (for example, by stratification); in the second case, the statistics of the folds are not known, but the number of folds and potential repetitions are known.

In the first case, when the folds are known, one can specify them by listing them:

.. code-block:: Python
folding = {"folds": [{"p": 5, "n": 10},
{"p": 4, "n": 10},
{"p": 5, "n": 10}]}
This folding can represent the evaluation of a dataset with 14 positive and 30 negative samples in a 3-fold stratified cross-validation scenario.

Knowing that the folding is derived by some standard stratification techniques, one can just specify the parameters of the folding:

.. code-block:: Python
folding = {"n_folds": 3, "n_repeats": 1, "strategy": "stratified_sklearn"}
In this specification, it is assumed that the samples are distributed into the folds according to the ``sklearn`` stratification implementation.

Finally, if neither the folds nor the folding strategy is known, one can simply specify the folding with its parameters (assuming a repeated k-fold scheme):

.. code-block:: Python
folding = {"n_folds": 3, "n_repeats": 2}
Note that not all consistency testing functions support the latter case (not knowing the exact structure of the folds).

Specification of an evaluation
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

A dataset and a folding constitute an *evaluation*, and many of the test functions take evaluations as parameters describing the scenario:

.. code-block:: Python
evaluation = {"dataset": {"p": 10, "n": 50},
"folding": {"n_folds": 5, "n_repeats": 1,
"strategy": "stratified_sklearn"}}
A note on the *Score of Means* and *Mean of Scores* aggregations
----------------------------------------------------------------

When it comes to the aggregation of scores (either over multiple folds, multiple datasets or both), there are two approaches in the literature. In the *Mean of Scores* (MoS) scenario, the scores are calculated for each fold/dataset, and the mean of the scores is determined as the score characterizing the entire experiment. In the *Score of Means* (SoM) approach, first the overall micro-figures (e.g. the overall confusion matrix in classification, the overall squared error in regression) are determined, and then the scores are calculated based on these total figures. The advantage of the MoS approach over SoM is that it is possible to estimate the standard deviation of the scores, however, its disadvantage is that the average of non-linear scores might be distorted and some score might become undefined on when the folds are extremely small (typically in the case of small and imbalanced data).

The ``mlscorecheck`` package supports both approaches, however, by design, to increase awareness, different functions are provided for the different approaches, usually indicated by the '_mos' or '_som' suffixes in the function names.

The types of tests
------------------

The consistency tests can be grouped to three classes, and it is the problem and the experimental setup determining which internal implementation is applicable:

- Exhaustive enumeration: primarily applied for binary and multiclass classification, when the scores are calculated from one single confusion matrix. The calculations are speeded up by interval computing techniques. These tests support all 20 performance scores of binary classification.
- Linear programming: when averaging is involved in the calculation of performance scores, due to the non-linearity of most scores, the operation cannot be simplified and the extremely large parameter space prevents exhaustive enumeration. In these scenarios, linear integer programming is exploited. These tests usually support only the accuracy, sensitivity, specificity and balanced accuracy scores.
- Checking the relation of scores: mainly used for regression, when the domain of the performance scores is continuous, preventing inference from the discrete values.
70 changes: 0 additions & 70 deletions docs/01b_specifying_setup.rst

This file was deleted.

Loading

0 comments on commit 0cbb9e1

Please sign in to comment.