diff --git a/manuscript/01-paper.md b/manuscript/01-paper.md index f6bb47c..8aa346d 100644 --- a/manuscript/01-paper.md +++ b/manuscript/01-paper.md @@ -23,16 +23,16 @@ abbreviations: +++ {"part": "abstract"} Predictive modeling is a key approach to improve the understanding of complex biological systems and to develop novel tools for translational medical research. However, complex machine learning approaches and extensive data pre-processing and feature engineering pipelines can result in overfitting and poor generalizability. Unbiased evaluation of predictive models requires external validation, which involves testing the finalized model on independent data. Due to the high cost and time required for the acquisition of additional data, often no external validation is performed or the independence of the validation set from the training procedure is hard to evaluate. -Here we propose that model discovery and validation should be separated by the public disclosure (e.g. preregistration) of feature processing steps and model weights. Furthermore, we introduce a novel approach to optimize the trade-off between efforts spent on training and external validation in such studies. We show on data involving more than 3000 participants from four different datasets that, for any "sample size budget", the proposed adaptive splitting approach can successfully identify the optimal time to stop model discovery so that predictive performance is maximized without risking a low powered, and thus inconclusive, external validation. +Here we propose that model discovery and validation should be separated by the public disclosure (e.g. pre-registration) of feature processing steps and model weights. Furthermore, we introduce a novel approach to optimize the trade-off between efforts spent on training and external validation in such studies. We show on data involving more than 3000 participants from four different datasets that, for any "sample size budget", the proposed adaptive splitting approach can successfully identify the optimal time to stop model discovery so that predictive performance is maximized without risking a low powered, and thus inconclusive, external validation. The proposed design and splitting approach (implemented in the Python package "AdaptiveSplit") may contribute to addressing issues of replicability, effect size inflation and generalizability in predictive modeling studies. +++ ## Introduction Multivariate predictive models integrate information across multiple variables to construct predictions of a specific outcome and hold promise for delivering more accurate estimates than traditional univariate methods ([](https://doi.org/10.1038/nn.4478)). For instance, in case of predicting individual behavioral and psychometric characteristics from brain data, such models can provide higher statistical power and better replicability, as compared to conventional mass-univariate analyses ([](https://doi.org/10.1038/s41586-023-05745-x)). Predictive models can utilize a variety of algorithms, ranging from simple linear regression-based models to complex deep neural networks. With increasing model complexity, the model will be more prone to overfit its training dataset, resulting in biased, overly optimistic in-sample estimates of predictive performance and often decreased generalizability to data not seen during model fit ([](https://doi.org/10.1016/j.neubiorev.2020.09.036)). Internal validation approaches, like cross-validation (cv) provide means for an unbiased evaluation of predictive performance during model discovery by repeatedly holding out parts of the discovery dataset for testing purposes ([](https://doi.org/10.1201/9780429246593); [](doi:10.1001/jamapsychiatry.2019.3671)). -However, internal validation approaches, in practice, still tend to yield overly optimistic performance estimates ([](https://doi.org/10.1080/01621459.1983.10477973); [](https://doi.org/10.1016/j.biopsych.2020.02.016); [](https://doi.org/10.1038/s41746-022-00592-y)). There are several reasons for this kind of effect sie inflation. First, predictive modelling approaches typically display a high level of "analytical flexibility" and pose a large number of possible methodological choices in terms of feature preprocessing and model architecture, which emerge as uncontrolled (e.g. not cross-validated) "hyperparameters" during model discovery. Seemingly 'innocent' adjustments of such parameters can also lead to overfitting, if it happens outside of the cv loop. The second reason for inflated internally validated performance estimates is 'leakage' of information from the test dataset to the training dataset ([](https://doi.org/10.1016/j.patter.2023.100804)). Information leakage has many faces. It can be a consequence of, for instance, feature standardization in a non cv-compliant way or, in medical imaging, the co-registration of brain data to a study-specific template. Therefore, it is often very hard to notice, especially in complex workflows. +However, internal validation approaches, in practice, still tend to yield overly optimistic performance estimates ([](https://doi.org/10.1080/01621459.1983.10477973); [](https://doi.org/10.1016/j.biopsych.2020.02.016); [](https://doi.org/10.1038/s41746-022-00592-y)). There are several reasons for this kind of effect size inflation. First, predictive modelling approaches typically display a high level of "analytical flexibility" and pose a large number of possible methodological choices in terms of feature pre-processing and model architecture, which emerge as uncontrolled (e.g. not cross-validated) "hyperparameters" during model discovery. Seemingly 'innocent' adjustments of such parameters can also lead to overfitting, if it happens outside of the cv loop. The second reason for inflated internally validated performance estimates is 'leakage' of information from the test dataset to the training dataset ([](https://doi.org/10.1016/j.patter.2023.100804)). Information leakage has many faces. It can be a consequence of, for instance, feature standardization in a non cv-compliant way or, in medical imaging, the co-registration of brain data to a study-specific template. Therefore, it is often very hard to notice, especially in complex workflows. Another reason for overly optimistic internal validation results may be that even the highest quality discovery datasets can only yield an imperfect representation of the real world. Therefore, predictive models might capitalize on associations that are specific to the dataset at hand and simply fail to generalize "out-of-the-distribution", e.g. to different populations. Finally, some models might also be overly sensitive to unimportant characteristics of the training data, like subtle differences between batches of data acquisition or center-effects ([](https://doi.org/10.1038/s42256-020-0197-y); [](https://doi.org/10.1093/gigascience/giac082)). -The obvious solution for these problems is *external validation*; that is, to evaluate the model's predictive performance on independent ('external') data that is guaranteed to be unseen during the the whole model discovery procedure. There is a clear agreement in the community that external validation is critical for establishing machine learning model quality ([](https://doi.org/10.1186/1471-2288-14-40); [](https://doi.org/10.1016/j.patter.2020.100129); [](https://doi.org/10.1148/ryai.210064); [](https://doi.org/10.1038/s41586-023-05745-x); [](doi:10.1001/jamapsychiatry.2019.3671)). However, the amount of data to be used for model discovery and external validation can have crucial implications on the predictive power, replicability and validity of predictive models and is, therefore, subject of intense discussion ([](https://doi.org/10.1002/sim.9025); [](https://doi.org/10.1038/s41586-022-04492-9); [](https://doi.org/10.1038/s41586-023-05745-x); [](https://doi.org/10.1038/s41593-022-01110-9); [](10.52294/51f2e656-d4da-457e-851e-139131a68f14); [](https://doi.org/10.1101/2023.06.16.545340); [](#supplementary-table-1)). Finding the optimal sample sizes is especially challenging for biomedical research, where this trade-off needs to consider both ethical and economic reasons. As a consequence, to date only around 10\% of predictive modeling studies include an external validation of the model ([](https://doi.org/10.1093/jamia/ocac002)). Those few studies performing true external validation often perform it on retrospective data (like [](https://doi.org/10.1038/s41591-020-1142-7) or [](10.31219/osf.io/utkbv)) or in separate, prospective studies ([](https://doi.org/10.1038/s41467-019-13785-z); [](10.31219/osf.io/utkbv)). Both approaches can result in a suboptimal use of data and may slow down the dissemination process of new results. +The obvious solution for these problems is *external validation*; that is, to evaluate the model's predictive performance on independent ('external') data that is guaranteed to be unseen during the whole model discovery procedure. There is a clear agreement in the community that external validation is critical for establishing machine learning model quality ([](https://doi.org/10.1186/1471-2288-14-40); [](https://doi.org/10.1016/j.patter.2020.100129); [](https://doi.org/10.1148/ryai.210064); [](https://doi.org/10.1038/s41586-023-05745-x); [](doi:10.1001/jamapsychiatry.2019.3671)). However, the amount of data to be used for model discovery and external validation can have crucial implications on the predictive power, replicability and validity of predictive models and is, therefore, subject of intense discussion ([](https://doi.org/10.1002/sim.9025); [](https://doi.org/10.1038/s41586-022-04492-9); [](https://doi.org/10.1038/s41586-023-05745-x); [](https://doi.org/10.1038/s41593-022-01110-9); [](10.52294/51f2e656-d4da-457e-851e-139131a68f14); [](https://doi.org/10.1101/2023.06.16.545340); [](#supplementary-table-1)). Finding the optimal sample sizes is especially challenging for biomedical research, where this trade-off needs to consider both ethical and economic reasons. As a consequence, to date only around 10\% of predictive modeling studies include an external validation of the model ([](https://doi.org/10.1093/jamia/ocac002)). Those few studies performing true external validation often perform it on retrospective data (like [](https://doi.org/10.1038/s41591-020-1142-7) or [](10.31219/osf.io/utkbv)) or in separate, prospective studies ([](https://doi.org/10.1038/s41467-019-13785-z); [](10.31219/osf.io/utkbv)). Both approaches can result in a suboptimal use of data and may slow down the dissemination process of new results. In this manuscript we argue that maximal reliability and transparency during external validation can be achieved with prospective data acquisition preceded by "freezing" and publicly depositing (e.g. pre-registering) the whole feature processing workflow and all model weights. Furthermore, we present a novel adaptive design for predictive modeling studies with prospective data acquisition that optimizes the trade-off between efforts spent on training and external validation. We evaluate the proposed approach on data involving more than 3000 participants from four different datasets to illustrate that for any "sample size budget", it can successfully identify the optimal time to stop model discovery, so that predictive performance is maximized without risking a low powered, and thus inconclusive, external validation. @@ -42,7 +42,7 @@ In this manuscript we argue that maximal reliability and transparency during ext Let us consider the following scenario: a research group plans to involve a fixed number of participants in a study with the aim of constructing a predictive model, and at the same time, evaluate its external validity. How many participants should they allocate for model discovery and how many for external validation to get the highest performing model as well as conclusive validation results? -In most cases it is very hard to make an educated guess about the optimal split of the total sample size into discovery and external validation samples prior to data acquisition. A possible approach is to use simplistic rules-of-thumb. Splitting data with a 80-20\% ratio (a.k.a Pareto-split, [](https://doi.org/10.1080/00207390802213609)) is probably the most common method, but a 90-10\% or a 50-50\% may also be plausible choices ([](10.1007/978-3-319-23528-8_1)). However, as illustrated on {numref}`fig1`, such prefixed sample sizes are likely sub-optimal in many cases and the optimal strategy is actually determined by the dependence of the model performance on training sample size, that is, the "learning curve". For instance, in case of a significant but generally low model performance ({numref}`fig1`A: flat learning curve) the model does not benefit a lot from adding more data to the training set but, on the other hand, it may require a larger external validation set for conclusive evaluation, due to the lower predictive effect size. This is visualized by the "power curve" on {numref}`fig1`, which shows the statistical power of external validation with the remaining sample as a function of sample size used for model discovery. The optimal strategy will be different, however, if the learning curve shows a persistent increase, without a strong saturation effect, meaning that predictive performance can be significantly enhanced by training the model on larger samples ({numref}`fig1`B). +In most cases it is very hard to make an educated guess about the optimal split of the total sample size into discovery and external validation samples prior to data acquisition. A possible approach is to use simplistic rules-of-thumb. Splitting data with a 80-20\% ratio (a.k.a Pareto-split, [](https://doi.org/10.1080/00207390802213609)) is probably the most common method, but a 90-10\% or a 50-50\% may also be plausible choices ([](10.1007/978-3-319-23528-8_1)). However, as illustrated on {numref}`fig1`, such prefixed sample sizes are likely sub-optimal in many cases and the optimal strategy is actually determined by the dependence of the model performance on training sample size, that is, the "learning curve". For instance, in case of a significant but generally low model performance ({numref}`fig1`A: flat learning curve) the model does not benefit a lot from adding more data to the training set but, on the other hand, it may require a larger external validation set for conclusive evaluation, due to the lower predictive effect size. This is visualized by the "power curve" on {numref}`fig1`, which shows the statistical power of external validation with the remaining samples as a function of sample size used for model discovery. The optimal strategy will be different, however, if the learning curve shows a persistent increase, without a strong saturation effect, meaning that predictive performance can be significantly enhanced by training the model on larger sample size ({numref}`fig1`B). In this case, the stronger predictive performance that can be achieved with larger training sample size, at the same time, allows a smaller external validation sample to be still conclusive. Finally, in some situations, model performance may rapidly get strong and reach a plateau at a relatively low sample size ({numref}`fig1`C). In such cases, the optimal strategy might be to stop early with the discovery phase and allocate resources for a more powerful external validation. @@ -55,21 +55,21 @@ In this case, a larger external validation sample (for more robust external perf #### Transparent reporting of external validation: registered models A key criterion for external validation is the independence of the external data from the data used during model discovery ([](https://doi.org/10.1016/j.jclinepi.2015.04.005); [](10.1186/1471-2288-14-40); [](10.1038/s41586-023-05745-x)). Regardless of the splitting strategy, an externally validated predictive modelling study must provide strong guarantees for this independence criterion. -Preregistration, i.e. the public disclosure of study plans before the start of the study, is an increasingly popular way of enhancing transparency and replicability in biomedical research ([](https://doi.org/10.1016/j.tics.2019.07.009); [](10.1038/s41586-023-05745-x)) ({numref}`fig2`A), which could also be used to ensure the independence of the external validation sample. +Pre-registration, i.e. the public disclosure of study plans before the start of the study, is an increasingly popular way of enhancing transparency and replicability in biomedical research ([](https://doi.org/10.1016/j.tics.2019.07.009); [](10.1038/s41586-023-05745-x)) ({numref}`fig2`A), which could also be used to ensure the independence of the external validation sample. -However, as the concept of preregistration was originally developed for confirmatory research, it does not fit well to exploratory nature of the model discovery phase in typical predictive modelling endeavors. Specifically, while preregistration necessitates that as many parameters of the analysis as possible are fixed before data acquisition, predictive modelling studies often involve a large number of hyperparameters (e.g. model architecture, feature preprocessing steps, regularization parameters, etc.) that are not known in advance and need to be optimized during the model discovery phase. This is especially true for complex machine learning models, like deep neural networks, where the number of free parameters can easily reach tens of thousands or even more. In such cases, the preregistration of the discovery phase would require a large number of assumptions or simplifications, which would make the the process ineffective and less transparent. +However, as the concept of pre-registration was originally developed for confirmatory research, it does not fit well with the exploratory nature of the model discovery phase in typical predictive modelling endeavors. Specifically, while pre-registration necessitates that as many parameters of the analysis as possible are fixed before data acquisition, predictive modelling studies often involve a large number of hyperparameters (e.g. model architecture, feature pre-processing steps, regularization parameters, etc.) that are not known in advance and need to be optimized during the model discovery phase. This is especially true for complex machine learning models, like deep neural networks, where the number of free parameters can easily reach tens of thousands or even more. In such cases, the pre-registration of the discovery phase would require a large number of assumptions or simplifications, which would make the process ineffective and less transparent. -Therefore, we propose to perform the preregistration after the model discovery phase, but before the external validation {numref}`fig2`B). In this case, more freedom is granted for the discovery phase, while the external validation remains equally conclusive, as long as the pre-registration of the external validation includes all details of the *finalized* model (including the feature pre-processing workflow). This can easily be done by attaching the data and the reproducible analysis code used during the discovery phase or, alternatively, a serialized version of the fitted model (i.e. a file that contains all model weight, see e.g. [](10.1038/s41467-019-13785-z) and [](10.31219/osf.io/utkbv)). We refer to such models as **registered models**. +Therefore, we propose to perform the pre-registration after the model discovery phase, but before the external validation ({numref}`fig2`B). In this case, more freedom is granted for the discovery phase, while the external validation remains equally conclusive, as long as the pre-registration of the external validation includes all details of the *finalized* model (including the feature pre-processing workflow). This can easily be done by attaching the data and the reproducible analysis code used during the discovery phase or, alternatively, a serialized version of the fitted model (i.e. a file that contains all model weight, see e.g. [](10.1038/s41467-019-13785-z) and [](10.31219/osf.io/utkbv)). We refer to such models as **registered models**. :::{figure} figures/fig2.png :name: fig2 **The registered model design and the proposed adaptive sample splitting procedure for prospective predictive modeling studies.** \ - **(A)** Predictive modelling combined with conventional preregistration. In this case the pre-registration precedes data acquisition and requires that as many details of the analysis are fixed as possible. Given the potentially large number of coefficients to be optimized and the importance of hyperparameter optimization, conventional preregistration exhibits a limited compatibility with predictive modelling studies. **(B)** Here we propose that in case of predictive modelling studies, public registration should only happen after the model is trained and finalized. The registration step in this case includes publicly depositing the finalized model, with all its parameters as well as all feature preprocessing steps. External validation is performed with the resulting *registered model*. This practice ensures a transparent, clear separation of model discovery and external validation. **(C)** The "registered model" design allows a flexible, adaptive splitting of the "sample size budget" into discovery and external validation phases. The proposed adaptive sample splitting procedure starts with fixing (and potentially pre-registering) a stopping rules (R1). During the training phase, one or more candidate models are trained and the splitting rule is repeatedly evaluated as the data acquisition proceeds. When the splitting rule "activates", the model gets finalized (e.g. by being fit on the whole training sample) and publicly deposited/registered (R2). Finally, data acquisition continues and the prospective external validation is performed on the newly acquired data. + **(A)** Predictive modelling combined with conventional pre-registration. In this case the pre-registration precedes data acquisition and requires fixing as many details of the analysis as possible. Given the potentially large number of coefficients to be optimized and the importance of hyperparameter optimization, conventional pre-registration exhibits a limited compatibility with predictive modelling studies. **(B)** Here we propose that in case of predictive modelling studies, public registration should only happen after the model is trained and finalized. The registration step in this case includes publicly depositing the finalized model, with all its parameters as well as all feature pre-processing steps. External validation is performed with the resulting *registered model*. This practice ensures a transparent, clear separation of model discovery and external validation. **(C)** The "registered model" design allows a flexible, adaptive splitting of the "sample size budget" into discovery and external validation phases. The proposed adaptive sample splitting procedure starts with fixing (and potentially pre-registering) a stopping rules (R1). During the training phase, one or more candidate models are trained and the splitting rule is repeatedly evaluated as the data acquisition proceeds. When the splitting rule "activates", the model gets finalized (e.g. by being fit on the whole training sample) and publicly deposited/registered (R2). Finally, data acquisition continues and the prospective external validation is performed on the newly acquired data. ::: #### The adaptive splitting design -Even with registered models, the the amount of data to be used for model discovery and external validation can have crucial implications on the predictive power, replicability and validity of predictive models. Here, we introduce a novel design for prospective predictive modeling studies that leverages the flexibility of model discovery granted by the registered model design. Our approach aims to adaptively determine an optimal splitting strategy during data acquisition. This strategy balances the model performance and the statistical power of the external validation ({numref}`fig2`C). The proposed design involves continuous model fitting and hyperparameter tuning throughout the discovery phase, for example, after every 10 new participants, and evaluating a 'stopping rule' to determine if the desired compromise between model performance and statistical power of the external validation has been achieved. This marks the end of the discovery phase and the start of the external validation phase, as well as the point at which the model must be publicly and transparently deposited or preregistered. Importantly, the preregistration should precede the continuation of data acquisition, i.e., the start of the external validation phase. +Even with registered models, the the amount of data to be used for model discovery and external validation can have crucial implications on the predictive power, replicability and validity of predictive models. Here, we introduce a novel design for prospective predictive modeling studies that leverages the flexibility of model discovery granted by the registered model design. Our approach aims to adaptively determine an optimal splitting strategy during data acquisition. This strategy balances the model performance and the statistical power of the external validation ({numref}`fig2`C). The proposed design involves continuous model fitting and hyperparameter tuning throughout the discovery phase, for example, after every 10 new participants, and evaluating a 'stopping rule' to determine if the desired compromise between model performance and statistical power of the external validation has been achieved. This marks the end of the discovery phase and the start of the external validation phase, as well as the point at which the model must be publicly and transparently deposited or pre-registered. Importantly, the pre-registration should precede the continuation of data acquisition, i.e., the start of the external validation phase. In the present work, we propose and evaluate a concrete, customizable implementation for the splitting rule. ## Methods and Implementation @@ -82,7 +82,7 @@ The stopping rule of the proposed adaptive splitting design can be formalized as S_\Phi(\mathbf{X}_{act}, \mathbf{y}_{act}, \mathcal{M}) \quad \quad S: \mathbb{R}^2 \longrightarrow \{True, False\} ::: -where $\Phi$ denotes customizable parameters of the rule (detailed in the next paragraph), $\mathbf{X}_{act} \in \mathbb{R}^2$ and $\mathbf{y}_{act} \in \mathbb{R}$ is the data (a matrix consisting of $n_{act} > 0$ observations and an fixed number of features $p$) and prediction target, respectively, as acquired so far and $\mathcal{M}$ is the machine learning model to be trained. The discovery phase ends if and only if the stopping rule returns $True$. +where $\Phi$ denotes customizable parameters of the rule (detailed in the next paragraph), $\mathbf{X}_{act} \in \mathbb{R}^2$ and $\mathbf{y}_{act} \in \mathbb{R}$ is the data (a matrix consisting of $n_{act} > 0$ observations and a fixed number of features $p$) and prediction target, respectively, as acquired so far and $\mathcal{M}$ is the machine learning model to be trained. The discovery phase ends if and only if the stopping rule returns $True$. ##### **Hard sample size thresholds** @@ -100,12 +100,12 @@ Specifically: \text{Max-rule:} \quad n_{act} \geq n_{total} – v_{min} ::: -where $n_{act}$ and $n_{total}$ are the actual sample size (e.g. participants measured so far) and the total sample sizes (i.e. the sample size budget), respectively, so that $n_{total} >= n_{act} > 0$. +where $n_{act}$ and $n_{total}$ are the actual sample size (e.g. participants measured so far) and the total sample size (i.e. the "sample size budget"), respectively, so that $n_{total} >= n_{act} > 0$. Setting $t_{min}$ and $v_{min}$ may be useful to prevent early stopping at the beginning of the training procedure, where predictive performance and validation power estimates are not yet reliable due to the small $n_{act}$ or to ensure that a minimal validation sample size, even if stopping criteria are never met. If $t_{min}$ and $v_{min}$ are set so that $t_{min} + v_{min} = n_{total}$ then our approach falls back to training a registered model with predefined training and validation sample sizes. ##### **Forecasting Predictive Performance via Learning Curve Analysis** -Taking internally validated performance estimates of the candidate model as a function of training sample size, also known as learning curve analysis, is a widely used approach to gain deeper insights into model training dynamics (see examples on {numref}`fig1`). In the proposed stopping rule, we will rely on learning curve analysis to provide estimates of the current predictive performance and the expected gain when adding new data to the discovery sample. +Taking internally validated performance estimates of the candidate model as a function of training sample size, also known as learning curve analysis, is a widely used approach to gain deeper insights into model training dynamics (see examples on {numref}`fig1`). In the proposed stopping rule, we will rely on learning curve analysis to provide estimates of the current predictive performance and the expected gain when adding new data to the discovery sample. Performance estimates can be unreliable or noisy in many cases, for instance with low sample sizes or when using leave-one-out cross-validation ([](https://doi.org/10.1016/j.neuroimage.2017.06.061)). To obtain stable and reliable learning curves, we propose to calculate multiple cross-validated performance estimates from sub-samples sampled without replacement from the actual data set. The proposed procedure is detailed in {numref}`alg-learning-curve`. @@ -137,7 +137,7 @@ The learning curve analysis allows the discovery phase to be stopped if the expe \text{Performance-rule:} \quad \hat{s}_{total} - s_{act} \leq s_{min} ::: -where $s_{act}$ is the actual bootstrapped predictive performance score (i.e. the last element of $\textbf{l}_{act}$, as returned by {numref}`alg-learning-curve`, $\hat{s}_{total}$ is a estimate of the (unknown) predictive performance $s_{total}$ (i.e. the predictive performance of the model trained the whole sample) and $\epsilon_{s}$ is the smallest predictive effect of interest. Note that, setting $\epsilon_{s} = -\infty$ deactivates the *Performance-rule* ([Eq. %s](#eq-perf-rule)). +where $s_{act}$ is the actual bootstrapped predictive performance score (i.e. the last element of $\textbf{l}_{act}$, as returned by {numref}`alg-learning-curve`, $\hat{s}_{total}$ is a estimate of the (unknown) predictive performance $s_{total}$ (i.e. the predictive performance of the model trained on the whole sample size) and $\epsilon_{s}$ is the smallest predictive effect of interest. Note that, setting $\epsilon_{s} = -\infty$ deactivates the *Performance-rule* ([Eq. %s](#eq-perf-rule)). While $s_{total}$ is typically unknown at the time of evaluating the stopping rule $S$, there are various approaches of obtaining an estimate $\hat{s}_{total}$. In the base implementation of AdaptiveSplit, we stick to a simple method: we extrapolate the learning curve $l_{act}$ based on its tangent line at $n_{act}$, i.e. assuming that the latest growth rate will remain constant for the remaining samples. While in most scenarios this is an overly optimistic estimate, it still provides a useful upper bound for the maximally achievable predictive performance with the given sample size and can successfully detect if the learning curve has already reached a flat plateau (like on {numref}`fig1`C). @@ -153,7 +153,7 @@ Specifically, the stopping rule $S$ will return $True$ if the *Performance-rule* ::: where $POW_\alpha(s, n)$ is the power of a validation sample of size $n$ to detect an effect size of $s$ and $n_{val} = n_{total}-n_{act}$ is the size of the validation sample if stopping, i.e. the number of remaining (not yet measured) participants in the experiment. -Given that machine learning model predictions are often non-normally distributed ([](https://doi.org/10.1093/gigascience/giac082)), our implementation is based on a bootstrapped power analysis for permutation tests [refs], as shown in {numref}`alg-power-rule`. Our implementation is, however, simple to extend with other parametric or non-parametric estimators like Pearson Correlation and Spearman Rank Correlation. +Given that machine learning model predictions are often non-normally distributed ([](https://doi.org/10.1093/gigascience/giac082)), our implementation is based on a bootstrapped power analysis for permutation tests, as shown in {numref}`alg-power-rule`. Our implementation is, however, simple to extend with other parametric or non-parametric estimators like Pearson Correlation and Spearman Rank Correlation. :::{prf:algorithm} Calculation of the Power-rule :label: alg-power-rule @@ -192,7 +192,7 @@ Calculating the validation power ({numref}`alg-power-rule`) for all available sa #### Stopping Rule -Our proposed stopping rule integrates the $\text{Min-rule}$, the $\text{Min-rule}$, the $\text{Peerformance-rule}$ and the $\text{Power-rule}$ in the following way: +Our proposed stopping rule integrates the $\text{Min-rule}$, the $\text{Max-rule}$, the $\text{Performance-rule}$ and the $\text{Power-rule}$ in the following way: \begin{equation} \begin{split} @@ -212,43 +212,43 @@ We have implemented the proposed stopping rule in the Python package “*adaptiv #### Empirical evaluation -We evaluate the proposed stopping rule, as implemented in the package *adaptivesplit*, in four publicly available datasets; the Autism Brain Imaging Data Exchange (ABIDE) [](https://doi.org/10.1038/mp.2013.78), the Human Connectome Project (HCP; [](https://doi.org/10.1016/j.neuroimage.2013.05.041)), the Information eXtraction from Images (IXI)[^ixi] and the Breast Cancer Wisconsin (BCW; [](10.1117/12.148698)) datasets (Fig. 3). +We evaluate the proposed stopping rule, as implemented in the package *adaptivesplit*, in four publicly available datasets; the Autism Brain Imaging Data Exchange (ABIDE; [](https://doi.org/10.1038/mp.2013.78)), the Human Connectome Project (HCP; [](https://doi.org/10.1016/j.neuroimage.2013.05.041)), the Information eXtraction from Images (IXI)[^ixi] and the Breast Cancer Wisconsin (BCW; [](10.1117/12.148698)) datasets (Fig. 3). ##### **ABIDE** -We obtained preprocessed data from Autism Brain Imaging Data Exchange (ABIDE) dataset [](https://doi.org/10.1038/mp.2013.78) involving 866 participants (Autism Spectrum Disorder: 402, neurotypical control: 464). Preprocessed regional time-series data were obtained as shared[^abide-data] by [](https://doi.org/10.1016/j.neuroimage.2019.02.062), which were based on preprocessed image data provided by the Preprocessed Connectome Project ([](10.3389/conf.fninf.2013.09.00041)). -Tangent correlation across the time series of the n=122 regions of the BASC brain parcellation (Multi-level bootstrap analysis of stable clusters; [](https://doi.org/10.1016/j.neuroimage.2010.02.082)) was computed with nilearn[^nilearn]. -The resulting functional connectivity estimates were considered features for a predictive model of autism diagnosis. +We obtained preprocessed data from Autism Brain Imaging Data Exchange (ABIDE) dataset ([](https://doi.org/10.1038/mp.2013.78)) involving 866 participants (Autism Spectrum Disorder: 402, neurotypical control: 464). Pre-processed regional time-series data were obtained as shared[^abide-data] by [](https://doi.org/10.1016/j.neuroimage.2019.02.062), which were based on pre-processed image data provided by the Pre-processed Connectome Project ([](10.3389/conf.fninf.2013.09.00041)). +Tangent correlation across the time series of the n=122 regions of the BASC brain parcellation (Multi-level bootstrap analysis of stable clusters; [](https://doi.org/10.1016/j.neuroimage.2010.02.082)) was computed with nilearn[^nilearn]. The resulting functional connectivity estimates were considered features for a predictive model of autism diagnosis. ##### **HCP** -The Human Connectome Project dataset contains imaging and behavioral data of approximately 1,200 healthy subjects ([](10.1016/j.neuroimage.2013.05.041)). Preprocessed resting state functional magnetic resonance imaging (fMRI) connectivity data (partial correlation matrices; [](10.1016/j.neuroimage.2013.04.127)) as published with the HCP1200 release (N= 999 participants with functional connectivity data) were used to build models that predict individual fluid intelligence scores (Gf), measured with Penn Progressive Matrices ([](10.1126/science.289.5478.457)). +The Human Connectome Project dataset contains imaging and behavioral data of approximately 1,200 healthy subjects ([](10.1016/j.neuroimage.2013.05.041)). Pre-processed resting state functional magnetic resonance imaging (fMRI) connectivity data (partial correlation matrices; [](10.1016/j.neuroimage.2013.04.127)) as published with the HCP1200 release (N= 999 participants with functional connectivity data) were used to build models that predict individual fluid intelligence scores (Gf), measured with Penn Progressive Matrices ([](10.1126/science.289.5478.457)). ##### **IXI** -The IXI dataset is published by the Neuroimage Analysis Center, from Imperial College London, in the United Kingdom, and it is part of the project Brain Development. It consists of approximately 600 structural MRI images from a diverse population of healthy individuals, including both males and females across a wide age range. The dataset contains high-resolution brain images form three different MRI scanners (Philips Intera 3T, Philips Gyroscan Intera 1.5T and GE 1.5T) and associated demographic information, making it suitable for studying age-related changes in brain structure and function. We used gray matter probability maps generated from T1–weighted MR images with Freesurfer ([](https://doi.org/10.1016/j.neuroimage.2012.01.021)) as features for a predictive model of age. +The IXI dataset is published by the Neuroimage Analysis Center, from Imperial College London, in the United Kingdom, and it is part of the project Brain Development. It consists of approximately 600 structural MRI images from a diverse population of healthy individuals, including both males and females across a wide age range. The dataset contains high-resolution brain images from three different MRI scanners (Philips Intera 3T, Philips Gyroscan Intera 1.5T and GE 1.5T) and associated demographic information, making it suitable for studying age-related changes in brain structure and function. We used gray matter probability maps generated from T1–weighted MR images with Freesurfer ([](https://doi.org/10.1016/j.neuroimage.2012.01.021)) as features for a predictive model of age. ##### **BCW** The Breast Cancer Wisconsin (BCW, [](10.1117/12.148698)) dataset contains diagnostic features computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. The dataset includes 30 different features such as the mean radius, mean texture, mean perimeter, mean area, mean smoothness, mean compactness, mean concavity, etc. The target variable for predictive modelling in this dataset is the diagnosis (M = malignant, B = benign). -The chosen datasets include both classification and regression tasks, and span a wide range in terms the number of participants, number of predictive features, achievable predictive effect size and data homogeneity (see [Supplementary Figures 1-6](#supplementary-figures)). +The chosen datasets include both classification and regression tasks, and span a wide range in terms of number of participants, number of predictive features, achievable predictive effect size and data homogeneity (see [Supplementary Figures 1-6](#supplementary-figures)). Our analyses aimed to contrast the proposed adaptive splitting method with the application of fixed training and validation sample sizes, specifically using 50, 60 or 90% of the total sample size for training and the rest for external validation. We simulated various "sample size budgets" (total sample sizes, $n_{total}$) with random sampling without replacement. For a given total sample size, we simulated the prospective data acquisition procedure by incrementing $n_{act}$; starting with 10\% of the total sample size and going up with increments of five. In each step, the stopping rule was evaluated with "AdaptiveSplit", but fitting a Ridge model (for regression tasks) or a L2-regularized logistic regression (for classification tasks). Model fit always consisted of a cross-validated fine-tuning of the regularization parameter, resulting in a nested cv estimate of prediction performance and validation power. Robust estimates (and confidence intervals) were obtained with bootstrapping, as described in {numref}`alg-learning-curve` and {numref}`alg-power-rule`. This procedure was iterated until the stopping rule returned True. The corresponding sample size was then considered thee final training sample. With all four splitting approaches (adaptive, Pareto, Half-split, 90-10\% split), we trained the previously described Ridge or regularized logistic regression model on the training sample and obtained predictions for the sample left out for external validation. This whole procedure was repeated 100 times for each simulated sample size budget in each dataset, to estimate the confidence intervals for the models performance in the external validation and its statistical significance. -In all analyzes, the adaptive splitting procedure is performed with a target power of $v_{pow} = 0.8$, an $alpha = 0.05$, $t_{tmin} = n_{total}/3$, $v_{min}=12$, $s_{min}=-\infty$. P-values were calculated using a permutation test with 5000 permutations. +In all analyses, the adaptive splitting procedure is performed with a target power of $v_{pow} = 0.8$, an $alpha = 0.05$, $t_{tmin} = n_{total}/3$, $v_{min}=12$, $s_{min}=-\infty$. P-values were calculated using a permutation test with 5000 permutations. ## Results The results of our empirical analyses of four large, openly available datasets confirmed that the proposed adaptive splitting approach can successfully identify the optimal time to stop acquiring data for training and maintain a good compromise between maximizing both predictive performance and external validation power with any sample size budget. -In all three samples, the applied models yielded a statistically significant predictive performance at much lower sample sizes than the total size of the dataset, i.e. all datasets were well powered for the analysis. Trained on the full sample size with cross-validation, the models displayed the following performances: functional brain connectivity from the HCP dataset explained 13% of the variance in cognitive abilities; structural MRI data (gray matter probability maps) in the IXI dataset explained 48% in age; c**lassification accuracy was 65.5% for autism diagnosis (functional brain connectivity) in the ABIDE dataset and 92% for breast cancer diagnosis in the BCW dataset. +In all four samples, the applied models yielded a statistically significant predictive performance at much lower sample sizes than the total size of the dataset, i.e. all datasets were well powered for the analysis. Trained on the full sample size with cross-validation, the models displayed the following performances: functional brain connectivity from the HCP dataset explained 13% of the variance in cognitive abilities; structural MRI data (gray matter probability maps) in the IXI dataset explained 48% in age; c**lassification accuracy was 65.5% for autism diagnosis (functional brain connectivity) in the ABIDE dataset and 92% for breast cancer diagnosis in the BCW dataset. The datasets varied not only in the achievable predictive performance but also in the shape of the learning curve, with different sample sizes and thus, they provided a good opportunity to evaluate the performance of our stopping rule in various circumstances ([Supplementary Figures 1-6](#supplementary-figures)). -We found that adaptively splitting the data provided external validation performances that were comparable to the commonly used Pareto split (80-20%) in most cases ({numref}`fig3` left column). As expected half-split tended to provide worse predictive performance due to the smaller training sample. In contrast, 90-10% tended to display only slightly higher performances than the Pareto and the Adaptive splitting techniques, in most cases. -This small achievement came with a big cost in terms of the statistical power in the external validation sample, where the 90-10% split very often gave inconclusive results ($p\geq0.05$) (Fig. {numref}`fig3`right column), especially with low sample size budgets. +We found that adaptively splitting the data provided external validation performances that were comparable to the commonly used Pareto split (80-20%) in most cases ({numref}`fig3`, left column). As expected half-split tended to provide worse predictive performance due to the smaller training sample. In contrast, 90-10% tended to display only slightly higher performances than the Pareto and the Adaptive splitting techniques, in most cases. +This small achievement came with a big cost in terms of the statistical power in the external validation sample, where the 90-10% split very often gave inconclusive results ($p\geq0.05$) +({numref}`fig3`, right column), especially with low sample size budgets. Although to a lesser degree, Pareto split also frequently failed to yield a conclusive external validation with small total sample sizes. Adaptive splitting (as well as half-split) provided sufficient statistical power for the external validation in most cases. Focusing only on cases with a successful, conclusive external validation, the proposed adaptive splitting strategy always provided equally good or better predictive performance than the fixed splitting strategies (as shown by the 95% confidence intervals on {numref}`fig3`). @@ -256,16 +256,16 @@ Focusing only on cases with a successful, conclusive external validation, the pr :::{figure} figures/fig3.png :name: fig3 **The proposed adaptive splitting approach provides a good compromise between predictive performance and statistical power of the external validation.** \ -The left and right column shows the comparison of splitting methods on external validation performance and p-values, respectively, at various $n_{total}$. Confidence intervals are based on 100 repetitions of the analyses. The adaptive splitting approach (blue) provides a good compromise between predictive performance and statistical power of the external validation. The Pareto split (orange) provides similar external validation performances to adaptive splitting, however it often fails to provide conclusive results due to an insufficieent sample size during external validation, especially in case of a limited sample size budget. The 90-10% split (green) provides only slightly higher performances than the Pareto and the Adaptive splitting techniques, but it very often gives inconclusive results ($p\geq0.05$) in the external validation sample. Half-split (red) tends to provide worse predictive performance due to the too small training sample. +The left and right column shows the comparison of splitting methods on external validation performance and p-values, respectively, at various $n_{total}$. Confidence intervals are based on 100 repetitions of the analyses. The adaptive splitting approach (blue) provides a good compromise between predictive performance and statistical power of the external validation. The Pareto split (orange) provides similar external validation performances to adaptive splitting, however it often fails to provide conclusive results due to an insufficient sample size during external validation, especially in case of a limited sample size budget. The 90-10% split (green) provides only slightly higher performances than the Pareto and the Adaptive splitting techniques, but it very often gives inconclusive results ($p\geq0.05$) in the external validation sample. Half-split (red) tends to provide worse predictive performance due to the too small training sample. ::: ## Discussion -Here we have proposed "registered models", a novel design for prospective predictive modeling studies that allows flexible model discovery and trustworthy prospective external validation by fixing and publicly depositing the model after the discovery phase. Furthermore, capitalizing on the flexibility during model discovery with the registered model design, we have proposed a stopping rule for adaptively splitting of the sample size budget into discovery and external validation phases. These approaches together provide a robust and flexible framework for predictive modeling studies and address several common issues in the field, including overfitting, effect size inflation as well as the lack of reliability and reproducibility. +Here we have proposed "registered models", a novel design for prospective predictive modeling studies that allows flexible model discovery and trustworthy prospective external validation by fixing and publicly depositing the model after the discovery phase. Furthermore, capitalizing on the flexibility during model discovery with the registered model design, we have proposed a stopping rule for adaptively splitting the sample size budget into discovery and external validation phases. These approaches together provide a robust and flexible framework for predictive modeling studies and address several common issues in the field, including overfitting, effect size inflation as well as the lack of reliability and reproducibility. Registered models provide a clear and transparent separation between the discovery and external validation phases, which is essential for ensuring the independence of the external validation data. Thereby, they provide a straightforward solution to several of the widely discussed issues and pitfalls of predictive model development ([](https://doi.org/10.1080/01621459.1983.10477973); [](https://doi.org/10.1016/j.biopsych.2020.02.016); [](https://doi.org/10.1038/s41746-022-00592-y); [](10.1038/s41586-022-04492-9); [](10.1038/s41586-023-05745-x)). With registered models, external validation estimates are guaranteed to be free of information leakage ([](https://doi.org/10.1016/j.patter.2023.100804)) and to provide an unbiased estimate of the model's predictive performance. Nevertheless, these performance estimates will still be subject of sampling variance, which can be reduced by increasing the sample size of the external validation set. -The question of how many participants should be involved in the discovery and external validity remains of central importance for the optimal use of available resources (scanning time, budget, limitations in participant recruitment) ([](https://doi.org/10.1002/sim.8766); [](https://doi.org/10.1002/sim.9025); [](https://doi.org/10.1038/s41586-022-04492-9); [](https://doi.org/10.1038/s41586-023-05745-x); [](https://doi.org/10.1038/s41593-022-01110-9); [](10.52294/51f2e656-d4da-457e-851e-139131a68f14); [](https://doi.org/10.1101/2023.06.16.545340); [](#supplementary-table-1)). Optimal sample sizes are often challenging to determine prior to the study. The proposed adpative splitting procedure promises to provide a solution in such cases by allowing the sample size to be adjusted during the data acquisition process, based on the observed performance of the model trained on the already available data. +The question of how many participants should be involved in the discovery and external validity remains of central importance for the optimal use of available resources (scanning time, budget, limitations in participant recruitment) ([](https://doi.org/10.1002/sim.8766); [](https://doi.org/10.1002/sim.9025); [](https://doi.org/10.1038/s41586-022-04492-9); [](https://doi.org/10.1038/s41586-023-05745-x); [](https://doi.org/10.1038/s41593-022-01110-9); [](10.52294/51f2e656-d4da-457e-851e-139131a68f14); [](https://doi.org/10.1101/2023.06.16.545340); [](#supplementary-table-1)). Optimal sample sizes are often challenging to determine prior to the study. The proposed adaptive splitting procedure promises to provide a solution in such cases by allowing the sample size to be adjusted during the data acquisition process, based on the observed performance of the model trained on the already available data. We performed a thorough evaluation of the proposed adaptive splitting procedure on data from more than 3000 participants from four publicly available datasets. We found that the proposed adaptive splitting approach can successfully identify the optimal time to stop acquiring data for training and maintain a good compromise between maximizing both predictive performance and external validation power with any "sample size budget". When contrasting splitting approaches based on fixed validation size with the proposed adaptive splitting technique, using the latter was always the preferable strategy to maximize power and statistical significance during external validation. The benefit of adaptively splitting the data acquisition for training and validation provides the largest benefit in lower sample size regimes. In case of larger sample sizes, the fixed Pareto split (20-80%) provided also good results, giving similar external validation performances to adaptive splitting, without having to repeatedly re-train the model during data acquisition. Thus, for moderate to large sample sizes and well powered models, the Pareto split might be a good alternative to the adaptive splitting approach, especially if the computational resources for re-training the model are limited. diff --git a/manuscript/02-supplementary.md b/manuscript/02-supplementary.md index 2e5f180..6e7590a 100644 --- a/manuscript/02-supplementary.md +++ b/manuscript/02-supplementary.md @@ -12,14 +12,14 @@ exports: :name: si-bcw-scatter :align: center :width: 40% -Predictive performance (confusion matrix) of the model trained on in the BCW dataset to predict diagnosis. The model was trained on the whole dataset with nested cross-validation. +Predictive performance (confusion matrix) of the model trained on the BCW dataset to predict diagnosis. The model was trained on the whole dataset with nested cross-validation. ::: :::{figure} figures/si-bcw-lc.png :name: si-bcw-scatter :align: center :width: 50% -Learning curve (top) and power curve (bottom) of the model trained oon in the BCW dataset to predict diagnosis. The maximum sample size (i.e. the whole dataset) was considered as the "sample size budget". X-axis: $n_{act}$; y-axis (learning curve): Accuracy as a measure of predictive performance; y-axis (power curve): statistical power of the remaining sample to confirm the model's validity. +Learning curve (top) and power curve (bottom) of the model trained on the BCW dataset to predict diagnosis. The maximum sample size (i.e. the whole dataset) was considered as the "sample size budget". X-axis: $n_{act}$; y-axis (learning curve): Accuracy as a measure of predictive performance; y-axis (power curve): statistical power of the remaining sample to confirm the model's validity. ::: :::{figure} figures/si-ixi-scatter.png diff --git a/manuscript/myst.yml b/manuscript/myst.yml index cd8d7a0..ccc5750 100644 --- a/manuscript/myst.yml +++ b/manuscript/myst.yml @@ -8,12 +8,13 @@ project: authors: - name: Giuseppe Gallitto affiliations: - - Department of Diagnostic and Interventional Radiology and Neuroradiology, University Medicine Essen,Germany + - Center for Translational Neuro- and Behavioral Sciences (C-TNBS), University Medicine Essen, Germany + - Department of Neurology, University Medicine Essen, Germany email: giuseppe.gallitto@uk-essen.de - name: Robert Englert affiliations: - - Department of Diagnostic and Interventional Radiology and Neuroradiology, University Medicine Essen, Germany + - Department of Diagnostic and Interventional Radiology and Neuroradiology, University Medicine Essen, Germany - name: Balint Kincses affiliations: @@ -27,6 +28,10 @@ project: affiliations: - Department of Neurology, University Medicine Essen, Germany - Max Planck School of Cognition, Leipzig, Germany + + - name: Kevin Hoffschlag + affiliations: + - Department of Neurology, University Medicine Essen, Germany - name: Ulrike Bingel affiliations: @@ -35,7 +40,7 @@ project: - name: Tamas Spisak affiliations: - - Department of Diagnostic and Interventional Radiology and Neuroradiology, University Medicine Essen, Germany + - Department of Diagnostic and Interventional Radiology and Neuroradiology, University Medicine Essen, Germany - Center for Translational Neuro- and Behavioral Sciences (C-TNBS), University Medicine Essen, Germany orcid: 0000-0002-2942-0821 email: tamas.spisak@uk-essen.de