diff --git a/manuscript/01-paper.md b/manuscript/01-paper.md index 82333f4..9493b32 100644 --- a/manuscript/01-paper.md +++ b/manuscript/01-paper.md @@ -137,13 +137,13 @@ The learning curve analysis allows the discovery phase to be stopped if the expe \text{Performance-rule:} \quad \hat{s}_{total} - s_{act} \leq s_{min} ::: -where $s_{act}$ is the actual bootstrapped predictive performance score (i.e. the last element of $\textbf{l}_{act}$, as returned by {numref}`alg-learning-curve`, $\hat{s}_{total}$ is a estimate of the (unknown) predictive performance $s_{total}$ (i.e. the predictive performance of the model trained on the whole sample size) and $\epsilon_{s}$ is the smallest predictive effect of interest. Note that, setting $\epsilon_{s} = -\infty$ deactivates the *Performance-rule* ([Eq. %s](#eq-perf-rule)). +where $s_{act}$ is the actual bootstrapped predictive performance score (i.e. the last element of $\textbf{l}_{act}$, as returned by {numref}`alg-learning-curve`, $\hat{s}_{total}$ is a estimate of the (unknown) predictive performance $s_{total}$ (i.e. the predictive performance of the model trained on the whole sample size) and $\epsilon_{s}$ is the smallest predictive effect of interest. Note that setting $\epsilon_{s} = -\infty$ deactivates the *Performance-rule* ([Eq. %s](#eq-perf-rule)). While $s_{total}$ is typically unknown at the time of evaluating the stopping rule $S$, there are various approaches of obtaining an estimate $\hat{s}_{total}$. In the base implementation of AdaptiveSplit, we stick to a simple method: we extrapolate the learning curve $l_{act}$ based on its tangent line at $n_{act}$, i.e. assuming that the latest growth rate will remain constant for the remaining samples. While in most scenarios this is an overly optimistic estimate, it still provides a useful upper bound for the maximally achievable predictive performance with the given sample size and can successfully detect if the learning curve has already reached a flat plateau (like on {numref}`fig1`C). ##### **Statistical power of the external validation sample** -Even if the learning curve did not reach a plateau, we still need to make sure that we stop the training phase early enough to save a sufficient amount of data for a successful external validation from our sample size budget. Given the actual predictive performance estimate $s_{act}$ and the size of the remaining, to-be-acquired sample $s_{total} - s{act}$, we can estimate the probability that the external validation correctly rejects the null hypothesis (i.e. zero predictive performance). +Even if the learning curve did not reach a plateau, we still need to make sure that we stop the training phase early enough to save a sufficient amount of data for a successful external validation from our sample size budget. Given the actual predictive performance estimate $s_{act}$ and the size of the remaining, to-be-acquired sample $s_{total} - s_{act}$, we can estimate the probability that the external validation correctly rejects the null hypothesis (i.e. zero predictive performance). This type of analysis, known as power calculation, allows us to determine the optimal stopping point that guarantees the desired statistical power during the external validation. Specifically, the stopping rule $S$ will return $True$ if the *Performance-rule* ([Eq. %s](#eq-perf-rule)) is $False$ and the following is true: @@ -153,7 +153,7 @@ Specifically, the stopping rule $S$ will return $True$ if the *Performance-rule* ::: where $POW_\alpha(s, n)$ is the power of a validation sample of size $n$ to detect an effect size of $s$ and $n_{val} = n_{total}-n_{act}$ is the size of the validation sample if stopping, i.e. the number of remaining (not yet measured) participants in the experiment. -Given that machine learning model predictions are often non-normally distributed ([](https://doi.org/10.1093/gigascience/giac082)), our implementation is based on a bootstrapped power analysis for permutation tests, as shown in {numref}`alg-power-rule`. Our implementation is, however, simple to extend with other parametric or non-parametric estimators like Pearson Correlation and Spearman Rank Correlation. +Given that machine learning model predictions are often non-normally distributed ([](https://doi.org/10.1093/gigascience/giac082)), our implementation is based on a bootstrapped power analysis for permutation tests, as shown in {numref}`alg-power-rule`. Our implementation is, however, simple to extend with other parametric or non-parametric power calculation techniques. :::{prf:algorithm} Calculation of the Power-rule :label: alg-power-rule @@ -188,7 +188,7 @@ Given that machine learning model predictions are often non-normally distributed Note that depending on the aim of external validation, the *Power-rule* can be swapped to, or extended with, other conditions. For instance, if we are interested in accurately estimating the predictive effect size, we could condition the stopping rule on the width of the confidence interval for the prediction performance. -Calculating the validation power ({numref}`alg-power-rule`) for all available sample sizes ($n = 1 \dots n_{act}$) defines the so-called "validation power curve" (see {numref}`fig1`), that represents the expected ratio of true positive statistical tests on increasing sample size calculated on the external validation set. Various extrapolations of the power curve can predict the expected stopping point during the course of the experiment. +Calculating the validation power ({numref}`alg-power-rule`) for all available sample sizes ($n = 1 \dots n_{act}$) defines the so-called "validation power curve" (see {numref}`fig1` and [Supplementary Figures %s](#si-bcw-lc), [%s](#si-ixi-lc) and [%s](#si-hcp-lc)), that represents the expected ratio of true positive statistical tests on increasing sample size calculated on the external validation set. Various extrapolations of the power curve can predict the expected stopping point during the course of the experiment. #### Stopping Rule @@ -208,7 +208,7 @@ Our proposed stopping rule integrates the $\text{Min-rule}$, the $\text{Max-rule where $\Phi = $ are parameters of the stopping rule: minimum training sample size, minimum validation sample size, minimum effect of interest and target power for the external validation and the significance threshold, respectively. -We have implemented the proposed stopping rule in the Python package “*adaptivesplit*[^adaptive-split]. The software can be used together with a wide variety of machine learning tools and provides an easy-to-use interface to work with scikit-learn ([](https://doi.org/10.48550/arXiv.1201.0490)) models. +We have implemented the proposed stopping rule in the Python package “*adaptivesplit*[^adaptive-split]. The package can be used together with a wide variety of machine learning tools and provides an easy-to-use interface to work with scikit-learn ([](https://doi.org/10.48550/arXiv.1201.0490)) models. #### Empirical evaluation @@ -219,7 +219,7 @@ We obtained preprocessed data from Autism Brain Imaging Data Exchange (ABIDE) da Tangent correlation across the time series of the n=122 regions of the BASC brain parcellation (Multi-level bootstrap analysis of stable clusters; [](https://doi.org/10.1016/j.neuroimage.2010.02.082)) was computed with nilearn[^nilearn]. The resulting functional connectivity estimates were considered features for a predictive model of autism diagnosis. ##### **HCP** -The Human Connectome Project dataset contains imaging and behavioral data of approximately 1,200 healthy subjects ([](10.1016/j.neuroimage.2013.05.041)). Pre-processed resting state functional magnetic resonance imaging (fMRI) connectivity data (partial correlation matrices; [](10.1016/j.neuroimage.2013.04.127)) as published with the HCP1200 release (N= 999 participants with functional connectivity data) were used to build models that predict individual fluid intelligence scores (Gf), measured with Penn Progressive Matrices ([](10.1126/science.289.5478.457)). +The Human Connectome Project dataset contains imaging and behavioral data of approximately 1,200 healthy subjects ([](10.1016/j.neuroimage.2013.05.041)). Pre-processed resting state functional magnetic resonance imaging (fMRI) connectivity data (partial correlation matrices; [](10.1016/j.neuroimage.2013.04.127)) as published with the HCP1200 release (N=999 participants with functional connectivity data) were used to build models that predict individual fluid intelligence scores (Gf), measured with Penn Progressive Matrices ([](10.1126/science.289.5478.457)). ##### **IXI** @@ -232,7 +232,8 @@ The Breast Cancer Wisconsin (BCW, [](10.1117/12.148698)) dataset contains diagno The chosen datasets include both classification and regression tasks, and span a wide range in terms of number of participants, number of predictive features, achievable predictive effect size and data homogeneity (see [Supplementary Figures 1-6](#supplementary-figures)). Our analyses aimed to contrast the proposed adaptive splitting method with the application of fixed training and validation sample sizes, specifically using 50, 60 or 90% of the total sample size for training and the rest for external validation. We simulated various "sample size budgets" (total sample sizes, $n_{total}$) with random sampling without replacement. -For a given total sample size, we simulated the prospective data acquisition procedure by incrementing $n_{act}$; starting with 10\% of the total sample size and going up with increments of five. In each step, the stopping rule was evaluated with "AdaptiveSplit", but fitting a Ridge model (for regression tasks) or a L2-regularized logistic regression (for classification tasks). Model fit always consisted of a cross-validated fine-tuning of the regularization parameter, resulting in a nested cv estimate of prediction performance and validation power. Robust estimates (and confidence intervals) were obtained with bootstrapping, as described in {numref}`alg-learning-curve` and {numref}`alg-power-rule`. +For a given total sample size, we simulated the prospective data acquisition procedure by incrementing $n_{act}$; starting with 10\% of the total sample size and going up with increments of five. +In each step, the stopping rule was evaluated with "AdaptiveSplit", fitting a Ridge model (for regression tasks) or a L2-regularized logistic regression (for classification tasks). Model fit always consisted of a cross-validated fine-tuning of the regularization parameter, resulting in a nested cv estimate of prediction performance and validation power. Robust estimates (and confidence intervals) were obtained with bootstrapping, as described in {numref}`alg-learning-curve` and {numref}`alg-power-rule`. This procedure was iterated until the stopping rule returned True. The corresponding sample size was then considered the final training sample. With all four splitting approaches (adaptive, Pareto, Half-split, 90-10\% split), we trained the previously described Ridge or regularized logistic regression model on the training sample and obtained predictions for the sample left out for external validation. This whole procedure was repeated 100 times for each simulated sample size budget in each dataset, to estimate the confidence intervals for the models performance in the external validation and its statistical significance. @@ -243,14 +244,13 @@ In all analyses, the adaptive splitting procedure is performed with a target pow The results of our empirical analyses of four large, openly available datasets confirmed that the proposed adaptive splitting approach can successfully identify the optimal time to stop acquiring data for training and maintain a good compromise between maximizing both predictive performance and external validation power with any sample size budget. In all four samples, the applied models yielded a statistically significant predictive performance at much lower sample sizes than the total size of the dataset, i.e. all datasets were well powered for the analysis. Trained on the full sample size with cross-validation, the models displayed the following performances: functional brain connectivity from the HCP dataset explained 13% of the variance in cognitive abilities; structural MRI data (gray matter probability maps) in the IXI dataset explained 48% in age; c**lassification accuracy was 65.5% for autism diagnosis (functional brain connectivity) in the ABIDE dataset and 92% for breast cancer diagnosis in the BCW dataset. -In all three samples, the applied models yielded a statistically significant predictive performance at much lower sample sizes than the total size of the dataset, i.e. all datasets were well powered for the analysis. Trained on the full sample size with cross-validation, the models displayed the following performances: functional brain connectivity from the HCP dataset explained 13% of the variance in cognitive abilities; structural MRI data (gray matter probability maps) in the IXI dataset explained 48% in age; classification accuracy was 65.5% for autism diagnosis (functional brain connectivity) in the ABIDE dataset and 92% for breast cancer diagnosis in the BCW dataset. The datasets varied not only in the achievable predictive performance but also in the shape of the learning curve, with different sample sizes and thus, they provided a good opportunity to evaluate the performance of our stopping rule in various circumstances ([Supplementary Figures 1-6](#supplementary-figures)). We found that adaptively splitting the data provided external validation performances that were comparable to the commonly used Pareto split (80-20%) in most cases ({numref}`fig3`, left column). As expected half-split tended to provide worse predictive performance due to the smaller training sample. In contrast, 90-10% tended to display only slightly higher performances than the Pareto and the Adaptive splitting techniques, in most cases. This small achievement came with a big cost in terms of the statistical power in the external validation sample, where the 90-10% split very often gave inconclusive results ($p\geq0.05$) ({numref}`fig3`, right column), especially with low sample size budgets. -Although to a lesser degree, Pareto split also frequently failed to yield a conclusive external validation with small total sample sizes. Adaptive splitting (as well as half-split) provided sufficient statistical power for the external validation in most cases. +Although to a lesser degree, Pareto split also frequently failed to yield a conclusive external validation with small total sample sizes. Adaptive splitting (as well as half-split) provided sufficient statistical power for the external validation in most cases. This was achieved by applying different strategies in different scenarios. In case of low total sample sizes, it retained a larger proportion of the sample for the external validation phase in order to achieve sufficient power, up to using 79% of the data for external validation. On the other hand, if the total sample size budget allowed it, adaptive splitting let the predictive model benefit from larger training samples, retaining as low as 8% of the data for external validation is such cases. Focusing only on cases with a successful, conclusive external validation, the proposed adaptive splitting strategy always provided equally good or better predictive performance than the fixed splitting strategies (as shown by the 95% confidence intervals on {numref}`fig3`). @@ -264,14 +264,16 @@ The left and right column shows the comparison of splitting methods on external Here we have proposed "registered models", a novel design for prospective predictive modeling studies that allows flexible model discovery and trustworthy prospective external validation by fixing and publicly depositing the model after the discovery phase. Furthermore, capitalizing on the flexibility during model discovery with the registered model design, we have proposed a stopping rule for adaptively splitting the sample size budget into discovery and external validation phases. These approaches together provide a robust and flexible framework for predictive modeling studies and address several common issues in the field, including overfitting, effect size inflation as well as the lack of reliability and reproducibility. -Registered models provide a clear and transparent separation between the discovery and external validation phases, which is essential for ensuring the independence of the external validation data. Thereby, they provide a straightforward solution to several of the widely discussed issues and pitfalls of predictive model development ([](https://doi.org/10.1080/01621459.1983.10477973); [](https://doi.org/10.1016/j.biopsych.2020.02.016); [](https://doi.org/10.1038/s41746-022-00592-y); [](10.1038/s41586-022-04492-9); [](10.1038/s41586-023-05745-x)). With registered models, external validation estimates are guaranteed to be free of information leakage ([](https://doi.org/10.1016/j.patter.2023.100804)) and to provide an unbiased estimate of the model's predictive performance. Nevertheless, these performance estimates will still be subject of sampling variance, which can be reduced by increasing the sample size of the external validation set. +Registered models provide a clear and transparent separation between the discovery and external validation phases, which is essential for ensuring the independence of the external validation data. Thereby, they provide a straightforward solution to several of the widely discussed issues and pitfalls of predictive model development ([](https://doi.org/10.1080/01621459.1983.10477973); [](https://doi.org/10.1016/j.biopsych.2020.02.016); [](https://doi.org/10.1038/s41746-022-00592-y); [](10.1038/s41586-022-04492-9); [](10.1038/s41586-023-05745-x)). With registered models, external validation estimates are guaranteed to be free of information leakage ([](https://doi.org/10.1016/j.patter.2023.100804)) and provide an unbiased estimate of the model's predictive performance. -The question of how many participants should be involved in the discovery and external validity remains of central importance for the optimal use of available resources (scanning time, budget, limitations in participant recruitment) ([](https://doi.org/10.1002/sim.8766); [](https://doi.org/10.1002/sim.9025); [](https://doi.org/10.1038/s41586-022-04492-9); [](https://doi.org/10.1038/s41586-023-05745-x); [](https://doi.org/10.1038/s41593-022-01110-9); [](10.52294/51f2e656-d4da-457e-851e-139131a68f14); [](https://doi.org/10.1101/2023.06.16.545340); [](#supplementary-table-1)). Optimal sample sizes are often challenging to determine prior to the study. The proposed adaptive splitting procedure promises to provide a solution in such cases by allowing the sample size to be adjusted during the data acquisition process, based on the observed performance of the model trained on the already available data. + With registered models, the question of how the total sample size budget should be distributed between the discovery and external validation phase remains of central importance for the optimal use of available resources (scanning time, budget, limitations in participant recruitment) ([](https://doi.org/10.1002/sim.8766); [](https://doi.org/10.1002/sim.9025); [](https://doi.org/10.1038/s41586-022-04492-9); [](https://doi.org/10.1038/s41586-023-05745-x); [](https://doi.org/10.1038/s41593-022-01110-9); [](10.52294/51f2e656-d4da-457e-851e-139131a68f14); [](https://doi.org/10.1101/2023.06.16.545340); [](#supplementary-table-1)). Optimal sample sizes are often challenging to determine prior to the study. The proposed adaptive splitting procedure promises to provide a solution in such cases by allowing the sample size to be adjusted during the data acquisition process, based on the observed performance of the model trained on the already available data. We performed a thorough evaluation of the proposed adaptive splitting procedure on data from more than 3000 participants from four publicly available datasets. We found that the proposed adaptive splitting approach can successfully identify the optimal time to stop acquiring data for training and maintain a good compromise between maximizing both predictive performance and external validation power with any "sample size budget". When contrasting splitting approaches based on fixed validation size with the proposed adaptive splitting technique, using the latter was always the preferable strategy to maximize power and statistical significance during external validation. The benefit of adaptively splitting the data acquisition for training and validation provides the largest benefit in lower sample size regimes. -In case of larger sample sizes, the fixed Pareto split (20-80%) provided also good results, giving similar external validation performances to adaptive splitting, without having to repeatedly re-train the model during data acquisition. +In case of larger total sample size budgets, the fixed Pareto split (20-80%) provided also good results, giving similar external validation performances to adaptive splitting, without having to repeatedly re-train the model during data acquisition. Thus, for moderate to large sample sizes and well powered models, the Pareto split might be a good alternative to the adaptive splitting approach, especially if the computational resources for re-training the model are limited. -The proposed adaptive splitting design can advance the development of predictive models in several ways. Firstly, it provides a simple way to perform both model discovery and initial external validation in a single study. Furthermore, it allows for the registration of models at an early stage of the study, enhancing transparency, reliability and replicability. Finally, it provides a flexible approach to data splitting, which can be adjusted according to the specific needs of the study. +Of note, the presented implementation of adaptive data splitting aims to maximize the training sample (and minimize the external validation sample) in order to achieve the highest possible perfromance together with a conclusive (statistically significant) external validation. However, the resulting external performance estimates will still be subject of sampling variance. If the aims is to provide more reliable estimates of the predictive effect size in the external validation, the power-rule in the proposed approach can be modified so that it stops the discovery phase when a desired confidence interval width for the external effect size estimate is reached. + +The proposed adaptive splitting design can advance the development of predictive models in several ways. Firstly, it provides a simple way to perform both model discovery and initial external validation in a single study. Furthermore, it promotes the public deposition (registration) of models at an early stage of the study, enhancing transparency, reliability and replicability. Finally, it provides a flexible approach to data splitting, which can be adjusted according to the specific needs of the study. In conclusion, registered models provide a simple approach to guarantee the independence of model discovery and external validation and for the development and initial evaluation of registered models with unknown power, the introduced adaptive splitting procedure provides a robust and flexible approach to determine the optimal ratio of data to be used for model discovery and external validation. Together, registered models and the adaptive splitting procedure, address several common issues in the field, including overfitting, cross-validation failure, and boost the reliability and reproducibility. diff --git a/manuscript/02-supplementary.md b/manuscript/02-supplementary.md index 34884cc..0c54377 100644 --- a/manuscript/02-supplementary.md +++ b/manuscript/02-supplementary.md @@ -56,27 +56,27 @@ Learning curve (red) and power curve (blue) of the model trained on resting stat ### Supplementary Table 1 Manuscripts, commentaries, and editorials on the topic of brain-behavior associations and their reproducibility, related to [](https://doi.org/10.1038/s41586-022-04492-9). See the up-to-date list here: https://spisakt.github.io/BWAS_comment/ -| Authors | Title | Where | -|---------------------------------------|------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------| -| Nature editorial | Cognitive neuroscience at the crossroads | [Nature](https://www.nature.com/articles/d41586-022-02283-w) -| Spisak et al. | Multivariate BWAS can be replicable with moderate sample sizes | [Nature](https://doi.org/10.1038/s41586-023-05745-x) | -| [Nat. Neurosci. editorial ] | Revisiting doubt in neuroimaging research | [Nat. Neurosci.](https://doi.org/10.1038/s41593-022-01125-2) | +| Authors | Title | Where | +|-----------------------------------|------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------| +| Nature editorial | Cognitive neuroscience at the crossroads | [Nature](https://www.nature.com/articles/d41586-022-02283-w) +| Spisak et al. | Multivariate BWAS can be replicable with moderate sample sizes | [Nature](https://doi.org/10.1038/s41586-023-05745-x) | +| Nat. Neurosci. editorial | Revisiting doubt in neuroimaging research | [Nat. Neurosci.](https://doi.org/10.1038/s41593-022-01125-2) | | Monica D. Rosenberg and Emily S. Finn | How to establish robust brain–behavior relationships without thousands of individuals | [Nat. Neurosci.](https://doi.org/10.1038/s41593-022-01110-9) | -| Bandettini P et al. | The challenge of BWAS: Unknown Unknowns in Feature Space and Variance | [Med](http://www.thebrainblog.org/2022/07/04/the-challenge-of-bwas-unknown-unknowns-in-feature-space-and-variance/) | -| Gratton C. et al. | Brain-behavior correlations: Two paths toward reliability | [Neuron](https://doi.org/10.1016/j.neuron.2022.04.018) | -| Cecchetti L. and Handjaras G. | Reproducible brain-wide association studies do not necessarily require thousands of individuals | [psyArXiv](10.31234/osf.io/c8xwe) | -| Winkler A. et al. | We need better phenotypes | [brainder.org](https://brainder.org/2022/05/04/we-need-better-phenotypes/) | -| DeYoung C. et al. | Reproducible between-person brain-behavior associations do not always require thousands of individuals | [psyArXiv](10.31234/osf.io/sfnmk) | -| Gell M et al. | The Burden of Reliability: How Measurement Noise Limits Brain-Behaviour Predictions | [bioRxiv](https://doi.org/10.1101/2023.02.09.527898) | -| Tiego J. et al. | Precision behavioral phenotyping as a strategy for uncovering the biological correlates of psychopathology | [OSF](10.31219/osf.io/geh6q) | -| Chakravarty MM. | Precision behavioral phenotyping as a strategy for uncovering the biological correlates of psychopathology | [Nature Mental Health](https://doi.org/10.1038/s44220-023-00057-5) | +| Bandettini P et al. | The challenge of BWAS: Unknown Unknowns in Feature Space and Variance | [Med](http://www.thebrainblog.org/2022/07/04/the-challenge-of-bwas-unknown-unknowns-in-feature-space-and-variance/) | +| Gratton C. et al. | Brain-behavior correlations: Two paths toward reliability | [Neuron](https://doi.org/10.1016/j.neuron.2022.04.018) | +| Cecchetti L. and Handjaras G. | Reproducible brain-wide association studies do not necessarily require thousands of individuals | [psyArXiv](10.31234/osf.io/c8xwe) | +| Winkler A. et al. | We need better phenotypes | [brainder.org](https://brainder.org/2022/05/04/we-need-better-phenotypes/) | +| DeYoung C. et al. | Reproducible between-person brain-behavior associations do not always require thousands of individuals | [psyArXiv](10.31234/osf.io/sfnmk) | +| Gell M et al. | The Burden of Reliability: How Measurement Noise Limits Brain-Behaviour Predictions | [bioRxiv](https://doi.org/10.1101/2023.02.09.527898) | +| Tiego J. et al. | Precision behavioral phenotyping as a strategy for uncovering the biological correlates of psychopathology | [OSF](10.31219/osf.io/geh6q) | +| Chakravarty MM. | Precision behavioral phenotyping as a strategy for uncovering the biological correlates of psychopathology | [Nature Mental Health](https://doi.org/10.1038/s44220-023-00057-5) | | White T. | Behavioral phenotypes, stochastic processes, entropy, evolution, and individual variability: Toward a unified field theory for neurodevelopment and psychopathology | [OHBM Aperture Neuro](https://doi.org/10.52294/c900ce20-3ffd-4545-8c15-3ec532b2ee3b) | -| Bandettini P. | Lost in transformation: fMRI power is diminished by unknown variability in methods and people | [OHBM Aperture Neuro](10.52294/725139d7-0b8a-49dc-a81d-ba2ca64ff6d9) | -| Thirion B. | On the statistics of brain/behavior associations | [OHBM Aperture Neuro](10.52294/51f2e656-d4da-457e-851e-139131a68f14) | -| Tiego J., Fornito A. | Putting behaviour back into brain–behaviour correlation analyses | [OHBM Aperture Neuro](10.52294/2f9c5854-d10b-44ab-93fa-d485ef5b24f1) | +| Bandettini P. | Lost in transformation: fMRI power is diminished by unknown variability in methods and people | [OHBM Aperture Neuro](10.52294/725139d7-0b8a-49dc-a81d-ba2ca64ff6d9) | +| Thirion B. | On the statistics of brain/behavior associations | [OHBM Aperture Neuro](10.52294/51f2e656-d4da-457e-851e-139131a68f14) | +| Tiego J., Fornito A. | Putting behaviour back into brain–behaviour correlation analyses | [OHBM Aperture Neuro](10.52294/2f9c5854-d10b-44ab-93fa-d485ef5b24f1) | | Lucina QU. | Brain-behavior associations depend heavily on user-defined criteria | [OHBM Aperture Neuro](https://doi.org/10.52294/5ba14033-72bb-4915-81a3-fa221302818a) | -| Valk SL., Hettner MD. | Commentary on ‘Reproducible brain-wide association studies require thousands of individuals’ | [OHBM Aperture Neuro](10.52294/de841a29-d684-4707-9042-5bbd3d764f84) | -| Kong XZ., et al. | Scanning reproducible brain-wide associations: sample size is all you need? | [Psychoradiology](https://doi.org/10.1093/psyrad/kkac010) | +| Valk SL., Hettner MD. | Commentary on ‘Reproducible brain-wide association studies require thousands of individuals’ | [OHBM Aperture Neuro](10.52294/de841a29-d684-4707-9042-5bbd3d764f84) | +| Kong XZ., et al. | Scanning reproducible brain-wide associations: sample size is all you need? | [Psychoradiology](https://doi.org/10.1093/psyrad/kkac010) | | J. Goltermann, et al. | Cross-validation for the estimation of effect size generalizability in mass-univariate brain-wide association studies | [BioRxiv](https://doi.org/10.1101/2023.03.29.534696) | | Kang K., et al. | Study design features that improve effect sizes in cross-sectional and longitudinal brain-wide association studies | [BioRxiv](https://doi.org/10.1101/2023.05.29.542742) | | Makowski C., et al. | Reports of the death of brain-behavior associations have been greatly exaggerated |[BioRxiv]( https://doi.org/10.1101/2023.06.16.545340) |