Skip to main content
PLOS Biology logoLink to PLOS Biology
. 2025 Oct 28;23(10):e3003451. doi: 10.1371/journal.pbio.3003451

Brain-age models with lower age prediction accuracy have higher sensitivity for disease detection

Marc-Andre Schulz 1,2,3,4,*,#, Nys Tjade Siegel 1,2,3,4,#, Kerstin Ritter 1,2,3,4
Editor: Laura D Lewis5
PMCID: PMC12633945  PMID: 41150707

Abstract

This study critically reevaluates the utility of brain-age models within the context of detecting neurological and psychiatric disorders, challenging the conventional emphasis on maximizing chronological age prediction accuracy. Our analysis of T1 MRI data from 46,381 UK Biobank participants reveals that simpler machine learning models, and notably those with excessive regularization, demonstrate superior sensitivity to disease-relevant changes compared to their more complex counterparts, despite being less accurate in chronological age prediction. This counterintuitive discovery suggests that models traditionally deemed inferior might, in fact, offer a more meaningful biomarker for brain health by capturing variations pertinent to disease states. These findings challenge the assumption that accuracy-optimized brain-age models serve as useful normative models of brain aging. Optimizing for age accuracy appears misaligned with normative aims: it drives models to rely on low-variance aging features and to deemphasize higher-variance signals that are more informative about brain health and disease. Consequently, we propose a recalibration of focus towards models that, while less accurate in conventional terms, yield brain-age gaps with larger patient-control effect sizes, offering greater utility in early disease detection and understanding the multifaceted nature of brain aging.


There is increasing interest in developing imaging-based models to characterize 'brain-age', but how accurate are these for predicting neurological disease? This study shows that simpler models, which have lower age prediction accuracy, are paradoxically more sensitive to disease-related brain changes, suggesting the field re-evaluate modeling priorities.

Introduction

The advancement of neuroimaging techniques has paved the way for innovative measures of brain aging, providing insights into biological aging processes that extend beyond mere chronological metrics [1]. Among these, brain-age prediction models, underpinned by sophisticated machine learning algorithms, have emerged as promising tools for identifying deviations from normative aging trajectories [2]. Such deviations, manifesting as discrepancies between predicted and actual brain-age, termed the brain-age gap, may signal the early onset of neurological or psychiatric disorders [3].

The brain-age gap has been used to investigate various diseases of the central nervous system, including Alzheimer’s [4], schizophrenia [5], and major depressive disorder [6]. Evidence suggests that a wider brain-age gap correlates with more severe disease symptoms or faster progression, positioning brain-age as a potentially sensitive biomarker for these conditions [7].

The pursuit of ever-more precise brain-age models, from linear regression to convolutional neural networks to cutting-edge vision transformers, raises critical questions about the relationship between model accuracy and clinical utility. Initial observations suggest that less optimized models might paradoxically reveal more pronounced brain-age gaps, challenging the assumption that increased accuracy inherently enhances clinical value [810]. Furthermore, recent work by Vidal-Pineiro and colleagues [11] has demonstrated that cross-sectional brain-age estimates may primarily reflect stable individual differences rather than ongoing aging processes or pathological changes. This finding calls into question the interpretation of brain-age gaps as direct measures of accelerated aging or disease-related change.

In light of these emerging complexities and apparent contradictions in the field of brain-age modeling, there is a clear need for a comprehensive reevaluation of our approach to developing and applying these models in clinical contexts. This study aims to: (1) demonstrate that simpler and over-regularized models can be more effective biomarkers for detecting neurological and psychiatric disorders, despite lower age prediction accuracy; (2) challenge the conventional focus on maximizing age prediction accuracy in brain-age modeling; and (3) explore the mechanisms underlying the superior performance of simpler models in detecting disease-related brain changes. By “better” or “more effective”, we specifically mean generating brain-age gaps that are more sensitive to various conditions, as quantified by larger effect sizes in group comparisons between patients and matched controls.

Results

Our investigation centers on discerning under which conditions brain-age models generate clinically useful brain-age gaps. This study analyzed T1 MRI images from 46,381 UK Biobank participants [12]. We focused on a spectrum of psychiatric disorders—alcohol dependency, bipolar disorder, depression—and neurological disorders, such as Parkinson’s disease, epilepsy, and sleep disorders, alongside cognitive and environmental factors including fluid intelligence, recent severe stress exposure, and levels of social support. To ensure rigorous comparison, we applied propensity score matching to generate matched healthy cohorts [13], excluding individuals with any diagnosed neurological or psychiatric condition. Matching criteria included sex, age, education level, household income, the Townsend deprivation index, and genetic principal components, serving as an ethnicity proxy. Fig 1 illustrates our analysis workflow. Detailed information on targets and covariates is provided in the Methods section.

Fig 1. Illustrative overview of brain-age prediction model development and analysis.

Fig 1

(a) Development of brain-age prediction models using T1 brain images from a healthy control group. (b) Comparison of model complexity and accuracy, showing Resnet-50 CNN and SWIN vision transformer models surpassing a regularized linear model in age estimation precision. (c) Creation of matched patient and healthy control cohorts to examine brain-age disparities across various neurological and psychiatric conditions. (d) Inverse relationship was observed between the machine learning models’ prediction accuracy and the magnitude of brain-age gap effect sizes in comparing patient and control groups. (e and f) Exploration of the balance between model regularization strength, age prediction accuracy, and the ability to discriminate between patient and control groups, indicating that over-regularization, while reducing age prediction accuracy, enhances patient-control discrimination.

We trained three distinct models to explore these dynamics: a Ridge regularized linear regression model [14], a 3D ResNet-50 CNN [15,–16], and a 3D SWIN transformer [1618], each chosen for its unique approach to learning from imaging data. The Ridge regression model was selected for its ability to handle high-dimensional data with built-in regularization, making it well-suited for the imaging-derived phenotypes (IDPs; [19]). The ResNet-50 CNN was chosen for its proven effectiveness in image analysis tasks, particularly its ability to learn hierarchical features from raw image data. The SWIN transformer was included as a state-of-the-art model capable of capturing long-range dependencies in image data, which may be particularly relevant for whole-brain analysis. The linear model leveraged precomputed imaging-derived phenotypes, while the deep learning models utilized raw, minimally processed T1 images, aiming to capture the most granular features possible.

Increased model accuracy may diminish biomarker effectiveness

We examined how varying degrees of complexity and fine-tuning of brain-age models influence their ability to detect brain-age gaps associated with selected psychiatric and neurological conditions. We posited that simpler models, despite their potential limitations in capturing complex patterns, might exhibit heightened sensitivity to variations relevant to disease states. Conversely, more expressive models could potentially mask these signals by fitting to fine-grained but disease-invariant aging features.

Our findings corroborate and extend previous observations on the relationship between model complexity and biomarker effectiveness: the simplest model, based on Ridge regression (1,400 parameters), demonstrated the largest brain-age gap effect sizes. Effect sizes appeared to diminish for the more complex and more accurate SWIN transformer (10.1 million trainable parameters) and CNN (46.2 million trainable parameters) models (see Fig 2). To further investigate this phenomenon, we conducted additional analyses varying key parameters of the model training process. These analyses, presented in Fig 3, explore how effect sizes change under varying training set sizes, different random seeds for model initialization, and increasing numbers of training epochs. The trend of inverse relationships between age prediction accuracy and disease detection sensitivity appears to hold across varying conditions. In sum, we find that conventional methods of improving accuracy (e.g., more expressive models, increasing training data, or training duration) can actually decrease the model’s sensitivity to disease-related brain changes.

Fig 2. Visualization of model complexity vs. sensitivity in detecting brain-age gaps.

Fig 2

This figure illustrates the nuanced relationship between the complexity of machine learning models and their ability to detect brain-age gaps associated with psychiatric and neurological conditions. It highlights a key observation: models with lower complexity, such as Ridge regression, exhibited a heightened sensitivity in identifying deviations that are clinically significant, compared to their more sophisticated counterparts like CNNs and SWIN Transformers. This unexpected finding challenges conventional expectations, suggesting that simplicity in model architecture may enhance biomarker efficacy by preserving sensitivity to critical, disease-relevant variations. Error bars indicate standard deviation (STD) over multiple model train runs (with exception of deterministic Ridge model). The data underlying this figure can be found in S1 Data.

Fig 3. More accurate models can yield worse biomarker effect size.

Fig 3

Conventional ways of improving model accuracy like increasing the amount of train data (left), choosing the best performing model out of multiple randomly initialized train runs (middle), and increasing training duration (right) negatively impacted brain-age gap effect size for depression for both CNN and SWIN models. We focus on depression as it represents our largest clinical sample, offering the most robust statistical foundation. Similar patterns were observed for a majority of other conditions (see Fig F in S1 Text). The shaded area represents the confidence interval for the regression estimate. The data underlying this figure can be found in S2 Data.

This counterintuitive result underscores a crucial insight: simpler models may possess a greater ability to function as effective biomarkers, hinting that the expressive capacity of complex models exacerbates the impact of an optimization objective that is not fully aligned with normative modeling.

Over-regularization of ML models increases biomarker sensitivity

We investigated how increasing regularization in Ridge regression models affects their ability to generate clinically useful brain-age gaps, hypothesizing that excessive regularization might decrease age prediction accuracy but enhance the model’s utility as a biomarker by improving its sensitivity to disease-relevant anomalies.

We systematically adjusted the L2 regularization strength in a sequence of Ridge regression models tasked with brain-age prediction, intentionally exceeding the regularization level that typically optimizes age prediction accuracy (Fig 4). The findings confirmed that while over-regularization reduced the models’ age estimation precision, it notably increased their sensitivity to the brain-age gap associated with selected psychiatric and neurological conditions.

Fig 4. Impact of L2 regularization on brain-age gap detection sensitivity.

Fig 4

This figure depicts how adjustments in L2 regularization levels within Ridge regression models influence their ability to identify brain-age gaps associated with selected psychiatric and neurological conditions (effect size, left y-axis, blue) and their age prediction accuracy (mean absolute error, right y-axis, red). The x-axis represents increasing regularization strength (α). The vertical dashed lines demarcate three regions: optimal age prediction (dashed red), transition (middle), and optimal disease detection (dashed blue). In the transition region, we observe an inverse relationship between prediction accuracy and disease detection sensitivity. Horizontal dashed lines indicate effect sizes at α values maximizing prediction accuracy (red) and effect size (blue), i.e., the potential sensitivity gain when prioritizing effect size over prediction accuracy. Patterns vary across conditions, reflecting differences in associated brain changes. Shaded regions indicate standard error derived from bootstrap resampling. The data underlying this figure can be found in S3 Data.

Our analysis revealed three distinct regions of regularization strength: optimal age prediction, transition, and optimal disease detection. In the transition region, we observed an inverse relationship between prediction accuracy and disease detection sensitivity across all conditions. As regularization increased, age prediction error rose while the effect size for detecting condition-related brain-age gaps improved. The regularization strength that optimized age prediction accuracy (typically at lower α values) differed from the one that maximized discriminative ability (often at higher α values). This difference represents a potential gain in biomarker sensitivity when prioritizing effect size over prediction accuracy.

Interestingly, the effect size often peaks at an intermediate level of regularization, rather than at the maximum level. This pattern suggests an optimal balance between model simplicity and feature retention. This balance point varies across conditions, potentially reflecting differences in the nature and extent of associated brain changes.

The results suggest that increasing the level of regularization, effectively simplifying the model’s representation of aging patterns, enhances its ability to identify variations pertinent to clinical conditions, distinct from conventional aging. This loss in precision for age prediction, counterintuitively, did not diminish the model’s value; rather, it bolstered its performance as a biomarker. Such findings support the notion that model complexity negatively affects how well the brain-age gap encompasses deviations from a normative aging process, affirming the potential of strategically simplified models to serve as more effective tools in disease detection and characterization.

Over-regularized models focus on global gray matter volume

To gain insight into the mechanisms underlying the improved biomarker performance of over-regularized models, we analyzed the feature importances of Ridge regression models with varying levels of regularization. We employed SHapley Additive exPlanations (SHAP) values to quantify the contribution of each feature to the model’s predictions [20].

Our analysis revealed a clear trend: as regularization increased, the models increasingly relied on global measures of brain structure, particularly total gray matter volume. This shift in focus from localized features to global indicators aligns with our understanding of many neurological and psychiatric conditions, which often manifest as widespread alterations in brain structure rather than changes confined to specific regions [2123].

When comparing the top features for both accuracy-optimized and disease-detection-optimized models, we observed distinct patterns of feature importance. The accuracy-optimized model prioritizes features such as specific regional volumes and contrasts, which are strong predictors of age [24]. In contrast, the disease-detection-optimized model focuses more on global measures like total gray matter volume and ventricular size, which may be more sensitive to a range of pathological changes. These differences are visually represented in Fig 5.

Fig 5. Feature importance’s for brain-age prediction using Ridge regression.

Fig 5

The heat maps illustrate the top-10 features for both the accuracy-optimized model (top panel) and the biomarker-optimized model (bottom panel), with color intensity representing the relative SHAP values. The x-axis denotes increasing regularization strength [α]. The accuracy-optimized model (α ~ 103), trained for maximum age prediction accuracy, highlights features such as the volume of the pons, gray-white contrast in the inferior parietal region, and volume of the cerebrospinal fluid (CSF). Conversely, the biomarker-optimized model (α ~ 105), trained to enhance the brain-age gap effect size for a majority of conditions (cf. Methods and Fig E in S1 Text) vs. controls, emphasizes features like the volume of gray matter, volume of peripheral cortical gray matter, and mean intensity of the third ventricle. The dashed vertical lines indicate the regularization strength [α] at which each model was optimized. This comparison underscores the distinct feature importances between models focused on accuracy vs. those optimized for sensitivity to disease-relevant changes, supporting the manuscript’s thesis that traditional accuracy-optimized models may not provide the best biomarkers for disease detection. The data underlying this figure can be found in S4 Data.

This finding offers a plausible explanation for the superior performance of over-regularized models as biomarkers. By focusing on global brain characteristics, these models may be more sensitive to the diffuse structural changes associated with various brain disorders. Conversely, more complex models that can capture intricate, localized patterns may paradoxically be less effective at detecting these broad, disease-related alterations.

Our interpretability analysis thus provides mechanistic insights into why simpler, over-regularized models can outperform more complex ones in disease detection. By prioritizing global brain measures, these models appear to capture broader indicators of brain health that are particularly relevant to a range of neurological and psychiatric conditions. This underscores the importance of aligning model complexity with the intended application, suggesting that in the context of disease detection, strategically simplified models may offer advantages over those optimized solely for age prediction accuracy.

Discussion

Our study reassesses the efficacy of brain-age models in detecting neurological and psychiatric disorders, revealing that simpler or more regularized models often demonstrate enhanced sensitivity to disease-related variations compared to more complex counterparts. This finding challenges the conventional assumption that higher age prediction accuracy necessarily leads to better biomarkers, suggesting a need to reevaluate criteria for assessing brain-age models in clinical and research settings.

The inverse relationship between age prediction accuracy and disease detection sensitivity persisted across various experimental conditions, including different training set sizes, random initializations, and training durations. Our results align with and extend previous work [8,10], providing systematic evidence across a broader range of models and conditions.

These dynamics help reconcile prior mixed findings about what brain-age reflects [7,11,2527]: simpler or more constrained models produce larger patient-control effect sizes, indicating that the brain-age gap integrates both age-normative and pathology-sensitive variation, with the balance governed by training choices.

This work also resonates with recent studies questioning the universal superiority of deep learning over simpler linear models in brain imaging analyses [2831]. Our results provide further empirical support for the potential advantages of simpler or deliberately constrained models in certain neuroimaging tasks.

Further, our findings align with a methodological shift in the parallel field of epigenetic clocks, where clinical utility has been enhanced by moving beyond the singular goal of predicting chronological age. An analogue to our “brain-age paradox” was demonstrated by Zhang and colleagues [32], who found that as an epigenetic clock’s age prediction accuracy improved, its association with mortality risk attenuated. These convergent findings are reinforced by evidence showing that “next-generation” epigenetic clocks like DunedinPACE and GrimAge, which are less correlated with age, consistently outperform their predecessors in predicting clinical outcomes, including dementia [33], adverse brain structure [34], and frailty and mortality in vulnerable populations [35]. Furthermore, these advanced clocks show greater sensitivity to socioeconomic and lifestyle risk factors even in younger adults [36]. Our work suggests that simpler, over-regularized brain-age models function analogously, de-emphasizing pure age prediction to better capture the clinically relevant signals of pathology.

The preference for simpler or more strongly regularized models as biomarkers can be explained by how the optimization objective interacts with population variance. In probabilistic normative modeling, one estimates the distribution of selected features conditional on covariates (e.g., age, sex, and site) and then quantifies each individual’s deviation from that distribution. This paradigm relies on leveraging cross-sectional variation in features that are sensitive to brain health and disease. Brain-age models, by contrast, compress many features into a single scalar trained to minimize chronological age error. To reduce average prediction error, they place more weight on features with high signal-to-noise for age and low residual variability across individuals. As a result, features with greater cross-sectional variability, often those most sensitive to pathology, are down-weighted or ignored. This can yield excellent age prediction accuracy while producing brain-age gaps that are less informative for detecting disease-related deviations. In short, optimization for age accuracy can attenuate precisely the variance that normative modeling aims to characterize.

This misalignment can be amplified by cohort contamination. If the “healthy” training set includes undiagnosed cases or uses broad inclusion criteria, disease-linked variation appears as noise with respect to age. Because the training procedure seeks features that predict age consistently across the sample, the model will preferentially rely on disease-invariant aging signals over features whose age trajectory is perturbed by illness, further reducing sensitivity to pathology at test time. Complete elimination of such contamination is challenging in large population cohorts, as many disorders manifest as tail deviations on otherwise continuous neurobiological dimensions.

Furthermore, regularization encourages a focus on robust features, potentially corresponding to larger-scale structural changes more likely associated with disease processes. This suggests an optimal balance point in the bias-variance trade-off that differs for disease detection versus age prediction, where moderate regularization constraints maintain sensitivity to disease-relevant patterns while avoiding fitting to fine-grained age-specific features.. Simpler models might exhibit greater robustness to the heterogeneity of brain changes in disease, capturing general indicators of abnormality across diverse presentations. Collectively, these attributes suggest that the constraints imposed by simpler or more regularized models may inadvertently align with the nature of the biological signals most relevant to disease detection, explaining their enhanced performance as biomarkers despite lower age prediction accuracy.

Our findings provide empirical support for these theoretical advantages. Feature importance analysis (Fig 4) reveals that as regularization increases in the Ridge regression model, it tends to rely more heavily on global measures of brain structure, particularly total gray matter volume. This shift towards global features aligns well with the widespread alterations often seen in neurological and psychiatric conditions [2123]. While our feature importance analysis demonstrates this for linear models, the exact mechanisms in nonlinear models remain to be explored. More complex models might learn to prioritize fine-grained, localized patterns that excel in age prediction but are less sensitive to diffuse pathological changes, though directly testing this hypothesis requires further research.

It’s important to note that our findings, while reminiscent of the “loose fitting” hypothesis proposed by Bashyam and colleagues [8], differ in crucial ways that address the concerns raised by Hahn [9]. Unlike “loose fitting”, or underfitting, which potentially violates normative modeling principles, our approach focuses on model simplicity and regularization rather than intentional undertraining. Undertraining deep networks can yield partially random feature spaces that may coincidentally capture disease-relevant patterns, similar to how randomly initialized CNNs can sometimes provide useful embeddings. In contrast, our regularized models are fully trained but deliberately constrained in their capacity to combine age-relevant features. This means they still learn age-specific patterns, but are limited in how many features they can utilize or how complexly they can combine them. The systematic relationship we observe between regularization strength and biomarker sensitivity suggests that this controlled constraint of age-related feature combinations, rather than random or incomplete feature learning, drives the improved disease detection.

Our study used both minimally processed T1 images and imaging-derived phenotypes. While raw images retain more information, they require significant computational resources and engineering expertise. IDPs offer reduced computational demands, easier interpretability, and lower expertise requirements, but may lose some fine-grained information. Notably, our IDP-based model outperformed those using minimally processed images in detecting disease-related changes, suggesting that pre-extracted features may be sufficient and even advantageous for identifying broad, disease-related patterns.

This highlights a key trade-off in brain-age modeling between information retention and feature robustness. Our results indicate that for developing sensitive biomarkers, IDPs with simpler, more regularized models may offer a good balance of performance, interpretability, and accessibility.

However, this study is not without its limitations. Our analysis was confined to a limited set of target phenotypes and utilized a limited range of machine learning models. Future research should aim to replicate these findings across a more diverse array of diseases and model architectures to validate the generalizability of our conclusions. Additionally, the mechanisms by which simpler or more regularized models achieve greater sensitivity to pathological signals warrant further investigation. Disentangling the features and patterns these models prioritize could offer valuable insights into the biological underpinnings of the brain-age gap phenomenon.

Furthermore, the time elapsed between diagnosis and brain imaging varies across conditions and individuals, potentially affecting the observed effect sizes differently for progressive versus stable conditions. While fluid intelligence showed notably different patterns compared to clinical conditions, the underlying mechanisms for these differences remain to be explored. Future work should investigate how the temporal dynamics of different conditions interact with brain-age predictions.

Future work could also explore whether systematic regularization of nonlinear deep learning models yields similar or potentially even stronger biomarker properties, though this presents significant computational challenges given the high dimensionality of these models’ parameter spaces.

In conclusion, our work contributes to a growing body of evidence that questions the exclusive focus on chronological age prediction accuracy in brain-age modeling. By highlighting the potential of simpler and more regularized models to serve as effective biomarkers for neurological and psychiatric disorders, we advocate for a shift towards models that prioritize clinical relevance and interpretability. This paradigm shift could pave the way for more effective diagnostic tools and intervention strategies, ultimately enhancing patient care in the realm of neurology and psychiatry.

Methods

Dataset and preprocessing

The study utilized T1-weighted magnetic resonance imaging data from 46,381 participants enrolled in the UK Biobank project [12]. This large-scale biomedical database and research resource contains in-depth genetic and health information from half a million UK participants, with a subset undergoing extensive brain imaging. Our analysis used T1 imaging-derived-phenotypes (IDPs; regional gray and white matter volumes, cortical thickness, and surface area) provided by the UK Biobank [19], specifically 1,425 descriptors defined in UK Biobank category 110. Psychiatric disorders (alcohol dependency, bipolar disorder, and depression) and neurology-related disorders (Parkinson’s disease, epilepsy, and sleep disorders), as well as cognitive and environmental factors (fluid intelligence, exposure to severe stress in the last 2 years, and level of social support), served as prediction targets. Disorders refer to ICD-10 codes F10, F31, F32, G20, G40, and G47, with the time of first diagnosis preceding the date of imaging. Fluid intelligence refers to UKB-field 20016, binarized into top and bottom quartile. Exposure to severe stress refers to UKB-field 6145, binarized into having or not having an adverse life event in the last 2 years. Level of social support, specifically “able to confide”, field UKB-2110, was binarized into the majority category almost daily versus anything less frequent. For model training, we excluded all participants who had received any neurological or behavioral (ICD-10 category F or G) diagnosis at any point during follow-up, including diagnoses made after imaging. For the analysis of disease effects, we considered participants “healthy controls” if they had no relevant diagnosis at the time of MRI measurement. For nondisease targets (fluid intelligence, stress, and social support), we relaxed this strict comorbidity exclusion to maximize available sample sizes. All features were standard scaled, participants with missing data were excluded on an analysis-by-analysis basis. For details on methodological choices please refer to Note 1 in S1 Text.

Matching

To investigate the association between brain-age gaps and various psychiatric and neurological conditions, we employed propensity score matching [13], with a caliper of 0.25 standard deviations. This technique allowed us to create balanced cohorts of participants with and without specific diagnoses, thereby reducing confounding factors. Matching criteria included sex (field 31), age (field 21003), years of education (field 6138, translated into years of education via the 1997 International Standard Classification of Education), household income (field 738), Townsend deprivation index (field 22189), and the first three genetic principal components (field 22009) as proxies for ethnicity. The resulting sample sizes after matching were: alcohol dependency (n = 495), bipolar disorder (n = 126), depression (n = 1,146), Parkinson’s disease (n = 95), epilepsy (n = 327), sleep disorders (n = 1,007), fluid intelligence (n = 3,285), severe stress (n = 6,528), and social support (n = 4,250). Healthy controls were drawn from the pool of participants who were not used for training in the respective experiments.

Brain-age gap calculation

Brain-age gaps were calculated by training a machine learning model to predict the participant’s age from the T1 image on a healthy (see sections below) cohort, then using this model to predict held-out participants’ ages. The difference between predicted age and true age was linear sample-level bias-corrected [37], i.e., the difference between predicted age and true age was bias-corrected using linear regression on all test data, with subsequent analyses performed on the residuals - referred to as the participant’s (corrected) “brain-age gap” in our analyses. Throughout this study, we distinguish between the brain-age gap itself (the difference between predicted and chronological age) and the ability of these gaps to discriminate between patient and control groups (quantified by effect sizes).

Effect size calculation

As a proxy for the practical usefulness of a brain-age model we calculated effect sizes for our binary target variables (disease status, binarized fluid intelligence, etc.; see above). Specifically, we created a matched control group for each variable via propensity score matching (see above), calculated the brain-age gaps for participants in both groups, then used these brain-age gaps to calculate the effect size (Cohen’s d) of the given group comparison.

Experiment 1: Model training and architecture specification

Three distinct machine learning models were trained for the task of brain-age prediction: a regularized linear regression model [38], a 3D ResNet-50 convolutional neural network [15,16,39,40], and a 3D SWIN transformer [1618]. The linear model utilized precomputed IDPs, while the deep learning models were trained on minimally processed (skull stripped and linearly registered, mask and transformation matrix provided by UK Biobank) T1 images. Data was split into healthy participants (no neurological or behavioral, i.e., ICD-10 category F or G diagnoses) and patients (any ICD-10 F or G diagnoses). Patients (n = 17,671) and 1,000 random healthy participants were kept as a held-out test set, patients with diagnoses that were not investigated in this work were discarded, and the remaining healthy participants constituted the train set (n = 27,538). Brain-age prediction accuracy was computed on the healthy test set participants only.

Architecture details and training hyperparameters of CNN and transformer are described in detail in [16]. Both deep architectures were trained using AdamW, optimizing the mean squared error loss, using a one-cycle learning rate policy with a maximal learning rate of 10−2 for the ResNet and 10−4 for the SWIN transformer. Both were trained for 150,000 gradient update steps, with an effective batch size of 8. Each architecture was trained 6 times with different random initializations. We generally report end-of-training performance and, unless differently specified, the average performance over re-trained instances.

The linear model relied on the ScikitLearn [41] RidgeCV implementation (α 10−5–105, 100 ticks, log spaced).

Experiment 2: Over-regularization and biomarker sensitivity

We systematically varied the regularization strength in the Ridge regression brain-age model to study its impact on age prediction accuracy and the detection of brain-age gaps. This experiment aimed to understand how over-regularization, or intentionally reducing model complexity, might influence the model’s utility as a biomarker for disease. The train set size was reduced to 5,000 to ensure that both over-regularization and under-regularization effects could be visualized and easily distinguished. The train set was randomly sampled 10 times, so that values in Fig 3 refer to the mean over brain-age models that were trained on different sets of participants, and error bars refer to the standard error (SE), respectively.

Experiment 3: Interpretability analysis

To further elucidate why simpler, over-regularized models are more effective in detecting disease-relevant changes, we conducted an interpretability experiment. This experiment compares the feature importances of Ridge models with increasing regularization strength. Models were trained in the same manner as in Experiment 2. Feature importances were derived using SHAP, averaged over the 10 random train set resamplings. In Fig 4, we present the Top-10 important features for the accuracy-maximizing regularization strength and the regularization strength maximizing effect sizes for a majority of conditions (alcohol dependence, depression, epilepsy, sleep disorders, and stress). Top-10 important features for higher level of regularization are presented in Fig E in S1 Text.

Statistical analysis details

Estimation of statistical uncertainties was performed via repeated random sub-sampling validation; reported uncertainties refer to variability over models trained on different subsets of the data. Mean Absolute Error served as our primary metric for assessing age prediction accuracy, while Cohen’s d was used to quantify the effect sizes in a manner that is both standardized and interpretable. Analyses were conducted using Python 3.10.8 with the following key library versions: Scikit-Learn 1.3.0, PyTorch 1.12.1, numpy 1.25.2, and scipy 1.11.1. Deep learning models were trained using NVIDIA A100 GPUs with CUDA 12.2.

Ethical considerations

This study was conducted under the umbrella of the UK Biobank’s ethics agreements. The UK Biobank has received ethical approval from the National Health Service National Research Ethics Service (16/NW/0274) to collect and distribute data to approved researchers compliant with the declaration of Helsinki. As our study utilized de-identified data provided by the UK Biobank, individual consent from participants was covered under the UK Biobank’s broad consent model.

Supporting information

S1 Text. Supplementary Online Material.

(PDF)

pbio.3003451.s001.pdf (688.6KB, pdf)
S1 Data. Data underlying Fig 2.

(CSV)

pbio.3003451.s002.csv (5.2KB, csv)
S2 Data. Data underlying Fig 3.

(CSV)

pbio.3003451.s003.csv (4.7KB, csv)
S3 Data. Data underlying Fig 4.

(CSV)

pbio.3003451.s004.csv (158.4KB, csv)
S4 Data. Data underlying Fig 5.

(CSV)

pbio.3003451.s005.csv (5.8KB, csv)

Acknowledgments

We thank Roshan Rane, Habakuk Hain, Ruben Brandhofer, and Marcel Jühling for feedback on the manuscript. We thank the UKBB participants for their voluntary commitment and the UKBB team for their work in collecting, processing, and disseminating these data for analysis. Research was conducted using the UKBB resource under project-ID 33073. The authors acknowledge the Scientific Computing of the IT Division at the Charité – Universitätsmedizin Berlin for providing computational resources that have contributed to the research results reported in this paper, as well as support from Gemeinnützige Hertie Stiftung.

Abbreviations

CSF

cerebrospinal fluid

IDPs

imaging-derived phenotypes

SE

standard error

SHAP

SHapley Additive exPlanations

STD

standard deviation

Data Availability

The dataset analyzed during the current study is sourced from the UK Biobank, a large-scale biomedical database and research resource containing in-depth genetic and health information from half a million UK participants. Due to the UK Biobank’s data sharing policy, the data used in this study are available upon application to the UK Biobank by qualified researchers for approved purposes (https://www.ukbiobank.ac.uk/use-our-data/apply-for-access/). The data underlying our figures is available in files figN_data.csv. The code developed for this study, including data preprocessing, model training, and evaluation is available at https://doi.org/10.5281/zenodo.17244945.

Funding Statement

The project was funded by Deutsche Forschungsgemeinschaft (DFG; project-ID 414984028, 389563835, 402170461, 459422098, and 442075332 to K.R.), the Brain & Behavior Research Foundation (NARSAD young investigator grant to K.R.), the Manfred and Ursula-Müller Stiftung (K.R.), and a DMSG research award (K.R.). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Cole JH, Franke K. Predicting age using neuroimaging: innovative brain ageing biomarkers. Trends Neurosci. 2017;40(12):681–90. doi: 10.1016/j.tins.2017.10.001 [DOI] [PubMed] [Google Scholar]
  • 2.Baecker L, Garcia-Dias R, Vieira S, Scarpazza C, Mechelli A. Machine learning for brain age prediction: Introduction to methods and clinical applications. EBioMedicine. 2021;72:103600. doi: 10.1016/j.ebiom.2021.103600 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Chen C-L, Hwang T-J, Tung Y-H, Yang L-Y, Hsu Y-C, Liu C-M, et al. Detection of advanced brain aging in schizophrenia and its structural underpinning by using normative brain age metrics. Neuroimage Clin. 2022;34:103003. doi: 10.1016/j.nicl.2022.103003 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Gaser C, Franke K, Klöppel S, Koutsouleris N, Sauer H, Alzheimer’s Disease Neuroimaging Initiative. BrainAGE in mild cognitive impaired patients: predicting the conversion to Alzheimer’s disease. PloS One. 2013;8(6):e67346. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Koutsouleris N, Davatzikos C, Borgwardt S, Gaser C, Bottlender R, Frodl T, et al. Accelerated brain aging in schizophrenia and beyond: a neuroanatomical marker of psychiatric disorders. Schizophr Bull. 2014;40(5):1140–53. doi: 10.1093/schbul/sbt142 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Han LKM, Dinga R, Hahn T, Ching CRK, Eyler LT, Aftanas L, et al. Brain aging in major depressive disorder: results from the ENIGMA major depressive disorder working group. Mol Psychiatry. 2021;26(9):5124–39. doi: 10.1038/s41380-020-0754-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Cole JH, Marioni RE, Harris SE, Deary IJ. Brain age and other bodily “ages”: implications for neuropsychiatry. Mol Psychiatry. 2019;24(2):266–81. doi: 10.1038/s41380-018-0098-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Bashyam VM, Erus G, Doshi J, Habes M, Nasrallah IM, Truelove-Hill M, et al. MRI signatures of brain age and disease over the lifespan based on a deep brain network and 14 468 individuals worldwide. Brain. 2020;143(7):2312–24. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Hahn T, Fisch L, Ernsting J, Winter NR, Leenings R, Sarink K, et al. From ‘loose fitting’to high-performance, uncertainty-aware brain-age modelling. Brain. 2021;144(3):e31. [DOI] [PubMed] [Google Scholar]
  • 10.Jirsaraie RJ, Gorelik AJ, Gatavins MM, Engemann DA, Bogdan R, Barch DM, et al. A systematic review of multimodal brain age studies: uncovering a divergence between model accuracy and utility. Patterns. 2023;4(4). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Vidal-Pineiro D, Wang Y, Krogsrud SK, Amlien IK, Baaré WF, Bartres-Faz D, et al. Individual variations in ‘brain age’relate to early-life factors more than to longitudinal brain change. elife. 2021;10:e69995. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Sudlow C, Gallacher J, Allen N, Beral V, Burton P, Danesh J, et al. UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 2015;12(3):e1001779. doi: 10.1371/journal.pmed.1001779 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Caliendo M, Kopeinig S. Some practical guidance for the implementation of propensity score matching. J Econ Surv. 2008;22(1):31–72. doi: 10.1111/j.1467-6419.2007.00527.x [DOI] [Google Scholar]
  • 14.Hastie T, Tibshirani R, Friedman J. The elements of statistical learning: data mining, inference, and prediction. New York, NY: Springer; 2009. [Google Scholar]
  • 15.Fisch L, Ernsting J, Winter NR, Holstein V, Leenings R, Beisemann M, et al. . (2021). Predicting brain-age from raw T 1 -weighted Magnetic Resonance Imaging data using 3D Convolutional Neural Networks. arXiv. eess.IV. http://arxiv.org/abs/2103.11695 [Google Scholar]
  • 16.Siegel NT, Kainmueller D, Deniz F, Ritter K, Schulz M-A. Do transformers and CNNs learn different concepts of brain age? Hum Brain Mapp. 2025;46(8):e70243. doi: 10.1002/hbm.70243 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Hatamizadeh A, Nath V, Tang Y, Yang D, Roth HR, Xu D. Swin UNETR: swin transformers for semantic segmentation of brain tumors in MRI images. In: Brainlesion: glioma, multiple sclerosis, stroke and traumatic brain injuries. 2022;272–84. [Google Scholar]
  • 18.Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, et al. Swin transformer: hierarchical vision transformer using shifted windows. arXiv. 2021. http://arxiv.org/abs/2103.14030 [Google Scholar]
  • 19.Alfaro-Almagro F, Jenkinson M, Bangerter NK, Andersson JLR, Griffanti L, Douaud G, et al. Image processing and quality control for the first 10,000 brain imaging datasets from UK Biobank. Neuroimage. 2018;166:400–24. doi: 10.1016/j.neuroimage.2017.10.034 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Lundberg SM, Lee SI. A unified approach to interpreting model predictions. Advances in neural information processing systems. 2017. p. 30. [Google Scholar]
  • 21.Pini L, Pievani M, Bocchetta M, Altomare D, Bosco P, Cavedo E, et al. Brain atrophy in Alzheimer’s disease and aging. Ageing Res Rev. 2016;30:25–48. doi: 10.1016/j.arr.2016.01.002 [DOI] [PubMed] [Google Scholar]
  • 22.Schmaal L, Hibar DP, Sämann PG, Hall GB, Baune BT, Jahanshad N, et al. Cortical abnormalities in adults and adolescents with major depression based on brain scans from 20 cohorts worldwide in the ENIGMA Major Depressive Disorder Working Group. Mol Psychiatry. 2017;22(6):900–9. doi: 10.1038/mp.2016.60 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.van Erp TGM, Walton E, Hibar DP, Schmaal L, Jiang W, Glahn DC, et al. Cortical brain abnormalities in 4474 individuals with schizophrenia and 5098 control subjects via the enhancing neuro imaging genetics through meta analysis (enigma) consortium. Biol Psychiatry. 2018;84(9):644–54. doi: 10.1016/j.biopsych.2018.04.023 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Raz N, Ghisletta P, Rodrigue KM, Kennedy KM, Lindenberger U. Trajectories of brain aging in middle-aged and older adults: regional and individual differences. Neuroimage. 2010;51(2):501–11. doi: 10.1016/j.neuroimage.2010.03.020 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Anatürk M, Kaufmann T, Cole JH, Suri S, Griffanti L, Zsoldos E, et al. Prediction of brain age and cognitive age: quantifying brain and cognitive maintenance in aging. Hum Brain Mapp. 2021;42(6):1626–40. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Nguyen HD, Clément M, Mansencal B, Coupé P. Brain structure ages—a new biomarker for multi‐disease classification. Hum Brain Mapp. 2024;45(1):e26558. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Wrigglesworth J, Harding IH, Ward P, Woods RL, Storey E, Fitzgibbon B, et al. Factors influencing change in brain-predicted age difference in a cohort of healthy older individuals. J Alzheimers Dis Rep. 2022;6(1):163–76. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Schulz MA, Yeo BT, Vogelstein JT, Mourao-Miranada J, Kather JN, Kording K, et al. Different scaling of linear models and deep learning in UKBiobank brain images versus machine-learning datasets. Nat commun. 2020 Aug 25;11(1):4238. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Schulz MA, Bzdok D, Haufe S, Haynes JD, Ritter K. Performance reserves in brain-imaging-based phenotype prediction. Cell Rep. 2024;43(1). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Han J, Kim SY, Lee J, Lee WH. Brain age prediction: a comparison between machine learning models using brain morphometric data. Sensors. 2022;22(20):8077. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.He T, Kong R, Holmes AJ, Nguyen M, Sabuncu MR, Eickhoff SB, et al. Deep neural networks and kernel regression achieve comparable accuracies for functional connectivity prediction of behavior and demographics. NeuroImage. 2020;206:116276. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Zhang Q, Vallerga CL, Walker RM, Lin T, Henders AK, Montgomery GW, et al. Improved precision of epigenetic clock estimates across tissues and its implication for biological ageing. Genome Med. 2019;11(1):54. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Sugden K, Caspi A, Elliott ML, Bourassa KJ, Chamarti K, Corcoran DL, et al. Association of pace of aging measured by blood-based DNA methylation with age-related cognitive impairment and dementia. Neurology. 2022;99(13):e1402-13. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Whitman ET, Ryan CP, Abraham WC, Addae A, Corcoran DL, Elliott ML, et al. A blood biomarker of the pace of aging is associated with brain structure: replication across three cohorts. Neurobiol Aging. 2024;136:23–33. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Guida JL, Hyun G, Belsky DW, Armstrong GT, Ehrhardt MJ, Hudson MM, et al. Associations of seven measures of biological age acceleration with frailty and all-cause mortality among adult survivors of childhood cancer in the St. Jude Lifetime Cohort. Nat cancer. 2024;5(5):731–41. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Harris KM, Levitt B, Gaydosh L, Martin C, Meyer JM, Mishra AA, et al. Sociodemographic and lifestyle factors and epigenetic aging in US young adults: NIMHD social epigenomics program. JAMA Netw Open. 2024;7(7):e2427889. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Smith SM, Vidaurre D, Alfaro-Almagro F, Nichols TE, Miller KL. Estimation of brain age delta from brain imaging. Neuroimage. 2019;200:528–39. doi: 10.1016/j.neuroimage.2019.06.017 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Hastie T, Tibshirani T, Friedman J. The elements of statistical learning. New York: Springer. [Google Scholar]
  • 39.He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, p. 770–8. doi: 10.1109/cvpr.2016.90 [DOI] [Google Scholar]
  • 40.Kolbeinsson A, Filippi S, Panagakis Y, Matthews PM, Elliott P, Dehghan A, et al. Accelerated MRI-predicted brain ageing and its associations with cardiometabolic and brain disorders. Sci Rep. 2020;10(1):19940. doi: 10.1038/s41598-020-76518-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: machine learning in Python. JMLR. 2011;12(Oct):2825–30. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

S1 Text. Supplementary Online Material.

(PDF)

pbio.3003451.s001.pdf (688.6KB, pdf)
S1 Data. Data underlying Fig 2.

(CSV)

pbio.3003451.s002.csv (5.2KB, csv)
S2 Data. Data underlying Fig 3.

(CSV)

pbio.3003451.s003.csv (4.7KB, csv)
S3 Data. Data underlying Fig 4.

(CSV)

pbio.3003451.s004.csv (158.4KB, csv)
S4 Data. Data underlying Fig 5.

(CSV)

pbio.3003451.s005.csv (5.8KB, csv)

Data Availability Statement

The dataset analyzed during the current study is sourced from the UK Biobank, a large-scale biomedical database and research resource containing in-depth genetic and health information from half a million UK participants. Due to the UK Biobank’s data sharing policy, the data used in this study are available upon application to the UK Biobank by qualified researchers for approved purposes (https://www.ukbiobank.ac.uk/use-our-data/apply-for-access/). The data underlying our figures is available in files figN_data.csv. The code developed for this study, including data preprocessing, model training, and evaluation is available at https://doi.org/10.5281/zenodo.17244945.


Articles from PLOS Biology are provided here courtesy of PLOS

RESOURCES