Skip to main content
Alzheimer's & Dementia logoLink to Alzheimer's & Dementia
. 2025 Aug 1;21(8):e70508. doi: 10.1002/alz.70508

Machine learning diagnosis of cognitive impairment and dementia in harmonized older adult cohorts

Dan Mungas 1,, Brandon Gavett 1, L Paloma Rojas‐Saunero 2, Yixuan Zhou 2,3, Eleanor Hayes‐Larson 4, Crystal Shaw 5, Sarah Tomaszewski Farias 1, Keith Widaman 6, Evan Fletcher 1, Maria M Corrada 7,8, Paola Gilsanz 9, Maria Glymour 10, John Olichney 1, Charles DeCarli 1, Rachel Whitmer 1,11, Elizabeth Rose Mayeda 2
PMCID: PMC12314546  PMID: 40747592

Abstract

INTRODUCTION

Clinical diagnosis (normal cognition, mild cognitive impairment [MCI], dementia) is critical for understanding cognitive impairment and dementia but can be resource intensive and subject to inconsistencies due to complex clinical judgments that are required. Machine learning approaches might provide meaningful additions and/or alternatives to traditional clinical diagnosis.

METHODS

The study sample was composed of three harmonized longitudinal cohorts of demographically diverse older adults. We used the XGBoost extreme gradient boosting platform to predict clinical diagnosis using different feature sets.

RESULTS

Measures of cognition were especially important predictive features of clinical diagnosis. Prediction accuracy was higher in a sample that had longer follow‐up, better balance across diagnostic outcomes, and both self‐ and informant‐report independent function measures.

DISCUSSION

Algorithmic diagnosis might be a meaningful substitute for clinical diagnosis in studies in which clinical evaluation and diagnosis are not feasible for all participants and may provide a standardized alternative when clinical diagnosis is available.

Highlights

  • A machine learning algorithm was used to diagnose cognitive impairment and dementia.

  • Measures of cognition were strongest predictive features for clinical diagnosis.

  • Algorithm accuracy was improved by informant‐report independent function measures.

  • Algorithmic diagnosis might be an alternative if clinical diagnosis is not feasible.

  • Standardization is an important advantage of algorithmic diagnosis.

Keywords: algorithmic diagnosis, clinical assessment, cognitive impairment, dementia diagnosis, longitudinal cohorts, machine learning, mild cognitive impairment, neuropsychological measures, predictive modeling, XGBoost algorithm

1. BACKGROUND

Cognitive decline and dementia in older adults are major public health problems that have large quality of life and economic impacts. Clinical diagnosis is used to distinguish levels of cognitive impairment ranging from normal cognition to mild cognitive impairment (MCI) to dementia and is standard in clinical practice and research because it provides important information about real‐life function, quality of life, and clinical care needs of older adults. However, clinical diagnosis is resource intensive. It requires specialized evaluations by highly trained clinicians, and ongoing training and quality control are critical to ensure standardization and objectivity of diagnosis. High‐quality clinical diagnosis may be difficult to achieve in primary care clinical settings and can be resource intensive in research settings.

Machine learning methods have been used increasingly in recent years to estimate/predict a broad range of health outcomes, 1 , 2 , 3 , 4 , 5 including dementia and progression of cognitive impairment. 6 , 7 , 8 , 9 , 10 , 11 These methods have the potential to improve standardization and objectivity and to reduce patient/participant burden and resource use in health care and research settings. They have been developed to efficiently process large datasets and are optimized for identifying complex associations of high‐dimensional feature sets that can be used to predict complex outcomes. Machine learning has been criticized for being a “black box” method, but a recent emphasis in the field on interpretability of predictive models provides new methods for improved understanding of model results and transparency about decisions.

There is a rapidly developing literature using machine learning approaches to estimate/predict clinical diagnosis of cognitive impairment. 6 , 9 , 10 , 12 Different predictive feature sets, including cognitive test results; clinical findings; demographic characteristics; and high‐dimensional neuroimaging, biomarker, and genetic data, have been evaluated using a variety of machine learning methods. Accuracy of models in terms of correspondence of model‐predicted diagnosis with observed clinical diagnosis has varied across studies, but very high accuracy (≥ 95%) has been achieved in a number of studies. This is true for studies with predominantly clinical and cognitive predictive features as well as studies that include high‐dimension neuroimaging and biomarker features. 9 A limitation of previous machine learning diagnosis studies is that the samples have often been highly selected for inclusion of well‐differentiated clinical cases and have not been representative of the older adult population.

We used machine learning methodology in this project to develop a model to predict clinical, cognitive syndrome diagnosis. We used results from three longitudinal cohort studies that have common assessment and clinical diagnosis protocols and in combination provide a sample of > 10,000 assessments that have generated a clinical diagnosis. These studies emphasize representative community recruitment methods resulting in racially/ethnically diverse samples and address a limitation in previous research. Predictive features were variables that could be obtained in a 1 to 2 hour assessment and included cognitive tests, self‐ and informant reports of independent function, and demographic characteristics. We also included longitudinal change in cognitive and independent function measures as predictive features, another novel contribution of this study.

An important motivation for this project was to develop a model that can be used to predict clinical diagnosis for all assessments in the included study that did not have clinical evaluations at every assessment. This will have direct practical implications for this study. A second major motivation was to provide standardized and objective (algorithmic) diagnosis of cognitive syndrome for every assessment in the included studies. This can subsequently be compared to actual clinical diagnosis and external validation variables, including neuroimaging and longitudinal clinical and cognitive change, to evaluate how well algorithmic diagnosis captures relevant biological and clinical processes underlying cognitive impairment and dementia. We used multiple methods to address model interpretability, and this will contribute to a better understanding of what diagnosis represents. These results will form a foundation for future studies to compare the reliability and validity of clinical and algorithmic diagnosis and will provide practical information to inform the development of algorithmic diagnosis in different samples using different predictive variables.

2. METHODS

2.1. Overview of study design

This study used data from three longitudinal cohort studies of older adults to develop and validate an algorithm for assigning clinical diagnosis. The studies were the University of California Davis Alzheimer's Disease Research Center Longitudinal Cohort (ADRC, N individuals = 1318, N assessments = 5445), the Kaiser Healthy Aging and Diverse Life Experiences Study (KHANDLE, N individuals = 628, N assessments = 1700), and the LifeAfter90 Study (LA90, N individuals = 824, N assessments = 3476). These studies all emphasized recruitment and longitudinal follow‐up of demographically diverse older adults. They used common clinical diagnostic criteria and had clinical evaluation components that produced a multiclass, cognitive syndrome diagnosis (normal, MCI, dementia). The clinical evaluation protocols did differ somewhat across studies, and there were also relevant differences in inclusion and exclusion criteria that notably resulted in differences in diagnostic outcomes, especially prevalence of dementia. Merging these three datasets substantially increased the sample size and maximized heterogeneity of clinical diagnosis and of features used for algorithmic diagnosis. Features used to predict diagnosis included: (1) high‐quality cognitive assessments (Spanish and English Neuropsychological Assessment Scales [SENAS] 13 , 14 ), (2) demographic variables, (3) study cohort indicators, and (4) measures of independent function. Cognitive, demographic, and self‐report measures of independent function were shared by and available for all studies, but informant‐report measures were available only in the ADRC cohort. ADRC and LA90 had clinical evaluations and clinical diagnosis at all assessments; KHANDLE participants received clinical evaluations and diagnosis by study design in ≈ 29% of assessments. An extreme gradient boosting (XGBoost) 15 modeling platform was used to generate an algorithmic diagnosis that optimally recapitulated the clinical diagnosis.

2.2. Sample characteristics

The three study cohorts are described in more detail in Appendix S1.1 in supporting information. ADRC enrollment was initiated in 2002, using a rolling enrollment recruitment design such that new participants are added on an ongoing basis intermixed with follow‐ups occurring on average every 1.32 years. KHANDLE began enrollment in 2017. The original cohort was recruited and evaluated in Wave 1 and received three additional follow‐up assessments between 2017 and 2024, with an average time between assessments of 1.68 years. A refresher cohort of ≈ 500 individuals was added beginning in 2022, and some of these individuals are included in this study. LA90 began enrollment in 2018 and has used rolling enrollment with new enrollments intermixed with follow‐up assessments. The average time between assessments for LA90 is 0.63 years.

RESEARCH IN CONTEXT

  1. Systematic review: We reviewed the literature on studies of machine learning diagnosis of cognitive impairment and dementia using traditional (e.g., PubMed) sources.

  2. Interpretation: Algorithmic diagnosis based on cognitive test results, measures of independent function, and demographic variables showed moderate to strong associations with clinical diagnosis in three demographically diverse cohorts of older adults. Algorithmic diagnosis might be a viable alternative when resources for comprehensive evaluation and expert adjudication needed for clinical diagnosis are not available. Machine learning methods can contribute to understanding what clinical diagnosis represents.

  3. Future directions: Future research will compare algorithmic and clinical diagnoses with respect to rates of longitudinal progression, association with future cognitive decline, and association with biomarkers of cognitive impairment and dementia. These studies will address questions about the potential utility of machine learning approaches to diagnosis and will help to define the contexts and scope of utility of these methods.

2.3. Clinical evaluation and diagnosis

2.3.1. Clinical evaluation components

The clinical evaluation protocol for all three cohorts included a clinical exam, administered by a trained physician, that was composed of a clinical history, a neurological exam, and clinical testing of cognitive function. An informant was interviewed when available, and elements of the National Alzheimer's Coordinating Center Uniform Dataset (NACC UDS) 16 were collected according to standardized UDS guidelines. ADRC and KHANDLE clinical evaluations included clinical neuropsychological assessment using the UDS Neuropsychological Battery. 17 The Modified Mini‐Mental State Exam (3MS) 18 was used to assess cognition in LA90 clinical evaluations. The examining physician made a provisional clinical diagnosis of normal cognition versus MCI versus dementia for all three cohorts. Clinical components are summarized in Table S1 in supporting information.

2.3.2. Clinical diagnosis

The same clinical diagnostic criteria were used for the three cohorts. Dementia was diagnosed using Diagnostic and Statistical Manual of Mental Disorders III‐R criteria for dementia modified such that dementia could be diagnosed in the absence of memory impairment if there was significant impairment of two or more cognitive domains. MCI was diagnosed according to standard clinical criteria according to UDS guidelines. 19 Normal cognitive function was diagnosed if there was no clinically significant cognitive impairment. Clinical diagnosis for the ADRC was made in a multidisciplinary case conference in which all clinical evaluation findings incorporated in UDS instruments were available except Clinical Dementia Rating (CDR 20 ), which was excluded from consideration in the case conference. UDS neuropsychological test results were considered.

KHANDLE and LA90 diagnoses were derived by applying standardized, rational‐intuitive decision rules to the data elements collected during the clinical evaluation. This was designed to recapitulate the provisional clinical syndrome diagnostic criteria used by clinicians. The elements used to create this diagnosis were the summarized results of the clinical neuropsychological assessment, the CDR, and the syndrome diagnosis made by the examining clinician. In KHANDLE, neuropsychological test findings were used to establish the presence of clinically significant cognitive impairment in each of five domains (memory, executive function, language, attention, and spatial ability). Impairment of an individual test was defined as a score at or below the 10th percentile based on the KHANDLE cognitively normal reference sample, and impairment of an individual domain was defined as impairment on ≥ 50% of the tests in that domain. In LA90, a confirmatory factor analysis identified a four‐factor model that explained test results corresponding to four domains: memory, executive, language, and spatial. Factor scores were generated for these domains, and a domain was considered impaired if the factor score fell at or below the 10th percentile in an LA90 reference sample that had an examining physician diagnosis of normal cognition. The CDR Sum of Boxes (CDRSum) was used in both studies to identify the presence of functional impairment. An examining physician's diagnosis of dementia also was used as a proxy for functional impairment. Information about cognitive and functional impairment was merged to arrive at a final diagnosis.

2.4. Non‐diagnostic cognition and independent function assessments

All three studies had a common design feature—they included measures of cognition and independent function (SENAS, Everyday Cognition [ECog] 21 ) that were excluded from consideration in the clinical evaluation process. This was implemented when these cohorts were initiated so that diagnosis would not be contaminated by results of these measures and associations of diagnosis with these measures could be examined without circularity. There is overlap in the domains measured by SENAS and ECog and by the measures that were part of the clinical evaluation (e.g., UDS Neuropsychology and Functional Activities Questionnaire [FAQ]), but the tests are entirely different. In KHANDLE and LA90, SENAS and ECog were administered as part of the survey administered to all participants at each wave, and the clinical evaluation components were administered in a different assessment by a different assessment team.

Three cognitive domains (verbal episodic memory, semantic memory, and executive function) were assessed by the SENAS, a battery of cognitive tests that has previously undergone extensive development for measurement of cognitive function and change in diverse racial/ethnic groups and English and Spanish language administrations. 13 , 14 , 22 , 23 , 24 , 25 , 26 , 27 The same SENAS measures were used in the three cohorts. The measures are all scored using item response theory (IRT) methods. Analyses of differential item function associated with race/ethnicity and language of administration have been performed on the measures included in this study, and adjustment for non‐invariance of items across race/ethnicity and linguistic groups has been incorporated into the IRT scoring. SENAS scores for all three studies were calculated using the same scoring program. Independent function was assessed using the ECog questionnaire. 21 , 28 , 29 , 30 It assesses change in everyday/real‐world independent function in multiple domains. Detailed information about the SENAS and ECog is in Appendix S1.2.

2.5. Data analysis

2.5.1. Predictive features

Features that were used in algorithmic diagnosis are listed in Table S2 in supporting information. Cognition (SENAS) and independent function (ECog) measures were excluded from consideration in the clinical diagnosis process. CDR 20 variables were included as predictive features in secondary analyses. CDR scores were an explicit part of the diagnostic criteria for the KHANDLE and LA90 cohorts, raising the likelihood of criterion contamination bias in these cohorts. The CDR was not used in the diagnosis process of the ADRC cohort. We estimated models with the CDR predictive feature to evaluate the incremental impact of the CDR, especially in the ADRC cohort, and to provide a direct comparison to other machine learning/algorithmic diagnosis studies that have used the CDR.

2.5.2. XGBoost modeling

XGBoost (R xgboost version 1.7.8.1 with tidymodels version 1.2.0) modeling was used to train models to predict clinical diagnosis from available features and to generate cross‐validation/out‐of‐sample predictions of diagnosis in held‐out test datasets. Although many machine learning approaches exist for classification tasks, XGBoost was selected for several reasons. First, it is not only capable of dealing with missing data, it treats missing data as a potential source of information (e.g., if participants have missing data on some ECog items because they stopped performing certain activities of daily living, that type of missingness could be diagnostically informative). This ensures that the entire sample can be used, reducing selection bias. Second, unlike some other machine learning methods, XGBoost does not require all features to be standardized, which allows for the use of categorical variables as predictors (using indicator coding or one‐hot encoding) and facilitates interpretability of results on the native scales of the features. Third, because XGBoost uses a decision‐tree approach to classification, interactions between features are accounted for implicitly, and no assumptions (e.g., linearity) are made about the functional form relating the predictive features to the outcome. Finally, XGBoost is regarded for its accuracy in numerous applications, including but not limited to biomedical research.

An XGBoost model is built on decision trees. A decision tree starts by identifying the predictive feature that is most strongly associated with the outcome (here, clinical syndrome diagnosis: normal, MCI, or dementia) and then finds an optimal cutpoint such that differences in the outcome of groups above and below the cutoff are maximized. It then splits the resulting two groups by identifying the feature most strongly associated with the outcome within each of the initial two groups and finding an optimal cutpoint for those features. Additional trees are subsequently added; this “boosting” is mathematically guided by the XGBoost algorithm so that new trees optimally correct errors of the preceding ensemble of trees. There are several important aspects of this process: (1) the dichotomization of the features selected for a tree is based on the natural range of the feature variable and does not require a common scale across features, (2) the process of selective partitioning incorporates complex interaction effects of the predictive features selected at each step, and (3) the boosting process amalgamates an ensemble of decision trees into a final prediction, and in this way, individual decision trees that might have weak associations with the outcome can be merged to create a strong prediction. See Ninja 31 for a more detailed description of the basic features of XGBoost.

The modeling process consisted of two phases (summarized in Figure 1). In step 1, model development, a series of XGBoost models started with SENAS variables and systematically added demographic, cohort, self‐reported independent function, informant‐reported independent function, time‐lagged cognition, and time‐lagged independent function variables. Models within this feature evaluation phase were trained on a random 50% of the sample and tested/validated on the other 50%. Hyperparameters were tuned in model training using a 5‐fold cross‐validation design in which tuning was performed with a random 80% of the training sample and testing with the remaining 20%, repeating this process for each of the five random folds. The training model from this process that had the highest accuracy was then applied to the held‐out 50% fold to generate cross‐validated model predictions of the diagnosis outcome. The feature set that yielded the best cross‐validated accuracy in the 50% held‐out sample was identified. In step 2, out‐of‐sample prediction, 10‐fold cross‐validation was used. Training, with 5‐fold hyperparameter tuning, was done on a random 90% of the sample, and the best‐trained model was applied to the held‐out 10% to generate an out‐of‐sample prediction of clinical diagnosis. This process was repeated for each of the 10 folds so that an out‐of‐sample predicted diagnosis was obtained for every assessment in the dataset. These cross‐validated predicted diagnoses were used in subsequent analyses to evaluate out‐of‐sample diagnostic accuracy in the full sample. More detail about the implementation of XGBoost modeling is included in Appendix S1.3.

FIGURE 1.

FIGURE 1

Data analysis overview (r = randomly selected). CV, coefficient of variation.

2.5.3. Model accuracy metrics

Model training used multiclass mean‐log‐loss as the primary metric of model fit with multiclass receiver operating characteristic (ROC) area under the curve (AUC) as a secondary metric. Loss functions in machine learning evaluate the accuracy of prediction. Log‐loss is a commonly used classification model metric that evaluates how well the predicted class corresponds to the actual observed class (diagnosis in this study). Lower values indicate better prediction accuracy, with 0 = perfect prediction. Log‐loss rewards correct predictions that are made with high confidence (e.g., dementia correctly predicted with a 0.95 probability: log‐loss = 0.05) and penalizes incorrect predictions (e.g., actual dementia predicted with 0.05 probability: log‐loss = 3.00). Log‐loss for an individual observation in a multiclass classification model is defined as:

loglossobs=logP^trueclass

where loglossobs = log‐loss value for an individual observation, P^trueclass = model predicted probability of the true (observed) class in the data. Multiclass mean‐log‐loss is the average over observations of the log‐loss values for each observation.

XGBoost generated probabilities of normal, MCI, and dementia diagnoses (which summed to 1.0) for each person‐assessment. The predicted multiclass diagnosis was the diagnosis class with the highest probability. Prediction accuracy in held‐out samples was evaluated by comparing multiclass clinical diagnosis to multiclass predicted diagnosis using the R yardstick package (version 1.2.0) to evaluate accuracy, kappa (weighted‐quadratic), AUC (hand‐till), and multiclass (macro‐weighted) versions of sensitivity, specificity, positive predictive value, negative predictive value, and balanced accuracy. Balanced accuracy is the average of sensitivity and specificity. Bootstrap resampling methods with 10,000 draws were used to generate confidence intervals for accuracy metrics and for differences in accuracy across different models. Multiclass accuracy metrics were examined in the full (combined) sample and in each individual study cohort.

We also examined the accuracy of two dichotomous diagnoses: (1) cognitive impairment (MCI or dementia) versus normal cognition and (2) dementia versus no dementia (MCI or normal cognition). This also was done for the combined sample and separately for each study cohort. The prevalence of the specific diagnosis in the baseline assessment of a specific cohort/sample was calculated, and the optimal cutoff for the predicted probability of that diagnosis in that sample was identified as the (1 – prevalence) quantile of the distribution of the predicted probability of the diagnosis in the baseline assessment for the sample. This optimal cutoff was then applied to subsequent assessments for the sample.

2.5.4. Model interpretability

We evaluated feature importance using XGBoost importance and Shapley Additive exPlanations (Shapley) values. 32 , 33 We also used Shapley values to show how features were combined to yield predicted diagnosis. Shapley values show the incremental contribution of each predictive feature to each model prediction apart from all other features. This study used a classification model that has a categorical, multiclass outcome. There is a separate set of Shapley values for each outcome class (normal, MCI, dementia). The set for an outcome (e.g., normal) has a row for each person‐assessment and a column for each predictive feature. The columns contain Shapley values that show how each feature contributes to the predicted probability of the outcome (e.g., normal) for that row (assessment). Shapley values are on a log‐odds scale; XGBoost‐predicted probability is transformed to log‐odds using the logit transformation:

logodds=logP1P

where P = probability. A 50% probability of the diagnosis corresponds to a log‐odds of 0 (10% probability = log‐odds of −2.20, 90% probability = log‐odds of 2.20, 25% probability = log‐odds of −1.10, 75% probability = log‐odds of 1.10). The sum of Shapley values across predictive features plus an intercept term (the log‐odds of the specific diagnosis in the overall sample) is an estimate of the log‐odds of the diagnosis (e.g., normal) for the specific person and assessment represented in the row. This is expressed (for normal diagnosis) in the formula:

featuresSHAP+Intercept=logPNormal1PNormal

where SHAP = Shapley value, Intercept = log‐odds of a normal diagnosis in the overall sample, PNormal = probability that this individual is cognitively normal at this assessment.

As a concrete example relevant to this study, if the sum of Shapley values for dementia for a person‐assessment is 1.10 (corresponding probability = 0.75) and the log‐odds of dementia in the sample is −2.20 (probability = 0.10), then the log‐odds of dementia for that assessment would be 1.10 + −2.20 = −1.10 (probability = 0.25). The Shapley value for a specific feature for a person‐assessment (row) shows how the value of the feature observed for that assessment changes the probability of the outcome when it is included in the prediction versus not included. The individual Shapley value for each feature shows both the direction (positive versus negative) and magnitude of the contribution of that feature to the predicted diagnosis. Shapley importance for a diagnosis class can be calculated by averaging the absolute Shapley values for each feature across all assessment records for that class. More detail about these methods is provided in Appendix S1.4.

3. RESULTS

3.1. Sample characteristics

Table 1 shows sample characteristics by study cohort. Baseline prevalence of normal cognition was highest in KHANDLE (67% vs. 60.7% in LA90 and 56.4% in ADRC). MCI baseline prevalence was 29.5% in the combined sample and was roughly equal across study cohorts. Dementia prevalence was clearly smallest in KHANDLE (3.0% vs. 10.7% in LA90 and 13.7% in ADRC). Sex distribution was similar across cohorts. Baseline age was much greater in LA90, as would be expected, and was ≈ 2 years greater on average in KHANDLE than ADRC. Average education ranged from 13.5 years in ADRC to 14.7 in KHANDLE, with LA90 intermediate at 13.9. KHANDLE and LA90 had relatively good balance across the four main race/ethnicity groups, whereas ADRC included proportionally more Whites (46% of sample) and relatively few Asians (4% of sample). ADRC and LA90 had ≈ 4 assessments per person on average, with ≈ 3 for KHANDLE. Average time of follow‐up was the same in ADRC and KHANDLE (3.8 years) but was substantially less in LA90 (1.9 years). The ADRC sample included 458 individuals who have > 5 assessments, LA90 had 320, and the maximum number of assessments in KHANDLE was 4.

TABLE 1.

Sample characteristics by study cohort.

Variable

ADRC

(N = 1318)

KHANDLE

(N = 628)

LA90

(N = 824)

Overall

(N = 2770)

diagnosis_baseline
Normal 743 (56.4%) 421 (67.0%) 500 (60.7%) 1664 (60.1%)
MCI 394 (29.9%) 188 (29.9%) 236 (28.6%) 818 (29.5%)
Dementia 181 (13.7%) 19 (3.0%) 88 (10.7%) 288 (10.4%)
n_diagnosis_baseline 1318 628 824 2770
Sex
Male 507 (38.5%) 266 (43.3%) 323 (39.2%) 1096 (39.8%)
Female 811 (61.5%) 349 (56.7%) 501 (60.8%) 1661 (60.2%)
age_baseline (years)
Mean (SD) 75.2 (7.32) 77.1 (7.27) 92.4 (2.34) 80.7 (9.88)
Median [Min, Max] 75.0 [49.0, 97.0] 76.3 [65.4, 99.0] 91.5 [90.1, 105] 80.0 [49.0, 105]
Education (years)
Mean (SD) 13.5 (4.48) 14.7 (3.14) 13.9 (3.35) 13.9 (3.92)
Median [Min, Max] 14.0 [0, 20.0] 14.0 [2.00, 20.0] 13.0 [0, 20.0] 14.0 [0, 20.0]
Race
Asian 58 (4.4%) 154 (25.0%) 203 (25.3%) 415 (15.2%)
Black 331 (25.2%) 133 (21.6%) 203 (25.3%) 667 (24.4%)
LatinX 320 (24.4%) 171 (27.8%) 160 (19.9%) 651 (23.8%)
Native American 4 (0.3%) 0 (0%) 4 (0.5%) 8 (0.3%)
White 599 (45.7%) 157 (25.5%) 233 (29.0%) 989 (36.2%)
n_assessments
Mean (SD) 4.04 (3.45) 3.20 (1.03) 4.21 (2.21) 3.90 (2.74)
Median [Min, Max] 3.00 [1.00, 18.0] 4.00 [1.00, 4.00] 4.00 [1.00, 8.00] 3.00 [1.00, 18.0]
fu_time
Mean (SD) 3.84 (4.32) 3.84 (1.98) 1.98 (1.28) 3.28 (3.31)
Median [Min, Max] 2.48 [0, 20.5] 4.47 [0, 6.64] 1.73 [0, 4.31] 2.64 [0, 20.5]

Note: Column totals are numbers of individuals enrolled in each cohort.

Abbreviations: ADRC, Alzheimer's Disease Research Center; AUC, area under the curve; KHANDLE, Kaiser Healthy Aging and Diverse Life Experiences Study; LA90, LifeAfter90 Study; MCI, mild cognitive impairment; SD, standard deviation.

3.2. Feature set comparison

We started with a predictive feature set that included only cognitive measures (feature set 1), and then systematically added additional groups of features in a predetermined order. Accuracy metrics for different feature sets are presented in Table 2. Adding demographic variables to the cognition‐only feature set (feature set 2) modestly improved model fit, as did adding cohort indicators (3), but model fit did not substantially improve after adding self‐reported independent function (4). Adding lagged cognition features (5) yielded relatively small incremental improvement. Adding informant‐reported independent function resulted in clear but modest incremental improvement (6), but model fit did not improve after adding lagged independent function (7).

TABLE 2.

Comparison of model accuracy across feature sets.

FS Cog Dem Coh IF CDR acc kapwq auc sens spec ppv npv bal_acc
1 Cur 0.694 0.604 0.802 0.694 0.7 0.668 0.802 0.697
2 Cur Yes 0.72 0.659 0.825 0.72 0.744 0.702 0.815 0.732
3 Cur Yes All 0.73 0.672 0.835 0.73 0.754 0.713 0.826 0.742
4 Cur Yes All S 0.727 0.671 0.837 0.727 0.762 0.713 0.818 0.745
5 Cur + Lag Yes All S 0.734 0.673 0.841 0.734 0.755 0.717 0.83 0.744
6 Cur + Lag Yes All SI 0.75 0.695 0.858 0.75 0.764 0.736 0.839 0.757
7 Cur + Lag Yes All SI + Lag 0.752 0.702 0.858 0.752 0.77 0.738 0.84 0.761
8 Cur + Lag Yes All SI Yes 0.801 0.794 0.914 0.801 0.812 0.793 0.864 0.806
9 Cur + Lag Yes A SI Yes 0.812 0.818 0.91 0.812 0.834 0.8 0.896 0.823
10 Cur + Lag Yes L SI Yes 0.807 0.809 0.921 0.807 0.792 0.798 0.856 0.8
11 Cur + Lag Yes K SI Yes 0.757 0.608 0.871 0.757 0.761 0.761 0.781 0.759

Notes: Each row represents a specific model applied to a specific sample. Rows are roughly ordered according to model accuracy (increasing for combined sample results, decreasing for cohort‐specific results). Bolded values indicate that the value was significantly different from the previous row (bootstrap confidence interval for difference did not include 0). FS, feature set sequence number; FEATURES Cog—Cur, SENAS concurrent with diagnosis; Cur + Lag, concurrent SENAS + 2 previous assessments; Dem—All, age + sex/gender + education + race/ethnicity; Coh—All, ADRC + KHANDLE + LA90; A, ADRC; K, KHANDLE; L, LA90; IF—S, independent function (self‐reported) concurrent with diagnosis; SI, independent function (self‐reported + informant‐reported) concurrent with diagnosis; SI + Lag, independent function (self‐reported + informant‐reported); concurrent plus 2 previous assessments; CDR—Yes, CDR included; METRICS acc, accuracy; kapwq, weighted kappa (quadratic); auc—area under curve (multiclass); sens, sensitivity (multiclass); spec, specificity (multiclass); ppv, positive predictive value (multiclass); npv, negative predictive value (multiclass), bal_acc, balanced accuracy (multiclass).

ADRC, Alzheimer's Disease Research Center; CDR, Clinical Dementia Rating; KHANDLE, Kaiser Healthy Aging and Diverse Life Experiences Study; LA90, LifeAfter90 Study; SENAS, Spanish and English Neuropsychological Assessment Scales.

Models from secondary analyses that included CDR features had the best correspondence with observed diagnosis. Accuracy in the ADRC cohort was significantly higher than that in the combined sample (9 vs. 8) for all metrics except ROC AUC. ADRC accuracy (9) was higher than for LA90 (10) for specificity and negative predictive value. The KHANDLE cohort (11) had the lowest accuracy for this feature set. All metrics had smaller values for KHANDLE compared to ADRC (not shown), and compared to LA90, all metrics except specificity had smaller values. Accuracy for both KHANDLE and LA90 may be overestimated because the CDR is incorporated into the diagnostic decision process for those studies.

The feature set that included cognition, demographics, cohort, self‐ and informant‐independent function, and lagged cognition was selected as the best feature set for subsequent analyses. The feature set including the CDR was not selected because of the circularity of CDR and diagnosis in KHANDLE and LA90 cohorts.

3.3. Modeling decisions

A series of analyses were performed to evaluate alternative strategies at specific modeling decision points. Details about these analyses and results are in Appendix S2.1 in supporting information. Appendix S2.1.1 addresses how the training and testing samples were defined. Appendix S2.1.2 describes comparisons of cohort‐specific training versus using the combined cohort, and Table S3 in supporting information shows model accuracy associated with different approaches. Appendix S2.1.3 addresses the use of model weights for outcome imbalance/differential prevalence across groups, and Appendix S2.1.4 describes comparisons of accuracy from models that did and did not include independent function predictive features that were missing by design in the KHANDLE and LA90 cohorts. Informed by these results, we (1) used the combined sample for model estimation, (2) trained on a random selection of participants from the combined cohort and tested on the unselected participants, (3) used informant‐reported independent function features that were systematically missing in the KHANDLE and LA90 cohorts, and (4) did not use weights to address differential prevalence of outcomes.

3.4. Model interpretability

The out‐of‐sample prediction phase of the analysis was used to examine how features contributed to model estimation overall and to specific model predictions. We used 10‐fold cross‐validation to generate out‐of‐sample predictions of the diagnosis outcome for every assessment record. This resulted in 10 different models being trained, each with a different 10% fold held out from model training. We used these results to evaluate feature importance, to compare and examine accuracy across follow‐up assessments, and to show how features impacted predicted diagnosis for selected cases and groups.

3.4.1. Feature importance

Figure 2 shows Shapley value importance for the 20 most influential features and breaks overall importance down into components for each diagnosis class. We arbitrarily selected training results from the sample that left out the 10th fold—results were similar for other folds. The cognition variables were the most influential, especially verbal episodic memory (vrmem). The KHANDLE indicator contributed little to dementia predictions, which would be expected given the low prevalence of dementia in KHANDLE. Several of the self‐reported (memcncrn, excheckp) and informant‐reported independent function measures (ecogmem, ecogorg) contributed primarily to dementia predictions. This too fits with conceptual expectations because impaired independent function is a defining feature that differentiates dementia from MCI and normal cognition. Changes in vrmem from one and two prior assessments were in the top 20 features and contributed to all three diagnosis classes. XGBoost feature importance results (averaged across the 10 folds) were similar and are described in Figure S1 in supporting information.

FIGURE 2.

FIGURE 2

Shapley value importance by cognitive impairment outcome (normal, MCI, dementia) for the 20 most influential features. Overall length of bars shows the incremental contribution of each feature to all diagnosis classes, and the colored segments show the contribution to specific diagnosis classes. Results are from the trained Phase 2 model that held out fold 10. aget89, age (top coded at 89); cat, Category Fluency; catch1, Category Fluency Change Lag 1; ecogmem, ECog Informant Memory Summary Score; ecogorg, ECog Informant Organization Summary Score; excheckp, ECog Self “Balance the checkbook without error”; female, female sex/gender; khan, Kaiser Healthy Aging and Diverse Life Experiences Study; la, LatinX race/ethnicity; la90, LifeAfter90 Study; MCI, mild cognitive impairment; memcncrn, ECog Self “Concerned about memory”); mdate, ECog Informant “Remember current date”; mdatep, ECog Self “Remember current date”; phon, Phonemic/Letter Fluency; sem, Semantic Memory; vrmem, Verbal Episodic Memory; vrmemch1, Verbal Episodic Memory Change Lag 1; wh, White race/ethnicity; wm, Working Memory.

We used Shapley values to illustrate the independent contributions to diagnosis prediction of verbal episodic memory (vrmem) from the current and two previous assessments (vrmemch1 and vrmemch2). The three panels of Figure 3 are Shapley dependence plots for vrmem, vrmemch1, and vrmemch2 and show how these features contribute to log‐odds of a normal diagnosis. Figure 3A (vrmem) shows a strong and relatively linear association of vrmem with log‐odds of a normal diagnosis across a vrmem range from ≈ −2.0 to 1.0 (corresponding to a change in probability of a normal diagnosis from 0.56 to 0.92), but also apparent non‐linearity such that values below the −2.0 threshold and above the 1.0 threshold did not further modify the probability of a normal diagnosis. Figures 3B and 3C are Shapley dependence plots for change in vrmem from 1 (vrmemch1) and 2 (vrmemch2) previous assessments, respectively. The magnitude of the independent effect of vrmem was ≈ 3.7 times larger than that for vrmemch1 (Shapley value range of 2.2 vs. 0.6) and 11 times larger than that for vrmemch2 (Shapley value range of 2.2 vs. 0.2). Probability of a normal diagnosis changed from 0.76 to 0.85 across the range of vrmemch1 and from 0.76 to 0.79 for vrmemch2. Positive values of vrmemch1 and vrmemch2 indicate decline from the previous assessments. The color coding in Figures 3B and 3C shows that lower current vrmem is associated with more decline. After controlling for vrmem and all the other features in the model, greater decline from the two prior assessments was associated with a slightly higher probability of a normal diagnosis. A similar but reversed pattern of results was present when dementia was chosen as the outcome class (not shown). To summarize, independent contributions to prediction of current diagnosis dropped off substantially from vrmem to vrmemch1 to vrmemch2, higher vrmem was associated with a higher probability of a normal diagnosis, and greater vrmem change from the previous assessment was associated with a slightly higher probability of a normal diagnosis at the current assessment.

FIGURE 3.

FIGURE 3

Shapley dependence plots for verbal episodic memory (vrmem) from assessment concurrent with diagnosis and from 2 previous assessment in a sample with 4+ assessments. (A) shows how SHAP value varies as a function of concurrent vrmem. The color coding of points in (B) and (C) corresponds to the concurrent vrmem values. For example, high concurrent vrmem generally corresponds to less vrmem decline from the previous assessment (negative value of vrmemch1). SHAP, Shapley; vrmem, verbal episodic memory from assessment concurrent with predicted diagnosis; vrmemch1, change in vrmem from 1‐prior assessment to concurrent assessment (1‐prior assessment—concurrent); vrmemch2, change in vrmem from 2‐prior assessment to concurrent assessment (2‐prior assessment—concurrent).

3.4.2. Feature contribution to outcomes

Shapley values can also be used to show how the different features contribute to the prediction of specific diagnoses, and this can occur at the level of individual assessments but also for aggregated groups of assessments. Figure 4 shows an aggregated group‐level comparison of younger and older persons diagnosed with dementia. Assessments resulting in a diagnosis of dementia with ages ≤ 70 comprised one group, and dementia diagnosis and ages ≥ 85 defined the other. The younger group had a higher predicted average probability of dementia (0.899 versus 0.823), suggesting that they had greater impairment. The overall pattern of contributions was similar in the two groups, and verbal episodic memory made the largest contribution. Shapley values can be used to show how model predictions were obtained for individual assessments. Feature contributions for two individuals, one age 70 and one age 85, are described in Appendix S2.3 and presented in Figure S2 in supporting information.

FIGURE 4.

FIGURE 4

Shapley waterfall plots for two groups with clinical diagnoses of dementia, one composed of individuals ≤ 70 years of age and the second of individual in the 85+ range. These plots show averaged independent contributions of features to the outcome (log‐odds of dementia) in each group. f(x) ∼ average log‐odds of dementia in depicted groups (corresponding probabilities of dementia: ≤ 70 = 0.90, ≥ 85 = 0.82), E(f[x]) ∼ average log‐odds of dementia in full sample (corresponding probability of dementia = 0.38). aget89, age (top coded at 89); cat, Category Fluency; ecogmem, ECog Informant Memory Summary Score; ecogorg, ECog Informant Organization Summary Score; excheckp, ECog Self “Balance the checkbook without error”; memcncrn, ECog Self “Concerned about memory”); mdate, ECog Informant “Remember current date”; mdatep, ECog Self “Remember current date”; phon, Phonemic/Letter Fluency; sem, Semantic Memory; vrmem, Verbal Episodic Memory; wm, Working Memory.

3.5. Study cohort comparison

We used the out‐of‐sample predictions of diagnosis to evaluate prediction accuracy in the combined sample as well as differences across study cohorts. First, we evaluated multiclass accuracy metrics in the different samples (Table S4 in supporting information). Accuracy was highest for the ADRC cohort and was somewhat higher than for the combined sample. Accuracy was lower for KHANDLE compared to ADRC and was lowest with respect to most metrics for LA90. We also examined the accuracy of two specific dichotomous diagnoses—cognitive impairment (MCI or dementia) versus normal cognition and dementia versus no dementia (normal cognition or MCI). ROC curves for these two comparisons are presented in Figure 5. Again, the highest accuracy was observed for ADRC. Better accuracy in ADRC is expected because informant‐reported independent function measures were available only in that cohort. The ADRC cohort also had the longest longitudinal follow‐up, and this might have contributed to improved prediction accuracy.

FIGURE 5.

FIGURE 5

ROC curves for study cohorts comparing clinical diagnosis with the out‐of‐sample prediction of clinical diagnosis. Plot A shows results for a diagnosis of dementia versus non‐dementia (normal cognition or MCI), Plot B is for a diagnosis of cognitive impairment (MCI or dementia) versus normal cognition. AUC, ROC area under the curve (95% confidence interval). ADRC, Alzheimer's Disease Research Center; AUC, area under the curve; KHANDLE, Kaiser Healthy Aging and Diverse Life Experiences Study; LA90, LifeAfter90 Study; MCI, mild cognitive impairment; ROC, receiver operating characteristic.

3.6. Follow‐up assessment comparison

We used the out‐of‐sample predictions of diagnosis to test whether prediction accuracy differed across longitudinal follow‐ups. Lag‐1 and lag‐2 cognition values were missing by definition for wave 1, and lag‐2 cognition values were missing for wave 2. We compared the correspondence of diagnosis to the out‐of‐sample predictions in Waves 1 through 4 in a sample that had completed 4+ assessments (Table 3). Accuracy improved across sequential waves for most metrics, but the wave 1 and 2 difference was significant only for negative predictive value. Waves 2 and 3 differences were significant for kappa and ROC AUC, and the waves 3 and 4 difference was significant for kappa, specificity, and balanced accuracy. Improved accuracy for waves 3+ could be due to the availability of lagged cognition data for later follow‐up assessments. To evaluate this hypothesis, we compared models with and without lagged cognition variables, and accuracy did not substantially differ (not shown). This suggests that factors other than the inclusion of lagged cognition features influenced increasing accuracy across assessments.

TABLE 3.

Accuracy metrics comparing clinical diagnosis to the out‐of‐sample prediction of clinical diagnosis across assessment Waves 1–4.

Wave acc kapwq auc sens spec ppv npv bal_acc
Wave 1 0.767 0.589 0.811 0.767 0.703 0.753 0.774 0.735
Wave 2 0.754 0.625 0.833 0.754 0.716 0.735 0.824 0.735
Wave 3 0.75 0.68 0.855 0.75 0.737 0.729 0.841 0.743
Wave 4 0.768 0.737 0.867 0.768 0.78 0.754 0.849 0.774

Note: Wave 1 had no lagged cognitive variables as features, Wave 2 had lag‐1 cognitive features, and Waves 3 and 4 had lag‐1 and lag‐2 cognitive features. Bolded values indicate that the value was significantly different from the previous row (bootstrap confidence interval for difference did not include 0). Metrics—acc, accuracy; kapwq, weighted kappa (quadratic); auc—area under curve (multiclass); sens, sensitivity (multiclass); spec, specificity (multiclass); ppv, positive predictive value (multiclass); npv, negative predictive value (multiclass); bal_acc, balanced accuracy (multiclass).

4. DISCUSSION

This project leveraged data from three longitudinal cohort studies of older adults to develop models for predicting clinical diagnosis using the XGBoost modeling platform. A primary motivation for this project was to determine whether clinical diagnosis could be algorithmically assigned without the necessity of a comprehensive clinical exam and expert clinician adjudication. Diagnostic accuracy in the combined sample was ≈ 75% with a weighted kappa value of 0.70 and multiclass ROC AUC of 0.86, values that indicate moderate to strong agreement. Agreement was strongest in the ADRC cohort; salient differences of this cohort from KHANDLE and LA90 were that informant‐based measures of independent function were available and there was longer longitudinal follow‐up.

Secondary analyses that incorporated CDR variables as features yielded the highest diagnostic accuracy. These results are especially relevant for the ADRC cohort because the CDR is not used in the clinical diagnosis process. Diagnostic accuracy of the CDR model in this cohort was strong and compares favorably to other studies; for example Yi et al. 34 who used XGBoost with an Alzheimer's Disease Neuroimaging Initiative (ADNI) sample and reported 87.6% multiclass accuracy (compared to 81.2% in this study) and 0.906 ROC AUC (0.914 in this study). Criterion contamination is a concern for ADNI‐based studies because CDR results, used as predictive features, are an explicit part of the diagnostic criteria in ADNI. These ADNI studies also used a broader range of features, including magnetic resonance imaging (MRI) and positron emission tomography neuroimaging, genetics, and cerebrospinal fluid biomarkers. The current ADRC results indicate that the CDR can be a valuable predictive feature and show that a high level of accuracy can be achieved without multimodal biomarkers in a setting in which CDR results are independent of clinical diagnosis.

Studies using machine learning methods to predict clinical diagnosis have proliferated in recent years. 6 Many studies have used the ADNI dataset, and the majority have included MRI variables as predictive features. Diagnostic accuracy has ranged from modest (≈ 50%) to very high (95+%) depending on the specific diagnostic comparison, the sample, and the predictive features. Recent, comprehensive studies 9 , 10 used the large National Alzheimer's Coordinating Center (NACC) database along with other multimodal clinical studies. Criterion contamination might be a concern for these studies; for example, neuropsychological test variables used as features are also used for clinical diagnosis at some data collection sites. Another limitation of these studies is that recruitment tends to be guided by clinical protocols, and consequently, generalizability to the older adult population is not an emphasis.

The current study addresses an important knowledge gap—it shows how algorithmic diagnosis might function in a more representative sample of the older adult population in the absence of a comprehensive clinical evaluation and specialized imaging and biomarker measurement. The algorithmic diagnosis in this study provides substantial information about the clinical status of individual participants. There are potential advantages of algorithmic diagnosis that may make it a useful tool. First, standardization is a major advantage. Once a model is developed, the algorithm will apply the same decision rules for all future individuals and assessments. Second, algorithmic diagnosis does not require a comprehensive clinical evaluation and expert clinicians, and this can reduce resource use/cost and burden to patients/participants.

Interpretability of machine learning models has been increasingly emphasized. The feature importance and Shapley value results in this study help to clarify how model‐based predictions are made and, indirectly, contribute to understanding of what clinical diagnosis represents. For example, results of this study show that clinical diagnosis was most strongly associated with cognition functioning, with episodic memory having predominance over other cognitive domains. Moreover, this general pattern held for diagnosing normal cognition, MCI, and dementia. Demographic characteristics, especially education and age, were influential, but to a lesser extent than cognition. Independent function also was influential, but less so than cognition, and appeared to be more important for a diagnosis of dementia than for normal and MCI diagnoses. These results correspond to clinical expectations and could be generated with more traditional statistical methods that do not involve machine learning. However, an advantage of the XGBoost decision tree–based algorithm is that it efficiently searches for complex interactions among features and is able to incorporate these interactions into decision models in a way that would be cumbersome with traditional multivariable regression models.

We incorporated longitudinal change from previous assessments into our predictive algorithm, an approach that has not been well represented in previous literature. This allowed us to examine how change from prior assessments contributes to clinical diagnosis beyond contributions of concurrent predictive features. Concurrent variables, especially tests of episodic memory, were the strongest features for predicting clinical diagnosis, though change from previous assessments added to the prediction. Higher episodic memory scores in previous assessments slightly increased the probability of a less impaired diagnosis over what would be expected based on the concurrent value alone. On the surface, this finding seems counterintuitive—it suggests that greater decline from previous assessments is associated with a lower likelihood of cognitive impairment. However, the vrmem change effects were independent of the much stronger concurrent vrmem effect. In addition, the positive impact of the lagged change across two assessments was substantially attenuated compared to the one‐lag change. These results merit further study to understand how cognitive level and rate of decline jointly contribute to clinical diagnosis and show how the model generated in this study can lead to new understanding of complex associations among variables.

We were also able to examine how longitudinal change prior to diagnosis contributed to diagnosis prediction. Prediction accuracy improved across follow‐up assessments, but this didn't seem to result from lagged cognition features available for follow‐up assessments because results did not differ when lagged cognition variables were or were not included in the feature set. An alternative explanation is that diagnosis is more predictable when more follow‐up is available. Access to follow‐up data provides more information and the ability to observe longitudinal trends, and this may increase both the accuracy and reliability of clinical diagnosis, which in turn increases the ability of the model to predict diagnosis. This is a hypothesis that merits further study.

4.1. Strengths and limitations

A major strength of this study was the availability of three longitudinal cohorts with common data components. All three cohorts were composed of demographically diverse older adults, so this study captures heterogeneity of demographic differences better than much of the previous literature on machine learning approaches to diagnosis. There was also heterogeneity of clinical diagnosis; all three cohorts included the same high‐quality, non‐diagnostic cognitive assessments that anchored the diagnosis prediction models, and there was substantial longitudinal follow‐up. The resulting combined sample consisted of > 10,000 clinical assessments that could be used to develop prediction models, and the inclusion of different cohorts allowed us to incorporate cohort differences into this model. An important limitation was that two of the cohorts did not collect informant‐reported measures of independent function. This reflects realities of community‐based research in which identifying an informant and coordinating additional assessments can be challenging and requires additional resources. The low prevalence of dementia in the KHANDLE cohort was a limitation. The clinical diagnosis process differed somewhat across cohorts, and this is a potential limitation. Most notably, all ADRC diagnoses were made by clinicians in the context of a traditional multidisciplinary case conference, while the KHANDLE and LA90 diagnoses used standardized decision rules applied to quantitative data features that were collected in the clinical evaluations of those projects. Finally, these results may be specific to the features used in this study. Future work is needed to validate the general pattern of results in different samples using different cognitive measures.

4.2. Implications for future studies

This study is the first phase of a project to address whether machine learning methods can provide a practical and meaningful alternative to clinical evaluation and diagnosis. This report describes model development and shows correspondence of model‐predicted diagnosis and clinical diagnosis but does not address whether the predicted diagnosis can serve as an alternative when clinical diagnosis is not available. Clinical diagnosis summarizes the individual's clinical status at the time of assessment, and this conveys important information about underlying brain integrity and prognosis for future clinical progression. The correspondence of clinical and algorithmic diagnosis in this study was less than perfect. Does this difference mean that the model‐predicted diagnosis does not capture relevant disease status information inherent in clinical diagnosis, or might it reflect unreliability in clinical diagnosis, predicted diagnosis, or both? The next phase of this project will compare model‐predicted diagnosis with clinical diagnosis to determine (1) how closely longitudinal change in diagnosis tracks across the two forms of diagnosis, (2) how well model‐predicted diagnosis predicts future clinical progression with respect to both diagnosis change and continuous cognitive decline, and (3) how the two forms of diagnosis relate to biomarkers of degenerative brain diseases. These results will provide information for evaluating the potential utility of machine learning approaches to diagnosis and defining the contexts and scope of utility of these methods.

CONFLICT OF INTEREST STATEMENT

The authors declare no conflicts of interest. Author disclosures are available in the supporting information.

CONSENT STATEMENT

Participants in all three studies signed informed consent, and all human subject involvement was overseen by institutional review boards at UC Davis and Kaiser Permanente Northern California.

Supporting information

Supporting Information

ALZ-21-e70508-s001.pdf (3.2MB, pdf)

Supporting Information

ALZ-21-e70508-s002.docx (844.4KB, docx)

ACKNOWLEDGMENTS

This work was supported by research grants from the National Institute on Aging: P30AG072972 (DeCarli, Whitmer PIs), R01AG052132 (Whitmer, Gilsanz, Glymour, Mayeda, PIs), R01AG056519 (Whitmer, Corrada, Gilsanz, PIs), R01AG031563 (Farias, Fletcher, PIs), K99AG075317 (Hayes‐Larson, PI). The funding source was not involved in study design; in the collection, analysis, and interpretation of data; in the writing of the report; or in the decision to submit the article for publication.

Mungas D, Gavett B, Rojas‐Saunero LP, et al. Machine learning diagnosis of cognitive impairment and dementia in harmonized older adult cohorts. Alzheimer's Dement. 2025;21:e70508. 10.1002/alz.70508

REFERENCES

  • 1. Ahsan MM, Luna SA, Siddique Z. Machine‐learning‐Based disease diagnosis: a comprehensive review. Healthcare. 2022;10:541. doi: 10.3390/healthcare10030541 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Myszczynska MA, Ojamies PN, Lacoste AMB, et al. Applications of machine learning to diagnosis and treatment of neurodegenerative diseases. Nat Rev Neurol. 2020;16:440‐456. doi: 10.1038/s41582-020-0377-8 [DOI] [PubMed] [Google Scholar]
  • 3. Swanson K, Wu E, Zhang A, Alizadeh AA, Zou J. From patterns to patients: advances in clinical machine learning for cancer diagnosis, prognosis, and treatment. Cell. 2023;186:1772‐1791. doi: 10.1016/j.cell.2023.01.035 [DOI] [PubMed] [Google Scholar]
  • 4. Li Y, Wu X, Yang P, Jiang G, Luo Y. Machine Learning for Lung Cancer Diagnosis, Treatment, and Prognosis. Genom Proteom Bioinformat. 2022;20:850‐866. doi: 10.1016/j.gpb.2022.11.003 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Kaplan A, Cao H, FitzGerald JM, et al. Artificial Intelligence/Machine Learning in Respiratory Medicine and Potential Role in Asthma and COPD Diagnosis. The J Allerg Clin Immunol Prac. 2021;9:2255‐2261. doi: 10.1016/j.jaip.2021.02.014 [DOI] [PubMed] [Google Scholar]
  • 6. Martin SA, Townend FJ, Barkhof F, Cole JH. Interpretable machine learning for dementia: a systematic review. Alzheimer Dement. 2023;19:2135‐2149. doi: 10.1002/alz.12948 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Chen C, Yin C, Wang Y, et al. XGBoost‐based machine learning test improves the accuracy of hemorrhage prediction among geriatric patients with long‐term administration of rivaroxaban. BMC Geriatrics. 2023;23:418. doi: 10.1186/s12877-023-04049-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Klöppel S, Stonnington CM, Barnes J, et al. Accuracy of dementia diagnosis—a direct comparison between radiologists and a computerized method. Brain. 2008;131:2969‐2974. doi: 10.1093/brain/awn239 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Qiu S, Miller MI, Joshi PS, et al. Multimodal deep learning for Alzheimer's disease dementia assessment. Nat Commun. 2022;13:3404. doi: 10.1038/s41467-022-31037-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Xue C, Kowshik SS, Lteif D, et al. AI‐based differential diagnosis of dementia etiologies on multimodal data. Nat Med. 2024;30:2977‐2989. doi: 10.1038/s41591-024-03118-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Grassi M, Rouleaux N, Caldirola D, et al. A novel ensemble‐based machine learning algorithm to predict the conversion from mild cognitive impairment to Alzheimer's disease using socio‐demographic characteristics, clinical information, and neuropsychological measures. Front Neurolog. 2019;10:756. doi: 10.3389/fneur.2019.00756 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Menagadevi M, Devaraj S, Madian N, Thiyagarajan D. Machine and deep learning approaches for Alzheimer disease detection using magnetic resonance images: an updated review. Measurement. 2024;226:114100. doi: 10.1016/j.measurement.2023.114100 [DOI] [Google Scholar]
  • 13. Mungas D, Reed BR, Crane PK, Haan MN, Gonzalez H. Spanish and English Neuropsychological Assessment Scales (SENAS): further development and psychometric characteristics. Psychol Assess. 2004;16:347‐359. doi: 10.1037/1040-3590.16.4.347 [DOI] [PubMed] [Google Scholar]
  • 14. Hernandez Saucedo H, Whitmer RA, Glymour M, et al. Measuring cognitive health in ethnically diverse older adults. J Gerontol B, Psychol Sci Soc Sci. 2022;77:261‐271. doi: 10.1093/geronb/gbab062 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Chen T, Guestrin C. XGBoost: a scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA: Association for Computing Machinery; 2016, 785‐794. doi: 10.1145/2939672.2939785 [DOI] [Google Scholar]
  • 16. Besser L, Kukull W, Knopman DS, et al. Version 3 of the National Alzheimer's Coordinating Center's Uniform Data Set. Alzheimer Dis Assoc Disord. 2018;32:351‐358. doi: 10.1097/WAD.0000000000000279 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Weintraub S, Besser L, Dodge HH, et al. Version 3 of the Alzheimer Disease Centers’ Neuropsychological Test Battery in the Uniform Data Set (UDS). Alzheimer Dis Assoc Disord. 2018;32:10‐17. doi: 10.1097/WAD.0000000000000223 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Teng EL, Chui HC. The Modified Mini‐Mental State (3MS) Examination. J Clin Psychiatry. 1987;48:314‐318. [PubMed] [Google Scholar]
  • 19. Morris JC, Weintraub S, Chui HC, et al. The Uniform Data Set (UDS): clinical and cognitive variables and descriptive data from Alzheimer disease centers. Alzheimer Dis Assoc Disord. 2006;20(4):210‐216. doi: 10.1097/01.wad.0000213865.09806.92 [DOI] [PubMed] [Google Scholar]
  • 20. Morris JC. The Clinical Dementia Rating (CDR): current version and scoring rules. Neurol. 1993;43:2412‐2414. [DOI] [PubMed] [Google Scholar]
  • 21. Farias ST, Mungas D, Reed BR, et al. The measurement of everyday cognition (ECog): scale development and psychometric properties. Neuropsychol. 2008;22:531‐544. doi: 10.1037/0894-4105.22.4.531 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Mungas D, Reed BR, Haan MN, Gonzalez H. Spanish and English Neuropsychological Assessment Scales: relationship to demographics, language, cognition, and independent function. Neuropsychol. 2005;19:466‐475. [DOI] [PubMed] [Google Scholar]
  • 23. Mungas D, Reed BR, Tomaszewski Farias S, DeCarli C. Criterion‐referenced validity of a Neuropsychological Test Battery: equivalent performance in elderly hispanics and non‐hispanic whites. J Int Neuropsychol Soc. 2005;11:620‐630. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Crane PK, Narasimhalu K, Gibbons LE, et al. Composite scores for executive function items: demographic heterogeneity and relationships with quantitative magnetic resonance imaging. J Int Neuropsychol Soc. 2008;14:746‐759. doi: 10.1017/S1355617708081162 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Mungas D, Beckett L, Harvey D, et al. Heterogeneity of cognitive trajectories in diverse older persons. Psychol Aging. 2010;25(3):606‐619. doi: 10.1037/a0019502 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Mungas D, Widaman KF, Reed BR, Tomaszewski Farias S. Measurement invariance of neuropsychological tests in diverse older persons. Neuropsychol. 2011;25:260‐269. doi: 10.1037/a0021090 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Early DR, Widaman KF, Harvey D, et al. Demographic predictors of cognitive change in ethnically diverse older persons. Psychol Aging. 2013;28:633‐645. doi: 10.1037/a0031645 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Farias ST, Mungas D, Hinton L, Haan M. Demographic, neuropsychological, and functional predictors of rate of longitudinal cognitive decline in Hispanic older adults. Amer J Geriat Psychiat. 2011;19:440‐450. doi: 10.1097/JGP.0b013e3181e9b9a5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Farias ST, Chou E, Harvey DJ, et al. Longitudinal trajectories of everyday function by diagnostic status. Psychol Aging. 2013;28:1070‐1075. doi: 10.1037/a0034069 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Rueda A, Lau K, Saito N, et al. Self‐rated and informant‐rated everyday function in comparison to objective markers of Alzheimer's disease. Alzheimer Dement. 2015;11:1080‐1089. doi: 10.1016/j.jalz.2014.09.002 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Ninja N. XGBoost: Powering Machine Learning with Gradient Boosting. LDS. April 23, 2023. https://letsdatascience.com/xgboost/ n.d
  • 32. Lundberg SM, Lee S‐I. A unified approach to interpreting model predictions. Proceedings of the 31st International Conference on Neural Information Processing Systems, Red Hook, NY, USA: Curran Associates Inc; 2017, 4768‐4777. [Google Scholar]
  • 33. Henninger M, Strobl C. Interpreting machine learning predictions with LIME and Shapley values: theoretical insights, challenges, and meaningful interpretations. Behaviormetrika. 2025;52:45‐75. doi: 10.1007/s41237-024-00253-2 [DOI] [Google Scholar]
  • 34. Yi F, Yang H, Chen D, et al. XGBoost‐SHAP‐based interpretable diagnostic framework for Alzheimer's disease. BMC Med Inf Decis Making. 2023;23:137. doi: 10.1186/s12911-023-02238-9 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supporting Information

ALZ-21-e70508-s001.pdf (3.2MB, pdf)

Supporting Information

ALZ-21-e70508-s002.docx (844.4KB, docx)

Articles from Alzheimer's & Dementia are provided here courtesy of Wiley

RESOURCES