Abstract
INTRODUCTION
Digital speech biomarkers (DSBs) support the detection and monitoring of Alzheimer's disease (AD) in Latinos. However, they have not been benchmarked against standard cognitive and neuroimaging measures, missing a critical validation milestone.
METHODS
Thirty‐three AD patients and 33 healthy controls completed verbal fluency tasks, episodic memory and executive tests, and magnetic resonance imaging (MRI) (volume) and functional MRI (fMRI) (connectivity) scans. Between‐group machine learning classification was compared among fluency‐derived DSBs, episodic and executive test scores, MRI, and fMRI measures.
RESULTS
The fluency classifier's performance (area under the curve [AUC] = 0.84) was comparable (p > 0.14) to the episodic (AUC = 0.90), executive (AUC = 0.79), and structural (AUC = 0.90) classifiers and superior to the functional classifier (AUC = 0.65, p = 0.002). Top discriminating features were word length and frequency, both associated with right (pre)frontal volume upon adjusting for sociodemographic factors.
DISCUSSION
DSBs appear non‐inferior to standard cognitive and imaging measures, supporting scalable AD assessments in Latinos.
Highlights
We examined digital speech biomarkers (DSBs) for detecting AD in Latinos.
DSBs were benchmarked against cognitive and neuroimaging features.
DSB‐based classifiers matched or outperformed cognitive and brain classifiers.
Top DSBs included word length, phonological neighborhood, and frequency.
Word length and frequency correlated with right (pre)frontal brain volume.
Keywords: Alzheimer's disease, automated speech and language analysis, cognitive assessments, digital biomarkers, machine learning, magnetic resonance imaging
1. BACKGROUND
Alzheimer's disease (AD) is characterized by primary episodic memory and executive deficits, 1 medial temporal lobe atrophy, 2 aberrant default mode network (DMN) connectivity, 3 and amyloid beta (Aβ) and tau pathology. 4 A leading cause of disability and death, 1 AD reduces quality of life for patients and caregivers, 1 affecting physical, social, and mental well‐being and generating massive annual (health)care costs. 1 Its burden is particularly severe in Latin America, 5 where cases multiply amid a lack of biofluid and neuroimaging devices, culturally adapted tests, and specialized clinicians. 6 The need has arisen for scalable markers enabling AD detection and monitoring in this population. 7
Prominent among these are digital speech biomarkers (DSBs) – acoustic and linguistic anomalies derived automatically from verbal production. These low‐cost, time‐efficient metrics allow for detecting AD, 8 , 9 predicting its onset preclinically, 10 , 11 and differentiating it from other dementias. 12 , 13 Moreover, they capture cognitive symptom severity, 12 , 14 underlying brain atrophy patterns, 12 , 15 and Aβ status, 15 reinforcing clinical utility. 7 While most evidence comes from English speakers, 14 accumulating research on Latinos 7 has revealed that (semi)spontaneous speech features are sensitive to AD but not to other disorders 13 , 16 and that some support their detection even when models are trained on English speakers. 17 Additional results come from verbal fluency tasks, with word property (WP) 12 and word organization 18 metrics proving sensitive to AD. Promisingly, apps capturing these patterns are being deployed across Latin American healthcare settings. 19 , 20
A good balance between sensitivity and interpretability is achieved by metrics tapping on semantic memory, which is typically affected by AD. 21 First, speech timing (ST) features – quantifying pauses and spoken segments – capture word retrieval efficiency. Individuals with or at heightened risk for AD exhibit more and longer pauses than healthy controls (HCs), 22 reaching classification accuracies of up to ∼0.85. 23 , 24 , 25 , 26 Second, WP features index vocabulary search patterns, 12 , 27 , 28 revealing more frequent, less specific terms in AD patients than in HCs, yielding area under the receiver operating characteristic curve (AUC) scores as high as 0.89, 9 , 12 , 27 and tracking cognitive symptom severity. 12 Findings from diverse populations, including Latinos, show that both ST and WP features predict temporal/frontal atrophy 11 , 12 and abnormal DMN connectivity. 12 Importantly, these metrics can be reliably captured from verbal fluency tasks, which are brief, engaging, and widely used worldwide. Thus, they represent powerful candidate markers for Latinos. 7
However, ST and WP metrics lack validation against clinical and neuroimaging benchmarks, a key milestone to prevent advocating suboptimal measures and increasing inequities in Latin America. 29 Classification using WP and ST features approximates that of episodic, executive, and otherwise cognitive tests (AUC = 0.80 to 0.97; accuracy = 0.77 to 0.96), 30 , 31 , 32 as well as those from MRI features such as hippocampal volume (AUC = 0.87 to 0.99; accuracy = 0.85 to 0.99). 33 Yet, these figures stem from different populations and methodologies, limiting comparability and interpretability. Only a few studies, on non‐Latino participants, have contrasted digital speech metrics with cognitive outcomes in the same cohort (revealing similar or better discrimination), 10 , 15 , 34 , 35 and none has done so against neuroimaging measures in any population, let alone targeting ST and WP metrics. Briefly, a critical validation milestone is missing for this thriving field.
Here, we benchmarked ST and WP metrics against cognitive and imaging measures in Latinos. ST and WP metrics were extracted from two 1‐min verbal fluency tasks. 36 Episodic memory features were derived from immediate and delayed recall tests, while executive features came from short‐term memory, working memory, attention, and mental set‐shifting tests. Volumetric and connectivity features were extracted from MRI and fMRI recordings, respectively. Machine learning classifiers were independently trained on each feature set to discriminate between patients and HCs, and their outcomes were statistically compared. The most discriminative fluency features were further examined via brain–behavior correlations.
We hypothesized that classification via ST and WP metrics would approximate that of episodic, executive, and imaging measures. Also, we expected classification to be driven by pause‐related features, 11 , 24 word frequency, granularity, and phonological neighborhood. 12 Finally, we anticipated that such top features would correlate with the volume and connectivity of temporal and frontal brain regions. 11 , 12 This approach fills a key validation gap for speech biomarkers of AD in Latinos.
RESEARCH IN CONTEXT
Systematic review: We reviewed studies on DSBs in AD using PubMed and Google Scholar. We considered papers in any language from Quartile 1 and 2 journals. References in each hit were then scanned for additional works. Few studies compared AD classification performance using DSBs versus cognitive metrics. None of them compared DSBs and neuroimaging measures, let alone in Latin American samples. All pertinent studies were duly referenced.
Interpretation: Our study suggests that AD classification through DSBs is non‐inferior to that achieved using standard measures.
Future directions: Replications should be conducted in larger, more diverse samples. Also, larger feature sets should be used, employing further cognitive and neuroimaging measures, and including social determinants of health and biofluid biomarkers. Finally, our approach should be tested across different languages to meet recent calls for cross‐linguistic equity in the field.
2. METHODS
The study's methods are summarized in Figure 1.
FIGURE 1.

Experimental design. (A) Data acquisition. We recruited (A1) 33 individuals with AD and 33 HCs, who completed (A2) phonemic and semantic fluency tasks, (A3) cognitive tests tapping on episodic memory (Craft story and FCSRT) and executive function (TMT‐A, TMT‐B, forward digit span, and backward digit span), and (A4) MRI and fMRI brain scans. (B) Feature extraction. (B1) Fluency features consisted of word property and speech timing metrics from each fluency task. (B2) Episodic features were derived from normalized immediate and delayed scores from the Craft story task and FCSRT. (B3) Executive features were derived from TMT‐A and B time and errors, and digit forward and backward span. (B4) Structural features consisted of AD‐implicated ROI volume derived from MRI scans. (B5) Functional features were obtained from the internal connectivity of canonical networks, derived from fMRI scans. (C) Machine learning pipeline. We trained multiple models for each feature group through nested cross‐validation (with 11 folds of inner layer and three folds of outer layer), using 10 random seeds and bootstrapping at a factor of 1000. We then evaluated performance in terms of AUC and accuracy, comparing the best classifier of each category. Finally, the most important features of the best fluency classifier were correlated with gray matter volume in a whole‐brain VBM analysis in SPM12. AD, Alzheimer's dementia; AUC, area under receiver operating characteristic curve; FCSRT, Free and Cued Selective Reminding Test; fMRI, functional magnetic resonance imaging; HC, healthy control; MRI, magnetic resonance imaging; ROI, region of interest; SPM12, statistical parametric mapping 12; TMT, trail making test; VBM, voxel‐based morphometry.
2.1. Participants
As in previous research, 12 participants were selected from a larger N of 73 (38 AD patients, 35 HCs) through an automated algorithm that ensured balanced groups and statistical matching for sex, handedness, age, years of education, and acquisition center (Table 1). The final sample comprised 66 native Spanish speakers (33 with AD, 33 HCs), providing an adequate feature‐to‐sample ratio for machine learning analysis under the n − 1 criterion 37 (Section 2.6). Participants were recruited from two centers of the Multi‐Partner Consortium to Expand Dementia Research in Latin America (ReDLat), following unified procedures that included extensive neurological and neuropsychological interviews and evaluations. 38 , 39 A proportion of these participants (11 of the 33 AD patients, and 13 of the 33 HCs) were part of a previous report by Ferrante et al. 12
TABLE 1.
Participants’ sociodemographic and cognitive profiles.
| Variable | AD patients (n = 33) | HCs (n = 33) | Stats | p‐value |
|---|---|---|---|---|
| Sociodemographic profile | ||||
| Sex (F:M) a | 21:12 | 17:16 | 0.558 | 0.455 |
| Handedness (L:R) a | 3:30 | 0:33 | 1.397 | 0.237 |
| Age, years b | 72.18 (8.62) | 70.18 (8.43) | 0.953 | 0.344 |
| Education, years b | 14.12 (4.12) | 15.24 (4.04) | −1.115 | 0.269 |
| Acquisition center (1:2) a | 12:21 | 16:17 | 0.558 | 0.455 |
| Cognitive profile | ||||
| MMSE c | 22.45 (3.75) | 29.03 (1.33) | −9.488 | <0.001* |
| Normalized immediate recall score b | 0.18 (0.16) | 0.60 (0.16) | −10.987 | <0.001* |
| Normalized delayed recall score d | 0.11 (0.16) | 0.65 (0.23) | 34.0 | <0.001* |
| Digits forward d | 4.76 (1.39) | 5.39 (1.12) | 379.0 | 0.030* |
| Digits backward d | 3.18 (1.01) | 4.12 (1.11) | 300.5 | 0.001* |
| TMT‐A time d | 104.97 (54.27) | 52.94 (21.75) | 823.5 | <0.001* |
| TMT‐A errors d | 0.26 (0.63) | 0.28 (0.72) | 497.0 | 0.991 |
| TMT‐B time d | 409.12 (308.78) | 134.59 (76.34) | 893.0 | <0.001* |
| TMT‐B errors d | 1.89 (2.06) | 1.00 (1.76) | 605.5 | 0.015* |
Note: Data presented as mean (SD), except for sex and handedness.
Abbreviations: AD, Alzheimer's disease; F, female; HC, healthy control; L, left‐handed; M, male; MMSE, Mini‐Mental State Examination; R, right‐handed; TMT‐A, Trail‐Making Test Part A; TMT‐B, Trail‐Making Test Part B.
P values calculated via chi‐squared test (χ 2).
P values calculated with independent samples Student's t‐test.
P values calculated with independent samples Welch's t‐test.
P values calculated with independent samples Mann‐Whitney U test.
Scores were normalized via the min‐max method as some participants were tested with the Craft Story Recall task and others with the Free and Cued Selective Recall Test.
In line with standardized validated protocols, 40 AD participants were diagnosed by expert neurologists following clinical criteria from the National Institute of Neurological and Communicative Diseases and Stroke 41 as well as the Alzheimer's Disease and Related Disorders Association. 42 Biomarker data were unavailable for this cohort. Analysis of MRI and fMRI scans (Supporting Information 1) revealed bilateral hippocampal atrophy (Supporting Information 2) and reduced connectivity between (para)hippocampal and frontal regions as well as within the DMN (Supporting Information 3) in AD patients. They were also impaired in overall cognitive skills, as measured by the Mini‐Mental State Examination (MMSE) 43 ; episodic memory, examined with the Craft story retelling task 44 and the Free and Cued Selective Reminding Test (FCSRT) 45 ; short‐term and working memory, assessed through standard forward and backward digit span tasks, respectively 43 ; and attention and cognitive flexibility, as evaluated via the Trail‐Making Test Parts A and B (TMT‐A, TMT‐B), 43 respectively – see Sections 2.3 and 2.4 for task details. HCs were recruited through an outreach program and invitations extended to eligible caregivers of patients. They were confirmed to be functionally autonomous and cognitively preserved, with MMSE scores above the local mean of 26. 46 No participant reported other neurological disorders, psychiatric conditions, primary language deficits, or substance abuse.
All participants provided written informed consent in accordance with the Declaration of Helsinki. The study was approved by the institutional ethics committee.
2.2. Fluency tasks and features
All participants completed phonemic and semantic fluency tasks (Figure 1A2), requiring them to utter as many words as possible starting with the phoneme /p/ and fitting the category “animals,” respectively. A certified neuropsychologist administered these tasks in a counterbalanced order in a silent testing room. Following reported procedures, 12 , 43 participants had 60 s per task and were instructed to avoid proper names, numbers, repetitions, words from the same family, or morphological variations of the same word. Instructions were given orally, including examples of invalid responses.
All responses were audio‐recorded and automatically transcribed using Google's Text‐to‐Speech Application Programming Interface. Transcriptions were then checked and edited by at least two examiners following a standardized protocol. 12 Unintelligible words, metalinguistic comments, and interventions from the evaluators were discarded. As in recent work, 12 features were extracted from all remaining words, regardless of their validity in terms of standard scoring criteria. All participants produced at least five words per task.
2.2.1. Speech timing features
Audio recordings were preprocessed with the Toolkit to Examine Lifelike Language (TELL) app. 19 , 20 They were converted to .wav format, downmixed to mono (single channel), encoded in GSM Full Rate, resampled to 16 kHz, compressed with a factor of 8, bandpass‐filtered from 200 Hz to 3.4 kHz, loudness‐normalized on the FFmpeg multimedia framework, and denoised through the FullSubNet model. Finally, voice activity was then detected to segment audio signals into speech and non‐speech regions.
Then, using a PRAAT script, 47 we automatically extracted the following ST features from each recording: number of phonated segments (peaks in intensity preceded and followed by dips in intensity), number of pauses (silent intervals), articulation rate (number of syllables over phonation time), mean and standard deviation of phonated segment duration, and mean and standard deviation of pause duration. Therefore, each participant had seven ST features per task (Figure 1B).
2.2.2. Word properties
As in prior studies, 12 we extracted seven lexico‐semantic properties from each of the participants’ transcribed responses. First, each word's granularity was determined using Python's library NLTK to access WordNet, a hierarchical graph of nodes representing increasingly specific concepts leading from the broadest hypernym (“entity”) to more specialized terms (e.g., “animal,” “dog,” “bulldog”). Granularity refers to the number of nodes between a word and the top‐node “entity.” For instance, bin‐3 words are closer to “entity” than bin‐10 words, with the former indicating more general concepts. As in previous research, 12 , 16 Spanish responses were automatically translated into English for this analysis.
Additionally, we used the EsPal database 48 to extract each response's word frequency (logarithmic frequency per million), phonological neighborhood (number of words generated by substituting, adding, or omitting a phoneme), syllabic length (number of syllables for word), phonemic length (number of phonemes in each word), and concreteness (rated on a scale from 1: not concrete to 7: highly concrete). For word concreteness, missing values were imputed by retrieving the 20 closest words using the 150‐dimensional FastText model for Spanish and subsequently selecting the concreteness value from the closest word available in the EsPal database using a Python script. We calculated the mean and standard deviation (SD) of each variable per subject. Therefore, each participant had 12 WP features per task (Figure 1B).
2.3. Episodic memory tasks and features
Episodic memory was assessed through the Craft story retelling task 44 in 19 AD patients and 17 HCs and through the FCSRT 45 in 14 AD patients and 16 HCs. For the Craft story task, participants were read a short narrative and asked to recall it immediately and again after a 20‐min delay, with performance quantified as the number of story units correctly recalled at each time point. 44 For the FCSRT, participants learned words under controlled encoding with semantic cues and completed recall immediately and after a 20‐min delay, with performance quantified using total recall scores. 45 Both tasks yielded an immediate recall score and a delayed recall score, with higher scores indicating better episodic memory performance. To render outcomes comparable across tasks, immediate and delayed recall scores were normalized between tasks via min–max normalization. Episodic classifiers were trained with two features: normalized immediate recall score and normalized delayed recall score (Figure 1B2).
2.4. Executive tasks and features
Executive assessments were conducted to evaluate four core domains affected by AD. Short‐term and working memory were assessed via forward and backward digit span tests, respectively. 43 The former requires participants to recall a sequence of numbers in the same order, while the latter involves recalling the sequence in reverse order. In both tasks, sequence length increases progressively. Participants must correctly recall each sequence twice to advance, whereas two consecutive incorrect responses result in task termination. Performance is scored based on the maximum sequence length correctly recalled in two trials, with no upper limit. 43 Attention and mental set shifting were evaluated with the TMT‐A and TMT‐B, two tests with good specificity and sensitivity. 43 In TMT‐A, participants are asked to connect 25 numbers randomly arranged on a page in ascending order. In TMT‐B, they must alternate between 25 numbers and letters (e.g., 1‐A‐2‐B). Scoring is based on the completion time of each task, which ranges from 1 to 300 (seconds), with both tasks taking approximately 10 min to complete. 43 Executive classifiers were trained using six features: the short‐term memory score, the working memory score, TMT‐A time and errors, and TMT‐B time and errors (Figure 1B3).
2.5. Neuroimaging protocol and brain features
2.5.1. MRI and fMRI data acquisition and preprocessing
MRI and fMRI recordings were performed in two scanners, with minimally varying parameters (Supporting Information 1). As in previous research, 12 acquisition center was introduced as a covariate in all analyses.
Participants’ gray matter volume was estimated via voxel‐based morphometry, following reported protocols. 49 Preprocessing was performed with the CAT12 toolbox for SPM12 49 running on MATLAB R2021a. Scans were automatically set with origins to the anterior commissure and reoriented. Images were segmented between gray and white matter, normalized to the MNI space based on a Dartel estimation, and smoothed. Total intracranial volume (TIV) was estimated using native‐space tissue maps obtained via a CAT12 module.
For fMRI preprocessing, scans were manually set with origins to the anterior commissure and reoriented, and the first five volumes were removed. 12 Functional and anatomical data were then preprocessed using a standard preprocessing pipeline in the CONN version 22.a toolbox. 50 This included (1) realignment with correction of susceptibility distortion interactions; (2) outlier detection using ART (global signal z‐value threshold = 9, subject‐motion mm threshold = 2); (3) direct segmentation, MNI‐space normalization and resampling to 3 mm isotropic voxels; (4) smoothing with a Gaussian kernel of 8 mm full width at half maximum; (5) denoising (regression of potential confounding effects characterized by white matter, cerebrospinal fluid, realignment, and scrubbing); and (6) bandpass filtering between 0.008 and 0.09 Hz. Quality control measures did not differ between groups (Supporting Information 1). Still, motion artifacts were regressed out during the default denoising step implemented in the CONN toolbox.
2.5.2. Structural brain features
For the structural features, based on the Automated Anatomical Labelling 3 (AAL3) atlas, 51 we estimated the volumes of regions of interest (ROIs) typically atrophied in AD. 2 We used the CAT12 “ROI estimation” function to extract volumetric measures for the following ROIs: hippocampus, parahippocampal gyrus; superior, middle, and inferior temporal gyri; temporal pole (superior and middle temporal gyri); precuneus; inferior frontal gyrus (opercular and triangular parts); rolandic operculum; superior frontal gyrus (medial orbital); gyrus rectus; insula; anterior cingulate cortex (subgenual, pregenual, and supracallosal parts); middle cingulate and paracingulate gyri; and posterior cingulate gyrus. Each ROI was analyzed separately for the left and right hemispheres, yielding a total of 38 structural features, with which the structural classifiers were trained (Figure 1B4).
2.5.3. Functional features
Functional brain features were derived using ROI‐to‐ROI connectivity analysis in the CONN toolbox, based on the AAL3 atlas. 51 Fisher‐transformed bivariate Pearson correlation coefficient matrices were extracted as absolute values using custom Python 3.9 scripts with the SciPy library, capturing the strength of functional interactions irrespective of sign. 52 Eight resting‐state networks were examined: (1) default mode, (2) language, (3) executive control, (4) salience/ventral attention, (5) visual, (6) somatomotor, (7) limbic, and (8) memory networks (Supporting Information 3). The global correlation value was also included as an additional feature, yielding a total of nine functional features, with which the functional classifiers were trained (Figure 1B5).
2.6. Machine learning
Machine learning analyses were conducted to classify between patients and HCs based on (a) fluency features, (b) episodic memory features, (c) executive features, (d) structural brain features, and (e) functional brain features (Figure 1C). Fluency classifiers were trained using (i) ST features alone, (ii) WP features alone, and (iii) the combination of both. Episodic memory classifiers were trained with normalized immediate and delayed recall scores. Executive classifiers were trained using a combination of digit span forward and backward scores and TMT‐A and TMT‐B outcomes (completion time and errors). Structural brain classifiers were based on the volume of the aforementioned AD‐relevant ROIs. Functional brain classifiers were trained considering connectivity within each of the networks listed above. Additional structural and functional brain classifiers were run with alternative feature sets (Supporting Information 4).
For each feature set, we employed a stratified nested cross‐validation procedure, where label balance is maintained across folds. 53 The outer loop used three folds for model evaluation, while the inner loop used 11 folds for hyperparameter tuning and feature selection. Following prior studies, 12 inside the inner loop, feature values were standardized using z‐score 54 normalization, and missing data were imputed using a k‐nearest neighbors approach (k = 5). This entire process was repeated across 10 random seeds. We tested three classification algorithms: XGBoost (an ensemble method that reduces the risk of overfitting by averaging the predictions made by several weak learners), 55 k‐nearest neighbors, and logistic regression, to consider both linear and non‐linear decision boundaries. Hyperparameter tuning was carried out using Bayesian optimization, 56 with 15 initial sampling points and 50 optimization iterations. Feature selection was performed via recursive feature elimination, 57 using the drop in AUC as the criterion for feature removal. Classification performance of the optimal models was evaluated using 95% confidence intervals (CIs) computed via bootstrap resampling (N = 1000), pooling the scores across all random seeds. 58 For each feature set, the best‐performing model was defined based on AUC. All machine learning analyses were implemented in Python 3.11.9 using the Scikit‐learn library (version 1.4.1). Custom scripts were developed to automate feature selection, hyperparameter tuning, and model evaluation. The full pipeline was executed on a high‐performance computing cluster, enabling parallel processing to optimize model training and evaluation efficiency.
Model performance was evaluated using multiple metrics, including AUC, accuracy, F1 score, sensitivity, precision, normalized expected cost, 59 and normalized cross‐entropy, 60 following recent recommendations. 61 Feature importance was assessed for each top classifier after training them on the entire dataset. Finally, we compared the best‐performing models against each other and against null models generated by label shuffling, based on the 95% CIs for the difference in their performance metrics, computed using the bootstrap method. This approach enabled us to assess the statistical significance of performance differences and to confirm that the models captured meaningful patterns rather than spurious associations or chance‐level variability. The CIs obtained in this way were used to compare our best fluency classifier with (i) the best episodic memory classifier, (ii) the best executive classifier, (iii) the best structural brain classifier, and (iv) the best functional brain classifier. P values were estimated as the fraction of bootstrap differences less than 0, after sign alignment. Differences were deemed significant at an alpha level of 0.05.
2.7. Top fluency features and neural correlates
Neural signatures of the top five discriminatory fluency features were examined through brain–behavior correlations, collapsing AD and HC groups (Figure 1C4), in line with prior approaches. 12 First, associations with gray matter volume were examined via whole‐brain multiple regressions on SPM12, controlling for TIV, age, education, sex, handedness, acquisition center, and diagnostic group (HC or AD). 62 Statistical significance was assessed using threshold‐free cluster enhancement (TFCE) with 5000 permutations, 12 with an alpha level of p < 0.05, family‐wise error (FWE) corrected. 63 Second, we examined partial Spearman correlations with the internal connectivity of resting‐state fMRI networks yielding significant between‐group differences. Eight networks were considered: (1) default mode, (2) language, (3) executive control, (4) salience/ventral attention, (5) visual, (6) somatomotor, (7) limbic, and (8) memory (Supporting Information 3). Analyses were covaried for sex, age, handedness, years of education, and acquisition center. Statistical significance was set at p < 0.05, FWE‐corrected. 12
3. RESULTS
3.1. Group identification across classifiers
As shown in Figure 2, maximal classification was obtained via XGBoost for fluency features (AUC = 0.84 [0.73, 0.90], accuracy = 0.77 [0.67, 0.83], XGBoost for episodic scores (AUC = 0.90 [0.80, 0.96], accuracy = 0.87 [0.79, 0.93]), logistic regression for executive scores (AUC = 0.79 [0.69, 0.87], accuracy = 0.75 [0.65, 0.83]), XGBoost for structural (ROI volume) features (AUC = 0.90 [0.81, 0.95], accuracy = 0.83 [0.74, 0.89]), and logistic regression for functional (connectivity) features (AUC = 0.65 [0.54, 0.74], accuracy = 0.59 [0.51, 0.68]). Full results are provided in Supporting Information 5.
FIGURE 2.

Machine learning results and brain‐behavior correlations. (A) Violin plots show the AUC distribution for each classifier, based on bootstrapped cross‐validation outputs. (B) Feature importance of the best fluency classifier. (C) Clusters showing significant brain–behavior correlations with the top five fluency features. (C1) A negative correlation was observed between word frequency of the semantic task and gray matter volume in right orbitofrontal regions. (C2) A positive correlation was observed between word phonemic length of the semantic task and right orbitofrontal regions. AUC, area under receiver operating characteristic curve.
Fluency‐based classification outcomes did not differ significantly from those obtained via episodic, executive, or structural brain features, as shown by comparing their 95% CIs for AUC values (fluency vs episodic = −0.06 [−0.19, 0.05]; fluency vs executive = −0.05 [−0.07, 0.18]; fluency vs structural = −0.06 [−0.18, 0.06]; all p > 0.14) and for accuracy values (fluency vs episodic = −0.10 [−0.21, 0.00]; fluency vs executive = 0.02 [−0.09, 0.13]; fluency vs structural = −0.06 [−0.19, 0.05]; all p > 0.16). Also, fluency features performed significantly better than the best functional connectivity features (AUC = 0.19 [0.05, 0.33], p = 0.002; accuracy = 0.17 [0.04, 0.28], p = 0.003).
All feature sets yielded chance‐level results upon randomly shuffling participants’ labels, and the five selected classifiers performed significantly better than their respective shuffled versions. For details, see Supporting Information 5.
3.2. Feature‐level analyses
The top five features for the best fluency classifier were word phonemic length, phonological neighborhood, frequency, and granularity in the semantic task and word concreteness in the phonemic task (Figure 2B). Complementary statistical analyses were performed to understand the direction of feature‐level between‐group differences. In the semantic task, relative to HCs, AD patients produced fewer phonated segments, fewer but longer pauses, and a lower speech rate and used words that were significantly more frequent, shorter, more neighbor‐dense, and more conceptually vague (all p < 0.04). No significant differences were observed for the remaining fluency features (all p > 0.063). In the phonemic task, AD patients likewise produced fewer phonated segments, longer pauses, and a lower speech rate and used words that were significantly shorter and more concrete (all p < 0.042), with no other significant differences observed (all p > 0.169). For details, see Supporting Information 6.
3.3. Brain‐behavior correlations
Two of the top five features of the best fluency classifier significantly correlated with participants’ gray matter volume. Word frequency in the semantic task was negatively correlated with the volume of the right orbitofrontal cortex and right inferior frontal gyrus (Figure 2C1, Table 2). Phonemic length in the semantic task was positively correlated with the volume of the right orbitofrontal cortex (Figure 2C2, Table 2). We did not find a significant association between word phonological neighborhood or granularity of the semantic task or word concreteness of the phonemic task and gray matter volume.
TABLE 2.
Brain clusters with significant correlations with the top five fluency features.
| Fluency feature | Size (voxels) | Peak region | x | y | z | TFCE | Peak p FWE |
|---|---|---|---|---|---|---|---|
|
Negative: Word logarithmic frequency (semantic task) |
1947 | Right gyrus rectus | 9 | 33 | −26 | 1413 | 0.006 |
| Right superior frontal gyrus, orbital part | 9 | 44 | −24 | 1400 | 0.006 | ||
| Right middle frontal gyrus, orbital part | 39 | 53 | 0 | 1382 | 0.007 | ||
|
Positive: Word phonemic length (semantic task) |
351 | Right middle frontal gyrus, orbital part | 33 | 51 | −14 | 1064 | 0.026 |
| 41 | 45 | −14 | 945 | 0.041 | |||
| 41 | 54 | −2 | 902 | 0.048 |
Abbreviation: TFCE, threshold‐free cluster enhancement.
None of the top five fluency features showed significant partial correlations with intrinsic connectivity of the studied networks after controlling for sex, age, handedness, years of education, and acquisition center and after applying FWE correction (all p > 0.09). For details, see Supporting Information 7.
4. DISCUSSION
We benchmarked digital speech biomarkers of AD against standard cognitive and imaging measures. Unlike ST metrics, WP features from fluency tasks robustly classified between patients and HCs. AUC and accuracy metrics did not differ significantly from those obtained via episodic, executive, or neuroanatomical features, and they outperformed functional connectivity features. The two top performing fluency features (length and frequency) were associated with right frontal/orbitofrontal volume. Below we discuss these findings.
The fluency classifier successfully detected AD patients and HCs, yielding an AUC of 0.84. This result closely mirrors prior WP research on AD, 12 further resembling or surpassing the discriminatory power of multiple digital speech analyses in this population. 9 , 14 , 23 , 64 Importantly, reports of higher classification performance in the speech biomarker literature 8 , 65 are often based on uninterpretable features from large language models or complex embeddings. Conversely, our approach can be linked to semantic memory difficulties (a hallmark of AD), indicating that patients favor the most retrievable items during vocabulary search. Thus, our fluency framework meets the imperative of balancing sensitivity and interpretability. 65
Crucially, outcomes of the fluency classifier were non‐significantly different from those of the best episodic (AUC = 0.90), executive (AUC = 0.79), and structural brain (AUC = 0.90) classifiers, which fell within reported ranges. 30 , 31 , 32 , 33 Of note, the episodic and executive classifiers were trained on several AD‐relevant tasks (targeting immediate and delayed recall, on the one hand, and short‐term memory, working memory, attention, and mental set‐shifting, on the other), while structural classifiers relied on meta‐analytically derived volumetric features. Thus, the three feature sets created stringent benchmarks for comparison, underscoring the robustness of the fluency metrics. Also, the fluency classifier significantly outperformed the functional classifier (AUC = 0.65), which captured resting‐state networks affected in AD. 3 Briefly, then, digital fluency markers meet the non‐inferiority validation criterion relative to well‐established cognitive and imaging measures.
Fluency‐based classification was driven by phonemic length and neighborhood, frequency, granularity, and concreteness. This aligns with works showing that AD patients favor words that are short, neighbor‐dense, frequent, conceptually unspecific, and concrete – namely, items that are more easily retrievable within semantic memory. 12 , 16 , 66 More particularly, the presence of phonological neighborhood, frequency, and granularity among the top discriminatory features, especially when derived from a semantic fluency task, echoes previous findings, 12 reinforcing the notion that word access in AD is distinctly characterized by a preference for phonologically crowded, commonly occurring, and conceptually vague words. Also, the preference for shorter words in AD patients has been widely reported, 67 and word length alone can discriminate between AD patients and HCs. 12 More generally, these results underscore WP features as core linguistic signatures of AD.
Two of those features, indeed, correlated with neural patterns. Higher word frequency and lower phonemic length were associated with reduced gray matter volume in the right orbitofrontal cortex, while frequency further correlated with right inferior frontal volume. Both features were previously linked to left‐sided prefrontal activation in other tasks, 68 with frequency in word production also yielding associations with the volume of left 12 , 69 and right 69 prefrontal regions in AD and right prefrontal regions in behavioral variant frontotemporal dementia. 12 Although the effects are typically weaker than those observed in the left hemisphere, right prefrontal regions support lexical access and controlled retrieval processes 70 and are commonly affected in mild cognitive impairment and AD. 71 Adding to these antecedents, our findings suggest that atypical word processing in AD may partly reflect right prefrontal disturbances, inviting more studies on these areas’ specific contributions to lexico‐semantic choices in the disorder.
Our results carry clinical implications. Though central to diagnosis, standard cognitive tests typically involve lengthy sessions, highly trained examiners, and expensive licenses, limiting their global deployment, especially in overstretched primary care settings. 72 Likewise, MRI evaluations, as well as biofluid tests, prove costly, stressful, and inaccessible in many low‐ and middle‐income countries. 73 Conversely, much like other digital speech biomarkers, the present fluency framework is automated, time‐efficient, affordable, patient‐friendly, and remotely deployable, attesting to its scalability. 7 , 14 Indeed, our ST and WP features are already integrated into apps currently used in clinical centers worldwide. 19 , 20 Thus, just as novel blood tests are maximizing the scalability of pathological markers, 74 so too can fluency measures boost the global deployment of cognitive markers – all while matching the discriminability of standard tools. This is particularly relevant in Latin America and other underresourced regions, where structural limitations and high costs substantially restrict access to traditional diagnostic methods. 5 , 6 Overall, by focusing on Latino participants, this study fosters equity and diversity in speech‐based AD research, beyond its predominant focus on English‐speaking cohorts. 7
Despite its strengths, this study has several limitations. Although our sample size (n = 66) was similar to the median of previous digital speech marker studies (median n = 70, derived from a systematic review), 64 replications with larger cohorts for robust generalizability tests are needed. This limitation also restricted the exploration of broader feature sets, given feature‐to‐sample ratio constraints. 37 Future research with more participants may evaluate classifiers trained on more comprehensive speech, cognitive, and imaging feature sets. Such replications would also benefit from additional measures absent in our protocol. These include validated tools drawing on social determinants of health 75 and biofluid markers known to modulate clinical outcomes in AD. 4 Moreover, patients were in moderate disease stages, calling for replications in mild‐stage cohorts to better establish the utility of WP and ST markers for early detection. Also, although the similarities with Ferrante et al.’s (2024) study 12 are noteworthy, roughly 36% of our sample overlapped with the one reported therein. Future studies should replicate our approach in a fully distinct sample to better ascertain the generalizability of the observed patterns. Finally, the episodic memory classifier was based on tests that differed among participants. While this was tackled via well‐established normalization methods, future works should replicate our approach with consistent (and, ideally, additional) episodic memory measures.
In sum, our findings reinforce digital fluency metrics as robust AD markers, on par with standard cognitive and imaging measures. Importantly, relative to such benchmarks, this framework proves more affordable, time‐efficient, and scalable, thereby addressing dire needs of low‐resource, underserved settings. Future extensions of this approach, especially in broader primary care cohorts, could yield additional validation for its widespread real‐world deployment.
AUTHOR CONTRIBUTIONS
Iván Caro: Methodology, formal analysis, writing – original draft. Gonzalo Pérez: Methodology, software, formal analysis, writing – original draft. Joaquín Valdés Bize: Formal analysis, writing – original draft. Joaquín Ponferrada: Software, writing – review and editing. Franco Ferrante: Software, writing – review and editing. Alejandro Sosa Welford: Software, writing – review and editing. Lara Gauder: Data curation, writing—review & editing. Loreto Olavarría: Validation, writing – review and editing. Fernando Henríquez: Validation, writing – review and editing. Teresita Ramos: Validation, writing – review and editing. Cristina Besnier: Validation, writing – review and editing. Luciana Ferrer: Methodology, writing – review and editing. Maria Luisa Gorno‐Tempini: Writing – review and editing. Andrea Slachevsky: Resources, data curation, writing – review and editing. Agustín Ibañez: Resources, funding acquisition, writing – review and editing. Adolfo M. García: Conceptualization, writing – review and editing, supervision, project administration, funding acquisition.
CONFLICT OF INTEREST STATEMENT
Adolfo M. García is CSO of TELL. All other authors report no conflicts of interest. Author disclosures are available in the Supporting Information.
CONSENT STATEMENT
All participants of the study provided written informed consent in accordance with the Declaration of Helsinki. The study was approved by the Institutional Ethics Committee CEC SSM Oriente under code 29092020.
Supporting information
Supporting information
Supporting information
ACKNOWLEDGMENTS
We thank all participants and their families for their time and excellent predisposition. Andrea Slachevsky is supported by ANID (FONDAP ID15150012); Fondecyt Regular 1231839 and PIA Anillos ACT210096) and Multi‐Partner Consortium to Expand Dementia Research in Latin America (ReDLat, supported by National Institutes of Health [NIH], National Institute on Aging [NIA] [R01 AG057234], Alzheimer's Association [SG‐20‐725707], Rainwater Charitable Foundation – Tau Consortium, and Global Brain Health Institute). Agustín Ibañez is supported by grants from the Multi‐Partner Consortium to Expand Dementia Research in Latin America (ReDLat, supported by Fogarty International Center [FIC], NIH, NIA (R01 AG057234, R01 AG075775, R01 AG21051, R01 AG083799, CARDS‐NIH, R01 AG057234), Alzheimer's Association (SG‐20‐725707), Rainwater Charitable Foundation – The Bluefield Project to Cure FTD, and Global Brain Health Institute), ANID/FONDECYT Regular (1250091 and 1250317 and 1220995); ANID/PIA/ANILLOS ACT210096; JPI JPND‐Care, DISCeRN 2025 ‐ Health and Social Care Research with a Focus on the Moderate and Late Stages of Neurodegenerative Diseases; FONDEF ID20I10152, and ANID/FONDAP 15150012; Wellcome Trust award for BRAIN‐CLIMA: Investigating the Combined Impact of Heat and Air Pollution on Blood‐Brain Barrier Integrity and Brain Aging in Latin America (335293/Z/25/Z), Wellcome Leap CARE Program (Grant CARE‐2025‐0883490149) for the project “Advancing Female‐Specific Predictive Models and Risk Assessment Tools for Alzheimer's Disease in the US and Latin America, and the CliCBrain” (Horizon ID: 101236426; https://doi.org/10.3030/101236426, Marie Skłodowska‐Curie Actions – MSCA). Adolfo M. García is partially supported by the NIA of the NIH (R01AG075775, R01AG083799, 2P01AG019724); ANID (FONDECYT Regular 1250317, 1250091); and Agencia Nacional de Promoción Científica y Tecnológica (01‐PICTE‐2022‐05‐00103).
DATA AVAILABILITY STATEMENT
All anonymized data and scripts supporting the findings of this study are freely available in the following repository: https://osf.io/ad8jb/overview?view_only=e69e33184e334282ade387b1c4da32a9.
REFERENCES
- 1. 2025 Alzheimer's disease facts and figures. Alzheimers Dement. 2025;21(4):e70235. 10.1002/alz.70235 [DOI] [Google Scholar]
- 2. Wang WY, Yu JT, Liu Y, et al. Voxel‐based meta‐analysis of grey matter changes in Alzheimer's disease. Transl Neurodegener. 2015;4(1):6. 10.1186/s40035-015-0027-z [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Ibrahim B, Suppiah S, Ibrahim N, et al. Diagnostic power of resting‐state fMRI for detection of network connectivity in Alzheimer's disease and mild cognitive impairment: a systematic review. Hum Brain Mapp. 2021;42(9):2941‐2968. 10.1002/hbm.25369 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Hansson O, Blennow K, Zetterberg H, Dage J. Blood biomarkers for Alzheimer's disease in clinical practice and trials. Nat Aging. 2023;3(5):506‐519. 10.1038/s43587-023-00403-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Wimo A, Guerchet M, Ali GC, et al. The worldwide costs of dementia 2015 and comparisons with 2010. Alzheimers Dement J Alzheimers Assoc. 2017;13(1):1‐7. 10.1016/j.jalz.2016.07.150 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Parra MA, Baez S, Allegri R, et al. Dementia in Latin America: assessing the present and envisioning the future. Neurology. 2018;90(5):222‐231. 10.1212/WNL.0000000000004897 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. García AM, de Leon J, Tee BL, Blasi DE, Gorno‐Tempini ML. Speech and language markers of neurodegeneration: a call for global equity. Brain J Neurol. 2023;146(12):4870‐4879. 10.1093/brain/awad253 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. García AM (Accepted). Speech and language in healthy aging and Alzheimer's dementia. Nat Rev Psychol. [Google Scholar]
- 9. Fraser KC, Meltzer JA, Rudzicz F. Linguistic features identify Alzheimer's disease in narrative speech. J Alzheimers Dis. 2016;49(2):407‐422. 10.3233/JAD-150520 [DOI] [PubMed] [Google Scholar]
- 10. Eyigoz E, Mathur S, Santamaria M, Cecchi G, Naylor M. Linguistic markers predict onset of Alzheimer's disease. eClinicalMedicine. 2020;28. 10.1016/j.eclinm.2020.100583 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. De Looze C, Dehsarvi A, Crosby L, et al. Cognitive and structural correlates of conversational speech timing in mild cognitive impairment and mild‐to‐moderate Alzheimer's disease: relevance for early detection approaches. Front Aging Neurosci. 2021;13:637404. 10.3389/fnagi.2021.637404 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Ferrante FJ, Migeot J, Birba A, et al. Multivariate word properties in fluency tasks reveal markers of Alzheimer's dementia. Alzheimers Dement. 2024;20(2):925‐940. 10.1002/alz.13472 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Lopes da Cunha P, Ruiz F, Ferrante F, et al. Automated free speech analysis reveals distinct markers of Alzheimer's and frontotemporal dementia. PLoS One. 2024;19(6):e0304272. 10.1371/journal.pone.0304272 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Ding K, Chetty M, Noori Hoshyar A, Bhattacharya T, Klein B. Speech based detection of Alzheimer's disease: a survey of AI techniques, datasets and challenges. Artif Intell Rev. 2024;57(12):325. 10.1007/s10462-024-10961-6 [DOI] [Google Scholar]
- 15. Hajjar I, Okafor M, Choi JD, et al. Development of digital voice biomarkers and associations with cognition, cerebrospinal biomarkers, and neural representation in early Alzheimer's disease. Alzheimers Dement Amst Neth. 2023;15(1):e12393. 10.1002/dad2.12393 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Sanz C, Carrillo F, Slachevsky A, et al. Automated text‐level semantic markers of Alzheimer's disease. Alzheimers Dement Diagn Assess Dis Monit. 2022;14(1):e12276. 10.1002/dad2.12276 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Pérez‐Toro PA, Ferrante FJ, Pérez G, et al. Automated speech markers of Alzheimer dementia: test of cross‐linguistic generalizability. J Med Internet Res. 2025;27(1):e74200. 10.2196/74200 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Bertola L, Mota NB, Copelli M, et al. Graph analysis of verbal fluency test discriminate between patients with Alzheimer's disease, mild cognitive impairment and normal elderly controls. Front Aging Neurosci. 2014;6:185. 10.3389/fnagi.2014.00185 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. García AM, Johann F, Echegoyen R, et al. Toolkit to examine lifelike language (TELL): an app to capture speech and language markers of neurodegeneration. Behav Res Methods. 2024;56(4):2886‐2900. 10.3758/s13428-023-02240-z [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. García AM, Ferrante FJ, Pérez G, et al. Toolkit to examine lifelike language v.2.0: optimizing speech biomarkers of neurodegeneration. Dement Geriatr Cogn Disord. 2024;54(2):96‐108. 10.1159/000541581 [DOI] [PubMed] [Google Scholar]
- 21. Marra C, Piccininni C, Masone Iacobucci G, et al. Semantic memory as an early cognitive marker of Alzheimer's disease: role of category and phonological verbal fluency tasks. J Alzheimer's Dis. 2021;81(2):619‐627. 10.3233/JAD-201452 [DOI] [PubMed] [Google Scholar]
- 22. Pistono A, Jucla M, Barbeau EJ, et al. Pauses during autobiographical discourse reflect episodic memory processes in early Alzheimer's disease. J Alzheimers Dis. 2016;50(3):687‐698. 10.3233/JAD-150408 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Meilán JJG, Martínez‐Sánchez F, Carro J, López DE, Millian‐Morell L, Arana JM. Speech in Alzheimer's disease: can temporal and acoustic parameters discriminate dementia? Dement Geriatr Cogn Disord. 2014;37(5‐6):327‐334. 10.1159/000356726 [DOI] [PubMed] [Google Scholar]
- 24. Gonzalez‐Moreira E, Torres‐Boza D, Kairuz HA, et al. Automatic prosodic analysis to identify mild dementia. BioMed Res Int. 2015;2015:916356. 10.1155/2015/916356 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Yuan J, Bian Y, Cai X, Huang J, Ye Z, Church K. Disfluencies and fine‐tuning pre‐trained language models for detection of Alzheimer's disease. 2020:2162‐2166. 10.21437/Interspeech.2020-2516 [DOI]
- 26. Roark B, Mitchell M, Hosom JP, Hollingshead K, Kaye J. Spoken language derived measures for detecting mild cognitive impairment. IEEE Trans Audio Speech Lang Process. 2011;19(7):2081‐2090. 10.1109/TASL.2011.2112351 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Vonk JMJ, Flores RJ, Rosado D, et al. Semantic network function captured by word frequency in nondemented APOE ε4 carriers. Neuropsychology. 2019;33(2):256‐262. 10.1037/neu0000508 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Vonk JMJ, Morin BT, Pillai J, et al. Automated speech analysis to differentiate frontal and right anterior temporal lobe atrophy in frontotemporal dementia. Neurology. 2025;104(9):e213556. 10.1212/WNL.0000000000213556 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Scott IA. Non‐inferiority trials: determining whether alternative treatments are good enough. Med J Aust. 2009;190(6):326‐330. 10.5694/j.1326-5377.2009.tb02425.x [DOI] [PubMed] [Google Scholar]
- 30. Maito MA, Santamaría‐García H, Moguilner S, et al. Classification of Alzheimer's disease and frontotemporal dementia using routine clinical and cognitive measures across multicentric underrepresented samples: a cross sectional observational study. Lancet Reg Health ‐ Am. 2022;17:100387. 10.1016/j.lana.2022.100387 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31. Gupta A, Kahali B. Machine learning‐based cognitive impairment classification with optimal combination of neuropsychological tests. Alzheimers Dement Transl Res Clin Interv. 2020;6(1):e12049. 10.1002/trc2.12049 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32. Battista P, Salvatore C, Castiglioni I. Optimizing neuropsychological assessments for cognitive, behavioral, and functional impairment classification: a machine learning study. Behav Neurol. 2017;2017(1):1850909. 10.1155/2017/1850909 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33. Menagadevi M, Devaraj S, Madian N, Thiyagarajan D. Machine and deep learning approaches for Alzheimer disease detection using magnetic resonance images: an updated review. Measurement. 2024;226:114100. 10.1016/j.measurement.2023.114100 [DOI] [Google Scholar]
- 34. Zhang L, Ngo A, Thomas JA, et al. Neuropsychological test validation of speech markers of cognitive impairment in the Framingham Cognitive Aging Cohort. Explor Med. 2021;2(3):232‐252. 10.37349/emed.2021.00044 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35. Alexopoulou ZS, Köhler S, Mallick E, et al. Speech‐based digital cognitive assessments for detection of mild cognitive impairment: validation against paper‐based neurocognitive assessment scores. J Alzheimer's Dis. 2025;108(1)_suppl:S118‐S131. 10.1177/13872877251343296 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36. Taler V, Phillips NA. Language performance in Alzheimer's disease and mild cognitive impairment: a comparative review. J Clin Exp Neuropsychol. 2008;30(5):501‐556. 10.1080/13803390701550128 [DOI] [PubMed] [Google Scholar]
- 37. Hua J, Xiong Z, Lowey J, Suh E, Dougherty ER. Optimal number of features as a function of sample size for various classification rules. Bioinformatics. 2005;21(8):1509‐1515. 10.1093/bioinformatics/bti171 [DOI] [PubMed] [Google Scholar]
- 38. Ibanez A, Yokoyama JS, Possin KL, et al. The multi‐partner consortium to expand dementia research in Latin America (ReDLat): driving multicentric research and implementation science. Front Neurol. 2021;12:631722. 10.3389/fneur.2021.631722 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39. Slachevsky A, Zitko P, Martínez‐Pernía D, et al. GERO Cohort Protocol, Chile, 2017‐2022: community‐based cohort of functional decline in subjective cognitive complaint elderly. BMC Geriatr. 2020;20(1):505. 10.1186/s12877-020-01866-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40. New manual aims to create common standards for dementia diagnosis across Latin America. Alzheimers Dement J Alzheimers Assoc. 2020;16(7):1099. 10.1002/alz.12147 [DOI] [PubMed] [Google Scholar]
- 41. Dubois B, Feldman HH, Jacova C, et al. Research criteria for the diagnosis of Alzheimer's disease: revising the NINCDS‐ADRDA criteria. Lancet Neurol. 2007;6(8):734‐746. 10.1016/S1474-4422(07)70178-3 [DOI] [PubMed] [Google Scholar]
- 42. McKhann GM, Knopman DS, Chertkow H, et al. The diagnosis of dementia due to Alzheimer's disease: recommendations from the National Institute on Aging‐Alzheimer's Association workgroups on diagnostic guidelines for Alzheimer's disease. Alzheimers Dement J Alzheimers Assoc. 2011;7(3):263‐269. 10.1016/j.jalz.2011.03.005 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43. Sherman E, Tan J, Hrabok M. A Compendium of Neuropsychological Tests: Fundamentals of Neuropsychological Assessment and Test Reviews for Clinical Practice. Oxford University Press; 2023. [Google Scholar]
- 44. Weintraub S, Besser L, Dodge HH, et al. Version 3 of the Alzheimer disease centers’ neuropsychological test battery in the uniform data set (UDS). Alzheimer Dis Assoc Disord. 2018;32(1):10. 10.1097/WAD.0000000000000223 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45. Grober E, Lipton RB, Hall C, Crystal H. Memory impairment on free and cued selective reminding predicts dementia. Neurology. 2000;54(4):827‐832. 10.1212/wnl.54.4.827 [DOI] [PubMed] [Google Scholar]
- 46. Molina‐Donoso M, González‐Hernández J, Delgado C, et al. Development of normative data for the mini mental state examination (MMSE) in the elderly population of Chile: a multi‐city study. Rev Médica Chile. 2023;151(11):1464‐1470. 10.4067/s0034-98872023001101464 [DOI] [PubMed] [Google Scholar]
- 47. de Jong NH, Wempe T. Praat script to detect syllable nuclei and measure speech rate automatically. Behav Res Methods. 2009;41(2):385‐390. 10.3758/BRM.41.2.385 [DOI] [PubMed] [Google Scholar]
- 48. Duchon A, Perea M, Sebastián‐Gallés N, Martí A, Carreiras M. EsPal: one‐stop shopping for Spanish word properties. Behav Res Methods. 2013;45(4):1246‐1258. 10.3758/s13428-013-0326-1 [DOI] [PubMed] [Google Scholar]
- 49. Gaser C, Dahnke R, Thompson PM, Kurth F, Luders E, the Alzheimer's Disease Neuroimaging Initiative . CAT: a computational anatomy toolbox for the analysis of structural MRI data. GigaScience. 2024;13:giae049. 10.1093/gigascience/giae049 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50. Whitfield‐Gabrieli S, Nieto‐Castanon A. Conn: a functional connectivity toolbox for correlated and anticorrelated brain networks. Brain Connect. 2012;2(3):125‐141. 10.1089/brain.2012.0073 [DOI] [PubMed] [Google Scholar]
- 51. Rolls ET, Huang CC, Lin CP, Feng J, Joliot M. Automated anatomical labelling atlas 3. Neuroimage. 2020;206:116189. 10.1016/j.neuroimage.2019.116189 [DOI] [PubMed] [Google Scholar]
- 52. Rubinov M, Sporns O. Complex network measures of brain connectivity: uses and interpretations. Neuroimage. 2010;52(3):1059‐1069. 10.1016/j.neuroimage.2009.10.003 [DOI] [PubMed] [Google Scholar]
- 53. Mosteller F, Tukey JW. Data analysis, including statistics. Handb Soc Psychol. 1968;2:80‐203. [Google Scholar]
- 54. Moguilner S, Birba A, Fittipaldi S, et al. Multi‐feature computational framework for combined signatures of dementia in underrepresented settings. J Neural Eng. 2022;19(4): 10.1088/1741-2552/ac87d0 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55. Chen T, Guestrin C. XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining . 2016:785‐794. 10.1145/2939672.2939785 [DOI]
- 56. Brochu E, Cora VM, de Freitas N. A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. arXiv . Preprint posted online December 12, 2010:arXiv:1012.2599. 10.48550/arXiv.1012.2599 [DOI]
- 57. Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed. Springer; 2009. (Springer Series in Statistics). [Google Scholar]
- 58. Efron B. Better bootstrap confidence intervals. J Am Stat Assoc. 1987;82(397):171‐185. 10.2307/2289144 [DOI] [Google Scholar]
- 59. Ferrer L. No need for ad‐hoc substitutes: the expected cost is a principled all‐purpose classification metric. Trans Mach Learn Res. 2024. Accessed December 9, 2025. https://openreview.net/forum?id=5PPbvCExZs [Google Scholar]
- 60. Ferrer L, Ramos D. Evaluating Posterior Probabilities: Decision Theory, Proper Scoring Rules, and Calibration. arXiv . Preprint posted online August 5, 2024:arXiv:2408.02841. 10.48550/arXiv.2408.02841 [DOI]
- 61. Ferrer L, Scharenborg O, Bäckström T. Good practices for evaluation of machine learning systems. arXiv . Preprint posted online December 4, 2024:arXiv:2412.03700. 10.48550/arXiv.2412.03700 [DOI]
- 62. Hazelton JL, Migeot J, Gonzalez‐Gomez R, et al. Cardiovascular risk factors and the allostatic interoceptive network in dementia. Cardiovasc Res. 2025;121(14):2222‐2232. 10.1093/cvr/cvaf185 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63. Smith SM, Nichols TE. Threshold‐free cluster enhancement: addressing problems of smoothing, threshold dependence and localisation in cluster inference. Neuroimage. 2009;44(1):83‐98. 10.1016/j.neuroimage.2008.03.061 [DOI] [PubMed] [Google Scholar]
- 64. de la Fuente Garcia S, Ritchie CW, Luz S. Artificial intelligence, speech, and language processing approaches to monitoring Alzheimer's disease: a systematic review. J Alzheimers Dis JAD. 2020;78(4):1547‐1574. 10.3233/JAD-200888 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65. Shankar R, Goh Z, Devi F, Xu Q. A systematic review of explainable artificial intelligence methods for speech‐based cognitive decline detection. NPJ Digit Med. 2025;8:724. 10.1038/s41746-025-02105-z [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66. Cho S, Cousins KAQ, Shellikeri S, et al. Lexical and acoustic speech features relating to Alzheimer disease pathology. Neurology. 2022;99(4):e313‐e322. 10.1212/WNL.0000000000200581 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67. Gumus M, Koo M, Studzinski CM, Bhan A, Robin J, Black SE. Linguistic changes in neurodegenerative diseases relate to clinical symptoms. Front Neurol. 2024;15:1373341. 10.3389/fneur.2024.1373341 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68. Yarkoni T, Speer NK, Balota DA, McAvoy MP, Zacks JM. Pictures of a thousand words: investigating the neural mechanisms of reading with extremely rapid event‐related fMRI. Neuroimage. 2008;42(2):973‐987. 10.1016/j.neuroimage.2008.04.258 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69. Li Q, Koehler S, Koenig A, et al. Associations between digital speech features of automated cognitive tasks and trajectories of brain atrophy and cognitive decline in early Alzheimer's disease. J Alzheimers Dis JAD. 2025;107(1):154‐169. 10.1177/13872877251359967 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70. Ries SK, Dronkers NF, Knight RT. Choosing words: left hemisphere, right hemisphere, or both? Perspective on the lateralization of word retrieval. Ann N Y Acad Sci. 2016;1369(1):111‐131. 10.1111/nyas.12993 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71. Zhao H, Li X, Wu W, et al. Atrophic patterns of the frontal‐subcortical circuits in patients with mild cognitive impairment and Alzheimer's disease. PLoS One. 2015;10(6):e0130017. 10.1371/journal.pone.0130017 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72. Whitehead JC, Gambino SA, Richter JD, Ryan JD. Focus group reflections on the current and future state of cognitive assessment tools in geriatric health care. Neuropsychiatr Dis Treat. 2015;11:1455‐1466. 10.2147/NDT.S82881 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73. Chávez‐Fumagalli MA, Shrivastava P, Aguilar‐Pineda JA, et al. Diagnosis of Alzheimer's disease in developed and developing countries: systematic review and meta‐analysis of diagnostic test accuracy. J Alzheimers Dis Rep. 2021;5(1):15‐30. 10.3233/ADR-200263 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 74. Palmqvist S, Warmenhoven N, Anastasi F, et al. Plasma phospho‐tau217 for Alzheimer's disease diagnosis in primary and secondary care using a fully automated platform. Nat Med. 2025;31(6):2036‐2043. 10.1038/s41591-025-03622-w [DOI] [PMC free article] [PubMed] [Google Scholar]
- 75. Santamaria‐Garcia H, Sainz‐Ballesteros A, Hernandez H, et al. Factors associated with healthy aging in Latin American populations. Nat Med. 2023;29(9):2248‐2258. 10.1038/s41591-023-02495-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supporting information
Supporting information
Data Availability Statement
All anonymized data and scripts supporting the findings of this study are freely available in the following repository: https://osf.io/ad8jb/overview?view_only=e69e33184e334282ade387b1c4da32a9.
