Abstract
INTRODUCTION
Cognitive screening to detect mild cognitive impairment (MCI) and dementia in primary care settings has proven to be a challenging task. The ideal solution would be a brief, yet sensitive, tool appropriate for use with individuals from diverse educational and cultural backgrounds that requires limited time and expertise from clinic staff. The purpose of this project was (1) to develop an automated cognitive screening tool incorporating cognitive and speech/language data using machine learning techniques for potential use in primary care settings and (2) to compare its classification accuracy to an established cognitive screening measure.
METHODS
Participants were 53 cognitively normal and 51 cognitively impaired older adults. Each completed a working memory (WM) and four speaking tasks, followed by a second administration of WM to investigate the added utility of practice effects. Bayesian additive regression trees were used to test nine models, and the Quick Mild Cognitive Impairment screen was administered as a comparator.
RESULTS
The top feature set consisted of both administrations of the WM task and a personal narrative task and achieved a cross‐validated classification accuracy (area under the receiver operating characteristics curve) of 0.84, which was slightly better than the comparator.
DISCUSSION
Combining WM and acoustic and linguistic variables derived from connected speaking tasks discriminated cognitively normal from cognitively impaired groups with a high degree of accuracy.
Highlights
Working memory and speaking tasks were used for detection of cognitive impairment.
This combination distinguished cognitively normal from impaired older adults.
This automated tool may overcome barriers to cognitive screening in primary care.
Keywords: cognitive screening, dementia, early detection, mild cognitive impairment, older adults, primary care
1. BACKGROUND
Under‐diagnosis of mild cognitive impairment (MCI) and dementia in primary care settings is widely recognized, leading to recommendations for routine brief cognitive assessments of older adults. 1 There are many barriers to routine cognitive screening in primary care, however, including lack of time and variability in confidence and knowledge about detecting and managing dementia. 2 , 3 , 4 Most primary care providers (PCPs) remain uncertain about which patients to assess, which tools to use, and how to communicate results. 1 In addition, many PCPs are concerned about the accuracy of brief cognitive assessments and their own lack of training on administration, scoring, and interpretation of these assessments. Support for these concerns is validated by studies finding cultural, linguistic and educational biases 5 and incorrect scoring or reporting in 25%–33% of cases. 6 , 7
Research shows that the Mini‐Mental State Examination (MMSE) remains the most commonly used brief cognitive assessment tool by PCPs, followed by the Clock Drawing Test and Mini‐Cog, 1 , 8 although the Montreal Cognitive Assessment (MoCA) and St. Louis University Mental Status exam are also frequently used. While electronic versions of these tools may exist, they do not reflect the vast amount of psychometric knowledge gained from studies in preclinical Alzheimer's disease (AD) that have taken place over the past 20 years, particularly with regard to tasks that are most likely to detect cognitive impairment years ahead of a formal diagnosis. 9 , 10 Nor do they incorporate advances in technologies, such as speech and voice analysis 11 , 12 , 13 or statistical modeling and machine learning techniques. 14
The overarching goal of this project is to develop a brief (< 5 minutes), automated, low‐burden assessment tool sensitive to MCI that could be deployed in primary care settings. The goals of the current study were: (1) to develop a classification algorithm using as input a 90‐second computerized processing speed/working memory measure and four speaking tasks to differentiate cognitively normal from cognitively impaired older adults, and (2) to compare its classification accuracy to an established short cognitive screening task, the Quick Mild Cognitive Impairment screen, 15 which was the only short (i.e., < 5 minutes) cognitive screening measure shown to be valid to detect MCI based on a systematic review. 16 The classification algorithm was chosen to exploit modern machine learning techniques, in particular, ones that hold promise in relatively small samples.
2. METHODS
2.1. Participants
Participants were 104 older adults (52% male), 53 who were cognitively normal (CN) and 51 who were cognitively impaired (CI), recruited from two sites that specialize in AD and related dementias, the Comprehensive Memory Center at The University of Texas Health‐Austin in Austin, Texas, and the Institute for Dementia Research and Prevention at Pennington Biomedical Research Center in Baton Rouge, Louisiana. Participants were recruited from a database of individuals who presented to these sites within the past year and completed a cognitive workup that included neuropsychological testing assessing language, visuospatial abilities, attention/executive functioning, and memory. All participants were required to be 60 years or older and fluent in English and could not have severe chronic psychiatric illness or active substance abuse that may impact cognition. To confirm CN status at the time of study enrollment, CN participants had to obtain a score of 24 or higher on the MMSE, which has been shown to be an effective cutoff for ruling out cognitive impairment, 17 and indicate they were independent with all activities of daily living. Additional exclusion criteria for CN participants were any neurologic condition associated with cognitive impairment (e.g., seizure disorder), prior traumatic brain injury with loss of consciousness for more than 15 minutes, and/or significant current depressive symptomatology as indicated by a score of ≥ 5 on the Geriatric Depression Scale – Short Form 18 (GDS‐SF). Participants with CI met established criteria for MCI (n = 35) or dementia (n = 16) according to the National Institute on Aging–Alzheimer's Association (NIA‐AA) workgroup diagnostic guidelines. 19 , 20 To make classification accuracy between CN and CI more challenging, participants diagnosed with dementia were required to obtain an MMSE score ≥ 20 during the screening visit to ensure they were in a mild stage of dementia. 21
RESEARCH IN CONTEXT
Systematic review: Literature review focused on detection of cognitive impairment in primary care, digital cognitive screening tools, and speech analysis to detect cognitive impairment. No automated cognitive screening tool designed specifically for primary care settings was identified, and the utility of combining cognitive tests with speaking tasks for detection of cognitive impairment was not addressed. In addition, while several speech/language tasks demonstrated the ability to distinguish between cognitively normal and cognitively impaired older adults, direct comparisons of the classification accuracy of various speaking tasks were lacking.
Interpretation: Combining cognitive performance with connected speech variables using machine learning techniques showed good classification accuracy of older adults with and without cognitive impairment and was comparable to an existing tool recommended for primary care settings.
Future directions: Further validation of this digital cognitive screening tool in larger, diverse samples, and within busy primary care clinic settings is needed.
2.2. Measures
2.2.1. Processing speed/working memory
Digit symbol substitution tasks are brief, yet sensitive indicators of cognitive impairment that are minimally affected by sociocultural factors and thus widely used in dementia research. 8 , 9 , 22 , 23 In this study, a Web‐based digit symbol substitution task, Speeded Matching (SM), was used. In this task, participants are shown nine symbols with corresponding numbers at the top of the screen. Then numbers without the corresponding symbols are presented in a row beneath the key, and participants are instructed to tap on the symbol that goes in the highlighted empty box above the corresponding number. The symbol will appear in the box after the participant taps on it and then the next empty box will be highlighted and so on. After completion of one row, a new row appears. The total number of correctly matched symbol–number pairs after 90 seconds is the score. SM demonstrated good 2‐week test–retest reliability (r = 0.73) and construct validity when compared to the paper‐and‐pencil version of the Preclinical Alzheimer's Cognitive Composite in 155 CN older adults. 24 Because research suggests that lack of practice effect may indicate cognitive impairment 25 or be predictive of future cognitive decline, 26 participants in the present study completed SM twice, before (SM1) and after (SM2) the speaking tasks described below, to enable investigation of the added value of prior exposure (i.e., practice effect) to classification accuracy.
2.2.2. Speaking tasks
Four speaking tasks were examined in the current study: (1) picture description using the “cat rescue” picture, 27 (2) procedural discourse (i.e., how would you make a peanut butter and jelly sandwich), (3) an open‐ended personal narrative sample (the “important event” question from Aphasia Bank 28 ), and (4) counting backward from 305 to 285. 12 Each audio file (three connected speaking tasks and one counting backward task per participant) was evaluated for quality and manually edited to remove silent periods and any experimenter speech (at beginning and end of sample) prior to further processing. Subsequently, a custom Health Insurance Portability and Accountability Act (HIPAA) ‐compliant speech‐language analysis pipeline was used to extract acoustic (all tasks) and linguistic (all tasks excluding counting backward) features for each sample.
2.2.3. Cognitive screening
The Quick Mild Cognitive Impairment (Qmci) screen 15 was administered to compare classification accuracy of our new cognitive screening algorithm to a validated brief cognitive assessment tool. The Qmci consists of six subtests: Orientation, Word Registration, Clock Drawing, Delayed Recall, Verbal Fluency, and Logical Memory. It has an average administration time of 4.5 min with high sensitivity (90%) and specificity (87%) for detecting cognitive impairment. 29 The Qmci was chosen as a comparator due to its brevity and higher accuracy for detecting cognitive impairment than other widely used screening measures, including the MoCA and standardized MMSE. 15 , 29 Total scores range from 0 to 100, and a cutoff score of <67 (for CI) is recommended to maximize specificity and minimize false positives for identification of MCI. 30
2.3. Study procedures
The study was approved by The University of Texas at Austin Internal Review Board (IRB), which served as the single IRB for both sites. Participants who had previously been evaluated at one of the two study sites were invited to participate in the current study. Enrollment occurred from August 2021 through November 2022. After providing written informed consent, participants were screened for exclusionary conditions and administered the MMSE and GDS‐SF by trained research personnel in a quiet, private room. Participants who met inclusion and exclusion criteria were administered SM1, the four speaking tasks, and SM2 on an Apple iPad, followed by the Qmci. Speaking tasks were recorded using the standard microphone in the Apple iPad. Upon completion of the study visit, participants received a $25 electronic gift card.
2.4. Data analysis
Means, standard deviations, and proportions were calculated to describe the study sample and dependent variables. Independent samples t‐tests and chi‐square tests were computed to examine group differences in interval and categorical variables, respectively.
To determine which acoustic and linguistic features would be included in development of the classification model, prior to modeling cognitive group classification (CI vs. CN) and based on expert opinion and review of extant literature, a set of 26 acoustic and 30 linguistic features were selected as candidate predictors (Table 1). The 26‐feature acoustic variable set comprises speech and voice parameters derived using openSMILE 31 and speech timing measures derived using Praat software. 32 , 33 After transcription of the audio file using an automatic speech recognition tool, 34 the 30Aver linguistic features were derived using SpaCy 35 and SPLAT 36 tools.
TABLE 1.
Acoustic and linguistic features included in machine learning model in a study of working memory and connected speech.
| Feature | Type | Description | Tool |
|---|---|---|---|
| 1. Fundamental frequency | Acoustic | Fundamental frequency expressed on a semitone scale relative to 27.5 Hz (F0semitoneFrom27.5Hz_sma3nz…); corresponds to perceptual feature of pitch | OpenSMILE 31 |
| a. Mean | OpenSMILE | ||
| b. Standard deviation | OpenSMILE | ||
| c. 20th percentile | OpenSMILE | ||
| d. 50th percentile | OpenSMILE | ||
| e. 80th percentile | OpenSMILE | ||
| f. Percentile range | OpenSMILE | ||
| g. Rising slope mean | OpenSMILE | ||
| h. Rising slope standard deviation | OpenSMILE | ||
| i. Falling slope mean | OpenSMILE | ||
| j. Falling slope standard deviation | OpenSMILE | ||
| k. Cycle‐to‐cycle variability (jitter) mean | OpenSMILE | ||
| l. Cycle‐to‐cycle variability (jitter) standard deviation | OpenSMILE | ||
| 2. Formant frequencies | Acoustic | Resonant frequency determined by configuration of vocal tract | OpenSMILE |
| a. F1 mean | First formant frequency (F1frequency_sma3nz…); corresponds to perception of vowel | OpenSMILE | |
| b. F1 standard deviation | OpenSMILE | ||
| c. F2 mean | Second formant frequency (F2frequency_sma3nz…); corresponds to perception of vowel | OpenSMILE | |
| d. F2 standard deviation | OpenSMILE | ||
| e. F3 mean | Third formant frequency (F3frequency_sma3nz…); corresponds to perception of voice quality | OpenSMILE | |
| f. F3 standard deviation | OpenSMILE | ||
| 3. Speaking rate: Pseudosyllable rate | Acoustic | No. of continuous voiced regions per second (VoicedSegmentsPerSec) | OpenSMILE |
| 4. Syllable duration mean | Acoustic | Mean length and the standard deviation of continuously voiced regions (F0 > 0) (VoicedSegmentLengthSec) | OpenSMILE |
| 5. Syllable duration standard deviation | OpenSMILE | ||
| 6. Pause duration mean | Acoustic | Mean length and the standard deviation of unvoiced regions (F0 = 0; approximating pauses) (UnvoicedSegmentLength) | OpenSMILE |
| 7. Pause duration standard deviation | OpenSMILE | ||
| 8. Speaking rate: Syllables per second | Acoustic | No. of syllables per second (speechrate.nsyll.dur.) | Parselmouth*,† |
| 9. Articulation rate | Acoustic | No. of syllables per second, excluding pause time (articulation.rate.nsyll.phonationtime) | Parselmouth |
| 10. Speech‐to‐pause ratio | Acoustic | Ratio of speaking time to pause time [total duration of voicing / (total duration of sound file) – (total duration of voicing)] (speakingtot / (originaldur – speakingtot) |
Parselmouth |
| 11. Grammatical complexity index | Linguistic | Proportion of complex grammatical relations (CSUBJ, COMP, CPRED, CPOBJ, COBJ, CJCT, XJCT, CMOD from dependency parses) / total grammatical relations, adapted from CLAN 37 | SpaCy 35 |
| 12. Noun‐verb ratio | Linguistic | No. of nouns / no. of verbs | SpaCy |
| 13. No. of unique words | Linguistic | Count of unique words in a sample | SpaCy |
| 14. Type–token ratio | Linguistic | Measure of lexical diversity: No. of unique words / no. of words | SpaCy |
| 15. Moving‐average type–token ratio ‐ 5 | Linguistic | Type‐token ratio applied with a moving window of 5 words | SpaCy |
| 16. Moving‐average type–token ratio ‐ 15 | Linguistic | Type–token ratio applied with a moving window of 15 words | SpaCy |
| 17. Propositional density | Linguistic | Percentage of words labeled with the following part of speech tags: verbs, adjectives, adverbs, prepositions, and conjunctions / no. of words | SpaCy |
| 18. Part of speech tag: Adjectives | Linguistic | Percentage of words labeled as adjectives / no. of words | SpaCy |
| 19. Part of speech tag: Adpositions | Linguistic | Percentage of words labeled as adpositions / no. of words | SpaCy |
| 20. Part of speech tag: Adverbs | Linguistic | Percentage of words labeled as adverbs / no. of words | SpaCy |
| 21. Part of speech tag: Auxiliary verbs | Linguistic | Percentage of words labeled as auxiliary verbs / no. of words | SpaCy |
| 22. Part of speech tag: Determiners | Linguistic | Percentage of words labeled as determiners / no. of words | SpaCy |
| 23. Part of speech tag: Interjections | Linguistic | Percentage of words labeled as interjections / no. of words | SpaCy |
| 24. Part of speech tag: Nouns | Linguistic | Percentage of words labeled as nouns / number of words | SpaCy |
| 25. Part of speech tag: Numerals | Linguistic | Percentage of words labeled as numerals / no. of words | SpaCy |
| 26. Part of speech tag: Particles | Linguistic | Percentage of words labeled as particles / no. of words | SpaCy |
| 27. Part of speech tag: Pronouns | Linguistic | Percentage of words labeled as pronouns / no. of words | SpaCy |
| 28. Part of speech tag: Proper nouns | Linguistic | Percentage of words labeled as proper nouns / no. of words | SpaCy |
| 29. Part of speech tag: Verbs | Linguistic | Percentage of words labeled as verbs / no. of words | SpaCy |
| 30. Part of speech tag: Conjunctions | Linguistic | Percentage of words labeled as conjunctions / no. of words | SpaCy |
| 31. Proportion of fillers | Linguistic |
No. of fillers / no. of words Fillers: okay, um, uh, er, eh |
SpaCy |
| 32. Repetitions per word | Linguistic | No. of single‐word repetitions / no. of words | SpaCy |
| 33. Content‐to‐function word ratio | Linguistic | No. of content words / no. of function words. | SPLAT |
| 34. Mean word frequency | Linguistic | ‡ Average: Unigram log probability of word, from the SUBTLEX database a | |
| 35. Mean word length in phonemes | Linguistic | ‡ Average: No. of phonemes in a word, from the CMU Pronouncing Dictionary b | |
| 36. Mean word semantic diversity | Linguistic | ‡ Average: Diversity of context within which a word appears c | |
| 37. Mean word prevalence | Linguistic | ‡ Average: Proportion of people familiar with a word, based on survey data d | |
| 38. Mean word concreteness | Linguistic | ‡ Average: Rating of degree to which a concept denoted by a word refers to a perceptible entity e | |
| 39. Mean word age of acquisition | Linguistic | Average: Approximate age when a word is learned f | |
| 40. Mean word length in morphemes | Linguistic | No. of morphemes in a word | Polyglot's English morfessor model g |
Jadoul Y, Thompson B, de Boer B. Introducing Parselmouth: A Python interface to Praat. Journal of Phonetics, 2018;71:1‐15. Retrieved from https://parselmouth.readthedocs.io/en/stable/.
Feinberg DR. Parselmouth praat scripts in python. Open Science Framework, 10; 2019.
Average calculated for open‐class words.
Brysbaert M, New B. Moving beyond Kucera and Francis: a critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English. Behav Res Methods. 2009;41(4):977‐990. doi:10.3758/BRM.41.4.977.
Carnegie Mellon University. (n.d.). CMU Pronouncing Dictionary. Retrieved March 5, 2025, from http://www.speech.cs.cmu.edu/cgi‐bin/cmudict.
Hoffman P, Lambon Ralph MA, Rogers TT. Semantic diversity: a measure of semantic ambiguity based on variability in the contextual usage of words. Behav Res Methods. 2013;45(3):718‐730. doi:10.3758/s13428‐012‐0278‐x.
Brysbaert M, Mandera P, McCormick SF, Keuleers E. Word prevalence norms for 62,000 English lemmas. Behav Res Methods. 2019;51(2):467‐479. doi:10.3758/s13428‐018‐1077‐9.
Brysbaert M, Warriner AB, Kuperman V. Concreteness ratings for 40 thousand generally known English word lemmas. Behav Res Methods. 2014;46(3):904‐911. doi:10.3758/s13428‐013‐0403‐5.
Kuperman V, Stadthagen‐Gonzalez H, Brysbaert M. Age‐of‐acquisition ratings for 30,000 English words [published correction appears in Behav Res Methods. 2013 Sep;45(3):900]. Behav Res Methods. 2012;44(4):978‐990. doi:10.3758/s13428‐012‐0210‐4.
McAuliffe M, Stengel‐Eskin E, Socolof M, Sonderegger M. Polyglot and Speech Corpus Tools: A system for representing, integrating, and querying speech corpora. In Proceedings of Interspeech 2017. pp. 3887‐3891; doi:10.21437/Interspeech.2017‐1390.
With this pool of candidate predictors selected, we then used Bayesian additive regression trees (BART) 38 , 39 to develop classification models and to determine relative importance of demographic, SM1 and SM2‐SM1 (SMDiff), and the selected acoustic and linguistic features from the speaking tasks. Rather than learning (fitting) a single classification or regression tree, BART fits many such trees, each one relatively parsimonious and none of them very good by themselves, and then averages the results to get a much more accurate classifier. Each tree is evolved as a fitted sequence of binary classification decision rules on either the full data (for the first split) or each subset of the data defined by the foregoing splits (for the second and later splits). Each classification rule is based on which single feature would give the largest improvement in fit, and for non‐binary features, the algorithm selects the splitting threshold as well as the splitting feature. We used the bartMachine package in R in our analysis. 40 Overall model fit for each classifier was assessed via area under the receiver operating characteristic (AUROC) curve (equivalently often and below referred to as the c‐statistic).
2.4.1. Model inputs and cross‐validation
Because a main objective of the project was to develop a diagnostic tool that can be more rapidly administered, we did not try to include all features from the five tasks (i.e., SM and four speaking tasks) in a single fitting algorithm run of bartMachine. Rather, we considered 9 sets of inputs separately: (1) SM1 and SMDiff by themselves; (2)–(5) each of the four speaking tasks by themselves; and (6)–(9) each of the four speaking tasks together with SM1 and SMDiff. In addition, all predictor sets included sex, age, and education (categorized). In preliminary analyses, the models with the picture description and procedural discourse tasks materially underperformed and were subsequently dropped, leaving five candidate sets of predictors, the results of which we present here.
Like many machine learning algorithms, bartMachine has a number of tuning parameters—two in this case—that need to either be pre‐specified or chosen outside of the core fitting routine. These include the number of trees, m, allowed in the ensemble, and k, which determines the relative dispersion of the predicted probabilities (of CI vs. CN) across the leaves resulting from a fitted tree. In addition, in our algorithm, we also needed to select from among the five candidate feature sets described above. We accomplished both the feature set selection and the tuning parameter specification (a total of M = 30 specifications) using leave‐one‐out cross‐validation before refitting the winning model on the entire dataset. Model selection was based on the highest c‐statistic, with a minor modification designed to break near ties. In preliminary discussions, it was decided that the “important event” task had more clinical utility than “counting backward”. As such, and recognizing that a difference in c‐statistic of 0.01 is negligible, a score was defined as the c‐statistic plus a “bonus” of 0.01 for the “important event + SM” feature set; final selection was based on this score. A brief description of our approach follows, with details in the appendix.
2.4.2. Model validation
To obtain unbiased estimates quantifying the classification ability of the predictive model, we deployed leave‐one‐out cross‐validation. For each of the approximately 100 datasets formed by dropping one observation, we ran the entire process described above, including the leave‐one‐out process for selecting among the M = 30 specifications, and also fitting the winning specification on the data set. We generated and saved the predicted classification probability for the left‐out observation. Based on this final set of predicted probabilities, we estimated the c‐statistic and the entire ROC curve in an unbiased manner.
2.4.3. Model interpretation
With many machine learning algorithms, interpretation of a fitted classifier can add additional challenges to the model‐fitting process. To address this, we executed three post‐hoc analyses (the first two available in bartMachine) to interpret fitted classifiers on a feature‐by‐feature basis: variable importance calculations (presented in bar charts), permutation based p‐values, and logistic regression modeling of the fitted predictions on the most important features.
Finally, a complementary ROC analysis was conducted to examine classification accuracy of the Qmci as a point of comparison; the posterior distribution of the confusion matrix between our fitted classifier and the Qmci was also generated (see appendix for methodological details).
3. RESULTS
3.1. Demographic and clinical characteristics
The CI and CN groups did not differ significantly in age or years of education completed, averaging about 74 years and 17 years, respectively (Table 2). The CI group comprised a higher proportion of men (i.e., 65% vs. 40%), while both groups were primarily non‐Hispanic White individuals. The CN group performed better on all cognitive measures, with Cohen's D effect sizes ranging from 0.8 to 1.4 (Table 2).
TABLE 2.
Cognitively impaired versus cognitively normal participants in a study of working memory and connected speech: Demographic and clinical characteristics by group.
| Parameter | CI (n = 51) | CN (n = 53) | Cohen's D |
|---|---|---|---|
| Age, mean (SD) years old | 74.2 (6.8) | 74.3 (6.4) | |
| Education, mean (SD) years completed a | 16.9 (2.7) | 16.9 (2.1) | |
| Sex (% male)* | 65% | 40% | |
| Ethnicity/race (% non‐Hispanic white) | 88% | 91% | |
| MMSE**, mean (SD) | 27.3 (2.7) | 29.3 (0.9) | 1.0 |
| Qmci**, mean (SD) | 58.2 (14.0) | 75.8 (10.4) | 1.4 |
| SM1**, mean (SD) | 17.0 (9.1) | 31.1 (14.5) | 1.2 |
| SMDiff**, mean (SD) | 1.9 (4.3) | 8.0 (9.9) | 0.8 |
Abbreviation: CN, cognitively normal; CI, cognitively impaired; SD, standard deviation; MMSE, Mini‐Mental Status Examination; Qmci, Quick Mild Cognitive Impairment screen; SM1, Speeded Matching, first attempt; SMDiff, Speeded Matching difference (second attempt – first attempt).
For the modeling process, education is organized into discrete categories: No formal education (0 years), primary/elementary education (1–8 years), secondary education/GED/high school graduate (9–12 years), some college/Associate's degree (13–15 years), Bachelor's degree (16 years), Master's/Doctorate degree (17–20 years).
p = 0.01;
p < 0.001.
3.2. Comparison of classification models
As noted above, in preliminary analyses, feature sets that included picture description and procedural discourse tasks materially under‐performed relative to models based on the “important event” or the “counting backward” tasks. Thus, we reduced the problem to five feature sets: each of “important event” or “counting backward,” with or without the two SM measures, and a set with just the two SM measures, each model containing the same set of demographic predictors (listed in Table 2).
Among those five sets, along with the 6 combinations of m and k, we present the top feature set (among all m and k) for each of the five models, based on unadjusted c‐statistic in Table 3. The top two models, in descending order were: (1) “important event” together with the two SM tasks—SM1 and SMDiff (k = 2; m = 200 trees); (2) “counting backward” together with the two SM tasks (k = 2; m = 50 trees), and these were fairly close to one another; others were considerably lower in c‐statistic accuracy. For the final top feature set, the unadjusted and adjusted (bias‐corrected), with 95% confidence intervals (CIs), results in terms of estimated c‐statistic and specificity at sensitivity = 80% and 90% are presented (Table 3), along with, for technical completeness, the values of m and k. The bias‐corrected ROC curve for this top feature set is presented in Figure 1, along with both 95% and 90% CIs. The narrower interval is useful because an argument could be made that we only care about 95% coverage in one direction—away from (above) the 45‐degree line; hence, the upper bound of the 90% interval could be ignored, and the lower interpreted as bounding a one‐sided upper region.
TABLE 3.
Predictive model results in a study of working memory and connected speech: Aggregate results of BART model development.
| Model description | m and k | Unadjusted statistic | Bias‐corrected statistic * (95% CI) |
|---|---|---|---|
| Top feature set: “Important event”, plus 2 SM tasks, plus demographics | m = 50; k = 2 | ||
| Top feature set: Key statistics | |||
| c‐Statistic † | 0.88 | 0.86 (0.69, 0.95) | |
| Specificity at sensitivity = 80% | 0.84 | 0.82 (0.17, 0.99) | |
| Specificity at sensitivity = 90% | 0.72 | 0.62 (0.07, 0.97) | |
| Competing feature sets (c‐Statistics) | |||
| “Count backward”, plus 2 SM tasks, plus demographicsa | m = 200; k = 2 | 0.88 | |
| 2 SM tasks, plus demographics | m = 200; k = 2 | 0.86 | |
| “Count backward”, plus demographics | m = 50; k = 2 | 0.83 | |
| “Important event”, plus demographics | m = 200; k = 2 | 0.82 | |
| Qmci: Key statistics | |||
| c‐Statistic | 0.83 (0.74, 0.92) | ||
| Specificity at Sensitivity = 80% | 0.80 (0.40, 0.92) | ||
| Specificity at Sensitivity = 90% | 0.46 (0.11,0.86) |
Abbreviations: BART, Bayesian additive regression trees; CI, confidence interval; Qmci, Quick Mild Cognitive Impairment screen; SM, Speeded Matching.
“Bias‐corrected” implies corrected for optimism in model development (vs. “Unadjusted”).
c‐Statistic for top feature set = 0.8805; for runner‐up = 0.8777. Tie‐breaker algorithm served for further separation.
FIGURE 1.

Predictive model results in a study of working memory and connected speech. Best‐fitting BART model with two Speeded Matching variables (SM1 and SMDiff), linguistic and acoustic variables from the “important event” task, and demographic variables. Estimated and bias‐corrected ROC curve. AUROC is in Table 2. The bands are pointwise 95% and 90% (narrower) confidence bands (see Supplemental material); please see Methods. AUROC, area under the receiver operating characteristics curve; BART, Bayesian additive regression trees; ROC, receiver operating characteristics.
3.3. Study of top feature set
Variable importance (VI) values for the top feature set are presented in Figure 2 in terms of inclusion proportions 38 and permutation p‐values. From the bartMachine vignettes preprint, “the inclusion proportion for any given predictor represents the proportion of times that variable is chosen as a splitting rule out of all splitting rules among the posterior draws of the sum of trees model”. 39 , 40 The four variables presented are the top performers, and even though the VI ordering is strong, the inclusion proportions are small, suggesting that no one feature is overwhelmingly dominating the prediction, even while the corresponding permutation p‐values, which ranged from <0.0005 to 0.035, are nominally significant. As described in the appendix, for a given predictor, X, these p‐values represented the proportion of times a classifier with a randomly scrambled version of X, fit better than the original model; small values thus represented a greater role for X in the fitted classifier. The full variable importance plot (not shown) included well‐over 20 predictors from the pool, each with inclusion proportion well under 2%. This pattern of results, which can be seen in Figure 2, where even the fourth most important variable (SM difference) appears in only barely above 2% of the splitting rules, suggests that there is considerable incremental additive value across the entire pool of predictors, with each one contributing only small but non‐negligible information to the overall prediction to achieve a strong AUROC value of 0.86.
FIGURE 2.

Predictive model results in a study of working memory and connected speech. Variable importance and statistical significance from the best‐fitting BART model with two Speeded Matching variables (SM1 and SMDiff), linguistic and acoustic variables from the “important event” task, and demographic variables. Variable importance showing the consistent top four predictors in the modelling procedure. Included p‐values arise from the permutation tests based on the Brier score. BART, Bayesian additive regression trees.
Finally, to aid in human interpretation of the fitted classifier, we fitted a series of logistic models, regressing the classifier probability on sets of the most important features (see appendix). We repeated this for each BART posterior sample of predictions to obtain 95% credible intervals for the resulting logistic regression coefficients.
3.4. Qmci classification accuracy
Figure 3 depicts the AUROC for the Qmci, along with the 95% CIs. The adjusted AUC of .83 (0.74, 0.92) was lower than the unadjusted and adjusted AUC values for the top feature set combining SM and acoustic and linguistic variables from the “important event” task (Table 3), although, owing to lack of need to fit a complex model, the associated CI is narrower. With regard to specificity at 80% and 90% sensitivity, results were slightly stronger for the top feature set relative to Qmci. In addition, we calibrated the fitted classifier to the Qmci (i.e., choosing a classifier threshold to yield the same number of cases between as for the Qmci), and quantified the concordance and confusion between the classifier and Qmci (see appendix).
FIGURE 3.

Receiver operating curve for Quick Mild Cognitive Impairment (Qmci) screen with 95% confidence intervals.
4. DISCUSSION
To our knowledge, this is the first study to examine the utility of combining cognitive performance and connected speech tasks in development of an automated tool to screen for cognitive impairment. Findings indicate that a combination of working memory / processing speed, acoustic and linguistic variables from connected speaking tasks, and the ability to benefit from prior exposure to the working memory task can discriminate CN from CI groups with a high degree of accuracy, comparable to an established measure. Two of the top four contributors to the model were the first and second administration of SM, confirming the utility of this widely used cognitive test and the additive benefit of examining practice effects. The other top two variables measured the lexical diversity of a person's speech (i.e., type–token ratio) and subtle acoustic features of voice quality, such as fine articulatory control (i.e., third formant frequency mean). Similar features in spontaneous speech samples have been identified as useful for discriminating between MCI and AD groups, 37 but specific feature importance for the purposes of classification accuracy has not been widely studied. These combinations of tasks assess key cognitive and communication abilities known to be affected early in a variety of neurodegenerative diseases, for example, attention/executive functioning, episodic and semantic memory, and phonological processing, and include both verbal and nonverbal tasks, which may be an important feature for individuals unable to do one type of task or the other due to speech production or visual or motor deficits. Importantly, individuals with varying levels of cognitive functioning, including mild dementia, were able to use the tool successfully.
While these initial results are promising, they must be interpreted within the context of study limitations. First, recruitment took place during the global coronavirus disease 2019 (COVID‐19) pandemic, limiting access to the critical subpopulations of interest for this study (e.g., older adults, those who are cognitively impaired, and underrepresented ethnic/racial groups) who were necessarily minimizing social contact. Despite planned community outreach efforts that served predominantly Hispanic/Latinx individuals in Austin, Texas, and site selection to include Pennington Biomedical Research Center in Baton Rouge, Louisiana, which has a strong track record of recruitment of African American/Black participants for AD research, the study sample was primarily non‐Hispanic White, highly educated, and likely well‐resourced. Thus, findings cannot be generalized with confidence to groups who do not have similar demographic features and further validation is needed.
Despite years of methodological development in machine learning, there remain challenges in obtaining valid interpretations of learned (fitted) models. Relevant issues to this study include obtaining valid simultaneous confidence regions for ROC curves that both account for the model‐selection (here via cross‐validation) process and that are not also overly conservative. Our sense is that the presented regions in Figure 1, while nominally pointwise, are in fact closer to simultaneous and may even be conservative in that sense, but we are not able to make that claim mathematically. Some of the implications of this uncertainty emerge when comparing the top feature set to Qmci, where confidence intervals are some places wider for the fitted top feature set as compared to Qmci. This does not necessarily reflect that the top feature set is a less predictive model, but rather that the high‐dimensional statistical process followed in learning the model, when honestly assessed, carries inherent uncertainty. New studies with independent data are the most robust way to resolve these issues.
Results of this study add to the growing literature demonstrating the potential utility of speech/language data to screen for cognitive impairment and decline. Although many studies have analyzed audio recordings of neuropsychological testing, 11 , 41 collection and analysis of connected speech, such as recounting a personal memory, feels less like a “test” and may be less stressful or threatening, resulting in a more natural, ecologically valid assessment of cognition. Moreover, the ability to collect brief speaking samples via the telephone broadens the potential reach of these tools, adding to the attraction of pursuing additional validation research. The SM task used to assess working memory/processing speed in this study has already demonstrated feasibility and validity as a Web‐based, remote self‐assessment tool. 24
Although the cognitive screening tool in this study shows promise to be used remotely, the primary goal is to develop an automated tool that can be deployed by untrained clinic staff, such as medical assistants, and completed in less than 5 minutes as part of a primary care visit. Therefore, the tool was modified to include only the SM, personal narrative, and counting backward tasks and is currently being studied in three primary care clinics to further examine its validity, utility, and usability in a larger, diverse group of older adults. Because it is not necessary or feasible to screen all older adults, and research demonstrates that cognitive screening tools are more accurate when applied to individuals at high risk, the tool under study also includes a risk assessment 42 conducted prior to the cognitive and speaking tasks and follow‐up instructions for the PCP post‐screening. Of note, PCP review of output and associated clinical decision making is not included in the < 5‐minute estimate to complete cognitive screening.
In closing, findings from the current study highlight that applying technology and modern statistical techniques to the vast knowledge accumulated over the past 2 decades about assessments sensitive to cognitive decline may be beneficial to PCPs who are the first line of medical care and represent a critical piece of the solution for early detection of cognitive impairment. The tool evaluated in this study was designed specifically for use in primary care and holds several advantages over established methods. Specifically, tasks included are less biased by cultural and educational background yet still assess cognitive abilities susceptible to early decline in older adults. The tool is automated, allowing it to be self‐administered so that staff time and training are not required, administration and scoring errors are avoided, and scores do not have to be manually entered into the electronic health record. The final version will provide evidence‐based guidance about what to say and do when patients screen positive, serving as a clinical decision support tool. Thus, it may be an effective intervention to increase cognitive screening in primary care settings. Implementation studies in a variety of primary care settings with unique workflows seeking feedback from a wide range of healthcare providers are important next steps, and investigation of classification accuracy compared to and in combination with biomarkers in blood, cerebral spinal fluid, and neuroimaging are needed.
CONFLICT OF INTEREST STATEMENT
The authors have nothing to disclose.
CONSENT STATEMENT
All human subjects provided written informed consent.
Supporting information
Supporting information
Supporting information
ACKNOWLEDGMENTS
The authors sincerely thank the study participants without whom this research would not be possible.
Portions of this manuscript were presented at the 2023 Alzheimer's Association International Conference.
Research reported in this publication was supported by the National Institutes of Health National Institute on Aging Award Numbers R61AG069780 and P30AG066546 and National Institute on Deafness and Other Communication Disorders Award R01DC016291.
Hilsabeck RC, Keller JN, Henry ML, et al. Development and classification accuracy of an automated cognitive screening tool combining working memory and connected speech tasks for early detection of cognitive impairment in primary care. Alzheimer's Dement. 2025;11:e70145. 10.1002/trc2.70145
REFERENCES
- 1. 2019 Alzheimer's disease facts and figures. Alzheimer's & Dementia, 2019;15:321‐387. doi: 10.1016/j.jalz.2019.01.010 [DOI] [Google Scholar]
- 2. Aminzadeh F, Molnar FJ, Dalziel WB, Ayotte D. A review of barriers and enablers to diagnosis and management of persons with dementia in primary care. Can Geriatr J. 2012;15(3):85‐94. doi: 10.5770/cgj.15.42 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Harris DP, Chodosh J, Vassar SD, Vickrey BG, Shapiro MF. Primary care providers' views of challenges and rewards of dementia care relative to other conditions. J Am Geriatr Soc. 2009;57(12):2209‐2216. doi: 10.1111/j.1532-5415.2009.02572.x [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Iliffe S, Robinson L, Brayne C, et al. Primary care and dementia: 1. diagnosis, screening and disclosure. Int J Geriatr Psychiatry. 2009;24(9):895‐901. doi: 10.1002/gps.2204 [DOI] [PubMed] [Google Scholar]
- 5. O'Driscoll C, Shaikh M. Cross‐Cultural Applicability of the Montreal Cognitive Assessment (MoCA): A Systematic Review. J Alzheimers Dis. 2017;58(3):789‐801. doi: 10.3233/JAD-161042 [DOI] [PubMed] [Google Scholar]
- 6. Cannon P, Larner AJ. Errors in the scoring and reporting of cognitive screening instruments administered in primary care. Neurodegener Dis Manag. 2016;6(4):271‐276. doi: 10.2217/nmt-2016-0004 [DOI] [PubMed] [Google Scholar]
- 7. Wojtowicz A, Larner AJ. General Practitioner Assessment of Cognition: use in primary care prior to memory clinic referral. Neurodegener Dis Manag. 2015;5(6):505‐510. doi: 10.2217/nmt.15.43 [DOI] [PubMed] [Google Scholar]
- 8. Karimi L, Mahboub‐Ahari A, Jahangiry L, Sadeghi‐Bazargani H, Farahbakhsh M. A systematic review and meta‐analysis of studies on screening for mild cognitive impairment in primary healthcare. BMC Psychiatry. 2022; 22(1): 97. doi: 10.1186/s12888-022-03730-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Caselli RJ, Langlais BT, Dueck AC, et al. Neuropsychological decline up to 20 years before incident mild cognitive impairment. Alzheimers Dement. 2020;16(3):512‐523. doi: 10.1016/j.jalz.2019.09.085 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Mormino EC, Papp KV, Rentz DM, et al. Early and late change on the preclinical Alzheimer's cognitive composite in clinically normal older individuals with elevated amyloid β. Alzheimers Dement. 2017;13(9):1004‐1012. doi: 10.1016/j.jalz.2017.01.018 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Amini S, Hao B, Zhang L, et al. Automated detection of mild cognitive impairment and dementia from voice recordings: A natural language processing approach. Alzheimers Dement. 2023;19(3):946‐955. doi: 10.1002/alz.12721 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. König A, Satt A, Sorin A, et al. Automatic speech analysis for the assessment of patients with predementia and Alzheimer's disease. Alzheimers Dement (Amst). 2015;1(1):112‐124. doi: 10.1016/j.dadm.2014.11.012 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Thomas JA, Burkhardt HA, Chaudhry S, et al. Assessing the Utility of Language and Voice Biomarkers to Predict Cognitive Impairment in the Framingham Heart Study Cognitive Aging Cohort Data. J Alzheimers Dis. 2020;76(3):905‐922. doi: 10.3233/JAD-190783 [DOI] [PubMed] [Google Scholar]
- 14. Graham SA, Lee EE, Jeste DV, et al. Artificial intelligence approaches to predicting and detecting cognitive decline in older adults: A conceptual review. Psychiatry Res. 2020;284:112732. doi: 10.1016/j.psychres.2019.112732 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. O'Caoimh R, Gao Y, McGlade C, et al. Comparison of the quick mild cognitive impairment (Qmci) screen and the SMMSE in screening for mild cognitive impairment. Age Ageing. 2012;41(5):624‐629. doi: 10.1093/ageing/afs059 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. De Roeck EE, De Deyn PP, Dierckx E, Engelborghs S. Brief cognitive screening instruments for early detection of Alzheimer's disease: a systematic review. Alzheimers Res Ther. 2019;11(1):21. doi: 10.1186/s13195-019-0474-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Mitchell AJ. A meta‐analysis of the accuracy of the mini‐mental state examination in the detection of dementia and mild cognitive impairment. J Psychiatr Res. 2009;43(4):411‐431. doi: 10.1016/j.jpsychires.2008.04.014 [DOI] [PubMed] [Google Scholar]
- 18. Marc LG, Raue PJ, Bruce ML. Screening performance of the 15‐item geriatric depression scale in a diverse elderly home care population. Am J Geriatr Psychiatry. 2008;16(11):914‐921. doi: 10.1097/JGP.0b013e318186bd67 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Albert MS, DeKosky ST, Dickson D, et al. The diagnosis of mild cognitive impairment due to Alzheimer's disease: recommendations from the National Institute on Aging‐Alzheimer's Association workgroups on diagnostic guidelines for Alzheimer's disease. Alzheimers Dement. 2011;7(3):270‐279. doi: 10.1016/j.jalz.2011.03.008 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. McKhann GM, Knopman DS, Chertkow H, et al. The diagnosis of dementia due to Alzheimer's disease: recommendations from the National Institute on Aging‐Alzheimer's Association workgroups on diagnostic guidelines for Alzheimer's disease. Alzheimers Dement. 2011;7(3):263‐269. doi: 10.1016/j.jalz.2011.03.005 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Perneczky R, Wagenpfeil S, Komossa K, Grimmer T, Diehl J, Kurz A. Mapping scores onto stages: mini‐mental state examination and clinical dementia rating. Am J Geriatr Psychiatry. 2006;14(2):139‐144. doi: 10.1097/01.JGP.0000192478.82189.a8 [DOI] [PubMed] [Google Scholar]
- 22. O'Bryant SE, Humphreys JD, Bauer L, McCaffrey RJ, Hilsabeck RC. The influence of ethnicity on Symbol Digit Modalities Test performance: an analysis of a multi‐ethnic college and hepatitis C patient sample. Appl Neuropsychol. 2007;14(3):183‐188. doi: 10.1080/09084280701508986 [DOI] [PubMed] [Google Scholar]
- 23. Jaeger J. Digit Symbol Substitution Test: The Case for Sensitivity Over Specificity in Neuropsychological Testing. J Clin Psychopharmacol. 2018;38(5):513‐519. doi: 10.1097/JCP.0000000000000941 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Calamia M, Weitzner DS, De Vito AN, Bernstein JPK, Allen R, Keller JN (2021) Feasibility and validation of a web‐based platform for the self‐administered patient collection of demographics, health status, anxiety, depression, and cognition in community dwelling elderly. PLoS ONE 16(1): e0244962. doi: 10.1371/journal.pone.0244962 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Darby D, Maruff P, Collie A, McStephen M. Mild cognitive impairment can be detected by multiple assessments in a single day. Neurology. 2002;59(7):1042‐1046. doi: 10.1212/wnl.59.7.1042 [DOI] [PubMed] [Google Scholar]
- 26. Hassenstab J, Ruvolo D, Jasielec M, Xiong C, Grant E, Morris JC. Absence of practice effects in preclinical Alzheimer's disease. Neuropsychology. 2015;29(6):940‐948. doi: 10.1037/neu0000208 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Nicholas LE, Brookshire RH. A system for quantifying the informativeness and efficiency of the connected speech of adults with aphasia. J Speech Hear Res. 1993;36(2):338‐350. doi: 10.1044/jshr.3602.338 [DOI] [PubMed] [Google Scholar]
- 28. Macwhinney B, Fromm D, Forbes M, Holland A. AphasiaBank: Methods for Studying Discourse. Aphasiology. 2011;25(11):1286‐1307. doi: 10.1080/02687038.2011.589893 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. O'Caoimh R, Timmons S, Molloy DW. Screening for Mild Cognitive Impairment: Comparison of “MCI Specific” Screening Instruments. J Alzheimers Dis. 2016;51(2):619‐629. doi: 10.3233/JAD-150881 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30. O'Caoimh R, Gao Y, Svendovski A, Gallagher P, Eustace J, Molloy DW. Comparing Approaches to Optimize Cut‐off Scores for Short Cognitive Screening Instruments in Mild Cognitive Impairment and Dementia. J Alzheimers Dis. 2017;57(1):123‐133. doi: 10.3233/JAD-161204 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31. Eyben F, Wöllmer M, Schuller B. openSMILE—The Munich Versatile and Fast Open‐Source Audio Feature Extractor. In Proceedings of the ACM Multimedia (MM), 2010;1459‐1462.
- 32. Boersma P. Praat, a system for doing phonetics by computer. Glot International, 2001;5: 341‐345. [Google Scholar]
- 33. de Jong NH, Wempe T. Praat script to detect syllable nuclei and measure speech rate automatically. Behav Res Methods. 2009;41(2):385‐390. doi: 10.3758/BRM.41.2.385 [DOI] [PubMed] [Google Scholar]
- 34. Radford A, Kim JW, Xu T, Brockman G, Mcleavey C, Sutskever, I . Robust Speech Recognition via Large‐Scale Weak Supervision. Proceedings of the 40th International Conference on Machine Learning, in Proceedings of Machine Learning Research, 2023;202:28492‐28518. [Google Scholar]
- 35. Honnibal M, Montani I, Van Landeghem S, Boyd A. spaCy: Industrial‐strength Natural Language Processing in Python. 2020; retrieved from https://spacy.io
- 36. Quirk C, Choudhury P, Gao J, et al. MSR SPLAT, a language analysis toolkit. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Demonstration Session, 2012.
- 37. Gosztolya G, Vincze V, Toth L, et al. Identifying mild cognitive impairment and mild Alzheimer's disease based on spontaneous speech using ASR and linguistic features. Computer Speech & Lang, 2019;53:181‐197. [Google Scholar]
- 38. Chipman HA, George EI, McCulloch RE. BART: Bayesian additive regression trees. Ann. Appl. Stat. 2020:4(1), 266‐298. [Google Scholar]
- 39. Hill J, Linero A., Murray J. Bayesian Additive Regression Trees: A Review and Look Forward. Annu. Rev. Stat. Appl. 2020. 7:251‐78. doi: 10.1146/annurev-statistics-031219-041110 [DOI] [Google Scholar]
- 40. Kapelner A, Bleich J. bartMachine: Machine Learning with Bayesian Additive Regression Trees. J Statistical Software 2020:70(4), 1‐40. doi: 10.18637/jss.v070.i04 [DOI] [Google Scholar]
- 41. Xue C, Karjadi C, Paschalidis IC, et al. Detection of dementia on voice recordings using deep learning: a Framingham Heart Study. Alz Res Therapy 13, 146 (2021). doi: 10.1186/s13195-021-00888-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42. Barnes DE, Beiser AS, Lee A, et al. Development and validation of a brief dementia screening indicator for primary care. Alzheimers Dement. 2014;10(6):656‐665.e1. doi: 10.1016/j.jalz.2013.11.006 [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supporting information
Supporting information
