Abstract
Introduction
We investigated the agreement between automated and gold‐standard manual transcriptions of telephone chatbot‐based semantic verbal fluency testing.
Methods
We examined 78 cases from the Screening over Speech in Unselected Populations for Clinical Trials in AD (PROSPECT‐AD) study, including cognitively normal individuals and individuals with subjective cognitive decline, mild cognitive impairment, and dementia. We used Bayesian Bland–Altman analysis of word count and the qualitative features of semantic cluster size, cluster switches, and word frequencies.
Results
We found high levels of agreement for word count, with a 93% probability of a newly observed difference being below the minimally important difference. The qualitative features had fair levels of agreement. Word count reached high levels of discrimination between cognitively impaired and unimpaired individuals, regardless of transcription mode.
Discussion
Our results support the use of automated speech recognition particularly for the assessment of quantitative speech features, even when using data from telephone calls with cognitively impaired individuals in their homes.
Highlights
High levels of agreement were found between automated and gold‐standard manual transcriptions of telephone chatbot‐based semantic verbal fluency testing, particularly for word count.
The qualitative features had fair levels of agreement.
Word count reached high levels of discrimination between cognitively impaired and unimpaired individuals, regardless of transcription mode.
Automated speech recognition for the assessment of quantitative and qualitative speech features, even when using data from telephone calls with cognitively impaired individuals in their homes, seems feasible and reliable.
Keywords: automated speech recognition, Bland–Altman analysis, dementia, reliability, remote cognitive testing, semantic verbal fluency
1. INTRODUCTION
Language is impaired early in several dementia diseases, including Alzheimer's disease (AD), frontotemporal dementia (FTD), and primary progressive aphasia. Features automatically extracted from speech recordings are increasingly being used to identify and monitor cognitive decline in older people, including those with manifest dementia or at risk of developing dementia. 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10 Automated chatbot calls allow for remote cognitive assessment with little to no input from third parties. In previous and ongoing studies, we have gained experience with the chatbot Mili (Mili Phone). 11 , 12 , 13 , 14 , 15 Mili is a platform designed to support speech data collection in clinical trials and studies via an App (Mili App) or via Chatbot (Mili Phone). Mili Phone connects with participants by making an ordinary phone call to their landline or smartphone device to complete cognitive assessments on the phone. The software guides the interaction through a predefined protocol consisting of short cognitive tests (e.g., semantic verbal fluency, episodic memory word list test). Like many other chatbot systems, Mili uses automated transcription of audio data as the basis for processing of verbal material. The accuracy of automated transcription of verbal material has received little attention to date, 16 , 17 even though it provides the basis for all subsequent analyses. This is particularly true when considering remote assessments via telephone that provide lower quality of speech recordings than recordings from a hospital setting.
The semantic verbal fluency task is very promising for automated remote speech‐based testing. 18 In addition to the word count, that is, number of correctly produced words, which is the typical readout in clinical testing, automated speech analysis can extract a range of additional features that may help not only to detect cognitive decline but also to differentiate between its causes. 18 Features that are reduced in people with cognitive impairments 19 include the size of semantic clusters, that is, temporal sequences of semantically related words, and the number of switches between these clusters, indicating the executive search process between clusters. 20 The word frequency measure indicates how often a term appears in a particular text or corpus 21 ; it has been found to increase with increasing cognitive impairment, 22 so that individuals with cognitive decline seem to lose access to less common words. The reliability and validity of these features will depend on the degree to which the words that are recorded are transcribed correctly. This is not an easy task, considering that the sound quality and noise level of different phones are different and people are studied at home under different environmental conditions, with different cognitive and neuropsychiatric impairments and different speaking habits, including different dialects.
Here, we determined the agreement of speech features based on automated transcription compared to the reference standard of manual transcription of telephone‐based remote semantic verbal fluency recordings of older people with different levels of cognitive impairment, including dementia, mild cognitive impairment (MCI), subjective cognitive impairment, and normal cognition. The data used here were collected as part of the multilingual Screening over Speech in Unselected Populations for Clinical Trials in AD (PROSPECT‐AD) study, 13 but contained only speech recordings in German. Previous assessments of agreement relied on regression analysis between automated and manual transcriptions. 16 However, this can be misleading as a regression approach only assesses the relative association between two measures but not potential deviations in their absolute values, 23 which can be highly relevant for example when considering error rates of language features. Here, we used intra‐class correlation and Bland–Altman analysis in a Bayesian framework. Bland–Altman analysis calculates the mean difference between methods and an interval that defines the limits of agreement 24 , 25 and can be compared with the minimally important difference (MID). The MID defines which differences between methods may still be acceptable for clinical use. The Bayesian approach to Bland–Altman analysis was introduced in 2021. 26 It assesses “the degree of agreement between two methods of measurement via a posterior predictive distribution (e.g., calculation of the probability that the absolute difference will be within a fixed value in future observations) instead of a hypothesis test for the true limits of agreement” 26 as with frequentist analysis. It provides the user with a direct estimate of the probability that the expected differences between the two methods will be within an acceptable range.
RESEARCH IN CONTEXT
Systematic Review: The authors reviewed the literature using traditional (e.g., PubMed) sources and meeting abstracts and presentations. While automated and manual transcripts have been compared before, the analyses often used inappropriate metrics, such as simple regression, or did not consider levels of agreement but only downstream outcomes, such as levels of group discrimination. Many studies also used high‐quality recordings in experimental settings, rather than telephone calls to participants’ homes.
Interpretation: Our findings suggest that established features of semantic verbal fluency can be automatically extracted from automated telephone speech recordings and that even qualitative features achieve a reasonable level of agreement.
Future Directions: We suggest further lines of research in adapting adequate and interpretable measures of agreement, especially when used in a Bayesian framework, and encourage the use of remote automated speech assessment for detecting and monitoring cognitive decline in older people.
2. MATERIALS AND METHODS
2.1. Data sources
The data came from the PROSPECT‐AD study. The PROSPECT‐AD study is a multinational, multilingual study on the use and validation of speech biomarkers for the diagnosis and monitoring of AD. 13 It is financed by the Alzheimer Drug Discovery Foundation (ADDF). In Germany, the PROSPECT‐AD study is attached to two national cohorts conducted by the German Center for Neurodegenerative Diseases (DZNE), the DZNE Longitudinal Cognitive Impairment and Dementia Study (DELCODE), 27 and the Clinical Registry Study of Neurodegenerative Diseases (DESCRIBE).
2.2. Consent statement
Written informed consent was obtained from all participants and/or authorized representatives. The study protocols had been approved by the local institutional review boards and ethical committees of the centers participating in the PROSPECT‐AD, DELCODE, and DESCRIBE studies. The studies are being conducted in accordance with the Declaration of Helsinki of 1975 and its later amendments.
2.3. Participants
For the transcription analysis, we selected a subset of 78 cases from the PROSPECT‐AD study that had completed chatbot‐based speech assessment, including the semantic verbal fluency task, and had available clinical diagnostic information and neuropsychological testing from the DELCODE or DESCRIBE cohort visits. We included the following diagnoses: older healthy controls, first‐degree relatives, subjective cognitive decline (SCD), MCI, and AD dementia. A diagnosis of healthy control was based on the absence of cognitive impairment as determined by cognitive testing, and absence of subjective cognitive complaints, and the exclusion of neurological or psychiatric diseases. First‐degree relatives were cognitively normal but had at least one confirmed case of AD in their immediate family. The diagnosis of SCD followed the consensus definition proposed by the SCD Initiative Working Group: they defined SCD as “a persistent self‐perceived cognitive decline in the absence of objective cognitive impairment lasting at least for 6 months and being unrelated to an acute event.” 28 A diagnosis of MCI or AD dementia followed the National Institute on Aging–Alzheimer's Association (NIA‐AA) workgroup criteria for MCI and AD, respectively. 29 , 30
2.4. Neuropsychological assessment
DELCODE and DESCRIBE conduct annual cognitive testing using an extensive testing battery. To test the discrimination between cognitively unimpaired and impaired subjects, based on either automatically or manually transcribed speech, respectively, we grouped all subjects according to the Clinical Dementia Rating (CDR) global score (CDR = 0 vs. CDR > 0). For benchmarking the automated semantic fluency features, we used the number of correctly produced words from the paper–pencil version of the semantic verbal fluency task.
2.5. Chatbot‐based cognitive assessment
Participants were called via the software Mili. At the start of the automated assessment, the chatbot confirmed that the participant consented to continue with the phone call and agreed with the audio recording. The complete phone assessment included the following tasks: verbal learning encoding (immediate), semantic verbal fluency (animals), verbal learning recall (delayed), and narrative storytelling. Each of these tasks had verbatim instructions, which were read word by word to the participant before the task started. Each of these tasks was recorded in a secondary audio stream which just recorded the participant's responses to allow for deep speech analysis of performance on these tasks. For the analysis in the current paper, we only used the data from the semantic verbal fluency task.
2.6. Transcription and feature generation
Audio recordings of the semantic verbal fluency have been transcribed in two different ways: manually by human raters as well as automatic transcripts by the proprietary ki:elements’ speech analysis pipeline SIGMA.
2.6.1. Manual transcripts
A bachelor student of applied health science manually transcribed the audio files from the semantic verbal fluency tasks. After the transcription was completed, S.K. checked all transcripts against the audio files again. S.K. extended the transcription by adding the participants’ comments and interrupted words. As a result, we wrote down all the words in an Excel file regardless of whether they were animals or not. We found words that did not fit the category, such as “house”; fantasy animals, such as “wandering bear” and “power mosquito”; or participants’ comments, such as “Oh‐God” or “Dog‐I‐already‐had.” The student marked duplicates, interrupted, or inappropriate words. S.K. checked the marks and added or corrected them where needed.
2.6.2. Automated transcript
Automated transcripts were generated using SIGMA automatic transcription service based on Google Speech API (Google Speech API; available from: https://cloud.google.com/speech‐to‐text).
2.6.3. Feature extraction
Subsequent to transcription, semantic verbal features were extracted from both the manual and automatic transcripts in the same way, using ki:elements’ speech processing pipeline SIGMA analyzing subjective, serial or semantic groupings, as well as semantic and temporal aspects of the semantic verbal fluency output of the participants. The implementation of the features in SIGMA used in this analysis was based on the following:
2.7. Statistical analysis
Demographic data were compared between diagnostic groups using Bayesian analysis of variance (for age, education years, Mini‐Mental State Examination [MMSE] scores) and Bayesian contingency tables test (for sex distribution) using Jeffreys's Amazing Statistics Program (JASP Version 0.18.3), available at jasp‐stats.org. We report the Bayes Factor (BF10) quantifying evidence in favor of the hypotheses of a difference. A description of the BF10 and the used evidence categories can be found in the Supplementary Statistical Analysis section in supporting information.
We calculated the intraclass correlation coefficients (ICCs) to determine levels of agreement that were conducted using package “performance” in R, accessed through R Studio 2023.12.1. ICC is defined as the proportion of the inter‐rater variance to the sum of the inter‐rater and intra‐rater variance of measures. Here, we used variance estimates from Bayesian variance component models resulting in an estimate of the ICC and its 95% credible interval. The ICC can range between 0 and 1 with values of 1 indicating perfect agreement.
For the Bayesian Bland–Altman analysis, we used the R script BA.Bayesian provided by Alari et al. in the supplement of their paper. 26 We used uninformative priors based on the lack of a priori knowledge about the expected differences, adopting the prior parameter values reported in Alari et al. 26 For the Bland–Altman analysis we determined the MID as half the standard deviation (SD) of the value of the variable of interest based on the gold standard measurement method in the healthy controls. There is no standard definition of the MID. Ideally, there is previous literature that has directly examined the magnitude of the MID for a given metric. When such data are not available, as was the case in our analysis, many studies have used half the SD as an approximation for the MID. 33 , 34 We calculated the posterior mean μ and SD σ of the differences together with their 95% credible intervals as well as their joint distribution (μ, σ), and the upper and lower limits of agreement θ1 and θ2 and their credible intervals. We also determined the posterior predictive distribution of the differences and their credible intervals. The posterior predictive distribution was obtained by simulating new data points from the Bland–Altman model using samples from the posterior distribution of the differences. This distribution reflects the variability expected in newly observed differences based on the estimated model. Based on this posterior predictive distribution, we determined the probability that a newly observed difference will be smaller than the MID. More details on creating and analyzing the posterior predictive distribution can be found in the Supplementary Statistical Analysis section.
Finally, we determined the probability of the alternative hypothesis H1 that the limits of agreement would be within the range of the MID. For comparison, we report the frequentist Bland–Altman analysis of word count that plots the average versus the difference of the word count based on manual and automated transcriptions, and indicates the mean difference and its bias, that is, the distance of the mean difference from zero, and the upper and lower limits of agreement. The limits of agreement were calculated assuming a Gaussian data generation process as mean ± SD(mean) x z(P ≤ 0.025), with SD representing the standard deviation of the mean difference, and z the critical value of the standard normal distribution for a P value < 0.025.
3. RESULTS
3.1. Demographics
Demographic data are reported in Table 1. Of note, we had combined cognitively normal controls and cognitively normal first‐degree relatives into the same group, as the number of cases per group was small and there were no differences in cognitive performance between them. Diagnostic groups differed in sex distribution and, as expected, in MMSE scores, but not in age and years of education.
TABLE 1.
Demographics.
| F/M a | Age [years] (SD) b | MMSE (SD) c | Education [years] (SD) d | |
|---|---|---|---|---|
| Cognitively normal | 15/3 | 74.9 (5.7) | 29.4 (0.9) | 14.3 (2.2) |
| SCD | 20/23 | 72.0 (6.9) | 29.2 (1.1) | 14.9 (2.4 |
| MCI | 4/10 | 72.4 (7.6) | 27.4 (2.0) | 13.6 (3.0) |
| AD | 2/1 | 72.7 (7.6) | 26.0 (3.0) | 13.3 (1.5) |
Abbreviations: AD, Alzheimer's disease; F/M, female/male; MCI, mild cognitive impairment; MMSE, Mini‐Mental State Examination; SCD, subjective cognitive decline; SD, standard deviation.
Moderate evidence for a between‐group difference in sex distribution (BF10 = 6.8).
Moderate evidence against a between‐group difference (BF10 = 0.245).
Extreme evidence for a between‐group difference (BF10 = 5596).
Moderate evidenceagains a between‐group difference (BF10 = 0.357).
3.2. Agreement between manual and automated transcription
Figure 1 shows cross‐correlation scatter plots between the manual and automated features, in red highlighting the association of the paired manual and automated features. Spearman rank correlation coefficients > 0.86 indicate a potentially high level of association between the paired manual and automated features; however, this analysis ignores systematic differences in absolute numbers. The ICC coefficients indicated fair to excellent agreement (Table S1 in supporting information) with 0.96 for the word count, but only 0.82 for cluster switches.
FIGURE 1.

Scatter plots of manual versus automated features. Scatter plots across all features. Bold red frames highlight the plots that represent corresponding pairs of features like manual and automated word count. R values indicate Spearman rank correlation. Man./aut. correct = manual/automatic transcription of word counts; higher values indicate better performance. Man./aut. size = manual/automatic transcription of semantic cluster size; higher values indicate better performance. Man./aut. switches = manual/automatic transcription of semantic cluster switches; higher values indicate better performance. Man./aut. freq. = manual/automatic transcription of word frequency; smaller values indicate better performance.
The results of the Bayesian Bland–Altman analysis for word count are shown in Figure 2 and Figure 3. The word count mean difference was 0.78 words; the predicted difference, that is, the mean of the posterior predictive distribution, was 0.80 words and was predicted to fall within the limits of agreement with 93% probability. The corresponding Bayesian Bland–Altman plot is shown in Figure 3. For comparison, we plotted the frequentist Bland–Altman plot in Figure 4. The estimates of the mean difference and its variation were very similar between both analyses. However, the Bayesian approach provided additional data on the posterior predictive distribution of the difference as shown in Figure 2 lower left panel. In addition, the credible intervals have a direct interpretation by indicating the interval in which the true parameter will be within 95% probability given the observed data. The mean values and corresponding parameters for the features of semantic cluster size and switches and frequency of words are reported in Table S2 in supporting information; detailed plots are shown in Figures S1–S3 in supporting information. Of note the probability of the predicted differences for these features was estimated to lie within the limits of agreement with only 62% to 71% probability, considerably lower than for the word count. Consistently, the posterior probability that the limits of agreement were within the MID approximated 0.
FIGURE 2.

Bayesian Bland–Altman analysis for word count. Upper left panel: posterior distribution of the mean differences (μ); vertical red dashed lines indicate the 95% credible interval. Upper middle panel: posterior distribution of the standard deviation of the differences (σ); vertical red dashed lines indicate the 95% credible interval. Upper right panel: Joint posterior distribution of (μ, σ). Blue dots indicate pairs of μ and σ that fulfill the alternative hypothesis (H1) that the limits of agreement are below the minimally important difference. Lower left panel: posterior distribution of the lower limit of agreement (θ1); vertical red dashed lines indicate the 95% credible interval. Lower middle panel: posterior distribution of the upper limit of agreement (θ2); vertical red dashed lines indicate the 95% credible interval. Lower right panel: posterior predictive distribution, that is, the distribution of random draws of the posterior distribution, of the differences (); vertical red dashed lines indicate the 95% credible interval.
FIGURE 3.

Bayesian Bland–Altman plot for word count. Scatter plot of the average of the word count based on manual and automated transcriptions versus the difference between the word count based on manual and automated transcriptions. The red dashed line indicates the mean difference, the distance between this line and the black horizontal line crossing zero indicates the bias of the difference. The upper and lower limits of agreement are indicated by the dashed blue lines. LL, lower limit; SD, standard deviation; UL, upper limit.
FIGURE 4.

Frequentist Bland–Altman plot for word count. Same as Figure 3, however, the estimates are based on the mean of the differences, and the lower and upper limits are derived as lower limit = mean difference – 1.96 x SD(mean difference). Upper limit = mean difference + 1.96 x SD(mean difference) with SD representing the standard deviation of the differences and 1.96 representing the critical value for the standard normal distribution for a probability ≤ 0.025. 25 LL, lower limit; SD, standard deviation; UL, upper limit.
3.3. Group separation
In receiver operating characteristic (ROC) analyses, we discriminated cognitively impaired (CDR > 0) from cognitively unimpaired (CDR = 0) individuals (Figure 5 and Table S3 in supporting information). Numerically, the manually transcribed features except the cluster size yielded higher areas under the ROC curve (AUC) than the automatically transcribed features; however, the differences were small. Specifically, the word count discriminated the groups with an AUC of 84%, irrespective of the mode of transcription. This compares to a benchmark of an AUC of 87% using the paper–pencil version of the semantic verbal fluency task during the clinical visit.
FIGURE 5.

ROC curves with 95% credible intervals. Black (red) lines indicate ROC curves for discriminating cognitively impaired from unimpaired individuals based on manual (automated) transcriptions. The dashed lines indicate the corresponding 95% credible intervals. FPF, false positive fraction; ROC, receiver operating characteristic; TPF, true positive fraction.
4. DISCUSSION
Despite a generally high correlation between manual and automated features, we found quite different profiles of agreement between the different feature pairs using agreement metrics such as ICC and Bland–Altman analysis. Of note, the word count achieved excellent agreement, with a high posterior predictive probability of 93% that the difference between manual and automated feature extraction will be less than the MID. Cluster size and switches, as well as word frequency, had lower levels of agreement, still fair but not excellent, and the probability of the predicted differences being less than the MID was much lower than for word count. Consistently, these features differed slightly when based on manual versus automated transcription in discriminating cognitively impaired from unimpaired individuals, whereas for word count discrimination accuracy was identical between manual and automated transcription.
We draw two main conclusions from these results. First, in terms of clinical relevance, despite the ability of automated speech detection to provide a wide range of different features, the widely established word count appeared to be the most robust marker and came close to the group separation of the paper–pencil version of the semantic verbal fluency task. Previous studies had found differences in various semantic fluency features between MCI and AD patients and cognitively normal individuals, with higher effect sizes for word count than for semantic cluster size and switches. 35 , 36 , 37 Our group comparisons based on automated speech transcription and feature computation are consistent with these previous results. Although in previous studies the qualitative features of semantic cluster size and switching did not contribute discriminative power independent of word count to discriminate between AD and healthy controls, 38 the study of these features may help to better understand the underlying cognitive mechanisms of impaired language production 39 and may help to distinguish cortical from subcortical causes of dementia with higher impairment in subcortical cerebrovascular lesions affecting executive function. 40 Reliable automated rather than manual extraction of these qualitative features would therefore support cost‐effective comprehensive cognitive testing.
Second, from a methodological point of view, the use of appropriate metrics is important when assessing the level of agreement. Here, we extended the established portfolio of agreement measures by a Bayesian adaptation of the Bland–Altman analysis 24 , 25 that was introduced in 2021. 26 In the Bayesian framework, the 95% credible intervals for the mean differences and the predicted differences represent the bounds within which the true value is expected to lie with 95% probability (described in Kruschke, 41 chapter 11.3). Such an intuitive interpretation would be non‐sensical for the frequentist confidence interval. Here, the true parameter value is not a random variable but a fixed (but unknown) quantity. Consequently, “by definition, any given 95% confidence interval estimate will either include or exclude the truth with 100% probability.” 42 “Over infinite repeated sampling, […], the [95%]‐level confidence interval will include the true value in [95%] of the samples for which it is calculated” 42 ; that is, the confidence interval relates to long‐term realizations of the parameter value in future hypothetical experiments. Given this rather non‐intuitive meaning, the frequentist confidence interval is often misinterpreted as if it were a Bayesian credible interval. This gives the Bayesian approach an advantage as it provides exactly the result that is mistakenly expected from the frequentist analysis.
As a special feature, the Bayesian Bland–Altman analysis provides an estimate of the posterior predictive probability that the difference will lie within a defined interval representing the MID. 26 This probability can be used to assess whether a particular approach is accurate enough to serve as a proxy for the gold standard method. Based on our results, one would consider automated transcription sufficient to replace manual transcription for estimating word count. However, posterior predictive probabilities were lower for the other features, so a trade‐off between accuracy and efficiency and cost will be required to decide whether automated transcription can serve as an appropriate replacement for the manual method.
One limitation of our study is the low number of cases (78). This was due to the large amount of time and resources required to manually transcribe the audio files. The PROSPECT‐AD study also tested the learning and recall of word lists using automated remote assessment, but here we focused on the semantic fluency task.
The observation that semantic cluster size switches and word frequency measures were more sensitive to errors in the transcription than the number of correctly produced words is not surprising given the characteristics of these features. The substitution of a single word will directly affect the definition of semantic clusters and the transitions between them, especially if the replaced word belongs to a different semantic catgegory. Still, given the challenging material of speech recordings from telephone calls with older people with cognitive decline, the agreement between manual and automated transcriptions was at least fair, encouraging the use of remote speech‐based assessment for cognitive testing.
CONFLICT OF INTEREST STATEMENT
A.K., J.T., E.M., and N.L. are employed by ki:elements. N.L. and J.T. hold shares in the company ki:elements. S.K. has received unrestricted funding from the Alzheimer Drug Discovery Foundation and lecture fees from Eisai. B.F. has received funding from the Deutsche Forschungsgemeinschaft. J.W. has received funding from the BMBF; consulting fees from Immungenetics, Noselab, and Roboscreen; and lecture fees from Beijing Yibai Science and Technology Ltd., Gloryren, Janssen Cilag, Pfizer, Med Update GmbH, Roche Pharma, and Lilly. J.W. participated on a data safety monitoring board or advisory board of Biogen, Abbott, Boehringer Ingelheim, Lilly, MSD Sharp & Dohme, and Roche. C.B. received funding from the German Alzheimer Association and lecture fees from Lilly, Roche Pharma, and Eisai. S.T. participated on scientific advisory boards of Roche Pharma AG, Biogen, Lilly, and Eisai, and received lecture fees from Lilly and Eisai. A.O., E.D., W.G., M.B., J.P., S.A., A.S., O.K., I.K., C.L., M.M., S.R., I.F., D.H., F.J., and M.W. have nothing to disclose. Author disclosures are available in the supporting information.
CONSENT STATEMENT
Written informed consent was obtained from all participants and/or authorized representatives. The study protocols had been approved by the local institutional review boards and ethical committees of the centers participating in the PROSPECT‐AD, DELCODE, and DESCRIBE studies. The studies are being conducted in accordance with the Declaration of Helsinki of 1975 and its later amendments.
Supporting information
Supporting information
Supporting information
ACKNOWLEDGMENTS
The PROSPECT‐AD study is funded by the Alzheimer Drug Discovery Foundation (ADDF).
Open access funding enabled and organized by Projekt DEAL.
König A, Köhler S, Tröger J, et al. Automated remote speech‐based testing of individuals with cognitive decline: Bayesian agreement of transcription accuracy. Alzheimer's Dement. 2024;16:e70011. 10.1002/dad2.70011
REFERENCES
- 1. Garcia‐Gutierrez F, Marquie M, Munoz N, et al. Harnessing acoustic speech parameters to decipher amyloid status in individuals with mild cognitive impairment. Front Neurosci. 2023;17:1221401. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. He R, Chapin K, Al‐Tamimi J, et al. Automated classification of cognitive decline and probable Alzheimer's dementia across multiple speech and language domains. Am J Speech Lang Pathol. 2023;32:2075‐2086. [DOI] [PubMed] [Google Scholar]
- 3.Šubert M, Novotný M, Tykalová T, et al. Lexical and syntactic deficits analyzed via automated natural language processing: the new monitoring tool in multiple sclerosis. Ther Adv Neurol Disord. 2023;16:17562864231180719. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Stegmann G, Hahn S, Bhandari S, et al. Automated semantic relevance as an indicator of cognitive decline: out‐of‐sample validation on a large‐scale longitudinal dataset. Alzheimers Dement. 2022;14:e12294. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Li R, Wang X, Lawler K, Garg S, Bai Q, Alty J. Applications of artificial intelligence to aid early detection of dementia: a scoping review on current capabilities and future directions. J Biomed Inform. 2022;127:104030. [DOI] [PubMed] [Google Scholar]
- 6. Sanborn V, Ostrand R, Ciesla J, Gunstad J. Automated assessment of speech production and prediction of MCI in older adults. Appl Neuropsychol Adult. 2022;29:1250‐1257. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Ostrand R, Gunstad J. Using automatic assessment of speech production to predict current and future cognitive function in older adults. J Geriatr Psychiatry Neurol. 2021;34:357‐369. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Heaton KJ, Williamson JR, Lammert AC, et al. Predicting changes in performance due to cognitive fatigue: a multimodal approach based on speech motor coordination and electrodermal activity. Clin Neuropsychol. 2020;34:1190‐1214. [DOI] [PubMed] [Google Scholar]
- 9. Mueller KD, Koscik RL, Hermann BP, Johnson SC, Turkstra LS. Declines in connected language are associated with very early mild cognitive impairment: results from the Wisconsin Registry for Alzheimer's Prevention. Front Aging Neurosci. 2017;9:437. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Toth L, Hoffmann I, Gosztolya G, et al. A speech recognition‐based solution for the automatic detection of mild cognitive impairment from spontaneous speech. Curr Alzheimer Res. 2018;15:130‐138. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Possemis N, Ter Huurne D, Banning L, et al. The reliability and clinical validation of automatically derived verbal memory features of the verbal learning test in early diagnostics of cognitive impairment. J Alzheimers Dis. 2024;97:179‐191. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Ter Huurne D, Possemis N, Banning L, et al. Validation of an automated speech analysis of cognitive tasks within a semiautomated phone assessment. Digit Biomark. 2023;7:115‐123. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. König A, Linz N, Baykara E, et al. Screening over speech in unselected populations for clinical trials in AD (PROSPECT‐AD): study design and protocol. J Prev Alzheimers Dis. 2023;10:314‐321. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Ter Huurne D, Ramakers I, Possemis N, et al. The accuracy of speech and linguistic analysis in early diagnostics of neurocognitive disorders in a memory clinic setting. Arch Clin Neuropsychol. 2023;38:667‐676. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. König A, Mallick E, Tröger J, et al. Measuring neuropsychiatric symptoms in patients with early cognitive decline using speech analysis. Eur Psychiatry. 2021;64:e64. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Konig A, Linz N, Troger J, Wolters M, Alexandersson J, Robert P. Fully automatic speech‐based analysis of the semantic verbal fluency task. Dement Geriatr Cogn Disord. 2018;45:198‐209. [DOI] [PubMed] [Google Scholar]
- 17. Soroski T, da Cunha Vasco T, Newton‐Mason S, et al. Evaluating web‐based automatic transcription for Alzheimer speech data: transcript comparison and machine learning analysis. JMIR Aging. 2022;5:e33460. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Hall JR, Harvey M, Vo HT, O'Bryant SE. Performance on a measure of category fluency in cognitively impaired elderly. Neuropsychol Dev Cogn B Aging Neuropsychol Cogn. 2011;18:353‐361. [DOI] [PubMed] [Google Scholar]
- 19. Troyer AK, Moscovitch M, Winocur G, Leach L, Freedman M. Clustering and switching on verbal fluency tests in Alzheimer's and Parkinson's disease. J Int Neuropsychol Soc. 1998;4:137‐143. [DOI] [PubMed] [Google Scholar]
- 20. Troyer AK, Moscovitch M, Winocur G. Clustering and switching as two components of verbal fluency: evidence from younger and older healthy adults. Neuropsychology. 1997;11:138‐146. [DOI] [PubMed] [Google Scholar]
- 21. Brysbaert M, Mandera P, Keuleers E. The word frequency effect in word processing: an updated review. Curr Dir Psychol Sci. 2018;27:45‐50. [Google Scholar]
- 22. Marczinski CA, Kertesz A. Category and letter fluency in semantic dementia, primary progressive aphasia, and Alzheimer's disease. Brain Lang. 2006;97:258‐265. [DOI] [PubMed] [Google Scholar]
- 23. Watson PF, Petrie A. Method agreement analysis: a review of correct methodology. Theriogenology. 2010;73:1167‐1179. [DOI] [PubMed] [Google Scholar]
- 24. Bland JM, Altman DG. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet. 1986;1:307‐310. [PubMed] [Google Scholar]
- 25. Giavarina D. Understanding Bland Altman analysis. Biochem Med. 2015;25:141‐151. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Alari KM, Kim SB, Wand JO. A tutorial of Bland Altman analysis in A Bayesian framework. Meas Phys Educ Exerc Sci. 2021;25:137‐148. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Jessen F, Spottke A, Boecker H, et al. Design and first baseline data of the DZNE multicenter observational study on predementia Alzheimer's disease (DELCODE). Alzheimers Res Ther. 2018;10:15. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Jessen F, Amariglio RE, van Boxtel M, et al. A conceptual framework for research on subjective cognitive decline in preclinical Alzheimer's disease. Alzheimers Dement. 2014;10:844‐852. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Albert MS, DeKosky ST, Dickson D, et al. The diagnosis of mild cognitive impairment due to Alzheimer's disease: recommendations from the National Institute on Aging‐Alzheimer's Association workgroups on diagnostic guidelines for Alzheimer's disease. Alzheimers Dement. 2011;7:270‐279. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30. McKhann GM, Knopman DS, Chertkow H, et al. The diagnosis of dementia due to Alzheimer's disease: recommendations from the National Institute on Aging‐Alzheimer's Association workgroups on diagnostic guidelines for Alzheimer's disease. Alzheimers Dement. 2011;7:263‐269. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31. Linz N, Tröger J, Alexandersson J, König A. Using neural word embeddings in the analysis of the clinical semantic verbal fluency task. In: Proceedings of the 12th International Conference on Computational Semantics (IWCS)—Short papers. Association for Computational Linguistics (ACL); 2017. p. 1‐7. [Google Scholar]
- 32. Linz N, Tröger J, Alexandersson J, Wolters M, König A, Robert P. Predicting dementia screening and staging scores from semantic verbal fluency performance. In: IEEE International Conference on Data Mining (ICDM)‐Workshop on Data Mining for Aging, Rehabilitation and Independent Assisted Living2017 . IEEE; 2017. p. 719‐728. [Google Scholar]
- 33. Norman GR, Sloan JA, Wyrwich KW. Interpretation of changes in health‐related quality of life: the remarkable universality of half a standard deviation. Med Care. 2003;41:582‐592. [DOI] [PubMed] [Google Scholar]
- 34. Draak THP, de Greef BTA, Faber CG, Merkies ISJ; PeriNomS study group . The minimum clinically important difference: which direction to take. Eur J Neurol. 2019;26:850‐855. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35. Mueller KD, Koscik RL, LaRue A, et al. Verbal fluency and early memory decline: results from the Wisconsin Registry for Alzheimer's Prevention. Arch Clin Neuropsychol. 2015;30:448‐457. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36. Weakley A, Schmitter‐Edgecombe M. Analysis of verbal fluency ability in Alzheimer's disease: the role of clustering, switching and semantic proximities. Arch Clin Neuropsychol. 2014;29:256‐268. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37. Haugrud N, Crossley M, Vrbancic M. Clustering and switching strategies during verbal fluency performance differentiate Alzheimer's disease and healthy aging. J Int Neuropsychol Soc. 2011;17:1153‐1157. [DOI] [PubMed] [Google Scholar]
- 38. Gomez RG, White DA. Using verbal fluency to detect very mild dementia of the Alzheimer type. Arch Clin Neuropsychol. 2006;21:771‐775. [DOI] [PubMed] [Google Scholar]
- 39. Ahn H, Yi D, Chu K, et al. Functional neural correlates of semantic fluency task performance in mild cognitive impairment and Alzheimer's disease: an FDG‐PET study. J Alzheimers Dis. 2022;85:1689‐1700. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40. Zhao Q, Guo Q, Hong Z. Clustering and switching during a semantic verbal fluency test contribute to differential diagnosis of cognitive impairment. Neurosci Bull. 2013;29:75‐82. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41. Kruschke JK. Doing Bayesian Data Analysis—A Tutorial with R, JAGS, and Stan. 2nd ed. Elsevier; 2015. [Google Scholar]
- 42. Naimi AI, Whitcomb BW. Can confidence intervals be interpreted?. Am J Epidemiol. 2020;189:631‐633. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supporting information
Supporting information
