ADscreen: A Speech Processing-based Screening System for Automatic Identification of Patients with Alzheimer’s Disease and Related Dementia

Maryam Zolnoori; Ali Zolnour; Maxim Topaz

doi:10.1016/j.artmed.2023.102624

. Author manuscript; available in PMC: 2024 Sep 1.

Published in final edited form as: Artif Intell Med. 2023 Jul 17;143:102624. doi: 10.1016/j.artmed.2023.102624

ADscreen: A Speech Processing-based Screening System for Automatic Identification of Patients with Alzheimer’s Disease and Related Dementia

Maryam Zolnoori ^1,^*, Ali Zolnour ^2,^*, Maxim Topaz ¹

PMCID: PMC10483114 NIHMSID: NIHMS1918819 PMID: 37673583

Abstract

Alzheimer’s disease and related dementias (ADRD) present a looming public health crisis, affecting roughly 5 million people and 11% of older adults in the United States. Despite nationwide efforts for timely diagnosis of patients with ADRD, more than 50% of them are not diagnosed and unaware of their disease. To address this challenge, we developed ADscreen, an innovative speech-processing based ADRD screening algorithm for the protective identification of patients with ADRD. ADscreen consists of five major components: (i) noise reduction for reducing background noises from the audio-recorded patient speech, (ii) modeling the patient’s ability in phonetic motor planning using acoustic parameters of the patient’s voice, (iii) modeling the patient’s ability in semantic and syntactic levels of language organization using linguistic parameters of the patient speech, (iv) extracting vocal and semantic psycholinguistic cues from the patient speech, and (v) building and evaluating the screening algorithm. To identify important speech parameters (features) associated with ADRD, we used the Joint Mutual Information Maximization (JMIM), an effective feature selection method for high dimensional, small sample size datasets. Modeling the relationship between speech parameters and the outcome variable (presence/absence of ADRD) was conducted using three different machine learning (ML) architectures with the capability of joining informative acoustic and linguistic with contextual word embedding vectors obtained from the DistilBERT (Bidirectional Encoder Representations from Transformers). We evaluated the performance of the ADscreen on an audio-recorded patients’ speech (verbal description) for the Cookie-Theft picture description task, which is publicly available in the dementia databank. The joint fusion of acoustic and linguistic parameters with contextual word embedding vectors of DistilBERT achieved F1-score = 84.64 (standard deviation [std] = ± 3.58) and AUC-ROC = 92.53 (std = ± 3.34) for training dataset, and F1-score = 89.55 and AUC-ROC = 93.89 for the test dataset. In summary, ADscreen has a strong potential to be integrated with clinical workflow to address the need for an ADRD screening tool so that patients with cognitive impairment can receive appropriate and timely care.

Graphical abstract

graphic file with name nihms-1918819-f0001.jpg

1. Introduction

Alzheimer’s disease and related dementias (ADRD) represent a looming public health crisis, affecting roughly 5 million people and 11% of older adults in the United States.¹ ADRD patients are frequent utilizers of healthcare services in general^2,3 and emergency department services^4,5 in particular, and they incur higher costs of care compared with non-ADRD patients.^2,6 Despite nationwide efforts for timely diagnosis of ADRD, more than 50% of these patients remain underdiagnosed and undertreated.^7–9 This is mostly due to patients’ inability to recognize early symptoms,¹⁰ limited availability of biomarkers (e.g., cerebrospinal fluid, magnetic resonance imaging¹¹),¹² and clinicians’ insufficient time to assess patients for ADRD.¹³ Given the projection of 13.2 million ADRD patients by 2050¹⁴, development of a robust screening tool for early identification of elderly patients with ADRD has been recognized as a research priority by the National Institute on Aging (NIA).^15,9

Emerging studies show that patients’ spoken language is one of the earliest signs of cognitive impairment, enabling the features of spoken language to act as biomarkers for multiple dimensions of cognitive abilities, including executive functioning, semantic memory, and language.^16–18 Cues of cognitive impairment conveyed in the voice have been empirically documented by measurement of the acoustic waveform (parameters), reflecting the shape of the vocal tract and the patient’s abilities to control vocal cord execution during speech. Also, the language of patients with cognitive impairment conveys cues such as low coherence or low information density that mostly occur due to the memory deficit.¹⁹ These language cues can be identified using established methods in the natural language processing domain, such as metrics for measuring information density and contextual word embedding methods for modeling disfluency in patient speech. Additionally, alternations in emotion expression in patients with ADRD can develop in parallel with cognitive deterioration. This alternation can affect the psycholinguistic features of the voice, and in turn the patient ability in communication and expression of their needs to some degree.^20,21 The psycholinguistic cues of speech can be modeled using both nonverbal vocalization (non-word speech) or semantically via language. Changes in nonverbal vocalization can be estimated by different phonatory and articulatory parameters of the acoustic waveform.²² The identification of semantic psycholinguists cues in language can be achieved by utilizing a set of linguistic features, including lexical-based natural language processing tools specifically designed to analyze the psychological aspects of language.²³

In this research, we developed an innovative screening method called ADscreen for proactive, automated identification of patients at risk for ADRD. This study marks the first time we have developed a unique pipeline to model three primary elements of speech, specifically phonetic motor planning, semantic and syntactic language organization, and psycholinguistic features of patient speech. We utilized this pipeline to process audio recordings of patients’ verbal descriptions during the “Cookie-Theft” picture description test (referred to as the Cookie-Theft test) for each element. This audio-recorded dataset is part of the publicly available DementiaBank English Pitt Corpus. Different machine learning (ML) models were trained on the training and evaluated on the test dataset to provide an unbiased evaluation of the performance of the screening algorithm.

2. Related Works

To develop a screening algorithm for identifying patients with ADRD, previous studies used different approaches to process the audio-recorded patients’ verbal descriptions for the Cookie-Theft test available in the DementiaBank English Pitt Corpus. Some of the studies focused only on the acoustic (voice) or linguistic (transcription of the verbal description) part of speech, while others analyzed both parts and highlighted the importance of each part in detecting patients with ADRD. Additionally, they used different ML architectures and evaluation mechanisms to build the screening algorithm. This section provides a brief review of studies’ approaches for generating acoustic and linguistic features and ML architecture used to analyze the features for building an ADRD screening algorithm. We included studies demonstrated promising performance in identifying patients with ADRD.

Balagopalan et al.²⁴ modeled the acoustic part of speech using acoustic parameters of frequency and spectral domain, and speech fluency. To model the linguistic part, they used lexical and syntactic features (e.g., lexical richness, constituency parsing tree) and proportions of various information content units (as an indicator of memory impairment) used in the patients’ verbal descriptions for the Cookie-Theft test. Different ML algorithms were trained on the combination of acoustic and linguistic features. Support Vector Machine (SVM) had the highest performance with an accuracy of 81.5. The authors did not report the importance of each part of the speech in achieving this performance.

Shah et al.²⁵ investigated the performance of alternate open-access repository of acoustic assessment algorithms, including AVEC-2013²⁶, EMO_Large,²⁷ and ComParE-2013²⁸ for modeling the acoustic part of speech. For modeling the linguistic part, they used different natural language processing (NLP) techniques such as part of speech (POS) tagging, term frequency-inverse document frequency (TF-IDF) features, and n-grams to quantify syntactic and semantic parameters of the patient’s langue. They also computed repetition and filled/non-filled pauses to quantify semantic disfluency. Different machine learning models were trained on acoustic and linguistic feature sets. SVM achieved the highest performance with an accuracy of 65 for acoustic features, an accuracy of 85 for linguistic features, and an accuracy of 83 for the joint combination of acoustic and linguistic features. The authors concluded that the reduction in accuracy might be due to the overfitting of ML models on the feature sets of training data.

Martinc et al.²⁹ extracted the psycholinguistic cues in the patient’s speech using GeMAPS³⁰ acoustic feature set (see section “3.5 Component 4: Modeling the patient’s psycholinguistic expression” for details of GeMAPS). To model the acoustic part of speech, they used parameters of frequency and spectral domain. To model the linguistic part, they used different NLP techniques, such as TF-IDF, grammatical dependency, universal dependency, and the Doc2Vex text representation model. Authors also used readability features (e.g., Gunning Fog index) to measure complexity in the linguistic part of speech. Different ML algorithms were trained on the linguistic and acoustic feature sets. Logistic Regression had the highest performance, with an accuracy of 57.6 for the acoustic part, an accuracy of 75 for the linguistic component, and an accuracy of 77.08 for the joint combination of acoustic, linguistic, and psycholinguistic feature sets.

Chen et al.³¹ extracted the psycholinguistic cues in the patient’s speech using Linguistic Inquiry and Word Count²³ (LIWC) and GeMAPS³⁰ feature sets. Similar to Shah et al.,²⁵ authors used ComParE-2013²⁸ to model the acoustic part of speech. To model the linguistic part of speech, the authors used a transformer-based pretrained language model, Bidirectional Encoder Representations from Transformers (BERT). The BERT language model is capable of modeling the conceptual relationship between words and semantic disfluency in the patient’s utterances.³² Several ML algorithms were trained and tested on acoustic and linguistic feature sets (including psycholinguistic cues extracted using GeMAPS and LIWC). Logistic Regression had the highest performance with an accuracy of 71.69 for the acoustic part, an accuracy of 74.65 for the linguistic part (using only BERT), and an accuracy of 81.69 for the joint combination of acoustic, linguistic, and psycholinguistic feature sets.

Rohanian et al.³³ modeled the acoustic part of speech using COVAREP³⁴, an open-access repository of acoustic assessment algorithms. To model the linguist part of speech, authors used Glove,³⁵ a word embedding technique that generates word embedding vectors from the patients’ utterances. To model semantic disfluency, the authors used a deep-learning driven model of incremental detection of disfluency developed by Hough and Schlangen.³⁶ An ML architecture with Bi-long short-term memory (Bi-LSTM) was trained and evaluated on the linguistic and acoustic feature sets, which achieved an accuracy of 66.6 for the acoustic feature sets, an accuracy of 70.8 for the linguistic feature set, and an accuracy of 79.2 for the joint combination of acoustic and linguistic feature sets.

Pappagari et al.³⁷ investigated the performance of x-vectors,³⁸ a transformer-based pretrained speech processing model, to process the acoustic part of the speech.³⁸ The x-vectors is a deep neural network that was originally developed for speaker type identification. It was trained on several datasets³⁸ for telephone conversation and microphone speech to map variable-length utterances to fixed-dimensional embeddings. For modeling the linguistic part, the authors used the BERT language model, which can model the conceptual relationship between words and semantic disfluency in the patient’s utterances.³² Authors trained and tested gradient boosting machine (GBM) on acoustic and linguistic embedding vectors. GBM achieved an accuracy of 58 for x-vectors, an accuracy of 72.92 for the BERT model, and an accuracy of 75 for the joint combination of x-vectors and the BERT model.

Pompili et al.³⁹ investigated the performance of x-vectors (see Pappagari et al.³⁷) and i-vectors,⁴⁰ transformer-based pretrained speech processing models, to process the acoustic part of the speech. The i-vectors is a DNN acoustic embedding method trained on VoxCeleb dataset,⁴¹ an annotated audio data for speaker identification collected from YouTube. Similar to previous studies,^31,37 the authors used BERT contextual embedding to model the linguistics part of speech. Also, POS tagging was used to calculate the distribution of part of speech for each word in the sentences. An ML architecture with a Bi-LSTM network and SVM was trained and tested on acoustic and linguistic feature sets, which achieved an accuracy of 54.17 for x-vectors embeddings, an accuracy of 72.92 for BERT embedding, and an accuracy of 81.25 for the joint combination of acoustic and linguistic embeddings.

Zhu et al.⁴² investigated the performance of different transformer-based pretrained speech processing models for generating deep acoustic embeddings from the acoustic part of speech. Specifically, they used MobileNet,⁴³ YAMNet,⁴⁴ and Speech BERT⁴⁵ for this task. MobileNet is a lightweight deep neural network built on depth-wise separable convolutions and trained on the ImageNet dataset.⁴⁶ YAMNet has the same architecture as MobileNet but with the difference that it was trained on a human-labeled YouTube audio dataset⁴⁷ for audio events. The Speech BERT architecture is similar to the Text BERT architecture (the contextual embedding model), except that the Speech BERT’s input is the Mel spectrogram of speech data, and it was trained on the LibriSpeech dataset,⁴⁸ a large set of audiobooks. For the linguistic part, they used BERT, BERT large, and Longformer. Longformer⁴⁹ is an extended version of the BERT language model that scales linearly with the sequence length of the text document to facilitate processing a document with thousands of tokens or longer. For unimodal (acoustic or linguistic part) transfer learning, Speech BERT and Longformer achieved the highest accuracy of 66.67 and 82.08, respectively. For multimodal transfer learning, the joint combination of Speech BERT and Longformer achieved accuracy = 82.9, which was the highest compared to the joint combination of other models.

Koo et al.⁵⁰ investigated the acoustic component of speech using VGGish,⁵¹ a transformer-based pretrained speech processing model for generating deep acoustic embeddings from the acoustic part of speech. VGGish was trained on a large manually-annotated YouTube videos.⁴⁷ The authors also employed the GeMAPS feature set to extract acoustic psycholinguistic cues from the acoustic part of speech. To model the linguistic part of speech, the authors used the XLNet⁵² language model, an extended version of the BERT language model. Like BERT, XLNet is able to model the conceptual relationship between words and semantic disfluency in speech utterances. Additionally, the authors quantified semantic impairment in speech utterances using repetitiveness and lexical richness metrics. An ML architecture with CCN and Bi-LSTM networks was trained on acoustic and linguistic features, which achieved an accuracy of 72.92 for VGGish, an accuracy = 81.25 for XLNet, and an accuracy = 81.25 for the joint combination of all acoustic and linguistic features. The Authors concluded that achieving the same performance for only the linguistic feature set (unimodal) and the joint combination of linguistic and acoustic feature sets (multimodal) might be due to the overfitting of the multimodal model on the training dataset.

Syed et al.⁵³ investigated the performance of alternate open-access repository of acoustic assessment algorithms, including IS10-Paralinguistics feature-set⁵⁴ and COVAREP³⁴ for modeling the acoustic part of speech. The authors also used VGGish⁵¹ for generating acoustic embeddings. For the linguistic part, they used different versions of the BERT language model (e.g., BERT large cased, DistilRoBERTa, and BioMed Roberta) for modeling the contextual relationship between words in the utterances. SVM was trained on the acoustic and linguistic parts of speech, which achieved an accuracy of 64.58 for the acoustic part, an accuracy of 85.42 for the linguistic part, and an accuracy = 79.17 for the joint combination of acoustic and linguistic parts.

Balagopalan et al.⁵⁵ utilized lexico-syntactic features to model the linguistic aspects of participant speech. These features were derived from speech-graph, constituency parsing tree, lexical richness, syntactic and semantic features based on picture description content. For the acoustic aspect of speech, the researchers employed Mel-frequency cepstral coefficients (MFCCs), fundamental frequency, and zero-crossing rate-related statistics. They trained various machine learning algorithms using a combination of these acoustic and linguistic features. The SVM model, with the 10 most informative features chosen through the ANOVA method, achieved an accuracy of 81.3%. Additionally, the authors applied the pre-trained BERT model for the linguistic portion, resulting in a higher accuracy of 83.3% compared to models trained on manually crafted linguistic and acoustic features.

Weirui Kong et al.⁵⁶ modeled the linguistic aspects of speech by employing syntactic and semantic features, as well as psycholinguistic characteristics of participants’ language. They projected these features into an embedding space with a specific dimension using an encoder. For the acoustic parameters of speech, the researchers utilized the MFCCs technique. They combined the two modalities using a joint embedding method adapted from Kiros et al. (2014) and built logistic regression classifiers on these feature sets, achieving an accuracy of 70.8%. In addition to this approach, the study also explored the performance of an end-to-end neural model using hierarchical attention networks (HAN), which allows avoiding any feature engineering. This model achieved an accuracy of 81.5%. However, when incorporating participant age into the model, the classification performance improved, reaching an accuracy of 86.9%

Bertini et al.⁵⁷ trained an autoencoder with a multilayer perceptron architecture on the log mel spectrogram of participants’ speech audio data. The rationale for employing an autoencoder was to generate a 128-dimensional vector that effectively captures the inherent audio features of Alzheimer’s disease patients’ vocal production. This code was then utilized to train a multilayer perceptron capable of identifying potential Alzheimer’s disease subjects. To enhance the model’s performance, the researchers utilized the SpecAugment suite, introduced by Park et al.,⁵⁸ which transforms log mel spectrograms to increase the input data points. The model achieved an accuracy of 93.3% (F-1 score = 88.5) on the augmented dataset, demonstrating a 26.4% improvement compared to its performance on the non-augmented data (accuracy = 73.9, F-1 score = 62.1).

Roshanzamir et al.⁵⁹ examined the performance of pre-trained BERT, XLNet, and XLM models for contextual embedding of the linguistic aspect of speech. They found that the BERT large language model, when combined with a logistic regression classifier, achieved the highest accuracy score of 88.08%. The researchers also investigated the impact of two text augmentation techniques on the performance of contextual word embedding methods: the similar word substitution augmentation method and the sentence removal augmentation method. However, they showed that these methods did not result in any significant overall improvements.

The ADscreen, the screening algorithm we developed, has some key differences from previous studies: (1) ADscreen has a component for modeling the patient’s ability in phonetic motor planning, built on informative acoustic parameters associated with ADRD. Compared to previous studies that used transformer-based speech processing models, this component provides an insight into the impairment in acoustic parameters such as alternation in speech fluency. Additionally, we showed that the prediction power of this component (accuracy = 78.87) is higher than the prediction power of transformer-based speech processing models (x-vector, i-vectors, VGGish, MobileNet, YAMNet, Speech BERT) reported in previous studies. (2) To model the linguistic part of speech, we used both the transformer-based pretrained language model (the BERT language model) and domain-related features to quantify the semantic fluency, semantic impairment, and syntactic structure in the speech of the patient with ADRD. This component had a relatively high predictive performance (accuracy = 83.09) compared to previous studies and can provide an insight into the impairment in the linguistic components of speech. (3) ADscreen has a component for extracting psycholinguistic cues from the patient’s speech. This component is particularly important to extract vocal semantic psycholinguistic cues associated with neuropsychiatric symptoms for further evaluation. (4) Finally, ADscreen has an accuracy = 90.14 and F1-score = 89.55 for identifying patients with ADRD measured on the test dataset. This result indicates that ADscreen has a strong potential to be integrated into clinical workflow to raise clinicians’ attention to the patient’s cognitive status for further evaluation. Details of methodology for the development of ADscreen were provided in the methodology section.

3. Method

ADscreen is built on an analytic pipeline for modeling spontaneous speech by extracting acoustic and linguistic speech parameters. As a part of this pipeline, we used ML algorithms for modeling the relationships among variables for detecting patients with ADRD. Figure 1 provides a schematic view of the ADscreen analytical components. Sections 3.2–3.5 provide a detailed description of the ADscreen’ components.

Figure 1. — A schematic view of the ADScreen pipeline

3.1. Data Source

We used an audio-recorded speech dataset from the English Pitt DementiaBank, which included spontaneous speech samples from 237 participants during the “Cookie-Theft” picture description test. The “Cookie-Theft” test⁶⁰ is a drawing depicting two children stealing cookies behind their mother’s back (see Appendix A). This test has been proven effective for assessing cognitive function in several studies.^61,62

Participants were instructed by a healthcare providers to describe the drawing or create a story. From the 237 participants, the audio-recorded speech of 166 participants was organized into a training dataset, consisting of 87 ADRD patients (case group) and 79 non-cognitively impaired participants (control group). The remaining 71 participants’ audio-recorded speech was organized into a test dataset, with 35 ADRD patients and 36 non-cognitively impaired participants.

The “Cookie-Theft” test study’s inclusion criteria mandated participants be at least 44 years old. Table 1 displays demographic information for both development and test datasets, indicating that case group participants were slightly older and more likely to be women than control participants.

Table 1.

Characteristics of the cohort

Development Dataset
Attributes	ADRD participant (Case) N = 87	non-cognitively impaired participants (control group) N = 79
Gender: F/M	58 / 29	52 / 27
Age Mean (Std):	69.72 ± 6.8	66.04 ± 6.25
MMSE score Mean (Std):	17.44 ± 5.33	28.99 ± 1.15
Words counts Mean (Std):	88.54 ± 47.92	113.54 ± 69.58
Test Dataset
Attributes	ADRD participant (Case) N = 35	non-cognitively impaired participants (control group) N = 36
Gender: F/M	21 / 14	23 / 13
Age Mean (Std):	68.51 ± 7.12	66.11 ± 6.53
MMSE score Mean (Std):	18.86 ± 5.8	28.91 ± 1.25
Words counts Mean (Std):	92 ± 57	109 ± 56

Open in a new tab

Participants also underwent a thorough neuropsychological assessment, including verbal tasks and Mini-Mental State Examination (MMSE). Eligible participants were required to have no history of major nervous system disorders and achieve an initial MMSE score above 10. MMSE scores, ranging from 0 to 30 points, are interpreted as follows: 24–30 (normal cognitive function), 18–23 (mild cognitive impairment), 10–17 (moderate cognitive impairment), and 0–9 (severe cognitive impairment). As per the MMSE scores in Table 1, case groups in both development (MMSE = 17.44 ± 5.33) and test datasets (MMSE = 18.86 ± 5.8) primarily experienced mild to moderate cognitive impairment.

ADRD participants exhibited lower mean word counts compared to the control group in both datasets (see table 1), suggesting potential difficulties with language and communication. A detailed description of the cohort building process is available in Becker et al.⁶⁰, with further information provided in Appendix B of the manuscript.

3.2. Component 1: Noise reduction

Individuals’ speech that is audio-recorded in laboratory or real-word settings often includes noises that may affect the quality of downstream tasks built on that audio data. The environmental noises can stem from different sources, such as human conversation in the background or a thermal noise from a radio receiver. Overall, noises can be categorized into two main areas: stationary and nonstationary. In stationary noises, statistical parameters of the signal, such as intensity and spectrum shape, remain constant over time, while in nonstationary noises, these parameters change. Including noise in the audio-recorded data affects the accuracy of acoustic assessment algorithms, as well as the accuracy of machine learning classifiers built on computed acoustic feature sets. To eliminate noise in the speech data, we used the iZotope RX8⁶³ toolkits. iZotope RX8 is software for noise reduction and noise removal built on deep learning neural network models to eliminate background noises. iZotope RX8 uses deep learning methods to identify and reduce both stationary and nonstationary noises, which showed a good performance in several sound enhancement studies.^64,65

3.3. Component 2: Modeling phonetic motor planning (phonetic component)

Impairment in phonetic motor planning in patients with neurodegenerative disorders leads to poor pronunciation, along with alternation in phonological planning and speech rhythm.^66,67,68 We used acoustic parameters in five domains to model phonetic motor planning.

3.3.1. Alternation in speech fluency:

Metrics for evaluation of speech fluency are among the most widely employed measures for assessing cognitive functioning. Different studies show that a subtle impairment in speech fluency indicates changes in the temporal functions of phonation. Individuals with ADRD manifest slower speech and articulation rates, including longer within-word disfluency (occurring due to prolongation of words’ sounds), more and longer pauses, as well as inappropriate temporal distribution of pauses⁶⁹ in their speech unit. We used the following metrics to model speech fluency: articulation rate (number of phonemes per second without hesitation),⁶⁶ speech rate (number of phonemes per second with hesitation),⁶⁶ silent pauses (number of speechless intervals at the beginning of and between words),⁷⁰ and within-word disfluency (within-word silent pauses and sound prolongations),⁷¹ and voicing probability⁷² (indicating a percentage of unvoiced and voiced energy in a speech signal).

3.3.2. Alternation in frequency and spectral parameters of the voice:

Fundamental frequency (F0) and resonant frequencies (Formant) [measured in hertz (Hz) or cycles per second (cps)] can provide information about an individual’s ability to control vocal fold and tract in speech production,⁷³ and phonological motor planning in turn. F0 is the vibratory rate of the vocal folds, and F0 range is an indicator of the phonation range that an individual can produce. The range is lower in individuals with ADRD compared with non-cognitively impaired individuals. Formant frequencies⁷⁴ (F1, F2, F3) are acoustic resonances of the vocal tract that occur due to changes in the position of vocal organs. The perceived quality of vowel pronunciation is the functional relationship among F1, F2, & F3.⁷⁵ If the formants do not change fast enough or are not distinct enough, sounds may become harder for listeners to identify, leading to the perception of mumbling.⁷⁶ Patients with ADRD are unable to control high-format frequencies with average tonal oscillations over 500 Hz. This is particularly the case for F3 with tonal oscillations between 1500 Hz and 2500 Hz,⁷⁷ resulting in the generation of unclear sound in speech.

Spectral parameters of speech are the analysis of discrete frequencies (spectrum frequencies) of the speech signal over desired frequency bins (e.g., 25ms).⁷⁸ Statistical analysis of the power (energy) of spectrum frequencies over a continuous range of speech can provide important clues about an individual’s phonological planning. Energy variation among the frequency spectrum of speech signals can be computed using the Mel Frequency Cepstral Coefficients (MFCCs) metric. MFCC⁷⁹ is a widely used approach for the detection of phones (a sound representation of the phoneme) in speech recognition systems. Previous studies show that MFCC has good discrimination power in detecting patients with ADRD.^80,81 Other metrics computed using spectrum frequencies are long-term average spectrum (LTAS) and the spectral center of gravity,⁸² which captures the spectrum of the glottal source as well as resonant characteristics of the vocal tract.⁸³ These two metrics were linked to cognitive changes in previous studies.⁸⁴ The estimation of formant frequencies and bandwidths are commonly computed using linear predictor analysis via Linear Predictor Coefficients (LPCs). LSP is used to represent LPC due to properties such as smaller sensitivity to quantization noise that make them superior to direct quantization of LPCs.

3.3.3. Alternation in the intensity of the voice:

Voice intensity is a function of mass, tension, and biochemical characteristics of the vocal folds, as well as a slight variation in an individual’s ability in neural control. Impairment in phonological motor planning is associated with the individual’s inability to control the intensity of speech. That inability can negatively affect the articulatory and prosodic aspects of speech, making the sound monotonous, dull, or even meaningless.⁶⁶ Mean and variability of intensity in an individual’s speech correlates with the perception of vocal loudness and loudness variation. Vocal intensity is measured using metrics of sound pressure level (the Time-average Sound Level Definition), indicating the strength of vocal fold vibration. Variation in loudness can be measured using jitter and shimmer metrics, which are measured by the cycle-to-cycle variations of fundamental frequency and amplitude, respectively. These two metrics are widely used for describing pathological voice quality,⁸⁵ particularly in patients with cognitive impairment.⁸⁶ Additionally, to quantify speech intensity, we computed the Hammarberg Index (the difference between the maximum energy in the 0…2kHz band and the energy in the 2…5kHz band) energy concentration⁷⁰ (average of spectral frequency content), and the ratio of the energy of the spectral harmonic peak at the second and third formant’s center frequency to the energy of the spectral peak at F0 in voiced regions.

3.3.4. Alternation in the voice quality:

The patient’s ability in phonetic motor planning affects how the listener perceives the quality of their voice. Despite the increase in voice noise in elderly individuals as a part of normal aging, older individuals with ADRD “lose” part of their vocal noise characteristics and have more harmonic and fluty voices than they did when they were younger.⁷⁷ The presence of noise can be measured using the Harmonic to Noise Ratio (HNR).⁷³ HNR is the relationship between harmonic sound (periodic component) and noise in the vocal signal (aperiodic component). Voice-breaking⁸⁷ is another indicator in individuals’ speech of impairment in phonological planning and difficulty in vocal cord execution. This break is a sudden gap in sound that accrues when the thyroarytenoid muscles suddenly decrease their activity, and the cricothyroid muscles begin to function.⁸⁸ It is calculated as the frequency of breaks during an utterance (a continuous block of speech without interruption). We also calculated the Voice Quality Index (AVQI),⁸⁹ which consists of a weighted combination of time-frequency and frequency-domain metrics that was originally developed to measure the severity of dysphonia.

3.3.5. Alternation in the rhythmic structure of the voice:

Studies showed that alternation in rhythmic structure throughout the evolution from healthy aging to ADRD follows a steady pattern parallel to cognitive decline. This disorder can be related to impairment in phonetic-motor planning, leading to poor pronunciation and alteration in syllabic rhythm. Rhythm is defined as an isochronous recurrence of some type of speech unit,⁹⁰ such as syllabic duration, intensity, and voice breaks. Impairment in phonological planning and progression to ADRD implies conversion to slower speech, less intensive speech, monotone and tremulous voice, and continuous interruptions and breaks. As a result, the speech signal becomes progressively degraded, and the speech itself loses clarity,^86,91 creating the impression of choked and hesitant speech sounds. Slowness in speech is measured using metrics of speech fluency, prolonged syllable intervals, and a higher variation in the duration between two successive syllabic intervals (Pairwise Variability Index). Monotonous voice is the result of a reduction in the variation in the breadth of vowel sounds, which can be measured using the Shimmer metric (a measure of the maximum amplitude between two consecutive periods of vibration of the glottis). The greater shimmer in speech production is a characteristic of patients with cognitive impairment,⁹¹ which indicates greater instability of amplitude in the sound.

3.4. Component 3: Modeling the patient ability in semantic and syntactic levels of language organization (linguistic component)

Language impairment in patients with cognitive impairment is associated with aphasia-like symptoms and memory deficit symptoms, characterized by less dense and inaccurate speech planning,^92,93 difficulties in finding words,^66,94 simplified syntaxis and semantics,^66,95,96 and circumlocution.^66,97 These symptoms can result in communication errors and lower coherence in speech. To model the semantic level of language organization, we used metrics of semantic disfluency and lexical richness. To model the syntactic structure of the language, we used metrics measuring the complexity and components of sentence structure.

3.4.1. Modeling semantic disfluency in speech:

Semantic disfluency in speech often characterized by repeated words (repetitiveness) or inappropriate pausing behavior/hesitation during their speech.

Repetitiveness:

Previous studies found that patients with ADRD repeat words and phrases more frequently in their verbal responses to the Cookie-Theft test compared with non-cognitively impaired participants.^98,99 To identify repetitiveness in the patients’ verbal responses for the Cookie-Theft test, we used two methods: (1) computing the similarity score between clauses in the patient’s verbal response: Using bag-of-words, we computed cosine distance between clauses to obtain the similarity score. To improve the accuracy of this computation, we removed stop-words and common occurrences of some words, such as “he” or “is” in clauses, “he is looking at Mom,” and “he is falling off the stool.” A similarity score of “0” indicates that two clauses are identical. The proportion of clauses with a score of “0” and average and standard deviation of computed similarity scores in the patient’s verbal repones were taken into account. (2) Identifying consecutive duplicate words or phrases in clauses: We used the Regular Expression library of Python programming language to identify duplicate words/terms. The proportion of duplicated words/phrases with reference to the total number of words/phrases in the patient’s verbal response was computed.

Pausing behavior:

Pauses are the smallest syntactic units that occur at a specific moment in speech production where a form of a content word (e.g., verb/noun) or a lexical concept has to be retrieved from the memory and inserted to complete a clause or an utterance.^100,101 At this time, the lexical and semantic memory needs to be active. By contrast, at the initial boundary of an utterance or clause, a complete semantic and syntactic configuration should be constructed that requires full thought. Such structure does not retrieve from the memory but is creatively produced on occasion. Therefore, identifying disfluencies at a specific syntactic location can show a link between pausing and semantic disfluency/thinking expressed in speech. We modeled the pausing behavior according to four metrics introduced by Lofgren et al.:¹⁰² Whether pauses occurred (1) within-clauses, (2) clause-initial, (3) utterance-initial, or (4) whether the pause preceded nouns, verbs, or adjective/adverbs when occurring within-clauses.

To compute these linguistic parameters, we used the Amazon Web Service (AWS) General Transcribe (GT) system to transcribe the audio-recorded speech in the study sample. To calculate pausing metrics, we used Spacy, a module of Python programming language, and the “word timing” information provided by AWS-GT.

3.4.2. Measuring lexical richness (lexical diversity):

Patients with impairment in semantic memory have a less diverse vocabulary and are biased toward and higher frequent words.^103,104 Additionally, they may have “difficulty accessing more diverse nouns and verbs.”¹⁰⁵ Previous studies showed that metrics for computing lexical richness are applicable for quantifying impairment in semantic memory in patients with ADRD.^92,93 We computed the lexical richness (content density) score using five following metrics: 1) Type-token Ratio (TTR), including root type-token ratio (RTTR), corrected type-token ratio (CTTR), moving average type-token ratio (MATTR),¹⁰⁶ index is the total number of unique words divided by the total number of words for each successive window of fixed length; 2) Brunet’s Index is the variation in the type of words marked by a part-of-speech tagging tool in a sentence with reference to the total number of words in a sentence¹⁰⁷; 3) Honore’s Index measures the proportion of words used only once with reference to the total number of words;¹⁰⁸ 4) hypergeometric distribution index is a discrete probability distribution that computes the probability of randomly drawing the same word after a number of attempts without replacement;^109,110 and 5) The Measure of Textual Lexical Diversity (MTLD) which reflects the average number of words in a sequence of words for which a certain TTR is maintained.¹⁰⁹

3.4.3. Modeling the syntactic structure:

We used part-of-speech tagging (POS) for this task. POS is a method for identifying sentences and types of words (adjectives, adverbs, articles, nouns, numbers, and verbs) in sentences. We used the outcome of the POS method to compute metrics indicating syntactic complexity, syntactic components, and dependency among syntactic components of sentences using a set of syntactical features. The features include part-of-speech rate, frequency-of-use tagging, action verbs rate, relative pronouns, and negative adverbs rate. Previous studies showed that these metrics are associated with mental disorders and cognitive impairment, and thus they can point toward syntactical indicators of ADRD.^76,111

3.4.4. Modeling disfluency in the patient speech using the BERT language model:

Previous studies showed that BERT and its extended versions (e.g., DistilBERT, XLNet) has a knowledge of the structure of disfluency.^32,112 This language model processes disfluency by selectively attending to different parts of the disfluency at different intensities using the key mechanism of attention. This mechanism allows BERT to differentiate between the contextual embeddings of disfluent sentences and their fluent counterparts (see more details in Appendix C).

We modeled the conceptual relationship between words in the patient’s utterances (for the Cookie-Theft test) using the BERT language model and its extended versions, DistilBERT, DistilRoBERTa, and XLNet. Next, we evaluated the performance of each language model in detecting patients with ADRD. Appendix D provides more details about this evaluation. DistilBERT had the highest performance with an accuracy = 83.09 and F1-score = 82.35. Therefore, we used this model along with other linguistic domain-related features (explained above) to model the patient’s ability in semantic and syntactic levels of language organization.

DistilBERT is a BERT-based small, fast, and light pretrained transformer model. It uses 40% fewer parameters than the original BERT-base model and runs 60% faster, and keeps more than 95% of BERT’s performance, as measured in this study.¹¹³ We only use the cased version of this model with the following hyperparameters: 6-layer, 768-hidden, 12-heads, 65 M parameters.¹¹⁴ The model implementation is available here.¹¹⁴

3.5. Component 4: Modeling the patient’s vocal and semantic psycholinguistic cues

Prior studies have revealed that vocal and semantic psycholinguistic cues present in a patient’s language can be indicative of Alzheimer’s disease and other related dementias, as these conditions impact a person’s cognitive processes.^115,116 By analyzing these cues, healthcare professionals can detect early signs of cognitive decline, track the progress of the disease and treatment. We modeled the vocal psycholinguistic cues using Geneva Minimalistic Acoustic Parameter Set (GeMAPS)³⁰ and the semantic psycholinguistic cures using Linguistic Inquiry and Word Count (LIWC) 2015.²³

3.5.1. Extracting vocal psycholinguistic cues:

Vocal psycholinguistic cues conveyed in an individual’s voice have been empirically documented using a wide range of acoustic parameters that reflect subglottal pressure, vocal tract airflow, and vocal fold vibration^117,118 (e.g., frequency and spectral parameters and parameters measuring intensity and speech fluency). In this study, we used a minimalistic standard parameter set called the Geneva Minimalistic Acoustic Parameter Set (GeMAPS)³⁰ for modeling vocal psycholinguistic cues in participants’ speech. Compared with large brute-forced feature sets (e.g., ComParE with 6373 acoustic parameters), GeMAPS has a better generalization capability to an unseen dataset.¹¹⁹ The GeMAPS parameters are presented in four domains: frequency-related parameters, energy/amplitude-related parameters, spectral (balance/shape/dynamics) parameters, and voice quality parameters. The details of parameters for each domain were explained in the original article.³⁰ The effectiveness of GeMAPS in modeling the vocal psycholinguistic cues particularly “arousal” and “valence” and its generalization capabilities is clear from previous studies.^30,120,121 Please see Appendix E for more details about the performance of GeMAPS in identifying vocal psycholinguistic cues. We also conducted a statistical association analysis using T-test method to demonstrate the relationship between GeMAPS parameters and ADRD within the study’s sample population. The findings can be found in Appendix E. According to the findings, out of 88 total acoustic parameters, 57 parameters were significantly associated with ADRD (P-value <0.05).

3.5.2. Extracting semantic psycholinguistic cues:

In this study, we used Linguistic Inquiry and Word Count (LIWC) 2015²³ to extract semantic psycholinguistic cues associated with ADRD. LIWC is a manually curated lexical-based natural language processing tool developed by experts in the psychology of language. It contains a large selection of commonly used words and terms organized into 11 top-level categories, including linguistic structure, affective processes, social processes, cognitive processes, perceptual processes, biological processes, drives, relativity, informal language, personal concerns, and time orientation. LIWC has been used in several studies to characterize patients’ and clinicians’ language.^122,123 For examples words indicating tentativeness, such as “maybe” and “guess/think” belong to the category cognitive process, words indicating conjunctions such as “but” and “whereas” belongs to category of linguistic structure, and words indicating “unease” and “worried” belong to the category of “affective process.” In the area of healthcare, the reliability and validity of LIWC in detecting semantic psycholinguistic cues associated with mental and neurological disorders have been verified in several studies.^95,124–127 O’Dea et al.¹²⁴ showed that features of the LIWC’s linguistic domain, including “tentativeness” and “non-fluencies,” were significantly correlated with symptoms of depression and anxiety.¹²⁴ Also, Asgari et al.⁹⁵ found that linguistic markers from domains of psychological process and linguistic features were associated with the presence of cognitive impairment.

Please see Appendix F for more details about the performance of LIWC in identifying semantic psycholinguistic cues. We also conducted a statistical association analysis using T-test method to demonstrate the relationship between LIWC parameters and ADRD within the study’s sample population. The findings can be found in Appendix F. Based on these findings, 79 out of the 93 linguistic parameters analyzed exhibited a significant association with ADRD, as evidenced by a P-value less than 0.05.

3.6. Component five: Building and evaluating machine learning models

Figure 2 provides a schematic overview of the phases we used for building and evaluation of machine learning models.

3.6.1. Feature generation phase: Processing the acoustic and linguistic parts of the audio-recorded data

To model the patient phonetic motor planning, we used the implementation of acoustic assessment algorithms available in OpenSMILE¹²⁸ and PRAAT Vocal Toolkits¹²⁹ for five acoustic domains, speech fluency, frequency and spectral parameters, intensity, quality, and rhythmic structure of the voice. Both Toolkits are open-source platforms, including the efficient implementation of the acoustic parameters. We computed the parameters using a frame size of 25 ms.¹³⁰ Mean, standard deviation (std), feature quartile, interquartile range, and skewness were computed. We also processed the acoustic part of speech using YAMNet, a transformer-based pretrained speech processing model. However, because of the low performance of this model in detecting patients with ADRD, we did not incorporate the acoustic embedding vectors generated by this model into the feature sets used for building the screening algorithm. See Appendix G for more information about YAMNet and its performance in detecting patients with ADRD.

For the implementation of acoustic parameters of GeMAPS, we also used the acoustic assessment algorithms available in OpenSMILE.¹²⁸

For the implementation of the linguistic part of the speech, we first transcribed all the audio-recorded data to text using Amazon Web Service (AWS) General Transcribe (GT). In our previous study,¹³¹ we computed the Word Error Rate (WER) for AWS-GT for the transcription of patient-spoken language as 0.26, which was higher than other transcription systems, including AWS-Medical Transcribe (WER= 0.56) and Wave2Vec (WER= 0.98).¹³² Wave2Vec¹³³ is an open-source automatic transcription system developed by Facebook company. AWS-GT transcription includes the transcription of the spoken word and the timing (start time and end time) associated with it. We applied NLTK and Spacy toolkits of Python programming language to the transcribed data to compute metrics related to repetitiveness, pausing behavior, lexical richness, and syntactic structure of patients’ verbal descriptions (see section “3.4. component 3” for details).

We also generated word embedding vectors using the DistilBERT language model with a size of 768 × 512 to model the conceptual relationship between words in the patients’ speech. 768 is the size of the hidden layer, and 512 is the max sequence length (See details in the section “3.4.4. Modeling disfluency in the patient speech using BERT” and Appendix D). The embedding vectors were incorporated into the acoustic and linguistic feature sets for building the screening algorithm (see section “3.6.3 In processing phase” for details).

To extract the semantic psycholinguistic cures, we used Linguistic Inquiry and Word Count (LIWC) version 2015.²³ See section “Component 4: Modeling the patient’s vocal and semantic psycholinguistic cues” for details. Table 2 provides the list of generated features for three components (phonetic motor planning, semantic and syntactic levels of language organization, vocal and semantic psycholinguistic cues) in compact form.

Table 2.

List of acoustic and linguistic features used for development of models in compact form

Phonetic more planning	Speech Fluency	Articulation rate, speech rate, silent pauses, voicing probability, within-word disfluency
	Frequency and spectral parameters	Fundamental frequency, Jitter, Pitch, Formant Frequencies, MFCCs, LPCs, LTAS
	Voice Intensity	Voice intensity, Loudness, Hammarberg Index, energy concentration, Ratio of the energy of the spectral harmonic peak at the second and third formant’s center frequency to the energy of the spectral peak at F0 in voiced regions.
	Voice Quality	Harmonic to Noise Ratio, Voice-breaking, Voice Quality Index
	Rhythmic structure of the voice	Pairwise Variability Index, prolonged syllable intervals, Shimmer metric
	acoustic embedding models	YAMNet model
Semantic and syntactic levels of language organization	Repetitiveness	Similarity score between clauses, consecutive duplicate words or phrases in clauses
	Pausing behavior	pauses occurred (1) within-clauses, (2) clause-initial, (3) utterance-initial, or (4) whether the pause preceded nouns, verbs, or adjective/adverbs when occurring within-clauses.
	lexical richness	Type-token Ratio (TTR) [root type-token ratio (RTTR), corrected type-token ratio (CTTR), moving average type-token ratio (MATTR)], Brunet’s Index, Honore’s Index, hypergeometric distribution index, Textual Lexical Diversity (MTLD),
	Syntactic structure	part-of-speech rate, frequency-of-use tagging, action verbs rate, relative pronouns, and negative adverbs rate
	Contextual word embedding features	DistilBERT, BERT base-cased, XLNET, DistilRoBERTa
vocal and semantic psycholinguistic cues	Parameters defined in eGMAPS and LIWC

Open in a new tab

3.6.2. Pre-processing phase

All variables (acoustic and linguistic features) were centered and scaled using standard scaling, which standardizes the features by removing the mean and by scaling to unit variance. For feature selection, we used Joint Mutual Information Maximization (JMIM) method. JMIM was recently introduced as an effective feature selection method, particularly in high dimensional, small sample size datasets. It is from the family of joint mutual information (JMI)-based methods. In information theory, the mutual information (MI) of two random variables is the amount of information obtained about one random variable (X) by observing the other random variable (Y). This can be quantified as the reduction in entropy of one random variable (Y) given another variable (X) as follows:

I (X; Y) = E (Y) - E (Y ∣ X)

Where E(Y) is entropy of Y. For any discrete variable such Y=(y₁,y₂,…,y_N), E(Y) is defined as:

E (Y) = - \sum_{i = 1}^{N} p (y_{i}) log (p (y_{i}))

E(Y|X) is the conditional entropy. The conditional entropy is the amount of uncertainty left in variable Y when a variable X is introduced. So, it is less than or equal to the entropy of both variables. Conditional entropy is formulated as follows:

E (Y ∣ X) = - \sum_{j = 1}^{M} \sum_{i = 1}^{N} p (x_{i}, y_{j}) log (p (y_{j} ∣ x_{i}))

The information gain method is the simplest feature selection method built on MI. It assumes features are conditionally independent of one another. To tackle this problem, the JMIM method attempts to take into account the potential dependency among feature set F= {f₁,f₂,……,f_N} by selecting a subset of feature S= {s₁,s₂,……,s_k} with the dimension of K where K(number of features in S) < N(number of features in F) and S⊆F, while minimizing the redundancy of information among selected features and maximizing the joint mutual information among the subset of S and outcome class label Y. Mathematically,

f_{JMIM} = arg max_{f_{i} \in F - S} (min_{f_{s} \in S} (I (f_{i}, f_{s}; Y)))

The most important advantage of the JMIM method compared to feature selection methods from the class of wrapper and embedding is its generalizability of selected features for unseen datasets, which can improve the stability and generalizability of the ML models on an unseen dataset.

3.6.3. In processing phase: Machine learning (ML) architecture

We used three different ML architectures to model the relationship between speech parameters (predictors variables) and the presence of ADRD (outcome variable). Figure 2 shows a schematic view of these three architectures. In the first architecture (Figure 3. Model I), we first trained the DistilBERT model on the training dataset to extract the contextual word embeddings from the last layer of the model. In this way, we obtained a single 768-dimensional feature vector for each patient’s description for the Cookie-Theft test. This approach was used by Pompili et al.³⁹ and had a promising result for identifying patients with ADRD. Next, we applied the JMIM method to the 768-dimensional feature vector to extract the important features associated with ADRD for combination with other linguistic and acoustic feature sets.

To select important acoustic, linguistic, and psycholinguistic features from modeling the phonetic motor planning (phonetic component), semantic and syntactic levels of language organization (linguistic component) and vocal and sematic psycholinguistic cues (psycholinguistic component), we independently applied the JMIM method to the acoustic/linguist features computed for each component. Next, we fused selected features with important contextual features obtained from the DistilBERT. For the classification task, we tested the performance of different ML classifiers including, Logistic Regression (as a baseline), Random Forest and Extremely Randomized Trees¹³⁴ (Extra Trees), two popular algorithms from the family of bootstrap aggregation (bagging) ensemble decision tree algorithms, Adaptive Boosting¹³⁵ (AdaBoost) and Extreme Gradient Boosting (XGBoost),¹³⁶ two popular algorithms from the family of gradient boosting ensemble decision tree algorithms, and Support Vector Machine (SVM) from the family of the general category of kernel methods.¹³⁷ See more information about these algorithms in Appendix H. Figure 3. Model I provides a schematic view of the architecture of Model I that was applied to the test dataset.

The second architecture (Figure 3. Model II) is composed of two Bi-Long Short-Term Memory (LSTM) that were trained on contextual word embeddings vectors computed using DistilBERT. Bi-LSTM is a particular type of recurrent neural network (RNN) through which the relationships between the longer input and output variables are modeled. In a Bi-LSTM network, the given input variables are utilized twice for training (i.e., first from left to right, and next from right to left). We used Bi-LSTM rather than the LSTM network because previous studies showed that it has a higher performance in modeling sequential data.¹³⁸ The output of the second Bi-LSTM layer was passed to a fully connected (FC) layer. The outcome of the FC layer was a feature set for fusing with acoustic and linguistic parameters. To determine the number of Bi-LSTM layers and the size of the FC layer, we computed the overall performance of this architecture on the training dataset with different number of layers (1,2,3,4 layers) for Bi-LSTM and different number of neurons for the FC layer (120, 64, and 32). This architecture was followed by a SoftMax layer for training. Two Bi-LSTM layers with one FC layer with 32 neurons had the highest performance. Next, we fused the outcome of the FC layer with important acoustic/linguist features computed for each component (phonetic, linguistic, psycholinguistic) using the JMIM method that were passed to classification algorithms (LR, Random Forest, Extra Trees, XGBoost, AdaBoost, and SVM) for separating patients with ADRD from the participants without ADRD. Figure 3. Model II provides a schematic view of the architecture of Model II that was applied to the test dataset.

For the third architecture (Figure 3. Model III), we investigated the performance of the Convolutional Neural Network (CNN) in modeling the relationship between embedding vectors of DistilBERT and the output variable. CNN is a type of deep learning model that has a grid pattern (e.g., two dimensional matrices) and “is designed to automatically and adaptively learn special to automatically and adaptively learn spatial hierarchies of features, from low- to high-level patterns.”¹³⁹ A typical architecture of CNN is composed of the stack of several convolution layers and a pooling layer linked to one or more FC layers. The convolutional layer is a specialized type of linear operation used for feature extraction, and the pooling layer uses a down-sampling operation to reduce the dimensionality of the extracted feature.¹³⁹ In the second architecture, we used two CNN layers, an max pooling layer, and one FC layer (with the same functionality of the FC layer in the ML Model II). To determine the number of convolution layers, a function for pooling operation, and the number of neurons for the FC layer, we computed the overall performance of this architecture on the training dataset with different size (1,2,3 layers) for the convolution layer, average and max functions for the pooling operation, and a different number of neurons for the FC layer (120, 64, and 32). This architecture was followed by a SoftMax layer for training. A model with two convolution layers, max function for the pooling layer, and 32 neurons for the FC layer had the highest performance on the training dataset. Similar to Model II, we fused the outcome of the FC layer with important acoustic/linguist features that were passed to classification algorithms for separating patients with ADRD from the participants without ADRD. Figure 3. Model III provides a schematic view of the architecture of Model III that was applied to the test dataset.

3.6.4. Post-processing phase: Training and evaluating performance of machine learning architecture

All the processes for selecting important acoustic and linguist parameters (for phonetic, linguistic, and psycholinguistic components) were conducted using the JMIM method. We also used the training sample for fine tuning the parameters of machine learning algorithms used in ML Models I, II, and III. To do this, the training sample was partitioned into five equal subsets (“folds”), with the random partitioning stratified by ADRD patients to ensure that their distribution was approximately the same in all the folds. Since the algorithms (LR, Random Forest, Extra Trees, XGBoost, AdaBoost, and SVM) require choosing tuning parameters for optimal performance, a grid search was implemented within the five-fold cross-validation to select the best parameters. See Appendix I for parameters tuned for each classification algorithm. Specifically, for each combination of parameters defined over a grid, the algorithms were trained with those parameters using five folds and assessed the performance of the model on the 5th or hold-out fold. This procedure was repeated five times for each parameter set in the grid, until each fold has been used for testing.

Then, the optimal parameters for each algorithm were selected based on the Area Under Curve- Receiver Operating Characteristics (AUC-ROC) over the five repeated runs. For each algorithm, the performance of the final model was obtained by retraining the algorithm on the entire training dataset using the optimal parameters. We then used the test dataset to evaluate the performance of the final model on the validation cohort.

3.6.5. ML Models testing:

Goodness of fit of a classifier is evaluated using standard (general) performance metrics, including AUC-ROC, Cumulative Gains curve, Gini Score, Sensitivity, Specificity, Positive Predictive Value (PPV), and F-score (a harmonic mean between Precision and Recall). AUC-ROC is the tradeoff between the True Positive Rate (TPR or Sensitivity) and the False Positive Rate (FPR or 1-Specificity) and has the advantage of being invariant to the class distribution. For each ML model, we reported the standard deviation of the performance metrics over the five-fold cross-validations. For the best performing ML model, we computed the optimal sensitivity and specificity using the geometric mean (G-mean) metric and visualized it on the AUC-ROC curve.

4. Results

4.1. Most informative acoustic and linguistic features for screening patients with ADRD

The results of applying JMIM for extracting the most informative acoustic and linguistic features from each component are presented in Figures 4, 5, and 6. Figure 4 presents the top 20 informative acoustic features for modeling impairment in phonological motor planning. The acoustic parameters of all five domains, speech fluency, frequency and spectral parameters, voice intensity, the rhythmic structure of the voice, and voice quality are associated with the risk of ADRD. Acoustic parameters indicating alternation in frequency/spectral parameters (line spectrum pair [LSP] frequencies, MFCC, long-term average spectrum, F0) and speech fluency (pause rate, voice segments, voice probability, unvoiced segments) have higher presentation among the top 20 acoustic parameters than other domains. Parameters presenting voice intensity including: the rising slope of loudness, loudness peak, the energy ratio of the spectral harmonic peak of F0 and F1, and the energy ratio of the spectral harmonic peak of F0 and F3 are more informative than other parameters in this domain for screening patients with ADRD. Alternation in jitter and shimmer as indicators of alternation in the rhythmic structure of the voice are also among the most informative acoustic parameters. Harmonic to noise ratio [HNR] as an indicator of voice quality is also an important parameter; however, it is less informative compared to frequency/spectral and speech fluency parameters (see Figure 4).

Figure 4. — JMIM values of acoustic features used for modeling the impairment in phonological motor planning

*LSP are used to represent linear prediction coefficients (LPC) for transmission over a channel. LSPs have several properties such as smaller sensitivity to quantization noise that make them superior to direct quantization of LPCs.

** LinregerrQ: The quadratic error computed as the difference of linear approximation and the actual contour.

***Linregc2: The offset (t) of a linear approximation of the contour.

Figure 5 presents the importance of 20 linguistic features. Features of lexical richness (root and corrected of type-token ratio, Honor’s and Brunet’s Index) are among the top important features for modeling the patient’s verbal language. Features extracted using POS tagging (e.g., pronouns, part-of-speech rate) are also informative indicators of the ADRD risk, as expected. Features presenting pausing behavior (e.g., total average silence duration per word within clauses, total average silence duration in initial clauses) and repetitiveness (e.g., proportion of clauses with a similarity score of zero) are also among the top important informative features in detecting patients with ADRD. We did not combine the word embedding features extracted from DistilBERT with the linguistic domain features for presentation in Figure 5 as they are not explainable.

Figure 6.A presents the top 20 informative features for vocal psycholinguistic cues extracted using GeMAPS’s acoustic features. Features from four domains of GeMAPS, including frequency domain (std [standard deviation] of F2 bandwidth), spectral domain (mean spectral slope 0–500 Hz, mean spectral slope 500–1500 Hz), voice quality domain (HNR), and energy/amplitude domain (SD [standard deviation] of rising slope of loudness) are shown among the top five features. Other top 15 features are mostly related to the frequency and energy/amplitude domains as expected. Figure 6.B presents the top 20 informative linguistic features for semantic psycholinguistic cues extracted using LIWC. Features from linguistic dimension are among the most informative features for expression of semantic psycholinguistic cues in patients with ADRD. As expected, features from psychological process are also informative for modeling psycholinguistic cues in those patients. Other domains, relativity, personal concerns, and spoken language were not presented as important features for modeling the psycholinguistic cues in patients with ADRD.

4.2. Performance of ML models

We evaluated the performance of three ML Models (I, II, III) for screening patients with ADRD. These three ML Models are different in terms of processing the contextual word embedding vectors obtained from the DistilBERT language model. In Model I, we extracted a 768-dimensional feature vector from the last contextual word embedding vectors of DistilBERT and fused it with the informative features extracted from the three components of phonetic, linguistic, and psycholinguistic components. This model outperformed ML Model II and III in which the Bi-LSTM and CNN models were trained on the word embedding vectors (512 × 768) of DistilBERT, and their outcomes were fused with informative features of the three components (see Figure 3 for the architecture of this three models). Table 3 shows the performance of ML model I for different ML classifiers for both development and test datasets. SVM with RBF (Radial Basis Function) Kernel had the highest performance with AUC-ROC = 92.53 ± 3.34, accuracy = 85.04 ± 3.41, and F1-score = 84.64 ± 3.58 for the training dataset and AUC-ROC = 93.89, accuracy = 90.14, and F1-score = 89.55 for the test dataset. Ensemble decision trees had almost the same performance for AUC-ROC, F-score, and accuracy on the training dataset. However, the performance of gradient boosting trees (AdaBoost and XGBoost) on the test dataset was less than bagging trees (Random Forest and Extra Trees). This might indicate that the boosting trees overfitted the training dataset. Logistic Regression had the lowest performance for both training and test dataset. This might indicate that the relationship between input predictors and the outcome variable is not linear.

Table 3.

Performance of ML model I with different classification algorithms for training and test datasets

5-fold cross validation performance of classification algorithms on the training dataset
Algorithms	Precision	Recall	F1-score	AUC-ROC	Accuracy
SVM	86.92 ± 5.18	82.68 ± 3.94	84.64 ± 3.58	92.53 ± 3.34	85.04 ± 3.41
Extra Trees	85.77 ± 6.05	83.58 ± 7.19	84.26 ± 3.62	92.82 ± 3.64	84.52 ± 3.21
Random Forest	85.14 ± 6.11	83.45 ± 7.72	83.92 ± 4.47	92.08 ± 4.24	84.16 ± 4.04
AdaBoost	88.06 ± 7.36	80.52 ± 9.79	83.63 ± 6.36	93.69 ± 3.61	84.47 ± 5.63
XGBoost	82.99 ± 6.23	82.81 ± 7.04	83.00 ± 5.33	92.39 ± 4.58	82.74 ± 5.18
LR	82.80 ± 5.21	79.21 ± 12.17	80.65 ± 8.36	90.01± 4.06	81.55 ± 6.93
Performance of classification algorithms on the test dataset
SVM	93.75	85.71	89.55	93.89	90.14
Random Forest	88.78	79.54	83.84	91.60	84.94
Extra Trees	89.94	76.79	82.80	91.29	84.31
AdaBoost	77.42	68.57	72.72	84.84	74.65
XGBoost	83.87	74.28	78.78	88.17	80.28
LR	87.10	77.14	81.81	88.01	83.09

Open in a new tab

The performance of ML model II on the training dataset using SVM algorithm was AUC-ROC = 89.03 ± 5.68, accuracy = 80.76 ± 7.15, and F1-score = 82.97 ± 6.56; while the performance of ML III (using SVM algorithm) was AUC-ROC = 85.77 ± 6.75, accuracy = 77.73 ± 5.48, and F1-score = 78.65 ± 5.77 (See more results in Appendix J, Table 5 and Table 6 for ML Models II and III). These results were less than the ML Model I, indicating that modeling the high-dimensional word embedding vectors using the Bi-LSTM and the CNN network on the small training sample of this study does not improve the prediction power of the SVM algorithm. Additionally, Models II and III had higher standard deviations compared to ML Model I (measured using the five-fold cross validation on the training data), indicating the lower stability of the Models. The performance of Model II on the test data was AUC-ROC = 94.44, accuracy = 87.32, and F1-score = 87.67; while the performance of Model III was AUC-ROC = 92.06, accuracy = 88.73, and F1-score = 88.57. See Table 4 for the performance of Model II and Model III on the test dataset.

Table 5. Features selected via JMIM for building the highest-performing machine learning algorithm (Model I), ranked by importance.

Component (Com)- 1: phonetic motor planning; 2: Semantic and syntactic levels of language organization; 3: Psycholinguistic cues

Linguistic and acoustic features	Com	Linguistic and acoustic features	Com
linregc2 of voice probability	2	Part-of-Speech rate	3
Common verbs	4	linregerrQ of simple moving average of LSP Frequency	2
quartile 1 of MFCC	2	Functional words	3
Words cannot be found in Dictionary in LIWC	4	Textual Lexical Diversity	3
Article	4	Silence time for VERB within clauses	3
Std of LSP frequency	2	Content words	3
Quartile2-Quartile1 LSP frequency	2	Silence time for ADJ/ADV within clauses	2
Voiced Segments Per Second	2	Average of similarity score between clauses without stop word	3
Pause rate	2	Brunet’s Index	3
Total average silence duration in initial clauses	2	Indefinites articles	3
LinregerrQ of MFCC	2	Rate of negative adverbs	2
Words that are longer than six letters.	4	Root type-token ratio	3
Definite articles	2	Interquartile range of the 3rd MFCC coefficient	2
Content Density	3	Analytical thinking (summary variables in LIWC that measure cognitive language style)	4
Lexical frequency	3	Corrected type-token ratio	3
Std of MFCC	2	Linregc2 of simple moving average of LSP Frequency	2
Average Length of Unvoiced Segments	2	linregc1 of perceived loudness	4
Proportion of clauses with a similarity score of zero with stop word	3	Honor’s Statistic	3
Cognitive processes	4	pitch	4
Skewness of LSP frequency	2	MaxPos of Simple Moving Average (sma) of LSP	2
Std of local Shimmer	2	Normalized standard deviation of simple moving average of F2	4
Silence time for NOUNs within clauses	3	Normalized Std of simple moving average of the amplitude of F1 relative to F0	4
Mean of Local Jitter	2	Determiners	3
Standard deviation of similarity score between clauses with stop word	3	80th percentile of Frequency of 27.5Hz	2
Relative pronouns rate	3	Ratio of standardized mean amplitude of F3 and F0	4
Mean F0 Envelope	2	Std of Length of Unvoiced Segment	2
Total average silence duration per word within clauses	3	80th Percentile of Loudness	2
Pronouns	3	Reference Rate to Reality	3
Std of rising slope of loudness	2	Average of similarity score between clauses with stop word	3
std local Jitter	2	Std of harmonic noise ratio	4
Mean ratio energy spectral harmonic	2	Unique word count	3
Speech rate	2	Proportion of clauses with a similarity score of zero without stop word	3
Nouns	3	Hypergeometric Distribution Diversity	3
Quartile2-Quartile3 of F0	2	Word count	4
Long term average spectrum	2	Consecutive repeated clauses	3
Normalized standard deviation of the amplitude of F2 to F0	2	Type-token ratio	3

Open in a new tab

LSP are used to represent linear prediction coefficients (LPC) for transmission over a channel. LSPs have several properties such as smaller sensitivity to quantization noise that make them superior to direct quantization of LPCs.

^**

LinregerrQ: The quadratic error computed as the difference of linear approximation and the actual contour.

^***

Linregc2: The offset (t) of a linear approximation of the contour.

Table 6.

Performance of the component of phonetic motor planning in identifying patients with ADRD

Algorithms	Precision	Recall	F1-score	AUC-ROC	Accuracy
XGBoost	79.41	77.14	78.26	80.32	78.87
Random Forest	78.47	77.84	78.13	83.07	78.53
ExtraTrees	76.58	77.77	77.15	83.34	77.29
AdaBoost	73.73	74.285	74.01	77.22	74.27
SVM	70.27	74.28	72.22	82.31	71.83
Logistic Regression	68.42	74.28	71.23	75.55	70.42

Open in a new tab

Table 4.

Performance of ML model II and ML model III with different classification algorithms for training and test datasets

Algorithms	Precision	Recall	F1-score	AUC-ROC	Accuracy
Performance of ML model II with different classification algorithms on the test dataset
SVM	84.21	91.42	87.67	94.44	87.32
Random Forest	82.08	78.57	80.28	91.31	80.98
Extra Trees	82.21	80.18	81.14	91.85	81.67
AdaBoost	81.84	77.35	79.51	80.34	80.38
XGBoost	82.85	82.85	82.85	83.09	83.09
LR	85.29	82.85	84.05	92.06	84.50
Performance of ML model III with different classification algorithms on the test dataset
SVM	86.11	88.57	87.32	93.57	87.32
Random Forest	84.37	82.01	83.16	91.37	83.62
Extra Trees	83.29	84.09	83.67	91.21	83.83
AdaBoost	80.14	80.13	79.77	79.90	79.90
XGBoost	72.50	82.85	77.33	76.15	76.05
LR	88.57	88.57	88.57	92.06	88.73

Open in a new tab

The accuracy and F1-score of both models were less than Model I, but the AUC-ROC of Model II was slightly better than Model I. We decided to choose the Model I as our final model for building the screening algorithm because of its higher stability and lower computational cost of this model. Table 5 shows the most informative features used for building Model I.

Tables 6 and 7 provide information about the prediction power of the component of phonetic motor planning (phonetic component) and syntactic and semantic levels of language organization (linguistic component) in identifying patients with ADRD, measured on the test dataset. The highest computed accuracy for phonetic component and linguistic component in detecting patients with ADRD was 78.87 and 83.09, respectively. This result indicates that cognitive impairment has a negative impact on both acoustic and linguistic parts of speech; Thus, as shown in Table 1, combining these two components can improve the performance of the screening algorithms for detecting patients with ADRD (accuracy = 90.14).

Table 7.

Performance of the component of syntactic and semantic levels of language organization in identifying patients with ADRD

Algorithms	Precision	Recall	F1-score	AUC-ROC	Accuracy
SVM	84.84	80	82.35	89.05	83.09
Logistic Regression	84.84	80	82.35	89.60	83.09
AdaBoost	85.71	68.57	76.19	82.77	78.87
ExtraTrees	79.73	72.66	75.98	86.01	77.41
Random Forest	77.90	72.34	74.96	85.18	76.25
XGBoost	76.66	65.71	70.77	84.28	73.23

Open in a new tab

4.3. Added value of speech components in screening patients with ADRD

Figure 7.A demonstrates the added values of each speech component for screening patients with ADRD for AUC-ROC based on the result we obtained from the ML Model I on the training dataset. As this figure shows, combining all three components of speech (phonetic, linguistic, and psycholinguistic) can substantially improve AUC-ROC (AUC-ROC = 92.53) compared to using only the acoustic features of phonetic component (AUC-ROC = 78.49) and the combination of phonetic and linguistic components (AUC-ROC = 90.02). We can see this increase in performance for the Precision/Recall curve (Figure 7.B), Positive Predictive Value (Figure 7.D), and Sensitivity (Figure 7.E). Figure 7.C presents information gain of the speech components in identifying patients with ADRD with respect to the percentage of the study’s sample. The gain curve shows that if we selected the top 40% of the entire population, representing 66 patients (out of 166), the sample would contain approximately 80% of patients with ADRD. However, the gain value for using only acoustic features for phonetic component is about 60% for the sample size (N = 66).

Additionally, as Figures 7.A, B, C, D, and E demonstrate, modeling the patient’s psycholinguistic cues can improve the performance of the screening algorithm. For example, as shown in Figure 7.A, adding the psycholinguistic cues to the combination of phonetic and linguistic components improved the AUC-ROC by 2.79%. Finally, Figure 7.F demonstrates the density plot for the SVN model. As this figure shows, this model has good separability for separating patients with ADRD (class 1) from patients without ADRD (class 0).

Figure 8.A demonstrates the added values of each speech component for screening patients with ADRD based on the result we obtained from the ML Model I on the test dataset. As this figure shows, ML Model I has good generalizability on the unseen data (test dataset) with AUC-ROC = 93.89 for the combination of features of all three components (phonetic, linguistic, and psycholinguistic), AUC-ROC = 93.02 for the combination of phonetic and linguistic components, and AUC-ROC = 82.31 for phonetic component. This improvement can be seen in other metrics, Precision/Recall curve (Figure 8.B), Positive Predictive Value (Figure 8.D), and Sensitivity (Figure 8.E). The information gain curve also shows this improvement (Figure 8.C). By selecting the top 40% of the entire population (representing 66 patients out of 166), the sample would contain approximately 84% of patients with ADRD.

5. Discussion

In this study, we provided a detailed perspective on how the patient’s verbal response (spontaneous speech) for the Cookie-Theft test can be used for modeling three speech components: (1) the individual’s ability in phonetic motor planning, (2) the semantic and syntactic level of language organization, and (3) vocal and semantic psycholinguistic cues. Modeling these three components generated a list of informative features with high discrimination power for building ML models for the proactive identification of patients with ADRD. Each component was composed of domain-related features that can provide insight into the underlying factors associated with the development of ADRD, such as the individual’s ability to control their vocal cords, to demonstrate their recall ability and to construct the semantic and syntactic structure of sentences.

Neuropsychological assessment tools, such as the Mini-Mental State Examination (MMSE),¹⁴⁰ the Montreal Cognitive Assessment,¹⁴¹ and the Memory Impairment Screen,¹⁴² exhibit acceptable sensitivity and specificity in detecting patients with ADRD.¹⁴³ However, their application in clinical settings is often limited due to the patient’s difficulty recognizing early symptoms¹⁰ and the clinicians’ insufficient time to assess cognitive impairment.¹³ Incorporating AI-based ADRD screening algorithms, such as speech-processing algorithms, into clinical workflows can streamline the patient screening process for ADRD diagnosis by alerting clinicians to patients’ cognitive status. This may enable clinicians to implement appropriate interventions, including lifestyle modifications, comorbidity management, or referrals to behavioral health specialists.¹⁴⁴ Consequently, early detection may lead to improvements in the quality of life for patients and their caregivers while reducing overall healthcare utilization and costs.¹⁴⁵

Language corpora in speech production in ADRD has inspired several studies for the automatic assessment of individuals at risk of ADRD. The DementiaBank is the only relatively large, publicly available corpora that include individual verbal responses for the Cookie-Theft test. Several studies were published on the development of automatic screening algorithms for the identification of patients with ADRD using DementiaBank. To model the acoustic part of patients’ speech, studies used two major approaches: (1) open-access repositories of acoustic assessment algorithms and (2) transformer-based pretrained speech processing models. For the first approach, AVEC-2013,²⁶ EMO_Large,²⁷ ComParE-2013,²⁸ COVAREP,³⁴ and IS10-Paralinguistics feature-set⁵⁴ are examples of repositories of acoustic algorithms used by Shah et al.,²⁵ Chen et al.,³¹ Syed et al.,⁵³ and Rohanian et al.³³ The highest accuracy obtained using these repositories was for ComParE-2013 with accuracy = 71.69³¹ for identifying patients with ADRD. This relatively low accuracy is mostly due to the low generalizability of the acoustic feature sets available in these repositories for screening patients with ADRD. For example, ComParE-2013 includes 6373 generic acoustic descriptors that were mostly developed for music information retrieval and general sound analysis. The development of a repository of acoustic parameters that are specifically associated with cognitive impairment diseases (e.g., dementia, Alzheimer’s disease, mild cognitive impairment) can improve the assessment’s accuracy of the acoustic component of speech in detecting patients with ADRD. By including significant acoustics parameters associated with ADRD in the component of phonetic motor planning, we were able to achieve an accuracy = 78.87 in this study.

Transformer-based pretrained speech processing models are the second approach that was used by previous studies to process the acoustic part of speech.^{38,40,42,50,53} VGGish⁵¹ had the highest accuracy = 72.92 (reported by Koo et al⁵⁰) in detecting patients with ADRD patients compared to MobileNet,⁴³ YAMNet,⁴⁴ x-vectors,³⁸ i-vectors,⁴⁰ and Speech BERT.⁴⁵ However, the accuracy for VGGish was 64.58 in another study reported by Syed et al.⁵³ MobileNet, YAMNet, and VGGish were trained on annotated YouTube audio data, and Speech BERT was trained on the LibriSpeech dataset.⁴⁸ None of the training audio data is a good representative of the speech data collected from patients with cognitive impairment, resulting in the models’ relatively low performance in detecting patients with ADRD. In this study, we also investigated the performance of YAMNet (accuracy = 64.78), but due to reducing the overall performance of ADscreen in detecting patients with ADRD, we decided to exclude it from the model.

To model the linguistic part of patients’ speech, studies used both domain-related linguistic features and transformer-based pretrained language models. POS tagging, TF-IDF, n-grams, grammatical dependency, and filled/non-filled pauses are examples of domain-related features that have been used to quantify syntactic and semantic parameters of the patient’s langue. ^24,25,29,50 For transformer-based pretrained language models, BERT and its extended versions (e.g., XLNet, DistilRoBERTa) were used. Previous studies reported different accuracy for BERT in detecting patients with ADRD deepening on the type of the BERT model used for generating conceptual embedding vectors from the patients’ descriptions and the analysis method used to process the embedding vectors. For example, Pompili et al.³⁹ an accuracy = 72.92 for BERT base-cased and Logistic Regression algorithm for classification. While Zhu et al.⁴² reported an accuracy = 82.08 for the longformer language model. Overall, BERT and its extended versions had relatively higher performance than only linguistic related domains. This is mostly because BERT language models were trained on a very large dataset from Wikipedia, Google News, or Biomedical literature that included textual information similar to patients’ descriptions for the Cookie-Theft test. Additionally, BERT models have the capability of modeling the disfluency in patient language, which is a very important factor in detecting patients with ADRD (see Appendix C for details). In this study, we evaluated the performance of BERT base-cased and its extended versions. The DistilBERT model had the highest accuracy in detecting patients with ADRD in our initial analysis (see Appendix D for details). We combined the DistilBERT’s word embeddings with linguistic and acoustic domain-related features, which resulted in an accuracy = 83.09.

In summary, ADscreen is an example of a screening algorithm that can provide insight into three major components of speech for modeling phonetic motor planning, levels of language organization, and psycholinguistic cues. Additionally, the analytic pipeline for modeling these three components is generalizable to other speech datasets generated through individuals’ spontaneous speech for other neurological assessment tests (e.g., film-recall tasks⁸, story-retelling¹⁴⁶ tasks) and speech datasets created through patient-clinician verbal communications in clinical settings. In the next phase of this study, our goal is to upgrade components of the ADscreen by incorporating other domain-related features such as social interaction features that may provide some clues about the risk of ADRD.

5.1. Limitations

First, although participants in both the case and control groups were selected through extensive physical and neurological examinations, semi-structured psychiatric interviews, and neuropsychological assessments, there remains a risk of misdiagnosing patients with ADRD, primarily due to limited access to biomarkers such as cerebrospinal fluid (CSF) biomarkers. This limitation may affect ADscreen’s sensitivity in detecting ADRD patients.
Second, patients’ spontaneous speech for the Cookie-Theft test Pitt corpus was audio-recorded in 1994 using now-outdated technology, resulting in low-quality voice recordings. The low quality of the audio data may impact the accuracy of linguistic and acoustic features extracted from this dataset.
Third, participants in this dataset are predominantly White. Studies have demonstrated that ML algorithms trained on racially imbalanced data may yield poor predictive performance for minority populations. Therefore, ADscreen’s results may not be generalizable to other races and ethnic groups.
Fourth, although we explored a wide range of acoustic and linguistic features for modeling phonological and language impairment in ADRD patients, other acoustic and linguistic features (such as distinctive grammar patterns) might improve ADscreen’s performance in screening for ADRD.
Fifth, we investigated the performance of three different ML architectures built on both deep-learning methods (CNN and Bi-LSTM) and traditional ML algorithms for developing ADscreen. In the next phase of the study, we plan to explore the performance of other ML architectures (e.g., the combination of CNN and Gated Recurrent Unit [GRU]) in detecting ADRD patients.
Sixth, the dementia databank does not supply any information regarding the participants’ cognitive status or disease stage, which constrains the ML-based screening algorithms developed using this dataset in forecasting disease progression for patients with cognitive impairment.
Lastly, the databank does not offer evaluation results concerning the participants’ emotional status (e.g., anxiety, depression). Changes in emotional status are strong biomarkers for detecting ADRD patients. However, it is not possible to directly model patients’ emotions and incorporate them as indicators for ADRD detection. Instead, we extracted vocal and semantic psycholinguistic cues as indicators for detecting ADRD patients using GeMAPS and LIWC in this study.

6. Conclusion

Recent advances in the automated assessment of patients at risk of ADRD should inspire new complex contributions to profiling speech components in pathological cognitive impairment. Both acoustic and linguistic parameters can be very sensitive to changes in the neuropsychological status of the elderly patients; therefore, creating a comprehensive parametric speech profile should be established for (1) assessing the cognitive status of elderly individuals and (2) progression from one clinical stage to another stage (e.g., from cognitively healthy, to mild cognitive impairment, to Alzheimer’s disease). This profiling has a potential in monitoring the changes in cognitive status of elderly individuals in order to evaluate the subsequent effectiveness of interventions in stopping or delaying the progress of the disease. In summary, ADscreen has the potential to address the need for an ADRD screening tool, so that patients with these disorders receive appropriate and timely care.

Statement of Significance.

Alzheimer’s disease and related dementias (ADRD) represent a looming public health crisis, affecting roughly 5 million people and 11% of older adults in the United States.¹ Despite nationwide efforts for timely diagnosis of patients with ADRD, more than 50% of them are not diagnosed and unaware of their disease. Missed and delayed diagnosis not only impose more strain on family and caregivers emotionally and financially, but also leads to lost opportunities for treatment and the associated negative outcomes, particularly emergency department visits and hospitalization. Given the projection of 13.2 million ADRD patients by 2050², and the associated cost of more than $1.1 trillion, many organizations, including National Institute of Health and National Science Foundation have recognized the development of a robust diagnostic tool for early identification of elderly patients with ADRD as a critical and urgent research priority by many organizations. Emerging studies showed that patients’ spoken language is one of the earliest signs of cognitive impairment, enabling the features of spoken language to act as biomarkers for multiple dimensions of cognitive abilities, including executive functioning, semantic memory, and language. Stablished speech analysis and natural language processing techniques can be utilized for modeling components of spoken language and development of robust acoustic and linguistic metrics for detecting cues of cognitive impairment from the spoken language. In response to the challenges of timely diagnosis of ADRD, we developed ADscreen for proactive automated screening of patients at risk for ADRD. To develop ADscreen, we trained different machine learning algorithms on a combination of a large set of acoustic and linguist parameters and transformer-based methods to detect cues of cognitive impairment in spoken language. We tested performance of the ADscreen on a speech dataset of DementiaBank English Pitt Corpus³ for ADRD patients. The obtained result not only was promising for identifying patients with ADRD, but it can also provide an insight into the specific type of speech impairments present in these patients in order to adopt appropriate interventions.

Source of Funding:

K99AG076808- “Development of a Screening Algorithm for Timely Identification of Patients with Mild Cognitive Impairment and Early Dementia in Home Healthcare”

Appendices

Appendix A. Cookie-Theft Speech Description Task

graphic file with name nihms-1918819-f0010.jpg

Source: Figueiredo, S., & Barfod, V. (2012). Boston Diagnostic Aphasia Examination (BDAE). Chicago¹⁴⁷

Appendix B. Inclusion and exclusion criteria for recruiting patients for the ADRD study

Patient recruitment:

Participants for the ADRD study were enrolled from various clinical settings, including the Benedum Geriatric Center, Multispecialty Outpatient Geriatric Facility at the University of Pittsburgh Medical Center, and local neurologists and psychiatrists.⁶⁰

Inclusion criteria:

Individuals with a diagnosis for ADRD and symptoms associated with ADRD were eligible for the case group. Individuals with no history of cognitive impairment were eligible for the control group. The research team contacted eligible patients and their caregivers and explained the study’s goal, potential risks, and benefits associated with participation. Patients with signed informed consent received extensive physical and neurological examinations, semi-structured psychiatric interviews, and neuropsychological assessments. Also, each participant was interviewed by a psychiatric nurse to assess their physical and cognitive limitations as well as the caregiving burden to their primary caregiver. In addition to the examinations listed, each participant completed various laboratory studies, including blood chemistry, liver and thyroid function tests, and vitamin level tests.⁶⁰

Exclusion criteria:

Patients with the following symptoms were excluded from this study: Presence of severe manifestation of behavioral and psychiatric symptoms, severe impairment in speech and oral expression of language, significant disease of the central nervous system such as brain tumor, seizure disorder, subdural hematoma, cranial arteritis, the need for emergent care such as uncontrolled pain, wound infection or deterioration, and frequent use of high doses of opioid analgesics.

Appendix C. How does BERT process disfluency?

Tian et al.³² investigated if and how the BERT language model understands language disfluency using three experiments. For the first experiment, they added a soft-max layer to the medium-sized BERT model (with 12 layers, 12 attention heads, and a total of 110M parameters). Then, they trained the classifier on a synthetic dataset, including 100 fluent sentences with corresponding disfluent sentences. The finding showed that the BERT language model has an accuracy of 81.3% in detecting disfluent sentences. The authors suggested that without any fine-tuning on data containing disfluency, BERT already performs fairly well in identifying disfluent data.

With this finding, the authors hypothesized that BERT has an innate understanding of disfluencies. To test this hypothesis, the authors looked inside the Blackbox of the BERT deep learning model to investigate how the embeddings of disfluent sentences change over BERT layers. The authors further hypothesized that if the BERT model can understand language disfluency, the sentence embeddings of the disfluent sentence and its fluent counterpart should be more similar in layers associated with semantic representation than layers associated with surface form and syntactic representation. This is because fluent sentences and their disfluent counterparts are more similar in meaning than form. For this experiment, they used a sample of 900 disfluent utterances from Switchboard corpus,¹⁴⁸ including telephone conversations from speakers across the United States. For each disfluent, they created a fluent counterpart by removing filled pauses, interjections, and reparandam. Using two metrics, (1) the raw cosine similarity and (2) the cosine similarity ranking, the authors determined the quality of an embedding in capturing semantic nuances and closeness of a disfluent-fluent pair in the embedding space. The result of the experiment showed that (1) “BERT ranks a disfluent sentence high in similarity compared to all possible fluent counterparts;” (2) “final layer embedding is a relatively good aggregation of sentence meaning;” (3) “In terms of all sentence tokens, the similarity improves steadily in deeper layers, pointing towards increasing semantic selectivity and invariance to disfluencies.” These findings confirmed that the BERT language model can understand language disfluency.³² In the third experiment, the authors analyzed the role of the attention mechanism in identifying semantic disfluency. They found that BERT distinguishes the reparandum and alteration in the sentences by paying less attention to the reparandum in deeper layers.

Overall, the BERT language model processes disfluency by selectively attending to different parts of the disfluency at different intensities using the key attention mechanism. This mechanism allows the BERT language model to differentiate between the word embedding of a disfluent sentence and its fluent counterpart.

Appendix D. Evaluating the performance of different pretrained transformer-based language models in detecting patients with ADRD

BERT¹⁴⁹ is a contextualized word representation model built on a masked language model and pre-trained using bidirectional transformers on large data sets collected from Wikipedia and Google News. The BERT architecture addressed one of the fundamental challenges in language modeling, which is the prediction of a word in a sequence of words (e.g., a sentence). BERT uses a masked language model that predicts randomly masked words in a sequence of words and, therefore, can be used for learning bidirectional representation of a word. This mechanism substantially improved the BERT’s performance over traditional language models^150,151 with a mechanism of combining information from two unidirectional language models to improve the accuracy of word prediction.

We hypothesize that the BERT’s bidirectional representation mechanism is also critical for modeling the language component of speech in patients with ADRD because it helps to incorporate information related to language disfluency (e.g., repeated words or filler pauses) into word embeddings model. Please see appendix C for more details about modeling language disfluency using BERT.

We modeled the patient’s utterances (for the “Cookie-Theft” test) using the BERT (BERT-base cased, the original version of the BERT) and its extended versions, including DistilBERT, DistilRoBERTa, and XLNet. We selected these four language models because they outperformed other language models in different natural language processing tasks.^152,153

BERT¹⁴⁹ base-cased comprises 12 transformer blocks, a total of 110M parameters, 12 attention heads, and a hidden layer size of 768. We used the implementation of the BERT-base case here.¹⁵⁴
DistilBERT:¹¹³ It is a BERT-based small, fast, and light transformer model that uses 40% fewer parameters than BERT-base, runs 60% faster, and keeps more than 97% of the BERT’s results.¹¹³ DistilBERT comprises the following hyper-parameters: 6 transformer blocks, 65 M parameters, 12 attention heads, and a hidden layer size of 768. We used the implementation of DistilBERT here.¹¹⁴
XLNet:⁵² It is an extension of the Transformer-XL model, which was trained with an autoregressive method to learn bidirectional contexts. Like BERT base-cased, XLNet comprised the hyperparameters of 12 transformer blocks, a total of 110M parameters, 12 attention heads, and a hidden layer size of 768. We used the implementation of XLNet here.¹⁵⁵
DistilRoBERTa-base:¹⁵⁶ DistilRoBERTa-base is a distilled version of the RoBERta-base model. RoBERTa-base model¹⁵⁶ is the extension of the BERT language model. Compared to the BERTmodel, RoBERTa was trained on additional news and stories corpora, and adjusted training strategies were used to improve its performance. DistilRoBERTa-base comprises 6 transformer blocks, a total of 82M parameters (compared to 125M parameters for RoBERTa-base), 12 attention heads, and a hidden layer size of 768. On average DistilRoBERTa is twice as fast as RoBERTa -base. We used the implementation of DistilRoBERTa here.¹⁵⁷

We trained each language model independently on the sample of the study (see section 3.1 Data Source), which generated sentence embedding vectors with the size of 768 × 512. The number 768 is the size of the hidden layer, and 512 is the max sequence length (i.e., the maximum number of tokens in a sentence). The embedding vectors for the test dataset were fed into a support vector machine (SVM) classifier algorithm with the RBF kernel. We chose the SVM classifier because it has a better performance in a high-dimensional small sample dataset. Additionally, the SVM classifier has stable performance in different random states. Other classification algorithms, such as ensemble decision trees, provides different performance in different random states. Table 1 shows the performance of the four language models on the test dataset of the study (see section 3.1 Data Source) in detecting patients with ADRD.