Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2023 Apr 14.
Published in final edited form as: Proc IEEE Int Conf Acoust Speech Signal Process. 2022 Apr 27;2022:6462–6466. doi: 10.1109/icassp43922.2022.9747006

TOWARDS INTERPRETABILITY OF SPEECH PAUSE IN DEMENTIA DETECTION USING ADVERSARIAL LEARNING

Youxiang Zhu 1, Bang Tran 1, Xiaohui Liang 1, John A Batsis 2, Robert M Roth 3
PMCID: PMC10102974  NIHMSID: NIHMS1883778  PMID: 37064829

Abstract

Speech pause is an effective biomarker in dementia detection. Recent deep learning models have exploited speech pauses to achieve highly accurate dementia detection, but have not exploited the interpretability of speech pauses, i.e., what and how positions and lengths of speech pauses affect the result of dementia detection. In this paper, we will study the positions and lengths of dementia-sensitive pauses using adversarial learning approaches. Specifically, we first utilize an adversarial attack approach by adding the perturbation to the speech pauses of the testing samples, aiming to reduce the confidence levels of the detection model. Then, we apply an adversarial training approach to evaluate the impact of the perturbation in training samples on the detection model. We examine the interpretability from the perspectives of model accuracy, pause context, and pause length. We found that some pauses are more sensitive to dementia than other pauses from the model’s perspective, e.g., speech pauses near to the verb “is”. Increasing lengths of sensitive pauses or adding sensitive pauses leads the model inference to Alzheimer’s Disease (AD), while decreasing the lengths of sensitive pauses or deleting sensitive pauses leads to non-AD.

Keywords: Speech pauses, interpretability, acoustic feature, dementia detection, spontaneous speech

1. INTRODUCTION

Exploiting spontaneous speech for dementia detection has shown promise in recent years. In the Alzheimer’s Dementia Recognition through Spontaneous Speech (ADReSS) challenge [1, 2], spontaneous speech datasets were collected using a Cookie Theft picture description task [3], and researchers have developed deep learning models for achieving a promising classification accuracy in differentiating AD from non-AD patients [4, 5]. One common acoustic biomarker these successful models have exploited is the speech pauses [4, 5, 6], which have been widely studied in the medical research [7, 8]. However, these deep learning models have not analyzed the impact of the positions and lengths of the speech pauses on the detection of dementia. Thus, it remains unknown that to what extent these models understand the speech pauses to make the decision.

Recent deep transfer learning studies have significantly advanced our understanding of the domains of image, language, and speech. Deep transfer learning models learn features automatically from large-scale datasets, and these auto-learned features can be more representative than conventional handcrafted features. In dementia detection, such models produced a high accuracy [9, 4, 5], but they often lacked inter-pretability, limiting the trustworthiness of their medical applications. Attention and visualization mechanisms are common methods to enhance the interpretability of deep learning models [10, 11, 12, 13, 14, 15]. However, these methods do not apply to the speech-based dementia detection scenario because of the complex relationship between input data and labels. In the image domain, labels are directly related to the input data, e.g., a dog and a dog image. Humans can evaluate whether the interpretability results are consistent with human knowledge, e.g., the model assigns large weights to the dog’s body area to recognize the dog. In contrast, dementia labels are derived from a standard and complex assessment, which is not directly related to the speech data. Humans hardly evaluate the interpretability results on dementia-related speech. Therefore, we evaluated the interpretability of speech pauses, rather than words expressed, for dementia detection.

The WavBERT dementia detection model [16] is the first model preserving the pause in the text without using the manual transcription. WavBERT uses Wav2vec [17] to measure the lengths of speech pauses and represents them with punctuation marks. In this paper, we use adversarial learning and WavBERT to study the speech pauses. Specifically, we first propose an adversarial attack approach where we generate adversarial samples from original testing samples by adding, replacing, and deleting the punctuation marks. We test the WavBERT model using the adversarial samples and identify the perturbation action that leads to the most significant change in the model’s confidence. In an adversarial training approach, other than training with the original samples, we additionally train the WavBERT model with two groups of adversarial samples, one includes samples positively influencing the confidence, and the other includes samples negatively influencing the confidence. We then investigate how the adversarial training affects the model on the testing samples. The contributions of this paper are three-fold.

First, we exploit adversarial learning to study the inter-pretability of speech pauses in dementia detection. We propose methods to evaluate the impact of speech pauses from the perspectives of model accuracy, pause context and length.

Second, we found that the speech pauses between “is” and another verb is dementia-sensitive, and confirmed this observation with the training set. We also found that adding pauses at sensitive places or increasing the lengths of sensitive pauses leads the model inference toward AD, while deleting sensitive pauses or decreasing the lengths of sensitive pauses leads the model inference toward non-AD.

Third, we found that the misspelling errors from Wav2vec introduced uncommon tokens, leading to misinterpretation of the speech pauses. Such problem is inevitable if using a speech recognition model and a language model, and might be resolved with large-scale speech representation.

2. WAVBERT - PAUSE PRESERVATION

WavBERT [16] utilizes speech pauses and transcripts for dementia detection. It transcribes the speech audio to transcript using Wav2vec, and this transcript does not have punctuation marks. WavBERT uses the number of blank tokens from the intermediate results of Wav2vec to define two levels of pauses: longer sentence-level pause and shorter insentence pause. A “period” mark is inserted into the transcript if a sentence-level pause is determined; a “comma” mark is inserted into the transcript if an in-sentence pause is determined. After adding the punctuation “period” and “comma” into the transcripts, WavBERT feeds the transcripts to BERT for AD classification. More details can be found in Figure 1 and the paper [16]. In this paper, we add one more level of in-sentence pause and represent it with “semicolon.” The length of such pause is longer than “comma” and shorter than “period.”

Fig. 1:

Fig. 1:

The description of WavBERT model (a), the adversarial attack approach (b), and the adversarial training approach (c).

3. PROPOSED ADVERSARIAL LEARNING

We present two adversarial learning approaches to exploit the impact of speech pauses on dementia detection.

Adversarial attack.

An adversarial attack adds perturbation to the punctuation marks. Note that, other adversarial attacks on BERT considered the entire vocabulary as the attack space [18, 19]. In our case, the attack space only includes punctuation marks, and the perturbation actions are adding, deleting, and replacing. As shown in Figure 1(b), we adopt the same training phase of the original WavBERT. To launch the attack, we choose an original testing sample s0, and generate a set of adversarial samples by using a single action of adding, deleting, or replacing. For adding, we add one punctuation mark between two neighboring words where no punctuation mark exists; for deleting, we delete an existing punctuation mark; and for replacing, we replace an existing punctuation mark with a different punctuation mark, e.g., replacing “comma” with “period.” We define a confidence level as lAD=oAD(s) for an AD sample s, and lnonAD=ononAD(s) for a non-AD sample s, where oAD(s) and ononAD(s) are the logit output by the model for AD and non-AD labels. Among the samples generated from s0, we identify a sample s1 as the most effective sample if the confidence level of s1 is the lowest. We call s1 as the step-1 sample, and then iteratively launch the attack on s1 to obtain the step-2 sample s2. The attack will be launched for a maximum of 20 steps unless the confidence level cannot be lowered. The details are shown in Algorithm 1.

Algorithm 1 Adversarial attack si → si+i
Input:A stepismaplesiand a modelMOutput:A step(i+1)samplesi+11:si+1=NULL,confidence levelli=M(si)2:Generate a set of adversarial samples fromsiusing oneperturbation action:Ci+1={si+1,1,si+1,2,,si+1,n}.3:foreach samplesi+1,jinCi+1do4:li+1,j=M(si+1,j).5:ifli+1,j<lithen6:li=li+1,j,si+1si+1,j7:Returnsi+1

Adversarial training.

In another adversarial learning approach, we generate adversarial samples and apply adversarial training to the WavBERT, as shown in Figure 1(c). Specifically, we add perturbations to the original training samples of WavBERT. The perturbation actions are the same as the previous adversarial attack approach. In each attack step on a sample, we choose the adversarial sample that has the lowest confidence level. The attack is iteratively launched for a maximum of 20 steps unless the confidence level cannot be lowered. Finally, the original training samples and perturbed samples are used to train the WavBERT. The updated model is expected to be less effective because adversarial samples produce a negative impact in the training phase. We further propose a reversed adversarial training approach. We generate an adversarial sample from an original sample that has the highest confidence level. In this way, we envision that the reversed adversarial training makes the model more effective.

Interpretability.

We study the interpretability of speech pauses from the following three perspectives.

Model accuracy.

In the adversarial attack approach, we can evaluate how much the perturbation affects the model accuracy. In the adversarial training approach, the changes of model accuracy reveal how well the model understands the speech pauses in the original samples.

Pause context.

In each step of the adversarial attack, we choose the perturbation that causes the most negative impact on the confidence level. We can analyze the statistics of the neighboring words of all the chosen punctuation marks, and identify the dementia-sensitive pauses based on the context.

Pause length.

We divide the perturbation actions into two groups. Group 1 involves adding and replacing a short pause with a long pause, and Group 2 involves deleting and replacing a long pause with a short pause. Group 1 shifts the result towards AD, and Group 2 shifts the result towards non-AD.

4. EVALUATION

In this section, we present the implementation details and report our evaluation results on speech pauses.

4.1. Implementation details

We followed the original implementation of WavBERT [16] using PyTorch. The training and the adversarial training used the standard training and testing sets of ADReSS speech only (ADReSSo). The training was conducted with 10 rounds, and the average result of 10 rounds was reported. Compared to the original setting, the learning rate was increased to 10−5, the maximum epoch was reduced to 100 for faster convergence, and the model accuracy remains similar [16].

4.2. Model accuracy of adversarial attack/training

Figure 2 shows the impact on accuracy from adversarial attack and adversarial training. The original WavBERT used a majority vote of 10 rounds and achieved an accuracy of 83-84%. The average accuracy of 10 rounds was 82-83%. We chose to report the average accuracy because it reflects the performance of every individual round. As shown in Figure 2(a), with the step-1 attack, the accuracy is lowered to 70-72%, which shows the WavBERT model may easily be impacted by small perturbation. However, humans in their speeches often add pauses for many unknown reasons, which are not related to their cognitive problems. In addition, we observed that if a multi-step attack adds more perturbation on pauses, the accuracy is significantly lowered. We confirmed that the pause is an important factor for model inference. Lastly, the level-3 model resulted in the lowest accuracy when the step-20 attack was launched. We consider that the level-3 model relies more on pauses to make the inference, and thus is more vulnerable to the adversarial attacks on pauses. Figure 2(b) and (c) show the accuracy of models using adversarial training and reversed adversarial training was lowered but not significantly, thus revealing WavBERT’s limited understanding of pauses.

Fig. 2:

Fig. 2:

Impact of adversarial attack/training. Level 1: period. Level 2: comma + period. Level 3: comma + semi-colon + period

4.3. Context analysis of pauses

Table 1 shows the context analysis results of perturbation. We listed the top five contexts based on the frequency for range (1, 2) and attack step (1, 5, 20). The adversarial samples listed under the column step-k include all the adversarial samples generated from step-1 to step-k. We have three observations: i) We found the pause between “is” and a verb is highly dementia-sensitive, as shown in the highlighted green. We confirmed that the long pause between “is” and a verb only appears in the AD samples but not non-AD samples of the training set. ii) We found interviewers’ pause is important, as shown in the underlined text. As this type of pause does not appear in the step-1 attack, it is not considered as the most important pauses for model inference. We consider it still affects the accuracy to some extent. Perez-Toro et al. has reported a similar result [20]. iii) We found ASR misspelling errors are important, such as “othing tote # ofor” and “gon as # aa itre”, as shown in the oval boxes. We consider the misspelling errors introduce the uncommon tokens, which are considered as important context in the downstream task because the pretrained models have limited knowledge about these tokens.

Table 1:

Context analysis on frequency of pauses. ‘#’ represents a pause. r represents the range, and s represents the step

r = 1, s = 1 r = 1, s = 5 r = 1, s = 20 r = 2, s = 1 r = 2, s = 5 r = 2, s = 20
level 1 it # looks 19 of # the 40 the # sink 102 the mother # and the 10 the picture # tell me 17 mother is # washing dishes 26
of # the 16 dishes # and 32 and # the 71 othing tote # ofor 10 mother is # washing dishes 13 ah it # looks like 20
dishes # and 12 it # looks 27 of # the 60 ah it # looks like 10 washing dishes # and the 12 the picture # tell me 19
mother # and 11 i # see 25 the # little 54 it is # o 10 reaching for # the cookie 11 washing dishes # and the 18
tote # ofor 10 is # drying 19 i # see 50 gon as # aa itre 10 nitchin cookies # ok 10 is open # looks like 17
level 2 of # the 17 is # running 38 and # the 117 or cake # and look 10 mother is # washing dishes 24 mother is # washing dishes 37
it # looks 15 dishes # and 38 the # sink 93 shink is # running over 10 the picture # well the 18 washing dishes # and the 18
is # drying 13 of # the 37 on # the 92 the mother # and the 10 two cups # and a 13 the picture # well the 18
mother # and 12 and # the 32 of # the 76 othing tote # ofor 10 washing dishes # and the 12 of water # and the 18
is # running 11 the # sink 27 i # see 66 ah it # looks like 10 the picture # tell me 12 and the # sink is 18
level 3 it # looks 15 dishes # and 40 and # the 120 or cake # and look 10 mother is # washing dishes 19 mother is # washing dishes 37
of # the 12 of # the 35 the # sink 96 ah it # looks like 10 the picture # well the 17 washing dishes # and the 23
cake # and 10 is # going 31 on # the 93 little girl # cubboard doors 10 two cups # and a 12 water on # the floor 20
dishes # and 10 is # running 30 of # the 78 othing tote # ofor 9 trying to # get in 11 the picture # well the 20
girl # cubboard 10 and # the 30 dishes # and 78 what wit # the sinks 9 stool is # going to 11 and the # sink is 20

4.4. Length analysis of pauses

Table 2 shows the frequency of perturbation actions. Adding and deleting the long pauses (punctuation “period”) are common perturbation methods to change detection results (highlighted). In step-1 attack, we consider the perturbation actions are performed on sensitive pauses. In cases of changing confidence towards non-AD, the total number of deleting and long-to-short replacing actions is significantly larger than the total number of adding and short-to-long replacing. In cases of changing confidence towards AD, the perturbation actions are mostly adding pauses. However, we note that most actions of adding a comma in level 2/3 and adding a period in level 1 lead to a confidence change towards non-AD. Thus, we consider that not all pauses are dementia-sensitive from the model’s perspective, because natural language contains pauses. Lastly, we found that replacing actions appear less frequently than adding and deleting, suggesting that the pause context is more important than the pause length.

Table 2:

Length analysis. A stands for AD, N for Non-AD

Perturbation Step-1 Step-5 Step-20
A→N N→A A→N N→A A→N N→A
level 1 Add. 90 357 495 1764 1336 5715
Delete . 242 3 844 34 1168 644
level 2 Add, 34 1 272 35 1631 1365
Add . 38 334 157 1639 450 4410
Replace ,→. 5 16 58 86 196 502
Delete , 57 9 388 19 1501 369
Delete . 180 0 591 9 879 188
Replace .→, 36 0 284 12 653 243
level 3 Add , 36 2 284 15 1686 629
Add; 14 21 58 123 303 1009
Add . 34 311 111 1524 329 4118
Replace ,→; 0 5 4 9 89 141
Replace ,→. 2 10 19 89 86 454
Replace ;→. 3 2 29 9 72 88
Delete , 8 7 136 13 946 218
Delete ; 41 0 187 0 444 75
Delete . 153 2 488 11 719 163
Replace ;→, 9 0 92 1 467 64
Replace .→, 42 0 284 5 630 82
Replace .→; 8 0 58 1 169 73

5. CONCLUSIONS

In this paper, we exploited adversarial learning to study the interpretability of speech pauses in dementia detection. With the adversarial attack and training techniques, we successfully identified the positions and lengths of sensitive pauses, e.g., speech pauses near to the verb “is” are dementia-sensitive. We also confirmed that increasing lengths of sensitive pauses or adding sensitive pauses leads the model inference to AD, while decreasing the lengths of sensitive pauses or deleting sensitive pauses leads to non-AD. Our evaluation reveals that the misspelling errors are inevitable if using speech recognition and language model. One future direction is to exploit large-scale speech data for pre-training a speech-based language model, thus resolving the misspelling problem.

Acknowledgments

This research is funded by the US National Institutes of Health National Institute on Aging, under grant No. R01AG067416.

LIST OF ACRONYMS

AD

Alzheimer’s Disease

ADReSS

Alzheimer’s Dementia Recognition through Spontaneous Speech

ADReSSo

ADReSS speech only

REFERENCES

  • [1].Luz Saturnino, Haider Fasih, de la Fuente Sofia, Fromm Davida, and MacWhinney Brian, “Alzheimer’s dementia recognition through spontaneous speech: the adress challenge,” arXiv preprint arXiv:2004.06833, 2020. [Google Scholar]
  • [2].Luz Saturnino, Haider Fasih, de la Fuente Sofia, Fromm Davida, and MacWhinney Brian, “Detecting cognitive decline using speech only: The adresso challenge,” arXiv preprint arXiv:2104.09356, 2021. [Google Scholar]
  • [3].Goodglass Harold, Kaplan Edith, and Weintraub Sandra, BDAE: The Boston Diagnostic Aphasia Examination, Lippincott Williams & Wilkins; Philadelphia, PA, 2001. [Google Scholar]
  • [4].Yuan Jiahong, Bian Yuchen, Cai Xingyu, Huang Jiaji, Ye Zheng, and Church Kenneth, “Disfluencies and fine-tuning pre-trained language models for detection of alzheimer’s disease.,” in INTERSPEECH, 2020, pp. 2162–2166. [Google Scholar]
  • [5].Yuan Jiahong, Cai Xingyu, Bian Yuchen, Ye Zheng, and Church Kenneth, “Pauses for detection of alzheimer’s disease,” Frontiers in Computer Science, vol. 2, pp. 57, 2020. [Google Scholar]
  • [6].Sluis Rachel A, Angus Daniel, Wiles Janet, Back Andrew, Gibson Tingting, Liddle Jacki, Worthy Peter, Copland David, and Angwin Anthony J, “An automated approach to examining pausing in the speech of people with dementia,” American Journal of Alzheimer’s Disease & Other Dementias®, vol. 35, pp. 1533317520939773, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [7].Pastoriza-Dominguez Patricia, Torre Ivan G, Dieguez-Vide Faustino, Gomez-Ruiz Isabel, Gelado Sandra, Bello-Lopez Joan, Avila-Rivera Asuncion, Matias-Guiu Jordi, Pytel Vanesa, and Hernandez-Fernandez Antoni, “Speech pause distribution as an early marker for alzheimer’s disease,” medRxiv, pp. 2020–12, 2021. [Google Scholar]
  • [8].Davis Boyd H and Maclagan Margaret, “Examining pauses in alzheimer’s discourse,” American Journal of Alzheimer’s Disease & Other Dementias®, vol. 24, no. 2, pp. 141–154, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [9].Balagopalan Aparna, Eyre Benjamin, Rudzicz Frank, and Novikova Jekaterina, “To bert or not to bert: comparing speech and language-based approaches for alzheimer’s disease detection,” arXiv preprint arXiv:2008.01551, 2020. [Google Scholar]
  • [10].Ribeiro Marco Tulio, Singh Sameer, and Guestrin Carlos, “” why should i trust you?” explaining the predictions of any classifier,” in Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, 2016, pp. 1135–1144. [Google Scholar]
  • [11].Goyal Yash, Wu Ziyan, Ernst Jan, Batra Dhruv, Parikh Devi, and Lee Stefan, “Counterfactual visual explanations,” in International Conference on Machine Learning. PMLR, 2019, pp. 2376–2384. [Google Scholar]
  • [12].Simonyan Karen, Vedaldi Andrea, and Zisserman Andrew, “Deep inside convolutional networks: Visualising image classification models and saliency maps,” arXiv preprint arXiv:1312.6034, 2013. [Google Scholar]
  • [13].Clark Kevin, Khandelwal Urvashi, Levy Omer, and Manning Christopher D, “What does bert look at? an analysis of bert’s attention,” arXiv preprint arXiv:1906.04341, 2019. [Google Scholar]
  • [14].Liu Nelson F, Gardner Matt, Belinkov Yonatan, Peters Matthew E, and Smith Noah A, “Linguistic knowledge and transferability of contextual representations,” arXiv preprint arXiv:1903.08855, 2019. [Google Scholar]
  • [15].Che Zhengping, Purushotham Sanjay, Khemani Robinder, and Liu Yan, “Distilling knowledge from deep networks with applications to healthcare domain,” arXiv preprint arXiv:1512.03542, 2015. [Google Scholar]
  • [16].Zhu Youxiang, Obyat Abdelrahman, Liang Xiaohui, Batsis John A, and Roth Robert M, “Wavbert: Exploiting semantic and non-semantic speech using wav2vec and bert for dementia detection,” Proc. Interspeech 2021, pp. 3790–3794, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [17].Baevski Alexei, Zhou Henry, Mohamed Abdelrahman, and Auli Michael, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” arXiv preprint arXiv:2006.11477, 2020. [Google Scholar]
  • [18].Garg Siddhant and Ramakrishnan Goutham, “Bae: Bert-based adversarial examples for text classification,” arXiv preprint arXiv:2004.01970, 2020. [Google Scholar]
  • [19].Li Linyang, Ma Ruotian, Guo Qipeng, Xue Xiangyang, and Qiu Xipeng, “Bert-attack: Adversarial attack against bert using bert,” arXiv preprint arXiv:2004.09984, 2020. [Google Scholar]
  • [20].Pérez-Toro PA, Bayerl SP, Arias-Vergara T, Vásquez-Correa JC, Klumpp P, Schuster M, Nöth Elmar, Orozco-Arroyave JR, and Riedhammer K, “Influence of the interviewer on the automatic assessment of alzheimer’s disease in the context of the adresso challenge,” Proc. Interspeech 2021, pp. 3785–3789, 2021. [Google Scholar]

RESOURCES