Deep-Learning-Based Analysis of Glottal Attack and Offset Times in Adductor Laryngeal Dystonia

Ahmed M Yousef; Dimitar D Deliyski; Mohsen Zayernouri; Stephanie RC Zacharias; Maryam Naghibolhosseini

doi:10.1016/j.jvoice.2023.10.011

. Author manuscript; available in PMC: 2025 May 15.

Published before final editing as: J Voice. 2023 Nov 15:S0892-1997(23)00319-3. doi: 10.1016/j.jvoice.2023.10.011

Deep-Learning-Based Analysis of Glottal Attack and Offset Times in Adductor Laryngeal Dystonia

Ahmed M Yousef ^a, Dimitar D Deliyski ^a, Mohsen Zayernouri ^b, Stephanie RC Zacharias ^c,^d, Maryam Naghibolhosseini ^a

PMCID: PMC11093885 NIHMSID: NIHMS1938084 PMID: 37977969

Abstract

Objective:

Diagnosis of adductor Laryngeal dystonia (AdLD) is challenging as it mimics voice features of other voice disorders. This could lead to misdiagnosis (or delayed diagnosis) and ineffective treatments of AdLD. This paper develops automated measurements of glottal attack time (GAT) and glottal offset time (GOT) from high-speed videoendoscopy (HSV) in connected speech as objective measures that can potentially facilitate the diagnosis of this disorder in future.

Methods:

HSV data were recorded from vocally normal adults and patients with AdLD during reading of the “Rainbow Passage” and six CAPE-V sentences. A deep-learning framework was designed and trained to segment the glottal area and detect the vocal fold edges in the HSV dataset. This automated framework allowed to automatically measure and quantify the GATs and GOTs for the participants. Accordingly, a comparison was held between the obtained measurements among vocally normal speakers and those with AdLD.

Results:

The automated framework was successfully developed and able to accurately segment the glottal area/edges. The precise automated measurements of GAT and GOT revealed minor, non-significant differences compared to the results of visual analysis - showing a strong correlation between the measures by the automated and manual methods. The results showed significant differences in the GAT values between the vocally normal subjects and AdLD patients with larger variability in both the GAT and GOT measures in the AdLD group.

Conclusions:

The developed automated approach for GAT and GOT measurement can be valuable in clinical practice. These quantitative measurements can be used as meaningful biomarkers of the impaired vocal function in AdLD and help its differential diagnosis in future.

Keywords: Adductor Laryngeal Dystonia, Connected Speech, Laryngeal Imaging, High-Speed Videoendoscopy, Glottal Attack/Offset Time, Deep Learning

1. Introduction

Adductor laryngeal dystonia (AdLD) is a neurological voice disorder that disrupts laryngeal muscle control and causes excessive vocal fold (VF) closure during speech [1, 2, 3, 4]. The spasmodic overclosure of the VFs, particularly when the VFs are approximating during phonation, leads to frequent phonatory breaks and a strained voice quality with cessation of airflow [5, 6, 7]. The symptoms of AdLD are more likely to be elicited in running speech compared to sustained phonation [8, 9]. Diagnosis of AdLD is difficult due to the lack of an accurate diagnostic criterion to differentiate between AdLD and other voice disorders that mimic the AdLD voice features [10, 11]. Hence, AdLD misdiagnosis commonly happens, and, hence, needless medical interventions might occur [12].

Previous studies investigated AdLD through various assessment tools such as perceptual voice evaluation [13], acoustic analysis [14, 15], fiberoptic laryngoscopy [16], and aerodynamic measurements [17]. However, the capability of these assessment procedures to better understand the pathophysiology and differential diagnosis of AdLD was limited [18]. Because of this limitation, a definite standardized diagnostic protocol for AdLD is currently not available and the diagnosis predominantly relies on the specialists present in the clinical settings and the expertise of the clinicians [19]. In spite of utilizing these different tools for AdLD evaluation, the use of laryngeal high-speed videoendoscopy (HSV) in investigating AdLD is limited in literature [20, 21, 22]. HSV is a powerful tool that can capture the detailed VFs motion [23, 24, 25, 26, 27, 28]. This advanced imaging tool facilitates visualization of not only periodic vibrations but also aperiodic vibrations [29, 30, 31, 32, 33, 34] (not feasible using the standard videostroboscopy [35, 36]). Analyzing aperiodic VF vibrations is crucial as they are commonly associated with neurological voice disorders like AdLD in connected speech [35, 37, 38, 39, 40]. On top of that, by utilizing HSV, complex transitional phonatory events of running speech can be studied during onsets and offsets of vibration [41, 42]. Therefore, it is important to use such a powerful assessment tool that can provide more information about the impaired laryngeal mechanisms and vocal function in AdLD and eventually, facilitate its accurate diagnosis [43].

A few research works were found in literature that studied AdLD using HSV, mainly conducted on sustained vocalized samples, with a scarcity of studies examining the impact of AdLD on running speech. Most studies were performed by visually evaluating the steady-state phonation to reveal features of VFs motion and oscillatory breaks corresponding to AdLD [20, 21]. Some studies demonstrated different VF gestures during the onset of phonation in AdLD patients [22, 21]. Given the fact that AdLD symptoms and signs are task specific, occurring during running speech but not during sustained phonation, studying AdLD in a natural context is crucial to better understand its impaired vocal behavior. We have previously studied the visual characterization of phonation onset and offset using HSV [44]. This study was carried out by three raters through visual analysis of HSV recordings from vocally normal speakers and AdLD patients. Considering the huge number of images (video frames) generated from HSV recordings, extracting measures through visually analyzing thousands of HSV frames was a time-consuming process [45, 46, 28]. Additionally, the manual analysis was challenging due to the dynamic nature of the laryngeal tissues and unexpected behavior of VFs in HSV [47]. The difference in the visual acuity and judgments of the raters may have also impacted the accurate analysis of HSV data [44]. These challenges emphasize the necessity for automated techniques that can rapidly analyze and process the HSV videos in order to obtain accurate/reliable measurements and clinically relevant information about VF function during running speech [48, 49, 50, 51, 52].

Segmenting the glottal area is a fundamental step to automatically analyze HSV video sequences, which provides quantitative representation of VF movement in connected speech as we have previously demonstrated in multiple studies [53, 54]. As such, the extraction of the glottal area, particularly the glottal area waveform (GAW), from HSV data is a building block needed to extract HSV-based measures to assess both the oscillatory motion of the VFs and various pre- and post-phonatory events [55, 56, 49, 57, 58, 59, 60]. Evaluating these pre- and post-phonatory behaviors in HSV by specifically measuring the glottal attack time (GAT) and glottal offset time (GOT) [34, 44, 61, 62], corresponding to the onset and offset of VF oscillations and contacts, can be an effective tool to assess the impaired vocal function of patients with AdLD [35, 38, 39, 40, 63, 62].

The present work aims to automatically analyze phonation onset and offset from HSV data in connected speech through measuring GAT and GOT in AdLD and normal controls. The primary research question is whether the GAT and GOT are different between AdLD and normal controls? This research question will be addressed by testing the below hypotheses:

H1: An automated algorithm can be developed to measure GAT and GOT with comparable accuracy to visual measurements.
H2: GAT and GOT will be significantly higher in AdLD versus normal controls.
H3: GAT and GOT will show more variability in AdLD participants.

Toward testing these research hypotheses, an automated approach was developed based on the segmented VF edges and GAW in the HSV recordings by means of our previously developed deep learning tool [64]. The deep learning tool was primarily used for image segmentation to capture the glottal area in the HSV frames and accordingly the VF edges. The GAT and GOT were defined as the time difference between the first VF oscillation and first VF contact and between the last VF contact and last VF oscillation, respectively. The GAT and GOT were automatically measured from the HSV recordings and validated with the visual analysis results, performed by three raters, in order to assess the accuracy of the developed automated tool. In addition, the automated measurements of GAT and GOT were compared between the AdLD patients and normal controls to investigate the impact of AdLD on the onset and offset of phonation in running speech.

2. Materials and Methods

2.1. Data Collection

The study population included four vocally normal participants (three female and one male) without any history of voice disorder and five patients with AdLD (four female and one male). The participants were within the age range of 35–76 years old. The data were collected at the Mayo Clinic in Scottsdale, AZ. The data collection was approved by the Institutional Review Board (IRB# 18–007716). A monochrome high-speed camera (Photron FASTCAM mini AX200; Photron Inc., San Diego, CA) was coupled with a flexible nasolaryngoscope and used to collect the HSV recordings from all the participants. The frame rate was selected to be 4,000 frames per second (fps), and the spatial resolution of the recordings were 256 × 224 pixels. During the recording session, the participants were asked to read the six CAPE-V sentences and the first six sentences of the “Rainbow Passage.” The recording length of the collected HSV videos ranged between 50–100 s (200,000 – 400,000 frames) depending on how fast the subjects were able to read the recording passage. All recordings were stored as mraw files on a laptop, connected to the HSV system. After de-identification of the collected raw data, the data were transferred to our laboratory data server for analysis.

2.2. Data Analysis

In this work, our previously developed deep neural network (DNN) tool was utilized as a powerful spatial segmentation technique for the analysis of collected HSV data [64]. This allowed us to automatically capture the glottal area and VF edges in the HSV recordings during running speech. In this study, the DNN was primarily used as a segmentation tool to estimate the spatial features (glottal area) across the HSV frames in the recordings. That is, the pixels located in the area between the two VFs were spatially classified and identified. The DNN tool consisted of a fully convolutional neural network that was inspired from the U-Net architecture [65]. The main components of the DNN architecture included four encoder stages, tailored to extract spatial features from the input HSV frames. Subsequently, the encoded spatial features were decoded through four decoder stages to reconstruct the output segmentation mask – highlighting the glottal area pixels based on the extracted image features. Each stage comprised multiple layers of convolutions to process the input HSV images and improve the segmentation results. The DNN output segmentation mask was a binary image (black and white image) in which each pixel was assigned a value of zero or one based on being a glottal area pixel or a background one.

This network architecture was trained using manually segmented HSV frames – randomly selected from the different subjects’ recordings, except for one recording from an AdLD patient, which was excluded from the training for testing/evaluating the trained architecture. Although the number of subjects was relatively small to train the network, the training frames were selected through a diverse representation of glottal area shapes/gestures during various phonatory events in connected speech to avoid overfitting. This included different events such as VF adjustments, vibrations, and spasms, and frames displaying obstructed and unobstructed VF views. Additionally, incorporating data augmentation methods (e.g., flipping images, adding blurriness and noise, and applying rotation), along with a validation dataset (a subset of the training data), the DNN capacity was enhanced to generalize and adapt to the variability present in running speech given the limited sample size. After training, the algorithm was evaluated on the testing dataset to ensure the accurate segmentation and detection of the glottal area across the HSV frames and recordings. The GAW was then obtained from the segmented frames for each subject.

The GAW refers to the change in the glottal area (measured in pixels) across HSV frames due to VF vibrations. The GAW was determined during various vocalized segments in the recordings. To do so, the spatial locations of the left and right VF edges were determined based on the segmented glottal area (glottal boundary) [54]. That is, to automatically detect the left and right VF edges in HSV frames based on the extracted glottal area, the first image moment of inertia was computed for each row of the detected glottal area pixels (glottis center) in each segmented HSV frame. The first image moment helped localize the center of darkness (center of glottis) in the segmentation masks along the length of the VFs. After detecting the midpoints of the glottal area, the corresponding midline was predicted. The midline was estimated as a fitted second-order curve to accurately represent the glottal center. Based on the detected glottal midline (the fitted curve), the spatial locations of the left and right VF edges were automatically determined with respect to the location of the midline. After extracting the GAWs and the associated glottal boundary (left and right VF edges), we proceeded with the computation of the GAT and GOT measures.

The GAT and GOT measures are important to be studied as they are critical factors to investigate the pathophysiology of voice disorders [41]. The GAT and GOT measures were automatically calculated during the onset and offset of phonation using the extracted GAW and the segmented VF edges. The GAT was automatically computed based on the fact that the energy of the GAW increases at the beginning of VF oscillation, and the energy of the contact waveform rises sharply with VF contact. Accordingly, the time delay between rise of the energy of VF oscillation and VF contact at the onset was determined as a measure of GAT. To do so, the normalized GAW and the average medial glottal contact waveforms were calculated from the GAW and the detected edges of the VFs. The GAW was defined as the area, measured in pixels, between the left and right VF edges. The average medial glottal contact waveform was calculated using the average number of points (pixels) located along the VF length that were in contact at each frame. The energy of the two waveforms were then computed using a sliding window with a size of 30 ms. The energy waveform was calculated by taking the sum of squared values of the waveform within the sliding window. The sliding window was moved frame by frame, and the energy was calculated at each step. After computing the two energy waveforms, the GAT was computed as the unbiased delay between the first derivative of the two energy contours, which was calculated using the cross-correlation technique. The cross-correlation method provides an unbiased measurement of the time delay without any predefined thresholds/conditions - free from any operator intervention or bias, meaning that it is fully automated [41].

The automated computation of the GOT was done using a similar procedure as in GAT. GOT represented the time delay between the drop of the energy of VF oscillation and VF contact at the phonation offset. The GAW and the average medial glottal contact waveform were computed again as explained for the GAT measurement. During the phonation offset, the energy of the average medial glottal contact waveform drops sharply after the last VF contact, and the energy of the GAW drops and dissipates as the oscillation of the VFs damps. The GOT was calculated as the unbiased time delay between the first derivative of the two energy waveforms using the cross-correlation method. Based on visual examination of the extracted GAW from each HSV recording, the different phonations were identified. The GAT and GOT were then automatically computed for each identified phonation.

During the visual analysis, three raters analyzed the HSV data from each participant using a camera playback software (Phantom Camera Control, PCC), where they were able to adjust the playback speed, Gaussian filter settings, gain, brightness, and contrast of the video frames for a better visualization and better image quality [44]. The raters visually determined the timestamps corresponding to the first oscillation and contact frames as well as the last oscillation and contact frames for each vocalization in the recordings [44]. Based on the difference between these timestamps, they computed the GAT and GOT, measured in number of frames and in ms. The raters compared and reviewed the vocalizations with large errors among themselves; particularly, discrepancies more than 2.5 ms for the contacts and 5 ms for the oscillations. This was done to allow the raters to come to a consensus about their measurements.

The proposed automated approach for computing the GAT and GOT was validated through comparing the results the GAT/GOT measures by visual analysis of the present HSV recordings. To do so, Pearson correlation was obtained to investigate the degree of agreement between the automated measurements and the manual analysis and test the first hypothesis. This was done by comparing the manually acquired GAT and GOT measurements with their corresponding automated values. As such, the visual measurements were obtained alongside the automated measurements from the same phonation onsets and offsets across the different subjects’ recordings. This allowed a robust and accurate computation of the Pearson correlation coefficient between these two sets of measurements for both GAT and GOT. This analysis enabled the quantification of the degree of linear association between the automated and manual measurements – examining the accuracy/consistency of the automated method in comparison to the manual measures.

An independent samples t-test was used to test the second hypothesis and see whether there was any significant difference between the GAT/GOT measures of the AdLD group in comparison with the normal controls. The GAT and GOT measurements were considered as continuous dependent variables in the statistical model. The group type (vocally normal subjects vs AdLD patients) was considered as a categorical independent variable in the model. The t-test was performed in this study with the assumptions of independence of the derived measurements, homoscedasticity (homogeneity of variances across the two groups), and normally distributed data within each group, which were all met. Two independent t-tests were performed to compare two measurements (GAT and GOT) between the normal control group and AdLD group. A significance level of 0.05 was considered for each conducted test in the present study.

3. Results

A sample of HSV frames is shown in Figure 1. The figure includes HSV frames from both the vocally normal speakers and the AdLD patients. Various laryngeal gestures due to the movement of epiglottis, arytenoid cartilages, false VFs, and laryngeal tissue compression are included in the figure. Different locations of the VFs in the image, open and closed VFs during onset/offset, and when the VFs are not vibrating are displayed in the figure. In addition to the different configurations and positions of the VFs in the HSV frames, varying image quality levels are shown including different degrees of brightness, contrast, blurring, and image noise.

Figure 2 illustrates the performance of the implemented DNN tool in detecting the glottal area and the left/right VF edges on a set of individual HSV frames. This figure shows the results of applying the DNN to seven HSV frames showing different configurations and gestures of the VFs in terms of their location, size, image quality, and being partially obstructed. The results of processing steps of these seven frames are presented in the figure in five rows. Each row shows different segmentation results and image processing stages except for the first one, which depicts the original HSV frames. The second and third row, from the top, exhibit the glottal area segmentation results in a form of the outcome segmentation masks and the associated original frames with the detected glottal area in red, respectively. The last two rows show the detected midline in yellow (row 4), and the left (in red) and right (in green) VF edges (row 5). The results in rows 4 and 5 show the successful detection of the glottal midline and the edges of VFs. Overall, as can be seen, the developed tool was able to accurately capture the various shapes of the glottal area and precisely detect VF edges.

The results of applying the deep learning tool to a set of sequential HSV frames are presented in Figure 3 during a phonation onset. The top panel displays a subset of 12 HSV frames selected at various timestamps from the HSV sequence in Figure 1 and segmented using the automated tool. The segmented glottal area is shown in green, and the corresponding timestamps are indicated on each segmented image in yellow font. As can be seen, the segmented frames demonstrate the detection of the glottal area via capturing the complex details of the glottis. The bottom panel shows the GAW in pixels (Px) as a function of time (ms). The timestamps at which the segmented frames were selected are marked by red dots on the generated GAW, illustrated in the bottom panel. The automatically developed GAW provides a visual representation of the change in the glottal area and VF behavior during the phonation onset.

The results of automated measurement of the GAT at the phonation onset of a vocalization during running speech are shown in Figure 4. The normalized GAW and the average medial glottal contact waveform as a function of time are displayed in the top two panels in solid red and dashed blue lines, respectively. The GAW and contact waveform energy contours are also illustrated in the third panel with the same color codes. The cross correlation of the GAW and contact waveform is shown in the bottom panel (in green). The automatically computed GAT is indicated in the figure by the time difference between the two vertical dashed lines in the bottom panel. This figure shows a visualization of the energy rise of both the GAW and the contact waveform and how the delay between the two energy waveforms was captured. In this example, the automated algorithm reveals a delay time of 14.75 ms, referring to the GAT value during this particular phonation onset.

The outcome of implementing the DNN tool to a sequence of HSV frames during the offset of phonation is shown in Figure 5. The figure has the same formatting as Figure 3. The top panel shows a sample of 12 segmented HSV frames within the sequence and the bottom panel shows the generated GAW in pixels (Px) as a function of time during the phonation offset. As shown in the top panel, different image quality and altered configurations, views, and sizes of the VFs are used across the different frames within the presented sequence. As can be seen, various shapes of the glottal area, shown in green, are captured. The GAW in the bottom panel exhibits the dynamic characteristics of the glottal area and VF behavior during the offset of phonation. The plotted GAW in the figure represents and captures not only the oscillations during the steady-state vibration portion but also the small-amplitude oscillations at the end of phonation offset.

Figure 6 depicts the results of the automated GOT measurement during a phonation offset. Similar to Figure 4, the top two panels display the normalized GAW and average medial contact waveform, automatically generated using the segmentation tool and shown in solid red and dashed blue lines, respectively. The derived energy contours of each waveform are illustrated in the third-row panel. The two energy waveforms show the accurate representation of the drop of the oscillation energy corresponding to the damping motion of the VFs at the end of phonation. As can be seen in the figure, there is a time lag between the two energy waveforms, which can be measured through cross-correlation. The result of cross-correlation is shown in the bottom panel (in green) and the time lag between the energy waveforms is indicated by the two dashed lines, showing a GOT value of 28 ms at this specific phonation offset.

The results of the comparison performed between the manual and automated measurements of GAT and GOT are shown in Figure 7. The figure shows the mean values of GAT and GOT computed using the automated algorithm and the visual analysis for each HSV recording obtained from both vocally normal subjects (N) and the AdLD patients. The solid blue and green bars indicate the automated measurements of GAT and GOT for each subject, respectively, whereas the dashed bars indicate the visual measurements. The standard deviation values are illustrated as error bars in the figure. Overall, the automated measurements precisely detect the GAT and GOT values - showing a close alignment and agreement with the visual analysis in most subjects with minimal differences. It can also be observed that the automated measurements demonstrate more accurate values of GAT and GOT for the vocally normal participants in comparison with the AdLD patients - showing a marginally better agreement with the manual measures. In addition, from the figure, the overall automated computation of the GAT showed slightly more accurate values compared to the GOT based on the comparison of automated and manual measures.

Overall, the analytical comparison revealed a minimal average discrepancy between the automated and the manual measurements. The average difference for the mean GAT between the automated and manual analysis was 1.6 ms across all the recordings. The mean GOT showed a slightly higher average difference of 2.7 ms between the automated and visual measurements. Moreover, an additional quantitative analysis was carried out between the automated and the visual analysis to compare the magnitudes of the GAT and GOT within the vocalized segments in different subjects. The statical analysis demonstrated a strong correlation between the automated and manual measurements for both GAT and GOT. As can be seen in Table 1, a high Pearson correlation coefficient of r = 0.93 was found for GAT, suggesting a significant level of agreement between the two ways of measurements. Likewise, GOT measurements demonstrated a strong correlation coefficient of r = 0.91. An independent t-test was performed using the various vocalized segments resulting in a non-statistically significant difference between the automated and visual measurements (Column 3). That is, the resulted p-values from the t-test (for the automated and manual measurements) were 0.86 for the GATs and 0.77 for the GOTs, refereeing to not statistically significant differences and a high similarity between the measures of the manual and automated approaches.

Table 1.

Correlation coefficients and t-test results for GAT and GOT between normal controls vs. the AdLD group.

	Correlation (r-value)* Automated vs. Visual Measurement	t-test (p-value)* Automated vs. Visual Measurement	t-test (p-value)* Normal vs. AdLD
GAT	0.93	0.86	< 0.001
GOT	0.91	0.77	0.20

Open in a new tab

As shown in Figure 7, the mean GAT values of almost all the AdLD patients are higher than those of the vocally normal individuals. The mean GAT values for the AdLD individuals (between 14.8–22.9 ms) has a larger variability than those of the normal controls (between 14.9–15.2 ms). In addition, as can be seen in the figure, the standard deviation shows higher values in AdLD (ranging from 5.4 to 19.6 ms with a mean of 12.1 ms) in comparison with the vocally normal group (ranging from 4.1 to 8.7 ms with a mean of 6.1 ms) – referring to the high variability in the GAT in individual AdLD patients. Overall, the AdLD group had higher average GAT (18.95 ms) than the vocally normal group (14.65 ms). From Table 1, the statistical analysis revealed that this difference was statistically significant between the two groups (p-value < 0.001). The mean GOT values demonstrated a slight increase in the AdLD group (ranging between 23.8 and 32.1 ms) in comparison with the normal controls (ranging from 25.3 to 31.4 ms). The mean GOT value across all the AdLD patients (28.9 ms) did not demonstrate a statistically significant difference (p-value = 0.20) compared to the vocally normal controls, having an average GOT value of 27.3 ms. Similar to the degree of variability found in the GAT measures, the GOT demonstrated higher values of standard deviation in the AdLD group (for individual patients) between 9.8 and 19.1 ms at an average value of 15.3 ms compared to the normal controls between 5.9 and 13.2 ms at an average value of 10.1 ms.

Figure 8 presents a bar graph, which visualizes the individual measurements of GAT and GOT of each subject for a comprehensive representation of the measures. The illustrated box plots show the distribution of the individual measures for both GAT (in blue) and GOT (in green), associated with the different vocalizations across both the normal and AdLD group participants. The horizontal line within each box depicts the medial value of each participant’s measurements. The box boundaries represent the 25^th and 75^th percentiles. The illustrated whiskers extend to the extreme values after the outlier removal. The figure also depicts the outliers corresponding to each participant in blue and green dots for GAT and GOT measurements, respectively. Larger interquartile range can be observed for the AdLD group for both GAT and GOT measures. Moreover, it can be seen that the difference between the mean and the median in each individual subject is larger in the AdLD group in comparison with the normal-control group. The frequency of occurrence of outliers was almost similar between the two groups, and more outliers were observed for the GOT measures in both groups.

4. Discussion

The goal of this study was to develop an automated algorithm for measurements of GAT and GOT from HSV data in connected speech as objective measures that could potentially facilitate the diagnosis of AdLD in future. In order to achieve this goal, an automated segmentation tool was implemented on the monochrome HSV recordings of vocally normal subjects and patients with. AdLD. The segmentation technique showed successful detection of the glottal area during various onsets and offsets of phonation in running speech. The DNN tool demonstrated high-performance capabilities in detecting the varied sizes, geometries, locations, and configurations of the glottal area and VFs that commonly exist during different phonation onsets and offsets in running speech, even with the presence of challenging HSV image quality. Being able to capture this variability and transitional states - extending from various degrees of VFs opening and small-amplitude oscillations to steady-state vibrations - demonstrated the high reliability and consistent accuracy of the developed tool during phonation onset/offset. A promising performance was achieved even with the presence of inferior and variable image quality in the recordings. The successful segmentation outcome in capturing and quantifying the dynamic changes in the glottal area facilitated the automated measurements of the GAT and GOT. Based on the segmented glottis, the contact between VFs was successfully determined. This allowed the precise computation of the energy contours associated with both the vibration of VFs (represented in the GAW) and the VFs contact. By computing these two contours, the delay between the rise in the energy contours was successfully calculated using the cross-correlation technique at the phonation onset (to measure the GAT), and the drop in the energy contours at the offset of phonation (to measure GOT). The automated algorithm showed efficient measurements of the GAT and GOT for the vocally normal group as well as the AdLD group.

The developed automated method for computing GAT and GOT was first validated against the visual analysis. This was done by applying the automated method to the HSV recordings of all the subjects and automatically generating the GAT and GOT values during the different phonation onsets and offsets in each recording. For manual measurements, the visual analysis was carried out by three raters to obtain manual measurements of the GAT and GOT through visually determining the timestamps between the first oscillation and first contact of the VFs (to measure the GAT) and between the last contact and last oscillation (to measure the GOT). The manual and automated measurements were obtained for each recording. The comparative analysis showed a close agreement between the automated and visual measurements of GAT and GOT in most of the recordings with minimal differences. The average difference between the automated and manual measurements, computed based on the mean of each recording, exhibited a small value of 1.6 ms for GAT and 2.7 ms for GOT. This minor difference between the automated and the visual analysis was even lower than the error found among the three raters, which was considered as an acceptable deviation in the measurements of GAT and GOT [44]. Additionally, the statistical analysis performed between the automated and manual measurements within different vocalizations demonstrated a strong correlation between the two measurements in computing both GAT (r = 0.93) and GOT (r = 0.91), indicating a high similarity degree between the automated and manual measures. These findings were supported by conducting a t-test to investigate whether the GAT/GOT measurements by the automated and manual analysis differed. The t-test revealed that no significant difference was found between the automated versus the visual analysis in GAT and GOT with p-values of 0.86 and 0.77, respectively, indicating the high level of similarity between the two methods of measurements. These comparative results reflect the reliability and accuracy of the automated analysis technique compared to the visual analysis in estimating GAT and GOT, leading to the acceptance of the first hypothesis.

The results demonstrated that the automated algorithm was marginally more precise in determining the GAT compared to GOT across the different subjects. This small deviation in accuracy was primarily derived from the longer durations of GOT that because of the minimal amplitudes of VF oscillations toward the end of the offset, which was difficult to define for the raters - causing small discrepancies with the automated method. Similarly, the automated measures obtained for the vocally normal group showed a marginally elevated precision for both GAT and GOT in comparison with the AdLD patients. This was mainly due to the irregularity found in the dynamics of the VFs in AdLD patients as well as the excessive phonatory breaks, making the analysis more challenging for both the automated and the visual analysis. Another likely cause of the difference between the automated and the manual analysis, though minimal, arose from the inferior image quality in the recordings and the blurriness of the VF edges during their adduction, creating difficulty in determining the first and last contact frames.

After validating the automated measurements with the visual analysis, the automated measures of GAT and GOT across all the recordings were compared between the normal controls and the AdLD. The results revealed that, overall, the GAT measures were longer in the AdLD patients in comparison with the vocally normal participants. The statistical analysis demonstrated a significant difference (p-value < 0.001) between the average GAT of the AdLD (18.95 ms) and normal controls (14.65 ms). Accordingly, a part of the second hypothesis is accepted, which indicated significantly higher GAT in AdLD versus normal controls. This finding was supported by a previous study that demonstrated a delay between the onset of phonation and the activation of the laryngeal muscles [66]. A previous study in literature showed longer onset times in AdLD than normal controls, computed using a different method, which supports the findings in the present work [21]. In contrast, although the present automated analysis demonstrated a slight increase in the mean GOT of the AdLD group (28.9 ms) versus the vocally normal group (27.3 ms), the difference was not statistically significant (p-value = 0.20). This insignificant difference of GOT rejected part of the second hypothesis that expected significantly higher GOT in AdLD than in normal controls.

The results demonstrated that there was a larger variability in the measurements of GAT and GOT within the AdLD group with a particularly greater variability observed in GOT. In contrast, within the normal controls, this variability was less, especially the GAT which showed a minimal range of variability across the vocally normal individuals. This finding showed the acceptance of hypothesis 3. The explanation for this primarily lay in the irregular/inconsistent behavior of the VF vibrations in AdLD along with the impaired neurological laryngeal function impacting the laryngeal muscle control in AdLD [2]. Hence, given the statistical significance between AdLD versus normal controls besides the more consistency and less variability found within the normal controls, GAT can be a valuable clinical measure compared to GOT. To rationalize the presence of outliers in GAT/GOT measures in some of the subjects, the phonemic content of connected speech for the outliers should be investigated, which will be done in a future work. A larger sample size can substantiate the findings on the impact of AdLD on GAT and GOT in order to indicate the clinical significance of the introduced measures using HSV as an assessment tool.

5. Conclusion

The present work investigated the pathological vocal function during phonation onset and offset of AdLD using HSV in connected speech - bridging a huge gap in literature. In this work, an automated analysis was successfully conducted to measure the GAT and GOT from vocally normal participants and AdLD to investigate the difference between the two groups. The automated measurements revealed minor, non-significant differences in comparison with the visual analysis - showing strong correlations between the two methods. The automated measurements demonstrated two main findings. That is, AdLD patients showed significantly longer GATs compared to the vocally normal group, and more variability was observed in both GATs and GOTs of the AdLD patients due to the considerable irregularity in their impaired VF vibrations. Since this study is considered one of the earliest attempts in literature to investigate these measurements in running speech for AdLD, these findings can serve as a foundational baseline for future research utilizing larger sample size and different voice disorders. The developed automated approach for GAT and GOT measurement can be valuable in the clinical practice in future. Obtaining such measures enables the exploration of clinically relevant information in HSV data to address diagnostic challenges of AdLD.

Acknowledgments:

We acknowledge the support from NIH NIDCD K01DC017751, R21DC020003 and R01DC019402, and Michigan State University Discretionary Funding Initiative.

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Declaration of Interest:

The authors declare that they have no conflict of interest.

Submission Declaration and Verification:

The work described has not been published previously. Portions of this work were presented at the 52^nd Anniversary Symposium: Care of the Professional Voice, Philadelphia, PA, May 31 - June 4, 2023. This manuscript is not under consideration for publication elsewhere, its publication is approved by all authors and tacitly or explicitly by the responsible authorities where the work was carried out, and if accepted, it will not be published elsewhere in the same form, in English or in any other language, including electronically without the written consent of the copyright-holder.

Declaration of interests

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

[1].Chetri DK, Merati AL, Blumin JH, Sulica L, Damrose EJ and Tsai VW, “Reliability of the perceptual evaluation of adductor spasmodic dysphonia,” An Otol Rhinol Laryngol, vol. 117, pp. 159–165, 2008. [DOI] [PubMed] [Google Scholar]
[2].Roy N, Gouse M, Mauszycki SC, Merrill RM and Smith ME, “Task specificity in adductor spasmodic dysphonia versus muscle tension dysphonia,” The Laryngoscope, vol. 115, no. 2, pp. 311–316, 2005. [DOI] [PubMed] [Google Scholar]
[3].Blitzer A and Kohli N, “Laryngeal dystonia: Phenomenology, genetics, and management,” Toxicon, p. 107258, 2023. [DOI] [PubMed] [Google Scholar]
[4].Meyer TK, Rickert SM, Childs LF and Blitzer A, “Spasmodic dysphonia. In Handbook of Dystonia,” CRC Press, pp. 196–206, 2012. [Google Scholar]
[5].Cannito M and Kondraske G, “Rapid manual abilities in spasmodic dysphonic and normal female subjects,” J Speech Hear Res, vol. 33, p. 123–133, 1990. [DOI] [PubMed] [Google Scholar]
[6].Roy N, “Differential diagnosis of muscle tension dysphonia and spasmodic dysphonia,” Current Opinion in Otolaryngology & Head and Neck Surgery, vol. 18, no. 3, pp. 165–170, 2010. [DOI] [PubMed] [Google Scholar]
[7].Stemple J, Roy N and Klaben B, Clinical Voice Pathology: Theory and Management, 6th ed., Plural Publishing, 2018. [Google Scholar]
[8].Roy N, Mazin A and Awan SN, “Automated acoustic analysis of task dependency in adductor spasmodic dysphonia versus muscle tension dysphonia,” The Laryngoscope, vol. 124, no. 3, pp. 718–724, 2014. [DOI] [PubMed] [Google Scholar]
[9].Mor N and B. A, “Diagnosis and treatment of laryngeal dystonia: past, present and future directions,” pp. 1–12, 2016. [Google Scholar]
[10].Chhetri DK, Mendelsohn AH, Blumin JH and Berke GS, “Long-term follow-up results of selective laryngeal adductor denervation–reinnervation surgery for adductor spasmodic dysphonia,” Laryngoscope, vol. 116, p. 635–642, 2006. [DOI] [PubMed] [Google Scholar]
[11].Roy N, Bless DM, Heisey D and Ford CN, “Manual circumlaryngeal therapy for functional dysphonia: an evaluation of short- and long-term treatment outcomes,” J Voice, vol. 11, p. 321–331, 1997. [DOI] [PubMed] [Google Scholar]
[12].Chen DW and Ongkasuwan J, “Spasmodic dysphonia,” International ophthalmology clinics, vol. 58, no. 1, pp. 77–87, 2018. [DOI] [PubMed] [Google Scholar]
[13].Cannito MP, Woodson GE, Murry T and Bender B, “Perceptual analyses of spasmodic dysphonia before and after treatment,” Archives of Otolaryngology–Head & Neck Surgery, vol. 130, no. 12, pp. 1393–1399, 2004. [DOI] [PubMed] [Google Scholar]
[14].Sapienza CM, Walton S and Murry T, “Adductor spasmodic dysphonia and muscular tension dysphonia: acoustic analysis of sustained phonation and reading,” J Voice, vol. 14, p. 502–520, 2000. [DOI] [PubMed] [Google Scholar]
[15].Rees CJ, Blalock PD, Kemp SE, Halum SL and Koufman JA, “Differentiation of adductor-type spasmodic dysphonia from muscle tension dysphonia by spectral analysis,” Otolaryngol Head Neck Surg, vol. 137, p. 576–581, 2007. [DOI] [PubMed] [Google Scholar]
[16].Leonard R and Kendall K, “Differentiation of spasmodic and psychogenic dysphonias with phonoscopic evaluation,” Laryngoscope, vol. 109, p. 295–300, 1999. [DOI] [PubMed] [Google Scholar]
[17].Higgins MB, H CD and Shulte L, “Phonatory air flow characteristics of adductor spasmodic dysphonia and muscle tension dysphonia,” J Speech Lang Hear Res, vol. 42, p. 101–111, 1999. [DOI] [PubMed] [Google Scholar]
[18].Hintze JM, Ludlow CL, Bansberg SF, Adler CH and Lott DG, “Spasmodic dysphonia: a review. Part 2: characterization of pathophysiology,” Otolaryngology–Head and Neck Surgery, vol. 157, no. 4, pp. 558–564, 2017. [DOI] [PubMed] [Google Scholar]
[19].Ludlow CL, Adler CH, Berke GS, Bielamowicz SA, Blitzer A, Bressman SB and Woodson GE, “Research priorities in spasmodic dysphonia,” Otolaryngology—Head and Neck Surgery, vol. 139, no. 4, pp. 495–505, 2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
[20].Patel RR, Liu L, Galatsanos N and Bless DM, “Differential vibratory characteristics of adductor spasmodic dysphonia and muscle tension dysphonia on high-speed digital imaging,” Annals of Otology, Rhinology & Laryngology, vol. 120, no. 1, pp. 21–32, 2011. [DOI] [PubMed] [Google Scholar]
[21].Chen W, Woo P and Murry T, “Vibratory Onset of Adductor Spasmodic Dysphonia and Muscle Tension Dysphonia: A High-Speed Video Study,” Journal of Voice, vol. 34, no. 4, pp. 598–603, 2020. [DOI] [PubMed] [Google Scholar]
[22].Woo P, “High-speed imaging of vocal fold vibration onset delay: normal versus abnormal,” J Voice, vol. 31, pp. 307–312, 2017. [DOI] [PubMed] [Google Scholar]
[23].Morrison MD and Rammage LA, “Muscle misuse voice disorders: description and classification,” Acta oto-laryngologica, vol. 113, no. 3, pp. 428–434, 1993. [DOI] [PubMed] [Google Scholar]
[24].Halberstam B, “Acoustic and perceptual parameters relating to connected speech are more reliable measures of hoarseness than parameters relating to sustained vowels,” ORL, vol. 66, no. 2, pp. 70–73, 2004. [DOI] [PubMed] [Google Scholar]
[25].Maryn Y, Corthals P, Van Cauwenberge P, Roy N and De Bodt M, “Toward improved ecological validity in the acoustic measurement of overall voice quality: combining continuous speech and sustained vowels,” J Voice, vol. 24, no. 5, pp. 540–555, 2010. [DOI] [PubMed] [Google Scholar]
[26].Lowell SY, “The acoustic assessment of voice in continuous speech,” SIG 3 Perspectives on Voice and Voice Disorders, vol. 22, no. 2, pp. 57–63, 2012. [Google Scholar]
[27].Pietruszewska W, Just M, Morawska J, Malinowski J, Hoffman J, Racino A, Barańska M, Kowalczyk M and Niebudek-Bogusz E, “Comparative analysis of high-speed videolaryngoscopy images and sound data simultaneously acquired from rigid and flexible laryngoscope: a pilot study,” Scientific Reports, vol. 11, no. 1, pp. 1–14, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
[28].Naghibolhosseini M, Deliyski DD, Zacharias SR, de Alarcon A and Orlikoff RF, “Temporal segmentation for laryngeal high-speed videoendoscopy in connected speech,” J Voice, vol. 32, no. 2, pp. 256.e1–256.e12, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
[29].Deliyski DD, Powell ME, Zacharias SR, Gerlach TT and de Alarcon A, “Experimental investigation on minimum frame rate requirements of high-speed videoendoscopy for clinical voice assessment,” Biomed. Signal. Process. and Control, vol. 17, pp. 51–59, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
[30].Zañartu M, Mehta DD, Ho JC, Wodicka GR and Hillman RE, “Observation and analysis of in vivo vocal fold tissue instabilities produced by nonlinear source-filter coupling: A case study,” Journal of the Acoustical Society of America, vol. 129, no. 1, pp. 326–339, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
[31].Mehta DD, Deliyski DD, Zeitels SM, Zañartu M and Hillman RE, Integration of transnasal fiberoptic high-speed videoendoscopy with time-synchronized recordings of vocal function, innormal & abnormal vocal folds Kinematics: High speed digital phonoscopy (HSDP), optical coherence tomography (OCT) & narrow band imaging, vol. 12, San Fransisco, CA: Pacific Voice & Speech Foundation, 2015, pp. 105–114. [Google Scholar]
[32].Naghibolhosseini M, Deliyski DD, Zacharias SR, de Alarcon A and Orlikoff RF, “A method for analysis of the vocal fold vibrations in connected speech using laryngeal imaging,” in Manfredi C (Ed.) Proceedings of the 10th International Workshop on Models and Analysis of Vocal Emissions for Biomedical Applications MAVEBA, Firenze University Press, Firenze, Italy, 2017. [Google Scholar]
[33].Yiu E, Worrall L, Longland J and Mitchell C, “Analysing vocal quality of connected speech using Kay’s computerized speech lab: a preliminary finding,” Clin Linguist & Phon, vol. 14, no. 4, pp. 295–305, 2000. [Google Scholar]
[34].Naghibolhosseini M, Deliyski DD, Zacharias SRC, de Alarcon A and Orlikoff RF, “Studying vocal fold non-stationary behavior during connected speech using high-speed videoendoscopy,” The Journal of the Acoustical Society of America, vol. 144, no. 3, pp. 1766–1766, 2018. [Google Scholar]
[35].Deliyski DD, “Clinical feasibility of high-speed videoendoscopy,” in Perspectives on Voice and Voice Disorders, vol. 17, American Speech-Language-Hearing Association, 2007, pp. 12–16. [Google Scholar]
[36].Mehta DD, Deliyski DD, Quatieri TF and Hillman RE, “Automated measurement of vocal fold vibratory asymmetry from high-speed videoendoscopy recordings,” Journal of Speech, Language, and Hearing Research, vol. 54, no. 1, pp. 47–54, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
[37].Deliyski DD, Petrushev PP, Bonilha HS, Gerlach TT, Martin-Harris B and Hillman RE, “Clinical implementation of laryngeal high-speed videoendoscopy: challenges and evolution,” Folia Phoniatr. et Logop, vol. 60, no. 1, pp. 33–44, 2007. [DOI] [PubMed] [Google Scholar]
[38].Deliyski DD and Hillman RE, “State of the art laryngeal imaging: Research and clinical implications,” Current Opinion in Otolaryngology & Head and Neck Surgery, vol. 18, no. 3, pp. 147–152, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
[39].Woo P, “Objective measures of stroboscopy and high speed video,” Advances in Oto-Rhino-Laryngology, vol. 85, pp. 25–44, 2020. [DOI] [PubMed] [Google Scholar]
[40].Deliyski DD, Laryngeal high-speed videoendoscopy, in: Laryngeal evaluation: Indirect laryngoscopy to high-speed digital imaging, New York: Thieme Medical Publishers, 2010, pp. 243–270. [Google Scholar]
[41].Orlikoff RF, Deliyski DD, Baken R and Watson BC, “Validation of a glottographic measure of vocal attack,” J. Voice, vol. 23, p. 164–168, 2009. [DOI] [PubMed] [Google Scholar]
[42].Kunduk M, Ikuma T, Blouin DC and McWhorter AJ, “Effects of volume, pitch, and phonation type on oscillation initiation and termination phases investigated with high-speed videoendoscopy,” Journal of Voice, vol. 31, no. 3, pp. 313–322, 2017. [DOI] [PubMed] [Google Scholar]
[43].Yousef A, Deliyski DD, Zacharias SR and Naghibolhosseini M, “Detection of Vocal Fold Image Obstructions in High-Speed Videoendoscopy During Connected Speech in Adductor Spasmodic Dysphonia: A Convolutional Neural Networks Approach,” Journal of Voice, pp. S0892–1997(22)00027–3, 2022. Online ahead of print. [DOI] [PMC free article] [PubMed] [Google Scholar]
[44].Naghibolhosseini M, Zacharias SR, Zenas S, Levesque F and Deliyski DD, “Laryngeal Imaging Study of Glottal Attack/Offset Time in Adductor Spasmodic Dysphonia during Connected Speech,” Applied Sciences, vol. 13, no. 5, p. 2979, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
[45].Olthoff A, Woywod C and Kruse E, “Stroboscopy versus high-speed glottography: A comparative study,” The Laryngo scope, vol. 117, no. 6, pp. 1123–1126, 2007. [DOI] [PubMed] [Google Scholar]
[46].Popolo PS, “Investigation of flexible high-speed video nasolaryngoscopy,” J Voice, vol. 32, no. 5, pp. 529–537, 2018. [DOI] [PubMed] [Google Scholar]
[47].Yousef AM, Deliyski DD, Zacharias SR, de Alarcon A, Orlikoff RF and Naghibolhosseini M, “A Deep Learning Approach for Quantifying Vocal Fold Dynamics during Connected Speech using Laryngeal High-Speed Videoendoscopy,” Journal of Speech, Language, and Hearing Research, vol. 65, no. 6, pp. 2098–2113, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
[48].Yousef AM, Deliyski DD, Zacharias SRC, de Alarcon A, Orlikoff RF and Naghibolhosseini M, “Spatial segmentation for laryngeal high-speed videoendoscopy in connected speech,” J Voice, vol. 37, no. 1, pp. 26–36, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
[49].Kist AM and Döllinger M, “Efficient biomedical image segmentation on EdgeTPUs at point of care,” IEEE Access, vol. 8, pp. 139356–139366, 2020. [Google Scholar]
[50].Fehling MK, Grosch F, Schuster ME, Schick B and Lohscheller J, “Fully automatic segmentation of glottis and vocal folds in endoscopic laryngeal high-speed videos using a deep Convolutional LSTM Network,” PLoS ONE, vol. 15, no. 2: e0227791, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
[51].Naghibolhosseini M, Yousef AM, Zayernouri M, Zacharias SR and Deliyski DD, “Deep Learning for High-Speed Laryngeal Imaging Analysis,” in Proceedings of the 3rd International IEEE Conference on Computational Intelligence and Knowledge Economy (ICCIKE), Amity University, Dubai, UAE, 2023. [Google Scholar]
[52].Kist A, Gómez P, Dubrovskiy D, Schlegel P, Kunduk M, Echternach M, Patel R, Semmler M, Bohr C, Dürr S, Schützenberger A and Döllinger M, “A Deep Learning Enhanced Novel Software Tool for Laryngeal Dynamics Analysis,” Journal of Speech, Language, and Hearing Research, pp. 1–15, 2021. [DOI] [PubMed] [Google Scholar]
[53].Yousef AM, Deliyski DD, Zacharias SR, de Alarcon A, Orlikoff RF and Naghibolhosseini M, “A Hybrid Machine-Learning-Based Method for Analytic Representation of the Vocal Fold Edges during Connected Speech,” Applied Sciences, vol. 11, no. 3, pp. 1179.e1–1179.e15, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
[54].Yousef AM, Deliyski DD, Zayernouri M, Zacharias SR and Naghibolhosseini M, “Vocal Fold Detective Edge Analysis in High-speed Videoendoscopy during Running Speech in Adductor Spasmodic Dysphonia,” in Proceedings of the 15th International Conference on Advances in Quantitative Laryngology, Voice and Speech Research (AQL), Phoenix, AZ, 2023 March 30-April 1. [Google Scholar]
[55].Gómez P, Kist AM, Schlegel P, Berry DA, Chhetri DK, Dürr S, Echternach M, Johnson AM, Kniesburges S, Kunduk M, Maryn Y, Schützenberger A, Verguts M and Döllinger M, “BAGLS, a multihospital benchmark for automatic glottis segmentation,” Scientific Data, vol. 7, no. 1, p. 186, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
[56].Kist AM, Zilker J, Gómez P, Schützenberger A and Döllinger M, “Rethinking glottal midline detection,” Scientific Reports, vol. 10:20723, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
[57].Yousef A, Deliyski D, Zacharias S. d. A. A., Orlikoff R and Naghibolhosseini M, “Automated detection and segmentation of glottal area using deep-learning neural networks in high-speed videoendoscopy during connected speech,” in In 14TH International Conference Advances In Quantitative Laryngology, Voice And Speech Research (AQL), Bogotá, Colombia, 2021 June 7–10. [Google Scholar]
[58].Naghibolhosseini M, Henry T, Yousef A, Zayernouri M, Zacharias S and Deliyski D, “Applications of machine learning for vocal fold motion analysis using laryngeal high-speed videoendoscopy,” in in Proceedings of the 10th Convention of the European Acoustics Association, Politecnico di Torino, Turin, Italy, 2023. Accepted for publication. [Google Scholar]
[59].Murtola T, Malinen J, Geneid A and Alku P, “Analysis of phonation onsets in vowel production, using information from glottal area and flow estimate,” Speech Communication, vol. 109, pp. 55–65, 2019. [Google Scholar]
[60].Ikuma T, Kunduk M and McWhorter AJ, “Objective quantification of pre-and postphonosurgery vocal fold vibratory characteristics using high-speed videoendoscopy and a harmonic waveform model,” Journal of Speech, Language, and Hearing Research, vol. 57, no. 3, pp. 743–757, 2014. [DOI] [PubMed] [Google Scholar]
[61].Koike Y, “Experimental studies on vocal attack,” Practica Oto-Rhino-Laryngologica, vol. 60, no. 8, pp. 663–688, 1967. [Google Scholar]
[62].Baken RJ and Orlikoff RF, “Vocal fold adduction time estimated from glottographic signals,” in The 25th Mid-Winter Meeting of the Association for Research in Otolaryngology, St. Petersburg, FL, February 1998. [Google Scholar]
[63].Casper J and Leonard R, Understanding voice problems: A physiological perspective for diagnosis and treatment, Lippincott Williams & Wilkins, 2006. [Google Scholar]
[64].Yousef AM, Deliyski DD, Zacharias SR and Naghibolhosseini M, “Deep-Learning-Based Representation of Vocal Fold Dynamics in Adductor Spasmodic Dysphonia during Connected Speech in High-Speed Videoendoscopy,” Journal of Voice, pp. S0892–1997(22)00263–6, 2022. Online ahead of print. [DOI] [PMC free article] [PubMed] [Google Scholar]
[65].Ronneberger O, Fischer P and Brox T, “U-Net: Convolutional networks for biomedical image segmentation,” in Int. Conf. Med. Image Comp. Comp.-ass. Interv. (MICCAI), 2015. [Google Scholar]
[66].De Biase NG, Korn GP, Lorenzon P, Padovani M, Moraes M, Madazio G and Vilanova LCP, “Dysphonia severity degree and phonation onset latency in laryngeal adductor dystonia,” Journal of Voice, vol. 24, no. 4, pp. 406–409, 2010. [DOI] [PubMed] [Google Scholar]

[R1] [1].Chetri DK, Merati AL, Blumin JH, Sulica L, Damrose EJ and Tsai VW, “Reliability of the perceptual evaluation of adductor spasmodic dysphonia,” An Otol Rhinol Laryngol, vol. 117, pp. 159–165, 2008. [DOI] [PubMed] [Google Scholar]

[R2] [2].Roy N, Gouse M, Mauszycki SC, Merrill RM and Smith ME, “Task specificity in adductor spasmodic dysphonia versus muscle tension dysphonia,” The Laryngoscope, vol. 115, no. 2, pp. 311–316, 2005. [DOI] [PubMed] [Google Scholar]

[R3] [3].Blitzer A and Kohli N, “Laryngeal dystonia: Phenomenology, genetics, and management,” Toxicon, p. 107258, 2023. [DOI] [PubMed] [Google Scholar]

[R4] [4].Meyer TK, Rickert SM, Childs LF and Blitzer A, “Spasmodic dysphonia. In Handbook of Dystonia,” CRC Press, pp. 196–206, 2012. [Google Scholar]

[R5] [5].Cannito M and Kondraske G, “Rapid manual abilities in spasmodic dysphonic and normal female subjects,” J Speech Hear Res, vol. 33, p. 123–133, 1990. [DOI] [PubMed] [Google Scholar]

[R6] [6].Roy N, “Differential diagnosis of muscle tension dysphonia and spasmodic dysphonia,” Current Opinion in Otolaryngology & Head and Neck Surgery, vol. 18, no. 3, pp. 165–170, 2010. [DOI] [PubMed] [Google Scholar]

[R7] [7].Stemple J, Roy N and Klaben B, Clinical Voice Pathology: Theory and Management, 6th ed., Plural Publishing, 2018. [Google Scholar]

[R8] [8].Roy N, Mazin A and Awan SN, “Automated acoustic analysis of task dependency in adductor spasmodic dysphonia versus muscle tension dysphonia,” The Laryngoscope, vol. 124, no. 3, pp. 718–724, 2014. [DOI] [PubMed] [Google Scholar]

[R9] [9].Mor N and B. A, “Diagnosis and treatment of laryngeal dystonia: past, present and future directions,” pp. 1–12, 2016. [Google Scholar]

[R10] [10].Chhetri DK, Mendelsohn AH, Blumin JH and Berke GS, “Long-term follow-up results of selective laryngeal adductor denervation–reinnervation surgery for adductor spasmodic dysphonia,” Laryngoscope, vol. 116, p. 635–642, 2006. [DOI] [PubMed] [Google Scholar]

[R11] [11].Roy N, Bless DM, Heisey D and Ford CN, “Manual circumlaryngeal therapy for functional dysphonia: an evaluation of short- and long-term treatment outcomes,” J Voice, vol. 11, p. 321–331, 1997. [DOI] [PubMed] [Google Scholar]

[R12] [12].Chen DW and Ongkasuwan J, “Spasmodic dysphonia,” International ophthalmology clinics, vol. 58, no. 1, pp. 77–87, 2018. [DOI] [PubMed] [Google Scholar]

[R13] [13].Cannito MP, Woodson GE, Murry T and Bender B, “Perceptual analyses of spasmodic dysphonia before and after treatment,” Archives of Otolaryngology–Head & Neck Surgery, vol. 130, no. 12, pp. 1393–1399, 2004. [DOI] [PubMed] [Google Scholar]

[R14] [14].Sapienza CM, Walton S and Murry T, “Adductor spasmodic dysphonia and muscular tension dysphonia: acoustic analysis of sustained phonation and reading,” J Voice, vol. 14, p. 502–520, 2000. [DOI] [PubMed] [Google Scholar]

[R15] [15].Rees CJ, Blalock PD, Kemp SE, Halum SL and Koufman JA, “Differentiation of adductor-type spasmodic dysphonia from muscle tension dysphonia by spectral analysis,” Otolaryngol Head Neck Surg, vol. 137, p. 576–581, 2007. [DOI] [PubMed] [Google Scholar]

[R16] [16].Leonard R and Kendall K, “Differentiation of spasmodic and psychogenic dysphonias with phonoscopic evaluation,” Laryngoscope, vol. 109, p. 295–300, 1999. [DOI] [PubMed] [Google Scholar]

[R17] [17].Higgins MB, H CD and Shulte L, “Phonatory air flow characteristics of adductor spasmodic dysphonia and muscle tension dysphonia,” J Speech Lang Hear Res, vol. 42, p. 101–111, 1999. [DOI] [PubMed] [Google Scholar]

[R18] [18].Hintze JM, Ludlow CL, Bansberg SF, Adler CH and Lott DG, “Spasmodic dysphonia: a review. Part 2: characterization of pathophysiology,” Otolaryngology–Head and Neck Surgery, vol. 157, no. 4, pp. 558–564, 2017. [DOI] [PubMed] [Google Scholar]

[R19] [19].Ludlow CL, Adler CH, Berke GS, Bielamowicz SA, Blitzer A, Bressman SB and Woodson GE, “Research priorities in spasmodic dysphonia,” Otolaryngology—Head and Neck Surgery, vol. 139, no. 4, pp. 495–505, 2008. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] [20].Patel RR, Liu L, Galatsanos N and Bless DM, “Differential vibratory characteristics of adductor spasmodic dysphonia and muscle tension dysphonia on high-speed digital imaging,” Annals of Otology, Rhinology & Laryngology, vol. 120, no. 1, pp. 21–32, 2011. [DOI] [PubMed] [Google Scholar]

[R21] [21].Chen W, Woo P and Murry T, “Vibratory Onset of Adductor Spasmodic Dysphonia and Muscle Tension Dysphonia: A High-Speed Video Study,” Journal of Voice, vol. 34, no. 4, pp. 598–603, 2020. [DOI] [PubMed] [Google Scholar]

[R22] [22].Woo P, “High-speed imaging of vocal fold vibration onset delay: normal versus abnormal,” J Voice, vol. 31, pp. 307–312, 2017. [DOI] [PubMed] [Google Scholar]

[R23] [23].Morrison MD and Rammage LA, “Muscle misuse voice disorders: description and classification,” Acta oto-laryngologica, vol. 113, no. 3, pp. 428–434, 1993. [DOI] [PubMed] [Google Scholar]

[R24] [24].Halberstam B, “Acoustic and perceptual parameters relating to connected speech are more reliable measures of hoarseness than parameters relating to sustained vowels,” ORL, vol. 66, no. 2, pp. 70–73, 2004. [DOI] [PubMed] [Google Scholar]

[R25] [25].Maryn Y, Corthals P, Van Cauwenberge P, Roy N and De Bodt M, “Toward improved ecological validity in the acoustic measurement of overall voice quality: combining continuous speech and sustained vowels,” J Voice, vol. 24, no. 5, pp. 540–555, 2010. [DOI] [PubMed] [Google Scholar]

[R26] [26].Lowell SY, “The acoustic assessment of voice in continuous speech,” SIG 3 Perspectives on Voice and Voice Disorders, vol. 22, no. 2, pp. 57–63, 2012. [Google Scholar]

[R27] [27].Pietruszewska W, Just M, Morawska J, Malinowski J, Hoffman J, Racino A, Barańska M, Kowalczyk M and Niebudek-Bogusz E, “Comparative analysis of high-speed videolaryngoscopy images and sound data simultaneously acquired from rigid and flexible laryngoscope: a pilot study,” Scientific Reports, vol. 11, no. 1, pp. 1–14, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] [28].Naghibolhosseini M, Deliyski DD, Zacharias SR, de Alarcon A and Orlikoff RF, “Temporal segmentation for laryngeal high-speed videoendoscopy in connected speech,” J Voice, vol. 32, no. 2, pp. 256.e1–256.e12, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] [29].Deliyski DD, Powell ME, Zacharias SR, Gerlach TT and de Alarcon A, “Experimental investigation on minimum frame rate requirements of high-speed videoendoscopy for clinical voice assessment,” Biomed. Signal. Process. and Control, vol. 17, pp. 51–59, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] [30].Zañartu M, Mehta DD, Ho JC, Wodicka GR and Hillman RE, “Observation and analysis of in vivo vocal fold tissue instabilities produced by nonlinear source-filter coupling: A case study,” Journal of the Acoustical Society of America, vol. 129, no. 1, pp. 326–339, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] [31].Mehta DD, Deliyski DD, Zeitels SM, Zañartu M and Hillman RE, Integration of transnasal fiberoptic high-speed videoendoscopy with time-synchronized recordings of vocal function, innormal & abnormal vocal folds Kinematics: High speed digital phonoscopy (HSDP), optical coherence tomography (OCT) & narrow band imaging, vol. 12, San Fransisco, CA: Pacific Voice & Speech Foundation, 2015, pp. 105–114. [Google Scholar]

[R32] [32].Naghibolhosseini M, Deliyski DD, Zacharias SR, de Alarcon A and Orlikoff RF, “A method for analysis of the vocal fold vibrations in connected speech using laryngeal imaging,” in Manfredi C (Ed.) Proceedings of the 10th International Workshop on Models and Analysis of Vocal Emissions for Biomedical Applications MAVEBA, Firenze University Press, Firenze, Italy, 2017. [Google Scholar]

[R33] [33].Yiu E, Worrall L, Longland J and Mitchell C, “Analysing vocal quality of connected speech using Kay’s computerized speech lab: a preliminary finding,” Clin Linguist & Phon, vol. 14, no. 4, pp. 295–305, 2000. [Google Scholar]

[R34] [34].Naghibolhosseini M, Deliyski DD, Zacharias SRC, de Alarcon A and Orlikoff RF, “Studying vocal fold non-stationary behavior during connected speech using high-speed videoendoscopy,” The Journal of the Acoustical Society of America, vol. 144, no. 3, pp. 1766–1766, 2018. [Google Scholar]

[R35] [35].Deliyski DD, “Clinical feasibility of high-speed videoendoscopy,” in Perspectives on Voice and Voice Disorders, vol. 17, American Speech-Language-Hearing Association, 2007, pp. 12–16. [Google Scholar]

[R36] [36].Mehta DD, Deliyski DD, Quatieri TF and Hillman RE, “Automated measurement of vocal fold vibratory asymmetry from high-speed videoendoscopy recordings,” Journal of Speech, Language, and Hearing Research, vol. 54, no. 1, pp. 47–54, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R37] [37].Deliyski DD, Petrushev PP, Bonilha HS, Gerlach TT, Martin-Harris B and Hillman RE, “Clinical implementation of laryngeal high-speed videoendoscopy: challenges and evolution,” Folia Phoniatr. et Logop, vol. 60, no. 1, pp. 33–44, 2007. [DOI] [PubMed] [Google Scholar]

[R38] [38].Deliyski DD and Hillman RE, “State of the art laryngeal imaging: Research and clinical implications,” Current Opinion in Otolaryngology & Head and Neck Surgery, vol. 18, no. 3, pp. 147–152, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R39] [39].Woo P, “Objective measures of stroboscopy and high speed video,” Advances in Oto-Rhino-Laryngology, vol. 85, pp. 25–44, 2020. [DOI] [PubMed] [Google Scholar]

[R40] [40].Deliyski DD, Laryngeal high-speed videoendoscopy, in: Laryngeal evaluation: Indirect laryngoscopy to high-speed digital imaging, New York: Thieme Medical Publishers, 2010, pp. 243–270. [Google Scholar]

[R41] [41].Orlikoff RF, Deliyski DD, Baken R and Watson BC, “Validation of a glottographic measure of vocal attack,” J. Voice, vol. 23, p. 164–168, 2009. [DOI] [PubMed] [Google Scholar]

[R42] [42].Kunduk M, Ikuma T, Blouin DC and McWhorter AJ, “Effects of volume, pitch, and phonation type on oscillation initiation and termination phases investigated with high-speed videoendoscopy,” Journal of Voice, vol. 31, no. 3, pp. 313–322, 2017. [DOI] [PubMed] [Google Scholar]

[R43] [43].Yousef A, Deliyski DD, Zacharias SR and Naghibolhosseini M, “Detection of Vocal Fold Image Obstructions in High-Speed Videoendoscopy During Connected Speech in Adductor Spasmodic Dysphonia: A Convolutional Neural Networks Approach,” Journal of Voice, pp. S0892–1997(22)00027–3, 2022. Online ahead of print. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R44] [44].Naghibolhosseini M, Zacharias SR, Zenas S, Levesque F and Deliyski DD, “Laryngeal Imaging Study of Glottal Attack/Offset Time in Adductor Spasmodic Dysphonia during Connected Speech,” Applied Sciences, vol. 13, no. 5, p. 2979, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R45] [45].Olthoff A, Woywod C and Kruse E, “Stroboscopy versus high-speed glottography: A comparative study,” The Laryngo scope, vol. 117, no. 6, pp. 1123–1126, 2007. [DOI] [PubMed] [Google Scholar]

[R46] [46].Popolo PS, “Investigation of flexible high-speed video nasolaryngoscopy,” J Voice, vol. 32, no. 5, pp. 529–537, 2018. [DOI] [PubMed] [Google Scholar]

[R47] [47].Yousef AM, Deliyski DD, Zacharias SR, de Alarcon A, Orlikoff RF and Naghibolhosseini M, “A Deep Learning Approach for Quantifying Vocal Fold Dynamics during Connected Speech using Laryngeal High-Speed Videoendoscopy,” Journal of Speech, Language, and Hearing Research, vol. 65, no. 6, pp. 2098–2113, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R48] [48].Yousef AM, Deliyski DD, Zacharias SRC, de Alarcon A, Orlikoff RF and Naghibolhosseini M, “Spatial segmentation for laryngeal high-speed videoendoscopy in connected speech,” J Voice, vol. 37, no. 1, pp. 26–36, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R49] [49].Kist AM and Döllinger M, “Efficient biomedical image segmentation on EdgeTPUs at point of care,” IEEE Access, vol. 8, pp. 139356–139366, 2020. [Google Scholar]

[R50] [50].Fehling MK, Grosch F, Schuster ME, Schick B and Lohscheller J, “Fully automatic segmentation of glottis and vocal folds in endoscopic laryngeal high-speed videos using a deep Convolutional LSTM Network,” PLoS ONE, vol. 15, no. 2: e0227791, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R51] [51].Naghibolhosseini M, Yousef AM, Zayernouri M, Zacharias SR and Deliyski DD, “Deep Learning for High-Speed Laryngeal Imaging Analysis,” in Proceedings of the 3rd International IEEE Conference on Computational Intelligence and Knowledge Economy (ICCIKE), Amity University, Dubai, UAE, 2023. [Google Scholar]

[R52] [52].Kist A, Gómez P, Dubrovskiy D, Schlegel P, Kunduk M, Echternach M, Patel R, Semmler M, Bohr C, Dürr S, Schützenberger A and Döllinger M, “A Deep Learning Enhanced Novel Software Tool for Laryngeal Dynamics Analysis,” Journal of Speech, Language, and Hearing Research, pp. 1–15, 2021. [DOI] [PubMed] [Google Scholar]

[R53] [53].Yousef AM, Deliyski DD, Zacharias SR, de Alarcon A, Orlikoff RF and Naghibolhosseini M, “A Hybrid Machine-Learning-Based Method for Analytic Representation of the Vocal Fold Edges during Connected Speech,” Applied Sciences, vol. 11, no. 3, pp. 1179.e1–1179.e15, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R54] [54].Yousef AM, Deliyski DD, Zayernouri M, Zacharias SR and Naghibolhosseini M, “Vocal Fold Detective Edge Analysis in High-speed Videoendoscopy during Running Speech in Adductor Spasmodic Dysphonia,” in Proceedings of the 15th International Conference on Advances in Quantitative Laryngology, Voice and Speech Research (AQL), Phoenix, AZ, 2023 March 30-April 1. [Google Scholar]

[R55] [55].Gómez P, Kist AM, Schlegel P, Berry DA, Chhetri DK, Dürr S, Echternach M, Johnson AM, Kniesburges S, Kunduk M, Maryn Y, Schützenberger A, Verguts M and Döllinger M, “BAGLS, a multihospital benchmark for automatic glottis segmentation,” Scientific Data, vol. 7, no. 1, p. 186, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R56] [56].Kist AM, Zilker J, Gómez P, Schützenberger A and Döllinger M, “Rethinking glottal midline detection,” Scientific Reports, vol. 10:20723, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R57] [57].Yousef A, Deliyski D, Zacharias S. d. A. A., Orlikoff R and Naghibolhosseini M, “Automated detection and segmentation of glottal area using deep-learning neural networks in high-speed videoendoscopy during connected speech,” in In 14TH International Conference Advances In Quantitative Laryngology, Voice And Speech Research (AQL), Bogotá, Colombia, 2021 June 7–10. [Google Scholar]

[R58] [58].Naghibolhosseini M, Henry T, Yousef A, Zayernouri M, Zacharias S and Deliyski D, “Applications of machine learning for vocal fold motion analysis using laryngeal high-speed videoendoscopy,” in in Proceedings of the 10th Convention of the European Acoustics Association, Politecnico di Torino, Turin, Italy, 2023. Accepted for publication. [Google Scholar]

[R59] [59].Murtola T, Malinen J, Geneid A and Alku P, “Analysis of phonation onsets in vowel production, using information from glottal area and flow estimate,” Speech Communication, vol. 109, pp. 55–65, 2019. [Google Scholar]

[R60] [60].Ikuma T, Kunduk M and McWhorter AJ, “Objective quantification of pre-and postphonosurgery vocal fold vibratory characteristics using high-speed videoendoscopy and a harmonic waveform model,” Journal of Speech, Language, and Hearing Research, vol. 57, no. 3, pp. 743–757, 2014. [DOI] [PubMed] [Google Scholar]

[R61] [61].Koike Y, “Experimental studies on vocal attack,” Practica Oto-Rhino-Laryngologica, vol. 60, no. 8, pp. 663–688, 1967. [Google Scholar]

[R62] [62].Baken RJ and Orlikoff RF, “Vocal fold adduction time estimated from glottographic signals,” in The 25th Mid-Winter Meeting of the Association for Research in Otolaryngology, St. Petersburg, FL, February 1998. [Google Scholar]

[R63] [63].Casper J and Leonard R, Understanding voice problems: A physiological perspective for diagnosis and treatment, Lippincott Williams & Wilkins, 2006. [Google Scholar]

[R64] [64].Yousef AM, Deliyski DD, Zacharias SR and Naghibolhosseini M, “Deep-Learning-Based Representation of Vocal Fold Dynamics in Adductor Spasmodic Dysphonia during Connected Speech in High-Speed Videoendoscopy,” Journal of Voice, pp. S0892–1997(22)00263–6, 2022. Online ahead of print. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R65] [65].Ronneberger O, Fischer P and Brox T, “U-Net: Convolutional networks for biomedical image segmentation,” in Int. Conf. Med. Image Comp. Comp.-ass. Interv. (MICCAI), 2015. [Google Scholar]

[R66] [66].De Biase NG, Korn GP, Lorenzon P, Padovani M, Moraes M, Madazio G and Vilanova LCP, “Dysphonia severity degree and phonation onset latency in laryngeal adductor dystonia,” Journal of Voice, vol. 24, no. 4, pp. 406–409, 2010. [DOI] [PubMed] [Google Scholar]

PERMALINK

Deep-Learning-Based Analysis of Glottal Attack and Offset Times in Adductor Laryngeal Dystonia

Ahmed M Yousef

Dimitar D Deliyski

Mohsen Zayernouri

Stephanie RC Zacharias

Maryam Naghibolhosseini