Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2023 Oct 1.
Published in final edited form as: J Voice. 2020 Nov 27:S0892-1997(20)30408-2. doi: 10.1016/j.jvoice.2020.10.017

Spatial Segmentation for Laryngeal High-Speed Videoendoscopy in Connected Speech

Ahmed M Yousef *, Dimitar D Deliyski *, Stephanie RC Zacharias †,, Alessandro de Alarcon §,, Robert F Orlikoff , Maryam Naghibolhosseini *
PMCID: PMC8411982  NIHMSID: NIHMS1735374  PMID: 33257208

Abstract

Summary: Objective.

This study proposes a new computational framework for automated spatial segmentation of the vocal fold edges in high-speed videoendoscopy (HSV) data during connected speech. This spatio-temporal analytic representation of the vocal folds enables the HSV-based measurement of the glottal area waveform and other vibratory characteristics in the context of running speech.

Methods.

HSV data were obtained from a vocally normal adult during production of the “Rainbow Passage.” An algorithm based on an active contour modeling approach was developed for the analysis of HSV data. The algorithm was applied on a series of HSV kymograms at different intersections of the vocal folds to detect the edges of the vibrating vocal folds across the frames. This edge detection method follows a set of deformation rules for the active contours to capture the edges of the vocal folds through an energy optimization procedure. The detected edges in the kymograms were then registered back to the HSV frames. Subsequently, the glottal area waveform was calculated based on the area of the glottis enclosed by the vocal fold edges in each frame.

Results.

The developed algorithm successfully captured the edges of the vocal folds in the HSV kymograms. This method led to an automated measurement of the glottal area waveform from the HSV frames during vocalizations in connected speech.

Conclusion.

The proposed algorithm serves as an automated method for spatial segmentation of the vocal folds in HSV data in connected speech. This study is one of the initial steps toward developing HSV-based measures to study vocal fold vibratory characteristics and voice production mechanisms in norm and disorder in the context of connected speech.

Keywords: Connected Speech, Voice Assessment, High, Speed Videoendoscopy, Laryngeal Imaging, Spatial Segmentation, Glottal Area Waveform

INTRODUCTION

Laryngeal imaging is an essential tool in clinical voice assessment in order to observe and evaluate the laryngeal structure and function during voice production.14 The current laryngeal imaging technique that is widely used in clinics is videostroboscopy.57 Using an endoscope coupled with a stroboscopic light in videostroboscopy, video recordings of the laryngeal structures can be obtained. Although videostroboscopy is used during connected speech, it can only capture gross laryngeal adjustments while the functional assessment of vocal fold vibration using videostroboscopy is limited to sustained vocalizations only.811 In addition to this constraint, videostroboscopy has limitations in assessment of aperiodic vibrations of the vocal folds, which commonly are present in voice disorders.12,13 The recent advancement that has come with coupling flexible fiberoptic endoscopes with laryngeal high-speed videoendoscopy (HSV) serves to overcome these limitations by capturing the true vibrations of the vocal folds since it does not rely on periodicity of the acoustic signals.1214 Thus, HSV is a powerful tool that enables the study of vocal function and its underlying physiological and physical mechanisms with high temporal resolution in ways not possible with videostroboscopy.12,13,15,16

HSV provides a unique possibility to measure vocal fold vibrations with high temporal resolution during connected speech.1721 Vocal abnormalities often reveal themselves, not in sustained phonation, but in running speech.2227 Utilizing HSV in connected speech is therefore likely to serve as an invaluable tool, providing detailed information about the intra-cycle vibratory characteristics of the vocal folds, as well as nonstationary events, such as voicing onsets, offsets, and breaks.17,20,21,2830 However, HSV recordings of connected speech result in big datasets, where navigating through this huge amount of information is not feasible in clinical settings. Thus, there is a need to develop automated techniques and algorithms for such big-data analysis and extraction of clinically relevant information from the vocal fold vibrations.

The analytic extraction of the edges of the vibrating vocal folds in connected speech would facilitates the development of HSV-based measures of vocal fold vibration. Previously, we presented a study on using HSV in connected speech, which developed a fully automated temporal segmentation method to extract the timestamps for onsets and offsets of vocalizations and epiglottic obstructions of the vocal folds.17 Additionally, in that study, we estimated a temporal correlate of the glottal area waveform (GAW).17 The need for developing a dedicated spatial segmentation method is driven by the need to segment analytically and register over time the edges of the vocal folds for more accurate measurement of glottal attack time, glottal offset time, GAW, vocal fold contact, phase asymmetry and other essential spatio-temporal characteristics of vocal fold behavior. Several approaches have been developed and applied for spatial segmentation of the glottal area in HSV during sustained phonation.3135 The main methods for extracting the glottal area from the HSV data are the region growing,32,36,37 histogram thresholding,33,38 level sets,39,51 and active contour modeling approach.34,35,50 These methods were used for HSV analysis in isolated sustained vowels. Most of the developed methods are not fully automated and require visual inspection of the data and some manual analysis.34,36,3941 Region growing and histogram thresholding are both vulnerable to image noise and homogeneity.31 The level set method can accurately estimate the glottal cycle only if the vocal folds fully close during adduction and it is also prone to noise.39 The active contour modeling approach, however, is less sensitive to noise and intensity inhomogeneity in images, can be initialized anywhere, even across boundaries, and efficiently preserves global line shapes.42,43 Hence, this approach is a promising technique for spatial segmentation of the glottal area.44

The active contour modeling method can be used to dynamically locate the contour of the desired image features, such as the edges of the glottis, and therefore, can be used for the GAW measurement. The active contour (aka snake) is a spline, which deforms based on certain energy minimization rules to capture the glottal edge. The active contour modeling approach has been employed to detect the glottal edges (1) spatially in each HSV frame: this method is based on using closed loop snakes for each individual HSV frame32,34,50; (2) temporally in HSV kymograms: this method uses two open curve snakes for detection of the right and left vocal folds35,45; and (3) spatio-temporally: the open curve snakes are used for glottal edge detection in the HSV kymograms across different cross sections of the vocal folds and the extracted edges are then registered back to each HSV frame.35 The existing studies for spatial segmentation of glottal edges were performed on HSV data obtained during sustained vowels and not on connected speech.

The goal of this study is to present a spatial segmentation method based on the active contour modeling approach to provide an analytic representation of the edges of the vocal folds and perform a GAW measurement in HSV data during connected speech. The proposed computational algorithm in this work is fully automated and performs a spatio-temporal analysis of the HSV data to detect the glottal edges during vocal fold vibrations.

METHODS

Data acquisition

The HSV data were recorded from a vocally normal female (38 years of age) who did not have any history of voice disorder, while reading the “Rainbow Passage.” The examination was done at the Center for Pediatric Voice Disorders, Cincinnati Children’s Hospital Medical Center and was approved by the Institutional Review Board. A custom-built color HSV system with 4,000 frames per second (fps) and 249 μs integration time was used for the data acquisition. The recording length was 29.14 s (116,543 frames in total) with HSV image resolution of 256 × 256 pixels. The recording system included a FASTCAM SA-Z color high-speed camera (Photron Inc., San Diego, CA) coupled with a 3.6-mm Olympus ENF-GP Fiber Rhinolaryngoscope (Olympus Corporation, Tokyo, Japan), and a 300-W xenon light source, model 7152A (PENTAX Medical Company, Montvale, NJ). The camera had a 12-bit color image sensor with sensitivity of ISO 20,000 and 64 GB of cache memory divided into two 32-GB partitions. The selected zoom lens adapter had a focal distance of 45 mm in order to provide the optimal pixel representation and dynamic range. The distance of the endoscope to the vocal folds was selected to ensure that despite the active maneuvers of the larynx during connected speech, the vocal folds always fall within the field of view of the endoscope during the recording. The recorded HSV sequence was saved as an uncompressed 24-bit RGB AVI file and then analyzed.

The camera used for the data collection had a native resolution of 1,024 × 1,024 pixels at 20,000 fps. However, for the purposes of the study, the resolution was set to 256 × 256 pixels, which at the chosen speed of 4,000 fps provided for up to 30 seconds per partition to record the reading of the Rainbow Passage. The selected frame rate was shown to be clinically acceptable for voice assessment.46 Moreover, using 256 × 256 pixels provided the optimal balance between the image resolution, camera frame rate, duration of the recording, and the light sensitivity necessary for this data collection.

Data analysis

Prior to developing the automated spatial segmentation method, several image processing steps were implemented. The first step was to apply a temporal segmentation method17 to extract the vocalizations in the HSV recording. The second step was to apply an algorithm for motion compensation to align the vibrating vocal folds across the frames during each vocalization.20,47 The HSV kymograms were extracted next at all different cross sections of the vocal folds. The spatial segmentation was then performed to analytically describe the vocal fold edges in the kymograms using an active contour modeling approach. Subsequently, the segmented vocal fold edges were registered back to the HSV frames and the GAW was computed across the frames for each vocalization. Each of the aforementioned processing steps is elaborated in the following subsections.

Temporal segmentation

A temporal segmentation algorithm we developed previously17 was applied to the HSV data to automatically extract timestamps of the vocalized segments (vibratory onsets/offsets) and epiglottic obstruction of the glottis. The unobstructed vocalized segments of the HSV recordings were selected for further processing.

Motion compensation

A previously developed denoising algorithm along with a motion compensation technique,20 were applied to each vocalized segment of the HSV recording. The goal of this step was to capture the spatial location of the vibrating vocal folds in a motion window across the frames. The motion compensation was mainly performed to overcome the problem with spatio-temporal misalignment of the vocal folds due to laryngeal maneuvers during running speech and the movements of the endoscopic tip relative to the laryngeal tissues over time. After detecting the location of the vibrating vocal folds across the frames, the HSV frames were cropped based on the location of the center of the motion window, designed in an ellipse shape (see20 for complete description of the motion window). The size of the cropped frames was based on the horizontal and vertical second moments of inertia after applying the motion window to each HSV frame and removing the irrelevant noise and tissues. This step was essential for extracting the kymograms inside a rectangular window that enclosed the vibrating vocal folds in each frame.

Kymogram extraction

The HSV kymograms were extracted for the frames between 25 ms before the onset of vocalization and 25 ms frames after the vocalization offset for each vocalization to ensure the full pre-, post-, and periphonatory phases are included. For each vocalized segment, the first kymogram was extracted at the mid-line passing through the vibrating vocal folds inside the motion window. The spatial segmentation was then applied to the kymogram to detect the glottal edges. The kymograms were extracted at different cross sections of the vocal folds between the anterior and posterior commissure and the spatial segmentation algorithm, explained below, was applied to each kymograms.

Spatial segmentation

The spatial segmentation algorithm was based on an active contour modeling approach. The goal of the active contour technique is to extract a specific feature in the image. In this work, the feature of interest was the glottic edge in the kymogram image. To do so, a line or an active contour (also called a snake) should be drawn close to the glottal edge. The snake should be close enough to the edges of interest, otherwise, it might be attracted to other edges present in the kymograms. In each kymogram, a pair of open-curve snakes (left and right active contours) were used to capture the left and right edges of the glottis. The initialization of the snakes’ locations is an important step toward an accurate edge detection. Therefore, an algorithm for an automated snake initialization was developed prior to applying the active contour modeling approach.

Automated snake initialization.

The initial snake should pass through the center of the glottis. The snake initialization was performed for each kymogram ni using the first moment of inertia, denoted by M1(y, ni) and calculated as follows48:

M1(y,ni)=x=1Kwy=1KhI(x,y,ni)yx=1Kwy=1KhI(x,y,ni), (1)

where x and y correspond to the spatial coordinate of a pixel, I(x, y, ni) is the kymogram image, Kw is the kymogram image width, which is the number of frames in the kymogram, and Kh is the kymogram image height in pixels. The green channel is less prone to noise and therefore, this channel was used for the calculation of the first moment. Since the first moment of inertia determines the center of brightness, the kymogram was inverted to find the center of darkness (ie, the centroid of the glottis). The first moment of inertia was then adjusted to find a more accurate line passing through the center of the glottis. This adjustment step was done because the first moment of inertia was vulnerable to the high level of noise in the present HSV data and resulted in a not fully accurate line. The adjustment was performed using the gradient information of the kymogram image around the moment line. The gradient was computed in each kymogram vertically (along the left-right dimension of the image) with a step size of 10 pixels. The negative gradient can represent the glottis since the pixels located inside the glottis area are darker than the neighboring pixels outside the glottis. Hence, the summation of the negative gradient values was calculated horizontally (across the time) at seven different lines just above and below the moment line. The pixel with the largest summation value of the negative gradient for each frame was considered the center point for that frame. The center point for different frames formed the snake initialization line across the frames, which was used for initialization of the right and left snakes.

Active contour modeling.

The active contour modeling was applied to the kymograms to detect the left and right edges of the vocal folds. The snake deformed spatially in the kymogram image according to certain energy optimization rules. The optimization process aimed to minimize an energy function, which was a summation of an external and internal energy function. The external energy was derived from the spatial gradient of the image to highlight the desired feature (ie, glottal edges). The internal energy controlled the shape and deformation of the snake.44 The snake spline can be parametrized as a vector v(s) = [x(s), y(s)], where s ∈ [0,1]. The objective energy function that needed to be minimized was defined as:44,49

E=01[Eint(v(s))+Eimage(v(s))]ds, (2)

where Eint is the internal energy function and Eimage is the external image function. The internal energy function (Eint) of the spline, acting as a soft constraint to make the shape of the snake smooth, was computed by:

Eint(v(s))=12[α(s)|v(s)|2+β(s)|v(s)|2], (3)

where v′(s) and v″(s) are the snake first and second derivatives, respectively; α and β are two weights included to adjust the elasticity and rigidity of the snake, respectively, which control the snake shape. The two weights α and β were considered constant and were set to 0.1 and 0.03, respectively.

The image energy function (Eimage) counterbalances the internal energy and was given by:

Eimage(v(s))=|I(x,y)|2, (4)

where ∇ I (x, y) is the spatial gradient of the kymogram image.

The solution for Eq. (2) was based on discretization of the energy function. The finite difference method was used to approximate the first and second derivatives in the internal energy function. The internal energy function, given in Eq. (3), can be rewritten as follows, after the discretization:

Eint(vi)=12[α|vivi1|2+β|vi+12vi+vi1|2], (5)

where vi refers to the ith snaxel; the snaxels are the vertices that make up the snake spline. The discretization of the image energy function, given in Eq. (4), yielded:

Eimage(vi)=|I(vi)|2. (6)

By discretizing the total energy function, given in Eq. (2), the following equation was derived, which can be considered as a dynamic-programming problem:49

Etotal=i=1n[Eint(vi)+Eimage(vi)], (7)

where n is the total number of snaxels. The total number of snaxels refers to the total number of frames included in the kymogram. The solution to the above dynamic programming problem generated a sequence of functions {Si}i=1n1 with one unknown variable vi, where Si is the optimum value function. The Si functions for several snaxels are shown in Eq. (8). At each Si, a minimization was conducted over the vi variable, where vi is a state variable and can be assigned m possible values. The value of m refers to the number of pixels in the neighborhood of the snaxel that the algorithm searches to find the optimal vi. In this study, the value of m was set to 5. Note that in Eq. 8, Eint(vn) = Eint(v(n–1), vn, v(n+1)).

S1(v1)=minv1[Eint(v0,v1,v2)+Eimage(v1)],S2(v2)=minv2[S1(v1)+Eint(v1,v2,v3)+Eimage(v2)],S3(v3)=minv3[S2(v2)+Eint(v2,v3,v4)+Eimage(v3)],Sn(vn)=minvn[S(n1)(v(n1))+Eint(vn)+Eimage(vn)]. (8)

And in the general case,

Si(vi)=minvi[i=1i2(Eint(vi))+i=1i1(Eimage(vi))+Eint(vi1)+Eint(vi)+Eimage(vi)]. (9)

The image gradient ∇ I was calculated in the vertical direction along the left-right dimension of the kymogram image in each frame with the step size of 10 pixels. The gradient was computed for each kymogram to signify the glottal edges, where the intensity changed rapidly. Accordingly, the movement of each snaxel for the left and right active contours was limited to the vertical direction. The positive and negative gradients were computed next. The positive gradient of the kymogram was obtained by assigning the negative gradient pixels value of zero. The negative gradient of the kymogram was obtained by assigning the positive gradient pixels zero. The positive gradient was used in the image energy function when searching for the left vocal fold edges while the negative gradient was used to search for the right vocal fold.

To find the snakes that captured the left and right vocal fold edges, the discretized energy function in Eq. (9) was solved for i = 1, 2, 3, …, n. The minimization of the Si functions was done recursively, and the snake was updated during each loop until the position the snake remained almost unchanged through the minimization procedure. As such, during each loop and at each snaxel, the value of the Si function was numerically calculated for the column-wise snaxel’s five neighboring pixels. The neighboring pixel that led to a minimum total energy was considered as the updated value of vi. After updating all the vi values, the next loop started until the algorithm converged and the optimum snake spline was found. The convergence of the algorithm depended on an error function. This error function was defined as the sum of squared of the difference between the calculated snake in the current loop and the previous loop. The algorithm converged when the error became smaller than or equal to 1.

The vocal fold edges were analytically described at each vocalized segment of the “Rainbow Passage”. The anterior and posterior commissure were detected automatically. That is, if no glottal area was present in the extracted kymogram, the active contour method returned a snake similar to the snake initialization line, at which the location of the posterior/anterior commissure was specified.

GAW

The detected edges in each frame of the kymograms were registered back to the HSV frames. Subsequently, the glottal area was measured for each frame. The glottal area was defined as the number of pixels in the enclosed area between the detected right and left edges of the vocal folds. The GAW was measured as the glottal area values across the frames for each vocalization.

RESULTS

Using the temporal segmentation algorithm, the timestamps for all the vocalized segments of the HSV connected speech recording were extracted (except for the segments with epiglottic obstruction). Subsequently, the motion compensation was applied to each vocalized segment of the “Rainbow Passage” to capture the location of the vibrating vocal folds across the frames. The result of applying the motion window to three individual frames during five different vocalizations are depicted in Figure 1ae. Each row in Figure 1 shows three frames for a different vocalization between the following frame numbers: 40,505–41,255 (Figure 1a), 42,975–43,815 (Figure 1b), 84,281–84,891 (Figure 1c), 103,942–104,577 (Figure 1d), and 109,548–110,363 (Figure 1e). The individual frame numbers are shown on top of the HSV frames for each figure panel. The figure shows that the implemented motion window captures both the location and size of the vibrating vocal folds in different frames.

FIGURE 1.

FIGURE 1.

HSV frames along with the applied motion windows for three different frames at five different vocalized segments (panels (a–e)).

The kymograms were extracted for each vocalization of the “Rainbow Passage” after aligning the vocal folds across the frames using the motion compensation method. Examples of the extracted kymograms at the medial section of the vocal folds are shown in Figure 2. The kymograms for five different vocalizations are shown in panel a–e between the same frame numbers as in Figure 1. You can see the onset and offset of phonation in each kymogram and that the darker glottal area is almost on a straight line across the frames for each kymogram.

FIGURE 2.

FIGURE 2.

HSV kymograms at the medial section of the vocal folds for five different vocalized segments (panels (a–e)). The L and R on the y-axis refer to the left and right vocal folds, respectively.

A cropped frame is shown in Figure 3a, and the four dashed lines show the location of the intersections for which the kymograms are plotted in panel b–e. The first moment of inertias in solid magenta lines and the adjusted moment lines in dashed cyan lines, which correspond to the center of the glottis, are shown for each kymogram. All the kymograms are extracted from the same vocalized segment during a sustained phonation over 200 frames (frame #40,768–40,967). As seen, the adjusted lines span through the glottis center more accurately in comparison with the moment lines. The adjustment of the moment line becomes more critical when the image becomes blurrier as in Figure 3d and e.

FIGURE 3.

FIGURE 3.

Cropped HSV frame (panel a) and four kymograms at the four intersections of the vocal folds (panel b–e). The first moment of inertia (solid magenta line) and the adjusted snake initialization line (dashed cyan line) are shown for the four kymograms between frames 40,768 and 40,967. The L and R on the y-axis refer to the left and right vocal folds, respectively. For interpretation of the references to color in this figure legend, the reader is referred to the Web version of this article.

After the snake initialization, the active contour algorithm was applied to the kymograms. The upper and lower snakes (active contours) corresponding to the left and right vocal folds for the four kymograms in Figure 3 are shown in Figure 4. The number of frames shown in the figure is 546, between frame 40,585 and 41,165, including voicing onset, vocal folds vibration, and voicing offset. Two zoomed-in image segments are included in Figure 4 to better visualize the performance of the algorithm. As seen, the active contour approach detects both the left and right vocal fold edges (solid green line and yellow line, respectively) at different cross sections – providing an analytical representation of the glottal edges. Moreover, the algorithm is able to capture the edges before the phonation starts and after the phonation ends.

FIGURE 4.

FIGURE 4.

Kymograms between frames 40,585 and 41,165 at four different cross sections of the vocal folds (panel a–d) along with the upper and lower active contours (solid green line and solid yellow line, respectively) corresponding to the left and right vocal folds. Two zoomed-in image segments are included to better visualize the performance of the algorithm. The L and R on the y-axis refer to the left and right vocal folds, respectively. For interpretation of the references to color in this figure legend, the reader is referred to the Web version of this article.

After detecting the edges of the vocal folds in the kymograms at different cross sections of the vocal folds across the frames, the edges were registered back to the HSV frames. Video S1 demonstrates the detected edges of the vocal folds during the first vocalization of the “Rainbow Passage” (692 frames presented during the voicing onset, sustained phonation, and voicing offset). The top figure in the video shows the sequence of HSV between frames 43,030 and 43,721. Both of the left and right edges of the vocal folds, extracted using the active contour modeling approach, are plotted on each HSV frame in green and yellow color, respectively. The corresponding GAW is shown in the bottom figure of the video. The red star moving along the GAW corresponds to the glottal area in the HSV frame in the top figure of the video. As can be seen in the top figure, the registered edges of both the left and right vocal folds synchronously track the true spatial location of the vibrating vocal folds across the HSV frames.

The computed GAW for five different vocalizations of the “Rainbow Passage” between frame numbers 40,505–41,255, 42,975–43,815, 84,281–84,891, 103,942–104,577, and 109,548–110,363 is shown in Figure 5ae. The change in the glottal area can be seen during the onset and offset of phonation and the sustained phonation in each panel. The moments with GAW values of zero are associated with the vocal folds contact.

FIGURE 5.

FIGURE 5.

The computed GAW at five different vocalized segments of the “Rainbow Passage.”

DISCUSSION

The temporal segmentation was successfully utilized to determine the vocalized segments of the “Rainbow Passage.” Subsequently, the motion compensation precisely located the vibrating vocal folds across the frames during the extracted vocalizations. After applying the motion compensation and determining the location of the vibrating vocal folds, digital kymograms were successfully extracted at various intersections of the vocal folds. The vibrating vocal folds always appeared on an almost straight line in the extracted kymograms, which was necessary for a better performance of the spatial segmentation algorithm.

The automated snake initialization tool was successfully developed and accurately located a line that spanned through the glottis center in the extracted kymograms. The adjusted moment line was introduced because the results revealed how vulnerable the first moment of inertia line was toward the noise in the kymogram image. Based on the results, the proposed adjusted moment line demonstrated a better estimation than using only the first moment of inertia line for finding the center of the glottis. Obtaining an accurate snake initialization line was a necessary step toward a better performance of the active contour modeling algorithm and its convergence.

The active contour technique was successfully implemented for the kymograms of the vocalized segments of the “Rainbow Passage.” The application of active contour method allowed the analytic representation of the vocal fold edges at different cross sections of vocal folds from the anterior to the posterior commissure. The performance of the algorithm exceeded the challenging quality of the HSV images. From 76 vocalizations of the “Rainbow Passage”, the visual observation of the detected edges and the HSV kymograms showed that the algorithm’s error was not more than one pixel for 67 vocalizations (88%) deeming it successful for precise detection of the glottis boundaries. Due to dim lighting in some of the frames in the kymograms of the other 9 vocalizations, the active contour modeling was not able to find the glottal edges. The visual observation also could not determine the glottal edges due to the lighting issue for these kymograms.

This study showed the feasibility of automatic edge detection using the proposed method in challenging data obtained using a color high-speed camera. Color images are preferred over monochrome images by clinical specialists since color images allow them to evaluate the health of the tissues while observing and evaluating the vibrations of the vocal folds. Therefore, this study used a color high-speed camera to demonstrate that the proposed algorithm can be applied to color images. Moreover, the goal of this study was to develop an algorithm that works for the most challenging conditions (ie, color images). Despite the edge uncertainties on the color images, the paired active contour was not attracted to erroneous edges and it maintained optimal rigidity. Since this work shows the robustness of the spatial segmentation method in the most challenging conditions due to color images, this method applied to HSV data recorded using a monochrome high-speed camera provides significantly better segmentation results. This is due to the inherently higher dynamic range (image quality) of monochrome images and to the significantly more accurate representation of the gradients of the edges in the image. The application of this method on monochrome HSV recordings, and the evaluation of its accuracy and reliability on a larger sample is part of an ongoing study, which will be presented in future.

After registering back the segmented edges in the kymograms to the HSV frames, based on the visual inspection of the results, the implemented active contour modeling successfully detected the edges of the vibrating vocal folds across the frames during each vocalized segment. The algorithm had some difficulty to find the edges before the start of the oscillation and after the oscillation of the vocal folds stopped and the registered vocal fold edges infrequently deviated from the actual location near the anterior commissure. This could be due to the deviation of the motion window from the location of the vocal folds prior to the beginning of vocal fold oscillations, also after the oscillations stopped. Due to this deviation, the vibrating vocal folds were not exactly in the center of the motion window. However, this was not an issue in this study since the main goal was to capture the vocal fold edges during vocalization. In future, we will aim to develop a hybrid edge detection model using machine learning to also detect the edges of the vocal folds while there is no oscillation. This would be useful in studying laryngeal tissue maneuvers during pre- and postphonation in connected speech.

The GAW was computed successfully using the segmented vocal fold edges for each vocalization of the “Rainbow Passage” across the frames. The calculated GAW was able to capture not only the fluctuations in the glottal area during vocal folds vibrations, but also the exact area fluctuations in the onset and offset of the oscillations. Even though the development of spatial segmentation of the vocal fold edges during connected speech is important mainly to enable the measurement of glottal attack and offset times, as well as various measures related to vocal fold contact and asymmetry during spontaneous speech production, the GAW is a common measure used in sustained phonations that is also made available through the proposed segmentation method, thus provided in this article for comparative purposes.

CONCLUSION

The application of HSV to study the vocal folds function during running speech is a breakthrough, which can lead to improved understanding of phonatory processes in connected speech and to the advancement of clinical voice assessment protocols. Developing automated image processing techniques for the analysis and interpretation of the HSV recordings is crucial. That is, automatic spatial segmentation is an important step to enter a new era of precision laryngeal imaging measurements. This work is necessary for automatic extraction and measurements of the relevant vocal fold vibratory features. The temporal segmentation and motion compensation were able to determine the vocalized segments and locate the vibrating vocal folds for the entire “Rainbow Passage.” The developed automated algorithm for spatial segmentation successfully captured the vocal fold edges across frames for each vocalization, which allowed for accurate computation of GAW at each frame.

In this study, the promising performance of the developed algorithm was shown for a vocally normal participant. This warrants future research using the developed method for the analysis of HSV data for a larger number of vocally normal participants and patients with voice disorders, which will help generalize the developed algorithm in norm and disorder, and address inter- and intra-subject reliability. Furthermore, HSV-based measures will be developed from the measured GAW for norm and disorder.

Supplementary Material

Video Still
VID1
Download video file (12.7MB, mp4)

Acknowledgments

The authors would like to acknowledge the support from the National Institutes of Health, NIDCD, K01DC017751 “Studying the Laryngeal Mechanisms Underlying Dysphonia in Connected Speech” and NIDCD R01-DC007640 “Efficacy of Laryngeal High-Speed Videoendoscopy.”

Footnotes

Declaration of Interest: The authors declare that they have no conflict of interest.

Submission Declaration and Verification: The work described has not been published previously. Portions of this work have been presented at the 47th Annual Symposium: Care of the Professional Voice, Philadelphia, PA, May 30-June 3, 2018, and the Virtual Voice Symposium: Care of the Professional Voice, May 27–31, 2020. This manuscript is not under consideration for publication elsewhere, its publication is approved by all authors and tacitly or explicitly by the responsible authorities where the work was carried out, and if accepted, it will not be published elsewhere in the same form, in English or in any other language, including electronically without the written consent of the copyright-holder.

SUPPLEMENTARY DATA

Supplementary data related to this article can be found online at doi:10.1016/j.jvoice.2020.10.017.

REFERENCES

  • 1.Mafee MF, Valvassori GE, Becker M. Imaging of the Neck and Head. 2nd ed. Stuttgart: Thieme; 2005 [Google Scholar]
  • 2.Uloza V, Saferis V, Uloziene I. Perceptual and acoustic assessment of voice pathology and the efficacy of endolaryngeal phonomicrosurgery. J Voice. 2005;19:138–145 [DOI] [PubMed] [Google Scholar]
  • 3.Verikas A, Uloza V, Bacauskiene M, et al. Advances in laryngeal imaging. Eur Arch Otorhinolaryngol. 2009;266:1509–1520 [DOI] [PubMed] [Google Scholar]
  • 4.Slonimsky E Laryngeal imaging. Operat Techn Otolaryngol Head Neck Surg. 2019;30:237–242 [Google Scholar]
  • 5.Kitzing P Stroboscopy–a pertinent laryngological examination. J Otolaryngol. 1985;14:151–157 [PubMed] [Google Scholar]
  • 6.Bless DM, Hirano M, Feder RJ. Videostroboscopic evaluation of the larynx. Ear Nose Throat J. 1987;66:289–296 [PubMed] [Google Scholar]
  • 7.Woo P, Casper J, Colton R, et al. Aerodynamic and stroboscopic findings before and after microlaryngeal phonosurgery. J Voice. 1994;8:186–194 [DOI] [PubMed] [Google Scholar]
  • 8.Stemple JC, Roy N, Klaben BG. Clinical Voice Pathology: Theory and Management. San Diego, CA: Plural Publishing; 2000 [Google Scholar]
  • 9.Stojadinovic A, Shaha AR, Orlikoff RF, et al. Prospective functional voice assessment in patients undergoing thyroid surgery. Ann Surg. 2002;236:823–832 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Mehta DD, Hillman RE. Voice assessment: updates on perceptual, acoustic, aerodynamic, and endoscopic imaging methods. Curr Opin Otol Head Neck Surg. 2008;16:211–215 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Aronson AE, Bless D. Clinical Voice Disorders. New York, NY: Thieme; 2011 [Google Scholar]
  • 12.Patel R, Dailey S, Bless D. Comparison of high-speed digital imaging with stroboscopy for laryngeal imaging of glottal disorders. Ann Otol Rhinol Laryngol. 2008;117:413–424 [DOI] [PubMed] [Google Scholar]
  • 13.Zacharias SRC, Myer CM, Meinzen-Derr J, et al. Comparison of videostroboscopy and high-speed videoendoscopy in evaluation of supraglottic phonation. Ann Otol Rhinol Laryngol. 2016;125:829–837 [DOI] [PubMed] [Google Scholar]
  • 14.Deliyski DD. Laryngeal high-speed videoendoscopy. Laryngeal Evaluation: Indirect Laryngoscopy to High-Speed Digital Imaging. New York: Thieme Medical Publishers; 2010:243–270 [Google Scholar]
  • 15.Deliyski DD. Clinical feasibility of high-speed videoendoscopy. Perspectives on Voice and Voice Disorders. 2007;17:12–16 [Google Scholar]
  • 16.Deliyski DD, Petrushev PP, Bonilha HS, et al. Clinical implementation of laryngeal high-speed videoendoscopy: challenges and evolution. Folia Phoniatr et Logop. 2007;60:33–44 [DOI] [PubMed] [Google Scholar]
  • 17.Naghibolhosseini M, Deliyski DD, Zacharias SR, et al. Temporal segmentation for laryngeal high-speed videoendoscopy in connected speech. J Voice. 2018;32:256.e1–256.e12 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Zañartu M, Mehta DD, Ho JC, et al. Observation and analysis of in vivo vocal fold tissue instabilities produced by nonlinear source-filter coupling: a case study. J Acoust Soc Am. 2011;129:326–339 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Mehta DD, Deliyski DD, Zeitels SM, et al. Integration of Transnasal Fiberoptic High-Speed Videoendoscopy With Time-Synchronized Recordings of Vocal Function, Innormal & Abnormal Vocal Folds Kinematics: High Speed Digital Phonoscopy (HSDP), Optical Coherence Tomography (OCT) & Narrow Band Imaging. 12. San Fransisco, CA: Pacific Voice & Speech Foundation; 2015:105–114 [Google Scholar]
  • 20.Naghibolhosseini M, Deliyski DD, Zacharias SR, et al. A method for analysis of the vocal fold vibrations in connected speech using laryngeal imaging. In: Manfredi C, ed. Proceedings of the 10th International Workshop on Models and Analysis of Vocal Emissions for Biomedical Applications MAVEBA. Firenze, Italy;. Firenze University Press; 2017 [Google Scholar]
  • 21.Naghibolhosseini M, Deliyski DD, Zacharias SRC, et al. Studying vocal fold non-stationary behavior during connected speech using high-speed videoendoscopy. J Acoust Soc Am. 2018;144. 1766–1766 [Google Scholar]
  • 22.Morrison MD, Rammage LA. Muscle misuse voice disorders: description and classification. Acta oto-laryngologica. 1993;113:428–434 [DOI] [PubMed] [Google Scholar]
  • 23.Yiu E, Worrall L, Longland J, et al. Analysing vocal quality of connected speech using Kay’s computerized speech lab: a preliminary finding. Clin Linguist Phon. 2000;14:295–305 [Google Scholar]
  • 24.Halberstam B Acoustic and perceptual parameters relating to connected speech are more reliable measures of hoarseness than parameters relating to sustained vowels. J Oto-Rhino-Laryngol Relat Spec. 2004;66:70–73 [DOI] [PubMed] [Google Scholar]
  • 25.Roy N, Gouse M, Mauszycki SC, et al. Task specificity in adductor spasmodic dysphonia versus muscle tension dysphonia. Laryngoscope. 2005;115:311–316 [DOI] [PubMed] [Google Scholar]
  • 26.Maryn Y, Corthals P, Van Cauwenberge P, et al. Toward improved ecological validity in the acoustic measurement of overall voice quality: combining continuous speech and sustained vowels. J Voice. 2010;24:540–555 [DOI] [PubMed] [Google Scholar]
  • 27.Lowell SY. The acoustic assessment of voice in continuous speech. SIG 3 Perspect Voice Voice Disord. 2012;22:57–63 [Google Scholar]
  • 28.Brown C, Deliyski DD, Zacharias SRC, et al. Glottal attack and offset time during connected speech in adductor spasmodic dysphonia. Virtual Voice Symposium: Care of the Professional Voice. Philadelphia; 2020 [Google Scholar]
  • 29.Naghibolhosseini M, Deliyski DD, Zacharias SRC, et al. Glottal attack time in connected speech. The 11th International Conference on Voice Physiology and Biomechanics ICVPB. East Lansing, MI; 2018 [Google Scholar]
  • 30.Brown C, Naghibolhosseini M, Zacharias SRC, et al. Investigation of high-speed videoendoscopy during connected speech in norm and neurogenic voice disorder. Michigan Speech-Language-Hearing Association (MSHA) Annual Conference. East Lansing, MI; 2019 [Google Scholar]
  • 31.Koç T, Çiloğlu T. Automatic segmentation of high speed video images of vocal folds. J Appl Math. 2014;2014:1–16 [Google Scholar]
  • 32.Lohscheller J, Toy H, Rosanowski F, et al. Clinically evaluated procedure for the reconstruction of vocal fold vibrations from endoscopic digital high-speed videos. Med Image Anal. 2007;11:400–413 [DOI] [PubMed] [Google Scholar]
  • 33.Mehta DD, Deliyski DD, Quatieri TF, et al. Automated measurement of vocal fold vibratory asymmetry from high-speed videoendoscopy recordings. J Speech Lang Hear Res. 2011;54:47–54 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Karakozoglou S-Z, Henrich N, D’Alessandro C, et al. Automatic glottal segmentation using local-based active contours and application to glottovibrography. Speech Commun. 2012;54:641–654 [Google Scholar]
  • 35.Moukalled HJ, Deliyski DD, Schwarz RR, et al. Segmentation of laryngeal high-speed videoendoscopy in temporal domain using paired active contours. In: Manfredi C, ed. Proceedings of the 10th International Workshop on Models and Analysis of VocaL Emissions for Biomedical Applications MAVEBA. Firenze, Italy;. Firenze University Press; 2009 [Google Scholar]
  • 36.Yan Y, Chen X, Bless D. Automatic tracing of vocal-fold motion from high-speed digital images. IEEE Transact Biomed Eng. 2006;53:1394–1400 [DOI] [PubMed] [Google Scholar]
  • 37.Yan Y, Damrose E, Bless D. Functional analysis of voice using simultaneous high-speed imaging and acoustic recordings. J Voice. 2007;21: 604–616 [DOI] [PubMed] [Google Scholar]
  • 38.Mehta DD, Deliyski DD, Zeitels SM, et al. Voice production mechanisms following phonosurgical treatment of early glottic cancer. Ann Otol Rhinol Laryngol. 2010;119:1–9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Demeyer J, Dubuisson T, Gosselin B, et al. Glottis segmentation with a high-speed glottography: a fullyautomatic method. 3rd Adv. Voice Funct. Assess. Int. Workshop. 2009 [Google Scholar]
  • 40.Yan Y, Du G, Zhu C, et al. Snake based automatic tracing of vocal-fold motion from high-speed digital images. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’12). 2012 [Google Scholar]
  • 41.Zhang Y, Bieging E, Tsui H, et al. Efficient and effective extraction of vocal fold vibratory patterns from high-speed digital imaging. J Voice. 2010;24:21–29 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Zhou S, Wang J, Zhang S, et al. Active contour model based on local and global intensity information for medical image segmentation. Neurocomputing. 2016;186:107–118 [Google Scholar]
  • 43.Sulong G, Abdulaali H, Hassan S. Edge detection algorithms vs-active contour for sketch matching: comparative study. Res J Appl Sci Eng Technol. 2015;11:759–764 [Google Scholar]
  • 44.Kass M, Witkin A, Terzopoulos D. Active contour models. Int J Comput Vision. 1988;1:321–331 [Google Scholar]
  • 45.Manfredi C, Bocchi L, Bianchi S, et al. Objective vocal fold vibration assessment from videokymographic images. Biomed Signal Process Control. 2006;1:129–136 [Google Scholar]
  • 46.Deliyski DD, Powell ME, Zacharias SR, et al. Experimental investigation on minimum frame rate requirements of high-speed videoendoscopy for clinical voice assessment. Biomed Signal Process Control. 2015;17:21–28 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Deliyski DD. Endoscope motion compensation for laryngeal high-speed videoendoscopy. J Voice. 2005;19:485–496 [DOI] [PubMed] [Google Scholar]
  • 48.Deliyski DD, Petrushev P. Methods for objective assessment of high-speed videoendoscopy. Proc Adv Quant Laryngol. 2003:1–16 [Google Scholar]
  • 49.Amini A, Weymouth T, Jain R. Using dynamic programming for solving variational problems in vision. IEEE Transact Pattern Analysis Mach Intellig. 1990;12(9):855–867 [Google Scholar]
  • 50.Fabian Schenk, Philipp Aichinger, Imme Roesner, Martin Urschler. Automatic high-speed video glottis segmentation using salient regions and 3D geodesic active contours. Ann Br Mach Vision Assoc. 2015;2015:1–15 [Google Scholar]
  • 51.Tailong Shi, June Kim Hyun, Thomas Murry, Peak Woo, Yuling Yan. Tracing vocal fold vibrations using level set segmentation method. Int J Numer Methods Biomed Eng. 2015;31(6):e02715. 10.1002/cnm [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Video Still
VID1
Download video file (12.7MB, mp4)

RESOURCES