Highlights
-
•
A method to segment the vocal tract and articulators in MR images was developed.
-
•
Median accuracy: Dice coefficient of 0.92; general Hausdorff distance of 5mm.
-
•
Developed to facilitate quantitative analysis of the vocal tract and articulators.
-
•
Intended for use in clinical and non-clinical studies of speech.
-
•
A novel clinically relevant segmentation accuracy metric was also developed.
Keywords: Convolutional neural networks, Segmentation, Dynamic magnetic resonance imaging, Speech, Vocal tract, Articulators
Abstract
Background and Objective
Magnetic resonance (MR) imaging is increasingly used in studies of speech as it enables non-invasive visualisation of the vocal tract and articulators, thus providing information about their shape, size, motion and position. Extraction of this information for quantitative analysis is achieved using segmentation. Methods have been developed to segment the vocal tract, however, none of these also fully segment any articulators. The objective of this work was to develop a method to fully segment multiple groups of articulators as well as the vocal tract in two-dimensional MR images of speech, thus overcoming the limitations of existing methods.
Methods
Five speech MR image sets (392 MR images in total), each of a different healthy adult volunteer, were used in this work. A fully convolutional network with an architecture similar to the original U-Net was developed to segment the following six regions in the image sets: the head, soft palate, jaw, tongue, vocal tract and tooth space. A five-fold cross-validation was performed to investigate the segmentation accuracy and generalisability of the network. The segmentation accuracy was assessed using standard overlap-based metrics (Dice coefficient and general Hausdorff distance) and a novel clinically relevant metric based on velopharyngeal closure.
Results
The segmentations created by the method had a median Dice coefficient of 0.92 and a median general Hausdorff distance of 5mm. The method segmented the head most accurately (median Dice coefficient of 0.99), and the soft palate and tooth space least accurately (median Dice coefficients of 0.92 and 0.93 respectively). The segmentations created by the method correctly showed 90% (27 out of 30) of the velopharyngeal closures in the MR image sets.
Conclusions
An automatic method to fully segment multiple groups of articulators as well as the vocal tract in two-dimensional MR images of speech was successfully developed. The method is intended for use in clinical and non-clinical speech studies which involve quantitative analysis of the shape, size, motion and position of the vocal tract and articulators. In addition, a novel clinically relevant metric for assessing the accuracy of vocal tract and articulator segmentation methods was developed.
1. Introduction
1.1. MRI in investigations of speech
Magnetic resonance imaging (MRI) is increasingly used in studies of speech as it enables non-invasive visualisation of the vocal tract and articulators from any view. This visualisation provides information about the shape, size, motion and position of the vocal tract and articulators (organs involved in the production of sound) such as the soft palate and the tongue (Fig. 1). As well as being used in speech science and phonetics studies to increase our understanding of normal speech production [1], [2], [3], [4], [5], [6], MRI is beginning to be used in clinical assessments of speech. More specifically, it is used to visualise the motion of the tongue of patients after glossectomy (surgical removal of tongue tumours) [7] and the motion of the soft palate of patients with speech problems caused by velopharyngeal insufficiency (lack of contact between the soft palate and pharyngeal wall as shown in Fig. 1) [8], [9], [10], [11], [12]. Such visualisation provides important information that aids patient management decision-making. In addition, MRI is used to investigate the anatomical reasons why some patient groups are more predisposed to speech problems [13], [14], [15], [16].
Fig. 1.
Diagrams of the vocal tract, head, jaw, lower incisor space, epiglottis and five articulators (organs involved in the production of sound). (A) The soft palate at rest (for example, while breathing). (B) During normal speech, the soft palate elevates and comes into contact with the pharyngeal wall. This contact is known as velopharyngeal closure. Velopharyngeal closure is required in order to produce most sounds.
1.2. Vocal tract and articulator segmentation methods
Speech studies that use MRI involve the acquisition of time series of two-dimensional (2D) magnetic resonance (MR) images. To enable quantitative analysis of the information provided by these images, it is necessary to segment the anatomical features of interest, such as the vocal tract and articulators [1], [2], [3], [4], [5]. To avoid the time-consuming and expensive process of manual segmentation, several methods have been developed to perform this task semi or fully automatically [17], [18], [19], [20], [21], [22], [23], [24], [25]. One of these methods segmented the entire vocal tract [25], while the others only labelled pixels at air-tissue boundaries and therefore created a partial contour for each articulator [17], [18], [19], [20], [21], [22], [23], [24].
The methods have been based on a variety of approaches. In [17], the air-tissue boundaries between the vocal tract and neighbouring tissues were automatically labelled using an optimisation algorithm to adjust an anatomically informed synthetic image of the vocal tract until the k-space of the synthetic image was as similar as possible to the k-space of the MR image. Other methods performed the labelling by analysing pixel values along gridlines superposed on the MR image [18] or by using active shape models [19,20].
More recently, deep-learning-based methods have been developed to automatically label air-tissue boundaries between the vocal tract and neighbouring tissues [21], [22], [23], [24], [25]. In [21] and [23], fully convolutional networks (FCNs) with an architecture similar to SegNet [26] were developed to label the air-tissue boundaries and, in [21], identify which articulators the boundary pixels belonged to. In [22] and [24], FCNs with architectures similar to the original FCN [27] and the FCN in [28] respectively were developed to label the air-tissue boundaries. An FCN with an architecture similar to the original U-Net [29] was developed in [25] to segment the entire vocal tract, not just its boundaries with neighbouring tissues.
1.3. Motivation
Our ultimate goal is to develop software to automatically analyse 2D MR images of the vocal tract and articulators of patients with speech problems. This software will automatically extract quantitative information that clinicians believe will aid decision-making about which treatment is most likely to solve the speech problem. This information will include the shape and size of the vocal tract, soft palate and tongue, and the distance between the soft palate and pharyngeal wall. As a first step towards achieving this goal, we have developed a method to automatically segment these anatomical features.
1.4. Clinical considerations
Velopharyngeal closure (contact between the soft palate and pharyngeal wall, as shown in Fig. 1B) is required to produce most sounds. Some patients with speech problems are able to achieve velopharyngeal closure, while others are not. In the latter case, clinicians are interested in measuring the gap between the soft palate and pharyngeal wall, as they believe this information will aid decision-making about which treatment is most likely to solve the speech problem. One of our ultimate goals is to develop software that automatically performs this measurement by first segmenting the soft palate and pharyngeal wall, and then calculating the distance between these anatomical structures. To be able to perform this measurement, a segmentation method that preserves such gaps is required. However, it is also important that the method accurately shows velopharyngeal closures, otherwise it will artificially create gaps. If a method fails to preserve a gap between the soft palate and pharyngeal wall, its segmentation will show an artificial velopharyngeal closure. By comparing the velopharyngeal closures in the ground truth segmentations with those in the segmentations created by the method, the ability of the method to preserve such gaps and also accurately show velopharyngeal closures can be assessed.
1.5. Contributions and novelty
Although existing segmentation methods label the air-tissue boundaries between the vocal tract and neighbouring tissues, the majority of them do not identify which articulators the tissues belong to [18,19,[22], [23], [24], [25]], and consequently only provide information about the shape and size of the vocal tract. Three methods have been developed that label the air-tissue boundaries and identify which articulators the boundary pixels belong to [17,20,21]. However, since these methods only label pixels at the air-tissue boundaries, they only partially contour the articulators. Full segmentation or contouring of the articulators is required for automatic analysis of their shape, size, motion and position during speech, analysis that clinicians are increasingly interested in performing.
The main contribution and novelty of this work is the development of an automatic method to fully segment multiple groups of articulators as well as the vocal tract in MR images, thus overcoming the limitations of existing vocal tract and articulator segmentation methods. Furthermore, for the first time a clinically relevant metric based on velopharyngeal closure is used to assess the performance of the method, as well as standard segmentation overlap-based metrics.
2. Materials and methods
2.1. Data
Five real-time MRI (rtMRI) datasets were used in this work. Each dataset consisted of a series of 2D rtMR images of a subject counting from one to ten in English (Fig. 2A). The datasets were acquired using a 3.0T TX Achieva MRI scanner and a 16-channel neurovascular coil (both Philips Healthcare, Best, the Netherlands) and a fast low-angle shot pulse sequence at a frame rate of 10 frames per second. The rtMR images are of a 300×230×10mm3 midsagittal slice of the head and have a matrix size of 256×256. The subjects (2 females, 3 males; age range: 24 – 28 years) were fluent English speakers and had no recent history of speech and language disorders. The speech task performed by the subjects (counting from one to ten) is a commonly performed one in clinical assessment of speech in the United Kingdom [30]. The datasets consisted of 105, 71, 71, 78 and 67 images (392 images in total). Each dataset was normalised with respect to the minimum and maximum intensities in the set so that all intensities were between 0 and 1.
Fig. 2.
Example magnetic resonance (MR) images and corresponding ground truth segmentations. (A) Five consecutive MR images of a single subject saying the word “one”. (B) Ground truth segmentations of the head (dark blue), soft palate (light blue), jaw (green), tongue (yellow), vocal tract (pink) and tooth space (red).
2.2. Ground truth segmentation
First, each image was labelled as either showing velopharyngeal closure or not. Determining whether or not an rtMR image shows velopharyngeal closure can be challenging, especially if the soft palate is close to the pharyngeal wall or there is fluid (which has a similar intensity to the soft palate) between the soft palate and pharyngeal wall. To reduce the subjectivity of the labels, each image was labelled by two MRI Physicists with ten and five years of speech MRI experience respectively. Images that were labelled differently were jointly inspected a second time by the Physicists, and a consensus was reached on whether or not the images showed velopharyngeal closure.
Ground truth segmentations of six classes were created by manual pixel-wise labelling of the following anatomical features in every image in the datasets (Fig. 2B): the head (including the upper lip and hard palate), soft palate, jaw (including the lower lip), tongue (including the epiglottis), vocal tract and tooth space (lower incisor only). The segmentation was performed by the MRI Physicist with five years of speech MRI experience using MATLAB R2019b (MathWorks, Natick, MA).
2.3. FCN architecture, implementation and training
An FCN with a similar architecture to the original U-Net [29] was implemented. The network had a five-layer encoding path followed by a four-layer decoding path. More information on its architecture is provided in Fig. 3. Dropout with a probability of 0.5 was included in the fourth and fifth encoding layers. The outputs of the network were seven probability maps, one for each class. The network was implemented using PyTorch 1.4.0 [31] and training was performed on a NVIDIA TITAN RTX graphics card. Cross entropy was used as the loss function during network training. The Adam optimiser [32] with hyperparameters β1=0.9, β2=0.999 and ε=1e-8 was used to adjust network weights. In each experiment, the network was trained for 200 epochs. Data augmentation was performed to increase the number of images in the training dataset by a factor of four. Augmented images were created by randomly translating, rotating, cropping and rescaling the original images. Translation was by between -30 and 30 pixels in the x-direction and between 0 and -30 in the y-direction. Rotation was by between -10° and 30° clockwise. Cropping was to a matrix size of either 220×220 if followed by rescaling or between 210×210 and 255×255 if followed by zero padding. All augmented images had the same matrix size as the original images. This was achieved by cropping and then zero padding the translated images and the rotated images, and rescaling or zero padding the cropped images.
Fig. 3.
Network architecture. BN: batch normalisation, ReLU: rectified linear unit, conv: convolution.
2.4. Loss function weighting
The loss function was weighted in order to compensate for class imbalances. More specifically, the losses of class k ∈ {1, 2, …, 7} were multiplied by the following weight:
where Nk is the number of pixels of class k in the training dataset.
2.5. Segmentation post-processing
After fully training the network, connected component analysis based post-processing was performed on each segmentation output at inference time in order to remove anatomically impossible regions. More specifically, each region (i.e. connected component) in the segmentation output was automatically analysed in the following way:
-
1
The classes of the regions in contact with it were identified.
-
2
If the region was surrounded by another region, its class was changed to that of the surrounding region.
-
3
If the region was either in contact with an anatomically impossible region (for example, if a jaw region was in contact with a soft palate region) or not in contact with anatomically expected regions (for example, if a tooth space region was not in contact with a jaw region and a tongue region), the classes of the pixels surrounding the region were identified and the class of the region was changed to the mode class of these pixels.
This analysis was performed using MATLAB R2019b.
3. Experiments and results
3.1. Segmentation accuracy assessment
After fully training the network, its segmentation accuracy was assessed using two metrics. First, the Dice coefficient was used to quantify the overlap between the ground truth segmentations and the predicted segmentations (i.e. segmentations created by the network). Second, general Hausdorff distance was used to quantify the maximum discrepancy between the boundaries of the ground truth and predicted segmentations. The Dice coefficient and general Hausdorff distance of each class in every predicted segmentation (both before and after post-processing) was calculated using MATLAB R2019b.
3.2. Velopharyngeal closure assessment
The accuracy with which the predicted segmentations showed occurrences of velopharyngeal closure was assessed by manually comparing the closures in the ground truth and predicted segmentations. The comparison enabled identification of the number of:
-
•
“Correct” closures: closures that were shown in both the ground truth and predicted segmentations.
-
•
“Additional” closures: closures that were shown in the predicted segmentations but not the ground truth segmentations.
-
•
“Merged” closures: one or more consecutive closures that were shown as separate closures in the ground truth segmentations and a single closure in the predicted segmentations.
-
•
“Missed” closures: closures that were shown in the ground truth segmentations but not in the predicted segmentations.
An example of each type of closure is shown in Fig. 4.
Fig. 4.
Examples of each type of velopharyngeal closure. On the y-axis, “closed” indicates contact between the soft palate and pharyngeal wall (velopharyngeal closure), while “open” indicates no contact.
3.3. Post-processing investigation
To investigate the effect of the post-processing on the accuracy of the predicted segmentations, two sets of statistical tests were performed to firstly compare the Dice coefficient of each class before and after post-processing, and secondly compare the general Hausdorff distance of each class before and after post-processing. Each metric was compared using either the one-tailed Wilcoxon signed-rank test or sign test, depending on whether the distribution of differences between the paired data points was symmetric or not. The statistical tests were performed using MATLAB R2019b and a 95% significance level was used.
3.4. Cross-validation
To evaluate the generalisability of the network, a five-fold cross-validation was performed with the dataset of each subject being left out once. Hyperparameter optimisation was achieved by carrying out a nested cross-validation for each fold of the main cross-validation, in the way described in [33]. This nested cross-validation was a four-fold cross-validation with the dataset of each of the remaining four subjects being left out once. Six different learning rate {0.003, 0.0003, 0.00003} and minibatch size {4, 8} combinations were evaluated in this way, and the hyperparameter combination that resulted in the highest mean Dice coefficient on the left-out dataset (of the nested cross-validation) after post-processing was chosen as the optimal hyperparameter combination. Once the optimal hyperparameter combination had been identified for a fold of the main cross-validation, the network was trained using all the datasets except the left-out dataset for that fold, and then tested using the left-out dataset. Segmentations created when the optimised network was inputted with the test dataset are referred to as predicted segmentations in the following sections of this article.
3.5. New vocal tract shape investigation
Different vocal tract shapes and articulator positions are required to produce different speech sounds. The data used to train the network does not contain images of all the different possible vocal tract shapes in English. To investigate the ability of the network to segment vocal tract shapes not present in the training dataset, 15 additional rtMR images were segmented using the network. The images (three per subject) were of the same five subjects described in section 2.1 producing three sounds which require vocal tract shapes not present in the training dataset: /ɒ/ and /b/ in “Bob” and /a/. The accuracy of the segmentations was assessed using the Dice coefficient and general Hausdorff distance. The images were acquired and ground truth segmentations created in the ways described in sections 2.1 and 2.2 respectively.
4. Results
For four of the folds in the main cross-validation, the optimal hyperparameter combination was a learning rate of 0.0003 and a minibatch size of 4. For the fifth fold, the optimal combination was a learning rate of 0.003 and a minibatch size of 8.
Examples of predicted segmentations before and after post-processing are shown in Fig. 5, Fig. 6. Fig. 5A(2), Fig. 5B(2) and Fig. 5C(2) show predicted segmentations (before post-processing) with relatively low, average and high Dice coefficients respectively, while Fig. 5D(2), Fig. 5E(2) and Fig. 5F(2) show predicted segmentations (before post-processing) with relatively large, average and small general Hausdorff distances respectively. Column 3 in Fig. 5 shows the predicted segmentations after post-processing. Fig. 6 is a video showing the ground truth and predicted segmentations (both before and after post-processing) of one of the image sets.
Fig. 5.
Examples of ground truth segmentations (column 1) and corresponding predicted segmentations before and after post-processing (columns 2 and 3 respectively). Rows A to C show predicted segmentations with low, average and high Dice coefficients respectively. Rows D to F show predicted segmentations with large, average and small general Hausdorff distances respectively. The sounds being produced by the subjects are /t/ in “two” (row A), /r/ in “three” (row B), /n/ at the end of “nine” (row C), /w/ in “one” (row D), /f/ in “four” (row E) and /n/ in “ten” (row F). The segmentations have been cropped to only show the vocal tract region.
Fig. 6.
Video showing the segmentations of one of the magnetic resonance image sets. The segmentations have been cropped to only show the vocal tract region. Left: ground truth segmentation. Centre: predicted segmentations before post-processing. Right: predicted segmentations after post-processing. The subject is counting from one to ten.
Results of the quantitative investigation into the effect of the post-processing on the accuracy of the predicted segmentations are shown in Fig. 7, Fig. 8. Fig. 7 shows the Dice coefficients of each class in the predicted segmentations before and after post-processing, while Fig. 8 shows the general Hausdorff distances of each class in the predicted segmentations before and after post-processing. Statistical tests revealed that the median Dice coefficient of each class increased after post-processing (95% significance level, p < 0.001 for all classes), while the median general Hausdorff distance of each class decreased after post-processing (95% significance level, p < 0.001 for all classes).
Fig. 7.
Dice coefficients of the predicted segmentations before and after post-processing (PP). Each boxplot shows the Dice coefficients of a different class. Statistical tests revealed that there was a statistically significant increase in the median Dice coefficient of each class after PP (95% confidence level, p < 0.001 for all classes).
Fig. 8.
General Hausdorff distances in mm of the predicted segmentations before and after post-processing (PP). Each boxplot shows the general Hausdorff distance of a different class. Statistical tests revealed that there was a statistically significant decrease in the median general Hausdorff distance of each class after PP (95% confidence level, p < 0.001 for all classes).
The results of the quantitative assessments of the accuracy of the predicted segmentations after post-processing for all classes are summarised in Fig. 9, Fig. 10. Fig. 9 shows the Dice coefficients of each class in the predicted segmentations after post-processing, while Fig. 10 shows the general Hausdorff distances of each class in the predicted segmentations after post-processing. These Figures enable comparison of the accuracy with which each class was segmented. The median Dice coefficient of the predicted segmentations after post-processing was 0.92, while the median general Hausdorff distance was 5mm. In 93% of segmentations (365 out of 392 images in the test dataset), the Dice coefficients of all the classes were above 0.85.
Fig. 9.
Dice coefficients of each class in the predicted segmentations after post-processing.
Fig. 10.
General Hausdorff distances in mm of each class in the predicted segmentations after post-processing.
The velopharyngeal closures in the ground truth and predicted segmentations (after post-processing) of one of the subjects are shown in Fig. 11. This Figure shows cases of “correct”, “merged” and “additional” closures in the predicted segmentations. Fig. 12 shows rtMR images whose predicted segmentations incorrectly showed velopharyngeal closure.
Fig. 11.
Velopharyngeal closures in the ground truth and predicted segmentations (after post-processing) of one of the subjects counting from one to ten. C: correct closure (closures that were shown in both the ground truth and predicted segmentations). A: additional closure (closures that were shown in the predicted segmentations but not the ground truth segmentations). Merged: merged closures (one or more consecutive closures that were shown as separate closures in the ground truth segmentations and a single closure in the predicted segmentations).
Fig. 12.
Magnetic resonance images (column 1) whose predicted segmentations after post-processing (column 3) incorrectly showed velopharyngeal closure. Column 2 is the ground truth segmentation of the images. In both images, the soft palate is close to the pharyngeal wall but not in contact with it. Row A shows the subject pausing between saying “four” and “five”, while row B shows the subject producing the sound /n/ at the end of “nine”. The segmentations have been cropped to only show the vocal tract region.
The velopharyngeal closures in the ground truth segmentations and the predicted segmentations after post-processing are summarised in Table 1. This Table lists the number of “correct”, “merged”, “additional” and “missed” closures in the predicted segmentations.
Table 1.
Number of velopharyngeal closures in the ground truth segmentations and predicted segmentations after post-processing. Total: total number of closures in the segmentations. Correct: closures that were shown in both the ground truth and predicted segmentations. Additional: closures that were shown in the predicted segmentations but not the ground truth segmentations. Merged: one or more consecutive closures that were shown as separate closures in the ground truth segmentations and a single closure in the predicted segmentations. Missed: closures that were shown in the ground truth segmentations but not in the predicted segmentations.
| Ground truth | Predicted | |
|---|---|---|
| Total | 30 | 33 |
| Correct | 30 | 27 |
| Additional | 0 | 5 |
| Merged | 0 | 3 |
| Missed | 0 | 0 |
Five examples of predicted segmentations after post-processing, created when the network was inputted with additional rtMR images of vocal tract shapes not present in the training dataset are shown in Fig. 13. The median Dice coefficient of the predicted segmentations of the 15 additional rtMR images was 0.96, while the median general Hausdorff distance was 6mm.
Fig. 13.
Ground truth segmentations (column 1) and corresponding predicted segmentations after post-processing (column 2) created when the network was inputted with images of vocal tract shapes not present in the training dataset. The segmentations have been cropped to only show the vocal tract region.
5. Discussion
The main contribution and novelty of this work is the development of an automatic method to fully segment multiple groups of articulators as well as the vocal tract in 2D rtMR images. This novelty overcomes the limitations of existing methods that either only segment the air-tissue boundaries between the vocal tract and neighbouring tissues [17], [18], [19], [20], [21], [22], [23], [24] or fully segment the vocal tract only [25]. The other contribution and novelty of this work is the development of a clinically relevant metric for assessing the accuracy of segmentations created by vocal tract and articulator segmentation methods. This novel metric was used to assess the accuracy of the segmentations created by the method.
Our method is deep learning based and consists of two steps: first, segmentations are created by inputting rtMR images into a trained FCN with a similar architecture to the original U-Net [29]; second, a connected component analysis based post-processing is performed on the segmentations to remove anatomically impossible regions.
Statistical tests were performed to investigate the effect of the post-processing on the accuracy of the predicted segmentations. The tests revealed that the median Dice coefficient of each class in the predicted segmentations increased after post-processing (95% significance level, p < 0.001 for all classes), while the median general Hausdorff distance decreased (95% significance level, p < 0.001 for all classes). These results show that the post-processing was effective in improving the accuracy of the predicted segmentations.
Our method segmented each class with a high accuracy, as shown by its segmentations achieving a median Dice coefficient of 0.92 and a median general Hausdorff distance of 5mm. On average, the head was segmented most accurately (median Dice coefficient of 0.99). This result is unsurprising as this class has the largest number of pixels and the least variation in shape and position in the rtMR images. It is therefore the least challenging class for our method to learn to segment. On average, the soft palate and tooth space were segmented least accurately (median Dice coefficients of 0.92 and 0.93 respectively). This result is also unsurprising as these classes have the smallest number of pixels and so small errors at the boundaries will have a bigger impact on the Dice coefficient. In addition, the soft palate is the class with the largest variation in shape and position in the rtMR images. It is therefore the most challenging class for our method to learn to segment.
Our method segmented the vocal tract with a higher accuracy (mean Dice coefficient of 0.95) than the only other published method for fully segmenting the vocal tract (mean Dice coefficient of 0.90) [25]. Both methods are deep learning based and have a similar architecture, therefore the higher accuracy of our method is likely to be because it was trained using a larger number of images (up to 1625 images, while [25] was trained using 300 images). Currently, no other methods to fully segment the articulators in MR images exist. It is therefore not possible to compare the articulator segmentation accuracy of our method with that of any other published methods.
In 93% of cases (365 out of 392 images), the Dice coefficient of each of the six predicted segmentations (one per class) was 0.85 or above. This result suggests that the generalisability of our method is good.
When clinically assessing the speech of patients with speech problems, an important consideration is whether velopharyngeal closure occurs during speech. It is therefore important that segmentation methods intended for use in clinical speech assessment accurately show occurrences of velopharyngeal closure, while not artificially creating velopharyngeal closures when these do not occur (i.e. preserve gaps between the soft palate and pharyngeal wall). The predicted segmentations correctly showed 90% (27 out of 30) of the velopharyngeal closures in the ground truth segmentations. As shown in Fig. 11, three consecutive closures in the ground truth segmentations were shown as a single closure in the predicted segmentations. It is important to note that the soft palate motion between these three closures was different from the motion between all the other closures: instead of moving to a position far from the pharyngeal wall, the soft palate remained close to the wall (an example is shown in Fig. 12A). Consequently, the gap between the soft palate and pharyngeal wall remained small. The predicted segmentations also showed five closures that did not occur in the ground truth segmentations (two are shown in Fig. 11). All five of these additional closures occurred when the soft palate was close to the pharyngeal wall (an example is shown in Fig. 12B). The merging of closures and the occurrence of additional closures shows that our method was not always able to preserve small gaps between the soft palate and the pharyngeal wall. Further work is required to improve the ability of our method to preserve such gaps. A factor that can make preservation of such gaps particularly challenging is the presence of fluid within them. In rtMR images, fluid has a similar intensity to the soft palate and pharyngeal wall, and can therefore make it appear as though the soft palate and pharyngeal wall are in contact (an example is shown in Fig. 12B). This factor should be considered in any future work.
Different vocal tract shapes and articulator positions are required to produce different speech sounds. Our method was trained using rtMR images of vocal tract shapes that occur in counting from one to ten (a speech task commonly performed in clinical speech assessment) rather than using images of all the different possible shapes in English. Nevertheless, we found that our method was able to segment rtMR images of three different vocal tract shapes not present in the training dataset with a similar accuracy to images of vocal tract shapes present in the training dataset. The median Dice coefficient and median general Hausdorff distance of segmentations of the former images were 0.96 and 6mm respectively, while those of the latter images were 0.92 and 5mm. The similarity of these results suggests that our method is able to segment images of vocal tract shapes not present in the training dataset with an accuracy similar to images of vocal tract shapes present in the training dataset. However, further work involving images of a larger range of vocal tract shapes is required to investigate the extent to which this finding is true.
6. Conclusions
A novel automatic method to fully segment multiple groups of articulators as well as the vocal tract in 2D rtMR images was developed. This method overcomes the limitations of existing methods that either only segment the air-tissue boundaries between the vocal tract and neighbouring tissues or fully segment the vocal tract only. The method is intended for use in clinical and non-clinical speech investigations which involve quantitative analysis of the shape, size, motion and position of the vocal tract and articulators. A novel clinically relevant metric for assessing the accuracy of vocal tract and articulator segmentation methods was also developed.
Declaration of Competing Interest
None of the authors have any conflicts of interest to declare.
Acknowledgements
Matthieu Ruthven is funded by a Health Education England / National Institute for Health Research Clinical Doctoral Research Fellowship for this project.
Andrew King was supported by the Wellcome/EPSRC Centre for Medical Engineering [WT 203148/Z/16/Z].
Footnotes
Supplementary material associated with this article can be found, in the online version, at doi:10.1016/j.cmpb.2020.105814.
Appendix. Supplementary materials
References
- 1.Scott A.D., Wylezinska M., Birch M.J., Miquel M.E. Speech MRI: Morphology and function. Phys. Medica. 2014;30:604–618. doi: 10.1016/j.ejmp.2014.05.001. [DOI] [PubMed] [Google Scholar]
- 2.Carignan C., Shosted R.K., Fu M., Liang Z.P., Sutton B.P. A real-time MRI investigation of the role of lingual and pharyngeal articulation in the production of the nasal vowel system of French. J. Phon. 2015;50:34–51. doi: 10.1016/j.wocn.2015.01.001. [DOI] [Google Scholar]
- 3.Carey D., Miquel M.E., Evans B.G., Adank P., McGettigan C. Vocal Tract Images Reveal Neural Representations of Sensorimotor Transformation During Speech Imitation. Cereb. Cortex. 2017;33:316–325. doi: 10.1093/cercor/bhw393. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Leppävuori M., Lammentausta E., Peuna A., Bode M.K., Jokelainen J., Ojala J., Nieminen M.T. Characterizing Vocal Tract Dimensions in the Vocal Modes Using Magnetic Resonance Imaging. J. Voice. 2020 doi: 10.1016/j.jvoice.2020.01.015. [DOI] [PubMed] [Google Scholar]
- 5.Kim J., Toutios A., Lee S., Narayanan S.S. Vocal tract shaping of emotional speech. Comput. Speech Lang. 2020 doi: 10.1016/j.csl.2020.101100. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Hagedorn C., Proctor M., Goldstein L., Wilson S.M., Miller B., Gorno-Tempini M.L., Narayanan S.S. Characterizing articulation in apraxic speech using real-time magnetic resonance imaging. J. Speech, Lang. Hear. Res. 2017;60:877–891. doi: 10.1044/2016_JSLHR-S-15-0112. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Ha J., Sung I., Son J., Stone M., Ord R., Cho Y. Analysis of speech and tongue motion in normal and post-glossectomy speaker using cine MRI. J. Appl. Oral Sci. 2016;24:472–480. doi: 10.1590/1678-775720150421. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Beer A.J., Hellerhoff P., Zimmermann A., Mady K., Sader R., Rummeny E.J., Hannig C. Dynamic near-real-time magnetic resonance imaging for analyzing the velopharyngeal closure in comparison with videofluoroscopy. J. Magn. Reson. Imaging. 2004;20:791–797. doi: 10.1002/jmri.20197. [DOI] [PubMed] [Google Scholar]
- 9.Drissi C., Mitrofanoff M., Talandier C., Falip C., Le Couls V., Adamsbaum C. Feasibility of dynamic MRI for evaluating velopharyngeal insufficiency in children. Eur. Radiol. 2011;21:1462–1469. doi: 10.1007/s00330-011-2069-7. [DOI] [PubMed] [Google Scholar]
- 10.Silver A.L., Nimkin K., Ashland J.E., Ghosh S.S., van der Kouwe A.J.W., Brigger M.T., Hartnick C.J. Cine Magnetic Resonance Imaging With Simultaneous Audio to Evaluate Pediatric Velopharyngeal Insufficiency. Arch. Otolaryngol. Neck Surg. 2011;137:258–263. doi: 10.1001/archoto.2011.11. [DOI] [PubMed] [Google Scholar]
- 11.Sagar P., Nimkin K. Feasibility study to assess clinical applications of 3-T cine MRI coupled with synchronous audio recording during speech in evaluation of velopharyngeal insufficiency in children. Pediatr. Radiol. 2015;45:217–227. doi: 10.1007/s00247-014-3141-7. [DOI] [PubMed] [Google Scholar]
- 12.Kulinna-Cosentini C., Czerny C., Baumann A., Weber M., Sinko K. TrueFisp versus HASTE sequences in 3T cine MRI: Evaluation of image quality during phonation in patients with velopharyngeal insufficiency. Eur. Radiol. 2016;26:2892–2898. doi: 10.1007/s00330-015-4115-3. [DOI] [PubMed] [Google Scholar]
- 13.Ruotolo R.A., Veitia N.A., Corbin A., McDonough J., Solot C.B., McDonald-McGinn D., Zackai E.H., Emanuel B.S., Cnaan A., LaRossa D., Arens R., Kirschner R.E. Velopharyngeal Anatomy in 22q11.2 Deletion Syndrome: A Three-Dimensional Cephalometric Analysis. Cleft Palate Craniofac J. 2006;43:446. doi: 10.1597/04-193R.1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Park M., Ahn S.H., Jeong J.H., Baek R.M. Evaluation of the levator veli palatini muscle thickness in patients with velocardiofacial syndrome using magnetic resonance imaging. J. Plast. Reconstr. Aesthetic Surg. 2015;68:1100–1105. doi: 10.1016/j.bjps.2015.04.013. [DOI] [PubMed] [Google Scholar]
- 15.Filip C., Impieri D., Aagenæs I., Breugem C., Høgevold H.E., Særvold T., Aukner R., Lima K., Tønseth K., Abrahamsen T.G. Adults with 22q11.2 deletion syndrome have a different velopharyngeal anatomy with predisposition to velopharyngeal insufficiency. J. Plast. Reconstr. Aesthetic Surg. 2018;71:524–536. doi: 10.1016/j.bjps.2017.09.006. [DOI] [PubMed] [Google Scholar]
- 16.Kollara L., Baylis A.L., Kirschner R.E., Bates D.G., Smith M., Fang X., Perry J.L. Velopharyngeal Structural and Muscle Variations in Children With 22q11.2 Deletion Syndrome: An Unsedated MRI Study. Cleft Palate-Craniofacial J. 2019;56:1139–1148. doi: 10.1177/1055665619851660. [DOI] [PubMed] [Google Scholar]
- 17.Bresch E., Narayanan S. Region segmentation in the frequency domain applied to upper airway real-time magnetic resonance images. IEEE Trans. Med. Imaging. 2009;28:323–338. doi: 10.1109/TMI.2008.928920. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Kim J., Kumar N., Lee S., Narayanan S. Proc. 10th Int. Semin. Speech Prod. 2014. Enhanced airway-tissue boundary segmentation for real-time magnetic resonance imaging data; pp. 222–225. [Google Scholar]
- 19.Silva S., Teixeira A. Unsupervised segmentation of the vocal tract from real-time MRI sequences. Comput. Speech Lang. 2015;33:25–46. doi: 10.1016/j.csl.2014.12.003. [DOI] [Google Scholar]
- 20.Labrunie M., Badin P., Voit D., Joseph A.A., Frahm J., Lamalle L., Vilain C., Boë L.-J. Automatic segmentation of speech articulators from real-time midsagittal MRI based on supervised learning. Speech Commun. 2018;99:27–46. doi: 10.1016/j.specom.2018.02.004. [DOI] [Google Scholar]
- 21.Somandepalli K., Toutios A., Narayanan S.S. INTERSPEECH. 2017. Semantic Edge Detection for Tracking Vocal Tract Air-tissue Boundaries in Real-time Magnetic Resonance Images; pp. 631–635. [Google Scholar]
- 22.Valliappan C., Mannem R., Ghosh P.K. INTERSPEECH. 2018. Air-tissue boundary segmentation in real-time magnetic resonance imaging video using semantic segmentation with fully convolutional networks; pp. 3132–3136. [DOI] [Google Scholar]
- 23.Valliappan C., Kumar A., Mannem R., Karthik G., Ghosh P.K. IEEE Int. Conf. Acoust. Speech Signal Process. 2019. An improved air tissue boundary segmentation technique for real time magnetic resonance imaging video using SegNet; pp. 5921–5925. [Google Scholar]
- 24.Mannem R., Ghosh P.K. IEEE Int. Conf. Acoust. Speech Signal Process. 2019. Air-tissue boundary segmentation in real time magnetic resonance imaging video using a convolutional encoder-decoder network; pp. 5941–5945. [Google Scholar]
- 25.Erattakulangara S., Lingala S.G. IEEE Int. Symp. Biomed. Imaging. 2020. Airway segmentation in speech MRI using the U-net architecture; pp. 1887–1890. [Google Scholar]
- 26.Badrinarayanan V., Kendall A., Cipolla R. SegNet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017;39:2481–2495. doi: 10.1109/TPAMI.2016.2644615. [DOI] [PubMed] [Google Scholar]
- 27.Long J., Shelhamer E., Darrell T. Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., IEEE. 2015. Fully Convolutional Networks for Semantic Segmentation; pp. 3431–3440. [DOI] [PubMed] [Google Scholar]
- 28.Yang J., Price B., Cohen S., Lee H., Yang M.H. Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. 2016. Object contour detection with a fully convolutional encoder-decoder network; pp. 193–202. [DOI] [Google Scholar]
- 29.Ronneberger O., Fischer P., Brox T. Med. Image Comput. Comput. Interv. Springer; 2015. U-Net: Convolutional Networks for Biomedical Image Segmentation; pp. 234–241. [DOI] [Google Scholar]
- 30.Sell D., Pereira V. Instrumentation in the Analysis of the Structure and Function of the Velopharyngeal Mechanism. In: Howard S., Lohmander A., editors. Cleft Palate Speech Assess. Interv. Wiley-Blackwell; 2011. p. 373. [Google Scholar]
- 31.Paszke A., Gross S., Massa F., Lerer A., Bradbury J., Chanan G., Killeen T., Lin Z., Gimelshein N., Antiga L., Desmaison A., Kopf A., Yang E., DeVito Z., Raison M., Tejani A., Chilamkurthy S., Steiner B., Fang L., Bai J., Chintala S. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In: Wallach H., Larochelle H., Beygelzimer A., D'Alché-Buc F., Fox E., Garnett R., editors. Adv. Neural Inf. Process. Syst. 32. Curran Associates, Inc.; 2019. pp. 8024–8035.http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf [Google Scholar]
- 32.Kingma D.P., Ba J.L. 3rd Int. Conf. Learn. Represent. ICLR 2015 - Conf. Track Proc. 2015. Adam: A method for stochastic optimization. [Google Scholar]
- 33.Krstajic D., Buturovic L.J., Leahy D.E., Thomas S. Cross-validation pitfalls when selecting and assessing regression and classification models. J. Cheminform. 2014;6 doi: 10.1186/1758-2946-6-10. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.













