Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2022 Feb 1.
Published in final edited form as: Future Gener Comput Syst. 2020 Sep 30;115:610–618. doi: 10.1016/j.future.2020.09.040

Estimation of laryngeal closure duration during swallowing without invasive X-rays

Shitong Mao a, Aliaa Sabry b, Yassin Khalifa a, James L Coyle b, Ervin Sejdic a,c,*
PMCID: PMC7584133  NIHMSID: NIHMS1637912  PMID: 33100445

Abstract

Laryngeal vestibule (LV) closure is a critical physiologic event during swallowing, since it is the first line of defense against food bolus entering the airway. Identifying the laryngeal vestibule status, including closure, reopening and closure duration, provides indispensable references for assessing the risk of dysphagia and neuromuscular function. However, commonly used radiographic examinations, known as videofluoroscopy swallowing studies, are highly constrained by their radiation exposure and cost. Here, we introduce a non-invasive sensor-based system, that acquires high-resolution cervical auscultation signals from neck and accommodates advanced deep learning techniques for the detection of LV behaviors. The deep learning algorithm, which combined convolutional and recurrent neural networks, was developed with a dataset of 588 swallows from 120 patients with suspected dysphagia and further clinically tested on 45 samples from 16 healthy participants. For classifying the LV closure and opening statuses, our method achieved 78.94% and 74.89% accuracies for these two datasets, suggesting the feasibility of implementing sensor signals for LV prediction without traditional videofluoroscopy screening methods. The sensor supported system offers a broadly applicable computational approach for clinical diagnosis and biofeedback purposes in patients with swallowing disorders without the use of radiographic examination.

Keywords: Laryngeal vestibule closure, High resolution cervical auscultation (HRCA), Deep learning, Dysphagia, Health-care

Graphical Abstract

graphic file with name nihms-1637912-f0001.jpg

1. Introduction

For humans, the respiration and digestive systems share the same entrance, therefore, protecting the airway from food bolus entering the trachea or lungs is a fundamental requirement for safe swallowing. Laryngeal vestibule (LV) closure has been considered the first line of defense against swallowed material entering the airway [1, 2, 3]. Likewise, the duration of LV closure is a predictor of airway invasion during swallowing. If the laryngeal closure is absent or its duration is too short, this can lead to aspiration/penetration[4, 5]. Aspiration has been considered as a major concern for individuals with dysphagia (swallowing disorders), especially in neurologic and neurodegenerative diseases, where aspiration-related respiratory infections are a leading cause of death[6]. Therefore, proper evaluation of LV closure and duration could provide an objective outcome measure to improve the assessment of swallowing safety, provide clinical evidence of increased risk of airway compromise during swallowing, and guide the instigation of appropriate compensatory interventions.

The videofluoroscopy swallowing study (VFSS) is the only instrumental assessment technique that can visualize the event of LV closure and determine its duration during swallowing through the kinematic analysis of radiographic images[6, 7, 8]. However, practical issues will raise when the VFSS is implemented: it exposes patients to radiation, and is not feasible in all facilities without x-ray departments or qualified clinicians to perform and interpret the VFSS images.[9, 10]. Additionally, it is not suitable for the cases in which patients prefer not to undergo x-ray testing or when patients are unable to participate in the examination protocols[10, 11, 12, 13].

Furthermore, there are certain limitations of the ordinary clinical setting preventing the more frequent temporal analysis of swallowing events using VFSS images which provides quantification of LV closure at baseline and assessment of treatment efficacy. Frame-by-frame review of VFSS video is time-consuming and some clinicians may not have the ability to record VFSS images for secondary review due to lack of equipment or limited access to archived materials. Clinicians tend to comment on whether and at what phase of the swallow, the material enters the laryngeal vestibule without determining whether LV closure itself was shortened, further limiting inferences that can lead to treatment decisions[14].

Because of the previously mentioned drawbacks and limitations of VFSS in the detection of LV closure and reopening, it would be practically beneficial for patients and clinicians to investigate an alternative, non-invasive tool. High-resolution cervical auscultation (HRCA) is a promising non-invasive method for dysphagia screening assessment and management[15]. It uses high-resolution accelerometers and microphones, attached to patients’ necks to record vibratory and acoustic signals during swallowing[16, 17]. The advantages of such a sensor supported approach include mobility, cost-effectiveness, non-invasiveness, and suitability for. day-to-day and even minute-to-minute monitoring[15, 18]. To investigate the relationship between the signals and the LV shifting, previous studies postulated the cardiac analogy hypothesis that explained the elusive physiologic cause of swallowing sounds[19]. This theory suggested that cervical auscultation acoustic signals are generated via vibrations caused by valve and pump systems within the vocal tract. Moreover, HRCA signal features have been found to be associated with LV closure onset and LV reopening[20]. The slapping of the epiglottis and aryepiglottic fold may provide the valve activity that generates swallowing sounds and neck vibration which can be recorded with HRCA.

All these studies indicated the possibility of detecting the LV closure and reopening, and a method of determining the closure duration solely based on the HRCA signals. However, no studies attempted to quantitatively implement such an idea. The main challenge was that the explicit dependencies between the signal features and the LV behaviors were not mathematically established. In this study, we sought to investigate the ability of HRCA signals to identify LV status with an advanced deep learning model, which approximated the relationship with training examples. We hypothesized that the computer-aided algorithm with HRCA signals which were acquired from the neck was able to detect the event of LV switching, and estimate the duration of LV closure.

The machine learning and deep learning methods have already become powerful tools in the health-care applications and widely employed in the computer-assisted diagnosis for swallowing, laryngeal and neck disorders or disease[21]. Based on larynx contact endoscopic video images, Esmaeili et al. attempted to apply support vector machine, k-nearest neighbor, and random forest to classify benign and malignant lesions on the superficial layers of laryngeal mucosa[22]. For early stage diagnosis of laryngeal squamous cell carcinoma, Moccia et al. implemented a support vector machine classifier with features extracted from the laryngeal endoscopic frames, and they achieved 93% sensitivity[23]. For the similar purpose, Araújo et al. applied transfer learning with pre-trained Convolutional Neural Network (CNN) models to process laryngoscopy images, and they achieved state-of-art performance[24]. Our previous study also performed multi-scale CNN filers for hyoid bone detection on the VFSS images[25]. All those studies were conducted based on images as model input. Only several studies attempted to use signals in time series as input and build up deep learning models to serve the swallowing or laryngeal applications. Our previous research verified that the on-neck sensor signals can effectively identify swallowing activity, track hoyid bone and segment upper esophageal sphincter opening with the help of deep learning[26, 27, 28]. However, more studies are needed for analyzing the LV closure.

In this study, we used a deep learning architecture with the combination of CNN and Recurrent Neural Network (RNN), which composed an artificial intelligence topology with high nonlinearity to build the relationship between the sensor signals and the LVC duration. To verify the efficacy of the proposed method, we developed the model with a dataset of 588 swallowing samples and a 10-fold subject cross-validation technique as used. Then we applied the model on an independent dataset with 45 new samples as a testing set for analyzing the capacity of generalization.

2. Methods

2.1. Data collection and equipment

This research aimed to estimate LV closure duration using HRCA signals by temporally identifying the LV status, including LV closure and reopening. We collected two sets of data. The first dataset was composed of 588 swallows from 120 enrolled patients with suspected dysphagia. The second dataset was composed of 45 swallows from 16 healthy participants. All the enrolled participants underwent VFSS at the University of Pittsburgh Medical Center Presbyterian Hospital. Participants in the first dataset include 52 females (43.33%) and 68 males (56.67%) with various diagnoses and etiologies of dysphagia, and a mean age of 64 years (range 19–94). Seventeen participants (14.17%) had a history of stroke. Participants in the second dataset include 7 females (43.75%) and 9 males (56.25%). The mean age of participants was also 64 years, with a range of 55–75. We intentionally did not control for participant variables because the aim of this research was to test the feasibility of our system in predicting LVC regardless of participant variables or characteristics of swallowed materials. Various volumes and consistencies of swallowed material were swallowed by participants. All participants in this study signed informed consents and the data collection protocol was approved by the Institutional Research Board of the University of Pittsburgh.

VFSSs were conducted using a standard fluoroscopic x-ray (Precision 500D system,GE Healthcare, LLC, Waukesha, WI) and the videos were captured by a frame grabber module (AccuStream Express HD, Foresight Imaging, Chelmsford, MA) with 30 Hz sampling rate.

The sensor signals were collected concurrently during all VFSS examinations using a tri-axial accelerometer neck sensor and a contact microphone, as shown in Figure 1. The vibratory signals were samples from three accelerometry directions: anterior-posterior(A-P), superior-interior(S-I) and medial-lateral (M-L). Meanwhile, the acoustic signal was also acquired by the microphone. The accelerometer (ADXL 327, Analog Devices, Norwood, Massachusetts) was attached at the midline of the anterior neck of participants at the level of the cricoid cartilage with medical adhesive tape to obtain best signal quality[29, 18]. The sensor was powered by a power supply (model 1504, BK Precision, Yorba Linda, California) with a 3V output, and the resulting signals were bandpass filtered from 0.1 to 3000 Hz with a ten-times amplifier (model P55, Grass Technologies, Warwick, Rhode Island). The microphone (model C411L, AKG, Vienna, Austria), which was powered by a power supply (model B29l,AKG, Vienna, Austria), was placed below the accelerometer and slightly towards the right lateral side of the trachea. This location has previously been described to be appropriate for collecting swallowing sound signals[29, 30] without interfering with X-ray image quality. All the signals acquired by accelerometer and microphone were fed into a National Instruments 6210 DAQ and recorded at 20 kHz by the LabView program (Signal Express, National Instruments, Austin, Texas). The signals were down sampled from 20 kHz to 4 kHz to remove the redundant points. This setup is be effective at detecting swallowing activity in previous studies[31, 32].

Figure 1:

Figure 1:

The HRCA signals and the VFSS videos were recorded concurrently for each participant in the study.

2.2. Data labeling

The videofluoroscopic x-ray videos were segmented into individual swallow events based on the frame in which the head of the bolus reached the ramus of the mandible (onset), the frame in which the hyoid bone returned to its lowest position following clearance of the bolus from the pharynx (offset). The Inter- and intra-rater reliabilities were established on 10% of the videos with over 0.99 intraclass correlation coefficient and intra-rater reliability was maintained throughout testing to avoid judgment drift.

Two trained raters judged the first closure and first re-opening of the LV from videofluoroscopic x-ray videos for each swallow sample. Excellent inter- and intra-judge reliabilities were established on 10% of the videos (intraclass correlation coefficients were all over .99) and reliability was continuously monitored and maintained throughout testing to avoid judgment drift. The total labeled frame number of 588 samples is n = 15583. The criteria in determining the onsets of LV closure and re-opening are listed as follows:

Onset of LV closure:

The first frame in which no air or barium contrast is seen in the collapsed LV (between the arytenoid cartilage and base of the epiglottis).

Onset of LV re-opening:

The first frame in which the LV re-opens. It is the frame of the first obvious airspace reappearance within the vestibule.

All the frames before the onset of LV closure and after the onset of LV re-opening were coded as ‘0’, while all the frames in the period of LV closure were coded as ‘1’. The labeled LV status for each swallow was converted as a binary sequence, as shown in Fig. 2. Meanwhile, to match the sampling rate of frame (30Hz), the simultaneously acquired HRCA signals were up-sampled from 4 kHz to 4.5 kHz.

Figure 2:

Figure 2:

Temporal binary labeling based on kinematic analysis of VFSS videos.

2.3. Deep neural network architecture

The design of the deep neural network architecture should consider the characteristics of the swallowing data. First, the swallow is composed of a series of sequential events, and temporal dependencies naturally exist in waveforms of HRCA signals[33]. The proposed deep structure should have the capacity in modeling the causality of the physiological process, and the RNN is the best choice[34]. Second, the signal’s length is too long. A complete swallow generally needs 1–2 seconds, leading a 4.5k to 9k sampling points with 4.5kHz sampling rate. In our dataset, the longest swallowing trial kept 1.87s and the maximum length of such a trial was 1.87 × 4500 = 8400. Therefore, if the signal sequence was directly fed into RNN, it would be computationally expensive for model training, since the time steps were extremely large for RNN. A solution to circumvent this issue is to adopt sliding window technique to divide the long signal into small segments and then use RNN to model the temporal dependency among these segments. Third, the features presenting each local signal segment should be carefully chosen. Unfortunately, there is no study analyzing these features associated with swallowing kinetics in a window period, although our previous studies fully discussed the meaningful global features, which characterize the HRCA signal in a complete swallowing period[35, 36, 20, 31]. Instead of designing handcrafted features, we can take the advantage of CNN layers, which is good at searching the contextual features in short time period. In the following section, we will also show that feeding raw signal into the RNN without feature extraction would not give us the optimal performance. Besides above considerations, the RNN based structure can deal with the training signal with variable lengths. Such a merit particularly matches the swallowing data form.

In this study, we designed a combined (hybrid) deep neural network which contained 2-layer CNN and 2-layer RNN, at the top of which were three fully connected layers for decision making. The architecture is shown in Fig. 3. The entire network processed signal segments successively along the HRCA signal length of each swallowing sample with a sliding window, and produced the outputs at individual window to aggregate a sequential prediction. The window stride was 0.033s, which was the duration of one video frame. To investigate the effect of overlap percentage between two adjacent windows, we set the variable overlapping ratio from 0 (non-overlap) to 50%.

Figure 3:

Figure 3:

The architecture of the proposed convolutional recurrent neural network.

The CNN part involved two 1d CNN layers and two 1d pooling layers were inserted after each CNN layer. For each CNN layer, the output of CNN was defined as

hj=σ(WXj+b) (1)

where Xj was the signal segment in the j th window, and the symbol ’∗’ represented convolution operation. Here Xj was a Lwin × C matrix, in which Lwin was the window length and C was the channel number of input signal. The dimension of the weight tensor W was S × Lkernel × C, in which S was the channel number of output and Lkernel was the kernel length. The input channels of the first CNN layer was 4 since the HRCA had 4 channels, and the input channels of the second CNN layers was 64. The output channels for these two CNN layers were 64, and the filter length was 9. The weights W of the CNN layers were initialized with Xavier initializer[37]. The bias term b was initialized as zeros.

In Eq. 1, σ() referred to an activation function that introduced the nonlinearity into the model. In this study, the activation function was leaky RELU (rectified linear unit), in which the slope for negative input was set as 0.2[38]. The max pooling operation was used in the pooling layers with 4 points pooling length, and the pooling strides was 2 points.

In a single window, the output of the second pooling layer was flattened as local feature. The sliding window thus produced a feature sequence as the input for the RNN layers, which aimed to model the swallowing dynamics between the signal patches, namely:

h(t+1)=RNNUnit(h(t),x(t)) (2)

In this study, we used bidirectional RNN and gated recurrent unit (GRU) as the RNN unit, which was defined as[39]:

h(t)=u(t)h˜(t)+(1u(t))h(t1) (3)
h˜(t)=tanh[Uhx(t)+Wh(r(t)h(t1))+bh] (4)
r(t)=sigmoid(Urx(t)+Wrh(t1)+br) (5)
u(t)=sigmoid(Uux(t)+Wuh(t1)+bu) (6)

where h(t) was the hidden state at step t, the dimension of which was unknown as unit size. The gate variables r(t) and u(t) represented as reset gate and update gate. In Eq. 46, all the matrix U and W were weight matrices and they were initialized with Xavier initializer[37]. In this study, each RNN layer had 64 units for both forward and backward paths, and the hidden states of second layer were concatenated at each time step as RNN output.

At the top of the RNN was 3-layer fully connected layer. The activation function of the first two layers was leaky RELU with 0.2 slope for negative input, and the activation function of the last layer was the sigmoid function. The input dimension of all the 3 layers was 64 and the output dimension of the last layer was one. The dropout layer was also employed for each fully connected layer, and the neuron keeping rate was 0.5. All of the weights of fully connected layers were initialized with Xavier initializer[37].

2.4·. Training

In the model training phase, a 10-fold subject cross-validation technique was applied on the first dataset, which involved 588 swallowing trials from 120 patients. We randomly divided all patients into 10 non-overlapping groups and each group contained around 59 swallowing trials. The deep neural network was independently trained by 9 groups and the other group was retained as validation set. The cross-validation process was repeated 10 times, so that each swallowing trial was used exactly once as the validation data. After the optimal model hyperparameters were confirmed, we used the entire 588 trials as training set to train a new model with the same setup and then applied this model on the second dataset, which involved 45 swallowing trials and was independent of the first dataset for testing purpose.

Based on the training set, the primary aim was to confirm the parameters (weights and biases) which maximized the probability shown as follow:

maxθP(yi|x1xixT;θ),i (7)

where yi referred to the LV status at the ith window and it was a binary variable: it would only choose 0 or 1 represented LV opening or closure respectively. x1xi were four-channel signal segment in the past frame(window) steps. In this study, the information along all the singal was used to make the prediction at current time step i, since bidirectional RNN was used. θ were the parameters (weights and biases) in the model, which were unconfirmed before training. To maximize the probability in Eq. 7, we use mean cross-entropy with L2 regularization, namely:

loss=1n=1NKnn=1Ni=1Kn[(1yn,i)log(1y^n,i)+yn,ilog(y^n,i)]+λWθ2 (8)

where i and n were the frame(window) index and sample index respectively, Kn was the frame length (or window number) of the nth swallow sample, and N was the total number of samples. yn,i was the human-labeled LV status (0 or 1) and y^n,i indicated the model’s final output, which was a probability value for predicting the closure or opening. The loss function had two parts: the cross entropy referring to difference between the model output and human rater’s label and L2 norm for better generalization. In Eq. 8, Wθ2 was the L2 norm of all the weights in the deep model with a factor λ[40], including the weights of CNN kernel, RNN weights and the weights of fully connected layer. To avoid over-fitting, we adopted the following procedures:

  1. As introduced in the last section, the dropout layers were added for the fully connected layers.

  2. We used L2 regularization with λ = 1e−4, as shown in Eq. 8.

  3. We conducted early-stopping with 1000 training epochs. This criterion was confirmed by cross-validation.

For optimizing the loss function, we used Adam algorithm with β1 = 0.9, β2 = 0.999, ϵ = 1e−7 and learning rate = 1e−4[41]. Considering the deep neural network may have gradient exploding problem, we used the gradient norm clipping with a threshold as 100.

3. Results

This research aimed to estimate LV closure duration using HRCA signals by identifying the LV status, including LV closure and reopening. The deep learning architecture with conbined CNN and RNN was applied to fulfill this goal. To further investigate the efficacy of the proposed model, comparative studies were also conducted from two aspects: ablation studies and overlapping rate tuning for the sliding window.

Beside the proposed C-RNN model, we also implemented three types of baseline models in the experiments: 1.Multi-layer perceptron (MLP) model, which had the exactly same setup with the fully connected layer of the C-RNN model. The signal segment in each window was flattened into 1-dimension vector as model input and the temporal dependency amoung the windows was ignored; 2. CNN 1d layers and MLP, which had the exactly same setup with the CNN 1d layers and fully connected layers of the C-RNN model. The CNN part was first applied on the signal segment in each window, as described in Eq.(1), and the output was flattened into 1-dimension vector as the input of MLP. The temporal dependency was also ignored in this model; 3. RNN and MLP, which had the exactly same setup with the RNN part and fully connected layers of the C-RNN model. The signal segment in each window was first flattened as 1-dimentsion vector and the successive windows thus produced a vector sequence as input for RNN.

The overlapping rate was another hyperparameter in the experiment. We set this rate from 0.0(non-overlap) to 50% with 10% step. Since the window stride was 150 (sampling rates of the HRCA signal and VFSS frame were 4.5 kHz and 30Hz respectively), the window length was adjusted in [150, 166, 188, 214, 250, 300].

For each type of model and the overlapping rate, we first applied the 10-fold subject cross-validation on the first dataset. Then we re-trained the model with the entire first dataset and tested the model with the second dataset. To predict LV status for the unseen validation and testing swallowing samples, we used HRCA signals solely as the input for the deep model, in which the parameters (weights and biases) were frozen after training. To evaluate the performances of these models, the metrics of accuracy, sensitivity, specificity and AUC (Area under the ROC Curve) were used. Additionally, we used the duration ratio as a clinical measure, which was defined as the predicted LV closure frames number over the human-labeled LV closure frames number. The results are listed in Table 1. Meanwhile, the ROC (Receiver operating characteristic) curves calculated from cross-validation are shown in Figure 4.

Table 1:

Comparative results of proposed C-RNN and other architectures for LV status classification.

Model OR Data ACC SEN SPE AUC DR
MLP 0 Val 63.99% 11.00% 92.56% 0.5988 0.4856
Test 50.80% 6.72% 92.62% 0.4934 0.2610

10% Val 63.17% 20.40% 86.22% 0.6258 0.6986
Test 51.81% 10.91% 90.63% 0.5331 0.2958

20% Val 63.77% 22.59% 85.70% 0.6272 0.7162
Test 52.98% 15.84% 88.22% 0.5448 0.3829

30% Val 63.76% 24.90% 84.62% 0.6306 0.7574
Test 52.03% 16.59% 85.67% 0.5397 0.4229

40% Val 63.44% 27.33% 82.87% 0.6361 0.7961
Test 51.53% 17.34% 83.97% 0.5267 0.4667

50% Val 62.96% 30.37% 80.50% 0.6338 0.8734
Test 53.28% 20.63% 84.26% 0.5331 0.4747

CNN+ MLP 0 Val 64.98% 8.92% 94.92% 0.6069 0.4764
Test 51.02% 4.33% 95.32% 0.5361 0.2897

10% Val 64.62% 9.30% 94.52% 0.6056 0.5341
Test 51.38% 5.14% 95.32% 0.5376 0.3253

20% Val 64.20% 14.41% 90.68% 0.6245 0.6236
Test 51.75% 12.11% 89.36% 0.5649 0.4507

30% Val 64.70% 19.06% 89.22% 0.6367 0.6504
Test 53.57% 37.82% 68.51% 0.5699 0.8127

40% Val 64.80% 24.52% 86.54% 0.6487 0.7335
Test 55.17% 57.10% 53.33% 0.5803 1.1318

50% Val 64.82% 25.68% 85.98% 0.6515 0.7723
Test 55.39% 67.41% 43.97% 0.5879 1.3324

RNN+ MLP 0 Val 72.93% 59.35% 80.84% 0.7946 1.300
Test 65.50% 47.68% 82.41% 0.7761 0.7572

10% Val 73.23% 59.91% 80.52% 0.8024 1.2599
Test 62.81% 49.17% 75.74% 0.7360 1.0474

20% Val 72.98% 57.73% 81.50% 0.8022 1.2755
Test 65.42% 58.59% 71.91% 0.7190 1.0073

30% Val 73.77% 59.12% 82.24% 0.7962 1.2681
Test 62.81% 45.14% 79.57% 0.6839 0.9331

40% Val 72.74% 56.87% 81.24% 0.7872 1.2601
Test 62.74% 55.75% 69.36% 0.7172 1.1059

50% Val 72.54% 56.80% 81.10% 0.7869 1.2621
Test 65.07% 48.13% 81.13% 0.6947 1.1040

Proposed C-RNN 0 Val 78.39% 72.02% 81.82% 0.8622 1.291
Test 75.32% 68.01% 82.27% 0.8587 0.9636

10% Val 78.94% 72.39% 82.69% 0.8680 1.2693
Test 74.89% 65.62% 83.69% 0.8613 0.9314

20% Val 78.62% 72.46% 81.98% 0.8597 1.2672
Test 75.96% 71.00% 78.72% 0.8363 1.0201

30% Val 78.45% 71.50% 82.41% 0.8642 1.2551
Test 75.40% 70.55% 80.00% 0.8540 1.0055

40% Val 78.56% 70.99% 82.67% 0.8617 1.2470
Test 76.35% 70.70% 81.70% 0.8568 0.9766

50% Val 78.39% 71.29% 82.28% 0.8597 1.2472
Test 77.07% 75.93% 78.16% 0.8678 1.0617

In the first row, the meanings of all the acronyms and abbreviations are explained as follows: OR: Overlapping Rate; Data: the model was either applied on validation sets or the testing set; ACC: Accuracy; SEN: sensitivity; SPE: specificity; AUC: Area under the ROC curve; DR: Duration ratio.

Figure 4:

Figure 4:

The ROC curves of classifying the LV status based on the 10-fold subject cross-validation. The four models are (a) MLP; (b) CNN+MLP; (c) RNN+MLP and (d) C-RNN

The MLP model gave the lowest values in accuracy and AUC for both validation and testing sets. The accuracy was around 63% for the validation and 52% for testing set. When the CNN 1d layers were used for extracting the local features, the model performance achieved very limited improvement. The accuracies only got around 2% growth for both datasets. We can observe a significant improvement when the RNN model was used. Compared with the CNN+MLP model, the RNN+MLP model increased the accuracy values with around 8% for both datasets, although the features in each window were not properly extracted. The proposed C-RNN model achieved the best performance. Compared with the RNN+MLP model, the C-RNN improved the accuracy around 6% for validation set, and surprisingly, the testing accuracy obtain around 11% improvement, which indicated the proposed model had a better generalization for unseen data.

For all the models, we cannot observe a clear effect of overlapping rate. For the MLP and CNN+MLP model, the overlapping rate gave slight improvement on the testing accuracy and AUC of cross-validation set. For the proposed C-RNN model, 50% overlapping rate gave 1.75% improvement compared with non-overlapped windows. The best validation performance was found at overlapping rate was 10%, in which the accuracy as 78.94%, AUC was 0.8680, and the duration ratio was 1.27. With the same setup, the testing accuracy was 74.89% and the duration ratio was 0.93.

4. Discussion

The VFSS has long been thought to be the only assessment tool for detecting laryngeal activities, although the limitations exist. To break these limitations, the primary aim of this study was to determine the feasibility of HRCA signals to predict LV status (opening or closure) with an advanced computer-aided approach and thus estimate the duration of LVC non-invasively. The HRCA signal features acquired by the sensors are strongly associated with several swallowing kinematic events including the onset of LV closure[20]. Also, cervical auscultation acoustic signals were associated with LV opening and other swallowing related events[42, 43, 36]. Based on the cardiac analogy theory, the slapping of the epiglottis and aryepiglottic fold may provide the valve activity that generates swallowing sounds and vibrations which can be recorded with HRCA. Also, our results provided evidence of LV detectability by extracting the information solely from the HRCA recordings. Importantly, we used a deep learning method as a way to identify the LV activities in which other swallowing induced events are mixed, such as hyoid bone movement and upper esophageal sphincter closure.

The main challenge in this study was that the temporal complexity for swallowing kinematics was still unclear, therefore, it was unfeasible to explicitly model the framewise laryngeal behavior based on the HRCA signal. Moreover, no studies attempted to analyze the general pattern among the crowd since the swallowing characteristics are individual-specific. We demonstrated that a highly complex and non-linear relationship between the LV status and HRCA signals can be established via advanced deep learning algorithms, such as the proposed hybrid neural network in this study. All the validating and testing samples’ predictions were independent of the VFSS videos, which were only used for evaluating the model performance with the comparison of human labeling.

Specifically, several practical problems need to be addressed: First, the signal’s patch in each window was highly heterogeneous, non-stationary, patient-specific, and noisy. Secondly, for each signal patch, the signal representation, which is related to the LV opening and closure, was unknown based on the clinical knowledge; Third, the HRCA records were composed of acoustics and tri-axial vibration signals, however, the dependency between every two channels is not fully explored. Furthermore, laryngeal kinematics during swallowing was a dynamic process. For a single swallow, the historical information would affect the current state. In other words, during a single swallowing time period, the past observed HRCA signals and laryngeal status determined the closure/opening in the current time step.

Based on the results, the MLP and CNN+MLP models were trying to learn the relation between individual signal segment and the corresponding label, while the interaction with the past or future segment did not take account. The accuracies for both models were relative low. Meanwhile, the ROC curves shown in Figure 4(a) and (b) were close to the random guess line, which is the diagonal line between (0, 0) and (1, 1), demonstrating that the models can learn very limited meaningful information from the training pairs[44]. In contrast, the RNN base models (RNN+MLP and C-RNN) significantly improved the accuracies and AUCs, suggesting that the strong temporal dependency among signal segments exists and the design of the neural network should consider such a dependency.

Compared with RNN+MLP, C-RNN model further improved the performance of accuracy, AUC and the duration ratio, suggesting that the CNN part was indispensable. Moreover, for the baseline models MLP, CNN+MLP, and RNN+MLP, we can find 7% ~ 10% gap between the accuracies of validation and testing sets. However, when CNN was used at lower level of the model, such a gap was narrowed down to 2% ~ 4% for the proposed C-RNN. This result indicated that the CNN can help to learn more robust features from the signal segments and thus offer the best generalization capability.

The overlapping rate may slightly affect the model performance, but it was not the dominant factor, as shown in Figure 4. Increasing the window length helped the signal segment to include more information from the last and next window, however, it is not enough for learning the temporal relationship along the entire signal for the MLP and CNN+MLP models.

In this study, we also confirmed the possibility to estimate LV closure duration based on the sensor signals. We found that the model’s output was close to the human-labeled results: the duration ratios of both validation and testing sets were closed to one. These results indicated the HRCA signals also contain the LV shifting information. All these results strongly supported the hypothesis that using C-RNN model to obtain such information from the signals is feasible. This provides an objective tool to further analyze the laryngeal behavior during swallowing that can be compared to published, normative data, and provide clinical evidence of increased risk of airway compromise during swallowing. This combined use of more objective, and less subjective judgments, by adding temporal measurements to LV closure evaluation, can have significant clinical diagnostic and treatment importance for people with dysphagia.

Detecting the LV activities non-invasively from neck-sensors is of particular interest in speech language pathology, as it establishes the way in the understanding of swallowing mechanism. This study reports the feasibility of using HRCA and our custom-designed analysis algorithms in non-invasively predicting LV closure duration and estimating risks of airway compromise during swallowing. Future research uses from an engineering standpoint include increasing the precision of our system. From a clinical standpoint, we did not seek to characterize patient- or disease-specific differences in LVC, but rather to evaluate the feasibility of our system. Future efforts may also investigate potential interaction effect of patient’s age and the bolus volumes on the performance of model output, which are all ripe areas for future investigations. Since the study aimed to determine the extent to which sensor-based signals can independently predict LV status regardless of age, gender, or diagnosis, the system’s feasibility has been demonstrated, and the additional considerations discussed lead us to interesting directions for future research.

5. Conclusion

In this research, we proposed a new method for detecting the LV closure and opening status based on the HRCA signals and hybrid deep learning algorithm. The results revealed that it is possible to identify LV status and further calculate LV closure duration solely based on the information provided by HRCA signals. This study demonstrates the feasibility of using the sensor as a potential non-invasive swallow screening method to judge the swallowing function. The sensor supported approach provides a potential non-invasive tool to estimate LV closure duration for diagnostic and biofeedback purposes in managing patients with dysphagia.

Highlights.

  • Estimation of laryngeal closure duration during swallowing based on the high-resolution cervical auscultation signals.

  • A hybrid deep learning structure combined with 1-dimensional convolutional neural network and recurrent neural network is employed.

  • Model was trained and validated by the signals of 588 swallows with 10-fold subject cross-validation and further clinically tested on completely unseen 45 swallows.

  • Our method achieved 78.94% and 74.89% accuracies for cross-validation and testing set.

Acknowledgements

Research reported in this publication was supported by the Eunice Kennedy Shriver National Institute of Child Health & Human Development of the National Institute of Health under Award no. R01HD092239, while the data were collected under Award no. R01HD074819. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institute of Health.

Biography

graphic file with name nihms-1637912-b0002.gif

Shitong Mao

received the BSc and MSc degree from Harbin Institute of Technology, China, in 2008 and 2010 respectively. He is working toward the PhD degree at the University of Pittsburgh, Swanson school of engineering. His current research interests include pattern recognition, machine learning, biomedical signal processing and computer vision.

graphic file with name nihms-1637912-b0003.gif

Aliaa Sabry

Aliaa is a phoniatrician, and first year post doctoral research fellow in Krembil research institute at the University health network. She graduated from the Mansoura University, Egypt with a Bachelor of Medicine and Surgery (MBBCh) in 2006. She earned M.S.c in speech-language pathology from Ain Shams University, Egypt in 2011. She received her PhD, with specialization in Dysphagia, from Mansoura University, Egypt, in 2019, and practiced clinically as a phoniatrician with people with swallowing disorders. Before starting her post doctoral research fellow at the University health network, Toronto, Ontario, Canada, she worked as a researcher at the swallowing lab in the Department of Communication Science and Disorders at the University of Pittsburgh (from 2016 to 2019). Current research areas of interest include non-invasive instrumental methods of swallow screening and management through signal processing of patients’ voice and vibratory signals generated during swallowing.

graphic file with name nihms-1637912-b0004.gif

Yassin Khalifa

received his B.S. and M.S. degrees in 2010 and 2013 respectively from Biomedical Engineering Department, Cairo University, Egypt.

He is currently a Research Assistant at the department of Electrical and Computer Engineering, Swanson School of Engineering, University of Pittsburgh, PA, USA.

From 2011 to 2016, Yassin worked as a teaching and research assistant for Cairo University and Nile University, Egypt, where he helped in teaching of several courses in both graduate and undergraduate levels and participated in multiple research projects focusing on big data analytics, machine learning and statistical analysis. His current research interests include biomedical signal processing, data science especially big data analytics and the applications of deep learning.

graphic file with name nihms-1637912-b0005.gif

James L Coyle

received the Ph.D. degree in rehabilitation science from the University of Pittsburgh, Pittsburgh, PA, USA, in 2008 with a focus in neuroscience.

He is currently a Professor of Communication Sciences and Disorders at the School of Health and Rehabilitation Sciences, University of Pittsburgh and an Adjunct Associate Professor of speech and hearing sciences at the Ohio State University, Columbus, OH, USA. He maintains an active clinical practice in the Department of Otolaryngology, Head and Neck Surgery and the Speech Language Pathology Service, University of Pittsburgh Medical Center.

Dr. Coyle is in the Board Certified by the American Board of Swallowing and Swallowing Disorders. He is a Fellow of the American Speech Language and Hearing Association.

graphic file with name nihms-1637912-b0006.gif

Ervin Sejdić

received the B.E.Sc. and Ph.D. degrees in electrical engineering from the University of Western Ontario, London, ON, Canada, in 2002 and 2008, respectively.

From 2008 to 2010, he was a Post-Doctoral Fellow with the University of Toronto, Toronto, ON, Canada, with a cross-appointment with Bloorview Kids Rehab, Toronto, ON, Canada, Canada’s largest children’s rehabilitation teaching hospital. From 2010 to 2011, he was a Research Fellow with the Harvard Medical School, Boston, MA, USA, with a cross-appointment with the Beth Israel Deaconess Medical Center. In 2011, he joined the Department of Electrical and Computer Engineering, University of Pittsburgh, Pittsburgh, PA, USA. as a tenure-track Assistant Professor. In 2017, he was promoted to a tenured Associate Professor. He holds secondary appointments with the Department of Bioengineering, Swanson School of Engineering, with the Department of Biomedical Informatics, School of Medicine, and with the Intelligent Systems Program, School of Computing and Information, University of Pittsburgh. His current research interests include biomedical signal processing, gait analyses, swallowing difficulties, advanced information systems in medicine, rehabilitation engineering, assistive technologies, and anticipatory medical devices.

Dr. Sejdic was a recipient of many awards. As a graduate student, he was awarded two prestigious awards from the Natural Sciences and Engineering Research Council of Canada. In 2010, he was the recipient of the Melvin First Young Investigator’s Award from the Institute for Aging Research at Hebrew Senior Life, Boston, MA, USA. In 2016, President Obama named Prof. Sejdi\’{c} as a recipient of the Presidential Early Career Award for Scientists and Engineers, the highest honor bestowed by the U.S. Government on science and engineering professionals in the early stages of their independent research careers. In 2017, he was the recipient of the National Science Foundation CAREER Award, which is the National Science Foundation’s most prestigious award in support of career-development activities of those scholars who most effectively integrate research and education within the context of the mission of their organization.

Footnotes

Declaration of competing interest

We declare we have no competing interests.

Declaration of interests

□ The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

  • [1].Vose A, Humbert I, Hidden in plain sight: A descriptive review of laryngeal vestibule closure, Dysphagia (2018) 1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [2].Ekberg O, Closure of the laryngeal vestibule during deglutition, Acta Oto-laryngologica 93 (1–6) (1982) 123–129. [DOI] [PubMed] [Google Scholar]
  • [3].Ekberg O, Nylander G, Cineradiography of the pharyngeal stage of deglutition in 150 individuals without dysphagia, The British Journal of Radiology 55 (652) (1982) 253–257. [DOI] [PubMed] [Google Scholar]
  • [4].Park T, Kim Y, Ko D-H, McCullough G, Initiation and duration of laryngeal closure during the pharyngeal swallow in post-stroke patients, Dysphagia 25 (3) (2010) 177–182. [DOI] [PubMed] [Google Scholar]
  • [5].Shibata S, Inamoto Y, Saitoh E, Kagaya H, Aoyagi Y, Ota K, Akahori R, Fujii N, Palmer J, González-Fernández M, The effect of bolus volume on laryngeal closure and UES opening in swallowing: Kinematic analysis using 320-row area detector CT study, Journal of Oral Rehabilitation 44 (12) (2017) 974–981. [DOI] [PubMed] [Google Scholar]
  • [6].Power ML, Hamdy S, Singh S, Tyrrell PJ, Turnbull I, Thompson DG, Deglutitive laryngeal closure in stroke patients, Journal of Neurology, Neurosurgery and Psychiatry 78 (2) (2007) 141–146. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [7].Vilardell N, Rofes L, Arreola V, Martin A, Muriana D, Palomeras E, Ortega O, Clavé P, Videofluoroscopic assessment of the pathophysiology of chronic poststroke oropharyngeal dysphagia, Neurogastroenterology and Motility 29 (10) (2017) 1–8. [DOI] [PubMed] [Google Scholar]
  • [8].Coyle JL, Robbins J, Assessment and behavioral management of oropharyngeal dysphagia., Current Opinion in Otolaryngology, Head and Neck Surgery 5 (3) (1997) 147–152. [Google Scholar]
  • [9].Mahesh M, Fluoroscopy: Patient radiation exposure issues, Radiographics 21 (4) (2001) 1033–1045. [DOI] [PubMed] [Google Scholar]
  • [10].Zammit-Maempel I, Chapple C-L, Leslie P, Radiation dose in videofluoroscopic swallow studies, Dysphagia 22 (1) (2007) 13–15. [DOI] [PubMed] [Google Scholar]
  • [11].Nierengarten MB, Evaluating dysphagia: Current approaches, Oncology Times 31 (14) (2009) 29–30. [Google Scholar]
  • [12].Steele C, Allen C, Barker J, Buen P, French R, Fedorak A, Day S, Lapointe J, Lewis L, MacKnight C, et al. , Dysphagia service delivery by speech-language pathologists in Canada: Results of a national survey, Canadian Journal of Speech-Language Pathology and Audiology 31 (4) (2007) 166–177. [Google Scholar]
  • [13].Bonilha HS, Humphries K, Blair J, Hill EG, McGrattan K, Carnes B, Huda W, Martin-Harris B, Radiation exposure time during MBSS: Influence of swallowing impairment severity, medical diagnosis, clinician experience, and standardized protocol use, Dysphagia 28 (1) (2013) 77–85. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [14].Bonilha HS, Blair J, Carnes B, Huda W, Humphries K, McGrattan K, Michel Y, Martin-Harris B, Preliminary investigation of the effect of pulse rate on judgments of swallowing impairment and treatment recommendations, Dysphagia 28 (4) (2013) 528–538. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [15].Sejdic E, Malandraki GA, Coyle JL, Computational deglutition: Using signal-and image-processing methods to understand swallowing and associated disorders [life sciences], IEEE Signal Processing Magazine 36 (1) (2019) 138–146. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [16].Movahedi F, Kurosu A, Coyle JL, Perera S, Sejdić E, Anatomical directional dissimilarities in tri-axial swallowing accelerometry signals, IEEE Transactions on Neural Systems and Rehabilitation Engineering 25 (5) (2017) 447–458. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [17].Dudik JM, Jestrović I, Luan B, Coyle JL, Sejdić E, Characteristics of dry chin-tuck swallowing vibrations and sounds, IEEE Transactions on Biomedical Engineering 62 (10) (2015) 2456–2464. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [18].Dudik JM, Kurosu A, Coyle JL, Sejdić E, Dysphagia and its effects on swallowing sounds and vibrations in adults, Biomedical Engineering Online 17 (1) (2018) 69. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [19].Cichero JA, Murdoch BE, The physiologic cause of swallowing sounds: Answers from heart sounds and vocal tract acoustics, Dysphagia 13 (1) (1998) 39–52. [DOI] [PubMed] [Google Scholar]
  • [20].Kurosu A, Coyle JL, Dudik JM, Sejdic E, Detection of swallow kinematic events from acoustic high-resolution cervical auscultation signals in patients with stroke, Archives of Physical Medicine and Rehabilitation 100 (3) (2019) 501–508. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [21].Resteghini C, Trama A, Borgonovi E, Hosni H, Corrao G, Orlandi E, Calareso G, De Cecco L, Piazza C, Mainardi L, et al. , Big data in head and neck cancer, Current Treatment Options in Oncology 19 (12) (2018) 62. [DOI] [PubMed] [Google Scholar]
  • [22].Esmaeili N, Illanes A, Boese A, Davaris N, Arens C, Friebe M, Novel automated vessel pattern characterization of larynx contact endoscopic video images, International Journal of Computer Assisted Radiology and Surgery 14 (10) (2019) 1751–1761. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [23].Moccia S, De Momi E, Guarnaschelli M, Savazzi M, Laborai A, Guastini L, Peretti G, Mattos LS, Confident texture-based laryngeal tissue classification for early stage diagnosis support, Journal of Medical Imaging 4 (3) (2017) 034502. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [24].Araújo T, Santos CP, De Momi E, Moccia S, Learned and handcrafted features for early-stage laryngeal SCC diagnosis, Medical & Biological Engineering & Computing 57 (12) (2019) 2683–2692. [DOI] [PubMed] [Google Scholar]
  • [25].Zhang Z, Coyle JL, Sejdić E, Automatic hyoid bone detection in fluoroscopic images using deep learning, Scientific reports 8 (1) (2018) 1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [26].Khalifa Y, Coyle JL, Sejdić E, Non-invasive identification of swallows via deep learning in high resolution cervical auscultation recordings, Scientific Reports 10 (1) (2020) 1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [27].Khalifa Y, Donohue C, Coyle JL, Sejdic E, Upper esophageal sphincter opening segmentation with convolutional recurrent neural networks in high resolution cervical auscultation, IEEE Journal of Biomedical and Health Informatics (2020) 1–1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [28].Mao S, Zhang Z, Khalifa Y, Donohue C, Coyle JL, Sejdic E, Neck sensor-supported hyoid bone movement tracking during swallowing, Royal Society open science 6 (7) (2019) 181982. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [29].Dudik JM, Coyle JL, Sejdić E, Dysphagia screening: Contributions of cervical auscultation signals and modern signal-processing techniques, IEEE Transactions on Human-machine Systems 45 (4) (2015) 465–477. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [30].Cichero JA, Murdoch BE, Detection of swallowing sounds: Methodology revisited, Dysphagia 17 (1) (2002) 40–49. [DOI] [PubMed] [Google Scholar]
  • [31].Dudik JM, Kurosu A, Coyle JL, Sejdić E, A statistical analysis of cervical auscultation signals from adults with unsafe airway protection, Journal of Neuroengineering and Rehabilitation 13 (1) (2016) 7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [32].Lee J, Sejdić E, Steele CM, Chau T, Effects of liquid stimuli on dual-axis swallowing accelerometry signals in a healthy population, Biomedical Engineering OnLine 9 (1) (2010) 7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [33].Dodds WJ, Stewart ET, Logemann JA, Physiology and radiology of the normal oral and pharyngeal phases of swallowing, American Journal of Roentgenology 154 (5) (1990) 953–963. [DOI] [PubMed] [Google Scholar]
  • [34].Goodfellow I, Bengio Y, Courville A, Deep Learning, MIT Press, 2016. [Google Scholar]
  • [35].He Q, Perera S, Khalifa Y, Zhang Z, Mahoney AS, Sabry A, Donohue C, Coyle JL, Sejdić E, The association of high resolution cervical auscultation signal features with hyoid bone displacement during swallowing, IEEE Transactions on Neural Systems and Rehabilitation Engineering 27 (9) (2019) 1810–1816. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [36].Rebrion C, Zhang Z, Khalifa Y, Ramadan M, Kurosu A, Coyle JL, Perera S, Sejdić E, High-resolution cervical auscultation signal features reflect vertical and horizontal displacements of the hyoid bone during swallowing, IEEE Journal of Translational Engineering in Health and Medicine 7 (2018) 1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [37].Glorot X, Bengio Y, Understanding the difficulty of training deep feedforward neural networks, in: Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, 2010, pp. 249–256. [Google Scholar]
  • [38].Xu B, Wang N, Chen T, Li M, Empirical evaluation of rectified activations in convolutional network, arXiv preprint arXiv:1505.00853 (2015). [Google Scholar]
  • [39].Chung J, Gulcehre C, Cho K, Bengio Y, Gated feedback recurrent neural networks, in: International Conference on Machine Learning, 2015, pp. 2067–2075. [Google Scholar]
  • [40].Han S, Pool J, Tran J, Dally W, Learning both weights and connections for efficient neural network, in: Advances in Neural Information Processing Systems, 2015, pp. 1135–1143. [Google Scholar]
  • [41].Kingma DP, Ba J, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980 (2014). [Google Scholar]
  • [42].Moriniere S, Boiron M, Alison D, Makris P, Beutter P, Origin of the sound components during pharyngeal swallowing in normal subjects, Dysphagia 23 (3) (2008) 267–273. [DOI] [PubMed] [Google Scholar]
  • [43].Perlman AL, Ettema SL, Barkmeier J, Respiratory and acoustic signals associated with bolus passage during swallowing, Dysphagia 15 (2) (2000) 89–94. [DOI] [PubMed] [Google Scholar]
  • [44].Fawcett T, An introduction to roc analysis, Pattern recognition letters 27 (8) (2006) 861–874. [Google Scholar]

RESOURCES