Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2025 Mar 6.
Published in final edited form as: Annu Int Conf IEEE Eng Med Biol Soc. 2024 Jul;2024:1–4. doi: 10.1109/EMBC53108.2024.10782457

Optimizing Modified Barium Swallow Exam Workflow: Automating Pre-Analysis Video Sorting in Swallowing Function Assessment

Shitong Mao 1, Mohamed A Naser 2, Sheila Nida Buoy 1, Kristy K Brock 3, Katherine A Hutcheson 1
PMCID: PMC11883172  NIHMSID: NIHMS2052535  PMID: 40039880

Abstract

Modified Barium Swallow (MBS) exams, performed using video-fluoroscopy, an X-ray imaging technique, are essential for assessing swallowing function. They visualize the barium bolus (contrast agent) during the swallowing process in the head and neck area, thereby providing crucial insights into the dynamics of swallowing. Typically, these exams include both diagnostic anteroposterior (AP) and lateral planes, in addition to non-diagnostic “scout” films. This study introduces a deep learning solution aimed at streamlining the pre-analysis process of MBS exams by automating the identification of video orientations and scout video clips. Our methods are trained and tested on a comprehensive dataset comprising 2,315 video clips from 172 MBS exams and 106 patients. To distinguish AP videos from lateral views, our model achieved more than 99% accuracy at the frame level and 100% at the video level. In differentiating scout from bolus swallowing tasks, the model attained a maximum accuracy of 86% at the video level. We further merged these two tasks into a multi-task learning approach further enhanced the accuracy to 91% for scout/bolus differentiation. These advancements allow clinicians to allocate more efforts to focus primarily on lateral view videos for clinically relevant measurements such as the Penetration-Aspiration Scale (PAS) and Dynamic Imaging Grade of Swallowing Toxicity (DIGEST). This image sorting is also a pre-requisite step necessary to apply deep learning solutions to full image analysis.

Keywords: Modified Barium Swallow (MBS), Deep Learning, Video-fluoroscopy, Swallowing Function Assessment

I. Introduction

The Modified Barium Swallow (MBS) study, employing video-fluoroscopy, stands as a gold standard for assessing dysphagia (swallowing disorder). This dynamic X-ray imaging technique visualizes the barium bolus during the swallowing process, highlighting the safety and efficiency of the swallow function and key structures in the neck. Typically, MBS exams consist of several swallowing trials using barium as a contrast agent, recorded in both lateral and anteroposterior (AP) orientations, as shown in Fig. 1. Lateral views are indispensable for detailed swallowing mechanism analysis and are used for critical clinical measurements like the Penetration-Aspiration Scale (PAS) [1], and Dynamic Imaging Grade of Swallowing Toxicity (DIGEST) [2, 3]. Though less prevalent, AP views also offer vital insights into the symmetry of the swallowing process and the width of pharyngeal structures but do not contribute to most derived MBS metrics. Another essential element in MBS studies is the scout video (referred to as the initial view, preliminary view, baseline imaging, or setup film) [4], a preliminary non-barium video clip aimed at optimizing fluoroscopy settings and providing anatomic reference prior to barium administration [5, 6]. Scout films ensure the appropriate calibration, magnification, and patient position for capturing the appropriate field of view to visualize swallowing in subsequent trials. However, these scout videos, including the AP direction, are mixed temporally within diagnostic clips due to the sequential nature of MBS recording, which necessitates meticulous sorting to ensure accurate and efficient MBS analysis.

Fig. 1.

Fig. 1.

Configuration of MBS exam that includes AP and lateral orientations corresponding to the coronal and sagittal planes, respectively.

For the post-exam analysis focusing on lateral orientation, the task of manually sorting out non-diagnostic scout videos and identifying relevant AP clips becomes increasingly complex with larger datasets (>100 MBSs). This process is not only labor-intensive but also poses significant challenges for data management and resource allocation. Recent advancements in computer vision and deep learning have advanced the automation of MBS exam analyses. Studies have applied deep learning to analyze video fluoroscopic imaging features like hyoid bone tracking [7, 8], food bolus segmentation [9], and cervical vertebrae localization [10], showing promise in dysphagia assessment. However, these algorithms were developed in manually curated datasets of fractional segments of the full clinical examination file, wherein researchers have censored non-informative image segments like scout or AP films. This manual approach creates challenges applying extant algorithms to automated process of a full clinical imaging file. Manual sorting of videos, necessary for pre-processing, contradicts the goal of a fully automated system and is a significant barrier to integrating computational techniques in swallowing function assessment.

In this study, we applied deep learning methods to automate the sorting of swallowing videos in MBS studies, classifying them by orientation (AP or lateral) and distinguishing between scout and bolus clips. The video-level accuracy of the first task reached 100%, and utilizing a multi-task learning approach, we improved the video-level accuracy from 86% to 91% for the second task. These results underscored the fundamental strides made in the field of computational deglutition analysis. The subsequent sections are organized as follows: Section II outlines the data pipeline and the architectures of the deep learning model. Section III presents the results, focusing on the comparative performance analysis across various model backbones.

II. Methods

A. Data Collection and Preparation

In this study, a dataset was collected from a cohort comprising 106 patients. The average age of these patients was 61.34 ± 9.00 (mean ± std) years, including 92 males, at the University of Texas, MD Anderson Cancer Center over the period 2020–2022. The collection of these MBS videos was conducted following the approval from the Institutional Review Board (IRB) of MD Anderson (IRB approval number 2018–0019). Each patient underwent an MBS exam following a standard acquisition as a routine healthcare procedure. The fluoroscopy was recorded at 30 FPS during administration of approximately 10 barium bolus trials for each MBS exam, and each trial was recorded as an individual video clip. The dataset encompasses a total of 172 MBS exams, yielding an aggregate of 2,315 individual video clips. The average frame number was 256.88 ± 175.63 (mean ± standard deviation), and the total number was 594,667. Clinical raters reviewed all the videos, manually annotating each for its orientation (AP or Lateral) and categorizing them as scout or bolus swallowing videos.

To construct the deep learning data pipeline, we implemented two types of frame selection methods (as shown in Fig. 2): 1. Random Frame Selection Method: frames were randomly selected from the MBS video sequence and individually processed by a Convolutional Neural Network (CNN), with each frame’s output then fed into a classifier for final output; 2. Fixed Interval Selection Method: a fixed interval of frames is selected from a randomly chosen segment (3/4 of the total frames) of the MBS video sequence. These frames are processed in sequence through a CNN, and the combined outputs are then concatenated as input for the final output. The first method substantially decreased the computational load but at the cost of losing the temporal dependency inherent in the video sequence. Similar to late fusion in video classification, the second method was limited in its capability to compute frame-level accuracy. The selected frame number was set as 9 for consistency across all following experiments in our study.

Fig. 2.

Fig. 2.

Two strategies for analyzing video frame sequences from swallowing videos. (a) Random frame selection method and (b) Fixed interval selection method.

B. Multi-task Learning Approach

In our study, we utilized a multi-task learning approach to increase the accuracy of our deep learning model, training it to classify video clips by both orientation and scout type. We employed CNN backbone with pre-trained model as feature extractor, removing its top layer (head) across all selected frames. For randomly selected frames, a new head (a linear layer) was used to process features from individual frames. In the case of fixed interval frame selection, concatenated features served as input for the head. This approach allowed our multi-task training process to adapt to different frame selection strategies. The model featured two heads, each independently trained but sharing the same CNN backbone. During the fine-tuning phase, the CNN parameters were updated concurrently, reflecting a harmonized integration of the model components.

C. Experiments implementation

The cohort of 106 patients was randomly divided into training, validation, and test groups, detailed in Table I. Models underwent training using videos from the training group, with an early stopping criterion guided by the validation group’s performance. Validation performance relied on video-level accuracy, and for the multi-task learning framework, we calculated the mean accuracy across both tasks. Training ceased if there was no improvement in validation accuracy for 50 consecutive epochs. The model achieving the highest validation accuracy was then applied to the test group videos.

TABLE I.

Demographic and MBS Study Distribution

Patient number Age Gender MBS Exam number Video number
Training group 74 (69.81%) 61.31 ±8.58 M: 65, F: 9 117 (68.02%) 1,559 (67.34%)
Validation group 16 (15.09%) 61.13 ± 11.51 M: 13, F: 3 27 (15.70%) 369 (15.94%)
Test group 16 (15.09%) 61.88 ± 7.95 M: 14, F: 2 28 (16.28%) 387 (16.72%)

Data augmentation techniques were integrated into the training process to enhance model robustness. This included applying random scaling and cropping to the frames, with a scaling range set between 0.8 and 1.0 and an aspect ratio range of 0.6 to 1.667. Additionally, random horizontal flipping was employed; however, vertical flipping was excluded to maintain clinical relevance, as it does not correspond to typical patient positioning in MBS exams, where the patient’s head remains upright. To ensure uniformity across the frames in the same video, frames selected at fixed intervals underwent consistent augmentation. No augmentations were performed on the validation or testing sets. During the inference stage on the testing set, for models trained using the randomly selected frame scenario, the model was applied to all frames of an individual video. The final classification for each video was determined by a majority vote based on the collective frame-level outputs. For the fixed interval selection scenario, the ratio of selected segment was one, meaning that we uniformly sampled the frame from the entire frame sequence. We also applied single-task learning for these two tasks separately as references for multi-task learning.

For the CNN backbones, we experimented with a variety of pre-trained models, each requiring different input frame sizes: ResNet variants (18, 34, and 50) with an input frame size of (512, 512), Inception_V3 at a frame size of (299, 299), and DenseNet versions (121 and 169), MobileNet, and EfficientNet, all with an input frame size of (224, 224). In the training phase, the loss function is cross entropy loss, and we employed a Stochastic Gradient Descent (SGD) optimizer with a learning rate of 0.0001 and a momentum of 0.9. The batch size was set to 16, corresponding to the number of videos. To address the issue of data imbalance, we implemented the Synthetic Minority Over-sampling Technique (SMOTE). This study was developed on Kubernetes GPU cluster, each note has 128 physical CPUs, 1TB memory, and 8 Nvidia A100 GPUS.

III. Results

The efficacy of our deep learning model was evaluated across two primary tasks: classification of video orientations (AP or lateral) and differentiation of video types (scout or bolus). We first applied the single-task learning on the video orientation classification and evaluated the accuracy and F1-score with random selected frames, as shown in Table II.

TABLE II.

Classification results for MBS video origination

Model Frame-level accuracy Frame-level F1-score Video-level accuracy Video-level F1-score
ResNet18 99.72% 0.9921 100% 1.0000
ResNet34 99.70% 0.9925 100% 1.0000
ResNet50 99.70% 0.9921 100% 1.0000
Inception_V3 99.70% 0.9919 100% 1.0000
DenseNet121 99.51% 0.9908 100% 1.0000
DenseNet169 99.70% 0.9922 100% 1.0000
MobileNet 99.59% 0.9894 100% 1.0000
EfficientNet 99.22% 0.9869 99.74% 0.9985

For the MBS patient origination classification task, frame-level accuracies are exceptionally high, with ResNet18 leading slightly and EfficientNet showing a marginally lower performance compared to the others. At the video level, all models reached perfect or near-perfect accuracy, with EfficientNet slightly trailing at 99.74%. The video-level F1-scores mirror this trend, with EfficientNet showing a slightly lower score of 0.9985, which is still remarkably high. The results suggested that the salient information within the MBS video images was well-captured by the various deep learning models. The consistently high frame-level accuracies across all models indicate that the distinguishing features for orientation classification are prominently represented in the frames, allowing the models to effectively learn and identify these features. Given that the randomly selected frames achieved high accuracy, we did not proceed with experiments using frames selected at fixed intervals for distinguishing between AP and lateral swallowing videos and focused on the performance of scout/bolus video classification.

We then applied another single-task learning for identifying the scout films from the MBS by applying the deep learning model on frames. The results are shown in Table III. At the frame level, the accuracies range from 68.25% for Inception_V3 to 86.34% for DenseNet169. At the video level, all models show improved performance compared to the frame level, with accuracies ranging from 64.59% for Inception_V3 to 86.30% for DenseNet169, and a general increase in F1-scores. Subsequently, we trained models using sequences of uniformly sampled frames from single videos, as illustrated in Table III. DenseNet169 and EfficientNet emerged as the top performers, each achieving an accuracy exceeding 86%. The F1-scores of these models indicated a commendable balance between precision and recall. Notably, ResNet50 exhibited a remarkable F1-score of 0.8830, underscoring its robustness in classifying videos. A comparison of results between Table III and Table IV, which incorporated temporal dependencies into the classification, revealed an increase in video-level accuracy for most models, with DenseNet169 experiencing a slight decline. This suggests that models like DenseNet and ResNet are possibly more capable of leveraging the global contextual information from fixed interval frames, compared to the more variable data from random frame selection. Moreover, the observed improvements indicate that models may be better at discerning broader patterns within a frame sequence, enhancing overall video classification accuracy.

TABLE III.

Classification results for scout/bolus swallowing on randomly selected frames

Model Frame-level accuracy Frame-level F1-score Video-level accuracy Video-level F1-score
ResNet18 77.23% 0.7094 77.26% 0.7410
ResNet34 78.47% 0.5320 78.81% 0.6473
ResNet50 73.48% 0.7459 77.00% 0.8152
Inception_V3 68.25% 0.5760 64.59% 0.5673
DenseNet121 72.91% 0.6174 75.19% 0.6612
DenseNet169 86.34% 0.6162 86.30% 0. 6718
MobileNet 72.41% 0.7220 75.71% 0.7906
EfficientNet 70.28% 0.5782 70.28% 0. 6639

TABLE IV.

Classification results for scout/bolus videos on frames at fixed interval from multi-task learning

Model Frame-level accuracy Frame-level F1-score Video-level accuracy Video-level F1-score
ResNet18 76.63% 0.7456 76.74% 0.7785
ResNet34 61.76% 0.6719 62.53% 0.6918
ResNet50 79.71% 0.7550 82.94% 0.7901
Inception_V3 65.59% 0.52.53 61.50% 0.4704
DenseNet121 82.43% 0.6719 84.75% 0.7247
DenseNet169 88.56% 0.5953 89.66% 0.6819
MobileNet 80.29% 0.4604 81.14% 0. 5917
EfficientNet 78.36% 0. 6608 79.33% 0. 7280

Next, we employed a multi-task learning framework to classify scout/bolus videos, utilizing both random and fixed interval frame selection strategies. Tables V and VI. For the random selected frame scenario, the performance changes between single-task and multi-task learning at the video level are model-dependent. Some models like ResNet50, DenseNet121, and EfficientNet were improved when leveraging the relatedness of multiple tasks, possibly due to better feature extraction and generalization capabilities. Conversely, models like ResNet34 and Inception_V3 appear to struggle with the added complexity of multi-task learning, highlighting the need for model-specific approaches to optimize for performance on different types of tasks. For the scenario that considered the temporal coherence, an enhancement in video-level accuracies was observed across nearly all models. An exception to this trend was DenseNet169, which exhibited a minor reduction in accuracy, suggesting a nuanced response to the complexities of multi-task learning. ResNet34 achieved the highest accuracy at 91.73% and this was likely attributed to the model’s ability to capitalize on temporal coherence. In terms of F1-score, ResNet34 demonstrated notable performance, maintaining the highest score even with a marginal reduction in accuracy of about 1%. This underscores the effectiveness of our strategy in achieving a balance between precision and recall, particularly in the scout/bolus classification task.

TABLE V.

Classification results for scout/bolus swallowing with temporal coherence from multi-task learning

Model Video-level accuracy Video-level F1-score
ResNet18 90.18% 0.8463
ResNet34 90.70% 0.8708
ResNet50 91.73% 0.6879
Inception_V3 88.11% 0.7898
DenseNet121 89.15% 0.8632
DenseNet169 87.08% 0.8317
MobileNet 82.95% 0.7434
EfficientNet 82.17% 0.7402

TABLE VI.

Classification results for scout/bolus swallowing on frames at fixed interval

Model Video-level accuracy Video-level F1-score
ResNet18 78.55% 0.7884
ResNet34 81.91% 0.8257
ResNet50 80.36% 0.8830
Inception_V3 77.78% 0.7435
DenseNet121 77.26% 0.6694
DenseNet169 86.04% 0. 8671
MobileNet 79.32% 0. 7034
EfficientNet 86.56% 0. 8292

IV. Conclusion

This study represents an advancement in MBS exam analysis for swallowing function assessment, by investigating a deep learning-based methodology to automate video sorting. The incorporation of a multi-task learning framework enhanced our approach, notably improving the classification of scout and bolus swallowing videos. This advancement suggests the possibility for fully automating the preprocessing steps and optimizing the MBS review workflow, paving the way towards complete automation of MBS analysis.

Acknowledgment

This work is supported by Cancer Center Support Grant (CCSG) P30 CA016672. We would also like to express our sincere appreciation to Gabriel Gelabert, Emory Twitty, Viridiana Andrade, and Parvin Ebadi, for their expert and dedicated manual annotation of the videos.

References

  • [1].Rosenbek JC, Robbins JA, Roecker EB, Coyle JL, and Wood JL, “A penetration aspiration scale,” Dysphagia, vol. 11, no. 2, pp. 93–98, Spr, 1996. [DOI] [PubMed] [Google Scholar]
  • [2].Hutcheson KA, Barbon CE, Alvarez CP, and Warneke CL, “Refining measurement of swallowing safety in the Dynamic Imaging Grade of Swallowing Toxicity (DIGEST) criteria: Validation of DIGEST version 2,” Cancer, vol. 128, no. 7, pp 1458–1466. 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [3].Hutcheson KA et al. , “Dynamic Imaging Grade of Swallowing Toxicity (DIGEST): Scale development and validation,” Cancer, vol. 123, no. 1, pp. 62–70, Jan 1, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [4].Peterson R, “Modified Barium Swallow for Evaluation of Dysphagia,” Radiol Tech, vol. 89, no. 3, pp. 257–275, Jan-Feb, 2018. [PubMed] [Google Scholar]
  • [5].Fynes MM, Smith C, and Brodsky MB, “The modified barium swallow study: When, how, and why?,” Appl Radiol, vol. 48, no. 5, pp. 3–8, 2019. [Google Scholar]
  • [6].Choi HE, Jo GY, Kim WJ, Do HK, Kwon JK, and Park SH, “Characteristics and Clinical Course of Dysphagia Caused by Anterior Cervical Osteophyte,” Ann Rehabil Med, vol. 43, no. 1, pp. 27–37, Feb, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [7].Zhang Z, Coyle JL, and Sejdic E, “Automatic hyoid bone detection in fluoroscopic images using deep learning,” Sci Rep, vol. 8, no. pp. Aug 17, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [8].Kim HI, Kim Y, Kim B, Shin DY, Lee SJ, and Choi SI, “Hyoid Bone Tracking in a Videofluoroscopic Swallowing Study Using a Deep-Learning-Based Segmentation Network,” Diagnostics, vol. 11, no. 7, pp. Jul, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [9].Li W, Mao S, Mahoney AS, Petkovic S, Coyle JL, and Sejdic E, “Deep learning models for bolus segmentation in videofluoroscopic swallow studies,” J Real-Time Image Pr, vol. 21, no. 1, pp. 18, Feb, 2024. [Google Scholar]
  • [10].Zhang Z, Mao S, Coyle JL, and Sejdic E, “Automatic annotation of cervical vertebrae in videofluoroscopy images via deep learning,” Medical Image Analysis, vol. 74, pp. 102218, Dec, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES