Skip to main content
Karger Author's Choice logoLink to Karger Author's Choice
. 2025 Aug 12. Online ahead of print. doi: 10.1159/000547346

Quantifying Pharyngeal Residue during Videofluoroscopic Swallow Studies (VFSS/MBS): Validation of Percentage-Based Visual-Perceptual Residue Ratings

James Curtis a,b,, Valentina Mocchetti b, Brandon Jagdhar a,b, Anaïs Rameau b,c, Christine M Clark b, Mel Grasso a,b
PMCID: PMC12503582  PMID: 40795820

Abstract

Introduction

Computerized, percentage-based assessments of pharyngeal residue (e.g., % filling of the valleculae) during videofluoroscopic swallow studies (VFSSs) offer a high level of precision, validity, and reliability compared to more frequently used visual-perceptual categorical-based rating methods (e.g., mild, moderate, severe). Despite these advantages, clinical practice largely relies on visual-perceptual methods, given their ease, speed, and lack of need for specialized software or training. Percentage-based visual-perceptual residue ratings could represent a scalable and clinically feasible alternative to computerized techniques; however, their accuracy and reliability have not been systematically investigated. Therefore, this study aimed to: evaluate the accuracy of visual-perceptual percentage-based ratings of pharyngeal residue during VFSS compared to pixel-based ground-truth measurements; characterize the inter- and intra-rater reliability of these ratings; and explore whether clinician characteristics are associated with rating accuracy.

Methods

An anonymous international survey was distributed to speech-language pathologists (SLPs). Respondents were asked to provide visual-perceptual ratings of pharyngeal residue for 40 pairs of unique fluoroscopic images. Digital tracings were superimposed onto each image to guide percentage-based ratings. SLP respondents provided two types of residue ratings: Bolus Clearance Ratio (BCR; n = 20) and Residue Ratio Scale for the valleculae (RRSV; n = 20). 50% of the images were randomly repeated to assess intra-rater reliability. Ratings were compared to ground-truth values. Statistical analyses were used to characterize rater accuracy and reliability.

Results

129 SLP respondents participated in the survey, yielding an analysis of 6,569 visual-perceptual percentage-based residue ratings. Residue ratings showed moderate-to-substantial agreement with ground-truth values for both BCR (ρc = 0.92) and RRSV (ρc = 0.94). Group-level inter-rater reliability was good-to-excellent for BCR (ICC = 0.90) and excellent for RRSV (ICC = 0.93). Intra-rater reliability was also excellent (BCR ICC = 0.92; RRSV ICC = 0.97). Greater accuracy was associated with clinicians with fewer years of clinical experience (r = 0.35, p = 0.006), clinicians who used frame-by-frame analysis more frequently in their clinical practice (τ = 0.228, p = 0.002), and clinicians who used standardized residue rating tools such as DIGEST and MBSImP more frequently in their clinical practice.

Conclusion

Visual-perceptual percentage-based residue ratings during VFSS demonstrate a high level of accuracy and inter- and intra-rater reliability. Clinician experience and practice patterns influence rating accuracy. These findings support the potential clinical utility of percentage-based visual-perceptual methods as a valid and accessible alternative to traditional ordinal residue scales.

Keywords: Pharyngeal residue, Videofluoroscopic swallow study (VFSS), Visual-perceptual percentage-based residue rating, Bolus Clearance Ratio, Dysphagia

Introduction

Swallowing efficiency refers to the transport of foods, liquids, and saliva from the mouth into the stomach with minimal post-swallow residue. Impairments in pharyngeal swallowing efficiency are characterized by reduced bolus clearance through the pharyngoesophageal segment, resulting in vallecular and piriform residue. The presence of pharyngeal residue is clinically important, as it is associated with an increased risk of post-swallow aspiration [1, 2] as well as serious downstream medical consequences including malnutrition [3] and pneumonia [4].

Videofluoroscopic swallow studies (VFSS) are widely used in clinical and research settings to assess impairments in pharyngeal swallow efficiency. Pharyngeal residue during VFSS can be evaluated using visual-perceptual methods or computerized techniques [57]. Visual-perceptual methods are the most common in clinical practice and typically rely on categorical rating systems. These categorical rating scales vary in construct: some are intended to characterize the severity of residue (e.g., none, mild, moderate, severe) [8, 9], while others are intended to estimate the amount of residue (e.g., minimal, moderate, maximal) [9, 10].

Despite their clinical utility, categorical scales have notable limitations. They often lack the granularity needed to detect subtle, yet potentially meaningful, differences in pharyngeal residue across different patients or within the same patient over time. For example, consider a patient who exhibits 45% of a bolus remaining in the pharynx after a swallow pre-therapy but 15% post-therapy. A rating scale that categorizes residue as <10%, 10–49%, and ≥50% would classify both instances within the same category (10–49%), obscuring a potentially meaningful improvement in swallowing efficiency because of therapy. Moreover, pharyngeal residue exists along a continuum and is not inherently categorical. As such, treating residue as a percentage-based outcome may offer greater construct validity and clinical sensitivity.

Recent studies using flexible endoscopic evaluation of swallowing have shown that visual-perceptual percentage-based residue ratings may offer higher sensitivity, validity, and rater reliability compared to traditional categorical scales [1114]. Similarly, percentage-based computerized assessments of residue during VFSS have demonstrated high accuracy and reliability [6], but they require specialized software and technical expertise, limiting their clinical applicability. Two examples of percentage-based computerized assessments of pharyngeal residue include the Bolus Clearance Ratio (BCR) and the Residue Ratio Scale for the valleculae (RRSV) [1518]. BCR is used to characterize the percent of bolus remaining in the pharynx after the swallow, relative to how much bolus was originally propelled into the pharynx during the swallow. RRSV, conversely, is used to characterize the percent of vallecular space filled with vallecular residue.

To date, no published research has described the use of percentage-based residue ratings during visual-perceptual assessments of VFSS – an approach that may provide a more precise and accessible alternative to ordinal ratings without requiring complex software. If accurate and reliable, such rating methods could represent a clinically feasible method to enhance the precision of VFSS interpretation in real-world settings. This is important given that the current data suggest the reliability of visual-perceptual assessments of pharyngeal residue during VFSS is relatively low [1921] and that there is a need to develop user-friendly approaches to rate pharyngeal residue using methods that are valid, reliable, and responsiveness/sensitive [22].

Given the above, the primary aims of this study were to: (1) evaluate the accuracy of visual-perceptual percentage-based ratings of pharyngeal residue during VFSS compared to ground-truth measurements; and (2) characterize the inter- and intra-rater reliability of these rating methods. As an exploratory aim, we examined whether clinician characteristics were associated with accuracy of the visual-perceptual percentage-based residue ratings. We hypothesized that visual-perceptual percentage-based ratings would demonstrate moderate-to-high agreement with ground-truth measurements and good-to-excellent rater reliability. No a priori hypotheses were made for the exploratory aim.

Methods

This study was a prospective cross-sectional study approved by Institutional Review Board at Weill Cornell Medicine (IRB #: 24‐10028086). Previously completed, clinically indicated VFSS were obtained from the medical institution’s picture archiving and communication system and TIMS Review Software (TIMS Medical). The standard viewing plane for the VFSS included the lips anteriorly, the nasal cavity superiorly, the cervical spine posteriorly, and the cervical esophagus and proximal trachea inferiorly. All VFSS were acquired at 30 pulses per second. TIMS DICOM audiovisual recording system was used to record VFSS images at a rate of 30 frames per second. All videos were stored to the university hospital’s picture archiving and communication system and locally TIMS Review Software. Radiopaque contrasts were 40% w/v Varibar barium sulfate thin liquid or pudding products (Bracco Imaging). VFSS were reviewed for swallows representing a range of BCR and RRSV ratings. Once swallows were identified, fluoroscopic still images were obtained from each swallow to be used for subsequent BCR and RRSV analyses.

For BCR, two fluoroscopic images were obtained from each VFSS. The first fluoroscopic image captured the maximal amount of bolus propelled into the pharynx “during the swallow”, with a tracing superimposed outlining the bolus area of interest. The second fluoroscopic image captured the amount of bolus remaining in the pharynx “after the swallow”, with a tracing superimposed outlining the bolus area of interest (Fig. 1). For BCR, “after the swallow” bolus residue tracings are made immediately after closure of the pharyngoesophageal segment, but before complete descent of the larynx. For RRSV, one image capturing post-swallow vallecular residue was obtained from each VFSS. This image represents after the larynx and pharynx returned to its lowest resting position after the swallow. The image was then duplicated, and tracings were superimposed onto each image. The first image contained a tracing outlining vallecular residue. The second image contained a tracing outlining the total vallecular space (Fig. 2). A total of 20 BCR and 20 RRSV unique pairs of images were obtained. All tracings were made using ImageJ software (Version 2.14.0/1.54f), pixel-based measurements obtained at the time of the tracings. ImageJ is a free, open-source image analysis software developed by the National Institutes of Health that enables pixel-based quantification of images. The freehand selection tool was used to trace the bolus area of interest, superimposing a yellow tracing over the gray scale fluoroscopic image. The pixel-based ImageJ measurements of BCR and RRSV served as the “ground truth” to which clinician visual-perceptual percentage-based, numerical-based ratings BCR and RRSV were compared.

Fig. 1.

Fig. 1.

Example of paired set of images used for Bolus Clearance Ratio (BCR). The image on the left represents the amount of bolus propelled into the pharynx during the swallowing trial. The image on the right represents the amount of bolus remaining in the pharynx immediately after closure of the pharyngoesophageal segment. A yellow digital tracing was superimposed to guide visual-perceptual ratings and was used for computerized analysis of the ground truth. In this figure, BCR = 0.1, indicating 10% of the bolus propelled in the pharynx during the swallowing remained in the pharynx after the swallow.

Fig. 2.

Fig. 2.

Example of paired set of images used for Residue Ratio Scale of the valleculae (RRSV) for the valleculae. Both images include the same post-swallow image. The image on the left includes a tracing of the entire vallecular space. The image on the right includes a tracing of the residue filling the valleculae. The yellow digital tracings were superimposed to guide visual-perceptual ratings and were used for computerized analysis of the ground truth. In this example, RRSV = 0.45, indicating 45% of the valleculae was filled with residue.

A three-part, anonymous, online survey (Qualtrics) was disseminated to speech-language pathologists nationally and internationally via online professional listservs, social media, and word-of-mouth. Inclusion criteria for study enrollment were as follows: (1) being 18 years or older; (2) being a licensed speech-language pathologist (SLP), SLP clinical fellow, retired SLP, or current undergraduate or gradate SLP student; and (3) not having a color vision impairment that would prevent the ability to distinguish a yellow tracing from a black or gray background. Written informed consent was provided to each potential participants prior to the start of the survey.

Part 1 of the survey involved collecting data on clinician characteristics. Clinician characteristics were grouped into four domains: (1) demographics, (2) professional training and status, (3) self-reported confidence, and (4) clinical practice patterns related to VFSS interpretation. Parts 2 and 3 of the survey involved a brief instruction of how to complete visual-perceptual percentage-based numerical ratings of BCR and RRSV, respectively. Participants were then presented with the 20 unique pairs of BCR and RRSV images and asked to provide a visual-perceptual percentage-based rating ranging from 0 to 100. Fifty percent of the 20 unique pairs of BCR and RRSV images were randomly selected for repeated analysis by each respondent to examine intra-rater reliability, for a total of 30 pairs of BCR and RRSV images presented to each respondent for BCR and RRSV analysis.

Statistical Analysis

All data were statistically analyzed using R version 4.4.3 [23]. Data and R code were uploaded to the Open Science Framework repository. Descriptive statistics were used to summarize accuracy of BCR and RRSV. Accuracy of visual-perceptual ratings of BCR and RRSV were determined for each unique image by comparing SLPs’ visual-perceptual ratings to the ground-truth pixel-based measurements. The BCR and RRSV images that were repeated for assessment of intra-rater reliability were excluded from this analysis. Lin’s concordance correlation coefficient (ρc) [24] was used to characterize the accuracy of the visual-perceptual ratings and to characterize the level of agreement with the ground truth. ρc was interpreted as “poor” if <0.90, “moderate” if 0.90–0.949, “substantial” if 0.95–0.998, and “excellent” if ≥0.999 [25]. In addition, we report scale shift of (ω), location shift (υ), and correction bias (C.b.), which are components of Lin’s concordance analysis. The scale shift (ω) reflects how much ratings are compressed or expanded compared to the reference standard; the location shift (υ) indicates systematic over- or underestimation; and the correction bias (C.b.) is a summary index of overall agreement, with values closer to 1 indicating higher concordance.

Inter-rater reliability was examined by comparing visual-perceptual ratings between raters, whereas intra-rater reliability was examined by comparing visual-perceptual ratings within each rater. Inter-rater reliability was characterized at the group level, and for each unique pair of raters (dyad-level). Intra-rater reliability was characterized at the group level and for each unique rater (respondent-level). Intraclass correlation coefficient (ICC) was used to assess inter- and intra-rater reliability. ICCs were interpreted as “poor” if <0.5, “moderate” if 0.5–0.75, “good” if 0.75–0.90, and “excellent” if ≥0.90 [26].

Relationships between clinician characteristics and accuracy of residue ratings were examined using Pearson’s, Spearman’s, and Kendall’s correlation coefficients, as well as Wilcoxon rank-sum tests. Only respondents who completed the entire survey were included in this exploratory analysis. A single measure of accuracy was calculated for each respondent as the median absolute difference between their perceptual ratings and the ground-truth values across all rated images. Pearson’s correlation was used for continuous clinician characteristics. Spearman’s correlation was planned for ordinal variables, with Kendall’s correlation used in cases with a large number of tied ranks. Binary clinician characteristics were analyzed using Wilcoxon rank-sum tests.

Results

A total of 129 SLPs participated in the survey (Tables 14) yielding an analysis of 6,569 visual-perceptual percentage-based residue ratings. The majority of SLP respondents were based in the USA (82.9%). Respondents represented a wide range of clinical experience, with a median of 11 years in the field. Most participants (69%) were practicing clinicians, while the remainders were SLP trainees, including clinical fellows, graduate students, and undergraduate students. Nearly half (49%) reported having performed more than 250 VFSSs in their careers, although 14.7% indicated they had never completed a VFSS as part of their clinical practice or training.

Table 1.

Demographics

Overall (N = 129)
Age, years
 Mean (SD) 33.7 (10.6)
 Median [Q1, Q3] 31.0 [25.0, 42.0]
 Min, max 21.0, 65.0
Sex
 Do not wish to provide 1 (0.8%)
 Female 117 (90.7%)
 Male 11 (8.5%)
Gender
 Do not wish to provide 3 (2.3%)
 Man 11 (8.5%)
 Non-binary 2 (1.6%)
 Woman 113 (87.6%)
Race/ethnicity
 Asian 22 (17.1%)
 Black or African American 1 (0.8%)
 Do not wish to provide 4 (3.1%)
 Hispanic or Latino 2 (1.6%)
 Multiracial 1 (0.8%)
 White 99 (76.7%)
Multiracial/multiethnic demographics
 Asian (and) Native American or other Pacific Islander 1 (0.8%)
Country
 Australia 1 (0.8%)
 Canada 2 (1.6%)
 Germany 1 (0.8%)
 India 1 (0.8%)
 Ireland 4 (3.1%)
 Philippines 1 (0.8%)
 Taiwan (Province of China) 9 (7.0%)
 United Kingdom 3 (2.3%)
 USA 107 (82.9%)

% represents percentage of total sample.

Table 4.

VFSS experience and practice patterns

Overall (N = 129)
Lifetime number of VFSS
 0 19 (14.7%)
 1–10 6 (4.7%)
 11–50 21 (16.3%)
 51–100 14 (10.9%)
 101–250 11 (8.5%)
 251–500 15 (11.6%)
 501–1,000 17 (13.2%)
 >1,000 26 (20.2%)
Frequency of using frame-by-frame analysis of VFSS
 0% 5 (3.9%)
 1–20% 8 (6.2%)
 21–40% 9 (7.0%)
 41–60% 15 (11.6%)
 61–80% 11 (8.5%)
 81–100% 62 (48.1%)
Frequency of using categorical-based descriptors of residue
 0% 16 (12.4%)
 1–20% 9 (7.0%)
 21–40% 7 (5.4%)
 41–60% 10 (7.8%)
 61–80% 13 (10.1%)
 81–100% 67 (51.9%)
Types of categorical-based residue rating methods used
 Non-standardized descriptions related to amount (e.g., minimal, moderate, maximal) 48 (37.2%)
 Non-standardized descriptions related to severity (e.g., mild, moderate, severe) 58 (45.0%)
 MBSImP 58 (45.0%)
 DIGEST 40 (31.0%)
 Other 5 (3.9%)
Frequency of using percentage-based descriptors of residue
 0% 61 (47.3%)
 1–20% 21 (16.3%)
 21–40% 10 (7.8%)
 41–60% 8 (6.2%)
 61–80% 2 (1.6%)
 81–100% 21 (16.3%)
Types of percentage-based residue rating methods used
 Visual-perceptual percentage-based estimations 63 (48.8%)
 Computerized/digital tracings 18 (14.0%)
 Digital tracing – Normalized Residue Ratio Scale of the valleculae 4 (3.1%)
 Digital tracing – Normalized Residue Ratio Scale of the Piriformis 4 (3.1%)
 Digital tracing – ASPEKT 11 (8.5%)
 Digital tracing – Bolus Clearance Ratio 12 (9.3%)
 Digital tracing – Pharyngeal Residue Ratio 10 (7.8%)
 Other 4 (3.1%)

% represents percentage of total sample.

Table 2.

Professional experiences and work setting

Overall (N = 129)
Training status
 Clinical fellow 5 (3.9%)
 Clinician (full-time, part-time, or retired) 89 (69.0%)
 Graduate student 33 (25.6%)
 Undergraduate student 2 (1.6%)
Years of work experience
 Mean (SD) 12.1 (9.28)
 Median [Q1, Q3] 11.0 [3.75, 18.3]
 Min, Max 1.00, 41.0
 Missing 41 (31.8%)
Years of dysphagia work experience
 Mean (SD) 8.13 (9.25)
 Median [Q1, Q3] 4.00 [1.00, 15.0]
 Min, max 0, 41.0
 Missing 1 (0.8%)
Work setting
 Medical hospital 19 (14.7%)
 Multiple settings 74 (57.4%)
 Not applicable 15 (11.6%)
 Other 3 (2.3%)
 Outpatient 7 (5.4%)
 Private practice 5 (3.9%)
 Rehabilitation hospital 3 (2.3%)
 Research laboratory 1 (0.8%)
 Skilled nursing facility 2 (1.6%)

% represents percentage of total sample.

Table 3.

Trainings and confidence

Overall (N = 129)
Certifications
 CCC-SLP 78 (60.5%)
 BCS-S 10 (7.8%)
Specialized trainings
 DIGEST 26 (20.2%)
 MBSImP (clinician version) 47 (36.4%)
 MBSImP (student version) 56 (43.4%)
 VASES 16 (12.4%)
 ASPEKT 6 (4.7%)
 DSS/SwallowTail 8 (6.2%)
Confidence in interpretating VFSS
 Very not confident 9 (7.0%)
 Somewhat not confident 16 (12.4%)
 Neutral 19 (14.7%)
 Somewhat confident 56 (43.4%)
 Very confident 29 (22.5%)
Confidence in performing basic algebra
 Very not confident 5 (3.9%)
 Somewhat not confident 10 (7.8%)
 Neutral 11 (8.5%)
 Somewhat confident 39 (30.2%)
 Very confident 63 (48.8%)
 Missing 1 (0.8%)
Confidence in spatial awareness abilities
 Very not confident 3 (2.3%)
 Somewhat not confident 11 (8.5%)
 Neutral 30 (23.3%)
 Somewhat confident 63 (48.8%)
 Very confident 21 (16.3%)
 Missing 1 (0.8%)

% represents percentage of total sample.

All 129 respondents completed part 1 of the survey, which included questions about demographics, training, and clinical practice patterns. All 129 also completed at least one of the 30 BCR ratings in part 2 and were therefore included in the BCR analysis; 114 of these respondents (88%) completed all 30 BCR ratings. In part 3 of the survey, 101 respondents (78%) completed at least one of 30 RRSV ratings and were included in the RRSV analysis; of those, 99 respondents (76%) completed all 30 RRSV ratings. These 99 respondents, who completed all 20 unique ratings in both the BCR and RRSV tasks, were included in the exploratory aim examining the relationship between clinician characteristics and rating accuracy.

Accuracy of Residue Ratings

Visual-perceptual percentage-based ratings of BCR and RRSV demonstrated moderate-to-substantial agreement with ground-truth pixel-based measurements (Table 5; Fig. 3). Across all BCR trials (n = 2,405), the median absolute difference between visual-perceptual ratings and ground-truth values was 7.41 percentage points (IQR: 2.96–12.4), with a mean absolute difference of 9.17 (SD = 7.79). Lin’s concordance correlation coefficient indicated moderate agreement between visual-perceptual and pixel-based ratings for both BCR (ρc = 0.92 [95% CI: 0.91–0.92]), with a scale shift of ω = 0.91, a location shift of υ = 0.05, and a correction bias of C.b. = 0.99. Across all RRSV trials (n = 1,986), the median absolute difference was 6.47 percentage points (IQR: 2.86–11.3), with a mean of 7.87 (SD = 6.78). Lin’s concordance correlation coefficient indicated moderate agreement between visual-perceptual and pixel-based ratings for both RRSV (ρc = 0.94 [95% CI: 0.94–0.95]), with a scale shift of ω = 0.89, a location shift of υ = −0.02, and a correction bias of C.b. = 0.99.

Table 5.

Accuracy of visual-perceptual percentage-based ratings

BCR RRSV
N = 2,580 N = 2,580
Absolute difference
 Mean (SD) 9.17 (7.79) 7.87 (6.78)
 Median [Q1, Q3] 7.41 [2.96, 12.4] 6.47 [2.86, 11.3]
 Min, max 0.0400, 47.3 0, 53.1
 Missing 175 (6.8%) 594 (23.0%)
Relative difference
 Mean (SD) −1.41 (12.0) 0.713 (10.4)
 Median [Q1, Q3] −0.310 [−9.17, 5.83] 0.0950 [−5.29, 7.71]
 Min, max −45.1, 47.3 −53.1, 42.9
 Missing 175 (6.8%) 594 (23.0%)

% represents percentage of total sample.

BCR, Bolus Clearance Ratio; RRSV, Residue Ratio Scale of the valleculae.

Fig. 3.

Fig. 3.

Boxplots comparing ground truth to visual-perceptual percentage-based ratings of the Bolus Clearance Ratio (BCR; top) and Residue Ratio Scale of the valleculae (RRSV; bottom). The numbers within each panel represent the median visual-perceptual rating made across all raters.

Inter- and Intra-Rater Reliability of Residue Ratings

Group-level inter-rater reliability was high for both BCR and RRSV visual-perceptual ratings. The group-level ICC for visual-perceptual ratings of RRSV was ICC(A, 1) = 0.90 (95% CI: 0.81–0.97), indicating good-to-excellent agreement among raters. For RRSV, the group-level ICC for visual-perceptual ratings was 0.93 (95% CI: 0.89–0.97), reflecting excellent inter-rater agreement. Across both measures, the mean relative difference from the grand mean was effectively zero, suggesting that no consistent directional bias was present in group-level rating patterns. Dyad-level inter-rater reliability ICCs were observed to be mostly within the good-to-excellent range for BCR and within the excellent range for RRSV (Fig. 4).

Fig. 4.

Fig. 4.

Histograms outlining the distribution of intraclass correlation coefficients (ICC) for inter- and intra-rater reliability of the Bolus Clearance Ratio (BCR) and the Residue Ratio Scale of the valleculae (RRSV).

Group-level intra-rater reliability was excellent for both BCR and RRSV visual-perceptual ratings. The group-level ICC for repeated intra-rater ratings of BCR was ICC(A, 1) = 0.92 (95% CI: 0.91–0.93), and for RRSV was ICC(A,1) = 0.97 (95% CI: 0.96–0.97), indicating highly consistent performance across repeated trials by the same rater. At the individual level, most respondents demonstrated good-to-excellent intra-rater reliability, with ICC values for BCR clustering above 0.85 and for RRSV tightly clustered above 0.90 (Fig. 4).

Clinician Characteristics and Accuracy of Residue Ratings

A series of correlation and Wilcoxon rank-sum tests were conducted to examine whether clinician characteristics were associated with accuracy of visual-perceptual residue ratings during VFSS. In terms of professional training and work status/setting, years of work experience showed a negative correlation with accuracy of residue ratings, such that a greater number of years of work experience were associated with lower residue rating accuracy (r = 0.348; p = 0.006). However, clinicians who had trained to interpret VFSS using the Dynamic Swallow Study (DSS/SwallowTail) computerized technique [27] showed great accuracy of residue rating (r = 0.265; p = 0.008).

In terms of VFSS clinical practice patterns, clinicians who use frame-by-frame analysis more frequently to interpret VFSS exhibited greater accuracy of residue ratings (τ = 0.228; p = 0.002). Additionally, clinicians who include assessment of pharyngeal residue when interpreting VFSS exhibited greater accuracy of residue ratings than clinicians who do not (τ = 0.228; p = 0.002). More specifically, people who rely on standardized assessments of pharyngeal residue, including DIGEST (r = 0.266; p = 0.008) and MBSImP (r = 0.255; p = 0.011), demonstrated greater accuracy of residue rating compared to people who rely on non-standardized assessment of amount (e.g., minimal, maximal) or severity (e.g., mild, severe). No other statistically significant relationships were observed, including but not limited to clinician characteristics related to work setting, professional certifications, self-reported confidence in VFSS interpretation, or number of VFSS completed within their career.

Discussion

This study evaluated the accuracy and reliability of visual-perceptual percentage-based residue ratings during VFSS among an international sampling of SLPs using two standardized metrics: BCR and RRSV. Findings revealed moderate-to-substantial agreement between participant ratings and ground-truth measurements, with excellent intra-rater and good-to-excellent inter-rater reliability. Additionally, certain clinician characteristics – such as fewer years of experience, use of frame-by-frame VFSS analysis, and reliance on standardized tools like DIGEST and MBSImP – were associated with greater rating accuracy.

These results suggest that clinicians can produce residue ratings that closely align with objective ground-truth values when using a percentage-based visual-perceptual method guided by digital tracings. Unlike categorical residue scales, percentage-based ratings offer greater precision and responsiveness by allowing clinicians to quantify subtle changes in pharyngeal residue that may be clinically meaningful but would otherwise be obscured by broad ordinal categories. This level of granularity can improve sensitivity to treatment effects, enhance monitoring of patient progress over time, and facilitate clearer communication across providers. Although computerized techniques remain the most precise method for quantifying pharyngeal residue, they require technical training, are not widely accessible in many clinical settings, and often depend on manual (non-automated) tracings that are still subject to rater error. In contrast, visual-perceptual percentage-based ratings – when standardized – strike a balance between clinical feasibility and interpretive precision, offering a scalable method for integrating quantitative assessment of swallowing efficiency into everyday practice. With appropriate training, they are accessible, expedient, and cost-free, making them particularly well suited for clinical environments that lack access to specialized software.

The high level of inter- and intra-rater reliability observed in this study is consistent with previous research using percentage-based rating strategies during flexible endoscopic evaluation of swallowing [12, 14, 28] and supports the broader applicability of such approaches across swallowing assessment modalities. Notably, the reliability of the visual-perceptual ratings in this study was comparable to – or even exceeded – that reported for the same metrics derived from manual computerized techniques [15, 16, 18, 2931]. These findings suggest that a high level of reliability can be achieved with visual-perceptual assessments when clear, transparent, and standardized methods are used. In this study, clear, transparent, and standardized methods were facilitated by controlling the timing of ratings (using still images) and by defining spatial boundaries (through superimposed digital tracings), enabling a level of rater consistency similar to that observed with manual computerized methods.

Interestingly, a greater number of years of work experience were associated with lower accuracy in residue ratings. This finding is partially consistent with previous research showing that clinical experience either has no effect [12, 32, 33], or may have a small negative effect [34], on the interpretation of bolus-related outcomes during instrumental swallowing assessments. In contrast, more frequent use of frame-by-frame analysis and validated standardized rating scales was associated with greater residue rating accuracy. This finding, in addition to the absence of a relationship between rating accuracy, work setting, and specialty certifications, is also consistent with prior research [33]. These results are encouraging, as they suggest that clinicians might be able to improve their VFSS interpretation accuracy by including frame-by-frame analysis and standardized interpretation techniques into their routine practice patterns.

Several limitations should be considered when interpreting these findings. First, although digital tracings were used to guide ratings and promote consistency, this level of visual support does not reflect typical clinical practice. The use of digital tracings allowed for direct comparisons between visual-perceptual ratings and ground-truth measurements by ensuring that the same spatial boundaries were applied across methods. This controlled approach likely contributed to the high accuracy and reliability observed and highlights that, when residue rating methods are clearly defined and standardized, visual-perceptual ratings can approach the rigor of computerized techniques. In contrast, most perceptual rating methods in clinical settings lack such explicit boundaries and methodological transparency, which may partially explain variability in rater performance. Future research should examine how rating accuracy and reliability are affected when digital tracings are removed but standardized rating instructions are retained, as this may better reflect the realities of clinical practice.

Second, this study focused exclusively on still image ratings and did not assess dynamic swallowing events or incorporate other VFSS elements such as bolus timing or kinematic measures. Exploring these rating approaches in the context of real-time VFSS videos will help determine their generalizability in more naturalistic settings. However, doing so may reintroduce variability in how spatial and temporal boundaries are perceived and applied – underscoring a key strength of the present study’s design, even as it limits broader ecological validity.

Third, the survey-based design of this study may have introduced sampling bias. For example, clinicians with busier schedules, greater burnout, or less motivation may have been less likely to participate. This could skew the sample toward including people with more time and motivation, which may have impacted the results of this study. Future research might address this limitation by embedding rating tasks into protected clinical education sessions or mandatory in-service trainings, thereby ensuring more representative participation across a broader spectrum of motivation levels and practice ease of participation.

Fourth, this study did not examine accuracy and reliability of visual-perceptual ratings of the piriformis, with comparisons to Residue Ratio Scale of the Piriformis. This was not pursued in the current study to avoid survey fatigue and attrition. Future work should examine if the results in this study generalize to percentage-based, visual-perceptual ratings of the piriformis.

Lastly, this study did not compare differences in interpretability of categorical- versus percentage-based residue ratings. While categorical terms (e.g., mild, moderate, severe) may initially appear more intuitive, they are less precise and may lack consistency in meaning across raters, potentially reducing interpretive accuracy and clinical validity. Future research should therefore examine how percentage-based versus categorical residue ratings are interpreted and understood by patients, clinicians, and interdisciplinary team members. These data are important for applying user-centered design principles to improve clinician uptake, scale usability, and alignment with real-world practice needs [22].

Conclusions

Visual-perceptual percentage-based residue ratings (when provided visual image tracings) during VFSS demonstrated a moderate-to-high level accuracy and good-to-excellent inter- and intra-rater reliability across two standardized measures of pharyngeal residue. Clinician characteristics and practice patterns were associated with accuracy, suggesting that training and standardization may play a key role in optimizing rating accuracy. These findings support the potential use of percentage-based visual-perceptual approaches as an accessible and clinically meaningful alternative to traditional ordinal residue scales. With further validation, these methods may help improve the precision and consistency of pharyngeal residue assessment in everyday dysphagia care.

Statement of Ethics

This study was performed in accordance with the Declaration of Helsinki. This human study was approved by Weill Cornell Medicine’s Institutional Review Board, approval: IRB #: 24–10028086. All adult participants provided written informed consent to participate in this study.

Conflict of Interest Statement

The authors have no conflicts of interest to declare.

Funding Sources

This study was not supported by any sponsor or funder.

Author Contributions

All authors met the minimum criteria for authorship status, as proposed by the International Committee of Medical Journal Editors (ICMJE). Authorship contributions have been characterized using CRediT (https://credit.niso.org/). Roles include the following: conceptualization, data curation, formal analysis, funding acquisition, investigation, methodology, project administration, resources, software, supervision, validation, visualization, writing – original draft, and writing – reviewing and editing. Specific authorship contributions are as follow: James Curtis: conceptualization, data curation, formal analysis, funding acquisition, investigation, methodology, project administration, resources, software, supervision, visualization, writing – original draft, and writing – reviewing and editing; Brandon Jagdhar: data curation, investigation, methodology, and writing – reviewing and editing; Valentina Mocchetti: investigation, methodology, resources, and writing – reviewing and editing; Anaïs Rameau and Christine M. Clark: resources and writing – reviewing and editing; Mel Grasso: data curation, investigation, methodology, project administration, supervision, writing – original draft, and writing – reviewing and editing.

Funding Statement

This study was not supported by any sponsor or funder.

Data Availability Statement

All data and R code associated with this study are openly available in the Open Science Framework repository at https://osf.io/63b8y/. Further inquiries can be directed to the corresponding author.

References

  • 1. Steele CM, Peladeau-Pigeon M, Barrett E, Wolkin TS. The risk of penetration–aspiration related to residue in the pharynx. Am J Speech Lang Pathol. 2020;29(3):1608–17. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Eisenhuber E, Schima W, Schober E, Pokieser P, Stadler A, Scharitzer M, et al. Videofluoroscopic assessment of patients with dysphagia: pharyngeal retention is a predictive factor for aspiration. AJR Am J Roentgenol. 2002;178(2):393–8. [DOI] [PubMed] [Google Scholar]
  • 3. Oliveira DL, Moreira EAM, De Freitas MB, Gonçalves JA, Furkim AM, Clavé P, et al. Pharyngeal residue and aspiration and the relationship with clinical/nutritional status of patients with oropharyngeal dysphagia submitted to videofluoroscopy. J Nutr Health Aging. 2017;21(3):336–41. [DOI] [PubMed] [Google Scholar]
  • 4. Langmore SE, Terpenning MS, Schork A, Chen Y, Murray JT, Lopatin D, et al. Predictors of aspiration pneumonia: how important is dysphagia? Dysphagia. 1998;13(2):69–81. [DOI] [PubMed] [Google Scholar]
  • 5. Swan K, Cordier R, Brown T, Speyer R. Psychometric properties of visuoperceptual measures of videofluoroscopic and Fibre-Endoscopic Evaluations of Swallowing: a systematic review. Dysphagia. 2018;34:2–33. [DOI] [PubMed] [Google Scholar]
  • 6. Steele CM, Peladeau-Pigeon M, Nagy A, Waito AA. Measurement of pharyngeal residue from lateral view videofluoroscopic images. J Speech Lang Hear Res. 2020;63(5):1404–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Donohue C, Robison R, DiBiase L, Anderson A, Vasilopoulos T, Plowman EK. Comparison of validated videofluoroscopic outcomes of pharyngeal residue: concordance between a perceptual, ordinal, and bolus-based rating scale and a normalized pixel-based quantitative outcome. J Speech Lang Hear Res. 2022;65(7):2510–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Bryant KN, Finnegan E, Berbaum K. VFS interjudge reliability using a free and directed search. Dysphagia. 2012;27(1):53–63. [DOI] [PubMed] [Google Scholar]
  • 9. Hutcheson KA, Barrow MP, Barringer DA, Knott JK, Lin HY, Weber RS, et al. Dynamic imaging grade of swallowing toxicity (DIGEST): scale development and validation. Cancer. 2017;123(1):62–70. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Martin-Harris B, Brodsky MB, Michel Y, Castell DO, Schleicher M, Sandidge J, et al. MBS measurement tool for swallow impairment-MBSimp: establishing a standard. Dysphagia. 2008;23(4):392–405. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Pisegna JM, Kaneoka A, Coster WJ, Leonard R, Langmore SE. Residue ratings on FEES: trends for clinical application of residue measurement. Dysphagia. 2020;35(5):834–42. [DOI] [PubMed] [Google Scholar]
  • 12. Pisegna JM, Borders JC, Kaneoka A, Coster WJ, Leonard R, Langmore SE. Reliability of untrained and experienced raters on FEES: rating overall residue is a simple task. Dysphagia. 2018;33(5):645–54. [DOI] [PubMed] [Google Scholar]
  • 13. Pisegna JM, Kaneoka A, Leonard R, Langmore SE. Rethinking residue: determining the perceptual continuum of residue on FEES to enable better measurement. Dysphagia. 2018;33(1):100–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Curtis JA, Borders JC, Perry SE, Dakin AE, Seikaly ZN, Troche MS. Visual Analysis of Swallowing Efficiency and Safety (VASES): a standardized approach to rating pharyngeal residue, penetration, and aspiration during FEES. Dysphagia. 2022;37(2):417–35. [DOI] [PubMed] [Google Scholar]
  • 15. Leonard R. Two methods for quantifying pharyngeal residue on fluoroscopic swallow studies: reliability assessment. Ann Otolaryngolgy Rhinology. 2017;4(3):1168–72. [Google Scholar]
  • 16. Pearson WG, Molfenter SM, Smith ZM, Steele CM. Image-based measurement of post-swallow residue: the normalized residue ratio scale. Dysphagia. 2013;28(2):167–77. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Jardine M, Miles A, Allen J, Leonard R. Quantifying post-swallow residue in healthy aging. Perspect ASHA Spec Interest Groups. 2020;5(6):1657–65. [Google Scholar]
  • 18. Leonard R, Miles A, Allen J. Bolus clearance ratio elevated in patients with neurogenic dysphagia compared with healthy adults: a measure of pharyngeal efficiency. Am J Speech Lang Pathol. 2023;32(1):107–14. [DOI] [PubMed] [Google Scholar]
  • 19. Stoeckli SJ, Huisman TAGM, Seifert B, Martin-Harris BJW. Interrater reliability of videofluoroscopic swallow evaluation. Dysphagia. 2003;18(1):53–7. [DOI] [PubMed] [Google Scholar]
  • 20. McCullough GH, Wertz RT, Rosenbek JC, Mills RH, Webb WG, Ross KB. Inter- and intrajudge reliability for videofluoroscopic swallowing evaluation measures. Dysphagia. 2001;16(2):110–8. [DOI] [PubMed] [Google Scholar]
  • 21. Baijens L, Barikroo A, Pilz W. Intrarater and interrater reliability for measurements in videofluoroscopy of swallowing. Eur J Radiol. 2013;82(10):1683–95. [DOI] [PubMed] [Google Scholar]
  • 22. Wilson T, Checklin M, Lawson N, Burnett AJ, Lombardo T, Freeman-Sanderson A. Understanding user experience and normative data in pharyngeal residue rating scales used in flexible endoscopic evaluation of swallowing (FEES): a scoping review. Int J Speech Lang Pathol. 2024;0(0):1–14. [DOI] [PubMed] [Google Scholar]
  • 23. R Core Team . R: a language and environment for statistical computing [Internet]. Vienna, Austria: R Foundation for Statistical Computing; 2024. Available from: https://www.R-project.org/ [Google Scholar]
  • 24. Lin LIK. A concordance correlation coefficient to evaluate reproducibility. Biometrics. 1989;45(1):255–68. [PubMed] [Google Scholar]
  • 25. McBride GB. A proposal for strength-of-agreement criteria for Lin’s concordance correlation coefficient. NIWA client Rep HAM2005-062. 2005;45:1–10. [Google Scholar]
  • 26. Koo TK, Li MY. A guideline of selecting and reporting intraclass correlation coefficients for reliability research. J Chiropr Med. 2016;15(2):155–63. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Leonard R, Kendall K. Dysphagia assessment and treatment planning: a team approach. 5th ed. San Diego, CA: Plural Publishing Inc.; 2023; p. 481. [Google Scholar]
  • 28. Curtis JA, Borders JC, Dakin AE, Troche MS. Normative reference values for FEES and VASES: preliminary data from 39 nondysphagic, community-dwelling adults. J Speech Lang Hear Res. 2023;66(7):2260–77. [DOI] [PubMed] [Google Scholar]
  • 29. Molfenter SM, Steele CM. The relationship between residue and aspiration on the subsequent swallow: an application of the normalized residue ratio scale. Dysphagia. 2013;28(4):494–500. [DOI] [PubMed] [Google Scholar]
  • 30. Curtis JA, Molfenter S, Troche MS. Predictors of residue and airway invasion in Parkinson’s disease. Dysphagia. 2020;35(2):220–30. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Molfenter SM, Brates D, Herzberg E, Noorani M, Lazarus C. The swallowing profile of healthy aging adults: comparing noninvasive swallow tests to videofluoroscopic measures of safety and efficiency; p. 1–10. [DOI] [PMC free article] [PubMed]
  • 32. Neubauer PD, Rademaker AW, Leder SB. The yale pharyngeal residue severity rating scale: an anatomically defined and image-based tool. Dysphagia. 2015;30(5):521–8. [DOI] [PubMed] [Google Scholar]
  • 33. Vose AK, Kesneck S, Sunday K, Plowman E, Humbert I. A survey of clinician decision making when identifying swallowing impairments and determining treatment. J Speech Lang Hear Res. 2018;61(11):2735–56. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Kitila M, Borders JC, Krisciunas GP, McNally E, Pisegna JM. Confidence, accuracy, and reliability of penetration-aspiration scale ratings on flexible endoscopic evaluations of swallowing by speech pathologists. Dysphagia. 2024;39(3):504–13. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

All data and R code associated with this study are openly available in the Open Science Framework repository at https://osf.io/63b8y/. Further inquiries can be directed to the corresponding author.


Articles from Folia Phoniatrica et Logopaedica are provided here courtesy of Karger Publishers

RESOURCES