Fig. 3.
Processing pipeline used to detect the pharyngeal phase and the bolus location during the swallow in bolus-level clips. Top row (Supervised pharyngeal phase detection) - Each frame was classified by the CNN as pharyngeal (P) or non-pharyngeal (N). The BPM frame was identified as the first frame with predicted class equal to P that was followed by at least 3 consecutive P frames, whereas the UESC frame was identified as the last frame with predicted class equal to P, which was preceded by at least 3 consecutive frames of class P. Bottom row (Weakly supervised bolus localization) - We refined the coarse localization maps obtained with Grad-CAM via additional image processing steps. Specifically: we binarized the activation maps and used image dilation to fill small holes (only the largest blob was retained). The binary map was used as mask to seek the 100-darkest pixels within the pre-processed VFSS frames, and the convex hull of the 100-darkest pixels was used to initialize a geodesic active contour segmentation algorithm. Finally, the centroid of the refined shape was computed and mapped to a coordinate system with vertical axis defined by the C2-C4 spinal segment.
