Deep learning site classification model for automated photodocumentation in upper GI endoscopy (with video)

Liang Yen Liu; Jeffrey R Fetzer; Nayantara Coelho-Prabhu; Devanshi N Damani; Priyadharshini Sivasubramaniam; Upasana Agrawal; Shivaram P Arunachalam; John B League; Cadman L Leggett

doi:10.1016/j.igie.2023.01.002

. 2023 Feb 14;2(1):1–8.e2. doi: 10.1016/j.igie.2023.01.002

Deep learning site classification model for automated photodocumentation in upper GI endoscopy (with video)

Liang Yen Liu ^1,^∗, Jeffrey R Fetzer ^2,^∗, Nayantara Coelho-Prabhu ², Devanshi N Damani ³, Priyadharshini Sivasubramaniam ⁴, Upasana Agrawal ², Shivaram P Arunachalam ², John B League ², Cadman L Leggett ^2,^∗

PMCID: PMC12850839 PMID: 41647956

Abstract

Background and Aims

Photodocumentation during EGD can be automated and standardized using deep learning (DL) models for anatomic site classification. EGD video data contain a significant number of suboptimal quality image frames for computational analysis (eg, off-center or blurry). We aimed to develop a DL model that extracts high-quality frames from EGD video data for anatomic classification.

Methods

The photodocumentation algorithm consisted of 2 image filters that extract high-quality image frames (appropriately centered, minimal to no blurriness) classified into 1 of 8 anatomic sites: esophagus, gastroesophageal junction, stomach body, fundus, angularis, antrum, duodenal bulb, and duodenum. Model training, testing, and internal validation were performed using 8231 EGD still images and 26,103 video-derived images with an even split among anatomic sites. Images were independently rated per category by 2 gastroenterologists. External validation was performed using an independent dataset of 2142 EGD still images. Model performance (accuracy, F1 score) for 5 EGD videos (6308 frames) was analyzed using a majority vote strategy across 5, 10, 20, and 30 consecutive frame windows.

Results

Internal testing and external validation for site classification showed overall accuracies of 98.1% and 95.0%, respectively; F1 scores ranged from 90.0% to 99.0% and 92.0% to 97.0% across anatomic sites, respectively. When applied to EGD video data, overall accuracies ranged from 89.7% to 94.8% across sampling window sizes.

Conclusions

We present a DL model capable of extracting high-quality frames from EGD video data and performing subsequent anatomic site classification with acceptable accuracy, allowing automated photodocumentation for consistent study quality and video indexing for annotated study review.

Accurate and comprehensive photodocumentation during EGD is an important component of the procedural electronic medical record. Photodocumentation guidelines were first proposed by the European Society of Gastrointestinal Endoscopy in 2001, consisting of 8 standardized anatomic landmarks for imaging during EGD studies.¹ Multiple international endoscopy societies have since developed and revised their own EGD guidelines to include standardized photodocumentation,2, 3, 4 but the number of images and anatomic landmarks vary considerably. Implementation of consistent photodocumentation is burdensome to clinical workflow because the process requires manual acquisition and labeling of images for each EGD study. As a quality metric, auditing photodocumentation proves to be a logistic challenge given that images must be individually checked for quality and accuracy of annotation.⁵ Even among endoscopists enrolled in a training program to improve quality indicators for EGD (including photodocumentation), improvements in benchmark performance were not durable and regressed at the 3-year follow up.⁶

With the purpose of automating and standardizing photodocumentation, previous studies have developed deep learning (DL) models for anatomic classification of EGD still images. Models developed using different convolutional neural network (CNN) architectures have demonstrated comparable performance.⁷ Although the current literature describes ways to optimize DL model development and accuracy for EGD still images, there have been no studies where DL models were developed for anatomic classification of raw EGD video output. The advantage of using raw video output is that acquisition is automated and provides a comprehensive record of the endoscopic procedure. However, raw endoscopy video data consist of numerous clinically “noninformative” video frames that do not clearly visualize the GI lumen. Methods to identify in-focus images⁸ and automate detection of “informative” video frames have been applied to wireless capsule endoscopy⁹ but have not been fully integrated to EGD video data.

In this study, we aimed to develop a DL model that extracts high-quality images from EGD raw video data for subsequent anatomic classification. This allows for not just automated photodocumentation, but also efficient video indexing of clinically useful and high-quality window frames that are labeled by anatomic site.

Methods

Single-frame image and video acquisition

High-definition white-light EGD single-frame images obtained from clinical procedures performed at our institution between January 2020 and September 2021 were reviewed for quality and normal upper GI anatomy, defined as the absence of surgically altered anatomy, foreign body (ie, hemostatic clip, percutaneous gastrostomy tube, etc), and/or upper GI pathology. Upper GI endoscopy videos were obtained prospectively under optimal conditions. After confirmation of a normal GI anatomy, a video was captured in 1 fluid motion of controlled insertion of the endoscope from the upper esophagus to the duodenum and subsequent withdrawal, with no pauses for manual acquisition of still-frame images. To maximize visualization of anatomic landmarks, videos were obtained after lavage of the upper GI tract. This study was approved by our institutional review board.

All images and videos used for development of our DL model were obtained using EVIS EXCERA III GIF-190 series gastroscopes (Olympus, Tokyo, Japan). Images were stored in DICOM format with a resolution of 1920 × 1080 or 1280 × 720 pixels. Video recording was performed using high-definition video recorder systems (SCD3; Stryker, Kalamazoo, Mich, USA). Video image resolution was set at 1920 × 1080 pixels with a capture rate of 30 frames per second and stored in MP4 format. After acquisition, all images and videos were cropped to a resolution of 1350 × 1080 pixels to retain only the field of view of the endoscope.

High-definition white-light EGD single-frame images obtained from clinical procedures performed at 2 of our institution’s satellite sites between January 2020 and November 2021 were reviewed for quality and normal upper GI anatomy before use for external validation. Images were obtained using EVIS EXCERA III GIF-190 series gastroscopes (Olympus) and stored in DICOM format with a resolution of 1920 × 1080 pixels. All images were cropped to a resolution of 1350 × 1080 pixels to retain only the field of view of the endoscope.

Image classification

Single-frame and video-derived images were independently classified by 2 gastroenterologists (C.L.L. and N.C.-P.) into 8 anatomic sites as specified by previously published guidelines¹^,⁴: esophagus, gastroesophageal junction (GEJ), stomach body, stomach fundus, stomach angularis, stomach antrum, duodenal bulb, and duodenum. Single-frame image annotations were cross-checked for accuracy; only concordantly labeled images were retained for DL model development. Annotation of video-derived images was performed within the image sequence, with reviewers able to review frames appearing before and after the frame of interest before assigning a final label. All video-derived frames with discordant labels between reviewers were classified as transition zones and excluded from data analysis. The final, concordant label generated for each single-frame and video-derived image was designated as the ground truth used for model development and validation. This approach accounted for the continuous nature of video data not uncommonly capturing 2 anatomic landmarks within the same frame, which makes labeling transition zones challenging and subjective.

DL model image datasets

The following image datasets were acquired for model development and validation:

1.
Dataset 1, used for the development of a quality filter to determine clinical usefulness of an image frame. This dataset consisted of 14,000 useful and 8500 nonuseful white-light still-frame EGD images as defined below.
2.
Dataset 2, used for the development of an anatomic classification DL model. This dataset consisted of 8231 useful and fair/good quality white-light GI endoscopy still images and 26,103 video-derived images for the following anatomic sites (still; video-derived): esophagus (1078; 9114), GEJ (986; 2757), stomach body (1075; 3357), fundus (1020; 2662), angularis (926; 0), antrum (1006; 1760), duodenal bulb (1088; 3104), and duodenum (1052; 3349).
3.
Dataset 3, used for external validation of our anatomic classification model. This dataset consisted of 2142 white-light EGD still images obtained for the following anatomic sites: esophagus (n = 300), GEJ (287), stomach body (274), fundus (300), angularis (298), antrum (283), duodenal bulb (200), and duodenum (200).
4.
Dataset 4, used to assess performance of our anatomic classification model using raw EGD video input. Images were derived from 5 EGD videos obtained from patients with normal upper GI anatomy, as defined above, that were deconstructed into 18,859 individual frames. This dataset consisted of 6308 frames identified as useful, good-quality frames that excluded transition zones for the following anatomic sites: esophagus (n = 1025), GEJ (539), body (1189), fundus (598), angularis (327), antrum (769), duodenal bulb (647), and duodenum (1214), representing 33.4% of all frames.

Anatomic site classification and quality filter model architecture

Our DL model consisted of 2 image quality filters followed by an upper GI anatomic site classification model (Fig. 1, Video 1, available online at www.igiejournal.org). The first filter was the clinically useful image quality filter, in which a quality filter model was developed to extract all clinically useful images from a video-derived image series. The model classified an upper GI image as useful (Fig. 2A) versus nonuseful (Fig. 2B). Useful images were defined as images where the GI lumen was clearly visible. Nonuseful images were defined as those where the GI lumen could not be identified (eg, tip of the endoscope pressed against the GI mucosa). This model was trained using dataset 1.

Photodocumentation deep learning model architecture. The algorithm takes an EGD still-frame image or EGD video-derived image and then applies 2 filters to categorize the image regarding image quality. The first quality filter categorizes images by clinical usefulness, and the second quality filter categorizes images by camera focus. Images subsequently undergo anatomic classification by a convolutional neural network pretrained using ImageNet with a ResNet34 architecture, which consists of 32 convolutional layers, 1 average pool layer, and 1 maximum pool layer.

A, Example of a “useful” image as classified by the clinically useful image quality filter. The esophageal lumen is well visualized and centered in the image. B, Example of a “nonuseful” image as classified by the clinically useful image quality filter. The endoscopic view of the esophageal lumen is partially occluded by an air bubble.

The second filter was the degree of image blurriness filter, in which a quality filter was developed to determine the degree of blurriness of an image. This model used Ying et al’s¹⁰ “patches to pictures” (PaQ-2-PiQ) model, which is trained on a set of 40,000 real-world photography images consisting of 120,000 patches on which roughly 4 million annotated labels of picture quality were collected. The PaQ-2-PiQ model measures an image’s degree of blurriness and then classifies the image as poor (Fig. 3A), fair (Fig. 3B), or good (Fig. 3C) quality.

A, Example of a “poor-quality” image as classified by the degree of image blurriness quality filter. The EGD image of the stomach body shows poor focus of the rugal folds in both the center and periphery of the image. B, Example of a “fair-quality” image as classified by the degree of image blurriness quality filter. The EGD image of the stomach body shows poor central focus of the rugal folds with sharp focus of the periphery of the image. C, Example of a “good-quality” image as classified by the degree of image blurriness quality filter. The EGD image of the stomach body shows sharp focus of the rugal folds in both the center and periphery of the image.

For the anatomic site classification model, an anatomic upper GI classification model was developed for each of 8 categories: esophagus, GEJ, stomach body, fundus, angularis, antrum, duodenal bulb, and duodenum. This model was pretrained using ImageNet and used a ResNet34 architecture.¹¹ It was developed using dataset 2, with usage splits of 80% training, 10% internal testing, and 10% validation per each of the 8 anatomic sites. External validation of the model was performed using dataset 3.

DL model performance

Analysis was performed using a NVIDIA Tesla A100 (NVIDIA corporation, Santa Clara, Calif). Overall performance was evaluated using accuracy and the F1 score and measured against the “ground truth” of the expert annotated images. Performance was broken down by anatomic site. Confusion matrices were used to summarize model performance. Measures were defined as follows:

A c c u r a c y = \frac{t r u e p o s i t i v e + t r u e n e g a t i v e}{t r u e p o s i t i v e + t r u e n e g a t i v e + f a l s e p o s i t i v e + f a l s e n e g a t i v e}

P r e c i s i o n = \frac{t r u e p o s i t i v e}{t r u e p o s i t i v e + f a l s e p o s i t i v e}

R e c a l l = \frac{t r u e p o s i t i v e}{t r u e p o s i t i v e + f a l s e n e g a t i v e}

F 1 s c o r e = 2 * \frac{p r e c i s i o n * r e c a l l}{p r e c i s i o n + r e c a l l}

When our DL model was applied to video-derived image data, a simple majority vote for model predictions was performed across image windows consisting of consecutive 5, 10, 20, and 30 frames. Multiple window sizes were explored to determine optimal performance when applied to an EGD video stream. Because EGD videos were recorded at 30 frames per second, the 5-, 10-, 20-, and 30-frame windows corresponded to one-sixth, one-third, one-half, and 1 second of video, respectively. The simple majority vote acted as a “smoothing function” to generate a single output that represented the average model prediction¹² and the ground-truth label across each window. This decreased the number of false positive classifications by ensuring that the model analyzed at least one-sixth second of useful, good quality EGD video, effectively acting as a threshold that only triggered a weighted prediction output when the required number of consecutive frames was met. In cases where the majority vote method resulted in a 2-way tie between anatomic sites, the tiebreaker was always assigned a final prediction that would result in site misclassification to obtain the most conservative estimate of accuracy.

Gradient-weighted class activation maps

Gradient-weighted class activation maps (Grad-CAMs) provide visual insight into model prediction by overlapping a heat map that highlights regions of importance used to determine a given label.¹³ Grad-CAMs were generated for all datasets. Single-frame and video-derived images were cropped to 224 × 224 pixels before input into the Grad-CAM model. The Grad-CAM uses the last convolutional layer of a CNN to assign importance values to each neuron for a particular decision of interest.

Results

Internal validation results

Overall accuracy was 98.1% (3395/3459 correctly labeled images). Individual anatomic site accuracies were summarized in a confusion matrix (Fig. 4A) and ranged from 86% for the angularis to 99% for the esophagus, GEJ, and fundus. The 3 most common misclassifications were between the angularis and antrum (5% of angularis frames misclassified as antrum), between the angularis and duodenal bulb (4% of angularis frames misclassified as duodenal bulb), and between the angularis and fundus (3% of angularis frames misclassified as fundus). The F1 score ranged from 90% for the angularis to 99% for the esophagus and duodenum (Table 1).

A, Internal testing confusion matrix. The horizontal axis represents deep learning model predictions, whereas the vertical axis represents actual ground-truth labels. B, External validation confusion matrix. The horizontal axis represents deep learning model predictions, whereas the vertical axis represents actual ground-truth labels. *ESO*, Esophagus; *GEJ*, gastroesophageal junction; *BOD*, stomach body; *FND*, stomach fundus; *ANG*, stomach angularis; *ANT*, stomach antrum; *BLB*, duodenal bulb; *DUO*, duodenum.

Table 1.

F1 score by anatomic site

	Esophagus	Gastroesophageal junction	Stomach body	Stomach fundus	Stomach angularis	Stomach antrum	Duodenal bulb	Duodenum
Internal testing	.99	.98	.98	.98	.90	.98	.98	.99
External validation	.97	.97	.96	.96	.92	.95	.93	.95
5-frame window	.86	.99	.78	.93	.96	.84	.98	.95
10-frame window	.85	1.00	.74	.93	.96	.82	.97	.96
20-frame window	.92	1.00	.84	.94	1.00	.88	1.00	1.00
30-frame window	.91	1.00	.86	.98	1.00	.92	1.00	1.00

Open in a new tab

External validation results

Overall accuracy was 95.0% (2034/2142 correctly labeled images). Individual anatomic site accuracies were summarized in a confusion matrix (Fig. 4B) and ranged from 90% for the angularis to 99% for the fundus and duodenum. The 3 most common misclassifications were between the angularis and fundus (8% of angularis frames misclassified as fundus), between the antrum and angularis (5% of antrum frames misclassified as angularis), and between the duodenal bulb and duodenum (4% of duodenal bulb frames misclassified as duodenum). The F1 score ranged from 92% for the angularis to 99% for the esophagus and GEJ (Table 1).

Video-derived majority vote analysis

The total number of frames from dataset 4 that were used for 5-, 10-, 20-, and 30-frame analysis were 5870, 5030, 4000, and 3480, respectively. This represented 93.1%, 79.7%, 63.4%, and 55.2% of all useful and high-quality frames, respectively. Total video analysis run time for dataset 4 was 40 minutes and 15 seconds, split into 13 minutes and 19 seconds for the 2 quality filters and 26 minutes and 56 seconds for the anatomic site classification model.

For majority vote analysis, total run times for 5-, 10-, 20-, and 30-frame windows were 6628.0, 5976.1, 5295.8, and 4874.8 ms, respectively, for the 5 EGD videos. This averaged to 1325.6, 1195.2 1059.2, and 975.0 ms of run time per video. Overall accuracies for the 5-, 10-, 20-, and 30-frame analysis windows were 90.5% (1062/1174), 89.7% (451/503), 93.0% (186/200), and 94.8% (110/115), respectively.

Individual anatomic site accuracies across 5-, 10-, 20-, and 30-frame analysis windows are summarized in Table 2 and Supplementary Figure 1 (available online at www.igiejournal.org). Accuracy ranged from 81% for the antrum to 99% for the fundus; from 79% for the body to 100% for the fundus; from 83% for the antrum to 100% for the esophagus, GEJ, fundus, and angularis; and from 86% for the antrum to 100% for the esophagus, GEJ, fundus, angularis, and duodenal bulb for 5-, 10-, 20-, and 30-frame windows, respectively.

Table 2.

Accuracy by anatomic site

	Esophagus	Gastroesophageal junction	Stomach body	Stomach fundus	Stomach angularis	Stomach antrum	Duodenal bulb	Duodenum
Internal testing	.990	.990	.980	.990	.860	.980	.980	.980
External validation	.960	.980	.950	.990	.900	.920	.910	.990
5-frame window	.967	.961	.819	.991	.935	.806	.935	.904
10-frame window	.958	.978	.794	1.000	.926	.810	.907	.901
20-frame window	1.000	1.000	.875	1.000	1.000	.826	.955	.905
30-frame window	1.000	1.000	.870	1.000	1.000	.857	1.000	.957

Open in a new tab

Supplementary Figure 1 — A, Video data 5-frame window analysis confusion matrix. B, Video data 10-frame window analysis confusion matrix. C, Video data 20-frame window analysis confusion matrix. D, Video data 30-frame window analysis confusion matrix. The horizontal axis represents deep learning model predictions, whereas the vertical axis represents actual ground-truth labels. *ESO*, Esophagus; *GEJ*, gastroesophageal junction; *BOD*, stomach body; *FND*, stomach fundus; *ANG*, stomach angularis; *ANT*, stomach antrum; *BLB*, duodenal bulb; *DUO*, duodenum.

Across all windows, the most common misclassification was between the body and duodenal bulb, with 15%, 19%, 10%, and 13% of body frames misclassified as duodenal bulb for 5-, 10-, 20-, and 30-frame windows, respectively. The second most common misclassification was between the antrum and body, with 10%, 9%, 7%, and 7% of antrum frames misclassified as body for 5-, 10-, 20-, and 30-frame windows, respectively. The third most common misclassification was between the antrum and duodenal bulb, with 6%, 9%, 7%, and 7% of antrum frames misclassified as duodenal bulb for 5-, 10-, 20-, and 30-frame windows, respectively. The F1 score by anatomic site ranged from 78% for the stomach body to 99% for the GEJ; 74% for the stomach body to 100% for the GEJ; 84% for the stomach body to 100% for the GEJ, angularis, duodenal bulb, and duodenum; and 86% for the stomach body to 100% for the GEJ, angularis, duodenal bulb, and duodenum for 5-, 10-, 20-, and 30-frame windows, respectively (Table 1).

Grad-CAMs by anatomic classification site

Grad-CAM images were generated for each anatomic classification site using video-derived frames and demonstrated salient anatomic features corresponding to each site. Representative Grad-CAM images from the GEJ, angularis, and fundus sites can be found in Figure 5. Identification of the GEJ relies primarily on the sharp demarcation between squamous and columnar epithelium at the squamocolumnar junction. Identification of the fundus relies primarily on visualization of the retroflexed endoscope. Identification of the angularis relies on identification of the angular incisure, retroflexed endoscope, and distal GI lumen. Representative EGD images and corresponding Grad-CAM images for the remaining anatomic sites can be found in Supplementary Figure 2 (available online at www.igiejournal.org).

EGD image and corresponding gradient-weighted class activation map for the following anatomic sites: A and B, gastroesophageal junction; C and D, angularis; E and F, fundus.

Supplementary Figure 2 — EGD image and corresponding gradient-weighted class activation map for the following anatomic sites: A and B, esophagus; C and D, stomach body; E and F, antrum; G and H, duodenal bulb; I and J, duodenum.

Discussion

The present study describes an anatomic site classification model for automated photodocumentation in upper GI endoscopy. Our model is unique because it was optimized for endoscopy video data and uses 2 image quality filters before applying a majority vote prediction across a predetermined number of frames. Model performance showed excellent overall accuracy on internal (98.1%) and external (95.0%) validation and when tested using EGD videos obtained under optimal conditions (90.5%-94.8%).

Our model’s performance is comparable with the performance of previously developed DL anatomic site classification models for single-frame EGD images. He et al⁷ showed similar overall accuracy across various CNN model architectures, with overall accuracies greater than 90%; the F1 score by anatomic site for the best-performing CNN was also similar, ranging between 88.42% for the GEJ to 98.07% for the fundus. Image annotation of that study was performed independently by 2 separate investigators and similar to our study methodology. However, their models were trained exclusively on single-frame EGD images and did not incorporate video-derived EGD images. The addition of video-derived EGD images to dataset 3 was necessary to optimize our model for video data analysis and did not compromise performance for single-frame image classification.

Other studies have focused on optimizing DL models for anatomic classification of still-frame EGD image data. Methods to accelerate model development using a “semi-supervised learning” approach have been developed to ease the workforce requirement for data preparation.¹⁴ Methods to automate a raw data preprocessing step before DL model development have been developed to maximize anatomic classification accuracy of EGD still-frame images.¹⁵ However, such DL models have not been optimized for EGD video data. Although video data provide comprehensive documentation of an EGD procedure, application of an anatomic classification model can be limited by a significant number of suboptimal quality frames with various degrees of blurriness and lack of visibility of the GI lumen. To account for these limitations, our model incorporated quality filters for clinical usefulness and blurriness before anatomic site classification, making it uniquely optimized for video data analysis by identifying and excluding low-quality EGD video from subsequent anatomic site classification to maximize model performance. The incorporation of quality filters did not significantly increase the DL model’s total run time, accounting for 33.1% for dataset 4.

Our study explored the optimal window frame size (5, 10, 20, and 30 frames) that served as a smoothing function when analyzing video-derived frames. Although a slight increase in accuracy was observed with increasing window frame size, this difference was minimal (5.1%). Overall, this increase was primarily because of a decreasing proportion of misclassifications at the body and antrum sites. Generally, all anatomic sites had improved accuracy when increasing from a 5- to 30-frame window because of a decreasing proportion of misclassifications. Therefore, using the simple majority vote method across wider frame windows resulted in improving performance. However, this came at the cost of a rapidly decreasing number of total windows that met the necessary criteria for analysis; the total number of windows available in the 10-, 20-, and 30-frame analysis were 43%, 17%, and 10%, respectively, of the windows available in the 5-frame analysis. Therefore, given the relatively small improvement in overall accuracy but almost exponential drop in available windows with increasing window size, using a 5-frame window for future applications of our DL model would confer the ideal compromise between data quality and data quantity. There was less than a 1-second improvement in run time for the 30-frame analysis compared with the 5-frame analysis and therefore no practical difference. From a practical standpoint, an endoscopist would only need to obtain one-sixth second of a useful, good-quality EGD video for our DL model to generate an anatomic site prediction. This ensures that anatomic sites that are often examined quickly during routine diagnostic EGD (eg, the duodenal bulb) are not missed by our model because of inadequate frames to trigger the prediction threshold.

We generated Grad-CAM maps for both EGD still images and video-derived EGD images that identified 3 image characteristics important for model prediction: visualization of characteristic anatomic features, visualization of the distal upper GI lumen, and visualization of the retroflexed endoscope. When performing video-derived classification, our 2 raters could view frames before and after the frame of interest before assigning a ground-truth label. This afforded a significant advantage in accurately predicting the location of isolated frames that were difficult to predict based on image characteristics alone. Without incorporation of temporal context, our model had the worst performance at the stomach body and antrum sites. Because of rapid transitions between different anatomic sites within the stomach, many frames were difficult to predict based on image characteristics alone but could be accurately labeled when viewed in the context of frames that preceded and followed them.

Several limitations preclude direct implementation of our model into clinical practice. First, our model was trained with EGD images from patients with normal upper GI anatomy; we anticipate performance drops when applied to EGD videos with altered anatomy or upper GI pathology. Second, our EGD videos were obtained using an initial cleaning step before the smooth reinsertion and withdrawal of the endoscope. Under normal EGD study conditions, endoscopic lavage is performed throughout the entire duration of the study and generates a significant number of suboptimal frames. The initial cleaning step therefore maximized the number of useful and good-quality frames for analysis; this step may deviate from routine clinical practice. The smooth reinsertion and withdrawal technique decreased the number of transition zone frames, which represent parts of the EGD procedure in which multiple anatomic landmarks are visualized simultaneously. We excluded transition zones from the analysis given the subjectivity associated with their annotation and to preserve a ground truth for all labeled frames. Currently, our model is limited to providing a single classification per video frame. A future iteration would output a primary and secondary prediction to account for transition zones.

For future steps, we plan to implement 3 major improvements before consideration for clinical use. First, the image quality filter will be optimized for EGD images by retraining the PaQ-2-PiQ model entirely on EGD images. Second, we will train the model using EGD images obtained from patients with surgically altered anatomy, foreign body, and/or upper GI pathology for generalizability. Third, the model will incorporate temporal data to improve performance when analyzing video data, which by nature contains spatial and temporal characteristics.16, 17, 18, 19

We describe a novel DL model that distilled EGD video data into useful and high-quality image frames and required only 5 consecutive frames (one-sixth second) of video data to perform accurate anatomic site classification. This model serves as the foundation for automated and standardized EGD photodocumentation in clinical practice, which improves quality and efficiency of clinical workflow, and video indexing, which has clinical utility for the creation of an EGD video database that stores abbreviated and annotated procedural videos for patient and endoscopist review.

Disclosure

The following authors disclosed financial relationships: N. Coelho-Prabhu, C. L. Leggett: Consultant for Verily LifeSciences. All other authors disclosed no financial relationships. Research support for this study was provided by Career Development Award (Mayo Clinic, Rochester, MN).

Supplementary Material

Video 1

Video representation of deep learning anatomic site classification model. Video frames from a normal upper endoscopy are classified by image quality and anatomic location. A color-coded progress bar illustrates the site of each anatomic landmark, with red = esophagus, yellow = gastroesophageal junction, light green = stomach body, light blue = stomach fundus, dark green = stomach angularis, dark blue = stomach antrum, purple = duodenal bulb, violet = duodenum. Only “clinically useful” and “good” quality frames undergo anatomic site classification. The video is played at 1.5 times speed.

Download video file^{(27.9MB, mp4)}

Appendix

References

1.Rey J.F., Lambert R., ESGE Quality Assurance Committee ESGE recommendations for quality control in gastrointestinal endoscopy: guidelines for image documentation in upper and lower GI endoscopy. Endoscopy. 2001;33:901–903. doi: 10.1055/s-2001-42537. [DOI] [PubMed] [Google Scholar]
2.Yao K. The endoscopic diagnosis of early gastric cancer. Ann Gastroenterol. 2013;26:11. [PMC free article] [PubMed] [Google Scholar]
3.Emura F., Sharma P., Arantes V., et al. Principles and practice to facilitate complete photodocumentation of the upper gastrointestinal tract: World Endoscopy Organization position statement. Dig Endosc. 2020;32:168–179. doi: 10.1111/den.13530. [DOI] [PubMed] [Google Scholar]
4.Beg S., Ragunath K., Wyman A., et al. Quality standards in upper gastrointestinal endoscopy: a position statement of the British Society of Gastroenterology (BSG) and Association of Upper Gastrointestinal Surgeons of Great Britain and Ireland (AUGIS) Gut. 2017;66:1886–1899. doi: 10.1136/gutjnl-2017-314109. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Bisschops R., Rutter M.D., Areia M., et al. Overcoming the barriers to dissemination and implementation of quality measures for gastrointestinal endoscopy: European Society of Gastrointestinal Endoscopy (ESGE) and United European Gastroenterology (UEG) position statement. Endoscopy. 2021;53:196–202. doi: 10.1055/a-1312-6389. [DOI] [PubMed] [Google Scholar]
6.Serrat J.A.A., Córdova H., Moreira L., et al. Evaluation of long-term adherence to oesophagogastroduodenoscopy quality indicators. Gastroenterol Hepatol. 2020;43:589–597. doi: 10.1016/j.gastrohep.2020.01.017. [DOI] [PubMed] [Google Scholar]
7.He Q., Bano S., Ahmad O.F., et al. Deep learning-based anatomical site classification for upper gastrointestinal endoscopy. Int J Comput Assist Radiol Surg. 2020;15:1085–1094. doi: 10.1007/s11548-020-02148-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Oh J.H., Hwang S., Lee J.K., et al. Informative frame classification for endoscopy video. Med Image Anal. 2007;11:110–127. doi: 10.1016/j.media.2006.10.003. [DOI] [PubMed] [Google Scholar]
9.Bashar M.K., Kitasaka T., Suenaga Y., et al. Automatic detection of informative frames from wireless capsule endoscopy images. Med Image Anal. 2010;14:449–470. doi: 10.1016/j.media.2009.12.001. [DOI] [PubMed] [Google Scholar]
10.Ying Z, Niu H, Gupta P, et al. From patches to pictures (PaQ-2-PiQ): Mapping the perceptual space of picture quality. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 2020, pp. 3572-3582, 10.1109/CVPR42600.2020.00363. [DOI]
11.He K., Zhang X., Ren S., et al. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA. 2016. Deep residual learning for image recognition; pp. 770–778. [DOI] [Google Scholar]
12.Paul J.F., Rohnean A., Giroussens H., et al. Evaluation of a deep learning model on coronary CT angiography for automatic stenosis detection. Diagn Intervent Imag. 2022;103:316–323. doi: 10.1016/j.diii.2022.01.004. [DOI] [PubMed] [Google Scholar]
13.Selvaraju R.R., Cogswell M., Das A., et al. Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy. 2017. Grad-CAM: visual explanations from deep networks via gradient-based localization; pp. 618–626. [DOI] [Google Scholar]
14.Chang Y.Y., Li P.C., Chang R.F., et al. Deep learning-based endoscopic anatomy classification: an accelerated approach for data preparation and model validation. Surg Endosc. 2022;36:3811–3821. doi: 10.1007/s00464-021-08698-2. [DOI] [PubMed] [Google Scholar]
15.Cogan T., Cogan M., Tamil L. MapGI: accurate identification of anatomical landmarks and diseased tissue in gastrointestinal tract using deep learning. Comput Biol Med. 2019;111 doi: 10.1016/j.compbiomed.2019.103351. [DOI] [PubMed] [Google Scholar]
16.Bengio Y., Simard P., Frasconi P. Learning long-term dependencies with gradient descent is difficult. IEEE Trans Neural Netw. 1994;5:157–166. doi: 10.1109/72.279181. [DOI] [PubMed] [Google Scholar]
17.Udristoiu A.L., Stefanescu D., Gruionu G., et al. Deep learning algorithm for the confirmation of mucosal healing in Crohn's disease, based on confocal laser endomicroscopy images. J Gastrointest Liver Dis. 2021;30:59–65. doi: 10.15403/jgld-3212. [DOI] [PubMed] [Google Scholar]
18.Reuss J., Pascual G., Wenzek H., et al. Sequential models for endoluminal image classification. Diagnostics. 2022;12:501. doi: 10.3390/diagnostics12020501. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Hanif U., Kezirian E., Kiar E.K., et al. Upper airway classification in sleep endoscopy examinations using convolutional recurrent neural networks. Annu Int Conf IEEE Eng Med Biol Soc. 2021;2021:3957–3960. doi: 10.1109/EMBC46164.2021.9630098. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Video 1

Download video file^{(27.9MB, mp4)}

[bib1] 1.Rey J.F., Lambert R., ESGE Quality Assurance Committee ESGE recommendations for quality control in gastrointestinal endoscopy: guidelines for image documentation in upper and lower GI endoscopy. Endoscopy. 2001;33:901–903. doi: 10.1055/s-2001-42537. [DOI] [PubMed] [Google Scholar]

[bib2] 2.Yao K. The endoscopic diagnosis of early gastric cancer. Ann Gastroenterol. 2013;26:11. [PMC free article] [PubMed] [Google Scholar]

[bib3] 3.Emura F., Sharma P., Arantes V., et al. Principles and practice to facilitate complete photodocumentation of the upper gastrointestinal tract: World Endoscopy Organization position statement. Dig Endosc. 2020;32:168–179. doi: 10.1111/den.13530. [DOI] [PubMed] [Google Scholar]

[bib4] 4.Beg S., Ragunath K., Wyman A., et al. Quality standards in upper gastrointestinal endoscopy: a position statement of the British Society of Gastroenterology (BSG) and Association of Upper Gastrointestinal Surgeons of Great Britain and Ireland (AUGIS) Gut. 2017;66:1886–1899. doi: 10.1136/gutjnl-2017-314109. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib5] 5.Bisschops R., Rutter M.D., Areia M., et al. Overcoming the barriers to dissemination and implementation of quality measures for gastrointestinal endoscopy: European Society of Gastrointestinal Endoscopy (ESGE) and United European Gastroenterology (UEG) position statement. Endoscopy. 2021;53:196–202. doi: 10.1055/a-1312-6389. [DOI] [PubMed] [Google Scholar]

[bib6] 6.Serrat J.A.A., Córdova H., Moreira L., et al. Evaluation of long-term adherence to oesophagogastroduodenoscopy quality indicators. Gastroenterol Hepatol. 2020;43:589–597. doi: 10.1016/j.gastrohep.2020.01.017. [DOI] [PubMed] [Google Scholar]

[bib7] 7.He Q., Bano S., Ahmad O.F., et al. Deep learning-based anatomical site classification for upper gastrointestinal endoscopy. Int J Comput Assist Radiol Surg. 2020;15:1085–1094. doi: 10.1007/s11548-020-02148-5. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib8] 8.Oh J.H., Hwang S., Lee J.K., et al. Informative frame classification for endoscopy video. Med Image Anal. 2007;11:110–127. doi: 10.1016/j.media.2006.10.003. [DOI] [PubMed] [Google Scholar]

[bib9] 9.Bashar M.K., Kitasaka T., Suenaga Y., et al. Automatic detection of informative frames from wireless capsule endoscopy images. Med Image Anal. 2010;14:449–470. doi: 10.1016/j.media.2009.12.001. [DOI] [PubMed] [Google Scholar]

[bib10] 10.Ying Z, Niu H, Gupta P, et al. From patches to pictures (PaQ-2-PiQ): Mapping the perceptual space of picture quality. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 2020, pp. 3572-3582, 10.1109/CVPR42600.2020.00363. [DOI]

[bib11] 11.He K., Zhang X., Ren S., et al. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA. 2016. Deep residual learning for image recognition; pp. 770–778. [DOI] [Google Scholar]

[bib12] 12.Paul J.F., Rohnean A., Giroussens H., et al. Evaluation of a deep learning model on coronary CT angiography for automatic stenosis detection. Diagn Intervent Imag. 2022;103:316–323. doi: 10.1016/j.diii.2022.01.004. [DOI] [PubMed] [Google Scholar]

[bib13] 13.Selvaraju R.R., Cogswell M., Das A., et al. Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy. 2017. Grad-CAM: visual explanations from deep networks via gradient-based localization; pp. 618–626. [DOI] [Google Scholar]

[bib14] 14.Chang Y.Y., Li P.C., Chang R.F., et al. Deep learning-based endoscopic anatomy classification: an accelerated approach for data preparation and model validation. Surg Endosc. 2022;36:3811–3821. doi: 10.1007/s00464-021-08698-2. [DOI] [PubMed] [Google Scholar]

[bib15] 15.Cogan T., Cogan M., Tamil L. MapGI: accurate identification of anatomical landmarks and diseased tissue in gastrointestinal tract using deep learning. Comput Biol Med. 2019;111 doi: 10.1016/j.compbiomed.2019.103351. [DOI] [PubMed] [Google Scholar]

[bib16] 16.Bengio Y., Simard P., Frasconi P. Learning long-term dependencies with gradient descent is difficult. IEEE Trans Neural Netw. 1994;5:157–166. doi: 10.1109/72.279181. [DOI] [PubMed] [Google Scholar]

[bib17] 17.Udristoiu A.L., Stefanescu D., Gruionu G., et al. Deep learning algorithm for the confirmation of mucosal healing in Crohn's disease, based on confocal laser endomicroscopy images. J Gastrointest Liver Dis. 2021;30:59–65. doi: 10.15403/jgld-3212. [DOI] [PubMed] [Google Scholar]

[bib18] 18.Reuss J., Pascual G., Wenzek H., et al. Sequential models for endoluminal image classification. Diagnostics. 2022;12:501. doi: 10.3390/diagnostics12020501. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib19] 19.Hanif U., Kezirian E., Kiar E.K., et al. Upper airway classification in sleep endoscopy examinations using convolutional recurrent neural networks. Annu Int Conf IEEE Eng Med Biol Soc. 2021;2021:3957–3960. doi: 10.1109/EMBC46164.2021.9630098. [DOI] [PubMed] [Google Scholar]

PERMALINK

Deep learning site classification model for automated photodocumentation in upper GI endoscopy (with video)

Liang Yen Liu, BS

Jeffrey R Fetzer, MS

Nayantara Coelho-Prabhu, MBBS

Devanshi N Damani, MBBS

Priyadharshini Sivasubramaniam, MBBS

Upasana Agrawal, MBBS

Shivaram P Arunachalam, PhD, DBA

John B League, BS

Cadman L Leggett, MD

Abstract

Background and Aims

Methods

Results

Conclusions

Methods

Single-frame image and video acquisition

Image classification

DL model image datasets

Anatomic site classification and quality filter model architecture

Figure 1.

Figure 2.

Figure 3.

DL model performance

Gradient-weighted class activation maps

Results

Internal validation results

Figure 4.

Table 1.

External validation results

Video-derived majority vote analysis

Table 2.

Supplementary Figure 1.

Grad-CAMs by anatomic classification site

Figure 5.

Supplementary Figure 2.

Discussion

Disclosure

Supplementary Material

Appendix

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases