ABSTRACT
Background
Assessment of the gastroesophageal junction (GEJ) is an integral part of gastroscopy; however, the absence of standardized reporting hinders consistency of examination documentation. The Hill classification offers a standardized approach for evaluating the GEJ. This study aims to compare the accuracy of an artificial intelligence (AI) system with that of physicians in classifying the GEJ according to Hill in a prospective, blinded, superiority trial.
Methods
Consecutive patients scheduled for gastroscopy with an intact GEJ were recruited during clinical routine from October 2023 to December 2023. Nine physicians (six experienced, three inexperienced) assessed the Hill grade, and the AI system operated in the background in real‐time. The gold standard was determined by a majority vote of independent assessments by three expert endoscopists who did not participate in the study. The primary outcome was accuracy. Secondary outcomes were per‐Hill grade analysis and result comparison for experienced and inexperienced endoscopists separately.
Results
In 131 analysed examinations the AI's accuracy of 84.7% (95% CI: 78.6–90.8) was significantly higher than 62.5% (95% CI: 54.2–71) of physicians (p < 0.01). The AI outperformed physicians in all but one cases in the per‐Hill‐class analysis. AI was significantly more accurate than inexperienced physicians (85% vs. 56%, p < 0.01) and in trend better than experienced physicians (84% vs. 69.6%, p = 0.07).
Conclusions
AI was significantly more accurate than examiners in assessing the Hill classification. This superior model performance can prove beneficial for endoscopists, especially those with limited experience.
Trial Registration
ClinicalTrials.gov identifier: NCT06040723
Keywords: accuracy, AI, ConvNext, deep learning, EndoMind, esophagogastroduodenoscopy, gastroscopy, GEJ

1.
Summary.
-
Summarize the established knowledge on this subject
-
◦
The Hill classification provides a standardized and objective method for gastroesophageal junction assessment.
-
◦
Hill classification is associated with several distinct clinical conditions.
-
◦
Gastroscopy can be enhanced with AI‐powered tools.
-
◦
-
What are the significant and/or new findings of this study?
-
◦
AI enables real‐time evaluation of the Hill grade during gastroscopy.
-
◦
AI surpasses physician accuracy in Hill grade assessments.
-
◦
AI can provide instant, interpretable feedback through heatmaps.
-
◦
2. Introduction
Thorough inspection and assessment of the gastroesophageal junction (GEJ) is a central part of each gastroscopy. Alterations of this anatomical landmark such as hiatal hernia [1] are associated with conditions like gastroesophageal reflux disease (GERD) and dysphagia [2]. Hiatal hernia is usually assessed using the axial length in centimetres [3], which is subject to interobserver variability [4]. The GEJ inspection is performed in retroflexion during gastroscopy, with multiple national and international guidelines suggesting the inclusion of image documentation in the examination report [5, 6, 7]. In addition to image documentation, standardized assessment of the landmark proves beneficial as it can limit the inter‐observer variability. The Hill classification [8] is widely used for evaluating the GEJ [9] and has been associated with multiple conditions such as reflux oesophagitis [10], and postoperative GERD [11]. Additionally, a revised version of the Hill classification was suggested by Nguyen, but prospective clinical validation is still needed [12]. Furthermore, it was shown to be an accurate predictor for GERD [13], performing better compared to the length of the hiatal hernia [14]. However, proper assessment of the Hill grade requires expertise and experience in interpreting endoscopic findings.
Artificial intelligence (AI) has already found successful applications in gastroenterology, where it has become part of the clinical routine. Several applications have been developed for colonoscopy, including colorectal polyp detection [15, 16], characterization [17, 18] and sizing [19, 20] methods, developed for both research and commercial purposes. Additionally, automation of examination report generation based on AI has been investigated for both the upper and lower gastrointestinal (GI) tract [21, 22]. In gastroscopy, AI has found a plethora of applications in disease identification [23, 24], including gastric cancer [25, 26], atrophic gastritis [27, 28, 29] and Barrett's neoplasia [30, 31]. In our previous work, we investigated the development of an AI for predicting the Hill classification using only retrospectively collected endoscopic images [32]. Yet, the performance of that AI was neither assessed in clinical practice nor compared to that of physicians.
Although AI has found numerous applications in clinical routine, scepticism still exists in terms of the ability to interpret model output for a given input. To overcome this, several explainability methods such as GradCAM [33], saliency maps [34] and concept relevance propagation [35] have been developed. These methods attempt to highlight the parts of the original image that influence model prediction the most but typically require post‐processing. This hinders real‐time explainability since no immediate insights on the input parts with the highest impact on model prediction can be obtained.
In this work, we evaluate the performance of an upgraded version of our self‐developed AI that assesses the GEJ according to the Hill classification in clinical practice during consecutive examinations and compares it with that of physicians. The main outcome is accuracy in assessing the Hill grade. We also investigate the impact of physician experience. Finally, the developed AI enables the generation of explainability heatmaps in real time. Using these heatmaps, we assess correct and erroneous predictions.
3. Materials and Methods
3.1. The Hill Classification
The Hill classification, introduced by Lucius Hill [8], provides a standardized method for assessing the status of the GEJ. Hill defined four different grades of increasing severity, based on the anatomical condition of the GEJ, with grade 1 being the normal condition. Hill grade 1 is characterized by a distinct and persistent fold along the lesser curvature of the GEJ that is firm around the endoscope. In Hill grade 2, the fold at the GEJ periodically opens and closes around the endoscope. By contrast, patients with Hill grade 3 the fold at the GEJ is not prominent and fails to tightly grip to the endoscope. Lastly, Hill grade 4 is characterized by a large hiatal hernia, resulting in an open oesophageal lumen which allowed visualization of the squamous epithelium from below. Examples of the different grades are depicted in Figure 1.
FIGURE 1.

Example images representing each of the four different Hill grades.
3.2. Architecture and Training of the AI
We developed and trained an AI that automatically assessed images of the GEJ obtained during retroflexion in the stomach using the Hill classification. The AI was trained with 8500 expert annotated images from two hospitals. Compared to our previous work [32] this AI was trained with more images and used a different architecture. The developed AI uses a network of the ConvNext family [36] to extract image features, which were passed through a novel classification head, which is capable of generating heatmaps explaining the parts of the input image that have the highest impact on the prediction in real‐time. The AI prediction with the highest confidence was chosen as the predicted Hill grade. The architecture of the network and the proposed classification head (Supporting Information S1: Figure S2) are in detail described in Supporting Information S1. AI was introduced in our freely available real‐time framework EndoMind [15] for AI in endoscopy, which enabled its introduction to clinical routine. During the examination, frames captured from the endoscopic processor were used as model inputs, and model real‐time outputs were displayed on the monitor and simultaneously stored in the system. Our study setup blinded the examiner regarding the AI output by turning the second monitor with AI predictions off.
3.3. Study Design and Power Analysis
The performance of the AI versus the physicians in assessing the Hill classification was compared through a single‐centre prospective superiority trial. The primary outcome of the study was the accuracy in assessing the Hill grades. Before the beginning of the trial, all participating physicians attended a seminar with a lecture on the application of the Hill classification. All adult patients appointed for a gastroscopy in the intervention room with the AI‐setup and who had no anatomical alterations of the GEJ were eligible for participation in the study. Examinations in the study were conducted by nine physicians, six of whom had over two years of endoscopy experience and three who had less than two years of experience. On average, participating endoscopists performed 347 endoscopic examinations per year. In more detail, the average was 421 per year for examiners with more than two years of experience and 201 per year examinations for physicians with less than 2 years of experience. Values were calculated based on the number of examinations over the last two years. The former were considered experienced and the latter inexperienced.
To calculate the sample size, the accuracy of the AI and physicians was internally assessed on images of the GEJ from an external expert‐annotated image dataset. Power analysis revealed that 127 patients (paired measurements) are required to have a 90% chance of detecting, as significant at the 5% level, an increase in the primary outcome measure from 72% in the control group to 88% in the experimental group. During each examination, the AI was running in the background in real time, and its predictions were stored in the system. The physicians performed the examination as usual and were instructed to assess the Hill grade for each patient without having access to the AI predictions.
To obtain the gold standard for the examination, high‐quality images from the inspection of the GEJ from the examination recording were collected when available. These images were presented to a three‐member committee of expert endoscopists, which were blinded to assessments from both AI and physicians. None of the committee members performed examinations analysed in the study. The result of the majority vote was regarded as the gold standard.
3.4. Study Outcomes and Statistics
Expert agreement when determining the gold standard was evaluated using the Fleiss kappa value [37]. The primary outcome was accuracy in evaluating the Hill classification, that is, the ratio of correct grade assessments. Confidence intervals were obtained using bootstrapping and the presence of statistically significant differences was investigated using the chi‐squared test. Secondary outcomes were the per Hill grade analysis for the accuracy, precision, sensitivity (recall), and specificity. Additionally, the distributions of over‐ and under‐estimations for the different Hill grades were investigated for both AI and physician assessments. Finally, a subgroup analysis compared the accuracy of experienced physicians, defined as physicians with at least two years' experience in performing gastroscopies, and inexperienced physicians, with less than two years of experience, against the accuracy of the AI.
4. Results
From 10 October 2023 until 22 December 2023, a total of 195 patients were recruited and assessed for eligibility. Out of them 29 were excluded due to protocol, resulting in 166 patients participating in the study. Out of them, 24 had to be excluded due to technical issues while recording the examination data, resulting in 142 patients to be analysed. Further 11 cases were excluded, where the expert committee decided that the image quality during GEJ inspection was not sufficient to determine a gold standard. Finally, a total of 131 patients were analysed. The consort flowchart is presented in Figure 2. The analysed cases included 66 (50.4%) male and 65 (49.6%) female patients. The median patient age was 62 years (Q1–Q3: 47–71).
FIGURE 2.

Participant recruitment flowchart for the study.
The expert committee resulted in a majority vote for all examinations included in the study, with 82% having unanimous agreement. The value of the Fleiss coefficient for expert agreement was 0.8, corresponding to perfect agreement. In the analysed data, the distribution of gold standard labels was 67 (51.2%), 49 (37.4%), 10 (7.6%), and 5 (3.8%) for Hill grades 1, 2, 3, and 4 respectively. The distribution of gold standard labels as well as unanimous expert agreement is depicted in Supporting Information S1: Figure S1.
The developed AI achieved accuracy of 84.7% (95% CI: 78.6%–90.8%), which was significantly higher than the accuracy of 62.5% (95% CI: 54.2%–71.0%) achieved by physicians (p < 0.01). The subgroup of inexperienced physicians achieved a 56% (95% CI: 46.7%–68%) accuracy, which was significantly less than the 85% (95% CI: 77.3%–93.3%) accuracy achieved by the AI for the same examinations (p < 0.01). Experienced physicians achieved an accuracy of 69.6% (95% CI: 57.1%–80.4%), which was lower but not significantly from the 84% (95% CI: 73.2%–92.8%) achieved by the AI for the same examinations (p = 0.07). The accuracy and bootstrapped confidence intervals regarding the accuracy for the study and subgroups of experienced and inexperienced endoscopists are depicted in Figure 3.
FIGURE 3.

Accuracy comparison between AI and physician assessments of Hill grade, presented with bootstrapped 95% confidence intervals for the entire dataset (left) as well as for those examinations performed by inexperienced (middle) and experienced (right) endoscopists. Bar heights indicate accuracies, and error bars represent bootstrapped 95% confidence intervals. The p‐values from chi‐squared tests are shown above each pair of accuracy bars.
The per‐class analysis for each Hill grade revealed that the AI achieved higher accuracy, precision, and specificity for all Hill grades. AI sensitivity was also higher for all grades, except grade 4. The complete per‐class analysis results are presented in Table 1.
TABLE 1.
Comparison of the AI and physician accuracy, precision, sensitivity (recall) and specificity for each different Hill grade.
| Accuracy | Precision | Sensitivity | Specificity | |
|---|---|---|---|---|
| Grade 1 | ||||
| Physician | 84% | 82.9% | 86.7% | 81.3% |
| AI | 91.6% | 87.8% | 97% | 85.9% |
| Grade 2 | ||||
| Physician | 67.1% | 61.5% | 32.7% | 87.8% |
| AI | 87.8% | 92.3% | 73.5% | 96.3% |
| Grade 3 | ||||
| Physician | 83.2% | 20% | 40% | 86.8% |
| AI | 93.1% | 53.8% | 70% | 95% |
| Grade 4 | ||||
| Physician | 90.8% | 26.7% | 80% | 91.2% |
| AI | 96.9% | 60% | 60% | 98.4% |
Note: Bold text indicates the higher value in each pair.
Analysis of the distribution of under‐ and overestimates of Hill grades showed that in all cases the AI assessments were at most one Hill grade away from the gold standard. A similar tendency was observed for physicians, except for Hill grade 2, which in 6 cases was over‐estimated as grade 4. All Hill grade estimations for both AI and physician assessments are depicted in Figure 4.
FIGURE 4.

Distribution of correct assessments as well as under‐ and over‐estimations of the AI and physicians for each Hill grade.
Pots‐hoc analysis of the explainability heatmaps generated revealed that the AI always focused on correct areas of the input image when predicting, for both correct and erroneous assessments. Examples of correct and mistaken evaluations, together with the prediction for each class and the heatmap for the prediction are depicted Figure 5. Furthermore, Video S1 displays the real‐time AI output of our system, including predictions and heatmaps, during inspection of the GEJ. Due to the blinded design of the study those predictions were not displayed during the examination. Examples of real‐time AI output, including predictions and heatmaps, during inspection of the GEJ.
FIGURE 5.

Examples of AI predictions for each Hill grade and explainability heatmap for the prediction. The first row of images contains correct assessments, whereas the second row contains erroneous assessments. In the lower left corner, the prediction confidence for each class is displayed as a green bar.
5. Discussion
In this study, an AI model that assesses the GEJ according to the Hill classification was evaluated in a prospective superiority design, comparing its accuracy to the accuracy of endoscopists. The model achieved accuracy of 84.7%, which was significantly higher than the 62.5% achieved by endoscopists (p < 0.01). Furthermore, a per‐class analysis revealed that the AI performed better than physicians for all Hill grades in terms of accuracy, precision, sensitivity, and specificity except for sensitivity for Hill grade 4. Furthermore, when the AI was mistaken, the prediction was at most one Hill grade away from the gold standard. In contrast, in six cases, the physicians were two Hill grades away, with Hill grade 2 being assessed as grade 4. Furthermore, subgroup analysis based on the experience level of the physicians revealed that although the AI performed better in both cases, statistical significance existed when comparing with inexperienced physicians (85% vs. 56% accuracy, p < 0.01), but not with experienced ones (84% vs. 69.6% accuracy, p = 0.07).
Enabling standardized assessment of the GEJ and hiatal hernias in clinical practice comes with multiple benefits. First, it contributes to the completeness and quality improvement of the produced report [38]. Secondly, Hill classification has been already demonstrated to be a strong predictor for conditions such as GERD and erosive oesophagitis [10, 11, 13, 14]. Additionally, patient treatment plan can benefit from such a standardized assessment, as reduced efficiency of the flap at the GEJ is known to be related with poor response to PPI treatment [39].
In our previous work, a model for predicting the Hill classification was trained and evaluated on images from an external dataset [32]. However, this model was not evaluated on prospective data and its explainability was investigated using the GradCAM method [33], which does not provide any real‐time insights. We used the model [32] in a post hoc analysis with the prospectively collected data in the current study. This resulted in an accuracy of 78.6%, which is lower compared to the 84.7% achieved by the model in this work. Furthermore, our new attention‐based AI model was able to generate explainability heatmaps in real‐time. These were used to evaluate correct and erroneous model assessments, revealing that in all cases, the model focused on the correct image areas when assessing it. Erroneous model classifications can be attributed to inaccurate assessment of the widening of the ridge or depth of the hernia around the endoscope. It is interesting to note that when assessing Hill grades 1 to 3, the heatmaps focus on the ridge, whereas for Hill grade 4, the model focuses on the hernia depth, which is consistent with the definition of the Hill grades. More interestingly, one should keep in mind that during annotation process for AI training images were classified without any annotation of specific anatomical landmarks. To the best of our knowledge, there is no other AI study assessing the Hill classification that we can compare our work to.
This study is also subject to limitations. The study was conducted at a single centre, which can be a limiting factor for the variability of the included patients. Therefore, patients were sequentially recruited to minimize selection bias. Due to the study being carried out on unselected gastroscopies, the higher Hill grades, 3 and 4, were less represented in the population. However, the per‐Hill grade analysis together with post hoc evaluation of the explainability heatmaps verified the high model performance in these cases. Finally, the study was not powered to assess statistical differences between AI and physicians in the per‐class analysis. Further analysis, especially for the higher Hill grades, should be performed in a future multi‐center study with a more diverse population. Finally, in 11 (8%) of the cases, 9 of which were performed by inexperienced endoscopists, there was no image allowing proper assessment of the Hill grade. Yet, the AI is not able to distinguish between high‐ and low‐quality images of the GEJ. This could potentially be tackled by initial filtering of input images based on their quality.
We believe that the superior model performance combined with its ability to generate explainability heatmaps in real time can prove beneficial for physicians, especially those with limited experience. We believe that a randomized controlled trial assessing the performance of physicians with and without access to model predictions during the examination can properly assess the impact of introducing such a model in clinical practice.
In this work, the performance of a model for predicting the Hill classification of the gastroesophageal junction during gastroscopy was prospectively evaluated. The model performed significantly better in terms of accuracy compared with examiners. Per‐class analysis revealed that our AI performed better than physicians for almost all grades and metrics. Furthermore, subgroup analysis based on endoscopist experience revealed that the developed model performed like expert physicians and was significantly better than physicians with limited experience.
Ethics Statement
This study received approval from the ethical committee of the University Hospital of Würzburg (12/20‐am).
Consent
In alignment with the Helsinki Declaration of 1964 and later versions, signed informed consent was obtained from each patient prior to participation.
Conflicts of Interest
The authors declare no conflicts of interest.
Supporting information
Supporting Information S1
Supporting Information S2
Video S1
Acknowledgements
Alexander Hann and Wolfram G. Zoller received public funding for this work from the Eva Mayr‐Stihl Foundation, Waiblingen, Germany. Florian Seyfried is supported by the German research foundation (DFG, SE 2027/5‐1). Open Access funding enabled and organized by Projekt DEAL.
Funding: Alexander Hann and Wolfram G. Zoller received public funding for this work from the Eva Mayr‐Stihl Foundation, Waiblingen, Germany. Florian Seyfried is supported by the German Research Foundation (DFG, SE 2027/5‐1).
Data Availability Statement
The data that support the findings of this study are available on request from the corresponding author. The data are not publicly available due to privacy or ethical restrictions.
References
- 1. Stylopoulos N. and Rattner D. W., “The History of Hiatal Hernia Surgery: From Bowditch to Laparoscopy,” Annals of Surgery 241, no. 1 (2005): 185–193, 10.1097/01.sla.0000149430.83220.7f. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Fuchs K. H., DeMeester T. R., Otte F., et al., “Severity of GERD and Disease Progression,” Diseases of the Esophagus 34, no. 10 (2021): doab006, 10.1093/dote/doab006. [DOI] [PubMed] [Google Scholar]
- 3. Wallner B., Björ O., Andreasson A., et al., “Identifying Clinically Relevant Sliding Hiatal Hernias: A Population‐Based Endoscopy Study,” Scandinavian Journal of Gastroenterology 53, no. 6 (2018): 657–660, 10.1080/00365521.2018.1458896. [DOI] [PubMed] [Google Scholar]
- 4. Guda N. M., Partington S., and Vakil N., “Inter‐ and Intra‐Observer Variability in the Measurement of Length at Endoscopy: Implications for the Measurement of Barrett’s Esophagus,” Gastrointestinal Endoscopy 59, no. 6 (2004): 655–658, 10.1016/s0016-5107(04)00182-8. [DOI] [PubMed] [Google Scholar]
- 5. Rey J.‐F., Lambert R., and Committee null and the EQA , “ESGE Recommendations for Quality Control in Gastrointestinal Endoscopy: Guidelines for Image Documentation in Upper and Lower GI Endoscopy,” Endoscopy 33, no. 10 (2001): 901–903, 10.1055/s-2001-42537. [DOI] [PubMed] [Google Scholar]
- 6. Bisschops R., Areia M., Coron E., et al., “Performance Measures for Upper Gastrointestinal Endoscopy: A European Society of Gastrointestinal Endoscopy (ESGE) Quality Improvement Initiative,” Endoscopy 48, no. 9 (2016): 843–864, 10.1055/s-0042-113128. [DOI] [PubMed] [Google Scholar]
- 7. Beg S., Ragunath K., Wyman A., et al., “Quality Standards in Upper Gastrointestinal Endoscopy: A Position Statement of the British Society of Gastroenterology (BSG) and Association of Upper Gastrointestinal Surgeons of Great Britain and Ireland (AUGIS),” Gut 66, no. 11 (2017): 1886–1899, 10.1136/gutjnl-2017-314109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Hill L. D., Kozarek R. A., Kraemer S. J. M., et al., “The Gastroesophageal Flap Valve: In Vitro and In Vivo Observations,” Gastrointestinal Endoscopy 44, no. 5 (1996): 541–547, 10.1016/S0016-5107(96)70006-8. [DOI] [PubMed] [Google Scholar]
- 9. K. H. Fuchs , I. Kafetzis , A. Hann , and A. Meining , “Hiatal Hernias Revisited—A Systematic Review of Definitions, Classifications, and Applications” Life 14, no. 9 (2024): 1145, 10.3390/life14091145. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Brännström L., Werner M., Wallner B., Franklin K. A., and Karling P., “What Is the Significance of the Hill Classification?,” Diseases of the Esophagus 36, no. 9 (2023): doad004, 10.1093/dote/doad004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Chue K. M., Goh D. W. X., Chua C. M. E., et al., “The Hill’s Classification Is Useful to Predict the Development of Postoperative Gastroesophageal Reflux Disease and Erosive Esophagitis After Laparoscopic Sleeve Gastrectomy,” Journal of Gastrointestinal Surgery 26, no. 6 (2022): 1162–1170, 10.1007/s11605-022-05324-x. [DOI] [PubMed] [Google Scholar]
- 12. Nguyen N. T., Thosani N. C., Canto M. I., et al., “The American Foregut Society White Paper on the Endoscopic Classification of Esophagogastric Junction Integrity,” Foregut 2, no. 4 (2022): 339–348, 10.1177/26345161221126961. [DOI] [Google Scholar]
- 13. Osman A., Albashir M. M., Nandipati K., Walters R. W., and Chandra S., “Esophagogastric Junction Morphology on Hill’s Classification Predicts Gastroesophageal Reflux With Good Accuracy and Consistency,” Digestive Diseases and Sciences 66, no. 1 (2021): 151–159, 10.1007/s10620-020-06146-0. [DOI] [PubMed] [Google Scholar]
- 14. Hansdotter I., Björ O., Andreasson A., et al., “Hill Classification Is Superior to the Axial Length of a Hiatal Hernia for Assessment of the Mechanical Anti‐Reflux Barrier at the Gastroesophageal Junction,” Endoscopy International Open 04, no. 03 (2016): E311–E317, 10.1055/s-0042-101021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Lux T. J., Banck M., Saßmannshausen Z., et al., “Pilot Study of a New Freely Available Computer‐Aided Polyp Detection System in Clinical Practice,” International Journal of Colorectal Disease 37, no. 6 (2022): 1349–1354, 10.1007/s00384-022-04178-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Brand M., Troya J., Krenzer A., et al., “Development and Evaluation of a Deep Learning Model to Improve the Usability of Polyp Detection Systems During Interventions,” United European Gastroenterology Journal 10, no. 5 (2022): 477–484, 10.1002/ueg2.12235. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Weigt J., Repici A., Antonelli G., et al., “Performance of a New Integrated Computer‐Assisted System (CADe/CADx) for Detection and Characterization of Colorectal Neoplasia,” Endoscopy 54, no. 2 (2022): 180–184, 10.1055/a-1372-0419. [DOI] [PubMed] [Google Scholar]
- 18. Kader R., Cid‐Mejias A., Brandao P., et al., “Polyp Characterization Using Deep Learning and a Publicly Accessible Polyp Video Database,” Digestive Endoscopy 35, no. 5 (2023): 645–655, 10.1111/den.14500. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Sudarevic B., Sodmann P., Kafetzis I., et al., “Artificial Intelligence‐Based Polyp Size Measurement in Gastrointestinal Endoscopy Using the Auxiliary Waterjet as a Reference,” Endoscopy 55, no. 09 (2023): 871–876, 10.1055/a-2077-7398. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Benson A., Jacob H., Katz L. H., et al., “Artificial Intelligence (AI) Based Real‐Time Automatic Polyp Size Estimation During Colonoscopy,” Gastrointestinal Endoscopy 95, no. 6 (2022): AB251, 10.1016/j.gie.2022.04.652. [DOI] [Google Scholar]
- 21. Lux T. J., Saßmannshausen Z., Kafetzis I., et al., “Assisted Documentation as a New Focus for Artificial Intelligence in Endoscopy: The Precedent of Reliable Withdrawal Time and Image Reporting,” Endoscopy 55, no. 12 (2023): 1118–1123, 10.1055/a-2122-1671. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Zhang L., Lu Z., Yao L., et al., “Effect of a Deep Learning–Based Automatic Upper GI Endoscopic Reporting System: A Randomized Crossover Study (With Video),” Gastrointestinal Endoscopy 98, no. 2 (2023): 181–190.e10, 10.1016/j.gie.2023.02.025. [DOI] [PubMed] [Google Scholar]
- 23. Hmoud Al‐Adhaileh M., Mohammed Senan E., Alsaade F. W., et al., “Deep Learning Algorithms for Detection and Classification of Gastrointestinal Diseases,” Complexity 2021, no. 1 (2021): 6170416, 10.1155/2021/6170416. [DOI] [Google Scholar]
- 24. Yang H., Wu Y., Yang B., et al., “Identification of Upper GI Diseases During Screening Gastroscopy Using a Deep Convolutional Neural Network Algorithm,” Gastrointestinal Endoscopy 96, no. 5 (2022): 787–795.e6, 10.1016/j.gie.2022.06.011. [DOI] [PubMed] [Google Scholar]
- 25. Qiu W., Xie J., Shen Y., Xu J., and Liang J., “Endoscopic Image Recognition Method of Gastric Cancer Based on Deep Learning Model,” Expert Systems 39, no. 3 (2022): e12758, 10.1111/exsy.12758. [DOI] [Google Scholar]
- 26. Sharma P. and Hassan C., “Artificial Intelligence and Deep Learning for Upper Gastrointestinal Neoplasia,” Gastroenterology 162, no. 4 (2022): 1056–1066, 10.1053/j.gastro.2021.11.040. [DOI] [PubMed] [Google Scholar]
- 27. Chong Y., Xie N., Liu X., et al., “A Deep Learning Network Based on Multi‐Scale and Attention for the Diagnosis of Chronic Atrophic Gastritis,” Zeitschrift für Gastroenterologie 60, no. 12 (2022): 1770–1778, 10.1055/a-1828-1441. [DOI] [PubMed] [Google Scholar]
- 28. Zhang Y., Li F., Yuan F., et al., “Diagnosing Chronic Atrophic Gastritis by Gastroscopy Using Artificial Intelligence,” Digestive and Liver Disease 52, no. 5 (2020): 566–572, 10.1016/j.dld.2019.12.146. [DOI] [PubMed] [Google Scholar]
- 29. Zhao Q., Jia Q., and Chi T., “Deep Learning as a Novel Method for Endoscopic Diagnosis of Chronic Atrophic Gastritis: A Prospective Nested Case–Control Study,” BMC Gastroenterology 22, no. 1 (2022): 352, 10.1186/s12876-022-02427-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30. de Groof A. J., Struyvenberg M. R., van der Putten J., et al., “Deep‐Learning System Detects Neoplasia in Patients With Barrett’s Esophagus With Higher Accuracy Than Endoscopists in a Multistep Training and Validation Study With Benchmarking,” Gastroenterology 158, no. 4 (2020): 915–929.e4, 10.1053/j.gastro.2019.11.030. [DOI] [PubMed] [Google Scholar]
- 31. Abdelrahim M., Saiko M., Maeda N., et al., “Development and Validation of Artificial Neural Networks Model for Detection of Barrett’s Neoplasia: A Multicenter Pragmatic Nonrandomized Trial (With Video),” Gastrointestinal Endoscopy 97, no. 3 (2023): 422–434, 10.1016/j.gie.2022.10.031. [DOI] [PubMed] [Google Scholar]
- 32. Kafetzis I., Fuchs K.‐H., Sodmann P., et al., “Efficient Artificial Intelligence‐Based Assessment of the Gastroesophageal Valve With Hill Classification Through Active Learning,” Scientific Reports 14, no. 1 (2024): 18825, 10.1038/s41598-024-68866-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33. Selvaraju R. R., Cogswell M., Das A., R. Vedantam, D. Parikh, and D. Batra, “Grad‐CAM: Visual Explanations From Deep Networks via Gradient‐Based Localization,” arXiv:1610.02391. 10.48550/arXiv.1610.02391. [DOI]
- 34. Simonyan K., Vedaldi A., and Zisserman A., “Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps,” arXiv:1312.6034 (2014), 10.48550/arXiv.1312.6034. [DOI]
- 35. Achtibat R., Dreyer M., Eisenbraun I., et al., “From Attribution Maps to Human‐Understandable Explanations through Concept Relevance Propagation,” Nature Machine Intelligence 5, no. 9 (2023): 1006–1019, 10.1038/s42256-023-00711-8. [DOI] [Google Scholar]
- 36. Liu Z., Mao H., Wu C.‐Y., C. Feichtenhofer, T. Darrell, and S. Xie, “A ConvNet for the 2020s,” arXiv:2201.03545 (2022), 10.48550/arXiv.2201.03545. [DOI]
- 37. Fleiss J. L., “Measuring Nominal Scale Agreement Among Many Raters,” Psychological Bulletin 76, no. 5 (1971): 378–382, 10.1037/h0031619. [DOI] [Google Scholar]
- 38. Januszewicz W. and Kaminski M. F., “Quality Indicators in Diagnostic Upper Gastrointestinal Endoscopy,” Therapeutic Advances in Gastroenterology 13 (2020): 1756284820916693, 10.1177/1756284820916693. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39. Cheong J. H., Kim G. H., Lee B. E., et al., “Endoscopic Grading of Gastroesophageal Flap Valve Helps Predict Proton Pump Inhibitor Response in Patients With Gastroesophageal Reflux Disease,” Scandinavian Journal of Gastroenterology 46, no. 7–8 (2011): 789–796, 10.3109/00365521.2011.579154. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supporting Information S1
Supporting Information S2
Video S1
Data Availability Statement
The data that support the findings of this study are available on request from the corresponding author. The data are not publicly available due to privacy or ethical restrictions.
