Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2021 Mar 12.
Published in final edited form as: JAMA Ophthalmol. 2019 Dec 1;137(12):1361–1362. doi: 10.1001/jamaophthalmol.2019.3512

Finding glaucoma in color fundus photographs using deep learning

Karine D Bojikian 1, Cecilia S Lee 1, Aaron Y Lee 1
PMCID: PMC7335661  NIHMSID: NIHMS1604142  PMID: 31513255

Advances in artificial intelligence (AI) and applications of deep learning in ophthalmic imaging analyses have created remarkable successes and enthusiasm.13 In this issue of JAMA Ophthalmology, Liu and colleagues report a deep learning system (DLS) for detecting glaucomatous optic neuropathy (GON) and its generalizability in various datasets of color fundus photographs.4 In addition to the local validation set, the authors assessed the model’s performance in external validation datasets which vary in geographic location, population ethnicities, and camera systems. The area under receiver operating characteristic curve (AUC) for the local validation set was 0.996 while the AUCs for other out-of-sample, external datasets were lower ranging from 0.823 to 0.995 with similar pattern in sensitivities (82.2–96.1%) and specificities (70.4–97.1%). Interestingly, the authors also developed an online deep learning system with a Human-Computer Interaction loop: the deep learning model predicted the positive samples for glaucoma, which were confirmed by ophthalmologists then fed into the algorithm again to improve its performance.

This first step in developing a deep learning system for GON screening at scale shows promising results. One important consideration in deep learning research is to establish the accuracy of the system’s training data set and understand the quality of ground truth. In the study by Liu et al., the labeling was performed with multi-tiered assessment with arbitration, which is an excellent approach in curating labels for a large number of examples.4 However, there was a manual quality curation step that was done at the beginning of this dataset and in many of the external validation datasets. The authors did not emphasize the performance of the frozen algorithm in the deployment of the algorithm in consecutive patients where this manual quality curation step was not applied. In addition, the authors did not examine the veracity of the negative predictions in the online system, leaving open-ended questions as to the true real-world performance and scalability of the model. Finally, while the online adaptive deep learning system is compelling, there is danger that the model will begin to overfit on the dataset as it is fine-tuned only on corrections from positive examples. Future deployment of these online deep learning systems needs to be monitored carefully with a true test-set and visualizations for explainability.

The study from Liu et al. included over 241,000 and 114,300 fundus images for training and validation, respectively.4 All images were classified as unlikely, probable, and definitive GON by human experts, and DLS was trained for a binary classification of normal and “referable GON,” defined as probable or definite GON. Given the excellent AUC, sensitivities and specificities, the authors suggested that DLS could be applied in current screening program of GON. However, referrable glaucoma was noted to be imbalanced across both the training dataset and all external validation data-sets. The use of AUC is prevalent in ophthalmic binary classification results, but the area under the Precision-Recall (PR) curve is noted to be a more accurate measure of performance.5 The inclusion of a balanced test set accuracy or including the area under the PR curve in the reported metrics would greatly strengthen the understanding of the performance of automated diagnostic algorithms in the future.

Better understanding the strengths and limitations of DLS is key for its future practical applications by ophthalmologists. Another important contribution from Liu et al was the inclusion of visualization and the reasons for false positives and false negatives. This represents an important additional step in machine learning literature that is often not included in studies and highlights one of the biggest criticisms of deep learning models in general. The black-box nature of deep learning models stem from the high-combinatorial computational space in which these algorithms operate, which give these models not only immense flexibility but also difficulty in explainability. Studies like Liu et al that include visualizations and attempts of understanding sources of failure will expedite the acceptance and approval of AI applications in medicine.

In many countries, regulatory approval is required before algorithms are allowed to be used for clinical care. Nevertheless, the approval process of AI models may evolve to be different than what is done today. The first Food Drug & Administration (FDA) approved AI algorithm for screening decision was IDx-DR, designed to output either “referable diabetic retinopathy(DR)” (i.e. more than mild DR) or no referable DR based on retinal photos.The FDA approval was based on sensitivity of 87.2% and specificity of 90.7% in retinal images from 900 patients with diabetes,6 similar to the performance of the DLS in this study. The approval process for IDx-DR was in line with the traditional medical device approval paradigm. However, the key advantages of AI algorithms are the potential of continuous improvement in their performance as more data is included for tuning.

FDA recently released a white paper acknowledging the unique challenges in regulating AI systems and proposed a new regulatory framework where adaptive algorithms may not need to be re-approved.7 Exciting regulatory changes with further emphasis on the responsibilities of the researchers themselves on ensuring the safety, efficacy, and improvement of the AI systems are anticipated. Studies such as Liu et al. that have incorporated continuous performance feedback loop have the opportunity to study the minimal requirements for testing, validating, and monitoring.4 In addition, agreement in the highest standards in reporting the AI performance across many external validation datasets and the willingness to share datasets among researchers will be essential in moving this field forward. The bright future of automated AI diagnostic systems has the potential to identify patients with early GON, prevent irreversible blindness in asymptomatic patients, provide expert level diagnoses in resource limited settings, and scale to population-level wide screening systems.

References

  • 1.Ting DSW, Cheung CY-L, Lim G, et al. Development and Validation of a Deep Learning System for Diabetic Retinopathy and Related Eye Diseases Using Retinal Images From Multiethnic Populations With Diabetes. JAMA. 2017;318(22):2211–2223. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Kihara Y, Heeren TFC, Lee CS, et al. Estimating Retinal Sensitivity Using Optical Coherence Tomography With Deep-Learning Algorithms in Macular Telangiectasia Type 2. JAMA Netw Open. 2019;2(2):e188029. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.De Fauw J, Ledsam JR, Romera-Paredes B, et al. Clinically applicable deep learning for diagnosis and referral in retinal disease. Nat Med. 2018;24(9):1342–1350. [DOI] [PubMed] [Google Scholar]
  • 4.Liu H, Li L, Qiao C, et al. Establishing a Generalized Deep Learning System for Detection of Glaucomatous Optic Neuropathy using Fundus Photographs. JAMA Ophthalmol. In Press. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Davis J, Goadrich M. The Relationship Between Precision-Recall and ROC Curves. In: Proceedings of the 23rd International Conference on Machine Learning. ICML ‘06. New York, NY, USA: ACM; 2006:233–240. [Google Scholar]
  • 6.Abràmoff MD, Lavin PT, Birch M, Shah N, Folk JC. Pivotal trial of an autonomous AI-based diagnostic system for detection of diabetic retinopathy in primary care offices. npj Digital Medicine. 2018;1(1). doi: 10.1038/s41746-018-0040-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.U.S. Food and Drug Administration. Artificial Intelligence and Machine Learning in Software. https://www.fda.gov/medical-devices/software-medical-device-samd/artificial-intelligence-and-machine-learning-software-medical-device. Published February 4, 2019. Accessed June 13, 2019.

RESOURCES