Abstract
PURPOSE:
Artificial intelligence (AI) offers considerable promise for retinopathy of prematurity (ROP) screening and diagnosis. The development of deep-learning algorithms to detect the presence of disease may contribute to sufficient screening, early detection, and timely treatment for this preventable blinding disease. This review aimed to systematically examine the literature in AI algorithms in detecting ROP. Specifically, we focused on the performance of deep-learning algorithms through sensitivity, specificity, and area under the receiver operating curve (AUROC) for both the detection and grade of ROP.
METHODS:
We searched Medline OVID, PubMed, Web of Science, and Embase for studies published from January 1, 2012, to September 20, 2021. Studies evaluating the diagnostic performance of deep-learning models based on retinal fundus images with expert ophthalmologists' judgment as reference standard were included. Studies which did not investigate the presence or absence of disease were excluded. Risk of bias was assessed using the QUADAS-2 tool.
RESULTS:
Twelve studies out of the 175 studies identified were included. Five studies measured the performance of detecting the presence of ROP and seven studies determined the presence of plus disease. The average AUROC out of 11 studies was 0.98. The average sensitivity and specificity for detecting ROP was 95.72% and 98.15%, respectively, and for detecting plus disease was 91.13% and 95.92%, respectively.
CONCLUSION:
The diagnostic performance of deep-learning algorithms in published studies was high. Few studies presented externally validated results or compared performance to expert human graders. Large scale prospective validation alongside robust study design could improve future studies.
Keywords: Artificial intelligence, deep learning, diagnosis, retinopathy of prematurity, screening
INTRODUCTION
The concept of artificial intelligence (AI) dates back to the 1950s, when Alan Turing first discussed how to build and test intelligent machines in the paper “computing machinery and intelligence.”[1] It wasn't until 1956, however, at the seminal conference Dartmouth Summer Research Project on AI, did John McCarthy officially coin the term AI. This conference introduced a computer program designed to mimic the problem solving skills of a human, catalyzing the next 20 years of AI research.[2] Today, AI is incorporated into many applications for day-to-day life, including speech recognition, photo captioning, language translation, robotics, and even self-driving cars.[3,4,5] These applications are made possible through the use of deep learning, an advanced form of AI which self-learns from large training sets to program itself to perform certain tasks.[6] The application of AI has gained popularity in the medical diagnostic field, and promising outcomes have resulted from deep-learning screening algorithms in Ophthalmology.
There has been particular success in AI screening for diabetic retinopathy, with several groups reporting deep-learning algorithms detecting diabetic retinopathy at sensitivities and specificities of 83%–90% and 92%–98% respectively.[7,8] Moreover, the successful validation of these algorithms has seen progression to “real-world” implementation of screening programs through prospective evaluation. One such study produced a sensitivity of 83.3% and specificity of 92.5% in detecting referable diabetic retinopathy in a prospective evaluation.[8] Similar promising results are being reported by many other groups utilizing deep learning for the diagnosis of other ophthalmic conditions including diabetic macula edema,[9] age-related macular degeneration,[10] glaucoma,[11] and retinopathy of prematurity (ROP).[12,13]
ROP is a retinal vascular proliferative disease affecting premature infants whose diagnosis is dependent on timely screening. Globally, it is estimated that at least 50,000 children are blind from ROP,[14] and it remains the leading cause of preventable childhood blindness.[15] Advances in retinal imaging means disease is now easily identifiable by retinal photographs, making it a perfect candidate for deep learning. As survival rates of premature infants continue to increase with medical advances,[16] the demand for ROP screening is rapidly exceeding the capacity of available specialist ophthalmologists. For this reason, reports of deep-learning models matching or exceeding human experts in ROP diagnostic performance have generated considerable interest. It remains fundamental; however, that this enthusiasm does not overrule the need for critical appraisal as a missed diagnosis of ROP can result in significant sequelae such as blindness. Therefore, any deep-learning screening algorithm will need to show high diagnostic performance, high sensitivity, be generalizable, and be applicable to the real-world setting. In anticipation of deep-learning diagnostic tools becoming implemented into clinical practice, it is judicious to systematically review the body of evidence supporting AI screening for ROP. This systematic review aims to critically appraise the current state of diagnostic performance of deep-learning algorithms for ROP screening, with particular consideration for study design, algorithm development, type of validation, performance compared to clinicians, and diagnostic accuracy.
METHODS
Search strategy and selection criteria
Studies that developed or validated a deep learning model for the diagnosis of ROP and compared accuracy of algorithm diagnoses to ROP experts were included in this systematic review. We searched MEDLINE-Ovid, PubMed, Web of Science, and Embase for studies published from January 1, 2012 to September 20, 2021. The full search strategy for each database is available in Appendix 1. The cutoff of January 1, 2012 was prespecified based on an important breakthrough made with the development of deep-learning approaches in the model AlexNet.[17] The search was first performed on July 10, 2020, revised on May 23, 2021 and updated on September 20, 2021. Manual searches of bibliographies and citations from included studies were also completed to identify any additional articles potentially missed by searches.
Eligibility assessment was conducted by two reviewers who independently screened titles and abstracts of search results. Only studies aiming to identify through AI algorithms the presence of the disease of interest, ROP, were included. We accepted standard-of-care diagnosis, expert opinion or consensus as adequate reference standards to classify the absence or presence of disease. We excluded studies that did not test for diagnostic performance or investigated accuracy of image segmentation rather than disease classification. Studies which assessed the ability to classify disease severity were accepted if they incorporated primary results of disease detection. Review articles, conference abstracts, and studies that presented duplicate data were excluded. We assessed the risk of bias in patient selection, index test, reference standard, and flow and timing of each study using QUADAS-2.[18] Full assessment of bias can be found in Appendix 2.
This systematic review was completed following the recommendations of the Preferred Reporting Items for Systematic reviews and Meta-Analyses[19] statement and the research question was formulated according to the CHARMS[20] checklist for systematic reviews of prediction models. Methods of analysis and inclusion criteria were specified in advance.
Data analysis
Data were extracted independently by two reviewers (AB and SD) using a predefined data extraction sheet, followed by cross-checking. Any discrepancies were discussed with a third reviewer (CC). Demographics and sample size (gestational age [GA], birth weight, number of participants, and number of images), data characteristics (data source, inclusion and exclusion criteria, and image augmentation), algorithm development (architecture, transfer learning, and number of images for training and tuning), algorithm validation (reference standard, number of experts, same method for assessing reference standard, and internal and external validation), and results (sensitivity, specificity, area under the receiver operating characteristic curve for algorithm(AUROC), human graders, and external validation if applicable) were sought. Two papers produced different algorithms from different data sets or with different identification tasks and were therefore recorded as separate algorithms in the results section.[21,22] Data from all 12 papers were included and any missing information was recorded. In the case where sensitivity and specificity were not explicitly recorded but could be calculated from a confusion matrix, the calculated results were included.
RESULTS
Our search identified 175 records, of which 99 were screened [Figure 1]. Thirty full-text articles were assessed for eligibility and 12 studies were included in the systematic review.[12,13,21,22,23,24,25,26,27,28,29,30,31] Fifty studies were excluded due to no test of diagnostic performance,[32,33,34,35,36,37,38,39] no classification task,[40,41,42] no internal validation,[23,43] no AI algorithm,[44] and not based on standard clinical care.[45]
Figure 1.

Outline of study selection
Data characteristics and demographics
All twelve studies obtained retrospective images as part of routine clinical care or from local screening programs. Seven of these studies collected images from China,[22,24,25,26,28,29,31] one from India, one from North America,[12] one from America and Mexico sites,[30] one from America and Nepal,[21] and one study included images from New Zealand.[13] Date range for image collection among all studies varied from July 2011 to June 2020. Three studies specified their inclusion criteria[25,26,31] and five other studies specified their exclusion criteria.[12,13,21,28,29] Poor quality images were excluded in five studies[12,13,28,29,31] and image augmentation occurred in seven studies.[13,21,25,27,28,29,30] These characteristics are summarized in Table 1. Seven studies recorded demographic information,[21,24,25,26,27,29,31] of which the mean GA was 30.9 weeks and mean birth weight was 1501.25 g. A total of 178,459 images were used across all 12 studies ranging from 2668 to 52,249 images per study. Five studies formulated an algorithm to detect ROP[21,22,24,25,31] and seven studies created an algorithm to detect the presence of plus disease out of a total of 5358 plus disease images.[12,13,26,27,28,29,30] Full details of demographics and sample size are found in Table 2.
Table 1.
Data characteristics for the 12 included studies
| Data characteristics | |||||||
|---|---|---|---|---|---|---|---|
|
| |||||||
| Study | Source of data | Date range | Open-acceess data | Missing data | Inclusion criteria | Exclusion criteria | Exclusion of poor-quality imaging/image augmentation |
| Brown et al., 2018 | Retrospective cohort, data collected at multiple hospitals across North America | July 2011-December 2016 | N | NR | NR | Stage 4-5 ROP | Y/NR |
| Chen et al., 2020 | |||||||
| American Trained Algorithm | Routine ROP screening from 9 North American institutions | NR | NR | NR | NR | Evidence of prior treatment (laser photocoagulation and scars), retinal detachment (stage >4), if artifacts obscures >50% of image | N/Y - Scipy package in Python, contrast enhancement, green channel extraction, recizing to 224×224 pixels |
| Nepalese Trained Algorithm | ROP screening program from 4 urban hospitals in Kathmandu, Nepal (Patan Hospital, Kanti Children's Hospital, Paropakar Maternity and Women's Hospital, Tilganga Institute of ophthalmology) | NR | NR | NR | NR | ||
| Hu et al., 2019 | Chengdu women and Children's Central Hospital | 2014-2017 | NR | NR | NR | NR | NR/NR |
| Huang et al., 2020 | Neonatal intensive care unit of Chang Gung Memorial Hospital, Linkou, Taiwan | 17 Dec 2013-24 May 2019 | NR | NR | Premature infants with no ROP, stage 1 ROP and stage 2 ROP. ROP screening criteria: BW ≤1500g, GA ≤32 weeks, select few infants with BW 1500-2000 g or GA >32 weeks with unstable clinical condition | NR | N/Y |
| Mao et al., 2020 | Eye Hospital of Wenzhou Medical University | July 2013 - May 2018 | NR | NR | Only images of the posterior retina | NR | NR/NR |
| Ramachandran et al., 2021 | KIDROP Bangalore, India | NR | NR | NR | NR | NR | N/Y |
| Tan et al., 2019 | ROP (ART-ROP) image library - from four neonatal ICU in Auckland, New Zealand | 2006-2015 | NR | NR | NR | Poor image quality | Y - Not grossly out of focus, not affected by blur/Y |
| Tong et al., 2020 | Images collected from Renmin Hopsital of Wuhan University Eye Centre | 1 Feb 2012-1 Oct 2016 | NR | NR | NR | 1. Poor image quality, 2. Imaging artefacts 3. Unfocused scans 4. Presence of other disease phenotypes (e.g., retinal haemorrhage) | Y/Y |
| Wang et al., 2021 | 4 centres in southern China - JSIEC of Shantou University and Chinese University of Hong King, Guangdong Women and Children Hospital in Yuexiu branch (Yuexiu) and Panyu branch (Panjyu), and the Sixth Affiliated Hospital of Guangzhou Medical University and Qingyuan People's Hospital | 1 Sept 2018-24 June 2020 | NR | NR | NR | 1. Nonfundus photos or fundus photos taken by imaging devices other than RetCam 2. Infants with other ocular diseases e.g., congenital cataract, retinoblastoma, or persistent hyperplastic primary vitreous, and 3. any images with disagreeing labels | Y/Y |
| Wang et al., 2018 | Images captured during routine clinical ROP Screening from Chengdu Women and Children's Central Hospital | Jan 2018 | NR | NR | NR | NR | NR/NR |
| Yildiz et al., 2020 | 8 study centres: Columbia university, university of illinois at chicago, william beaumont Hospital, children's Hospital Los Angeles, Cedars-Sinai Medical centre, university of Miami, Weill Cornell Medical centre and Asociacion para Evitar la Ceguera en Mexico | July 2011 - December 2016 | NR | NR | NR | NR | N/Y - resized to 480×640-pixel input images |
| Zhang et al., 2018 | From telemedicine ROP trial (Telemed - R), images collected from 30 hospitals in Guangong and Fujian Provinces of China | June 2013 | NR | NR | BW<2000 g, preterm infants with BW 2000 g but severe systemic disorders (as per paediatrician) | NR | Y - Highly-blurry images, very dark or bright images, nonfundus photographs/NR |
NR: Not recorded, ROP: Retinopathy of prematurity, KIDROP: Karnataka Internet Assisted Diagnosis of ROP, ART: Auckland Regional Telemedicine, ICU: Intensive care unit, JSIEC: Joint Shantou International Eye centre, BW: Birth weight, GA: Gestational age, N: No, Y: Yes
Table 2.
Patient demographics and sample size for the 12 included studies
| Study | Participants |
Sample size |
||||||
|---|---|---|---|---|---|---|---|---|
| Mean GA (SD; range), weeks | BW (SD; range), grams | Mean age at screening (SD), weeks | Number of participants represented by training data | Number of images used in the study | Number of Images with ROP | Number of images with plus disease | Number of images with preplus | |
| Brown et al., 2018 | NR | NR | NR | 898 | 5511 | N/A | 172 | 805 |
| Chen et al., 2020 | ||||||||
| American Trained Algorithm | 26.6 (2.2; NR) | 856.2 (293.7; NR) | NR | 711 | 5943 | NR | N/A | N/A |
| Nepalese Trained Algorithm | 32.6 (2.8; NR) | 1949.6 (495.8; NR) | NR | 541 | 5049 | NR | N/A | N/A |
| Hu et al., 2019 | 32 (NR; NR) | 1994 (NR; NR) | NR | 720 | 2668 | 1184 | N/A | N/A |
| Huang et al., 2020 | 27.3 (1.8; NR) | 936.4 (229.8; NR) | NR | NR | 10235 | 1279 557 (stage 1 ROP) 722 (stage 2 ROP) |
NR | NR |
| Mao et al., 2020 | 31.3 (2.1; NR) training set 31 (2; NR) test set | 1643 (419.5; NR) training set 1583.3 (401.6; NR) test set |
NR | 3021 | 6161 | N/A | 290 | 691 |
| Ramachandran et al., 2021 | 32.4 (1.1; NR) no plus 30.9 (1.8; NR) plus |
1350 (240; NR) no plus 1280 (226; NR) plus |
NR | 150 | 289 | N/A | 89 | N/A |
| Tan et al., 2019 | NR | NR | NR | NR | 4926 3487 suitable for training |
N/A | 1638 (postimage preprocessing + data augmentation) | 0 |
| Tong et al., 2020 | NR | NR | NR | NR | 36231 | N/A | 3006 2745 (in training dataset) 261 (in test dataset) |
NR |
| Wang et al., 2021 | 32.9 (3.1; NR) | 1925 (774; NR) | NR | 8652 | 52249 | N/A | NR | NR |
| Wang et al., 2018 | ||||||||
| Id - Net | NR* | NR* | NR | 1273 total for both Id - Net and Gr - Net 605 for developing data 264 for data for expert comparison 404 data from web |
20,795 13,526 (developing) 2361 (expert comparison) 4908 (from web) |
6917 5967 (developing) 293 (expert) 657 (data from web) |
N/A | N/A |
| Gr - Net | NR | NR | NR | 5089 4139 (developing) 293 (expert comparison), 657 (data from web) |
Severe=2517 (2305 (developing), 120 (expert), 92 (web)) Minor=2572 (1834 (developing), 173 (expert), 565 (from web)) |
N/A | N/A | |
| Yildiz et al., 2020 | NR | NR | NR | NR | 5512 | N/A | 163 | 802 |
| Zhang et al., 2018 | 31.9-32 (NR; 24-36.4) | 1490-1500 (NR; 630-2000) | NR | NR | 17,801 | 8090 | N/A | N/A |
*Data represented as bar graph distribution, unable to calculate mean. NR: Not recorded, N/A: Not applicable, SD: Standard deviation, BW: Birth weight, GA: Gestational age, SD: Standard deviation, ROP: Retinopathy of prematurity
Algorithm development and validation
Convolutional neural networks formed the basis for algorithms developed in all twelve studies. A variety of algorithms were utilized for transfer learning including ResNet, ImageNet, U-Net, and VGG-16, [Table 3], whereas one study did not use a transfer learning approach.[25] The majority of studies used <6000 images to train their algorithm; however, five studies utilized >10,000 images for algorithm development.[22,25,28,29,31] The reference standard across all twelve studies were based off disease diagnosis by 1–5 expert graders, with an average of 2.6 human graders agreeing upon each image per study. A variety of internal validation methods were recorded, including random split sample validation and cross-validation [Table 4]. Five studies[12,13,21,22,30] obtained external validation of their AI algorithms, of which one study completed a prospective evaluation of algorithm performance.[22]
Table 3.
Details of algorithm development for the 12 included studies
| Algorithm development | ||||
|---|---|---|---|---|
|
| ||||
| Study | Algorithm name | Algorithm architecture | Transfer learning applied | Number of images for training/tuning |
| Brown et al., 2018 | NR | U-Net architecture | Yes | 4409/1102 |
| Chen et al., 2020 | ||||
| American Trained Algorithm | NR | ImageNet and Pytorch | Yes (from ResNet) | 5235/NR |
| Nepal Trained Algorithm | NR | ImageNet and Pytorch | Yes (from ResNet) | 4802/NR |
| Hu et al., 2019 | NR | ImageNet and TensorFlow (Inception-v2, VGG-16, ResNet-50) | Yes (from ImageNet) | 2068/300 |
| Huang et al., 2020 | NR | Tensorflow | No | 10,235/1137 |
| Mao et al., 2020 | NR | U-Net and DenseNet | Yes (from ImageNet) | 5711/NR |
| Ramachandran et al., 2021 | NR | Darknet-53 | Yes (from ImageNet) | 289/32 (then retrained by 96 images for final model) |
| Tan et al., 2019 | ROP.AI | TensorFlow's Inception-v3 | Yes | 80% of 6974/NR |
| Tong et al., 2020 | NR | Faster R-CNN+TensorFlow | Yes (from ResNet) | 90% of 26,459/10% of 26,459 |
| Wang et al., 2021 | J-PROP | NR* | Yes (from Res-Unet) | 75% (39,029)/10% (5140) |
| Wang et al., 2018 | DeepROP | Tensorflow - Inception-BN Network | Yes (from ImageNet) | 17665/NR |
| Yildiz et al., 2020 | I-ROP ASSIST | NR* | Yes (from U-Net) | 5512/NR |
| Zhang et al., 2018 | CAD-R | NR* | Yes (from VGG-16) | 17801/NR |
*Specific architecture not recorded, however CNN used. NR: Not recorded, AI: Artificial intelligence, ROP: Retinopathy of prematurity, CAD-R: Computer-aided diagnosis system for ROP, R-CNN: Region-convolutional neural networks, I-ROP: Indian ROP, BN: Batch normalised, VGG: Visual geometry group
Table 4.
Method of algorithm validation for the 12 included studies
| Algorithm validation | |||||||
|---|---|---|---|---|---|---|---|
|
| |||||||
| Study | Reference standard | If compared to experts, how many? | Same method for assessing reference standard across samples | Type of internal validation | Number of images for internal validation | External validation | Number of images for external validation |
| Brown et al., 2018 | Expert consensus | 3 | Yes | Random split sample validation | 20% | Yes | 100 |
| Chen et al., 2020 | |||||||
| American Trained Algorithm | Expert consensus | 3 | Yes | 5-fold cross-validation | 10% test set | Yes | 247 images from Nepal |
| Nepal Trained Algorithm | 1 | Yes | 5-fold cross-validation | 10% test set | Yes | 708 images from America | |
| Hu et al., 2019 | Expert consensus | 3 | Yes | Random split sample validation | 300 | No | N/A |
| Huang et al., 2020 | Expert consensus | 3 | Yes | 5-fold cross-validation | 244 | No | N/A |
| Mao et al., 2020 | Clinical diagnosis by one ophthalmologist | 1 | NR | Random split | 450 | No | N/A |
| Ramachandran et al., 2021 | Expert consensus | 3 | Yes | 80:20 split | 161 (67 ROP) | No | N/A |
| Tan et al., 2019 | Expert ophthalmologist from New Zealand; External images graded by expert ophthalmologist from HongKong | 1 | No - 2 different experts between internal and external validation | 80:20 random split validation | 20% of 6974 | Yes | 90 (33 plus, 57 normal) + additional 26 preplus imagesfor assessing preplus |
| Tong et al., 2020 | Expert grading (11 retinal experts for first-round screening, 2 senior experts confirmed or corrected labels) | 2 | No - 11 different first round graders | 10-fold cross-validation | 9772 | No* | N/A |
| Wang et al., 2021 | Expert grading (2 junior ophthalmologists labelled, any disagreement submitted to 1 senior ophthalmologist) | 3 | No - dependent on agreement | Random split 75:10:15 (training, validation, test) - but based on a patient-based split policy (i.e., all images of a patient were allocated into the same sub-data set) | 8080 | No | N/A |
| Wang et al., 2018 | Expert consensus (images included if 2 out of 3 graders agreed, disagreements sent to fourth ophthalmologist) | 3-4 | Yes | Random split | 298 (for Id-Net), 104 (for Gr-Net) | Yes - prospective evaluation | 2361 (total, Id and Gr net) |
| Yildiz et al., 2020 | Expert consensus | 3 | Yes | 5-fold cross-validation | 5000 | Yes | 100 (15 plus, 34 preplus) |
| Zhang et al., 2018 | Cross validation by one senior ophthalmologist | 5 (2 senior experts, 2 attending physicians, 1 resident) | Yes | Random selection | 1742 (155 ROP, 1587 without ROP) | No | N/A |
*No external validation, however did measure algorithm versus human graders on another 1227 images collected during routine clinical care. ROP: Retinopathy of prematurity, N/A: Not applicable, NR: Not recorded
Algorithm performance
The performance of each algorithm is listed in Table 5. Five studies recorded the ability of their algorithm to detect the presence of ROP disease, of which the average area under the receiver operating curve (AUROC) was 0.984.[21,22,24,25,31] Sensitivity and specificity were recorded in four of those studies and averaged 95.72% and 98.15%, respectively.[22,24,25,31] One study compared human grader performance to the AI algorithm revealing similar sensitivities (94.1% AI, 93.5% human) and specificities (99.3% AI, 99.5% human) of ROP diagnostic performance.[31] Two of the five studies underwent external validation revealing an average sensitivity and specificity of 60% and 88.3%, respectively, for detecting the presence of disease.[21,22] The seven other studies determined ability of their algorithm to detect the presence of plus disease. Among these, six studies measured AUROC, with which the average was 0.98.[12,13,26,27,29,30] The average sensitivity and specificity for detecting plus disease recorded from six studies were 91.13% and 95.92%, respectively.[12,13,26,27,28,29] External validation occurred in two of these studies and produced an average sensitivity of 93.45% and specificity of 87.35%.[12,13] Performance of AI algorithm at detecting pre-plus disease was measured in two articles producing an average sensitivity of 96.2% and specificity of 95.7%.[12,26] This is compared to four studies who measured performance of determining the stage of ROP disease, showing an average sensitivity and specificity of 89.07% and 94.63%, respectively.[22,25,28,29]
Table 5.
Summary of results from the 12 included studies
| Study | Algorithm Performance |
||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Sens % | Spec % | Area under the ROC curve (AUROC) | |||||||||
| Detecting Disease | |||||||||||
| Hu et al. 2019 | 96 | 98 | 0.9922 | ||||||||
| Zhang et al. 2018 | 94.1 | 99.3 | 0.998 | ||||||||
| Chen et al. 2020 | |||||||||||
| American Trained Algorithm | NR | NR | 0.99 | ||||||||
| Nepal Trained Algorithm | NR | NR | 0.96 | ||||||||
| Combined (American & Nepal) Trained Algorithm | NR | NR | 0.99 | ||||||||
|
| |||||||||||
| Sens grading % | Spec grading % | AUROC grading | |||||||||
|
| |||||||||||
| Detecting Disease & Stage | |||||||||||
| Huang et al. 2020 | 96.14±0.87 | 95.95±0.48 | 0.96 | 91.82±2.03 (stage 1) | 94.5±0.71 (stage 1) | 0.93 (stage 1) | |||||
| Wang et al. 2018 | |||||||||||
| Id-Net | 96.64 | 99.33 | 0.995 | n/a | n/a | n/a | |||||
| Gr-Net | n/a | n/a | n/a | 88.46 (minor vs. severe) | 92.31 (minor vs. severe) | 0.951 (minor vs. severe) | |||||
|
| |||||||||||
| Sens Pre-Plus % | Spec Pre-plus % | AUROC Pre-plus | |||||||||
|
| |||||||||||
| Detecting Plus Disease | |||||||||||
| Brown et al. 2018 | 93 | 94 | 0.98 | 100 | 94 | NR | |||||
| Mao et al. 2020 | 95.1 | 97.8 | 0.99 | 92.4 | 97.4 | NR | |||||
| Ramachandran et al. 2021 | 99 | 98 | 0.9947 | n/a | n/a | n/a | |||||
| Tan et al. 2019 | 96.6 | 98 | 0.993 | n/a | n/a | n/a | |||||
| Yildiz et al. 2020 | NR | NR | 0.94 | NR | NR | 0.88 | |||||
| Detecting Plus & Severity | |||||||||||
| Wang et al. 2021 | 91.8 | 97 | 0.983 | 98.2 (stage) | 98.5 (stage) | 0.998 (stage) | |||||
| Tong et al. 2020 | 71.3 | 90.7 | NR | 77.8 (“normal” “mild” “semi-urgent” “urgent”) | 93.2 (“normal” “mild” “semi-urgent” “urgent”) | NR | |||||
|
| |||||||||||
| Study |
Human Performance
|
External Validation
|
|||||||||
| Sens % | Spec % | Sens grading % | Spec grading % | AUROC | Sens % | Spec % | AUROC | ||||
|
| |||||||||||
| Detecting Disease | |||||||||||
| Hu et al. 2019 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a | |||
| Zhang et al. 2018 | 93.5 | 99.5 | n/a | n/a | n/a | n/a | n/a | n/a | |||
| Chen et al. 2020 | |||||||||||
| American Trained Algorithm | n/a | n/a | n/a | n/a | n/a | 52 | 99 | 0.96 | |||
| Nepal Trained Algorithm | n/a | n/a | n/a | n/a | n/a | 44 | 69 | 0.62 | |||
| Combined (American & Nepal) Trained Algorithm | n/a | n/a | n/a | n/a | n/a | 98/82 (against American/Nepal set) | 96/99 (against American/Nepal set) | 0.99/0.98 (against American/Nepal set) | |||
|
| |||||||||||
| Sens grading % | Spec grading % | AUROC grading | |||||||||
|
| |||||||||||
| Detecting Disease & Stage | |||||||||||
| Huang et al. 2020 | NR | NR | NR | NR | NR | n/a | n/a | n/a | n/a | n/a | n/a |
| Wang et al. 2018 | |||||||||||
| Id-Net | NR | NR | NR | NR | NR | 84.91 | 96.9 | NR | n/a | n/a | n/a |
| Gr-Net | NR | NR | NR | NR | NR | n/a | n/a | n/a | 93.33 (minor vs. severe) | 73.63 (minor vs. severe) | NR |
|
| |||||||||||
| Sens Pre-Plus % | Spec Pre-plus % | AUROC Pre-plus | |||||||||
|
| |||||||||||
| Detecting Plus Disease | |||||||||||
| Brown et al. 2018 | n/a | n/a | n/a | n/a | n/a | 93 | 94 | NR | 100 | 94 | NR |
| Mao et al. 2020 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Ramachandran et al. 2021 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
| Tan et al. 2019 | n/a | n/a | n/a | n/a | n/a | 93.9 | 80.7 | NR | 81.4 | 80.7 | 0.977 |
| Yildiz et al. 2020 | n/a | n/a | n/a | n/a | n/a | NR | NR | 0.99 | NR | NR | 0.97 |
| Detecting Plus & Severity | |||||||||||
| Wang et al. 2021 | 100 (compared to J-PROP on same dataset: 100) | 99.8 (compared to J-PROP on same dataset: 98.4) | 91.7 (stage) (compared to J-PROP on same dataset: 97.9) | 99.1 (stage) (compared to J-PROP on same dataset: 97.4) | NR | n/a | n/a | n/a | n/a | n/a | n/a |
| Tong et al. 2020 | 74.8 (expert 1), 65.9 (expert 2) (for grading “normal” “mild” “semi-urgent” “urgent”) | 93.4 (expert 1), 92.3 (expert 2) (for grading “normal” “mild” “semi-urgent” “urgent”) | n/a | n/a | NR | n/a | n/a | n/a | n/a | n/a | n/a |
DISCUSSION
We found that deep-learning algorithms for ROP screening demonstrated sensitivity and specificity metrics that were comparable to neural network algorithms in diabetic retinopathy.[46] Although this estimate supports the application potential for deep-learning algorithms to be implemented as a real-world diagnostic tool, there are several methodological deficiencies that were common across included studies which need to be considered. These include the quality of reference standard, use of sample size calculations, external validation, definition of presence or absence of disease, and the need for prospective evaluation.
First, we found variability in specific algorithm diagnostic targets with the 12 papers split between diagnosing the presence of ROP as a whole versus the presence of plus disease. It is important to differentiate these diagnostic targets as the clinical implication of the findings will differ. In addition, most studies utilized a reference standard graded by on average 2–3 experts with only one study producing a reference standard diagnosed by 5 clinicians per image.[31] It is well reported that there is a significant amount of intergrader variability in ROP diagnosis due to its subjective nature;[47,48] therefore, caution needs to be taken in recognizing the potential for grader bias in studies utilizing only a few expert graders.
Second, there was a large variety in the number of images used to train each algorithm, ranging from 289[27] to 39,029 images.[29] Convolutional neural networks learn by computing the error between the machine's output and the image diagnosis; hence, the more images used to train a machine the smaller the error of its diagnostic output.[6] For this reason, the studies that had sample sizes in the ten-thousands were likely to have more reliable results than those that were trained off hundreds or thousands of images. Nonetheless, no studies reported formal sample size calculations to ensure sufficient sizing of studies. Despite the challenge of sample size calculations in the context of AI algorithms, it remains a principal component of any study design and only one paper reported sample size as a limitation in their study.[25] Future studies should consider formulating sample size calculations to justify the number of images required for algorithm design.
Thirdly, exclusion of poor-quality images or image augmentation may impact the performance of these deep-learning algorithms in the real-world clinical setting. This is a factor which may limit the diagnostic performance of an algorithm as high quality images correlates to high quality diagnoses and smaller algorithm errors.[6] For this reason, it is understandable that most papers will exclude poor quality images; however, it is important to keep this within reason. Quality of images used to train an algorithm should correspond to the quality of images taken in the clinical setting so that algorithm performance may equate to its real-life performance. It is also for this reason that external validation of an algorithm using an image set outside of the training image set is crucial to determine the generalizability of a study. Only five of the twelve studies completed external validation of which all but one study, showing equivalent performance, revealed inferior algorithm performance compared to their test set. This finding highlights the need for an out-of-sample external validation in these screening algorithms to better understand how the algorithm will perform in the clinical setting.
Fourth, the ground truth or reference standard labels were mostly derived from data collected for other purposes such as a database of ROP images or retrospective routine clinical care notes. Although there exists an internationally accepted guideline for defining presence and stage of ROP, the International Classification of Retinopathy of Prematurity revisited (ICROP)[49] (more recently updated in a 2021 version[50]), only five studies specifically mentioned the ICROP in their methods for defining the reference standard. As ICROP acts as the universally adopted diagnostic criteria for grading ROP, it is safe to assume that the other seven studies also used these guidelines; however, the criteria for the presence or absence of disease should always be clearly defined in AI studies.
Finally, only one study completed prospective evaluation of their algorithm, a process that is vital to assess the performance on real-world implications. The majority of studies assessed deep learning diagnostic accuracy in isolation, without external validation as mentioned earlier or comparison to experts. Only three studies provided a comparison of AI performance with human performance, allowing for evaluation of real-world application. Without comparison of AI to human performance, the results from the other seven studies are limited in their ability to be extrapolated into health-care delivery. In order for a deep learning diagnostic tool to be applicable in clinical bedside screening, it must perform better or comparable to the gold standard, in this case expert diagnosis. More work is required to validate the performance of AI algorithms in comparison to human graders, ideally using the same external test dataset.
It is clear from this systematic review that there still lacks a well-designed randomized head-to-head comparison of an effective externally validated AI algorithm to human performance in real-time. A study of this magnitude could reveal the possible clinical implications for an algorithm implemented in the clinical setting. For this reason, prospective evaluations of these deep-learning diagnostic tests are crucial to unveil the bounding potential of AI in both diagnostic and therapeutic medicine. We recognize that there is a large “black box” issue in deep learning, where image features learned by an algorithm is unknown to the user.[6] It is for this reason that many clinicians are sceptical to entrust clinical care to AI, especially when the clinical features clinicians are familiar with may not be the same features used by an algorithm. This further emphasizes the need for well executed studies that minimize bias and are thoroughly and transparently reported. Most of the concerns we have highlighted in this review are avoidable with robust design and it remains critical that these AI diagnostic tests are evaluated in the context of its intended clinical pathway.
CONCLUSION
AI has been heralded as a revolutionary technology for many industries, and certainly deep learning algorithms for diagnosis of ROP are no exception. Despite the issues we have highlighted in this systematic review, the performance of the twelve deep-learning algorithms evaluated has been extremely high, with all studies delivering a recordable AUROC above or equivalent to 0.94. These results outline the ability for AI algorithms to perform comparable to or exceeding human experts and provide the groundwork for future large-scale prospective studies. Although there are clear screening and treatment guidelines, ROP disease burden continues to rise as increased survival of preterm infants coincides with advancements in medical care.[15] The inadequate accessibility and number of experienced ophthalmologists continues to limit ROP screening and diagnosis. Consequently, the burden of ROP visual impairment is expected to increase unless a novel strategy such as deep-learning diagnostic algorithms becomes available. There is no doubt that the successful application of AI in ROP will revolutionalize disease diagnosis through its high predictive performance and streamlined efficiency. The clinical implications of this implementation into real-world clinical practice is immeasurable, with translation into high accessibility, high quality, timely screening and the significant reduction in cost of screening. AI will therefore become ubiquitous and indispensable for ROP screening, and it is important that high quality research continues to aid the translation of this transformative technology in order to reduce the incidence of visual loss and blindness from this preventable disease.
Financial support and sponsorship
Nil.
Conflicts of interest
There are no conflicts of interest.
Appendix 1: Full search strategy
| We show the search strategy for a. Medline OVID b. Pubmed c. Web of science d. Embase |
| Medline OVID |
| “Retinopathy of prematurity” or “ROP” and |
| “Diagnosis” or “screening” and |
| “Artificial intelligence” or “deep learning” or “convolutional neural networks” |
| PubMed |
| “Retinopathy of prematurity” or “ROP” and “diagnosis” or “screening” AND “artificial intelligence” or “deep learning” or “convolutional neural network” |
| Web of science |
| TI = (diagnosis or screening or classification) and |
| TS = (artificial intelligence or machine learning or deep learning or convolutional neural network) and |
| TI = (retinopathy of prematurity or ROP) |
| Embase |
| “Retinopathy of Prematurity” or “ROP” or “plus disease” and |
| “Diagnosis” or “screening” or “classification” and |
| “Artificial intelligence” or “deep learning” or “convolutional neural network” or “machine learning” |
ROP: Retinopathy of prematurity
Appendix 2: Methodological quality assessment of bias for included studies using QUADAS-2[18]
| Study | Domain 1A | Domain 1B | Domain 2A | Domain 2B | Domain 3A | Domain 3B | Domain 4A |
|---|---|---|---|---|---|---|---|
| Brown et al. 2018 | Unclear | Low | Low | Low | Low | Low | Low |
| Chen et al. 2020 | Low | Low | Low | Low | Unclear | Low | Low |
| Hu et al. 2019 | Unclear | Unclear | Low | Low | Low | Low | Low |
| Huang et al. 2020 | Low | Low | Low | Low | Low | Low | Low |
| Mao et al. 2020 | High | Unclear | Low | Low | High | Low | Unclear |
| Ramachandran et al. 2021 | High | Low | Low | Low | Low | Low | Low |
| Tan et al. 2019 | Low | Low | Low | Low | Low | Low | Low |
| Tong et al. 2020 | Low | Low | Low | Low | High | High | High |
| Wang et al. 2021 | Low | Low | Low | Low | Unclear | Low | Unclear |
| Wang et al. 2018 | Low | Low | Low | Low | Low | Low | Low |
| Yildiz et al. 2020 | High | Low | Low | High | Low | Low | Low |
| Zhang et al. 2018 | High | Low | Low | Low | Low | Low | Unclear |
REFERENCES
- 1.Turing A. Computing machinery and intelligence. Mind. 1950;49:433–60. [Google Scholar]
- 2.McCarthy J, Minsky, M, Rochester N, Shannon C. A Proposal for the Dartmouth Summer Research Project on Artificial Intelligence 1955. AI Magazine. 2006;27:12. [Google Scholar]
- 3.Wu J, Yılmaz E, Zhang M, Li H, Tan KC. Deep spiking neural networks for large vocabulary automatic speech recognition. Front Neurosci. 2020;14:199. doi: 10.3389/fnins.2020.00199. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Popel M, Tomkova M, Tomek J, Kaiser Ł, Uszkoreit J, Bojar O, et al. Transforming machine translation: A deep learning system reaches news translation quality comparable to human professionals. Nat Commun. 2020;11:4381. doi: 10.1038/s41467-020-18073-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Fayyad J, Jaradat MA, Gruyer D, Najjaran H. Deep learning sensor fusion for autonomous vehicle perception and localization: A review. Sensors (Basel) 2020;20:e4220. doi: 10.3390/s20154220. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521:436–44. doi: 10.1038/nature14539. [DOI] [PubMed] [Google Scholar]
- 7.Gulshan V, Peng L, Coram M, Stumpe MC, Wu D, Narayanaswamy A, et al. Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. JAMA. 2016;316:2402–10. doi: 10.1001/jama.2016.17216. [DOI] [PubMed] [Google Scholar]
- 8.Zhang Y, Shi J, Peng Y, Zhao Z, Zheng Q, Wang Z, et al. Artificial intelligence-enabled screening for diabetic retinopathy: A real-world, multicenter and prospective study. BMJ Open Diabetes Res Care. 2020;8:e001596. doi: 10.1136/bmjdrc-2020-001596. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.De Fauw J, Ledsam JR, Romera-Paredes B, Nikolov S, Tomasev N, Blackwell S, et al. Clinically applicable deep learning for diagnosis and referral in retinal disease. Nat Med. 2018;24:1342–50. doi: 10.1038/s41591-018-0107-6. [DOI] [PubMed] [Google Scholar]
- 10.Burlina PM, Joshi N, Pekala M, Pacheco KD, Freund DE, Bressler NM. Automated grading of age-related macular degeneration from color fundus images using deep convolutional neural networks. JAMA Ophthalmol. 2017;135:1170–6. doi: 10.1001/jamaophthalmol.2017.3782. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Li Z, He Y, Keel S, Meng W, Chang RT, He M. Efficacy of a deep learning system for detecting glaucomatous optic neuropathy based on color fundus photographs. Ophthalmology. 2018;125:1199–206. doi: 10.1016/j.ophtha.2018.01.023. [DOI] [PubMed] [Google Scholar]
- 12.Brown JM, Campbell JP, Beers A, Chang K, Ostmo S, Chan RVP, et al. Automated diagnosis of plus disease in retinopathy of prematurity using deep convolutional neural networks. JAMA Ophthalmol. 2018;136:803–10. doi: 10.1001/jamaophthalmol.2018.1934. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Tan Z, Simkin S, Lai C, Dai S. Deep learning algorithm for automated diagnosis of retinopathy of prematurity plus disease. Transl Vis Sci Technol. 2019;8:23. doi: 10.1167/tvst.8.6.23. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Blencowe H, Lawn JE, Vazquez T, Fielder A, Gilbert C. Preterm-associated visual impairment and estimates of retinopathy of prematurity at regional and global levels for 2010. Pediatr Res. 2013;74(Suppl 1):35–49. doi: 10.1038/pr.2013.205. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Gilbert C. Retinopathy of prematurity: A global perspective of the epidemics, population of babies at risk and implications for control. Early Hum Dev. 2008;84:77–82. doi: 10.1016/j.earlhumdev.2007.11.009. [DOI] [PubMed] [Google Scholar]
- 16.Valentine PH, Jackson JC, Kalina RE, Woodrum DE. Increased survival of low birth weight infants: Impact on the incidence of retinopathy of prematurity. Pediatrics. 1989;84:442–5. [PubMed] [Google Scholar]
- 17.Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, et al. ImageNet large scale visual recognition challenge. Int J Comput Vis. 2015;115:211–52. [Google Scholar]
- 18.Whiting PF, Rutjes AW, Westwood ME, Mallett S, Deeks JJ, Reitsma JB, et al. QUADAS-2: A revised tool for the quality assessment of diagnostic accuracy studies. Ann Intern Med. 2011;155:529–36. doi: 10.7326/0003-4819-155-8-201110180-00009. [DOI] [PubMed] [Google Scholar]
- 19.Page MJ, McKenzie JE, Bossuyt PM, Boutron I, Hoffmann TC, Mulrow CD, et al. The PRISMA 2020 statement: An updated guideline for reporting systematic reviews. Int J Surg. 2021;88:105906. doi: 10.1016/j.ijsu.2021.105906. [DOI] [PubMed] [Google Scholar]
- 20.Moons KG, de Groot JA, Bouwmeester W, Vergouwe Y, Mallett S, Altman DG, et al. Critical appraisal and data extraction for systematic reviews of prediction modelling studies: The CHARMS checklist. PLoS Med. 2014;11:e1001744. doi: 10.1371/journal.pmed.1001744. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Chen JS, Coyner AS, Ostmo S, Sonmez K, Bajimaya S, Pradhan E, et al. Deep learning for the diagnosis of stage in retinopathy of prematurity: Accuracy and generalizability across populations and cameras. Ophthalmol Retina. 2021;5:1027–35. doi: 10.1016/j.oret.2020.12.013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Wang J, Ju R, Chen Y, Zhang L, Hu J, Wu Y, et al. Automated retinopathy of prematurity screening using deep neural networks. EBioMedicine. 2018;35:361–8. doi: 10.1016/j.ebiom.2018.08.033. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Campbell JP, Singh P, Redd TK, Brown JM, Shah PK, Subramanian P, et al. Applications of artificial intelligence for retinopathy of prematurity screening. Pediatrics. 2021;147:e2020016618. doi: 10.1542/peds.2020-016618. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Hu J, Chen Y, Zhong J, Ju R, Yi Z. Automated analysis for retinopathy of prematurity by deep neural networks. IEEE Trans Med Imaging. 2019;38:269–79. doi: 10.1109/TMI.2018.2863562. [DOI] [PubMed] [Google Scholar]
- 25.Huang YP, Basanta H, Kang EY, Chen KJ, Hwang YS, Lai CC, et al. Automated detection of early-stage ROP using a deep convolutional neural network. Br J Ophthalmol. 2021;105:1099–103. doi: 10.1136/bjophthalmol-2020-316526. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Mao J, Luo Y, Liu L, Lao J, Shao Y, Zhang M, et al. Automated diagnosis and quantitative analysis of plus disease in retinopathy of prematurity based on deep convolutional neural networks. Acta Ophthalmol. 2020;98:e339–45. doi: 10.1111/aos.14264. [DOI] [PubMed] [Google Scholar]
- 27.Ramachandran S, Niyas P, Vinekar A, John R. A deep learning framework for the detection of Plus disease in retinal fundus images of preterm infants. Biocybern Biomed Eng. 2021;41:362–75. [Google Scholar]
- 28.Tong Y, Lu W, Deng QQ, Chen C, Shen Y. Automated identification of retinopathy of prematurity by image-based deep learning. Eye Vis (Lond) 2020;7:40. doi: 10.1186/s40662-020-00206-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Wang J, Ji J, Zhang M, Lin JW, Zhang G, Gong W, et al. Automated explainable multidimensional deep learning platform of retinal images for retinopathy of prematurity screening. JAMA Netw Open. 2021;4:e218758. doi: 10.1001/jamanetworkopen.2021.8758. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Yildiz VM, Tian P, Yildiz I, Brown JM, Kalpathy-Cramer J, Dy J, et al. Plus disease in retinopathy of prematurity: Convolutional neural network performance using a combined neural network and feature extraction approach. Transl Vis Sci Technol. 2020;9:10. doi: 10.1167/tvst.9.2.10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Zhang Y, Wang L, Wu Z, Zeng J, Chen Y, Tian R, et al. Development of an automated screening system for retinopathy of prematurity using a deep neural network for wide-angle retinal images. Ieee Access. 2019;7:10232–41. [Google Scholar]
- 32.Coyner AS, Swan R, Campbell JP, Ostmo S, Brown JM, Kalpathy-Cramer J, et al. Automated fundus image quality assessment in retinopathy of prematurity using deep convolutional neural networks. Ophthalmol Retina. 2019;3:444–50. doi: 10.1016/j.oret.2019.01.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Greenwald MF, Danford ID, Shahrawat M, Ostmo S, Brown J, Kalpathy-Cramer J, et al. Evaluation of artificial intelligence-based telemedicine screening for retinopathy of prematurity. J AAPOS. 2020;24:160–2. doi: 10.1016/j.jaapos.2020.01.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Gupta K, Campbell JP, Taylor S, Brown JM, Ostmo S, Chan RVP, et al. A Quantitative severity scale for retinopathy of prematurity using deep learning to monitor disease regression after treatment. JAMA Ophthalmol. 2019;137:1029–36. doi: 10.1001/jamaophthalmol.2019.2442. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Redd T, Campbell J, Brown J, Kim S, Ostmo S, Chan R, et al. Utilization of a deep learning image assessment tool for epidemiologic surveillance of retinopathy of prematurity. Invest Ophthalmol Vis Sci. 2019;60:580–4. [Google Scholar]
- 36.Smith K, Kim S, Goldstein I, Ostmo S, Chan R, Brown J, et al. Quantitative analysis of aggressive posterior retinopathy of prematurity using deep learning. Invest Ophthalmol Vis Sci. 2019;60:4759. [Google Scholar]
- 37.Taylor S, Kishan G, Campbell P, Brown J, Ostmo S, Chan R, et al. Invest Ophthalmol Vis Sci. 2018;59:3937. [Google Scholar]
- 38.Wallace DK, Zhao Z, Freedman SF. A pilot study using “ROPtool” to quantify plus disease in retinopathy of prematurity. J AAPOS. 2007;11:381–7. doi: 10.1016/j.jaapos.2007.04.008. [DOI] [PubMed] [Google Scholar]
- 39.Wang J, Zhang G, Lin J, Ji J, Qiu K, Zhang M. Application of standardized manual labeling on identification of retinopathy of prematurity images in deep learning. Zhonghua Shiyan Yanke Zazhi. 2019;37:653–7. [Google Scholar]
- 40.Campbell J, Chan R, Ostmo S, Anderson J, Singh P, Kalpathy-Cramer J, Chiang M. Analysis of the relationship between retinopathy of prematurity zone, stage, extent and a deep learning-based vascular severity scale. Invest Ophthalmol Vis Sci. 2020;61:2193. [Google Scholar]
- 41.Choi RY, Brown JM, Kalpathy-Cramer J, Chan RV, Ostmo S, Chiang MF, et al. Variability in plus disease identified using a deep learning-based retinopathy of prematurity severity scale. Ophthalmol Retina. 2020;4:1016–21. doi: 10.1016/j.oret.2020.04.022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Ramachandran S, Kochitty S, Vinekar V, John R. A fully convolutional neural network approach for the localization of optic disc in retinopathy of prematurity diagnosis. J Intell Fuzzy Syst. 2020;38:6269–78. [Google Scholar]
- 43.Worrall DE, Wilson C, Brostow GJ. Automated retinopathy of prematurity case detection with convolutional neural networks. In: Deep Learning and Data Labeling for Medical Applications. DLMIA, LABELS. 2016:68–76. [Google Scholar]
- 44.Ataer-Cansizoglu E, Bolon-Canedo V, Campbell JP, Bozkurt A, Erdogmus D, Kalpathy-Cramer J, et al. Computer-based image analysis for plus disease diagnosis in retinopathy of prematurity: performance of the “i-ROP” system and image features associated with expert diagnosis. Transl Vis Sci Technol. 2015;4:5. doi: 10.1167/tvst.4.6.5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Touch P, Wu Y, Kihara Y, Zepeda E, Gillette T, Cabrera M, et al. Development of AI deep learning algorithms for the quantification of retinopathy of prematurity. J Invest Med. 2019;67:209. [Google Scholar]
- 46.Wang S, Zhang Y, Lei S, Zhu H, Li J, Wang Q, et al. Performance of deep neural network-based artificial intelligence method in diabetic retinopathy screening: A systematic review and meta-analysis of diagnostic test accuracy. Eur J Endocrinol. 2020;183:41–9. doi: 10.1530/EJE-19-0968. [DOI] [PubMed] [Google Scholar]
- 47.Gschließer A, Stifter E, Neumayer T, Moser E, Papp A, Pircher N, et al. Inter-expert and intra-expert agreement on the diagnosis and treatment of retinopathy of prematurity. Am J Ophthalmol. 2015;160:553–60.e3. doi: 10.1016/j.ajo.2015.05.016. [DOI] [PubMed] [Google Scholar]
- 48.Campbell JP, Ataer-Cansizoglu E, Bolon-Canedo V, Bozkurt A, Erdogmus D, Kalpathy-Cramer J, et al. Expert diagnosis of plus disease in retinopathy of prematurity from computer-based image analysis. JAMA Ophthalmol. 2016;134:651–7. doi: 10.1001/jamaophthalmol.2016.0611. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.International Committee for the Classification of Retinopathy of Prematurity. The International classification of retinopathy of prematurity revisited. Arch Ophthalmol. 2005;123:991–9. doi: 10.1001/archopht.123.7.991. [DOI] [PubMed] [Google Scholar]
- 50.Chiang MF, Quinn GE, Fielder AR, Ostmo SR, Paul Chan RV, Berrocal A, et al. International classification of retinopathy of prematurity, third edition. Ophthalmology. 2021;128:e51–68. doi: 10.1016/j.ophtha.2021.05.031. [DOI] [PMC free article] [PubMed] [Google Scholar]
