Performance of deep-learning artificial intelligence algorithms in detecting retinopathy of prematurity: A systematic review

Amelia Bai; Christopher Carty; Shuan Dai

doi:10.4103/sjopt.sjopt_219_21

. 2022 Oct 14;36(3):296–307. doi: 10.4103/sjopt.sjopt_219_21

Performance of deep-learning artificial intelligence algorithms in detecting retinopathy of prematurity: A systematic review

Amelia Bai ^1,^2,³, Christopher Carty ^4,⁵, Shuan Dai ^1,^3,^6,^✉

PMCID: PMC9583359 PMID: 36276252

Abstract

PURPOSE:

Artificial intelligence (AI) offers considerable promise for retinopathy of prematurity (ROP) screening and diagnosis. The development of deep-learning algorithms to detect the presence of disease may contribute to sufficient screening, early detection, and timely treatment for this preventable blinding disease. This review aimed to systematically examine the literature in AI algorithms in detecting ROP. Specifically, we focused on the performance of deep-learning algorithms through sensitivity, specificity, and area under the receiver operating curve (AUROC) for both the detection and grade of ROP.

METHODS:

We searched Medline OVID, PubMed, Web of Science, and Embase for studies published from January 1, 2012, to September 20, 2021. Studies evaluating the diagnostic performance of deep-learning models based on retinal fundus images with expert ophthalmologists' judgment as reference standard were included. Studies which did not investigate the presence or absence of disease were excluded. Risk of bias was assessed using the QUADAS-2 tool.

RESULTS:

Twelve studies out of the 175 studies identified were included. Five studies measured the performance of detecting the presence of ROP and seven studies determined the presence of plus disease. The average AUROC out of 11 studies was 0.98. The average sensitivity and specificity for detecting ROP was 95.72% and 98.15%, respectively, and for detecting plus disease was 91.13% and 95.92%, respectively.

CONCLUSION:

The diagnostic performance of deep-learning algorithms in published studies was high. Few studies presented externally validated results or compared performance to expert human graders. Large scale prospective validation alongside robust study design could improve future studies.

Keywords: Artificial intelligence, deep learning, diagnosis, retinopathy of prematurity, screening

INTRODUCTION

The concept of artificial intelligence (AI) dates back to the 1950s, when Alan Turing first discussed how to build and test intelligent machines in the paper “computing machinery and intelligence.”[1] It wasn't until 1956, however, at the seminal conference Dartmouth Summer Research Project on AI, did John McCarthy officially coin the term AI. This conference introduced a computer program designed to mimic the problem solving skills of a human, catalyzing the next 20 years of AI research.[2] Today, AI is incorporated into many applications for day-to-day life, including speech recognition, photo captioning, language translation, robotics, and even self-driving cars.[3,4,5] These applications are made possible through the use of deep learning, an advanced form of AI which self-learns from large training sets to program itself to perform certain tasks.[6] The application of AI has gained popularity in the medical diagnostic field, and promising outcomes have resulted from deep-learning screening algorithms in Ophthalmology.

There has been particular success in AI screening for diabetic retinopathy, with several groups reporting deep-learning algorithms detecting diabetic retinopathy at sensitivities and specificities of 83%–90% and 92%–98% respectively.[7,8] Moreover, the successful validation of these algorithms has seen progression to “real-world” implementation of screening programs through prospective evaluation. One such study produced a sensitivity of 83.3% and specificity of 92.5% in detecting referable diabetic retinopathy in a prospective evaluation.[8] Similar promising results are being reported by many other groups utilizing deep learning for the diagnosis of other ophthalmic conditions including diabetic macula edema,[9] age-related macular degeneration,[10] glaucoma,[11] and retinopathy of prematurity (ROP).[12,13]

ROP is a retinal vascular proliferative disease affecting premature infants whose diagnosis is dependent on timely screening. Globally, it is estimated that at least 50,000 children are blind from ROP,[14] and it remains the leading cause of preventable childhood blindness.[15] Advances in retinal imaging means disease is now easily identifiable by retinal photographs, making it a perfect candidate for deep learning. As survival rates of premature infants continue to increase with medical advances,[16] the demand for ROP screening is rapidly exceeding the capacity of available specialist ophthalmologists. For this reason, reports of deep-learning models matching or exceeding human experts in ROP diagnostic performance have generated considerable interest. It remains fundamental; however, that this enthusiasm does not overrule the need for critical appraisal as a missed diagnosis of ROP can result in significant sequelae such as blindness. Therefore, any deep-learning screening algorithm will need to show high diagnostic performance, high sensitivity, be generalizable, and be applicable to the real-world setting. In anticipation of deep-learning diagnostic tools becoming implemented into clinical practice, it is judicious to systematically review the body of evidence supporting AI screening for ROP. This systematic review aims to critically appraise the current state of diagnostic performance of deep-learning algorithms for ROP screening, with particular consideration for study design, algorithm development, type of validation, performance compared to clinicians, and diagnostic accuracy.

METHODS

Search strategy and selection criteria

Studies that developed or validated a deep learning model for the diagnosis of ROP and compared accuracy of algorithm diagnoses to ROP experts were included in this systematic review. We searched MEDLINE-Ovid, PubMed, Web of Science, and Embase for studies published from January 1, 2012 to September 20, 2021. The full search strategy for each database is available in Appendix 1. The cutoff of January 1, 2012 was prespecified based on an important breakthrough made with the development of deep-learning approaches in the model AlexNet.[17] The search was first performed on July 10, 2020, revised on May 23, 2021 and updated on September 20, 2021. Manual searches of bibliographies and citations from included studies were also completed to identify any additional articles potentially missed by searches.

Eligibility assessment was conducted by two reviewers who independently screened titles and abstracts of search results. Only studies aiming to identify through AI algorithms the presence of the disease of interest, ROP, were included. We accepted standard-of-care diagnosis, expert opinion or consensus as adequate reference standards to classify the absence or presence of disease. We excluded studies that did not test for diagnostic performance or investigated accuracy of image segmentation rather than disease classification. Studies which assessed the ability to classify disease severity were accepted if they incorporated primary results of disease detection. Review articles, conference abstracts, and studies that presented duplicate data were excluded. We assessed the risk of bias in patient selection, index test, reference standard, and flow and timing of each study using QUADAS-2.[18] Full assessment of bias can be found in Appendix 2.

This systematic review was completed following the recommendations of the Preferred Reporting Items for Systematic reviews and Meta-Analyses[19] statement and the research question was formulated according to the CHARMS[20] checklist for systematic reviews of prediction models. Methods of analysis and inclusion criteria were specified in advance.

Data analysis

Data were extracted independently by two reviewers (AB and SD) using a predefined data extraction sheet, followed by cross-checking. Any discrepancies were discussed with a third reviewer (CC). Demographics and sample size (gestational age [GA], birth weight, number of participants, and number of images), data characteristics (data source, inclusion and exclusion criteria, and image augmentation), algorithm development (architecture, transfer learning, and number of images for training and tuning), algorithm validation (reference standard, number of experts, same method for assessing reference standard, and internal and external validation), and results (sensitivity, specificity, area under the receiver operating characteristic curve for algorithm(AUROC), human graders, and external validation if applicable) were sought. Two papers produced different algorithms from different data sets or with different identification tasks and were therefore recorded as separate algorithms in the results section.[21,22] Data from all 12 papers were included and any missing information was recorded. In the case where sensitivity and specificity were not explicitly recorded but could be calculated from a confusion matrix, the calculated results were included.

RESULTS

Our search identified 175 records, of which 99 were screened [Figure 1]. Thirty full-text articles were assessed for eligibility and 12 studies were included in the systematic review.[12,13,21,22,23,24,25,26,27,28,29,30,31] Fifty studies were excluded due to no test of diagnostic performance,[32,33,34,35,36,37,38,39] no classification task,[40,41,42] no internal validation,[23,43] no AI algorithm,[44] and not based on standard clinical care.[45]

Data characteristics and demographics

All twelve studies obtained retrospective images as part of routine clinical care or from local screening programs. Seven of these studies collected images from China,[22,24,25,26,28,29,31] one from India, one from North America,[12] one from America and Mexico sites,[30] one from America and Nepal,[21] and one study included images from New Zealand.[13] Date range for image collection among all studies varied from July 2011 to June 2020. Three studies specified their inclusion criteria[25,26,31] and five other studies specified their exclusion criteria.[12,13,21,28,29] Poor quality images were excluded in five studies[12,13,28,29,31] and image augmentation occurred in seven studies.[13,21,25,27,28,29,30] These characteristics are summarized in Table 1. Seven studies recorded demographic information,[21,24,25,26,27,29,31] of which the mean GA was 30.9 weeks and mean birth weight was 1501.25 g. A total of 178,459 images were used across all 12 studies ranging from 2668 to 52,249 images per study. Five studies formulated an algorithm to detect ROP[21,22,24,25,31] and seven studies created an algorithm to detect the presence of plus disease out of a total of 5358 plus disease images.[12,13,26,27,28,29,30] Full details of demographics and sample size are found in Table 2.

Table 1.

Data characteristics for the 12 included studies

Data characteristics

Study	Source of data	Date range	Open-acceess data	Missing data	Inclusion criteria	Exclusion criteria	Exclusion of poor-quality imaging/image augmentation
Brown et al., 2018	Retrospective cohort, data collected at multiple hospitals across North America	July 2011-December 2016	N	NR	NR	Stage 4-5 ROP	Y/NR
Chen et al., 2020
American Trained Algorithm	Routine ROP screening from 9 North American institutions	NR	NR	NR	NR	Evidence of prior treatment (laser photocoagulation and scars), retinal detachment (stage >4), if artifacts obscures >50% of image	N/Y - Scipy package in Python, contrast enhancement, green channel extraction, recizing to 224×224 pixels
Nepalese Trained Algorithm	ROP screening program from 4 urban hospitals in Kathmandu, Nepal (Patan Hospital, Kanti Children's Hospital, Paropakar Maternity and Women's Hospital, Tilganga Institute of ophthalmology)	NR	NR	NR	NR
Hu et al., 2019	Chengdu women and Children's Central Hospital	2014-2017	NR	NR	NR	NR	NR/NR
Huang et al., 2020	Neonatal intensive care unit of Chang Gung Memorial Hospital, Linkou, Taiwan	17 Dec 2013-24 May 2019	NR	NR	Premature infants with no ROP, stage 1 ROP and stage 2 ROP. ROP screening criteria: BW ≤1500g, GA ≤32 weeks, select few infants with BW 1500-2000 g or GA >32 weeks with unstable clinical condition	NR	N/Y
Mao et al., 2020	Eye Hospital of Wenzhou Medical University	July 2013 - May 2018	NR	NR	Only images of the posterior retina	NR	NR/NR
Ramachandran et al., 2021	KIDROP Bangalore, India	NR	NR	NR	NR	NR	N/Y
Tan et al., 2019	ROP (ART-ROP) image library - from four neonatal ICU in Auckland, New Zealand	2006-2015	NR	NR	NR	Poor image quality	Y - Not grossly out of focus, not affected by blur/Y
Tong et al., 2020	Images collected from Renmin Hopsital of Wuhan University Eye Centre	1 Feb 2012-1 Oct 2016	NR	NR	NR	1. Poor image quality, 2. Imaging artefacts 3. Unfocused scans 4. Presence of other disease phenotypes (e.g., retinal haemorrhage)	Y/Y
Wang et al., 2021	4 centres in southern China - JSIEC of Shantou University and Chinese University of Hong King, Guangdong Women and Children Hospital in Yuexiu branch (Yuexiu) and Panyu branch (Panjyu), and the Sixth Affiliated Hospital of Guangzhou Medical University and Qingyuan People's Hospital	1 Sept 2018-24 June 2020	NR	NR	NR	1. Nonfundus photos or fundus photos taken by imaging devices other than RetCam 2. Infants with other ocular diseases e.g., congenital cataract, retinoblastoma, or persistent hyperplastic primary vitreous, and 3. any images with disagreeing labels	Y/Y
Wang et al., 2018	Images captured during routine clinical ROP Screening from Chengdu Women and Children's Central Hospital	Jan 2018	NR	NR	NR	NR	NR/NR
Yildiz et al., 2020	8 study centres: Columbia university, university of illinois at chicago, william beaumont Hospital, children's Hospital Los Angeles, Cedars-Sinai Medical centre, university of Miami, Weill Cornell Medical centre and Asociacion para Evitar la Ceguera en Mexico	July 2011 - December 2016	NR	NR	NR	NR	N/Y - resized to 480×640-pixel input images
Zhang et al., 2018	From telemedicine ROP trial (Telemed - R), images collected from 30 hospitals in Guangong and Fujian Provinces of China	June 2013	NR	NR	BW<2000 g, preterm infants with BW 2000 g but severe systemic disorders (as per paediatrician)	NR	Y - Highly-blurry images, very dark or bright images, nonfundus photographs/NR

Open in a new tab

NR: Not recorded, ROP: Retinopathy of prematurity, KIDROP: Karnataka Internet Assisted Diagnosis of ROP, ART: Auckland Regional Telemedicine, ICU: Intensive care unit, JSIEC: Joint Shantou International Eye centre, BW: Birth weight, GA: Gestational age, N: No, Y: Yes

Table 2.

Patient demographics and sample size for the 12 included studies

Study	Participants				Sample size
	Mean GA (SD; range), weeks	BW (SD; range), grams	Mean age at screening (SD), weeks	Number of participants represented by training data	Number of images used in the study	Number of Images with ROP	Number of images with plus disease	Number of images with preplus
Brown et al., 2018	NR	NR	NR	898	5511	N/A	172	805
Chen et al., 2020
American Trained Algorithm	26.6 (2.2; NR)	856.2 (293.7; NR)	NR	711	5943	NR	N/A	N/A
Nepalese Trained Algorithm	32.6 (2.8; NR)	1949.6 (495.8; NR)	NR	541	5049	NR	N/A	N/A
Hu et al., 2019	32 (NR; NR)	1994 (NR; NR)	NR	720	2668	1184	N/A	N/A
Huang et al., 2020	27.3 (1.8; NR)	936.4 (229.8; NR)	NR	NR	10235	1279 557 (stage 1 ROP) 722 (stage 2 ROP)	NR	NR
Mao et al., 2020	31.3 (2.1; NR) training set 31 (2; NR) test set	1643 (419.5; NR) training set 1583.3 (401.6; NR) test set	NR	3021	6161	N/A	290	691
Ramachandran et al., 2021	32.4 (1.1; NR) no plus 30.9 (1.8; NR) plus	1350 (240; NR) no plus 1280 (226; NR) plus	NR	150	289	N/A	89	N/A
Tan et al., 2019	NR	NR	NR	NR	4926 3487 suitable for training	N/A	1638 (postimage preprocessing + data augmentation)	0
Tong et al., 2020	NR	NR	NR	NR	36231	N/A	3006 2745 (in training dataset) 261 (in test dataset)	NR
Wang et al., 2021	32.9 (3.1; NR)	1925 (774; NR)	NR	8652	52249	N/A	NR	NR
Wang et al., 2018
Id - Net	NR*	NR*	NR	1273 total for both Id - Net and Gr - Net 605 for developing data 264 for data for expert comparison 404 data from web	20,795 13,526 (developing) 2361 (expert comparison) 4908 (from web)	6917 5967 (developing) 293 (expert) 657 (data from web)	N/A	N/A
Gr - Net	NR	NR	NR	5089 4139 (developing) 293 (expert comparison), 657 (data from web)	Severe=2517 (2305 (developing), 120 (expert), 92 (web)) Minor=2572 (1834 (developing), 173 (expert), 565 (from web))	N/A	N/A
Yildiz et al., 2020	NR	NR	NR	NR	5512	N/A	163	802
Zhang et al., 2018	31.9-32 (NR; 24-36.4)	1490-1500 (NR; 630-2000)	NR	NR	17,801	8090	N/A	N/A

Open in a new tab

*Data represented as bar graph distribution, unable to calculate mean. NR: Not recorded, N/A: Not applicable, SD: Standard deviation, BW: Birth weight, GA: Gestational age, SD: Standard deviation, ROP: Retinopathy of prematurity

Algorithm development and validation

Convolutional neural networks formed the basis for algorithms developed in all twelve studies. A variety of algorithms were utilized for transfer learning including ResNet, ImageNet, U-Net, and VGG-16, [Table 3], whereas one study did not use a transfer learning approach.[25] The majority of studies used <6000 images to train their algorithm; however, five studies utilized >10,000 images for algorithm development.[22,25,28,29,31] The reference standard across all twelve studies were based off disease diagnosis by 1–5 expert graders, with an average of 2.6 human graders agreeing upon each image per study. A variety of internal validation methods were recorded, including random split sample validation and cross-validation [Table 4]. Five studies[12,13,21,22,30] obtained external validation of their AI algorithms, of which one study completed a prospective evaluation of algorithm performance.[22]

Table 3.

Details of algorithm development for the 12 included studies

Algorithm development

Study	Algorithm name	Algorithm architecture	Transfer learning applied	Number of images for training/tuning
Brown et al., 2018	NR	U-Net architecture	Yes	4409/1102
Chen et al., 2020
American Trained Algorithm	NR	ImageNet and Pytorch	Yes (from ResNet)	5235/NR
Nepal Trained Algorithm	NR	ImageNet and Pytorch	Yes (from ResNet)	4802/NR
Hu et al., 2019	NR	ImageNet and TensorFlow (Inception-v2, VGG-16, ResNet-50)	Yes (from ImageNet)	2068/300
Huang et al., 2020	NR	Tensorflow	No	10,235/1137
Mao et al., 2020	NR	U-Net and DenseNet	Yes (from ImageNet)	5711/NR
Ramachandran et al., 2021	NR	Darknet-53	Yes (from ImageNet)	289/32 (then retrained by 96 images for final model)
Tan et al., 2019	ROP.AI	TensorFlow's Inception-v3	Yes	80% of 6974/NR
Tong et al., 2020	NR	Faster R-CNN+TensorFlow	Yes (from ResNet)	90% of 26,459/10% of 26,459
Wang et al., 2021	J-PROP	NR*	Yes (from Res-Unet)	75% (39,029)/10% (5140)
Wang et al., 2018	DeepROP	Tensorflow - Inception-BN Network	Yes (from ImageNet)	17665/NR
Yildiz et al., 2020	I-ROP ASSIST	NR*	Yes (from U-Net)	5512/NR
Zhang et al., 2018	CAD-R	NR*	Yes (from VGG-16)	17801/NR

Open in a new tab

*Specific architecture not recorded, however CNN used. NR: Not recorded, AI: Artificial intelligence, ROP: Retinopathy of prematurity, CAD-R: Computer-aided diagnosis system for ROP, R-CNN: Region-convolutional neural networks, I-ROP: Indian ROP, BN: Batch normalised, VGG: Visual geometry group

Table 4.

Method of algorithm validation for the 12 included studies

Algorithm validation

Study	Reference standard	If compared to experts, how many?	Same method for assessing reference standard across samples	Type of internal validation	Number of images for internal validation	External validation	Number of images for external validation
Brown et al., 2018	Expert consensus	3	Yes	Random split sample validation	20%	Yes	100
Chen et al., 2020
American Trained Algorithm	Expert consensus	3	Yes	5-fold cross-validation	10% test set	Yes	247 images from Nepal
Nepal Trained Algorithm		1	Yes	5-fold cross-validation	10% test set	Yes	708 images from America
Hu et al., 2019	Expert consensus	3	Yes	Random split sample validation	300	No	N/A
Huang et al., 2020	Expert consensus	3	Yes	5-fold cross-validation	244	No	N/A
Mao et al., 2020	Clinical diagnosis by one ophthalmologist	1	NR	Random split	450	No	N/A
Ramachandran et al., 2021	Expert consensus	3	Yes	80:20 split	161 (67 ROP)	No	N/A
Tan et al., 2019	Expert ophthalmologist from New Zealand; External images graded by expert ophthalmologist from HongKong	1	No - 2 different experts between internal and external validation	80:20 random split validation	20% of 6974	Yes	90 (33 plus, 57 normal) + additional 26 preplus imagesfor assessing preplus
Tong et al., 2020	Expert grading (11 retinal experts for first-round screening, 2 senior experts confirmed or corrected labels)	2	No - 11 different first round graders	10-fold cross-validation	9772	No*	N/A
Wang et al., 2021	Expert grading (2 junior ophthalmologists labelled, any disagreement submitted to 1 senior ophthalmologist)	3	No - dependent on agreement	Random split 75:10:15 (training, validation, test) - but based on a patient-based split policy (i.e., all images of a patient were allocated into the same sub-data set)	8080	No	N/A
Wang et al., 2018	Expert consensus (images included if 2 out of 3 graders agreed, disagreements sent to fourth ophthalmologist)	3-4	Yes	Random split	298 (for Id-Net), 104 (for Gr-Net)	Yes - prospective evaluation	2361 (total, Id and Gr net)
Yildiz et al., 2020	Expert consensus	3	Yes	5-fold cross-validation	5000	Yes	100 (15 plus, 34 preplus)
Zhang et al., 2018	Cross validation by one senior ophthalmologist	5 (2 senior experts, 2 attending physicians, 1 resident)	Yes	Random selection	1742 (155 ROP, 1587 without ROP)	No	N/A

Open in a new tab

*No external validation, however did measure algorithm versus human graders on another 1227 images collected during routine clinical care. ROP: Retinopathy of prematurity, N/A: Not applicable, NR: Not recorded

Algorithm performance

The performance of each algorithm is listed in Table 5. Five studies recorded the ability of their algorithm to detect the presence of ROP disease, of which the average area under the receiver operating curve (AUROC) was 0.984.[21,22,24,25,31] Sensitivity and specificity were recorded in four of those studies and averaged 95.72% and 98.15%, respectively.[22,24,25,31] One study compared human grader performance to the AI algorithm revealing similar sensitivities (94.1% AI, 93.5% human) and specificities (99.3% AI, 99.5% human) of ROP diagnostic performance.[31] Two of the five studies underwent external validation revealing an average sensitivity and specificity of 60% and 88.3%, respectively, for detecting the presence of disease.[21,22] The seven other studies determined ability of their algorithm to detect the presence of plus disease. Among these, six studies measured AUROC, with which the average was 0.98.[12,13,26,27,29,30] The average sensitivity and specificity for detecting plus disease recorded from six studies were 91.13% and 95.92%, respectively.[12,13,26,27,28,29] External validation occurred in two of these studies and produced an average sensitivity of 93.45% and specificity of 87.35%.[12,13] Performance of AI algorithm at detecting pre-plus disease was measured in two articles producing an average sensitivity of 96.2% and specificity of 95.7%.[12,26] This is compared to four studies who measured performance of determining the stage of ROP disease, showing an average sensitivity and specificity of 89.07% and 94.63%, respectively.[22,25,28,29]

Table 5.

Summary of results from the 12 included studies

Study	Algorithm Performance
	Sens %	Spec %	Area under the ROC curve (AUROC)
Detecting Disease
Hu et al. 2019	96	98	0.9922
Zhang et al. 2018	94.1	99.3	0.998
Chen et al. 2020
American Trained Algorithm	NR	NR	0.99
Nepal Trained Algorithm	NR	NR	0.96
Combined (American & Nepal) Trained Algorithm	NR	NR	0.99

				Sens grading %	Spec grading %	AUROC grading

Detecting Disease & Stage
Huang et al. 2020	96.14±0.87	95.95±0.48	0.96	91.82±2.03 (stage 1)	94.5±0.71 (stage 1)	0.93 (stage 1)
Wang et al. 2018
Id-Net	96.64	99.33	0.995	n/a	n/a	n/a
Gr-Net	n/a	n/a	n/a	88.46 (minor vs. severe)	92.31 (minor vs. severe)	0.951 (minor vs. severe)

				Sens Pre-Plus %	Spec Pre-plus %	AUROC Pre-plus

Detecting Plus Disease
Brown et al. 2018	93	94	0.98	100	94	NR
Mao et al. 2020	95.1	97.8	0.99	92.4	97.4	NR
Ramachandran et al. 2021	99	98	0.9947	n/a	n/a	n/a
Tan et al. 2019	96.6	98	0.993	n/a	n/a	n/a
Yildiz et al. 2020	NR	NR	0.94	NR	NR	0.88
Detecting Plus & Severity
Wang et al. 2021	91.8	97	0.983	98.2 (stage)	98.5 (stage)	0.998 (stage)
Tong et al. 2020	71.3	90.7	NR	77.8 (“normal” “mild” “semi-urgent” “urgent”)	93.2 (“normal” “mild” “semi-urgent” “urgent”)	NR

Study	Human Performance					External Validation
	Sens %	Spec %	Sens grading %	Spec grading %	AUROC	Sens %	Spec %	AUROC

Detecting Disease
Hu et al. 2019	n/a	n/a	n/a	n/a	n/a	n/a	n/a	n/a
Zhang et al. 2018	93.5	99.5	n/a	n/a	n/a	n/a	n/a	n/a
Chen et al. 2020
American Trained Algorithm	n/a	n/a	n/a	n/a	n/a	52	99	0.96
Nepal Trained Algorithm	n/a	n/a	n/a	n/a	n/a	44	69	0.62
Combined (American & Nepal) Trained Algorithm	n/a	n/a	n/a	n/a	n/a	98/82 (against American/Nepal set)	96/99 (against American/Nepal set)	0.99/0.98 (against American/Nepal set)

									Sens grading %	Spec grading %	AUROC grading

Detecting Disease & Stage
Huang et al. 2020	NR	NR	NR	NR	NR	n/a	n/a	n/a	n/a	n/a	n/a
Wang et al. 2018
Id-Net	NR	NR	NR	NR	NR	84.91	96.9	NR	n/a	n/a	n/a
Gr-Net	NR	NR	NR	NR	NR	n/a	n/a	n/a	93.33 (minor vs. severe)	73.63 (minor vs. severe)	NR

									Sens Pre-Plus %	Spec Pre-plus %	AUROC Pre-plus

Detecting Plus Disease
Brown et al. 2018	n/a	n/a	n/a	n/a	n/a	93	94	NR	100	94	NR
Mao et al. 2020	n/a	n/a	n/a	n/a	n/a	n/a	n/a	n/a	n/a	n/a	n/a
Ramachandran et al. 2021	n/a	n/a	n/a	n/a	n/a	n/a	n/a	n/a	n/a	n/a	n/a
Tan et al. 2019	n/a	n/a	n/a	n/a	n/a	93.9	80.7	NR	81.4	80.7	0.977
Yildiz et al. 2020	n/a	n/a	n/a	n/a	n/a	NR	NR	0.99	NR	NR	0.97
Detecting Plus & Severity
Wang et al. 2021	100 (compared to J-PROP on same dataset: 100)	99.8 (compared to J-PROP on same dataset: 98.4)	91.7 (stage) (compared to J-PROP on same dataset: 97.9)	99.1 (stage) (compared to J-PROP on same dataset: 97.4)	NR	n/a	n/a	n/a	n/a	n/a	n/a
Tong et al. 2020	74.8 (expert 1), 65.9 (expert 2) (for grading “normal” “mild” “semi-urgent” “urgent”)	93.4 (expert 1), 92.3 (expert 2) (for grading “normal” “mild” “semi-urgent” “urgent”)	n/a	n/a	NR	n/a	n/a	n/a	n/a	n/a	n/a

Open in a new tab

DISCUSSION

We found that deep-learning algorithms for ROP screening demonstrated sensitivity and specificity metrics that were comparable to neural network algorithms in diabetic retinopathy.[46] Although this estimate supports the application potential for deep-learning algorithms to be implemented as a real-world diagnostic tool, there are several methodological deficiencies that were common across included studies which need to be considered. These include the quality of reference standard, use of sample size calculations, external validation, definition of presence or absence of disease, and the need for prospective evaluation.

First, we found variability in specific algorithm diagnostic targets with the 12 papers split between diagnosing the presence of ROP as a whole versus the presence of plus disease. It is important to differentiate these diagnostic targets as the clinical implication of the findings will differ. In addition, most studies utilized a reference standard graded by on average 2–3 experts with only one study producing a reference standard diagnosed by 5 clinicians per image.[31] It is well reported that there is a significant amount of intergrader variability in ROP diagnosis due to its subjective nature;[47,48] therefore, caution needs to be taken in recognizing the potential for grader bias in studies utilizing only a few expert graders.

Second, there was a large variety in the number of images used to train each algorithm, ranging from 289[27] to 39,029 images.[29] Convolutional neural networks learn by computing the error between the machine's output and the image diagnosis; hence, the more images used to train a machine the smaller the error of its diagnostic output.[6] For this reason, the studies that had sample sizes in the ten-thousands were likely to have more reliable results than those that were trained off hundreds or thousands of images. Nonetheless, no studies reported formal sample size calculations to ensure sufficient sizing of studies. Despite the challenge of sample size calculations in the context of AI algorithms, it remains a principal component of any study design and only one paper reported sample size as a limitation in their study.[25] Future studies should consider formulating sample size calculations to justify the number of images required for algorithm design.

Thirdly, exclusion of poor-quality images or image augmentation may impact the performance of these deep-learning algorithms in the real-world clinical setting. This is a factor which may limit the diagnostic performance of an algorithm as high quality images correlates to high quality diagnoses and smaller algorithm errors.[6] For this reason, it is understandable that most papers will exclude poor quality images; however, it is important to keep this within reason. Quality of images used to train an algorithm should correspond to the quality of images taken in the clinical setting so that algorithm performance may equate to its real-life performance. It is also for this reason that external validation of an algorithm using an image set outside of the training image set is crucial to determine the generalizability of a study. Only five of the twelve studies completed external validation of which all but one study, showing equivalent performance, revealed inferior algorithm performance compared to their test set. This finding highlights the need for an out-of-sample external validation in these screening algorithms to better understand how the algorithm will perform in the clinical setting.

Fourth, the ground truth or reference standard labels were mostly derived from data collected for other purposes such as a database of ROP images or retrospective routine clinical care notes. Although there exists an internationally accepted guideline for defining presence and stage of ROP, the International Classification of Retinopathy of Prematurity revisited (ICROP)[49] (more recently updated in a 2021 version[50]), only five studies specifically mentioned the ICROP in their methods for defining the reference standard. As ICROP acts as the universally adopted diagnostic criteria for grading ROP, it is safe to assume that the other seven studies also used these guidelines; however, the criteria for the presence or absence of disease should always be clearly defined in AI studies.

Finally, only one study completed prospective evaluation of their algorithm, a process that is vital to assess the performance on real-world implications. The majority of studies assessed deep learning diagnostic accuracy in isolation, without external validation as mentioned earlier or comparison to experts. Only three studies provided a comparison of AI performance with human performance, allowing for evaluation of real-world application. Without comparison of AI to human performance, the results from the other seven studies are limited in their ability to be extrapolated into health-care delivery. In order for a deep learning diagnostic tool to be applicable in clinical bedside screening, it must perform better or comparable to the gold standard, in this case expert diagnosis. More work is required to validate the performance of AI algorithms in comparison to human graders, ideally using the same external test dataset.

It is clear from this systematic review that there still lacks a well-designed randomized head-to-head comparison of an effective externally validated AI algorithm to human performance in real-time. A study of this magnitude could reveal the possible clinical implications for an algorithm implemented in the clinical setting. For this reason, prospective evaluations of these deep-learning diagnostic tests are crucial to unveil the bounding potential of AI in both diagnostic and therapeutic medicine. We recognize that there is a large “black box” issue in deep learning, where image features learned by an algorithm is unknown to the user.[6] It is for this reason that many clinicians are sceptical to entrust clinical care to AI, especially when the clinical features clinicians are familiar with may not be the same features used by an algorithm. This further emphasizes the need for well executed studies that minimize bias and are thoroughly and transparently reported. Most of the concerns we have highlighted in this review are avoidable with robust design and it remains critical that these AI diagnostic tests are evaluated in the context of its intended clinical pathway.

CONCLUSION

AI has been heralded as a revolutionary technology for many industries, and certainly deep learning algorithms for diagnosis of ROP are no exception. Despite the issues we have highlighted in this systematic review, the performance of the twelve deep-learning algorithms evaluated has been extremely high, with all studies delivering a recordable AUROC above or equivalent to 0.94. These results outline the ability for AI algorithms to perform comparable to or exceeding human experts and provide the groundwork for future large-scale prospective studies. Although there are clear screening and treatment guidelines, ROP disease burden continues to rise as increased survival of preterm infants coincides with advancements in medical care.[15] The inadequate accessibility and number of experienced ophthalmologists continues to limit ROP screening and diagnosis. Consequently, the burden of ROP visual impairment is expected to increase unless a novel strategy such as deep-learning diagnostic algorithms becomes available. There is no doubt that the successful application of AI in ROP will revolutionalize disease diagnosis through its high predictive performance and streamlined efficiency. The clinical implications of this implementation into real-world clinical practice is immeasurable, with translation into high accessibility, high quality, timely screening and the significant reduction in cost of screening. AI will therefore become ubiquitous and indispensable for ROP screening, and it is important that high quality research continues to aid the translation of this transformative technology in order to reduce the incidence of visual loss and blindness from this preventable disease.

Financial support and sponsorship

Nil.

Conflicts of interest

There are no conflicts of interest.

Appendix 1: Full search strategy

We show the search strategy for a. Medline OVID b. Pubmed c. Web of science d. Embase

Medline OVID

“Retinopathy of prematurity” or “ROP” and

“Diagnosis” or “screening” and

“Artificial intelligence” or “deep learning” or “convolutional neural networks”

PubMed

“Retinopathy of prematurity” or “ROP” and “diagnosis” or “screening” AND “artificial intelligence” or “deep learning” or “convolutional neural network”

Web of science

TI = (diagnosis or screening or classification) and

TS = (artificial intelligence or machine learning or deep learning or convolutional neural network) and

TI = (retinopathy of prematurity or ROP)

Embase

“Retinopathy of Prematurity” or “ROP” or “plus disease” and

“Diagnosis” or “screening” or “classification” and

“Artificial intelligence” or “deep learning” or “convolutional neural network” or “machine learning”

Open in a new tab

ROP: Retinopathy of prematurity

Appendix 2: Methodological quality assessment of bias for included studies using QUADAS-2[18]

Study	Domain 1A	Domain 1B	Domain 2A	Domain 2B	Domain 3A	Domain 3B	Domain 4A
Brown et al. 2018	Unclear	Low	Low	Low	Low	Low	Low
Chen et al. 2020	Low	Low	Low	Low	Unclear	Low	Low
Hu et al. 2019	Unclear	Unclear	Low	Low	Low	Low	Low
Huang et al. 2020	Low	Low	Low	Low	Low	Low	Low
Mao et al. 2020	High	Unclear	Low	Low	High	Low	Unclear
Ramachandran et al. 2021	High	Low	Low	Low	Low	Low	Low
Tan et al. 2019	Low	Low	Low	Low	Low	Low	Low
Tong et al. 2020	Low	Low	Low	Low	High	High	High
Wang et al. 2021	Low	Low	Low	Low	Unclear	Low	Unclear
Wang et al. 2018	Low	Low	Low	Low	Low	Low	Low
Yildiz et al. 2020	High	Low	Low	High	Low	Low	Low
Zhang et al. 2018	High	Low	Low	Low	Low	Low	Unclear

Open in a new tab

REFERENCES

1.Turing A. Computing machinery and intelligence. Mind. 1950;49:433–60. [Google Scholar]
2.McCarthy J, Minsky, M, Rochester N, Shannon C. A Proposal for the Dartmouth Summer Research Project on Artificial Intelligence 1955. AI Magazine. 2006;27:12. [Google Scholar]
3.Wu J, Yılmaz E, Zhang M, Li H, Tan KC. Deep spiking neural networks for large vocabulary automatic speech recognition. Front Neurosci. 2020;14:199. doi: 10.3389/fnins.2020.00199. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Popel M, Tomkova M, Tomek J, Kaiser Ł, Uszkoreit J, Bojar O, et al. Transforming machine translation: A deep learning system reaches news translation quality comparable to human professionals. Nat Commun. 2020;11:4381. doi: 10.1038/s41467-020-18073-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Fayyad J, Jaradat MA, Gruyer D, Najjaran H. Deep learning sensor fusion for autonomous vehicle perception and localization: A review. Sensors (Basel) 2020;20:e4220. doi: 10.3390/s20154220. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521:436–44. doi: 10.1038/nature14539. [DOI] [PubMed] [Google Scholar]
7.Gulshan V, Peng L, Coram M, Stumpe MC, Wu D, Narayanaswamy A, et al. Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. JAMA. 2016;316:2402–10. doi: 10.1001/jama.2016.17216. [DOI] [PubMed] [Google Scholar]
8.Zhang Y, Shi J, Peng Y, Zhao Z, Zheng Q, Wang Z, et al. Artificial intelligence-enabled screening for diabetic retinopathy: A real-world, multicenter and prospective study. BMJ Open Diabetes Res Care. 2020;8:e001596. doi: 10.1136/bmjdrc-2020-001596. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.De Fauw J, Ledsam JR, Romera-Paredes B, Nikolov S, Tomasev N, Blackwell S, et al. Clinically applicable deep learning for diagnosis and referral in retinal disease. Nat Med. 2018;24:1342–50. doi: 10.1038/s41591-018-0107-6. [DOI] [PubMed] [Google Scholar]
10.Burlina PM, Joshi N, Pekala M, Pacheco KD, Freund DE, Bressler NM. Automated grading of age-related macular degeneration from color fundus images using deep convolutional neural networks. JAMA Ophthalmol. 2017;135:1170–6. doi: 10.1001/jamaophthalmol.2017.3782. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Li Z, He Y, Keel S, Meng W, Chang RT, He M. Efficacy of a deep learning system for detecting glaucomatous optic neuropathy based on color fundus photographs. Ophthalmology. 2018;125:1199–206. doi: 10.1016/j.ophtha.2018.01.023. [DOI] [PubMed] [Google Scholar]
12.Brown JM, Campbell JP, Beers A, Chang K, Ostmo S, Chan RVP, et al. Automated diagnosis of plus disease in retinopathy of prematurity using deep convolutional neural networks. JAMA Ophthalmol. 2018;136:803–10. doi: 10.1001/jamaophthalmol.2018.1934. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Tan Z, Simkin S, Lai C, Dai S. Deep learning algorithm for automated diagnosis of retinopathy of prematurity plus disease. Transl Vis Sci Technol. 2019;8:23. doi: 10.1167/tvst.8.6.23. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Blencowe H, Lawn JE, Vazquez T, Fielder A, Gilbert C. Preterm-associated visual impairment and estimates of retinopathy of prematurity at regional and global levels for 2010. Pediatr Res. 2013;74(Suppl 1):35–49. doi: 10.1038/pr.2013.205. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Gilbert C. Retinopathy of prematurity: A global perspective of the epidemics, population of babies at risk and implications for control. Early Hum Dev. 2008;84:77–82. doi: 10.1016/j.earlhumdev.2007.11.009. [DOI] [PubMed] [Google Scholar]
16.Valentine PH, Jackson JC, Kalina RE, Woodrum DE. Increased survival of low birth weight infants: Impact on the incidence of retinopathy of prematurity. Pediatrics. 1989;84:442–5. [PubMed] [Google Scholar]
17.Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, et al. ImageNet large scale visual recognition challenge. Int J Comput Vis. 2015;115:211–52. [Google Scholar]
18.Whiting PF, Rutjes AW, Westwood ME, Mallett S, Deeks JJ, Reitsma JB, et al. QUADAS-2: A revised tool for the quality assessment of diagnostic accuracy studies. Ann Intern Med. 2011;155:529–36. doi: 10.7326/0003-4819-155-8-201110180-00009. [DOI] [PubMed] [Google Scholar]
19.Page MJ, McKenzie JE, Bossuyt PM, Boutron I, Hoffmann TC, Mulrow CD, et al. The PRISMA 2020 statement: An updated guideline for reporting systematic reviews. Int J Surg. 2021;88:105906. doi: 10.1016/j.ijsu.2021.105906. [DOI] [PubMed] [Google Scholar]
20.Moons KG, de Groot JA, Bouwmeester W, Vergouwe Y, Mallett S, Altman DG, et al. Critical appraisal and data extraction for systematic reviews of prediction modelling studies: The CHARMS checklist. PLoS Med. 2014;11:e1001744. doi: 10.1371/journal.pmed.1001744. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Chen JS, Coyner AS, Ostmo S, Sonmez K, Bajimaya S, Pradhan E, et al. Deep learning for the diagnosis of stage in retinopathy of prematurity: Accuracy and generalizability across populations and cameras. Ophthalmol Retina. 2021;5:1027–35. doi: 10.1016/j.oret.2020.12.013. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Wang J, Ju R, Chen Y, Zhang L, Hu J, Wu Y, et al. Automated retinopathy of prematurity screening using deep neural networks. EBioMedicine. 2018;35:361–8. doi: 10.1016/j.ebiom.2018.08.033. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Campbell JP, Singh P, Redd TK, Brown JM, Shah PK, Subramanian P, et al. Applications of artificial intelligence for retinopathy of prematurity screening. Pediatrics. 2021;147:e2020016618. doi: 10.1542/peds.2020-016618. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Hu J, Chen Y, Zhong J, Ju R, Yi Z. Automated analysis for retinopathy of prematurity by deep neural networks. IEEE Trans Med Imaging. 2019;38:269–79. doi: 10.1109/TMI.2018.2863562. [DOI] [PubMed] [Google Scholar]
25.Huang YP, Basanta H, Kang EY, Chen KJ, Hwang YS, Lai CC, et al. Automated detection of early-stage ROP using a deep convolutional neural network. Br J Ophthalmol. 2021;105:1099–103. doi: 10.1136/bjophthalmol-2020-316526. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Mao J, Luo Y, Liu L, Lao J, Shao Y, Zhang M, et al. Automated diagnosis and quantitative analysis of plus disease in retinopathy of prematurity based on deep convolutional neural networks. Acta Ophthalmol. 2020;98:e339–45. doi: 10.1111/aos.14264. [DOI] [PubMed] [Google Scholar]
27.Ramachandran S, Niyas P, Vinekar A, John R. A deep learning framework for the detection of Plus disease in retinal fundus images of preterm infants. Biocybern Biomed Eng. 2021;41:362–75. [Google Scholar]
28.Tong Y, Lu W, Deng QQ, Chen C, Shen Y. Automated identification of retinopathy of prematurity by image-based deep learning. Eye Vis (Lond) 2020;7:40. doi: 10.1186/s40662-020-00206-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Wang J, Ji J, Zhang M, Lin JW, Zhang G, Gong W, et al. Automated explainable multidimensional deep learning platform of retinal images for retinopathy of prematurity screening. JAMA Netw Open. 2021;4:e218758. doi: 10.1001/jamanetworkopen.2021.8758. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Yildiz VM, Tian P, Yildiz I, Brown JM, Kalpathy-Cramer J, Dy J, et al. Plus disease in retinopathy of prematurity: Convolutional neural network performance using a combined neural network and feature extraction approach. Transl Vis Sci Technol. 2020;9:10. doi: 10.1167/tvst.9.2.10. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Zhang Y, Wang L, Wu Z, Zeng J, Chen Y, Tian R, et al. Development of an automated screening system for retinopathy of prematurity using a deep neural network for wide-angle retinal images. Ieee Access. 2019;7:10232–41. [Google Scholar]
32.Coyner AS, Swan R, Campbell JP, Ostmo S, Brown JM, Kalpathy-Cramer J, et al. Automated fundus image quality assessment in retinopathy of prematurity using deep convolutional neural networks. Ophthalmol Retina. 2019;3:444–50. doi: 10.1016/j.oret.2019.01.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Greenwald MF, Danford ID, Shahrawat M, Ostmo S, Brown J, Kalpathy-Cramer J, et al. Evaluation of artificial intelligence-based telemedicine screening for retinopathy of prematurity. J AAPOS. 2020;24:160–2. doi: 10.1016/j.jaapos.2020.01.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Gupta K, Campbell JP, Taylor S, Brown JM, Ostmo S, Chan RVP, et al. A Quantitative severity scale for retinopathy of prematurity using deep learning to monitor disease regression after treatment. JAMA Ophthalmol. 2019;137:1029–36. doi: 10.1001/jamaophthalmol.2019.2442. [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Redd T, Campbell J, Brown J, Kim S, Ostmo S, Chan R, et al. Utilization of a deep learning image assessment tool for epidemiologic surveillance of retinopathy of prematurity. Invest Ophthalmol Vis Sci. 2019;60:580–4. [Google Scholar]
36.Smith K, Kim S, Goldstein I, Ostmo S, Chan R, Brown J, et al. Quantitative analysis of aggressive posterior retinopathy of prematurity using deep learning. Invest Ophthalmol Vis Sci. 2019;60:4759. [Google Scholar]
37.Taylor S, Kishan G, Campbell P, Brown J, Ostmo S, Chan R, et al. Invest Ophthalmol Vis Sci. 2018;59:3937. [Google Scholar]
38.Wallace DK, Zhao Z, Freedman SF. A pilot study using “ROPtool” to quantify plus disease in retinopathy of prematurity. J AAPOS. 2007;11:381–7. doi: 10.1016/j.jaapos.2007.04.008. [DOI] [PubMed] [Google Scholar]
39.Wang J, Zhang G, Lin J, Ji J, Qiu K, Zhang M. Application of standardized manual labeling on identification of retinopathy of prematurity images in deep learning. Zhonghua Shiyan Yanke Zazhi. 2019;37:653–7. [Google Scholar]
40.Campbell J, Chan R, Ostmo S, Anderson J, Singh P, Kalpathy-Cramer J, Chiang M. Analysis of the relationship between retinopathy of prematurity zone, stage, extent and a deep learning-based vascular severity scale. Invest Ophthalmol Vis Sci. 2020;61:2193. [Google Scholar]
41.Choi RY, Brown JM, Kalpathy-Cramer J, Chan RV, Ostmo S, Chiang MF, et al. Variability in plus disease identified using a deep learning-based retinopathy of prematurity severity scale. Ophthalmol Retina. 2020;4:1016–21. doi: 10.1016/j.oret.2020.04.022. [DOI] [PMC free article] [PubMed] [Google Scholar]
42.Ramachandran S, Kochitty S, Vinekar V, John R. A fully convolutional neural network approach for the localization of optic disc in retinopathy of prematurity diagnosis. J Intell Fuzzy Syst. 2020;38:6269–78. [Google Scholar]
43.Worrall DE, Wilson C, Brostow GJ. Automated retinopathy of prematurity case detection with convolutional neural networks. In: Deep Learning and Data Labeling for Medical Applications. DLMIA, LABELS. 2016:68–76. [Google Scholar]
44.Ataer-Cansizoglu E, Bolon-Canedo V, Campbell JP, Bozkurt A, Erdogmus D, Kalpathy-Cramer J, et al. Computer-based image analysis for plus disease diagnosis in retinopathy of prematurity: performance of the “i-ROP” system and image features associated with expert diagnosis. Transl Vis Sci Technol. 2015;4:5. doi: 10.1167/tvst.4.6.5. [DOI] [PMC free article] [PubMed] [Google Scholar]
45.Touch P, Wu Y, Kihara Y, Zepeda E, Gillette T, Cabrera M, et al. Development of AI deep learning algorithms for the quantification of retinopathy of prematurity. J Invest Med. 2019;67:209. [Google Scholar]
46.Wang S, Zhang Y, Lei S, Zhu H, Li J, Wang Q, et al. Performance of deep neural network-based artificial intelligence method in diabetic retinopathy screening: A systematic review and meta-analysis of diagnostic test accuracy. Eur J Endocrinol. 2020;183:41–9. doi: 10.1530/EJE-19-0968. [DOI] [PubMed] [Google Scholar]
47.Gschließer A, Stifter E, Neumayer T, Moser E, Papp A, Pircher N, et al. Inter-expert and intra-expert agreement on the diagnosis and treatment of retinopathy of prematurity. Am J Ophthalmol. 2015;160:553–60.e3. doi: 10.1016/j.ajo.2015.05.016. [DOI] [PubMed] [Google Scholar]
48.Campbell JP, Ataer-Cansizoglu E, Bolon-Canedo V, Bozkurt A, Erdogmus D, Kalpathy-Cramer J, et al. Expert diagnosis of plus disease in retinopathy of prematurity from computer-based image analysis. JAMA Ophthalmol. 2016;134:651–7. doi: 10.1001/jamaophthalmol.2016.0611. [DOI] [PMC free article] [PubMed] [Google Scholar]
49.International Committee for the Classification of Retinopathy of Prematurity. The International classification of retinopathy of prematurity revisited. Arch Ophthalmol. 2005;123:991–9. doi: 10.1001/archopht.123.7.991. [DOI] [PubMed] [Google Scholar]
50.Chiang MF, Quinn GE, Fielder AR, Ostmo SR, Paul Chan RV, Berrocal A, et al. International classification of retinopathy of prematurity, third edition. Ophthalmology. 2021;128:e51–68. doi: 10.1016/j.ophtha.2021.05.031. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref1] 1.Turing A. Computing machinery and intelligence. Mind. 1950;49:433–60. [Google Scholar]

[ref2] 2.McCarthy J, Minsky, M, Rochester N, Shannon C. A Proposal for the Dartmouth Summer Research Project on Artificial Intelligence 1955. AI Magazine. 2006;27:12. [Google Scholar]

[ref3] 3.Wu J, Yılmaz E, Zhang M, Li H, Tan KC. Deep spiking neural networks for large vocabulary automatic speech recognition. Front Neurosci. 2020;14:199. doi: 10.3389/fnins.2020.00199. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref4] 4.Popel M, Tomkova M, Tomek J, Kaiser Ł, Uszkoreit J, Bojar O, et al. Transforming machine translation: A deep learning system reaches news translation quality comparable to human professionals. Nat Commun. 2020;11:4381. doi: 10.1038/s41467-020-18073-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref5] 5.Fayyad J, Jaradat MA, Gruyer D, Najjaran H. Deep learning sensor fusion for autonomous vehicle perception and localization: A review. Sensors (Basel) 2020;20:e4220. doi: 10.3390/s20154220. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref6] 6.LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521:436–44. doi: 10.1038/nature14539. [DOI] [PubMed] [Google Scholar]

[ref7] 7.Gulshan V, Peng L, Coram M, Stumpe MC, Wu D, Narayanaswamy A, et al. Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. JAMA. 2016;316:2402–10. doi: 10.1001/jama.2016.17216. [DOI] [PubMed] [Google Scholar]

[ref8] 8.Zhang Y, Shi J, Peng Y, Zhao Z, Zheng Q, Wang Z, et al. Artificial intelligence-enabled screening for diabetic retinopathy: A real-world, multicenter and prospective study. BMJ Open Diabetes Res Care. 2020;8:e001596. doi: 10.1136/bmjdrc-2020-001596. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref9] 9.De Fauw J, Ledsam JR, Romera-Paredes B, Nikolov S, Tomasev N, Blackwell S, et al. Clinically applicable deep learning for diagnosis and referral in retinal disease. Nat Med. 2018;24:1342–50. doi: 10.1038/s41591-018-0107-6. [DOI] [PubMed] [Google Scholar]

[ref10] 10.Burlina PM, Joshi N, Pekala M, Pacheco KD, Freund DE, Bressler NM. Automated grading of age-related macular degeneration from color fundus images using deep convolutional neural networks. JAMA Ophthalmol. 2017;135:1170–6. doi: 10.1001/jamaophthalmol.2017.3782. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref11] 11.Li Z, He Y, Keel S, Meng W, Chang RT, He M. Efficacy of a deep learning system for detecting glaucomatous optic neuropathy based on color fundus photographs. Ophthalmology. 2018;125:1199–206. doi: 10.1016/j.ophtha.2018.01.023. [DOI] [PubMed] [Google Scholar]

[ref12] 12.Brown JM, Campbell JP, Beers A, Chang K, Ostmo S, Chan RVP, et al. Automated diagnosis of plus disease in retinopathy of prematurity using deep convolutional neural networks. JAMA Ophthalmol. 2018;136:803–10. doi: 10.1001/jamaophthalmol.2018.1934. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref13] 13.Tan Z, Simkin S, Lai C, Dai S. Deep learning algorithm for automated diagnosis of retinopathy of prematurity plus disease. Transl Vis Sci Technol. 2019;8:23. doi: 10.1167/tvst.8.6.23. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref14] 14.Blencowe H, Lawn JE, Vazquez T, Fielder A, Gilbert C. Preterm-associated visual impairment and estimates of retinopathy of prematurity at regional and global levels for 2010. Pediatr Res. 2013;74(Suppl 1):35–49. doi: 10.1038/pr.2013.205. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref15] 15.Gilbert C. Retinopathy of prematurity: A global perspective of the epidemics, population of babies at risk and implications for control. Early Hum Dev. 2008;84:77–82. doi: 10.1016/j.earlhumdev.2007.11.009. [DOI] [PubMed] [Google Scholar]

[ref16] 16.Valentine PH, Jackson JC, Kalina RE, Woodrum DE. Increased survival of low birth weight infants: Impact on the incidence of retinopathy of prematurity. Pediatrics. 1989;84:442–5. [PubMed] [Google Scholar]

[ref17] 17.Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, et al. ImageNet large scale visual recognition challenge. Int J Comput Vis. 2015;115:211–52. [Google Scholar]

[ref18] 18.Whiting PF, Rutjes AW, Westwood ME, Mallett S, Deeks JJ, Reitsma JB, et al. QUADAS-2: A revised tool for the quality assessment of diagnostic accuracy studies. Ann Intern Med. 2011;155:529–36. doi: 10.7326/0003-4819-155-8-201110180-00009. [DOI] [PubMed] [Google Scholar]

[ref19] 19.Page MJ, McKenzie JE, Bossuyt PM, Boutron I, Hoffmann TC, Mulrow CD, et al. The PRISMA 2020 statement: An updated guideline for reporting systematic reviews. Int J Surg. 2021;88:105906. doi: 10.1016/j.ijsu.2021.105906. [DOI] [PubMed] [Google Scholar]

[ref20] 20.Moons KG, de Groot JA, Bouwmeester W, Vergouwe Y, Mallett S, Altman DG, et al. Critical appraisal and data extraction for systematic reviews of prediction modelling studies: The CHARMS checklist. PLoS Med. 2014;11:e1001744. doi: 10.1371/journal.pmed.1001744. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref21] 21.Chen JS, Coyner AS, Ostmo S, Sonmez K, Bajimaya S, Pradhan E, et al. Deep learning for the diagnosis of stage in retinopathy of prematurity: Accuracy and generalizability across populations and cameras. Ophthalmol Retina. 2021;5:1027–35. doi: 10.1016/j.oret.2020.12.013. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref22] 22.Wang J, Ju R, Chen Y, Zhang L, Hu J, Wu Y, et al. Automated retinopathy of prematurity screening using deep neural networks. EBioMedicine. 2018;35:361–8. doi: 10.1016/j.ebiom.2018.08.033. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref23] 23.Campbell JP, Singh P, Redd TK, Brown JM, Shah PK, Subramanian P, et al. Applications of artificial intelligence for retinopathy of prematurity screening. Pediatrics. 2021;147:e2020016618. doi: 10.1542/peds.2020-016618. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref24] 24.Hu J, Chen Y, Zhong J, Ju R, Yi Z. Automated analysis for retinopathy of prematurity by deep neural networks. IEEE Trans Med Imaging. 2019;38:269–79. doi: 10.1109/TMI.2018.2863562. [DOI] [PubMed] [Google Scholar]

[ref25] 25.Huang YP, Basanta H, Kang EY, Chen KJ, Hwang YS, Lai CC, et al. Automated detection of early-stage ROP using a deep convolutional neural network. Br J Ophthalmol. 2021;105:1099–103. doi: 10.1136/bjophthalmol-2020-316526. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref26] 26.Mao J, Luo Y, Liu L, Lao J, Shao Y, Zhang M, et al. Automated diagnosis and quantitative analysis of plus disease in retinopathy of prematurity based on deep convolutional neural networks. Acta Ophthalmol. 2020;98:e339–45. doi: 10.1111/aos.14264. [DOI] [PubMed] [Google Scholar]

[ref27] 27.Ramachandran S, Niyas P, Vinekar A, John R. A deep learning framework for the detection of Plus disease in retinal fundus images of preterm infants. Biocybern Biomed Eng. 2021;41:362–75. [Google Scholar]

[ref28] 28.Tong Y, Lu W, Deng QQ, Chen C, Shen Y. Automated identification of retinopathy of prematurity by image-based deep learning. Eye Vis (Lond) 2020;7:40. doi: 10.1186/s40662-020-00206-2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref29] 29.Wang J, Ji J, Zhang M, Lin JW, Zhang G, Gong W, et al. Automated explainable multidimensional deep learning platform of retinal images for retinopathy of prematurity screening. JAMA Netw Open. 2021;4:e218758. doi: 10.1001/jamanetworkopen.2021.8758. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref30] 30.Yildiz VM, Tian P, Yildiz I, Brown JM, Kalpathy-Cramer J, Dy J, et al. Plus disease in retinopathy of prematurity: Convolutional neural network performance using a combined neural network and feature extraction approach. Transl Vis Sci Technol. 2020;9:10. doi: 10.1167/tvst.9.2.10. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref31] 31.Zhang Y, Wang L, Wu Z, Zeng J, Chen Y, Tian R, et al. Development of an automated screening system for retinopathy of prematurity using a deep neural network for wide-angle retinal images. Ieee Access. 2019;7:10232–41. [Google Scholar]

[ref32] 32.Coyner AS, Swan R, Campbell JP, Ostmo S, Brown JM, Kalpathy-Cramer J, et al. Automated fundus image quality assessment in retinopathy of prematurity using deep convolutional neural networks. Ophthalmol Retina. 2019;3:444–50. doi: 10.1016/j.oret.2019.01.015. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref33] 33.Greenwald MF, Danford ID, Shahrawat M, Ostmo S, Brown J, Kalpathy-Cramer J, et al. Evaluation of artificial intelligence-based telemedicine screening for retinopathy of prematurity. J AAPOS. 2020;24:160–2. doi: 10.1016/j.jaapos.2020.01.014. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref34] 34.Gupta K, Campbell JP, Taylor S, Brown JM, Ostmo S, Chan RVP, et al. A Quantitative severity scale for retinopathy of prematurity using deep learning to monitor disease regression after treatment. JAMA Ophthalmol. 2019;137:1029–36. doi: 10.1001/jamaophthalmol.2019.2442. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref35] 35.Redd T, Campbell J, Brown J, Kim S, Ostmo S, Chan R, et al. Utilization of a deep learning image assessment tool for epidemiologic surveillance of retinopathy of prematurity. Invest Ophthalmol Vis Sci. 2019;60:580–4. [Google Scholar]

[ref36] 36.Smith K, Kim S, Goldstein I, Ostmo S, Chan R, Brown J, et al. Quantitative analysis of aggressive posterior retinopathy of prematurity using deep learning. Invest Ophthalmol Vis Sci. 2019;60:4759. [Google Scholar]

[ref37] 37.Taylor S, Kishan G, Campbell P, Brown J, Ostmo S, Chan R, et al. Invest Ophthalmol Vis Sci. 2018;59:3937. [Google Scholar]

[ref38] 38.Wallace DK, Zhao Z, Freedman SF. A pilot study using “ROPtool” to quantify plus disease in retinopathy of prematurity. J AAPOS. 2007;11:381–7. doi: 10.1016/j.jaapos.2007.04.008. [DOI] [PubMed] [Google Scholar]

[ref39] 39.Wang J, Zhang G, Lin J, Ji J, Qiu K, Zhang M. Application of standardized manual labeling on identification of retinopathy of prematurity images in deep learning. Zhonghua Shiyan Yanke Zazhi. 2019;37:653–7. [Google Scholar]

[ref40] 40.Campbell J, Chan R, Ostmo S, Anderson J, Singh P, Kalpathy-Cramer J, Chiang M. Analysis of the relationship between retinopathy of prematurity zone, stage, extent and a deep learning-based vascular severity scale. Invest Ophthalmol Vis Sci. 2020;61:2193. [Google Scholar]

[ref41] 41.Choi RY, Brown JM, Kalpathy-Cramer J, Chan RV, Ostmo S, Chiang MF, et al. Variability in plus disease identified using a deep learning-based retinopathy of prematurity severity scale. Ophthalmol Retina. 2020;4:1016–21. doi: 10.1016/j.oret.2020.04.022. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref42] 42.Ramachandran S, Kochitty S, Vinekar V, John R. A fully convolutional neural network approach for the localization of optic disc in retinopathy of prematurity diagnosis. J Intell Fuzzy Syst. 2020;38:6269–78. [Google Scholar]

[ref43] 43.Worrall DE, Wilson C, Brostow GJ. Automated retinopathy of prematurity case detection with convolutional neural networks. In: Deep Learning and Data Labeling for Medical Applications. DLMIA, LABELS. 2016:68–76. [Google Scholar]

[ref44] 44.Ataer-Cansizoglu E, Bolon-Canedo V, Campbell JP, Bozkurt A, Erdogmus D, Kalpathy-Cramer J, et al. Computer-based image analysis for plus disease diagnosis in retinopathy of prematurity: performance of the “i-ROP” system and image features associated with expert diagnosis. Transl Vis Sci Technol. 2015;4:5. doi: 10.1167/tvst.4.6.5. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref45] 45.Touch P, Wu Y, Kihara Y, Zepeda E, Gillette T, Cabrera M, et al. Development of AI deep learning algorithms for the quantification of retinopathy of prematurity. J Invest Med. 2019;67:209. [Google Scholar]

[ref46] 46.Wang S, Zhang Y, Lei S, Zhu H, Li J, Wang Q, et al. Performance of deep neural network-based artificial intelligence method in diabetic retinopathy screening: A systematic review and meta-analysis of diagnostic test accuracy. Eur J Endocrinol. 2020;183:41–9. doi: 10.1530/EJE-19-0968. [DOI] [PubMed] [Google Scholar]

[ref47] 47.Gschließer A, Stifter E, Neumayer T, Moser E, Papp A, Pircher N, et al. Inter-expert and intra-expert agreement on the diagnosis and treatment of retinopathy of prematurity. Am J Ophthalmol. 2015;160:553–60.e3. doi: 10.1016/j.ajo.2015.05.016. [DOI] [PubMed] [Google Scholar]

[ref48] 48.Campbell JP, Ataer-Cansizoglu E, Bolon-Canedo V, Bozkurt A, Erdogmus D, Kalpathy-Cramer J, et al. Expert diagnosis of plus disease in retinopathy of prematurity from computer-based image analysis. JAMA Ophthalmol. 2016;134:651–7. doi: 10.1001/jamaophthalmol.2016.0611. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref49] 49.International Committee for the Classification of Retinopathy of Prematurity. The International classification of retinopathy of prematurity revisited. Arch Ophthalmol. 2005;123:991–9. doi: 10.1001/archopht.123.7.991. [DOI] [PubMed] [Google Scholar]

[ref50] 50.Chiang MF, Quinn GE, Fielder AR, Ostmo SR, Paul Chan RV, Berrocal A, et al. International classification of retinopathy of prematurity, third edition. Ophthalmology. 2021;128:e51–68. doi: 10.1016/j.ophtha.2021.05.031. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Performance of deep-learning artificial intelligence algorithms in detecting retinopathy of prematurity: A systematic review

Amelia Bai

Christopher Carty

Shuan Dai

Abstract

PURPOSE:

METHODS:

RESULTS:

CONCLUSION:

INTRODUCTION

METHODS

Search strategy and selection criteria

Data analysis

RESULTS

Figure 1.

Data characteristics and demographics

Table 1.

Table 2.

Algorithm development and validation

Table 3.

Table 4.

Algorithm performance

Table 5.

DISCUSSION

CONCLUSION

Financial support and sponsorship

Conflicts of interest

Appendix 1: Full search strategy

Appendix 2: Methodological quality assessment of bias for included studies using QUADAS-2[18]

REFERENCES

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Performance of deep-learning artificial intelligence algorithms in detecting retinopathy of prematurity: A systematic review

Amelia Bai

Christopher Carty

Shuan Dai

Abstract

PURPOSE:

METHODS:

RESULTS:

CONCLUSION:

INTRODUCTION

METHODS

Search strategy and selection criteria

Data analysis

RESULTS

Figure 1.

Data characteristics and demographics

Table 1.

Table 2.

Algorithm development and validation

Table 3.

Table 4.

Algorithm performance

Table 5.

DISCUSSION

CONCLUSION

Financial support and sponsorship

Conflicts of interest

Appendix 1: Full search strategy

Appendix 2: Methodological quality assessment of bias for included studies using QUADAS-2[18]

REFERENCES

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases