Skip to main content
Eye logoLink to Eye
. 2022 Dec 28;37(12):2518–2526. doi: 10.1038/s41433-022-02366-y

Artificial intelligence for the diagnosis of retinopathy of prematurity: A systematic review of current algorithms

Ashwin Ramanathan 1, Sam Ebenezer Athikarisamy 2,3,, Geoffrey C Lam 4,5
PMCID: PMC10397194  PMID: 36577806

Abstract

Background/Objectives

With the increasing survival of premature infants, there is an increased demand to provide adequate retinopathy of prematurity (ROP) services. Wide field retinal imaging (WFDRI) and artificial intelligence (AI) have shown promise in the field of ROP and have the potential to improve the diagnostic performance and reduce the workload for screening ophthalmologists. The aim of this review is to systematically review and provide a summary of the diagnostic characteristics of existing deep learning algorithms.

Subject/Methods

Two authors independently searched the literature, and studies using a deep learning system from retinal imaging were included. Data were extracted, assessed and reported using PRISMA guidelines.

Results

Twenty-seven studies were included in this review. Nineteen studies used AI systems to diagnose ROP, classify the staging of ROP, diagnose the presence of pre-plus or plus disease, or assess the quality of retinal images. The included studies reported a sensitivity of 71%–100%, specificity of 74–99% and area under the curve of 91–99% for the primary outcome of the study. AI techniques were comparable to the assessment of ophthalmologists in terms of overall accuracy and sensitivity. Eight studies evaluated vascular severity scores and were able to accurately differentiate severity using an automated classification score.

Conclusion

Artificial intelligence for ROP diagnosis is a growing field, and many potential utilities have already been identified, including the presence of plus disease, staging of disease and a new automated severity score. AI has a role as an adjunct to clinical assessment; however, there is insufficient evidence to support its use as a sole diagnostic tool currently.

Subject terms: Retinal diseases, Health services

Introduction

Retinopathy of prematurity (ROP) is a vaso-proliferative disease affecting premature infants [1]. The evolution of modern newborn care has led to improved survival of premature infants, resulting in a higher number of infants needing ROP screening. Most milder cases of ROP resolve without significant clinical sequelae; however, a small proportion of cases (5–10%) will require intervention to prevent blindness [2]. The incidence of ROP varies between regions primarily due to the survival rates of these at-risk infants and the oxygen saturation targeting protocols and oxygen delivery practices. Screening guidelines have differed from region to region for the same reason [3]. ROP is described by defining the location (zone I-III), extent (clock hours of disease), stage (1–5), presence of plus or pre-plus disease and whether aggressive ROP (previously known as aggressive posterior ROP) is present based on the International Classification of Retinopathy of Prematurity (ICROP) [47].

In current clinical practice, diagnosis is dependent on binocular indirect ophthalmoscope examination (BIO) or interpretation of wide field digital retinal imaging (WFDRI) by a skilled ophthalmologist [3]. This may be prone to interclinician discrepancy and variability [8]. Access to experienced ophthalmologists is also limited in certain countries. In the last decade, some health services across the globe have transitioned to non-ophthalmologist-led (nurses, non-physician imagers) WFDRIs to overcome the above challenges [9]. However, in the majority of the setups, image interpretation continues to be performed by ophthalmologists and still requires significant time and effort from ophthalmologists. The integration of AI has been suggested as another possible solution to further reduce the workload and improve the efficiency of the screening process. The key question that remains is whether AI can safely identify all infants who are at risk of developing sight-threatening ROP.

To date, there have been many published studies describing the role of AI in ROP assessment and diagnosis. The aim of this systematic review is to synthesize evidence from the current literature, analyse the diagnostic performance of these algorithms and discuss opportunities and challenges in the field of AI in ROP.

Glossary of AI terminology

Deep learning (DL) networks can progressively learn from input data without the need for specific programming for a task. It does, however, require extensive training of the model. It utilises several layers of artificial neural networks to create a robust algorithm that is capable of high complexity tasks such as image recognition. Figure 1.

  1. Artificial neural networks (ANNs) are designed to simulate neural activity in the human brain and consist of multiple artificial neurons that build connections between each other to pass information [10]. A simple ANN structure consists of only three layers and as such tends to be effective for single specific tasks only and lacks generalisability [11].

  2. Convolutional neural networks (CNNs) are a subset of ANNs that are particularly useful for image processing as well as cognitive tasks. They preserve the spatial relationship between pixels in an image, and filters can be trained to extract specific features [12]. Once trained appropriately, the CNN theoretically should have the ability to output accurate image classification on unlabelled data based on features that humans may not be able to evaluate. Training of these algorithms involves inputting raw human-labelled data. These data are then processed through the algorithm’s multiple layers and allow the CNN to develop their own learning of image features.

Fig. 1. Diagrammatic representation of deep learning frameworks.

Fig. 1

Key features of Artificial Neural Network and Convolutional Neural Network.

Materials and Methods

Search strategy

A comprehensive systematic review of the literature by two authors (A R and S E) was independently performed using online databases (PubMed, Ovid Medline, Google Scholar and Embase) to search for potentially relevant studies dated from 1 January 2000 to 31 December 2021. Appropriate search terms were selected after discussion between authors (“Retinopathy of Prematurity” OR “ROP” OR “Plus Disease”) AND (“Deep learning” OR “Machine learning” OR “Artificial Intelligence” OR “Convolutional neural networks” OR “CNN” OR “Artificial Neural Network” OR “Automated Diagnosis”). In addition to the thorough database search, we also examined bibliographies of relevant studies for further eligible publications. This systematic review was reported adhering to the guidelines stipulated by the PRISMA statement [13].

Inclusion and exclusion criteria

All publications identified through the search were assessed for suitability in this systematic review in an independent nature. Relevant studies were identified by the authors using predetermined inclusion and exclusion criteria. Any disagreement was resolved through discussion between the authors.

The inclusion criteria for this review were as follows:

  • Available in English language

  • Published between 2000 and December 2021

  • Original research article

  • Employs a fully automated DL system to investigate the objective

  • Studies were excluded from this review if they met any of the following criteria:

  • Fully automated DL system not used, i.e., clinician/manual interpretation still required for the diagnosis

  • Inclusion of biometric information (gestational age or postnatal weight gain) into the algorithm

  • No full text available

  • Animal model

Data extraction

Once a final list of suitable publications was agreed upon, two authors (A R, S E) extracted relevant data points from each study. The main outcome measures for diagnostic type studies were presented as sensitivity, specificity, accuracy and area under the receiver operated characteristic curve (AUROC). The AUROC quantifies the capability of a test to classify a binary outcome, with 0.5 representing random chance and 1.0 representing a perfect test [14]. In studies that used a vascular severity score (VSS), the mean or median VSS was recorded as a primary outcome.

We also extracted data including the image dataset as well as sample size and test group size, camera used to capture retinal images, year of publication and type of model used.

Results

Study screening

Our comprehensive search of the literature yielded a total of 1006 potential articles for screening. Two additional studies were added after examining the bibliographies of the included studies. We excluded all duplicate papers (167) and performed a brief review of titles and abstracts for exclusion of irrelevant studies (785). A thorough analysis and review of all remaining articles (56) was then conducted. Twenty-nine studies were excluded for failing to meet the inclusion criteria. A total of 27 studies met all criteria for inclusion (19 diagnostic accuracy studies & eight studies using VSS). Figure 2.

Fig. 2. PRISMA flow diagram.

Fig. 2

Graphical representation of the flow of citations reviewed in the course of this Systematic Review.

Study characteristics

Table 1 presents a total of 19 studies that evaluated the performance of different AI systems in regard to diagnosing or assessing ROP. The studies were published between 2016 and 2021, and all studies used a DL system. Seven of those studies included the outcome of identifying an image with ROP vs no-ROP. Seven included studies were able to further classify ROP images into stages. Nine studies differentiated images based on the presence of plus or pre-plus disease, and three studies looked at the quality of the retinal images. The total images used varied between each study, with 5 of 19 studies using images from the i-ROP consortium database. The median (IQR) total dataset was 4861 (3975), with a median test set of 244 (1642). Table 2 summarises eight studies that utilised a DL-derived VSS to classify ROP.

Table 1.

Characteristics of included diagnostic type studies (not including VSS studies).

Author Year Model Dataset Camera Images (test) Outcome Sensitivity/specificity Accuracy AUROC
Ramachandran et al. [31] 2021 U-COSFIRE KIDROP RetCam 545 (161) Plus Vs no Plus 0.99/0.98 0.97 0.99
Zhang et al. [22] 2021

ResNet50

SE-HBP50

China RetCam 12230 (3654)

ROP Vs no ROP

AP-ROP Vs ROP

-/0.99

0.88/0.95

0.98

0.93

0.99

-

Atallah et al. [23] 2021 DIAROP (ResNet50, Inception and Inception ResNet) ROP Collaboration group, China RetCam 17801 ROP vs no-ROP - 0.93 0.98
Wang et al. [17] 2021 China RetCam 3635 (558) Image Quality 0.97/0.95 - 0.99
41678 (6349) Stage 0.98/0.99 - 0.998
41312 (6389) Haemorrhage 0.97/0.99 - 0.997
36390 (5629) Posterior disease 0.92/0.96 0.99
16926 (2518) PrePlus/Plus 0.92/0.97 0.98
52249 (5295) Referral Warranted 0.98/0.97 0.996
Mao et al. [25] 2020 U-Net, DenseNet China RetCam 5711 (450)

Plus vs not-plus

Pre-plus vs normal

0.95/0.98

0.92/0.97

0.93

-

0.99

-

Huang et al. [18] 2020 VGG-19* Taiwan & Japan RetCam

2351 (101)

2363 (85) for stage

ROP vs no-ROP

Mild vs severe ROP^

0.97/0.95

0.99/0.99

0.96

0.99

0.97

0.99

Huang et al. [19] 2020 CNN Taiwan RetCam 1975 (244)

ROP vs no-ROP

Stage 1 vs other’

Stage 2 vs other’

0.96/0.96

0.92/0.95

0.90/0.99

0.92

-

-

0.96

0.93

0.92

Tong et al. [26] 2020 ResNet, Faster-RCNN China RetCam 36231 (9772)

Grading”

Plus vs no-Plus

0.78/0.93

0.71/0.91

0.90

0.90

-

-

Yildiz et al. [27] 2020 U-Net, CNN i-ROP RetCam 5512 (100)

Plus vs not-plus

Pre-plus vs normal

-

-

-

-

0.99

0.97

Lepore et al. [41] 2020 CNN Italy RetCam 835 (83) Treated vs not treated - 0.88 0.91
Coyner et al. [16] 2019 Inception-V3 i-ROP RetCam 6139 (2109) Acceptable quality vs possibly acceptable quality 0.94/0.84 - 0.97
Tan et al. [28] 2019 ROP.AI ART-ROP RetCam 3487 (90)

Plus vs not-plus

Pre-plus vs normal

0.94/0.81

0.81/0.81

0.86

0.81

0.98

-

Hu et al. [20] 2018 Inception V2* China RetCam

2068 (300)

466 (100)

ROP vs no-ROP

Mild vs Severe ROP+

0.96/0.98

0.82/0.86

0.97

0.84

0.99

0.92

Brown et al. [29] 2018 U-Net, Inception V1 i-ROP RetCam 5511 (100)

Plus vs not-plus

Pre-plus vs normal

0.93/0.94

1.00/0.94

0.91

-

0.98*

-

Wang et al. [21] 2018 Id-Net, Gr-Net China -

13526 (2361)

Grad 4139 (293)

Clin (4908/657)

ROP vs no-ROP

ROP vs no-ROP (clin)

Mild vs Severe ROP#

Mild vs Severe ROP (clin)

0.97/0.99

0.85/0.97

0.88/0.92

0.93/0.74

-

0.96

-

0.76

0.99

-

0.95

-

Redd et al. [30] 2018 i-ROP DL score i-ROP RetCam 4861 (100)

Plus vs no-Plus

Type 1 ROP

Clin sig ROP

-

0.94/0.79

-

-

-

-

0.99

0.96

0.91

Coyner et al. [15] 2018 VGG-19 i-ROP - 6043 (3073) Acceptable vs not acceptable - 0.89 0.96
Zhang et al. [24] 2018

VGG-16

GoogleNet

AlexNet

ROP Collaboration group, China RetCam 17801 (1742) ROP vs no-ROP 0.99/0.98 0.98 0.99
Worrall et al. [32] 2016 GoogleNet, Bayesian CNN Canada & London - 1459 (106) Plus vs no-Plus 0.83/0.98 0.92 -

Total images are quoted as the total set of included images used for training after exclusions.

^Mild ROP defined as Stage 1 and 2. Severe ROP is stage 3.

‘Other is eg Stage 1 vs (no-rop and stage 2).

“Grading the ROP cases as normal, mild (stage I or stage II, without plus disease; routine observation); semi-urgent (stage I or stage II, with plus disease; suggested referral); and urgent (stage III, stage IV, or stage V, with or without plus disease; urgent referral for treatment).

+Mild is defined as stage 1 and 2, severe is defined as stage 3–5.

#Mild is defined as Zone II or III and stage 1 or 2, and severe ROP is defined as threshold disease, type I, type II, or AP-ROP, and stage 4, or 5.

*Different models were tested, the most accurate and its results are quoted.

ROP Retinopathy of Prematurity.

AP-ROP Aggressive Posterior Retinopathy of Prematurity.

AUROC Area Under the Receiver Operating Characteristic Curve.

Table 2.

Characteristics of included studies utilising Vascular Severity Score.

Author Year Model Dataset Camera Images Outcome Mean (SD) or Median (IQR) VSS applied to RSD classification Summary of Results
Campbell et al. [34] 2021 i-ROP DL India RetCam 4175 Detection of treatment requiring ROP

NO ROP 1.8 (1.3–2.4)

Pre Plus 3.5 (2.4–4.3)

Plus 6.2 (5.3–6.9)

AUROC 0.98, Sn 100%, Sp 78% for detecting TR-ROP

High diagnostic accuracy for TR-ROP

Higher ROP severity in NICU’s that had less resources for oxygen monitoring

Bellsmith et al. [36] 2020 i-ROP DL i-ROP RetCam 1507 Detection of AP-ROP

AP-ROP 8.8 (8.2–9.0)

TR without AP-ROP 7.2 (5.3-8.7)

Type 2 or Pre-plus 4.3 (2.2–5.1)

Mild ROP 1.2 (1.0–1.8)

No ROP 1.0 (1.0–1.3)

Longitudinal evaluation of VSS showed that eyes with AP-ROP demonstrated an earlier and rapid progression of disease compared to infants without AP-ROP.
Campbell et al. [33] 2020 i-ROP DL +  i-ROP RetCam 6344 Automated classification of ROP severity

No plus 2.4(0.8)

Pre-plus 4.7(1.1)

Plus 7.7(1.0)

Zone, stage and extent were all independently associated with the VSS.

A higher VSS was associated with more posterior disease, higher stage, and higher extent in stage 3

Greenwald et al. [40] 2020 i-ROP DL USA RetCam 613 Automated classification of ROP severity AUC of 0.99 for detection of RR-ROP. An operating point of 3 on scale 1-9 would have conferred 100% sensitivity for diagnosing RR-ROP. Treatment requiring ROP had a higher VSS than eyes without TR-ROP.
Choi et al. [35] 2020 i-ROP DL i-ROP RetCam 5255 Automated classification of ROP severity Plus 7.4 (1.9)
Taylor et al. [37] 2019 i-ROP DL i-ROP RetCam 5255 Monitoring ROP progression over time

No ROP 1.1 (1.0–1.5)

Mild ROP 1.5 (1.1–3.4)

Type 2 and pre-plus 4.6 (2.4–5.3)

TR-ROP 7.5 (5.0–8.7)

VSS is associated with category of disease and clinical progression of ROP
Gupta et al. [38] 2019 i-ROP DL i-ROP RetCam 5255 ROP progression post treatment

2 weeks prior to treatment 4.19(1.75)

Treatment 7.43(1.89)

2 weeks post treatment 4.00(1.88)

VSS increased 2 weeks prior to treatment, peaked at treatment, and decreased for 2 weeks post treatment.

Mean change in VSS post bevacizumab (−3.28) was higher than post laser (−1.91)

Brown et al. [39] 2018 U-Net, GoogLeNet i-ROP RetCam 4800 Severity score progression, including post treatment Disease severity increased across cohorts, with highest rate of increase in plus and aggressive posterior ROP groups. There was regression post treatment with Bevacizumab.

+i-ROP DL system is from Brown et al. [39].

Sn Sensitivity, Sp Specificity.

SD Standard Deviation, IQR Interquartile Range.

VSS Vascular Severity Score.

ROP Retinopathy of Prematurity.

RR-ROP Referral Recommended Retinopathy of Prematurity.

TR-ROP Treatment Requiring Retinopathy of Prematurity.

AP-ROP Aggressive Posterior Retinopathy of Prematurity.

AUROC Area Under the Receiver Operating Characteristic Curve.

Due to the heterogeneity of outcomes of the studies, the diagnostic indices could not be pooled and are largely summarised individually.

Assessing the quality of images

Two separate studies trained a convolutional neural network (CNN) to automatically assess the quality of a database of retinal images [15, 16]. A different CNN was used in each of the studies: VGG-19 and Inception-V3. The first study used a test set of 3073 images and reported an AUROC of 0.96. The second had a smaller test set of 2109 but with a higher AUROC of 0.97. Wang et al. also assessed image quality as part of a diverse set of outcomes with a similarly high AUROC of 0.99 [17]. These results affirm that captured digital retinal images are sufficient in quality to allow further analysis and are also performed at a level comparable to experts. This forms the foundation for subsequent studies into using these images for diagnosis and classification.

ROP vs no-ROP

A total of seven included studies aimed to identify the presence of ROP [1824]. A number of different CNN structures were utilised by each of these authors; however, all reported convincing efficacy. The datasets varied in number with a mean (SD) of the total images and test set of 9679 (6806) and 1400 (1314), respectively. The mean (SD) accuracy across six studies (one study did not report an accuracy) was 95.7% (2.36). The mean sensitivity (n = 5) was 97% (1.10).

Multiple studies including Huang et al. [18] and Zhang et al. [22] applied a number of different CNN models to compare their capabilities in detecting ROP. Huang et al. used five pretrained models, VGG16, VGG19, MobileNet, InceptionV3 and DenseNet, while Zhang et al. used a variety of ResNet networks with SE-HBP. The highest statistical performance for classifying ROP vs no-ROP from Huang et al. trial was VGG19, with an accuracy of 96.0%, sensitivity of 96.6% and specificity of 95.2%.

While all test datasets performed in all 5 studies resulted in a high degree of statistical efficacy in distinguishing ROP from no-ROP, the highest accuracy was achieved by Zhang et al., who recorded 98% accuracy, with 99%, 98% and 99% sensitivity, specificity and AUROC, respectively [24].

Wang et al. conducted a real-world clinical trial of their AI diagnostic platform by integrating the CNN algorithm with a cloud-based telehealth computing system [21]. Clinical ophthalmologists from participating hospitals in China uploaded the raw unprocessed retinal images captured during routine screening programs to a website attached to the CNN system. Of the 944 images uploaded, the fully automated framework recorded an accuracy of 95.6% but a sensitivity of only 84.9% in correctly labelling ROP vs no-ROP.

Pre-plus and plus vs normal

Nine studies developed DL systems to distinguish the presence of plus disease from those images without evidence of plus disease [17, 2532] and four aimed to classify pre-plus disease in addition [25, 2729]. Pre-plus disease was defined as arterial tortuosity and venous dilation in two or more quadrants of the eye. The mean (SD) number of total images used in each of the nine studies was 8915 (10621), and the mean (SD) test dataset was 1489 (3021). The mean (SD) sensitivity for diagnosing plus disease among six studies with available data was 89.2% (9.46). For diagnosing pre-plus disease (n = 4), this was marginally higher at 91.3% (6.76). The mean (SD) accuracy of the automated systems to diagnose plus disease was 91.5% (3.30). Unfortunately, only 1 of the 4 studies quoted an accuracy of diagnosing pre-plus disease (81%).

Staging of ROP

Eight studies aimed to further subclassify those images that were identified as having the characteristics consistent with ROP in reference to the severity of disease. The classification of ROP varied greatly between these studies, with differing definitions of ‘mild’ to ‘severe’ ROP. Huang et al. [19] and Hu et al. [20] defined mild ROP as stage 1 and 2 and severe ROP as stage 3-5, while Tong et al. [26] used mild (stage I or stage II, without plus disease; routine observation); semiurgent (stage I or stage II, with plus disease; suggested referral); and urgent (stage III, stage IV, or stage V, with or without plus disease; urgent referral for treatment) to group the images. Both Zhang et al. [22] and Wang et al. [21] included aggressive posterior ROP (AP-ROP) in their grading systems. A recent study by Wang et al. [17] included not only posterior disease but also the presence of haemorrhage with sensitivity, specificity and AUROC of 0.97, 0.96 and 1.0, which has not been described in any other study. Redd et al. was the only study to distinguish type 1 ROP and described a high sensitivity of 0.94 [30].

In the same clinical study by Wang et al. while the accuracy of diagnosing ROP vs no-ROP was high, the ability of the algorithm to grade the severity of ROP was less significant, with a considerably lower accuracy of only 76.4% [21].

Vascular severity score

Table 2 presents a summary of all currently available studies using a DL-derived VSS to monitor and categorise ROP. Eight studies were included [3340], five of which aimed to use the automated system to assign images an appropriate VSS in regards to its severity or classification, that is, plus disease or treatment requiring (TR-ROP). In all studies, plus disease was attributed to a higher VSS than pre-plus disease and no plus disease. In two separate studies, Campbell et al. reported mean (SD) VSS for no plus disease as 2.4 (0.8) and median (IQR) 1.8 (1.3–2.4), pre-plus disease 4.7 (1.1) and 3.5 (2.4–4.3) and plus disease as 7.7 (1.0) and 6.2 (5.3–6.9) on a scale of increasing severity 1 to 9 [33, 34]. Choi et al. similarly reported a mean score for plus disease of 7.4 (1.9) [35]. Bellsmith et al. was able to report AP-ROP with a significantly high VSS of 8.8 (8.2–9.0) and TR-ROP without AP-ROP at 7.2 (5.3–8.7) [36]. Taylor et al. used VSS to monitor the progression of ROP with respect to time and classified severity with mean VSS for TR-ROP 7.5 [37].

Gupta et al. used VSS to monitor progression post treatment with laser or anti-VEGF and described a peak VSS of 7.43 at the time of treatment followed by a significant decrease two weeks following treatment to 4.00 [38]. The mean change in VSS was higher in the anti-VEGF treatment group. Brown et al. similarly monitored regression of disease post treatment with bevacizumab and described a significant decrease in VSS post treatment [39].

Additionally, Mao et al. monitored changes in vascular parameters post treatment with ranibizumab with a significant decrease in severity at each time point post treatment [25], and Lepore et al. was able to distinguish between ‘treatment requiring’ and ‘not treatment requiring’ in a unique dataset of images that had undergone fluorescein angiography with an accuracy of 0.88 [41]. However, neither used VSS.

Discussion

As more extremely low gestational age premature babies survive, there is a potential need for a fast, reliable and efficient diagnostic tool to assist in the ROP screening process. The limitations of binocular indirect ophthalmoscopy (BIO) are well known as reported by both the CRYO-ROP [42] and ET-ROP [43] which had 12% and 15% disagreement, respectively, between the first and second BIO. Most recent studies that examined inter- and intraexpert variability demonstrated very similar trends in interpreting WFDR images [8, 44]. DL and AI systems are being developed and increasingly implemented within the wider medical field to overcome this variability. The key question is whether the currently available CNN algorithms have the potential to reduce the uncertainty surrounding diagnosis and whether these algorithms can assist ophthalmologists, particularly in low-middle income countries, where the disease burden is high and existing access to specialised care and screening is comparatively low. In this systematic review, we have been able to synthesise evidence from the available literature on the use of AI programmes to enhance the diagnostic and screening capabilities in ROP.

ROP needing treatment, termed Type 1 ROP by the ET-ROP trial [43] includes all ROP cases with plus disease and cases of Stage 3 without plus disease, in Zone I. Most algorithms that have been tested are based on plus changes. Therefore, there is a risk of missing the Stage 3, Zone I, without plus disease cases with some algorithms. It is important to note that in the e-ROP study, 48.5% of infants with Type 2 ROP or review warranted ROP (RW ROP) did not have any plus disease [45]. While multiple studies attempted to distinguish between type 1 and type 2 ROP with varying definitions of mild and severe ROP, only Redd et al. [30] were able to use DL methods to differentiate between type 1, type 2 and clinically significant ROP. This discrimination is what drives treatment decisions and is of most impact clinically. In the study by Wang et al., clinical application to grade the severity of ROP was less significant, with a considerably lower accuracy of only 76.4% [21].

An alternate vascular severity score (Deep ROP score) reported by Li et al. described the degree of vascular severity on a scale of 1-100 [46]. The Deep ROP score had an AUROC of 0.981 for detecting type 1 ROP and 0.986 for Type 2 ROP from 54626 images. On a hypothetical cut-off score of 35, all cases of severe ROP (type 1 and type 2) were identified. When compared to the i-ROP-DL, this score was trained to recognise stage rather than plus disease. Coyner et al. reported a risk prediction model using gestational age and VSS to predict ROP requiring treatment with 100% sensitivity and 80.8% negative predictive value [47].

While there is a significantly beneficial role for AI in the medical field in the foreseeable future, it has the potential to cause considerable ethical and legal issues, and a variety of concerns have been expressed by the medical and wider community. Reddy et al. identify a number of prominent ethical challenges, including AI bias, healthcare data and privacy, patient and clinician trust of an unknown mechanism, algorithmic safety and accountability [48]. There is a need for strong governance, and professional bodies will need to engage with regulatory agencies to address the concerns prior to widespread implementation.

In summary, we can conclude the following: (1) fully automated diagnostic algorithms can be used to evaluate a retinal image and produce a diagnosis of ROP with high accuracy; (2) AI systems can be used to classify the severity of ROP images; (3) AI systems can be used to diagnose the presence of pre-plus and plus diseases; (4) the VSS is a new automated classification system with good objectivity and accuracy; and (5) the VSS may be used to monitor the progression of ROP pre- and post-treatment.

The key question remains whether AI algorithms have the potential to identify infants who require intervention, which is currently decided based on assessment of location, severity and the presence of plus disease. While AI has been successful in identifying these characteristics individually, there are limited studies testing these components together. Furthermore, for AI to be more widely applicable, we first require a less expensive, more widely available imaging system and a workforce appropriately trained to acquire high-quality images that can be interpreted.

Limitations of this review

First, the heterogeneity of outcomes of studies made it difficult to compare and analyse the performance by pooling data. The accuracy and reliability of a DL algorithm depend largely on both the quality and quantity of the data used to train the system. The training process of an automated system still requires human involvement with the need for accurate labelling of training images, and inaccurate labelling at this early stage may influence the overall efficacy of the program. Furthermore, a large number of training images is required to achieve generalisability.

Although the studies presented in this review report high levels of performance, this is often limited to the research environment. Most included studies were retrospective in nature and utilised image sets obtained from image archives rather than prospective enrolment. Only high-quality images were generally used, which may not be reflective in a clinical setting. Only two papers reported on a real-world clinical application of the developed AI system [21, 40]. Large-scale trials of proposed systems have yet to be conducted in a clinical setting, and thus, the reproducibility of results and feasibility of use outside of a research setting cannot yet be commented on.

Conclusions

AI in ROP has the potential to create a paradigm shift in the diagnosis of ROP. Studies have identified many potential utilities, including automated diagnosis of ROP and plus disease, staging of disease and a new automated severity score. However, most studies have independently assessed different components of ROP diagnosis criteria. To replace existing systems, AI needs to consistently identify infants requiring treatment to the level of 100% accuracy and overcome other challenges related to AI implementation by conducting large- scale studies with data coming from real-world clinical settings.

Summary

What was known before

  • Artificial intelligence (AI) in the field of retinopathy of prematurity has been evaluated to improve diagnostic performance and reduce the workload of screening ophthalmologists.

What this study adds

  • This systematic review found that AI diagnostic algorithms demonstrated moderate to high levels of accuracy (sensitivity of 71%–100%, specificity of 74–99% and area under the curve of 91–99%) in classifying the disease (stages of ROP, diagnosis of the presence of pre-plus or plus disease).

  • There is scope to utilise AI technology to facilitate diagnostics and decision making and reduce the burden on specialists; however, limitations exist with current deep learning algorithms (exclusive testing in research setup and limited integration to clinical practice).

  • The technological, ethical and governance challenges need to be approached prior to implementation.

Author contributions

AR and SA collected the data. AR, SA, and GL analysed and interpreted the data. All authors (AR, SA, and GL) drafted the paper, revised it, and approved the final version.

Funding

We declare that no author received any specific funding for this study.

Data availability

The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Quinn GE. Retinopathy of prematurity blindness worldwide: phenotypes in the third epidemic. Eye Brain. 2016;8:31–6. doi: 10.2147/EB.S94436. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.National Eye Institute. Retinopathy of Prematurity: National Institutes of Health. 2019. https://www.nei.nih.gov/learn-about-eye-health/eye-conditions-and-diseases/retinopathy-prematurity.
  • 3.Fierson WM. Screening examination of premature infants for retinopathy of prematurity. Pediatrics. 2018;142:e20183061. doi: 10.1542/peds.2018-3061. [DOI] [PubMed] [Google Scholar]
  • 4.Jefferies AL, Society CP. Fetus, Committee N. Retinopathy of prematurity: An update on screening and management. Paediatr Child Health. 2016;21:101–4. doi: 10.1093/pch/21.2.101. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.International Committee for the Classification of Retinopathy of Prematurity. An international classification of retinopathy of prematurity. The Committee for the Classification of Retinopathy of Prematurity. Arch Ophthalmol. 1984;102:1130–4. doi: 10.1001/archopht.1984.01040030908011. [DOI] [PubMed] [Google Scholar]
  • 6.International Committee for the Classification of Retinopathy of Prematurity. The International classification of retinopathy of prematurity revisited. Arch Ophthalmol. 2005;123:991–9. doi: 10.1001/archopht.123.7.991. [DOI] [PubMed] [Google Scholar]
  • 7.Chiang MF, Quinn GE, Fielder AR, Ostmo SR, Chan RV, Berrocal A, et al. International classification of retinopathy of prematurity, Third Edition. Ophthalmology. 2021;128:e51–e68. doi: 10.1016/j.ophtha.2021.05.031. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Gschließer A, Stifter E, Neumayer T, Moser E, Papp A, Pircher N, et al. Inter-expert and intra-expert agreement on the diagnosis and treatment of retinopathy of prematurity. Am J Ophthalmol. 2015;160:553–60. [DOI] [PubMed]
  • 9.Athikarisamy SE, Lam GC, Ross S, Rao SC, Chiffings D, Simmer K, et al. Comparison of wide field imaging by nurses with indirect ophthalmoscopy by ophthalmologists for retinopathy of prematurity: A diagnostic accuracy study. BMJ Open. 2020;10:e036483. doi: 10.1136/bmjopen-2019-036483. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Ranschaert ER, Morozov S, Algra PR. Artificial intelligence in medical imaging: Opportunities, applications and risks. 2019, p. 39–48.
  • 11.Miotto R, Wang F, Wang S, Jiang X, Dudley JT. Deep learning for healthcare: Review, opportunities and challenges. Brief Bioinforma. 2018;19:1236–46. doi: 10.1093/bib/bbx044. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Choi RY, Coyner AS, Kalpathy-Cramer J, Chiang MF, Campbell JP. Introduction to machine learning, neural networks, and deep learning. Transl Vis Sci Technol. 2020;9:14. doi: 10.1167/tvst.9.2.14. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Page MJ, McKenzie JE, Bossuyt PM, Boutron I, Hoffman TC, Mulrow CD, et al. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ. 2021;372:n71. doi: 10.1136/bmj.n71. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Zweig MH, Campbell G. Receiver-operating characteristic (ROC) plots: A fundamental evaluation tool in clinical medicine. Clin Chem. 1993;39:561–77. doi: 10.1093/clinchem/39.4.561. [DOI] [PubMed] [Google Scholar]
  • 15.Coyner AS, Swan R, Brown JM, Kalpathy-Cramer J, Kim SJ, Campbell JP, et al. Deep learning for image quality assessment of fundus images in retinopathy of prematurity. AMIA Annu Symp Proc. 2018;2018:1224–32. [PMC free article] [PubMed] [Google Scholar]
  • 16.Coyner AS, Swan R, Campbell JP, Ostmo S, Brown JM, Kalpathy-Cramer J, et al. Automated fundus image quality assessment in retinopathy of prematurity using deep convolutional neural networks. Ophthalmol Retin. 2019;3:444–50. doi: 10.1016/j.oret.2019.01.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Wang J, Ji J, Zhang M, Lin JW, Zhang G, Gong W, et al. Automated explainable multidimensional deep learning platform of retinal images for retinopathy of prematurity screening. JAMA Netw Open 2021;4:e218758. [DOI] [PMC free article] [PubMed]
  • 18.Huang YP, Vadloori S, Chu HC, Kang EY, Wu WC, Kusaka S, et al. Deep learning models for automated diagnosis of retinopathy of prematurity in preterm infants. Electronics 2020;9:1444.
  • 19.Huang YP, Basanta H, Kang EY, Chen KJ, Hwang YS, Lai CC, et al. Automated detection of early-stage ROP using a deep convolutional neural network. Br J Ophthalmol. 2021;105:1099–103. [DOI] [PMC free article] [PubMed]
  • 20.Hu J, Chen Y, Zhong J, Ju R, Yi Z. Automated analysis for retinopathy of prematurity by deep neural networks. IEEE Transactions Med Imaging 2018;38:269–79. [DOI] [PubMed]
  • 21.Wang J, Ju R, Chen Y, Zhang L, Hu J, Wu Y, et al. Automated retinopathy of prematurity screening using deep neural networks. EBioMedicine. 2018;35:361–8. doi: 10.1016/j.ebiom.2018.08.033. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Zhang R, Zhao J, Xie H, Wang T, Chen G, Zhang G, et al. Automatic diagnosis for aggressive posterior retinopathy of prematurity via deep attentive convolutional neural network. Expert Sys Appl. 2021;187:115843.
  • 23.Attallah O. DIAROP: Automated deep learning-based diagnostic tool for retinopathy of prematurity. Diagnostics (Basel) 2021;11:2034. doi: 10.3390/diagnostics11112034. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Zhang Y, Wang L, Wu Z, Zeng J, Chen Y, Tian R, et al. Development of an automated screening system for retinopathy of prematurity using a deep neural network for wide-angle retinal images. IEEE Access. 2018;7:10232–41. doi: 10.1109/ACCESS.2018.2881042. [DOI] [Google Scholar]
  • 25.Mao J, Luo Y, Liu L, Lao J, Shao Y, Zhang M, et al. Automated diagnosis and quantitative analysis of plus disease in retinopathy of prematurity based on deep convolutional neural networks. Acta Ophthalmologica. 2020;98:e339–e45. doi: 10.1111/aos.14264. [DOI] [PubMed] [Google Scholar]
  • 26.Tong Y, Lu W, Deng QQ, Chen C, Shen Y. Automated identification of retinopathy of prematurity by image-based deep learning. Eye Vis. 2020;7:40. doi: 10.1186/s40662-020-00206-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Yildiz VM, Tian P, Yildiz I, Brown JM, Kalpathy-Cramer J, Dy J, et al. Plus disease in retinopathy of prematurity: Convolutional neural network performance using a combined neural network and feature extraction approach. Transl Vis Sci Technol. 2020;9:10-. doi: 10.1167/tvst.9.2.10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Tan Z, Simkin S, Lai C, Dai S. Deep learning algorithm for automated diagnosis of retinopathy of prematurity plus disease. Transl Vis Sci Technol. 2019;8:23-. doi: 10.1167/tvst.8.6.23. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Brown JM, Campbell JP, Beers A, Chang K, Ostmo S, Chan RV, et al. Automated diagnosis of plus disease in retinopathy of prematurity using deep convolutional neural networks. JAMA Ophthalmol. 2018;136:803–10. doi: 10.1001/jamaophthalmol.2018.1934. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Redd TK, Campbell JP, Brown JM, Kim SJ, Ostmo S, Chan RV, et al. Evaluation of a deep learning image assessment system for detecting severe retinopathy of prematurity. Br J Ophthalmol. 2019;103:580. doi: 10.1136/bjophthalmol-2018-313156. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Ramachandran S, Niyas P, Vinekar A, John R. A deep learning framework for the detection of Plus disease in retinal fundus images of preterm infants. Biocybern Biomed Eng. 2021;41:362–75. doi: 10.1016/j.bbe.2021.02.005. [DOI] [Google Scholar]
  • 32.Worrall DE, Wilson CM, Brostow G. Automated Retinopathy of Prematurity Case Detection with Convolutional Neural Networks. Deep Learning and Data labeling for Medical Applications; 2016, p. 68–76.
  • 33.Campbell JP, Kim SJ, Brown JM, Ostmo S, Chan RV, Kalpathy-Cramer J, et al. Evaluation of a Deep Learning-Derived Quantitative Retinopathy of Prematurity Severity Scale. Ophthalmology 2021;128:1070–6. [DOI] [PMC free article] [PubMed]
  • 34.Campbell JP, Singh P, Redd TK, Brown JM, Shah PK, Subramanian P, et al. Applications of Artificial Intelligence for Retinopathy of Prematurity Screening. Pediatrics 2021;147:e2020016618. [DOI] [PMC free article] [PubMed]
  • 35.Choi RY, Brown JM, Kalpathy-Cramer J, Chan RV, Ostmo S, Chiang MF, et al. Variability in plus disease identified using a deep learning-based retinopathy of prematurity severity scale. Ophthalmol Retin. 2020;4:1016–21. doi: 10.1016/j.oret.2020.04.022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Bellsmith KN, Brown J, Kim SJ, Goldstein IH, Coyner A, Ostmo S, et al. Aggressive posterior retinopathy of prematurity: Clinical and quantitative imaging features in a large North American Cohort. Ophthalmology. 2020;127:1105–12. doi: 10.1016/j.ophtha.2020.01.052. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Taylor S, Brown JM, Gupta K, Campbell JP, Ostmo S, Chan RV, et al. Monitoring disease progression with a quantitative severity scale for retinopathy of prematurity using deep learning. JAMA Ophthalmol. 2019;137:1022–8. doi: 10.1001/jamaophthalmol.2019.2433. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Gupta K, Campbell JP, Taylor S, Brown JM, Ostmo S, Chan RV, et al. A quantitative severity scale for retinopathy of prematurity using deep learning to monitor disease regression after treatment. JAMA Ophthalmol. 2019;137:1029–36. doi: 10.1001/jamaophthalmol.2019.2442. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Brown JM, Campbell JP, Beers A, Chang K, Donohue K, Ostmo S, et al. Fully automated disease severity assessment and treatment monitoring in retinopathy of prematurity using deep learning. Med Imaging 2018;10579:149–55.
  • 40.Greenwald MF, Danford ID, Shahrawat M, Ostmo S, Brown JM, Kalpathy-Cramer J, et al. Evaluation of artificial intelligence-based telemedicine screening for retinopathy of prematurity. J AAPOS. 2020;24:160–2. doi: 10.1016/j.jaapos.2020.01.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Lepore D, Ji MH, Pagliara MM, Lenkowicz J, Capocchiano ND, Tagliaferri L, et al. Convolutional neural network based on fluorescein angiography images for retinopathy of prematurity management. Transl Vis Sci Technol. 2020;9:37. doi: 10.1167/tvst.9.2.37. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Cryotherapy for Retinopathy of Prematurity Cooperative Group. Multicenter trial of cryotherapy for retinopathy of prematurity: ophthalmological outcomes at 10 years. Arch Ophthalmol. 2001;119:1110–8. doi: 10.1001/archopht.119.8.1110. [DOI] [PubMed] [Google Scholar]
  • 43.Good WV. Final results of the Early Treatment for Retinopathy of Prematurity (ETROP) randomized trial. Trans Am Ophthalmol Soc. 2004;102:233–50. [PMC free article] [PubMed] [Google Scholar]
  • 44.Chiang MF, Jiang L, Gelman R, Du YE, Flynn JT. Interexpert agreement of plus disease diagnosis in retinopathy of prematurity. Arch Ophthalmol. 2007;125:875–80. doi: 10.1001/archopht.125.7.875. [DOI] [PubMed] [Google Scholar]
  • 45.Quinn GE, Ying GS, Daniel E, Hildebrand PL, Ells A, Baumritter A, et al. Validity of a telemedicine system for the evaluation of acute-phase retinopathy of prematurity. JAMA Ophthalmol. 2014;132:1178–84. doi: 10.1001/jamaophthalmol.2014.1604. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Li J, Huang K, Ju R, Chen Y, Li M, Yang S, et al. Evaluation of artificial intelligence-based quantitative analysis to identify clinically significant severe retinopathy of prematurity. Retina. 2022;42:195–203. doi: 10.1097/IAE.0000000000003284. [DOI] [PubMed] [Google Scholar]
  • 47.Coyner AS, Chen JS, Singh P, Schelonka RL, Jordan BK, McEvoy CT, et al. Single-examination risk prediction of severe retinopathy of prematurity. Pediatrics. 2021;148:e2021051772. doi: 10.1542/peds.2021-051772. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Reddy S, Allan S, Coghlan S, Cooper P. A governance model for the application of AI in health care. J Am Med Inf Assoc. 2020;27:491–7. doi: 10.1093/jamia/ocz192. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.


Articles from Eye are provided here courtesy of Nature Publishing Group

RESOURCES