Abstract
Telemedicine has potential to improve the delivery, quality, and accessibility of ophthalmic care for infants with Retinopathy of Prematurity (ROP). Using a telemedicine screening strategy, three potential diagnostic cutoffs may be used to define disease that warrants ophthalmologic referral: presence of any ROP, presence of moderate (“type-2 prethreshold”) ROP, or presence of severe ROP requiring treatment. This study examines the relationship between accuracy and reliability of diagnosis by three masked ophthalmologist graders at each of these diagnostic cutoffs. The sensitivity, specificity, inter-grader reliability, and intra-grader reliability showed significant variation depending on the diagnostic cutoff, with best results at cutoffs of type-2 prethreshold ROP or treatment-requiring ROP. Before the large-scale adoption of telemedicine for image-based screening of diseases such as ROP, standards defining clinically-relevant referral cutoffs must be established, and diagnostic accuracy and reliability at these cutoffs must be characterized.
Introduction
Store-and-forward telemedicine is a technology in which static medical data and images are captured and transmitted to a remote storage device for subsequent retrieval and interpretation by an expert1–2. This has potential to improve the quality and delivery of medical care, particularly in heavily image-based domains such as ophthalmology. However, the widespread adoption of telemedicine has been limited by the lack of substantive evaluation studies about diagnostic efficacy and reliability, acceptability to patients and providers, and cost-effectiveness3–6.
The objective of a practical, large-scale telemedicine strategy is to diagnose cases of sufficient severity to warrant referral for further monitoring or treatment.
Clinical decisions in image-based domains are often made on the basis of subtle disease classifications, rather than on the simple presence or absence of disease. Retinopathy of Prematurity (ROP) is one such ophthalmic condition, in which diagnostic and treatment guidelines are based on classification in a well-established ordinal system7. The performance of image-based telemedicine systems is likely to vary depending on the classification cutoff used to define “abnormality” for the purpose of diagnostic referral. Therefore, appropriate evaluation of these technologies requires that clinically-relevant referral cutoffs are established, and that the accuracy and reproducibility of remote image interpretation by multiple graders are determined at these cutoff levels. Little published research has addressed these issues8.
Our recent studies have examined the performance of remote ROP diagnosis8. The objective of this paper is to analyze the variation in accuracy, inter-grader, and intra-grader reliability of ROP diagnosis at three potential diagnostic referral cutoffs by three masked ophthalmologist graders. Results are compared to a reference standard of full examination by an experienced ophthalmologist.
Evaluation Domain
Retinopathy of prematurity (ROP) is an ocular disease of low birth weight infants, and affects over 10,000 children in the United States each year9. It is diagnosed from dilated ophthalmoscopic examination by an experienced ophthalmologist, and there are accepted guidelines for identifying high-risk premature infants who need serial screening examinations10. When ROP occurs, approximately 90% of cases improve spontaneously and require only close follow-up examinations every 1–2 weeks. However, the remaining 10% are at high risk for complications leading to blindness and require laser or surgical treatment9.
ROP an ideal disease for applications and research in telemedicine for several reasons: (1) Clinical diagnosis is based solely on the appearance of disease in the retina. (2) There is a universally-accepted, evidence-based, diagnostic classification standard for ROP7. This ordinal system is used by ophthalmologists to define severe cases of ROP that require treatment, as well as cases with high-risk features that warrant particularly careful observation. (3) If severe ROP is diagnosed and treated early enough, blinding complications may be prevented11–12. (4) ROP continues to be a leading cause of childhood blindness in the United States and throughout the world9. (5) Current ROP examination methods are costly, time-intensive, frequently impractical, and physiologically stressful to premature infants. (6) Adequate ophthalmic expertise is often limited to larger academic centers, and therefore unavailable at the point of care. (7) Pilot studies have suggested that ROP diagnosis using remote interpretation of images captured using wide-angle digital retinal cameras may be feasible8,13–16.
Methods
Development of Retinal Atlas
This study was approved by the Institutional Review Board (IRB) at Columbia University Medical Center. A retinal image atlas was developed for this study, as described previously8. Briefly, infants who met existing criteria for ROP examination from 1999–2000, and whose parents provided informed consent for imaging, were included. Patients were excluded from consideration if they were judged by their attending neonatologist to be unstable for ROP examination because of poor general health.
Each infant underwent two examinations, which were sequentially performed under topical anesthesia at the neonatal intensive care unit bedside: (1) Dilated ophthalmoscopy by an experienced ophthalmologist, based on well-known protocols10. The presence or absence of ROP disease, and its characteristics when present, were documented according to the international classification standard9. (2) Wide-angle retinal imaging by an experienced ophthalmic photographer using a digital camera system (RetCam-120; MLI Inc., Pleasanton, California), based on guidelines established by the manufacturer.
All study photographs were complied into a retinal atlas. This included 163 unique sets of digital images (81 right eyes, 82 left eyes) from 63 consecutive infants whose parents consented to participate. Each image set consisted of 1 to 7 photographs from a single eye. Because imaging was performed at the bedside under typical working conditions, it was not possible to capture a standard set of photographs on each infant. No images were excluded based on poor quality. Examples of atlas images are shown in Figure 1. To test intra-grader reliability for interpreting the same images at different times, seven sets of duplicated images were included in the atlas. To maintain patient confidentiality, images were not annotated with any individually identifiable clinical data. Therefore, remote diagnosis was based on the appearance of retinal images alone.
Image Interpretation and Data Analysis
Three board-certified practicing ophthalmologists participated as image graders. They were trained to analyze study images by a pediatric ophthalmologist experienced in ROP, using a set of teaching photos that were unrelated to the atlas created for this study.
Graders were evaluated to ensure compliance with existing diagnostic standards.
Masked graders independently interpreted each image set, according to an ordinal scale based on established criteria from two major multi-center National Institutes of Health trials11–12. Images in this study were classified as: (1) No ROP, meaning that patients should generally be re-examined in 2 weeks to screen for future disease development10; (2) Mild ROP, meaning that patients should generally be reexamined in 2 weeks to follow progression10; (3) Type-2 prethreshold ROP, meaning moderate disease requiring close re-examination in 1 week or less12; and (4) Treatment-requiring ROP, meaning that laser or incisional surgery is needed within 48–72 hours12. Reference standard examinations of the 163 image sets disclosed 64 (39%) with no ROP, 65 (40%) with mild ROP, 16 (10%) with type-2 prethreshold ROP, and 18 (11%) with treatment-requiring ROP.
Data Analysis
Examination data were analyzed using statistical software (SPSS 13.0; SPSS Inc., Chicago, Illinois). Sensitivity, specificity, inter-grader reliability, and intra-grader reliability of remote imaging were determined for three diagnostic cutoffs: presence of any ROP (i.e. mild or worse ROP), presence of type-2 prethreshold or worse ROP, and presence of treatment-requiring ROP. Full ophthalmoscopic examination was used as the reference standard. The kappa statistic was used to measure chance-adjusted agreement for presence of disease, based on an accepted scale17: 0 to 0.20 = slight agreement, 0.21 to 0.40 = fair agreement, 0.41 to 0.60 = moderate agreement, 0.61 to 0.80 = substantial agreement, and 0.81 to 1.00 = almost-perfect agreement.
Results
Accuracy Based on Diagnostic Cutoff
Figure 2 shows the sensitivity, specificity, and kappa statistic for remote image-based diagnosis by each grader for three diagnostic cutoff values: detection of any ROP (i.e. mild or worse ROP), detection of type-2 prethreshold or worse ROP, and detection of treatment-requiring ROP.
Graders A and B had sensitivity >75% and specificity >90% at all three cutoff values. Grader C had similarly high sensitivity at each cutoff value, and high specificity for type-2 prethreshold ROP (98.5%) and treatment-requiring ROP (95.3%). However, grader C had significantly lower specificity for any ROP compared to the other two graders (49.3%, p<0.001). Similarly, the kappa statistic showed substantial to almost-perfect agreement for graders A and B with the reference standard at each diagnostic cutoff value. For grader C, the kappa statistic for detection of type-2 prethreshold ROP and treatment-requiring ROP showed substantial agreement with the reference standard, whereas kappa for detection of any ROP showed only fair agreement.
Reliability Based on Diagnostic Cutoff
Figure 3 shows the inter-grader reliability between each pair of graders at each cutoff level. At a cutoff of any ROP, two pairs of graders had reliabilities of 0.533 and 0.572, indicating moderate agreement. However, at a cutoff of treatment-requiring ROP, the inter-grader reliability of all pairs was from 0.843 to 0.859, indicating almost-perfect agreement.
For the seven duplicated image sets, the intra-grader reliabilities for graders A and B were perfect (kappa = 1.00 for ordinal classification). For grader C, the intra-grader reliability for ability to detect the presence of any ROP was lower than would be expected by pure chance (kappa = −0.235). However, intra-grader reliability for grader C for ability to detect type-2 prethreshold ROP or treatment-requiring ROP was perfect (kappa = 1.00). For intra-grader reliability of ordinal classification among all graders, the kappa and weighted kappa (standard error) were 0.599 (0.246) and 0.775 (0.135), indicating moderate-to-substantial agreement.
Discussion
This study examines the adequacy of remote image-based Retinopathy of Prematurity (ROP) diagnosis by multiple graders at three referral cutoffs. There was significant variation in accuracy, inter-grader reliability, and intra-grader reliability depending on the diagnostic cutoff, as shown in Figures 2 and 3. Sensitivity is indicative of ability to correctly detect all actual disease cases and rule-out disease, and was comparably high for each grader at all three cutoffs: any ROP, type-2 prethreshold or worse ROP, and treatment-requiring ROP. Specificity is indicative of ability to correctly exclude all actual non-diseased cases and rule-in disease, and was comparably high at the cutoffs of type-2 prethreshold ROP and treatment-requiring ROP for graders A and B. However, the specificity of grader C was significantly lower at a cutoff of any ROP; this was caused by the systematic classification of images as “mild ROP” when reference standard examinations found “no ROP.” As a result, inter-grader reliabilities involving Grader C were also significantly lower (Figure 3).
Although previous research has examined the sensitivity and specificity of image-based ROP diagnosis13–16, we are not aware of other published data sets that have analyzed the accuracy and reliability of multiple graders at different diagnostic referral cutoffs8. Taken together, these findings suggest that the accurate and reliable remote diagnosis of ROP is technically feasible by a range of graders. It is important to note that the accuracy, inter-grader reliability, and intra-grader reliability of an image-based ROP telemedicine strategy may be heavily dependent on the cutoff value that is used to define “abnormal” readings. Careful selection and training of graders will be required, regardless of whether graders are retina sub-specialists (such as graders A and B), general ophthalmologists (such as grader C), or trained laypersons.
Remote telemedicine screening is intended to determine the presence or absence of risk factors of sufficient severity to warrant referral for specialist diagnosis. This offers potential benefits such as improved access to medical care, reduced geographic and socioeconomic barriers to care, decreased waiting time, and reduced transportation costs for patients and physicians1–2. ROP diagnosis, as well as clinical research involving ROP incidence, has traditionally been based on determination of the presence of disease followed by strict classification of its severity (Figure 4A). Yet clinically, patients with “mild ROP” or “no ROP” are typically reexamined every 2 weeks without other intervention, whereas those with “prethreshold ROP” are observed much more carefully because of more significant risk for progression10. This is similar to the philosophy of a screening approach (Figure 4B). Therefore, the “type-2 prethreshold” or “treatment-requiring ROP” referral cutoffs may be much more clinically relevant than the detection of “any ROP” in assessing the accuracy and reliability of a telemedicine screening strategy14. The high specificity at the “type-2 prethreshold” and “treatment-requiring ROP” cutoffs suggests that telemedicine screening may be an effective and practical method for ruling-in clinically significant ROP for ophthalmic referral. In addition, the high inter-grader and intra-grader reliabilities at these cutoffs support the notion that this approach may be scalable. Further research must examine whether the false-negative rate will be acceptable from clinical and cost-effectiveness perspectives.
Several additional features and limitations of this study should be noted: (1) To protect patient confidentiality, images were not annotated with any clinical data. This may have biased against graders’ ability to interpret accurately. (2) No standardization of reading conditions was performed, such as resolution and color correction for monitor displays. Standardization of these features may improve accuracy and reliability of computer-based image interpretation. (3) The accuracy and reliability of the accepted reference standard, dilated examination by an experienced ophthalmologist, have not been previously established. (4) This study examined remote diagnosis based on an image atlas, and was not designed to analyze the process of image transfer.
This study demonstrates the significance of considering appropriate referral cutoffs for remote image diagnostic classification. When clinically-relevant cutoffs are used for ROP interpretation, the accuracy and intra-grader reliability of diagnosis by multiple graders is substantial. This suggests that the telemedical diagnosis of ROP is technically feasible using existing imaging modalities. Further studies involving the acceptability and cost-effectiveness tradeoffs of telemedical diagnosis will be required before standards defining disease features warranting referral for ROP can be established.
Acknowledgements
Supported by grant EY13972 from the National Institutes of Health, Bethesda, MD (MFC), and by a Career Development Award from Research to Prevent Blindness, New York, NY (MFC).
References
- 1.Grigsby J, Sanders JH. Telemedicine: where it is and where it’s going. Ann Int Med. 1998;129:123–7. doi: 10.7326/0003-4819-129-2-199807150-00012. [DOI] [PubMed] [Google Scholar]
- 2.Bashshur RL, Reardon TF, Shannon GW. Telemedicine: a new health care delivery system. Annu Rev Public Health. 2000;21:613–37. doi: 10.1146/annurev.publhealth.21.1.613. [DOI] [PubMed] [Google Scholar]
- 3.Field MJ, ed. Telemedicine: a guide to assessing telecommunications in health care. Washington, DC: National Academy Press, 1996. [PubMed]
- 4.Shea S, Starren J, Weinstock RS, et al. Columbia University’s Informatics for Diabetes Education and Telemedicine (IDEATel) project: rationale and design. J Am Med Inform Assoc. 2002;9:49–62. doi: 10.1136/jamia.2002.0090049. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Mair F, Whitten P. Systematic reviews of studies of patient satisfaction with telemedicine. BMJ. 2000;320:1517–20. doi: 10.1136/bmj.320.7248.1517. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Mair FS, Haycox A, May C, Williams T. A review of telemedicine cost-effectiveness studies. J Telemed Telecare. 2000;6 (Suppl 1):S38–40. doi: 10.1258/1357633001934096. [DOI] [PubMed] [Google Scholar]
- 7.Committee for the classification of retinopathy of prematurity. An international classification of retinopathy of prematurity. Arch Ophthalmol. 1984;102:1130–4. doi: 10.1001/archopht.1984.01040030908011. [DOI] [PubMed] [Google Scholar]
- 8.Chiang MF, Keenan JD, Starren J, et al. Accuracy and reliability of remote retinopathy of prematurity diagnosis. Arch Ophthalmol. In press. [DOI] [PubMed]
- 9.Siatkowski RM, Flynn JT. Retinopathy of prematurity. In Nelson L, ed. Harley’s Pediatric Ophthalmology. 4th ed. Philadelphia: WB Saunders & Co; 1998.
- 10.Fierson WM, Palmer EA, Petersen RA, et al. Screening examination of premature infants for retinopathy of prematurity. Pediatrics. 2001;108:809–11. doi: 10.1542/peds.108.3.809. [DOI] [PubMed] [Google Scholar]
- 11.Cryotherapy for ROP Cooperative Group. Multicenter trial of cryotherapy for retinopathy of prematurity: preliminary results. Arch Ophthalmol. 1988;106:471–9. doi: 10.1001/archopht.1988.01060130517027. [DOI] [PubMed] [Google Scholar]
- 12.Early Treatment for ROP Cooperative Group. Revised indications for the treatment of retinopathy of prematurity: results of the early treatment for retinopathy of prematurity randomized trial. Arch Ophthalmol. 2003;121:1684–94. doi: 10.1001/archopht.121.12.1684. [DOI] [PubMed] [Google Scholar]
- 13.Schwartz SD, Harrison SA, Ferrone PJ, Trese MT. Telemedical evaluation and management of retinopathy of prematurity using a fiberoptic digital fundus camera. Ophthalmology. 2000;107:25–8. doi: 10.1016/s0161-6420(99)00003-2. [DOI] [PubMed] [Google Scholar]
- 14.Ells AL, Holmes JM, Astle WF, et al. Telemedicine approach to screening for severe retinopathy of prematurity: a pilot study. Ophthalmology. 2003;110:2113–7. doi: 10.1016/S0161-6420(03)00831-5. [DOI] [PubMed] [Google Scholar]
- 15.Roth DB, Morales D, Feuer WJ, et al. Screening for retinopathy of prematurity employing the RetCam 120: sensitivity and specificity. Arch Ophthalmol. 2001;119:268–72. [PubMed] [Google Scholar]
- 16.Yen KG, Hess D, Burke B, et al. Telephotoscreening to detect retinopathy of prematurity: preliminary study of the optimum time to employ digital fundus camera imaging to detect ROP. J AAPOS. 2002;6:64–70. [PubMed] [Google Scholar]
- 17.Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977;33:157–74. [PubMed] [Google Scholar]