Abstract
Purpose:
To assess the diagnostic accuracy of multiple machine learning models using full retinal nerve fiber layer (RNFL) thickness maps in detecting glaucoma.
Design:
Case-control study.
Subjects:
93 eyes from 69 glaucoma patients and 128 eyes from 128 age- and sex-matched healthy controls from the Los Angeles Latino Eye Study, a large population-based, longitudinal cohort study consisting of Latino participants 40 years of age and over residing in El Puente, California, USA.
Methods:
6×6mm RNFL thickness maps centered on the optic nerve head (Cirrus 4000; ZEISS) were supplied to four different machine learning algorithms. These models included two conventional machine learning algorithms, Support Vector Machine (SVM) and K-Nearest Neighbor (KNN), and two convolutional neural nets, ResNet-18 and GlaucomaNet, which was a custom-made deep learning network. All models were tested with 5-fold cross validation.
Main Outcome Measures:
Area-under-curve (AUC) statistics to assess diagnostic accuracy of each model in comparison to that for conventional average circumpapillary RNFL thickness.
Results:
All four models achieved similarly high diagnostic accuracies, with AUC values ranging from 0.91 – 0.92. These values were significantly higher than that for average circumpapillary RNFL thickness, which had an AUC of 0.76 in the same patient population.
Conclusions:
Superior diagnostic performance was achieved with both conventional machine learning and convolutional neural net models, as compared to circumpapillary RNFL thickness. This supports the importance of the spatial structure of RNFL thickness map data in diagnosing glaucoma and further efforts to optimize our use of this data.
Precis:
Four machine learning models, including two convolutional neural networks, were trained to diagnose glaucoma using RNFL thickness maps. Diagnostic performance of all four models was superior to that of conventional circumpapillary RNFL thickness.
Introduction
Glaucoma, a leading cause of blindness worldwide, is an irreversible optic nerve degeneration, characterized by progressive loss of retinal ganglion cells (RGCs) and their axons (retinal nerve fiber layer; RNFL).1,2 Clinical diagnosis of glaucoma is typically based on evaluation of the optic disc on funduscopic exam and visual field testing yielding characteristic excavation of the neuroretinal rim with corresponding visual field defects. In recent decades, optical coherence tomography (OCT) provides additional information about the mean and sectoral circumpapillary RNFL thickness as well as the inner macula thickness, both of which have offered reasonable diagnostic value.3 Clinicians also use the other available OCT data, such as RNFL thickness and deviation maps, qualitatively in their assessment of glaucoma patients.
Recent years have seen machine learning applied to the medical field, aided by advances in computer processing and big data analysis. Both OCT and non-OCT data such as visual field tests and fundus images have been used as data to diagnose glaucoma with machine learning.4–8 However, when OCT RNFL data are used in machine learning algorithms, only measured parameters like clock hour data or cup to disc ratio have been used as machine learning classifiers, and these parameters may fail to capture critical information about the complete RNFL structure. In this study, we used the entire RNFL thickness maps from glaucoma and age-matched healthy participants of the Los Angeles Latino Eye Study (LALES) to evaluate the diagnostic accuracy of multiple machine learning algorithms, including deep learning models which use receptive fields in processing to extract and evaluate local features. We hypothesized that machine learning algorithms could utilize data from the complete RNFL thickness map to improve diagnostic accuracy as compared to mean circumpapillary RNFL thickness alone.
Methods
Study Population and Clinical Exam
This study was based on data from the Los Angeles Latino Eye Study (LALES), a large population-based, longitudinal cohort study consisting of Latino participants 40 years of age and older residing in La Puente, California, USA. Details of the study design, baseline characteristics, and methods for these studies have been reported previously.9 The study protocols were approved by the Institutional Review Board at the University of Southern California and adhered to the tenets of the Declaration of Helsinki. Written, informed consent was obtained from all participants. Study participants received a complete eye examination by an ophthalmologist including visual acuity, Goldmann applanation tonometry, gonioscopy, visual field testing, anterior and posterior segment examination, and stereoscopic fundus and optic disc photos. Visual field testing was performed using the Humphrey Automated Field Analyzer II (ZEISS, Dublin, CA, USA) using the Swedish Interactive Threshold Algorithm (SITA) Standard. If the results for the VF were normal, no additional VF testing was done. If the results were abnormal or unreliable, a repeat VF test was performed.
Glaucoma Diagnosis for Gold Standard
Diagnosis of glaucoma was made by consensus of two fellowship trained glaucoma specialists, and a third glaucoma specialist when disagreement occurred. The details of glaucoma diagnosis by the expert panel have been reported previously and will be described here.10 Two glaucoma specialists independently evaluated each participant’s clinical history, clinic examination data, optic disc photography, and visual field results. Each glaucoma specialist independently decided the presence or absence of open angle glaucoma, angle closure glaucoma, and other glaucoma based on specified guidelines reported previously10 and detailed below. The specialists focused on the visual field test results and stereoscopic optic disc photo evaluation in determining a diagnosis of glaucoma. IOP level was not considered in determining a diagnosis of glaucoma. A diagnosis of glaucoma was made if there was evidence of characteristic or compatible glaucomatous optic disc damage on stereo fundus photography and/or congruent, characteristic, or compatible glaucomatous visual field abnormalities. Simultaneous stereoscopic optic disc photographs were evaluated with a stereoscopic viewer (Asahi, Pentax, Englewood, CO) to characterize optic nerve findings including vertical and horizontal cup-to-disc ratio, cup-to-disc ratio asymmetry between the 2 eyes, disc or peripapillary nerve fiber layer hemorrhage, diffuse thinning of the neuroretinal rim (remaining rim <0.1), and notching of the neuroretinal rim (remaining rim in a localized area <0.1). Glaucomatous optic neuropathy was classified as characteristic if one of the following criteria was met: horizontal or vertical C/D ratio ≥0.8, notching of the neuroretinal rim, localized or diffuse loss of the neuroretinal rim with a maximum remaining rim of <0.1, or nerve fiber layer defect in the arcuate bundles. Each of the two glaucoma specialists graded VF loss as: characteristic or compatible with glaucoma; defect due to other nonglaucomatous/neurologic cause or artifact; or not determinable, based on the optic disc evaluation, clinical examination data, and evaluation of disc and fundus photographs. Visual field defects were deemed characteristic of glaucoma if they corresponded to the nerve fiber layer bundle pattern, which included nasal steps (superior or inferior), paracentral defects, arcuate defect, central island, temporal island, and absolute defect. Visual field defects that conformed to nerve fiber bundle loss but deviated in some manner from the characteristic defects, including both superior and inferior nasal steps or a VF defect present in one VF but not in the second VF tests (defects in the nasal, arcuate, or paracentral regions), were defined as compatible with glaucoma. Of note, while OCT data was collected on all participants in this study, OCT data was not accessible to glaucoma specialists on the diagnosis panels.
Selection of Cases and Controls
OCT data was collected from participants of LALES at follow-up (approximately 9 years after presentation) between 2009–2014. At this visit, participants were expected to be 49 years and older. All LALES eyes with a glaucoma diagnosis, as determined by the expert panel, and available OCT data were included in our glaucoma group. For the glaucoma group, since the number of glaucomatous cases was limited, we included every eye diagnosed with glaucoma. However, it is important to note that images that came from the same patient were placed together either in the training set or testing set. In doing so, we avoided inter-eye correlation caused by the testing set learning from the training set about a particular patient. Control eyes were selected from normal participants who were frequency matched to cases by age (5-year age group) and sex. These eyes were free of diabetic retinopathy, macular degeneration, other retinopathy or retinal disease, visually significant cataract worse than 20/40, corneal opacity, or prior ocular procedure. For the control group, only one eye of each participant was used in the study.
OCT Data
The OCT data of interest was the 6×6mm, 200pixel x 200pixel RNFL thickness maps centered on the optic nerve head (Cirrus 4000, ZEISS). When the RNFL thickness maps were taken, it was assured in the procedure that the optic disc was in the center of the image. Beyond this, we did not do any processing to register the images. We examined all RNFL thickness maps in our dataset, and there were no instances of severe misalignment of the optic nerve head. Participants were excluded if their OCT scans had signal strengths below 6/10 or if, quantitatively, their scans were missing more than 7% area of the 6×6mm RNFL thickness map with the optic disc excluded. The optic nerve head region was excluded, based on the OCT structural image that delineated the disc based on Bruch’s membrane opening. The Bruch’s membrane opening was calculated by the OCT device. When exporting from the device, a thickness map of the RNFL and the optic disc was exported. The output of the OCT provided a red-green-blue (RGB) image where the optic nerve head was labeled as gray. We developed an algorithm to detect the gray colors of this image and created a mask to exclude the optic disc from the exported RNFL thickness map.
Machine Learning Models
We applied four different machine learning models in this study: two conventional machine learning models, Support Vector Machine (SVM) and K-Nearest Neighbor (KNN), and two deep learning models, ResNet-18 and GlaucomaNet, a compact convolutional neural net we designed. KNN and SVM are well established conventional classification algorithms that are widely seen in classification tasks. For KNN, the classification process of a new input set is based on a plurality vote of its closest member of the training set in the feature space. On the other hand, SVM creates a hyperplane in the feature space with maximized distance to different classes of the training set and performs classification by observing which side of the hyperplane the input set is located.
Unlike KNN and SVM, the convolutional neural nets (CNN), deep learning networks, such as ResNet-18 and GlaucomaNet, only emerged a few years ago. CNNs consist of convolutional layers that are sequences of local filters to extract features, which eventually get combined in the fully connected layer for deriving a classification decision. ResNet was introduced in 2015 and has proven to be one of the most accurate and state-of-the-art deep learning algorithms to date.11 The design of shortcut connections in ResNet allows the neural networks to go much deeper compared to other networks. The ResNet-18 model has 17 convolutional layers and one fully connected layer.
The GlaucomaNet, which we designed, had 3 convolutional layers, each followed by an activation layer and a pooling layer, and 2 fully connected layers before the binary output (glaucoma or not). This network was compact, thus reducing the risk of overfitting, and was less time-consuming to train, modify, and analyze. Schematics of GlaucomaNet and ResNet-18 are shown in Figure 2.
Data augmentation was performed for both ResNet-18 and GlaucomaNet. When data augmentation was performed, we ensured that the overall structure was not largely altered. We only did three kinds of augmentation: center cropping 3/4 of the image width and resizing back to the needed size for the network; translation on one of the axes by 3%, or slightly rotating the images by a maximum of 20 degrees. Moreover, transfer learning (i.e. fine-tuning pre-trained models from a natural image dataset) was facilitated for ResNet as well. Both maneuvers have been reported to lead to higher classification accuracy for deep learning.12
For all models, all valid pixel values were used. Every model was tested with 5-fold cross validation. For each test run, 20% of the dataset was used as the testing set while the rest of the dataset was used as training. This was repeated 5 times so that every image in the dataset was tested once for accuracy. Training time for GlaucomaNet and ResNet-18 took 142s and 669s, respectively, while training KNN and SVM took 0.54s and 121.3s. All timing results were obtained by the same computer.
Statistical Analysis
Area-under-curve (AUC) statistics were calculated for each machine learning model based on the probability of glaucoma output by each model. Comparison p-values were obtained based on DeLong’s method.
Results
Of 103 glaucomatous eyes and 171 control eyes, we excluded 10 glaucoma eyes and 43 normal eyes due to poor signal strength or missing data. There were 93 glaucoma eyes and 128 age-and sex-matched healthy eyes included in this analysis. A summary of the patient demographics is shown in Table 1. There was no significant difference in age, sex, best-corrected visual acuity, or axial length between the two groups. The glaucoma group was more myopic, had lower visual field mean deviation, lower (treated) intraocular pressure, and higher cup to disc ratio compared to the healthy group. Among the glaucoma cases, based on Hodapp-Parrish-Anderson staging of the visual field mean deviation (VF MD) values, the largest portion of the glaucoma cases was in the mild stage of glaucoma. There were 47 glaucoma cases with VF MD greater than −6 dB, 20 cases between −6 and −12 dB, and 20 cases with VF MD worse than −12 dB (6 glaucoma cases had missing VF MD values).
Table 1.
Variables | Glaucoma (93 eyes0020from 69 participants) | Healthy (128 eyes from 128 participants) | P-value° |
---|---|---|---|
Age (patient) | 72 (11)^ | 70 (10) | 0.132 |
Female Sex (patient) | 33 (48) | 73 (57) | 0.216 |
History of Ocular Anti-Hypertensive Drop | 35 (38) | 2 (2) | <0.0001 |
Best-Corrected Visual Acuity | 27 (32) | 22 (7) | 0.190 |
Spherical Equivalent (D) | –1.45 (2.05) | –0.70 (1.03) | 0.0177 |
VF MD (dB) | –6.92 (8.28) | –1.84 (3.32) | <0.0001 |
Axial Length (mm) | 23.3 (2.21) | 23.2 (1.01) | 0.365 |
Intraocular Pressure (mmHg) | 16 (4.0) | 15 (3.3) | 0.0121 |
Cup to Disc Ratio | 0.76 (0.13) | 0.39 (0.16) | <0.0001 |
All data listed as mean (standard deviation) or frequency (percent)
P-values for eye-specific variables were based on linear regression, controlling for age correlation. Chi-square test was used for the person-specific categorical variables, and two-sample t-test assuming unequal variances was used for person-specific continuous variables.
The average accuracy, specificity, and sensitivity for the four models are shown in Table 2. There was a trend of ResNet-18 having slightly higher accuracy, sensitivity, and specificity compared to the other 3 models (SVN, KNN, and GlaucomaNet), but there was a trend for GlaucomaNet having slightly higher AUC compared to the other 3 models. Overall, however, there was no significant difference in AUCs between any of the four machine learning models, with all comparison p-values greater than 0.45.
Table 2.
Classification Model | Accuracy* | Sensitivity | Specificity | Area Under Curve Statistic |
---|---|---|---|---|
SVM | 0.869 | 0.817 | 0.906 | 0.915 |
KNN | 0.860 | 0.828 | 0.883 | 0.911 |
ResNet-18 | 0.905 | 0.860 | 0.938 | 0.906 |
GlaucomaNet | 0.878 | 0.828 | 0.914 | 0.921 |
Based on the models’ probability output of >0.5, or in the case of SVM based on distance between classification boundaries.
SVM = Support Vector Machine; KNN= K-Nearest Neighbor; ResNet-18 and GlaucomaNet are convolutional neural networks.
Figure 3 demonstrates the receiver operating characteristics (ROC) curves for the 4 models compared to mean RNFL thickness. Each of the 4 machine learning models outperformed mean RNFL thickness as determined by AUC value, with all comparison p-values less than 0.001. For further reference, in comparison to the AUC for mean RNFL thickness of 0.76 in this population, the AUC values of quadrant RNFL thickness were as followed: 0.56 (temporal RNFL), 0.75 (superior RNFL), 0.61 (nasal RNFL), and 0.79 (inferior RNFL).
Discussion
In this study, we trained and tested two conventional machine learning models and two convolutional neural network models to use data from the entire RNFL thickness map to diagnose glaucoma. Each of the four models achieved similarly high diagnostic accuracies, with AUC values above 0.90, and these values were superior to that for average circumpapillary RNFL thickness, which had an AUC of 0.76 in the same patient population. Average RNFL thickness value has often been considered to be among the single OCT parameters with the highest diagnostic performance, and thus it serves as a good control to determine the impact of using machine learning models using RNFL thickness maps to improve diagnostic accuracy.3,13 Leung et al. previously examined the diagnostic value of the RNFL deviation map using a scoring system that a grader used to assess the map and reported modestly higher diagnostic accuracy compared to average circumpapillary RNFL thickness.14 Additionally, Wu et al. demonstrated that qualitative evaluation of the OCT volume scans of the disc and macula had modest improvement compared to average circumpapillary RNFL thickness.15 This finding demonstrates that machine learning algorithms based on the entire RNFL thickness map data can substantially improve diagnostic accuracy and optimize the use of all available RNFL data. Furthermore, this finding confirms the importance of the spatial distribution of RNFL structure in assessment of glaucoma.
There were minor differences when comparing our 4 machine learning models. The deep learning models, ResNet-18 and GlaucomaNet, slightly outperformed our conventional models, KNN and SVM, in accuracy. This is likely because the regional information is extracted with convolutional filters and further combined when arriving to the diagnosis in deep learning algorithms, whereas in KNN and SVM, there is no local feature extraction of the map. Thus, local features of the RNFL thickness structure may have been lost in the KNN and SVM models. It was interesting that, although GlaucomaNet had a rather simple and straightforward network structure, it achieved a surprisingly good accuracy. This might be due to the fact that OCT-generated RNFL thickness maps have relatively consistent positions and expected structures for healthy eyes. Therefore, the classification task is much easier compared to diagnosis of other diseases such as tumor or lymph node pathology. Other advantages of GlaucomaNet include the fact that fully training GlaucomaNet took less than a quarter of the time that it took to fully train Resnet-18 and the smaller network is less prone to overfitting, especially when only a small dataset is present. Another important finding in this study was that even with a relatively small training dataset, compared to that of other published machine learning algorithms based on fundus photographs for example, substantial improvement in diagnostic accuracy over average circumpapillary RNFL thickness was seen.16 Using larger training datasets in the future could explore whether this would have any further incremental improvement in diagnostic accuracy.
One limitation of this study is that all the patients included are from the Latino Eye Study. Therefore, the results of this study may not be applicable to other ethnic populations. Future work will include applying these models to other populations. Additionally, while this study demonstrates the importance of the entire RNFL thickness map, we cannot demonstrate which features of the map are most important in triggering the decision of glaucoma diagnosis. Future studies could evaluate the activation maps from the deep learning networks to improve our understanding of this. Additionally, future studies could utilize longitudinal OCT data to evaluate the efficacy of these models on detecting glaucomatous progression and in staging disease. Finally, the machine learning algorithms had the information of the excluded disc size, and our population had a larger disc size in the glaucoma group. However, when considering only disc size, an AUC of only 0.71 is achieved, thus confirming that there is significant diagnostic value in the RNFL thickness map itself. If studying a population where glaucoma patients do not have a larger disc size, the machine learning algorithms may not function quite as well as reported here.
In summary, this study demonstrates the high diagnostic accuracy of machine learning models using RNFL thickness map data. Such work could enhance our ability to use OCT as a screening tool for glaucoma in the future.
Acknowledgments
Financial Support: This work was supported by National Institutes of Health Grants (K23EY027855–01, GMR; U10EY011753, RV), American Glaucoma Society Young Clinician Scientist Grant (GMR), and unrestricted grant to the USC Department of Ophthalmology from Research to Prevent Blindness. The sponsors or funding organizations had no role in the design or conduct of this research.
Footnotes
Parts of the data presented in the current manuscript were previously presented at the Association for Research in Vision and Ophthalmology Annual Meeting 2018.
Conflict of Interest: No conflicting relationship exists for any author.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Reference
- 1.Heijl A, Leske MC, Bengtsson B, et al. Reduction of intraocular pressure and glaucoma progression: results from the Early Manifest Glaucoma Trial. Arch Ophthalmol 2002;120(10):1268–79. [DOI] [PubMed] [Google Scholar]
- 2.Weinreb RN, Aung T, Medeiros FA. The pathophysiology and treatment of glaucoma: A review. JAMA 2014;311(18):1901–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Bussel II, Wollstein G, Schuman JS. OCT for glaucoma diagnosis, screening and detection of glaucoma progression. Br. J. Ophthalmol 2014;98(Suppl 2):ii15–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Kim SJ, Cho KJ, Oh S. Development of machine learning models for diagnosis of glaucoma. PLoS One 2017;12(5):e0177726. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Barella KA, Costa VP, Vidotti V Gonçalves, et al. Glaucoma Diagnostic Accuracy of Machine Learning Classifiers Using Retinal Nerve Fiber Layer and Optic Nerve Data from SD-OCT. J Ophthalmol 2013;2013:789129. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Bizios D, Heijl A, Hougaard JL, et al. Machine learning classifiers for glaucoma diagnosis based on classification of retinal nerve fibre layer thickness parameters measured by Stratus OCT. Acta Ophthalmol 2010;88(1):44–52. [DOI] [PubMed] [Google Scholar]
- 7.Boden C, Chan K, Sample PA, et al. Assessing Visual Field Clustering Schemes Using Machine Learning Classifiers in Standard Perimetry. Invest Ophthalmol Vis Sci 2007;48(12):5582–90. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Vidotti VG, Costa VP, Silva FR, et al. Sensitivity and specificity of machine learning classifiers and spectral domain OCT for the diagnosis of glaucoma. Eur J Ophthalmol 2013;23(1):61–9. [DOI] [PubMed] [Google Scholar]
- 9.Varma R, Paz SH, Azen SP, et al. The Los Angeles Latino Eye Study: design, methods, and baseline data. Ophthalmology 2004;111:1121–31. [DOI] [PubMed] [Google Scholar]
- 10.Varma R, Ying-Lai M, Francis BA, et al. Prevalence of Open-Angle Glaucoma and Ocular Hypertension in Latinos: The Los Angeles Latino Eye Study. Ophthalmology 2004;111:1439–48. [DOI] [PubMed] [Google Scholar]
- 11.He K, Zhang X, Ren S, et al. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition. 2016;770–778. [Google Scholar]
- 12.Shin HC, Roth HR, Gao M, et al. Deep Convolutional Neural Networks for Computer-Aided Detection: CNN Architectures, Dataset Characteristics and Transfer Learning. IEEE Trans Med Imaging 2016;35(5):1285–98. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Schweitzer C, Korobelnik J, Le Goff M, et al. Diagnostic performance of peripapillary retinal nerve fiber layer thickness for detection of glaucoma in an elderly population: the ALIENOR study. Invest Ophthalmol Vis Sci 2016;57:5882–91. [DOI] [PubMed] [Google Scholar]
- 14.Leung CK, Lam S, Weinreb RN, et al. Retinal nerve fiber layer imaging with spectral-domain optical coherence tomography: analysis of the retinal nerve fiber layer map for glaucoma detection. Ophthalmology 2010;117:1684–1691. [DOI] [PubMed] [Google Scholar]
- 15.Wu Z, Weng DSD, Rajshekhar R, et al. Effectiveness of a qualitative approach toward evaluating OCT imaging for detecting glaucomatous damage. Transl Vis Sci Technol 2018;7(4):7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Gulshan V, Peng L, Coram M, et al. Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. JAMA 2016;316:2402–10. [DOI] [PubMed] [Google Scholar]