Federated learning for multi-center collaboration in ophthalmology: improving classification performance in retinopathy of prematurity

Charles Lu; Adam Hanif; Praveer Singh; Ken Chang; Aaron S Coyner; James M Brown; Susan Ostmo; RV Paul Chan; Daniel Rubin; Michael F Chiang; J Peter Campbell; Jayashree Kalpathy-Cramer

doi:10.1016/j.oret.2022.02.015

. Author manuscript; available in PMC: 2025 Aug 1.

Published in final edited form as: Ophthalmol Retina. 2022 Mar 14;6(8):657–663. doi: 10.1016/j.oret.2022.02.015

Federated learning for multi-center collaboration in ophthalmology: improving classification performance in retinopathy of prematurity

Charles Lu ^1,^a, Adam Hanif ^2,^a, Praveer Singh ¹, Ken Chang ¹, Aaron S Coyner ², James M Brown ³, Susan Ostmo ², RV Paul Chan ⁴, Daniel Rubin ⁵, Michael F Chiang ⁶, J Peter Campbell ^2,^b, Jayashree Kalpathy-Cramer ^1,^b, on behalf of the Imaging and Informatics in Retinopathy of Prematurity Consortium

PMCID: PMC12316477 NIHMSID: NIHMS1803942 PMID: 35296449

Abstract

Objective:

To compare the performance of deep learning (DL) classifiers for the diagnosis of plus disease in retinopathy of prematurity (ROP) trained using two methods of developing models on multi-institutional datasets: centralizing data versus federated learning (FL) where no data leaves each institution.

Design:

Evaluation of a diagnostic test or technology.

Subjects, Participants, and/or Controls:

DL models were trained, validated, and tested on 5,255 wide-angle retinal images in the neonatal intensive care units of 7 institutions as part of the Imaging and Informatics in ROP (i-ROP) study. All images were labeled for the presence of plus, pre-plus, or no plus disease with a clinical label, and a reference standard diagnosis (RSD) determined by three image-based ROP graders and the clinical diagnosis.

Methods, Intervention or Testing:

We compared the area under the receiver operating characteristic curve (AUROC) for models developed on multi-institutional data, using a central approach, then FL, and compared locally trained models to either approach. We compared model performance (kappa) with label agreement (between clinical and RSD), dataset size and number of plus disease cases in each training cohort using Spearman’s correlation coefficient (CC).

Main Outcome Measures:

Model performance using AUROC and linearly-weighted kappa.

Results:

Four settings of experiment: FL trained on RSD against central trained on RSD, FL trained on clinical labels against central trained on clinical labels, FL trained on RSD against central trained on clinical labels, and FL trained on clinical labels against central trained on RSD (p=0.046, p=0.126, p=0.224, p=0.0173, respectively). 4/7 (57%) of models trained on local institutional data performed inferiorly to the FL models. Model performance for local models was positively correlated with label agreement (between clinical and RSD labels, CC = 0.389, p=0.387), total number of plus cases (CC=0.759, p=0.047), overall training set size (CC=0.924, p=0.002).

Conclusions:

We show that a FL model trained performs comparably to a centralized model, confirming that FL may provide an effective, more feasible solution for inter-institutional learning. Smaller institutions benefit more from collaboration than larger institutions, showing the potential of FL for addressing disparities in resource access.

Keywords: retinopathy of prematurity, federated learning, deep learning, epidemiology

Introduction

Deep learning (DL) methods have shown exceptional performance across a wide range of tasks for automated image-based diagnosis. Empirical evidence has shown that the ability to train, and the subsequent performance in predicting disease, is highly correlated with the amount and diversity of training data used to develop DL ¹models.^2–4 In the medical domain, the collection of large amounts of high-quality, labeled data is a challenging task. Even for large institutions with vast data resources, factors such as low disease incidence, variable image quality, and demographic homogeneity can undermine the curation of a robust and performant model.^5–7 These challenges can be circumvented by pooling data from multiple institutions in order to train centralized models. However, aggregating data to external servers exposes risks to patients’ privacy and adds additional overhead costs in data de-identification and duplicated data storage.

Federated learning (FL) has emerged as a promising training paradigm that allows multiple institutions to leverage their individual data resources to collaboratively develop DL models without directly sharing data, reducing time and cost, and protecting patient privacy. ^8,9Each contributing entity benefits from an aggregated model trained on a larger and more heterogeneous distribution of cases, which often results in far more generalizable models with performance greater than standalone models trained using only a single institution’s data. FL could therefore facilitate the democratization of highly-robust and accurate DL models, even for institutions with scarce data resources or for diseases with extremely low prevalence. Within the domain of ophthalmology, there may be specific differences across institutions, such as disease prevalence, clinician diagnostic paradigms, image acquisition devices, or patient demographics. To our knowledge, there have not been any studies showing the potential of FL between institutional datasets within ophthalmology.

As a proof of concept, we demonstrate the utility of a FL approach for retinopathy of prematurity (ROP), a leading cause of childhood blindness in the United States. The Imaging and Informatics in ROP (i-ROP) consortium curated a centralized, multi-institutional dataset of ROP images from seven different hospital centers in the United States, and applied a consensus reference standard diagnosis (RSD) to all images to develop a convolutional neural network (CNN), i-ROP DL, for detection of plus disease. This algorithm was able to classify disease more accurately than ROP experts, compared to the RSD.¹⁰ However, a centralized training schema based on RSD labels is not always practical, is time consuming, and expensive, and therefore there is merit in investigating whether this centralized approach was even necessary to develop an effective algorithm. As such, we explored whether FL approaches or even the baseline performance of a model developed using only a single institution could suffice, and whether the FL approach might obviate the need for a central schema training with RSD.

Methods

Dataset

This study was approved by an institutional review board at each center with written informed consent obtained from parents of all infants whose images were included in the study, and conforms with the Declaration of Helsinki. To demonstrate our federated learning framework, we utilized the multicenter Imaging and Informatics in Retinopathy of Prematurity (i-ROP) dataset, composed of 5,255 ROP fundus images collected from premature infants in the neonatal intensive care unit at seven academic healthcare institutions in the United States (Oregon Health and Science University, Columbia University, William Beaumont Hospital, Children’s Hospital Los Angeles, Cedars-Sinai Medical Center, University of Miami, Weill Cornell Medical Center). Each image was acquired between July 2011 and December 2016 using a commercially available camera (RetCam; Natus Medical Incorporated, Pleasanton, CA) with a standard imaging protocol at all centers. In line with the original study (Brown et al), only the posterior field of view was used for our experiments.

Ground truth

To investigate differences in consensus and individual grading, we based our experiments on two sets of ground truth: the original “bedside” clinical label, and a consensus RSD using methods previously described. ¹¹ The RSD was based on the consensus of 3 masked image graders, and the clinical (ophthalmoscopic) diagnosis. In cases where there was discrepancy between the clinical diagnosis and the three graders’ majority, the 3 graders re-reviewed the label in the context of the clinical diagnosis to reach a consensus for the RSD. Herein, these two sets of ground truth labels will be referred to as the “clinical” labels and “RSD”, respectively. The RSD represents a more robust gold-standard given the known variability between clinicians in the diagnosis of plus disease in practice.¹¹ Thus, training an algorithm on the RSD represents an idealized use-case (given the need for consensus among multiple experts) while training on clinical labels represents a more realistic use-case.

Model development

The DL architecture used for all experiments was ResNet18.¹² As a preprocessing step, we extracted segmentation masks of the retinal vasculature using a previously-trained ¹³U-Net encoder-decoder architecture to highlight the relevant vessel segments.^10,14 The channel dimension of each image was repeated 3 times to leverage pretrained weights from the ImageNet dataset for an input dimension of 3×480×640 (channels, height, width).¹⁵ The classifier was trained with cross-entropy loss using the Adam optimizer.¹⁶ During training, we performed random data augmentation in the form of rotations (30 degrees), flipping (random 50%), and affine transformations along with equal-class sampling per mini-batch to help mitigate the effect of class imbalance. The PyTorch framework was used for implementation and training of the DL models with data augmentation operations provided by the MONAI library.^17,18 All experiments were performed on an Nvidia GTX 1080 Ti GPU.

Our experiments utilized a variant of the FL technique known as model averaging, as shown in Figure 1.¹⁹ At the start of each round of federated training, a copy of the global model is shared with each of the seven institutions to initialize that institution’s local model. Each local model is then trained for 10 epochs, and the model checkpoint with the best validation performance, as measured by area under the receiver operating characteristic curve (AUROC), is chosen to be aggregated into the global model at the end of the round. Upon completion of all federation rounds, a copy of the global model is finally transferred to each individual site for the prediction of new, unseen data.

Performance of federated versus central training by ground truth

Our primary analysis compared FL and centrally-hosted approaches (trained on pooled data from all centers) using clinical or RSD gradings. The dataset was partitioned into a held-out testing set of 2,666 images (51%) from 389 patients (45%), which was stratified by class label and institution, and a development set of 2,579 images (49%) from 478 patients (55%), which was further divided into a training set of 1,145 images (22%) from 228 patients (26%) and validation set of 1,434 images (27%) from 250 patients (29%). We repeated each experiment with 10 different random weight initializations in order to calculate variance estimates across model training experiments. Average performance of FL versus centrally-hosted models and with RSD versus clinical labels training were compared using a one-way analysis of variance (ANOVA).

Performance of models trained on local data versus federated

In a secondary analysis, we compared models trained with each institution’s local data using the clinical labels. The sampling scheme for the individual baseline models to split into training, validation, and test sets followed the same procedure as the federated and centrally hosted experiments. We evaluate the performance of each model using AUROC for plus disease (according to the RSD) across a consolidated test set (made up of each institution’s test set). We conduct a pairwise comparison of plus AUROC performance between local models and central and FL models using the Tukey’s Range test with alpha=0.05.

Relationship between dataset size, disease prevalence, label quality, and model performance

We evaluated the relationship between the total number of cases in the dataset, the number of plus cases, and the institutional agreement on RSD vs. clinical labels for 3 level plus (independent variables) compared to the performance of the subsequent local institutional models using linearly-weighted kappa for 3 level plus disease classification (dependent variable) using Spearman’s correlation coefficient (CC).

Results

Demographics

Table 1 displays the demographics of the populations at the seven sites, including the birthweight and gestational ages of the babies, as well as the total number of exams and plus disease cases by clinical and RSD labels. The number of clinical plus disease labels in each set of data varied from 0 (site B) to 54 (site G). There was variability not only in the proportion of cases diagnosed as plus disease (range 0–11%), but also in the overall dataset size at each institution (range 61–1626), and the underlying demographic risk factors for ROP (birthweight [mean range 804–1045] and gestational age [mean range 26–28]). There was variable agreement between RSD and clinical labels (mean weighted kappa 0.58, range 0.19–0.83).

Table 1.

Demographics of babies at 7 i-ROP sites included in FL model

Site	Birth weight - mean (std)	Gestation al age - mean (std)	PMA - mean (std)	Number / prevalence of plus (RSD)	Number / prevalence of plus (CLINICAL)	Kappa between RSD and clinical
A	937.2±321.2	27.1±2.3	36 (3)	27 / 911 (3%)	25 / 911 (2.7%)	0.67
B	1102.5±323.3	28.3±2.0	35 (2)	0 / 328 (0%)	0 / 328 (0%)	0.66
C	770.1±322.2	26.1±2.8	37 (3)	4 / 61 (6.6%)	2 / 61 (3.3%)	0.21
D	881.9±282.8	26.8±2.2	36 (4)	35 / 727 (487%)	26 / 727 (3.6%)	0.51
E	799.8±232.2	25.9±2.1	38 (9)	28 / 248 (11.3%)	21 / 248 (8.5%)	0.46
F	963.3±336.8	27.4±2.3	39 (7)	22 /1344 (1.6%)	20 / 1344 (1.5%)	0.19
G	1058.5±334	27.9±2.4	36 (3)	40 / 1626 (2.5%)	54 / 1626 3.3%)	0.83
All Sites	967.1±327.3	27.3±2.4	37 (5)	156 / 5245 (3.0%)	148 / 5245 (2.8%)	0.58

Open in a new tab

Performance of federated versus central training by ground truth

Figure 2 shows the performance of the 4 models: FL using RSD, FL using clinical, Central using RSD, and Central using clinical. The AUROC for all four models was high (ranging from 0.93 + 0.02 to 0.96 + 0.02. After Bonferroni correction for multiple hypothesis testing, a significant difference was observed between FL trained with clinical labels and central trained with RSD (p=0.017). No statistically significant difference was observed for the other three comparisons: FL trained using RSD versus central trained using RSD, FL trained using clinical labels versus central trained using clinical labels, and FL trained using RSD versus central trained using clinical labels (p=0.0466, p=0.127, and p=0.22, respectively). .

Performance of models trained on local data versus federated

Figure 3 compares the performance, via average plus AUROC of all institution test sets with plus predictions (B and C excluded because there were no images with clinical labels of plus), of the locally-trained models at each institution (A-G) to the FL and centrally-trained models trained using clinical labels, and evaluated with RSD. With a null hypothesis that model performance would be comparable, using Tukey’s Range Test, we found that the performance of the locally trained models was inferior to either multi-institutional approach at 4/7 (57%) sites (A,B,C,D).

Relationship between dataset size, disease prevalence, RSD label agreement and model performance

We evaluated the relationship between dataset size, plus disease prevalence, and clinical versus RSD label agreement on the average kappa performance of locally trained models trained using clinical labels on test sets labeled with an RSD (Figure 4). Institutions with clinical labels with higher RSD agreement had moderately higher performing local models (Pearson correlation coefficient [CC] 0.389, p=0.387). We also found a positive correlation between the number of plus disease cases in the training dataset (Pearson 0.759, p=0.047) and overall training set size (Pearson CC 0.924, p=0.003), on model performance.

Figure 4. — There was a significant correlation between clinical vs. reference standard diagnosis (RSD) label agreement and average kappa performance of the model versus a RSD (Pearson coefficient 0.389 [p=0.387]). Pearson’s correlation between number of plus cases in the training set and kappa performance was 0.759 with p=0.047) while the correlation between overall training set size and kappa performance was 0.924 with p=0.002).

Discussion

Large quantities of diverse training data improve the generalizability of DL models.^20,21 When such data cannot be collated by a single institution, multi-institutional collaboration becomes a necessity. In this paper, we found that for a ROP classification task, training models using FL led to better local performance than locally trained models at 4/7 sites (Figure 3). Moreover, we found that using a FL approach was comparable to centrally hosted RSD models trained with either RSD or clinical ground truth labels (Figure 2), despite variability in agreement between clinical labels and RSD (Figure 4). This is important because while RSD labels are preferred, RSD labels are not practical to acquire in a real-world setting due to the need for evaluation and adjudication from multiple experts.²² The robustness of the FL model trained on clinical labels shows that the model is able to learn on less reliable (i.e. noisy) labels. Thus, FL represents a more private and convenient alternative to the currently most common mode of collaboration. To our knowledge, this is the first study to evaluate the utility of FL methods for DL model development in ophthalmology. As such, this result serves as an important proof-of-concept within ophthalmology and supports an expansion of evaluative efforts for this technology for other diseases.

A key difference between our work and many prior studies utilizing FL for healthcare applications is our evaluation of real patient data from 7 different hospital centers rather than simulating or synthesizing different institutions’ data distributions. Thus, the heterogeneity of the training data in our federated learning setup is more representative of real-life performance. Also, we investigate both heterogeneity in the patient demographics as well as label noise heterogeneity by running experiments using bedside labels (clinical) in addition to a consensus ground truth (RSD). Retrospective analysis of patient demographics, and relative proportion of plus disease cases revealed a high degree of inter-institutional variance. Furthermore, clinical labels at individual institutions exhibited varying degrees of concordance with corresponding RSD labels (Table 1). Thus, the FL training paradigm brings together diverse training data to train more generalizable models without requiring centralized data collection or an RSD for training. Prior to the introduction of FL, collaboration relied on training a single centralized model on pooled data from all participating sites. While simple to train, this approach becomes increasingly impractical in real-world contexts with increasing concerns about patient privacy, and regulations on sharing of medical data. FL may therefore offer a viable alternative to training on centralized datasets by generating high-performing and robust models without the risks and overhead of transferring sensitive medical datasets.

The relative performance of models trained on datasets from individual sites was also analyzed. While some institutional models were found to yield comparable AUROC values to FL and Central models tested on the same sets, training sets from certain sites (e.g. Site B & Site C) yielded models that performed poorly on test sets from external institutions (Figure 3). We found that a number of plus cases and label quality were correlated with model performance. In real-world applications, data characteristics will inevitably vary between institutions with differing patient populations, disease incidence rates, data acquisition methods (e.g. scanner type, protocol), and clinical and diagnostic biases. Our findings suggest that FL may facilitate diagnostic consistency and accuracy between institutions, in spite of these differences. In a multi-institutional network supporting FL, institutions with varying levels of positive case volume or expertise mutually benefit. In the most extreme scenario, a larger institution with high positive case volume and institutional expertise for labeling will experience minimal performance gains from a federated model, and therefore smaller institutions may have the most to benefit. Finally, such models would additionally create a “universal” metric with which to compare epidemiological data and clinical diagnostic patterns between institutions, an implication explored by Hanif et al.²³

There are several limitations to our study. First, the datasets contributed by each respective institution are not all entirely population-based. While some participants adopted a policy of enrolling nearly all eligible patients regardless of disease severity, others may have only enrolled specific tiers of severity or complexity, or when the coordinators had time, introducing a possibility of population bias in their cohorts. As demonstrated in our analysis, varying proportions of disease severity between cohorts may result in differences in model performance (Figure 3). If a population’s distribution is not truly reflected in these training cohorts, the model performance may not necessarily be reflective of what would be observed in a prospective evaluation. Second, this is a proof-of-concept experiment utilizing a prospective consecutive dataset. While demonstrative of the DL model and FL architecture’s performance, the practical obstacles of implementing this method, such as limited communication bandwidth, are better observed in a real-time implementation across multiple servers. Future studies may investigate federated learning within this more realistic setting, and further explore the relationship between local dataset size and contribution to model development. ²⁴ Lastly, as our study only focused on the performance aspect of FL, experiments were performed operating under the assumption that veracity of model weights was verified to protect against unintentional (or intentional) corruption of the global model. In a real-world setting, privacy-preserving federated learning techniques would need to be implemented.²⁵

Conclusion

This study demonstrates the viability of FL as a new paradigm for disease classification in the context of ROP. Our results indicate that FL trained on point-of-care labels performs comparably to models trained on centralized datasets with consensus-expert labels, supporting the feasibility of the approach in real-world settings. In other words, our approach is more secure and more convenient than the more commonly used collaborative approach. Given our proof-of-concept results for ROP, further study is warranted to demonstrate the versatility of FL in other use cases within ophthalmology.

Financial Support:

This work was supported by grants R01 EY19474, R01 EY031331, R21 EY031883, and P30 EY10572 from the National Institutes of Health (Bethesda, MD), and by unrestricted departmental funding and a Career Development Award (JPC) from Research to Prevent Blindness (New York, NY). The sponsors or funding organizations had no role in the design or conduct of this research.

Conflict of Interest Disclosures

Drs. Campbell, Chan, and Kalpathy-Cramer receive research support from Genentech (San Francisco, CA). Dr. Chiang previously received research support from Genentech.
The i-ROP DL system has been licensed to Boston AI Lab (Boston, MA) by Oregon Health & Science University, Massachusetts General Hospital, Northeastern University, and the University of Illinois, Chicago, which may result in royalties to Drs. Chan, Campbell, Brown, and Kalpathy-Cramer in the future.
Dr. Campbell was a consultant to Boston AI Lab (Boston, MA).
Dr. Chan is on the Scientific Advisory Board for Phoenix Technology Group (Pleasanton, CA), a consultant for Alcon (Ft Worth, TX).
Dr. Chiang was previously a consultant for Novartis (Basel, Switzerland), and was previously an equity owner of InTeleretina, LLC (Honolulu, HI).
Drs. Chan and Campbell are equity owners of Siloam Vision

Members of the i-ROP research consortium:

Oregon Health & Science University (Portland, OR): Michael F. Chiang, MD, Susan Ostmo, MS, Sang Jin Kim, MD, PhD, Kemal Sonmez, PhD, J. Peter Campbell, MD, MPH, Robert Schelonka, MD, Aaron Coyner, PhD. University of Illinois at Chicago (Chicago, IL): RV Paul Chan, MD, Karyn Jonas, RN, Bhavana Kolli, MHA. Columbia University (New York, NY): Jason Horowitz, MD, Osode Coki, RN, Cheryl-Ann Eccles, RN, Leora Sarna, RN. Weill Cornell Medical College (New York, NY): Anton Orlin, MD. Bascom Palmer Eye Institute (Miami, FL): Audina Berrocal, MD, Catherin Negron, BA. William Beaumont Hospital (Royal Oak, MI): Kimberly Denser, MD, Kristi Cumming, RN, Tammy Osentoski, RN, Tammy Check, RN, Mary Zajechowski, RN. Children’s Hospital Los Angeles (Los Angeles, CA): Thomas Lee, MD, Aaron Nagiel, MD, Evan Kruger, BA, Kathryn McGovern, MPH, Dilshad Contractor, Margaret Havunjian. Cedars Sinai Hospital (Los Angeles, CA): Charles Simmons, MD, Raghu Murthy, MD, Sharon Galvis, NNP. LA Biomedical Research Institute (Los Angeles, CA): Jerome Rotter, MD, Ida Chen, PhD, Xiaohui Li, MD, Kent Taylor, PhD, Kaye Roll, RN. University of Utah (Salt Lake City, UT): Mary Elizabeth Hartnett, MD, Leah Owen, MD. Stanford University (Palo Alto, CA): Darius Moshfeghi, MD, Mariana Nunez, B.S., Zac Wennber-Smith, B.S.. Massachusetts General Hospital (Boston, MA): Jayashree Kalpathy-Cramer, PhD. Northeastern University (Boston, MA): Deniz Erdogmus, PhD, Stratis Ioannidis, PhD. Asociacion para Evitar la Ceguera en Mexico (APEC) (Mexico City): Maria Ana Martinez-Castellanos, MD, Samantha Salinas-Longoria, MD, Rafael Romero, MD, Andrea Arriola, MD, Francisco Olguin-Manriquez, MD, Miroslava Meraz-Gutierrez, MD, Carlos M. Dulanto-Reinoso, MD, Cristina Montero-Mendoza, MD.

Footnotes

Meeting Presentation: This work is not under consideration for presentation and has not previously been presented at any meeting.

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

1.Lin T-Y, Maire M, Belongie S, et al. Microsoft COCO: Common Objects in Context. Computer Vision – ECCV 2014. 2014:740–755. Available at: 10.1007/978-3-319-10602-1_48. [DOI] [Google Scholar]
2.Gulshan V, Peng L, Coram M, et al. Development and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus Photographs. JAMA 2016;316:2402–2410. [DOI] [PubMed] [Google Scholar]
3.Dunnmon JA, Yi D, Langlotz CP, et al. Assessment of Convolutional Neural Networks for Automated Classification of Chest Radiographs. Radiology 2019;290:537–544. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Deng J, Dong W, Socher R, et al. ImageNet: A large-scale hierarchical image database. 2009 IEEE Conference on Computer Vision and Pattern Recognition 2009. Available at: 10.1109/cvprw.2009.5206848. [DOI] [Google Scholar]
5.Choi RY, Coyner AS, Kalpathy-Cramer J, et al. Introduction to Machine Learning, Neural Networks, and Deep Learning. Transl Vis Sci Technol 2020;9:14. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Srivastava N, Hinton G, Krizhevsky A, et al. Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 2014;15:1929–1958. [Google Scholar]
7.Schmidhuber J Deep learning in neural networks: an overview. Neural Netw 2015;61:85–117. [DOI] [PubMed] [Google Scholar]
8.Sheller MJ, Edwards B, Reina GA, et al. Federated learning in medicine: facilitating multi-institutional collaborations without sharing patient data. Sci Rep 2020;10:1–12. Available at: [Accessed October 11, 2021]. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Sheller MJ, Reina GA, Edwards B, et al. Multi-institutional Deep Learning Modeling Without Sharing Patient Data: A Feasibility Study on Brain Tumor Segmentation. In: Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries. Springer, Cham; 2018:92–104. Available at: [Accessed October 11, 2021]. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Brown JM, Peter Campbell J, Beers A, et al. Automated Diagnosis of Plus Disease in Retinopathy of Prematurity Using Deep Convolutional Neural Networks. JAMA Ophthalmology 2018;136:803. Available at: 10.1001/jamaophthalmol.2018.1934. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Ryan MC, Ostmo S, Jonas K, et al. Development and evaluation of reference standards for image-based telemedicine diagnosis and clinical research studies in ophthalmology. AMIA Annu Symp Proc 2014;2014:1902–1910. [PMC free article] [PubMed] [Google Scholar]
12.He K, Zhang X, Ren S, Sun J. Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2016. Available at: 10.1109/cvpr.2016.90. [DOI] [Google Scholar]
13.Anon. U-Net: Convolutional Networks for Biomedical Image Segmentation. Available at: https://lmb.informatik.uni-freiburg.de/people/ronneber/u-net/ [Accessed October 11, 2021].
14.Anon. U-Net: Convolutional Networks for Biomedical Image Segmentation. Available at: https://lmb.informatik.uni-freiburg.de/people/ronneber/u-net/ [Accessed October 10, 2021].
15.Russakovsky O, Deng J, Su H, et al. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision 2015;115:211–252. Available at: 10.1007/s11263-015-0816-y. [DOI] [Google Scholar]
16.Loshchilov I, Hutter F. Decoupled Weight Decay Regularization. 2017. Available at: http://arxiv.org/abs/1711.05101 [Accessed October 10, 2021]. [Google Scholar]
17.Paszke A, Gross S, Massa F, et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. Adv Neural Inf Process Syst 2019;32. Available at: https://proceedings.neurips.cc/paper/2019/file/bdbca288fee7f92f2bfa9f7012727740-Paper.pdf [Accessed October 10, 2021]. [Google Scholar]
18.Anon. Project MONAI — MONAI 0.7.0 Documentation. Available at: https://docs.monai.io/en/latest/ [Accessed October 10, 2021].
19.McMahan HB, Moore E, Ramage D, et al. Communication-Efficient Learning of Deep Networks from Decentralized Data. 2016. Available at: http://arxiv.org/abs/1602.05629 [Accessed October 10, 2021].
20.Chen JS, Coyner AS, Ostmo S, et al. Deep Learning for the Diagnosis of Stage in Retinopathy of Prematurity. Ophthalmology Retina 2021. Available at: 10.1016/j.oret.2020.12.013. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Chang K, Beers AL, Brink L, et al. Multi-Institutional Assessment and Crowdsourcing Evaluation of Deep Learning for Automated Classification of Breast Density. J Am Coll Radiol 2020;17. Available at: https://pubmed.ncbi.nlm.nih.gov/32592660/ [Accessed October 10, 2021]. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Abramoff Cunningham B, Patel B, et al. Foundational Considerations for Artificial Intelligence Utilizing Ophthalmic Images. Ophthalmology 2021. Available at: https://pubmed.ncbi.nlm.nih.gov/34478784/ [Accessed October 10, 2021]. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Hanif AM, Lu C, Chang K, Singh P, Coyner AS, Brown JM, Ostmo S, Chan RVP, Rubin D, Kalpathy-Cramer J, Campbell JP. Federated learning for multi-center collaboration in ophthalmology: implications for clinical diagnosis and disease epidemiology. Ophthalmology Science (In review). [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Kamp M, Fischer J, Vreeken J. Federated Learning from Small Datasets. arXiv preprint 2021;2110.03469. [Google Scholar]
25.Kaissis GA, Makowski MR, Rückert D, Braren RF. Secure, privacy-preserving and federated machine learning in medical imaging. Nature Machine Intelligence 2020;2:305–311. Available at: [Accessed October 10, 2021]. [Google Scholar]

[R1] 1.Lin T-Y, Maire M, Belongie S, et al. Microsoft COCO: Common Objects in Context. Computer Vision – ECCV 2014. 2014:740–755. Available at: 10.1007/978-3-319-10602-1_48. [DOI] [Google Scholar]

[R2] 2.Gulshan V, Peng L, Coram M, et al. Development and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus Photographs. JAMA 2016;316:2402–2410. [DOI] [PubMed] [Google Scholar]

[R3] 3.Dunnmon JA, Yi D, Langlotz CP, et al. Assessment of Convolutional Neural Networks for Automated Classification of Chest Radiographs. Radiology 2019;290:537–544. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] 4.Deng J, Dong W, Socher R, et al. ImageNet: A large-scale hierarchical image database. 2009 IEEE Conference on Computer Vision and Pattern Recognition 2009. Available at: 10.1109/cvprw.2009.5206848. [DOI] [Google Scholar]

[R5] 5.Choi RY, Coyner AS, Kalpathy-Cramer J, et al. Introduction to Machine Learning, Neural Networks, and Deep Learning. Transl Vis Sci Technol 2020;9:14. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] 6.Srivastava N, Hinton G, Krizhevsky A, et al. Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 2014;15:1929–1958. [Google Scholar]

[R7] 7.Schmidhuber J Deep learning in neural networks: an overview. Neural Netw 2015;61:85–117. [DOI] [PubMed] [Google Scholar]

[R8] 8.Sheller MJ, Edwards B, Reina GA, et al. Federated learning in medicine: facilitating multi-institutional collaborations without sharing patient data. Sci Rep 2020;10:1–12. Available at: [Accessed October 11, 2021]. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] 9.Sheller MJ, Reina GA, Edwards B, et al. Multi-institutional Deep Learning Modeling Without Sharing Patient Data: A Feasibility Study on Brain Tumor Segmentation. In: Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries. Springer, Cham; 2018:92–104. Available at: [Accessed October 11, 2021]. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] 10.Brown JM, Peter Campbell J, Beers A, et al. Automated Diagnosis of Plus Disease in Retinopathy of Prematurity Using Deep Convolutional Neural Networks. JAMA Ophthalmology 2018;136:803. Available at: 10.1001/jamaophthalmol.2018.1934. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] 11.Ryan MC, Ostmo S, Jonas K, et al. Development and evaluation of reference standards for image-based telemedicine diagnosis and clinical research studies in ophthalmology. AMIA Annu Symp Proc 2014;2014:1902–1910. [PMC free article] [PubMed] [Google Scholar]

[R12] 12.He K, Zhang X, Ren S, Sun J. Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2016. Available at: 10.1109/cvpr.2016.90. [DOI] [Google Scholar]

[R13] 13.Anon. U-Net: Convolutional Networks for Biomedical Image Segmentation. Available at: https://lmb.informatik.uni-freiburg.de/people/ronneber/u-net/ [Accessed October 11, 2021].

[R14] 14.Anon. U-Net: Convolutional Networks for Biomedical Image Segmentation. Available at: https://lmb.informatik.uni-freiburg.de/people/ronneber/u-net/ [Accessed October 10, 2021].

[R15] 15.Russakovsky O, Deng J, Su H, et al. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision 2015;115:211–252. Available at: 10.1007/s11263-015-0816-y. [DOI] [Google Scholar]

[R16] 16.Loshchilov I, Hutter F. Decoupled Weight Decay Regularization. 2017. Available at: http://arxiv.org/abs/1711.05101 [Accessed October 10, 2021]. [Google Scholar]

[R17] 17.Paszke A, Gross S, Massa F, et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. Adv Neural Inf Process Syst 2019;32. Available at: https://proceedings.neurips.cc/paper/2019/file/bdbca288fee7f92f2bfa9f7012727740-Paper.pdf [Accessed October 10, 2021]. [Google Scholar]

[R18] 18.Anon. Project MONAI — MONAI 0.7.0 Documentation. Available at: https://docs.monai.io/en/latest/ [Accessed October 10, 2021].

[R19] 19.McMahan HB, Moore E, Ramage D, et al. Communication-Efficient Learning of Deep Networks from Decentralized Data. 2016. Available at: http://arxiv.org/abs/1602.05629 [Accessed October 10, 2021].

[R20] 20.Chen JS, Coyner AS, Ostmo S, et al. Deep Learning for the Diagnosis of Stage in Retinopathy of Prematurity. Ophthalmology Retina 2021. Available at: 10.1016/j.oret.2020.12.013. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] 21.Chang K, Beers AL, Brink L, et al. Multi-Institutional Assessment and Crowdsourcing Evaluation of Deep Learning for Automated Classification of Breast Density. J Am Coll Radiol 2020;17. Available at: https://pubmed.ncbi.nlm.nih.gov/32592660/ [Accessed October 10, 2021]. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] 22.Abramoff Cunningham B, Patel B, et al. Foundational Considerations for Artificial Intelligence Utilizing Ophthalmic Images. Ophthalmology 2021. Available at: https://pubmed.ncbi.nlm.nih.gov/34478784/ [Accessed October 10, 2021]. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] 23.Hanif AM, Lu C, Chang K, Singh P, Coyner AS, Brown JM, Ostmo S, Chan RVP, Rubin D, Kalpathy-Cramer J, Campbell JP. Federated learning for multi-center collaboration in ophthalmology: implications for clinical diagnosis and disease epidemiology. Ophthalmology Science (In review). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] 24.Kamp M, Fischer J, Vreeken J. Federated Learning from Small Datasets. arXiv preprint 2021;2110.03469. [Google Scholar]

[R25] 25.Kaissis GA, Makowski MR, Rückert D, Braren RF. Secure, privacy-preserving and federated machine learning in medical imaging. Nature Machine Intelligence 2020;2:305–311. Available at: [Accessed October 10, 2021]. [Google Scholar]

PERMALINK

Federated learning for multi-center collaboration in ophthalmology: improving classification performance in retinopathy of prematurity

Charles Lu, BS

Adam Hanif, MD

Praveer Singh, PhD

Ken Chang, PhD

Aaron S Coyner, PhD

James M Brown, PhD

Susan Ostmo, MS

RV Paul Chan, MD

Daniel Rubin, MD

Michael F Chiang, MD

J Peter Campbell, MD, MPH

Jayashree Kalpathy-Cramer, PhD

Abstract

Objective:

Design:

Subjects, Participants, and/or Controls:

Methods, Intervention or Testing:

Main Outcome Measures:

Results:

Conclusions:

Introduction

Methods

Dataset

Ground truth

Model development

Figure 1: Federated learning training schema.

Performance of federated versus central training by ground truth

Performance of models trained on local data versus federated

Relationship between dataset size, disease prevalence, label quality, and model performance

Results

Demographics

Table 1.

Performance of federated versus central training by ground truth

Figure 2. Performance of federated and centrally trained models by ground truth.

Performance of models trained on local data versus federated

Figure 3. Comparative performance of single institution versus multi-institutional models.

Relationship between dataset size, disease prevalence, RSD label agreement and model performance

Figure 4. Relationship between label agreement, training dataset size, disease prevalence, and performance.

Discussion

Conclusion

Financial Support:

Conflict of Interest Disclosures

Members of the i-ROP research consortium:

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases