Abstract
Objective
Federated Learning (FL) enables collaborative model training while keeping data locally. Currently, most FL studies in radiology are conducted in simulated environments due to numerous hurdles impeding its translation into practice. The few existing real-world FL initiatives rarely communicate specific measures taken to overcome these hurdles. To bridge this significant knowledge gap, we propose a comprehensive guide for real-world FL in radiology. Minding efforts to implement real-world FL, there is a lack of comprehensive assessments comparing FL to less complex alternatives in challenging real-world settings, which we address through extensive benchmarking.
Materials and Methods
We developed our own FL infrastructure within the German Radiological Cooperative Network (RACOON) and demonstrated its functionality by training FL models on lung pathology segmentation tasks across six university hospitals. Insights gained while establishing our FL initiative and running the extensive benchmark experiments were compiled and categorized into the guide.
Results
The proposed guide outlines essential steps, identified hurdles, and implemented solutions for establishing successful FL initiatives conducting real-world experiments. Our experimental results prove the practical relevance of our guide and show that FL outperforms less complex alternatives in all evaluation scenarios.
Discussion and Conclusion
Our findings justify the efforts required to translate FL into real-world applications by demonstrating advantageous performance over alternative approaches. Additionally, they emphasize the importance of strategic organization, robust management of distributed data and infrastructure in real-world settings. With the proposed guide, we are aiming to aid future FL researchers in circumventing pitfalls and accelerating translation of FL into radiological applications.
Keywords: radiology, artificial intelligence, federated learning, healthcare infrastructure, distributed systems
Objectives
Deep learning (DL) has revolutionized radiological image analysis, rapidly driving radiological research and increasingly transforming clinical routine. Training powerful DL models requires access to vast and diverse datasets. A practical approach is to train on centralized data from multiple centers in a pooled data lake. However, such data aggregation is often complicated by various regulatory requirements, including privacy regulations such as GDPR, HIPAA, state-specific healthcare laws or federal privacy laws.1 Federated Learning (FL)2 resolves these issues by keeping data at the originating medical centers. Instead of sharing data, FL collaboratively trains models through periodic exchanges of DL model weights between locally training participants and a central server achieving performances comparable to centralized trained models.3 Consequently, FL holds significant potential in healthcare enabling sufficiently large datasets, even for rare diseases or minority populations.4
Most FL research is currently conducted in simulated environments,5 lacking broad translation into real-world applications due to practical implementation challenges. To address this predicament, we identified three major problems (P). P1: The few studies examining real-world FL in general medicine,6–11 and specifically in radiology,4,7,12–18 provide limited insights into real-world challenges of implementing FL in practice, leaving a significant knowledge gap. P2: Theoretical FL studies,5,12,19–34 which discuss hypothetical applications and potential challenges of FL in medicine, often do not account for real-world complexities and lack real-world FL investigations proving actual experience with these challenges, thereby reducing their practical relevance. P3: Given the numerous real-world challenges of FL, one may question its benefits or lean towards less complex alternatives. However, there is a lack of comprehensive benchmarking comparing FL to alternatives in a real-world setting across various evaluation scenarios.
Within the framework of the German Radiological Cooperative Network (RACOON) (https://racoon.network/),35 we have built and established the first nation-wide collaborative radiology initiative of its kind, that includes all 38 university hospitals of the country. We evaluated its functionality by conducting proof-of-concept real-world FL experiments to collaboratively train DL segmentation models on radiological image data across six university hospitals. This setup provides a ready-to-use infrastructure for future researchers to conduct clinical research using FL without building their own systems. During developing this FL initiative and conducting experiments with real-world datasets in a real-world setting, we encountered and overcame numerous difficulties and practical hurdles. These novel insights, largely unknown in the literature,4–9,12–16,19–34 motivated us to compile an extensive guide for real-world FL in radiology.
Our contributions (C) tackling the identified problems (P) are threefold, see Figure 1. C1: We propose a detailed guide for building real-world FL initiatives based on our first-hand experiences along with relevant literature. This guide outlines essential steps, highlights encountered issues and provides solutions for each phase of real-world FL in radiological research, aiming to support and accelerate future efforts. C2: We conducted real-world training of DL segmentation models for lung pathology detection using data distributed across six sites, demonstrating the functionality of our FL infrastructure, and emphasizing the practical relevance of our proposed guide. C3: Recognizing the challenges of real-world FL (C1) and the potential preference for less complex approaches, we compare FL to simpler alternatives like local model training and ensembling.36–38 We benchmark these approaches extensively across various evaluation scenarios: personalization, ie benefits participating sites gain from FL training; and generalization, ie benefits non-participating sites gain from leveraging collaboratively trained models from other sites, with or without incorporating local training capabilities.
Figure 1.
The underlying issue, three identified problems and corresponding, solving contributions to establish real-world FL initiatives in radiology. The underlying issue is that practical hurdles and inherent difficulties impede a straightforward realization of FL in the real world. We identified with P1-P3 three major problems and aim to solve them with our contributions C1-C3.
Materials and methods
Categorization of insights in building real-world FL initiatives
The term “challenge” is frequently used in FL research, yet the issues encountered in actual real-world FL implementations are diverse. We define “real-world” studies as those utilizing private, sensitive datasets that are not publicly available and employing distributed computing infrastructure integrated into clinical IT ecosystems. Additionally, we characterize a FL initiative as the combination of community engagement, organizational structures, legal agreements, and infrastructure necessary to conduct and scale FL experiments beyond single execution.
For our proposed guide (C1), we classify challenges of real-world FL implementation into two categories: practical hurdles and inherent difficulties. Practical Hurdles encompass organizational, legal, or technical issues that are solved through agreements or technical solutions. We share our solutions based on practical experiences with these hurdles. In contrast, inherent difficulties refer to limitations in real-world FL of organizational, technical, or research-related nature that cannot be avoided but must be acknowledged when successfully developing FL initiatives and conducting real-world experiments.
We further categorize these issues based on the scope within which they impact the establishment of a FL initiative. These categories include organization, legal requirements, infrastructure setup, experiment preparation, and experiments and evaluation.
Real-world FL study: experimental setup
FL infrastructure
Our real-world FL efforts are part of the German RACOON initiative, which aims to use artificial intelligence to advance radiological research across all 38 German university hospitals. Each hospital is equipped with a server hosting key software: Mint Lesion (https://mint-medical.com/de/mint-lesion) for structured radiological reporting, SATORI (https://www.mevis.fraunhofer.de/de/research-and-technologies/werkzeuge-fuer-ki-kollaborationen.html) and ImFusion Labels (https://www.imfusion.com/products/imfusion-labels) for imaging data annotation, and a Kaapana-based (https://www.kaapana.ai/) platform for medical image processing. Thereby, Kaapana facilitates curating radiological data39 for subsequent local and federated training38 of DL models, supporting various studies within RACOON.
In our real-world FL experiments, we orchestrated a centralized FL initiative across a subset of six out of 38 university hospitals by connecting their Kaapana platforms to a central server. The six university hospitals were: Charité Berlin (CHA), Technical University of Munich (TUM), University Medicine Essen (UME), and the university hospitals in Frankfurt am Main (UKF), Cologne (UKK) and Kiel (UKKI) (Figure 2).
Figure 2.
Phases of the proposed guide for building and deploying real-world FL in radiology (A) and the infrastructure of the RACOON FL initiative (B). FL infrastructure with a central FL server and six participating sites (TUM, UME, UKF, CHA, UKK, UKKI) maintaining data locally and periodically exchanging model weights during FL training. Detailed view of the site infrastructure (C): radiological images are queried from the PACS, processed by third party clinical systems (e.g, annotation tools) and used for FL training using the FL platform.
Dataset
In our FL experiments (C2), we trained a DL segmentation model on a lung CT dataset where every scan contains pathologies, due to the cohort design investigating various lung diseases based on the extent of pathologies. For this dataset voxel-level segmentation annotations of three pathologies were created by independent radiological readers, supervised by experienced board-certified radiologists at each site. The three types of pathological image patterns segmented were: (non-)malignant consolidation (Cons), ground-glass opacity (GGO), and pleural effusion (PE), which are significant predictors of disease progression in various lung diseases.40 To avoid bias towards specific diseases, the dataset was curated to maintain a balanced number of samples from 20 different lung diseases across all sites. Data provision varied between sites: TUM, UME, and UKF provided manually generated voxel-level annotations; sites CHA, UKK, and UKKI provided automatically pre-processed, manually corrected annotations created using a nnU-Net model41 trained on public data42 (qualitative annotation comparison in Figure A1). For model training, data at each site was split into training and test sets with an 80% to 20% ratio, with special care taken to maintain this ratio for less common PE cases. Following this data-splitting and site-specific inclusion and exclusion criteria, resulted in curated datasets per site as shown in Figure 3. Detailed descriptive statistics of the distributed data given in Figure 4.
Figure 3.
Cohort definition after data curation, filtering and splitting of the distributed data across the six participating sites resulting in the final training and test sets.
Figure 4.
Data characteristics of CT data and annotations labels across the six participating sites. (A) CT scanner manufacturer distribution; (B) Average voxel volume distribution of CT scans; (C) Relative histogram of CT attenuation in HU (Note: -1000 HU visualizing air); (D) Annotation label distribution; (E) Annotation label volume distribution; (F) CCA: Number of CC per annotation label.
Training details
The DL model utilized is a state-of-the-art U-Net model from the self-configuring medical image segmentation framework nnU-Net.41,43 The self-configuration process of the nnU-Net model uses a dataset fingerprint optimizing the model configuration through rule-based, fixed, and empirical parameter selection, making it a well-performing off-the-shelf baseline model.
The model's self-configuration process is straightforward for local training. However, in a federated setup, this procedure requires multiple steps to synchronize across participating sites, following the implementation in Kades et al.38 Each site generates a dataset fingerprint from its local training data, which is sent to the central server. The server aggregates these fingerprints and redistributes them to all sites, ensuring each site configures and initializes the model identically (Figure A2).
In our experiments, we used the low-resolution configuration of the nnU-Net model to optimize training efficiency and retained its self-configured parameters without further modifications or hyper-parameter tuning. Consequently, there was no need for a validation split and we trained the model for a fixed number of 1000 epochs (ideal for local training41) Each model processed 250 batches per epoch; further nnU-Net training details provided in Table A1. To optimize all site’s objectives equally, we utilized non-weighted averaging updating global model weights from locally updated model weights after each local epoch and federated communication round 38 see Equation (1).
| (1) |
As the experimental studies serve as proof-of-concept, real-world FL investigations, we chose established DL and FL, neglecting methodological novelty.
Evaluation metrics and ranking
To evaluate segmentation performances of trained models, we selected the following metrics according to investigated pathologies.44–46 We chose the intersection-based Dice Similarity Coefficient (DSC), as it is the default segmentation metric and suitable for the three target pathologies.45,46 We assessed the segmentation performance using the distance-based metrics Normalized Surface Dice (NSD), suitable for Cons and GGO45 with a threshold of 1 mm, and the Hausdorff Surface Distance (HSD), relevant for PE.46 The medically relevant difference of predicted and annotated volumes was measured via Normalized Average Volume Error (NAVE).45
Based on the utilized metric implementation,47 we disregarded samples with an empty ground truth (False Positives). For False Negatives, we set DSC and NSD to 0.0, HSD to 260.0 mm (height of a lung48) and NAVE to 20.0, twice the average of True Positives from local models .
As we consider all metrics as equally relevant, we determine the best performing method through a ranking. We compute for each site and metric the mean metric over all test samples and classes , resulting in scores per site. All compared methods are ranked per metric score. All rankings are averaged to obtain the overall ranking (Equation (2)).
| (2) |
Study design
In our experimental studies, we demonstrate for C2 the execution of real-world FL experiments within our built FL infrastructure. First, we investigate data characteristics via descriptive statistics assessing the data heterogeneity.
Beyond previous studies,36–38 we benchmark for C3 the model performances trained via federated learning () versus locally trained models at site () and an ensemble of these local models (), obtained by averaging the softmax probabilities of model predictions. Additionally, we assess specialized versions of the and models, and , by ensembling them with the local model () specific to site being evaluated.
We evaluate the compared models across three distinct evaluation scenarios summarized in Table 1.
Table 1.
Comprehensive overview of benchmarked models in the three distinct evaluation scenarios: personalization, generalization with local training, generalization without local training.
| Scenario Models | Personalization | Generalization without local training | Generalization with local training |
|---|---|---|---|
| ✓ | ✗ | ✓ | |
| ✗ | ✓ | ✗ | |
| ✓ | ✗ | ✓ | |
| ✗ | ✓ | ✗ | |
| ✓ | ✗ | ✗ | |
| ✗ | ✓ | ✓ | |
| ✓ | ✗ | ✓ | |
| ✓ | ✗ | ✗ | |
| ✗ | ✗ | ✓ |
The models are local models , ensemble of those , federated model , and the specializations and obtained by ensembling and with the local model () specific to site being evaluated.
The personalization scenario evaluates how a participating site can obtain improved models from joining collaborative efforts. Given the heterogeneous annotation procedures, we first investigate personalization capabilities in three-sites experiments with homogeneous annotation procedures each, before extending to six-sites experiments. Thereby, we assess segmentation performance by comparing locally trained models (), an ensemble of locally trained models (), a federated trained model (), and their specialized versions and . Hereinafter, we explore generalization capabilities of models trained across the three sites with manual annotations.
The second scenario explores generalization without local training, focusing on sites that cannot train their own models, therefore rely solely on models trained at other sites or through collaborative efforts of those. We benchmark the model performances of all other local models , excluding local model while testing on site , against the ensemble of those local models and a federated trained model excluding site , .
The third scenario addresses generalization with local training, targeting sites that have local training capabilities but are hesitant to join real-world FL efforts due to its complexities. We compare the performances of local models , the ensemble of those, , a federated trained model excluding site , , and the specializations and .
Results
Insights in building real-world FL initiatives
For C1, we share our experiences from developing a real-world FL initiative, complemented with insights from literature.4–16,19–34 We compiled them into a comprehensive guide (Table 2) and a Gantt chart (Figure A3), navigating through various phases, steps, and issues of translating FL into the real world.
Table 2.
Detailed guide to build and conduct real-world FL in radiological research outlining the phases, steps, and associated issues.
| Steps & Issues (D: difficulty, H: hurdle, S: solution) |
|---|
| Phase 1: Organization |
|
| Step 2: Conclude non-technical agreements covering health protocols, intellectual property (19), scientific acknowledgement (★) |
|
|
|
| Step 8: Identify task to evaluate, e.g, communication efficiency, model performance, security robustness (20,21,30) |
|
| Phase 2: Legal requirements |
|
|
| Phase3: Infrastructure set-up |
| Step 2: Provision and access VMs for deploying the FL platform on-site (★,16) |
|
| Step 4: Configure network settings to allow site VMs to access the container registry and connect to the FL server (★,16) |
| Step 5: Locate data in relevant source systems and configure secure read-access (16) |
|
| Phase 4: Experiment preparation |
|
|
| Phase 5: Experiments and evaluation |
|
|
|
|
Literature insights are cited; our insights are marked with ★.
Practical hurdles in organizing a FL initiative are convincing site’s IT departments and governance stakeholders to participate through incentives,6,16 and the importance of harmonizing terminology across the initiative.16 Organizational issues to consider are methods to assess and value sites’ data quantity, quality, heterogeneity, and infrastructure contributions,29 the need for available on-site personnel (IT staff, expert annotators),26 and strategies for incorporating human oversight, particularly across time zones.25,26 Establishing clear governance, traceability, and accountability for human expertise and site policies,26 and setting technical requirements for the FL platform regarding data access history, training configurations, and error handling should be clarified.19,34 Further practical hurdles include the need to create detailed specifications for radiological imaging data and annotations,10 establishing effective communication channels like regular meetings and chat rooms among FL researchers, radiologists, and IT staff, and aligning the FL platform's development cycle with necessary features and fixes. Selecting the utilized state-of-the-art DL and FL method are difficulties to clarify.
Legal steps involve negotiating bidirectional contracts between sites and the FL initiative,14 potentially across diverse legal jurisdictions,15 addressing regulations regarding software support and audits,16 and agreements considering model weights as non-patient-related, shareable data.
Practical steps in infrastructure setup include essential on-site infrastructure requirements,5,20,21,25–31,33,34 acquiring and connecting hardware, provisioning and accessing on-site virtual machines (VMs),16 managing limited disk space and resources,5,9,16 configuring firewall permissions, and establishing communication with third-party systems, e.g, PACS and annotation tools.5,16,28 Technical difficulties include minimizing the strain on site resources16 and managing limited communication capacities.9 Further significant practical issues involve installing FL platforms in highly restricted clinical IT environments and debugging sessions between FL platform engineers and local IT resolving network access and communication issues between the FL platform and third-party systems.
During the experiment preparation phase, practical hurdles primarily involve data-related issues including low quality but high heterogeneity,15,19,23,24,26,27,29,34 the necessity to standardize custom data,16 and addressing missing data harmonization or inconsistencies.11,15 Additionally, insufficient data specifications leading to preventable data heterogeneity,13 and data handling that violates standards or specifications. Ensuring data readiness for experiments can be achieved by deploying automated data validators, which minimize the need for manual inspections by local IT staff, and by FL researchers regularly conducting sanity checks on experimental results. Additionally, the necessity for on-site debugging possibilities for FL researchers29 and the trade-off between allowing researchers to find relevant data on-site while maintaining data privacy,19,26 represent considerable issues.
In the phase of experiments and evaluation, hurdles include managing infrastructure failures, such as site dropouts6 due to straggling,25 crashing,28,31,33 and disconnected sites,27 which require experiment restarts leading to idle machines at other sites. These issues are exacerbated by insufficient error logging and limited technical documentation, necessitating on-site debugging.13 Technical difficulties involve time variations per federated communication round due to infrastructure heterogeneity across the initiative,12 the necessity to minimize strain on site’s resources,16 managing limited communication capacities9 and dealing with data heterogeneity delaying the convergence of federated models.25 Lastly, it is crucial to ensure that the final FL model is readily available on-site for evaluation.
Real-world FL study: experimental results
For C2, the experimental results demonstrate the functionality of our FL infrastructure, built and deployed by successfully overcoming the previously identified practical hurdles and difficulties, underscoring the practical relevance of our proposed guide. We benchmarked benefits hospitals gain by leveraging the power of FL, demonstrated across three distinct evaluation scenarios for C3.
Data characteristics
Data heterogeneity across participating sites impacts performance of models trained via FL.50 We examined the characteristics of the distributed data to understand the degree of heterogeneity (descriptive statistics in Figure 4). Since CTs were provided pseudonymized, we relied on technical metadata and image-derived characteristics, excluding demographic details of the cohort.
The dataset comprises CT scans from four manufacturers, predominantly Siemens and Philips, with site CHA being an outlier. The voxel volumes vary from 0.15 (site UKK) to 4.84 mm3 (site UME). The HU intensity distributions across the scans are similar due to normalization of HU values. Annotation labels, including PE as the least occurring pathology, are uniformly distributed across sites. We analyzed the distinctions between site’s annotations by examining annotation volumes and conducting a connected component analysis (CCA). Site CHA features the largest annotation volumes, especially for PE cases. The CCA highlighted significant variations in annotation procedures among sites, with automatically pre-processed sites (CHA, UKK, UKKI) having a higher count of connected components (CC) compared to manually annotated sites (TUM, UME, UKF) (Figure A1).
Segmentation performance evaluation
Personalization
Evaluating personalization capabilities among manually annotated sites (TUM, UME, UKF) shows an overall superior performance of achieving the best rank with average metrics: DSC = 0.47 (95% CI: 0.43-0.52), NSD = 0.39 (95% CI: 0.35-0.44), HSD = 127.94 mm (95% CI: 114.40-141.57) and NAVE = 6.20 (95% CI: 0.0-13.57). The introduced specialization stabilizes the segmentation performances across all metrics, whereas non-specialized models (, , ) show considerable variability. We conclude that collaborative approaches outperform local models , while among non-specialized collaborative approaches outperforms (Figure 5A, Table A2, Figure A4; qualitative results in Figure 6).
Figure 5.
Heatmap visualization of the ranks achieved by compared models in the five evaluation scenarios (A-E) over the four metrics (DSC, NSD, HSD, NAVE). Each row represents a model, with darker squares indicating better performance and ranks. The final column shows the average rank for each model, with the arrow pointing towards the best rank. The evaluation scenarios are (A) Personalization of manually annotated sites; (B) Personalization of automatically pre-processed sites; (C) Personalization of all sites; (D) Generalization without local training of manually annotated sites; (E) Generalization with local training of manually annotated sites.
Figure 6.
Qualitative segmentation results on a test sample from site UKF with Cons in violet, GGO in cyan, PE in yellow. Segmentation predictions of the models , , , with the specialization approach for and highlighted.
Considering personalization capabilities across automatically pre-processed sites (CHA, UKK, UKKI), we obtain superior performance of the specialized and non-specialized FL models ( and ) on average rank with average metrics for of DSC = 0.41 (95% CI: 0.40-0.51), NSD = 0.40 (95% CI: 0.39-0.49), HSD = 127.96 mm (95% CI: 106.51-134.51), NAVE = 2.02 (95% CI: 0.49-3.23). Despite achieving only a single first-place ranking, its specialization contributes to a more consistent performance across metrics, securing its top position on average rank. Moreover, the results reveal a trend that sites with poor local model performance gain greater benefits from FL (Figure 5B, Table A3, Figure A5).
Incorporating all six sites in the benchmarking introduces a high data heterogeneity due to differences in annotations. Despite this, we obtain for the best average ranking with average metrics of DSC = 0.44 (95% CI: 0.43-0.50), NSD = 0.38 (95% CI: 0.36-0.43), HSD = 136.48 mm (95% CI: 122.60-143.33) and NAVE = 10.10 (95% CI: 0.0-35.81). The results support previously observed trends that specialization brings more stable performances and ranks, while sites with poorer local performance notably benefit from FL (Figure 5C, Table A4, Figure A6). Additionally, it can be observed that expanding an FL initiative does not necessarily lead to improved model performance (Tables A2, A3, and A5).
Generalization without local training
The generalization performance is evaluated among the three manually annotated sites TUM, UME and UKF. The best generalizing model on average rank is with average metrics of DSC = 0.42 (95% CI: 0.37-0.46), NSD = 0.33 (95% CI: 0.29-0.37), HSD = 140.93 mm (95% CI: 127.30-155.77) and NAVE = 9.52 (95% CI: 0.0-23.97). The results reveal that local models from other sites do not generalize well, whereas collaborative approaches, particularly with , demonstrate superior generalization capabilities (Figure 5D, Table A6, Figure A7).
Generalization with local training
For sites with training capabilities seeking to circumvent efforts of real-world FL, the results consistently reveal that collaborative approaches are superior compared to local models . Although non-specialized outperforms , the top-performing model is with average metrics of DSC = 0.46 (95% CI: 0.41-0.50), NSD = 0.37 (95% CI: 0.33-0.41), HSD = 125.48 mm (95% CI: 112.06-139.92), NAVE = 3.65 (95% CI: 1.20-6.08). This suggests that the greatest benefit for site is achieved by adopting a FL model trained by other sites and specializing it with their local model (Figure 5E, Table A7, Figure A8).
Discussion
The guide we propose for building real-world FL initiatives (C1) details steps, identifies hurdles, and provides solutions that we have implemented, steering clear of hypothetical solutions. All issues associated with real-world FL were either resolved or circumvented, whereas key solutions included ensuring that each site had motivated and capable staff through targeted incentives such as scientific acknowledgement. Additionally, defining data specifications during the organizational phase and employing data validators in the experiment preparation phase were crucial to ensure data-readiness. Insufficient data specifications led to varied annotation procedures among radiologists, resulting in poorer model performance (Figure 5B). This was evident from the performance disparities observed between sites with manual versus automatically pre-processed annotations (Tables A4-A6). Our most effective solution involved providing offline-installable VMs to deploy FL platforms at sites with restricted internet access. Conversely, our least effective solution was lengthy troubleshooting sessions via video conferences between FL platform-, third-party system engineers and on-site IT staff debugging software interfaces or deployed algorithms on site’s data instead of accessing the server remotely. Significantly hindering difficulties were idle machines at faster sites due to heterogeneous infrastructure and frequent errors during the initial experimental phase until we identified and addressed contributing factors like nightly backups and other maintenance activities conducted by local IT.
Our experimental studies demonstrate the functionality of our FL infrastructure, emphasize the relevance of our proposed guide for real-world applications (C2) and address P3 by benchmarking FL against alternative approaches to determine if it is worth the inherent challenges (C3). First, we analyzed data characteristics assessing the real-world data heterogeneity. This proved invaluable, as it enabled to detect differences in annotation procedures using a CCA, highlighting a significant source of data heterogeneity. Regarding segmentation performances, collaborative approaches—, , , and —consistently outperformed local models in all evaluation scenarios, highlighting the power of collaborative training approaches. Among these, the model and the specialized showed superior performance over ensembling approaches in both personalization and generalization scenarios. Moreover, our experiments demonstrate that randomly adding sites to a federation can worsen model performances due to increased data heterogeneity (Tables A2, A3, and A5). Our results address research problem P3 that despite the substantial hurdles involved in conducting real-world FL, the benefits clearly justify the investment. By jointly considering these hurdles and showcasing FL's superiority, we provide the complete picture regarding real-world FL in radiology.
The proposed guide serves as a starting point, recognizing that it won't address all potential issues and is intended for extension by future real-world FL efforts. While we supplemented our experiences with relevant FL literature, some insights may be unique to our initiative and not universally applicable. Our achieved performances on the challenging segmentation task45,51 are comparable to those reported in studies deploying non-fine-tuned segmentation models investigating the same target pathologies.51–53 The results of our proof-of-concept experiments are influenced by the following factors: data heterogeneity from varying annotation procedures, choosing the default federated averaging aggregation strategy, and using a single fold for nnU-Net rather than five-fold cross-validation without further hyper-parameter optimization. Additionally, results were influenced by efficiency-driven decisions, as using nnU-Net’s low-resolution model or training for fixed 1000 epochs, the inclusion of sites with low segmentation performances on own test data, and the lack of assessing annotation inter-rater variability.
Despite its limitations, our study outlines possibilities for future research. With the established FL infrastructure in the RACOON initiative, we are now equipped to investigate clinically relevant research questions at scale, leveraging the power of FL. From a FL perspective, exploring selection strategies for participating sites emerges as a promising area of research, particularly given the observed impact of heterogeneous annotations on model performance. This could lead to development of automatic proxies pre-evaluating a site’s participation in FL training. Moreover, respecting the lower effort and strong performance of ensemble approaches compared to local models, further explorations how ensemble approaches can enhance personalization and generalization are advantageous.
Conclusion
In this work, we strive towards bridging the gap between simulated and real-world FL research. We identified significant gaps in literature: the lack of detailed, expertise-proven insights into establishing real-world FL initiatives (P1 and P2), and the absence of extensive benchmarking of FL against alternatives in real-world settings justifying its adoption despite inherent challenges (P3). To address this, we developed and deployed a FL initiative within the German RACOON project and compiled our insights into a comprehensive guide (C1). This guide details necessary steps, encountered issues and suggests solutions involved in building and deploying real-world FL in radiological research. We conducted real-world experiments validating the functionality of our infrastructure, underscoring the practical relevance of our proposed guide (C2), and demonstrating FL’s superiority among collaborative learning approaches, proving its value despite hurdles in real-world FL (C3). With these three contributions, we provide an all-encompassing consideration of real-world FL and aim to streamline future real-world FL initiatives by guiding through the development process and helping avoiding pitfalls. We target to advance FL's clinical adoption, enhancing diagnosis and therapy with collaboratively trained models on distributed data.
Supplementary Material
Acknowledgments
The following individuals contributed significant work to the development of the Kaapana platform:
• Jonas Scherer (German Cancer Research Center Heidelberg)
• Klaus Kades (German Cancer Research Center Heidelberg)
• Ralf Floca (German Cancer Research Center Heidelberg)
• Hanno Gao (German Cancer Research Center Heidelberg)
• Philipp Schader (German Cancer Research Center Heidelberg)
• Santhosh Parampottupadam (German Cancer Research Center Heidelberg)
• Lorenz Feineis (German Cancer Research Center Heidelberg)
• Jens Beyermann (German Cancer Research Center Heidelberg)
• Benjamin Hamm (German Cancer Research Center Heidelberg)
• Rajesh Baidya (German Cancer Research Center Heidelberg)
• Mikulas Bankovic (German Cancer Research Center Heidelberg)
Additionally, the following individuals supported the project at the participating sites:
• Leonhard Feiner (TU Munich)
• Enrico Nasca (University Hospital Essen)
• Jonathan Kottlors (University Hospital Cologne)
• Benedikt Wichtlhuber (University Hospital Frankfurt)
• Petra Jiraskova (TU Munich)
Contributor Information
Markus Ralf Bujotzek, Division of Medical Image Computing, German Cancer Research Center Heidelberg, Heidelberg, 69120, Germany; Medical Faculty Heidelberg, University of Heidelberg, Heidelberg, 69120, Germany.
Ünal Akünal, Division of Medical Image Computing, German Cancer Research Center Heidelberg, Heidelberg, 69120, Germany.
Stefan Denner, Division of Medical Image Computing, German Cancer Research Center Heidelberg, Heidelberg, 69120, Germany; Faculty of Mathematics and Computer Science, Heidelberg University, Heidelberg, 69120, Germany.
Peter Neher, Division of Medical Image Computing, German Cancer Research Center Heidelberg, Heidelberg, 69120, Germany; Pattern Analysis and Learning Group, Department of Radiation Oncology, Heidelberg University Hospital, Heidelberg, 69120, Germany; German Cancer Consortium (DKTK), Partner Site Heidelberg, Heidelberg, 69120, Germany.
Maximilian Zenk, Division of Medical Image Computing, German Cancer Research Center Heidelberg, Heidelberg, 69120, Germany; Medical Faculty Heidelberg, University of Heidelberg, Heidelberg, 69120, Germany.
Eric Frodl, Institute for Diagnostic and Interventional Radiology, University Hospital Frankfurt, Frankfurt (Main), 60590, Germany; Goethe University Frankfurt, Frankfurt, 60590, Germany.
Astha Jaiswal, Institute for Diagnostic and Interventional Radiology, Faculty of Medicine, University Hospital Cologne, University of Cologne, Cologne, 50937, Germany.
Moon Kim, Institute for AI in Medicine (IKIM), University Hospital Essen (AöR), Essen, 45131, Germany.
Nicolai R Krekiehn, Intelligent Imaging Lab@Section Biomedical Imaging, Department of Radiology and Neuroradiology, University Medical Center Schleswig-Holstein (UKSH), Kel, 24118, Germany.
Manuel Nickel, Institute for AI in Medicine, Technical University of Munich, Munich, 81675, Germany.
Richard Ruppel, Department of Radiology, Charité—Universitätsmedizin Berlin, Berlin, 10117, Germany.
Marcus Both, Department of Radiology and Neuroradiology, University Medical Centers Schleswig-Holstein, Kiel, 24105, Germany.
Felix Döllinger, Department of Radiology, Charité—Universitätsmedizin Berlin, Berlin, 10117, Germany.
Marcel Opitz, Institute for Diagnostic and Interventional Radiology and Neuroradiology, University Hospital Essen (AÖR), Essen, 45131, Germany.
Thorsten Persigehl, Institute for Diagnostic and Interventional Radiology, Faculty of Medicine, University Hospital Cologne, University of Cologne, Cologne, 50937, Germany.
Jens Kleesiek, Institute for AI in Medicine (IKIM), University Hospital Essen (AöR), Essen, 45131, Germany.
Tobias Penzkofer, Department of Radiology, Charité—Universitätsmedizin Berlin, Berlin, 10117, Germany; Berlin Institute of Health, Berlin, 10178, Germany.
Klaus Maier-Hein, Division of Medical Image Computing, German Cancer Research Center Heidelberg, Heidelberg, 69120, Germany; Pattern Analysis and Learning Group, Department of Radiation Oncology, Heidelberg University Hospital, Heidelberg, 69120, Germany; German Cancer Consortium (DKTK), Partner Site Heidelberg, Heidelberg, 69120, Germany; National Center for Tumor Diseases (NCT), NCT Heidelberg, A Partnership Between DKFZ and The University Medical Center Heidelberg, Heidelberg, 69120, Germany.
Andreas Bucher, Institute for Diagnostic and Interventional Radiology, University Hospital Frankfurt, Frankfurt (Main), 60590, Germany; Goethe University Frankfurt, Frankfurt, 60590, Germany.
Rickmer Braren, Institute for Diagnostic and Interventional Radiology, Klinikum rechts der Isar, Technical University of Munich, Munich, 81675, Germany.
Author contributions
Markus Ralf Bujotzek contributed to Conceptualization, Formal analysis, Investigation, Methodology, Project administration, Software (development of Kaapana platform), Supervision, Validation, Visualization, Writing—original draft, Writing—review & editing. Ünal Akünal, Stefan Denner, Peter Neher, and Maximilian Zenk contributed to Conceptualization, Formal analysis, Investigation, Software (development of Kaapana platform), Supervision, Writing—review & editing. Eric Frodl, Astha Jaiswal, Nicolai R. Krekiehn, Manuel Nickel, and Richard Ruppel contributed to Investigation, Resources (local infrastructure). Moon Kim contributed to Investigation, Resources (local infrastructure), Data Curation, Segmentation annotation. Marcus Both, Felix Döllinger, Marcel Opitz, and Thorsten Persigehl contributed to Data Curation, Segmentation annotation. Jens Kleesiek contributed to Conceptualization (of RACOON), Funding acquisition, Project administration, Supervision. Tobias Penzkofer contributed to Conceptualization (of RACOON), Funding acquisition, Project administration. Klaus Maier-Hein contributed to Conceptualization (of RACOON and FL setup), Funding acquisition, Investigation, Project administration, Resources, Supervision, Writing—review & editing. Andreas Bucher contributed to Conceptualization (of RACOON and FL setup), Data curation, Segmentation annotation, Funding acquisition, Investigation, Project administration, Supervision, Writing—review & editing. Rickmer Braren contributed to Conceptualization (of RACOON), Data Curation, Segmentation annotation, Writing—review & editing.
Supplementary material
Supplementary material is available at Journal of the American Medical Informatics Association online.
Funding
This work was funded by “NUM 2.0” (FKZ: 01KX2121).
Conflicts of interest
The authors declare that they have no competing interests.
Data availability
The radiological image data used in this study is private, sensitive, and owned by the participating hospitals. Due to privacy regulations and institutional policies, this data cannot be shared publicly. However, the open-source code for the FL platform Kaapana including all implementations used for the experimental studies are openly available at https://github.com/kaapana/kaapana.
References
- 1. Kaissis GA, Makowski MR, Rückert D, et al. Secure, privacy-preserving and federated machine learning in medical imaging. Nat Mach Intell. 2020;2:305-311. [Google Scholar]
- 2. McMahan HB, Moore E, Ramage D, et al. Communication-efficient learning of deep networks from decentralized data. arXiv, http://arxiv.org/abs/1602.05629, 2016, preprint: not peer reviewed.
- 3. Sheller MJ, Edwards B, Reina GA, et al. Federated learning in medicine: facilitating multi-institutional collaborations without sharing patient data. Sci Rep. 2020;10:12598. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Pati S, Baid U, Edwards B, et al. Federated learning enables big data for rare cancer boundary detection. Nat Commun. 2022;13:7346. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Guan H, Yap PT, Bozoki A, et al. Federated learning for medical image analysis: a survey. Pattern Recognit. 2024;151:110424. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Soltan AAS, Thakur A, Yang J, et al. A scalable federated learning solution for secondary care using low-cost microcomputing: privacy-preserving development and evaluation of a COVID-19 screening test in UK hospitals. Lancet Digit Health. 2024;6:e93-e104. [DOI] [PubMed] [Google Scholar]
- 7. Dayan I, Roth HR, Zhong A, et al. Federated learning for predicting clinical outcomes in patients with COVID-19. Nat Med. 2021;27:1735-1743. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Ogier Du Terrail J, Leopold A, Joly C, et al. Federated learning for predicting histological response to neoadjuvant chemotherapy in triple-negative breast cancer. Nat Med. 2023;29:135-146. [DOI] [PubMed] [Google Scholar]
- 9. Oldenhof M, et al. Industry-scale orchestrated federated learning for drug discovery. In: Proceedings of the AAAI Conference on Artificial Intelligence. arXiv, 10.48550/arXiv.2210.08871, 2023, preprint: not peer reviewed. [DOI]
- 10. Cremonesi F, Planat V, Kalokyri V, et al. The need for multimodal health data modeling: a practical approach for a federated-learning healthcare platform. J Biomed Inform. 2023;141:104338. [DOI] [PubMed] [Google Scholar]
- 11. Deist TM, Dankers FJ, Ojha P, et al. Distributed learning on 20 000+ lung cancer patients—the personal health train. Radiother Oncol. 2020;144:189-200. [DOI] [PubMed] [Google Scholar]
- 12. Camajori Tedeschini B, Savazzi S, Stoklasa R, et al. Decentralized federated learning for healthcare networks: a case study on tumor segmentation. IEEE Access. 2022;10:8693-8708. [Google Scholar]
- 13. Karargyris A, Umeton R, Sheller MJ, et al. ; AI4SafeChole Consortium. Federated benchmarking of medical artificial intelligence with MedPerf. Nat Mach Intell. 2023;5:799-810. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Sarma KV, Harmon S, Sanford T, et al. Federated learning improves site performance in multicenter deep learning without data sharing. J Am Med Inform Assoc. 2021;28:1259-1264. 10.1093/jamia/ocaa341 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Pati S, Baid U, Zenk M, et al. The federated tumor segmentation (FeTS) challenge. arXiv, http://arxiv.org/abs/2105.05874, 2021, preprint: not peer reviewed.
- 16. Mullie L, Afilalo J, Archambault P, et al. CODA: an open-source platform for federated analysis and machine learning on distributed healthcare data. J Am Med Inform Assoc. 2024;31:651-665. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Liu Y, Huang J, Chen J-C, et al. Predicting treatment response in multicenter non-small cell lung cancer patients based on federated learning. BMC Cancer. 2024;24:688. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Roth HR, Chang K, Singh P, et al. Federated learning for breast density classification: a real-world implementation. arXiv, 2020, preprint: not peer reviewed.
- 19. Naz S, Phan KT, Chen YP. A comprehensive review of federated learning for COVID-19 detection. Int J Intell Syst. 2022;37:2371-2392. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Guendouzi BS, Ouchani S, Assaad HEL, et al. A systematic review of federated learning: challenges, aggregation methods, and development tools. J Netw Comput Appl. 2023;220:103714. [Google Scholar]
- 21. Rahman KMJ, Ahmed F, Akhter N, et al. Challenges, applications and design aspects of federated learning: a survey. IEEE Access. 2021;9:124682-124700. [Google Scholar]
- 22. Martínez Beltrán ET, Pérez MQ, Sánchez PMS, et al. Decentralized federated learning: fundamentals, state of the art, frameworks, trends, and challenges. IEEE Commun Surv Tutor. 2023;25(4):2983-3013. 10.1109/COMST.2023.3315746 [DOI] [Google Scholar]
- 23. Joshi M, Pal A, Sankarasubbu M. Federated learning for healthcare domain—pipeline, applications and challenges. ACM Trans Comput Healthc. 2022;3(4):40. [Google Scholar]
- 24. Xu J, Glicksberg BS, Su C, et al. Federated learning for healthcare informatics. arXiv, http://arxiv.org/abs/1911.06270, 2020, preprint: not peer reviewed. [DOI] [PMC free article] [PubMed]
- 25. Rauniyar A, Hagos DH, Jha D, et al. Federated learning for medical applications: a taxonomy, current trends, challenges, and future research directions. IEEE Internet Things J. 2024;11:7374-7398. [Google Scholar]
- 26. Rehman MHU, Hugo Lopez Pinaya W, Nachev P, et al. Federated learning for medical imaging radiology. Br J Radiol. 2023;96:20220890. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Nguyen DC, Pham Q-V, Pathirana Pubudu N, et al. Federated learning for smart healthcare: a survey. ACM Comput Surv (CSUR). 2022;55(3):60. [Google Scholar]
- 28. Darzidehkalani E, Ghasemi-Rad M, van Ooijen PMA. Federated learning in medical imaging: part ii: methods, challenges, and considerations. J Am Coll Radiol. 2022;19:975-982. [DOI] [PubMed] [Google Scholar]
- 29. Ng D, Lan X, Yao MM-S, et al. Federated learning: a collaborative effort to achieve better medical imaging models for individual sites that have small labelled datasets. Quant Imaging Med Surg. 2021;11:852-857. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30. Bharati S, Mondal MR, Podder P, et al. Federated learning: applications, challenges and future directions. HIS. 2022;18:19-35. [Google Scholar]
- 31. Li T, Sahu AK, Talwalkar A, et al. Federated learning: challenges, methods, and future directions. IEEE Signal Process Mag. 2020;37:50-60. [Google Scholar]
- 32. Mammen PM. Federated learning: opportunities and challenges. arXiv, arXiv:210105428, preprint: not peer reviewed.
- 33. Aouedi O, Sacco A, Piamrat K, et al. Handling privacy-sensitive medical data with federated learning: challenges and future directions. IEEE J Biomed Health Inform. 2023;27:790-803. [DOI] [PubMed] [Google Scholar]
- 34. Rieke N, Hancox J, Li W, et al. The future of digital health with federated learning. Npj Digit Med. 2020;3:119-117. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35. Heyder R, NUM Coordination Office; NUKLEUS Study Group. The German Network of University Medicine: technical and organizational approaches for research data platforms. Bundesgesundheitsblatt Gesundheitsforschung Gesundheitsschutz. 2023;66:114-125. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36. Haggenmüller S, Schmitt M, Krieghoff-Henning E, et al. Federated learning for decentralized artificial intelligence in melanoma diagnostics. JAMA Dermatol. 2024;160:303-311. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37. Casado FE, Lema D, Iglesias R, et al. Ensemble and continual federated learning for classification tasks. Mach Learn. 2023;112:3413-3453. [Google Scholar]
- 38. Kades K, Scherer J, Zenk M, et al. Towards real-world federated learning in medical image analysis using kaapana. In: Albarqouni S, Bakas S, Bano S. (eds) Distributed, Collaborative, and Federated Learning, and Affordable AI and Healthcare for Resource Diverse Global Health. Springer Nature Switzerland, 2022, pp. 130-140. [Google Scholar]
- 39. Denner S, Scherer J, Kades K, et al. Efficient large scale medical image dataset preparation for machine learning applications. In: Bhattarai B, et al., eds. Data Engineering in Medical Imaging. DEMI 2023. Lecture Notes in Computer Science, Vol. 14314. Cham: Springer; 2023. [Google Scholar]
- 40. Bucher AM, Henzel K, Meyer HJ, et al. Pericardial effusion predicts clinical outcomes in patients with COVID-19: a nationwide multicenter study. Acad Radiol. 2024;31:1784-1791. [DOI] [PubMed] [Google Scholar]
- 41. Isensee F, Jaeger PF, Kohl SAA, et al. nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nat Methods. 2021;18:203-211. [DOI] [PubMed] [Google Scholar]
- 42. Jun M, Cheng G, Yixin W, et al. COVID-19 CT lung and infection segmentation dataset. zenodo.org. 2020. Accessed 2021. 10.5281/zenodo.3757476 [DOI]
- 43. Isensee F, Wald T, Ulrich C, et al. nnU-Net revisited: a call for rigorous validation in 3D medical image segmentation. In: Linguraru MG, et al., eds. Medical Image Computing and Computer Assisted Intervention – MICCAI 2024. MICCAI 2024. Lecture Notes in Computer Science, Vol. 15009. Cham: Springer; 2024.
- 44. Maier-Hein L, Reinke A, Godau P, et al. Metrics reloaded: recommendations for image analysis validation. Nat Methods. 2024;21:195-212. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45. Roth H, Xu Z, Diez CT, et al. Rapid artificial intelligence solutions in a pandemic—the COVID-19-20 lung CT lesion segmentation challenge. Med Image Anal. 2022;82:102605. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46. Carmo D, Ribeiro J, Dertkigil S, et al. A systematic review of automated segmentation methods and public datasets for the lung and its lobes and findings on computed tomography images. Yearb Med Inform. 2022;31(1):277-295. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47. Cardoso MJ, Li W, Brown R, et al. MONAI: an open-source framework for deep learning in healthcare. arXiv, arXiv:221102701, preprint: not peer reviewed.
- 48. Xian RP, Walsh CL, Verleden SE, et al. A multiscale X-ray phase-contrast tomography dataset of a whole human left lung. Sci Data. 2022;9:264. 10.1038/s41597-022-01353-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49. Ghorbani A, Zou J. Data shapley: equitable valuation of data for machine learning. In: International Conference on Machine Learning PMLR. Vol. 97. 2019:2242-2251.
- 50. Luo G, Liu T, Lu J, et al. Influence of data distribution on federated learning performance in tumor segmentation. Radiol Artif Intell. 2023;5:e220082. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51. Rao Y, Lv Q, Zeng S, et al. COVID-19 CT ground-glass opacity segmentation based on attention mechanism threshold. Biomed Signal Process Control. 2023;81:104486. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52. Pezzano G, Díaz O, Ripoll VR, Radeva P. CoLe-CNN+: context learning—convolutional neural network for COVID-19-ground-glass-opacities detection and segmentation. Comput Biol Med. 2021;136:104689. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53. Saood A, Hatem I. COVID-19 lung CT image segmentation using deep learning methods: U-Net versus SegNet. BMC Med Imaging. 2021;21:19. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The radiological image data used in this study is private, sensitive, and owned by the participating hospitals. Due to privacy regulations and institutional policies, this data cannot be shared publicly. However, the open-source code for the FL platform Kaapana including all implementations used for the experimental studies are openly available at https://github.com/kaapana/kaapana.






