Large-scale generative tumor synthesis in computed tomography images for improving tumor recognition

Linshan Wu; Jiaxin Zhuang; Yanning Zhou; Sunan He; Jiabo Ma; Luyang Luo; Xi Wang; Xuefeng Ni; Xiaoling Zhong; Mingxiang Wu; Yinghua Zhao; Xiaohui Duan; Varut Vardhanabhuti; Pranav Rajpurkar; Hao Chen

doi:10.1038/s41467-025-66071-6

. 2025 Dec 11;16:11053. doi: 10.1038/s41467-025-66071-6

Large-scale generative tumor synthesis in computed tomography images for improving tumor recognition

Linshan Wu ¹, Jiaxin Zhuang ¹, Yanning Zhou ², Sunan He ¹, Jiabo Ma ¹, Luyang Luo ^1,³, Xi Wang ⁴, Xuefeng Ni ¹, Xiaoling Zhong ⁵, Mingxiang Wu ⁵, Yinghua Zhao ⁶, Xiaohui Duan ⁷, Varut Vardhanabhuti ⁸, Pranav Rajpurkar ³, Hao Chen ^1,^9,^10,^11,^12,^✉

PMCID: PMC12698779 PMID: 41381469

Abstract

AI-driven tumor recognition unlocks new possibilities for precise tumor screening and diagnosis. However, the progress is heavily hampered by the scarcity of annotated datasets, demanding extensive efforts by radiologists. To this end, we introduce FreeTumor, a Generative AI framework to enable large-scale tumor synthesis for mitigating data scarcity. Specifically, FreeTumor effectively leverages limited labeled data and large-scale unlabeled data for training. Unleashing the power of large-scale data, FreeTumor is capable of synthesizing a large number of realistic tumors for augmenting training datasets. We curate a large-scale dataset comprising 161,310 Computed Tomography (CT) volumes for tumor synthesis and recognition, with only 2.3% containing annotated tumors. 13 board-certified radiologists are engaged to discern between synthetic and real tumors, rigorously validating the quality of synthetic tumors. Through high-quality tumor synthesis, FreeTumor showcases a notable superiority over state-of-the-art tumor recognition methods, indicating promising prospects in clinical applications.

Subject terms: Cancer imaging, Cancer screening, Cancer models

AIaided diagnosis is an exciting area of cancer research, however, large scale training is limited by the availability of imaging datasets. Here, the authors develop FreeTumor as a generative AI framework to develop realistic tumor images for clinical application.

Introduction

Tumors contribute significantly to the global burden of disease, accounting for an estimated 10 million deaths annually, according to the findings of the World Health Organization¹. With the rapid advancements of deep learning^2–6, AI-driven tumor recognition^7–15 has received increasing attention in clinical applications. However, existing tumor recognition methods heavily rely on annotated tumor datasets for training^7–9,13,16, demanding substantial medical expertise and dedicated efforts for data collection and annotation. Suffering from the data-hungry nature of AI methods and the extensive annotation burden, the limited scale of tumor datasets significantly poses a substantial obstacle to the advancement of AI-driven tumor recognition.

To address this challenge, data augmentation with synthetic data has emerged as a potential solution. Recently, Generative AI (GAI)^17–21 has witnessed rapid development, which can generate large-scale realistic images, presenting a potential solution to mitigate the scarcity of annotated datasets²². Specifically, synthetic data can increase the scale and diversity of training datasets, significantly boosting the robustness and generalization of AI models^23–27. GAI has also attracted increasing attention in medical research^16,28–36, demonstrating that GAI can synthesize high-quality medical images and consequently enhancing medical image understanding. Although encouraging results have been demonstrated, previous works largely ignored the importance of tumor synthesis, leading to limited improvements in downstream tumor recognition tasks^8,37.

In this study, we explore GAI to synthesize high-quality tumors on images, aiming to mitigate the scarcity of annotated tumor datasets. Early attempts^38–42 utilized handcrafted image processing techniques to synthesize tumors on images. However, these handcrafted methods require complex designs from radiologists, and the synthetic tumors still differ significantly from real tumors, thus failing to improve the downstream performance effectively. Recently, diffusion models, especially conditioned diffusion models^{19–21,43–45} have received increasing attention in recent advances of GAI. Although with promising achievements, these conditioned diffusion models heavily rely on the guidance of conditioning information, e.g., text or mask annotations. Thus, when applying conditioned diffusion models to tumor synthesis⁴⁶, the synthesis training is still limited by the scale of annotated tumor datasets and falls short in leveraging large-scale data. Constrained by the scale of training datasets, conditioned diffusion models may encounter challenges in effectively generalizing to extensive unseen datasets from various sources, particularly when faced with a wide range of diverse medical image characteristics such as varying intensity levels, spacing patterns, and resolutions.

Our goal is to unleash the power of large-scale unlabeled data via high-quality tumor synthesis, aiming to augment training datasets and fortify the foundations of tumor recognition. The primary challenges include: (1) effectively leveraging large-scale unlabeled data for tumor synthesis training and (2) synthesizing realistic tumors for segmentation training. Confronted with the challenge of conditioned diffusion models lacking the ability to leverage large-scale unlabeled data, our focus shifts towards the exploration of adversarial-training methods, i.e., Generative Adversarial Networks (GAN)^17,18,24,47. GAN-based methods involve training a generator for data generation and a discriminator for distinguishing between real and generated data, which excels in leveraging unpaired data for synthesis training. Specifically, we investigate adversarial-training methods to tackle the two aforementioned challenges: (1) The adversarial-training methods for unpaired data facilitate the integration of large-scale unlabeled data into tumor synthesis training, i.e., train a generator to synthesize tumors on unlabeled images and discriminate them with a discriminator (real or synthetic tumors). (2) The incorporated discriminator further enables us to discard the low-quality synthetic tumors, i.e., synthetic tumors failing to pass the discriminator will be discarded, thus facilitating quality control of synthetic tumors for boosting subsequent segmentation training.

To this end, we introduce FreeTumor, a GAI framework tailored for large-scale tumor synthesis and segmentation training. FreeTumor can synthesize high-quality tumors on healthy organs without the requirement of extra annotations from radiologists. This innovation facilitates the integration of large-scale unlabeled data into segmentation training. As illustrated in Fig. 1d, FreeTumor operates through two pivotal stages: synthesis training and segmentation training. In Stage 1, FreeTumor effectively leverages a combination of limited labeled data and large-scale unlabeled data for adversarial-based tumor synthesis training. Subsequently, in Stage 2, FreeTumor is employed to synthesize tumors on healthy organs for segmentation training. Simultaneously, FreeTumor incorporates a discriminator to discard low-quality synthetic tumors, enabling automatic quality control of large-scale synthetic tumors. By integrating large-scale datasets from diverse sources for synthesis training, FreeTumor significantly improves the quantity, quality, and diversity of tumors for training, enhancing the robustness of tumor recognition.

Fig. 1 — a We explore tumor synthesis and segmentation on five types of tumors/lesions, i.e., liver tumors, pancreas tumors, kidney tumors, lung tumors, and COVID-19. b The rapid advancements in medical imaging have enabled the collection of large-scale Computed Tomography (CT) data. However, annotated tumor datasets are scarce due to the extensive annotation burden. c We curated 161,310 CT volumes from 33 public sources to enable large-scale tumor synthesis and recognition, with merely 2.3% of them comprising annotated tumors. d FreeTumor consists of two stages: synthesis training and segmentation training. In Stage 1, FreeTumor effectively unleashes the power of large-scale unlabeled data for tumor synthesis training. In Stage 2, FreeTumor synthesizes high-quality tumors on healthy organs, facilitating the integration of large-scale unlabeled data in tumor segmentation training. We present two lung instances to demonstrate that we synthesize both lung tumors and COVID-19 lesions on lungs. e Clinical evaluation of synthetic tumors. We invited 13 board-certified radiologists to a Visual Turing Test to discern between synthetic and real tumors. Rigorous clinician evaluation validates the high quality of our synthetic tumors. f Extensive segmentation results on 12 public datasets showcase the superiority of FreeTumor. Specifically, FreeTumor adopts SwinUNETR⁵¹ as the segmentation model and employs tumor synthesis for augmenting segmentation datasets. With large-scale synthetic tumors for training, FreeTumor surpasses the baseline SwinUNETR⁵¹ by significant margins, achieving 10.6%, 5.5%, 3.8%, 6.1%, and 7.9% Dice score improvements for five types of tumors/lesions, respectively. g Early tumor detection results on 12 public datasets (number of samples n = 1533). Box plots show the mean (center), 25th and 75th percentiles (bounds of box), and minima to maxima (whiskers). With tumor synthesis, FreeTumor yields + 16.4% sensitivity improvements on average. Source data are provided as a Source Data file. The elements are created in BioRender. Wu, L. (2025) https://BioRender.com/qo600iw.

In this work, we create a large-scale training dataset for tumor synthesis and recognition by curating 161,310 publicly available CT volumes from different medical centers, with only 2.3% of them comprising annotated tumors. We evaluate the effectiveness of FreeTumor across four types of tumors, i.e., liver tumors, pancreas tumors, kidney tumors, and lung tumors. FreeTumor is versatile and can also be applied for COVID-19 lesions. To validate the fidelity of synthetic tumors, we engage 13 board-certified radiologists in a Visual Turing Test to discern between synthetic and real tumors. Rigorous clinician evaluation validates the high quality of our synthesis results, as they achieved only 51.1% sensitivity and 60.8% accuracy in distinguishing our synthetic tumors from real ones. Extensive experiments on 12 public datasets highlight the superiority of FreeTumor. Augmenting the training datasets by over 40 times, FreeTumor clearly surpasses state-of-the-art AI methods^{8,38,46,48–54}, including various synthesis methods and foundation models. Furthermore, the synthesis of small tumors can enhance the performance of early tumor detection, substantially aiding the timely treatment of patients. These findings underscore the promising potential of FreeTumor in improving tumor recognition within clinical practice.

Results

Datasets

The rapid advancements in medical imaging have enabled the collection of large-scale CT data. However, few previous works have considered harnessing the untapped potential of large-scale unlabeled CT data for tumor recognition³⁷. As shown in Fig. 1c, we curate the existing largest training dataset for tumor synthesis and recognition, encompassing 161,310 publicly available CT volumes from 33 different sources. It is worth noting that only 2.3% of them (3696 volumes) contain annotated tumors. The pre-processing details of the datasets are presented in Datasets and Implementation Details. Details of datasets are presented in Supplementary Table 30.

Clinician evaluation of synthetic tumors

It has been a common practice to utilize fidelity metrics like Fréchet Inception Distance (FID)⁵⁵ to measure the quality of natural image synthesis in GAI models^17–21, where lower FIDs reflect higher synthesis quality. We first evaluate the FID results of our synthetic tumors, detailed FID results are presented in Supplementary Table 4 and Fig. 6. We observe that our proposed FreeTumor can achieve lower FID compared with two previous tumor synthesis methods^38,46. However, we have noted limitations in the effectiveness of FID⁵⁵ in reflecting tumor synthesis quality. Specifically, many synthetic tumors, despite with low FIDs, still present with unrealistic characteristics in the views of radiologists. The inherent challenge lies in the fact that tumor regions predominantly exhibit small sizes with abnormal intensities, rendering conventional fidelity metrics unreliable^16,38,46. Clinician evaluation serves as a more convincing standard for validating the quality of tumor synthesis. To this end, we invited 13 board-certified radiologists to evaluate the quality of synthetic tumors.

Evaluation of tumor segmentation and detection

Tumor segmentation^7–9 aims to precisely segment target tumors by capturing their positions, sizes, and shapes. In contrast, tumor detection^7,13,46 focuses on identifying the presence and location of tumors, without the need to outline their precise shapes and sizes. Our detection pipeline is mask-based, and the metrics are reported per tumor. Following previous methods^{13,46,56–59}, tumor detection is achieved by the tumor segmentation models, where detected tumors are identified when segmentation predictions overlap with ground truth labels. For the evaluation of early tumor detection, we present the detection results of small tumors (diameter < 2 cm) following previous methods^13,46,59. The diameter measurement follows the standard of the World Health Organization (WHO)^59,60. The evaluation was restricted to those lesions.

We evaluate the effectiveness of FreeTumor across four types of tumors, i.e., liver tumors, pancreas tumors, kidney tumors, and lung tumors. FreeTumor is versatile and can also be applied to COVID-19 lesions. We assess the performance of these five types of tumors/lesions due to the availability of public annotated datasets for evaluation. Following previous medical image synthesis works^{16,22,28,34,36,38–42,46}, the downstream evaluation is conducted on only real-world medical datasets. The synthetic datasets are only used for training, as validation on synthetic data may introduce bias due to variations in synthesis quality^{16,22,28,34,36,38–42,46}. As shown in Fig. 1e, 12 public datasets are used to evaluate the performances of tumor segmentation and detection, including: (1) Liver tumors: LiTS⁶¹, HCC-TACE⁶², IRCAD⁶³. (2) Pancreas tumors: MSD07-Pancreas¹⁰, PANORAMA⁶⁴, QUBIQ⁶⁵. (3) Kidney tumors: KiTS21⁶⁶, KiTS23⁶⁶, KIPA⁶⁷. (4) Lung tumors: MSD06-Lung¹⁰, RIDER⁶⁸. (5) COVID-19: CV19-20⁶⁹. The details of the datasets are presented in Supplementary Table 30. For tumor segmentation, we utilize Dice scores to measure the segmentation performance. We utilize F1-Score, sensitivity, and specificity to measure the detection performance as previous methods^7,13. Notably, our method also achieved superior performance on three public leaderboards, including FLARE25, FLARE23, and KiTS19, as shown in Supplementary Table 25.

Clinician evaluation

We invited 13 board-certified radiologists to evaluate the fidelity of synthetic tumors through a Visual Turing Tests²². These radiologists are from 4 hospitals in China, i.e., Li Ka Shing Faculty of Medicine of The University of Hong Kong (HKU), Shenzhen People’s Hospital, Sun Yat-Sen Memorial Hospital of Sun Yat-Sen University, and The Third Affiliated Hospital of Southern Medical University. Among the group of 13 radiologists, there are 6 junior radiologists, 4 mid-level radiologists, and 3 senior radiologists. Each level of radiologists is defined by the following standards:

Junior radiologists: Doctors in residency programs, with 5–10 years of clinical experience.
Mid-level radiologists: Doctors with a professional tenure of 10–20 years in hospitals.
Senior radiologists: Doctors with advanced professional titles in hospitals, with at least 20 years of clinical experience.

The process of the Visual Turing Test is shown in Fig. 1e. During the Visual Turing Test, 13 radiologists were presented with the same set of CT volumes containing tumors, with each volume containing only one tumor case for evaluation. Half of these tumors are real, and the remaining half are synthesized by FreeTumor. Specifically, we provided 18 cases each of liver tumors, pancreas tumors, kidney tumors, lung tumors, and COVID-19 (a total of 90 cases) for evaluation. There are 45 real and 45 synthetic tumors among 90 tumor cases. For each type, the numbers of real and synthetic tumors are also equal (9 real and 9 synthetic in 18 tumor cases). These 90 cases are randomly selected from our datasets. During the Visual Turing Test, the radiologists were tasked with: (1) Identifying the synthetic tumors from real ones. (2) Discerning the distinguishing features between real and synthetic tumors. The radiologists were informed of the type of tumors they were required to identify, and the positions of tumors were also provided. The specific number of synthetic tumors for each type is unknown to the invited radiologists to prevent any bias in their assessments. On average, the radiologists require 1.5-2 min for viewing each case and require about 2-3 h to assess all 90 cases.

As shown in Supplementary Fig. 1, we report the sensitivity, specificity, and accuracy results to measure the ability of radiologists to identify our synthetic tumors. Lower values for sensitivity, specificity, and accuracy indicate that our synthetic tumors attain a higher quality level. We observe that even experienced radiologists are unable to identify our synthetic tumors with complete accuracy, which demonstrates the effectiveness of FreeTumor in synthesizing realistic tumors. Detailed results are presented in Supplementary Tables 1 and 2. Concretely:

Sensitivity and specificity. The sensitivity and specificity results for each type of tumor are depicted in Supplementary Fig. 1b, with the average results showcased in Supplementary Fig. 1d. Notably, the average sensitivity is recorded at a modest 51.1%, demonstrating that FreeTumor effectively synthesizes realistic tumors.
Accuracy. The accuracy results for each type of tumor are depicted in Supplementary Fig. 1c, with the average results showcased in Supplementary Fig. 1e. The accuracy results are 59.8%, 51.7%, 63.7%, 65.8%, and 62.8% for liver tumors, pancreas tumors, kidney tumors, lung tumors, and COVID-19, respectively. The average accuracy of the assessment is 60.8%, suggesting that nearly 40% of cases are misclassified.
Junior radiologists struggle in distinguishing our synthetic tumors from real ones. We engage radiologists of varying expertise levels to evaluate the synthetic tumors. Our observations reveal that the breadth of experience significantly influences the evaluation results. As shown in Supplementary Fig. 1c, 6 junior radiologists achieve only 41.5% sensitivity and 56.6% accuracy, indicating that our synthetic tumors exhibit realistic characteristics, capable of misleading radiologists with limited experience levels.
Comparisons among different types of tumors/lesions. As shown in Supplementary Fig. 1e, among the five assessed types, pancreas tumors present the greatest challenge in identification, achieving a low sensitivity of 30.8%.
Case analysis in Visual Turing Test. Based on the results of clinician evaluation, we categorize the synthetic tumors into two groups: (1) Pass the Visual Turing Test: more than 1/2 of 13 radiologists identified the synthetic tumors as real ones. (2) Fail the Visual Turing Test: fewer than 1/2 of 13 radiologists identified the synthetic tumors as real ones. The detailed distributions of these two groups are shown in Supplementary Fig. 1f. It can be observed that there are 28 of 45 synthetic tumors (62.3%) pass the Visual Turing Test, indicating the high quality of our synthetic tumors.

Case studies

The case studies of synthetic tumors are presented in Supplementary Fig. 8. Summarized from the radiologists’ assessment, we highlight some characteristics of our synthetic tumors that contribute to deceiving radiologists: (1) Density: our synthetic tumors exhibit uneven and indistinct densities that are consistent with the clinical presentations of tumors. (2) Boundary: our synthetic tumors present unclear boundaries with blurred edges, resembling the characteristics of real tumors. (3) Mass Effect: our synthetic tumors also showcase the mass effect on the surrounding organs as real tumors. However, in some cases, some radiologists can still tell the distinct features of synthetic tumors, suggesting that our synthetic results can be further improved. More case studies with failure cases are presented in Supplementary Fig. 9.

Accurate and scalable segmentation across five types of tumors/lesions

Comparison methods

We conduct extensive tumor segmentation experiments on 12 public datasets and report the corresponding Dice score results. First, we compare our FreeTumor with five widely-used tumor segmentation models^8,48–51, i.e., UNet⁴⁸, TransUNet⁴⁹, UNETR⁵⁰, nnUNet⁸, and SwinUNETR⁵¹. These works^8,48–51 proposed to advance network architectures for improving tumor segmentation, while our FreeTumor is designed to address the challenges in tumor segmentation from the data scarcity aspect. We adopt SwinUNETR⁵¹ as the segmentation model, thus SwinUNETR⁵¹ can be seen as the baseline for comparisons. Second, we compare FreeTumor with two tumor synthesis methods^38,46 and three CT foundation models^53,54,70. In addition, we further evaluate the out-of-domain performance of FreeTumor. Out-of-domain evaluation represents transferring a model trained on a source dataset to a target dataset, i.e., direct inference on target datasets without fine-tuning models. It is worth noting that our 40 × enlarged dataset lacks annotated tumors/lesions for training the baseline segmentation models^{8,48–51,53,54,70,71}. Thus, in this work, we introduce FreeTumor to synthesize tumors/lesions on the enlarged dataset for segmentation training.

FreeTumor outperforms baseline tumor segmentation models

As shown in Fig. 2, on 12 public datasets across various types of tumors/lesions, our FreeTumor consistently outperforms five widely-used tumor segmentation models^8,48–51 by a clear margin. By augmenting the training datasets by over 40 times, FreeTumor surpasses the baseline SwinUNETR⁵¹ by 6.9, 8.6, 16.1, 6.0, 3.1, 7.2, 4.0, 3.7, 5.8, 7.1, 5.1, and 7.9% on 12 datasets, respectively. Overall, FreeTumor brings an average + 6.7% Dice score improvement over the baseline SwinUNETR⁵¹. The two-sided paired t test p-value = 5.085 × 10⁻⁵, remaining significant after Bonferroni correction⁷² for multiple comparisons (α = 0.00417, 5.085 × 10⁻⁵ < 0.00417). The substantial improvements demonstrate that the scarcity of tumor annotations is a critical bottleneck in tumor segmentation. Specifically, as shown in Fig. 2c, for the IRCAD⁶³ dataset that contains only 22 labeled CT volumes, FreeTumor demonstrates + 16.1% Dice score improvements by augmenting training datasets. These findings robustly validate the rationale of our motivation to mitigate data scarcity. Detailed results are presented in Supplementary Table 8.

FreeTumor outperforms previous tumor synthesis methods

We further compare FreeTumor with two tumor synthesis methods: SynTumor³⁸ and DiffTumor⁴⁶. Note that both of these two tumor synthesis methods^38,46 cannot leverage unlabeled data for synthesis training: (1) SynTumor³⁸ utilizes handcrafted image processing techniques for tumor synthesis. (2) DiffTumor⁴⁶ employs conditioned diffusion models for tumor synthesis, thus, it can only leverage labeled data for tumor synthesis training (360 labeled volumes are used in this work). In addition, SynTumor³⁸ is only applicable to liver tumors, and DiffTumor⁴⁶ is not applicable to lung tumors and COVID-19. For fair comparisons, SynTumor³⁸ and DiffTumor⁴⁶ adopt the same segmentation model⁵¹ as FreeTumor.

As shown in Fig. 3, FreeTumor significantly outperforms previous tumor synthesis methods SynTumor³⁸ and DiffTumor⁴⁶ by a clear margin, underscoring the importance of leveraging large-scale data for synthesis training. We further evaluate the effectiveness of SynTumor³⁸ and DiffTumor⁴⁶ in utilizing our large-scale datasets for segmentation training. However, we observe that without large-scale synthesis training, these synthesis methods^38,46 fail to generalize well on large-scale unseen datasets with different image characteristics. For example, when employing SynTumor³⁸ to segmentation training on our large-scale datasets, the average Dice score on LiTS⁶¹ is dropped from 60.2% to 52.8%. Detailed results are presented in Supplementary Table 19 and Figure 5. In contrast, our FreeTumor is capable of leveraging large-scale data in both synthesis and segmentation training, facilitating robust generalization across datasets from various sources. Detailed results are presented in Supplementary Table 9.

FreeTumor outperforms various CT foundation models

We further compare FreeTumor with three CT foundation models: MAE3D⁷⁰, SwinSSL⁵³, and VoCo⁵⁴. These foundation models are based on Self-Supervised Learning (SSL)^52,73,74: MAE3D⁷⁰ and SwinSSL⁵³ are based on mask image modeling⁵², while VoCo⁵⁴ is based on contrastive learning. Although these foundation models^53,54,70 can leverage unlabeled data in self-supervised pre-training, they still fail to utilize unlabeled data during segmentation training and remain constrained by the limited scale of annotated datasets.

As shown in Fig. 3, we observe that our FreeTumor clearly outperforms three foundation models^53,54,70. The fundamental bottleneck of the foundation models^53,54,70 is that they fail to leverage large-scale data during segmentation training. For example, for the liver tumor dataset IRCAD⁶³, these foundation models^53,54,70 are limited to utilizing merely 22 CT volumes for fine-tuning, whereas our FreeTumor model can harness a significantly larger dataset of 19,571 CT volumes for segmentation training. The utilization of large-scale data in segmentation training enables the superiority of FreeTumor. Detailed results are presented in Supplementary Table 10.

FreeTumor excels in out-of-domain evaluation

Extensive out-of-domain comparisons with five tumor segmentation models^8,48–51, two tumor synthesis methods^38,46, and three foundation models^53,54,70 are presented in Fig. 2m–r and Fig. 3m–r, respectively. Leveraging large-scale data from diverse sources, FreeTumor demonstrates superior generalizability compared with previous methods. Notably, when transferring models from LiTS⁶¹ to IRCAD⁶³, FreeTumor achieves a substantial improvement of 22.9% Dice score compared with the baseline SwinUNETR⁵¹ and also surpasses both tumor synthesis methods^38,46, and foundation models^53,54,70 by a clear margin. Detailed results are presented in Supplementary Table 9.

FreeTumor yields significant improvements across five types of tumors/lesions

As shown in Fig. 4a, compared with the baseline SwinUNETR⁵¹, FreeTumor yields average improvements of 10.6, 5.5, 3.8, 6.1, and 7.9% for liver tumors, pancreas tumors, kidney tumors, lung tumors, and COVID-19 in Dice scores, respectively. Given the marginal disparities observed within previous methods^{8,38,46,48–51,53,54,70}, these improvements underscore a non-trivial advancement in tumor segmentation. We provide qualitative visualization results of tumor segmentation in Fig. 4b. Notably, FreeTumor demonstrates better segmentation performance, offering precise sizes, shapes, and positions that are crucial for accurate tumor diagnosis. More qualitative results are presented in Supplementary Fig. 13.

Large-scale data enables more accurate tumor segmentation

The key strength of FreeTumor lies in its capacity to harness large-scale unlabeled data for tumor synthesis and segmentation. To evaluate the effectiveness of scaling up datasets, we conduct ablation studies on five segmentation datasets, i.e., LiTS⁶¹ (liver tumors), MSD07¹⁰ (pancreas tumors), KiTS23⁶⁶, MSD06¹⁰ (lung tumors), and CV19-20⁶⁹ (COVID-19 infection). As shown in Fig. 4c–g, we showcase the effectiveness of scaling up segmentation training datasets across five segmentation datasets^10,61,66,69, representing five types of tumors/lesions. We present the comparisons with five baseline models^8,48–51 and two tumor synthesis methods^38,46. The foundation models^53,54,70 leveraged segmentation training datasets that are of equivalent scale to the baseline models^8,48–51.

We have noted a significant correlation between segmentation performance and the scale of segmentation training datasets. As shown in Figure 4h, we further present a comparative analysis of data utilization. Notably, a key distinction lies in the utilization of unlabeled data. Previous methods^{8,38,46,48–51} are limited to less than 4000 CT volumes for training. Although two previous methods SynTumor³⁸ and DiffTumor⁴⁶ also explore tumor synthesis, they are unable to leverage large-scale unlabeled data for synthesis training. Without synthesis training on large-scale data, these two synthesis methods^38,46 fall short in effectively leveraging large-scale data for segmentation training (Supplementary Table 19). In summary, previous methods^{8,38,46,48–51} are constrained by their reliance on limited labeled data, thus curbing their potential for achieving superior performances. In contrast, by integrating large-scale data for tumor synthesis and segmentation training, our FreeTumor surpasses previous methods^{8,38,46,48–51} by a clear margin. These findings unequivocally demonstrate the rationale and effectiveness of FreeTumor.

Accurate detection across five types of tumors/lesions

Tumor detection, especially the detection of early-stage tumors, is vital for the timely treatment of patients. Accurate early tumor detection can result in a greater probability of survival with less morbidity as well as less expensive treatment^{13,56–58,75}. However, early-stage tumors are typically small in size, making them challenging to detect. Our proposed FreeTumor can synthesize tumors with flexible sizes. Thus, the synthesis of small tumors can serve as an effective data augmentation solution to improve the robustness of early tumor detection. In this study, we employ FreeTumor to synthesize a large number of small tumors for training, thereby boosting the sensitivity of early tumor detection and facilitating the timely treatment for patients.

Evaluation of tumor detection across all stages of tumors

We first evaluate the detection performance across all tumor stages, with the F1-Score (%) results illustrated in Fig. 5a. It can be seen that FreeTumor consistently surpasses the baseline methods^8,48–51 without tumor synthesis. Notably, the F1-Scores of FreeTumor in detecting the five types of tumors/lesions all surpass 97%, highlighting the potential of FreeTumor in clinical practice.

Effectiveness of detecting small tumors

To evaluate the performances of early tumor detection, we further present the results of detecting small tumors (diameter < 2 cm)^59,60. We highlight the sensitivity improvements of FreeTumor in Fig. 5d. It can be seen that limited by the data scarcity, the baseline methods^8,48–51 are not sensitive in detecting small tumors/lesions. Equipped with FreeTumor, the detection of small liver tumors, pancreas tumors, kidney tumors, lung tumors, and COVID-19 are improved by 22.9, 10.3, 16.7, 17.8, and 14.1%, respectively. Notably, the overall sensitivity is improved from 49.7% to 66.1% (+ 16.4%), marking a substantial advancement towards accurate early tumor detection. These findings indicate promising prospects of FreeTumor in aiding the timely treatment of patients. Detailed sensitivity and specificity results are presented in Supplementary Fig. 12.

Discussion

FreeTumor is a GAI framework tailored for large-scale tumor synthesis and segmentation training. Our FreeTumor is designed to address the scarcity of annotated tumor datasets, aiming to unleash the power of large-scale unlabeled data for training. Specifically, FreeTumor effectively leverages a combination of limited labeled data and large-scale unlabeled data for tumor synthesis training. By large-scale tumor synthesis training, FreeTumor is capable of synthesizing a large number of tumors varying in sizes, positions, and backgrounds, thus boosting the robustness of tumor recognition models. Rigorous clinician evaluation conducted by 13 board-certified radiologists demonstrates the high quality of our synthetic tumors. To evaluate the effectiveness of FreeTumor, we create the largest training dataset for tumor synthesis and recognition, encompassing 161,310 publicly available CT volumes from diverse sources (with only 2.3% of them containing annotated tumors). Extensive experiments on 12 public datasets demonstrate the superiority of FreeTumor over state-of-the-art AI methods. These findings showcase the promising prospects of FreeTumor in tumor recognition.

AI-driven tumor recognition has received increasing attention in recent years, yet the progress is heavily hampered by the scarcity of annotated datasets. Early attempts^8,48–51 mainly focus on advancing network architectures to improve tumor recognition. Although encouraging results have been demonstrated, the scarcity of annotated datasets still heavily hampered further development. To this end, numerous medical foundation models^53,54,70 have been introduced to tackle the challenges of data scarcity. Although these foundation models can leverage unlabeled data in self-supervised pre-training^{52,73,74,76,77}, they still fail to utilize unlabeled data during segmentation training and remain constrained by the limited scale of annotated datasets.

Thus, tumor synthesis emerges as a promising solution to mitigate the scarcity of annotated tumor datasets, which can synthesize a large number of tumors on images for augmenting training datasets. Early attempts^38–42,46 investigated image processing and generative models for tumor synthesis. However, these methods fail to integrate large-scale data into synthesis training, thus hindering the improvements of downstream tumor recognition. In addition, these methods largely ignore the importance of quality control in synthesizing tumors, while low-quality synthetic tumors will pose a negative impact on downstream training.

To this end, we introduce FreeTumor to address the aforementioned challenges. First, FreeTumor adopts an effective adversarial-based synthesis training framework to leverage both labeled and unlabeled data, facilitating the integration of large-scale unlabeled data in synthesis training. Second, FreeTumor further employs an adversarial-based discriminator to discard low-quality synthetic tumors, enabling automatic quality control of large-scale synthetic tumors in the subsequent segmentation training. In this way, FreeTumor facilitates the utilization of large-scale data in both synthesis and segmentation training, demonstrating superior performances compared with previous methods.

Although FreeTumor has demonstrated promising results in tumor recognition, there are still numerous areas for growth and improvement. In our work, we collected 12 annotated datasets from public resources for training and validation, which are commonly used in existing research for the five types of tumors/lesions we studied. With more annotated tumor datasets for training, the performance of FreeTumor could be further improved. In the future, we will consistently collect more annotated datasets to advance our model.

Although FreeTumor has showcased promising results in synthesizing various types of tumors/lesions on CT volumes, moving forward, we will extend the application of FreeTumor to encompass other tumor types. Furthermore, generative models, including GAN and diffusion models, have also demonstrated promising results in the applications of other medical imaging modalities, e.g., X-ray^16,34 and pathology images³⁶. In the future, we will explore adapting FreeTumor to other medical imaging modalities, which require further dataset curation and more evaluation.

In our work, most CT scans come from a few hospitals. Thus, the synthetic data may copy their hidden biases. In the future, we will collaborate with more hospitals to collect more data for developing stronger models. In addition, while FreeTumor has achieved satisfactory performance on various public datasets, further exploration of its application in clinical practice is necessary to substantiate the effectiveness of our method.

Methods

In this section, we first introduce the preliminary of our method in Preliminary of FreeTumor. The details of our tumor synthesis pipeline are illustrated in Large-Scale Generative Tumor Synthesis Training. Then, in Quality Control of Synthetic Tumors for Large-Scale Segmentation Training, we further describe our quality control strategy to discard low-quality synthetic tumors. Following this, in Unleashing the Power of Large-scale Unlabeled Data, we discuss the process of integrating large-scale unlabeled data in segmentation training. Finally, in Datasets and Implementation Details, we delve into the details of our implementation, including the details of dataset collection, pre-processing, training implementations, and evaluation metrics.

In this study, we focus on the tumor recognition tasks, thus, we use the term “unlabeled” to represent “without tumor labels”. Specifically, during tumor synthesis, we require organ labels to simulate the tumor positions on healthy organs. Among the datasets collected in this study, only a few of them contain organ labels. For the datasets that are without organ labels, we first utilize an organ segmentation model to generate pseudo-organ labels. The details of pre-processing datasets are described in Datasets and Implementation Details.

Preliminary of free tumor

Confronted with the challenge of conditioned diffusion models lacking the ability to leverage unlabeled data in synthesis training⁴⁶, we explore the adversarial training method to unleash the power of large-scale unlabeled data. Specifically, unlike earlier GAN-based methods Pix2Pix¹⁸ and CycleGAN¹⁷, our synthesis training pipeline is motivated by the GAN-based semantic image synthesis methods^{24,26,47,78–81}. Semantic image synthesis aims to generate images with specific classes. Typically, GAN-based semantic image synthesis methods first train a classification model as the discriminator in the generative model. During synthesis training, this discriminator is utilized to classify the images generated by the generator, where higher classification accuracy indicates higher quality of synthetic images. In this way, the generator can be trained by minimizing the classification loss.

In this paper, we propose to shift this paradigm to the field of tumor synthesis. Specifically, instead of using classification models, we propose to train a tumor segmentation model as the discriminator to distinguish synthetic tumors. Furthermore, unlike previous semantic image synthesis methods focused solely on image generation, our synthetic tumors are utilized to augment segmentation training datasets. Thus, to alleviate the negative impact of low-quality synthetic tumors, we further leverage the discriminator to enable automatic quality control of synthetic tumors. The framework of FreeTumor is shown in Supplementary Fig. 4.

Large-scale generative tumor synthesis training

First, we train a tumor segmentation model to discriminate between real and synthetic tumors. In Stage 1, we train a baseline segmentation model with only labeled tumor datasets, which will be employed as the discriminator of the following tumor synthesis model to discriminate the synthetic tumors.

Second, we employ the adversarial training strategy to train a tumor synthesis model. The first step is to simulate the tumor positions on the healthy organs, which aims to select a proper location for the synthetic tumors. Specifically, we first generate organ labels for these datasets (as described in Datasets and Implementation Details). With organ labels, it is easy to select a location to synthesize tumors, e.g., liver tumors on livers, pancreas tumors on pancreases. Here, we denote the tumor mask as M that represents the positions of synthetic tumors, where M = 1 are the positions of synthetic tumors and M = 0 remain as the original values. The tumor mask M is generated with flexible sizes and positions, enabling us to synthesize diverse tumors for boosting the robustness of tumor segmentation models.

The generator G used in this study is a typical encoder-decoder based U-Net⁴⁸, which is widely used in state-of-the-art generative models^20,24,46,82. In FreeTumor, we aim to use the generator G to transform the voxel values from organ to tumor. Specifically, we use x to denote the original voxel values, $\hat{x}$ denotes the synthetic voxel values. Note that the original voxel value x corresponds to the healthy organ texture, same as in the inference process. The transform process is as follows:

\hat{x} = (1 - M) \otimes x + M \otimes [x - t a n h (G (x)) \otimes g (x)],

where x is first normalized to 0 ~ 1 and g(x) is the Gaussian filter to blur the textures, enabling us to simulate diverse tumor textures. tanh is the activation function to normalize G(x). With the tumor mask M, only the synthetic positions are transformed, and other positions are reserved as the original values. According to Equation (1), FreeTumor synthesizes tumors by estimating the distance (tanh(G(x))) between organs and tumors. This approach transforms tumor synthesis into a trainable process, enhancing its adaptability and effectiveness.

In FreeTumor, we propose to employ a tumor segmentation model as the discriminator for adversarial training. During synthesis training, we feed the volumes with synthetic tumors to the segmentation model S. We aim to use the segmentation results of these synthetic tumors to optimize the generator G by adversarial training. Concretely, it is intuitive that if a case of synthetic tumor appears realistic in comparison to the real tumors, it has a higher probability of being segmented by the segmentation model S. Similar observations are also witnessed in previous semantic image synthesis methods^26,38,47,82. Motivated by this, we use a segmentation model as the discriminator: a tumor can be segmented by the segmentation model, discriminate as real; a tumor cannot be segmented by the segmentation model, discriminate as fake. We calculate the segmentation loss L_seg for adversarial training as follows:

L_{s e g} = \frac{1}{∥ M ∥} \sum_{M = 1} ∥ 1 - S (\hat{x}) ∥,

where $S (\hat{x})$ is the tumor prediction logits generated by the baseline segmentation model S, and we employ the simplest Euclidean distance to optimize the generator G. Specifically, higher prediction logits represent higher fidelity of the synthetic tumors, since they can be recognized as real tumors by the segmentation model trained in real-world tumor datasets.

In addition, following the traditional GAN^17,18,24,47, besides the segmentation model, we also adopt another classifier discriminator C to discriminate real or fake tumors using a typical classification loss L_cls. The classifier C works similarly to the previous adversarial training methods: (1) In the discriminating process, C is optimized to distinguish real and synthetic tumors. (2) In the generating process, C is frozen and tries to classify the synthetic tumors as the real tumors, thus optimizing the generator G. Thus, the total adversarial training loss L_adv is as follow:

L_{a d v} = \max_{G^{~}} \min_{D^{~}} λ_{c l s} L_{c l s} + L_{s e g},

where G^~ and D^~ represent the generating and discriminating processes, respectively. λ_cls is the weight of L_cls and is set to 0.1 in experiments empirically. Ablation studies of loss functions are presented in Supplementary Table 21.

Quality control of synthetic tumors for large-scale segmentation training

It is worth noting that synthetic tumors are not always flawless or perfect. We observe that the low-quality synthetic tumors will deteriorate the tumor segmentation training. Previous tumor synthesis methods^{38,39,41,42,46} largely ignored to alleviate their negative impacts. Thus, based on our discriminator, we develop an effective quality control strategy to automatically discard low-quality synthetic tumors.

Segmentation-based discriminator for quality control

Our quality control strategy relies on the segmentation-based discriminator S, which is a key factor in our decision to utilize adversarial training rather than diffusion models for tumor synthesis. We propose to adaptively discard low-quality synthetic tumors by calculating the proportions of satisfactory synthesized tumor regions. The satisfactory synthesized tumors represent the synthetic tumors that do match the corresponding tumor masks M well. Intuitively, we can use the baseline segmentation model S to calculate the correspondence: the proportions of synthetic tumors that are segmented as tumors. Thus, we calculate the proportion P as follows:

P = \frac{\sum_{i = 1}^{N} [1 (S (\hat{x})) \times 1 ((M = 1))]}{\sum_{i = 1}^{N} [1 ((M = 1))]},

where N denotes the total number of voxels, $1 (S (\hat{x}))$ denotes the number of voxels that are segmented as tumors, $1 ((M = 1))$ denotes the number of voxels that tumor mask is 1 (the positions of synthetic tumors). It is intuitive that if the proportionPis higher, the quality of this case of synthetic tumor tends to be higher. In this way, the discriminator can serve as an automatic tool for quality control.

We set a threshold T to split the high- and low-quality synthetic tumors. We use the term “Quality Test" to represent whether the synthetic case passes the discriminator, the quality control strategy Q is defined as:

Q (x ∣ P, T, G, S) = \{\begin{matrix} \hat{x}, P \geq T, & This synthetic tumor pass the Quality Test \\ x, P < T, & This synthetic tumor fail the Quality Test \end{matrix})

With Q, we can effectively achieve quality control of the synthetic tumors online. Ablation studies are presented in Supplementary Table 20. Despite its simplicity, we effectively alleviate the negative impact of unsatisfactory synthetic tumors in segmentation training, which is a significant improvement upon the previous tumor synthesis methods^{38,39,41,42,46}.

Unleashing the power of large-scale unlabeled data

Distinguished from previous works^{8,38,46,48–51,53,54,70} that used a limited scale of dataset for tumor segmentation training, we emphasize the importance of large-scale unlabeled data in the development of tumor segmentation. With the rapid development of medical imaging, we can easily collect adequate unlabeled CT data for training our FreeTumor. The challenge is that these datasets lack annotated tumor cases. To this end, we develop FreeTumor to leverage these unlabeled data. Specifically, as described in Large-Scale Generative Tumor Synthesis Training and Quality Control of Synthetic Tumors for Large-Scale Segmentation Training, given the unlabeled datasets D_u, we conduct tumor synthesis for $D_{u}^{'}$ as follow:

D_{u}^{'} = {(x, F [G (x)], S) ∣ x \in D_{u}} .

Online tumor synthesis

Specifically, we synthesize tumors in an online manner during segmentation training, which means we do not need to generate and save the synthetic tumors as offline datasets. There are two merits behind the online generation: (1) offline synthetic datasets may introduce problems about misinformation propagation of patients²²; (2) online generation enables more diverse synthesis, enabling us to synthesize a large number of tumors for segmentation training.

Visual turing test implementation

We invited 13 board-certified radiologists to evaluate the fidelity of synthetic tumors through a Visual Turing Test. During the Visual Turing Test, 13 radiologists were presented with the same set of CT volumes, with each volume containing only one tumor/lesion case for evaluation. Notably, we did not perform organ-specific cropping in advance. Instead, we provided the whole 3D volume to the radiologists for evaluation, and the slice thickness is 1 mm. The radiologists were informed of the type of tumors/lesions they were required to identify, and the positions of the tumors/lesions were also provided. Thus, the radiologists knew which organ they needed to view. Window adjustments or any other pre-processing tools for CT volumes are allowed. On average, the radiologists require 1.5-2 min for viewing each case.

The real group is pooled from the 12 annotated datasets. Following the synthesis process of the previous work DiffTumor⁴⁶, in the synthetic group, the normal cases are randomly selected from the healthy datasets CHAOS⁸³ and TCIA-Pancreas⁸⁴. These datasets are confirmed without tumors/lesions by radiologists^{38,46,59,83–86}. Then, following the previous works^38,46, we use FreeTumor to synthesize tumors/lesions on the healthy organs for the Visual Turing Test. Besides the Turing Test, we further evaluate the Fréchet Inception Distance (FID), Fréchet Video Distance (FVD), and Learned Perceptual Image Patch Similarity (LPIPS) results in Supplementary Tables 4, 6, and 7.

Although the Visual Turing Test is widely used in discerning the fidelity of synthetic medical images^16,22,28, there is still a limitation in applying it to tumor synthesis evaluation, since the radiologists typically do not perform tasks to distinguish real from synthetic tumors in clinical practice. In the future, we will explore more effective tools and metrics to measure the quality of synthetic tumors.

Datasets and implementation details

Datasets collection and pre-processing

Our proposed FreeTumor excels in leveraging large-scale data for tumor synthesis and segmentation. Thus, in this study, we first create a large-scale dataset with 161,130 publicly available CT volumes from 33 different sources, as shown in Supplementary Table 30.

As described in Large-Scale Generative Tumor Synthesis Training, our initial step involves simulating tumor positions within their corresponding organ regions, e.g., liver tumors on livers, pancreas tumors on pancreases. Consequently, generating the organ labels becomes essential. While a few of the datasets already include organ labels, the others still lack organ labels. To address this, we first utilize a robust organ segmentation model VoCo^37,54 to generate liver, pancreas, and kidney labels for the abdomen CT datasets. For lung organs, we employ Lungmask⁸⁷ to generate lung labels for chest CT datasets. This approach enables us to leverage the entirety of 161,130 CT volumes for tumor synthesis and segmentation training. Note that we only utilize the generated organ labels to simulate approximate tumor positions. Therefore, these organ labels do not need to be perfectly precise for the scope of this study.

Among our curated datasets, some of them contain abdomen regions, some of them contain chest regions, and a few of them contain both abdomen and chest regions. Specifically, for the training of liver, pancreas, and kidney tumors, we utilize 19,571 abdomen CT volumes for training. For lung tumors and COVID-19, we utilize 141,784 chest CT volumes for training.

Implementation details

In this study, instead of developing new network architectures, we mainly focus on advancing tumor segmentation from a data-driven aspect. Thus, we simply adopt the SwinUNETR⁵¹ as the tumor segmentation model. We use SwinUNETR⁵¹ for two reasons: (1) It achieves competitive results among the baseline tumor segmentation methods^8,48–51,71. (2) Previous tumor synthesis methods^38,46 and CT foundation models^53,54 also adopt SwinUNETR⁵¹ as backbones.

Specifically, during tumor synthesis, we regenerate tumor masks for each epoch without a synthetic cache, since reusing the same synthetic cache could leak information and inflate accuracy. Notably, the synthetic set was regenerated for each run.

In the process of generating tumor masks M, we simply follow the steps of previous methods DiffTumor⁴⁶ and SynTumor³⁸ for fair comparisons. (1) Flexible sizes: following previous methods^38,46, we predefine four sizes of tumor masks M, i.e., tiny, small, medium, and large. The radii are set to 4, 8, 16, and 32, respectively. The selection randomness is set to 0.25 equally for per size, and the spatial offset is from 0.75 to 1.25. (2) Flexible positions: following previous methods^38,46, we randomly select a position on organ masks to generate tumor masks M. To verify the results, we further present some visualization results in Supplementary Fig. 7.

When using GANs, evaluation of diversity and mode collapse is essential, since mode collapse will cause the generator to ignore most data patterns and repeatedly output only a few simplified modes. Following previous GAN-based methods^88–91, we adopted the Learned Perceptual Image Patch Similarity (LPIPS)⁹² metric to evaluate the diversity. Specifically, we first calculate the LPIPS on real-world datasets (12 labeled tumor/lesion datasets), then calculate the LPIPS⁹² on our synthetic datasets. For feature extraction, we follow the implementation of GenerateCT⁹³, which is widely adopted in CT imaging synthesis evaluation. The results are shown in Supplementary Table 7. Our synthetic dataset achieves LPIPS scores comparable to those of real datasets, underscoring the effectiveness of our method in generating diverse and realistic data.

We use Pytorch⁹⁴, MONAI⁹⁵, and nnUNet⁸ frameworks to conduct all the experiments. The synthesis training and segmentation training are conducted on NVIDIA H800 (80G) GPUs. More implementation details are presented in Supplementary Table 34. The comparisons of parameters, training time, and inference cost are shown in Supplementary Table 33.

Evaluation metrics

For the Visual Turing Test in clinician evaluation, we report the sensitivity, specificity, and accuracy results to measure the radiologists’ ability to identify synthetic tumors. Sensitivity (%) and specificity (%) are calculated as:

sensitivity = \frac{TP}{TP + FN}, specificity = \frac{TN}{TN + FP},

where TP (True Positive) denotes truly identifying the synthetic tumors, TN (True Negative) denotes truly identifying the real tumors, FP (False Positive) denotes falsely recognizing real tumors as synthetic tumors, and FN (False Negative) denotes falsely recognizing synthetic tumors as real tumors. The accuracy (%) is calculated as:

accuracy = \frac{TP + TN}{TP + TN + FP + FN} .

For tumor segmentation, the standard Dice scores (%) is employed to evaluate the performance. Dice scores is calculated as:

Dice (P r e, G r o) = \frac{2 ∣ P r e \cap G r o ∣}{∣ P r e ∣ + ∣ G r o ∣},

where Pre denotes the segmentation predictions, Gro is the ground truth of tumor labels.

For tumor detection, detected tumors are identified when segmentation predictions overlap with the ground truth labels^13,56–58. We use F1-Score, sensitivity, and specificity to measure the performance of tumor detection, where F1-Score is formulated as:

F1-Score = \frac{2 \times Precision \times Recall}{Precision + Recall},

where Precision and Recall are formulated as:

Precision = \frac{TP}{TP + FP}, Recall = \frac{TP}{TP + FN} .

Here, the “positive” class is defined as detecting a tumor within a CT volume.

We further evaluate the FID⁵⁵ and FVD⁹⁶ results of synthetic tumors, as shown in Supplementary Tables 4 and 6. We cropped the tumor regions to evaluate the FID results of synthetic tumors. Since in our method, only the tumor regions are synthesized, while the other regions remain as the original values. Thus, only the tumor regions require evaluation, which is the same as previous works^38–42,46. Specifically, we generate tumor masks M to synthesize tumors following previous methods^38,46. With tumor masks M, we can easily crop the synthesized tumor regions by extracting the regions of M. We follow the implementation of GenerateCT⁹³ to evaluate the FVD⁹⁶ results.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Supplementary information

Supplementary Information^{(15MB, pdf)}

Reporting Summary^{(78.3KB, pdf)}

Transparent Peer Review file^{(4.3MB, pdf)}

Source data

Source Data^{(96.6KB, zip)}

Acknowledgements

This work was supported by the Hong Kong Innovation and Technology Commission (Project No. MHP/002/22, GHP/006/22GD and ITCPD/17-9H.C.), HKUST (Project No. FS111, H.C.), and the Research Grants Council of the Hong Kong Special Administrative Region, China (Project Reference Number: T45-401/22-N, H.C.). We also thank the support of HKUST SuperPod for providing the GPU platform for model training. We express our sincere gratitude to the radiologists who contributed to the clinician evaluation, including Shisi Li, Dexuan Chen, Lingling Yang, Yu Wang, Riyu Han, Lin Liu, Kanrong Yang, Rui Zhang, Guangzi Shi, and Qiang Ye. We greatly appreciate their dedicated efforts. Icons of Fig. 1d, f, Supplementary Figs. 4c–g, Figs. A1a, A2, A3, A4, A8, A9 are made by Freepik from www.flaticon.com. For the elements created by BioRender, the citation to use: Created in BioRender. Wu, L. (2025) https://BioRender.com/qo600iw. This project has been reviewed and approved by the Human and Artefacts Research Ethics Committee (HAREC). The protocol number is HREP-2024-0429.

Author contributions

L.W. designed the framework and conducted the experiments. Y.Z., L.L., X.W., and P.R. provided suggestions on the framework and experiments. J.Z., S.H., J.M., and X.N. contributed to the data acquisition and downstream task evaluation. X.Z, M.W, Y.W, X.D., and V.V. contributed to the clinician evaluation of tumor synthesis and analyzed the results of tumor recognition. All authors contributed to the drafting and revising of the manuscript. H.C. and L.W. conceived the study. H.C. supervised the research.

Peer review

Peer review information

Nature Communications thanks Namkug Kim and Zongwei Zhou for their contribution to the peer review of this work. A peer review file is available.

Data availability

This study incorporates a total of 33 public datasets from different sources, encompassing 161,130 publicly available CT volumes. All these datasets are publicly available for research. For detailed information about the data used in this project, please refer to Supplementary Table 30. Source data are provided in this paper.

Code availability

The codes, datasets, and models of FreeTumor are available at GitHub (https://github.com/Luffy03/FreeTumor).

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

The online version contains supplementary material available at 10.1038/s41467-025-66071-6.

References

1.Bray, F. et al. Global cancer statistics 2022: Globocan estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J. Clin.74, 229–263 (2024). [DOI] [PubMed] [Google Scholar]
2.LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature521, 436–444 (2015). [DOI] [PubMed] [Google Scholar]
3.Jumper, J. et al. Highly accurate protein structure prediction with alphafold. Nature596, 583–589 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
4.He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770–778 (2016).
5.Deng, J. et al. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 248–255 (2009).
6.Krizhevsky, A., Sutskever, I. & Hinton, G. E. Imagenet classification with deep convolutional neural networks. In Proceedings of theAdvances in Neural Information Processing Systems, 25 (2012).
7.Zhao, T. et al. A foundation model for joint segmentation, detection, and recognition of biomedical objects across nine modalities. Nat. Methods22, 166–176 (2024). [DOI] [PubMed]
8.Isensee, F., Jaeger, P. F., Kohl, S. A., Petersen, J. & Maier-Hein, K. H. nnu-net: a self-configuring method for deep learning-based biomedical image segmentation. Nat. Methods18, 203–211 (2021). [DOI] [PubMed] [Google Scholar]
9.Ma, J. et al. Segment anything in medical images. Nat. Commun.15, 654 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Antonelli, M. et al. The medical segmentation decathlon. Nat. Commun.13, 4128 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Peiris, H., Hayat, M., Chen, Z., Egan, G. & Harandi, M. Uncertainty-guided dual-views for semi-supervised volumetric medical image segmentation. Nat. Mach. Intell.5, 724–738 (2023). [Google Scholar]
12.Wang, S. et al. Annotation-efficient deep learning for automatic medical image segmentation. Nat. Commun.12, 5915 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Cao, K. et al. Large-scale pancreatic cancer detection via non-contrast ct and deep learning. Nat. Med.29, 3033–3043 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Sun, Y., Wang, L., Li, G., Lin, W. & Wang, L. A foundation model for enhancing magnetic resonance images and downstream segmentation, registration and diagnostic tasks. Nat. Biomed. Eng.9, 521–538 (2024). [DOI] [PMC free article] [PubMed]
15.Avram, O. et al. Accurate prediction of disease-risk factors from volumetric medical scans by a deep vision model pre-trained with 2d scans. Nat. Biomed. Eng.9, 507–520 (2024). [DOI] [PMC free article] [PubMed]
16.Wang, J. et al. Self-improving generative foundation model for synthetic medical image generation and clinical applications. Nat. Med.31, 609–617 (2024). [DOI] [PubMed]
17.Zhu, J.-Y., Park, T., Isola, P. & Efros, A. A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, 2223–2232 (2017).
18.Isola, P., Zhu, J.-Y., Zhou, T. & Efros, A. A. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1125–1134 (2017).
19.Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N. & Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, 2256–2265 (PMLR, 2015).
20.Rombach, R., Blattmann, A., Lorenz, D., Esser, P. & Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10684–10695 (2022).
21.Zhang, L., Rao, A. & Agrawala, M. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 3836–3847 (2023).
22.Chen, R. J., Lu, M. Y., Chen, T. Y., Williamson, D. F. & Mahmood, F. Synthetic data in machine learning for medicine and healthcare. Nat. Biomed. Eng.5, 493–497 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Yang, L. et al. Depth anything v2. In Proceedings of the Advances in Neural Information Processing Systems, 36 (2024).
24.Sushko, V. et al. Oasis: only adversarial supervision for semantic image synthesis. Int. J. Comput. Vis.130, 2903–2923 (2022). [Google Scholar]
25.Fan, L. et al. Scaling laws of synthetic images for model training... for now. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7382–7392 (2024).
26.Yang, L., Xu, X., Kang, B., Shi, Y. & Zhao, H. Freemask: Synthetic images with dense annotations make stronger segmentation models. Advances in Neural Information Processing Systems36 (2024).
27.Zhong, Z., Zheng, L., Kang, G., Li, S. & Yang, Y. Random erasing data augmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, 34, 13001–13008 (2020).
28.Bluethgen, C. et al. A vision–language foundation model for the generation of realistic chest x-ray images. Nat. Biomed. Eng.9, 494–506 (2024). [DOI] [PMC free article] [PubMed]
29.Peng, Y., Rousseau, J. F., Shortliffe, E. H. & Weng, C. Ai-generated text may have a role in evidence-based medicine. Nat. Med.29, 1593–1594 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Jo, A. The promise and peril of generative AI. Nature614, 214–216 (2023).36747115 [Google Scholar]
31.Tudosiu, P.-D. et al. Realistic morphology-preserving generative modelling of the brain. Nat. Mach. Intell.6, 811–819 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
32.DeGrave, A. J., Cai, Z. R., Janizek, J. D., Daneshjou, R. & Lee, S.-I. Auditing the inference processes of medical-image classifiers by leveraging generative ai and the expertise of physicians. Nat. Biomed. Eng.9, 294–306 (2023). [DOI] [PubMed]
33.Ktena, I. et al. Generative models improve fairness of medical classifiers under distribution shifts. Nat. Med.30, 1166–1173 (2024). [DOI] [PMC free article] [PubMed]
34.Gao, C. et al. Synthetic data accelerates the development of generalizable learning-based algorithms for x-ray image analysis. Nat. Mach. Intell.5, 294–308 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Schäfer, R. et al. Overcoming data scarcity in biomedical imaging with a foundational multi-task model. Nat. Comput. Sci.4, 495–509 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Carrillo-Perez, F. et al. Generation of synthetic whole-slide image tiles of tumours from rna-sequencing data via cascaded diffusion models. Nat. Biomed. Eng.9, 320–332 (2025). [DOI] [PubMed] [Google Scholar]
37.Wu, L., Zhuang, J. & Chen, H. Large-scale 3d medical image pre-training with geometric context priors. Preprint at 10.48550/arXiv.2410.09890 (2024). [DOI] [PubMed]
38.Hu, Q. et al. Label-free liver tumor segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7422–7432 (2023).
39.Lyu, F. et al. Pseudo-label guided image synthesis for semi-supervised covid-19 pneumonia infection segmentation. IEEE Trans. Med. Imaging42, 797–809 (2022). [DOI] [PubMed] [Google Scholar]
40.Yao, Q., Xiao, L., Liu, P. & Zhou, S. K. Label-free segmentation of covid-19 lesions in lung ct. IEEE Trans. Med. Imaging40, 2808–2819 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
41.Wang, H. et al. Anomaly segmentation in retinal images with poisson-blending data augmentation. Med. Image Anal.81, 102534 (2022). [DOI] [PubMed] [Google Scholar]
42.Wyatt, J., Leach, A., Schmon, S. M. & Willcocks, C. G. Anoddpm: Anomaly detection with denoising diffusion probabilistic models using simplex noise. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 650–656 (2022).
43.Croitoru, F.-A., Hondru, V., Ionescu, R. T. & Shah, M. Diffusion models in vision: A survey. IEEE Trans. Pattern Anal. Mach. Intell.45, 10850–10869 (2023). [DOI] [PubMed] [Google Scholar]
44.Ho, J., Jain, A. & Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst.33, 6840–6851 (2020). [Google Scholar]
45.Song, J., Meng, C. & Ermon, S. Denoising diffusion implicit models. In International Conference on Learning Representations (2021).
46.Chen, Q. et al. Towards generalizable tumor synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024).
47.Park, T., Liu, M.-Y., Wang, T.-C. & Zhu, J.-Y. Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2337–2346 (2019).
48.Ronneberger, O., Fischer, P. & Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI, 234–241 (Springer, 2015).
49.Chen, J. et al. Transunet: Rethinking the u-net architecture design for medical image segmentation through the lens of transformers. Med. Image Anal.97, 103280 (2024). [DOI] [PubMed] [Google Scholar]
50.Hatamizadeh, A. et al. Unetr: Transformers for 3d medical image segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 574–584 (2022).
51.Hatamizadeh, A. et al. Swin unetr: Swin transformers for semantic segmentation of brain tumors in mri images. In International MICCAI Brainlesion Workshop, 272–284 (Springer, 2021).
52.He, K. et al. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16000–16009 (2022).
53.Tang, Y. et al. Self-supervised pre-training of swin transformers for 3d medical image analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 20730–20740 (2022).
54.Wu, L., Zhuang, J. & Chen, H. Voco: A simple-yet-effective volume contrastive learning framework for 3d medical image analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 22873–22882 (2024).
55.Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B. & Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Proceedings of theAdvances in Neural Information Processing Systems, 30 (2017).
56.Fitzgerald, R. C., Antoniou, A. C., Fruk, L. & Rosenfeld, N. The future of early cancer detection. Nat. Med.28, 666–677 (2022). [DOI] [PubMed] [Google Scholar]
57.Singhi, A. D., Koay, E. J., Chari, S. T. & Maitra, A. Early detection of pancreatic cancer: opportunities and challenges. Gastroenterology156, 2024–2040 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
58.Pereira, S. P. et al. Early detection of pancreatic cancer. Lancet Gastroenterol. Hepatol.5, 698–710 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
59.Bassi, P. R. et al. Radgpt: Constructing 3d image-text tumor datasets. In Proceedings of the IEEE/CVF International Conference on Computer Vision (2025).
60.Miller, A., Hoogstraten, B., Staquet, M. & Winkler, A. Reporting results of cancer treatment. Cancer47, 207–214 (1981). [DOI] [PubMed] [Google Scholar]
61.Bilic, P. et al. The liver tumor segmentation benchmark (lits). Med. Image Anal.84, 102680 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
62.Morshid, A. et al. A machine learning model to predict hepatocellular carcinoma response to transcatheter arterial chemoembolization. Radiology: Artif. Intell.1, e180021 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
63.Soler, L. et al. 3d image reconstruction for comparison of algorithm database. https://www.ircad.fr/research/data-sets/liver-segmentation-3d-ircadb-01 (2010).
64.Alves, N. et al. The PANORAMA study protocol: Pancreatic cancer diagnosis-radiologists meet AI 10.5281/zenodo.10599559 (2024).
65.Žukovec, M., Dular, L. & Špiclin, Ž. Modeling multi-annotator uncertainty as multi-class segmentation problem. in Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries (Springer, 2021).
66.Heller, N. et al. The kits21 challenge: Automatic segmentation of kidneys, renal tumors, and renal cysts in corticomedullary-phase CT. Preprint at 10.48550/arXiv.2307.01984 (2023).
67.He, Y. et al. Meta grayscale adaptive network for 3d integrated renal structures segmentation. Med. Image Anal.71, 102055 (2021). [DOI] [PubMed] [Google Scholar]
68.Aerts, H. J. et al. Decoding tumour phenotype by noninvasive imaging using a quantitative radiomics approach. Nat. Commun.5, 4006 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
69.Roth, H. R. et al. Rapid artificial intelligence solutions in a pandemic-the covid-19-20 lung ct lesion segmentation challenge. Med. Image Anal.82, 102605 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
70.Chen, Z. et al. Masked image modeling advances 3d medical image analysis. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 1970–1980 (2023).
71.Wu, J. et al. Medsegdiff-v2: Diffusion-based medical image segmentation with transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, 38, 6030–6038 (2024).
72.Weisstein, E. W. Bonferroni correction. https://mathworld.wolfram.com/ (2004).
73.He, K., Fan, H., Wu, Y., Xie, S. & Girshick, R. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9729–9738 (2020).
74.Chen, T., Kornblith, S., Norouzi, M. & Hinton, G. A simple framework for contrastive learning of visual representations. In International Conference on Machine Learning, 1597–1607 (PMLR, 2020).
75.Choi, J.-Y., Lee, J.-M. & Sirlin, C. B. Ct and mr imaging diagnosis and staging of hepatocellular carcinoma: part i. development, growth, and spread: key pathologic and imaging aspects. Radiology272, 635–654 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
76.Oquab, M. et al. Dinov2: Learning robust visual features without supervision. Preprint at 10.48550/arXiv.2304.07193 (2024).
77.Caron, M. et al. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 9650–9660 (2021).
78.Dong, H., Yu, S., Wu, C. & Guo, Y. Semantic image synthesis via adversarial learning. In Proceedings of the IEEE International Conference on Computer Vision, 5706–5714 (2017).
79.Tan, Z. et al. Diverse semantic image synthesis via probability distribution modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7962–7971 (2021).
80.Liu, X., Yin, G., Shao, J., Wang, X. et al. Learning to predict layout-to-image conditional convolutions for semantic image synthesis. In Proceedings of theAdvances in Neural Information Processing Systems32 (2019).
81.Tan, Z. et al. Efficient semantic image synthesis via class-adaptive normalization. IEEE Trans. Pattern Anal. Mach. Intell.44, 4852–4866 (2021). [DOI] [PubMed] [Google Scholar]
82.Xue, H., Huang, Z., Sun, Q., Song, L. & Zhang, W. Freestyle layout-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 14256–14266 (2023).
83.Kavur, A. E. et al. CHAOS Challenge - combined (CT-MR) healthy abdominal organ segmentation. Med. Image Anal.69, 101950 (2021). [DOI] [PubMed] [Google Scholar]
84.Roth, H. R. et al. Data from pancreas-ct. 10.7937/K9/TCIA.2016.tNB1kqBU (2016).
85.Li, W. et al. Pants: The pancreatic tumor segmentation dataset. Preprint at 10.48550/arXiv.2507.01291 (2025).
86.Ma, J. et al. Fast and low-gpu-memory abdomen ct organ segmentation: the flare challenge. Med. Image Anal.82, 102616 (2022). [DOI] [PubMed] [Google Scholar]
87.Hofmanninger, J. et al. Automatic lung segmentation in routine imaging is primarily a data diversity problem, not a methodology problem. Eur. Radiol. Exp.4, 1–13 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
88.Xia, W. et al. Gan inversion: A survey. IEEE Trans. Pattern Anal. Mach. Intell.45, 3121–3138 (2022). [DOI] [PubMed] [Google Scholar]
89.Parmar, G., Zhang, R. & Zhu, J.-Y. On aliased resizing and surprising subtleties in gan evaluation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11410–11420 (2022).
90.Kumari, N., Zhang, R., Shechtman, E. & Zhu, J.-Y. Ensembling off-the-shelf models for gan training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10651–10662 (2022).
91.Lang, O. et al. Explaining in style: training a gan to explain a classifier in stylespace. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 693–702 (2021).
92.Zhang, R., Isola, P., Efros, A. A., Shechtman, E. & Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 586–595 (2018).
93.Hamamci, I. E. et al. Generatect: Text-conditional generation of 3d chest CT volumes. Preprint at 10.48550/arXiv.2305.16037 (2023).
94.Paszke, A. et al. Pytorch: An imperative style, high-performance deep learning library. In Proceedings of theAdvances in Neural Information Processing Systems32 (2019).
95.Cardoso, M. J. et al. Monai: An open-source framework for deep learning in healthcare. Preprint at 10.48550/arXiv.2211.02701 (2022).
96.Unterthiner, T. et al. Fvd: A new metric for video generation (2019).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Information^{(15MB, pdf)}

Reporting Summary^{(78.3KB, pdf)}

Transparent Peer Review file^{(4.3MB, pdf)}

Source Data^{(96.6KB, zip)}

Data Availability Statement

The codes, datasets, and models of FreeTumor are available at GitHub (https://github.com/Luffy03/FreeTumor).

[CR1] 1.Bray, F. et al. Global cancer statistics 2022: Globocan estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J. Clin.74, 229–263 (2024). [DOI] [PubMed] [Google Scholar]

[CR2] 2.LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature521, 436–444 (2015). [DOI] [PubMed] [Google Scholar]

[CR3] 3.Jumper, J. et al. Highly accurate protein structure prediction with alphafold. Nature596, 583–589 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR4] 4.He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770–778 (2016).

[CR5] 5.Deng, J. et al. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 248–255 (2009).

[CR6] 6.Krizhevsky, A., Sutskever, I. & Hinton, G. E. Imagenet classification with deep convolutional neural networks. In Proceedings of theAdvances in Neural Information Processing Systems, 25 (2012).

[CR7] 7.Zhao, T. et al. A foundation model for joint segmentation, detection, and recognition of biomedical objects across nine modalities. Nat. Methods22, 166–176 (2024). [DOI] [PubMed]

[CR8] 8.Isensee, F., Jaeger, P. F., Kohl, S. A., Petersen, J. & Maier-Hein, K. H. nnu-net: a self-configuring method for deep learning-based biomedical image segmentation. Nat. Methods18, 203–211 (2021). [DOI] [PubMed] [Google Scholar]

[CR9] 9.Ma, J. et al. Segment anything in medical images. Nat. Commun.15, 654 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR10] 10.Antonelli, M. et al. The medical segmentation decathlon. Nat. Commun.13, 4128 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR11] 11.Peiris, H., Hayat, M., Chen, Z., Egan, G. & Harandi, M. Uncertainty-guided dual-views for semi-supervised volumetric medical image segmentation. Nat. Mach. Intell.5, 724–738 (2023). [Google Scholar]

[CR12] 12.Wang, S. et al. Annotation-efficient deep learning for automatic medical image segmentation. Nat. Commun.12, 5915 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR13] 13.Cao, K. et al. Large-scale pancreatic cancer detection via non-contrast ct and deep learning. Nat. Med.29, 3033–3043 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR14] 14.Sun, Y., Wang, L., Li, G., Lin, W. & Wang, L. A foundation model for enhancing magnetic resonance images and downstream segmentation, registration and diagnostic tasks. Nat. Biomed. Eng.9, 521–538 (2024). [DOI] [PMC free article] [PubMed]

[CR15] 15.Avram, O. et al. Accurate prediction of disease-risk factors from volumetric medical scans by a deep vision model pre-trained with 2d scans. Nat. Biomed. Eng.9, 507–520 (2024). [DOI] [PMC free article] [PubMed]

[CR16] 16.Wang, J. et al. Self-improving generative foundation model for synthetic medical image generation and clinical applications. Nat. Med.31, 609–617 (2024). [DOI] [PubMed]

[CR17] 17.Zhu, J.-Y., Park, T., Isola, P. & Efros, A. A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, 2223–2232 (2017).

[CR18] 18.Isola, P., Zhu, J.-Y., Zhou, T. & Efros, A. A. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1125–1134 (2017).

[CR19] 19.Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N. & Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, 2256–2265 (PMLR, 2015).

[CR20] 20.Rombach, R., Blattmann, A., Lorenz, D., Esser, P. & Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10684–10695 (2022).

[CR21] 21.Zhang, L., Rao, A. & Agrawala, M. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 3836–3847 (2023).

[CR22] 22.Chen, R. J., Lu, M. Y., Chen, T. Y., Williamson, D. F. & Mahmood, F. Synthetic data in machine learning for medicine and healthcare. Nat. Biomed. Eng.5, 493–497 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR23] 23.Yang, L. et al. Depth anything v2. In Proceedings of the Advances in Neural Information Processing Systems, 36 (2024).

[CR24] 24.Sushko, V. et al. Oasis: only adversarial supervision for semantic image synthesis. Int. J. Comput. Vis.130, 2903–2923 (2022). [Google Scholar]

[CR25] 25.Fan, L. et al. Scaling laws of synthetic images for model training... for now. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7382–7392 (2024).

[CR26] 26.Yang, L., Xu, X., Kang, B., Shi, Y. & Zhao, H. Freemask: Synthetic images with dense annotations make stronger segmentation models. Advances in Neural Information Processing Systems36 (2024).

[CR27] 27.Zhong, Z., Zheng, L., Kang, G., Li, S. & Yang, Y. Random erasing data augmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, 34, 13001–13008 (2020).

[CR28] 28.Bluethgen, C. et al. A vision–language foundation model for the generation of realistic chest x-ray images. Nat. Biomed. Eng.9, 494–506 (2024). [DOI] [PMC free article] [PubMed]

[CR29] 29.Peng, Y., Rousseau, J. F., Shortliffe, E. H. & Weng, C. Ai-generated text may have a role in evidence-based medicine. Nat. Med.29, 1593–1594 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR30] 30.Jo, A. The promise and peril of generative AI. Nature614, 214–216 (2023).36747115 [Google Scholar]

[CR31] 31.Tudosiu, P.-D. et al. Realistic morphology-preserving generative modelling of the brain. Nat. Mach. Intell.6, 811–819 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR32] 32.DeGrave, A. J., Cai, Z. R., Janizek, J. D., Daneshjou, R. & Lee, S.-I. Auditing the inference processes of medical-image classifiers by leveraging generative ai and the expertise of physicians. Nat. Biomed. Eng.9, 294–306 (2023). [DOI] [PubMed]

[CR33] 33.Ktena, I. et al. Generative models improve fairness of medical classifiers under distribution shifts. Nat. Med.30, 1166–1173 (2024). [DOI] [PMC free article] [PubMed]

[CR34] 34.Gao, C. et al. Synthetic data accelerates the development of generalizable learning-based algorithms for x-ray image analysis. Nat. Mach. Intell.5, 294–308 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR35] 35.Schäfer, R. et al. Overcoming data scarcity in biomedical imaging with a foundational multi-task model. Nat. Comput. Sci.4, 495–509 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR36] 36.Carrillo-Perez, F. et al. Generation of synthetic whole-slide image tiles of tumours from rna-sequencing data via cascaded diffusion models. Nat. Biomed. Eng.9, 320–332 (2025). [DOI] [PubMed] [Google Scholar]

[CR37] 37.Wu, L., Zhuang, J. & Chen, H. Large-scale 3d medical image pre-training with geometric context priors. Preprint at 10.48550/arXiv.2410.09890 (2024). [DOI] [PubMed]

[CR38] 38.Hu, Q. et al. Label-free liver tumor segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7422–7432 (2023).

[CR39] 39.Lyu, F. et al. Pseudo-label guided image synthesis for semi-supervised covid-19 pneumonia infection segmentation. IEEE Trans. Med. Imaging42, 797–809 (2022). [DOI] [PubMed] [Google Scholar]

[CR40] 40.Yao, Q., Xiao, L., Liu, P. & Zhou, S. K. Label-free segmentation of covid-19 lesions in lung ct. IEEE Trans. Med. Imaging40, 2808–2819 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR41] 41.Wang, H. et al. Anomaly segmentation in retinal images with poisson-blending data augmentation. Med. Image Anal.81, 102534 (2022). [DOI] [PubMed] [Google Scholar]

[CR42] 42.Wyatt, J., Leach, A., Schmon, S. M. & Willcocks, C. G. Anoddpm: Anomaly detection with denoising diffusion probabilistic models using simplex noise. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 650–656 (2022).

[CR43] 43.Croitoru, F.-A., Hondru, V., Ionescu, R. T. & Shah, M. Diffusion models in vision: A survey. IEEE Trans. Pattern Anal. Mach. Intell.45, 10850–10869 (2023). [DOI] [PubMed] [Google Scholar]

[CR44] 44.Ho, J., Jain, A. & Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst.33, 6840–6851 (2020). [Google Scholar]

[CR45] 45.Song, J., Meng, C. & Ermon, S. Denoising diffusion implicit models. In International Conference on Learning Representations (2021).

[CR46] 46.Chen, Q. et al. Towards generalizable tumor synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024).

[CR47] 47.Park, T., Liu, M.-Y., Wang, T.-C. & Zhu, J.-Y. Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2337–2346 (2019).

[CR48] 48.Ronneberger, O., Fischer, P. & Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI, 234–241 (Springer, 2015).

[CR49] 49.Chen, J. et al. Transunet: Rethinking the u-net architecture design for medical image segmentation through the lens of transformers. Med. Image Anal.97, 103280 (2024). [DOI] [PubMed] [Google Scholar]

[CR50] 50.Hatamizadeh, A. et al. Unetr: Transformers for 3d medical image segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 574–584 (2022).

[CR51] 51.Hatamizadeh, A. et al. Swin unetr: Swin transformers for semantic segmentation of brain tumors in mri images. In International MICCAI Brainlesion Workshop, 272–284 (Springer, 2021).

[CR52] 52.He, K. et al. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16000–16009 (2022).

[CR53] 53.Tang, Y. et al. Self-supervised pre-training of swin transformers for 3d medical image analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 20730–20740 (2022).

[CR54] 54.Wu, L., Zhuang, J. & Chen, H. Voco: A simple-yet-effective volume contrastive learning framework for 3d medical image analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 22873–22882 (2024).

[CR55] 55.Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B. & Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Proceedings of theAdvances in Neural Information Processing Systems, 30 (2017).

[CR56] 56.Fitzgerald, R. C., Antoniou, A. C., Fruk, L. & Rosenfeld, N. The future of early cancer detection. Nat. Med.28, 666–677 (2022). [DOI] [PubMed] [Google Scholar]

[CR57] 57.Singhi, A. D., Koay, E. J., Chari, S. T. & Maitra, A. Early detection of pancreatic cancer: opportunities and challenges. Gastroenterology156, 2024–2040 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR58] 58.Pereira, S. P. et al. Early detection of pancreatic cancer. Lancet Gastroenterol. Hepatol.5, 698–710 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR59] 59.Bassi, P. R. et al. Radgpt: Constructing 3d image-text tumor datasets. In Proceedings of the IEEE/CVF International Conference on Computer Vision (2025).

[CR60] 60.Miller, A., Hoogstraten, B., Staquet, M. & Winkler, A. Reporting results of cancer treatment. Cancer47, 207–214 (1981). [DOI] [PubMed] [Google Scholar]

[CR61] 61.Bilic, P. et al. The liver tumor segmentation benchmark (lits). Med. Image Anal.84, 102680 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR62] 62.Morshid, A. et al. A machine learning model to predict hepatocellular carcinoma response to transcatheter arterial chemoembolization. Radiology: Artif. Intell.1, e180021 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR63] 63.Soler, L. et al. 3d image reconstruction for comparison of algorithm database. https://www.ircad.fr/research/data-sets/liver-segmentation-3d-ircadb-01 (2010).

[CR64] 64.Alves, N. et al. The PANORAMA study protocol: Pancreatic cancer diagnosis-radiologists meet AI 10.5281/zenodo.10599559 (2024).

[CR65] 65.Žukovec, M., Dular, L. & Špiclin, Ž. Modeling multi-annotator uncertainty as multi-class segmentation problem. in Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries (Springer, 2021).

[CR66] 66.Heller, N. et al. The kits21 challenge: Automatic segmentation of kidneys, renal tumors, and renal cysts in corticomedullary-phase CT. Preprint at 10.48550/arXiv.2307.01984 (2023).

[CR67] 67.He, Y. et al. Meta grayscale adaptive network for 3d integrated renal structures segmentation. Med. Image Anal.71, 102055 (2021). [DOI] [PubMed] [Google Scholar]

[CR68] 68.Aerts, H. J. et al. Decoding tumour phenotype by noninvasive imaging using a quantitative radiomics approach. Nat. Commun.5, 4006 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR69] 69.Roth, H. R. et al. Rapid artificial intelligence solutions in a pandemic-the covid-19-20 lung ct lesion segmentation challenge. Med. Image Anal.82, 102605 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR70] 70.Chen, Z. et al. Masked image modeling advances 3d medical image analysis. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 1970–1980 (2023).

[CR71] 71.Wu, J. et al. Medsegdiff-v2: Diffusion-based medical image segmentation with transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, 38, 6030–6038 (2024).

[CR72] 72.Weisstein, E. W. Bonferroni correction. https://mathworld.wolfram.com/ (2004).

[CR73] 73.He, K., Fan, H., Wu, Y., Xie, S. & Girshick, R. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9729–9738 (2020).

[CR74] 74.Chen, T., Kornblith, S., Norouzi, M. & Hinton, G. A simple framework for contrastive learning of visual representations. In International Conference on Machine Learning, 1597–1607 (PMLR, 2020).

[CR75] 75.Choi, J.-Y., Lee, J.-M. & Sirlin, C. B. Ct and mr imaging diagnosis and staging of hepatocellular carcinoma: part i. development, growth, and spread: key pathologic and imaging aspects. Radiology272, 635–654 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR76] 76.Oquab, M. et al. Dinov2: Learning robust visual features without supervision. Preprint at 10.48550/arXiv.2304.07193 (2024).

[CR77] 77.Caron, M. et al. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 9650–9660 (2021).

[CR78] 78.Dong, H., Yu, S., Wu, C. & Guo, Y. Semantic image synthesis via adversarial learning. In Proceedings of the IEEE International Conference on Computer Vision, 5706–5714 (2017).

[CR79] 79.Tan, Z. et al. Diverse semantic image synthesis via probability distribution modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7962–7971 (2021).

[CR80] 80.Liu, X., Yin, G., Shao, J., Wang, X. et al. Learning to predict layout-to-image conditional convolutions for semantic image synthesis. In Proceedings of theAdvances in Neural Information Processing Systems32 (2019).

[CR81] 81.Tan, Z. et al. Efficient semantic image synthesis via class-adaptive normalization. IEEE Trans. Pattern Anal. Mach. Intell.44, 4852–4866 (2021). [DOI] [PubMed] [Google Scholar]

[CR82] 82.Xue, H., Huang, Z., Sun, Q., Song, L. & Zhang, W. Freestyle layout-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 14256–14266 (2023).

[CR83] 83.Kavur, A. E. et al. CHAOS Challenge - combined (CT-MR) healthy abdominal organ segmentation. Med. Image Anal.69, 101950 (2021). [DOI] [PubMed] [Google Scholar]

[CR84] 84.Roth, H. R. et al. Data from pancreas-ct. 10.7937/K9/TCIA.2016.tNB1kqBU (2016).

[CR85] 85.Li, W. et al. Pants: The pancreatic tumor segmentation dataset. Preprint at 10.48550/arXiv.2507.01291 (2025).

[CR86] 86.Ma, J. et al. Fast and low-gpu-memory abdomen ct organ segmentation: the flare challenge. Med. Image Anal.82, 102616 (2022). [DOI] [PubMed] [Google Scholar]

[CR87] 87.Hofmanninger, J. et al. Automatic lung segmentation in routine imaging is primarily a data diversity problem, not a methodology problem. Eur. Radiol. Exp.4, 1–13 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR88] 88.Xia, W. et al. Gan inversion: A survey. IEEE Trans. Pattern Anal. Mach. Intell.45, 3121–3138 (2022). [DOI] [PubMed] [Google Scholar]

[CR89] 89.Parmar, G., Zhang, R. & Zhu, J.-Y. On aliased resizing and surprising subtleties in gan evaluation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11410–11420 (2022).

[CR90] 90.Kumari, N., Zhang, R., Shechtman, E. & Zhu, J.-Y. Ensembling off-the-shelf models for gan training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10651–10662 (2022).

[CR91] 91.Lang, O. et al. Explaining in style: training a gan to explain a classifier in stylespace. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 693–702 (2021).

[CR92] 92.Zhang, R., Isola, P., Efros, A. A., Shechtman, E. & Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 586–595 (2018).

[CR93] 93.Hamamci, I. E. et al. Generatect: Text-conditional generation of 3d chest CT volumes. Preprint at 10.48550/arXiv.2305.16037 (2023).

[CR94] 94.Paszke, A. et al. Pytorch: An imperative style, high-performance deep learning library. In Proceedings of theAdvances in Neural Information Processing Systems32 (2019).

[CR95] 95.Cardoso, M. J. et al. Monai: An open-source framework for deep learning in healthcare. Preprint at 10.48550/arXiv.2211.02701 (2022).

[CR96] 96.Unterthiner, T. et al. Fvd: A new metric for video generation (2019).

PERMALINK

Large-scale generative tumor synthesis in computed tomography images for improving tumor recognition

Linshan Wu

Jiaxin Zhuang

Yanning Zhou

Sunan He

Jiabo Ma

Luyang Luo

Xi Wang

Xuefeng Ni

Xiaoling Zhong

Mingxiang Wu

Yinghua Zhao

Xiaohui Duan

Varut Vardhanabhuti

Pranav Rajpurkar

Hao Chen

Abstract

Introduction

Fig. 1. Overview of the study.

Results

Datasets

Clinician evaluation of synthetic tumors

Evaluation of tumor segmentation and detection

Clinician evaluation

Case studies

Accurate and scalable segmentation across five types of tumors/lesions

Comparison methods

FreeTumor outperforms baseline tumor segmentation models

Fig. 2. Comparison with baseline tumor segmentation models.

FreeTumor outperforms previous tumor synthesis methods

Fig. 3. Comparison with tumor synthesis methods and CT foundation models.

FreeTumor outperforms various CT foundation models

FreeTumor excels in out-of-domain evaluation

FreeTumor yields significant improvements across five types of tumors/lesions

Fig. 4. Comprehensive analysis of tumor segmentation performance and data scaling effects.

Large-scale data enables more accurate tumor segmentation

Accurate detection across five types of tumors/lesions

Evaluation of tumor detection across all stages of tumors

Fig. 5. Evaluation of tumor detection.

Effectiveness of detecting small tumors

Discussion

Methods

Preliminary of free tumor

Large-scale generative tumor synthesis training

Quality control of synthetic tumors for large-scale segmentation training

Segmentation-based discriminator for quality control

Unleashing the power of large-scale unlabeled data

Online tumor synthesis

Visual turing test implementation

Datasets and implementation details

Datasets collection and pre-processing

Implementation details

Evaluation metrics

Reporting summary

Supplementary information

Source data

Acknowledgements

Author contributions

Peer review

Peer review information

Data availability

Code availability

Competing interests

Footnotes

Supplementary information

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases