Skip to main content
JCO Clinical Cancer Informatics logoLink to JCO Clinical Cancer Informatics
. 2024 Jan 25;8:e2300201. doi: 10.1200/CCI.23.00201

Synthetic Data Improve Survival Status Prediction Models in Early-Onset Colorectal Cancer

Hyunwook Kim 1, Won Seok Jang 2, Woo Seob Sim 3, Han Sang Kim 1, Jeong Eun Choi 4, Eun Sil Baek 5, Yu Rang Park 6, Sang Joon Shin 1,
PMCID: PMC10830088  PMID: 38271642

Abstract

PURPOSE

In artificial intelligence–based modeling, working with a limited number of patient groups is challenging. This retrospective study aimed to evaluate whether applying synthetic data generation methods to the clinical data of small patient groups can enhance the performance of prediction models.

MATERIALS AND METHODS

A data set collected by the Cancer Registry Library Project from the Yonsei Cancer Center (YCC), Severance Hospital, between January 2008 and October 2020 was reviewed. Patients with colorectal cancer younger than 50 years who started their initial treatment at YCC were included. A Bayesian network–based synthesizing model was used to generate a synthetic data set, combined with the differential privacy (DP) method.

RESULTS

A synthetic population of 5,005 was generated from a data set of 1,253 patients with 93 clinical features. The Hellinger distance and correlation difference metric were below 0.3 and 0.5, respectively, indicating no statistical difference. The overall survival by disease stage did not differ between the synthetic and original populations. Training with the synthetic data and validating with the original data showed the highest performances of 0.850, 0.836, and 0.790 for the Decision Tree, Random Forest, and XGBoost models, respectively. Comparison of synthetic data sets with different epsilon parameters from the original data sets showed improved performance >0.1%. For extremely small data sets, models using synthetic data outperformed those using only original data sets. The reidentification risk measures demonstrated that the epsilons between 0.1 and 100 fell below the baseline, indicating a preserved privacy state.

CONCLUSION

The synthetic data generation approach enhances predictive modeling performance by maintaining statistical and clinical integrity, and simultaneously reduces privacy risks through the application of DP techniques.


Synthetic oncology data with differential privacy enhance survival status prediction AI, ensuring data privacy.


graphic file with name cci-8-e2300201-g001.jpg

INTRODUCTION

Colorectal cancer diagnosed in patients younger than 50 years, known as early-onset colorectal cancer (EOCRC), is believed to have distinct etiologic, genetic, and molecular profiles compared with colorectal cancer diagnosed at an average onset.1-3 The rising prevalence of EOCRC in North America, Europe, Australia, and South Korea highlights its emergence as a public health concern.4,5 However, it accounts for 10% of all colorectal cancer (CRC) cases, leading to a limited understanding of the disease.6

CONTEXT

  • Key Objective

  • To enhance predictive modeling of survival status in early-onset colorectal cancer by using synthetic data generation methods that preserve statistical and clinical integrity while ensuring patient privacy.

  • Knowledge Generated

  • The study generated a synthetic data set through a Bayesian network with differential privacy, preserving the original data's key attributes and confidentiality. Integrating synthetic data into machine learning models enhanced performance, emphasizing its utility in scenarios with limited patient data.

  • Relevance (J.L. Warner)

  • Synthetic datasets offer the opportunity to carry out analyses that protect privacy and confidentiality. Rigorous benchmarking is required to increase the general trust in such synthetic data, as this team has demonstrated.*

    *Relevance section written by JCO Clinical Cancer Informatics Editor-in-Chief Jeremy L. Warner, MD, MS, FAMIA, FASCO.

The integration of artificial intelligence (AI) in digital pathology and image analysis has significantly advanced oncology, especially in enhancing prognostic accuracy.7,8 However, applying AI to EOCRC populations poses challenges, as it requires numerous high-quality materials for training and validation, often scarce or hard to obtain.9 Comprehensive cancer registries such as the National Cancer Database offer data but have data quality limitations, such as incomplete variables, biases in treatment documentation, and difficulties in interpreting sequence data variables.10,11

Timely access to high-quality clinical data is crucial for evidence-based decisions and treatment in cancer care. Yet, acquiring these data faces challenges such as compliance with Health Insurance Portability and Accountability Act's privacy mandates,12 variable data accessibility, discrepancies in medical record systems across institutions, and the financial costs of data stewardship.

Despite the advent of the big data era and advancements in personalized precision medicine, a paradox emerges. As patient characteristics become increasingly refined, thereby narrowing target groups, the achievement of statistical significance and the implementation of AI grow more challenging. To address these challenges, the exploration of methods for generating synthetic data is underway.13-16

The Bayesian network (BN) is a graphical structure that represents the conditional probability of nodes, where the nodes represent continuous or discrete nodes.17 Studies have demonstrated that a BN can effectively capture variable correlations and generate synthetic data resembling the original data set.18,19 Compared with other synthesis technologies, a BN offers better interpretability, owing to its foundation in probability theory.20 Such interpretability is important in clinical settings, where understanding and clearly explaining outcomes are necessary. Additionally, the application of differential privacy (DP) methods ensures privacy preservation.21-23

We focused on generating synthetic data using a BN with DP and evaluated its statistical and clinical validity. We hypothesized that the performance of survival status prediction models for populations with limited number of data, such as the EOCRC, could be improved by using synthetic data.

MATERIALS AND METHODS

Original Cohort and Data Set Structure

Data were collected from patients younger than 50 years diagnosed with CRC at Severance Hospital in Seoul, Republic of Korea, between January 2008 and October 2020. All included patients had initiated their first CRC treatment at Yonsei Cancer Center, and those with <30 days of follow-up or missing baseline disease stage information were excluded. Data for this study were sourced from the CRC database of the Cancer Registry Library Project, validated by data administrators and CRC-specialized medical oncologists. A study approval was granted by Yonsei University Health System's Institutional Review Board (4-2021-0520), with informed consent waived because of the anonymization of personal data after collection, adhering to stringent confidentiality standards.

Our database consisted of 11 tables with 93 variables, encompassing patient demographics, cancer diagnoses (diagnostic methods, histology, stages, and molecular subtypes), and treatment details, including clinical outcomes such as recurrence, treatment duration, and survival (Data Supplement, Table S1).

Generation of Synthetic Population

We built a synthetic population on the basis of the EOCRC population via a BN-based synthetic data generation algorithm called DataSynthesizer,24 which learns the interrelatedness of variables by estimating the conditional probability. The algorithm also uses DP, enabling the privacy of synthesized data. We used privacy budgets of epsilon 0.1, 1, 10, 100, 1,000, and 10,000.22,23

Quality Evaluation of Synthetic Data

The validity of the synthesized data was evaluated by examining the preservation of numerical data formations. Hellinger distance (HD), a metric that assesses the similarity of probability distributions ranging from zero (indicating no difference) to one, was used to measure the discrepancy between the original and synthetic data for each variable. We also calculated the correlation coefficient matrix, identifying the absolute differences and visualizing these as heat maps.

Medical oncologists independently analyzed both synthetic and original data from a clinical perspective to evaluate the clinical validity. This assessment involved an examination of baseline population characteristics, clinically significant disease-related features, and survival outcomes.

The reidentification disclosure risk was assessed to ensure patient privacy and protection against potential external privacy breaches.25,26 Two data sets were created, each containing half of the data features selected and zero values for the nonselected columns. The K-nearest neighbor algorithm was then applied to identify the closest samples from each data set, followed by a comparison of these nearest samples for a match.27

Development of Survival Status Prediction Models

We aimed to develop the survival status prediction models for the EOCRC population using machine learning (ML) models: Decision Tree, Random Forest, and XGBoost. Models were fine-tuned for optimal hyperparameters during the training and validation. Specifically, for Decision Tree and Random Forest, we set the maximum depth at (2, 4, 5, 7, 9, 10) and the minimum samples split at (2, 3, 4). For the XGBoost, the chosen maximum depth parameters were (4, 5, 7, 10, 50), along with learning rates of (0.01, 0.1).

We divided the original data set into three distinct subsets: the training set, the validation set, and the test set. The test set comprised 20% of the initial data set, while the remaining 80% was further split into training and validation sets, with 20% dedicated to validation. Both the validation and test sets were never exposed to the training process. Additionally, the test set remained unexposed to the synthetic data generation process, thereby ensuring the reliability and validity of our test results. Only original data were used in the test process. Features directly associated with survival status predictions, including overall survival days, date of death, 5-year survival status, and survival status for the overall period, were intentionally excluded.

First, we implemented four strategies depending on how to place the original data and synthesized data in each of the training and validation processes. Thereafter, we developed a model on the basis of synthetic data generated with epsilon values ranging from 0.1 to 10,000, using the optimized training validation strategy derived from the first step. We then compared its performance with that of the baseline model trained on the original data. Additional experiments investigated how models derived from synthetic data fared when dealing with extreme data scarcity situations by gradually reducing the age cutoff. We gradually reduced the inclusion age criteria in 5-year intervals, starting with a cutoff of 50 years, and compared the performance. The schematic representation of the study's structure and progression is depicted in Figure 1.

FIG 1.

FIG 1.

Flowchart of synthetic data generation, quality, and efficacy evaluation.

To increase the efficiency and accuracy of data synthesis, date columns were converted into numerical values on the basis of differences from standard dates, and dummy coding was used for low-cardinality categorical variables.

The primary performance metric of models was measured as the area under the receiver operating characteristic curve (AUROC), as the survival data were imbalanced. The mean decrease in impurity, a feature importance factor, was also calculated, and this helped identify whether overlapping important features existed in each age group.28

Statistical Analysis

Patient demographics, clinical characteristics, and outcomes were summarized using descriptive statistics. Relationships between categorical variables were assessed using chi-square or Fisher's exact tests, while independent-sample t-tests evaluated continuous variables' mean differences. Kaplan-Meier curves depicted survival, and log-rank tests examined differences in event-time distributions. Statistical tests were two-tailed, with P < .05 deemed significant. Statistical analyses were performed using R version 4.2.2 (The R Foundation for Statistical Computing,29 Vienna, Austria), and ML simulations were conducted using the Scikit-learn package, version 1.0.2, in Python.

RESULTS

Characteristics of the Cohort and the Synthetic Population

A total of 1,253 patients were included for synthetic data generation. The original data set was tabular and contained 93 distinct clinical variables. Similarly, the synthetic data set was presented in a tabular format, comprised 5,005 individuals, and maintained the same clinical information as the original data set.

In the DP framework with an epsilon value of 10,000, the synthetic population was statistically comparable with the original, as shown in Table 1. Within the synthetic population, the median (Q1-Q3) age was 42.0 (35.0-47.0) years, and 2,538 (50.7%) patients were male. The rates of stage II and stage III CRC were 23.2% and 34.8%, respectively, while the rate of stage IV was 11.8%. Most of the patients (94.4%) underwent surgical treatment for their primary tumor. The positive rates for RAS/RAF mutations and microsatellite instability-high were similar to the original data. However, the mismatch repair (MMR)–deficient rate showed a statistically significant difference, where the rate was 42.9% in the synthetic data group, notably higher than the 12.9% in the original population. The mortality rates of both the original and synthetic populations were found to be <10%, with observed rates of 9.4% and 9.3%, respectively. No significant differences were observed between the groups, including in the overall survival analysis by stage (Data Supplement, Fig S1).

TABLE 1.

Baseline Characteristics of Patients With EOCRC: Comparison of Real and Synthetic Data

Characteristic EOCRC Data (N = 1,253) Synthetic EOCRC Data (N = 5,005; ε = 10,000) P
Age range, years, median (IQR) 44.0 (38.0-47.0) 42.0 (35.0-47.0) .06
 <20, No. (%) 4 (0.3) 135 (2.7)
 20-29, No. (%) 49 (3.9) 536 (10.7)
 30-39, No. (%) 322 (25.7) 1,327 (26.5)
 40-49, No. (%) 878 (70.1) 3,007 (60.1)
Sex, No. (%) .89
 Male 611 (48.8) 2,538 (50.7)
 Female 642 (51.2) 2,467 (49.3)
CRC site, No. (%) .84
 Colon 881 (70.3) 3,403 (68.0)
 Rectum 372 (29.7) 1,602 (32.0)
Stage at diagnosis, No. (%) .99
 Stage 0 or I 371 (29.6) 1,515 (30.2)
 Stage II 291 (23.2) 1,160 (23.2)
 Stage III 442 (35.3) 1,740 (34.8)
 Stage IV 149 (11.9) 590 (11.8)
Histologic grade, No. (%) .62
 Well/moderately differentiated 912 (72.8) 3,288 (65.7)
 Poorly differentiated 40 (3.2) 246 (4.9)
 Others 214 (17.1) 1,146 (22.9)
 Missing 87 (6.9) 325 (6.5)
Surgery on primary tumor, No. (%) 1
 Yes 1,186 (94.7) 4,724 (94.4)
 No 67 (5.3) 281 (5.6)
Adjuvant chemotherapy, No. (%) 456 (36.4) 2,160 (43.2) .40
Relapse, No. (%) 189 (15.1) 745 (14.9) 1
Lines of palliative chemotherapy, No. (%) 95 (7.6) 392 (7.8) .99
 First line 38 (3.0) 146 (2.9)
 Second line 34 (2.7) 155 (3.1)
 Third line more than fourth line 36 (2.9) 158 (3.2)
Underwent RAS testing, No. (%) 371 (29.7) 2,237 (44.7)
 RAS-mutated 120 (32.3a) 686 (30.6a) .91
Underwent RAF testing, No. (%) 22 (1.8) 195 (3.9)
 RAF-mutated 2 (9.1a) 13 (6.7a) .71
Underwent MSI testing, No. (%) 310 (24.7) 1,650 (33.0)
 MSI-H 89 (28.7a) 641 (38.8a) .17
Underwent MMR IHC testing, No. (%) 116 (9.3) 1,585 (31.7)
 MMR-deficient 15 (12.9a) 680 (42.9a) <.001*
Overall 5-year survival rate (95% CI)
 Stage I 99.0 (96.2 to 99.7) 99.6 (98.9 to 99.9) .66b
 Stage II 96.8 (93.3 to 98.4) 96.6 (95.1 to 97.6) .91b
 Stage III 93.2 (89.8 to 95.5) 93.9 (92.5 to 95.1) .66b
 Stage IV 49.8 (40.8 to 58.2) 53.3 (48.8 to 57.5) .79b

Abbreviations: CRC, colorectal cancer; EOCRC, early-onset colorectal cancer; IHC, immunohistochemistry; MMR, mismatch repair; MSI, microsatellite instability; MSI-H, microsatellite instability-high; RAF, rapidly accelerated fibrosarcoma; RAS, rat sarcoma virus.

a

Proportion of confirmed mutations or deficient among those tested.

b

P value was calculated with log rank test.

*P value <.05.

Quality Evaluation of Synthetic Data

Calculating the HD between the original and synthetic data sets yielded a value <0.5, indicating minor disparities in marginal distributions (Data Supplement, Fig S2). Additionally, our correlation-based assessment revealed increased discrepancies with higher noise levels. At the lowest noise level, epsilon 10,000, the differences remained below 0.5 (Data Supplement, Fig S3), demonstrating the synthetic data's statistical similarity to the original.

For privacy risk evaluation, the baseline score was set to approximately 0.03. Using an epsilon range of 0.1-100 resulted in the risk dropping below 0.02, indicating a lower privacy risk than with the original data. However, using epsilons of 1,000 and 10,000 yielded privacy risks of approximately 0.06, each roughly twice that of the baseline (Data Supplement, Fig S4).

Development of Survival Status Prediction Models

In an effort to maximize the performance of the ML model, multiple training strategies were evaluated using a data set with an epsilon value of 10,000 as the benchmark (Fig 2). On subjecting the original data to training and validation, performance measures of 0.710, 0.779, and 0.770 were obtained for the Decision Tree, Random Forest, and XGBoost models, respectively. When the Decision Tree model was trained on the synthetic data and validated using the original data, it demonstrated an enhanced performance of 0.850, showing an increase of 0.140 compared with its performance when trained on the original data. When the same training strategy was applied to different models, the Random Forest model trained on the synthetic data set scored 0.836, higher than the 0.779 achieved on the original data set. Similarly, the XGBoost model trained on synthetic data scored slightly higher, although the difference was marginal, at just 0.02. Other training strategies showed similar performance to those of the models trained on the original data but were unable to outperform them.

FIG 2.

FIG 2.

Training strategy comparison in epsilon 10,000 synthetic data. The test results using synthetic data as training and validation data show the highest and closest AUROC performance to the original data in each ML model. AUROC, area under the receiver operating characteristic curve; ML, machine learning.

Using the optimal training and validation method determined previously, we created a model on the basis of a synthetic population synthesized with various epsilon values. In all performance tests, the model trained on synthetic data consistently outperformed the original data-based model in AUROC values, regardless of model type or epsilon values (Fig 3). For the Random Forest model, the original data demonstrated an AUROC value of <0.8. By contrast, the model trained on synthetic data achieved its highest performance, with an AUROC value of 0.836, at epsilon 10,000. For the XGBoost model, the synthetic data-based model with an epsilon value of 100 produced the highest AUROC value, reaching 0.887. Among the three models evaluated, the Decision Tree model yielded the top AUROC value of 0.909 when trained on synthetic data with an epsilon value of 0.1. Nevertheless, it was difficult to determine a distinct trend on the basis of the epsilon values in the ML results.

FIG 3.

FIG 3.

Comparison of ML model performance across epsilon values ranging from 0.1 to 10,000 with (A) Decision Tree, (B) Random Forest, and (C) XGBoost. In every simulation, training with synthetic data and validating with the original data outperformed the performance of solely using the original data. AUROC, area under the receiver operating characteristic curve; ML, machine learning.

Performance Test in Extreme Conditions

As the cutoff age lowered, both the absolute number of data samples and the performance of both original and synthetic data diminished. The Decision Tree model trained on synthetic data performed better than the model trained on the original data for all age groups (Fig 4A). For the Random Forest model, a similar trend was observed; however, its performance in the group younger than 30 years was higher than 0.9, whereas the performance of the model trained on the original data was close to 0.6 (Fig 4B). For the XGBoost model, the performance of the model trained on the synthetic data was superior to that of the model trained on the original data (Fig 4C).

FIG 4.

FIG 4.

Performance evaluation in extreme data scenarios using a 5-year cutoff age decrease in ML models: (A) Decision Tree, (B) Random Forest, and (C) XGBoost. In most cases, the AUROC values of models trained on synthetic data and validated on original data showed superior performance to models trained and validated on original data. AUROC, area under the receiver operating characteristic curve; ML, machine learning; Real, trained on original data and validated on original data; Synthetic, trained on synthetic data and validated on real data.

DISCUSSION

We used a differentially private BN to generate a synthetic population on the basis of the high-quality but limited EOCRC population. Our study focused on analyzing the statistical and clinical integrity of the synthesized data as well as evaluating its efficacy in AI prediction models and the security of personal information.

Statistical attribution and privacy tests showed a decrease in the difference between synthetic and original data with higher epsilon values, but an increase in privacy risk. This aligns with previous reports of synthetic data having marginally higher privacy risks.26,30 Therefore, it is crucial to enforce privacy when generating synthetic data. Using DP, we found that the privacy risks exceed baseline levels at epsilon values of 1,000 or higher (Data Supplement, Fig S4). The optimal epsilon value for privacy varies with data type, necessitating a balance to keep privacy risk below baseline without overly perturbing the data, which may impair model performance.

Regarding the training and validating strategy, training with synthetic data and validating with original data consistently yielded high-performance models in all cases. These results align with those of previous studies, where models trained on synthetic data generally outperformed those trained on original data.26,31 Moreover, high-performance models can be constructed even with extremely scarce data via synthetic data generation. This tendency has been repeatedly demonstrated when synthetic data sets have been used with varying epsilon values to establish DP. It is unclear why the synthetic data, despite its imperfections, demonstrated higher survival status prediction performance than the original data. However, it should be noted that the quantity of training material is as important as the data quality. It is a widely recognized rule of thumb in ML that the data set should contain at least 10 times more data than the features.32

In our Random Forest model, under extreme conditions, the AUROC values notably outperformed those of the original data, particularly for the age cutoffs of 30 and 35 years. To analyze these causes, we examined the top five features' importance for two age cutoffs, 30 and 50 years. This showed that only relapse and disease stage were consistently significant predictors across both groups. Other features differed, indicating survival predictors may vary with specific factors such as age groups (Data Supplement, Fig S5).

Evaluating the statistical relationships between variables and the clinical validity of each synthetic instance is crucial when generating synthetic data.33 Therefore, we performed extra analysis to assess our synthetic data sets' clinical relevance, ensuring they accurately represent real-world clinical data with statistical robustness. Nonetheless, significant statistical differences were observed in the MMR-deficient population proportions within the synthetic population. The MMR IHC test proportion in the synthetic data group accounted for 31.7% of the total, which was significantly higher than the 9.3% observed in the original group (Table 1; P < .001). This discrepancy is likely the result of the difficulties encountered during the process of conducting column binding while preserving the individual column information for MLH1, MSH2, MSH6, and PMS2 to generate accurate synthetic data. Addressing the impact of imbalanced proportions on model performance, we examined the feature importance of MMR deficiency–related variables and found minimal influence on survival status prediction across various models (Data Supplement, Table S2). This observation is further supported by our analysis, which demonstrates that none of the top five features for the models include MMR deficiency–related features (Data Supplement, Table S3). We cautiously hypothesize that variations in group frequency are unlikely to significantly affect the overall performance of these models. This tentative assumption is grounded in the observation that changes in proportions do not alter the probabilistic relationship between the features and the target variable.

In a supplementary experiment, we tested a neural network model with the same data set and strategy, initially excluded for its lower interpretability and explainability, to provide further insights. The adjusted deep learning model had one to three hidden layers with 100 nodes each and alpha values ranging from 0.0001 to 1. This revealed that models trained on synthetic data and validated with original data consistently outperformed the original data set model, achieving an AUROC score of 0.814 compared with 0.754 (Data Supplement, Fig S6A). Moreover, across a range of epsilon values, except for 10 and 1,000, these models uniformly achieved AUROC scores above 0.8 (Data Supplement, Fig S6B). We achieved similar results with the neural network model as with the tree model, demonstrating comparable predictive power for survival status using synthetic data.

Our study had several limitations. First, it was based on a retrospective database of young patients with CRC from a single Asian institution, which may not represent the broader patient population, potentially limiting the generalizability of our results. Second, our method involved transforming time-varying variables into fixed-point or interval values, which might not accurately reflect the dynamics of real-world clinical data, indicating a need for further research in synthesizing time-dependent variables. Additionally, we implemented a preprocessing step to group clinically relevant columns, aiming to reduce computational load and enhance synthesis. However, this could have unintentionally heightened privacy risks and increased the error rate in column information.

Future research should focus on enhancing the quality of synthetic data by refining synthesis algorithms and integrating diverse patient data across various ethnicities, nations, and medical institutions. A critical step in this direction is the establishment of criteria for evaluating the clinical validity of synthetic populations. Additionally, a steadfast commitment to minimizing privacy risks is essential. We cautiously anticipate that advancements in data synthesis technology may eventually allow synthetic populations to substitute real control cohorts in randomized controlled clinical trials.

In conclusion, our study demonstrates the viability of using synthetic data generation in AI model development with feature-rich, high-quality oncology data sets, particularly when sample sizes are small. The results also show that the synthesized data retained the statistical correlation and clinical meanings, and potential privacy risks arising from synthesis can be mitigated by applying DP.

ACKNOWLEDGMENT

The authors thank the Medical Informatics Collaborative Unit members of Yonsei University College of Medicine for their assistance with the data analysis.

Jeong Eun Choi

Employment: Yonsei University Health System

No other potential conflicts of interest were reported.

EQUAL CONTRIBUTION

H.K. and W.S.J. contributed equally to this work as first authors. S.J.S. and Y.R.P. contributed equally to this work as senior authors.

SUPPORT

Supported by a grant from the Korea Health Technology R&D Project through the Korea Health Industry Development Institute, funded by the Ministry of Health and Welfare, Republic of Korea (grant number: HI21C0974); the Korean Cancer Foundation (grant number K20230519).

DATA SHARING STATEMENT

Data will not be shared.

AUTHOR CONTRIBUTIONS

Conception and design: Hyunwook Kim, Won Seok Jang, Han Sang Kim, Yu Rang Park, Sang Joon Shin

Financial support: Sang Joon Shin

Administrative support: Jeong Eun Choi, Yu Rang Park

Provision of study materials or patients: Han Sang Kim

Collection and assembly of data: Hyunwook Kim, Won Seok Jang, Han Sang Kim, Jeong Eun Choi, Eun Sil Baek

Data analysis and interpretation: Hyunwook Kim, Won Seok Jang, Woo Seob Sim, Han Sang Kim

Manuscript writing: All authors

Final approval of manuscript: All authors

Accountable for all aspects of the work: All authors

AUTHORS' DISCLOSURES OF POTENTIAL CONFLICTS OF INTEREST

The following represents disclosure information provided by authors of this manuscript. All relationships are considered compensated unless otherwise noted. Relationships are self-held unless noted. I = Immediate Family Member, Inst = My Institution. Relationships may not relate to the subject matter of this manuscript. For more information about ASCO's conflict of interest policy, please refer to www.asco.org/rwc or ascopubs.org/cci/author-center.

Open Payments is a public database containing information reported by companies about payments made to US-licensed physicians (Open Payments).

Jeong Eun Choi

Employment: Yonsei University Health System

No other potential conflicts of interest were reported.

REFERENCES

  • 1.Lander EM, Rivero-Hinojosa S, Aushev VN, et al. : Evaluation of genomic alterations in early-onset versus late-onset colorectal cancer. J Clin Oncol 41, 2023. (suppl 16; abstr 3511) [Google Scholar]
  • 2.Jin EH, Han K, Shin CM, et al. : Sex and tumor-site differences in the association of alcohol intake with the risk of early-onset colorectal cancer. J Clin Oncol 41:3816-3825, 2023 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Storandt MH, Shi Q, Eng C, et al. : Genomic landscapes of early-onset versus average-onset colorectal cancer populations. J Clin Oncol 41, 2023. (suppl 16; abstr 3536) [Google Scholar]
  • 4.The Lancet Gastroenterology Hepatology: Addressing the rise of early-onset colorectal cancer. Lancet Gastroenterol Hepatol 7:197, 2022 [DOI] [PubMed] [Google Scholar]
  • 5.Sinicrope FA: Increasing incidence of early-onset colorectal cancer. N Engl J Med 386:1547-1558, 2022 [DOI] [PubMed] [Google Scholar]
  • 6.Bailey CE, Hu C-Y, You YN, et al. : Increasing disparities in the age-related incidences of colon and rectal cancers in the United States, 1975-2010. JAMA Surg 150:17-22, 2015 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Kann BH, Hosny A, Aerts HJWL: Artificial intelligence for clinical oncology. Cancer Cell 39:916-927, 2021 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Bera K, Schalper KA, Rimm DL, et al. : Artificial intelligence in digital pathology—New tools for diagnosis and precision oncology. Nat Rev Clin Oncol 16:703-715, 2019 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Carvalho DV, Pereira EM, Cardoso JS: Machine learning interpretability: A survey on methods and metrics. Electronics 8:832, 2019 [Google Scholar]
  • 10.Boffa DJ, Rosen JE, Mallin K, et al. : Using the national cancer database for outcomes research: A review. JAMA Oncol 3:1722-1728, 2017 [DOI] [PubMed] [Google Scholar]
  • 11.National Cancer Institute : Surveillance, Epidemiology, and End Results (SEER) program. Limitations of treatment data in SEER, 2023. https://seer.cancer.gov/data-software/documentation/seerstat/nov2022/treatment-limitations-nov2022.html
  • 12. Health Insurance Portability and Accountability Act, Public Law 104-191, August 21, 1996. [PubMed]
  • 13.El Kababji S, Mitsakakis N, Fang X, et al. : Can synthetic data accurately mimic oncology clinical trials? J Clin Oncol 41, 2023. (suppl 16; abstr 1554) [Google Scholar]
  • 14.Gonzales A, Guruswamy G, Smith SR: Synthetic data in health care: A narrative review. PLoS Digit Health 2:e0000082, 2023 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Guillaudeux M, Rousseau O, Petot J, et al. : Patient-centric synthetic data generation, no reason to risk re-identification in biomedical data analysis. NPJ Digit Med 6:37, 2023 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Krauland MG, Frankeny RJ, Lewis J, et al. : Development of a synthetic population model for assessing excess risk for cardiovascular disease death. JAMA Netw Open 3:e2015047, 2020 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Young J, Graham P, Penny R: Using Bayesian networks to create synthetic data. J Official Stat 25:549-567, 2009 [Google Scholar]
  • 18.Kaur D, Sobiesk M, Patil S, et al. : Application of Bayesian networks to generate synthetic health data. J Am Med Inform Assoc 28:801-811, 2021 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Tucker A, Wang Z, Rotalinti Y, et al. : Generating high-fidelity synthetic patient data for assessing machine learning healthcare software. NPJ Digit Med 3:147, 2020 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Mihaljević B, Bielza C, Larrañaga P: Bayesian networks for interpretable machine learning and optimization. Neurocomputing 456:648-665, 2021 [Google Scholar]
  • 21.Zhang Z, Rubinstein B, Dimitrakakis C: On the differential privacy of Bayesian inference. Proceedings of the AAAI Conference on Artificial Intelligence 30, 2016
  • 22.Bao E, Xiao X, Zhao J, et al. : Synthetic data generation with differential privacy via Bayesian networks. J Priv Confidentiality 11, 2021. 10.29012/jpc.776 [DOI] [Google Scholar]
  • 23.Triastcyn A, Faltings B: Bayesian differential privacy for machine learning. Proceedings of the 37th International Conference on Machine Learning, PMLR 119:9583-9592, 2020
  • 24.Ping H, Stoyanovich J, Howe B: DataSynthesizer: Privacy-preserving synthetic datasets. Proceedings of the 29th International Conference on Scientific and Statistical Database Management, Chicago, IL, 2017
  • 25.Goncalves A, Ray P, Soper B, et al. : Generation and evaluation of synthetic patient data. BMC Med Res Methodol 20:108, 2020 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Yoon J, Mizrahi M, Ghalaty N, et al. : EHR-Safe: Generating high-fidelity and privacy-preserving synthetic electronic health records. NPJ Digit Med 6:141, 2023 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Su D, Huynh HT, Chen Z, et al. : Re-identification attack to privacy-preserving data analysis with noisy sample-mean. Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining 1045-1053, 2020
  • 28.Louppe G, Wehenkel L, Sutera A, et al. : Understanding variable importances in forests of randomized trees, in Burges CJ, Bottou L, Welling M, et al (eds): Advances in Neural Information Processing Systems. Red Hook, NY, Curran Associates, Volume 26, 2013 [Google Scholar]
  • 29.R Core Team: R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing. www.R-project.org
  • 30.Krenmayr L, Frank R, Drobig C, et al. : GANerAid: Realistic synthetic patient data for clinical trials. Inform Med Unlocked 35:101118, 2022 [Google Scholar]
  • 31.Chen RJ, Lu MY, Chen TY, et al. : Synthetic data in machine learning for medicine and healthcare. Nat Biomed Eng 5:493-497, 2021 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Goodhue DL, Lewis W, Thompson R: Does PLS have advantages for small sample size or non-normal data? MIS Q 36:981-1001, 2012 [Google Scholar]
  • 33.Rodriguez-Almeida AJ, Fabelo H, Ortega S, et al. : Synthetic patient data generation and evaluation in disease prediction using small and imbalanced datasets. IEEE J Biomed Health Inform 27:2670-2680, 2023 [DOI] [PubMed] [Google Scholar]

Articles from JCO Clinical Cancer Informatics are provided here courtesy of American Society of Clinical Oncology

RESOURCES