Abstract
Breast cancer (BC) is a major contributor to female mortality worldwide, particularly in young women with aggressive tumors. Despite the need for accurate prognosis in this demographic, existing studies primarily focus on broader age groups, often using the SEER database, which has limitations in variable selection. This study aimed to develop an ML-based model to predict survival outcomes in young BC patients using the BC public staging database. A total of 3,401 patients with BC were included in the study. Patients were categorized as younger (n = 1574) and older (n = 1827). We applied several survival models—Random Survival Forest, Gradient Boosting Survival, Extra Survival Trees (EST), and penalized Cox models (Lasso and ElasticNet)—to compare mortality characteristics. The EST model outperformed others in predicting mortality for both age groups. Older patients exhibited a higher prevalence of comorbidities compared to younger patients. Tumor stage was the primary variable used to train the model for mortality prediction in both groups. COPD was a significant variable only in younger patients with BC. Other variables exhibited varying degrees of consistency in each group. These findings can help identify high-risk young female patients with BC who require aggressive treatment by predicting the risk of mortality.
Supplementary Information
The online version contains supplementary material available at 10.1038/s41598-024-76331-y.
Keywords: Survival prediction model, Breast cancer in young women, Machine learning, Breast cancer
Subject terms: Computer science, Biomedical engineering, Breast cancer
Introduction
Breast cancer (BC) is one of the most common cancers and the leading cause of cancer-related mortality among women worldwide1. Recent epidemiological evidence indicates a consistent upward trend in the prevalence of BC in young women (BCYW)2. In the United States, 18% of new BC cases and 11% of BC deaths occur in women younger than 50 years3, BCYW tend to have a higher proportion of estrogen receptor (ER) negative, triple-negative, and HER2+ tumors4. Younger age is correlated with poorer prognostic characteristics of tumors, such as decreased tumor differentiation, increased Ki-67 expression, and greater lymph node infiltration, compared to females older than 50 years5. Given the unique challenges faced by BCYW, predicting their prognosis is crucial. Consequently, considerable research has been conducted to predict the survival of this specific population.
Sun et al. proposed a nomogram to predict overall and cancer-specific survival in young patients with BC. The nomogram used the important feature of lymph node ratio to predict overall survival and BC-specific survival based on the Cox proportional hazards model. These nomograms provide more accurate individualized risk predictions of adversarial events and may assist clinicians in making decisions for young patients with BC6. Gong et al. developed and validated a comprehensive nomogram to predict survival in young women with BC, using SEER database and a competing risks model. This model demonstrated superior performance and efficacy compared to the TNM system in the validation experiment7. Li et al. developed a model to predict the 3-year, 5-year, 7-year, and 10-year survival rates of young patients with BC using random forest survival, based on data from the SEER database8. Despite this effort, few studies have focused specifically on predicting the prognosis of BCYW compared to those on conducted based on the overall age of patients with BC9–13. Furthermore, most existing studies rely on the SEER database, which has inherent limitations in variable selection.
To address these gaps, our study introduces a ML-based prognosis model for young patients with BC, incorporating comorbidities such as atrial fibrillation (AF), chronic kidney disease (CKD), chronic obstructive pulmonary disease (COPD), diabetes mellitus (DM), deep venous thrombosis (DVT), dyslipidemia (DYS), heart failure (HF), hypertension (HYP), liver disease (LD), myocardial infarction(MI), peripheral vascular disease (PVD), and stroke. Additionally, we conducted a comparative analysis to identify the variables preferred by machine learning algorithms in predicting adverse outcomes in younger and older patients with BC.
Methods
We developed a model to predict all-cause mortality in younger patients with BC using data from the breast cancer public staging database (CPSD) provided by the National Cancer Data Center (NCDC) (Fig. 1). To achieve this, we first extracted data of patients diagnosed with BC between 2013 and 2015 from the Breast-CPSD. The data were divided into training and test datasets. Using the training data, we used random survival forest (RSF), gradient boost survival analysis (GBSA), extra survival tree (EST), penalized Cox probability hazard lasso (CoxPH-L) and ElasticNet (CoxPH-E). The performance of these predictions was evaluated on test data using the C-index metric.
Fig. 1.
Research Framework of prediction model for BC. CPSD cancer public staging database, BC breast cancer, C-index concordance index.
Table 2.
Mean C-index value from survival prediction models
| Model | Young cohort 5-year mortality prediction | Old cohort 5-year mortality prediction |
|---|---|---|
| C-index (95% CI) | C-index (95% CI) | |
| RSF | 74.91(72.47–77.36) | 81.23(80.34–82.12) |
| GBSA | 66.58(66.54–66.63) | 81.70(81.70–81.70) |
| EST | 87.53(86.68–88.53) | 82.95(82.09–83.81) |
| CoxPH-L | 73.86(73.86–73.86) | 81.70(81.70–81.70) |
| CoxPH-E | 72.61(72.61–72.61) | 81.73(81.73–81.73) |
RSF Random Survival Forest, GBSA Gradient boosting surviva, EST Extra Survival Trees, CoxPH-L Cox proportional hazards model Lasso, CoxPH-E Cox proportional hazards model ElasticNet, C-index concordance index, CI confidence interval.
Ethics approval and consent to participate
This study was approved by the Ethics Review Board of National Cancer Center Institutional Review Board (IRB NO. NCC2023-0260). The requirement for informed consent in this study was waived because the researcher accessed only anonymized data for analysis purposes. The pseudonymized data were analyzed in a secure environment provided by the National Cancer Data Centre, ensuring that only the results were exported. All procedures have been carried out in accordance with relevant guidelines and regulations.
Data source
This study utilized the Breast-CPSD provided by the NCDC14. This database was constructed by linking data from the Breast Cancer Public Library Database (CPLD) of the NCDC and collaborative staging (CS) for cancer established in the Korea Central Cancer Registry (KCCR)15,16. The Breast-CPSD includes data from 16,870 individuals who developed BC at primary sites C500–C506, C508, and C509 between 2012 and 2019. From these datasets, we excluded cohorts diagnosed with BC between 2016 and 2019 for 5-year mortality predictions and the 2012 cohort, for which screening information prior to cancer diagnosis was unavailable. The data were cleaned to remove missing or unknown entries (T-size 139 records, ER 66 records, PR 3 records, HER2 181 records, AJCC7 STAGE 7 records, weight 1 record, protein in urine 106 records, low-density lipoprotein 6 records, smoking status 3 records, Weekly alcohol consumption 1 record, Moderate physical activity 1 record, estimated glomerular filtration rate 21 records, Male 10 records). Subsequently, the dataset was divided into two cohorts: individuals aged < 50 years and those aged ≥ 51 years. For each cohort, 80% of the data were allocated to the training set to create survival models, whereas the remaining 20% constituted the test set, as shown in Fig. 2. AF, CKD, COPD, DM, DVT, DYS, HF, HYP, LD, MI, PVD, and stroke were defined according to the International Classification of Diseases 10th Revision (ICD-10). We split the dataset into training and test sets using a random seed of 42 and manually adjusted the class label distribution to minimize bias. The primary endpoint was defined as all-cause mortality within 5 years of cancer diagnosis. The difference in label proportions between the training set (y_train) and test set (y_test) was approximately 0.1%, ensuring balanced representation. In the older breast cancer patients dataset, the label distribution in y_train was 94.18% for class 0 and 5.82% for class 1. Similarly, in the younger breast cancer patients dataset, y_train had 95.08% for class 0 and 4.92% for class 1.
Fig. 2.
Flow diagram of patients.
Statistical analysis
Continuous variables were described as mean ± SD, and categorical variables were described as percentages (%). We used t-test based analysis for continuous variables and chi-square based approach for categorical variables. Kaplan-meier analysis and log-rank test were used in all causes of death free survival analysis. These analyses were performed using the SAS(SAS Institute, Cary, NC) 9.4 version.
Survival machine learning models
The modeling of time-to-event data in survival analysis requires specialized techniques to address specific challenges such as censoring, truncation, time-varying covariates, and covariate effects. Several models, including RSF, GBSA, EST, CoxPH-L, and CoxPH-E, have been employed for this purpose. RSF17 is an extension of the random forest algorithm proposed by Breiman et al. in 200118 to survival analysis. This model is a non-parametric, non-linear ensemble learning method. This model is a non-parametric, non-linear ensemble learning method. It incorporates relationships between non-linear effects and variables, while ranking the importance of variables to minimize generalization error19 In addition, RSF is capable of estimating the cumulative hazard function for each sample, even when the proportional hazards assumption does not hold17. The GBSA represents a sophisticated adaptation of the gradient boosting machine (GBM)20 specifically tailored for survival analysis, proficient in managing censored data21,22 GBM is an ensemble learning method that builds a series of decision trees, with each tree attempting to correct the errors made by the previous one20. Building on this framework, GBSA is an adaptation of GBM specifically designed for survival analysis, adept at handling censored data. GBSA possesses the capability to elucidate intricate interactions and non-linear associations between features and survival durations, and it functions independently of the proportional hazards assumption23. The EST model, an extension of the extremely randomized tree model developed by Geurts et al. in 200624. This approach introduces additional randomness by selecting split points at random, rather than based on the best split criteria. It can also accommodate censored data and does not rely on the proportional hazards assumption25. The Cox proportional hazards model (CoxPH)26 is the standard approach for survival analysis and has been enhanced by penalization methods, including lasso and ElasticNet penalties. The CoxPH models assume that the relationship between the predictors and the log hazard is both linear and additive27. The CoxPH-L implements Lasso regularization, a method that facilitates variable selection by imposing penalties on the absolute magnitude of coefficients, resulting in sparse models that include only the most important variables28. In contrast, the CoxPH-E integrates both Lasso (L1) and Ridge (L2) regularization techniques, striking a balance between the processes of variable selection and multicollinearity management29. The performance of these models was assessed using the C-index30. In addition, permutation feature importance analysis was performed. Feature importance scoring, which assesses the values of all input features to determine their importance in decision-making mechanisms, is a critical step in mortality prediction for survival analysis. Overall, feature importance was assessed using Scikit-learn (version 1.0.2)31. Survival models were constructed using the scikit-survival package (version 0.17.2)32, and the entire analysis was conducted using TensorFlow (version 1.15.5) and Python (version 3.7.5).
Results
Baseline characteristics of the patients
Table 1 presents a comparison of the basic characteristics of the younger and older patient cohorts. The younger group had a higher ratio of ER, PR, past smokers, current smokers, weekly alcohol consumption (days), and vigorous physical activity (days per week). In contrast, the ratios of HER2, AF, CKD, COPD, DM, DVT, DYS, HF, HYP, LD, MI, PVD, and stroke were higher in the older group. Additionally, the younger group exhibited different ratios of height, weight, waist circumference, urinary protein, topography code, morphology code, and AJCC7 STAGE compared to the older group.
Table 1.
Baseline characteristics between younger and older groups
| Variables | Older group N = 1,827 |
Younger group N = 1,574 |
|---|---|---|
| Age (y) | 59.75 ± 7.05 | 44.35 ± 4.35 |
| BMI (kg/m2) | 23.85 ± 2.81 | 22.68 ± 2.73 |
| Height (cm) | ||
| 150 ≤ H < 160 | 77.78 | 54.51 |
| 160 ≤ H < 170 | 22 | 43.14 |
| 170 ≤ H | 0.22 | 2.35 |
| Weight (kg, %) | ||
| 40 ≤ W ≤ 50 | 12.26 | 14.42 |
| 50 ≤ W < 60 | 46.52 | 48.86 |
| 60 ≤ W < 70 | 31.25 | 27.32 |
| 70 ≤ W < 80 | 7.99 | 6.8 |
| 80 ≤ W < 90 | 1.64 | 1.91 |
| 90 ≤ W | 0.33 | 0.7 |
| Waist circumference (cm, %) | ||
| 60 ≤ WC < 70 | 10.78 | 25.86 |
| 70 ≤ WC < 90 | 39.19 | 49.49 |
| 80 ≤ WC < 90 | 36.4 | 18.87 |
| 90 ≤ WC | 13.63 | 5.78 |
| Systolic blood pressure (mmHg) | 124,50 ± 15.40 | 115.60 ± 13.62 |
| Diastolic blood pressure (mmHg) | 76.35 ± 9.80 | 72.77 ± 9.47 |
| Hemoglobin level (g/dL) | 13.12 ± 1.15 | 12.77 ± 1.32 |
| Fasting blood sugar (mg/dL) | 101.90 ± 25.46 | 94.33 ± 17.46 |
| Total cholesterol (mg/dL) | 202.60 ± 39.86 | 189.50 ± 32.72 |
| Serum glutamic oxaloacetic transaminase (IU/L) | 26.50 ± 37.07 | 21.62 ± 12.60 |
| Serum glutamic pyruvic transaminase (IU/L) | 23.85 ± 30.93 | 18.44 ± 17.40 |
| Gamma glutamyl transpeptidase (IU/L) | 27.96 ± 37.14 | 21.79 ± 27.36 |
| Triglycerides (mg/dL) | 124.90 ± 72.32 | 98.22 ± 65.30 |
| High-density lipoprotein (mg/dL) | 57.04 ± 19.32 | 60.43 ± 14.30 |
| Low-density lipoprotein (mg/dL) | 121.00 ± 36.17 | 109.50 ± 29.88 |
| Serum creatine (mg/dL) | 0.78 ± 0.41 | 0.81 ± 2.32 |
| Estimated glomerular filtration rate (mL/min) | 85.34 ± 22.74 | 94.23 ± 32.64 |
| Protein in urine | ||
| 1 negative (-) | 95.35 | 94.98 |
| 2 positive (±) | 2.41 | 3.11 |
| 3 positive (+ 1) | 1.53 | 1.33 |
| 4 positive (+ 2) | 0.55 | 0.32 |
| 5 positive (+ 3) | 0.11 | 0.13 |
| 6 positive (+ 4) | 0.05 | 0.13 |
| Topography CODE (%) | ||
| C500 | 0.66 | 0.44 |
| C501 | 4.32 | 3.05 |
| C502 | 14.4 | 15.69 |
| C503 | 4.05 | 6.73 |
| C504 | 36.84 | 35.01 |
| C505 | 7.22 | 7.81 |
| C506 | 0 | 0.06 |
| C508 | 19.21 | 17.47 |
| C509 | 13.3 | 13.72 |
| Morphology CODE (%) | ||
| 1. Squamous and transitional cell carcinoma (8051– 8084, 8120–8131)) | 0.11 | 0 |
| 3. Adenocarcinoma (8140–8149, 8160–8163, 8190–8221, 8260–8337, 8350–8552, 8570–8576, 8940–8941) | 98.36 | 98.98 |
| 4. Other specific carcinomas (8030–8046, 8150–8157, 8170–8180, 8230–8255, 8340–8347, 8560–8562, 8580–8671) | 0.44 | 0.32 |
| 5. Unspecified carcinomas (NOS) (8010–8015, 8020–8022, 8050) | 0.82 | 0.32 |
| 16. (Other specified types of cancer (8720–8790, 8930–8936, 8950–8983, 9000–9030, 9060–9110, 9260–9365, 9380–9539)) | 0 | 0.06 |
| 17. Unspecified types of cancer (8000–8005) | 0.27 | 0.32 |
| AJCC7 STAGE | ||
| IA(%) | 46.52 | 46.19 |
| IB(%) | 1.97 | 1.78 |
| IIA(%) | 26.71 | 26.49 |
| IIB(%) | 13.03 | 12.77 |
| III(%) | 9.96 | 11.37 |
| IV(%) | 1.81 | 1.4 |
| T-size | 21.39 ± 15.08 | 22.46 ± 16.44 |
| ER (%) | 69.46 | 76.56 |
| PR (%) | 55.28 | 71.47 |
| HER2 (%) | 33.44 | 28.78 |
| Atrial fibrillation (%) | 3.17 | 0.57 |
| Chronic kidney disease (%) | 1.26 | 0.38 |
| Chronic obstructive pulmonary disease (%) | 4.76 | 1.78 |
| Diabetes (%) | 31.42 | 13.6 |
| Deep venous thrombosis (%) | 2.13 | 1.52 |
| Dyslipidemia (%) | 69.51 | 47.78 |
| Heart failure (%) | 5.47 | 2.54 |
| Hypertension (%) | 50.85 | 16.65 |
| Liver disease (%) | 40.72 | 35.9 |
| Myocardial infarction (%) | 0.93 | 0.13 |
| Peripheral vascular disease (%) | 17.41 | 6.86 |
| Stroke (%) | 0.88 | 0.13 |
| Smoking status | ||
| Non-smoker (%) | 94.75 | 91.68 |
| Past smoker (%) | 1.81 | 2.92 |
| Current smoker (%) | 3.45 | 5.4 |
| Weekly alcohol consumption (days) | 0.33 ± 0.96 | 0.61 ± 1.03 |
| Moderate physical activity (days in a week) | 1.26 ± 1.90 | 1.22 ± 1.67 |
| Physical activity Walking (days in a week) | 2.83 ± 2.48 | 2.80 ± 2.37 |
| Vigorous physical activity (days in a week) | 0.88 ± 1.66 | 0.94 ± 1.53 |
| Survival time (days) | 1775.70 ± 233.20 | 1797.00 ± 173.10 |
| All cause of death (%) | 5.64 | 3.37 |
Data are expressed as number of patients(percentage) or mean ± standard deviation.
BMI body mass index
The older group exhibited higher levels of BMI, systolic blood pressure, diastolic blood pressure, hemoglobin, fasting blood sugar, total cholesterol, serum glutamic oxaloacetic transaminase, serum glutamic pyruvic transaminase, gamma glutamyl transpeptidase, triglycerides, low-density lipoprotein, moderate physical activity (days per week), and physical activity (days per week). However, the younger group exhibited higher levels of high-density lipoprotein, serum creatine, estimated glomerular filtration rate, tumor size, weekly alcohol consumption (days), and vigorous physical activity (days per week).
Model performance evaluation
The performance of the RSF, GBSA, EST, CoxPH-L, and CoxPH-E models were compared based on the mean C-index with 95% Confidence Interval (CI) for the prediction of mortality in Table 2. The RSF model achieved a C-index with 95% CI of 74.91(72.47–77.36) for the young cohort and 81.23 (80.34–82.12) for the old cohort. The GBSA model had a C-index with 95% CI of 66.58 (66.54–66.63) for the young cohort and 81.70 (81.70–81.70) for the old cohort. The EST model showed a C-index with 95% CI of 87.53 (86.68–88.53) for the young and 82.95 (82.09–83.81) for the old cohorts. The CoxPH-L model was 73.86(73.86–73.86) for the young cohort and 81.70(81.70–81.70) for the old cohort. Finally, the CoxPH-E model had a C-index with 95% CI of 72.61 (72.61–72.61) for the young cohort and 81.73 (81.73–81.73) for the old cohort.
To construct the survival prediction models, we identified the most important features for each model and extracted the top 10 features in each model for analysis. In the young cohort, the RSF model highlighted AJCC7 stage, C509, COPD, ER status, age, physical activity “walking” (days per week), hemoglobin levels, weekly alcohol consumption (days), total cholesterol, and weight. The GBSA model identified AJCC7 STAGE, diastolic blood pressure, systolic blood pressure, ER, weekly alcohol consumption (days), fasting blood sugar, COPD, low-density lipoprotein, moderate physical activity (days per week), and C503. The EST model included AJCC7 STAGE, C509, COPD, ER, PR, DYS, physical activity walking (days per week), LD, age, and height. The CoxPH-L was used for total cholesterol, AJCC7 STAGE, low-density lipoprotein, triglyceride, high-density lipoprotein, COPD, PR, alcohol, C501, and C509 assays. The CoxPH-E model included total cholesterol, low-density lipoprotein, AJCC7 STAGE, triglycerides, high-density lipoprotein, COPD, weekly alcohol consumption (in days), C501, PR, and C509 levels. Within the old cohort, the RSF selected the AJCC7 STAGE, fasting blood sugar, T SIZE, ER, DYS, PR, estimated glomerular filtration rate, C509, HER2, and low-density lipoprotein levels. The GBSA model used AJCC7 stage, Tumor size, fasting blood sugar, age, BMI, high-density lipoprotein, estimated glomerular filtration rate, triglycerides, ER, and weight. The EST model included the AJCC7 STAGE, DYS, ER, fasting blood sugar, MCODE-3, C509, PR, T SIZE, HER2, and HF. The CoxPH-L model was assigned to the AJCC7 STAGE, fasting blood sugar, gamma glutamyl transpeptidase, serum creatine, PR, DYS, HYP, Vigorous physical activity (days per week), Physical activity “walking” (days per week), and PVD. The CoxPH-L model included AJCC7 stage, fasting blood sugar, gamma glutamyl transpeptidase, serum creatine, PR, DYS, HYP, vigorous physical activity (days/week), physical activity “walking” (days/week), and hemoglobin.
Figure 3 depicts a heatmap showing the frequency distribution of variables utilized in developing the 5-year mortality prediction models in both the young and old patient cohorts. Different color intensities indicate the importance of each feature, with deeper colors signifying higher frequency. In both young and old cohorts, the AJCC stage consistently emerged as the most critical feature across all models. However, the significance of the other features varied significantly, not only between the models but also within each age cohort.
Fig. 3.
Mortality prediction models with Feature Importance by age group. BMI body mass index, ER estrogen receptor, HF heart failure, HER2 human epidermal growth factor receptor 2, LD liver disease, PVD peripheral vascular disease, PR progesterone receptor, T-size tumor-size. The color legend on the heatmap indicates how often each feature was selected as one of the top 10 important features in the model. The color indicates how often a feature was selected as an important feature, with the color scale getting darker from low to high frequency.
In Fig. 4, the ROC curves illustrate the relationship between the true positive rate (TPR) and false positive rate (FPR) for the RSF, GBSA, EST, CoxPH-L, and CoxPH-ElasticNet models. Each curve is a visual representation of how accurately the model discriminates between death and survival outcomes based on the model’s predictions. The closer the curve is to the top left-hand corner, the better the model is at predicting the event of death, the higher the true positive rate and the lower the false positive rate. The closer the curve is to the diagonal, the closer the model performs to random guessing. The AUC value, which represents the area under the ROC curve, is a numerical measure of a model’s classification performance. An AUC closer to 1 indicates that the model has better classification performance. In this ROC analysis, the AUC value for each model is shown in the legend to facilitate performance comparisons between models. The Random Guessing baseline, which represents the performance of random guessing, is shown as a diagonal line, and the higher a model’s ROC curve is above this diagonal line, the better the model performs compared to random guessing.
Fig. 4.
ROC Curves for Survival Prediction Models. These curves demonstrate the models’ ability to discriminate between death and survival event. RSF Random Survival Forest, GBSA Gradient Boosting Survival Analysis, EST Extra Survival Trees, CoxPH-L penalized Cox probability hazard lasso, CoxPH-E penalized Cox probability hazard lasso, AUC Area Under the Curve, ROC Receiver Operating Characteristic.
Discussion
BC is the leading cause of cancer-related deaths among women, and the number of younger patients with BC continues to increase. Younger patients with BC tend to have an unfavorable prognosis due to their unusual tumor characteristics. This highlights the need for a personalized treatment approach for younger patients with BC. Survival rates have been shown to differ between younger and older patients with BC, with notable differences in the basic characteristics of these populations. In particular, the frequency of co-morbidities varies significantly between younger and older patients in our experiment. These age-related physiological differences may result in different responses and outcomes, which should be considered when developing survival-prediction models. Although several nomogram studies have been conducted to consider these characteristics, ML methodologies that can account for the characteristics of each population and reflect complex causal factors that contribute to negative outcomes are needed. Recently, Li et al. compared the performance of ML-based survival prediction models with the traditional COX method for young patients with BC using the SEER database and reported that the ML methodology outperformed the Cox methodology8. However, an important point was not considered in this data source. Patients with BC have many comorbidities due to various reasons, and the impact of these comorbidities on negative prognosis has not been considered. In addition, information on the daily lifestyle of patients with cancer was not considered.
To address these issues, we categorized patients with BC as young ( < = 50 year) and older(> 50 year) and compared 5-year survival prediction using five survival ML models based on Breast-CPSD. We employed three different Tree-based survival machine learning models, which excel at capturing non-linear relationships and complex interactions between multiple variables, making them particularly well-suited for datasets with diverse prognostic factors and intricate patterns. Tree-based approaches do not rely on the assumption of proportional hazards, which can be a limitation in complex survival data33. Tree-based models are especially effective when many factors interact in a non-linear way. These models can handle high-dimensional data and can automatically identify important feature interactions without the need for manual feature engineering. This capacity makes them particularly adept at survival analysis when relationships among covariates are intricate34. In addition to tree-based models, we also employed CoxPH and CoxElasticNet. CoxPH-Lasso simplifies the model by selecting only the most important variables, which reduces the risk of overfitting. ElasticNet strikes a balance between variable selection and model complexity, adjusting the penalty for both L1 (lasso) and L2 (ridge) regularization. This enables it to maintain stable predictive performance while avoiding overfitting. Such parametric methods can be highly interpretable and are beneficial when the proportional hazards assumption holds26,28.
In our experiment, the EST models split the nodes by randomly selecting attributes and determining thresholds, which is the key difference between the RFs models. The EST model outperformed the other ML models in both the older and younger cohorts. Among these, EST performs particularly well in unbalanced data, where randomization allows the model to explore multiple variables and their interactions, and where minority classes are important. Therefore, in our study, it detected complex patterns to enable accurate predictions and performed optimally in survival analysis where minority classes are important24. As the EST model allows for a wider range of threshold choices, it performs particularly well on datasets with many attributes. The algorithm captures the intricate relationships among various data subsets and effectively handles diverse attribute distributions. As a result of these characteristics, EST models outperform other ML models in both age groups35,36. Furthermore, 10 iterations were conducted to select the optimal model. At each stage, the variables deemed crucial for the ML model were identified and summarized. It is noteworthy that the COPD variable appears to be a highly significant predictor of mortality in young patients with BC, in contrast to the results observed in the older population. Previous studies have demonstrated that COPD plays a role in the development of lung, liver, and colorectal cancer37. However, our findings revealed that COPD was a significant predictor of negative prognosis in patients with BC. In young breast cancer patients, there were also differences in the survival curves between the group with COPD and the group without COPD in Fig. 5, and with stage III being significantly more common in the COPD group in the baseline characterization of the two groups in supplement 1. We can cautiously assume that these clinical factors contribute to the higher mortality observed in young breast cancer patients with COPD. In addition, the presence of metastases and related complications are responsible for many adverse prognostic events in breast cancer patients38. Death due to lung metastases is of particular concern39. Nomogram-based models have recently been proposed as a potential solution to these problems40. However, the precise results of these models in the context of COPD have not yet been published. Further studies are needed to clarify the relationship between COPD, metastases and mortality.
Fig. 5.
Survival curve for all causes of death free survival. A Comparison survival curve between younger group with COPD and younger group without COPD. B Comparison survival curve between older group with COPD and older group without COPD. COPD chronic obstructive pulmonary disease.
In addition, the Breast-CPSD includes screening data from the National Health Service, which permits the consideration of data on the general lifestyle of patients with cancer. In contrast to the older patient population, our experimental results indicated that weekly alcohol consumption (days) was a negative prognostic factor in younger patients with BC. The model proposed in this study enables the identification of high-risk groups of young patients with BC by considering various data. The results of this study can inform the implementation of more aggressive treatment strategies in these patients.
In conclusion, the number of younger patients diagnosed with BC is increasing and studies have proposed classifying these patients into high-risk groups. However, few studies have focused on survival predictions based on machine learning (ML). In this study, we developed a model to predict the survival of young patients with breast cancer based on ML, which is expected to help identify patients with BC who require aggressive treatment. However, there are some limitations in our research. First, given that the data used in this study were exclusively derived from female Korean patients with BC, further validation of the model in diverse populations is required in future studies. Second, although COPD was identified as a significant variable in the younger cohort in our study, the sample size is insufficient to draw definitive conclusions. Further studies are required to collect a larger sample on a nested case-control basis to provide a more comprehensive and objective understanding of the impact of comorbidity. Third, the model was run ten times to identify the variables favored by the model. The preferred variables from each model were then visualized. Although PR, ER, and HER2 were not consistently selected as key variables across all experimental results, the highest-performing EST models intermittently favored these variables and were selected. ER, PR and HER2 are known to be important prognostic variables in breast cancer patients. Therefore, further experiments based on large-scale data are essential to gain a more complete understanding of their role in this context. Finally, we built a survival prediction model using high-quality data to predict adverse prognosis in young breast cancer patients, and it performed well. However, as a methodological matter, future work is required to develop new optimized algorithms to predict such adverse prognosis.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Acknowledgements
This study was supported by a Grant (no: 2310440-2) from the National Cancer Center of Korea, Basic Science Research Program through the National Research Foundation of Korea (NRF), funded by the Ministry of Education (No. NRF-2022R1F1A107504).
Author contributions
Conceptualization was managed by HYJK, MSK, and KSR; methodology, HYJK, MSK, and KSR; validation, HYJK, MSK, and KSR; investigation, HYJK; data curation, HYJK, and KSR; writing the original draft preparation, HYJK and KSR. All the authors assisted in drafting and editing the manuscript.
Data availability
The data that support the findings of this study are available from Korea Clinical Data Utilization Network for Research Excellence (K-CURE) portal but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. Data are, however, available from the corresponding author upon reasonable request and with permission of NCDC. The GitHub repository (https://github.com/KwangSun-Ryu/Breast-cancer-survival-prediction.git) contains all codes relevant to the current submission.
Declarations
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Anderson, B. O. et al. The global breast Cancer Initiative: a strategic collaboration to strengthen health care for non-communicable diseases. Lancet Oncol. 22, 578–581 (2021). [DOI] [PubMed] [Google Scholar]
- 2.Fernandes, U. et al. Breast cancer in young women: a rising threat: a 5-year follow-up comparative study. Porto Biomed. J. 8, e213. 10.1097/j.pbj.0000000000000213 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.DeSantis, C. E. et al. Breast cancer statistics, 2019. CA Cancer J. Clin. 69, 438–451 (2019). [DOI] [PubMed] [Google Scholar]
- 4.Shah, A. N. et al. Circulating tumor cells, circulating tumor DNA, and disease characteristics in young women with metastatic breast cancer. Breast Cancer Res. Treat. 187, 397–405 (2021). [DOI] [PubMed] [Google Scholar]
- 5.Pruessmann, J. et al. Conditional disease-free and overall survival of 1858 Young women with non-metastatic breast Cancer and with participation in a post-therapeutic Rehab Programme according to clinical subtypes. Breast Care. 16, 163–172 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Sun, Y. Nomograms for prediction of overall and cancer-specific survival in young breast cancer. Breast Cancer Res. Treat. 184, 597–613 (2020). [DOI] [PubMed] [Google Scholar]
- 7.Guo, L. W. Development and validation of nomograms for predicting overall and breast cancer-specific survival among patients with triple-negative breast cancer. Cancer Manag Res. 10, 5881–5894 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Li, L. W., Liu, X., Shen, M. L., Zhao, M. J. & Liu, H. Development and validation of a random survival forest model for predicting long-term survival of early-stage young breast cancer patients based on the SEER database and an external validation cohort. Am. J. Cancer Res. 14, 1609–1621 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Sedighi-Maman, Z. & Mondello, A. A two-stage modeling approach for breast cancer survivability prediction. Int. J. Med. Inf. 149, 104438. 10.1016/j.ijmedinf.2021.104438 (2021). [DOI] [PubMed] [Google Scholar]
- 10.Li, J. Predicting breast cancer 5-year survival using machine learning: a systematic review. PloS One. 16, e0250370 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Boeri, C. et al. Machine learning techniques in breast cancer prognosis prediction: a primary evaluation. Cancer Med. 9, 3234–3243. 10.1002/cam4.2811 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Liu, P. et al. Optimizing survival analysis of XGBoost for ties to Predict Disease progression of breast Cancer. IEEE Trans. Biomed. Eng. 68, 148–160. 10.1109/TBME.2020.2993278 (2021). [DOI] [PubMed] [Google Scholar]
- 13.Ganggayah, M. D. et al. Predicting factors for survival of breast cancer patients using machine learning techniques. BMC Med. Inf. Decis. Mak. 19, 48. 10.1186/s12911-019-0801-4 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Home page. National Cancer Center. Cancer data. National Cancer Center. www.cancerdata.re.kr/en/index. Accessed 11 July 2024.
- 15.Korea central cancer registry. KCCR Survey. Korea Central Cancer Registry. July (2024). kccrsurvey.cancer.go.kr/index.do. Accessed 11.
- 16.Choi, D. W. et al. Data resource profile: the cancer public library database in South Korea. Cancer Res. Treat. Apr. 30 10.4143/crt.2024.207 (2024). [DOI] [PMC free article] [PubMed]
- 17.Ishwaran, H., Kogalur, U. B., Blackstone, E. H. & Lauer, M. SRandom survival forests. Annals Appl. Stat. 841–860 (2008).
- 18.Breiman, L., Random & Forests Mach. Learn. 45, 5–32 10.1023/A:1010933404324 (2001). [Google Scholar]
- 19.Mogensen, U. B., Ishwaran, H. & Gerds, T. A. Evaluating random forests for survival analysis using prediction error curves. J. Stat. Softw. 50 (11), 1. 10.18637/jss.v050.i11 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Friedman, J. H. ‘Greedy function approximation: a gradient boosting machine’. Ann. Statist. 29, 1189–1232 (2001). [Google Scholar]
- 21.Chen, Y., Jia, Z., Mercola, D. & Xie, X. A gradient boosting algorithm for survival analysis via direct optimization of concordance index. Comput. Math. Methods Med. 2013 (873595). 10.1155/2013/873595 (2013). [DOI] [PMC free article] [PubMed]
- 22.Bai, M., Zheng, Y. & Shen, Y. Gradient boosting survival tree with applications in credit scoring. J. Oper. Res. Soc. 73, 39–55 (2022). [Google Scholar]
- 23.Tizi, W. & Berrado, A. Machine learning for survival analysis in cancer research: a comparative study. Sci. Afr. 21, e01880. 10.1016/j.sciaf.2023.e01880 (2023). [Google Scholar]
- 24.Geurts, P., Ernst, D. & Wehenkel, L. Extremely randomized trees. Mach. Learn. 63, 3–42 (2006). [Google Scholar]
- 25.Zaenal, M. S., Fitrianto, A. & Wijayanto, H. Comparison of extremely randomized survival trees and Random Survival forests: a Simulation Study. Sci. J. Inf. 11 (3), 635–644. 10.15294/sji.v11i3.8464 (2024). [Google Scholar]
- 26.Cox, D. R. Regression models and life-tables. J. Roy. Stat. Soc.: Ser. B (Methodol.). 34, 187–202 (1972). [Google Scholar]
- 27.Cygu, S., Seow, H., Dushoff, J. & Bolker, B. M. Comparing machine learning approaches to incorporate time-varying covariates in predicting cancer survival time. Sci. Rep. 13 (1), 1370. 10.1038/s41598-023-28393-7 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Tibshirani, R. The lasso method for variable selection in the cox model. Stat. Med. 16, 85–95. https://doi.org/10.1002/(SICI)1097-0258(19970228)16:4<385::AID-SIM380>3.0.CO;2-3 (1997). [DOI] [PubMed]
- 29.Zou, H. & Hastie, T. Regularization and variable selection via the elastic-net. J. R Stat. Soc. 67, 301–320. 10.1111/j.1467-9868.2005.00503.x (2005). [Google Scholar]
- 30.Harrell, F. E., Califf, R. M., Pryor, D. B., Lee, K. L. & Rosati, R. A. Evaluating the yield of medical tests. JAMA. 247, 2543–2546 (1982). [PubMed] [Google Scholar]
- 31.Pedregosa, F. et al. Scikit-learn: machine learning in Python. *. J. Mach. Learn. Res. 12, 2825–2830 (2011). [Google Scholar]
- 32.Pölsterl, S. scikit-survival: a Library for Time-to-event analysis built on Top of scikit-learn. *. J. Mach. Learn. Res. 21, 1–6 (2020).34305477 [Google Scholar]
- 33.Du, M., Haag, D. G., Lynch, J. W. & Mittinty, M. N. Comparison of the Tree-Based Machine Learning Algorithms to Cox Regression in Predicting the Survival of Oral and Pharyngeal Cancers: Analyses Based on SEER Database. Cancers (Basel). Sep 29;12(10):2802. doi: (2020). 10.3390/cancers12102802. PMID: 33003533; PMCID: PMC7600270. [DOI] [PMC free article] [PubMed]
- 34.34. Dietrich, S., Floegel, A., Troll, M., Kühn, T., Rathmann, W., Peters, A., et al. (2016). Random Survival Forest in practice: a method for modeling complex metabolomics data in time to event analysis. Int. J. Epidemiol., 45(5), 1406-1420. 10.22283/qbs.2017.36.2.85 [DOI] [PubMed]
- 35.Ghazwani, M. & Begum, M. Y. Computational intelligence modeling of hyoscine drug solubility and solvent density in supercritical processing: gradient boosting, extra trees, and random forest models. Sci. Rep. 13, 10046. 10.1038/s41598-023-37232-8 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Wehenkel, L., Ernst, D. & Geurts, P. Ensembles of extremely randomized trees and some generic applications. In Proceedings of Robust Methods for Power System State Estimation and Load Forecasting (2006).
- 37.Ahn, S. V., Lee, E., Park, B., Jung, J. H., Park, J. E., Sheen, S. S., ... & Park, J. H. (2020). Cancer development in patients with COPD: a retrospective analysis of the National Health Insurance Service-National Sample Cohort in Korea. BMC Pulmon. Med. 20, 1-10. [DOI] [PMC free article] [PubMed]
- 38.Redig, A. J. & McAllister, S. S. Breast cancer as a systemic disease: a view of metastasis. J. Intern. Med. 274 (2), 113–126. 10.1111/joim.12084 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Cardoso, F. et al. 4th ESO–ESMO international consensus guidelines for advanced breast cancer (ABC 4). Ann. Oncol. 29 (8), 1634–1657. 10.1093/annonc/mdy192 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Wang, K., Li, Y., Wang, D. & Zhou, Z. Web-based dynamic nomograms for predicting overall survival and cancer-specific survival in breast cancer patients with lung metastases. J. Personalized Med. 13 (1), 43. 10.3390/jpm13010043 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Kwang Sun Ryu. Breast Cancer Survival Prediction. GitHub. (2024). https://github.com/KwangSun-Ryu/Breast-cancer-survival-prediction.git
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The data that support the findings of this study are available from Korea Clinical Data Utilization Network for Research Excellence (K-CURE) portal but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. Data are, however, available from the corresponding author upon reasonable request and with permission of NCDC. The GitHub repository (https://github.com/KwangSun-Ryu/Breast-cancer-survival-prediction.git) contains all codes relevant to the current submission.





