Abstract
Backgrounds
To develop a machine learning (ML) model for predicting the prognosis of breast cancer (BC) patients with low human epidermal growth factor receptor 2 (HER2) expression, and to investigate the association between clinicopathological characteristics and outcomes in HER2-low BC (HLBC) patients.
Methods
A retrospective analysis was conducted on data from 998 female HLBC patients treated at the Breast Center of the Fourth Hospital of Hebei Medical University (Hebei, China) between January 1, 2017, and December 31, 2020. To address class imbalance, the synthetic minority over-sampling technique was applied. Feature selection was performed using the least absolute shrinkage and selection operator, followed by construction of the prediction model using the random forest algorithm. Model performance, including specificity and accuracy, was assessed using the receiver operating characteristic (ROC) curve and confusion matrix, comparing it against other ML models. Additionally, the log-rank test was employed to examine the relationship between selected features and patient outcomes in HLBC.
Results
The random survival forest model demonstrated superior accuracy and specificity in predicting survival outcomes for HLBC patients. Compared with other ML models, it achieved more precise predictions of Disease-Free Survival (DFS) at 1, 2, and 3 years, with the area under the ROC curve (AUC) in the test and training cohorts measured at 0.726 and 0.819, 0.712 and 0.776, and 0.685 and 0.774, respectively. The analysis further identified a strong correlation between poor prognosis in HLBC patients and factors such as axillary lymph node dissection, family history, elevated topoisomerase (TOPO)-2 expression, advanced clinical stage, negative progesterone receptor status, P53 mutation, and increased Ki67 expression, observed across both cohorts.
Conclusions
A novel ML model was developed for accurate prognosis prediction in HLBC patients, offering valuable insights into prognostic risk factors. This model equips clinicians with enhanced data to guide treatment decisions, ultimately contributing to improved patient outcomes.
Keywords: Machine learning (ML), Random survival forest (RSF), HER2-low breast cancer (HLBC), Prognosis
Backgrounds
Breast cancer (BC) ranks among the most prevalent malignancies affecting women. In 2022, global cancer statistics reported an age-standardized incidence rate of approximately 33.0% per 100,000 [1]. Despite advancements in diagnostic techniques and therapeutic approaches, both the incidence and mortality rates remain elevated [2]. Furthermore, tumor heterogeneity contributes to varying prognoses across molecular subtypes. As a result, contemporary BC research increasingly emphasizes precision medicine in tailoring treatment strategies.
A range of treatment modalities, including chemotherapy, radiotherapy, endocrine therapy, targeted therapy, surgery, and immunotherapy, has been integrated into clinical practice to improve breast cancer outcomes. These therapies, whether applied individually or in combination, form the backbone of systemic and personalized treatment strategies. For instance, combining immunotherapy with chemotherapy has demonstrated a degree of efficacy in improving the prognosis of triple-negative breast cancer [3]. Concurrently, treatment protocols have been modified to mitigate adverse effects [4], such as hearing impairment [5], peripheral neuropathy, and headache [6]. Moreover, various clinical tools and prognostic scoring systems, including deoxyribonucleic acid (DNA) sequencing, immunohistochemistry, and the Royal Marsden Hospital (RMH) score, have been employed to assess patient outcomes and investigate the mechanisms driving breast cancer progression [7].
BC is classified into luminal A, luminal B, triple-negative, and HER2-enriched subtypes based on the expression of hormone receptors (HRs) such as estrogen receptor (ER), progesterone receptor (PR), and human epidermal growth factor receptor 2 (HER2) [8]. As a key oncogene in BC, HER2 exhibits tyrosine kinase activity, and its overexpression correlates with increased tumor aggressiveness and a worse prognosis [9]. The development and approval of several HER2-targeted therapies, including trastuzumab, lapatinib, pertuzumab, pirlotinib, and trastuzumab emtansine, have notably enhanced the outcomes for patients with HER2-positive BC [10].
Recent research in BC has shifted from HER2-enriched to HER2-low BC (HLBC). According to the 2022 American Society of Clinical Oncology (ASCO)/ College of American Pathologists (CAP) testing guidelines, HLBC is characterized by an IHC score of 1/2 + and a negative fluorescence in situ hybridization (FISH) test [11]. Trastuzumab deruxtecan (T-DXd) has demonstrated significant improvements in both progression-free survival (PFS) and overall survival (OS) for HLBC patients, indicating its therapeutic potential and survival benefit [12]. Historically, BC with low HER2 expression was categorized as HER2-negative, but recent studies, including the Diabetes Autoimmunity Study in the Young (NCT04132960), have revealed prognostic differences between HLBC and HER2-negative BC, prompting the reclassification of HLBC [13]. While these findings suggest a distinct prognostic relevance for HLBC, the specific prognostic factors remain unclear. Therefore, advanced methodologies are essential to identify prognostic markers in HLBC to enhance prediction accuracy for patient outcomes.
Machine learning (ML) offers a powerful methodology for disease diagnosis and prognosis prediction, with Random Forest (RF) being a prominent algorithm for outcome prediction. RF’s key advantages include resilience to overfitting, effective handling of nonlinear parameters, the ability to estimate the significance of each covariate, and its adaptability to both categorical and continuous data without the need for scaling or manual variable selection [14].
This study presents the development of an RF-based prognostic model designed to predict HLBC patient outcomes. The model construction process encompassed feature selection, evaluation, and comparison, while the relationship between model features and patient prognosis was systematically analyzed to provide insights for clinical practice.
Methods
Data collection and study design
Figure 1 presented the structured workflow of our study. Data from female HLBC patients who received treatment at the Breast Center of the Fourth Hospital of Hebei Medical University (Hebei, China) were retrospectively gathered, covering the period between January 1, 2017, and December 31, 2020.
Fig. 1.
The study design flowchart
The study’s inclusion criteria were defined as follows: (1) female patients diagnosed with BC; (2) pathology-confirmed invasive BC; (3) low HER2 expression as determined by immunohistochemistry (IHC); (4) TNM stages I to III based on the eighth edition of the American Joint Committee on Cancer (AJCC) staging guidelines [15]; (5) histological subtypes restricted to invasive ductal or lobular carcinoma; and (6) absence of distant metastasis or other malignancies, confirmed through preoperative comprehensive evaluation. Patients with incomplete follow-up data, missing clinical information, or DFS under 30 days were excluded.
We collected the information about cases according to the above criteria. We do not involve identifiable and private information of patients, besides, this study does not involve any intervention, so informed consent has been waived by the Ethics Committee of the Fourth Hospital of Hebei Medical University.
Clinicopathological features and endpoint indicators
ER-positive or PR-positive tumors were characterized by IHC staining of more than 1% of cells. Based on the complex mechanism of P53 in cancer, the interpretation of P53 mutations in different types of cancers is complex and inconsistent. Referring to the majority of the literature on the immunohistochemical interpretation of P53 mutation in breast cancer and gynecological tumors, P53 mutation was defined as uniformly strong nuclear staining in at least 80% of the breast cancer cells [16–20]. HLBC was classified as a BC subgroup with an HER2 IHC score of 1/2 + and a negative FISH test. Pathological grading of BC was based on the Nottingham Classification Criteria [21], with Grade 1 categorized as low grade and Grade 2 or higher as high grade.
The primary outcome measure, DFS, was defined as the period from radical surgery to either local recurrence, death, or the cutoff date of December 31, 2022.
Data preprocessing and feature selection
Categorical variables were transformed into numerical formats using the “get_dummies” function from the “pandas” package. Missing data were imputed through the “KNNImputer” function from the “sklearn.impute” package. Patient data were split into training and test cohorts at a 6:4 ratio using the “Train_test_split” function. To address data imbalance, oversampling was performed with the “SMOTE” function from the “imblearn.over_sampling” package. The “gtsummary” package generated a summary table outlining baseline characteristics.
The “glmnet” package was employed for feature selection utilizing the Least Absolute Shrinkage and Selection Operator (LASSO) method, a statistical approach for variable selection and model optimization. Based on the algorithm’s design, LASSO compressed the coefficients of certain variables to zero, effectively excluding them from the final model. Variables with non-zero coefficients were retained for inclusion. The selected feature coefficients were ranked by absolute value in descending order. To visualize the variables and their corresponding weights, the “wordcloud2” package generated a word cloud representation.
Establishment and validation of the random survival forest (RSF) model
Based on the LASSO feature selection outcomes, variables such as family history, TOPO2, TNM stage, T stage, N stage, PR expression, Ki67 expression, P53 mutation, and armpit surgery were integrated into ML models for predicting 1-, 2-, and 3-year DFS in HLBC patients. The RF ML model was constructed using the “sklearn” package, while ten-fold cross-validation was performed on the training cohort through the “GridSearchCV” function. Model performance was assessed on the test cohort. The specificity and accuracy of the model were evaluated using the receiver operating characteristic (ROC) curve and confusion matrix, implemented with the “timeROC” and “sklearn” packages, respectively.
Feature importance, model comparison, and prognosis analysis
This study employed Shapley additive explanations (SHAP) to assess and quantify the significance of model variables. Furthermore, the performance of various ML models, including linear regression, support vector machine, K-Nearest Neighbor, decision tree, Gradient Boosting Decision Tree, adaboost, Light Gradient Boosting Machine, and XGBoost, was evaluated by comparing their accuracy and specificity.
Survival analysis was conducted on the training and test cohorts utilizing the “survival” package to produce Kaplan-Meier curves and assess the influence of clinicopathological characteristics on HLBC patient outcomes. Univariate COX analysis was initially applied to identify potential associations between variables and prognosis, with features displaying P < 0.1 advancing to multivariate COX analysis. Variables with P < 0.05 in the multivariate analysis were then classified as independent prognostic factors.
Statistical analysis
The “sklearn.impute” package handled missing data imputation, while categorical variable processing was managed using the “pandas” package. Model construction and comparison were executed using the “sklearn” package within an ML framework. To assess feature importance, the “SHAP” package was employed. The performance metrics, including the confusion matrix and ROC curves, were evaluated and visualized using the “sklearn” and “timeROC” packages. Survival analyses were conducted with the “survival” package. All modeling and statistical analyses were performed using R (version 4.2.3, Statistics Department of the University of Auckland) and Python (version 3.7.1, Python Software Foundation). Statistical significance was determined at P-value < 0.05.
Results
Clinical characteristics of HLBC patients
Data from 749 eligible HLBC patients were analyzed, with their clinicopathological characteristics outlined in Table 1. Of the cohort, 372 patients (approximately 50%) were under 50 years of age, and 150 (20%) reported a family history of BC. Based on TNM staging, 88% (652 patients) presented with stage 1 or stage 2 BC, with 285 cases (38%) classified as T1 and 482 cases (64%) as N0. Pathological grading revealed that 520 cases (70%) were categorized as low-grade BC. Surgical interventions included mastectomy in 608 patients and axillary lymph node dissection (ALND) in 320 cases. Clinicopathological analysis indicated 97 cases were ER-negative, 181 were PR-negative, P53 mutation was absent in 580 cases, 139 cases exhibited low Ki67 expression, and 427 displayed low topoisomerase (TOPO II) expression.
Table 1.
Baseline characteristics of HER2 low expression (HLBC) patients before SMOTE oversampling
| Characteristics | Whole population (N = 748) |
Training cohort (N = 449) |
Testing cohort (N = 299) |
|---|---|---|---|
| Age | |||
| Less than 55 | 372 (50%) | 212 (47%) | 160 (54%) |
| No less than 55 | 376 (50%) | 237 (53%) | 139 (46%) |
| Family_history | |||
| No | 598 (80%) | 364 (81%) | 234 (78%) |
| Yes | 150 (20%) | 85 (19%) | 65 (22%) |
| T | |||
| T1 | 285 (38%) | 164 (37%) | 121 (40%) |
| T2-T4 | 463 (62%) | 285 (63%) | 178 (60%) |
| N | |||
| N0 | 482 (64%) | 283 (63%) | 199 (67%) |
| N1-N3 | 266 (36%) | 166 (37%) | 100 (33%) |
| Stage | |||
| 1 | 265 (35%) | 153 (34%) | 112 (37%) |
| 2 | 397 (53%) | 240 (53%) | 157 (53%) |
| 3 | 86 (11%) | 56 (12%) | 30 (10%) |
| Breast_surgery | |||
| Mastectomy | 608 (81%) | 354 (79%) | 254 (85%) |
| Breast conserving surgery | 140 (19%) | 95 (21%) | 45 (15%) |
| Armpit_surgery | |||
| ALND | 320 (43%) | 201 (45%) | 119 (40%) |
| SLNB | 428 (57%) | 248 (55%) | 180 (60%) |
| Grade | |||
| Low | 520 (70%) | 319 (71%) | 201 (67%) |
| High | 228 (30%) | 130 (29%) | 98 (33%) |
| ER | |||
| Negative | 97 (13%) | 58 (13%) | 39 (13%) |
| Positive | 651 (87%) | 391 (87%) | 260 (87%) |
| PR | |||
| Negative | 181 (24%) | 112 (25%) | 69 (23%) |
| Positive | 567 (76%) | 337 (75%) | 230 (77%) |
| P53 mutation | |||
| No | 580 (78%) | 346 (77%) | 234 (78%) |
| Yes | 168 (22%) | 102 (23%) | 66 (22%) |
| Ki67 | |||
| Less than 30% | 139 (19%) | 84 (19%) | 55 (18%) |
| No less than 30% | 609 (81%) | 365 (81%) | 244 (82%) |
| TOPO2 | |||
| Less than 20% | 427 (57%) | 261 (58%) | 166 (56%) |
| No less than 20% | 321 (43%) | 188 (42%) | 133 (44%) |
SMOTE oversampling was then employed to balance the baseline, with the post-balancing characteristics summarized in Table 2. Following adjustment, 821 patients were under 50 years of age, and 155 had a familial BC history. Approximately 90% of the cohort presented with stage 1 or stage 2 BC, including 633 patients at T1 stage and 977 patients at N0 stage. Additionally, 1105 cases (~ 79%) were classified as low-grade BC. Surgical interventions included mastectomy in 1222 patients and ALND in 801 cases. Regarding molecular markers, 316 cases were ER-negative, 565 were PR-negative, P53 remained unmutated in 1076 cases, 150 exhibited low Ki67 expression, and 817 demonstrated low TOPO II expression.
Table 2.
Baseline characteristics of HER2 low expression (HLBC) patients after SMOTE oversampling
| Characteristics | Whole population (N = 1404) |
Training cohort (N = 842) |
Testing cohort (N = 562) |
|---|---|---|---|
| Age | |||
| Less than 55 | 821(58%) | 490 (58%) | 331 (59%) |
| No less than 55 | 583(42%) | 352 (42%) | 231 (41%) |
| Family_history | |||
| No | 1249(89%) | 745 (88%) | 504 (90%) |
| Yes | 155(11%) | 97 (12%) | 58 (10%) |
| T | |||
| T1 | 633(45%) | 378 (45%) | 255 (45%) |
| T2-T4 | 771(55%) | 464 (55%) | 307 (55%) |
| N | |||
| N0 | 977(70%) | 595 (71%) | 382 (68%) |
| N1-N3 | 427(30%) | 247 (29%) | 180 (32%) |
| Stage | |||
| 1 | 536(38%) | 324 (38%) | 212 (38%) |
| 2 | 728(52%) | 437 (52%) | 291 (52%) |
| 3 | 140(10%) | 81 (10%) | 59 (10%) |
| Breast_surgery | |||
| Mastectomy | 1222(87%) | 734 (87%) | 488 (87%) |
| Breast conserving surgery | 182(13%) | 108 (13%) | 74 (13%) |
| Armpit_surgery | |||
| ALND | 801(57%) | 478 (57%) | 323 (57%) |
| SLNB | 603(43%) | 364 (43%) | 239 (43%) |
| Grade | |||
| Low | 1105(79%) | 649 (77%) | 456 (81%) |
| High | 299(21%) | 193 (23%) | 106 (19%) |
| ER | |||
| Negative | 316(23%) | 182 (22%) | 134 (24%) |
| Positive | 1088(77%) | 660 (78%) | 428 (76%) |
| PR | |||
| Negative | 565(40%) | 331 (39%) | 234 (42%) |
| Positive | 839(60%) | 511 (61%) | 328 (58%) |
| P53 mutation | |||
| No | 1076(77%) | 632 (75%) | 444 (79%) |
| Yes | 328(23%) | 210 (25%) | 118 (21%) |
| Ki67 | |||
| Less than 30% | 150(11%) | 94 (11%) | 56 (10%) |
| No less than 30% | 1254(89%) | 748 (89%) | 506 (90%) |
| TOPO2 | |||
| Less than 20% | 817(58%) | 489 (58%) | 328 (58%) |
| No less than 20% | 587(42%) | 353 (42%) | 234 (42%) |
Establishing and evaluating predictive models for estimating the prognosis of HLBC patients
An RSF model was constructed to predict 1-, 2-, and 3-year DFS in HLBC patients following surgery. Patients were randomly allocated to training and testing cohorts in a 6:4 ratio. Feature variables were selected using LASSO, with the top nine variables identified as key predictors for the RSF model (Fig. 2). To enhance model reliability, 10-fold cross-validation was employed for iterative testing within the training cohort. ROC curves were generated for both training and validation cohorts, and the corresponding AUC values were calculated. The RSF model demonstrated robust predictive performance, with AUCs for the test cohort at 0.726, 0.819, and 0.712 for 1-, 2-, and 3-year DFS, respectively, and for the training cohort at 0.776, 0.685, and 0.774 (Fig. 3).
Fig. 2.
Feature selection by least absolute shrinkage and selection operator. ER: estrogen receptor and PR: progesterone receptor
Fig. 3.
Evaluation of the random survival forest model. ROC curves of the 1-year prognostic model of the (A) training and (B) test cohorts, 2-year prognostic model of the (C) training and (D) test cohorts, and 3-year prognostic model of the (E) training and (F) test cohorts. ROC: receiver operating characteristic curve and AUC: area under the curve
The model’s performance was assessed through a confusion matrix (Fig. 4) and benchmarked against other ML models. The stochastic survival model demonstrated strong predictive capabilities, with AUCs for 1-, 2-, and 3-year DFS in the test cohort recorded at 0.726 (recall = 0.8721; f1 score = 0.9245), 0.712 (recall = 0.9045; f1 score = 0.9598), and 0.685 (recall = 0.8346; f1 score = 0.9254), respectively (Fig. 5).
Fig. 4.
Confusion matrix of the predicted results of the random survival forest model for the test cohort. Confusion matrix of the (A) 1-, (B) 2-, and (C) 3-year prognostic models
Fig. 5.
Performance of the machine learning-based prognostic models on test cohort. LR: logistic regression, RF: random forest, SVM: support vector machine, KNN: K-Nearest Neighbor, XGB: XGBoost, DT: decision-making tree, GBDT: Gradient Boosting Decision Tree, and LBGM: Light Gradient Boosting Machine
SHAP analysis was then applied to assess the significance of model variables. In the 1-year (Fig. 6A, B), 2-year (Fig. 6C, D), and 3-year (Fig. 6E, F) models, ALND, PR expression, and T stage emerged as the top three predictive variables. Notably, ALND demonstrated the highest influence in the 1- and 2-year DFS prognostic models, whereas PR expression was the most influential factor in the 3-year DFS prognostic model.
Fig. 6.
Importance Ranking of the Variables in the Random Survival Forest Model Assessed by SHAP. (A) Importance ranking and (B) summary of clinical characteristics in the 1-year prognostic model. (C) Importance ranking and (D) summary of clinical characteristics in the 2-year prognostic model. (E) Importance ranking and (F) summary of clinical characteristics in the 3-year prognostic model
Association between clinicopathological features and HLBC prognosis
A survival analysis was conducted to assess the association between clinicopathological features and the prognosis of HLBC patients. The analysis revealed significant correlations between poor prognosis and factors such as ALND surgery (Fig. 7A), family history (Fig. 7B), elevated TOPO II expression (Fig. 7C), advanced clinical stage (Fig. 7D–F), PR-negative status (Fig. 7G), P53 mutation (Fig. 7H), and high Ki67 expression (Fig. 7I) in both the training and test cohorts (Fig. 8). Univariate and multivariate COX regression analysis indicated that, aside from TOPO2 expression and armpit surgery (including ALND and SLNB), all other variables were identified as independent prognostic factors (P < 0.001, Table 3).
Fig. 7.
Analysis of the association between characteristic variables and prognosis of HLBC patients in the training cohort. (A) Axillary lymph node dissection; (B) family history; (C) topoisomerase (TOPO)-2 expression; (D) tumor (T) stage; (E) node (N) stage; (F) tumor, node, and metastasis (TNM) stage; (G) progesterone receptor expression; (H) P53 mutation; and (I) Ki67 expression
Fig. 8.
Analysis of the association between characteristic variables and prognosis of HLBC patients in the test cohort. (A) Axillary lymph node dissection; (B) family history; (C) topoisomerase (TOPO)-2 expression; (D) tumor (T) stage; (E) node (N) stage; (F) tumor, node, and metastasis (TNM) stage; (G) progesterone receptor expression; (H) P53 mutation; and (I) Ki67 expression
Table 3.
Univariate and multivariate COX analyses about model characteristics
| Characteristics | Number(N) | Univariate analysis | Multivariate analysis | ||
|---|---|---|---|---|---|
| HR (95% CI) | P value | HR (95% CI) | P value | ||
| Family_history | 1404 | ||||
| No | 1251 | Reference | Reference | ||
| Yes | 153 | 6.527 (5.364 - 7.942) | < 0.001 | 5.664 (4.590 - 7.040) | < 0.001 |
| T | 1404 | ||||
| T1 | 633 | Reference | Reference | ||
| T2-T4 | 771 | 1.800 (1.543 - 2.099) | < 0.001 | 1.285 (1.093 - 1.511) | 0.005 |
| N | 1404 | ||||
| N1-N3 | 427 | Reference | Reference | ||
| N0 | 977 | 0.184 (0.159 - 0.215) | < 0.001 | 0.210 (0.177 - 0.248) | < 0.001 |
| Stage | 1404 | ||||
| Stage I | 536 | Reference | Reference | ||
| Stage II | 728 | 1.868 (1.572 - 2.219) | < 0.001 | 1.840 (1.560 - 2.229) | < 0.001 |
| Stage III | 140 | 3.519 (2.765 - 4.478) | < 0.001 | 2.302 (1.782 - 2.975) | < 0.001 |
| Armpit_surgery | 1404 | ||||
| ALND | 801 | Reference | Reference | ||
| SLNB | 603 | 0.456 (0.386 - 0.538) | < 0.001 | 0.919 (0.766 - 1.103) | 0.364 |
| PR | 1404 | ||||
| Negative | 565 | Reference | Reference | ||
| Positive | 839 | 0.389 (0.334 - 0.451) | < 0.001 | 0.493 (0.423 - 0.575) | < 0.001 |
| P53 mutation | 1404 | ||||
| No | 1076 | Reference | Reference | ||
| Yes | 328 | 1.478 (1.258 - 1.758) | < 0.001 | 1.420 (1.191 - 1.692) | < 0.001 |
| Ki67 | 1404 | ||||
| No less than 30% | 1254 | Reference | Reference | ||
| Less than 30% | 150 | 0.109 (0.062 - 0.193) | < 0.001 | 0.195 (0.109 - 0.349) | < 0.001 |
| TOPO2 | 1404 | ||||
| No less than 20% | 587 | Reference | Reference | ||
| Less than 20% | 817 | 0.711 (0.613 - 0.825) | < 0.001 | 0.917 (0.786 - 1.070) | 0.271 |
Discussion
HER2, a tyrosine kinase receptor within the human epidermal growth factor receptor family, consists of four subdomains: I, II, III, and IV [22, 23]. Upon ligand binding, HER2 undergoes dimerization, forming homodimers or heterodimers, which subsequently activate downstream signaling pathways that drive cancer cell proliferation, migration, invasion, and survival [23]. Various HER2-targeting therapies, including trastuzumab, pertuzumab, and lapatinib, effectively inhibit the oncogenic activity of HER2 [24].
HLBC is characterized by an IHC score of 1/2 + and a negative FISH test. Prior research has examined its prognostic and therapeutic implications. The DS8201-A J101 study reported a modified PFS of 11.1 months and a median disease control duration of 10.4 months in advanced HLBC patients treated with trastuzumab deruxtecan [25]. Additionally, the TROPiCS-02 study demonstrated improved survival in both HER2-low and HER2-nonexpressing patients treated with sacituzumab govitecan [26, 27]. These results highlight the increasing focus on the treatment and prognosis of HLBC. While several prognostic models exist for breast cancer, most do not account for the heterogeneity among subtypes, leading to varied outcomes. HER2-low BC, as a distinct emerging subtype, has a prognosis that differs from traditional triple-negative BC, though its oncogenic mechanisms and prognostic factors remain poorly understood [28]. Furthermore, few models specifically address HER2-low BC prognosis, underscoring the need to explore its prognostic characteristics and develop reliable predictive models. This study applied multiple machine learning approaches to construct predictive models, selecting the optimal one. Survival analysis was then conducted on model features to identify independent prognostic factors, providing a basis for personalized treatment strategies and improved prognostic predictions for HER2-low BC patients.
Lymph node positivity represents a significant risk factor for BC, closely linked to lymph node metastasis (LNM), which is influenced by the expression of non-coding RNAs. For example, miR-98 has been shown to drive tumor cell metastasis to sentinel lymph nodes and correlates with poor prognosis in ER-positive, HER2-negative BC patients [29, 30]. Immune cell infiltration also plays a critical role in LNM. Takada et al. demonstrated that the density of tumor-infiltrating lymphocytes was markedly reduced in patients with LNM compared to those without [31]. Additionally, peripheral and lymphatic invasion contribute substantially to LNM, with studies indicating that neural and lymphatic invasion significantly elevate the risk of LNM [32]. Other research highlights the relevance of tumor size, body mass index, and the platelet-to-lymphocyte ratio in both LNM and BC prognosis [29, 33, 34]. Aligning with these observations, the present model confirmed that lymph node positivity had a pronounced impact on HLBC patient prognosis.
Tumor size (T stage) is a significant prognostic factor in BC patients. Kustic et al. demonstrated that larger tumors correlated with poorer prognosis, negatively impacting both DFS and OS in BC patients [35]. Similarly, Chu et al. found that patients with tumors ≥ 5 cm faced higher mortality risk and notably shorter survival times compared to those with tumors ≤ 1 cm. In alignment with these studies, our model indicated that T stage influenced the prognosis of HLBC patients, likely due to its correlation with cancer progression. Larger tumors are typically observed in advanced cancer stages, which are associated with increased rates of LNM and distant metastasis, both of which adversely affect patient outcomes [36].
The surgical approach significantly influences the prognosis of BC patients. Xue et al. demonstrated that patients who underwent surgery had a notably longer OS (34 months) compared to those who did not (23 months) [37]. ALND has been shown to enhance survival in node-positive BC patients [38]. However, the present findings indicate that ALND is associated with a less favorable prognosis, likely due to its frequent use in patients with LNM, a known adverse prognostic factor in BC.
ER and PR are key determinants of BC prognosis, with over 70% of BCs being HR-positive due to elevated levels of ER, PR, or both. As these tumors are estrogen-driven, they often respond to endocrine therapies that inhibit estrogen signaling. Various ER-targeting drugs have been developed, including tamoxifen, a selective estrogen receptor modulator commonly prescribed for premenopausal patients. Aromatase inhibitors such as letrozole, anastrozole, and exemestane are standard treatments for postmenopausal BC. Additionally, gonadotropin-releasing hormone agonists, including goserelin, leuprolide, and triptorelin, are widely used to suppress ovarian function in premenopausal women with BC. Results indicated that negative ER and PR expression were linked to poorer prognosis in HLBC patients [39]. Previous studies have also shown that high HR expression correlates with better outcomes in luminal A BC patients, aligning with the present findings [40, 41].
Several biomarkers, including P53, Ki67, and TOPO II, influence both the progression and prognosis of BC. As a key transcription factor, P53 responds to cancer-related stress and metabolic shifts, orchestrating multiple tumor suppressor functions such as metabolic regulation [42, 43]. Mutations in P53 result in the loss of its tumor-suppressing capabilities, alongside the acquisition of oncogenic properties. For instance, mutant P53 enhances the synthesis of serine and glycine, as well as the uptake of essential AAs by BC cells, thus driving cancer proliferation [44, 45]. Ki67, a marker of cellular proliferation, serves as an indicator of clinical outcomes in early-stage luminal A and luminal B BC [46] and predicts responses to neoadjuvant chemotherapy [47]. TOPO, present in the nucleus, modulates DNA topology (I and II) by catalyzing DNA breakage and strand re-ligation. TOPO II, often referred to as gyrase, comprises two isoenzymes, α and β. TOPO IIα, a critical enzyme in DNA replication, serves as a target for anthracyclines and shows significant upregulation during the S-G2/M phase of the cell cycle, decreasing after mitosis (G1 and G0 phases). Positive TOPO IIα expression indicates an active proliferative state in tumor cells [48]. This study found that elevated TOPO II expression, increased Ki67 levels, and P53 mutation were strongly associated with poor prognosis in HLBC patients.
To date, limited research has been conducted on HLBC, with even fewer studies addressing prognosis prediction in HLBC patients. This study developed a highly accurate and sensitive prognostic model for HLBC, offering clinicians a valuable reference for selecting appropriate treatment strategies. However, several limitations must be acknowledged. First, the study cohort consisted of retrospective cases from a single center, which may restrict the model’s applicability and generalizability. Second, the model’s performance was evaluated solely using internal validation, highlighting the need for further validation with external cohorts. Additionally, the absence of chemotherapy and radiotherapy data could compromise prediction accuracy. Comprehensive investigations into this novel subtype are required in the coming years to explore its pathogenesis and potential mechanisms. Such research is essential for advancing the understanding of the disease and developing more targeted, personalized therapies. By prioritizing this subtype, future studies have the potential to significantly improve patient outcomes in breast cancer management.
Conclusions
This study developed the first ML model specifically designed to predict the prognosis of HLBC patients. In contrast to previous research, this model offers a novel approach by incorporating additional prognostic risk factors, thereby enabling clinicians to make more informed decisions regarding treatment strategies and ultimately improving patient outcomes.
Acknowledgements
Not applicable.
Abbreviations
- ML
Machine learning
- BC
Breast cancer
- HER2
Human epidermal growth factor receptor 2
- HLBC
Human epidermal growth factor receptor 2-low breast cancer
- ROC
Receiver operating characteristic
- AUC
Area under the ROC curve
- DFS
Disease-Free Survival
- TOPO-2
Topoisomerase 2
- RSF
Random survival forest
- DNA
Deoxyribonucleic acid
- RMH
Royal Marsden Hospital
- HRs
Hormone receptors
- ER
Estrogen receptor
- PR
Progesterone receptor
- ASCO
American Society of Clinical Oncology
- CAP
College of American Pathologists
- FISH
Fluorescence in situ hybridization
- T-DXd
Trastuzumab deruxtecan
- PFS
Progression-free survival
- OS
Overall survival
- IHC
Immunohistochemistry
- AJCC
American Joint Committee on Cancer
- LASSO
Least Absolute Shrinkage and Selection Operator
- SHAP
Shapley additive explanations
- ALND
Axillary lymph node dissection
- LNM
Lymph node metastasis
Author contributions
L.M and YL.L. carried out study conception and design; YL.L. performed data analysis and prepared the draft; L.M. reviewed and revised the manuscript; YL.L and L.M. participated in the study administration; and XL.Y. collected data. All the authors read and approved the final manuscript.
Funding
This research received funding from the Foundation of Hebei Province for Scientific Research of Selected Returned Overseas Professionals (No.CY201608), the Clinical Medical Talent Support Program of Hebei Provincial Department of Finance (No.201746), the Biomedical Joint Foundation of Hebei Province (H2021206157), and the Innovation Team Support Program of the Fourth Hospital of Hebei Medical University(2023B01). The funding organizations had no involvement in the study’s design, data collection, analysis, interpretation, or manuscript preparation.
Data availability
The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.
Declarations
Ethics approval and consent to participate
This study was observed study approved by Ethics Committee of the Fourth Hospital of Hebei Medical University. Clinical trial number was not applicable. We do not involve identifiable and private information of patients, so informed consent has been waived by the Ethics Committee of the Fourth Hospital of Hebei Medical University.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Authors’ information
Not applicable.
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Cao W, Qin K, Li F, Chen W. Comparative study of cancer profiles between 2020 and 2022 using global cancer statistics (GLOBOCAN). J Natl Cancer Cent. 2024;4(2):128–34. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Giaquinto AN, Sung H, Miller KD, Kramer JL, Newman LA, Minihan A, Jemal A, Siegel RL. Breast Cancer Stat 2022 CA: cancer J Clin. 2022;72(6):524–41. [DOI] [PubMed] [Google Scholar]
- 3.Rizzo A, Cusmai A, Acquafredda S, Giovannelli F, Rinaldi L, Misino A, Palmiotti G. KEYNOTE-522, IMpassion031 and GeparNUEVO: changing the paradigm of neoadjuvant immune checkpoint inhibitors in early triple-negative breast cancer. Future Oncol (London England). 2022;18(18):2301–9. [DOI] [PubMed] [Google Scholar]
- 4.Rizzo A, Schipilliti FM, Di Costanzo F, Acquafredda S, Arpino G, Puglisi F, Del Mastro L, Montemurro F, De Laurentiis M, Giuliano M. Discontinuation rate and serious adverse events of chemoimmunotherapy as neoadjuvant treatment for triple-negative breast cancer: a systematic review and meta-analysis. ESMO open. 2023;8(6):102198. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Guven DC, Erul E, Kaygusuz Y, Akagunduz B, Kilickap S, De Luca R, Rizzo A. Immune checkpoint inhibitor-related hearing loss: a systematic review and analysis of individual patient data. Supportive care cancer: Official J Multinational Association Supportive Care Cancer. 2023;31(12):624. [DOI] [PubMed] [Google Scholar]
- 6.Rizzo A, Santoni M, Mollica V, Logullo F, Rosellini M, Marchetti A, Faloppi L, Battelli N, Massari F. Peripheral neuropathy and headache in cancer patients treated with immunotherapy and immuno-oncology combinations: the MOUSEION-02 study. Expert Opin Drug Metab Toxicol. 2021;17(12):1455–66. [DOI] [PubMed] [Google Scholar]
- 7.Sahin TK, Rizzo A, Aksoy S, Guven DC. Prognostic significance of the Royal Marsden Hospital (RMH) score in patients with cancer: a systematic review and meta-analysis. Cancers. 2024;16(10). [DOI] [PMC free article] [PubMed]
- 8.Goldhirsch A, Wood WC, Coates AS, Gelber RD, Thürlimann B, Senn HJ. Strategies for subtypes–dealing with the diversity of breast cancer: highlights of the St. Gallen International Expert Consensus on the primary therapy of early breast Cancer 2011. Annals Oncology: Official J Eur Soc Med Oncol. 2011;22(8):1736–47. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Seshadri R, Firgaira FA, Horsfall DJ, McCaul K, Setlur V, Kitchen P. Clinical significance of HER-2/neu oncogene amplification in primary breast cancer. The South Australian breast Cancer Study Group. J Clin Oncology: Official J Am Soc Clin Oncol. 1993;11(10):1936–42. [DOI] [PubMed] [Google Scholar]
- 10.Pondé N, Brandão M, El-Hachem G, Werbrouck E, Piccart M. Treatment of advanced HER2-positive breast cancer: 2018 and beyond. Cancer Treat Rev. 2018;67:10–20. [DOI] [PubMed] [Google Scholar]
- 11.Giordano SH, Franzoi MAB, Temin S, Anders CK, Chandarlapaty S, Crews JR, Kirshner JJ, Krop IE, Lin NU, Morikawa A, et al. Systemic therapy for Advanced human epidermal growth factor receptor 2-Positive breast Cancer: ASCO Guideline Update. J Clin Oncology: Official J Am Soc Clin Oncol. 2022;40(23):2612–35. [DOI] [PubMed] [Google Scholar]
- 12.Modi S, Jacot W, Yamashita T, Sohn J, Vidal M, Tokunaga E, Tsurutani J, Ueno NT, Prat A, Chae YS, et al. Trastuzumab Deruxtecan in previously treated HER2-Low advanced breast Cancer. N Engl J Med. 2022;387(1):9–20. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Lee J, Park YH. Trastuzumab deruxtecan for HER2 + advanced breast cancer. Future Oncol (London England). 2022;18(1):7–19. [DOI] [PubMed] [Google Scholar]
- 14.Salazar RM, Duryea JD, Leone AO, Nair SS, Mumme RP, De B, Corrigan KL, Rooney MK, Das P, Holliday EB et al. Random forest modeling of acute toxicity in anal cancer: effects of peritoneal cavity contouring approaches on model performance. Int J Radiation Oncol Biol Phys. 2023. [DOI] [PubMed]
- 15.Edition S, Edge S, Byrd D. AJCC cancer staging manual. AJCC cancer staging manual; 2017.
- 16.Köbel M, Kang EY. The many uses of p53 immunohistochemistry in gynecological pathology: proceedings of the ISGyP companion society session at the 2020 USCAP annual meeting. Int J Gynecol Pathol Off J Int Soc Gynecol Pathol. 2021;40(1):32–40. [DOI] [PubMed]
- 17.de Roos MA, de Bock GH, de Vries J, van der Vegt B, Wesseling J. p53 overexpression is a predictor of local recurrence after treatment for both in situ and invasive ductal carcinoma of the breast. J Surg Res. 2007;140(1):109–14. [DOI] [PubMed] [Google Scholar]
- 18.Linjawi A, Kontogiannea M, Halwani F, Edwardes M, Meterissian S. Prognostic significance of p53, bcl-2, and Bax expression in early breast cancer. J Am Coll Surg. 2004;198(1):83–90. [DOI] [PubMed] [Google Scholar]
- 19.Bellizzi AM. p53 as Exemplar Next-Generation immunohistochemical marker: a molecularly informed, pattern-based Approach, Methodological considerations, and Pan-cancer Diagnostic Applications. Appl Immunohistochem Mol Morphology: AIMM. 2023;31(7):507–30. [DOI] [PubMed] [Google Scholar]
- 20.Armbruster H, Schotte T, Götting I, Overkamp M, Granai M, Volmer LL, Bahlinger V, Matovina S, Koch A, Dannehl D, et al. Aberrant p53 immunostaining patterns in breast carcinoma of no special type strongly correlate with presence and type of TP53 mutations. Virchows Archiv: Int J Pathol. 2024;485(4):631–42. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Tsang JY, Tse GM. Breast cancer with neuroendocrine differentiation: an update based on the latest WHO classification. Mod Pathology: Official J United States Can Acad Pathol Inc. 2021;34(6):1062–73. [DOI] [PubMed] [Google Scholar]
- 22.Loibl S, Gianni L. HER2-positive breast cancer. Lancet (London England). 2017;389(10087):2415–29. [DOI] [PubMed] [Google Scholar]
- 23.Krishnamurti U, Silverman JF. HER2 in breast cancer: a review and update. Adv Anat Pathol. 2014;21(2):100–7. [DOI] [PubMed] [Google Scholar]
- 24.Oh DY, Bang YJ. HER2-targeted therapies - a role beyond breast cancer. Nat Reviews Clin Oncol. 2020;17(1):33–48. [DOI] [PubMed] [Google Scholar]
- 25.Jerusalem G, Park YH, Yamashita T, Hurvitz SA, Modi S, Andre F, Krop IE, Gonzàlez Farré X, You B, Saura C, et al. Trastuzumab Deruxtecan in HER2-Positive metastatic breast Cancer patients with brain metastases: a DESTINY-Breast01 subgroup analysis. Cancer Discov. 2022;12(12):2754–62. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Rugo HS, Bardia A, Tolaney SM, Arteaga C, Cortes J, Sohn J, Marmé F, Hong Q, Delaney RJ, Hafeez A, et al. TROPiCS-02: a phase III study investigating sacituzumab govitecan in the treatment of HR+/HER2- metastatic breast cancer. Future Oncol (London England). 2020;16(12):705–15. [DOI] [PubMed] [Google Scholar]
- 27.Rugo HS, Bardia A, Marmé F, Cortés J, Schmid P, Loirat D, Trédan O, Ciruelos E, Dalenc F, Gómez Pardo P, et al. Overall survival with sacituzumab govitecan in hormone receptor-positive and human epidermal growth factor receptor 2-negative metastatic breast cancer (TROPiCS-02): a randomised, open-label, multicentre, phase 3 trial. Lancet (London England). 2023;402(10411):1423–33. [DOI] [PubMed] [Google Scholar]
- 28.Abubakar M, Guo C, Koka H, Sung H, Shao N, Guida J, Deng J, Li M, Hu N, Zhou B et al. Clinicopathological and epidemiological significance of breast cancer subtype reclassification based on p53 immunohistochemical expression. NPJ Breast Cancer. 2019;5(1). [DOI] [PMC free article] [PubMed]
- 29.Okuno J, Miyake T, Sota Y, Tanei T, Kagara N, Naoi Y, Shimoda M, Shimazu K, Kim SJ, Noguchi S. Development of Prediction Model Including MicroRNA expression for Sentinel Lymph Node Metastasis in ER-Positive and HER2-Negative breast Cancer. Ann Surg Oncol. 2021;28(1):310–9. [DOI] [PubMed] [Google Scholar]
- 30.Fujita Y, Yoshioka Y, Ochiya T. Extracellular vesicle transfer of cancer pathogenic components. Cancer Sci. 2016;107(4):385–90. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Takada K, Kashiwagi S, Asano Y, Goto W, Kouhashi R, Yabumoto A, Morisaki T, Shibutani M, Takashima T, Fujita H, et al. Prediction of lymph node metastasis by tumor-infiltrating lymphocytes in T1 breast cancer. BMC Cancer. 2020;20(1):598. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Zhang Y, Wang H, Zhao H, He X, Wang Y, Wang H. Prognostic significance and value of further classification of lymphovascular invasion in invasive breast cancer: a retrospective observational study. Breast Cancer Res Treat. 2024;206(2):397–410. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Ishizuka Y, Horimoto Y, Nakamura M, Arakawa A, Fujita T, Iijima K, Saito M. Predictive factors for non-sentinel nodal metastasis in patients with Sentinel Lymph Node-positive breast Cancer. Anticancer Res. 2020;40(8):4405–12. [DOI] [PubMed] [Google Scholar]
- 34.Wang J, Cai Y, Yu F, Ping Z, Liu L. Body mass index increases the lymph node metastasis risk of breast cancer: a dose-response meta-analysis with 52904 subjects from 20 cohort studies. BMC Cancer. 2020;20(1):601. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Kustic D, Lovasic F, Belac-Lovasic I, Avirovic M, Ruzic A, Petretic-Majnaric S. Impact of HER2 receptor status on axillary nodal burden in patients with non-luminal A invasive ductal breast carcinoma. Rev Med Chil. 2019;147(5):557–67. [DOI] [PubMed] [Google Scholar]
- 36.Zuo WJ, He M, Zheng H, Liu Y, Liu XY, Jiang YZ, Wang ZH, Lu RQ, Shao ZM. Serum HER2 levels predict treatment efficacy and prognosis in patients with HER2-positive breast cancer undergoing neoadjuvant treatment. Gland Surg. 2021;10(4):1300–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Xue F, Yu L, Lin Y, Wang Z, Li S, Shao N, Ye F, Gu C, Li X. Surgery in initially metastatic breast cancer: prognosis is associated with patient characteristics and timing of surgery. J BUON: Official J Balkan Union Oncol. 2019;24(2):543–8. [PubMed] [Google Scholar]
- 38.Park TS, Thomas SM, Rosenberger LH, Fayanju OM, Plichta JK, Blitzblau RC, Ong CT, Hyslop T, Hwang ES, Greenup RA. The Association of Extent of Axillary Surgery and survival in women with N2-3 invasive breast Cancer. Ann Surg Oncol. 2018;25(10):3019–29. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Kay C, Martinez-Perez C, Dixon JM, Turnbull AK. The role of nodes and nodal assessment in diagnosis, treatment and prediction in ER+, node-positive breast cancer. J Personalized Med. 2023;13(10). [DOI] [PMC free article] [PubMed]
- 40.Stenmark Tullberg A, Lundstedt D, Olofsson Bagge R, Karlsson P. Positive sentinel node in luminal A-like breast cancer patients - implications for adjuvant chemotherapy? Acta Oncol (Stockholm Sweden). 2019;58(2):162–7. [DOI] [PubMed] [Google Scholar]
- 41.Dunnwald LK, Rossing MA, Li CI. Hormone receptor status, tumor characteristics, and prognosis: a prospective cohort of breast cancer patients. Breast cancer Research: BCR. 2007;9(1):R6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Lacroix M, Riscal R, Arena G, Linares LK, Le Cam L. Metabolic functions of the tumor suppressor p53: implications in normal physiology, metabolic disorders, and cancer. Mol Metabolism. 2020;33:2–22. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Hafner A, Bulyk ML, Jambhekar A, Lahav G. The multiple mechanisms that regulate p53 activity and cell fate. Nat Rev Mol Cell Biol. 2019;20(4):199–210. [DOI] [PubMed] [Google Scholar]
- 44.Walerych D, Lisek K, Del Sal G. Mutant p53: one, no one, and one hundred Thousand. Front Oncol. 2015;5:289. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Oren M, Rotter V. Mutant p53 gain-of-function in cancer. Cold Spring Harb Perspect Biol. 2010;2(2):a001107. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Colozza M, Azambuja E, Cardoso F, Sotiriou C, Larsimont D, Piccart MJ. Proliferative markers as prognostic and predictive tools in early breast cancer: where are we now? Annals Oncology: Official J Eur Soc Med Oncol. 2005;16(11):1723–39. [DOI] [PubMed] [Google Scholar]
- 47.Goldhirsch A, Winer EP, Coates AS, Gelber RD, Piccart-Gebhart M, Thürlimann B, Senn HJ. Personalizing the treatment of women with early breast cancer: highlights of the St Gallen International Expert Consensus on the primary therapy of early breast Cancer 2013. Annals Oncology: Official J Eur Soc Med Oncol. 2013;24(9):2206–23. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Jin J, Zheng D, Liu Y. Correlation between the expression of Topo IIα and Ki67 in breast cancer and its clinical pathological characteristics. Pakistan J Med Sci. 2017;33(4):844–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.








