Abstract
Background:
Breast cancer (BC) risk-stratification tools for Asian women that are highly accurate and can provide improved interpretation ability are lacking. We aimed to develop risk-stratification models to predict long- and short-term BC risk among Chinese women and to simultaneously rank potential non-experimental risk factors.
Methods:
The Breast Cancer Cohort Study in Chinese Women, a large ongoing prospective dynamic cohort study, includes 122,058 women aged 25–70 years old from the eastern part of China. We developed multiple machine-learning risk prediction models using parametric models (penalized logistic regression, bootstrap, and ensemble learning), which were the short-term ensemble penalized logistic regression (EPLR) risk prediction model and the ensemble penalized long-term (EPLT) risk prediction model to estimate BC risk. The models were assessed based on calibration and discrimination, and following this assessment, they were externally validated in new study participants from 2017 to 2020.
Results:
The AUC values of the short-term EPLR risk prediction model were 0.800 for the internal validation and 0.751 for the external validation set. For the long-term EPLT risk prediction model, the area under the receiver operating characteristic curve was 0.692 and 0.760 in internal and external validations, respectively. The net reclassification improvement index of the EPLT relative to the Gail and the Han Chinese Breast Cancer Prediction Model (HCBCP) models for external validation was 0.193 and 0.233, respectively, indicating that the EPLT model has higher classification accuracy.
Conclusions:
We developed the EPLR and EPLT models to screen populations with a high risk of developing BC. These can serve as useful tools to aid in risk-stratified screening and BC prevention.
Keywords: Breast cancer, Cancer prevention, Models, Women, Risk assessment
Introduction
According to GLOBOCAN 2020, breast cancer (BC) remains the leading cancer among women.[1–2] In 2020, BC surpassed lung cancer as the most commonly diagnosed form of cancer, with an estimated 2.26 million new cases. There were an estimated 416,000 new cases of BC in China during 2020, ranking the first in terms of incidence of malignant tumors among women in China, and the number of cases is increasing with time.[3]
Developing a risk assessment model is of great importance for early detection of groups with a high risk of disease. Since the first BC risk prediction model was reported by Gail et al[4] in 1989, more than 30 BC risk prediction models have been established,[5–8] including the Claus model, BRCAPRO model, Tyrer–Cuzick model, and Myriad model. Most models have values for the area under the receiver operating characteristic curve (AUC) ranging from 0.55 to 0.65, which indicate low accuracy in predicting individual BC incidence. To improve prediction efficiency, breast density,[9] hormone level, genomic and multi-omics data,[10] biomarkers, and other factors are gradually being included in prediction models. If single nucleotide polymorphism (SNP) is included, the accuracy of the model can be improved to a certain extent.[11,12] van Veen et al[12] added breast density into the Gail and Tyrer–Cuzick models, and the AUCs were elevated to more than 0.70 in both models.
Despite these positive results, the above-mentioned models have limitations. For instance, the included risk factors involve information requiring invasive detection, such as breast biopsy and genetic testing, which are not suitable for widespread application in the context of China's large population base, economic underdevelopment, and unbalanced distribution of medical resources. Similar studies among Asians have begun recently,[13,14] and there is few widely used or well-recognized prediction models those are particularly suitable for Chinese women. Therefore, it is critical to establish an efficient risk assessment model that is suitable for the Chinese population and meets health economic requirements.
Previously, we established a community model for high-risk groups and conducted preliminary verification and improvement, with a moderate AUC of 0.735.[15] In recent years, machine learning methods have been widely used in predicting both incidence and prognosis.[16–18] Traditional machine learning algorithms [19,20] often assume that the numbers of cases and controls are roughly the same and aim to minimize the global classification error, which would introduce biases in an unbalanced setting. To overcome these challenges, we proposed adoption of a bootstrap strategy to first produce balanced data and then constructed basic classifiers, which can also simultaneously achieve variable selection. We also borrowed an idea from bagging, the ensemble learning strategy, to further improve the discrimination accuracy and rank the importance of risk factors based on the number of occurrences in the weak classifiers. Specifically, we adopted the penalized logistic regression (PLR) model with elastic net penalty[21] as the basic weak classifier. Penalized regression methods such as the Least Absolute Shrinkage and Selection Operator (LASSO) have been widely applied in various settings for their advantages in selecting important factors.[22–24] Additionally, ensemble learning[25] has aroused great interest among researchers. The aggregation of classifiers has been shown to improve classification accuracy in a wide range of applications and can greatly reduce classification error. Our goals with this project were to incorporate advanced machine learning into existing BC risk assessment models, to improve detection screening statistics in both the short and long term, to maintain detection quality in the context of an unbalanced distribution of medical resources, and to set a standard for risk assessment models with widespread applicability in China.
Methods
Study population
The Breast Cancer Cohort Study in Chinese Women (BCCS-CW) is a large, ongoing prospective dynamic survey designed to investigate BC incidence among Han Chinese women. The study's target population is women aged 25–70 years in the eastern part of China (Shandong, Jiangsu, Hebei, and Tianjin). All participants provided their written informed consent; there was no financial compensation for study participation. Additional information is available in the literature.[26] The study was approved by the Ethics Committees of the Second Hospital of Shandong University (Jinan, China: 07090122) and National Center for Chronic and Non-communicable Disease Control (Jinan, China: 201610). The study followed the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) reporting guideline [Supplementary Table 1, http://links.lww.com/CM9/B768].
Outcomes and cohorts
A total of 122,058 women were recruited in the BCCS-CW cohort during 2008–2009. Participants completed face-to-face interviews, and physiological measurements and biosamples were obtained at baseline. At the last follow-up conducted from 2017 to 2020, 112,218 women were identified, including 31,724 women newly enrolled in this cohort. Incident cases of BC and mortality were identified chiefly through the linkage with the national health insurance claims database and disease registries, supplemented with local residential records and annual active confirmation at baseline and follow-up. At the beginning of the investigation, follow-up time was rounded to 10 years for reporting.
A questionnaire comprising only non-experimental factors was designed to collect BC-related information in the target population. A total of 72 variables in six aspects were included in the questionnaire, including demographic characteristics, female physiological and reproductive factors, medical and family history, dietary habits, lifestyle habits, and BC-related knowledge.
Statistical analysis
Data pre-processing
All the analyses were performed in the R software (version 4.1.2; R Foundation for Statistical Computing, Vienna, Austria). Packages "glmnet" were used to construct the ensemble penalized logistic regression (EPLR) model, "pROC" was used to plot the receiver operating characteristic (ROC) curve. P <0.05 was considered statistically significant. For duplicate entries in the original BCCS-CW database, only one was included in our analysis. We further excluded individuals with over 50% missing predictors, leaving 120,128 individuals for model development. Missing values for continuous and categorical predictors were imputed using column mean and mode, respectively. We handled duplicate samples and missing values in the new population data using the same criteria, and 30,060 of 31,724 individuals were selected for analysis. Additionally, to reduce overfitting, the continuous predictors used to develop the ensemble penalized long-term (EPLT) model were converted to categorical predictors. Body mass index (BMI) and weight-to-height ratio (WHI) were categorized according to well-established criteria for the Chinese population. Other predictors were categorized as described in Supplementary Table 2, http://links.lww.com/CM9/B768.
Model development
Our statistical methods are mainly related to the concepts of regularization methods, logistic regression, bootstrapping, and ensemble learning. As described in Supplementary Figure 1, http://links.lww.com/CM9/B768, the ensemble EPLR model[27,28] was built by aggregating a sequence of PLR models according to the bagging-based integrated framework.[29] We specified the number of bootstrap samples t to 200 according to our experience. In PLR, the important risk factors were entered into the logistic model automatically, i.e., the model was data-driven, thereby avoiding the arbitrariness of selecting risk factors subjectively. PLR is a technique belonging to regularization methods, which have shown great success in machine learning. Variable selection was repeated 200 times with construction of the base predictors. Combining the results of 200 variable selections, we counted the frequency of each risk factor selected in 200 base predictors, which can be used as a measure of their correlation with BC incidence. By sorting the factors according to frequency from high to low, we can obtain a factor importance ranking table. To further enhance the stability and accuracy of prediction, we integrated the approach of the EPLR model and the Gail model based on the age-specific hazard approach to develop the EPLT model, which can be used to predict the long-term risk of developing BC. The concrete model development process and algorithmic formula are provided in Supplementary Methods, http://links.lww.com/CM9/B768.
Sample splitting to internal and external datasets
The BCCS-CW database was split into internal and external datasets for model establishment and validation, as shown in Figure 1. The Shandong province dataset was treated as the internal dataset to develop our models, with 70% of data randomly selected to train the EPLR model and the remaining 30% left for validation. To eliminate randomness, we repeated the above process 100 times, as shown in the left part of the flowchart in Supplementary Figure 2, http://links.lww.com/CM9/B768. The dataset from the other three provinces was used as external validation to test the generalization performance. We also applied participants' data, which were newly added to the BCCS-CW at follow-up conducted from 2017 to 2020, to further validate performance of the EPLT model.
Figure 1.
The flow chart of invoking the BCCS-CW database. The BCCS-CW database was divided into different databases according to provinces, to train, validate, and test the EPLR and EPLT models. BCCS-CW: Breast Cancer Cohort Study in Chinese Women; EPLR: Ensemble penalized logistic regression; EPLT: Ensemble penalized long-term.
Model assessment
To fully evaluate how well the model predicts the outcome, we evaluated the discrimination and calibration of the model.[30] We used the ROC curve,[31,32] AUC, and net reclassification index (NRI) to evaluate the discrimination of the model. The NRI is the sum of the gain in both sensitivity and specificity for a given risk threshold. Calibration refers to the goodness of fit between the predicted probability of the model and the true probability. To assess the calibration ability of the EPLT model, we used calibration plots and the E/O ratio (E/O ratio is defined as the observed divided by the expected number) in this study.[33]
Results
Baseline characteristics
Table 1 provides baseline characteristics of the 20 most important risk factors determined according to the EPLR model in the BCCS-CW dataset. Morbidity was lower in the Shandong subset of the population (0.156%; n = 94) compared with populations from the other three provinces (0.201%; n = 120). Overall, the BC prevalence in the BCCS-CW dataset was 41.9 per 100,000, which is basically consistent with the data in the cancer registration system.[34–36] Both datasets comprised 72 non-experimental factors. Supplementary Table 2, http://links.lww.com/CM9/B768, shows the characteristics of the datasets with all 72 variables.
Table 1.
Population characteristics of BCCS-CW.
Characteristics | Development set (n = 60,396) | Validation set (n = 59,732) | ||||||
---|---|---|---|---|---|---|---|---|
Case (n = 94) | Control (n = 60,302) | χ 2/t | P-value | Case (n = 120) | Control (n = 59,612) | χ 2/t | P-value | |
Quality of life satisfaction degree | ||||||||
<25 | 42 (44.68) | 41,307 (68.50) | 4.434* | 0.218 | 46 (38.33) | 24,438 (41.00) | 0.300* | 0.764 |
≥25 | 52 (55.32) | 18,995 (31.50) | 74 (61.67) | 35,174 (59.00) | ||||
Menopause | ||||||||
Yes | 61 (64.89) | 17,771 (29.47) | 56.599* | <0.001 | 78 (65.00) | 17,934 (30.08) | 69.319* | <0.001 |
No | 33 (35.11) | 42,531 (70.53) | 42 (35.00) | 41,678 (69.92) | ||||
Family history of BC | ||||||||
No | 90 (95.74) | 59,728 (99.05) | 10.805* | 0.001 | 110 (91.67) | 58,930 (98.86) | 54.055* | <0.001 |
Yes | 4 (4.26) | 574 (0.95) | 10 (8.33) | 682 (1.14) | ||||
Breast hyperplasia | ||||||||
No | 92 (97.87) | 56,087 (93.01) | 3.416* | 0.065 | 105 (87.50) | 56,473 (94.73) | 12.532* | <0.001 |
Yes | 2 (2.13) | 4215 (6.99) | 15 (12.50) | 3139 (5.27) | ||||
Behavior prevention score | 1.72 ± 1.73 | 0.99 ± 1.36 | 18.319† | <0.001 | 3.05 ± 1.88 | 1.88 ± 1.89 | 32.314† | <0.001 |
Occupation | ||||||||
Farmer | 65 (69.15) | 43,093 (71.46) | 14.843* | 0.190 | 70 (58.33) | 23,833 (39.98) | 34.931* | <0.001 |
Worker | 7 (7.45) | 8830 (14.64) | 12 (10.00) | 12,207 (20.48) | ||||
Teacher | 1 (1.06) | 335 (0.56) | 8 (6.67) | 8664 (14.53) | ||||
Civil service | 1 (1.06) | 163 (0.27) | 3 (2.50) | 2647 (4.44) | ||||
Individual traders | 3 (3.19) | 985 (1.63) | 3 (2.50) | 1724 (2.89) | ||||
Driver | 0 (0) | 32 (0.05) | 1 (0.83) | 413 (0.69) | ||||
Services | 0 (0) | 469 (0.78) | 2 (1.67) | 1178 (1.98) | ||||
Staff | 2 (2.13) | 851 (1.41) | 1 (0.83) | 1355 (2.27) | ||||
Housewife | 11 (11.70) | 4516 (7.49) | 12 (10.00) | 5116 (8.58) | ||||
Health care | 3 (3.19) | 856 (1.42) | 2 (1.67) | 1376 (2.31) | ||||
Student | 0 (0) | 2 (0) | 0 (0) | 223 (0.37) | ||||
Others | 1 (1.06) | 170 (0.28) | 6 (5.00) | 876 (1.47) | ||||
Breast feeding duration (months) | 26.11 ± 17.77 | 28.06 ± 24.95 | 0.298† | 0.766 | 27.98 ± 22.65 | 20.96 ± 18.78 | –3.006† | 0.003 |
Fresh beans | ||||||||
Almost everyday | 5 (5.32) | 5524 (9.16) | 8.464* | 0.037 | 10 (8.33) | 5512 (9.25) | 5.855* | 0.119 |
3–4 days/week | 30 (31.91) | 13,630 (22.60) | 29 (24.17) | 20,191 (33.87) | ||||
1–2 days/week | 33 (35.11) | 18,253 (30.27) | 53 (44.17) | 22,527 (37.79) | ||||
Almost never | 26 (27.66) | 22,895 (37.97) | 28 (23.33) | 11,382 (19.09) | ||||
Age at first pregnancy at term (years) | 22.87 ± 5.53 | 23.92 ± 4.59 | 2.207† | 0.027 | 24.47 ± 4.28 | 23.55 ± 4.98 | –2.012† | 0.044 |
Awareness of BC score | 5.96 ± 3.63 | 4.50 ± 3.59 | 9.386† | 0.002 | 6.45 ± 4.72 | 5.34 ± 4.59 | 5.038† | 0.025 |
Ham | ||||||||
Almost everyday | 1 (1.06) | 430 (0.71) | 0.868* | 0.833 | 7 (5.83) | 2806 (4.71) | 16.165* | 0.001 |
3–4 days/week | 4 (4.26) | 3781 (6.27) | 15 (12.50) | 10,096 (16.94) | 0.001 | |||
1–2 days/week | 22 (23.40) | 14,550 (24.13) | 26 (21.67) | 21,094 (35.39) | ||||
Almost never | 67 (71.28) | 41,541 (68.89) | 72 (60.00) | 25,616 (42.97) | ||||
Garlic | ||||||||
Almost everyday | 30 (31.91) | 16,150 (26.78) | 1.93* | 0.587 | 25 (20.83) | 12,860 (21.57) | 8.479* | 0.037 |
3–4 days/week | 27 (28.72) | 19,218 (31.87) | 32 (26.67) | 18,771 (31.49) | ||||
1–2 days/week | 27 (28.72) | 16,628 (27.57) | 39 (32.5) | 21,070 (35.35) | ||||
Almost never | 10 (10.64) | 8306 (13.77) | 24 (20.00) | 6911 (11.59) | ||||
Sleep duration (hours) | 7.91 ± 1.43 | 7.92 ± 1.25 | 0.078† | 0.938 | 7.95 ± 1.33 | 7.91 ± 1.28 | 6.160† | 0.104 |
Tea | ||||||||
No | 78 (82.98) | 46,320 (76.81) | 2.005* | 0.157 | 109 (90.83) | 51,577 (86.52) | 1.911* | 0.167 |
Yes | 16 (17.02) | 13,982 (23.19) | 11 (9.17) | 8035 (13.48) | ||||
Hypertension | ||||||||
No | 75 (79.79) | 56,225 (93.24) | 26.863* | <0.001 | 106 (88.33) | 56,332 (94.50) | 8.734* | 0.003 |
Yes | 19 (20.21) | 4077 (6.76) | 14 (11.67) | 3280 (5.50) | ||||
Sleep disorders | ||||||||
No | 81 (86.17) | 54,695 (90.70) | 4.434* | 0.218 | 104 (86.67) | 55,085 (92.41) | 6.160* | 0.104 |
Yes | 13 (13.83) | 5607 (9.30) | 16 (13.33) | 4527 (7.59) | ||||
Waistline (Chi) | 2.57 ± 0.29 | 2.39 ± 0.25 | –5.861† | <0.001 | 2.44 ± 0.29 | 2.31 ± 0.22 | –4.622† | <0.001 |
Fried foods | ||||||||
Almost everyday | 1 (1.06) | 867 (1.44) | 1.325* | 0.723 | 9 (7.5) | 2933 (4.92) | 4.602* | 0.203 |
3–4 days/week | 6 (6.38) | 4733 (7.85) | 17 (14.17) | 10,966 (18.40) | ||||
1–2 days/week | 43 (45.74) | 24,236 (40.19) | 46 (38.33) | 25,462 (42.71) | ||||
Almost never | 44 (46.81) | 30,466 (50.52) | 48 (40.00) | 20,251 (33.97) | ||||
Passive smoke | ||||||||
No | 35 (37.23) | 21,673 (35.94) | 0.068* | 0.794 | 62 (51.67) | 32,791 (55.01) | 0.540* | 0.462 |
Yes | 59 (62.77) | 38,629 (64.06) | 58 (48.33) | 26,821 (44.99) | ||||
Carrot | ||||||||
Almost everyday | 0 (0) | 994 (1.65) | 1.966* | 0.742 | 10 (8.33) | 3701 (6.21) | 6.741* | 0.081 |
3–4 days/week | 13 (13.83) | 8267 (13.71) | 28 (23.33) | 14,331 (24.04) | ||||
1–2 days/week | 37 (39.36) | 21,584 (35.79) | 46 (38.33) | 28,522 (47.85) | ||||
Almost never | 44 (46.81) | 29,457 (48.85) | 36 (30.00) | 13,058 (21.90) |
The characteristics of the study sample are summarized as the mean±SD of continuous variables and the number (percentage) of categorical variables. This table only provides information on the 20 most important risk factors determined according to the EPLR model in the BCCS-CW data set. See Supplementary Table 2, http://links.lww.com/CM9/B768 for complete details. *: χ2 values; †: t values. BCCS-CW: Breast Cancer Cohort Study in Chinese Women; EPLR: Ensemble penalized logistic regression; SD: Standard deviation.
For individuals without BC at baseline, 242 of 60,302 participants in Shandong and 47 of 59,612 participants in the other three provinces were newly diagnosed with BC at follow-up during 2008–2020. We estimated the age-specific BC incidence [Supplementary Table 3, http://links.lww.com/CM9/B768] based on the birth date for all women and age when cancer developed.
Model development
In the initial exploration, we built EPLR models using the training set containing all 72 variables and the training set containing only 51 variables from the new population dataset. The results in Figure 2A show that the 51-factor model was not better than the 72-factor model in both the internal validation set (30% of Shandong province dataset) and external validation set (dataset from the other three provinces, which included 54,919 controls and 113 cases). Thus, we built the 72-factor EPLR model. Individuals without BC at baseline were recruited for the long-term risk prediction EPLT model. Similarly, the internal and external datasets comprised samples from Shandong province and the other three provinces, respectively. During 2018–2020, we carried out a 3-year follow-up, with 30,060 new participants added to the BCCS-CW during that time. However, only 51 of the same variables were collected for the new participants. Therefore, we used the new population as a purely external validation set only for the EPLT model.
Figure 2.
ROCs for EPLR model. (A) ROC curves for 51-factor and 72-factor EPLR models. ROC curves show performance of the 72-factor-EPLR model using both the internal validation set (green) and external validation set (yellow). ROC curves show performance of the 51-factor-EPLR model using both the internal validation set (red) and external validation set (blue). The AUC was improved in absolute terms by 4.5% and 8.5% using the 72-factor-EPLR model compared with the 51-factor-EPLR model in the internal and external validation sets, respectively. (B) ROC curves showing performance of the EPLR model using both the internal validation set (green) and external validation set (yellow). ROC curves showing performance of the BCRAM using both the internal validation set (green) and external validation set (blue). The AUC was improved in absolute terms by 10.9% and 6.8% using the EPLR model compared with the BCRAM in the internal and external validation sets, respectively. *Indicates the external validation set. AUC: Area under the receiver operating characteristic curve. EPLR: Ensemble penalized logistic regression; ROC: Receiver operating characteristic.
Discrimination: AUCs and NRI
In the 100 internal validation sets, the average value of the AUC for the 72-factor EPLR model was 0.800 (95% confidence interval [CI], 0.791–0.808; Figure 2B, green curve), and the average AUC for the Social Network-inspired Breast Cancer Risk Assessment Model (BCRAM)[15] was 0.691 (95% CI, 0.683–0.699; Figure 2B, blue curve). The averaged NRI of the 72-factor EPLR model relative to BCRAM was 0.164. Furthermore, in the external validation set, the EPLR model had an AUC of 0.751 (95% CI, 0.705–0.798; Figure 2B, yellow curve). The NRI relative to BCRAM (AUC: 0.683; 95% CI, 0.646–0.721; Figure 2B, red curve) was 0.268, which indicated that the EPLR model exhibited a substantial and statistically significant increase in prediction accuracy. Supplementary Table 4, http://links.lww.com/CM9/B768, shows the changes in some indicators when comparing the EPLR model with BCRAM.
As for the EPLT model, here, we report the 10-year predictions; 5-year predictions can be found in Supplementary Figures 3, 4, and 5; and Supplementary Table 5 [http://links.lww.com/CM9/B768]. In the 100 internal validation sets, the EPLT model obtained an average AUC of 0.692 (95% CI, 0.686–0.698; Supplementary Figure 3A, http://links.lww.com/CM9/B768, orange curve) compared with an average AUC of 0.568 (95% CI, 0.562–0.574) for the Han Chinese Breast Cancer Prediction Model (HCBCP)[37] and 0.607 (95% CI, 0.601–0.613) for the Gail model. A more detailed summary of the results is provided in Supplementary Figure 3A–C and Supplementary Table 5, http://links.lww.com/CM9/B768. When predicting the 10-year incidence probability of internal and external validation set, the AUC of the EPLT was 0.692 (95% CI, 0.686–0.698; Figure 3A, orange curve) and 0.760 (95% CI, 0.704–0.816; Figure 3B, orange curve) respectively. The NRI of the EPLT compared with the Gail and HCBCP models was 0.109 and 0.171 in internal validation (0.193 and 0.233 in external validation), respectively.
Figure 3.
ROC curves and calibration plots for EPLT, Gail, and HCBCP models. Orange, blue, and green represent curves or plots of EPLT, Gail, and HCBCP models, respectively. (A) ROC curves for internal validation set and (B) external validation set; (C) calibration plots for internal validation set and (D) external validation set. EPLT: Ensemble penalized long-term; ROC: Receiver operating characteristic.
Calibration: Calibration plots and E/O ratio
In the internal validation set, the E/O ratio for the EPLT, HCBCP, and Gail models was 1.098 (95% CI, 1.073–1.124), 1.198 (95% CI, 1.171–1.226), and 2.316 (95% CI, 2.263–2.370), respectively. In the external validation set, the E/O ratio for the EPLT, HCBCP, and Gail models was 0.944 (95% CI, 0.710–1.225), 1.224 (95% CI, 0.920–1.627), and 2.185 (95% CI, 1.643–2.905), respectively. Additionally, the calibration plots [Figure 3C,D] indicated that in both the internal and the external data sets, the EPLT model resulted in a good linear correlation between predicted and observed probabilities of BC.
Discussion
Using data from a large national prospective cohort, including over 110,000 individuals in China, we developed and validated a simple, convenient, and maneuverable machine learning risk prediction model to screen populations with a high risk of BC. The model is the aggregation of a series of logistic regression models, with good discrimination and moderate calibrating capacity, and only relies on non-experimental predictors. The superior accuracy of our model was confirmed in both internal and external validation. The developed model may serve as a useful tool to aid risk-stratified screening of individuals for potential BC and can provide a theoretical basis and technical support for the precise screening of high-risk women in China for BC prevention.
In this study, the AUC values of the EPLR model were 0.800 and 0.751 in the internal and external validation set, respectively. In the same internal and external validation sets, the AUC values of BCRAM were only 0.691 and 0.683. We also calculated the NRI to quantitatively evaluate the degree of improvement achieved by the EPLR model, which was 0.164 and 0.268, indicating better discrimination of the EPLR model. Notably, the EPLR model includes only simple risk factors that do not require invasive testing but have good predictive performance. Although the AUC values of the Gail and Tyrer–Cuzick models reach 0.74 and 0.76, respectively, these models generally use invasive and often expensive forms of detection; thus, these models are unsuitable for widespread application in China. Moreover, the number of factors in most models is limited. Our model involves 72 risk factors, relying on information collected in a survey, which is appropriate for the real-world context of China. Our model adopts simple, non-invasive factors and is therefore more suitable for the screening process and requirements of health economics in China.
To screen the high-risk population over the next few years, we further developed the EPLT model, which can be considered a combination of the EPLR and Gail models, comprising 51 variables and showing good discrimination and calibration ability. Compared with the Gail (five variables) and HCBCP (six variables) models, the dimensionality was relatively high; however, the predicted false positives and false negatives were reduced in both internal and external validation [see Supplementary Table 5, http://links.lww.com/CM9/B768]. Hence, the population entering chemoprevention can be screened more accurately using the EPLT model. These results indicate that our long-term risk prediction model has better discrimination and calibration ability than the HCBCP and Gail models.
In the external set, the EPLT model also performed well in terms of discrimination and calibration. The externally validated AUC was higher than the internally validated AUC. This may be caused by the death rate in the EPLT model being replaced by the death rate in Taixing City [Supplementary Table 3, http://links.lww.com/CM9/B768]. In the follow-up, we will collect as much information on mortality rates in other provinces and cities as possible to verify and improve the model.
Compared with traditional models, we incorporated modern machine learning algorithms to improve prediction accuracy and simultaneously rank the importance of potential risk factors. Based on the integration framework of bagging, the EPLR model aggregates the prediction results of multiple elastic net–PLR models by averaging, which makes the prediction results more stable and the variance smaller.[38] Therefore, the integration process strengthens the generalization ability of the EPLR model. Additionally, the integration and elastic net penalties allow us to assign an importance score to each risk factor. The importance score table of risk factors allows us to screen important risk factors as new predictors in the HCBCP model to improve its long-term prediction performance. We considered the unbalanced data samples such that our logistic regression model based on bagging calculates the regression coefficient more accurately. The conditional logistic regression method used in the original HCBCP model only extracts a small part of the data for the control group whereas our model combines multiple control groups to reduce the generalization error and improve the generalization ability of the model.
We incorporated all factors in the questionnaire into the model development and screened for significant factors multiple times by introducing a penalty function in the base predictor. Important risk factors were automatically selected in a data-driven manner, avoiding the arbitrariness of subjective risk factor selection. Except for the number of previous breast biopsies, which were not collected in the BCCS-CW, other risk factors included in the Gail and HCBCP models were ranked relatively high in importance [Figure 4]. We found that the most important predictor was "overall life satisfaction", which is a risk factor that we designed to indicate how a person feels regarding their current life status. This shows the importance of life satisfaction in lowering the risk of BC. It has been reported that the influence of mental and psychological status in BC cannot be ignored. A large number of studies[39–42] have shown that negative life events, depression, anxiety, irritability, poor psychological factors, and spiritual factors are all related to the incidence of BC, highlighting that some psychological factors may be related to BC. Psychological intervention should be considered in the comprehensive prevention of BC.
Figure 4.
Score of importance for each risk factor. Number of occurrences of 72 risk factors in 200 PLR models were trained on data from all Shandong Province, to quantify the impact of risk factors on breast cancer incidence. BMI: Body mass index; PLR: Penalized logistic regression; WHR: Weight-to-height ratio.
This study had several limitations. First, the performances of our model are only externally validated using the data of the other three provinces, and using the external dataset with 3 years of follow-up to validate the performance of 10-year and 5-year predictions of the EPLT model. Therefore, further validation of our model in a bigger validation set outside a longer follow-up of the cohort is warranted. Second, several established risk factors were not included in the long-term risk prediction model and were filtered out by the EPLR model. For example, although several studies have included alcohol, the low score for alcohol intake in the sorting table of risk factors in our study indicated that it is unimportant. Finally, owing to the limited number of BC cases, our database lacks information on BC subtypes. Thus, we were unable to establish an estrogen receptor-specific model.
In conclusion, using machine learning algorithms, we constructed BC risk prediction models, the EPLR and EPLT models, for the precise screening of groups with a high risk of developing BC in China. Our models have greater discrimination accuracy than existing state-of-the-art methods.
Acknowledgments
We would like to thank all subjects involved in the study for their participation. We are also grateful to the CDC for their technical assistance and generous support.
Funding
This research was supported by grants from China Postdoctoral Science Foundation (Nos. 2021M691911, 2021M701997), the National Key Research and Development Program of China (No. 2016YFC0901301), the Minister-affiliated Hospital Key Project of the Ministry of Health of China (No. 07090122), and General Programs of Natural Science Foundation of Shandong Province (No. ZR2021MH243).
Conflicts of interest
None.
Supplementary Material
Footnotes
Liyuan Liu and Yong He contributed equally to this work.
How to cite this article: Liu LY, He Y, Kao CY, Fan YY, Yang F, Wang F, Yu LX, Zhou F, Xiang YJ, Huang SY, Zheng C, Cai H, Bao HL, Fang LW, Wang LH, Chen ZJ, Yu ZG. An advanced machine learning method for simultaneous breast cancer risk prediction and risk ranking in Chinese population: A prospective cohort and modeling study. Chin Med J 2024;137:2084–2091. doi: 10.1097/CM9.0000000000002891
References
- 1.Sung H Ferlay J Siegel RL Laversanne M Soerjomataram I Jemal A, et al. Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin 2021;71: 209–249. doi: 10.3322/caac.21660. [DOI] [PubMed] [Google Scholar]
- 2.Xia C, Dong X, Li H, Cao M, Sun D, He S, Yang F, Yan X, Zhang S, Li N, Chen W. Cancer statistics in China and United States, 2022: profiles, trends, and determinants. Chin Med J 2022;135: 584–590. doi: 10.1097/CM9.0000000000002108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Cao M, Chen W. Epidemiology of cancer in China and the current status of prevention and control (in Chinese). Chin J Clin Oncol 2019;24: 145–149. doi: 10.3969/j.issn.1000-8179.2019.03.283 [Google Scholar]
- 4.Gail MH Brinton LA Byar DP Corle DK Green SB Schairer C, et al. Projecting individualized probabilities of developing breast cancer for white females who are being examined annually. J Natl Cancer Inst 1989;81: 1879–1886. doi: 10.1093/jnci/81.24.1879. [DOI] [PubMed] [Google Scholar]
- 5.Meads C, Ahmed I, Riley RD. A systematic review of breast cancer incidence risk prediction models with meta-analysis of their performance. Breast Cancer Res Treat 2012;132: 365–377. doi: 10.1007/s10549-011-1818-2. [DOI] [PubMed] [Google Scholar]
- 6.MacInnis RJ Bickerstaffe A Apicella C Dite GS Dowty JG Aujard K, et al. Prospective validation of the breast cancer risk prediction model BOADICEA and a batch-mode version BOADICEACentre. Br J Cancer 2013;109: 1296–1301. doi: 10.1038/bjc.2013.382. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Tyrer J, Duffy SW, Cuzick J. A breast cancer prediction model incorporating familial and personal risk factors. Stat Med 2004;23: 1111–1130. doi: 10.1002/sim.1668. [DOI] [PubMed] [Google Scholar]
- 8.Lindor NM Lindor RA Apicella C Dowty JG Ashley A Hunt K, et al. Predicting BRCA1 and BRCA2 gene mutation carriers: Comparison of LAMBDA, BRCAPRO, Myriad II, and modified Couch models. Fam Cancer 2007;6: 473–482. doi: 10.1007/s10689-007-9150-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Kim D Joung JG Sohn KA Shin H Park YR Ritchie MD, et al. Knowledge boosting: A graph-based integration approach with multi-omics data and genomic knowledge for cancer clinical outcome prediction. J Am Med Inform Assoc 2015;22: 109–120. doi: 10.1136/amiajnl-2013-002481. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Brentnall AR, Evans DG, Cuzick J. Distribution of breast cancer risk from SNPs and classical risk factors in women of routine screening age in the UK. Br J Cancer 2014;110: 827–828. doi: 10.1038/bjc.2013.747. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Dite GS MacInnis RJ Bickerstaffe A Dowty JG Allman R Apicella C, et al. Breast cancer risk prediction using clinical models and 77 independent risk-associated SNPs for women aged under 50 years: Australian Breast Cancer Family Registry. Cancer Epidemiol Biomarkers Prev 2016;25: 359–365. doi: 10.1158/1055-9965.EPI-15-0838. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.van Veen EM Brentnall AR Byers H Harkness EF Astley SM Sampson S, et al. Use of single-nucleotide polymorphisms and mammographic density plus classic risk factors for breast cancer risk prediction. JAMA Oncol 2018;4: 476–482. doi: 10.1001/jamaoncol.2017.4881. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Dai J Hu Z Jiang Y Shen H Dong J Ma H, et al. Breast cancer risk assessment with five independent genetic variants and two risk factors in Chinese women. Breast Cancer Res 2012;14: R17. doi: 10.1186/bcr3101. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Zheng W Wen W Gao YT Shyr Y Zheng Y Long J, et al. Genetic and clinical predictors for breast cancer risk assessment and stratification among Chinese women. J Natl Cancer Inst 2010;102: 972–981. doi: 10.1093/jnci/djq170. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Li A Wang R Liu L Xu L Wang F Chang F, et al. BCRAM: A social-network-inspired breast cancer risk assessment model. IEEE Trans Ind Inf 2018;15: 366–376. doi: 10.1109/TII.2018.2825345. [Google Scholar]
- 16.Cruz JA, Wishart DS. Applications of machine learning in cancer prediction and prognosis. Cancer Inform 2007;2: 59–77. doi: 10.1177/117693510600200030. [PMC free article] [PubMed] [Google Scholar]
- 17.Fontanella S, Frainay C, Murray CS, Simpson A, Custovic A. Machine learning to identify pairwise interactions between specific IgE antibodies and their association with asthma: A cross-sectional analysis within a population-based birth cohort. PloS Med 2018;15: e1002691. doi: 10.1371/journal.pmed.1002691. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Battineni G, Sagaro GG, Chinatalapudi N, Amenta F. Applications of machine learning predictive models in the chronic disease diagnosis. J Pers Med 2020;10: 21. doi: 10.3390/jpm10020021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Shi X Chen Z Wang H, et al. Convolutional LSTM network: A machine learning approach for precipitation nowcasting[J]. Advances in neural information processing systems, 2015, 28. [Google Scholar]
- 20.Mortazavi BJ Downing NS Bucholz EM Dharmarajan K Manhapra A Li SX, et al. Analysis of machine learning techniques for heart failure readmissions. Circ Cardiovasc Qual Outcomes 2016;9: 629–640. doi: 10.1161/CIRCOUTCOMES.116.003039. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Zou H, Hastie T. Regularization and variable selection via the elastic net. J R Stat Soc Series B Stat Methodol 2005;67: 301–320. doi: 10.1111/j.1467-9868.2005.00503.x. [Google Scholar]
- 22.Zou H. The adaptive LASSO and its oracle properties. J Am Stat Assoc 2006;101: 1418–1429. doi: 10.1198/016214506000000735. [Google Scholar]
- 23.Zhang CH. Nearly unbiased variable selection under minimax concave penalty. Ann Stat 2010;38: 894–942. doi: 10.1214/09-AOS729. [Google Scholar]
- 24.Robert Tibshirani, Regression Shrinkage and Selection Via the Lasso, Journal of the Royal Statistical Society: Series B (Methodological), Volume 58, Issue 1, January 1996, Pages 267–288, https://doi.org/10.1111/j.2517-6161.1996.tb02080.x [Google Scholar]
- 25.Yijing L, Haixiang G, Xiao L, Yanan L, Jinling L. Adapted ensemble classification algorithm based on multiple classifier system and feature selection for classifying multi-class imbalanced data. Knowledge-Based Syst 2016;94: 88–104. doi: 10.1016/j.knosys.2015.11.013. [Google Scholar]
- 26.Bao HL Liu LY Fang LW Cong S Fu ZT Tang JL, et al. The Breast Cancer Cohort Study in Chinese Women: The methodology of population-based cohort and baseline characteristics (in Chinese). Chin J Epidemiol 2020;41: 2040–2045. doi: 10.3760/cma.j.cn112338-20200507-00695. [DOI] [PubMed] [Google Scholar]
- 27.Mancini A Vito L Marcelli E Piangerelli M De Leone R Pucciarelli S, et al. Machine learning models predicting multidrug resistant urinary tract infections using "DsaaS". BMC Bioinformatics 2020;21(Suppl 10): 347. doi: 10.1186/s12859-020-03566-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Ijaz M, Asghar Z, Gul A. Ensemble of penalized logistic models for classification of high-dimensional data. Commun Stat Simul Comput 2021;50: 2072–2088. doi: 10.1080/03610918.2019.1595647. [Google Scholar]
- 29.Breiman L. Bagging predictors. Mach Learn 1996;24: 123–140. doi: 10.1023/A:1018054314350. [Google Scholar]
- 30.Zhou ZR Wang WW Li Y Jin KR Wang XY Wang ZW, et al. In-depth mining of clinical data: The construction of clinical prediction model with R. Ann Transl Med 2019;7: 796. doi: 10.21037/atm.2019.08.63. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 1982;143: 29–36. doi: 10.1148/radiology.143.1.7063747. [DOI] [PubMed] [Google Scholar]
- 32.Cook NR. Use and misuse of the receiver operating characteristic curve in risk prediction. Circulation 2007;115: 928–935. doi: 10.1161/CIRCULATIONAHA.106.672402. [DOI] [PubMed] [Google Scholar]
- 33.Costantino JP Gail MH Pee D Anderson S Redmond CK Benichou J, et al. Validation studies for models projecting the risk of invasive and total breast cancer incidence. J Natl Cancer Inst 1999;91: 1541–1548. doi: 10.1093/jnci/91.18.1541. [DOI] [PubMed] [Google Scholar]
- 34.Zheng R Zhang S Zeng H Wang S Sun K Chen R, et al. Cancer incidence and mortality in China, 2016. J Natl Cancer Center 2022;2: 1–9. doi: 10.1016/j.jncc.2022.02.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Giaquinto AN Sung H Miller KD Kramer JL Newman LA Minihan A, et al. Breast cancer statistics, 2022. CA Cancer J Clin 2022;72: 524–541. doi: 10.3322/caac.21754. [DOI] [PubMed] [Google Scholar]
- 36.Pan R Zhu M Yu C Lv J Guo Y Bian Z, et al. Cancer incidence and mortality: A cohort study in China, 2008-2013. Int J Cancer 2017;141: 1315–1323. doi: 10.1002/ijc.30825. [DOI] [PubMed] [Google Scholar]
- 37.Wang L Liu LY Lou Z Ding LJ Guan H Wang F, et al. Risk prediction for breast cancer in Han Chinese women based on a cause-specific Hazard model, BMC Cancer 2019; 19: 128. doi: 10.1186/s12885-019-5321-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Ditzler G, LaBarck J, Ritchie J, Rosen G, Polikar R. Extensions to online feature selection using bagging and boosting. IEEE Trans Neural Netw Learn Syst 2018;29: 4504–4509. doi: 10.1109/TNNLS.2017.2746107. [DOI] [PubMed] [Google Scholar]
- 39.Reich M, Lesur A, Perdrizet-Chevallier C. Depression, quality of life and breast cancer: A review of the literature. Breast Cancer Res Treat 2008;110: 9–17. doi: 10.1007/s10549-007-9706-5. [DOI] [PubMed] [Google Scholar]
- 40.Wondimagegnehu A, Abebe W, Abraha A, Teferra S. Depression and social support among breast cancer patients in Addis Ababa, Ethiopia. BMC Cancer 2019;19: 836. doi: 10.1186/s12885-019-6007-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Li J, Gao W, Yang Q, Cao F. Perceived stress, anxiety, and depression in treatment-naïve women with breast cancer: A case-control study. Psychooncology 2021;30: 231–239. doi: 10.1002/pon.5555. [DOI] [PubMed] [Google Scholar]
- 42.Galgut C. Psychological effect of breast cancer. Lancet Oncol 2011;12: 1187. doi: 10.1016/S1470-2045(11)70356-4. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.