Abstract
We propose a method to predict when a woman will develop breast cancer (BCa) from her lifestyle and health history features. To address this objective, we use data from the Alberta’s Tomorrow Project of 18,288 women to train Individual Survival Distribution (ISD) models to predict an individual’s Breast-Cancer-Onset (BCaO) probability curve. We show that our three-step approach–(1) filling missing data with multiple imputations by chained equations, followed by (2) feature selection with the multivariate Cox method, and finally, (3) using MTLR to learn an ISD model–produced the model with the smallest L1-Hinge loss among all calibrated models with comparable C-index. We also identified 7 actionable lifestyle features that a woman can modify and illustrate how this model can predict the quantitative effects of those changes–suggesting how much each will potentially extend her BCa-free time. We anticipate this approach could be used to identify appropriate interventions for individuals with a higher likelihood of developing BCa in their lifetime.
1. Introduction
Breast Cancer (BCa) is the most diagnosed malignancy among women worldwide, with 2.26 million new cases diagnosed in 2020 [1, 2]. It accounted for 30% of estimated new cancer cases in American women in 2021 [3], and has a mortality-to-incidence ratio of 15% [4]. Researchers have mostly looked through the lens of the human genome to diagnose, prevent, and treat cancer [5, 6]. However, studies of identical twins have shown that genes are not the only source of cancer [7]. Instead, a research shows that external factors, such as lifestyle and environment, also contribute greatly to cancer development [8]. Metabolomic data also shows that more than 90% of cancers are caused by environmental exposures [9]. The identification of these external factors can benefit individuals as it can provide individuals with a “prescription” on how changing their lifestyle can reduce cancer risk.
Several recent observational studies have reported the impact of lifestyle and environment on the incidence of BCa. The incidence of BCa varies across continents and countries (27 per 100,000 in Africa and East Asia, but 97 per 100,000 in North America), reflecting the possible association between the risk of BCa development and local economic status, social and lifestyle factors [10]. One of the most significant cohort studies, the Million Women Study, confirmed the deleterious effect of hormone replacement therapy on BCa from over one million women [11, 12]. A study cohort of female BCa patients in Sweden showed that a large proportion of BCa cases might also be associated with pregnancy-related factors, hormone therapy, lifestyle factors (such as body mass index, exercise, alcohol consumption, diet habits, smoking, etc.), and other risk factors [13]. Other studies concluded that more than a third of breast cancers seem to be preventable through lifestyle changes in high-income countries [8]. In addition, there is controversy as to whether hormonal contraceptives increase the risk of BCa. Some studies believe that the associated effect is minimal [13], while others believe that long-term usage has a certain deleterious effect on BCa [14, 15].
Time-to-event (aka survival) analysis methods can greatly assist with building “disease onset” predictive models using modifiable BCa risk factors. This onset information can also help reduce BCa mortality and aid in women’s access to high-quality prevention, early detection, and treatment services [16, 17]. The famous Gail model/Breast Cancer Risk Assessment Tool [18, 19] assesses the BCa risk over the next 5 years using medical history information. The Breast Cancer Surveillance Consortium model [20] performed risk analysis to explore the relationship between age, race, family cancer history, and breast density to their risk of BCaO. However, all the above research fails to consider external factors, e.g., lifestyle and environment, in their risk factor analysis. The closest study as our might be the Nurses’ Health Study (NHS) [21, 22] and California Teachers Study (CTS) [23], which both include medical history information, postmenopausal hormones, and alcohol assumption to predict the BCaO risks. However, our study aimed to identify modifiable lifestyle factors by including a richer lifestyle trait, including dietary habits, physical activity information, hormone usage, and social support index, which the NHS and CTS studies were silenced on. Furthermore, in contrast to all these traditional risk models, few (if any) survival prediction models can use health history and lifestyle factors to calculate the probability curve of a woman developing BCa over time (rather than a single risk score).
An accurate estimation of the time to breast cancer onset (BCaO) in a woman based on her health history and lifestyle information can help suggest timely lifestyle changes to potentially delay her BCaO. This accurate personalized BCaO detection can thus improve the overall quality of life and reduce the burden of cancer treatment for a patient. This paper focuses on survival prediction models that compute time to BCaO for women at the individual level. We consider the following types of survival models to provide a context for our personalized approach to BCaO prediction [24]:
Individual Survival Distribution Models produce a survival distribution (a probability curve for all future time points) specific to each individual.
Population-Level Survival Models produce a survival curve for a group of individuals, e.g., the Kaplan-Meier estimator [25].
Single-time Probability Models compute one survival estimate at a specific time for a particular individual. For example, the Gail model [18, 19] calculates the probability of breast cancer onset at 5 years from recruitment.
Time-Invariant Risk Models produce a time-invariant risk score for each individual, such as the Cox proportional hazards model [26].
We focus on the first type of models that compute ISDs in this paper as other models do not provide several probabilities of BCaO over all future time points as required for our analysis. It should be noted that we use the term ISD to refer to the cancer-free probability curves in this paper (see examples in Fig 1). Fig 1 shows that Patient A’s BCaO is 88% at 20 months, this means our model predicts that there is an 88% chance that she will be BCa-free throughout those 20 months and a 12% chance that she will develop BCa in this time. ISDs have many desired properties for clinical applications–e.g., an ISD can provide personalized BCaO probabilities at any future time-point and can also be used to compute the expected time of BCaO, as well as confidence intervals around any such prediction. Fig 1 shows the ISDs of two example patients and highlights the important patient-specific statistics that can be computed from these ISDs.
Fig 1. ISDs of two example participants–Patient A in green and B in red.
Note that the ISD provides a person’s probability of being cancer-free, for each future time t, and the BCaO probability at time t is (1—cancer-free-probability(t)). Some useful statistics that can be computed from ISDs: the intersection of a patient’s ISD with the median probability line (the horizontal line at 0.5) is the median BCaO time for that patient; the intersection of an ISD with the vertical line at a target time (e.g., at 20 months) provides the cancer-free probability at that specific time-point. Note that, in general, multiple ISDs are allowed to crossover (shown with an arrow) as each patient has her own ISD computed from her specific features (unless a model assumes proportional hazards).
Thus, our objective is to produce a tool that can accurately estimate the time until BCaO for each woman from her ISD, which is computed using her personal values of various features–including modifiable lifestyle features and health history information. We also propose that some of the modifiable lifestyle features could be used to suggest meaningful interventions to an individual to potentially delay BCaO, which we hope will eventually lead to a better quality of life (see Discussion section). The major strengths of this study are listed below:
Curating a BCaO dataset, ATP-BCa, with Alberta’s Tomorrow Project (ATP) cohort that describes the relevant characteristics of 18,288 female residents in Alberta, Canada, and their corresponding BCaO times or censoring times. We believe this is the first time-to-event analysis dataset that contains the lifestyle and environmental features for making personalized BCaO predictions.
Building ISD models from this ATP-BCa dataset that can predict individual BCaO from health history and lifestyle factors. We present a three-step approach (missing value imputation, feature selection, and ISD model) using a large observational dataset to develop an improved prognostic model for predicting BCaO time for healthy women. Note this is deviating from other existing models as we recruited only healthy adults without a previous history of cancer while most traditional risk models include BCa patients and healthy controls, which reflects the real-world scenario and hence more applicable for application. In addition, our model predicts BCaO time for healthy women instead of survival time for cancer-diagnosed patients.
Identifying the important features for BCaO prediction. While our models include intrinsic (i.e., non-modifiable) features, we focus on actionable lifestyle features that participants can modify to potentially prolong their BCa-free time. These important actionable lifestyle features include features from supplement consumption (selenium and Vitamin E intake), social support index (connection and share index), and healthy food consumption (orange vegetable, fish, and whole-grain consumption). We also demonstrate how clinicians can use counterfactual results obtained from a participant’s ISD to provide them with actionable “recommendations” to potentially delay a woman’s BCaO.
2. Methods
2.1. Ethics approval and consent to participates
All data used in this study were de-identified before releasing to authors for analyses. Data access and analyses of this study complied with the provincial Health Information Act (HIA) in Alberta and Alberta Health Services (AHS) data access procedures and data disclosure guidelines. The study was approved by the local Health Research Ethics Board of Alberta (HREBA)—Cancer Committee under the protocol #-19-0188.
The secondary use of data in this publication was originally collected from participants through the Alberta’s Tomorrow Project [27, 28]. This data collection was reviewed and approved by the Health Research Ethics Board of Alberta (HREBA.CC-17-0461). More specifically, each participant signed a consent form, with a witness signature. In the event where a physical consent form was not signed, an implied consent was documented–e.g., if individuals participated in the telephone screen interview, they agreed to receive a mailed information package and returned a completed baseline questionnaire. ATP documents consents both electronically (through e-collection or scanning paper copies of the signed consent) and by paper (not yet scanned).
2.2. Dataset curation and description
ATP was launched in 2000 to develop a deeper understanding of the etiological basis of cancers and other chronic diseases, to help prevent or reduce the incidences in the near future [27, 28]. This cohort recruited adult Alberta residents from 2000 to 2008, then regular follow-up these participants to collect lifestyle and health information. We selected women who filled out the Health and Lifestyle Questionnaire (HLQ) at the enrollment stage, as shown in the data assembly process in Fig 2. ATP-BCa also collected these participants’ diet and physical activity information through the Canadian Diet History Questionnaire I (CDHQ-I) and Past-Year Total Physical Activity Questionnaire (PYTPAQ). These participants were linked to the Alberta Cancer Registry (ACR) via their Personal Health Numbers (PHN) to collect detailed information on BCa diagnosis and stage. This study cohort, hereafter called “ATP-BCa”, includes 122 features (and time-to-event outcomes, e.g., censor indicator δ and time T) from 18,288 female ATP participants. Table 1 describes the characteristics of the uncensored and censored participants for this dataset. Please refer to [27, 28] for more details and patient characteristics in the ATP cohort.
Fig 2. Scheme of data curation.

Table 1. The definition and characteristics of female participants in the ATP-BCa dataset.
| Uncensored Participants | Censored Participants | Total Participants | |
|---|---|---|---|
| Event of Interest | Breast Cancer Onset | ||
| Criteria for Breast Cancer Onset | 1. Ductal Carcinoma In-situ, Intraepithelial, Non-Infiltrating, Non-invasive; | ||
| or | |||
| 2. Malignant or Primary Tumor | |||
| Start Time Definition | Recruitment Month | ||
| Time-to-Event or Censoring | Breast Cancer Onset Month | Last Follow-up Month a | Breast Cancer Onset Month / Last Follow-up Month |
| Number of Instances (%) | 605 (3.31%) | 17683 (96.69%) | 18288 (100%) |
| Number of features (Number of Actionable Lifestyle features) | 122 (98) | ||
| Minimum–Maximum Age in Years | 35.22–70.18 | 35.08–70.34 | 35.08–70.34 |
| Mean Age ± Standard Deviation | 53.89 ± 9.23 | 50.47 ± 9.19 | 50.58 ± 9.21 |
| Maximum Follow-up Time (Months) | 201 | 207 | 207 |
| Median Follow-up Time (Months) | 86.88 | 160.68 | 106.20 |
a The last follow-up time is the date of the last linkage with the Alberta Cancer Registry.
Fig 3 shows the cancer-free time characteristics for the ATP-BCa dataset. The Kaplan-Meier curves of Fig 3A show that most participants are not expected to develop BCa within 207 months, as the cancer-free probability for this time is 95.85%, with a 95% confidence interval of 95.44% to 96.22% (computed using Greenwood’s formula [29])–this confidence interval strengthens the fact that 96% of the individuals did not experience BCaO in the ATP-BCa dataset. Fig 3B presents the event and censor time histogram of all 18k participants in the ATP-BCa study, showing that the first censored participant appears at the 95th month after she was recruited. Thus, the personalized BCaO model built from the ATP-BCa data should incorporate a large number of censored patients with censor times between 95 to 207 months Fig 3C shows a zoomed-in version of the histogram of event times as these are occluded in Fig 3B. Note that we consider a broader definition of BCaO that includes ductal carcinoma in-situ (DCIS) because (1) DCIS is often considered the earliest form of breast cancer, and since our goal was early BCaO detection, it is reasonable to use DCIS as cancer onset event; (2) DCIS has a potential to develop as an invasive carcinoma, which means that the patients with DCIS are at a higher risk of having invasive breast cancer incidence.
Fig 3. Kaplan-Meier estimation of the ATP-BCa dataset and the uncensored/censored time histograms.
(a) Kaplan-Meier estimation with 95% confidence interval. Note that the y-axis starts from 0.95; (b) Uncensored (blue) and censored time (yellow) histogram of the participants in the ATP-BCa dataset; (c) Rescaling, and showing only the uncensored time histogram of the ATP-BCa dataset.
2.3. Feature categories
The dataset has rich feature sets including baseline information (BLINE), HLQ, CDHQ-I, and PYTPAQ [27, 28]. As mentioned in the introduction, we want to determine the lifestyle risk factors associated with BCaO, focusing on features that participants can alter or modify in the hope of delaying onset. Below we summarize the feature fields and define which sets of variables our experts consider actionable lifestyle factors.
BLINE (2 features): There are just two baseline variables: age and geographic (rural or urban) characteristics; obviously, age is not an actionable factor while the geographic location is actionable.
HLQ (56 features): This feature set contains the participant’s self-reporting answers about personal and family medical history, history of cancer screening and family history of BCa or other cancers, reproductive health, smoking, social support, anthropometric measurements, and demographic characteristics. This HLQ questionnaire was acquired at the recruitment time for participants. While the participant has the option of changing her future lifestyle (related to smoking, social support, and anthropometrics), she of course cannot modify previous features, such as her medical history, cancer screening behavior, reproductive health, nor demographic characteristics.
CDHQ-I (52 features): This feature set contains self-reporting food and nutrient intakes information in the preceding year of subjects’ enrolment in the project–including daily consumption of alcohol, dairy, meat and vegetable, vitamin intake from supplements, etc. All CDHQ-I features are considered actionable lifestyle factors.
PYTPAQ (12 features): This feature set describes the type and average amount of physical activity in the year before that subject enrolled. It considers four types of activities: job-related physical activity, household-related physical activity, leisure-time physical activity, and transportation-related physical activity. All PYTPAQ features are considered actionable lifestyle factors.
We distinguish each feature by whether it is actionable or intrinsic, as described in S4 Table in the Supplementary file. We treated 24 features as intrinsic, while the remaining 98 as actionable.
2.4. Preprocessing
The raw ATP-BCa dataset is missing some values due to two reasons: (1) whole sections of questionnaires are missing for participants who were part of different sub-studies and so, for example, a participant was instructed to complete the HLQ but not the CDHQ-I or PYTPAQ questionnaire; or (2) specific variables are missing, for example, BMI information is not available due to missing height or weight data for some participants. Then we experimented with three different data imputation methods to fill out the missing values for the remaining features:
Median Value Imputation (MVI) is a univariate non-parametric method that fills the missing values with the median value of that feature.
K-Nearest Neighbors (KNN) Imputation is a multivariate non-parametric method for imputation: If participant p is missing the value of feature f, this method first finds p’s k “nearest neighbors”, based on Euclidean distance (after removing feature f, of course). We then set p’s “f” value to be the mean of the “f” value (after one-hot encoding categorical features) of those k nearest neighbors. Here we set k = 2, meaning we use the 2 closest neighbors of each participant to calculate the missing feature value.
Multiple Imputation by Chained Equations (MICE) is a multivariate parametric method, which iteratively models each feature with missing values as a function of other non-missing features and then impute that estimate [30].
One difference between these three imputation methods is that MVI assumes the features are independent, while KNN (resp. MICE) computes the relevant value as a learned non-linear (resp., linear) function of the other feature values. After completing the missing value imputation, we apply feature normalization to non-categorical features and one-hot encoding to categorical features.
2.5. Feature selection
As discussed in Introduction section, identifying key actionable lifestyle features is important because this can help a patient delay her BCaO time by adopting a healthier lifestyle. However, clinicians will prefer to recommend a few (instead of hundreds) most important actionable features that a patient can modify to possibly reduce her BCaO probability by (say) 10% at 10 years in the future. With this in mind, we did a thorough feature selection analysis of the whole ATP-BCa dataset with 122 available features. This section briefly introduces the feature selection methods and the extraction of the features related to BCaO. Details of the implementation and the comparison are available in Results section.
Univariate Cox fits a Cox proportional hazard (CoxPH) model using one feature at a time and uses the Wald test to assess the significance of that feature [26, 31]. If the p-value of the Wald test is lower than a specified value (we used threshold = 0.001, as recommended in [32]), then the feature is considered significant and is retained in the dataset; otherwise, it is removed.
Recursive Feature Elimination (RFE) is a standard feature selection method in classification and regression tasks [33]. In this study, we altered it to adapt to accommodate censored data by iteratively fitting a CoxPH model and removing the least important feature, until it retains only the desired number of features. We set the desired number of features to 10, which is comparable to the number of features selected by the above method.
Minimum Redundancy, Maximum Relevance (mRMR) is a “minimal optimal” feature selection algorithm, meaning that it seeks to find a feature set that gives the best possible predictive performance, given a fixed number of features. Inspired from [34], we adapted the mRMR method to use the C-index to estimate the correlation between a variable and binary outcome of BCaO. We again set the desired number of features to 10 for mRMR.
Multivariate Cox fits a linear CoxPH model using all features and uses an elastic net penalty for feature selection [35]. Features with non-zero weights are retained in the dataset by this method. The hyper-parameter (lasso/ridge ratio) selection is made through grid search based on validation performance computed as the C-index.
2.6. Evaluation metrics
This section briefly describes the metrics we used in this study for evaluating ISD models. We evaluate the model from the following three aspects: how close is the predicted BCaO time close to reality (L1-Hinge loss), pairwise ranking accuracy (Concordance index), and model calibration (D-calibration). For the simplicity, we only provide a plain description of these metrics, while a precise definition can be found in the section A in S1 Text. We prefer a model that has the lowest L1-Hinge loss among all D-calibrated models and use C-index to break L1-Hinge loss ties (see Discussion section).
L1-Hinge loss is like the mean absolute error (MAE) measurement in the regression task, with lower scores (lower bounded by 0) indicating better performance. Fig 4A illustrates the L1-Hinge loss calculation, which is the absolute difference between actual and model-predicted BCaO times (which is the median of the person’s ISD). It incorporates censoring by assigning 0 loss to any patients whose predicted event times are later than the censoring times, and the loss of [censoring time—predicted BCaO time] if the predicted BCaO times are earlier than the respective participants censoring times. For formal definitions and details, please see section A.1 in S1 Text.
Concordance index (C-index) is one of the most popular metrics for evaluating survival prediction models, with higher scores (upper bounded by 1) indicating better performance. It checks if the predicted order of events matches the true order for every “comparable” pair in a dataset. C-index includes censored instances into its calculation by appropriately defining the “comparable” pairs. A pair is “comparable” if we can determine who has the BCaO first. Finally, the C-index is calculated as the percentage of concordant pairs (correctly ordered predicted event times) among all comparable pairs. Please see section A.2 in S1 Text for details.
Distribution calibration (D-calibration) is a statistical test that assesses whether the probability prediction provided by an ISD curve is reliable [24]. Note that the popular measures calibration assessment such as calibration regression plots, 1-calibration, etc., are not applicable to ISD models because the ISD models provide event probabilities at all future time-points instead of just one event probability or risk value at a fixed time-point. D-calibration splits the probability-axis of an ISD into B equal-sized bins (we used B = 10 in this paper) and assess if the actual number of events within each probability bin is statistically similar to the model-predicted number of events, claiming a model is D-calibrated if the p-value from Hosmer-Lemeshow test is > 0.05 [24]. For formal definition of D-calibration calculation and how to handle censored patients, please see section A.3 in S1 Text for details.
Fig 4. Illustration of the L1-Hinge loss and its comparison with the C-index.
(a) L1-Hinge loss calculation from an ISD curve for an uncensored participant in ATP-BCa dataset. (b) Comparison of the L1-Hinge measurement and the C-index (see Section 5.3 for details). The dashed and dotted arrows indicate the expected BCaO times. (Note these examples are of uncensored individuals).
2.7. Experimental design
The primary outcome for this study is the prediction of BCaO probabilities within the next 207 months (maximum follow-up time in currently available ATP-BCa dataset) given the participant does not have BCa at recruitment. We then compared the effectiveness of nine survival algorithms, ranging from traditional statistical models to deep learning models: Cox proportional hazard model (CoxPH) [26], CoxPH model with elastic net penalty (CoxNet) [35], accelerated failure time (AFT) [36], random survival forest (RSF) [37], gradient boosting Cox machine (GBCM) [38], component-wise gradient boosting Cox machine (CW-GBCM) [39], DeepHit [40], deep survival machine (DSM) [41], and multi-task logistic regression (MTLR) [42, 43].
We applied the same experimental procedure to all models using the ATP-BCa dataset as input:
We divided the ATP-BCa dataset into 5 cross-validation (5CV) splits, stratified for both censor indicator, δ, and BCaO time, T.
The imputation and feature selection process were performed after the 5CV dataset.
When necessary, we ran internal 5CV, within the training set, for grid-search-based hyperparameter selection (details appear in the Supplementary material), seeking the best (in terms of the lowest L1-Hinge loss or highest C-index, depending on the model) hyperparameter settings.
Discrimination measurements (L1-Hinge for individual-level prediction, C-index for pair-wise-level prediction) and calibration measurement are reported on the predicted BCaO probability curves of each ISD model. We do not report the Brier Score (nor Integrated Brier Score), because the high censoring rate for the ATP-BCa dataset, means these measure have limited clinical utility [44].
We also propose a soft-L1-Hinge loss as the objective function and show that directly minimizing this objective function for several epochs after pre-training the original model results in better performance. We compare the training loss for this approach, versus others; see the Supplementary material. The Discussion section presents the details of these evaluation metrics and their limitations in the clinical application in.
The above procedure is visually demonstrated in S1 Fig.
3. Results
In this section, we report the results of our experiments that compare different imputation methods, different feature selection methods, and different ISD methods for BCaO prediction in ATP-BCa dataset. Specifically, we experimented with three imputation methods–median value imputation (MVI), K-Nearest Neighbor (KNN), and Multiple Imputation by Chained Equations (MICE); four feature selection methods–univariate Cox feature selection, Recursive Feature Elimination (RFE), Minimum Redundancy Maximum Relevance (mRMR), and multivariate Cox feature selection; and several ISD models including Cox proportional hazard model (CoxPH) [26], CoxPH model with elastic net penalty (CoxNet) [35], accelerated failure time (AFT) [36], random survival forest (RSF) [37], gradient boosting Cox machine (GBCM) [38], component-wise gradient boosting Cox machine (CW-GBCM) [39], DeepHit [40], deep survival machine (DSM) [41], and multi-task logistic regression (MTLR) [42, 43].
We report the L1-Hinge loss, D-calibration, and time-invariant concordance index (see Discussion section and section A in S1 Text for detailed explanation) as described [24]. Note that the time-invariant concordance index is different from the standard single-time concordance in a fundamental way: the former measures the concordance on the scope of event times while the latter focuses on the single time probabilities. All reported results are the average (± standard deviation) of the five folds cross-validation. We prefer D-calibrated models with the smallest L1-Hinge loss and use C-index to break L1-Hinge loss ties (see Discussion section). The best performing combination in our experiments was the MTLR model with MICE imputation and multivariate Cox feature selection, which was D-calibrated and had an average L1-Hinge loss of 7.173 months (smallest among all D-calibrated models).
The comprehensive results are presented in S3 Table of the Supplementary material. Due to the space constraints in the main text, we only present the comparison of (a) imputation methods with the fixed (best) feature selection method (Fig 5), (b) feature selection methods with fixed feature imputation method (Fig 6), and (c) ISD models with fixed feature imputation and feature selection methods (Table 2), respectively.
Fig 5. Imputation method comparison using multivariate Cox feature selection method.
The red rectangle shows the non-calibrated models.
Fig 6. Feature selection method comparison using MICE imputation.
The red rectangle boxes represent the non-calibrated models.
Table 2. Comparison of all ISD algorithms using MICE imputation and multivariate Cox feature selection.
| MICE Imputation Multivariate Cox Feature Selection | L1-Hinge Loss (in months) | Concordance Index | ||
|---|---|---|---|---|
| Mean | Standard Deviation | Mean | Standard Deviation | |
| CoxPH | 80.286 | 2.523 | 0.602 | 0.013 |
| CoxNet | 77.653 | 0.707 | 0.605 | 0.012 |
| RSF | 125.086 | 16.731 | 0.516 | 0.017 |
| AFT | 77.261 | 2.594 | 0.602 | 0.012 |
| CW-GBCM | 77.122 | 0.879 | 0.605 | 0.012 |
| GBCM | 77.763 | 0.738 | 0.586 | 0.014 |
| DeepHit* | 3.841* | 0.018* | 0.583* | 0.015* |
| DSM | 80.887 | 5.356 | 0.601 | 0.018 |
| MTLR | 7.173 | 1.341 | 0.603 | 0.015 |
* Not D-Calibrated model.
The comparison of three data imputation methods, with the fixed feature selection method, is shown in Fig 5 and S3 Table in the Supplementary material. All models in this comparison used the unified multivariate Cox feature selection method. The red highlight on “DeepHit”, in both Fig 5 and S3 Table, means that the DeepHit models are not D-calibrated (see Materials and Methods section), while all other models are D-calibrated. These results demonstrate negligible difference between three imputation methods, however, MICE imputation produced models with relatively lower average L1-hinge loss (Fig 5A), and MVI and MICE imputation both obtained higher C-index than KNN imputation (Fig 5B). Recall that a lower L1-hinge loss indicates better performance while a higher C-index indicates better performance.
The feature selection method comparison appears in Fig 6 and S4 Table in the Supplementary material. All models in this comparison used the MICE imputation method. The red rectangle boxes in Fig 6 show the non-calibrated models (all DeepHit models). Fig 6A shows that univariate Cox and multivariate Cox methods obtain relatively lower L1-hinge losses and higher C-index in most models. The best performance is obtained using the MICE imputation method with Multivariate Cox feature selection using the MTLR model, as it has the lowest L1-hinge loss among all D-calibrated models and has a competitive C-index score. We can see that implementing feature selection (either with univariate or multivariate Cox) significantly boosted the performance for all ISD models. The poor performance of the models that used all features as input might be due to potential overfitting on the large set of features. However, RSF models without feature selection showed an obvious performance boost (mostly for L1-Hinge loss), which is probably because RSF has its own internal feature selection mechanism and thus does not require another external feature selection method.
Table 2 demonstrates the models’ performance on the ATP-BCa dataset with the best imputation method (MICE) and the best feature selection methods (multivariate Cox). We see that both discrete-time models–MTLR and DeepHit–have relatively smaller L1-hinge loss values, which may be because discrete-time models do not require the strong assumptions (distribution nor proportional hazard) that are required by other continuous-time models. The MTLR model also has the second-best C-index value among these nine models.
While DeepHit does have the lowest L1-hinge loss, this is not the best model, for three reasons: (1) DeepHit is the only model that is not D-calibrated (see S2 Fig in the Supplementary materials), which means its probability estimations at individuals’ BCaO times are not reliable and hence, not acceptable for clinical usage [45, 46]; (2) DeepHit requires that all BCaO probability curves must reach 0 at the last observed time, which implies that everyone must have BCaO before the maximum time in the dataset (207 months for ATP-BCa), which is contrary to the ATP-BCa dataset described in Results section; (3) DeepHit has smaller C-index than the other models, which means it cannot reliably rank the priority of pairs of patients, etc.
4. Discussion
4.1. Selected actionable lifestyle factors
In the past decades, many clinical studies have used statistical or machine learning methods to identify risk factors that reliably predict BCa development. However, most of the studies focused on intrinsic risk factors such as hereditary factors [5, 6], hormones and metabolic molecules [14, 47], as well as reproductive factors [48, 49]. In this study, we expand the scope of factors to include both those intrinsic (hereditary, hormone, reproductive) as well as lifestyle factors. We then consider four different feature selection methods for selecting the most important factors. Experimental results (see Results section) indicate that the multivariate Cox feature selection method (with MTLR) has the best performance (smallest L1-hinge loss among all D-calibrated models). Table 3 shows the features selected by applying the multivariate Cox feature selection to the entire ATP-BCa dataset (after pre-processing and MICE imputation; see Materials and Methods section).
Table 3. Important features selected by multivariate Cox feature selection.
| Feature selected | Plain description | Coefficient | Type |
|---|---|---|---|
| BLINE_AGE_AT_BASELINE | Age at recruitment | 0.3007 | Intrinsic |
| CDHQ1_SELENIUM_SPL | Selenium intake | -0.0595 | Actionable |
| HLQ_SPT_11 | Connection index with Family and Friends | 0.0194 | Actionable |
| CDHQ1_ORANGE_VEG_MYP | Orange vegetable consumption | -0.0184 | Actionable |
| CDHQ1_FISH_HI_MYP | Fish with high Omega-3 consumption | -0.0148 | Actionable |
| HLQ_FRH_1 | Age of first menstrual period | -0.0117 | Intrinsic |
| HLQ_SPT_16 | Share index with Family and Friends | 0.0094 | Actionable |
| HLQ_FRH_25 | Hormone replacement (usage) history | 0.0055 | Intrinsic a |
| CDHQ1_WHOLE_GRAIN_MYP | Whole-grain consumption | 0.0052 | Actionable |
| CDHQ1_VITAMIN_E_SPL | Vitamin E intake | 0.0010 | Actionable |
a Note that we consider hormone replacement history as an intrinsic feature from the patients’ perspective as they cannot ask for treatment on their own. However, it could be considered as actionable from the clinicians’ perspective.
Table 3 divides the selected features into two types: intrinsic and actionable. A feature is “actionable” if it can be modified–such as diet or exercise. By contrast, a feature is “intrinsic” if it cannot be changed–e.g., current age, age of first menstrual period, and hormone usage history. The positive coefficient values of age and hormone usage history (in Table 3) indicate that an increase in the values of these features increases the likelihood of BCaO. Similarly, the negative coefficient values of the first menstrual period information indicate that women with late first menstrual period have reduced BCaO probability. These findings are consistent with the results reported in previous studies [14, 15, 48–50].
Below we discuss the selected actionable features from three aspects: supplement consumption, social environment index, and healthy food consumption.
4.1.1. Supplement consumption (Selenium intake and Vitamin E intake)
Selenium intake from supplements is identified as the second most important feature and first important feature among the actionable features. Its negative coefficient value in Table 3 means larger values suggest smaller BCaO–i.e., earlier onset of BCa. However, retrospective research regarding selenium intake and human BCa development is limited. Cann et al. [51] suggested a preventive role of selenium in BCa, which was assessed from the association between dietary preferences of the U.S. and Japanese women and their relatively lower incidence rates of BCa. Although there is evidence that selenium has a preventive effect on human BCa from the study of dietary intakes and whole blood selenium levels [52], rigorous retrospective and prospective studies are still needed to confirm these hypotheses.
Additionally, there is evidence that metabolic selenium is associated with BCa incidence. A meta-analysis-based random effect analysis supports an inverse association between selenium concentration in serum and BCa risk [53]. A recent cohort study of 2295 Sweden patients also concluded that the combination of high serum iodine levels and high selenium levels was associated with a lower risk of BCa [54]. However, some researchers suggest that alterations in serum concentrations of selenium in women with BCa appear to be a consequence, rather than a cause of cancer [55].
The small positive coefficient value for Vitamin E intake suggests that a higher intake of this vitamin might increase the likelihood of BCa. Although the relationship between Vitamin E and cancer development is not yet established, some researchers suggest that Vitamin E may exert growth inhibitory effects on cancer cells, which may indicate that Vitamin E can reduce the tumor burden [56, 57]. However, our findings suggest that Vitamin E might increase the likelihood of BCaO in ATP participants, the bioavailability of Vitamin E from supplements is unclear [58]. Vitamin E is hydrophobic and bulk of this vitamin partitions into the adipose tissue (body fat) [59, 60] and hence, more research is required to find the actual association of Vitamin E intake, bioavailability and its association with BCaO.
Except for these direct causal influences, we hypothesize that Selenium and Vitamin E intake from supplements would also have an indirect association with the incidence of BCa. It might be due to some unobserved confounders that have a direct effect on both selenium/Vitamin E intake volume and BCa incidence. For example, women who pay more attention to personal health may choose a healthier lifestyle (e.g., smoke and drink less, exercise regularly, and maintain a healthy diet) and take their required supplements regularly. A healthy lifestyle, in turn, will play a direct role in preventing BCa, and selenium and Vitamin E intake may only have some (non-causal) correlation. Finally, both selenium and Vitamin E intake were self-reported in the ATP-BCa dataset, and thus, we need an objective quantification of these variables to accurately assess their relevance for BCaO prediction.
4.1.2. Social support (connection and share index with family and friends)
Social support is a well-recognized determinant in personal health [61]. Although prospective research has shown that socially isolated individuals have a physiological milieu that promotes tumor growth [62, 63], at present, there are no reported quantitative research on the impact of social environment and social isolation on cancer. Our early-stage research dataset, ATP-BCa, is the first attempt to transform the social environment into 20 semi-quantitative indicators. Table 3 shows that the multivariate Cox feature selection method has identified HLQ_SPT_11 (“Connection index with family and friends”) and HLQ_SPT_16 (“Share index with family and friends”) as important features for predicting BCaO. Surprisingly, these two features have positive coefficients, meaning they appear deleterious in the ATP-BCa dataset. This seems to differ from the previous research that suggests that more social interactions might lead to a better overall quality of life for breast cancer patients [62–64]. However, note that our study focuses on personalized prediction of BCaO and not on the quality of life assessment of breast cancer patients. It might be possible that some unobserved confounders lead to a higher BCaO rate as well as a higher social support index in the ATP-BCa dataset. For example, people with a high social support index might also tend to have higher alcohol consumption and irregular eating and sleeping timings (“party animal” lifestyle). Again, this field of quantitative research is still in its early stages, and further research is needed to confirm whether these findings can be used for intervention purposes.
4.1.3. Healthy food consumption (orange vegetable, fish with high Omega-3, and whole grain)
Healthy eating habits are essential for good physical health and growth. In Table 3, orange vegetables (negative coefficient), fish with high omega-3 (negative coefficient), and whole-grain consumption (positive coefficient) are identified as the three most important features among all healthy foods. Some of these findings are consistent with the existing research results, as described below.
Orange vegetables: Farvid et al. [65] used CoxPH to estimate hazard ratios with fruits and vegetable consumption as risk factors of BCa. Their research suggested that higher consumption of fruits and vegetables, especially cruciferous and yellow/orange vegetables, can reduce the risk of BCa [65]. In addition, it is well known that orange vegetables are rich in carotenoids, such as α-carotene, β-carotene, lutein, lycopene, etc. Many studies have demonstrated that a higher intake of these carotenoids is associated with a lower risk of pre- and postmenopausal BCa incidence [66, 67].
Fish with high Omega-3: Omega-3 fatty acids have been demonstrated to have a protective effect against BCa incidence [68]. Kim et al. [69] used a multivariate logistic regression model in a 718-Korean-patient cohort study and concluded that a high intake of fatty fish was associated with a reduced risk for BCa in both pre- and postmenopausal women. Kaizer et al. compared the incidence of BCa with the estimated consumption of fish, other foods, and nutrients for women from different countries [70]. Their results showed that the percentage of fish consumption was negatively correlated with the incidence of BCa, which was consistent with the protective effect described above as well as with our findings.
Whole grain: As part of a healthy diet instruction, studies have shown that eating more whole grains can reduce the risk of multiple diseases [71]. But its preventive effect on cancer, especially BCa, has yet to be confirmed. In a case-control study, it was observed that large amounts of whole grain foods were associated with a 0.4-fold lower likelihood in BCa risk compared with people who rarely consumed whole grains [72]. Moderate consumption of whole grain foods is associated with a 0.6-fold lower likelihood of the incidence of BCa [72]. Another case-control study also observed the same trend by investigating the pre-menopausal BCa likelihood [73]. The positive coefficient of whole grain intake in Table 3 also suggests that higher intake will increase the risk of BCa incidence. However, these findings should be confirmed with future rigorous research investigations in a case-control setting.
4.2. Interventions on lifestyle factors
After identifying the most important actionable features from the ATP-BCa dataset, both participants and doctors may have prognostic questions such as: How to quantify these actionable features? What actionable features should be changed to maximize a person’s cancer-free time (that is, how to delay or prevent BCaO)? How much can we delay a woman’s BCaO through interventions on actionable risk factors? This section will provide two examples of actual patients in the ATP-BCa dataset that suggest how one might use our ISD model for prognostic analysis. (While these examples are based on instances in our ATP-Bca dataset, we have made slight modifications to protect the privacy of the participants).
Example 1. Consider a female participant who was recruited at age 62 into the ATP-BCa study. Her regular follow-up revealed BCaO at the 197th month after her recruitment in the study. At recruitment, she had reported that she did not take any selenium supplement in the year proceeding to her recruitment day. Our learned model (MTRL + MICE imputation + multivariate Cox feature selection) predicted her BCaO time as 267 months (at 85 years of age) from the date of recruitment, using the features recorded at the time of her recruitment in the ATP-BCa dataset. Note that our predicted BCaO time is very close to her actual BCaO time: 267 months vs. 197 months with L1-Hinge loss of 70 months. In a counterfactual setting, we recomputed her BCaO time by changing her selenium intake to 47 unit/day (from earlier 0 unit/day), which is the maximum dose in the ATP-BCa dataset, but keeping the other risk factors unchanged; here, our model predicted that her counterfactual BCaO time would be 362 months (an increase of 95 months (approximately 8 years) from the previous estimate).
Example 2. Another ATP-BCa female participant was 37 years old at the time of recruitment. The periodic follow-up with linkage to ACR found that she was diagnosed with BCa after 390 months in the ATP-BCa study. In the CDHQ-I questionnaire, she reported that she rarely eats orange vegetables (0.1 cups/day), which include pumpkin, sweet potatoes, carrots, beef and chicken mixtures, and soups [74]. Our MTLR model predicted her BCaO time as 343 months (66 years old) if she maintains her current lifestyle. Suppose she increases orange vegetables to her daily meal equal to the average intake in the ATP-BCa dataset (0.21 cups/day). In that case, the predicted BCaO time goes up to 367 months (66 years old), which means we expect her to gain an additional 24 cancer-free months. If she increases orange vegetables to the maximum intake in the ATP-BCa dataset (2.31 cups/day), the predicted event time goes up to 660 months (92 years old), which gives her a predicted cancer-free time of 26 years.
4.3. L1-Hinge loss versus C-index
C-index is one of the most popular metrics, which has dominated the field of time-to-event prediction evaluation for decades. It focuses on measuring the pairwise discriminability of a survival model, with a larger value indicating a superior model performance. Intuitively, the C-index only assesses a model’s ability to correctly order the event-time of the patients. Although C-index is useful in several clinical problems that require such patient comparison, such as prioritizing patients for liver transplants [75], it is not an appropriate metric for evaluating patient-specific BCaO prediction models. This is because there is no need to predict correct patient order in BCaO prediction. Why will Ms. Smith be interested in knowing whether she will get BCa before some random stranger Ms. Jones? Also, a medical professional will administer care to his/her patient based on that patient’s specific risk of BCaO instead of providing recommendations based on the BCaO risk of another patient (although a clinician can use the ordered risk of patients to prioritize whom (s)he wants to see first).
Thus, Ms. Smith may want to know when she will likely develop BCa based on her current lifestyle. The physicians also want to know the predicted time of BCaO for Ms. Smith to decide whether to suggest appropriate cancer screening tests or provide her with an immediate treatment plan. To be able to answer these types of questions, we need a BCaO prediction model whose predicted BCaO time is close to the actual BCaO time for individuals in a given dataset–this is precisely quantified by the L1-Hinge loss metric [24]. Thus, we propose that L1-Hinge loss should be used to evaluate the performance of personalized BCaO prediction models.
It is tempting to believe if C-index and L1-Hinge loss prefer the same model–i.e., if Model 1 has a higher C-index than Model 2, then Model 1 must have a lower L1-Hinge loss, and vice-versa. However, this is not always the case. Fig 4B shows two ISD models that each predict BCaO times for three participants (A, B, and C) using their respective ISD curves. We see the actual time to BCaO is patient C followed by B and then A. Defining an ISD model’s risk score as (the negative of) the median value the ISD curve, we see that Model 1 has the perfect C-index (= 1) while its predicted BCaO times are far from the actual BCaO times (L1-Hinge loss ≈ 90 months). Model 2 has the worst possible C-index (= 0), but its BCaO time estimation is more accurate (L1-Hinge loss ≈ 10 months). This example demonstrates that C-index’s preferences do not match L1-Hinge loss’s, as these metrics measure different aspects of the model: We see that also a model with a perfect C-index can have very poor estimates at the individual level. Moreover, L1-Hinge loss focuses on measuring the accuracy of individual prediction, while C-index measures the accuracy of pairwise discrimination, which is often not relevant for personalized decision making in clinical settings as it is a relative performance measure that requires pairs of patients for its computation. Therefore, we prefer models that have the lowest L1-Hinge loss among all D-calibrated (p-value > 0.05) models.
4.4. Future directions
This early-stage work addresses the need for predicting the time of a woman’s BCaO from her medical history and lifestyle features. Our observational study included several experiments to assess various combinations of data imputation, feature selection, and ISD models for computing patient specific BCaO probability curves. We also provided guidelines for appropriate metrics for evaluating the BCaO models and discussed possible counterfactual usage of these personalized BCaO probability curves. We think that further research can address the following limitations of this paper: (1) The ATP-BCa dataset contains two categories of lifestyle features: diet-related and physical-activity-related features. While the multivariate Cox feature selection method selected 7 actionable features, none involve physical activities. This may be because the effects of the physical activity traits on BCaO have been indirectly revealed by the selected features. However, the direct effects of physical activity-related features still require further investigation. (2) Due to the nature of the dataset, all features were derived from patients’ responses to the self-reported questionnaires, which means those responses might not be objective and might contain information bias, and the range of questions might not be comprehensive. For example, Introduction section mentioned that a patient’s genetic features, the breast density features, and the features extracted from medical images might lead to more accurate personalized BCaO predictions. In addition, the current database has been dominated by female Caucasians; it would be useful to include people of other ethnicities. (3) Our analysis assumed that the patient’s medical history and lifestyle characteristics were stable. This implicit assumption runs throughout our study and specifically for our discussion in this section. However, in the real world, a patient’s medical and lifestyle characteristics do change over time, and some people can even change their lifestyle habits drastically in a short span of time. These changes need to be incorporated into the datasets and the analysis requires ISD models that could handle temporal-feature datasets.
5. Conclusion
This paper has demonstrated the effectiveness of learned individual survival distribution models to predict a woman’s personalized BCaO probability curve (time-dependent probabilities) from both her lifestyle and health history information. We first curated the ATP-BCa dataset that contains both health history and lifestyle information of 18,288 healthy (at the time of recruitment) females who were followed up for their BCaO. We used this ATP-BCa dataset to evaluate the effect of various combinations (3 × 5 × 9 = 135) of imputation methods, feature selection methods, and ISD models. We evaluated these combinations with both discrimination and calibration through three evaluation metrics: L1-Hinge loss, Concordance index, and D-Calibration [24]. We demonstrated the importance of L1-Hinge loss for evaluating models that compute personalized BCaO probabilities and explained that it differs from the standard concordance measure; we also show that L1-Hinge loss is often more relevant. Our results showed that the multi-task logistic regression algorithm [42, 43] with MICE imputation policy and multivariate Cox feature selection method, had the lowest L1-Hinge loss among all D-calibrated models and also had a competitive Concordance index, for BCaO prediction. This paper also described the top ten features that were identified as important for predicting BCaO, for ATP-BCa participants. Among the identified 10 important features, 7 of them were actionable lifestyle features that included supplement consumption, social support index, and food nutrient consumption. This paper then suggests ways to motivate these subjects to change those lifestyle features, to increase their number of BCa-free years.
Supporting information
(PDF)
(PDF)
(TIF)
(TIF)
(TIF)
(PDF)
(PDF)
(XLSX)
(XLSX)
Acknowledgments
We greatly acknowledge Dr. John Mackey for helpful discussions. Alberta’s Tomorrow Project is only possible because of the commitment of its research participants, its staff, and its funders. Cancer registry data was obtained through linkage with Surveillance & Reporting, Cancer Research & Analytics, Cancer Care Alberta. The views expressed herein represent the views of the author(s) and not of Alberta’s Tomorrow Project or any of its funders.
Data Availability
There are ethical and legal restrictions preventing us from making the Alberta’s Tomorrow Project’s (ATP) data available publicly. All data is available to researchers who apply through the ATP’s standard process (https://myatpresearch.ca/DataAccess). The consent signed by the ATP participants requires all data must be approved by following the Access Guidelines prior to release. This is also in accordance with the Health Information Act of Alberta and Freedom of Information and Privacy Act of Alberta. The ethical approval for the release of data requires Alberta Health Services to follow the Access Guidelines. While the data has been processed according to the Alberta Health Services Non-identifying data standard, there is still some theoretical ability to identify a subset of the records. All data released by the Alberta Health Services must be done under the control of a disclosure notice and a commitment form to not attempt to re-identify our participants. Thus, the ATP data used in this study is only available through a request made to ATP.
Funding Statement
Alberta Health, Alberta, Canada, Grace Shen-Tu Canadian Breast Cancer Foundation, Prairies/NWT Chapter, Canada, Sambasivarao Damaraju Alberta Cancer Foundation, Alberta, Canada, Grace Shen-Tu Canadian Partnership Against Cancer and Health Canada, Ontario, Canada, Grace Shen-Tu Alberta Health Services, Alberta, Canada, Grace Shen-Tu Alberta Machine Intelligence Institute, Russell Greiner Natural Sciences and Engineering Research Council of Canada, Russell Greiner
References
- 1.Sung H, Ferlay J, Siegel RL, Laversanne M, Soerjomataram I, Jemal A, et al. Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries. CA Cancer J Clin 2021;71:209–49. doi: 10.3322/caac.21660 [DOI] [PubMed] [Google Scholar]
- 2.Ferlay J, Colombet M, Soerjomataram I, Parkin DM, Piñeros M, Znaor A, et al. Cancer statistics for the year 2020: An overview. Int J Cancer 2021. doi: 10.1002/ijc.33588 [DOI] [PubMed] [Google Scholar]
- 3.Siegel RL, Miller KD, Fuchs HE, Jemal A. Cancer Statistics, 2021. CA Cancer J Clin 2021;71:7–33. doi: 10.3322/caac.21654 [DOI] [PubMed] [Google Scholar]
- 4.Siegel RL, Miller KD, Jemal A. Cancer statistics, 2020. CA Cancer J Clin 2020;70:7–30. doi: 10.3322/caac.21590 [DOI] [PubMed] [Google Scholar]
- 5.Feng Y, Spezia M, Huang S, Yuan C, Zeng Z, Zhang L, et al. Breast cancer development and progression: Risk factors, cancer stem cells, signaling pathways, genomics, and molecular pathogenesis. Genes Dis 2018;5:77–106. doi: 10.1016/j.gendis.2018.05.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Sun Y-S, Zhao Z, Yang Z-N, Xu F, Lu H-J, Zhu Z-Y, et al. Risk Factors and Preventions of Breast Cancer. Int J Biol Sci 2017;13:1387–97. doi: 10.7150/ijbs.21635 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Hamilton AS, Mack TM. Puberty and genetic susceptibility to breast cancer in a case-control study in twins. N Engl J Med 2003;348:2313–22. doi: 10.1056/NEJMoa021293 [DOI] [PubMed] [Google Scholar]
- 8.Anand P, Kunnumakkara AB, Kunnumakara AB, Sundaram C, Harikumar KB, Tharakan ST, et al. Cancer is a preventable disease that requires major lifestyle changes. Pharm Res 2008;25:2097–116. doi: 10.1007/s11095-008-9661-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Metabolomics Wishart D. and the Multi-Omics View of Cancer. Metabolites 2022;12:154. 10.3390/metabo12020154. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Bray F, Ferlay J, Soerjomataram I, Siegel RL, Torre LA, Jemal A. Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin 2018;68:394–424. doi: 10.3322/caac.21492 [DOI] [PubMed] [Google Scholar]
- 11.Beral V, Banks E, Reeves G. Effects of Estrogen-Only Treatment in Postmenopausal Women. JAMA 2004;292:683. 10.1001/jama.292.6.684-a. [DOI] [PubMed] [Google Scholar]
- 12.Gray S. Breast cancer and hormone-replacement therapy: the Million Women Study. The Lancet 2003;362:1332. 10.1016/S0140-6736(03)14598-9. [DOI] [PubMed] [Google Scholar]
- 13.Nur U, El Reda D, Hashim D, Weiderpass E. A prospective investigation of oral contraceptive use and breast cancer mortality: findings from the Swedish women’s lifestyle and health cohort. BMC Cancer 2019;19:807. doi: 10.1186/s12885-019-5985-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Mørch LS, Skovlund CW, Hannaford PC, Iversen L, Fielding S, Lidegaard Ø. Contemporary Hormonal Contraception and the Risk of Breast Cancer. N Engl J Med 2017;377:2228–39. doi: 10.1056/NEJMoa1700732 [DOI] [PubMed] [Google Scholar]
- 15.Del Pup L, Codacci-Pisanelli G, Peccatori F. Breast cancer risk of hormonal contraception: Counselling considering new evidence. Crit Rev Oncol Hematol 2019;137:123–30. doi: 10.1016/j.critrevonc.2019.03.001 [DOI] [PubMed] [Google Scholar]
- 16.Ginsburg O, Bray F, Coleman MP, Vanderpuye V, Eniu A, Kotha SR, et al. The global burden of women’s cancers: a grand challenge in global health. Lancet Lond Engl 2017;389:847–60. doi: 10.1016/S0140-6736(16)31392-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Kim G, Bahl M. Assessing Risk of Breast Cancer: A Review of Risk Prediction Models. J Breast Imaging 2021;3:144–55. doi: 10.1093/jbi/wbab001 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Gail MH, Brinton LA, Byar DP, Corle DK, Green SB, Schairer C, et al. Projecting individualized probabilities of developing breast cancer for white females who are being examined annually. J Natl Cancer Inst 1989;81:1879–86. doi: 10.1093/jnci/81.24.1879 [DOI] [PubMed] [Google Scholar]
- 19.Costantino JP, Gail MH, Pee D, Anderson S, Redmond CK, Benichou J, et al. Validation studies for models projecting the risk of invasive and total breast cancer incidence. J Natl Cancer Inst 1999;91:1541–8. doi: 10.1093/jnci/91.18.1541 [DOI] [PubMed] [Google Scholar]
- 20.Tice JA, Cummings SR, Smith-Bindman R, Ichikawa L, Barlow WE, Kerlikowske K. Using clinical factors and mammographic breast density to estimate breast cancer risk: development and validation of a new predictive model. Ann Intern Med 2008;148:337–47. doi: 10.7326/0003-4819-148-5-200803040-00004 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Rosner B, Colditz GA. Nurses’ health study: log-incidence mathematical model of breast cancer incidence. J Natl Cancer Inst 1996;88:359–64. doi: 10.1093/jnci/88.6.359 [DOI] [PubMed] [Google Scholar]
- 22.Rockhill B, Byrne C, Rosner B, Louie MM, Colditz G. Breast cancer risk prediction with a log-incidence model: evaluation of accuracy. J Clin Epidemiol 2003;56:856–61. doi: 10.1016/s0895-4356(03)00124-0 [DOI] [PubMed] [Google Scholar]
- 23.Rosner BA, Colditz GA, Hankinson SE, Sullivan-Halley J, Lacey JV, Bernstein L. Validation of Rosner-Colditz breast cancer incidence model using an independent data set, the California Teachers Study. Breast Cancer Res Treat 2013;142:187–202. doi: 10.1007/s10549-013-2719-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Haider H, Hoehn B, Davis S, Greiner R. Effective Ways to Build and Evaluate Individual Survival Distributions. J Mach Learn Res 2020;21:1–63.34305477 [Google Scholar]
- 25.Kaplan EL, Meier P. Nonparametric Estimation from Incomplete Observations. J Am Stat Assoc 1958;53:457–81. 10.2307/2281868. [DOI] [Google Scholar]
- 26.Cox DR. Regression Models and Life-Tables. J R Stat Soc Ser B Methodol 1972;34:187–202. 10.1111/j.2517-6161.1972.tb00899.x. [DOI] [Google Scholar]
- 27.Csizmadi I, Lo Siou G, Friedenreich CM, Owen N, Robson PJ. Hours spent and energy expended in physical activity domains: results from the Tomorrow Project cohort in Alberta, Canada. Int J Behav Nutr Phys Act 2011;8:110. doi: 10.1186/1479-5868-8-110 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Robson PJ, Solbak NM, Haig TR, Whelan HK, Vena JE, Akawung AK, et al. Design, methods and demographics from phase I of Alberta’s Tomorrow Project cohort: a prospective cohort profile. CMAJ Open 2016;4:E515–27. doi: 10.9778/cmajo.20160005 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Greenwood M. A Report on the Natural Duration of Cancer. Rep Nat Durat Cancer 1926. [Google Scholar]
- 30.Buuren S van, Groothuis-Oudshoorn K. mice: Multivariate Imputation by Chained Equations in R. J Stat Softw 2011;45:1–67. 10.18637/jss.v045.i03. [DOI] [Google Scholar]
- 31.Therneau TM. A Package for Survival Analysis in R. 2022. [Google Scholar]
- 32.Simon RM, Korn EL, McShane LM, Radmacher MD, Wright GW, Zhao Y. Design and analysis of DNA microarray investigations. vol. 209. Springer; 2003. [Google Scholar]
- 33.Guyon I, Weston J, Barnhill S, Vapnik V. Gene Selection for Cancer Classification using Support Vector Machines. Mach Learn 2002;46:389–422. 10.1023/A:1012487302797. [DOI] [Google Scholar]
- 34.Schröder MS, Culhane AC, Quackenbush J, Haibe-Kains B. survcomp: an R/Bioconductor package for performance assessment and comparison of survival models. Bioinformatics 2011;27:3206–8. doi: 10.1093/bioinformatics/btr511 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Simon N, Friedman J, Hastie T, Tibshirani R. Regularization Paths for Cox’s Proportional Hazards Model via Coordinate Descent. J Stat Softw 2011;39:1–13. doi: 10.18637/jss.v039.i05 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Stute W. Consistent Estimation Under Random Censorship When Covariables Are Present. J Multivar Anal 1993;45:89–103. [Google Scholar]
- 37.Ishwaran H, Kogalur UB, Blackstone EH, Lauer MS. Random survival forests. Ann Appl Stat 2008;2:841–60. 10.1214/08-AOAS169. [DOI] [Google Scholar]
- 38.Ridgeway G. The state of boosting. Comput Sci Stat 1999:172–81. [Google Scholar]
- 39.Hothorn T, Bühlmann P, Dudoit S, Molinaro A, van der Laan MJ. Survival ensembles. Biostat Oxf Engl 2006;7:355–73. doi: 10.1093/biostatistics/kxj011 [DOI] [PubMed] [Google Scholar]
- 40.Lee C, Zame W, Yoon J, Van Der Schaar M. Deephit: A deep learning approach to survival analysis with competing risks. Proc. AAAI Conf. Artif. Intell., vol. 32, 2018. [Google Scholar]
- 41.Nagpal C, Li X, Dubrawski A. Deep Survival Machines: Fully Parametric Survival Regression and Representation Learning for Censored Data With Competing Risks. IEEE J Biomed Health Inform 2021;25:3163–75. doi: 10.1109/JBHI.2021.3052441 [DOI] [PubMed] [Google Scholar]
- 42.Yu C-N, Greiner R, Lin H-C, Baracos V. Learning Patient-Specific Cancer Survival Distributions as a Sequence of Dependent Regressors. Adv. Neural Inf. Process. Syst. 24 25th Annu. Conf. Neural Inf. Process. Syst. 2011. Proc. Meet. Held 12–14 Dec. 2011 Granada Spain, 2011, p. 1845–53. [Google Scholar]
- 43.Fotso S. Deep Neural Networks for Survival Analysis Based on a Multi-Task Framework. ArXiv180105512 Cs Stat 2018. [Google Scholar]
- 44.Assel M, Sjoberg DD, Vickers AJ. The Brier score does not evaluate the clinical utility of diagnostic tests or prediction models. Diagn Progn Res 2017;1:19. doi: 10.1186/s41512-017-0020-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Van Calster B, McLernon DJ, van Smeden M, Wynants L, Steyerberg EW, Topic Group ‘Evaluating diagnostic tests and prediction models’ of the STRATOS initiative. Calibration: the Achilles heel of predictive analytics. BMC Med 2019;17:230. 10.1186/s12916-019-1466-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Kumar N, Qi S, Kuan L-H, Sun W, Zhang J, Greiner R. Learning accurate personalized survival models for predicting hospital discharge and mortality of COVID-19 patients. Sci Rep 2022;12:4472. doi: 10.1038/s41598-022-08601-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Endogenous Hormones and Breast Cancer Collaborative Group, Key TJ, Appleby PN, Reeves GK, Roddam AW, Helzlsouer KJ, et al. Circulating sex hormones and breast cancer risk factors in postmenopausal women: reanalysis of 13 studies. Br J Cancer 2011;105:709–22. doi: 10.1038/bjc.2011.254 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Kim Y, Yoo K-Y, Goodman MT. Differences in incidence, mortality and survival of breast cancer by regions and countries in Asia and contributing factors. Asian Pac J Cancer Prev APJCP 2015;16:2857–70. doi: 10.7314/apjcp.2015.16.7.2857 [DOI] [PubMed] [Google Scholar]
- 49.Bhadoria AS, Kapil U, Sareen N, Singh P. Reproductive factors and breast cancer: a case-control study in tertiary care hospital of North India. Indian J Cancer 2013;50:316–21. doi: 10.4103/0019-509X.123606 [DOI] [PubMed] [Google Scholar]
- 50.Mahouri K, Dehghani Zahedani M, Zare S. Breast cancer risk factors in south of Islamic Republic of Iran: a case-control study. East Mediterr Health J Rev Sante Mediterr Orient Al-Majallah Al-Sihhiyah Li-Sharq Al-Mutawassit 2007;13:1265–73. doi: 10.26719/2007.13.6.1265 [DOI] [PubMed] [Google Scholar]
- 51.Cann SA, van Netten JP, van Netten C. Hypothesis: iodine, selenium and the development of breast cancer. Cancer Causes Control CCC 2000;11:121–7. doi: 10.1023/a:1008925301459 [DOI] [PubMed] [Google Scholar]
- 52.el-Bayoumy K. Evaluation of chemopreventive agents against breast cancer and proposed strategies for future clinical intervention trials. Carcinogenesis 1994;15:2395–420. doi: 10.1093/carcin/15.11.2395 [DOI] [PubMed] [Google Scholar]
- 53.Babaknejad N, Sayehmiri F, Sayehmiri K, Rahimifar P, Bahrami S, Delpesheh A, et al. The relationship between selenium levels and breast cancer: a systematic review and meta-analysis. Biol Trace Elem Res 2014;159:1–7. doi: 10.1007/s12011-014-9998-3 [DOI] [PubMed] [Google Scholar]
- 54.Manjer J, Sandsveden M, Borgquist S. Serum Iodine and Breast Cancer Risk: A Prospective Nested Case-Control Study Stratified for Selenium Levels. Cancer Epidemiol Biomark Prev Publ Am Assoc Cancer Res Cosponsored Am Soc Prev Oncol 2020;29:1335–40. 10.1158/1055-9965.EPI-20-0122. [DOI] [PubMed] [Google Scholar]
- 55.Lopez-Saez J-B, Senra-Varela A, Pousa-Estevez L. Selenium in breast cancer. Oncology 2003;64:227–31. doi: 10.1159/000069312 [DOI] [PubMed] [Google Scholar]
- 56.Malafa MP, Neitzel LT. Vitamin E succinate promotes breast cancer tumor dormancy. J Surg Res 2000;93:163–70. doi: 10.1006/jsre.2000.5948 [DOI] [PubMed] [Google Scholar]
- 57.Kline K, Yu W, Sanders BG. Vitamin E and breast cancer. J Nutr 2004;134:3458S–3462S. doi: 10.1093/jn/134.12.3458S [DOI] [PubMed] [Google Scholar]
- 58.Borel P, Preveraud D, Desmarchelier C. Bioavailability of vitamin E in humans: an update. Nutr Rev 2013;71:319–31. doi: 10.1111/nure.12026 [DOI] [PubMed] [Google Scholar]
- 59.Bjørneboe A, Bjørneboe GE, Drevon CA. Absorption, transport and distribution of vitamin E. J Nutr 1990;120:233–42. doi: 10.1093/jn/120.3.233 [DOI] [PubMed] [Google Scholar]
- 60.Landrier J-F, Marcotorchino J, Tourniaire F. Lipophilic micronutrients and adipose tissue biology. Nutrients 2012;4:1622–49. doi: 10.3390/nu4111622 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Hinzey A, Gaudier-Diaz MM, Lustberg MB, DeVries AC. Breast cancer and social environment: getting by with a little help from our friends. Breast Cancer Res BCR 2016;18:54. doi: 10.1186/s13058-016-0700-x [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Williams JB, Pang D, Delgado B, Kocherginsky M, Tretiakova M, Krausz T, et al. A model of gene-environment interaction reveals altered mammary gland gene expression and increased tumor growth following social isolation. Cancer Prev Res Phila Pa 2009;2:850–61. 10.1158/1940-6207.CAPR-08-0238. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Hermes GL, Delgado B, Tretiakova M, Cavigelli SA, Krausz T, Conzen SD, et al. Social isolation dysregulates endocrine and behavioral stress while increasing malignant burden of spontaneous mammary tumors. Proc Natl Acad Sci U S A 2009;106:22393–8. doi: 10.1073/pnas.0910753106 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Kroenke CH, Kwan ML, Neugut AI, Ergas IJ, Wright JD, Caan BJ, et al. Social networks, social support mechanisms, and quality of life after breast cancer diagnosis. Breast Cancer Res Treat 2013;139:515–27. doi: 10.1007/s10549-013-2477-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Farvid MS, Chen WY, Rosner BA, Tamimi RM, Willett WC, Eliassen AH. Fruit and vegetable consumption and breast cancer incidence: Repeated measures over 30 years of follow-up. Int J Cancer 2019;144:1496–510. doi: 10.1002/ijc.31653 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Gaudet MM, Britton JA, Kabat GC, Steck-Scott S, Eng SM, Teitelbaum SL, et al. Fruits, vegetables, and micronutrients in relation to breast cancer modified by menopause and hormone receptor status. Cancer Epidemiol Biomark Prev Publ Am Assoc Cancer Res Cosponsored Am Soc Prev Oncol 2004;13:1485–94. [PubMed] [Google Scholar]
- 67.Farvid MS, Chen WY, Michels KB, Cho E, Willett WC, Eliassen AH. Fruit and vegetable consumption in adolescence and early adulthood and risk of breast cancer: population based cohort study. BMJ 2016;353:i2343. doi: 10.1136/bmj.i2343 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Fabian CJ, Kimler BF, Hursting SD. Omega-3 fatty acids for breast cancer prevention and survivorship. Breast Cancer Res BCR 2015;17:62. doi: 10.1186/s13058-015-0571-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.Kim J, Lim S-Y, Shin A, Sung M-K, Ro J, Kang H-S, et al. Fatty fish and fish omega-3 fatty acid intakes decrease the breast cancer risk: a case-control study. BMC Cancer 2009;9:216. doi: 10.1186/1471-2407-9-216 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Kaizer L, Boyd NF, Kriukov V, Tritchler D. Fish consumption and breast cancer risk: an ecological study. Nutr Cancer 1989;12:61–8. doi: 10.1080/01635588909514002 [DOI] [PubMed] [Google Scholar]
- 71.Slavin J. Whole grains and human health. Nutr Res Rev 2004;17:99–110. doi: 10.1079/NRR200374 [DOI] [PubMed] [Google Scholar]
- 72.Mourouti N, Kontogianni MD, Papavagelis C, Psaltopoulou T, Kapetanstrataki MG, Plytzanopoulou P, et al. Whole Grain Consumption and Breast Cancer: A Case-Control Study in Women. J Am Coll Nutr 2016;35:143–9. doi: 10.1080/07315724.2014.963899 [DOI] [PubMed] [Google Scholar]
- 73.Farvid MS, Cho E, Eliassen AH, Chen WY, Willett WC. Lifetime grain consumption and breast cancer risk. Breast Cancer Res Treat 2016;159:335–45. doi: 10.1007/s10549-016-3910-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 74.Cleveland LE, Cook DA, Krebs-Smith SM, Friday J. Method for assessing food intakes in terms of servings based on food guidance. Am J Clin Nutr 1997;65:1254S–1263S. doi: 10.1093/ajcn/65.4.1254S [DOI] [PubMed] [Google Scholar]
- 75.Andres A, Montano-Loza A, Greiner R, Uhlich M, Jin P, Hoehn B, et al. A novel learning algorithm to predict individual survival after liver transplantation for primary sclerosing cholangitis. PloS One 2018;13:e0193523. doi: 10.1371/journal.pone.0193523 [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
(PDF)
(PDF)
(TIF)
(TIF)
(TIF)
(PDF)
(PDF)
(XLSX)
(XLSX)
Data Availability Statement
There are ethical and legal restrictions preventing us from making the Alberta’s Tomorrow Project’s (ATP) data available publicly. All data is available to researchers who apply through the ATP’s standard process (https://myatpresearch.ca/DataAccess). The consent signed by the ATP participants requires all data must be approved by following the Access Guidelines prior to release. This is also in accordance with the Health Information Act of Alberta and Freedom of Information and Privacy Act of Alberta. The ethical approval for the release of data requires Alberta Health Services to follow the Access Guidelines. While the data has been processed according to the Alberta Health Services Non-identifying data standard, there is still some theoretical ability to identify a subset of the records. All data released by the Alberta Health Services must be done under the control of a disclosure notice and a commitment form to not attempt to re-identify our participants. Thus, the ATP data used in this study is only available through a request made to ATP.





