Abstract
Motivation: There is great interest in pathway-based methods for genomics data analysis in the research community. Although machine learning methods, such as random forests, have been developed to correlate survival outcomes with a set of genes, no study has assessed the abilities of these methods in incorporating pathway information for analyzing microarray data. In general, genes that are identified without incorporating biological knowledge are more difficult to interpret. Correlating pathway-based gene expression with survival outcomes may lead to biologically more meaningful prognosis biomarkers. Thus, a comprehensive study on how these methods perform in a pathway-based setting is warranted.
Results: In this article, we describe a pathway-based method using random forests to correlate gene expression data with survival outcomes and introduce a novel bivariate node-splitting random survival forests. The proposed method allows researchers to identify important pathways for predicting patient prognosis and time to disease progression, and discover important genes within those pathways. We compared different implementations of random forests with different split criteria and found that bivariate node-splitting random survival forests with log-rank test is among the best. We also performed simulation studies that showed random forests outperforms several other machine learning algorithms and has comparable results with a newly developed component-wise Cox boosting model. Thus, pathway-based survival analysis using machine learning tools represents a promising approach in dissecting pathways and for generating new biological hypothesis from microarray studies.
Availability: R package Pwayrfsurvival is available from URL: http://www.duke.edu/∼hp44/pwayrfsurvival.htm
Contact: pathwayrf@gmail.com
Supplementary information: Supplementary data are available at Bioinformatics online.
1 INTRODUCTION
The availability of high-dimensional gene expression datasets has posed new statistical and computational challenges. These valuable data may help to identify new anticancer drug targets. The success story of food and drug administration (FDA) approval for these drugs has been documented (Mayburd et al., 2008). Gene expression data are now commonly collected from patients in clinical trials. These data provide valuable information in identifying biomarkers for patients' prognosis and treatment response. An effective approach to analyzing and interpreting these high-dimensional data is to consider biological pathways rather than all the genes at once or one at a time. Pathways are sets of genes that serve a particular cellular or physiological function. Biology is not about a single gene, taking a pathway-based approach not only allows us to reduce the dimensionality of the problem but also makes the results more interpretable. In the past several years, many pathway-based approaches have been published, including discriminant analysis (Tai and Pan, 2007; Wu et al., 2009), enrichment analysis (Kemp et al., 2007; Subramanian et al., 2005), non-parametric regression (Wei and Li, 2007) and random forests (Pang et al., 2006;Pang and Zhao, 2008). These methods have primarily focused on discrete outcomes (classification) and continuous outcomes (regression), but not censored outcomes (survival analysis).
Several methods have been proposed to analyze survival outcomes with microarray data, including hierarchical Bayesian approach (Kaderali et al., 2006), ridge regression (Pawitan et al., 2004) and partial least squares (Li and Gui, 2004; Park et al., 2002). Some methods perform better than others in certain datasets, but there is no approach superior to others across all datasets. Overall, random forests is among the best for analyzing survival time using gene expression data (Schumacher et al., 2007; van Wieringen et al., 2009). However, the resulting sets of genes that are found to be good predictors may be hard to interpret. Therefore, there is a need to develop a pathway-based approach with efficient survival methods, e.g. tree-based ones that would allow us to obtain results that are more closely tied with the biological mechanism of diseases.
This article introduces a methodology for identifying relevant pathways informative about patient prognosis and the first bivariate node-splitting random forests for censored data. Our method is based on random forests, a tree-based method developed by Breiman (2001). Survival random forests was first proposed by Breiman and it has since then been refined (Breiman, 2002). Two popular variants of random forests for survival data are conditional inference random forests (Hothorn et al., 2006a) and random survival forests (Ishwaran et al., 2008). The use of random forests for survival outcomes in analyzing gene expression allows us to identify baseline gene expression measures that help predict prognosis and select important genes in each pathway. We have also developed a pathway-based outliers plot for survival data. The details of our approach are described below. As an illustration, we applied it to two breast cancer datasets with survival outcomes.
2 METHODS
Figure 1 shows the flow chart of how to integrate microarray data with pathway information using random forests for survival outcomes. For brief definitions of survival terms used in this article, see Supplementary Material A.
Fig. 1.
A schematic diagram of pathway analysis for survival outcomes using random forests.
Random forests was first introduced for classification and regression settings (Breiman, 2001). It grows and aggregates many trees and thus the name ‘Forests’. The first version of survival forests, random forests for censored data, was outlined in Breiman's notes in 2002 (Breiman, 2002) and implemented in Fortran. The three versions of random forests we used for pathway-based analysis were conditional inference random forests, random survival forests and newly developed bivariate node-splitting random survival forests. The key difference between random forests classification and regression and the survival counterpart is that the outcome of interest is a set of survival times with the corresponding censoring indicator. As a result, the split criteria for censored data are also different compared with random forests for classification and regression which uses the Gini criterion (Breiman, 2001). These differences are outlined in more detail below. The main properties of random forests, however, remain in these two algorithms. A deterministic algorithm dictates how each individual tree is formed. The trees are different owing to two factors. First, a best split is chosen from a random subset of the covariates rather than all of them. The best split with a single predictor is chosen for conditional inference random forests and random survival forests. Whereas the best split with a pair of predictors is chosen for bivariate node-splitting random survival forests. Second, every tree is built using a bootstrap sample of the observations. Moreover, unlike other tree algorithms, no form of pruning is necessary. Each observation is assigned to a leaf, the terminal node of a tree, according to the order and values of the predictor variables. The predictions are treated differently in conditional inference random forests and random survival forests. These are described in more details below.
The analysis was performed using our R package, Pwayrfsurvival, which was built on our brsf, randomSurvivalForest and party.
2.1 Random conditional inference forests
The conditional inference forests (CIFs) cforest implemented in R differs from original random forests in two ways. The base learners used are conditional inference trees (Hothorn et al., 2006a). Also, the aggregation scheme works by averaging observation weights extracted from each of the trees built instead of averaging predictions as in the original version.
CIFs consists of the construction of many conditional inference trees, which are built using a regression framework by binary recursive partitioning algorithm. This algorithm has three basic steps:
Step (1) Variable selection by testing the global null hypothesis of independence between any of the predictors and the survival outcome.
Step (IIa) If the hypothesis cannot be rejected, then terminate the algorithm.
Step (IIb) Otherwise select the predictor with the strongest association to the survival outcome.
Step (2) Implement a binary split based on split criteria using the selected predictors from Step (1b).
Step (3) Recursively repeat Steps (1 and 2).
For the case of censored data, the association is measured by a P-value corresponding to the following linear test statistic of a single predictor and the survival outcome:
| (1) |
where 1(in node) = 1 if the individual is in node, 0 otherwise. Xji is the j-th covariate for the i-th subject. ai is the log-rank score (LRS) (Hothorn and Lausen, 2003) of subject i defined as follows:
Let X(1) ≤ X(2)≤···≤X(n) be the set of ordered predictors, Yi be the response for individual i, and for each survival time of observation i,
| (2) |
where 1i = 1 if an event is observed for individual i and 0 otherwise, γl = ∑in#t: St ≤ Sl is the number of observed events or censored, occurring at Sl or before.
Since the distribution of Tj is unknown in most situations, permutation tests are employed. The conditional expectation μj and covariance Σj under the null hypothesis were derived by Strasser and Weber (1999). A standardized form of (1) can be obtained and the default in the cforest algorithm is in quadratic form Q = (t − μ)Σ+ (t − μ)T, where t is the observed linear test statistics and Σ+ is the Moore-Penrose inverse of Σ. The algorithm for Step (1) terminates if P-value falls below one minus the mincriterion as prespecified. For more details on how the best split is chosen, see Supplementary Material B.
2.1.1 Ensemble survival method
Let w be the weight vector where wi = 0 iff it corresponds to a censored observation. The random CIF (Hothorn et al., 2006b) is constructed as follows:
Step (A) Draw a bootstrap sample of c=(c1,…, cn) from the multinomial distribution Mult(n, p1,…, pn) where pi = (∑i=1n wi)−1 w.
Step (B) The base learner is constructed using the conditional inference trees algorithm described in the above section. The training sample comes from Step (A).
Step (C) Repeat Steps (A and B) until it reaches the desired number of trees (ntree).
The base learners are aggregated by averaging observation weights extracted from each of the ‘ntree’ trees. This is different from the original randomForest, which averages the predictions directly. Specifically, for a survival tree of bootstrap sample b, Treeb from Step (A), we can determine the observations which are elements of the same leaf of a survival tree as an observation xnew that was left out of the training sample. The aggregated sample is then TreeA(Xnew) = ∪b=1ntree Treeb(Xnew) for b=1,…, ntree. Then an aggregated estimator of the true conditional survival function of F(.|xnew) is the Kaplan-Meier curve of
.
cforest allows one to select cforest_classical (CIFc) that mimics the behavior of the original randomForest or cforest_unbiased (CIFu) which is the algorithm just described above. A key advantage of the latter random forest-type algorithm is that it can produce unbiased variable importance (VIMP) measures for classification and regression.
For prediction, it is the weighted average of the observed (log)-survival times under the quadratic loss function:
| (3) |
where the prediction weights are
and 1(*) = 1 when Xi and xnew are both in the partition for i=1,…, n and 0 otherwise.
2.2 Random survival forests
The randomSurvivalForest (RSF) algorithm (Ishwaran et al., 2008) implemented in R more closely resembles the original survival random forests by Breiman. Like random forests, RSF builds many binary trees. However, the aggregation scheme is now based on a cumulative hazard function (CHF) described in more details below.
Step (I) Draw bootstrap samples from the original data ntree times. For each bootstrap sample, this leaves approximately one-third of the samples out-of-bag (OOB).
Step(II) A survival tree is grown for each of the bootstrap sample.
Step(IIa) At each node of the tree, select
predictors at random for splitting.Step(IIb) Using one of the splitting criteria described below, a node is split using the single predictor from Step (IIa) that maximizes the survival differences between daughter nodes.
Step(IIc) Repeat Steps (IIa and b) until each terminal node contains no more than 0.632 times the number of events.
Step(III) Calculate a CHF for each survival tree built. Aggregate the ntree trees to obtain the ensemble cumulative hazard estimate.
An extra step for calculating the OOB error rate for the ensemble is available in the software, but we will not make use of this when we compare the random conditional inference trees with RSF as the former does not have an analogous way for calculating the prediction error.
2.2.1 Split criteria
Four split criteria are available for the RSF algorithm, they are log-rank (LR) (Segal, 1988), standardized LRS (Hothorn and Lausen, 2003), conservation of events (Naftel et al., 1985) and random (Lin and Jeon, 2006) splittings. We will describe below two, LR and LRS, which are used throughout this article. For the latter two, conservation of events (CON) and random (RAN), see Supplementary Material B. For a proposed node split, let c be the cutoff used, it is of the form x ≤ c and x > c. For an individual i, let n = n1 + n2, where n1 = ∑i=1n 1(Xi≤c).
The LR test for splitting (Segal, 1988) is defined as follows:
![]() |
(4) |
where E is the number of distinct event times T(1) ≤ T(2)≤···≤T(E) in the parent node, dti, childj is the number of events at time ti in the child nodes j = 1, 2, Rti, childj is the number of individuals at risk at time ti in the child nodes j = 1, 2, i.e. then number of individuals who are alive or died at time ti, and Rti = Rti, child1 + Rti, child2 and dti = dti, child1 + dti, child2. The absolute value of LR(X, c) measures the node separation. The best split is chosen such that it maximizes the absolute value of Equation (4).
The LRS test based on the LRS defined earlier in (2) is defined as
| (5) |
where μa and sa2 are the sample mean and sample variance of ai, respectively. LRS(X, c) measures the node separation. The best split is chosen such that it maximizes the absolute value of Equation (5).
2.2.2 Bivariate combination split
All the above procedures use univariate variable selection procedure. Considering the much-reduced dimension in a pathway-based setting, a bivariate splitting criteria is feasible. For the LR split criteria, we implemented three approaches to use bivariate splitting strategies in random survival forests by modifying the C embedded code in the R program. The LR split criterion is chosen because it performs best in simulations and has the most consistent findings for two real datasets. The first strategy is to split on the best pair of covariates at every node split by changing Step (II) portion of the above algorithm:
Step(IIa) At each node of the tree, select
predictors at random for splitting.Step(IIb) Using LR splitting criterion, a node is split using the predictor pair from Step (IIa) that maximizes the survival differences between daughter nodes by finding best split of the form xi + xj ≤ c for i ≠ j.
The second strategy is to find the best split of the form xi + xj ≤ c for i ≠ j or xk for any k. Using LR splitting criterion, a node is split using the predictor pair or a single predictor from Step (IIa) that maximizes the survival differences between daughter nodes.
The third strategy is to split on the best combination of a pair of predictors at every node split given some constraints to minimize computational cost:
Let i and j be the best split covariate, i.e. instead of finding best split of the form xi + xj ≤ c, aixi + ajxj ≤ c, where ai + aj = 1 with ai, aj ≥ 0.
However, through simulations we found that the second and the third approaches did not perform well. Therefore, we focus on the first strategy and call this approach bRSF LR for bivariate random survival forests with LR splitting criterion. This strategy helps take into account the correlations among genes in the pathway. The R package to perform the bivariate split is available from brsf.
2.2.3 Ensemble CHF
Once the survival reaches Step (III) in the algorithm, i.e. until each terminal node contains no more than 0.632 times the number of events, the trees are aggregated to form an ensemble CHF, which is calculated by grouping hazard estimates using terminal nodes. Let L be a terminal node, {ti, L} be distinct survival times, dti,L be the number of events and Rti,L be the individual at risk at time (ti,L).
The CHF estimate for a terminal node L is the Nelson–Aalen estimator
.
All individuals within L will have the same CHF. For Q terminal nodes in a tree, there are Q different CHF values. To determine
for an individual i with predictor xnew, drop xnew down the tree, and then xnew will fall into a unique terminal node L ∈ Q. The CHF for i is
, if xnew ∈ L.
is calculated for all new individuals in the test set.
The CHF above is defined for one tree. To compute the ensemble over all ntree survival trees. Let b = 1,…, ntree bootstraps. Denote the CHF for a tree as Λb*(t|xnew). The bootstrap ensemble for individual i is
.
For prediction, ensemble survival calculation is based on the concept of conservation of events. Corollary 1 in (Ishwaran et al., 2008) states that the total number of events in a tree is conserved within a terminal node L. The ensemble survival for individual i is defined as
. This measures the expected value for CHF summed over the Tj conditioned on xnew.
2.3 Important pathways
We propose to use cforest and RSF for survival prediction in pathway-based analysis. Our goal is to test whether a specific set of genes, e.g. genes from the same pathway, are strong prognostic factors. One way to do this is to find the expected survival times and expected number of events from cforest and RSF, respectively. We can then split the two groups into approximately equal sizes of high and low survival times or events. Finally, we can compute a LR test to see whether there is a significant difference between the two groups. A small P-value would indicate that this set of genes are informative about the prognosis of patients and can be marked as potentially interesting. The expected survival times and number of events are obtained using 10-fold cross-validation. At each of the k-fold iteration, 90% of the training data are used to build the random forests model for survival data. The left out 10% is then used to make predictions on testing individuals who are not involved in training the model. In addition, we can rank the importance of the pathways by using the LR tests. To evaluate the accuracy of pathway-based survival prediction, we also consider employing the area under the receiver operating characteristic (ROC) curve (AUC) approach for censored data (Heagerty et al., 2000). Time-dependent ROC analysis is an extension of the concept of ROC curves for time-dependent binary disease variables in censored data. In particular, we want to see how well the pathway-based genes predict the disease-free survival of the individuals. For expected survival times or expected number of events, E, derived from a set of predictive markers from pathway, sensitivity and specificity are defined as a function of time t as follows: sensitivity(c, t) = P(E > c|D(t) = 1) and specificity(c, t) = P(E ≤ c|D(t) = 0), where the disease variable Di(t) = 1, if patient i has recurrent breast cancer before time t and Di(t) = 0 otherwise. An ROC(t) is a function of t at different cutoffs c. Time-dependent ROC curve is a plot of sensitivity(c, t) versus 1 - specificity(c, t) and AUC is an accuracy measure of the ROC curve. A higher prediction accuracy is supported by a larger AUC value.
2.4 Important genes
Once the important pathways are identified, we can find potentially informative and biologically relevant genes within those pathways. It is often of interest to know which of the variables are important in prediction. RSF has a way to calculate the VIMP. To obtain the VIMP for variable x, the OOB cases are dropped down their in-bag survival tree. It then assigns a child node randomly when the variable of interest x is used for the split. The CHF is then calculated for each tree and aggregated across the trees. The VIMP for a predictor x is equal to PEo − PEn, where PEo is the prediction error of the original ensemble and PEn is the prediction error of the new ensemble. cforest does not have a built-in feature for VIMP for survival outcome. Therefore, we can use PEo − PEpx where PEo is the prediction error of the original cforest and PEpx is the prediction error of the new cforest with values of predictor x randomly permuted. This is also available in RSF. These two approaches are distinct from each other. However, they are both designed to quantify which genes are most informative, i.e. contribute most to the prediction accuracy, for giving the correct pathway-based prediction on survival outcome.
2.5 Pathway-based plots
A diagram analogous to that discussed for pathway-based outliers as introduced in Pang et al. (2006) can be plotted in the survival setting. From the expected number of events, an approximately equal number of subjects are assigned to either low or high risk. Outlier measure for subject ng of group g, low or high, is defined as out(ng)=1/∑g[proximity(ng, k)]2 for all k in the same class as n. It is the inverse of the sum of squares of the proximity measure proximity(ng, k), which measures the pairwise similarities among the subjects. If both the cases 1 and 2 result in the same terminal node, then the proximity measure between cases 1 and 2 is increased by one. The higher the value the more likely that it is an outlier in that particular pathway. Another useful plot is the multidimensional scaling plot which projects the proximity among patients to a 2D surface giving a summary of similarities among them and their respective ensemble survival measures.
2.6 Comparisons with other machine learning methods
We compared the random forests approach and several machine learning tools for survival data in simulations, including gradient boosting with component-wise linear models (Buhlmann, 2006; Buhlmann and Hothorn, 2007), recursive partitioning survival trees (Therneau and Atkinson, 1997), survival neural networks (Ripley and Ripley, 2001; Ripley et al., 2004), survival support vector machine (SVM) (Evers and Messow, 2008). The boosting approach is implemented in R package glmboost and has its own prediction tool. The base learner for this algorithm is a univariate Cox model. At each iteration, it fits the component-wise univariate Cox model by reducing the criterion based on the squared error loss weighted by inverse probabilities of censoring. The recursive partitioning and regression trees method is available as rpart in R. This algorithm implements CART-like trees for censored outcomes. The survival neural networks method is implemented in R package survnnet. It fits various feed-forward neural nets. Three hidden layer units and parametric survival nets for Weibull distribution were used in our simulations. For the survival support vector machine, it is based on generalizing the idea of margin maximization in support vector machine classification to censored data. We used the default of one for both the kernel width for the Gaussian kernel and the cost of error in the predicted sequence of events.
2.7 Datasets
2.7.1 Pathways
A total of 435 pathways were used for the analysis. These pathways are wired diagrams of genes and molecules from KEGG (Kanehisa et al., 2006) and BioCarta (http://www.biocarta.com/).
A total of 152 pathways were taken from KEGG, a pathway database with the majority responsible for metabolism, degradation and biosynthesis. There are also a few signal processing pathways and others related to human diseases.
A total of 283 pathways were taken from BioCarta. Most of these pathways are related to signal transduction for human and a smaller set of metabolic pathways.
2.7.2 Real data
We considered the datasets reported by Pawitan et al. (2004) and Miller et al. (2005). Both datasets utilized the Affymetrix chipsets HG-U133a with 22 215 probe sets. The Miller et al. (2005) dataset has 242 breast cancer patients, whereas the Pawitan et al. (2004) dataset has 159 breast cancer patients. The outcome of interest is a disease-free survival. For more details regarding these two datasets, see Supplementary Material D.
2.7.3 Simulated dataset
In order to understand how well our approach performs under the null and alternative hypotheses, we performed simulations. We modified the R package boost to simulate the pathway-based gene expression data (Dettling, 2004). This simulator allows us to retain the pathway correlation structure from real data. Gene expression data from a chosen pathway will be used to estimate a covariance matrix used for generating the simulated data. For the null case and the alternative case, two pathways from the Pawitan et al. dataset with large and small P-values for LR tests were chosen, respectively. The survival time Ss, is defined as Ss=exp(−Xs′β+ϵ) where Xs is the simulated gene expression matrix and ϵ ∼ N(0,0.25). For the null case, β equals 0 for all the predictors and the generated survival time is permuted for each run. Under the alternative case, β equals 1 for the top 8 informative genes and 0.1 otherwise. The censoring time (CT) was generated as a N(max(Ss),2), which gave for each simulated dataset with around 20–45% of events as censored. If the generated CT is less than the generated survival time, the survival time for that individual is considered as censored. Each simulation generated 35 i.i.d. genes with sample size 50, 100 or 150.
3 RESULTS
3.1 Simulation studies
To assess the type I error rate, we simulated 100 datasets from the null hypothesis as described in Section 2.7. For every simulated dataset, we first calculated the LR test P-value from 10-fold cross-validation as described in Section 2.3. For both type I error and power, we calculated the ratio between the number of pathways having a P < 0.05 and the number of simulated datasets.
From Table 1, we could see that the observed random forests type I error were around the nominal 0.05 level across different sample sizes for RSF with LR test split rule (RSF LR) and RSF with LRSs split rule (RSF LRS). The type I error for CIFu was ∼0.04 for 150 samples. The boosting algorithm produced comparable type I error rates with CIFu. It had the highest type I error for all sample sizes among the eight algorithms compared. Among the random forests method, bRSF LR had the smallest type I error on average across the different sample sizes. Survival nnet, rpart and survival SVM were also well controlled at the nominal 0.05 level. In terms of power (Table 2), we saw very similar results among the methods, RSF LR seemed to do better than other random forests methods achieving 74% power with 50 samples and close to 100% for both 100 and 150 samples. bRSF LR and RSF LRS were only slightly behind RSF LR. bRSF LR and the boosting algorithms were slightly superior for 50 samples. The boosting algorithm had the same power as RSF LR for sample sizes 100 and 150. However, CIF seemed to do slightly worse among the eight methods tested for type I error.
Table 1.
Simulation results under the Null - Type I error
| Sample | bRSF | CIFu | RSF | RSF | mboost | nnet | rpart | SVM |
|---|---|---|---|---|---|---|---|---|
| size | LR (%) | (%) | LR (%) | LRS (%) | (%) | (%) | (%) | surv (%) |
| 50 | 2 | 11 | 4 | 4 | 11 | 0 | 2 | 2 |
| 100 | 3 | 8 | 4 | 1 | 10 | 1 | 2 | 1 |
| 150 | 2 | 4 | 2 | 3 | 7 | 1 | 1 | 3 |
Table 2.
Simulation results under the Alternative - Power
| Sample | bRSF | CIFu | RSF | RSF | mboost | nnet | rpart | SVM |
|---|---|---|---|---|---|---|---|---|
| size | LR (%) | (%) | LR (%) | LRS (%) | (%) | (%) | (%) | surv (%) |
| 50 | 81 | 73 | 74 | 74 | 80 | 18 | 44 | 44 |
| 100 | 97 | 96 | 99 | 97 | 99 | 52 | 61 | 57 |
| 150 | 100 | 100 | 100 | 100 | 100 | 92 | 89 | 86 |
3.2 Applications of random forests for censored data
In this section, we applied CIF and RSF to assess their abilities in giving biological insights based on two breast cancer microarray datasets.
3.2.1 Pawitan et al. breast cancer dataset
We considered 435 gene sets for random forests prediction on breast cancer survival outcomes. Pathways were ranked using the LR test P-values from survival prediction results as described in Section 2.3. Using a cutoff of 0.00001, our analysis indicated that the following pathways were the most informative in predicting the survival for the breast cancer patients according to the three methods used: (i) Pyrimidine metabolism and (ii) p38 MAPK Signaling Pathway. It has been found that de novo pyrimidine metabolism pathway in breast cancer cells was higher than that in normal cells (Sigoillot et al., 2004)). p38 MAPK Signaling pathway was closely related to human breast cancer and mammary tumorigenesis (Bulavin et al., 2004; Demidov et al., 2007).
Using the predicted survival times, AUC analysis for CIF and RSF could tell us which pathways were better at predicting the disease-free survival in patients at year 2. This meant that the set of genes in those pathways have good prognostics value. These pathways had AUC value of close to 0.70 or above in either bRSF LR, CIFu or RSF LRS: (i) IGF 1 Signaling Pathway; (ii) Control of skeletal myogenesis by HDAC & calcium calmodulin dependent kinase (CaMK) pathway; and (iii) Skeletal muscle hypertrophy is regulated via AKT mTOR pathway.
3.2.2 Miller et al. breast cancer dataset
We next turned to another study of breast cancer by Miller et al. Two pathways were found to be significant based on 0.0005 among the three methods: Cyclins and Cell Cycle Regulation (CCCR) pathway and Pyrimidine metabolism.
Pyrimidine metabolism pathway, described in the previous section, was also shown to have high AUC. This indicates that it had high predictive power of which patient at year 2 will still be relapse free. In addition to Pyrimidine metabolism pathway, Cell Cycle G2 M Checkpoint pathway, Cell Cycle G1 S Check Point pathway and Sonic Hedgehog Receptor Ptc1 regulates cell cycle pathway, all had ≥0.7 AUC for bRSF LR, CIFu and RSF LRS. This dataset could better predict patient survival than the first dataset with a larger sample size.
In this set of pathways, a couple of them were directly related to Cell Cycles Regulation. They were the Cell Cycle G1 S Check Point, Cell Cycle G2 M Check point and Sonic Hedgehog Receptor Ptc1 regulates cell cycle pathways, all of which were involved in Cell Cycle regulation. Overall, cell cycles checkpoint pathways and how cells dealt with DNA damage have been investigated in the literature (Kastan and Bartek, 2004) as it is a crucial determinant of whether an individual develops cancer. The Cell Cycle G1 S check point pathways were closely related to breast cancer as described in D'Assoro et al. (2004) and Massague (2004). Similarly for Cell cycle G2 M, it was linked with BRCA1 (Yamane et al., 2007; Yarden et al., 2002) and has been described to function as biomarkers of breast cancer risk (Kaufmann et al., 2006). Sonic Hedgehog Receptor Ptc1 regulates cell cycle pathway's PTC-1 was overexpression in a number of human breast carcinomas (Mukherjee et al., 2006). A research group also suggested Hedgehog signaling pathway as a potential target for new therapeutics (Kameda et al., 2009).
We performed permutation analysis by permuting the survival time 100 times and found that none of the permutation P-values for the pathways in Tables 3 and 4 were less than 0.0001, see Supplementary Material C. These indicate that the pathways found in Tables 3 and 4 are less likely to be false positives.
Table 3.
Top pathways with P ≤ 0.00001 in one of the following methods for Pawitan dataset
| Number of | bRSF LR | bRSF LR | CIFu | CIFu | RSF LRS | RSF LRS | |
|---|---|---|---|---|---|---|---|
| Pathways | Genes | P-values | AUC | P-values | AUC | P-values | AUC |
| Acetylation and Deacetylation of RelA in the Nucleus pathway | 24 | 0.0001 | 0.603 | 0.0043 | 0.584 | 0.000002 | 0.646 |
| Activation of Src by PTP alpha pathway | 23 | 0.000002 | 0.691 | 0.0002 | 0.617 | 0.0009 | 0.637 |
| Adipocytokine signaling pathway | 108 | 0.0000005 | 0.660 | 0.0013 | 0.693 | 0.0004 | 0.642 |
| AKAP95 role in mitosis and chromosome dynamics pathway | 19 | 0.0002 | 0.666 | 0.00079 | 0.651 | 0.00008 | 0.643 |
| Alanine and aspartate metabolism | 42 | 0.0009 | 0.691 | 0.00018 | 0.680 | 0.0095 | 0.608 |
| Cdc25 and chk1 Regulatory pathway in response to DNA damage pathway | 19 | 0.00001 | 0.661 | 0.0006 | 0.627 | 0.0002 | 0.626 |
| Cell cycle | 80 | 0.0132 | 0.613 | 0.00005 | 0.657 | 0.00001 | 0.673 |
| Control of skeletal myogenesis by HDAC and calcium calmodulin dependent kinase (CaMK) pathway | 66 | 0.0016 | 0.663 | 0.00001 | 0.703 | 0.00008 | 0.699 |
| IGF 1 Signaling pathway | 45 | 0.00015 | 0.663 | 0.000009 | 0.709 | 0.00035 | 0.697 |
| Lysine biosynthesis | 6 | 0.0144 | 0.557 | 0.000008 | 0.657 | 0.00009 | 0.625 |
| NFAT and Hypertrophy of the heart pathway | 107 | 0.0004 | 0.681 | 0.00001 | 0.684 | 0.00001 | 0.686 |
| p38 MAPK Signaling pathway | 85 | 0.00008 | 0.649 | 0.00006 | 0.676 | 0.000003 | 0.693 |
| Pyrimidine metabolism | 123 | 0.000009 | 0.674 | 0.00004 | 0.667 | 0.00009 | 0.609 |
| Skeletal muscle hypertrophy is regulated via AKT mTOR pathway | 45 | 0.00007 | 0.684 | 0.00018 | 0.693 | 0.000003 | 0.706 |
This P-value cutoff was chosen such that the FDR Q-value will be controlled at the 0.01 level for both datasets.
Table 4.
Top pathways with P ≤0.0001 in one of the following methods for Miller dataset
| Number of | bRSF LR | bRSF LR | CIFu | CIFu | RSF LRS | RSF LRS | |
|---|---|---|---|---|---|---|---|
| Pathways | genes | P-values | AUC | P-values | AUC | P-values | AUC |
| Cell Cycle G1 S Check Point pathway | 53 | 0.0006 | 0.769 | 0.00006 | 0.733 | 0.0074 | 0.728 |
| Cell Cycle G2 M Check Point pathway | 47 | 0.000004 | 0.752 | 0.00003 | 0.751 | 0.0124 | 0.712 |
| Classical Complement pathway | 13 | 0.00007 | 0.712 | 0.0473 | 0.672 | 0.0041 | 0.709 |
| CCCR pathway | 35 | 0.00003 | 0.755 | 0.000003 | 0.729 | 0.00007 | 0.679 |
| IL 3 Signaling pathway | 20 | 0.00004 | 0.721 | 0.0012 | 0.636 | 0.00007 | 0.674 |
| NFAT and Hypertrophy of the heart pathway | 107 | 0.407 | 0.691 | 0.1615 | 0.660 | 0.00001 | 0.686 |
| p38 MAPK Signaling pathway | 85 | 0.2145 | 0.656 | 0.0223 | 0.614 | 0.000003 | 0.693 |
| Phosphorylation of MEK1 by cdk5 p35 down regulates the MAP kinase pathway | 20 | 0.0028 | 0.707 | 0.00042 | 0.692 | 0.0808 | 0.623 |
| Pyrimidine metabolism | 123 | 0.0001 | 0.720 | 0.00004 | 0.692 | 0.00039 | 0.701 |
| Regulation of BAD phosphorylation pathway | 47 | 0.0002 | 0.683 | 0.0109 | 0.630 | 0.00008 | 0.656 |
| Sonic Hedgehog Receptor Ptc1 regulates cell cycle pathway | 20 | 0.000004 | 0.752 | 0.0006 | 0.722 | 0.007 | 0.708 |
This P-value cutoff was chosen such that the FDR Q-value will be controlled at the 0.01 level for both datasets.
3.2.3 Top pathways for both datasets
Table 5 shows the consistency of the pathways found in both datasets. For the LR tests with P < 0.0001, CIFu found the largest number of top pathways for both Pawitan and Miller datasets. Three of the RSF methods, LR, CON, and RAN, found two overlapped except for RSF LRS, which found three overlapped. The majority of these pathways belongs to cell signaling and cell cycle regulation. One pathway, the CCCR pathway, was found by five different methods, bRSF LR, CIFu, RSF LR, RSF CON and RSF RAN. Pyrimidine metabolism was found by bRSF LR, CIFu, RSF LR and RSF CON and was discussed in Section 3.2.1. Both CIFu and CIFc found the Cell Cycle G2 M Checkpoint pathway which was discussed earlier.
Table 5.
Number of pathways found using different methods for two datasets based on LR tests with P ≤ 0.0001
| Methods | Pawitan | Miller | Both |
|---|---|---|---|
| bRSF LR | 27 | 6 | 2 |
| CIFc | 14 | 1 | 1 |
| CIFu | 25 | 4 | 4 |
| RSF LR | 21 | 6 | 2 |
| RSS CON | 35 | 3 | 2 |
| RSF LRS | 23 | 5 | 3 |
| RSF RAN | 10 | 4 | 2 |
In addition, the Regulation of BAD phosphorylation found by RSF LRS containing a widely expressed BCL-2 family member has been studied for epidermal growth factor receptor targeted therapy for breast cancer (Motoyama and Hynes, 2003). Expression of BAD was also known to predict outcome in breast cancer patients with tamoxifen (Cannings et al., 2007). Cell Cycle G1 S Check Point pathway found by RSF RAN was discussed above. For activation of Src by Protein tyrosine phosphatase (PTP) alpha pathway found by CIFu, it has been found that breast cancer cell lines are closely tied with PTP alpha and src RNAi (Ardini et al., 2000; Egan et al., 1999; Zheng et al., 2008).
For AUC analysis in Table 6, only the RSF methods bRSF LR, LR and LRS had pathways with AUC >0.7 in both datasets. The pathways are the TGF beta signaling pathway and Phenylalanine tyrosine and tryptophan biosynthesis for RSF LR; Regulation of eIF2 pathway, Cell Cycle G1 S Check Point pathway, and Classical Complement pathway for bRSF LR; and Degradation of the RAR and RXR by the proteasome pathway for RSF LRS. TGF beta signaling pathway has been found to be a potent inhibitor of human breast cancer cell proliferation and its tumor-suppressor effect property (Chen et al., 1998; Derynck et al., 2001). In a recent study, researchers found that it may regulate tumor cell dynamics (Tang et al., 2007). In addition, high levels of TGF beta 1 mRNA correlated with poor prognosis in breast cancer patient (de Jong et al., 1998). Kinases activated by viral infection (PKR) can phosphorylate a subunit of eIF-2. This kinase had shown elevated levels of activity in breast cancer cells (Kim et al., 2000; Nussbaum et al., 2003). For Classical Complement pathway, complement activation can be controlled by membrane-bound complement such as decay-accelerating factor (DAF) and membrane cofactor protein (MCP). It had been reported that loss of DAF was associated with poor prognosis in breast cancer (Madjd et al., 2004). Moreover, breast cancers with elevated MCP were correlated with tumor grade and recurrence Madjd et al., 2005. Furthermore, RAR and RXR in Degradation of the RAR and RXR by the proteasome pathway were found to be related to cancer (Altucci et al., 2007) and breast cancer cells (Wu et al., 2004).
Table 6.
Number of pathways found using different methods for two datasets based on AUC ≥ 0.7
| Methods | Pawitan | Miller | Both |
|---|---|---|---|
| bRSF LR | 17 | 54 | 3 |
| CIFc | 19 | 9 | 0 |
| CIFu | 22 | 12 | 0 |
| RSF LR | 21 | 46 | 2 |
| RSF CON | 32 | 22 | 0 |
| RSF LRS | 18 | 14 | 1 |
| RSF RAN | 11 | 23 | 0 |
CCCR pathway, NFAT and Hypertrophy of the heart pathway and Phenylalanine tyrosine and tryptophan biosynthesis (by RSF LR) will be discussed further in the next section on the identification of important genes.
The multidimensional scaling plot for the TGF beta signaling pathway, (Fig. 2), illustrates the fact that this pathway was good at separating patients of high and low number of events. The high number of events tends to cluster to the left indicated by the hollow circles, whereas the lower numbers indicated by solid black circles are concentrated more on the right-hand side of the plot. Outliers plot, see Figure 3, which allows users to visualize which patients had extreme values in gene expression measures compared with others, is particularly useful in pathway-based analysis. Shaded bars represented high risk and white bars represented low risk. In this case, we see that patient 125 in the low-risk group was more like the high-risk group in this subset of the data for TGF beta signaling pathway. Pathway-based analysis allows us to check which patients are outlying within a pathway with respect to others, and this might help describe other physical conditions/health of the individual.
Fig. 2.
A multidimensional scaling plot for TGF beta signaling pathway.
Fig. 3.
An outliers plot of 30 patients for TGF beta signaling pathway.
We compared our method with gene set analysis (GSA) and found that our method can identify more significant pathways (Efron and Tibshirani, 2006). Our bRSF LR method found two pathways that were significant in both datasets, whereas GSA found none, see Supplementary Material I.
3.2.4 Important genes
For pathways found in the previous section, it may not be apparent why those pathways were chosen, we have investigated the important genes in these pathways from both datasets to get a better understanding on why they were good predictors of survival outcome of breast cancer patients.
Here are some of the top genes with high VIMP value and literature evidence. First, ENO1 in Phenylalanine tyrosine and tryptophan biosynthesis was found to be significantly overexpressed in HER-2/neu positive breast tumors (Zhang et al., 2005). Second, TGFBR2 in TGF beta signaling pathway was associated with poor prognosis and recurrence in breast cancer patients (Barlow et al., 2003; Lucke et al., 2001). Third, IGF1 in NFAT and Hypertrophy of the heart pathway also has prognostic significance in human breast cancers (Bonneterre et al., 1990). Fourth, CDC2 in CCCR pathway activated an apoptotic pathway in breast cancer (Choi and Kim, 2009) and it had an effect on cell cycle progression in breast cancer cells (Caffarel et al., 2006)). Fifth, E2F1 in the same pathway was also known for being a strong predictor breast cancer survival outcome (Baldini et al., 2006; Vuaroqueaux et al., 2007). Sixth, a common variant of CDKN2A also in the CCCR pathway was associated with breast cancer risk (Debniak et al., 2008). Moreover, TK2 in Pyrimidine metabolism pathway might have value in determining prognosis in breast cancer patients (O'Neill et al., 1992). Finally, POLD2 also in the same pathway was one of the genes found to increase with tumor progression (Hedenfalk et al., 2001).
4 DISCUSSION
In this article, we have described a pathway-based approach for analyzing microarray data with censored outcome using random CIFs and random survival forests with univariate and bivariate node splits. This approach allows us to identify pathways that are strong predictors of patients' survival and informative biomarkers. We showed two distinct ways in ranking these important pathways: (i) LR test approach helps to identify pathways that are good at predicting patient's prognosis and (ii) AUC helps to identify pathways that are good at correctly predicting patients who progress/survive past a certain time. This tool can help biomedical researchers develop more biologically meaningful prognostic markers. Finding these important pathways allows researchers to focus on a small gene sets that explain the response of interest. We demonstrated the use of our tool with two breast cancer microarray datasets.
We compared different implementations of both the random CIFs and the random survival forests. In terms of random CIFs, the unbiased version gave more consistent results of the two. It also found the most number of pathways, four, which are common among Pawitan and Miller datasets. Possibly due to the differences in the sample size and patient-specific characteristics, the Pawitan dataset gave more significant results than Miller dataset for the same level of statistical significance. For random survival forests, bivariate node-split LR tests, LR tests and LRS tests gave the most consistent results among the five split criteria used. These three split criteria were also able to achieve AUC > 0.7 for the same pathway in both breast cancer datasets. Our simulation studies demonstrated that the type I error rates were higher in random CIFs unbiased than random survival forests bivariate node-split LR, LR or LRS. However, the four random forests methods are similar in terms of power across sample sizes of 100 and 150. Bivariate node-split LR test has the highest power among the random survival forest approaches for sample size 50. Comparing with another novel machine learning approach, we found that boosting algorithm has comparable results with random survival forest in terms of power, but did worse in terms of type I error. From the simulation results, we see that it might be wise to repeat the cross-validation procedure when the sample size is small. The resulting estimates of the error rate should be taken with caution though (Hanczar et al., 2007). The default for our R package is to split the low- versus high-risk patients equally. To make our program more flexible, we allow less than quartile one versus greater than or equal to quartile one, and quartile three versus greater than or equal to quartile three. This should be done if there is prior biological knowledge to indicate that it makes more sense to split the patients at other quartiles.
It is well known that genes work together in groups. Random forests take into account implicitly the correlations among genes in pathways and is an ideal algorithm to use for modeling pathway-based survival microarray data. Pathway analysis using random forests provides a tool for the researchers to combine biological information from externally available pathway databases such as KEGG and BioCarta with high-throughput data. In addition, random forests provides us with a multidimensional scaling plot and allows the detection of pathway-based outliers. These can tell us which of the subjects of interest is behaving differently from others in a particular pathway which is often useful in medical studies. This approach also has its benefits over examining genes on at the individual level. Moreover, we can make use of the VIMP measure to identify genes that are more informative in doing the pathway-based survival prediction. This can help to pick up important genes of interest and they may turn out to be novel biomarkers. One drawback of our approach is that, we do not have a pathway-based false discovery rate (FDR) control unlike other traditional gene set-based methods for binary or continuous outcomes which make use of other background genes. However, these methods tend to be highly sensitive to the background expression of genes. The described method is one of the first to combine machine learning methods with pathway information for analyzing microarray data with survival outcomes and it is also the first paper to introduce a bivariate node-splitting algorithm for survival random forests. It would certainly motivate and draw the interest of other researchers to develop other novel pathway-based methods for survival outcomes.
Supplementary Material
ACKNOWLEDGEMENTS
We would like to thank the reviewers for their constructive comments.
Funding: National Institutes of Health (grant GM59507); National Science Foundation (grant DMS0714817); a pilot grant from the Yale Pepper Center; start-up funds from Duke University Medical Center.
Conflict of Interest: none declared.
REFERENCES
- Altucci L, et al. RAR and RXR modulation in cancer and metabolic disease. Nat. Rev. Drug Discov. 2007;6:793–810. doi: 10.1038/nrd2397. [DOI] [PubMed] [Google Scholar]
- Ardini E, et al. Expression of protein tyrosine phosphatase alpha (RPTPalpha) in human breast cancer correlates with low tumor grade, and inhibits tumor cell growth in vitro and in vivo. Oncogene. 2000;19:4979–4987. doi: 10.1038/sj.onc.1203869. [DOI] [PubMed] [Google Scholar]
- Baldini E, et al. Cyclin A and E2F1 overexpression correlate with reduced disease-free survival in node-negative breast cancer patients. Anticancer Res. 2006;26:4415–4421. [PubMed] [Google Scholar]
- Barlow J, et al. Higher stromal expression of transforming growth factor-beta type II receptors is associated with poorer prognosis breast tumors. Breast Cancer Res. Treat. 2003;79:149–159. doi: 10.1023/a:1023918026437. [DOI] [PubMed] [Google Scholar]
- Bonneterre J, et al. Prognostic significance of insulin-like growth factor 1 receptors in human breast cancer. Cancer Res. 1990;50:6931–6935. [PubMed] [Google Scholar]
- Breiman L. Random forests. Mach. Learn. 2001;45:5–32. [Google Scholar]
- Breiman L. How to use survival forests (SFPDV1). 2002 Available at http://oz.berkeley.edu/users/breiman/SF_Manual.pdf (last accessed date May 1, 2009) [Google Scholar]
- Buhlmann P. Boosting for high-dimensional linear models. Ann. Stat. 2006;34:559–583. [Google Scholar]
- Buhlmann P, Hothorn T. Boosting algorithms: regularization, prediction and model fitting. Stat. Sci. 2007;22:477–505. [Google Scholar]
- Bulavin D, et al. Inactivation of the Wip1 phosphatase inhibits mammary tumorigenesis through p38 MAPK-mediated activation of the p16(Ink4a)-p19(Arf) pathway. Nat. Genet. 2004;36:343–350. doi: 10.1038/ng1317. [DOI] [PubMed] [Google Scholar]
- Caffarel M, et al. Delta9-tetrahydrocannabinol inhibits cell cycle progression in human breast cancer cells through Cdc2 regulation. Cancer Res. 2006;66:6615–6621. doi: 10.1158/0008-5472.CAN-05-4566. [DOI] [PubMed] [Google Scholar]
- Cannings E, et al. Bad expression predicts outcome in patients treated with tamoxifen. Breast Cancer Res. Treat. 2007;102:173–179. doi: 10.1007/s10549-006-9323-8. [DOI] [PubMed] [Google Scholar]
- Chen T, et al. Transforming growth factor beta type I receptor kinase mutant associated with metastatic breast cancer. Cancer Res. 1998;58:4805–4810. [PubMed] [Google Scholar]
- Choi E, Kim G. Apigenin causes G(2)/M arrest associated with the modulation of p21(Cip1) and Cdc2 and activates p53-dependent apoptosis pathway in human breast cancer SK-BR-3 cells. J. Nutr. Biochem. 2009;20:285–290. doi: 10.1016/j.jnutbio.2008.03.005. [DOI] [PubMed] [Google Scholar]
- D'Assoro A, et al. Genotoxic stress leads to centrosome amplification in breast cancer cell lines that have an inactive G1/S cell cycle checkpoint. Oncogene. 2004;36:4068–4075. doi: 10.1038/sj.onc.1207568. [DOI] [PubMed] [Google Scholar]
- Debniak T, et al. CDKN2A-positive breast cancers in young women from Poland. Breast Cancer Res. Treat. 2008;103:355–359. doi: 10.1007/s10549-006-9382-x. [DOI] [PubMed] [Google Scholar]
- de Jong J, et al. Expression of growth factors, growth-inhibiting factors, and their receptors in invasive breast cancer. J. Pathol. 1998;184:53–57. doi: 10.1002/(SICI)1096-9896(199801)184:1<53::AID-PATH6>3.0.CO;2-7. [DOI] [PubMed] [Google Scholar]
- Demidov O, et al. The role of the MKK6/p38 MAPK pathway in Wip1-dependent regulation of ErbB2-driven mammary gland tumorigenesis. Oncogene. 2007;26:2502–2506. doi: 10.1038/sj.onc.1210032. [DOI] [PubMed] [Google Scholar]
- Derynck R, et al. TGF-beta signaling in tumor suppression and cancer progression. Nat Genet. 2001;29:117–129. doi: 10.1038/ng1001-117. [DOI] [PubMed] [Google Scholar]
- Dettling M. BagBoosting for tumor classification with gene expression data. Bioinformatics. 2004;20:3583–3593. doi: 10.1093/bioinformatics/bth447. [DOI] [PubMed] [Google Scholar]
- Efron B, Tibshirani R. On testing the significance of sets of genes. 2006 Stanford Technical Report 2006. [Google Scholar]
- Egan C, et al. Activation of Src in human breast tumor cell lines: elevated levels of phosphotyrosine phosphatase activity that preferentially recognizes the Src carboxy terminal negative regulatory tyrosine 530. Oncogene. 1999;18:1227–1237. doi: 10.1038/sj.onc.1202233. [DOI] [PubMed] [Google Scholar]
- Evers L, Messow C-M. Sparse kernel methods for high-dimensional survival data. Bioinformatics. 2008;15:1632–1638. doi: 10.1093/bioinformatics/btn253. [DOI] [PubMed] [Google Scholar]
- Hanczar B, et al. Decorrelation of the true and estimated classifier errors in high-dimensional settings. EURASIP J. Bioinform. Syst. Biol. 2007:38473. doi: 10.1155/2007/38473. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Heagerty P, et al. Time-dependent ROC curves for censored survival data and a diagnostic marker. Biometrics. 2000;56:337–344. doi: 10.1111/j.0006-341x.2000.00337.x. [DOI] [PubMed] [Google Scholar]
- Hedenfalk I, et al. Gene-expression profiles in hereditary breast cancer. N. Engl. J. Med. 2001;344:539–548. doi: 10.1056/NEJM200102223440801. [DOI] [PubMed] [Google Scholar]
- Hothorn T, Lausen B. On the exact distribution of maximally selected rank statistics. Comput. Stat. Data Anal. 2003;43:121–137. [Google Scholar]
- Hothorn T, et al. Unbiased recursive partitioning: a conditional inference framework. J. Comput. Graph. Stat. 2006a;15:651–674. [Google Scholar]
- Hothorn T, et al. Survival ensembles. Biostatistics. 2006b;7:355–373. doi: 10.1093/biostatistics/kxj011. [DOI] [PubMed] [Google Scholar]
- Ishwaran U, et al. Random survival forests. Ann. Appl. Stat. 2008;2:841–860. [Google Scholar]
- Kaderali L, et al. CASPAR: a hierarchical Bayesian approach to predict survival times in cancer from gene expression data. Bioinformatics. 2006;22:1495–1502. doi: 10.1093/bioinformatics/btl103. [DOI] [PubMed] [Google Scholar]
- Kameda C, et al. The Hedgehog pathway is a possible therapeutic target for patients with estrogen receptor-negative breast cancer. Anticancer Res. 2009;29:871–879. [PubMed] [Google Scholar]
- Kanehisa M, et al. From genomics to chemical genomics: new developments in KEGG. Nucleic Acids Res. 2006;34:D354–D357. doi: 10.1093/nar/gkj102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kastan M, Bartek J. Cell-cycle checkpoints and cancer. Nature. 2004;432:316–323. doi: 10.1038/nature03097. [DOI] [PubMed] [Google Scholar]
- Kaufmann W, et al. Radiation clastogenesis and cell cycle checkpoint function as functional markers of breast cancer risk. Carcinogenesis. 2006;27:2519–2527. doi: 10.1093/carcin/bgl103. [DOI] [PubMed] [Google Scholar]
- Kemp D, et al. Extending the pathway analysis framework with a test for transcriptional variance implicates novel pathway modulation during myogenic differentiation. Bioinformatics. 2007;23:1356–1362. doi: 10.1093/bioinformatics/btm116. [DOI] [PubMed] [Google Scholar]
- Kim S, et al. Human breast cancer cells contain elevated levels and activity of the protein kinase, PKR. Oncogene. 2000;19:3086–3094. doi: 10.1038/sj.onc.1203632. [DOI] [PubMed] [Google Scholar]
- Li H, Gui J. Partial Cox regression analysis for high-dimensional microarray gene expression data. Bioinformatics. 2004;20(Suppl. 1):i208–i215. doi: 10.1093/bioinformatics/bth900. [DOI] [PubMed] [Google Scholar]
- Lin Y, Jeon Y. Random forests and adaptive nearest neighbors. J. Am. Stat. Assoc. 2006;101:578–590. [Google Scholar]
- Lucke C, et al. Inhibiting mutations in the transforming growth factor beta type 2 receptor in recurrent human breast cancer. Cancer Res. 2001;61:482–485. [PubMed] [Google Scholar]
- Madjd Z, et al. Loss of CD55 is associated with aggressive breast tumors. Clin. Cancer Res. 2004;10:2797–2803. doi: 10.1158/1078-0432.ccr-1073-03. [DOI] [PubMed] [Google Scholar]
- Madjd Z, et al. Do poor-prognosis breast tumours express membrane cofactor proteins (CD46)? Cancer Immunol. Immunother. 2005;54:149–156. doi: 10.1007/s00262-004-0590-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Massague J. G1 cell-cycle control and cancer. Nature. 2004;432:298–306. doi: 10.1038/nature03094. [DOI] [PubMed] [Google Scholar]
- Mayburd A, et al. Successful anti-cancer drug targets able to pass FDA review demonstrate the identifiable signature distinct from the signatures of random genes and initially proposed targets. Bioinformatics. 2008;24:389–395. doi: 10.1093/bioinformatics/btm447. [DOI] [PubMed] [Google Scholar]
- Miller L, et al. An expression signature for p53 status in human breast cancer predicts mutation status, transcriptional effects, and patient survival. Proc. Natl Acad. Sci. USA. 2005;102:13550–13555. doi: 10.1073/pnas.0506230102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Motoyama AB, Hynes NE. BAD: a good therapeutic target? Breast Cancer Res. 2003;5:27–30. doi: 10.1186/bcr552. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mukherjee S, et al. Hedgehog signaling and response to cyclopamine differ in epithelial and stromal cells in benign breast and breast cancer. Cancer Biol. Ther. 2006;5:674–683. doi: 10.4161/cbt.5.6.2906. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Naftel D, et al. Conservation of events. 1985 unpublished. [Google Scholar]
- Nussbaum J, et al. Transcriptional upregulation of interferon-induced protein kinase, PKR, in breast cancer. Cancer Lett. 2003;196:207–216. doi: 10.1016/s0304-3835(03)00276-3. [DOI] [PubMed] [Google Scholar]
- O'Neill K, et al. Can thymidine kinase levels in breast tumors predict disease recurrence? J. Natl Cancer Inst. 1992;84:1825–1828. doi: 10.1093/jnci/84.23.1825. [DOI] [PubMed] [Google Scholar]
- Pang H, et al. Pathway analysis using random forests classification and regression. Bioinformatics. 2006;22:2028–2036. doi: 10.1093/bioinformatics/btl344. [DOI] [PubMed] [Google Scholar]
- Pang H, Zhao H. Building pathway clusters from Random Forests classification using class votes. BMC Bioinformatics. 2008;9:87. doi: 10.1186/1471-2105-9-87. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Park P, et al. Linking gene expression data with patient survival times using partial least squares. Stat. Med. 2002;18(Suppl. 1):S120–S127. doi: 10.1093/bioinformatics/18.suppl_1.s120. [DOI] [PubMed] [Google Scholar]
- Pawitan Y, et al. Gene expression profiling for prognosis using Cox regression. Stat. Med. 2004;23:1767–1780. doi: 10.1002/sim.1769. [DOI] [PubMed] [Google Scholar]
- Ripley B, Ripley R. Neural networks as statistical methods in survival analysis. In: Dybowski R, Gant V, editors. Clinical Applications of Artificial Neural Networks. Cambridge, UK: Cambridge University Press; 2001. [Google Scholar]
- Ripley R, et al. Non-linear survival analysis using neural networks. Stat. Med. 2004;23:825–842. doi: 10.1002/sim.1655. [DOI] [PubMed] [Google Scholar]
- Schumacher M, et al. Assessment of survival prediction models based on microarray data. Bioinformatics. 2007;23:1768–1774. doi: 10.1093/bioinformatics/btm232. [DOI] [PubMed] [Google Scholar]
- Segal M. Regression trees for censored data. Biometrics. 1988;44:35–47. [Google Scholar]
- Sigoillot F, et al. Breakdown of the regulatory control of pyrimidine biosynthesis in human breast cancer cells. Int. J. Cancer. 2004;109:491–498. doi: 10.1002/ijc.11717. [DOI] [PubMed] [Google Scholar]
- Strasser H, Weber C. On the asymptotic theory of permutation statistics. Math. Methods Stat. 1999;8:220–250. [Google Scholar]
- Subramanian A, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl Acad. Sci. USA. 2005;102:15545–15550. doi: 10.1073/pnas.0506580102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tai F, Pan W. Incorporating prior knowledge of predictors into penalized classifiers with multiple penalty terms. Bioinformatics. 2007;23:1775–1782. doi: 10.1093/bioinformatics/btm234. [DOI] [PubMed] [Google Scholar]
- Tang B, et al. Transforming growth factor-beta can suppress tumorigenesis through effects on the putative cancer stem or early progenitor cell and committed progeny in a breast cancer xenograft model. Cancer Res. 2007;67:8643–8652. doi: 10.1158/0008-5472.CAN-07-0982. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Therneau T, Atkinson E. An introduction to recursive partitioning using the RPART routine, Mayo Foundation. Technical Report. 1997 [Google Scholar]
- van Wieringen W, et al. Survival prediction using gene expression data: a review and comparison. Comput. Stat. Data Anal. 2009;53:1590–1603. [Google Scholar]
- Vuaroqueaux V, et al. Low E2F1 transcript levels are a strong determinant of favorable breast cancer outcome. Breast Cancer Res. 2007;9:R33. doi: 10.1186/bcr1681. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wei Z, Li H. A Markov random field model for network-based analysis of genomic data. Bioinformatics. 2007;23:1537–1544. doi: 10.1093/bioinformatics/btm129. [DOI] [PubMed] [Google Scholar]
- Wu Q, et al. Ubiquitinated or sumoylated retinoic acid receptor alpha deter-mines its characteristic and interacting model with retinoid X receptor alpha in gastric and breast cancer cells. J. Mol. Endocrinol. 2004;32:595–613. doi: 10.1677/jme.0.0320595. [DOI] [PubMed] [Google Scholar]
- Wu M, et al. Sparse linear discriminant analysis for simultaneous testing for the significance of a gene set/pathway and gene selection. Bioinformatics. 2009;25:1145–1151. doi: 10.1093/bioinformatics/btp019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yamane K, et al. BRCA1 activates a G2-M cell cycle checkpoint following 6-thioguanine-induced DNA mismatch damage. Cancer Res. 2007;67:6286–6292. doi: 10.1158/0008-5472.CAN-06-2205. [DOI] [PubMed] [Google Scholar]
- Yarden R, et al. BRCA1 regulates the G2/M checkpoint by activating Chk1 kinase upon DNA damage. Nat. Genet. 2002;30:285–289. doi: 10.1038/ng837. [DOI] [PubMed] [Google Scholar]
- Zhang D, et al. Proteomic study reveals that proteins involved in metabolic and detoxification pathways are highly expressed in HER-2/neu-positive breast cancer. Mol. Cell Proteomics. 2005;4:1686–1696. doi: 10.1074/mcp.M400221-MCP200. [DOI] [PubMed] [Google Scholar]
- Zheng X, et al. Apoptosis of estrogen-receptor negative breast cancer and colon cancer cell lines by PTP alpha and src RNAi. Int. J. Cancer. 2008;222:1999–2007. doi: 10.1002/ijc.23321. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.




