Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2012 Aug 31.
Published in final edited form as: Bioinformatics. 2006 Nov 26;23(1):30–37. doi: 10.1093/bioinformatics/btl543

Improved breast cancer prognosis through the combination of clinical and genetic markers

Yijun Sun 1,3, Steve Goodison 2, Jian Li 3, Li Liu 1, William Farmerie 1
PMCID: PMC3431620  NIHMSID: NIHMS396978  PMID: 17130137

Abstract

Motivation

Accurate prognosis of breast cancer can spare a significant number of breast cancer patients from receiving unnecessary adjuvant systemic treatment and its related expensive medical costs. Recent studies have demonstrated the potential value of gene expression signatures in assessing the risk of post-surgical disease recurrence. However, these studies all attempt to develop genetic marker-based prognostic systems to replace the existing clinical criteria, while ignoring the rich information contained in established clinical markers. Given the complexity of breast cancer prognosis, a more practical strategy would be to utilize both clinical and genetic marker information that may be complementary.

Methods

A computational study is performed on publicly available microarray data, which has spawned a 70-gene prognostic signature. The recently proposed I-RELIEF algorithm is used to identify a hybrid signature through the combination of both genetic and clinical markers. A rigorous experimental protocol is used to estimate the prognostic performance of the hybrid signature and other prognostic approaches. Survival data analyses is performed to compare different prognostic approaches.

Results

The hybrid signature performs significantly better than other methods, including the 70-gene signature, clinical makers alone and the St. Gallen consensus criterion. At the 90% sensitivity level, the hybrid signature achieves 67% specificity, as compared to 47% for the 70-gene signature and 48% for the clinical makers. The odds ratio of the hybrid signature for developing distant meta-stases within five years between the patients with a good prognosis signature and the patients with a bad prognosis is 21.0 (95% CI: 6.5–68.3), far higher than either genetic or clinical markers alone.

1 INTRODUCTION

Breast cancer is the second most common cause of deaths from cancer among women in the United States. In 2006, it is estimated that about 212 000 new cases of invasive breast cancer will be diagnosed, along with 58 000 new cases of non-invasive breast cancer and 40 000 women are expected to die from this disease (Data from American Cancer Society, 2006). The major clinical problem of breast cancer is the recurrence of therapeutically resistant disseminated disease. In many patients, microscopic or clinically evident metastases have already occurred by the time the primary tumor is diagnosed. Chemotherapy or hormonal therapy reduces the risk of distant metastases by one-third. However, it is estimated that about 70% patients receiving treatment would have survived without it. Therefore, being able to predict disease outcomes more accurately would help physicians make informed decisions regarding the potential necessity of adjuvant treatment, and may lead to the development of individually tailored treatments to maximize the efficacy of treatment. Consequently, this would ultimately contribute to a decrease in overall breast cancer mortality, a reduction in overall heath care cost and an improvement in patients’ quality of life.

Despite significant advances in the treatment of primary cancer, the ability to predict the metastatic behavior of tumors remains one of the greatest clinical challenges in oncology. Two commonly used treatment guidelines are the St. Gallen (Goldhirsch et al., 2003) and NIH (Eifel et al., 2000) consensus criteria that determine whether a patient is at a high risk of tumor recurrence and/or distant metastases based on a panel of clinical markers, such as age of patient, tumor size, the number of involved lymph nodes at the time of surgery and the aggressiveness of the cancer based on histopathological parameters. These criteria are less than precise in predicting therapy failure, with only 10% specificity at the 90% sensitivity level1. A more accurate prognostic criterion is urgently needed to avoid over- or under-treatment in newly diagnosed patients.

It has been recently established that related cellular phenotypes are generally reflected in the related patterns of cellular transcripts, implying the possibility of classifying cellular states by monitoring gene expression profiles (Golub et al., 1999). Identifying a gene signature using microarray data for breast cancer prognosis has been a central goal in some recent large-scale exploratory studies. In van't Veer et al., 2002, a 70-gene signature (also known as the Amsterdam signature) was derived from a cohort of 78 breast cancer patients, the prognostic value of which was further validated in a larger dataset (van De Vijver et al., 2002). More recently, a 76-gene signature was identified and successfully used to predict distant metastases of lymph node-negative primary breast cancer (Wang et al., 2005). These studies have shown that gene profiling can achieve a much higher specificity than the current clinical systems (50% versus 10%) at the same sensitivity level. These results are considered groundbreaking in breast cancer prognosis. A prospective and randomized study involving ~800 breast cancer patients, referred to as MINDACT (Microarray In Node negative Disease may Avoid ChemoTherapy), is currently being conducted in Europe in order to evaluate the prognostic value of the 70-gene signature (Loi et al., 2006).

The predictive values of these gene signatures are usually demonstrated through comparison with the conventional St. Gallen and NIH consensus criteria. Though the results favor gene signatures, the comparison is somewhat unfair since both St. Gallen and NIH consensus criteria perform risk assessment by following the rules derived heuristically from clinical experiences rather than carefully optimized rules2. Edén et al. showed experimentally that the clinical markers, when used as the features in a well trained neural network (NNW), performed similarly to a gene based prognostic system (Edén et al., 2004), which is in sharp contrast with the conclusions drawn in the existing studies (van't Veer et al., 2002; van De Vijver et al., 2002; Weigelt et al., 2005). Moreover, most of the existing studies attempt to use a genetic marker based prognostic system to replace the existing clinical rules, rather than incorporating the valuable clinical information. Given the complexity of breast cancer prognosis, a more practical strategy, as suggested by Brenton et al., 2005, is to utilize both clinical and genetic markers that may contain complementary information. This may lead to a more economical and accurate prognostic system. In this paper, we conduct a computational study to demonstrate the feasibility of this strategy.

The key challenge to deriving a hybrid prognostic signature from both genetic and clinical markers is feature selection. One characteristic of microarray data, different from most of the classification problems we encounter, is the extremely large feature dimensionality compared to the small sample size. The curse of dimensionality (Duda et al., 2000; Trunk, 1979) becomes a serious problem. Here, we use our recently developed I-RELIEF algorithm to select a small feature subset such that the performance of a learning algorithm is optimized. I-RELIEF employs a feature weighting strategy that assigns each feature a real-valued number, instead of a binary one, to indicate its relevance to a learning problem. The feature weighting strategy enables the employment of well established optimization techniques, and thus allows for efficient algorithmic implementation that is critical for microarray data analysis. We use a rigorous experimental protocol to estimate the classification parameters and the prognostic performance of the new hybrid signature and other prognostic approaches, including the 70-gene signature, the clinical markers alone, and the conventional St. Gallen criterion. Survival data analyses are performed to compare the different prognostic approaches. Our results clearly demonstrate the superiority of the hybrid signature over a prognostic system that uses only genetic or clinical markers.

2 MATERIALS AND METHODS

2.1 Dataset

A computational study is performed on van't Veer's data (van't Veer et al., 2002). This dataset contains expression profile information derived from samples collected from 97 lymph node-negative breast cancer patients 55 years old or younger, and associated clinical information including age, tumor size, histological grade, angioinvasion, lymphocytic infiltration, estrogen receptor (ER) and progesterone receptor (PR) status. Among the 97 patients, 46 developed distant metastases within 5 years and 51 remained metastases free for at least 5 years. The isolation of RNA from cancerous tissues, labeling of complementary RNA (cRNA), the competing hybridization of labeled cRNA with a reference pool of cRNA from all tumors to arrays containing 24 481 gene probes, quantization and normalization of fluorescence intensities of scanned images are detailed described in the previous publication (van't Veer et al., 2002). The task is to build a computational model to accurately predict the risk of distant recurrence of breast cancer (using a 5-year post-surgery period as the defining point commonly used in the literature). Except for a simple re-scaling of each feature value to be between 0 and 1, no other preprocessing is performed. The re-scaling is performed by using the formula:

x^n(i)=xn(i)minmxm(i)maxmxm(i)minmxm(i),

where xn(i) is the ith feature in the nth sample. In the following, we drop the hat in x^n(i) for notational brevity.

2.2. Feature selection

Feature selection plays a critical role in the success of a learning algorithm in problems involving a significant number of irrelevant features. Here, we use the term feature to refer to both genetic and clinical markers. Microarray profiling is a powerful technique that allows researchers to examine the expression levels of tens of thousands of genes in a cell or a tissue simultaneously. However, it also poses a serious challenge to the existing machine-learning algorithms. With relatively small sample size, a learning algorithm can easily overfit training data, resulting in a zero training error but a very poor generalization performance on unseen data. A commonly used practice to correct for overfitting is to select a small feature subset such that the performance of a learning algorithm is optimized. Compared to the classifier design, feature selection still, to date, lacks a rigorous theoretical treatment. Most existing feature selection algorithms rely on heuristic combinatorial search and thus cannot provide any guarantee of optimality. This is largely due to the difficulty in defining an objective function that can be easily optimized by some well-established optimization techniques. In the presence of thousands of irrelevant genes, even heuristic searches become computationally unfeasible. For this reason, in microarray data analysis, nearly all of the gene selection algorithms resort to filter type methods that evaluate genes individually, e.g. t-test and Fisher score (Dudoit et al., 2002; Golub et al., 1999). The limitations of filter methods for feature selection are summarized as follows:

  1. Filter methods are unable to remove redundant features. For example, if a gene is top ranked, its co-regulated genes will also have high ranking scores. It is a well-established fact in machine learning that redundant features may not improve but rather deteriorate classification performance (Kohavi and John, 1997). This fact is largely ignored in many microarray data analyses. From the clinical perspective, the examination of the expression levels of redundant genes will not improve clinical decisions but increase medical examination costs needlessly.

  2. Filter methods evaluate the goodness of features individually, while neglecting the possible correlation information among them (Li et al., 2004; Dudoit et al., 2002). Some features may receive low ranking scores when evaluated separately, but can provide critical information when combined with other features. One possible solution to this problem is to use wrapper type methods that use a classifier to evaluate the goodness of selected feature subsets (Kohavi and John, 1997). However, with tens of thousands of features, it is computationally unfeasible to perform the combinatorial searching required in a wrapper method.

We have recently developed a new feature selection algorithm, referred to as I-RELIEF (Sun and Li, 2006) to alleviate the aforementioned drawbacks of filter methods and the computational issue of wrapper methods. I-RELIEF is one of the first feature selection algorithms that utilize the performance of a nonlinear classifier when searching for informative features, and yet can be implemented efficiently through optimization and numerical analysis techniques, instead of combinatorial searching. Below we present a brief review of I-RELIEF. Let D={(xn,yn)}n=1N denote a training dataset, where xn is the nth data sample and yn ε {±1} is the corresponding class label, i.e. metastasis or no metastasis. The ith component of xn records the expression level of the ith gene in the nth sample. We define a margin for the sample xn as ρn = d(xn – NM (xn)) – d(xn – NH(xn)), where NM(xn) and NH(xn) are the nearest miss and nearest hit of xn, which can be regarded as two functions that given an input xn return the nearest neighbors of xn from the opposite and same classes, respectively, and d(·) is a distance function defined as d(x)=ixi. Note that ρn > 0 if only if xn is correctly classified by a one-nearest-neighbor classifier. One natural idea is to scale each feature such that the averaged margin in a weighted feature space is maximized:

maxwn=1Nρn(w)=maxwn=1Ni=1Iwi(xn(i)NM(i)(xn)xn(i)NH(i)(xn))s.tw22=1,w0, (1)

where ρn(w) is the margin of xn computed with respect to w. The constraint w22=1 prevents the maximization from increasing without bound, and w ≥ 0 ensures that the w-weighted distance is a metric. We have proven that the optimization scheme in Equation (1) can be solved with a closed-form solution, and is equivalent to the well-known RELIEF algorithm (Kira and Rendell, 1992; Sun and Li, 2006). Note that the use of the block distance in the margin definition is consistent with the original formulation of RELIEF; other distance functions can also be used. For example, in Gilad-Bachrach et al., 2004, Euclidean distance is used in defining a margin, which, however, leads to a difficult nonconvex optimization problem. Due to the feedback of the performance of a nonlinear classifier when searching for useful features, RELIEF usually performs better than filter methods. One major drawback of RELIEF, however, is that the nearest-neighbors are defined in the original feature space, which is highly unlikely to be the ones in the weighted space. In the presence of many irrelevant features, which is the case in microarray data analysis, the performance of RELIEF can degrade significantly. I-RELIEF provides an analytic solution to mitigate the problem of RELIEF.

We first define two sets Mn={i:1iN,yiyn} and Hn={i:1iN,yi=yn,in}, associated with each sample xn. Suppose that we have known, forg each sample xn, its nearest hit and miss, the indices of which are recorded in the set Sn={(sn1,sn2)}, where sn1Mn and sn2Hn. Then the objective function we want to optimize can be formulated as

C(w)=n=1N(xnxSn1wxnxSn2w), (2)

where ||x||w = Σi wi | xi |. Equation (2) can be easily optimized by using RELIEF. However, we do not know the set S={Sn}n=1N. By following the principle of the Expectation Maximization algorithm, we regard the elements of {Sn}n=1N as hidden random variables, and derive the probability distributions of the unobserved data. We first make a guess on the weight vector w. The probability of the ith data point being the nearest miss of xn if iMn, or being the nearest hit of xn if iHn, can be naturally defined as

Pm(ixn,w)=f(xnxiw)jMnf(xnxjw),

and

Ph(ixn,w)=f(xnxiw)jHnf(xnxjw),

respectively, where f(·) is a kernel function. One commonly used kernel function is f(d) = exp (–d/σ), where σ is a user defined parameter. In the experiment, we set σ = 2 based on our empirical experience. (In the Supplementary material, we show that the choice of the tuning parameter is not critical, and the algorithm performs similarly for a large range of sigma values.) For notational brevity, we define αi,n = Pm(i|xn, w(t)), βi,n = Ph(i|xn, w(t)) W={w:w2=1,w0},mn,i=xnxi, if iMn, and hn,i=xnxiifiHn. I-RELIEF can be summarized as follows:

  • Step-1: After the tth iteration, the Q function is calculated as:
    Q(ww(t))=E{S}[C(w)]=n=1N(iMnαi,nxnxiwiHnβi,nxnxiw)=n=1N(jwjiMnαi,nmn,ijjwjiHnβi,nhn,ij)=wTn=1N(mnhn)=wTv, (3)
    where mn=iMnαi,nmn,i and hn=iHnβi,nhn,i.
  • Step-2: The re-estimation of w in the (t + 1) th iteration is:
    w(t+1)=argmaxwWQ(ww(t))=(v)+(v)+2, (4)
    where (νi)+ = max (νi,0). The above two steps iterate alternatingly until convergence, i.e. ||w(t+1)w(t)|| < θ, where θ is a small positive number. In Sun and Li, 2006, we have mathematically proven that I-RELIEF converges to a unique solution regardless of the initial weights if the kernel function is properly selected. The convergence is usually achieved within a few iterations.

I-RELIEF combines the merits of both filter and wrapper methods. Note that the objective function optimized by I-RELIEF approximates the leave-one-out accuracy of a nearest-neighbor classifier. Therefore, I-RELIEF can be regarded as a wrapper method, and thereby it naturally addresses the issues of feature correlation and the removal of redundant features. Moreover, I-RELIEF can be solved analytically, and thus avoids any heuristic combinatorial search. The effectiveness of the algorithm has been demonstrated through large-scale experiments on simulated data and six micro-array datasets (Sun and Li, 2006). In the Supplementary material, a simulation study of I-RELIEF on a toy example is presented for illustration purpose.

3 EXPERIMENTS

3.1 Experimental setup

In a computational study using microarray data with small sample sizes, special care must be taken in experimental protocols to avoid possible overfitting of a computational model to training data. One particular problem in many microarray data analyses is an incomplete cross-validation method that uses the same dataset for both training and testing, resulting in over-optimistic performances not reproducible in other independent validation studies (Simon et al., 2003; Simon, 2005; Brenton et al., 2005). To avoid this problem, we adopt a rigorous experimental protocol proposed in Wessels et al., 2005 with the leave-one-out cross validation (LOOCV) method. In each iteration, one sample is held out for testing and the remaining samples are used for training. The experimental protocol consists of two loops: inner and outer loops. In the inner loop, LOOCV is performed to estimate the optimal classification parameters based on the training data provided by the outer loop. In the outer loop, the held-out sample is classified by using the best parameters from the inner loop. The experiment is repeated until each sample has been used for testing.

The classification parameters that need to be specified in the inner loop include the kernel width of I-RELIEF, the structural parameters of a classifier (e.g. the regularization parameter in SVM and the number of the hidden nodes in NNW), and the number of the features used in a classifier, which leads to a multi-dimensional parameter searching. To make the experiment computationally feasible, we adopt some heuristic simplifications. Linear discriminant analysis (LDA) is used to estimate classification performances. One major advantage of LDA, compared to other classifiers, such as SVM and NNW is that LDA has no structural parameters. We then predefine the kernel width σ = 2, and estimate the number of features through LOOCV in the inner loop. The use of LDA is further justified by other research work. Simon pointed that there may not be sufficient information in most microarray datasets to support nonlinear classifiers (Simon, 2005). In the analyses performed by van't Veer et al., the 70-gene signature was derived from the same dataset, and the samples were classified using a correlation based classifier. It can be shown that the correlation based classifier is a special case of LDA, with the within-class scatter matrix being replaced by an identity matrix I. In Edén et al., 2004 where a NNW classifier was constructed, it was found through cross-validation that a NNW without hidden layers performed the best, which is actually a linear classifier. We comment that a comprehensive parameter searching may lead to a more accurate prediction performance but with a much higher computational complexity.

We demonstrate the predictive values of the hybrid prognostic signature derived from the genetic and clinical markers by comparing its performance with those of the clinical markers that are used as the features in a well trained LDA classifier, St. Gallen criterion and the 70-gene signature3. The performances of the 70-gene signature and the clinical markers are estimated through LOOCV. Hence, the held-out testing sample is not involved in the identification of a gene signature. It should be noted that the signature identified in each iteration is very likely to be different from the one reported in van't Veer et al., 2002. However, the LOOCV error provides us with an unbiased estimation on how the gene signature so-produced performs on unseen data (Simon, 2005) (c.f. Section 3.2).

3.2 Results

With a small sample size, some performance measurements, such as odds ratios are heavily influenced by the choice of a decision threshold. A receiver operating characteristic (ROC) curve obtained by varying a decision threshold can give us a direct view on how a classifier performs at the different sensitivity and specificity levels. In Figure 1, we plot the ROC curves of three classifiers based on the hybrid signature, the 70-gene signature and the clinical markers. We observe that the hybrid signature significantly outperforms both the 70-gene signature and the clinical markers, whereas the latter two approaches perform similarly. By following the study of van't Veer and colleagues (van't Veer et al., 2002), a threshold is set for each classifier such that the sensitivity of each classifier is equal to 90%. The corresponding specificities are computed and reported in Table 1. For comparison, the specificities of the St. Gallen criterion are also reported. Both the 70-gene signature and the clinical markers significantly outperform the St. Gallen criterion, as reported in the literature, and the hybrid signature improves the specificities of the 70-gene signature and the clinical markers by 20%. We point out that our estimation of the specificity of the 70-gene signature is worse than that reported in van't Veer et al., 2002 and Weigelt et al., 2005 (47% versus 73%), but is consistent with that in the follow-up validation study of the 70-gene signature on a larger dataset (van De Vijver et al., 2002) (53%). This is because 76 samples in van't Veer's dataset that were used for performance estimation were also involved in the identification of the gene signature, which led to a biased estimate of the prediction performance of the signature. Our result suggests that LOOCV can effectively correct for this bias.

Fig. 1.

Fig. 1

ROC curves of three methods. A colour version of this figure is available as supplementary data.

Table 1.

Prognostic Results (90% sensitivity)

Methods Specificity Odd ratio (95% CI) Hazard ratio (HR)
HR (95% CI) P-value
70-gene 24/51 = 47% 9.3 (2.9–30.0) 6.0 (2.0–17.0) <0.001
St. Gallen 6/51 = 12% N/A N/A N/A
Clinical 25/51 = 48% 9.3 (2.9–30.0) 6.2 (2.2–17.6) <0.001
Hybrid 34/51 = 67% 21.0 (6.5–68.3) 11.1 (3.9–31.5) <0.001

We compute the odds ratio (OR) of four approaches for developing distant metastases within five years between the patients with a good prognostic signature and the patients with a bad prognosis. The results are reported in Table 1. We observe that the 70-gene signature has the same OR (9.3, 95% confidence interval (CI): 2.9 30.0) as the clinical markers. This result is consistent with the findings reported in Edén et al., 2004. We also note that the hybrid signature gives a much higher OR (21.0, 95% CI: 6.5 68.3) than either genetic or clinical markers.

To further demonstrate the predictive value of the hybrid signature in assessing the risk of developing distant metastases in breast cancer patients, survival data analyses of four approaches are also performed4. The Kaplan–Meier curve of the hybrid signature, plotted in Figure 2, shows a significant difference in the probability of remaining free of distant metastases in patients with a good signature and the patients with a bad prognostic signature (P-value <0.001). The Mantel–Cox estimation of hazard ratio of distant metastases within five years for the hybrid signature is 11.1 (95% CI: 3.9 31.5, P-value <0.001), which is superior to either genetic or clinical markers alone.

Fig. 2.

Fig. 2

Kaplan–meier estimation of the probabilities of remaining distant metastases free in patients with a good or bad prognostic signature, determined by the St. Gallen criterion, clinical markers, 70-gene signature and hybrid signature. The P-values is computed by the use of log-rank test.

More experimental results can be found in the Supplementary material.

3.3 Hybrid signature

With a small sample size, each iteration in LOOCV may generate a different prognostic signature since training data are different (Simon, 2005). In our study, we find that the majority of the iterations identify the same hybrid signature that consists of only three gene markers and two clinical markers (Supplementary Table 1). Note that the hybrid signature is markedly shorter than the 70-gene signature.

The two clinical markers in the hybrid signature are tumor grade and angio-invasion. Histological grading of tumors has been shown in numerous studies to provide useful prognostic information in breast cancer (Elston et al., 1991). The grade represents a morphological assessment of the degree of differentiation of the tumor as evaluated by the percentage of tubule formation, the degree of nuclear pleomorphism and the presence of mitoses. Grade 1 tumors have a low risk of metastases; grade 2 tumors have an intermediate risk of metastases and grade 3 tumors have a high risk of metastases. Patients with grade 1 tumors have a significantly better survival rate than those with grade 2 or 3 tumors (Elston et al., 1991). An essential step in the metastatic cascade is (lympho)vascular invasion, or the penetration of tumor cells into lymph and/or blood vessels in and around the primary tumor. Accordingly, the observation of 3 or more tumor cell emboli in tumor-associated vessels has been correlated with the presence of LN metastases and with poor prognosis in patients with breast cancer (de Mascarel et al., 1998; Pinder et al., 1994).

The three genetic markers in the hybrid signature include AL080059, CEGP1 and PRAME, of which CEGP1 and AL080059 are also listed in the 70-gene signature. The CEPG1 gene (also known as SCUBE2, EGF2-like 2 and ASCL3), is located on human chromosome 11p15 and has homology to the achaetescute complex (ASC) of genes in the basic helix–loop–helix (bHLH) family of transcription factors. The exact biological role for CEGP1 (SCUBE2) is still unknown, but the gene encodes a secreted and cell-surface protein containing EGF and CUB domains (Yang et al., 2002). The epidermal growth factor motif is found in many extracellular proteins that play an important role during development, and the CUB domain is found in several proteins implicated in the regulation of extracellular process, such as cell–cell communication and adhesion (Grimmond et al., 2000). Expression of SCUBE2 has been detected in vascular endothelium and may play important roles in development, inflammation and perhaps carcinogenesis (Yang et al., 2002). The expression of SCUBE2 was recently reported to be associated with ER status in a recent SAGE-based study of breast cancer specimens (Abba et al., 2005). The AL080059 label refers to a sequence obtained from a human cDNA clone, but subsequent analysis has revealed significant homology with the TSPY-like 5 (TSPYL5) gene, and with other human proteins, including NAPs, factors which play a role in DNA replication (Schnieders et al., 1996). It is thought that NAPs act as histone chaperones, shuttling histones from their site of synthesis in the cytoplasm to the nucleus. Histone proteins are involved in regulating chromatin structure and accessibility and therefore can impact gene expression (Rodriguez et al., 1997), thus, a role in tumor cell phenotype can be proposed. Although both AL080059 and CEGP1 were found to be significantly over-expressed in our studies of a breast tumor metastases model (Goodison et al., 2005), neither the AL080059 nor CEGP1 genes have been evaluated independently in human cancers. Conversely, the expression of the preferentially expressed antigen in melanoma (PRAME) gene has been linked to human disease, including cancer. PRAME is classed as a cancer-testis antigen (CTA), a group of tumor-associated antigens that represent possible target proteins for immuno-therapeutic approaches. Their expression is encountered in a variety of malignancies but is negligible in healthy tissues, with male germinal cells being the exception (Juretic et al., 2003). PRAME was first discovered in a patient with melanoma (Ikeda et al., 1997), and has since been found to be expressed in a large variety of cancer cells including squamous cell lung carcinoma, medulloblastoma, neuroblastoma, renal cell carcinoma and acute leukemia (Matsushita et al., 2003). Our study raises the possibility that therapies targeted to PRAME may be beneficial in breast cancers also.

4 DISCUSSION

We present some discussion on the optimality and uniqueness of prognostic signatures. Due to these issues, among others, the appropriateness of the existing gene signatures being ready for clinical trials is currently under hot debate (Brenton et al., 2005; Weigelt et al., 2005; Loi et al., 2006). Since many potential readers of this paper are from the oncology community, we start the discussion with a relatively simple machine-learning example. This example was first used by Trunk (Trunk, 1979) to demonstrate the existence of the curse of dimensionality. We find that Trunk's experiment, when applied to the research of breast cancer prognosis, reveals to us much beyond the curse of the dimensionality. Consider the following binary classification problem. The a priori probabilities P(C1) = P(C2) = 1/2, and the class conditional probabilities are Gaussian, given by p(xC1)N(μ,I) and p(xC2)N(μ,I), respectively, where μ is the mean vector, the ith components of which is (1/i)1/2, and I is the identity matrix. The task is to construct a classifier based on a given training dataset. All of the a priori knowledge is given except for the mean vector μ, which is estimated from training data. The classification accuracies, as a function of the number of features used in a constructed classifier for four different training sample sizes (i.e. 20, 50, 100 and 200) averaged from 100 runs, are plotted in Figure 3. From the figure, we arrive at the following observations:

  1. For a given sample size, the inclusion of additional features beyond a certain point leads to a higher error. It can be shown that with a finite sample size, the classification error converges to one-half when the number of features goes to infinite (Trunk, 1979). Note that in Trunk's data model, each feature contains a certain amount of discriminant information. This observation, when applied to breast cancer prognosis, implies that with a limited number of training samples, some features, though having some predictive values in breast cancer prognosis when evaluated individually, do not necessarily improve the predictive performance of a computational model when used together with other features, and sometimes may even deteriorate performance. This highlights the need for performing feature selection.

  2. With the increase of sample sizes, the numbers of the features corresponding to the optimum performance are also increased. For example, with 20 samples, the classification accuracy peaks around 30 features, whereas for 200 samples, the peak occurs around 200 features (Fig. 3). This observation, when applied to the research of breast cancer prognosis, indicates that a prognostic signature derived from a small training dataset is likely to be lengthened when a larger training dataset is used.

  3. There exists a range in the feature dimensionality where a classifier achieves the close-to-optimum performance. Moreover, with the inclusion of more samples, the range becomes even larger. (Note that the x-axis of Fig. 3 is log-scaled.) It means that, for a given training dataset, there may exist several different signatures with a similar predictive value. This observation, together with other factors (e.g. the existence of co-regulated genes and the use of different micro-array platforms and data processing algorithms. Interested readers may refer to Dalton et al., 2006 for more detailed discussion.), may provide an explanation as to why the gene signatures identified in some recent independent studies are different.

Fig. 3.

Fig. 3

Trunk's experiment showing the curse of dimensionality. A colour version of this figure is available as supplementary data.

For clinical applications, what we are really interested in is not whether there exists several different signatures having a similar predictive value, but whether these signatures have achieved the optimum, or close-to-optimum performance. After decades of research on breast cancer prognosis, many prognostic markers have been reported in the literature, including clinical markers and gene signatures. Many of them are single-marker prognostic and predictive studies. A critical question remains unanswered to date: what is the best we can perform in breast cancer prognosis given all clinical and genetic information using advanced computational algorithms? Without further optimization, an expensive clinical validation trial of a prognostic signature may merely repeat the already established predictive values of the signature, and yet, cannot prove its optimality for clinical applications. In this paper, we have presented a computational study clearly demonstrating the feasibility of utilizing both clinical and genetic information simultaneously for more accurate breast cancer prognosis. We believe that this is an advantageous direction to pursue in future breast cancer prognosis studies.

Our experiment is based on van't Veer's data, which was obtained from only 97 tissue samples. We demonstrate through Trunk's experiment that identifying prognostic signatures for breast cancer prognosis is necessarily an ongoing and dynamic process, in which, with the inclusion of more patient samples, a prognostic signature will be continuously lengthened and refined, whereby the performance of a prognostic signature will be improved accordingly and finally stabilized.

5 RELATED WORK

From the machine-learning perspective, it is a straightforward idea to integrate all available information for a classification task. Some efforts have been made in this direction for breast cancer prognosis but with little success. Ritz (Ritz, 2003) combined both genetic and clinical information in a NNW for breast cancer prognosis but found that the combination did not improve the performance. Dettling et al., 2004 applied penalized logistic regression analysis to predict cancer prognosis for the same dataset. They found that none of the clinical variables entered the model and concluded that the clinical data did not contain any useful independent information for prediction, given the gene expression profile. In Gevaert et al., 2006, a Bayesian network was developed to perform breast cancer prognosis. The results showed that although a Bayesian network that used both genetic and clinical information can lead to a simpler classifier with fewer genes, which is consistent with our finding, the Bayesian network performed similarly to the 70-gene signature. We emphasize that these negative results do not necessarily mean that the clinical data contains no additional information to the genetic data; it only tells us that with their models the applied combination strategy did not work. This highlights the difficulty of designing a successful combination strategy.

6 CONCLUSION

In this paper, we applied a new mathematical model to predict the likelihood of disease recurrence and metastases in breast cancer. Our preliminary study has shown that a hybrid signature can provide significantly improved prognostic specificity over the existing gene signatures and the current clinical systems by about 20% and 60%, respectively. We have also presented an informative discussion on the issue of the curse of the dimensionality in the context of breast cancer prognosis. We believe that researchers, particularly from the oncology community, should benefit from the discussion.

To fully address the question of what is the best we can perform in breast cancer prognosis given all available information, as posed in Section 4, larger-scale computational studies involving more patient data and which compare different learning algorithms are required and are under way in our laboratory.

Supplementary Material

Supplementary Data
Supplementary Figures

ACKNOWLEDGEMENTS

Conflict of Interest: none declared.

Footnotes

Availability: The breast cancer dataset is available at www.nature.com and Matlab codes are available upon request.

Supplementary information: Supplementary data are available at Bioinformatics online.

1

The specificity is defined as the rate of correctly predicting the lack of need of the adjuvant systemic therapies when the therapies are indeed not necessary, and the sensitivity is the rate of administering the adjuvant systemic therapies when indeed these therapies are effective.

2

The St. Gallen consensus criterion: tumor ≤2 cm, estrogen receptor negative grade 2–3, patient <35 years old (any one of these criteria met suggests high distant metastases risk.); the NIH consensus is similar to that of St. Gallen, but with tumor >1 cm.

3

We follow the experimental procedure outlined in Veer et al., 2002 that first identified the top 70 genes and then assessed its predictive value by using a correlation based classifier. The detailed description is presented in the Supplement.

4

It is not clear whether at 5 years post-surgery, patients had died from distant metastasis or that the clinicians were unable to continue follow-up for other reasons. Some researchers (Edén et al., 2004) treated the patients who survived more than 5 years as if they lost follow-up, while in our experiment, we consider them as “dead”. Therefore, the results of the 10-year prognosis are not reliable.

REFERENCES

  1. Abba M, et al. Gene expression signature of estrogen receptor α status in breast cancer. BMC Genomics. 2005;6:74–81. doi: 10.1186/1471-2164-6-37. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Brenton J, et al. Molecular classification and molecular forecasting of breast cancer: ready for clinical application? J. Clin. Oncol. 2005;23:7350–7360. doi: 10.1200/JCO.2005.03.3845. [DOI] [PubMed] [Google Scholar]
  3. Dalton W, et al. Cancer biomarkers–an invitation to the table. Science. 2006;312:1165–1168. doi: 10.1126/science.1125948. [DOI] [PubMed] [Google Scholar]
  4. de Mascarel I, et al. Obvious peritumorous emboli: an elusive prognostic factor reappraised: multivariate analysis of 1320 node-negative breast cancers. Eur. J. Cancer. 1998;34:58–65. doi: 10.1016/s0959-8049(97)00344-4. [DOI] [PubMed] [Google Scholar]
  5. Dettling M, et al. Finding predictive gene groups from microarray data. J. Multivariate Anal. 2004;1:106–131. [Google Scholar]
  6. Duda R, et al. Pattern Classification. J. Wiley; NY: 2000. [Google Scholar]
  7. Dudoit S, et al. Comparison of discrimination methods for the classification of tumors using gene expression data. J. Am. Stat. Assoc. 2002;97:77–87. [Google Scholar]
  8. Edén P, et al. ‘Good old’ clinical markers have similar power in breast cancer prognosis as microarray gene expression profilers. Eur. J. Cancer. 2004;40:1837–1841. doi: 10.1016/j.ejca.2004.02.025. [DOI] [PubMed] [Google Scholar]
  9. Eifel P, et al. National Institutes of Health consensus development conference statement: adjuvant therapy for breast cancer. J. Natl. Cancer Inst. 2000;93:979–989. doi: 10.1093/jnci/93.13.979. [DOI] [PubMed] [Google Scholar]
  10. Elston C, et al. Pathological prognostic factors in breast cancer. I. The value of histological grade in breast cancer: experience from a large study with long-term follow-up. Histopathology. 1991;19:403–410. doi: 10.1111/j.1365-2559.1991.tb00229.x. [DOI] [PubMed] [Google Scholar]
  11. Gevaert O, et al. Predicting the prognosis of breast cancer by integrating clinical and microarray data with Bayesian networks. Bioinformatics. 2006;22:184–190. doi: 10.1093/bioinformatics/btl230. [DOI] [PubMed] [Google Scholar]
  12. Gilad-Bachrach R, et al. Margin based feature selection—theory and algorithms.. Proceedings of 21st International Conference Machine Learning.2004. pp. 43–50. [Google Scholar]
  13. Goldhirsch A, et al. Meeting highlights: updated international expert consensus on the primary therapy of early breast cancer. J. Clin. Oncol. 2003;21:3357–3365. doi: 10.1200/JCO.2003.04.576. [DOI] [PubMed] [Google Scholar]
  14. Golub T, et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science. 1999;286:531–537. doi: 10.1126/science.286.5439.531. [DOI] [PubMed] [Google Scholar]
  15. Goodison S, et al. The RhoGAP protein DLC-1 functions as a metastasis suppressor in breast cancer cells. Cancer Res. 2005;65:6042–6053. doi: 10.1158/0008-5472.CAN-04-3043. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Grimmond S, et al. Cloning, mapping, and expression analysis of gene encoding a novel Mammalian EGF-related protein (SCUBE1). Genomics. 2000;70:74–81. doi: 10.1006/geno.2000.6370. [DOI] [PubMed] [Google Scholar]
  17. Ikeda H, et al. Characterization of an antigen that is recognized on a melanoma showing partial HLA loss by CTL expressing an NK inhibitory receptor. Immunity. 1997;6:199–208. doi: 10.1016/s1074-7613(00)80426-4. [DOI] [PubMed] [Google Scholar]
  18. Juretic A, et al. Cancer/testis tumour-associated antigens: immunohistochemical detection with monoclonal antibodies. Lancet Oncol. 2003;4:104–109. doi: 10.1016/s1470-2045(03)00982-3. [DOI] [PubMed] [Google Scholar]
  19. Kira K, Rendell L. A practical approach to feature selection.. Proceedings of 9th International Conference Machine Learning.1992. pp. 249–256. [Google Scholar]
  20. Kohavi R, John G. Wrappers for feature subset selection. Artif. Intell. 1997;97:273–324. [Google Scholar]
  21. Li T, et al. A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression. Bioinformatics. 2004;20:2429–2437. doi: 10.1093/bioinformatics/bth267. [DOI] [PubMed] [Google Scholar]
  22. Loi S, et al. Molecular forecasting of breast cancer: time to move forward with clinical testing. J. Clin. Oncol. 2006;24:721–722. doi: 10.1200/JCO.2005.04.6524. [DOI] [PubMed] [Google Scholar]
  23. Matsushita M, et al. Preferentially expressed antigen of melanoma (PRAME) in the development of diagnostic and therapeutic methods for hematological malignancies. Leuk. Lymphoma. 2003;44:439–444. doi: 10.1080/1042819021000035725. [DOI] [PubMed] [Google Scholar]
  24. Pinder S, et al. Pathological prognostic factors in breast cancer. III. Vascular invasion: relationship with recurrence and survival in a large study with a long-term follow-up. Histopathology. 1994;24:41–47. doi: 10.1111/j.1365-2559.1994.tb01269.x. [DOI] [PubMed] [Google Scholar]
  25. Ritz C. Master thesis. Lund University; Sweden: 2003. Comparing prognostic markers for metastases in breast cancer using artificial neural networks. [Google Scholar]
  26. Rodriguez P, et al. Functional characterization of human nucleosome assembly protein-2 (NAP1L4) suggests a role as a histone chaperone. Genomics. 1997;44:253–265. doi: 10.1006/geno.1997.4868. [DOI] [PubMed] [Google Scholar]
  27. Schnieders F, et al. Testis-specific protein, Y-encoded (TSPY) expression in testicular tissues. Hum. Mol. Genet. 1996;5:1801–1807. doi: 10.1093/hmg/5.11.1801. [DOI] [PubMed] [Google Scholar]
  28. Simon R, et al. Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. J. Natl. Cancer Inst. 2003;95:14–18. doi: 10.1093/jnci/95.1.14. [DOI] [PubMed] [Google Scholar]
  29. Simon R. Roadmap for developing and validating therapeutically relevant genomic classifiers. J. Clin. Oncol. 2005;23:7332–7341. doi: 10.1200/JCO.2005.02.8712. [DOI] [PubMed] [Google Scholar]
  30. Sun Y, Li J. Iterative RELIEF for feature weighting.. Proceedings of 23rd International Conference Machine Learning.2006. pp. 913–920. [Google Scholar]
  31. Trunk G. A problem of dimensionality: a simple example. IEEE Trans. Pattern Anal. Mach. Intell. 1979;1:306–307. doi: 10.1109/tpami.1979.4766926. [DOI] [PubMed] [Google Scholar]
  32. van't Veer L, et al. Gene expression profiling predicts clinical outcome of breast cancer. Nature. 2002;415:530–536. doi: 10.1038/415530a. [DOI] [PubMed] [Google Scholar]
  33. van De Vijver M, et al. A gene-expression signature as a predict of survival in breast cancer. N. Engl. J. Med. 2002;347:1999–2009. doi: 10.1056/NEJMoa021967. [DOI] [PubMed] [Google Scholar]
  34. Wang Y, et al. Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. Lancet. 2005;365:671–679. doi: 10.1016/S0140-6736(05)17947-1. [DOI] [PubMed] [Google Scholar]
  35. Weigelt B, et al. Breast cancer metastasis: markers and models. Nat. Rev. Cancer. 2005;5:591–602. doi: 10.1038/nrc1670. [DOI] [PubMed] [Google Scholar]
  36. Wessels L, et al. A protocol for building and evaluating predictors of disease state based on microarray data. Bioinformatics. 2005;21:3755–3762. doi: 10.1093/bioinformatics/bti429. [DOI] [PubMed] [Google Scholar]
  37. Yang R, et al. Identification of a novel family of cell-surface proteins expressed in human vascular endothelium. J. Biol. Chem. 2002;227:46364–46373. doi: 10.1074/jbc.M207410200. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data
Supplementary Figures

RESOURCES