Abstract
The tumor-node-metastasis staging system has been the lynchpin of cancer diagnosis, treatment, and prognosis for many years. For meaningful clinical use, an orderly grouping of the T and N categories into a staging system needs to be defined, usually with respect to a time-to-event outcome. This can be reframed as a model selection problem with respect to features arranged on a partially ordered two-way grid, and a penalized regression method is proposed for selecting the optimal grouping. Instead of penalizing the L1-norm of the coefficients like lasso, in order to enforce the stage grouping, we place L1 constraints on the differences between neighboring coefficients. The underlying mechanism is the sparsity-enforcing property of the L1 penalty, which forces some estimated coefficients to be the same and hence leads to stage grouping. Partial ordering constraints is also required as both the T and N categories are ordinal. A series of optimal groupings with different numbers of stages can be obtained by varying the tuning parameter, which gives a tree-like structure offering a visual aid on how the groupings are progressively made. We hence call the proposed method the lasso tree. We illustrate the utility of our method by applying it to the staging of colorectal cancer using survival outcomes. Simulation studies are carried out to examine the finite sample performance of the selection procedure. We demonstrate that the lasso tree is able to give the right grouping with moderate sample size, is stable with regard to changes in the data, and is not affected by random censoring.
Keywords: Cancer staging, Cox model, Lasso, Lasso tree, Model selection
1. Introduction
The development of accurate prognostic classification schemes is of great interest and concern in many areas of clinical research. In oncology, much effort has been made to define a cancer classification scheme which can facilitate diagnosis and prognosis, provide a basis for making treatment or other clinical decisions, and identify homogeneous groups of patients for clinical trials (American Joint Committee on Cancer, 2002; 2010; National Cancer Institute, 2004). Among various classification schemes, the tumor-node-metastasis (TNM) staging system is widely used because of its simplicity and prognostic ability.
The basis of TNM staging is the anatomic extent of disease. It has three components: T for primary tumor, N for lymph nodes, and M for distant metastasis. In the case of colorectal cancer, which we will use as an example in this paper, there are four categories of T, three of N, and two of M. Details of the categories are provided in Table1. Since patients with distant metastases (M1) have markedly different prognoses from M0 patients, they are usually treated separately. Hence, in this paper we will confine our efforts to only the T and N categories of the TNM system.
Table 1.
The TNM staging system for colorectal cancer
| T : primary tumor | |
| T1 | Tumor invades submucosa |
| T2 | Tumor invades muscularis propria |
| T3 | Tumor invades pericolorectal tissues |
| T4 | Tumor directly invades or is adherent to other organs |
| N : lymph nodes | |
| N0 | No regional lymph node metastasis |
| N1 | Metastasis in one to three regional lymph nodes |
| N2 | Metastasis in four or more regional lymph nodes |
| M : distant metastasis | |
| M0 | No distant metastasis |
| M1 | Distant metastasis |
The T and N categories for colon cancer jointly define 12 distinct groups, which are unwieldy for meaningful clinical use (Gönen and Weiser, 2010. Therefore, the American Joint Committee on Cancer (AJCC) and International Union against Cancer defined an orderly, progressive grouping of the TNM categories which reduces the system to fewer stages (three main stages and six substages under the AJCC sixth edition; see Figure2(a)). Alternative grouping schemes have also been proposed by other authors. The value and usefulness of these TNM stage groupings are, however, very much debated. The main concern is that the AJCC system is defined without systematic empirical investigation. By systematic, we mean the extensive division of the T × N table into all possible stage groupings. There is also a lack of commonly accepted statistical methods for developing stage groupings.
Fig. 2.

Schematic showing the staging schemes for M0 colorectal cancer by (a) AJCC sixth edition staging, (b) the lasso tree selected staging, and (c) the lasso tree selected staging incorporating information from the AJCC.
Although a vast literature in the medical community exists on cancer staging systems, they are solely focused on evaluating and comparing existing proposals of TNM groupings (Lee and others, 1999; Groome and others, 2001). There have been relatively few literature reports of statistical techniques for developing cancer stage groupings. Begg and others (2005) considered the problem of comparing alternative stage groupings where three non-model-based statistical criteria were applied and compared. Hothorn and Zeileis (2008) used maximally selected log rank statistics to select the optimal two-class partition determined by the T and N categories of rectal cancer patients. Lin and others (2012) considered a bootstrap model selection method which selects the grouping that maximizes the bootstrap estimates of the chosen statistical criteria. These methods all suffer from the following drawbacks: (1) the number of stages needs to be pre-specified; (2) a complete search through all eligible partitions is thwarted by combinational explosion when the desired number of stages is large; and (3) these procedures, essentially a form of best subset model selection, are unstable with regard to small changes in the data (Breiman, 1996).
An intuitive approach for developing classification schemes is to use tree-based methods such as recursive partitioning (Breiman and others, 1983). However, these tree-based methods do not capture all types of T and N combinations. At each split, trees must have full/complete separation with respect to one variable. That is, in the T × N table, a tree method will split fully along all columns or rows conditional on the existing splits. In other words, the grouping that results from a tree must have partitions in straight lines. Yet the true staging system might have a different configuration. For instance, the newly published AJCC seventh edition (American Joint Committee on Cancer, 2010) has a partition along the diagonal, which cannot be achieved by a tree method. Hence, a more flexible method is needed for estimating cancer stage groupings.
In this paper, we reframe the task of finding the optimal stage grouping in a model selection context and an L1 penalized regression method is proposed. Specifically, the development of cancer stage groupings can be considered a model selection problem for a censored response grouped with respect to features arranged on a partially ordered two-way grid (the T × N table). Partial ordering is required since both T and N are ordinal, and hence only those groupings which are ordered in T given N and vice versa are eligible. An attractive way to reduce the time complexity of an exhaustive search is to introduce an L1 penalty in a regression model. In order to yield the grouping effect, we constrain the differences of coefficients that are one unit apart in both directions to be small. To be specific, we require
![]() |
(1.1) |
where βj,k is the coefficient for the cell with T=j and N=k, and s>0 is a pre-specified tuning parameter. The constraint leads to some of the estimated coefficients being exactly the same, which provides the desired stage grouping. An attractive feature of this method is that a series of optimal groupings with different numbers of groups can be obtained as a function of the tuning parameter s. This gives a tree-like structure for partitioning the T × N table, and thus offers doctors and medical researchers a visual aid on how the groupings are made progressively and a freedom to choose among different numbers of stages. We use the term “lasso tree” for the proposed method for the rest of the paper.
The continuous constraint function shrinks the difference of coefficients toward zero continuously and is expected to result in a more stable stage grouping than that provided by best subset selection (Tibshirani, 1997). This strategy has been proved to be appropriate in other statistical problems such as fused lasso (Tibshirani and others, 2005; Tibshirani and Taylor, 2011. Unlike the fused lasso, the lasso tree only focuses on sparsity in the differences of the coefficients but not the coefficients themselves. It also takes into account the partial ordering characteristics of the T and N categories.
The structure of the paper is as follows. In Section2, we describe the lasso tree and the algorithm for obtaining the estimates. The method is illustrated on a real data example in Section3, and simulation studies comparing the lasso tree with the best subset selection methods are presented in Section4. In Section5, we show that the proposed method can also incorporate information from the AJCC or other source by posing different weights on the penalty terms, and an example is given on colorectal cancer. Some discussion is given in Section6.
2. The lasso tree
2.1. A lasso-type selection procedure for survival outcomes
In general, suppose that the T descriptor has p categories and the N descriptor has q categories. The data for n subjects are of the form (y1,δ1,X1),…,(yn,δn,Xn), with δi describing whether yi is a survival time (δi=1) or a censoring time (δi=0) and Xi denoting the vector of covariates for the ith individual. Under the Cox proportional hazards model, the T × N table can be seen as a categorical covariate with p×q levels and the model can be written as
![]() |
(2.1) |
where β={βj,k},j=1,…,p,k=1,…,q, are the regression coefficients for cells in the T × N table and λ0(t) is an unspecified baseline hazard function. Equation(2.1) can be solved through maximizing the partial likelihood function
![]() |
(2.2) |
where D is the set of indices of the events and Rr denotes the set of indices of the individuals at risk at time tr−0.
The grouping problem can be addressed by borrowing the framework of a lasso-type model selection problem; instead of estimating β such that some of its components are exactly 0 as in the usual implementation of the lasso, we aim to estimate β such that some of the solution coefficients are exactly the same. Hence, instead of constraints on the coefficients we pose constraints on the differences between neighboring coefficients. For the stage grouping problem specifically, β is ordered in both T and N directions such that βj,1≤⋯≤βj,q and β1,k≤⋯≤βp,k,j=1,…,p,k=1,…,q, and these partial ordering constraints are also applied. We propose to estimate β as follows:
![]() |
(2.3) |
or, equivalently,
![]() |
(2.4) |
where
and s,λ>0 are tuning parameters. The underlying mechanism is the sparsity-enforcing property of the L1 penalty, which is expected to give a reduced number of unique βj,k values that represent different groups.
2.2. Computational approach
If each neighboring difference is penalized equivalently, as in (2.3) and (2.4), because of the ordering constraints, the absolute values can be dropped and the objective function can be simplified as
![]() |
(2.5) |
Note that only the coefficients of the “boundary cells” in the T × N table are taken into account in (2.5). Yet this is mathematically equivalent to (2.3) and (2.4) and will give the same estimates as (2.3) and (2.4).
Tibshirani (1997) gave an iterative procedure to solve the L1 penalized Cox proportional hazards model by expressing the usual Newton–Raphson update as an iterative reweighted least squares step and then replacing the weighted least squares step by a constrained weighted least squares procedure. Since our problem does not involve high-dimensional data, this procedure is quite adequate for computing its estimates. Define η=Xβ, u=∂ℓ/∂η, A=−∂2ℓ/∂ηηT, and z=η+A−1u. Define
. The iterative procedure is as follows:
Fix λ and initialize
.Compute η, u, A, and z based on the current value of
.Minimize (z−Xβ)TA(z−Xβ)+Pλ(β) subject to βj,1≤⋯≤βj,q and β1,k≤⋯≤βp,k.
Repeat Steps (2) and (3) until the convergence of
.
With the absolute values dropped, the minimization in Step (3) is a simple quadratic program with linear inequality constraints. It requires O(n2) computations with A being a full matrix. To speed up the computation, one can replace A with a diagonal matrix D with the diagonal entries of A (Tibshirani, 1997. The procedure typically runs between p and q iterations, and converges quickly based on our empirical experience. For the colorectal cancer data example, we will study in Section3 (n=1326, p=4, q=3), the typical computation time with a fixed λ is 5 s on a dual-core 2.4 GHz processor.
2.3. The tuning parameter and the lasso tree
The estimates from (2.4) depend on the tuning parameter λ. When λ=0, the solution is the usual Cox model estimate. As λ increases, the absolute differences between neighboring coefficients go to 0 successively, corresponding to the successive grouping of the cells, until all cells are in one group. This is similar to pruning a tree bottom-up and hence we call the proposed method the lasso tree. The estimated coefficients from the lasso tree fit can be displayed as a function of the tuning parameter λ; an example is given in Section3. “Warm starts” are used to efficiently compute the path of solutions over a grid of values for λ: starting at a solution for the previous λ, the solution for the next λ can be found relatively quickly.
We propose to use the Bayesian information criterion (BIC) (Schwarz, 1978)
![]() |
(2.6) |
to select the tuning parameter λ, where
is the log-partial likelihood for the constrained fit with λ, and kλ is the degrees of freedom in the model. In this paper, we estimate kλ by the number of unique parameters (the number of groups identified). Intuitively, the BIC inflates the negative log-partial likelihood by a penalty term proportional to the effective number of parameters. The BIC is calculated over a grid of values of λ which are uniformly distributed on the log-scale from 0 (yielding p×q stages) to some big number (reducing to a single stage), and the value
yielding the lowest estimated BIC is selected.
Other popular methods for tuning parameter selection include the Akaike information criterion (AIC) (Akaike, 1974), cross-validation (CV), and generalized CV (GCV) (Craven and Wahba, 1978). It is known that the BIC is consistent for model selection while the CV, GCV, and AIC are not (Shao, 1997). Since our primary focus is model selection rather than prediction, we elect to use BIC as the tuning parameter selection criterion in this paper. In fact, simulation studies have shown that the GCV statistic is inferior to the BIC in terms of selecting the correct grouping (not shown in this paper). In addition, AIC tends to select less sparse groupings compared with BIC.
3. Example
3.1. Colorectal cancer data
We based our analysis on a de-identified database of 1326 patients with non-metastatic colon cancer treated at the Memorial Sloan-Kettering Cancer Center between January 1, 1990 and December 27, 2000. All patients were diagnosed with AJCC stage 1 to 3c (sixth edition) disease. The outcome used in this analysis is cancer-specific survival. Of the 1326 patients, 379 died by the end of follow-up, and the median survival was 115 months. Of those alive, the median follow-up time was 61.4 months. A table presenting the sample size and hazard ratio estimated from an unrestricted Cox model for each cell in the T × N table is provided in Section A of the supplementary material available at Biostatistics online. With a couple of exceptions apparently due to small sample sizes, there is a strong upward trend in risk with increasing T and N involvement. However, we observe a relatively poor separation of the Kaplan–Meier survival curves under the AJCC sixth edition stage grouping (Figure3(a)).
Fig. 3.
Survivals of colorectal cancer patients by (a) AJCC staging, (b) the lasso tree selected staging, and (c) the lasso tree selected staging incorporating information from the AJCC.
3.2. Results: the lasso tree
Figure1 shows the estimated coefficients from the lasso tree fit as a function of the log tuning parameter
. Labels on top of the graph show when the groupings occur. The cell T1N0 is set to be the reference group. There is a left-right tree structure and the cells merge successively in a roughly monotone fashion. The monotonicity is with respect to the grouping process, where a higher level grouping always contains the lower ones as subsets.
Fig. 1.

The lasso tree: coefficient estimates and BIC for the colorectal cancer example, as a function of
. The dotted line represents the staging for
, selected by minimizing the BIC. K represents the number of groups.
The tree structure starts with 8 groups instead of 12 because some of the cells are forced to merge by the partial ordering constraints. Specifically, the unconstrained coefficients in cells T2N0, T2N1, T1N2, and T2N2 are not in line with the natural ordering and hence these cells are aggregated into one group at the very beginning. The same can be said for cells T3N0 and T4N0. Interestingly, T1N1 starts very close to T1N0 but separates from it until it merges with a higher stage. Note that since only 1 patient died among the 14 patients in T1N1, this behavior of
can be attributed to the large variation within this subcategory.
The vertical dotted line is drawn at 1.66, the value of log(λ) that minimizes the BIC. The grouping selected by BIC has four stages (Figure2(b)), with estimated hazard ratios
. Unlike the AJCC grouping (Figure2(a)), which divides stage 3 horizontally at N1, the lasso tree selected grouping classifies stages primarily by the T categories (vertically). The Kaplan–Meier survival curves for the lasso tree selected staging are displayed in Figure3(b). Clearly, the lasso tree selected staging scheme gives a better separation of survivals than does the AJCC staging system. This result indicates that there might be much room for improvement for the current AJCC staging.
Begg and others (2005) proposed three measures for comparing tumor staging and grading systems: (1) explained variation; (2) area under the ROC curve (AUC); and (3) the probability of concordance of stage and survival. Here, we use these criteria to formally evaluate the AJCC and the lasso tree selected staging systems. All three measures are bounded by 0 and 1, with greater value indicating a greater degree of prognostic separation. Table2 shows for the two systems the estimated values of the three criteria. The estimates of standard errors are obtained by bootstrapping. As expected, the lasso tree selected system has substantially greater prognostic power as measured by all three criteria than the AJCC system.
Table 2.
The AJCC and the lasso tree selected systems: the estimated criteria and their standard errors
| Criteria (SE) |
|||
|---|---|---|---|
| System | VAR | AUC | Concordance |
| AJCC | 0.642 (0.012) | 0.660 (0.013) | 0.623 (0.008) |
| Lasso tree selected | 0.686 (0.009) | 0.706 (0.011) | 0.664 (0.008) |
VAR, explained variation; AUC, area under the ROC curve; concordance, the probability of concordance.
4. Simulation study
In this section, we present simulation studies to investigate the finite sample properties of the lasso tree. The performance of the lasso tree and the existing approaches for cancer staging will be compared from two aspects: the ability to select the correct grouping and robustness with regard to changes in the data. In Section D of the supplementary material available at Biostatistics online, we also investigate the role of the partial ordering constraints on the lasso tree.
4.1. Ability to select the correct grouping
A hypothetical four-stage system is used to generate the data, whose sample distribution in the T × N table is chosen to be representative of the real colorectal cancer data. The sample size is set to be 1000, also representative of the colorectal cancer data. Based on the “true” grouping, we generate survival times from two models: (A) the exponential model
; and (B) the log-normal model
, where W has a standard normal distribution and σ=0.8. The coefficients are set to be
for a moderate effect and
for a small effect. Censoring times are generated from a Unif(0,τ) distribution, where τ is chosen to produce 40% and 80% censoring.
For comparison, the bootstrap and best subset methods are included and the criteria for selection are chosen to be the AUC for 10-year survival and the probability of concordance of grouping and survival (concordance), as studied by Begg and others (2005) and Lin and others (2012). Explicit descriptions of these two methods are provided in Section B of the supplementary material available at Biostatistics online. Other selection criteria that quantify the prognostic ability of candidate groupings, such as the partial likelihood, BIC, etc., can also be used under the bootstrap and best subset methods. Yet they perform unfavorably when compared with AUC and concordance (results not shown in this paper), and hence are dropped from the simulation. Note that both these two approaches assume that the true number of groups K is known or pre-specified, which might not be realistic in practice.
Table3 reports the empirical probabilities (based on 1000 simulations) of selecting the correct four-stage system. Two significant digits are shown based on the maximum Monte Carlo standard error (
). For the lasso tree, we report selection probabilities both when assuming that the true number of groups K=4 is known and when K is unknown and estimated by the BIC. The empirical results show that the lasso tree is able to select the correct grouping, especially when the true number of groups K is pre-specified; the probability of selecting the true grouping when fixing K=4 is over 70% in all scenarios studied. When the BIC is used to select the grouping, the successful rate is lower (around 65%). However, it is worthwhile to emphasize that most of the remaining groupings selected by the BIC are in fact nested in the true four-stage grouping, and so their errors involve falsely splitting stages rather than erroneously combining them. Moreover, the grouping selected by the BIC is very robust to the degree of censoring and the effect size. The bootstrap and best subset methods give much less satisfactory results, especially the best subset selection methods which show very low probabilities of selecting the correct grouping even though the true number of groups K is pre-specified.
Table 3.
Selection probabilities based on 1000 simulations (sample size = 1000)
| Method for selection |
|||||||
|---|---|---|---|---|---|---|---|
| Lasso tree |
Bootstrap |
Best subset |
|||||
| True model | % censored | K known | K unknown | AUC | Concordance | AUC | Concordance |
| A(1, 2, 4, 8) | 40 | 0.94 | 0.68 | 0.57 | 0.62 | 0.40 | 0.33 |
| 80 | 0.88 | 0.65 | 0.38 | 0.43 | 0.23 | 0.14 | |
| A(1.0, 1.5, 2.5, 4.0) | 40 | 0.80 | 0.68 | 0.44 | 0.48 | 0.22 | 0.14 |
| 80 | 0.72 | 0.62 | 0.32 | 0.31 | 0.13 | 0.09 | |
| B(1, 2, 4, 8) | 40 | 0.99 | 0.74 | 0.81 | 0.82 | 0.78 | 0.77 |
| 80 | 0.95 | 0.73 | 0.67 | 0.67 | 0.62 | 0.63 | |
| B(1.0, 1.5, 2.5, 4.0) | 40 | 0.94 | 0.74 | 0.56 | 0.58 | 0.53 | 0.52 |
| 80 | 0.89 | 0.74 | 0.51 | 0.50 | 0.46 | 0.44 | |
A, exponential model; B, log-normal model.
4.2. Estimation stability
Bootstrap samples are drawn from the colorectal cancer dataset with replacement to evaluate the estimation stability of the lasso tree with respect to changes in the data. For comparison, we include the best subset selection and assume the number of groups to be K=3 and K=6, corresponding to the two AJCC sixth edition groupings, respectively. The algorithm is stable if it selects one grouping a large proportion of time. Here, we only state the results: the lasso tree gives the most stable results, with one dominant staging when fixing K=3 or estimating K with BIC, and a few dominant groupings when K=6. The best subset selection gives most unstable results, especially when K is large. More details can be found in Section C of the supplementary material available at Biostatistics online.
5. Incorporating information from the AJCC staging or other source
The AJCC staging system has been seen as the most concerted effort to design a universally acceptable staging system and, since its introduction, has been used in clinical practice throughout the world. We recognize that suggesting a redefinition of the AJCC grouping scheme can be very difficult and comes with a cost regarding our future ability to make comparisons to past experience. On the other hand, the AJCC staging scheme has been developed using a combination of medical knowledge and observational studies, and hence could contain valuable information on prognostic separation of cancer patients. It is reasonable to incorporate this information when developing new systems, which can be done by modifying the penalty terms to reflect it.
We include the information from AJCC in the regression model by posing a heavier penalty on the differences between cells that are in the same stage according to AJCC, in other words, the differences that are zero under AJCC. More specifically, the sixth edition AJCC on colorectal cancer has three main stages and six substages. The differences that are zero under the substage grouping are β1,2−β1,1, β2,2−β2,1, β2,4−β2,3, β3,2−β3,1, β3,3−β3,2, and β3,4−β3,3; and the additional differences that are zero under the main stage grouping are β1,4−β1,3, β2,3−β2,2, β3,1−β2,1, β3,2−β2,2, β3,3−β2,3, and β3,4−β2,4. Different weights are imposed on these two sets of differences to reflect these two levels of grouping; cells in the same substage are expected to be closer than cells in the same main stage but different substages. Penalty terms |βj,k−βj,k−1| (or |βj,k−βj−1,k|) in (2.4) corresponding to the first set of differences are replaced by w1|βj,k−βj,k−1| (or w1|βj,k−βj−1,k|), and penalty terms corresponding to the second set of differences are replaced by w2|βj,k−βj,k−1| (or w2|βj,k−βj−1,k|), where w1>w2>1. The remaining differences have weights of 1. These cells are hence forced to aggregate more aggressively than the rest, leading to a staging system that might look more like the AJCC.
We apply this modified modeling on the colorectal cancer data with two choices of wi's: (1) w1=4, w2=2; and (2) w1=10, w2=5. The optimal groupings selected by the BIC are identical under both choices of wi's with slightly different hazard ratio estimates. The selected scheme has five groups (Figure2(c)), with
when w=(4,2) and
when w=(10,5). This grouping is in fact almost identical to the one selected by the lasso tree in Section3, except that the cell T2N0 is separated as a single stage. The information contained in the AJCC scheme is overwhelmed by the information contained in the data. Figure3(c) shows the survival curves under this five-stage system. As a further division of the previous four-stage scheme, the five-stage scheme seems to offer no apparent improvement in separating the survivals.
6. Discussion and conclusions
An accurate staging system is crucial for obtaining prognoses and guiding the treatment strategy. For decades investigators have developed and refined stage groupings using a combination of medical knowledge and observational studies, yet there appears to be no well-established statistical method for objectively incorporating quantitative evidence into this process. In this paper, we have proposed and studied the lasso tree stage selection method for censored survival data via a penalized likelihood approach. The method is shown to be effective in estimating the best staging system and its estimates stable with regard to changes in the data. The tree structure resulting from varying the tuning parameter provides a visual aid on how the groupings are made progressively and allows flexibility in the decision-making process for clinical researchers and practitioners.
Our analysis of the colorectal cancer data has provided some insight into the prognostic power of the TNM staging system. The selected systems (Figure2(b) and (c)) are virtually identical in their configuration and prognostic accuracy. The selected five-stage grouping is a further division of the four-stage grouping with no apparent improvement in separating the survivals. Thus, it might be reasonable to favor a more parsimonious system as urged in Gönen and Weiser (2010). Both selected schemes suggest that the most essential information is contained in the contrast between the tumor invading through the muscularis propria (T3 and T4) and otherwise (T1 and T2). This differs from the AJCC where the primary distinction is between node-positive (N1 and N2) and node-negative (N0) cancers. Cancer staging has been as much about anatomic interpretation as it is about accurate prognosis. A staging system that is prognostically optimal is unlikely to be adopted if it does not respect the anatomic extent of disease. Our results for colorectal cancer suggest that prognostically optimal systems are also anatomically interpretable. The substantial difference between the prognostic ability of the selected grouping schemes and the AJCC categories is concerning especially in light of the fact that the selected schemes are comparable with the AJCC's in terms of simplicity and interpretability.
In Section5, we tried combining prior beliefs on cancer staging—the AJCC system—into the selection process by imposing different weights on the penalty terms. Since the AJCC grouping bears almost no resemblance to the data selected grouping, the data eventually overwhelmed the prior belief in both choices of weights. The same task might be tackled from a Bayesian perspective. The prior domain knowledge can be quantified in the form of prior distributions elicited from the AJCC. More specifically, we may obtain the prior distributions of the regression coefficients by fitting the AJCC model/grouping to an independent dataset, such as the data from the Surveillance, Epidemiology, and End Results program, a large national cancer registry. Posterior distributions of model parameters can then be obtained for the penalized Cox proportional hazards model using a Bayesian methodology.
One difficulty with the development of staging systems has been that the current treatment strategies are stage-dependent. Thus, the survival outcomes for patients could be confounded by the actual staging system that was used in their care. A possible remedy is to control for the treatment assignment, which can be done by including the treatment covariate in the penalized proportional hazards model. Otherwise, we must acknowledge that our and others’ results may combine two or more categories whose outcomes have been rendered similar by varying treatment regimens.
In the data analysis, in Section3, we note that the four- and five-stage groupings selected by the lasso tree have nearly identical BICs. A careful examination of the analysis suggests that the four-stage grouping is indeed more desirable because it aggregates T1N1, the second stage in the five-stage grouping with inadequate sample size, into a larger group. Stages with low prevalence are not useful for clinical treatment decisions. In fact, one of the many criteria for a good staging system defined by Groome and others (2001) is a balanced distribution of patients across the groups. This was achieved by minimizing BIC in the colorectal cancer example, yet in more general cases it might be desirable to pursue a balanced distribution of patients by modifying the penalty as follows:
![]() |
(6.1) |
where the positive weights τ and υ are chosen to be inversely proportional to the sample size in the corresponding cells. That is, τj,k=1/(nj,k+nj,k−1) and υj,k=1/(nj,k+nj−1,k), where nj,k is the sample size in the cell with T=j and N=k. This places a heavier penalty on cells with small sample sizes and forces them to aggregate, leading to a more balanced distribution of stage sample sizes.
TNM staging is applicable to virtually any type of solid tumor and hence, although we used colorectal cancer as illustration, our methodology has general appeal. In addition to cancers, many other diseases also use aggregate risk scores based on ordinal (or ordinalized) risk factors, such as the ATP III score for high blood cholesterol that can benefit from optimal aggregation (Expert Panel on Detection, Evaluation, and Treatment of High Blood Cholesterol in Adults, 2001). In this paper, the lasso tree is based on the Cox model for censored survival data. But our methodology is applicable in principle to binary outcomes as well, where the Cox proportional hazards model is simply replaced by a logistic regression model. In some situations a “landmark” survival time, such as 5- or 10-year survival, can be more desirable than using the full survival. A logistic regression model is proposed by Jung (1996) for landmark survival analysis, and an extension of the lasso tree to this model is also quite possible.
In summary, we have proposed a promising method for developing cancer staging systems using a penalized proportional hazards model. Future work needs to be done to provide thorough theoretical support for this method, which would involve the consistency of the selection procedure and the asymptotic properties of the penalized likelihood estimators. The general lasso and fused lasso do not have model selection consistency unless with restrictions such as the Irrepresentable Condition (Zhao and Yu, 2006). However, the regression problem resulting from the original staging problem has a special structure: all covariates are orthogonal to each other. Under this specific structure, following Zhao and Yu (2006) and Fan and Li (2002), we expect to prove the model selection consistency of our lasso tree method. In other words, we expect to show that, under certain regularity conditions, when n (sample size) tends to infinity, the differences between β's that belong to the same stage will go to zero, and hence with probability tending to 1, we can identify the true staging system.
Software
Software in the form of R code together with a sample input dataset is available online at ftp://www.biostat.wisc.edu/pub/chappell/cancerstaging or on request from the corresponding author (yunzhi@stat.wisc.edu).
Supplementary material
Supplementary material is available at http://biostatistics.oxfordjournals.org.
Funding
This research was supported in part by NIH/NCI Grant Number P30 CA014520 to the UW Carbone Cancer Center, Madison, WI.
Supplementary Material
Acknowledgments
The authors thank the anonymous reviewers and the Associate Editor for their insightful comments and suggestions, which have led to a significantly improved paper. We also thank Mithat Gönen for supplying the colon cancer dataset. Its contents are solely the responsibility of the authors and do not necessarily represent the official views of the NIH. Conflict of Interest: None declared.
References
- Akaike H. A new look at the statistical model identification. IEEE Transactions on Automatic Control. 1974;19:716–723. [Google Scholar]
- American Joint Committee on Cancer. AJCC Cancer Staging Manual. 6th edition. New York: Springer; 2002. [Google Scholar]
- American Joint Committee on Cancer. AJCC Cancer Staging Manual. 7th edition. New York: Springer; 2010. [Google Scholar]
- Begg C. B., Cramer L. D., Venkatraman E. S., Rosai J. Comparing tumor staging and grading systems: a case study and a review of the issues, using thymoma as a model. Statistics in Medicine. 2005;19:1997–2014. doi: 10.1002/1097-0258(20000815)19:15<1997::aid-sim511>3.0.co;2-c. [DOI] [PubMed] [Google Scholar]
- Breiman L. Heuristics of instability and stabilization in model selection. Annals of Statistics. 1996;24:2350–2383. [Google Scholar]
- Breiman L., Friedman J. H., Olshen R. A., Stone C. J. Classification and Regression Trees. Belmont, CA: Wadsworth; 1983. [Google Scholar]
- Craven P., Wahba G. Smoothing noisy data with spline functions: estimating the correct degree of smoothing by the method of generalized cross-validation. Numerische Mathematik. 1978;31:377–403. [Google Scholar]
- Expert Panel on Detection, Evaluation, and Treatment of High Blood Cholesterol in Adults. Executive summary of the Third Report of the National Cholesterol Education Program (NCEP) Expert Panel on Detection, Evaluation, and Treatment of High Blood Cholesterol in Adults (Adult Treatment Panel III) Journal of the American Medical Association. 2001;285:2486–2497. doi: 10.1001/jama.285.19.2486. [DOI] [PubMed] [Google Scholar]
- Fan J., Li R. Variable selection for Cox's proportional hazards model and frailty model. Annals of Statistics. 2002;30:74–79. [Google Scholar]
- Gönen M., Weiser M. R. Whither tab1? Seminars in Oncology. 2010;37:27–30. doi: 10.1053/j.seminoncol.2009.12.009. [DOI] [PubMed] [Google Scholar]
- Groome P. A., Schulze K. M., Mackillop W. J., Grice B., Goh C., Cummings B. J., Hall S. F., Liu F.-F., Payne D., Rothwell D. M. A comparison of published head and neck stage groupings in carcinomas of the tonsillar region. Cancer. 2001;92:1484–1494. doi: 10.1002/1097-0142(20010915)92:6<1484::aid-cncr1473>3.0.co;2-w. and others. [DOI] [PubMed] [Google Scholar]
- Hothorn T., Zeileis A. Generalized maximally selected statistics. Biometrics. 2008;64:1263–1269. doi: 10.1111/j.1541-0420.2008.00995.x. [DOI] [PubMed] [Google Scholar]
- Jung S. H. Regression analysis for long-term survival rate. Biometrika. 1996;83:227–232. [Google Scholar]
- Lee A. W., Foo W., Law S. C. Staging of nasopharyngeal carcinoma: from Ho's to the new UICC system. International Journal of Cancer. 1999;842:179–187. doi: 10.1002/(sici)1097-0215(19990420)84:2<179::aid-ijc15>3.0.co;2-6. and others. [DOI] [PubMed] [Google Scholar]
- Lin Y., Chappell R. J., Gonen M. A Systematic Selection Method for the Development of Cancer Staging Systems. 2012 doi: 10.1177/0962280213486853. University of Wisconsin Department of Biostatistics and Medical Informatics, Technical Report No. 230. [DOI] [PMC free article] [PubMed] [Google Scholar]
- National Cancer Institute. National Cancer Institute Fact Sheet: Cancer Staging. http://www.cancer.gov/cancertopics/factsheet/detection/staging . [Google Scholar]
- Schwarz G. E. Estimating the dimension of a model. Annals of Statistics. 1978;6:461–464. [Google Scholar]
- Shao J. An asymptotic theory for linear model selection. Statistica Sinica. 1997;7:221–264. [Google Scholar]
- Tibshirani R. The lasso method for variable selection in the Cox model. Statistics in Medicine. 1997;16:385–395. doi: 10.1002/(sici)1097-0258(19970228)16:4<385::aid-sim380>3.0.co;2-3. [DOI] [PubMed] [Google Scholar]
- Tibshirani R., Saunders M., Rosset S., Zhu J. Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society Series B. 2005;67(1):91–108. [Google Scholar]
- Tibshirani R., Taylor J. The solution path of the generalized lasso. Annals of Statistics. 2011;39:1335–1371. [Google Scholar]
- Zhao P., Yu B. On model selection consistency of lasso. Journal of Machine Learning Research. 2006;7:2541–2563. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.









