Summary
To provide appropriate and practical level of health care, it is critical to group patients into relatively few strata that have distinct prognosis. Such grouping or stratification is typically based on well-established risk factors and clinical outcomes. A well-known example is the American Joint Committee on Cancer staging for cancer that uses tumor size, node involvement, and metastasis status. We consider a statistical method for such grouping based on individual patient data from multiple studies. The method encourages a common grouping structure as a basis for borrowing information, but acknowledges data heterogeneity including unbalanced data structures across multiple studies. We build on the “lasso-tree” method that is more versatile than the well-known classification and regression tree method in generating possible grouping patterns. In addition, the parametrization of the lasso-tree method makes it very natural to incorporate the underlying order information in the risk factors. In this article, we also strengthen the lasso-tree method by establishing its theoretical properties for which Lin and others (2013. Lasso tree for cancer staging with survival data. Biostatistics 14, 327–339) did not pursue. We evaluate our method in extensive simulation studies and an analysis of multiple breast cancer data sets.
Keywords: Cancer staging, Data heterogeneity, Individual patient data, Integrated analysis, Survival analysis
1. Introduction
Risk stratifying patients using established risk factors is a perpetual health care research theme. This topic was brought to the center stage of cancer research when the American Joint Committee on Cancer (AJCC) initiated its 8th edition on cancer staging in 2018. All previous editions have focused on anatomical factors alone, but the new edition tries to incorporate validated biological factors to provide more precise stratification.
Take breast cancer for an example, the new edition includes estrogen receptor (ER) and progesterone receptor (PR) status, human epidermal growth factor receptor 2 (HER2) status, histologic grade, and oncogene expression, to the previous 7th edition that uses only the anatomical assessment of tumor size, regional lymph node involvement, and distant metastasis (Amin and others, 2017; National Cancer Institute, 2004). This more “personalized” approach to patient classification is an inevitable step in this era of precision molecular oncology. However the availability of many factors or many levels of the established factors leads to exceedingly refined categorization for patients.
Compared with the 7th edition staging table that can be depicted on a single page, the 8th-edition staging algorithm spans over three pages and is rather complex and hard to follow. A large part of the algorithm is based on expert opinions, while sensible, still needs to be validated from continued accumulation of data. This naturally raises a practical question for health care management is the categorization or refinement clinically significant and if not, how can we combine similar categories into relatively few strata with clinically distinct prognosis so that the staging system is relatively simple to follow. This can be crucial for effectiveness and simplification of care delivery and disease management due to limited resources.
In our motivating example with three breast cancer data sets, anatomical assessment factor, ER or PR status, and HER2 status were collected. However, histologic grade and oncogene expression information were not. Therefore, we cannot verify the new AJCC 8th edition cancer staging. However, it is still methodologically and scientifically interesting to investigate how the biological factors of ER or PR status and HER2 status enhance the anatomical staging. In our data sets, the anatomical feature based on the 7th edition of AJCC has four stages: I, IIa, IIb, and III. The biomarker feature, based on the ER/PR and HER2 statuses, has four well-known subtypes (Onitilo and others, 2009): ER/PR+ and HER2+, ER/PR+ and HER2
, ER/PR
and HER2+, and ER/PR
and HER2
. The combination of the two factors, anatomical (A) and biomarker (B), leads to 16 categories as in a typical A
B table. The goal is to simplify the categories into fewer risk strata based on patients clinical outcomes using data-driven algorithms.
Statistically, this task is a supervised clustering or stratification problem where the outcome is usually time to event (such as recurrence or death) that is subject to possible censoring due to loss to follow-up. An intuitive solution is to use the well-known classification and regression tree (CART) method through recursive partitioning (Breiman, 1984; Loh, 2014). However, as argued by Lin and others (2013), CART does not capture all types of A and B combinations. At each split, CART must have a complete separation with respect to the splitting variable. In the case of the A
B table, this means that CART will split fully along all columns or rows conditional on the existing splits. In other words, the grouping that results from CART must have partitions in straight lines. Yet the true staging system might have a different configuration (Amin and others, 2017). Hence, Lin and others (2013) proposed using a Cox proportional hazard model with penalization on the neighboring coefficients for cancer staging. Specifically, they utilized the sparsity-enforcing property of the lasso penalty (Tibshirani, 1997) to merge neighboring coefficients that are close and, therefore, form clusters. Their method, known as “lasso-tree”, has the advantage of generating more general partitioning patterns.
To enable CART to generate arbitrary patterns of the A
B table, one could create an interaction term between A and B and then apply CART directly to this interaction term. In our setting, both the anatomical and biomarker features are ordered in the sense that their corresponding cancer prognoses satisfy the following: I
IIa
IIb
III and (ER/PR+; HER2+)
(ER/PR+; HER2
)
(ER/PR
; HER2+)
(ER/PR
; ER2
), where “
” means better prognosis. When A and B are ordered, the levels of the interaction term become a partially ordered set as off-diagonal levels are not comparable. For example, A1B1 has better prognosis than A1B2, A2B1, A2B2, etc., but A1B2 and A2B1 are not directly comparable. Such partial ordering information of the interaction term, which is derived from the ordered properties of A and B, is clearly not utilized in the usual CART algorithm. On the other hand, the lasso-tree method directly incorporates the ordering into its algorithm.
This article extends the lasso-tree method to perform integrated analysis that uses individual participant data (IPD) from multiple sources. Such an integrated analysis can synthesize more information to borrow strength across different studies. On the other hand, the analysis needs to fully acknowledge data heterogeneity. In our motivating study, the data sets have differences in sample sizes, lengths of follow-up, study periods, and population composition and characteristics. These underlying differences are likely associated with challenges, such as varying magnitudes of signals, missing data, and population variations across different studies.
Therefore, we extend Lin and others (2013) in a non-trivial way to emphasize a common structure of risk stratification across different data sets. This basis for borrowing information is more robust as it makes no assumption or restriction about the magnitude of parameters otherwise. The assumption is achieved by a special adaptive group lasso penalty. In addition, we strengthen the method of Lin and others (2013) by establishing its theoretical properties. We evaluate our method in extensive simulation studies and in analyzing our motivational breast cancer data sets.
2. Penalized cox regression
Suppose the data are collected from
studies where Study
has
observations for
. The total sample size
and for subject
we observe
as the event indicator describing whether the
is an event time (
) or a censoring time (
). Let
be the study indicator of this subject
. In addition, we also observe covariates, in particular, including staging factors
and
, which are ordinal variables. Because we adopt a regression framework, adjustment for other non-staging covariates, such as age and family history can be easily incorporated too. Our method works for more than two factors. However, for simpler presentation and better result visualization, we first focus on the two staging factors and without loss of generality, we suppose
, with larger values corresponding to worse prognoses.
We use the Cox proportional hazard model as our modeling framework. The hazard function for Study
is assumed as
![]() |
(2.1) |
where
corresponds to the two staging features
and
. The baseline hazard function
is unspecified as in the traditional Cox model. Note that we allow both the baseline functions
and coefficients
to be study-specific to account for possible heterogeneity among the studies. To make
identifiable, we restrict
.
We further denote
![]() |
(2.2) |
where
is the vectorization transformation which stacks matrix columns into a vector.
The logarithm of partial likelihood corresponding to Study
is
![]() |
(2.3) |
where
is the set of all failure events and
is the risk set of subject
in Study
. The overall log-likelihood is
![]() |
(2.4) |
where
.
To construct a staging system, one can first maximize (2.4) with respect to
, or maximize
with respect to
for
. Then inspect the resulting estimates for possible clustering of adjacent cells. However, such an endeavor may be very ineffective due to the large number of free parameters, that is,
, in
. In addition, a small Study
may cause unstable estimation of
, leading to difficulty in synthesizing findings across studies. Therefore, it is critical to streamline the analysis for construction of an interpretable and unified staging system.
As patients with similar values of the risk factors are more likely to have similar disease prognosis, the adjacent cells in the grid generated by the ordinal risk factors are more likely to cluster in the same stage. We focus on working with the vector of differences between the neighboring coefficients in Study
. To simplify notation, we let
be the set of neighboring pairs and define
as follows:
![]() |
(2.5) |
Here the matrix
can be written as
![]() |
where
and
are identity matrices,
is the Kronecker product, and
![]() |
To borrow strength across the studies and also enhance result interpretation, we consider the following adaptive group lasso penalization (Yuan and Lin, 2006; Zou, 2006; Wang and Leng, 2008) in the form of
![]() |
(2.6) |
where
![]() |
(2.7) |
Note that different from the usual adaptive group lasso, our penalty is directly on
, instead of
. This penalization on differences of adjacent parameters facilitates a “fusing” or clustering of neighboring coefficients for risk stratification. From the well-known group-wise selection property of the group lasso penalization (Yuan and Lin, 2006), all the terms under the square root will be enforced to be 0 simultaneously or not. That is, for a given
, all
,
, are either estimated to be all 0 or not. Therefore, our penalization is on “preserving" the stratifying structure across the
studies. Otherwise, we allow the coefficients
to vary freely. The adaptive part, realized by
, leads to much stabilized computation and paves the way for rigorous theoretical analysis of our method.
Because the staging factors are ordinal, the following ordering constraints are imposed,
![]() |
(2.8) |
The above constraints (2.8) means that all the elements of
are non-negative, which we denote by
.
In summary, we seek solution of the following problem
![]() |
(2.9) |
When there is only one study
, our formulation reduces essentially to the lasso-tree method (Lin and others, 2013) when the adaptive weights are kept as constant.
We use the well-known theory of counting processes (Andersen and Gill, 1982) to establish the oracle properties of the proposed penalized Cox regression approach. To guarantee the local asymptotic quadratic property for the partial likelihood function (2.4), conditions A–D on page 1105 in Andersen and Gill (1982) are assumed in the whole section. We call these conditions the “Andersen–Gill conditions”. Suppose
are the true values of
and
. Let
be the indices of adjacent grids that should be grouped together. Then we have the following theorem.
Theorem 1
Assume the Andersen–Gill conditions as in Andersen and Gill (1982) and Condition (2.7). In addition, assume that
, and
as
. Then the solution to (2.9),
satisfies the following properties as
:
(Consistency in grouping)
, where
;
(Asymptotic normality)
, where
is the the asymptotic covariance that one would get by using a standard Cox model as in Andersen and Gill (1982), and
is any full-rank
matrix satisfying
for all
.
The proof of Theorem 1 largely follows Andersen and Gill (1982) and Zou (2006) and is relegated to the Supplementary Materials available at Biostatistics online.
3. Computation and algorithm
The minimization shown in Problem (2.9) is non-trivial. Different from the usual adaptive group lasso that can be solved by applying a block coordinate descent algorithm (Friedman and others, 2007), our penalty is on
, instead of
. As a result, each
can appear under multiple square roots which makes it impossible to reparameterize to apply the block coordinate descent algorithm. Before we introduce our algorithm, we first revisit the existing algorithm for the Cox model with the lasso penalty proposed in Tibshirani (1997). We have developed an R package called “Group Lasso assisted Integrated Data Analysis for Risk Stratification” (GLIDARS), which is available at https://github.com/WangTJ/glidars.
3.1. Iteratively reweighted least squares algorithm
In presence of the well-known lasso penalty, the minimization of the negative log partial likelihood of the Cox model is achieved by a quasi-Newton method, usually referred as iteratively reweighted least squares (IRLS) algorithm (Green, 1984; Tibshirani, 1997).
For the convenience of notation, we introduce a dummy vector
for subject
such that
. Let
be a matrix whose
th row is
. Then the hazard function for subject
under the Cox model (2.1) can be rewritten as
. The IRLS algorithm starts from an initial guess of the optimizer,
and then calculate the Hessian and gradient of
at
:
![]() |
Then
is updated to
which minimizes
![]() |
with respect to
under the lasso penalty. This process is then iterated until convergence. It is worth mentioning that the calculation of the second-order differentiation
in
, which has the dimension of the total sample size
, is the bottleneck of the computation. It is suggested by Tibshirani (1997) to keep only the diagonal elements of the matrix
, because they are much larger than the off-diagonal ones. With such a method, more iterations are needed before convergence, but each iteration is much faster.
To deal with the penalty term in (2.6), here, we use an algorithm of local quadratic approximation similar to Fan and Li (2001). Define
. Note that the penalty term (2.6) is actually a summation of the
norms of the vectors
over all the groups, or 
The key idea is to replace the
norms in the penalty with quadratic terms that are simple to deal with in optimization. Suppose in the
-th step,
from
is obtained, then the
norm of
can be approximated by
![]() |
(3.10) |
Now the numerator is quadratic in
, and can be combined with the quadratic terms in the IRLS. The denominator is replaced with
that can be calculated from
, thus freed from the unknown when optimizing for
.
One needs to pay special attention if the norm of some group of
approaches 0,
, which is expected for the purpose of grouping through variable selection when the tuning parameter
is large enough. In that case, we use a small cutoff
to preclude the denominator in (3.10) from being 0 and modify (3.10) as
![]() |
where
is the maximum operator.
The cutoff
should be chosen much smaller than the convergence tolerance
,
, to ensure an accurate staging in the result.
The complete algorithm is described in Algorithm 1 below. The updating rule (3.10) has been rewritten in terms of
. To do so, we write
where
Then define the following matrix for
,
where
is the vector of length
whose
-th component is 1 and other components are 0. The functionality of these matrices is to link
with
so that
![]() |
With these notations, the approximation (3.10) can be written as a quadratic term of
,
![]() |
(3.11) |
Algorithm 1 The GLIDARS algorithm
Due to the constraint
, the updating step for
is a quandratic programming problem which can be solved by many standard solvers, such as the R package quadprog. In our setup, the Hessian will have many zero eigenvalues, leading to unstable computation. We deal with it by adding a very small disturbance
to the Hessian, where
is the precision tolerance. It will also generate a moderate fusion for very close coefficients.
3.2. Tuning and adaptive weights
For the adaptive weights, 
should be a
-consistent estimator of
to ensure the theoretical results from Theorem 1. We propose estimating
from the Cox model without penalization on each data set for such estimation.
With adaptive weights fixed, we propose to use the Bayesian information criterion (BIC) (Schwarz, 1978) to select the tuning parameter
:
![]() |
(3.12) |
Here
is the log-partial likelihood with the penalty tuning parameter
, and
is the degree of freedom in the model which is taken as the number of unique parameters. The BIC is calculated over a grid of values of
, for which we choose uniformly on the log-scale from 0 to a large value. The value
yielding the lowest estimated BIC is selected.
4. Simulation Study
We demonstrate the performance of our method by simulation studies under three scenarios listed in Table 1. For different scenarios, we vary the following aspects: sample sizes, distribution of patients on the
combinations, relative risks, and baseline functions. All the results are based on 500 simulations. Scenarios 1 and 2 have two data sets and Scenario 3 has four data sets. The underlying staging structures are the same across the data sets which are highlighted in different colored shades. However, the actual coefficients can vary for different data sets. We also allow the patient distributions on the
combinations to vary for different data sets. The survival outcomes were generated from the exponential distribution and the censoring rates were kept around 0.2 for all data sets.
Table 1.
Simulation scenarios
|
For comparison, we apply the lasso-tree method from Lin and others (2013) and the CART using the r package rpart. As these comparison methods are not designed for integrated analysis, we apply them both to each data set separately and to the pooled data.
To evaluate and compare results from different methods, we define staging or clustering “accuracy” as the proportion of correctly identified zero differences
between neighboring coefficients for
. The accuracy takes value between 0 and 1, with larger values indicating better results. For our method, GLIDARS, and the lasso-tree method, because there are also parameters to be estimated, we also present results for the sum of square error (SSE) of
in the Supplementary Materials available at Biostatistics online.
In Scenario 1, the sample sizes of the two data sets are
and
. All cells in the A
B table have some observations. In Scenario 2, we allow some cells of the
table to be empty, similar to our real data. In particular, the first data set has a sample size of
with no missing cells. The second data set has a larger sample size,
, but with no observations for B3 and for the cell A3B2. Scenario 3 has four data sets with the sample sizes
,
,
, and
. We chose relative larger sample sizes for the data sets with smaller signal or coefficients so that the separate analyses were stably conducted.
The results from different methods are in Figure 1. Besides providing boxplots for stratification ‘accuracy’ across 500 simulations, we also plotted percentages that these methods yielded highest accuracy in each simulation. We see that GLIDARS clearly outperforms the other methods in terms of accuracy. The separate analysis suffers from the small sample sizes. GLIDARS successfully makes use of the compensating distributions of the data sets to overcome this problem. Due to the heterogeneity in both the values of coefficients and the patient distributions in the
table, naively pooling also leads to inferior results. In terms of SSE, GLIDARS also outperforms the lasso-tree from the results presented in the Supplementary Materials available at Biostatistics online.
Fig. 1.
Staging accuracy results are shown in box plots (left), and the proportions of being best model in each simulation are shown in the bar plots (right).
We further conducted simulation studies when the underlying true staging may differ across different data sets. The simulation scenarios and results are presented in the Supplementary Materials available at Biostatistics online. By an inspection of our theoretical development, the final integrated staging in this case should asymptotically be the intersection of all the individual data stagings because of the way we penalize the differences of adjacent coefficients. All nonzero differences will remain nonzero in the integrated results given large sample sizes. Consequently, the integrated staging will be finer than all the individual data stagings. We take comfort in this result as this means more refined care for patients. In addition, the estimated coefficients from different stages can provide further information about the necessity of the integrated staging. That is, if the coefficients from two adjacent stages are very close to each other, we may combine the two stages into one stage.
5. Application to Breast Cancer Staging
We perform integrated analysis using our method and the comparators on three different studies: CALGB 49907 and CALGB 9741 conducted by the Cancer and Leukemia Group B (CALGB) and N9831 by the North Central Cancer Treatment Group (NCCTG). CALGB 49907 included 298 early-stage breast cancer patients who were 65 years or older (Partridge and others, 2010). CALGB 9741 included 798 patients with positive lymph nodes (Citron and others, 2002). N9831 has the largest sample size of 2256 (Perez and others, 2006). According to Table 2, the patient distribution is quite different across the studies. Both CALGB 9741 and N9831 have missing observations in some of the A
B combinations. Follow-up and event rate are also shown in the table.
Table 2.
Summary statistics and patient distributions for the three studies
| (a) Overall | |||
|---|---|---|---|
| Study | 49007 | 9741 | N9831 |
| Sample size | 298 | 798 | 2256 |
| Mean follow-up time (years) | 6.93 | 6.43 | 8.96 |
| Censor rate | 0.68 | 0.69 | 0.84 |
| Study 49907 | ||||
|---|---|---|---|---|
| ER/PR+ | ER/PR+ | ER/PR
|
ER/PR
|
|
| HER2+ | HER2
|
HER2+ | HER2
|
|
| I | 1 | 9 | 4 | 11 |
| IIa | 8 | 86 | 8 | 36 |
| IIb | 5 | 55 | 7 | 19 |
| III | 1 | 36 | 3 | 10 |
| Study 9741 | ||||
| ER/PR+ | ER/PR+ | ER/PR
|
ER/PR
|
|
| HER2+ | HER2
|
HER2+ | HER2
|
|
| I | 0 | 0 | 0 | 0 |
| IIa | 0 | 136 | 0 | 41 |
| IIb | 0 | 179 | 0 | 100 |
| III | 0 | 221 | 0 | 94 |
| Study N9831 | ||||
| ER/PR+ | ER/PR+ | ER/PR
|
ER/PR
|
|
| HER2+ | HER2
|
HER2+ | HER2
|
|
| I | 98 | 14 | 181 | 17 |
| IIa | 592 | 134 | 451 | 36 |
| IIb | 0 | 0 | 0 | 0 |
| III | 347 | 66 | 303 | 13 |
Similar to our simulation study, we compare GLIDARS with the lasso-tree and the CART methods. Note that because our goal for the real data is to construct a unified staging system for all patients, we only performed the comparison methods on the pooled data. For both GLIDARS and the lasso-tree, BIC is used to determine the optimum value of
from a uniform grid on the log-scale from 0 to 15.
Table 3 shows results from the GLIDARS method where the stratification results are shown in Table 3a and the corresponding coefficients estimates are in Table 3b. Figure 2 shows the estimated Kaplan–Meier survival curves based on GLIDARS’s stratification of five stages. The stratification led to clear separation except Stage 3 and Stage 4 patients for CALGB 49907. However, these two groups of patients were well separated in the other two studies.
Table 3.
The stratifying schemes and values of the coefficients on extending the AJCC 7th edition staging with biomarkers for three selected studies
|
Fig. 2.
The survival curves from the GLIDARS stratification.
Results from the comparator methods are in Table 4 and Figure 3. The lasso-tree method based on pooled data also had five stages; however, the separation of their Kaplan–Meier survival curves is not as distinct. For CART, the default pruning option led to three stages and there were clear separation of their Kaplan–Meier survival curves. However, we think this stratification may not be scientifically valid as it is well known that triple negative breast cancer (ER/PR
and HER2
) is more aggressive than the other subtypes. But this fact is not reflected in the CART result in patients with anatomical stages of I and IIa. Then, we specified more conservative options for CART that led to four stages and five stages of stratification. The same issue persisted for these CART results. In the meantime, the separation of their Kaplan–Meier survival curves became less distinct.
Table 4.
The stratifying schemes of comparator methods
|
Fig. 3.
The stratified survival curves from the comparator methods.
6. Discussion and conclusion
In this article, we developed an integrated analysis method for risk stratification using data sets from different sources. In particular, we made the minimum assumption that all data sets shared the same structure of the risk stratification. But we otherwise let all the parameters to vary freely, including the baseline hazard function. We facilitated such an assumption by using a special adaptive group lasso penalty on the differences of adjacent parameters. We developed our method based on the well-known Cox model for time-to-event outcomes. Our idea can be directly applied to other types of outcomes. We studied asymptotic properties of our approach. When simplified to a single study setting (i.e.
), our theoretical study also filled an important step for the original lasso-tree paper (Lin and others, 2013).
Even though we only considered two staging factors in our article, our algorithm can be applied to more than two factors. The increase in computation burden is manageable to certain degree as the workhorse of our algorithm is a quadratic programming algorithm that can be solved efficiently.
We motivated our method on risk stratification for breast cancer patients. Our method is directly applicable to other settings as well. For example, for patients with many chronic conditions, risk stratification is also encountering challenges due to the need to incorporate many risk factors. These risk factors, similar to our breast cancer example, are typically ordered. Considering a risk stratification system based on these factors can lead to personalized care but overly refined classification can be challenging for efficient and equitable delivery of health care.
7. Software
The programming codes for this paper are publicly available at https://github.com/WangTJ/glidars.
Supplementary Material
Acknowledgments
The authors would like to thank the anonymous reviewers for their valuable input on improving the original version of this article. Conflict of Interest: None declared.
Contributor Information
Tianjie Wang, Department of Statistics, University of Wisconsin, Madison, WI, USA.
Rui Chen, Department of Statistics, University of Wisconsin, Madison, WI, USA.
Wenshuo Liu, Department of Research & Innovation, Interactions LLC, 31 Hayward Street Suite E, Franklin, MA 02038, USA.
Menggang Yu, Department of Biostatistics and Medical Informatics, University of Wisconsin, Madison, WI, USA.
Supplementary material
Supplementary material is available at http://biostatistics.oxfordjournals.org.
Funding
The authors’ research was partially supported by the Specialized Program of Research Excellence (SPORE) program, through the National Institute for Dental and Craniofacial Research (NIDCR), and National Cancer Institute (NCI), grant P50DE026787. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institute of Health (NIH).
References
- Amin, M. B., Edge, S. B., Greene, F. L., Byrd, D. R., Brookland, R. K., Washington, M. K., Gershenwald, J. E., Compton, C. C., Hess, K. R., Sullivan, D. C.. and others. (2017). AJCC Cancer Staging Manual. New York: Springer. [Google Scholar]
- Andersen, P. K. and Gill, R. D. (1982). Cox’s regression model for counting processes: a large sample study. The Annals of Statistics 10, 1100–1120. [Google Scholar]
- Breiman, L. (1984). Classification and Regression Trees. New York: Routledge. [Google Scholar]
- Citron, M., Berry, D., Cirrincione, C., Carpenter, J., Hudis, C., Gradishar, W., Davidson, N., Ingle, J., Martino, S., Livingston, R.. and others. (2002). Superiority of dose-dense over conventional scheduling and equivalence of sequential (sc) vs. combination adjuvant chemotherapy for node-positive breast cancer (CALGB 9741, Int C9741). Breast Cancer Research and Treatment 76, S32. [Google Scholar]
- Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association 96, 1348–1360. [Google Scholar]
- Friedman, J., Hastie, T., Hofling, H. and Tibshirani, R. (2007). Pathwise coordinate optimization. The Annals of Applied Statistics 1, 302–332. [Google Scholar]
- Green, P. J. (1984). Iteratively reweighted least squares for maximum likelihood estimation, and some robust and resistant alternatives. Journal of the Royal Statistical Society. Series B (Methodological) 46, 149–192. [Google Scholar]
- Lin, Y., Wang, S. and Chappell, R. J. (2013). Lasso tree for cancer staging with survival data. Biostatistics 14, 327–339. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Loh, W.-Y. (2014). Fifty years of classification and regression trees. International Statistical Review 82, 329–348. [DOI] [PMC free article] [PubMed] [Google Scholar]
- National Cancer Institute. (2004). National Cancer Institute Fact Sheet: Cancer Staging. https://www.cancer.gov/about-cancer/diagnosis-staging/staging.
- Onitilo, A. A., Engel, J. M., Greenlee, R. T. and Mukesh, B. N. (2009). Breast cancer subtypes based on ER/PR and Her2 expression: comparison of clinicopathologic features and survival. Clinical Medicine and Research 7, 4–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Partridge, A. H., Archer, L., Kornblith, A. B., Gralow, J., Grenier, D., Perez, E., Wolff, A. C., Wang, X., Kastrissios, H., Berry, D.. and others. (2010). Adherence and persistence with oral adjuvant chemotherapy in older women with early-stage breast cancer in CALGB 49907: adherence companion study 60104. Journal of Clinical Oncology 28, 2418. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Perez, E. A., Suman, V. J., Davidson, N. E., Martino, S., Kaufman, P. A., Lingle, W. L., Flynn, P. J., Ingle, J. N., Visscher, D. and Jenkins, R. B. (2006). Her2 testing by local, central, and reference laboratories in specimens from the north central cancer treatment group n9831 intergroup adjuvant trial. Journal of Clinical Oncology 24, 3032–3038. [DOI] [PubMed] [Google Scholar]
- Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics 6, 461–464. [Google Scholar]
- Tibshirani, R. (1997). The lasso method for variable selection in the Cox model. Statistics in Medicine 16, 385–395. [DOI] [PubMed] [Google Scholar]
- Wang, H. and Leng, C. (2008). A note on adaptive group lasso. Computational Statistics and Data Analysis 52, 5277–5286. [Google Scholar]
- Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68, 49–67. [Google Scholar]
- Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association 101, 1418–1429. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
















































