Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2021 Aug 25.
Published in final edited form as: Stat Med. 2019 Dec 8;39(6):675–686. doi: 10.1002/sim.8438

A surrogate 0 sparse Cox’s regression with applications to sparse high-dimensional massive sample size time-to-event data

Eric S Kawaguchi 1, Marc A Suchard 1,2,3, Zhenqiu Liu 4, Gang Li 1,2
PMCID: PMC8386178  NIHMSID: NIHMS1732951  PMID: 31814146

Abstract

Sparse high-dimensional massive sample size (sHDMSS) time-to-event data present multiple challenges to quantitative researchers as most current sparse survival regression methods and software will grind to a halt and become practically inoperable. This paper develops a scalable 0-based sparse Cox regression tool for right-censored time-to-event data that easily takes advantage of existing high performance implementation of 2-penalized regression method for sHDMSS time-to-event data. Specifically, we extend the 0-based broken adaptive ridge (BAR) methodology to the Cox model, which involves repeatedly performing reweighted 2-penalized regression. We rigorously show that the resulting estimator for the Cox model is selection consistent, oracle for parameter estimation, and has a grouping property for highly correlated covariates. Furthermore, we implement our BAR method in an R package for sHDMSS time-to-event data by leveraging existing efficient algorithms for massive 2-penalized Cox regression. We evaluate the BAR Cox regression method by extensive simulations and illustrate its application on an sHDMSS time-to-event data from the National Trauma Data Bank with hundreds of thousands of observations and tens of thousands sparsely represented covariates.

Keywords: Censoring, high-dimensional covariates, massive sample size, penalized regression, proportional hazards, survival analysis

1 |. INTRODUCTION

Advancements in medical informatics tools and high-throughput biological experimentation are making large-scale data routinely accessible to researchers, administrators, and policymakers. This data deluge poses new challenges and critical barriers for quantitative researchers as existing statistical methods and software grind to a halt when analyzing these large-scale data sets, and calls for appropriate methods that can readily fit large-scale data. This paper primarily concerns survival analysis of sparse high-dimensional massive sample size (sHDMSS) data, a particular type of large-scale data with the following characteristics: (1) high-dimensional with a large number of covariates (pn in thousands or tens of thousands), (2) massive in sample-size (n in thousands to hundreds of millions), (3) sparse in covariates with only a very small portion of covariates being nonzero for each subject, and (4) rare in event rate. An example of sHDMSS data is the pediatric trauma mortality data from the National Trauma Data Bank (NTDB) maintained by the American College of Surgeons.1 This data set includes 210 555 patient records of injured children under 15 collected over 5 years from 2006 to 2010. Each patient record includes 125 952 binary covariates that indicate the presence or absence of an attribute (ICD9 Codes, AIS codes, etc) as well as their two-way interactions. The data matrix is extremely sparse with less than 1% of the covariates being non zero. The event (mortality) rate is also very low at 2%. Another application domain where sHDMSS data are common is drug safety studies that use massive patient-level databases such as the US FDA’s Sentinel Initiative (https://www.fda.gov/safety/fdassentinelinitiative/ucm2007250.htm)and the Observational Health Data Sciences and Informatics program (https://ohdsi.org/) to study rare adverse events with hundreds of millions of patient records and tens of thousands of patient attributes that are sparse in the covariates.

The sHDMSS survival data present multiple challenges to quantitative researchers. First, not all of the thousands of covariates are expected to be relevant to an outcome of interest. It would also be practically undesirable to predict a patient outcome using thousands of covariates. Traditionally, researchers hand-pick subject characteristics to include in an analysis. However, hand picking can introduce not only bias, but also a source of variability between researchers and studies. Moreover, it would become impractical in large-scale evidence generation when hundreds or thousands of analyses are to be performed.2 Hence, automated sparse regression methods are desired. Secondly, the commonly used “divide and conquer” strategy for massive size data is deemed inappropriate for sHDMSS time-to-event data since each of the divided data would have too few events for a meaningful analysis. Third, sHDMSS data presents a critical barrier to the application of existing sparse survival regression methods, since most current methods and standard software become inoperable for large data sets due to high computational costs and large memory requirements. Although many sparse survival regression methods are available,310 to the best of our knowledge, only LASSO, Elastic Net11 and ridge regression have been adapted to fit sHDMSS time-to-event data. In particular, Mittal et al12 developed a tool, named CYCLOPS, for fitting LASSO and ridge Cox regression with sHDMSS time-to-event data by storing data in a sparse format, exploiting sparsity in the data and partial likelihood, and using multicore threading and vector processing, along with other high-performance computing techniques, which delivers > 10-fold speedup12 over its competitors. However, ridge Cox regression does not yield a sparse model and LASSO tends to select too many noise features and is biased for estimation.13,14 Improved sparse Cox regression tools for sHDMSS time-to-event data are desired.

The purpose of this paper is to develop a surrogate 0-based sparse Cox regression method and adapt it to sHDMSS time-to-event data. It is well known that 0-penalized regression is natural for variable selection and parameter estimation with some optimal properties.1518 On the other hand, it is also known to have some pitfalls such as instability19 and being unscalable to even moderate dimensional covariates. The broken adaptive ridge (BAR) estimator, defined as the limit of an iteratively reweighted 2-penalization algorithm, was introduced to approximate the 0-penalization problem and has been recently shown to possess some desirable selection, estimation, and clustering properties under the linear model and several other model settings.10,2022 It is also computationally scalable to high-dimensional covariates and stable for variable selection as discussed later in Remark 2 of Section 2. However, the BAR method has yet to be rigorously studied for the Cox model. Moreover, current BAR algorithms have only been implemented for densely-represented covariates and are unsuitable for sHDMSS data due to high computational costs, high memory requirements, and numerical instability. Computation of the Cox partial likelihood and its derivatives is particularly demanding for massive sample size data since the required number of operations grows at the rate of O(n2). The key contributions of this paper are twofold. First, we rigorously extend the BAR methodology to the Cox model. Specifically, we establish the selection consistency, an oracle property for parameter estimation, and a grouping property of highly correlated covariates for the Cox model. It is worth noting that the theoretical extension of the BAR methodology to Cox model is nontrivial and notably different from other models because the log-partial likelihood for the Cox model is not the sum of independent terms and the standard martingale central limit theorem used to derive the asymptotic theory for Cox’s model with a fixed number of covariates is no longer applicable when the number of parameters diverges. Furthermore, because BAR involves performing an infinite number of penalized regressions, the derivations of its selection consistency and oracle property for estimation are substantially different from those for a single-step oracle estimator in the literature. The second key contribution of this paper is to develop an efficient implementation of BAR for Cox regression with sHDMSS time-to-event data by leveraging existing efficient massive 2-penalized Cox regression techniques,12 which include employing a column relaxation with logistic loss (CLG) algorithm using one-dimensional updates and a one-step Newton-Raphson approximation as well as exploiting the sparsity in the covariate structure and the Cox partial likelihood to reduce the number of operations from O(n2) to O(n).

In Section 2, we formally define the BAR estimator, state its theoretical properties for variable selection, parameter estimation, and grouping highly correlated covariates for the Cox model, and describe an efficient implementation for sHDMSS survival data. We also discuss how to adapt BAR as a postscreening sparse regression method for ultrahigh dimensional Cox regression with relatively small sample size. In Section 3, we present simulation studies to demonstrate the performance of the CoxBAR estimator with both moderate and massive sample size in various low and high-dimensional settings. We provide a real data example using the pediatric trauma mortality data12 in Section 4. Lastly, we give closing remarks in Section 5. The appendix collects proofs of the theoretical results and regularity conditions needed for the derivations. An R package has been developed for BAR and made available at https://github.com/OHDSI/BrokenAdaptiveRidge.

2 |. METHODOLOGY

2.1 |. Cox’s BAR regression and its large sample properties

2.1.1 |. The data structure, model, and estimator

Suppose that one observes a random sample of right-censored time-to-event data consisting of n independent and identically distributed triplets {(Xi,δi,zi())}i=1n, where for subject i, Xi = min(Ti, Ci) is the observed event time, δi = I(TiCi) is the censoring indicator, Ti is the event time of interest, and Ci is a censoring time that is conditionally independent of Ti given a pn-dimensional, possibly time-dependent, covariate vector Zi()=(zi1(),,zipn()).

Assume the Cox23 proportional hazard model

h{t|z(t)}=h0(t)exp{z(t)β}, (1)

where h{t | z(t)} is the conditional hazard function of Ti given {z(u),0 ≤ ut, }, h0(t) is an unspecified baseline hazard function, and β=(β1,,βpn) is a vector of time-independent regression coefficients. Denote by β1 and β2 the first qn and remaining pnqn components of β, respectively, and define β0=(β01,β02) as the true values of β, where, without loss of generality, β01=(β01,β0qn) is a vector of qn nonzero values and β02 = 0 is a pnqn dimensional vector of zeros. Further technical assumptions for β0 and pn are given later in condition (C6) of Section S4 of the Supplementary Material. For simplicity, we work on the time interval s ∈ [0, 1] as in the work of Andersen and Gill,24 which can be extended to any time interval [0, τ] for 0 < τ < ∞. Using the standard counting process notation, the log-partial likelihood for the Cox model is defined as

ln(β)=i=1n01βzi(s)dNi(s)01log[j=1nYj(s)exp{βzj(s)}]dN¯(s), (2)

where, for subject i, Yi(s) = I(Xis) is the at-risk process and Ni(s) = I(Xis, δi = 1) is the counting process of the uncensored event with intensity process hi(t | β) = h0(t)Yi(t) exp{zi(t)′β} and N¯=i=1nNi.

Our Cox’s BAR estimation of β starts with an initial Cox ridge regression estimator25

β^(0)=argminβ{2ln(β)+ξnj=1pnβj2}, (3)

which is updated iteratively by a reweighed 2-penalized Cox regression estimator

β^(k)=argminβ{2ln(β)+λnj=1pnβj2(β^j(k1))2},k1, (4)

where ξn and λn are nonnegative penalization tuning parameters. The BAR estimator is defined as

β^=limkβ^(k). (5)

Since 2-penalization yields a nonsparse solution, defining the BAR estimator as the limit is necessary to produce sparsity. Although λn is fixed at each iteration, it is weighted inversely by the square of the ridge regression estimates from the previous iteration. Consequently, coefficients whose true values are zero will have larger penalties in the next iteration, whereas penalties for truly nonzero coefficients will converge to a constant. We will show later in Theorem 1 that, under certain regularity conditions, the estimates of the truly zero coefficients shrink toward zero while the estimates of the truly nonzero coefficients converge to their oracle estimates with probability tending to 1. As illustrated by a small simulation in Section S2 (Figure S1) of the Supplementary Material, the signal (nonzero coefficients) and noise (zero coefficients) can be quickly separated within a few BAR iterations, although more iterations may be necessary in some scenarios to improve estimation of the nonzero coefficients.

Remark 1. (Computational aspects of BAR)

For moderate size data, one may calculate β^(k) in (4) using the Newton-Raphson method as in the work of Frommlet and Nuel,26 who outlined an iterative reweighted ridge regression for generalized linear models. It appears at first sight that (4) will encounter numerical overflow as some of the coefficients β^j(k1) will go to zero as k increases. However, it can be shown that after some simple algebraic manipulation the Newton-Raphson updating formula will only involve multiplications, instead of divisions, by β^j(k1)s, and thus numerical overflow can be avoided. Further details are provided in Section S1 (Equation (3)) of the Supplementary Material. We also note that because the limit of the BAR algorithm cannot be numerically achieved at any finite iteration step, an extra thresholding rule for small coefficients will be required to numerically obtain a sparse solution. However, this thresholding level can be set arbitrarily small (by default, we set the threshold value to 10−6 in our implementation) since it is simply used for numerical convergence to zero and has minimal impact on the resulting BAR estimator. Furthermore, Equation (3) of Section S1 of the Supplementary Material implies that, once a β^j(k1) becomes zero, it will remain as zero in subsequent iterations. Thus, one only needs to update β^(k) within the reduced nonzero parameter space, an appealing computational advantage for high-dimensional settings.

For massive-size data with large n and pn, the Newton-Raphson procedure, which at each iteration, calls for the calculation of both the gradient and Hessian can become practically infeasible due to high computational costs, memory requirements, and numerical instability. In Section 2.2, we will discuss how to adapt an efficient algorithm for massive 2-penalized Cox regression via cyclic coordinate descent and exploit the sparsity of the covariate structure to make BAR scalable to sHDMSS data.

Remark 2. (Broken adaptive ridge versus best subset selection)

The BAR method can be viewed as a performing a sequence of surrogate 0-penalizations, where each reweighted 2 penalty serves as a surrogate 0-penalty and the approximation of 0-penalization improves with each iteration. Consequently, BAR enjoys the best of 0- and 2-penalized regressions. For example, we establish in the next two sections that BAR possesses the oracle properties for estimation and selection consistency (an 0 property) as well as a grouping property (an 2 property). Numerically, for a fixed tuning parameter value, BAR is a surrogate to 0-penalization is not expected to be identical, but can be similar to the exact global 0-penalization solution where the latter must be solved using the best subset search (BSS). We illustrate this in Section S3 of the Supplementary Material (Figures S2 and S3) using a small simulation study. It is worth emphasizing that BAR overcomes some shortcomings of BSS; for example, BSS is computationally NP-hard and can be unstable for variable selection,19 whereas BAR is scalable to high-dimensional covariates and is more stable for variable selection as demonstrated in Figures S2 and S3 in Section S3 of the Supplementary Material.

2.1.2 |. Oracle properties

We establish the oracle properties for the BAR estimator for simultaneous variable selection and parameter estimation where we allow both qn and pn to diverge to infinity.

Theorem 1 (Oracle properties).

Assume the regularity conditions (C1) to (C6) in Section S4.1 of the Supplementary Material hold. Let β^1 and β^2 be the first qn and the remaining pnqn components of the BAR estimator β^, respectively. Then, as n → ∞, with probability tending to one,

  1. the BAR estimator β^=(β^1,β^2) exists and is unique, where β^2=0;

  2. nbnΣ(β0)111/2(β^1β01)DN(0,1), for any qn-dimensional vector bn such that ∥bn2 ≤ 1 and where Σ(β0)11 is the first qn × qn submatrix of Σ(β0), where Σ(β0) is defined in Condition (C4).

Theorem 1(a) establishes selection consistency of the BAR estimator. Part (b) of the theorem essentially states that the nonzero component of the BAR estimator is asymptotically normal and equivalent to the weighted ridge estimator of the oracle model, as shown in the proof provided in Section S4.2 of the Supplementary Material.

Remark 3 (Ultrahigh-dimensional covariates setting).

Although we allow pn to diverge, the asymptotic properties of the BAR estimator in the Section 2.1 are derived for pn < n. In an ultrahigh-dimensional setting where the number of covariates far exceeds the number of observations (pnn), one may couple a sure screening27 method with the BAR estimator to obtain a two-step estimator with desirable selection and estimation properties. The orders of qn, pn, and n and their relationships depend on the employed screening procedure. For example, coupling the BAR estimator with the sure joint screening procedure28 has been explored in the work of Kawaguchi.29

2.1.3 |. A grouping property

When the true model has a group structure, it is desirable for a variable selection method to either retain or drop all variables that are clustered within the same group. It is well known that ridge regression possesses the grouping property for highly correlated covariates.11 Because the BAR estimator is based on an iterative ridge regression, we show that BAR also possesses a grouping property for highly correlated covariates as stated in following theorem.

Theorem 2.

Let λn, {(Xi,δi,zi)}i=1n be given and assume that Z=(zi,zn) is standardized. That is, for all j = 1, … , pn, i=1nzij=0,z[,j]z[,j]=n1, where z[, j] is the jth column of Z. Suppose the regularity conditions (C1) to (C6) in Section S4.1 of the Supplementary Material hold and let β^ be the BAR estimator. Then, for any β^i0 and β^j0,

|β^i1±β^j1|1λn2{(n1)(1±rij)}n(1+dn)2, (6)

with probability tending to one, where dn=i=1nδi, and rij=1n1z[,i]z[,j] is the sample correlation of z[,i] and z[, j].

The proof is provided in Section S4.3 of the Supplementary Material. It is seen from (6) that, as rij → 1, the absolute difference between β^i and β^j approaches 0, implying that the estimated coefficients of two highly positively correlated variables will be similar in magnitude. Similarly, the estimated coefficients of two highly negatively correlated variables are also similar in magnitude with a sign change.

2.1.4 |. Selection of tuning parameters

Model complexity depends critically on the choice of the tuning parameters. The BAR estimator depends on two tuning parameters, ie, ξn for the initial ridge estimator in (3) and λn for the iterative ridge step in (4). Our simulations in Section 3.1 illustrate that, while fixing λn, the BAR estimator is insensitive to the choice of ξn over a wide interval (Figure 1) and thus practically only optimization with respect to λn is needed.

FIGURE 1.

FIGURE 1

Path plot for broken adaptive ridge (BAR) regression with varying (A) ξn and (B) λn = log(pn), (C) λn = 0.5 log(pn), and (D) λn = 0.75 log(pn) with estimates averaged over 100 Monte Carlo simulations of size n = 300, pn = 100, and censoring rate ≈ 25%. Path plot for ridge regression (D) with varying ξn is also included as a comparison

We optimize with respect to λn in a similar manner to currently used penalization methods. A popular strategy for tuning parameter selection is to perform optimization with respect to a data-driven selection criterion such as cross-validation (CV),30,31 Akaike information criterion,15 and Bayesian information criterion (BIC).16,17,32 Although CV has been used extensively in the literature, it has been known to asymptotically overfit models with a positive probability.33,34 Recent theoretical work has shown that, for penalized Cox models that possess the oracle property, BIC-based tuning parameter selection identifies the true model with probability tending to one.32 Further discussion on selecting λn for BAR is provided in the last paragraph of Section 3.2.

2.2 |. Implementation of BAR for sHDMSS data

As mentioned in Remark 1, the Newton-Raphson algorithm used for each iteration of the BAR algorithm will become infeasible in large-scale settings with large n and pn due to high computational costs, high memory requirements, and numerical instability. Furthermore, recently proposed BAR algorithms, as with most popularly available procedures, cannot directly handle sHDMSS data due to the computational burden imposed by the design matrix. Because BAR only involves fitting a reweighted Cox’s ridge regression at each iteration step, it allows us to adapt an efficient algorithm developed by Mittal et al12 for massive Cox ridge regression.

2.2.1 |. Adaptation of existing efficient algorithms for fitting massive 2-penalized Cox’s regression

Mittal et al12 developed an efficient implementation of the massive Cox’s ridge regression for sHDMSS data. For parameter estimation, the authors adopted the CLG algorithm of Zhang and Oles,35 which is a type of cyclic coordinate descent algorithm that estimates the coefficients using one-dimensional updates. The CLG easily scales to high-dimensional data7,36,37 and has been recently implemented for fitting 2- and 1-penalized generalized linear models,38 parametric time-to-event models,39 and Cox’s model.12 Readers are encouraged to refer to Section S3 of the Supplementary Material for a detailed explanation of the algorithm.

The design matrix Z for sHDMSS data has few nonzero entries for each subject. Storing such a sparse matrix as a dense matrix is inefficient and may increase computation time and/or cause standard software to crash due to insufficient memory allocation. To the best of our knowledge, popular penalization packages such as glmnet40 and ncvreg41 do not support a sparse data format as an input for right-censored time-to-event models, although the former supports the input for other generalized linear models. For sHDMSS data, we propose to use specialized column-data structures as in the works of Mittal et al12 and Suchard et al.38 The advantage of this structure is two-fold, ie, it significantly reduces the memory requirement needed to store the covariate information, and performance is enhanced when employing cyclic coordinate descent. For example, when updating βj, efficiency is gained when computing and storing the inner product ri=ziβ using a low-rank update ri(new)=ri+zij+Δβj for all i.12,35,36,38,42

Furthermore, one requires a series of cumulative sums introduced through the risk set Ri = {j : Xj > Xi} for each subject i to calculate the gradient and Hessian diagonal. These cumulative sums would need to be calculated when updating each parameter estimate in the optimization routine. This can prove to be computationally costly, especially when both n and pn are large. By taking advantage of the sparsity of the design matrix, one can reduce the computational time needed to calculate these cumulative sums by entering into this operation only if at least one observation in the risk set has a nonzero covariate value along dimension j and embarking on the scan at the first nonzero entry rather than from the beginning. Mittal et al12 and Suchard et al38 have implemented these efficiency techniques for conditional Poisson regression and Cox’s regression, respectively. Our BAR implementation naturally exploits the sparsity in the design matrix and the partial likelihood by embedding an adaptive version of Mittal et al’s12 massive Cox’s ridge regression within each iteration of the iteratively reweighted Cox’s ridge regression.

3 |. SIMULATIONS

This section presents three simulation studies. First, we demonstrate in Section 3.1 that, for fixed λn, the BAR estimator is insensitive to the tuning parameter ξn of its initial ridge estimator and does well in terms of performing variable selection and correcting possible bias of the initial ridge estimator. Then, in Section 3.2, we evaluate and compare the operating characteristics of BAR with some popular penalized Cox regression methods, where we only consider settings with moderate sample sizes due to the fact that most of the competing methods are inoperable for massive sample size data. Finally, in Section 3.3, we use a sHDMSS setting to illustrate the performance of BAR over its closest competitor.

Sections 3.1 and 3.2 employ the same simulation structure. Event times are drawn from an exponential proportional hazards model with baseline hazard h0(t) = 1, β0 = (0.40, 0, 0.45, 0, 0.50, 0.55, 0, 0, 0.70, 0.80, 0pn10), representing qn = 6 small to moderate effect sizes; the design matrix Z=(z1,,zn) is generated from a pn-dimensional normal distribution with mean zero and covariance matrix Σ = (σij) with an autoregressive structure such that σij = 0.5|ij| and independent censoring times are generated from uniform distribution U(0, umax), where umax is chosen to achieve different percentages of censoring. We describe how we simulate sHDMSS time-to-event data in Section 3.3.

3.1 |. Broken adaptive ridge estimator for varying values of ξn

We illustrate how the BAR estimator behaves by fixing λn and varying the tuning parameter ξn of the initial Cox ridge regression in the following. Figures 1B to 1D depict the solution path plots average over 100 Monte Carlo simulations of the BAR estimator with respect to ξn over a wide interval [10−2, 102] for n = 300, pn = 100, ≈ 25% censoring, and λn = log(pn), 0.5 log(pn), 0.75 log(pn), respectively. It is seen that the resulting BAR estimator is essentially unchanged, regardless of the choice of λn, over a large interval of ξn, suggesting that the BAR estimator is relatively insensitive to original ridge estimator.

As a reference, we also display the solution path plots of the corresponding initial ridge estimator in panel (a). The initial ridge estimator starts to introduce over shrinkage and, consequently, estimation bias when ξn exceeds 10. However, its bias has been effectively corrected by BAR. Therefore, by iteratively refitting reweighted Cox ridge regression, the BAR estimator not only performs variable selection by shrinking estimates of the true zero parameters to zero, but also effectively corrects the estimation bias from the initial Cox ridge estimator. Similar results are obtained for several different simulation scenarios and can be found in Section S4 of the Supplementary Material.

3.2 |. Model selection and parameter estimation

In this simulation, we evaluate and compare the variable selection and parameter estimation performance of BAR with four popular penalized Cox regression methods, ie, LASSO,3 SCAD,4 adaptive LASSO (ALASSO),5 and MCP.6 We fix ξn = 1 for the BAR methods since Section 3.1 yields evidence that the BAR estimator is insensitive to the selection of ξn. For all methods, a 25-value grid was used to find the optimal value of the tuning parameter via BIC minimization.32

Estimation bias is summarized through the mean squared bias, E(β^β02). Variable selection performance is measured by a number of indices, ie, the mean number of false positives (FP), the mean number of false negatives (FN), and average similarity measure for support recovery where SM=S^S00/S^|0S00 and S0 and S^ are the set of indices for the nonzero components of β0 and β^, respectively.43 The similarity measure can be viewed as a continuous measure for true model recovery, ie, it is close to 1 when the estimated model is similar to the true model and close to 0 when the estimated model is highly dissimilar to the true model. We use the R package ncvreg to perform LASSO, ALASSO, SCAD, and MCP penalizations in our simulations. For ALASSO, we let the initial weight be the maximum partial likelihood estimator since pn < n. Partial simulation results are summarized in Table 1 where we fix n = 300, 1000, pn = 100, a censoring rate of ≈ 25%, and average results over 100 replications.

TABLE 1.

(Moderate dimension and sample size) Simulated estimation and variable selection performance of broken adaptive ridge (BAR) Bayesian information criterion (BIC), LASSO (BIC), SCAD (BIC), adaptive lasso (ALASSO) (BIC), and MCP (BIC) where BIC in parenthesis indicates that the BIC minimization was used to select the tuning parameters via a grid search. (MSB = mean squared bias; FN = mean number of false positives; FP = mean number of false negatives; SM = average similarity measure; BIC = average BIC score; Each entry is based on 100 Monte Carlo samples of size n = 300, 1000, pn = 100, censoring rate = 25%)

MSB FN FP SM BIC
n = 300 BAR (λn = 0.5 log(pn)) 0.06 0.02 0.23 0.98 1930.97
BAR (λn = log(pn)) 0.10 0.17 0.02 0.98 1938.43
BAR (BIC) 0.11 0.01 1.79 0.89 1919.26
LASSO (BIC) 0.27 0.01 3.32 0.82 1958.40
SCAD (BIC) 0.12 0.01 2.23 0.87 1933.43
ALASSO (BIC) 0.11 0.04 1.48 0.90 1935.60
MCP (BIC) 0.09 0.02 1.21 0.92 1929.33
n = 1000 BAR (λn = 0.5 log(pn)) 0.01 0.00 0.19 0.99 8200.97
BAR (λn = log(pn)) 0.01 0.00 0.00 1.00 8203.52
BAR (BIC) 0.02 0.00 0.73 0.95 8196.51
LASSO (BIC) 0.10 0.00 2.77 0.84 8236.76
SCAD (BIC) 0.01 0.00 0.23 0.98 8203.00
ALASSO (BIC) 0.02 0.00 0.26 0.98 8204.58
MCP (BIC) 0.01 0.00 0.08 0.99 8202.04

From Table 1, we have that, when the tuning parameter λn is selected by minimizing the BIC score as the other methods, the performance of BAR (BIC) is generally comparable to other methods with respect to all measures across both scenarios. We have conducted more extensive simulations with different combinations of model dimension, censoring rates, sample sizes, and model sparsity, which yielded consistent findings and are reported in Section S5 of the Supplementary Material.

Since BAR aims to approximate 0-penalized regression, it directly provides a surrogate optima to some popular information criteria with some prefixed λn. For example, performing BAR with λn = c log(pn) for some c > 0 leads to a surrogate optima for the directly optimizing the extended BIC.4446 For thoroughness, in addition to using a 25-value grid for c, we also include simulation results in Table 1 for BAR with some prefixed values λn = 0.5 log(pn) and λn = log(pn). Not surprisingly, BAR with these prefixed values produced sometimes slightly suboptimal, but generally comparable estimation and selection performance. We also conducted further simulations using a 10-value coarse grid for λn. The results are presented in Tables S1 to S3 of the Supplementary Material, which showed that the 10-value grid worked as well as the 25-value grid across almost all of our simulation scenarios. This suggests that potential computational savings could be gained for BAR by using either prefixed or a coarse grid of values for λn for massive data, which is also illustrated in Section 4 (Table 3).

TABLE 3.

(Pediatric National Trauma Data Bank (NTDB) data) Comparison of mCox-LASSO and massive Cox’s regression for broken adaptive ridge (mBAR) regression for the pediatric NTDB data. (mCox-LASSO cross-validation (CV) and mCox-LASSO Bayesian information criterion (BIC) correspond to mCox-LASSO using cross validation and BIC selection criterion, respectively. mBAR (BIC) denotes mBAR using the BIC selection criterion while fixing ξn = log(pn). The training set has a sample size of 168 000, while the test set used for the c-index has a sample size of 45 555)

Method # Selected BIC score c-index Runtime (hours)
mBAR (λn = 0.5 log(pn)) 45 51 613.52 0.91 8
mBAR (λn = log(pn)) 21 52 182.90 0.89 8
mBAR (BIC) 83 51 269.43 0.93 97
mCox-LASSO (BIC) 100 52 544.90 0.91 25
mCox-LASSO (CV) 253 53 165.44 0.92 41

3.3 |. Sparse high-dimensional massive sample size data

In this simulation, we simulate a sHDMSS time-to-event data set with n = 200, 000, pn = 20, 000, and qn = 80. Event times are generated from an exponential hazards model with baseline hazard h0(t) = 1, regression coefficients β0 = (0 . 710, 0 . 510, 0 . 810, 110, − 0 . 710, − 0 . 510, − 0 . 810, − 110, 0pn80), and a censoring rate of 95%. The covariates for each subject are simulated such, on average, 2% are assigned a nonzero value. The amount of memory used to store this dense design matrix would require over 16 GB, which exceeds the functional capacity of most statistical software packages on standard hardware. To overcome this difficulty, we efficiently store the information in a coordinate list fashion and compare our massive Cox’s regression for BAR (mBAR) with the massive sparse Cox’s regression for LASSO (mCox-LASSO) using the Cyclops package,12,38 which, to the best of our knowledge, is the fastest software available today that exploits the sparsity of sHDMSS time-to-event data for efficient computing and offers > 10-fold speedup12 over its competitors such as CoxNet7 and FastCox.47 For LASSO, CV (mCox-LASSO (CV)), combined with a nonconvex optimization technique which is more efficient than the classical grid search approach, and BIC score minimization (mCox-LASSO (BIC)), implemented with the classical grid search approach, were used to find the optimal value for the tuning parameter. For the mBAR method, we implement BIC score minimization using a grid search and two prefixed tuning parameters λn = 0.5 log(pn) and log(pn) for comparative purposes. We report the bias (β^β02), number of FP, FN, and BIC score (2ln(β^)+log(n)jI(β^j0)) in Table 2.

TABLE 2.

(Sparse high-dimensional and massive sample size) Estimation and variable selection results for massive Cox regression with broken adaptive ridge (mBAR) and LASSO penalty (mCox-LASSO12) for a simulated sHDMSS data set with n = 200 000, pn = 20 000, and qn = 80. (Bias=β^β02; FP= number of false positives; FN = number of false negatives)

Method Bias FP FN BIC score
mBAR n = 0.5 log(pn)) 1.19 0 3 83 313.02
mBAR n = log(pn)) 2.02 0 10 83 573.96
mBAR (BIC) 0.97 5 0 83 266.47
mCox-LASSO (BIC) 2.93 12 0 84 479.47
mCox-LASSO (CV) 2.12 963 0 93 770.58

Abbreviations: BIC, Bayesian information criterion; CV, cross-validation.

We observe that both mCox-LASSO methods have retained all 80 true nonzero coefficients together with a moderate to large number of noise variables (12 for BIC and 967 for CV). In contrast, mBAR (BIC) chooses a sparser model selecting all 80 nonzero coefficients and 5 noise variables. As expected, mBAR (BIC) is less biased (0.82) than mCox-LASSO (2.49 for BIC and 2.02 for CV) and has a much lower BIC score when compared to both mCox-LASSO methods. We also notice that mBAR with the two prefixed λn tends to underestimate the true model, ie, fixing λn = log(pn) results in estimating a model that is too sparse, whereas λn = 0.5 log(pn) produces a model that is closer to the oracle model.

We further examined the solution paths of mCox-LASSO and mBAR in Figure 2. The vertical solid and dashed lines in the mCox-LASSO solution path plot (Figure 2A) represent the estimates at the optimal tuning parameter obtained via CV and BIC minimization, respectively. We can see that the mCox-LASSO solution path changes rapidly as its tuning parameter varies and shows severe bias. In contrast, the mBAR solution path plot (Figure 2B) with respect to λn changes very slowly where the vertical line represents the estimates at the optimal tuning parameter selected by BIC minimization and selects a model with estimates that are less biased than mCox-LASSO (see Table 2). Furthermore, the optimal value of λn that minimizes the BIC score for mBAR roughly corresponds to 0.3 log(pn). Since our empirical results suggest that the optimal value for λn generally lies within some constant of log(pn), we recommend that a coarse grid search within c log(pn) where c ∈ (0, 1] can be used. This is further corroborated by additional simulations in the Supplementary Material (Tables S1 to S3).

FIGURE 2.

FIGURE 2

Path plots for massive sparse Cox’s regression for LASSO (mCox-LASSO) and massive Cox’s regression for broken adaptive ridge (mBAR) regression. A, Path plot for mCox-LASSO regression, where the black solid and dashed lines represents the estimates when BIC minimization and cross-validation where used to find the optimal value of the tuning parameter, respectively; B, Path plot for mBAR regression with ξn = log(pn) and varying λn, where the black solid, dashed, and dotted lines represent estimates where λn was found using Bayesian information criterion minimization, fixed at log(pn) and 0.5 log(pn), respectively; C, Path plot for mBAR regression with λn = log(pn) and varying ξn, where the black solid line represent the estimates for mBAR when ξn = log(pn)

For the mBAR method, we also made a solution path plot with respect to ξn, while fixing λn = log(pn) in Figure 2C. It shows that the mBAR estimates are very stable over a large range of ξn, affirming our observation in Section 3.1 with small scale data that mBAR is generally insensitive ξn.

4 |. PEDIATRIC TRAUMA MORTALITY

For an application of mBAR regression in the sHDMSS setting, we consider a subset of the NTDB, a trauma database maintained by the American College of Surgeons.1 This data set was previously analyzed by Mittal et al12 as an example for efficient massive Cox regression with mCox-LASSO and ridge regression to sHDMSS data. The data set includes 210 555 patient records of injured children under 15 that were collected over 5 years (2006 to 2010). Each patient record includes 125 952 binary covariates, which indicate the presence or absence of an attribute (ICD9 Codes, AIS codes, etc) as well as the two-way interactions. The outcome of interest is mortality after time of injury. The data is extremely sparse, with less than 1% of the covariates being nonzero and has a censoring rate of 98%. We randomly split the data into training and test sets of 168 000 and 42 555, respectively. The mortality rate of both sets were approximately equal to the combined rate. Similar to Section 3.3, we were unable to load the training set (n = 168 000, pn = 125 000) into other popular oracle procedures due to the memory requirements needed to support a dense design matrix of that size and compare mBAR to mCox-LASSO. The BIC-score minimization over a penalization path of 10 tuning parameters was used to select the final model for both mBAR (fixing ξn = log(pn)) and mCox-LASSO. In addition, we perform mCox-LASSO using CV and mBAR with fixed tuning parameters λn = 0.5 log(pn) and log(pn). The BIC score based on the training data is used to compare selection performance between models and discriminatory performance is measured using Harrell’s c-statistic48,49 based on the test data.

Table 3 summarizes the findings for our example, which reflect what we observe in Section 3.3. Massive Cox’s regression for BAR, using BIC minimization, selects fewer covariates than both mCox-LASSO methods. Both model selection and discriminatory performance are similar to slightly superior for mBAR (BIC) over both mCox-LASSO methods. Again, mBAR with prefixed λn selects far fewer covariates than mBAR (BIC); however, the overall high c-index for both methods suggest that the strong predictors for pediatric trauma are still retained in the model. In terms of runtime, mBAR (BIC) is more time consuming than LASSO (BIC) as expected, but BAR with a prefixed tuning parameter value can help to reduce the runtime with a comparable prediction performance.

5 |. DISCUSSION

We have extended the BAR methodology to Cox’s model as a new sparse Cox regression method and rigorously established that it is selection consistent, oracle for parameter estimation, stable, and has a grouping property for highly correlated covariates. We illustrate through empirical studies that the BAR estimator has satisfactory performance for variable selection and parameter estimation. We have also extended the application of BAR to the sHDMSS domain by taking advantage of the fact that the BAR algorithm allows us to easily adapt existing high performance algorithms and software for massive 2-penalized Cox regression.12

Our surrogate 0-based BAR method and theory can be easily extended to a surrogate d-based BAR method for any d ∈ [0, 1], by replacing (β^j(k1))2 with |β^j(k1)|2d in (4). We have observed empirically that, as d increases toward 1, the resulting estimator becomes less sparse, and the average number of FP as well as estimation bias tend to increase, especially for larger pn, while the average number of FN tends to decrease. In practice, d can be used as a resolution tuning parameter.

Our theoretical and empirical results have established the BAR method as a valid and viable tool for variable selection and parameter estimation under the pn < n setting although pn is allowed to diverge with n. Theoretical properties of the BAR estimator for the high-dimensional setting (pnn) remain to be investigated. Furthermore, as pointed out by a referee, although BAR is selection consistent and oracle, it is subject to the same postselection inference issues as other variable selection methods.50,51 Lastly, although iteratively performing reweighted 2-penalizations allows BAR to enjoy the best of 0- and 2-penalized regressions and to readily adopt an existing efficient implementation of 2-penalization for sHDMSS data, its iterative nature does present another layer of computational complexity. While this added layer of computational complexity is not a practical concern for moderate size data, it can considerably increase the runtime in a large data setting when both n and p are large. As illustrated in our real data example, trying a prefixed tuning parameter value based on the extended BIC λn = c log(pn) can reduce the runtime of BAR with reasonably good performance. To further improve its computational efficiency, we are currently developing some modified BAR algorithms including a cyclic coordinatewise BAR algorithm, which will have comparable computational complexity and runtime to other popular variable methods such as LASSO. This line of further developments is beyond the scope of this paper and will be fully studied in a sequel paper.

Supplementary Material

Supp

ACKNOWLEDGEMENTS

The authors are grateful to the referees for their insightful comments and suggestions that have greatly improved the paper. The authors are also grateful to Drs Randall Burd and Sushil Mittal for providing us access to the NTDB data. Gang Li’s research was supported in part by National Institutes of Health grants P30 CA-16042, UL1TR000124-02, and P01AT003960.

Footnotes

CONFLICT OF INTEREST

The authors declare no potential conflict of interest.

SUPPORTING INFORMATION

Additional supporting information may be found online in the Supporting Information section at the end of the article.

Regularity conditions, proofs of theorems and additional simulation results are publicly available online in the supporting information tab of this article and implementation of BAR for right-censored time-to-event data can be found on the Github page [https://github.com/OHDSI/BrokenAdaptiveRidge].

DATA AVAILABILITY STATEMENT

The data that support the findings of this study are available from the American College of Surgeons.1 Restrictions apply to the availability of these data.

REFERENCES

  • 1.National Trauma Data Bank. https://www.facs.org/quality-programs/trauma/tqp/center-programs/ntdb
  • 2.Schuemie MJ, Ryan PB, Hripcsak D, Madigan G, Suchard MA. Honest learning for the healthcare system: large-scale evidence from real-world data. Science. 2017. Under review. [Google Scholar]
  • 3.Tibshirani R The lasso method for variable selection in the Cox model. Statist Med. 1997;16(4):385–395. [DOI] [PubMed] [Google Scholar]
  • 4.Fan J, Li R. Variable selection for Cox’s proportional hazards model and frailty model. Ann Stat. 2002;30(1):74–99. [Google Scholar]
  • 5.Zhang HH, Lu W. Adaptive lasso for Cox’s proportional hazards model. Biometrika. 2007;94(3):691–703. [Google Scholar]
  • 6.Zhang C-H. Nearly unbiased variable selection under minimax concave penalty. Ann Stat. 2010;38(2):894–942. [Google Scholar]
  • 7.Simon N, Friedman J, Hastie T, Tibshirani R. Regularization paths for Cox’s proportional hazards model via coordinate descent. J Stat Softw. 2011;39(5):1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Johnson BA, Long Q, Huang Y, Chansky K, Redman M. Log-Penalized Least Squares, Iteratively Reweighted Lasso, and Variable Selection for Censored Lifetime Medical Cost. Technical Report Atlanta, GA: Emory University; 2012. [Google Scholar]
  • 9.Su X, Wijayasinghe CS, Fan J, Zhang Y. Sparse estimation of Cox proportional hazards models via approximated information criteria. Biometrics. 2016;72(3):751–759. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Dai L, Chen K, Sun Z, Liu Z, Li G. Broken adaptive ridge regression and its asymptotic properties. J Multivar Anal. 2018;168:334–351. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Zou H, Hastie T. Regularization and variable selection via the elastic net. J R Stat Soc Ser B Stat Methodol. 2005;67(2):301–320. [Google Scholar]
  • 12.Mittal S, Madigan D, Burd RS, Suchard MA. High-dimensional, massive sample-size Cox proportional hazards regression for survival analysis. Biostatistics. 2014;15(2):207–221. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc. 2001;96(456):1348–1360. [Google Scholar]
  • 14.Zou H The adaptive lasso and its oracle properties. J Am Stat Assoc. 2006;101(476):1418–1429. [Google Scholar]
  • 15.Akaike H A new look at the statistical model identification. IEEE Trans Autom Control. 1974;19(6):716–723. [Google Scholar]
  • 16.Schwarz G Estimating the dimension of a model. Ann Stat. 1978;6(2):461–464. [Google Scholar]
  • 17.Volinsky CT, Raftery AE. Bayesian information criterion for censored survival models. Biometrics. 2000;56(1):256–262. [DOI] [PubMed] [Google Scholar]
  • 18.Shen X, Pan W, Zhu Y. Likelihood-based selection and sharp parameter estimation. J Am Stat Assoc. 2012;107(497):223–232. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Breiman L Heuristics of instability and stabilization in model selection. Ann Stat. 1996;24(6):2350–2383. [Google Scholar]
  • 20.Zhao H, Sun D, Li G, Sun J. Variable selection for recurrent event data with broken adaptive ridge regression. Can J Stat. 2018;46(3):416–428. 10.1002/cjs.11459 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Zhao H, Wu Q, Li G, Sun J. Simultaneous estimation and variable selection for interval-censored data with broken adaptive ridge regression. J Am Stat Assoc. 2019:1–13. 10.1080/01621459.2018.1537922 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Zhao H, Sun D, Li G, Sun J. Simultaneous estimation and variable selection for incomplete event history studies. J Multivar Anal. 2019;171:350–361. [Google Scholar]
  • 23.Cox DR. Regression models and life-tables. J R Stat Soc Ser B Methodol. 1972;34(2):187–202. [Google Scholar]
  • 24.Andersen PK, Gill RD. Cox’s regression model for counting processes: a large sample study. Ann Stat. 1982;10(4):1100–1120. [Google Scholar]
  • 25.Verweij PJ, Van Houwelingen HC. Penalized likelihood in Cox regression. Statist Med. 1994;13(23–24):2427–2436. [DOI] [PubMed] [Google Scholar]
  • 26.Frommlet F, Nuel G. An adaptive ridge procedure for L0 regularization. PLOS ONE. 2016;11(2):e0148620. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Fan J, Feng Y, Wu Y. High-dimensional variable selection for Cox’s proportional hazards model. In: Borrowing Strength: Theory Powering Applications–A Festschrift for Lawrence D. Brown Bethesda, MD: Institute of Mathematical Statistics; 2010:70–86. [Google Scholar]
  • 28.Yang G, Yu Y, Li R, Buu A. Feature screening in ultrahigh dimensional Cox’s model. Statistica Sinica. 2016;26:881–901. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Kawaguchi ES. Scalable Methods for Big Time-to-Event Data [PhD thesis]. Los Angeles, CA: UCLA; 2019. [Google Scholar]
  • 30.Craven P, Wahba G. Smoothing noisy data with spline functions. Numerische Mathematik (Heidelb). 1978;31(4):377–403. [Google Scholar]
  • 31.Verweij PJ, Van Houwelingen HC. Cross-validation in survival analysis. Statist Med. 1993;12(24):2305–2314. [DOI] [PubMed] [Google Scholar]
  • 32.Ni A, Cai J. Tuning parameter selection in Cox proportional hazards model with a diverging number of parameters. Scand J Stat Theory Appl. 2018;45(3):557–570. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Wang H, Li R, Tsai C-L. Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika. 2007;94(3):553–568. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Zhang Y, Li R, Tsai C-L. Regularization parameter selections via generalized information criterion. J Am Stat Assoc. 2010;105(489):312–323. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Zhang T, Oles FJ. Text categorization based on regularized linear classification methods. Information Retrieval. 2001;4(1):5–31. [Google Scholar]
  • 36.Wu TT, Lange K. Coordinate descent algorithms for lasso penalized regression. Ann Appl Stat. 2008;2(1):224–244. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Gorst-Rasmussen A, Scheike T. Coordinate descent methods for the penalized semiparametric additive hazards model. J Stat Softw. 2012;47(9):1–17. https://www.jstatsoft.org/v047/i09 [Google Scholar]
  • 38.Suchard MA, Simpson SE, Zorych I, Ryan P, Madigan D. Massive parallelization of serial inference algorithms for a complex generalized linear model. ACM Trans Model Comput Simul. 2013;23(1). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Mittal S, Madigan D, Cheng JQ, Burd RS. Large-scale parametric survival analysis. Statist Med. 2013;32(23):3955–3971. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. J Stat Softw. 2010;33(1):1–22. [PMC free article] [PubMed] [Google Scholar]
  • 41.Breheny P, Huang J. Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. Ann Appl Stat. 2011;5(1):232–253. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Genkin A, Lewis DD, Madigan D. Large-scale Bayesian logistic regression for text categorization. Technometrics. 2007;49(3):291–304. [Google Scholar]
  • 43.Zhang X, Cheng G. Simultaneous inference for high-dimensional linear models. J Am Stat Assoc. 2017;112(518):757–768. [Google Scholar]
  • 44.Chen J, Chen Z. Extended Bayesian information criteria for model selection with large model spaces. Biometrika. 2008;95(3):759–771. [Google Scholar]
  • 45.Chen J, Chen Z. Extended BIC for small-n-large-P sparse GLM. Statistica Sinica. 2012;22(2):555–574. [Google Scholar]
  • 46.Gao X, Carroll RJ. Data integration with high dimensionality. Biometrika. 2017;104(2):251–272. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Yang Y, Zou H. A cocktail algorithm for solving the elastic net penalized Cox’s regression in high dimensions. Stat Interface. 2012;6(2):167–173. [Google Scholar]
  • 48.Harrell FE, Califf RM, Pryor DB, Lee KL, Rosati RA. Evaluating the yield of medical tests. JAMA. 1982;247(18):2543–2546. [PubMed] [Google Scholar]
  • 49.Harrell FE Jr, Lee KL, Mark DB. Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Statist Med. 1996;15(4):361–387. [DOI] [PubMed] [Google Scholar]
  • 50.Lockhart R, Taylor J, Tibshirani RJ, Tibshirani R. A significance test for the lasso. Ann Stat. 2014;42(2):413. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Lee JD, Sun DL, Sun Y, Taylor JE. Exact post-selection inference, with application to the lasso. Ann Stat. 2016;44(3):907–927. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp

Data Availability Statement

The data that support the findings of this study are available from the American College of Surgeons.1 Restrictions apply to the availability of these data.

RESOURCES