Marginal false discovery rate for a penalized transformation survival model

Weijuan Liang; Shuangge Ma; Cunjie Lin

doi:10.1016/j.csda.2021.107232

. Author manuscript; available in PMC: 2022 Aug 1.

Published in final edited form as: Comput Stat Data Anal. 2021 Apr 2;160:107232. doi: 10.1016/j.csda.2021.107232

Marginal false discovery rate for a penalized transformation survival model

Weijuan Liang ^b, Shuangge Ma ^b,^c, Cunjie Lin ^a,^b,^*

PMCID: PMC8356905 NIHMSID: NIHMS1699547 PMID: 34393307

Abstract

Survival analysis that involves moderate/high dimensional covariates has become common. Most of the existing analyses have been focused on estimation and variable selection, using penalization and other regularization techniques. To draw more definitive conclusions, a handful of studies have also conducted inference. The recently developed mFDR (marginal false discovery rate) technique provides an alternative inference perspective and can be advantageous in multiple aspects. The existing inference studies for regularized estimation of survival data with moderate/high dimensional covariates assume the Cox and other specific models, which may not be sufficiently flexible. To tackle this problem, the analysis scope is expanded to the transformation model, which is robust and has been shown to be desirable for practical data analysis. Statistical validity is rigorously established. Two data analyses are conducted. Overall, an alternative inference approach has been developed for survival analysis with moderate/high dimensional data.

Keywords: Marginal false discovery rate, Transformation model, Survival analysis

1. Introduction

With the advancement of data collection techniques, data with a survival outcome and moderate/high dimensional covariates has become common. Examples can be found in biomedical studies as well as other fields such as engineering and social science. With such data, regularized estimation and variable selection are often needed to generate unique and stable estimation and screen out noises. Among the available regularization techniques, penalization has been popular because of its lucid interpretation as well as satisfactory statistical and numerical performance. There has been a vast literature on penalized estimation and variable selection with survival data.

To draw more definitive conclusions, inference is needed beyond estimation and selection. Being significantly more challenging, inference studies on survival data with moderate/high dimensional covariates under penalized (and other regularized) estimation have been limited. Examples include Lin and Wei (1989), Fang et al. (2016), Fang et al. (2020), Chai et al. (2019), and others. As pointed out in Miller and Breheny (2019), “…they are very restrictive, in the sense that they are often unable to select more than one or two features, even at high false discovery rates …”. Our own limited numerical studies also “confirm” this observation. Another limitation is that the existing inference approaches are often computationally expensive. Here we do note that, with the development of more efficient computational techniques, software, and hardware, this problem may diminish.

As summarized in Breheny (2019) and other studies, existing inference techniques mostly take the fully conditional or pathwise conditional perspective. Under the fully conditional perspective, the goal is to quantify whether a covariate is independent of survival conditional on all other covariates (Meinshausen et al., 2009). In contrast, under the pathwise conditional perspective, the goal is to quantify whether a covariate is independent of survival conditional on covariates that have already been included in the model (and concluded as relevant) (Lee et al., 2013; Tibshirani et al., 2016). More recently, the mFDR (marginal false discovery rate) strategy has been developed (Breheny, 2019). Consider the schematic plot in Fig. 1. Among the three covariates, X₁ is directly associated with survival and should be identified as significant. X₃ is not associated and should be identified as insignificant. X₂ may have some ambiguity: whether it is associated with survival depends on the status of X₁. If X₁ is not present, which is possible under the pathwise perspective, then X₂ is significant. If X₁ is present, as under the fully conditional perspective and some cases of the pathwise perspective, X₂ is not significant. To avoid such ambiguity and draw “cleaner” conclusions, the goal of the mFDR technique is to screen out covariates like X₃. For linear, generalized linear, as well as Cox and some other survival models, the theoretical validity and numerical advantage of the mFDR technique have been established (Breheny, 2019; Miller and Breheny, 2019).

Fig. 1. — Scheme of different associations with survival.

In published inference studies with survival data, including the aforementioned, the Cox and some parametric models have been assumed. In “classic” survival analysis, it has been shown that model mis-specification is not uncommon, and the general transformation model (Khan and Tamer, 2007), which includes the Cox and many other models as special cases, is more flexible and “more immune” from model mis-specification. Under moderate/high dimensional settings, the transformation model has been adopted for marginal analysis (Song et al., 2014) and shown to outperform (semi)parametric alternatives. For joint analysis with penalization, its estimation and variable selection properties have been established (Song and Ma, 2010), and its advantage of robustness/flexibility over the Cox and other models have been established.

In this article, we consider survival data with moderate/high dimensional covariates. The goal is to further expand the scope of mFDR analysis and investigate its properties under the general transformation model. This is motivated by the appealing properties of mFDR and desirable robustness/flexibility of the transformation model for moderate/high dimensional data. We note that this research is more than a direct application of the mFDR technique. In particular, the existing mFDR studies are all based on M-estimation. By contrast, for the general transformation model, U-estimation is needed, which is significantly more complicated than M-estimation. Accordingly, new and more challenging theoretical and numerical developments are needed. Overall, this study can provide a practically useful new inference approach for survival data with moderate/high dimensional covariates and is warranted beyond the existing literature.

2. Methods

2.1. Model and penalized estimation

Denote T and C as the event and censoring time, respectively. Under right censoring, the observed survival data are $Y = \min (T, C)$ and $Δ = I (T \leq C)$ . Consider the transformation model:

g (T) = β^{T} Z + ε,

where g(·) is a monotone increasing function with an unspecified form, and ε is the random error with mean zero and an unknown distribution. Z is a covariate vector independent of ε, and β is a length p vector of regression coefficients. For identifiability, and different from the Cox and other models, the first element of β is restricted to 1, that is, $β = {(1, θ^{T})}^{T}$ . Without specifying the form of g, the transformation model is “more immune” from model mis-specification. Without distributional assumptions on ε, likelihood approaches are not applicable, making the analysis fundamentally different from that in Miller and Breheny (2019) and others.

Let ${Y_{i}, Δ_{i}, Z_{i}}, i = 1, \dots, n,$ be a sample of independent and identically distributed observations. Consider the second order U-statistic based objective function:

{\tilde{H}}_{n} (β) = \frac{1}{n (n - 1)} \sum_{i \neq j} Δ_{j} I (Y_{i} \geq Y_{j}) I (β^{T} Z_{i} \geq β^{T} Z_{j}),

where I(·) is the indicator function. This has been referred to as the partial rank estimation (PRE) (Khan and Tamer, 2007) and has roots in the maximum rank correlation (MRC) estimation (Han, 1987). Note that U-estimation adopted here differs fundamentally from M-estimation. The indicator function is non-differentiable, creating challenges in optimization. Further consider the objective function with a smooth approximation:

H_{n} (β) = \frac{1}{n (n - 1)} \sum_{i \neq j} Δ_{j} I (Y_{i} \geq Y_{j}) s_{n} (β^{T} Z_{i} - β^{T} Z_{j}),

where $s_{n} (u) = s (u / h_{n}), s (u) = 1 / {1 + \exp (- u)},$ and h_n is strictly positive and decreasing and satisfies $\lim_{n \to \infty} h_{n} = 0$ . This has been referred to as the SPRE (smoothed PRE) (Song et al., 2007).

With moderate/high dimensional covariates, to regularize estimation and select relevant covariates, consider the penalized objective function:

Q_{n} (β) = H_{n} (β) - \sum_{k = 1}^{p - 1} p_{λ} (| θ_{k} |),

(1)

where p_λ(·) is a penalty function with a regularization parameter λ. In what follows, we consider Lasso with $p_{λ} (θ) = λ | θ | .$ Other penalties can be adopted and analyzed in a similar manner.

2.2. mFDR

Let $β_{0} = {(1, θ_{0}^{T})}^{T}$ be the true value of β. Suppose conditions (A1)-(A9) in Song and Ma (2010) hold, for any solution $\hat{β}$ of (1), we have:

\begin{array}{l} u_{k} (\hat{β}) - λ sign ({\hat{β}}_{k}) = 0, if {\hat{β}}_{k} \neq 0, \\ | u_{k} (\hat{β}) | \leq λ, if {\hat{β}}_{k} = 0, \end{array}

for k = 2,…, p, where

u_{k} (β) = \frac{\partial H_{n} (β)}{\partial β_{k}} = \frac{1}{n (n - 1)} \sum_{i \neq j} Δ_{j} I (Y_{i} \geq Y_{j}) s' (\frac{β^{T} Z_{i} - β^{T} Z_{j}}{h_{n}}) \frac{Z_{i k} - Z_{j k}}{h_{n}} .

Here, $s^{'} (u) = e^{- u} / {(1 + e^{- u})}^{2}$ is the first-order derivative of s(u). Define $X = (Y, Δ, Z), x = (y, δ, z),$ and

\begin{array}{l} f^{*} (x_{1}, x_{2}, β) = δ_{2} I (y_{1} \geq y_{2}) I (β^{T} z_{1} - β^{T} z_{2} > 0) + δ_{1} I (y_{2} \geq y_{1}) I (β^{T} z_{2} - β^{T} z_{1} > 0), \\ f_{n}^{*} (x_{1}, x_{2}, β) = δ_{2} I (y_{1} \geq y_{2}) s_{n} (β^{T} z_{1} - β^{T} z_{2}) + δ_{1} I (y_{2} \geq y_{1}) s_{n} (β^{T} z_{2} - β^{T} z_{1}) . \end{array}

Let $τ (x, β) = E {f^{*} (x, X, β)}$ and $τ_{n} (x, β) = E {f_{n}^{*} (x, X, β)} .$ In addition, let f₀(·) denote the marginal density function of the index $β_{0}^{T} Z$ , and $μ_{k} (β_{0}^{T} Z)$ denote the conditional expectation of Z_k given $β_{0}^{T} Z$ . Furthermore, define the function

q (x, t) = E [Δ I (y \geq Y) - δ I (Y \geq y) | β_{0}^{T} Z = t] .

We additionally impose the following condition:

(B0) q(x, t) is differentiable with respect to t, f₀(·) is differentiable, and

E {[Z_{k} - μ_{k} (β_{0}^{T} Z)] [Z_{r} - μ_{r} (β_{0}^{T} Z)] q' (X, β_{0}^{T} Z) f_{0} (β_{0}^{T} Z)} = 0,

where q′(·, ·) denotes the partial derivative of q(x, t) with respect to t.

This condition is analogous to the “vanishing correlation” assumed in Miller and Breheny (2019). Carefully examining the published studies and our proof suggest that it cannot be easily dropped or replaced.

For a variable Z_k independent of survival, our goal is to properly identify false discovery, that is, $β_{k 0} = 0$ , but ${\hat{β}}_{k} \neq 0.$ We can establish the following key result.

Theorem 1.

Suppose that Conditions (A1)–(A9) in Song and Ma (2010) and (B0) hold and that $λ = O (n^{- 1 / 2}) .$ For variable k with $β_{k 0} = 0,$

\sqrt{n} {u_{k} (\hat{β}) + v_{k} {\hat{β}}_{k}} \overset{D}{\to} N (0, σ_{k}^{2}),

where $v_{k} = - E {\frac{\partial^{2} τ (X, β_{0})}{\partial β_{k}^{2}}}$ and $σ_{k}^{2} = E {\frac{\partial τ (X, β_{0})}{\partial β_{k}}}^{2} .$

Proof is presented in Appendix A. With the U-estimation, it proceeds differently from that in Miller and Breheny (2019). Based on Theorem 1, we can develop the mFDR bound for estimator (1) as follows.

Consider:

\frac{\partial^{2} H_{n} (β_{0})}{\partial β_{k}^{2}} = \frac{1}{n (n - 1)} \sum_{i \neq j} Δ_{j} I (Y_{i} \geq Y_{j}) s^{″} (\frac{β^{T} Z_{i} - β^{T} Z_{j}}{h_{n}}) {(\frac{Z_{i k} - Z_{j k}}{h_{n}})}^{2},

where $s^{″} (u) = e^{- u} (e^{- u} - 1) / {(1 + e^{- u})}^{3} .$ Note that s″(u) < 0 if u > 0. The monotonicity of g(·) implies that $P (Y_{i} \geq Y_{j} | Z_{i}, Z_{j}) - P (Y_{i} \leq Y_{j} | Z_{i}, Z_{j})$ has the same sign as $β_{0}^{T} Z_{i} - β_{0}^{T} Z_{j} .$ And we can get that v_k is positive and

P ({\hat{β}}_{k} \neq 0) = P (| u_{k} (\hat{β}) + v_{k} {\hat{β}}_{k} | > λ) .

Furthermore, if $β_{k 0} = 0,$ this probability can be approximated by

P (\frac{\sqrt{n} | u_{k} (\hat{β}) + v_{k} {\hat{β}}_{k} |}{σ_{k}} > \frac{\sqrt{n} λ}{σ_{k}}) = 2 Φ (- \frac{\sqrt{n} λ}{σ_{k}}),

where Φ(·) is the distribution function of standard normal. This implies that the probability of a marginally unimportant covariate k being selected is approximately equal to the probability of a random variable with distribution $N (0, σ_{k}^{2} / n)$ exceeding λ in absolute value. Therefore, we can obtain the following estimate for the expected number of false discoveries:

\hat{FD} = 2 \sum_{k \in N} Φ (- \frac{\sqrt{n} λ}{{\hat{σ}}_{k}}),

(2)

where $N$ is the index set of noise covariates and ${\hat{σ}}_{k} = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {\frac{\partial_{τ_{n}} (X_{i}, \hat{β})}{\partial β_{k}}}^{2}} .$ Finally, the mFDR can be estimated by

m \hat{FD} R = \frac{\hat{FD}}{| S |},

where $S$ is the set of selected covariates, and $| S |$ is its size. It is noted that we use $S^{c}$ , as opposed to the set containing all covariates in Miller and Breheny (2019), to obtain a more accurate estimate of $N$ in (2). The overall mFDR estimation procedure is summarized in Algorithm 1.

Algorithm 1:

Algorithm for calculating the upper bound of mFDR

1 For

k \in {1, \dots, p} :

(1) Estimate

{\hat{σ}}_{k};

(2) Calculate

{\hat{FD}}_{k} = 2 Φ (- \frac{\sqrt{n} λ}{{\hat{σ}}_{k}}) .

2 Calculate

\hat{FD} = {\sum_{k \in S^{c}} \hat{FD}}_{k} .

3 The mFDR upper bound is

\hat{mFDR} = \min {\frac{\hat{FD}}{| S |}, 1} .

Open in a new tab

2.3. Computation using the Fabs technique

As the loss function is nonconvex, the estimation of $\hat{β}$ and ${\hat{σ}}_{k}$ is nontrivial. We adopt the Forward and Backward Stagewise (Fabs) technique (Shi et al., 2018) to overcome challenges brought by the nonconvexity. Fabs is an iterative algorithm, with each iteration taking either a backward or a forward step with a fixed small step size ϵ. It is rooted in the BLasso algorithm (Zhao and Yu, 2017) but extends to nonconvex optimization problems, by using the first-order Taylor expansion and taking advantage of differentiability.

We first initialize ${\hat{β}}^{(0)}$ as a length-p vector with the first entry being 1 and the rest being 0. It is noted that for notational simplicity, we take the first covariate as the “anchor” (Song and Ma, 2010) and set its estimate as 1. At iteration t, we take either a backward step along direction m determined by:

m = \underset{k \in A^{(t)}}{arg max} H_{n} ({\hat{β}}^{(t)} - sign ({\hat{β}}_{k}^{(t)}) 1_{k} ϵ),

(3)

or a forward step along direction m determined by:

m = \underset{k = 2, \dots, p}{arg max} H_{n} ({\hat{β}}^{(t)} + sign (u_{k} ({\hat{β}}^{(t)})) 1_{k} ϵ),

(4)

where 1_k denotes the length-p vector with the kth entry being 1 and the rest being 0, and $A^{(t)}$ is the index set of ${\hat{β}}^{(t)}$ ’s nonzero entries besides the first one. Fabs uses the first-order Taylor expansions to approximate objective functions (3) and (4) as follows:

H_{n} ({\hat{β}}^{(t)} - sign ({\hat{β}}_{k}^{(t)}) 1_{k} ϵ) = H_{n} ({\hat{β}}^{(t)}) - u_{k} ({\hat{β}}^{(t)}) sign ({\hat{β}}_{k}^{(t)}) ϵ + O (ϵ^{2}),

(5)

H_{n} ({\hat{β}}^{(t)} + sign (u_{k} ({\hat{β}}^{(t)})) 1_{k} ϵ) = H_{n} ({\hat{β}}^{(t)}) + | u_{k} ({\hat{β}}^{(t)}) | ϵ + O (ϵ^{2}) .

(6)

When ϵ is small enough, the third term on the right hand sides of (5) and (6) can be ignored. Therefore, a backward step direction m is determined by:

m = \underset{k \in A^{(t)}}{arg min} {u_{k} ({\hat{β}}^{(t)}) sign ({\hat{β}}_{k}^{(t)})},

and a forward step direction m is determined by:

m = \underset{k = 2 \dots, p}{arg max} | u_{k} ({\hat{β}}^{(t)}) | .

(7)

A backward step is taken to move the estimate along direction m towards zero, if this step can lead to an increase of the objective function. Specifically, if

Q_{n} ({\hat{β}}^{(t)} - sign ({\hat{β}}_{m}^{(t)}) 1_{m} ϵ; λ^{(t)}) > Q_{n} ({\hat{β}}^{(t)}; λ^{(t)}),

a backward step is taken by

{\hat{β}}^{(t + 1)} = {\hat{β}}^{(t)} - sign ({\hat{β}}_{m}^{(t)}) 1_{m} ϵ, and λ^{(t + 1)} = λ^{(t)} .

Otherwise, a forward step is taken by updating:

{\hat{β}}^{(t + 1)} = {\hat{β}}^{(t)} + sign (u_{m} ({\hat{β}}^{(t)})) 1_{m} ϵ, λ^{(t + 1)} = \min {λ^{(t)}, \frac{H_{n} ({\hat{β}}^{(t + 1)}) - H_{n} ({\hat{β}}^{(t)})}{ϵ}},

and direction m is determined by (7).

The algorithm is summarized in Appendix B. It stops when the predefined minimum λ is reached.

3. Numerical studies

3.1. Simulation

Multiple sets of simulations are conducted to gauge performance of the proposed approach and compare with alternatives. With the proposed approach, $S^{c}$ is estimated with λ selected using the BIC criterion. For each setting, 200 replicates are simulated.

Simulation 1.

The goal is to evaluate performance of the proposed approach by comparing the estimated versus actual mFDR values, under settings with different sample sizes, strengths of signals, and dimensions of covariates. As tuning parameter selection is not a completely resolved problem, we consider a sequence of tuning values, as in Miller and Breheny (2019).

Consider data generated under the AFT model

\log (T_{i}) = β^{⟙} Z_{i} + ε_{i},

where ε_i has a standard normal distribution. β = b (1, 1, −1, −1, 0_p−4)^⟙, where 0_p−4 is a vector with all entries equal to 0. The censoring time C_i is generated independently, and the censoring rate is about 10%. Covariates Z_i follow a multivariate normal distribution with marginal means 0, marginal variances 1, and the AR(0.3) correlation structure. Set h_n = n^−1/2. We further consider multiple scenarios: (a) n = 100, 200, and 400, with b = 1 and p = 10; (b) b = 0.5, 1 and 2, with n = 100 and p=10; and (c) p = 20, 40 and 150, with n = 100 and b = 1.

The results are summarized in Fig. 2. The left and middle panels show that the proposed approach provides a somewhat conservative control of mFDR. However, as sample size increases, or as signal level increases, the estimated mFDR value gets much closer to the actual value. For example, in the left panel, when the actual mFDR is 10%, the estimated mFDR values are 11.87%, 10.38%, and 10.13% with sample sizes 100, 200, and 400. In the middle panel, when the actual mFDR is 15%, the estimated mFDRs are 21.43%, 17.98%, and 17.70% with b = 0.5, 1, and 2.

It is noted that the theoretical validity is established under finite p. Simulation in the right panel examines the performance when covariates are high dimensional. It is observed that the proposed approach is still able to provide a conservative control. For example, when the true mFDR is 10%, the estimated mFDRs are 11.71%, 17.00%, and 21.40% with p = 20, 40, and 150.

In the next two sets of simulations, we further establish merit of the proposed approach by comparing against alternatives, when data are generated from the Cox and accelerated failure time (AFT) models. More specifically, comparison is conducted in two ways. (i) We first compare with the likelihood-based mFDR method assuming the Cox model and adopting M-estimating (Miller and Breheny, 2019). It is noted that, for better comparability, we modify the estimation of $\hat{FD}$ in equation (2.4) of Miller and Breheny (2019) to sum over $S^{c}$ instead of 1,…, p. For the proposed and Cox model-based methods, we examine the estimated mFDR values under the smallest λ that generates the actual mFDR=20%. (ii) We also evaluate the “ordinary” variable selection performance. With the proposed and Cox model-based methods, we examine variable selection performance under the smallest λ with estimated mFDR≤20%. The second alternative is Univariate testing, which fits a marginal Cox model with each covariate separately. Then the Benjamini-Hochberg procedure is applied to the p-values to control the FDR at 20%. The third alternative is Cross-Validation, where we fit a joint Cox model with all covariates, apply Lasso for penalized estimation, and select the optimal tuning using 10-fold cross validation, and hence focus on the important covariates correspondingly. It is noted that, the first two alternatives have analysis frameworks closer to that of the proposed, while the last alternative is more popular in the literature but does not have an explicit FDR control.

Simulation 2.

Survival time T is generated from an exponential distribution with rate exp(β^TZ), where β = (1, 0.8, 0.6, 1.2 0_p−4)^⟙. Censoring time C follows an exponential distribution with mean 10.

Simulation 3.

Survival time T is generated from an AFT model with β = (1, 0.8, 0.6, 1.2, 0_p−4)^⟙, and the random error follows a normal distribution with mean 0 and standard deviation 0.5. Censoring time C is generated in the same way as in Simulation 2.

For both Simulation 2 and Simulation 3, we set p = 10 and n = 100. In addition, we consider various strengths of correlation among covariates. In particular, in the AR(ρ) correlation structure, we consider ρ = 0, 0.1, …, 0.7, 0.8.

As shown in Fig. 3, both the proposed and Cox model-based approaches provide a conservative control of mFDR. As correlation increases, which makes it more difficult to distinguish signals from noises, performance of both approaches deteriorates. However, under all settings, the estimated mFDR values under the proposed approach are closer to the actual values, providing a more accurate control. It is interesting to observe that, even when the Cox model assumption is satisfied, the proposed approach still performs better. Research on mFDR is still limited, and it is not yet clear what causes the less satisfactory performance of the Cox model-based analysis.

As shown in Fig. 4, all four approaches have similar, satisfactory performance in identifying the true positives. The Univariate testing and Cross-Validation approaches are inferior with higher false positive rates. The proposed and Cox model-based approaches perform similarly, with the proposed approach having slightly more false positives.

Simulation 4.

This set of simulation is conducted to examine performance under the “null”. Specifically, survival time T is generated from the transformation model $g (T) = β^{T} Z + ε,$ where $g (T) = \sqrt{T} - 2, β = (1, 0_{p - 1}),$ and the random error is generated as in Simulation 1. Censoring time C follows an exponential distribution with mean 20. Set p = 10 and n = 100.

We consider the proposed and three alternative approaches. With the proposed, Cox model-based mFDR, and Univariate testing approaches, we target at controlling the mFDR or FDR at 10%. The average numbers of noise variables selected are 0.015 (proposed), 0.105 (Cox model-based mFDR), 0.76 (Univariate testing), and 1.89 (Cross-Validation), respectively. The proposed approach has the lowest false positive rate. Strictly speaking, there is still one signal in this model, as the transformation model-based analysis demands an anchor. We have also experimented with settings that have zero signal and found similar but less stable results (higher variations).

3.2. Data analysis

Two datasets are analyzed to further examine performance of the proposed analysis. The first dataset has a small number of covariates. The second has high-dimensional covariates, which is considered as the simulation shows that the proposed approach may also be applicable.

Example 1 (Loan Data).

We retrieve the loan data for the first to third quarter of 2019 from the Lending Club (LC) website (https://www.lendingclub.com/info/download-data.action). We obtain records with 60 months payments for two high-risk groups (LC grade E and F) and remove those that were in the process of being repaid. The time to event of interest is the total number of months a borrower has paid off, defined as the total payment received to date divided by the monthly payment owed by the borrower. Records with non-default loans over the study period are censored. Similar analysis has also been considered in the literature (Shi et al., 2018). The processed data contains a total number of 596 borrowers, and 82 of them default. We are interested in the relationship between the default time and borrower’s financial features, including the logarithm of loan amount, employment length, homeownership status, logarithm of borrower’s reported annual income, ratio of total monthly debt payment to total debt obligation (dti), number of open credit lines, logarithm of total credit revolving balance, revolving line utilization rate, total number of credit lines, number of derogatory public records, and their two-way interactions. Overall, the number of covariates is 55, which is moderate compared to the sample size.

To avoid unnecessary complication, we treat interactions in the same manner as for main effects. When applying the proposed method, we set the covariate with the largest positive Kendall’s τ correlation with the response, which is the interaction between logarithm of borrower’s reported annual income and logarithm of total credit revolving balance, as having coefficient 1. Results are summarized in the left panel of Fig. 5. More details are available from the authors. If we control the mFDR level at 10%, besides the first automatically entered covariate, the proposed approach identifies five covariates, namely the logarithm of loan amount, interaction between homeownership status and dti, interaction between homeownership status and number of open credit lines, interaction between total credit revolving balance and revolving line utilization rate, and interaction between revolving line utilization rate and total number of credit lines. These features have also been suggested as important factors in published studies, which provides some support to the validity of the proposed analysis. For example, it has been shown that a borrower with a large number of open credit or total credit lines may either default or choose to prepay, since he or she may either carry too heavy a debt burden or have great ability to pay back.

Fig. 5. — Analysis of Loan data (left) and SKCM data (right) using the proposed approach: estimated mFDR and number of covariates selected as a function of tuning.

Data is also analyzed using the alternatives. With the Cox model-based mFDR method, we allow the anchor in the model without any penalization to achieve better comparability. If we control the mFDR level at 10%, except for the first automatically entered covariate, this approach identifies none. When we increase the mFDR level to 29.32%, it can identify two additional covariates, namely the logarithm of loan amount and interaction between homeownership status and the number of open credit lines. When the FDR is controlled at 10%, the Univariate testing approach identifies none. When the FDR level is increased to 25.6% (and 29.32%), it identifies four covariates, all of which overlap with the proposed analysis. The Cross-Validation approach identifies three covariates, all overlapping with the proposed analysis.

Example 2 (SKCM Data).

We analyze the TCGA (The Cancer Genome Atlas) data on skin cutaneous melanoma (SKCM). The response of interest is overall survival, and the covariates are gene expressions. After standard processing, including removing samples with missing response data and genes with minimal expression variations, the final dataset contains 278 patients (among whom 159 have censored survival), their survival information, and 17,944 gene expressions.

Results are summarized in the right panel of Fig. 5, with more details available from the authors. When setting the mFDR level at 10%, beyond gene DDX60 which is automatically included, the proposed approach additionally selects gene PRKAR2 A. At estimated mFDR 16.49%, it selects three genes beyond DDX60: PHACTR4, STRIP1, and PRKAR2 A, all of which have been suggested as associated with melanoma prognosis. For example, literature has revealed that PHACTR4 is a STOP gene, which can limit proliferation and tumor growth in cells and thus prevent tumorigenesis. STRIP1 is one of the important complexes of STRIPAK, and STRIPAK complexes regulate the signal transduction pathway, mitosis and cytokinesis, cell polarity, and protein trafficking. Also, it has been shown that individuals with a deficiency in PRKAR2 A protein are predisposed to malignancies.

With the alternatives, the Cox model-based mFDR approach identifies no additional gene when controlling mFDR at 10%. When controlling FDR at 10%, the Univariate testing approach identifies 332 genes, which may not be sensible considering the limited number of melanoma associated genes reported in the literature. The Cross-Validation approach identifies only one gene C3ORF67.

4. Discussion

For survival data with a moderate to large number of covariates, we have developed the mFDR based inference under the transformation model and penalized estimation. mFDR provides an alternative, computationally simpler, and practically advantageous way of inference. In terms of methodology, adopting the transformation model leads to the much-desired robustness and flexibility. Our numerical analysis also shows its advantages over the Cox model-based mFDR analysis and other alternatives. Theoretically, we have been able to establish results comparable to those for the Cox model. It is noted that this is the first mFDR analysis with U-estimation, significantly expanding its scope. We have only established theoretical validity with a finite p. Our limited simulation shows that when p is large, the proposed approach provides a conservative control, and as such, it can be applicable. It is of interest but beyond our scope to establish theoretical validity for a diverging p. It is also noted that the PRE and SPRE can be directly simplified to accommodate continuous and categorical data. As such, the proposed approach can be applied to such data with almost no modification. We acknowledge the connections with but also distinctions from the recent inference studies on survival data under the Cox and other models–such studies have assumed less robust models and adopted M-estimation. Consolidating the developments for mFDR and FDR as well as for different models demands significant additional work. As the FDR studies are not as closely related, and as some of them are not easy to implement, we postpone numerical comparisons to the future.

Acknowledgments

We thank the editor and reviewers for their careful review and insightful comments. This work was supported by the National Natural Science Foundation of China (11701561), National Statistical Science Research Project, China (2019LZ22), Foundation from Ministry of Education of China (20JZD023), fund for building world-class universities (disciplines) of Renmin University of China, and NIH, USA (CA204120, CA121974).

Appendix A. Proof of Theorem 1

As established in Song and Ma (2010), $\hat{β}$ is consistent with $‖ \hat{β} - β_{0} ‖ = O_{p} (n^{- 1 / 2}) .$ With Taylor expansion,

\begin{array}{l} u_{k} (\hat{β}) = u_{k} (β_{0}) + \frac{\partial u_{k} (β_{0})}{\partial β^{T}} (\hat{β} - β_{0}) + o_{p} (‖ \hat{β} - β_{0} ‖) \\ = u_{k} (β_{0}) + \sum_{r \neq k} \frac{\partial^{2} H_{n} (β_{0})}{\partial β_{k} \partial β_{r}} ({\hat{β}}_{r} - β_{r 0}) + \frac{\partial^{2} H_{n} (β_{0})}{\partial β_{k}^{2}} ({\hat{β}}_{k} - β_{k 0}) + o_{p} (‖ \hat{β} - β_{0} ‖) . \end{array}

$\partial^{2} H_{n} (β_{0}) / \partial β \partial β^{T}$ is a second-order degenerate U-statistic, and it can be shown that

\begin{array}{l} \begin{matrix} \frac{\partial^{2} H_{n} (β_{0})}{\partial β_{k} \partial β_{r}} = E {\frac{\partial^{2} τ (X, β_{0})}{\partial β_{k} \partial β_{r}}} + o_{p} (1) & for & r \neq k, \end{matrix} \\ \frac{\partial^{2} H_{n} (β_{0})}{\partial β_{k}^{2}} = E {\frac{\partial^{2} τ (X, β_{0})}{\partial β_{k}^{2}}} + o_{p} (1) . \end{array}

Under Condition (B0), we further have:

E {\frac{\partial^{2} τ (X, β_{0})}{\partial β_{k} \partial β_{r}}} = E {[Z_{k} - μ_{k} (β_{0}^{T} Z)] [Z_{r} - μ_{r} (β_{0}^{T} Z)] q^{'} (X, β_{0}^{T} Z) f_{0} (β_{0}^{T} Z)},

and

\frac{\partial^{2} H_{n} (β_{0})}{\partial β_{k} \partial β_{r}} = o_{p} (1)

for $r \neq k .$ Together with ${\hat{β}}_{r} - β_{r 0} = O_{p} (n^{- 1 / 2})$ and $β_{k 0} = 0,$ we have

\sqrt{n} {u_{k} (\hat{β}) + v_{k} {\hat{β}}_{k}} = \sqrt{n} u_{k} (β_{0}) + o_{p} (1),

where $v_{k} = - E {\frac{\partial^{2} τ (X, β_{0})}{\partial β_{k}^{2}}} .$

$u_{k} (β_{0})$ is a U-statistic with kernel function

ψ (X_{i}, X_{j}) = Δ_{j} I (Y_{i} \geq Y_{j}) s^{'} (\frac{β_{0}^{T} Z_{i} - β_{0}^{T} Z_{j}}{h_{n}}) \frac{Z_{i k} - Z_{j k}}{h_{n}} + Δ_{i} I (Y_{j} \geq Y_{i}) s' (\frac{β_{0}^{T} Z_{j} - β_{0}^{T} Z_{i}}{h_{n}}) \frac{Z_{j k} - Z_{i k}}{h_{n}},

where $X_{i} = (Y_{i}, Δ_{i}, Z_{i}) .$ It can be rewritten as:

u_{k} (β_{0}) = \frac{1}{n (n - 1)} \sum_{i < j} ψ (X_{i}, X_{j}) .

We have $ψ (X_{i}, X_{j}) = \partial f_{n}^{*} (X_{i}, X_{j}, β_{0}) / \partial β_{k},$ and $E {ψ (X_{i}, X_{j})} \to 0$ as h_n → 0. For the variance of $u_{k} (β_{0})$ , we have:

ψ_{1} (X_{i}) = E {ψ (X_{i}, X_{j}) | X_{i}} = \frac{\partial τ_{n} (X_{i}, β_{0})}{\partial β_{k}} .

Then $u_{k} (β_{0})$ can be approximated by the U-statistic projection ${\hat{u}}_{k} (β_{0}) = \frac{1}{n} \sum_{i = 1}^{n} ψ_{1} (X_{i}) .$ Using the central limit theory of U-statistics (Lee, 1990), we have

\sqrt{n} u_{k} (β_{0}) \overset{D}{\to} N (0, σ_{k}^{2}),

where $σ_{k}^{2} = E {\frac{\partial τ (X, β_{0})}{\partial β_{k}}}^{2} .$ This completes the proof. □

Appendix B. Algorithm

Algorithm 2:

Fabs Algorithm for estimating β

1 Initialization:

With a small step size

ϵ > 0,

take an initial forward step:

m = \underset{k = 2, \dots, p}{arg max} | u_{k} ({\hat{β}}^{(0)}) | and A^{(0)} = {m},

{\hat{β}}^{(1)} = {\hat{β}}^{(0)} + sign (u_{m} ({\hat{β}}^{(0)})) 1_{m} ϵ,

λ^{(1)} = \frac{1}{ϵ} [H_{n} ({\hat{β}}^{(1)}) - H_{n} ({\hat{β}}^{(0)})],

where

{\hat{β}}^{(0)}

is the length-p vector with the first entry being 1 and the rest being 0, 1_m denotes the length-p vector with the mth entry being 1 and the rest being 0, and

A^{(t)}

is the index set of

{\hat{β}}^{(t)}

’s nonzero entries besides the first one.

2 Backward and Forward Steps:

2.1 The backward direction is determined by:

\begin{array}{l} m = \underset{k \in A^{(t)}}{arg min} {u_{k} ({\hat{β}}^{(t)}) sign ({\hat{β}}_{k}^{(t)})}, \\ Δ_{m} = - sign ({\hat{β}}_{m}^{(t)}) 1_{m} . \end{array}

2.2 Take a backward step as

{\hat{β}}^{(t + 1)} = {\hat{β}}^{(t)} + Δ_{m} ϵ

and

λ^{(t + 1)} = λ^{(t)}

Q_{n} ({\hat{β}}^{(t)} + Δ_{m} ϵ; λ^{(t)}) > Q_{n} ({\hat{β}}^{(t)}; λ^{(t)})

. Otherwise, take a forward step as

\begin{matrix} m = \underset{k = 2, \dots, p}{arg max} | u_{k} ({\hat{β}}^{(t)}) | and A^{(t + 1)} = A^{(t)} \cup {m}, \\ {\hat{β}}^{(t + 1)} = {\hat{β}}^{(t)} + sign (u_{m} ({\hat{β}}^{(t)})) 1_{m} ϵ, \end{matrix}

and relax λ if necessary as

λ^{(t + 1)} = \min {λ^{(t)}; \frac{1}{ϵ} [H_{n} ({\hat{β}}^{(t + 1)}) - H_{n} ({\hat{β}}^{(t)})]} .

3 Iteration:

Update t = t + 1, and repeat Step 2 and 3 until

λ^{(t)} \leq λ_{\min}

, where

λ_{\min}

is a predefined lower bound of the tuning.

Open in a new tab

References

Breheny P, 2019. Marginal false discovery rates for penalized regression models. Biostatistics 20 (2), 299–314. [DOI] [PubMed] [Google Scholar]
Chai H, Zhang Q, Huang J, Ma S, 2019. Inference for low-dimensional covariates in a high-dimensional accelerated failure time model. Statist. Sinica 29, 877–894. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fang EX, Ning Y, Li R, 2020. Test of significance for high-dimensional longitudinal data. Ann. Statist 48 (5), 2622–2645. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fang EX, Ning Y, Liu H, 2016. Testing and confidence intervals for high dimensional proportional hazards model. J. R. Stat. Soc. Ser. B Stat. Methodol 79 (5), 1415–1437. [DOI] [PMC free article] [PubMed] [Google Scholar]
Han AK, 1987. Non-parametric analysis of a generalized regression model: the maximum rank correlation estimator. J. Econometrics 35 (2–3), 303–316. [Google Scholar]
Khan S, Tamer E, 2007. Partial rank estimation of duration models with general forms of censoring. J. Econometrics 136 (1), 251–280. [Google Scholar]
Lee AJ, 1990. U-Statisics: Theory and Practice. Marcel dekker, Inc., New York, Basel. [Google Scholar]
Lee JD, Sun DL, Sun Y, Taylor JE, 2013. Exact post-selection inference with the lasso. Ann. Statist 44 (3), 907–927. [Google Scholar]
Lin D, Wei L, 1989. The robust inference for the cox proportional hazards model. J. Amer. Statist. Assoc 84 (408), 1074–1078. [Google Scholar]
Meinshausen N, Meier L, Bühlmann P, 2009. P-values for high-dimensional regression. J. Amer. Statist. Assoc 104 (488), 1671–1681. [Google Scholar]
Miller RE, Breheny P, 2019. Marginal false discovery rate control for likelihood-based penalized regression models. Biom. J 61 (4), 889–901. [DOI] [PubMed] [Google Scholar]
Shi X, Huang Y, Huang J, Ma S, 2018. A forward and backward stagewise algorithm for nonconvex loss functions with adaptive lasso. Comput. Statist. Data Anal 124, 235–251. [DOI] [PMC free article] [PubMed] [Google Scholar]
Song R, Lu W, Ma S, Jeng X, 2014. Censored rank independence screening for high-dimensional survival data. Biometrika 101 (4), 799–814. [DOI] [PMC free article] [PubMed] [Google Scholar]
Song X, Ma S, 2010. Penalised variable selection with U-estimates. J. Nonparametr. Stat 22 (4), 499–515. [DOI] [PMC free article] [PubMed] [Google Scholar]
Song X, Ma S, Huang J, Zhou XH, 2007. A semiparametric appraoch for the nonparametric transformation survival model with multiple covariates. Biostatistics 8 (2), 197–211. [DOI] [PubMed] [Google Scholar]
Tibshirani RJ, Taylor J, Lockhart R, Tibshirani R, 2016. Exact post-selection inference for sequential regression procedures. J. Amer. Statist. Assoc 111 (514), 600–620. [Google Scholar]
Zhao P, Yu B, 2017. Stagewise lasso. J. Mach. Learn. Res 8, 2701–2726. [Google Scholar]

[R1] Breheny P, 2019. Marginal false discovery rates for penalized regression models. Biostatistics 20 (2), 299–314. [DOI] [PubMed] [Google Scholar]

[R2] Chai H, Zhang Q, Huang J, Ma S, 2019. Inference for low-dimensional covariates in a high-dimensional accelerated failure time model. Statist. Sinica 29, 877–894. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] Fang EX, Ning Y, Li R, 2020. Test of significance for high-dimensional longitudinal data. Ann. Statist 48 (5), 2622–2645. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] Fang EX, Ning Y, Liu H, 2016. Testing and confidence intervals for high dimensional proportional hazards model. J. R. Stat. Soc. Ser. B Stat. Methodol 79 (5), 1415–1437. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] Han AK, 1987. Non-parametric analysis of a generalized regression model: the maximum rank correlation estimator. J. Econometrics 35 (2–3), 303–316. [Google Scholar]

[R6] Khan S, Tamer E, 2007. Partial rank estimation of duration models with general forms of censoring. J. Econometrics 136 (1), 251–280. [Google Scholar]

[R7] Lee AJ, 1990. U-Statisics: Theory and Practice. Marcel dekker, Inc., New York, Basel. [Google Scholar]

[R8] Lee JD, Sun DL, Sun Y, Taylor JE, 2013. Exact post-selection inference with the lasso. Ann. Statist 44 (3), 907–927. [Google Scholar]

[R9] Lin D, Wei L, 1989. The robust inference for the cox proportional hazards model. J. Amer. Statist. Assoc 84 (408), 1074–1078. [Google Scholar]

[R10] Meinshausen N, Meier L, Bühlmann P, 2009. P-values for high-dimensional regression. J. Amer. Statist. Assoc 104 (488), 1671–1681. [Google Scholar]

[R11] Miller RE, Breheny P, 2019. Marginal false discovery rate control for likelihood-based penalized regression models. Biom. J 61 (4), 889–901. [DOI] [PubMed] [Google Scholar]

[R12] Shi X, Huang Y, Huang J, Ma S, 2018. A forward and backward stagewise algorithm for nonconvex loss functions with adaptive lasso. Comput. Statist. Data Anal 124, 235–251. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] Song R, Lu W, Ma S, Jeng X, 2014. Censored rank independence screening for high-dimensional survival data. Biometrika 101 (4), 799–814. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] Song X, Ma S, 2010. Penalised variable selection with U-estimates. J. Nonparametr. Stat 22 (4), 499–515. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] Song X, Ma S, Huang J, Zhou XH, 2007. A semiparametric appraoch for the nonparametric transformation survival model with multiple covariates. Biostatistics 8 (2), 197–211. [DOI] [PubMed] [Google Scholar]

[R16] Tibshirani RJ, Taylor J, Lockhart R, Tibshirani R, 2016. Exact post-selection inference for sequential regression procedures. J. Amer. Statist. Assoc 111 (514), 600–620. [Google Scholar]

[R17] Zhao P, Yu B, 2017. Stagewise lasso. J. Mach. Learn. Res 8, 2701–2726. [Google Scholar]

PERMALINK

Marginal false discovery rate for a penalized transformation survival model

Weijuan Liang

Shuangge Ma

Cunjie Lin

Abstract

1. Introduction

Fig. 1.

2. Methods

2.1. Model and penalized estimation

2.2. mFDR

Theorem 1.

Algorithm 1:

2.3. Computation using the Fabs technique

3. Numerical studies

3.1. Simulation

Simulation 1.

Fig. 2.

Simulation 2.

Simulation 3.

Fig. 3.

Fig. 4.

Simulation 4.

3.2. Data analysis

Example 1 (Loan Data).

Fig. 5.

Example 2 (SKCM Data).

4. Discussion

Acknowledgments

Appendix A. Proof of Theorem 1

Appendix B. Algorithm

Algorithm 2:

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Marginal false discovery rate for a penalized transformation survival model

Weijuan Liang

Shuangge Ma

Cunjie Lin

Abstract

1. Introduction

Fig. 1.

2. Methods

2.1. Model and penalized estimation

2.2. mFDR

Theorem 1.

Algorithm 1:

2.3. Computation using the Fabs technique

3. Numerical studies

3.1. Simulation

Simulation 1.

Fig. 2.

Simulation 2.

Simulation 3.

Fig. 3.

Fig. 4.

Simulation 4.

3.2. Data analysis

Example 1 (Loan Data).

Fig. 5.

Example 2 (SKCM Data).

4. Discussion

Acknowledgments

Appendix A. Proof of Theorem 1

Appendix B. Algorithm

Algorithm 2:

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases