Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2019 Oct 1.
Published in final edited form as: Stat Methods Med Res. 2018 Oct 11;28(10-11):3404–3414. doi: 10.1177/0962280218803654

Bi-level variable selection for case-cohort studies with group variables

Soyoung Kim 1, Kwang Woo Ahn 1
PMCID: PMC6748310  NIHMSID: NIHMS1046936  PMID: 30306838

Abstract

The case-cohort design is an economical approach to estimate the effect of risk factors on the survival outcome when collecting exposure information or covariates on all patients is expensive in a large cohort study. Variables often have group structure such as categorical variables and highly correlated continuous variables. The existing literature for case-cohort data is limited to identifying non-zero variables at individual level only. In this article, we propose a bi-level variable selection method to select non-zero group and within-group variables for case-cohort data when variables have group structure. The proposed method allows the number of variables to diverge as the sample size increases. The asymptotic properties of the estimator including bi-level variable selection consistency and the asymptotic normality are shown. We also conduct simulations to compare our proposed method with some existing method and apply them to the Busselton Health data.

Keywords: Case-cohort design, efficiency, multiple diseases, survival analysis, variable selection

1. Introduction

The case-cohort design is widely used to estimate the effect of risk factors on survival outcome when measuring exposure information is costly. Case-cohort data consist of a random sample, called the subcohort, from the full cohort and all cases outside the subcohort. In other words, the expensive covariate information is collected from subjects being in the subcohort or having events of interest outside the subcohort. When studying multiple diseases of interest, the case-cohort design capable of using the same subcohort has been advocated.1

The extensive work has been done for analyzing case-cohort data for survival outcomes. With a single disease, a pseudo-likelihood approach was proposed by Prentice1 and Self and Prentice.2 In order to improve efficiency, Barlow3 and Kulich and Lin4 proposed a robust estimator using a time-varying weight and a class of weighted estimation using all available information, respectively. When several case-cohort studies have already been conducted for multiple diseases, Kang and Cai5 developed a joint model with multivariate failure time. However, they did not use extra information from the other diseases when estimating the effect of risk factors for a disease of interest. Kim et al.6 developed a more efficient estimation method with a new weight to make full use of information from the other diseases.

In spite of the progress made in developing methods analyzing case-cohort data, studying variable selection under the case-cohort design has been limited. Recently, Ni et al.7 proposed a variable selection procedure using the smoothly clipped absolute deviation (SCAD) penalty to select non-zero individual variables for a single case-cohort study. In practice, variables often have group structure such as categorical variables and highly correlated continuous variables. For example, in the Busselton Health data from Cullen8 and Knuiman et al.,9 a case-cohort study was conducted to investigate the association between serum ferritin and stroke event. It includes categorical variables such as smoking status and categorized serum ferritin levels, and continuous variables. For a continuous variable x, investigators often examine the effects of x,..., xa on the outcome of interest, where a is a positive integer. In practice, individual variable selection methods are used to identify important variables among x,..., xa.7,10 However, because they are highly correlated, one may treat x,..., xa as a group and apply a bi-level variable selection technique to identify non-zero group and within-group variables. Thus, bi-level selection may be more efficient than individual variable selection in selecting non-zero variables.

Many methods have been developed for variable selection with group structure under the proportional hazards model. Ma et al.11 and Kim et al.12 proposed the supervised group lasso and the group lasso method, respectively, which are limited to group variable selection. Huang et al.13 studied the group bridge to select variables at bi-level, that is, at group and within-group levels. However, they showed the group variable selection consistency only, not the bi-level variable selection consistency. Alternative approach by Wang and Nan14 considered a hierarchically penalized proportional hazards regression for bi-level variable selection. For the subdistribution hazards model with group variables, Ahn et al.15 developed the adaptive group bridge that has the bi-level variable selection consistency. However, to our best knowledge, there is no literature on bi-level variable selection for survival outcome under the case-cohort design.

In this article, we propose the adaptive group bridge which is capable of identifying non-zero variables at group and within-group levels for the univariate proportional hazards model under the case-cohort design. To study a disease of interest, we consider two case-cohort designs: a single case-cohort study and multiple case-cohort studies. In contrast with a single case-cohort study, multiple case-cohort studies have extra information from the other diseases. We propose to use such extra information to improve variable selection accuracy. We show the bi-level variable selection consistency and the asymptotic normality of the proposed estimator while allowing the number of variables to diverge as the sample size increases. The proposed method is applied to the Busselton Health data.

2. Model selection with adaptive group bridge under case-cohort designs

2.1. Estimation for case-cohort designs

We define notations for multiple case-cohort studies for generality. A single case-cohort study can be handled as a special case. To elaborate multiple case-cohort studies, we consider multivariate outcomes one of which is the outcome of interest for the univariate proportional hazards model.

Suppose there are n independent subjects and K diseases in the full cohort. For subject i with disease k, let Tik, Cik, and Xik = min(Tik, Cik) denote the failure time of interest, censoring time, and observed time, respectively. Let Δik=I(TikCik),Nik(t)=I(Xikt,Δik=1),andYik(t)=I(Xikt) be, respectively, a failure indicator, the counting process, and an at-risk indicator, where I() is an indicator function. Let Zik(t)={Zik1(t),,Zikdn(t)}T be a dn × 1 possibly time-dependent covariate vector. Without loss of generality, we assume that Zik(t) are centered and standardized. We assume that Tik is independent of Cik given Zik(). The study period is assumed to be [0, τ]. Consider the event-specific hazards model: the hazard function for subject i with disease k is assumed to be hik{t|Zik(t)}=h0k(t)exp{βkTZik(t)}, where h0k(t) is an unspecified baseline hazard function and βk=(β1,,βdn)T is an unknown parameter vector of interest.16 Let β0=(β10,,βdn0)T be the true parameter vector.

Under multiple case-cohort studies, covariate information is available for (i) a randomly selected subcohort from the full cohort and (ii) all cases from any causes outside the subcohort. The subcohort is shared by all outcomes. More specifically, fixed ñ subjects are randomly selected from the full cohort for the shared subcohort. Let ξi be an indicator of subject i being selected into the subcohort. Each subject is selected with the same probability α˜=pr(ξ1=1)=n˜/n. The data under multiple case-cohort studies consist of {Xik,Δik,ξi,Zik(t),0tXik} when ξi = 1 or Δik = 1 for k = 1,..., K and {Xik, Δik, ξi} when ξi = 0 and Δik = 0 for k = 1,..., K.

Assume disease k is of interest. For disease k, the negative pseudo-partial likelihood is

l˜k(β)=i=1n0τ[βkTZik(t)logj=1nwjk(t)Yjk(t)exp{βkTZjk(t)}]dNik(t) (1)

where wik(t) is the time-varying weight function for case-cohort data. There are several weight functions.1,3,6,17,18 In this paper, we consider two efficient time-varying weight functions. The first weight function proposed by Kalbfleisch and Lawless17 for a single case-cohort study has the following form

wik,1(t)=Δik+(1Δik)ξiα^k1(t) (2)

where α^k(t)=i=1nξiYik(t)(1Δik)/i=1nYik(t)(1Δik) which is an estimator for the subcohort selection probability representing the proportion of sampled subjects among subjects who have disease k and remain in the risk set at time t.

The weight function wik,1(t) ignores the extra information from the other diseases under multiple case-cohort studies. To use all collected covariate information for subjects who have the other diseases outside the subcohort, Kim et al.6 proposed the following more efficient weight

wik,2(t)={1k=1K(1Δik)}+k=1K(1Δik)ξiα˜k1(t) (3)

where α˜k(t)=i=1nξiYik(t)k=1K(1Δik)/i=1nYik(t)k=1K(1Δik) which is the proportion of sampled subjects among subjects who do not have any diseases and remain in the risk set at time t. The parameter estimator β˜ can be obtained by minimizing the negative pseudo-partial likelihood (1) using weights (2) or (3).

2.2. Adaptive group bridge

In this section, we propose the adaptive group bridge for bi-level variable selection with group variables. First, we define notations on group variables and their membership. Suppose that there are G groups and group memberships A1,..., AG are defined as subsets of {1,..., dn}. The cardinality of a set A is denoted by |A|. Groups are allowed to overlap. We define a |A| × 1 vector βA=(βm,mA)TandβA0=(βm0,mA)T. For individual membership, we define B1 and B2 such that βm ≠ 0 if mB1 and βm = 0 if mB2. Without loss of generality, we assume βAg0forg=1,,G1andβAg=0 for g = G1 + 1,..., G.

For data with group structure, we propose the following penalized pseudo-partial likelihood to obtain the estimator β^.

Ln(β)=l˜k(β)+λng=1Gcg(mAg|βm||β˜m|v)γ (4)

where 0 < γ < 1, λn > 0, v > 0, the cg’s are the constants to adjust for different |Ag|’s, and β˜ is a consistent estimator of β0.The estimator β˜ can be obtained from pseudo-partial likelihood (1) using weights (2) or (3). In practice, cg|Ag|1γ is widely used.19,13 When v = 0, the penalty term in (4) is the group bridge penalty of Huang et al.13,19 Using β˜m as the weight is similar to the adaptive lasso of Zou.20 Thus, we call our proposed penalty as the adaptive group bridge penalty. In the case with γ = 1/2 and cg = 1 for all g, the adaptive group bridge penalty is the same as the adaptive hierarchical penalty of Wang and Nan.14

Although Huang et al.13,19 proposed the group bridge for bi-level variable selection, they showed the group variable selection consistency only, not the bi-level variable selection consistency. Because the group bridge set v to 0 in (4), it equally penalizes βm’s for mAg within group g. This may lead to inconsistent within-group variable selection. To overcome this limitation of the group bridge, the adaptive group bridge uses the weight β˜m in the penalty term like the adaptive lasso. When βm0 = 0, β˜m is close to 0 for sufficiently large n. Thus, 1/|β˜j|v assigns larger penalties to zero parameters. On the other hand, 1/|β˜m|v converges to a non-zero constant when βm0 ≠ 0. By doing so, putting the weight β˜m into the penalty term enables the adaptive group bridge to identify non-zero variables at bi-level more consistently than the group bridge.

2.3. Asymptotic properties

In this section, we study the asymptotic properties of the adaptive group bridge estimator β^ using weight (3). The asymptotic properties of the estimator using weight (2) can be similarly shown and thus their proofs are omitted in this article. We define a⊗0 = 1, a⊗1 = a, a⊗2 = aaT, and the following notations:

Sk(d)(β,t)=1ni=1nYik(t)Zik(t)deβTZik(t),d=0,1,2
S˜k(d)(β,t)=1ni=1nwik,2(t)Yik(t)Zik(t)deβTZk(t),d=0,1,2
sk(d)(β,t)=E{Sk(d)(β,t)},d=0,1,2,ek(β,t)=sk(1)(β,t)/sk(0)(β,t)
vk(β,t)=sk(2)(β,t)sk(0)(β,t)sk(1)(β,t)2sk(0)(β,t)2
Vk(β,t)=S˜k(2)(β,t)S˜k(0)(β,t)S˜k(1)(β,t)2S˜k(0)(β,t)2
Ω(β)=0τvk(β,t)sk(0)(β,t)h0k(t)dt
Γ(β)=1nVar{l˜k(β)/β}

We make the following assumptions:

  1. For all k,0τh0k(t)dt<andP{Yik(t)=1}>0fort[0,τ],i=1,,n.

  2. |Zijk(0)|+0τ|dZijk(t)|<Dz<,i=1,,n,j=1,,dn almost surely and Dz is a constant.

  3. For d = 0, 1, 2, there exists a neighborhood Bofβ0 such that sk(d)(β,t) are continuous functions and supt[0,τ],βBSk(d)(β,t)sk(d)(β,t)2p0,whereap defines the Lp norm of a.

  4. For all βB,t[0,τ],andk=1,,K,Sk(1)(β,t)=Sk(0)(β,t)/β,andSk(2)(β,t)=2Sk(0)(β,t)/ββT, where Sk(d)(β,t) for d = 0, 1, 2 are continuous functions of βB uniformly in t ∈ [0, τ] and are bounded on B×[0,τ];sk(0) is bounded away from zero on B×[0,τ].

  5. There exist constants C1 and C2 such that
    0<C3<eigenmin{Ω(β0}eigenmax{Ωβ0}}<C4<

    where eigenmin{A} and eigenmax{A} are minimal and maximal eigenvalues of a matrix A, respectively.

  6. There exist constants C3 and C4 such that
    0<C3<eigenmin{Ω(β0)}eigenmax{Ω(β0)}<C4<
  7. limnα˜=α,whereα˜=n˜/n and α is a positive constant.

  8. For some v1 and v2 such that 0 < v1 < 1,0 < v2, and v2/(1 – v1) < v, minjB1|βj0(τ)|=Op{(dn/n)v1/2},maxg|AgB1|=O{(n/dn)v2/2}, we assume g=1G1cg{(jAgB1|βj0|1v)γ1jAgB11/|βj0|v}Mnandg=1Glcg{(jAgB1|βj0|1v)γ2jAgB11/|βj0|2v}Mn,whereMn=Op(1).

  9. λn/n0,n/dnβ˜j=Op(1),andmin(λnn(v1)/2dn(1+v)/2,λnnγ(v1)/2dn1+γ(1v)/2).

Conditions 1–6 are standard conditions for the proportional hazards model. They are similar to Conditions A1–A3 of Cai et al.10 and Conditions 1–4 of Ni et al.7 They guarantee local asymptotic quadratic property of l˜k(β) and the existence of local minimizer of Ln(β). Condition 7 is boundness for the subcohort selection probability. Conditions 8–9 control λn, the number of variables within group, and the magnitude of the true non-zero parameters within non-zero groups as n → ∞. Similar conditions to Conditions 8–9 were used in Ahn et al.15

We have the following theorem.

Theorem 1. Under Conditions 1–9, we have

  1. Consistency: if dn4/n0,β^β02=Op(dn/n).

  2. Bi-level variable selection consistency: P(β^B2=0)1.

  3. Asymptotic distribution: if dn5/n0, we have

n1/2uTΩ111/2Γ11(β^B1βB10)dN(0,1)

where u is a |B1|×1 constant vector with ||u|| = 1, and Ω11 and Γ11 are the leading |B1|×|B1| submatrices of Ω(β0) and Γ(β0), respectively.

The Supplemental Materials include the proof of Theorem 1 and the estimators of Ω(β0) and Γ(β0). Theorem 1 establishes the oracle property of the adaptive group bridge estimator. In particular, it shows that the adaptive group bridge consistently selects not only non-zero group variables, but also non-zero within-group variables. We have the following corollary:

Corollary 1. Let β˜ be the estimator based on the weighted pseudo-partial likelihood (1) using weights (2) or (3). Under Conditions 1–7 and dn4/n0,β˜β02=Op(dn/n).

2.4. Computation

Since it is difficult to directly minimize Ln(β) with respect to β, we formulate minimizing Ln(β) to minimizing

Q(β,θ)=l˜k(β)+g=1Gθg11/γcg1/γmAg|βm||β˜m|v+ζng=1Gθg (5)

where θ=(θ1,,θG)Tandλn=ζn1γγγ(1γ)γ1. We have the following proposition:

Proposition 1. Assume that λn=ζn1γγγ(1γ)γ1. Then, β^ minimizes Ln(β) if and only if (β^,θ^) minimizes Q(β,θ), where θg>0andθ^g>0 for g = 1,..., G

We can show Proposition 1 similarly to Huang et al.19 Thus, its proof is omitted.

To approximate l˜k(β),wedefinel˜(β)=l˜k(β)/βand2l˜(β)=2l˜k(β)/(ββT)=XTX, where X can be obtained by the Cholesky decomposition and Y=(XT)1{2l˜(β)βl˜(β)}. To minimize Q(β,θ), we minimize

Q˜(β,θ)=12(YXβ)T(YXβ)+g=1Gθg11/γcg1/γmAg|βm||β˜m|v+ζng=1Gθg (6)

Define β(p) is the estimator at the pth step in optimization. Then, the algorithm is as follows:

Step 1 Obtain β˜ by minimizing likelihood (1) and set it to the initial value β(0).

Step 2 Compute X and Y using β(p) at the pth step.

Step 3 Compute

θg(p)=cg(1γζnγ)γ(mAg|βm(p)||β˜m|v)γ,g=1,,G

Step 4 After plugging θ(p)=(θ1(p),,θG(p))T from Step 3 into Q˜(β,θ(p)), minimize Q˜(β,θ(p)) to obtain β(p+1).

Step 5 Repeat Step 2 – Step 4 until β(p+1)β(p)1<104.

To choose a tuning parameter, we can use the generalized cross validation by following Huang et al.13:

l˜k(β^)n{1d^(λn)/n}2

where d^(λn) is the number of non-zero coefficients given λn. To estimate the covariance matrix of β^0, we can use a quadratic approximation as in Fan and Li21 and Huang et al.13 It can be estimated as follows:

{2l˜(β^)+Υ(β^,θ^)}1Cov^{l˜(β^)}{2l˜(β^)+Υ(β^,θ^)}1

where

Υ(β^,θ^)=diag{Agmθ^m11/γcg1/γI(β^m0)|β^m||β˜m|v,m=1,,dn}
θ^m=cg(1γζnγ)γ(mAg|β^m||β˜m|ν)γ

3. Simulation

We conducted simulations to evaluate the performance of the adaptive group bridge and compare it with the group bridge from Huang et al.13 and SCAD of Ni et al.7 under two case-cohort studies. The failure time for disease 1, Ti1, was generated based on the proportional hazards model. To generate failure time for disease 2, Ti2, we used the Clayton–Cuzick model22

F(t1,t2|Zi1,Zi2)={S1(t1;Zi1)1/η+S2(t2;Zi2)1/η1}η

where Zi1=Zi2=Zi,Sk(t;Zi)=Pr(Tk>t|Zik)=exp{0th0k(u)eβkTZkdu} is survival function, h0k(t) and βk (k = 1,2) are the baseline hazard function and the covariate effect for disease k, respectively, and η is the association parameter between the failure times of the two diseases. The relationship between Kendall’s tau τη and η is τη = 1/(2η + 1). A larger Kendall’s tau represents a higher correlation between T1 and T2. Value of 4 was used for η. The corresponding Kendall’s tau value was approximately 0.11. Disease 1, that is, k = 1 was of interest.

We considered two censoring rates: 90% and 80%. The censoring distribution was generated independently from the uniform distribution. We assumed the constant baseline hazards: h01(t) = 2 and h02(t) = 8. We examined two event rates: 10% and 20% for k = 1.The corresponding event rates for k = 2were 20% and 40%, respectively. The control-to-case ratio was set to 1:2. Two sample sizes of the full cohort were considered: n = 750 and n = 1500. Let ncase be the expected number of cases given censoring rate. For each setting, 500 replications were conducted.

To compare the performance of the three variable selection methods, we calculated the group correction rate and the individual correction rate representing the proportion that each variable selection method correctly identified the true non-zero groups and non-zero individual variables of the underlying model, respectively. We also calculated group size (GS) and model size (MS) defined as the average number of non-zero groups and non-zero individual variables selected by each method. The ratio of mean squared errors was defined as the ratio of the mean squared error for each variable selection relative to the mean squared error of the oracle estimator

i=1500β^iβ022i=1500||β^Oracleiβ022

where β˜Oraclei is the oracle estimator of β0 at the ith iteration. The oracle estimator was obtained from the pseudo-partial likelihood (1) with either weight (2) or weight (3) assuming we already knew the true non-zero variables. Therefore, the ratio of mean squared errors closer to 1 indicates a better estimation of β0.

In the first simulation, we examined group variables consisting of continuous variables. When the event rate was 10% with population size 750, there were five group variables with GSs (|A1|,|A2|,|A3|,|A4|,|A5|)T=(2,3,4,5,3)T. Groups 2 and 3 were overlapped as follows

(β1,,β15)=(1.1,0.9,A11.2,0,0,A20,0A31.1,1,0,0.9,0,A40,0,0A5)

For the event rate 20% with population size 750 or the event rate 10% with population size 1500, one more zero group with size 2, (β16, β17)T = (0,0)T, was added. For the event rate 20% with population size 1500, an additional zero group with size 2, (β18, β19)T = (0,0)T, was added. By doing so, we allowed dn to increase as ncase increased. All variables within each group were generated from the multivariate normal distribution with mean 0, variance 1, and correlation 0.5. All groups other than Groups 2 and 3 were assumed to be independent. Thus, the true GS and MS were 3 and 6, respectively. For each scenario, the efficient case-cohort weight (3) and the traditional case-cohort weight (2) were examined.

Table 1 summarizes the results. Table 1 reports the group correction rate (GRC %), individual correction rate (IDC %), GS, MS, and ratio of mean square error (MSER). For all variable selection methods, higher event rates for disease 1 and larger sample sizes produced higher group correction and individual correction rates; group and MSs closer to 3 and 6, respectively; the MSER closer to 1. The results show that (i) group correction and individual correction rates for the adaptive group bridge are higher than those of the group bridge and SCAD; (ii) the GSs and the MSs for the adaptive group bridge are closer to 3 and 6, respectively, compared to the other two methods. Moreover, all three variable selection methods using the efficient case-cohort weight (3) have better group correction and individual correction rates than those using the traditional case-cohort weight (2), in particular when the event rate for disease 2 is higher and correlation between two failure times is smaller. We also examined different event rates, different pairwise correlations between variables, and a large number of coefficients with a high pairwise correlation between variables. In all scenarios, we had similar results to Table 1. The detailed settings and results of some additional simulation studies are provided in the Supplemental Materials.

Table 1.

Simulation results for group variables consisting of continuous variables.

n Weight P1, Δ2) dn Method GRC% IDC% GS MS MSER

750 Traditional case-cohort weight (0.1,–) 15 AGB 27.8 14.2 4.09 9.02 2.22
GB 33.2   5.2 3.96 9.53 2.02
SCAD   8.8   8.2 4.27 8.67 2.43
(0.2,–) 17 AGB 73.4 59.6 3.32 6.73 1.46
GB 73.2 21.8 3.32 7.61 1.72
SCAD 38.8 38.2 3.68 7.05 1.69
Efficient case-cohort weight (0.1,0.2) 15 AGB 54.2 32.8 3.58 7.48 1.64
GB 52.8 12.2 3.56 8.36 1.70
SCAD 17.6 16.6 4.06 8.01 2.07
(0.2,0.4) 17 AGB 83.0 71.2 3.20 6.48 1.36
GB 79.4 23.4 3.23 7.35 1.72
SCAD 47.4 46.4 3.56 6.79 1.50
1500 Traditional case-cohort weight (0.1,–) 17 AGB 34.4 20.0 4.12 8.68 2.02
GB 31.2   4.4 4.18 9.85 2.00
SCAD 12.0 11.8 4.37 8.74 2.18
(0.2,–) 19 AGB 83.8 76.0 3.22 6.45 1.30
GB 73.0 18.8 3.33 7.56 1.70
SCAD 66.0 65.6 3.39 6.57 1.39
Efficient case-cohort weight (0.1,0.2) 17 AGB 56.6 41.2 3.69 7.67 1.75
GB 51.4 10.8 3.69 8.54 1.84
SCAD 34.8 33.8 3.89 7.63 1.82
(0.2,0.4) 19 AGB 92.8 85.2 3.10 6.22 1.24
GB 82.8 23.4 3.19 7.28 1.74
SCAD 75.4 74.8 3.26 6.34 1.23

GRC: group correction; IDC: individual correction; GS: group size; MS: model size; MSER: mean square error ratio; AGB: adaptive group bridge; GB: group bridge; SCAD: smoothly clipped absolute deviation.

In the second simulation, we considered group structure with continuous and categorical variables. For 10% of event rate of disease 1, there were five groups: two groups consisting of categorical variables (A1,A2) and three groups consisting of continuous variables (A3,A4,A5). The GSs were (|A1|,|A2|,|A3|,|A4|,|A5|)=(2,3,3,4,5) and overlapping groups were A3 and A4 as follows

(β1,,β15)=(1.1,0.9A1,0,0,0A21.2,0,0,A3,0,0A41.1,1,0,0.9,0A5)

The categorical variables in GS 2 and 3 were generated from variables with three and four categories, respectively. The reference groups were set to 0. The continuous variables were generated from multivariate normal distribution with mean 0, correlation within group 0.5, correlation between different groups 0. When the event rate for disease 1 was 20% with population size 750 or the event rate for disease 1 was 10% with population size 1500, we added one more group consisting of categorical variables with size 2, (β16, β17)T = (0,0)T. When the event rate for disease 1 was 20% with population size 1500, one more group consisting of categorical variables with size 2 such as (β18,β19)T = (0,0)T was added. Table 2 summarizes the results. For the group/individual correction rates, GS, and MS, the adaptive group bridge outperformed the group bridge and SCAD; the efficient case-cohort weight (3) correctly identified the true non-zero variables better than the traditional case-cohort weight (2) in all three methods. The mean square ratio of the adaptive group bridge is always smaller than that of the group bridge in all settings and is comparable with or less than that of SCAD.

Table 2.

Simulation results for group variables consisting of continuous and categorical variables.

n Weight P(Δ1, Δ2) dn Method GRC% IDC% GS MS MSER

750 Traditional case-cohort weight (0.1,–) 15 AGB 25.2   9.8 4.08   8.88 2.06
GB 26.8   2.0 3.95   9.46 2.06
SCAD 10.6   8.6 4.31   9.00 2.20
(0.2,–) 17 AGB 67.2 51.2 3.40   6.87 1.40
GB 68.2 14.2 3.38   7.82 1.71
SCAD 33.6 33.4 3.73   7.10 1.47
Efficient case-cohort weight (0.1,0.2) 15 AGB 46.0 24.0 3.69   7.75 1.65
GB 48.8   6.0 3.61   8.52 1.76
SCAD 17.0 15.6 4.07   8.25 1.89
(0.2,0.4) 17 AGB 78.2 64.4 3.26   6.55 1.31
GB 77.4 15.8 3.25   7.53 1.67
SCAD 43.4 42.8 3.61   6.86 1.36
1500 Traditional case-cohort weight (0.1,–) 17 AGB 25.6 12.2 4.40   9.34 1.87
GB 21.2   1.4 4.36 10.18 1.91
SCAD 15.0 14.6 4.45   8.90 1.88
(0.2,–) 19 AGB 78.4 65.4 3.27   6.57 1.37
GB 72.2 15.4 3.36   7.68 1.78
SCAD 61.8 61.4 3.43   6.61 1.34
Efficient case-cohort weight (0.1,0.2) 17 AGB 46.4 28.4 3.79   7.84 1.53
GB 46.4   5.2 3.79   8.71 1.71
SCAD 30.0 29.4 3.98   7.76 1.53
(0.2,0.4) 19 AGB 84.0 74.4 3.19   6.38 1.34
GB 82.8 20.8 3.21   7.41 1.91
SCAD 68.6 68.4 3.33   6.42 1.24

GRC: group correction; IDC: individual correction; GS: group size; MS: model size; MSER: mean square error ratio; AGB: adaptive group bridge; GB: group bridge; SCAD: smoothly clipped absolute deviation.

We also examined the performance of variable selection when there were continuous variables (Zijs) and their squared variables (Zij2,s). The detailed settings and simulation results are presented in the Supplemental Materials. The results are similar to Table 1: the adaptive group bridge has better bi-level selection accuracy than the group bridge and SCAD. And the mean group and MSs of the adaptive group bridge are closer to their true values compared to the other two methods. This suggests that when some of the variables are highly collinear, bi-level selection may be beneficial even if individual variable selection is of interest.

4. Data analysis

We applied the proposed method to analyze the data from the Busselton Health Study.8,9 The Busselton Health Study was conducted in the south-west of Western Australia and questionnaires were used every three years from 1966 to 1981 to collect general health information for adult participants. The main aims of this study were to evaluate the association between stroke and serum ferritin effect and to identity risk factors related to stoke. The population consisted of 1612 men and women aged 40–89 who participated in 1981 and were free of coronary heart disease or stroke at that time. The outcome of interest was the time to stoke event, defined as hospital admission, any procedure, or death from stroke, which had followed by 31 December 1998. The time to stoke event was considered censored if subjects did not have an event by the end of study time or lost to follow-up during the study period.

To reduce the cost and preserve the blood sample, the case-cohort study was conducted for stoke. In addition to case-cohort sample for stoke, additional serum ferritin information was obtained from another case-cohort study for coronary heart disease. Under this design, the serum ferritin was measured for all the subjects with coronary heart disease and/or stroke as well as those in the subcohort. The full cohort size and the subcohort size were 1210 and 450, respectively. There were 117 subjects and 55 subjects who had stroke events in the full cohort and the subcohort, respectively. Extra information of serum ferritin was available for 174 subjects who had only coronary heart disease outside the subcohort. The other risk factors included age, gender, body mass index (BMI), blood pressure treatment, cholesterol, triglycerides, hemoglobin and smoking status.

To identify non-zero variables, we used the adaptive group bridge with weights (2) and (3) and compared the results with those from group bridge and SCAD. For continuous variables such as age, BMI, and triglycerides, we considered the following hierarchical structure: (Group 1) each continuous variable (GS 1); (Group 2) each continuous variable and its squared value (GS 2). All variables were centered and standardized. Since triglycerides were severely skewed, log-transformed triglycerides was used before standardization.

Table 3 reports estimated coefficients and their standard errors using the group bridge, adaptive group bridge, and SCAD. The results show that SCAD selected more variables than the group bridge and the adaptive group bridge regardless of the weights. In particular, the standard errors of serum ferritin levels, the square of log-transformed triglycerides, BMI, and diabetes treatment that SCAD identified even with weight (3) were large for their estimates. When using the traditional weight (2), all three methods selected more variables than when using the efficient weight (3). More specifically, compared to the variables selected by the group bridge with weight (3), the group bridge with weight (2) selected four additional variables including ferritin tertile 2, log-transformed triglycerides, its quadratic terms, and diabetes treatment. The standard errors of those four variables’ estimates were relatively large for their estimates. On the other hand, the adaptive group bridge with weight (2) additionally selected diabetes treatment only compared to the variables selected by the adaptive group bridge with weight (3). Both the group bridge and the adaptive group bridge using the efficient weight (3) selected the same three variables: age, blood pressure treatment, and sex. Therefore, increased age, blood pressure treatment, and female were associated with stoke event.

Table 3.

Estimated coefficients and standard errors for the Busselton Health Study Data.

Traditional case cohort weight (2)
Efficient case cohort weight (3)
Variable AGB
β^(SE)
GB
β^(SE)
SCAD
β^(SE)
AGB
β^(SE)
GB
β^(SE)
SCAD
β^(SE)

Ferritin tertile (ref = 1)
Ferritin tertile 2 0(–) 0 (–) 0.04 (0.15) 0 (–) 0 (–) 0.11 (0.15)
Ferritin tertile 3 0(–) 0.09 (0.11) 0.14 (0.13) 0 (–) 0 (–) 0.11 (0.13)
Age 0.86 (0.13) 0.91 (0.13) 0.88 (0.13) 0.85 (0.13) 0.83 (0.12) 0.90 (0.13)
Age2 0 (–) 0 (–) 0 (–) 0 (–) 0 (–) 0 (–)
BMI 0 (–) 0 (–) 0 (–) 0 (–) 0 (–) 0.04 (0.13)
BMI2 0 (–) 0 (–) 0 (–) 0 (–) 0 (–) 0 (–)
Cholesterol 0 (–) 0 (–) 0 (–) 0 (–) 0 (–) 0 (–)
Cholesterol2 0 (–) 0 (–) 0 (–) 0 (–) 0 (–) 0 (–)
log(TR) 0 (–) −0.17 (0.25) 0 (–) 0 (–) 0 (–) 0 (–)
log(TR)2 0 (–) 0.21 (0.21) 0.03 (0.13) 0 (–) 0 (–) 0.05 (0.12)
Diabetes TRT −0.17 (0.1) −0.16 (0.1) −0.13 (0.09) 0 (–) 0 (–) −0.05 (0.08)
Blood Pressure TRT 0.28 (0.09) 0.3 (0.1) 0.31 (0.10) 0.26 (0.09) 0.25 (0.09) 0.30 (0.10)
Sex(l = female) −0.33 (0.11) −0.31 (0.12) −0.32 (0.12) −0.27 (0.11) −0.26 (0.11) −0.25 (0.11)
Smoking(ref = Never)
Former 0(–) 0 (–) 0 (–) 0 (–) 0 (–) 0 (–)
Current 0 (–) 0 (–) 0(–) 0 (–) 0 (–) 0 (–)

TR: triglycerides; TRT: treatment; AGB: adaptive group bridge; GB: group bridge; SCAD: smoothly clipped absolute deviation; BMI: body mass index.

We compared the model errors of the three methods using a fivefold cross-validation evaluation. Following Huang et al.,13 we considered the following model error: ME(β^)=E{exp(β^TZ)exp(β0TZ)}2. We estimated the model error for case-cohort data as follows

ME^(β^)=1ni=1n{Δik+(1Δik)ξiα˜}{exp(β^TZi)exp(β0TZi)}2

For the traditional weight (2), the estimated model errors for the adaptive group bridge, the group bridge, and SCAD were 1.16, 2.35, and 11.05, respectively. When the efficient weight (3) was used, the estimated model errors for the three methods were smaller than those using the traditional weight (2): 0.18, 0.63, and 1.85 for the adaptive group bridge, the group bridge, and SCAD, respectively. For both weights, the adaptive group bridge had the smallest model error, which indicates the adaptive group bridge had a better prediction than the other two methods.

5. Discussion

We proposed the adaptive group bridge for case-cohort data and studied its asymptotic properties. The simulation studies and the Busselton Health data example showed the adaptive group bridge was superior to the group bridge and SCAD in terms of variable selection and prediction. The objective function of the proposed penalized proportional hazards model with the adaptive group bridge is non-convex. To minimize the non-convex objective function, we proposed a coordinate decent algorithm using the quadratic approximation of the pseudo-likelihood function and the optimization scheme for the adaptive L1 penalty.

The proposed method is limited to when dn < n. Studying bi-level selection when dn > n would be an important future research problem. One way to deal with this problem is screening. The two-stage selection procedure may be considered: we screen group variables in the first stage and then we make the parsimonious list of non-zero variables using the adaptive group bridge in the second stage.

Making an inference using the selected variables only may not be valid because hypotheses are generated from the data and not pre-specified. Recently post-selection inference for LASSO got much attention.2325 Developing a method to conduct valid inference after model selection would be an important research problem.

In this paper, we have considered the event-specific model. The joint analysis should be conducted when an investigator compares the risk effects on different diseases. Developing a variable selection method for multivariate failure time models under multiple case-cohort studies would be another interesting future research problem.

Supplementary Material

Supplemental material

Acknowledgements

We thank Professor Matthew Knuiman and the Busselton Population Medical Research Foundation for permission to use their data.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported in part by Institutional Research Grant IRG #16–183-31 from the American Cancer Society and the MCW Cancer Center, and the United States National Cancer Institute (U24CA076518).

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.

Supplemental Material

Supplemental material for this article is available online.

References

  • 1.Prentice R A case-cohort design for epidemiologic cohort studies and disease prevention trials. Biometrika 1986; 73: 1–11. [Google Scholar]
  • 2.Self SG and Prentice RL. Asymptotic distribution theory and efficiency results for case-cohort studies. Ann Statist 1988; 34: 103–119. [Google Scholar]
  • 3.Barlow W Robust variance estimation for the case-cohort design. Biometrics 1994; 50: 1064–1072. [PubMed] [Google Scholar]
  • 4.Kulich M and Lin DY. Improving the efficiency of relative-risk estimation in case-cohort study. J Am Statist Assoc 2004; 99: 832–844. [Google Scholar]
  • 5.Kang S and Cai J. Marginal hazard model for case-cohort studies with multiple disease outcomes. Biometrika 2009; 96: 887–901. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Kim S, Cai J and Lu W. More efficient estimators for case-cohort studies. Biometrika 2013; 100: 695–708. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Ni A, Cai J and Zeng D. Variable selection for case-cohort studies with failure time outcome. Biometrika 2016; 103: 547–562. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Cullen KJ. Mass health examinations in the Busselton population, 1996 to 1970. Aust J Med 1972; 2: 714–718. [DOI] [PubMed] [Google Scholar]
  • 9.Knuiman MW, Divitini ML, Olynyk JK, et al. Serum ferritin and cardiovascular disease: a 17-year following-up study in Busselton, Western Australia. Am J Epidemiol 2003; 158: 144–149. [DOI] [PubMed] [Google Scholar]
  • 10.Cai J, Fan J, Li R, et al. Variable selection for multivariate failure time data. Biometrika 2005; 92: 303–316. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Ma S, Song X and Huang J. Supervised group lasso with applications to microarray. BMC Bioinformatics 2007; 3: 60. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Kim J, Sohn I, Jung S, et al. Analysis of survival data with group lasso. Comm Statist Simulation Comput 2012; 41: 1593–1605. [Google Scholar]
  • 13.Huang J, Li L, Liu Y, et al. Group selection in the cox model with a diverging number of covariates. Statist Sin 2014; 24: 1787–1810. [Google Scholar]
  • 14.Wang S and Nan B. Hierarchically penalized cox regression with grouped variables. Biometrika 2009; 96: 307–322. [Google Scholar]
  • 15.Ahn KW, Banerjee A, Sahr N, et al. Group and within-group variable selection for competing risks data. Lifetime DataÚnalysis 2018; 24, 407–424. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Cox DR. Regression models and life-tables (with discussion). J R Statist Soc B 1972; 34: 187–220. [Google Scholar]
  • 17.Kalbfleisch JD and Lawless JF. Likelihood analysis of multistate models for disease incidence and mortality. Statist Med 1988; 7: 149–160. [DOI] [PubMed] [Google Scholar]
  • 18.Borgan O, Langholz B, Samuelsen SO, et al. Exposure stratified case-cohort designs. Lifetime Data Anal 2000; 6: 39–58. [DOI] [PubMed] [Google Scholar]
  • 19.Huang J, Ma S, Xie H, et al. A group bridge approach for variable selection. Biometrika 2009; 96: 339–355. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Zou H The adaptive lasso and its oracle properties. J Am Statist Assoc 2006; 101: 1418–1429. [Google Scholar]
  • 21.Fan J and Li R. Variable selection for cox’s proportional hazards model and frailty properties. J Am Statist Assoc 2002; 30: 74–99. [Google Scholar]
  • 22.Clayton D and Cuzick J. Multivariate generalizations of the proportional hazards model (with discussion). J R Statist Soc A 1985; 148: 82–117. [Google Scholar]
  • 23.Lee J, Sun D, Sun Y, et al. Exact post-selection inference, with application to the lasso. Ann Stat 2016; 44: 907–927. [Google Scholar]
  • 24.Lockhart R, Jonathan Taylor J, Tibshirani R, et al. A significance test for the lasso. Ann Stat 2014; 42: 413–468. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Tibshirani R, Taylor J, Lockhart R, et al. Exact post-selection inference for sequential regression procedures. J Am Statist Assoc 2016; 111: 600–620. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental material

RESOURCES