Identification of Homogeneous and Heterogeneous Variables in Pooled Cohort Studies

Xin Cheng; Wenbin Lu; Mengling Liu

doi:10.1111/biom.12285

. Author manuscript; available in PMC: 2016 Jun 1.

Published in final edited form as: Biometrics. 2015 Mar 2;71(2):397–403. doi: 10.1111/biom.12285

Identification of Homogeneous and Heterogeneous Variables in Pooled Cohort Studies

Xin Cheng ^1,^*, Wenbin Lu ^2,^**, Mengling Liu ^1,^***

PMCID: PMC4745128 NIHMSID: NIHMS680620 PMID: 25732747

Summary

Pooled analyses integrate data from multiple studies and achieve a larger sample size for enhanced statistical power. When heterogeneity exists in variables’ effects on the outcome across studies, the simple pooling strategy fails to present a fair and complete picture of the effects of heterogeneous variables. Thus, it is important to investigate the homogeneous and heterogeneous structure of variables in pooled studies. In this paper, we consider the pooled cohort studies with time-to-event outcomes and propose a penalized Cox partial likelihood approach with adaptively weighted composite penalties on variables’ homogeneous and heterogeneous effects. We show that our method can characterize the variables as having heterogeneous, homogeneous, or null effects, and estimate non-zero effects. The results are readily extended to high-dimensional applications where the number of parameters is larger than the sample size. The proposed selection and estimation procedure can be implemented using the iterative shooting algorithm. We conduct extensive numerical studies to evaluate the performance of our proposed method and demonstrate it using a pooled analysis of gene expression in patients with ovarian cancer.

Keywords: Adaptive group lasso, Cox proportional hazard model, Heterogeneity, Penalized partial likelihood, Pooled analysis, Structure identification

1. Introduction

Pooled studies can achieve a large sample size and facilitate investigations on rare diseases, rare exposures, and topics not easily addressed in a single study. For example, Ganzfried et al. (2013) made a concerted effort to create a curated database consisting of clinical and microarray gene expression data on 2970 ovarian cancer patients from 23 studies using 11 gene expression measurement platforms. This pooled database empowers researchers to investigate the prognostic effect of genetic biomarkers on ovarian cancer survival in a uniform and consistent fashion. But as seen in this and many other pooled studies, inter-study heterogeneity in the association between the biomarkers and the outcome often exists and its source includes differences in study populations, sampling methods, disease ascertainments, and measurement methods. Although harmonizing the data can alleviate this issue (Ganzfried et al., 2013), heterogeneity often is inherent in pooled studies. In some studies, heterogeneity itself is important for understanding disease disparity and progression at different phases (Moreno et al., 1996). Therefore, the analysis of pooled studies needs to properly account for heterogeneity to yield meaningful results.

To estimate covariate effects in pooled studies, the two-step procedure is commonly used, in which study-specific effects are first estimated using individual study data and then combined using a fixed-effects model (Hedges and Olkin, 1985) or a random-effects model (DerSimonian and Laird, 1986). However, the two-step procedure has difficulty in handling multiple variables. If heterogeneity exists in variables’ effects across studies we want to distinguish the variables with heterogeneous effects versus those without, as the effect of a homogeneous predictor should be modeled using a common parameter across studies to reduce model complexity and improve efficiency, while heterogeneous effects should be modeled by distinct parameters for different studies to build accurate models. Methods for discovering heterogeneity in variable’s effects include examining its interactions with the study-membership indicator variables or the heterogeneity statistics such as Cochran’s Q and I², but these often have low power especially when the number of studies is small and the number of predictors is large (Hedges and Olkin, 1985).

In this paper, we consider the heterogeneity issue in pooled studies with time-to-event endpoint, and formulate the problem in the framework of group variable selection. Specifically, we treat a variable’s effects across studies as a group and aim to classify the variables into three categories according to their effects: homogeneous, heterogeneous, and null. Group penalty regularized methods have been proposed to select variables with pre-specified group structure (Kim et al., 2012; Ma et al., 2007), and the composite absolute penalties (CAP) of Zhao et al. (2009) could accommodate complex group structures. Recently, Liu et al. (2013) investigated the use of adaptive CAP regularized partial likelihood estimation in the context of pooled nested case-control studies with heterogeneity. These methods, however, cannot delineate variables into the desirable categories.

Inspired by some recent developments in structure identification in the partially linear model (Zhang et al., 2011) and the time-varying Cox model (Yan and Huang, 2012), we employ the adaptively weighted L₁ and L₁/L₂ penalties on variables’ average effects and heterogeneous effects respectively, and propose a penalized partial likelihood approach to characterize the variables’ heterogeneity and simultaneously estimate variables’ effects. We establish asymptotic results for the proposed estimator when the number of parameters is fixed and also when it diverges with the sample size. The rest of the article is organized as follows. We introduce the composite L₁ + L₁/L₂ penalty regularized partial likelihood approach and the computation algorithm in Section 2. We also establish the theoretical properties of our proposed estimator, with the proofs provided in the Web Appendix. Section 3 contains numerical simulations and real study applications in the pooled ovarian cancer study. Section 4 gives concluding remarks.

2. Regularized Method for Identifying Homogeneous and Heterogeneity Variables

2.1 Penalized partial likelihood function

We consider a pooled study consisting of K sub-studies, with n_k subjects in study k and $\sum_{k = 1}^{K} n_{k} = n$ . Let $T_{k i}^{*}$ and C_ki be the failure time and censoring time for the ith subject in study k. Define the observed event time $T_{k i} = min (T_{k i}^{*}, C_{k i})$ , and the occurrence indicator of the failure event $δ_{k i} = I (T_{k i}^{*} < C_{k i})$ . We assume a Cox proportional hazards model for $T_{k i}^{*}$ :

λ_{k} {t | Z_{k i}} = λ_{0 k} (t) exp {β_{k}^{'} Z_{k i}}, k = 1, \dots, K,

(1)

where λ_0k(·) is the baseline hazard function, and β_k = (β_k1, …, β_kp)′ is a p × 1 vector characterizing the effects of covariates Z in study k. For the pooled study data under model (1), the log partial likelihood is expressed as

ℓ (β_{1}, \dots, β_{K}) = \sum_{k = 1}^{K} \sum_{i = 1}^{n_{k}} δ_{k i} [β_{k}^{'} Z_{k i} - log {\sum_{j = 1}^{n_{k}} I (T_{k j} > = T_{k i}) exp (β_{k}^{'} Z_{k j})}] .

To separate out homogeneous and heterogeneous effects, we reformulate model (1) into

λ_{k} {t | Z_{k i}} = λ_{0 k} (t) exp (μ' Z_{k i} + (β_{k} - μ)' Z_{k i}} ≜ λ_{0 k} (t) exp {μ' Z_{k i} + α_{k \cdot}^{'} Z_{k i}},

where μ = (μ₁, …, μ_p)′ denotes the average effects and α_k· = (α_k1, …, α_kp)′ denotes the deviance of effects in study k from the average effects μ. To accommodate the constraint that, for each covariate l, l = 1, …, p, $\sum_{k = 1}^{K} α_{k l} = 0$ , we denote α_l = (α_2l, …, α_Kl)′ and work with $θ = (μ', α_{1}^{'}, \dots, α_{p}^{'})'$ . We classify p predictors into three mutually exclusive categories: (1) homogeneous effects if μ_l ≠ 0 and the Euclidean norm of α_l: ‖α_l‖ = 0; (2) heterogeneous effects if ‖α_l‖ ≠ 0; (3) null effects if μ_l = 0 and ‖α_l‖ = 0. To estimate this homogeneous and heterogeneous structure, we propose the following composite penalty regularized partial likelihood estimator

\hat{θ} = arg min Q_{n} (θ) = arg min {- ℓ (θ) + λ_{1 n} \sum_{l = 1}^{p} ω_{0 l} | μ_{l} | + λ_{2 n} \sum_{l = 1}^{p} ω_{1 l} ‖ α_{l} ‖},

(2)

where ω_0l and ω_1l are data-dependent weights and here chosen as ω_0l = 1/|μ̃_l|, ω_1l = 1/‖α̃_l‖, where (μ̃_l, α̃_l) are some initial root-n consistent estimators.

2.2 Computation algorithm

We use the iterative shooting algorithm (Fu, 1998; Zhang and Lu, 2007) to minimize Q_n(θ) in (2). Let G = −∂ℓ/∂θ, H = −∂²ℓ/∂θ∂θ′, and X′X be the Cholesky decomposition of H. By defining a pseudo response vector Y = (X′)⁻¹{Hθ − G}, we can approximate −ℓ(θ) by a quadratic form $\frac{1}{2} (Y - X θ)' (Y - X θ)$ . Furthermore, we consider the L₁ norm as a special case of the Euclidean norm with one element and rewrite the composite penalty terms in (2) as an adaptive group lasso problem with 2p groups

\frac{1}{2} (Y - X θ)' (Y - X θ) + \sum_{g = 1}^{2 p} λ_{g} ‖ θ_{g} ‖,

(3)

where θ_g = μ_g and λ_g = λ_1nω_0g for g = 1, …, p and θ_g = α_g and λ_g = λ_2nω_1g for g = (p+1), …, 2p.

By the Karush–Kuhn–Tucker condition, Yuan and Lin (2006) showed that the necessary and sufficient condition for θ to be a solution of (3) is

- X_{g}^{'} (Y - X θ) + λ_{g} θ_{g} / ‖ θ_{g} ‖ = 0, θ_{g} \neq 0;

(4)

‖ - X_{g}^{'} (Y - X θ) ‖ \leq λ_{g}, θ_{g} = 0 .

(5)

Note that the condition (4) is equivalent to

S_{g} = (X_{g}^{'} X_{g} + λ_{g} θ_{g} / ‖ θ_{g} ‖ I_{d_{g}}) θ_{g},

(6)

where $S_{g} = X_{g}^{'} (Y - X θ_{- g})$ , with $θ_{- g} = (θ_{1}^{'}, \dots, θ_{g - 1}^{'}, 0, θ_{g + 1}^{'}, θ_{2 p}^{'})'$ , I_{d_g} is the identity matrix of dimension d_g, and d_g is the number of parameters in group g. Thus, our shooting algorithm is:

Initialize with θ⁽⁰⁾.
For each g = 1, …, 2p, if θ_g = 0, then it stays at 0. Otherwise, update θ_g with
$θ_{g} = {\begin{matrix} {(X_{g}^{'} X_{g} + λ_{g} / ‖ θ_{g} ‖)}^{- 1} S_{g}, & if ‖ S_{g} ‖ > λ_{g}; \\ 0, & if ‖ S_{g} ‖ \leq λ_{g} . \end{matrix}$
Update θ with the new θ_g, and repeat until convergence.

We choose the tuning parameters λ_1n and λ_2n over a two-dimensional grid by minimizing the Bayesian information criterion (BIC),

B I C (\hat{θ}) = - ℓ (\hat{θ}) + log (n) \times d f,

where $\hat{θ} = ({\hat{μ}}_{l}^{'}, {\hat{α}}_{l}^{'})'$ is a minimizer of (2) under λ_1n and λ_2n, and the degree of freedom (d f) is defined following Yuan and Lin (2006), $d f = \sum_{l = 1}^{p} I (| {\hat{μ}}_{l} | > 0) + \sum_{l = 1}^{p} I (‖ {\hat{α}}_{l} ‖ > 0) + (K - 2) \sum_{l = l}^{p} ‖ {\hat{α}}_{l} ‖ / ‖ {\tilde{α}}_{l} ‖$ . The BIC-based tuning parameter selection corresponds to maximizing the posterior probability of selecting the true model and has been shown to be consistent for model selection in various settings (Wang et al., 2007; Zhang et al., 2010). Our simulation experiences also support the use of the BIC-based selection method.

Following Fan and Li (2001), we estimate the covariance of θ̃ by

\hat{COV} (\hat{θ}) = {{{(H + Σ)}^{- 1} H {(H + Σ)}^{- 1}} |}_{\hat{θ}},

(7)

where Σ = diag{λ_g/‖θ_g‖I_{d_g} }_{g=1, …, 2p}.

2.3 Theoretical properties

Denote the true parameters by $θ_{n}^{*}$ , where we use the general notation with the subscript n to allow the number of parameters (p_n) to go to infinity as the sample size n increases. Let q_n = Kp_n be the total number of parameters. Define 𝒜_1n = {l : μ_l ≠ 0, l = 1, …, p_n}, 𝒜_2n = {l : ‖α_l‖ ≠ 0, l = 1, …, p_n}, and 𝒜_n = 𝒜_1n ∪ 𝒜_2n. The total number of parameters corresponding to 𝒜_n is s_n = |𝒜_1n| + (K − 1) × |𝒜_2n|. Under the regularity conditions specified in Web Appendix, the following asymptotic properties hold.

Theorem 1 (Estimation Consistency)

If $λ_{1 n} / \sqrt{n} \to 0, λ_{2 n} / \sqrt{n} \to 0$ , and $q_{n}^{4} / n \to 0$ , then $‖ {\hat{θ}}_{n} - θ_{n}^{*} ‖ = O_{p} {{(n / q_{n})}^{- 1 / 2}}$ .

Theorem 2 (Selection Consistency)

If λ_1n/q_n → ∞ and λ_2n/q_n → ∞, then $P ({\hat{θ}}_{𝒜_{n}^{c}} = 0) \to 1$ .

Because the dimension of θ̂_{𝒜_n} may diverge as sample size goes to infinity, for asymptotic normality property below, we consider its arbitrary linear combination B_nθ̂_{𝒜_n}, where B_n is an arbitrary m × s_n matrix with a finite m and $B_{n} B_{n}^{'} \to G$ and G is positive-definite.

Theorem 3 (Asymptotic Normality)

If $λ_{1 n} / \sqrt{n / g_{n}} \to 0, λ_{2 n} / \sqrt{n / g_{n}} \to 0$ , λ_1n/q_n → ∞, λ_2n/q_n → ∞, and $q_{n}^{4} / n \to 0$ , then

\sqrt{n} B_{n} I_{𝒜_{n}}^{1 / 2} ({\hat{θ}}_{𝒜_{n}} - θ_{𝒜_{n}}^{*}) \to_{d} N (0, G),

where I_{𝒜_n} is the Fisher information matrix corresponding to $θ_{𝒜_{n}}^{*}$ .

Therefore, as the sample size goes to infinity, the proposed estimator θ̂_n in (2) can perform as well as the correct model when that correct model is known in advance. Proofs are given in Web Appendix.

3. Numerical Studies

3.1 Simulations

We conducted simulations to evaluate the performance of our method under practical settings and compared it with four methods including the maximum likelihood estimation (MLE) method, the two-step method (two-step) with the random-effects model (DerSimonian and Laird, 1986), and the penalized partial likelihood methods with adaptive group Lasso (agLASSO) penalty (Yuan and Lin, 2006; Kim et al., 2012) and adaptive composite absolute (aCAP) penalty (Zhao et al., 2009; Liu et al., 2013). For the MLE method, we obtained the study-specific effects and checked whether the average effect and the heterogeneous effects of each variable were significantly different from 0. In the two-step method, we used Cochran’s Q-test to examine the heterogeneity. The agLASSO method imposed a weighted L₂ penalty on the group of coefficients consisting of each variable’s effects across studies. The aCAP approach imposed a weighted composite penalty as $λ_{1 n} \sum_{l = 1}^{p} ω_{0 l} ‖ (μ_{l}, α_{l}) ‖ + λ_{2 n} \sum_{l = 1}^{p} ω_{1 l} ‖ α_{l} ‖$ .

3.1.1 Example 1: fixed number of covariates

We first considered a pooled study with 3 sub-studies of size N = 150 or 300, and generated 11 covariates from a multivariate normal distribution with mean 0 and covariance cov(z_i, z_j) = 0.5^|i−j|. The survival times were generated using the Cox proportional hazards model (1) with the Weibull distribution (shape= 10, scale = 1) determining the baseline hazard. The true θ* was specified as follows:

θ^{*} = (\begin{matrix} μ \\ α_{2 \cdot}^{'} \\ α_{3 \cdot}^{'} \end{matrix}) = (\begin{matrix} Z_{1} & Z_{2} & Z_{3} & Z_{4} & Z_{5} & Z_{6} & Z_{7} & Z_{8} & Z_{9} & Z_{10} & Z_{11} \\ 0.6 & 0.4 & 0.6 & 0 & 0 & 0 & 0.6 & 0 & 0 & 0 & 0 \\ 0 & - 0.4 & - 0.5 & 0 & 0 & 0 & 0 & 0.4 & 0 & 0 & 0 \\ 0 & - 0.4 & 0.4 & 0 & 0 & 0 & 0 & - 0.5 & 0 & 0 & 0 \end{matrix}) .

Thus, covariates Z₁ and Z₇ had homogenous effects, Z₂, Z₃, and Z₈ had heterogeneous effects, and the rest were null variables. Censoring times were generated from the uniform distributions [0, 1.28] or [0, 1.77] to yield event rates around 25% or 45%.

Tables 1 – 4 report the simulation results. Table 1 reports the average numbers of correctly and incorrectly selected homogeneous and heterogeneous variables over 200 simulations. We use the square root of mean square errors (E‖θ̂ − θ*‖²)^1/2 (rMSE) to measure the estimation error. Our method performs very well for identifying the correct structure in all scenarios and outperforms all other methods in terms of having the smallest rMSE. Its overall performance improves with increasing sample size. When the sample size increases to 300, with an event rate of 45%, the average number of correctly identified homogeneous variables by the proposed method is 1.96, and the average number of correctly identified heterogeneous variables is 3. The agLASSO method is only capable of identifying the non-zero group of coefficients, and cannot differentiate homogenous from non-homogenous effects. The random-effects model gives a large estimation error as the method cannot estimate the study-specific effects for each variable.

Table 1.

Simulation results on variable selection and root mean square errors in Example 1.

		25% event rate					45% event rate

		Homo	Homo	Hetero	Hetero		Homo	Homo	Hetero	Hetero
N	Method	Corr (2)	Incorr (0)	Corr (3)	Incorr (0)	rMSE	Corr (2)	Incorr (0)	Corr (3)	Incorr (0)	rMSE
150	our method	1.9	0.69	2.37	0.31	0.83	1.93	0.3	2.84	0.26	0.57
	agLASSO	0	0	2.63	2.25	0.87	0	0	2.93	2.21	0.62
	aCAP	1.76	0.55	1.96	0.35	0.98	1.76	0.44	2.41	0.35	0.79
	two-step	1.78	0.60	2.41	0.74	1.39	1.84	0.41	2.89	0.61	1.30
	MLE	1.67	0.68	2.63	1.24	1.42	1.72	0.43	2.94	1.02	0.93

300	our method	1.94	0.14	2.95	0.27	0.48	1.96	0.06	3.00	0.16	0.35
	agLASSO	0	0	2.98	2.16	0.54	0	0	3	2.14	0.40
	aCAP	1.77	0.28	2.55	0.3	0.70	1.84	0.29	2.57	0.21	0.64
	two-step	1.85	0.25	2.97	0.57	1.27	1.87	0.31	3	0.51	1.25
	MLE	1.75	0.30	3	0.87	0.79	1.80	0.35	3.00	0.89	0.55

Open in a new tab

Homo/Hetero Corr: average numbers of correct homogeneous/heterogeneous effects; Homo/Hetero Incorr: average numbers of incorrect homogeneous/heterogeneous effects; rMSE: root mean square errors.

Table 4.

Simulation results on variable selection and root mean square errors in Example 2 with cohort size of 200 and 40% event rate.

		Homo	Homo	Hetero	Hetero
ρ	p	Corr (8)	Incorr (0)	Corr (10)	Incorr (0)	rMSE
0.5	250	6.92	0.96	9.81	1.58	1.86
0.5	450	6.87	1.03	9.75	1.56	1.90

0.75	250	6.52	0.38	9.61	2.02	2.02
0.75	450	6.31	0.41	9.62	2.31	2.05

Open in a new tab

Homo/Hetero Corr: average numbers of correct homogeneous/heterogeneous effects; Homo/Hetero Incorr: average numbers of incorrect homogeneous/heterogeneous effects; rMSE: root mean square errors

Table 2 presents the detection frequencies of each variable’s average and heterogeneous effects under the scenario with 45% event rate and N = 300 in each study. Our method identifies all variables’ structure with good accuracy. The agLASSO method cannot differentiate between the average effect and heterogeneous effects, and always selects the variable as long as it has at least one non-zero effect, e.g. Z₁ and Z₇. The aCAP also shows a reasonable performance except for variable Z₈, which has zero mean effect but non-zero heterogeneous effects. The two-step method fails to identify the non-zero average effect of variable Z₂, because Z₂ has a small average effect size, and the heterogeneity of the effects of Z₂ is captured with a large variance estimated by the random-effects model. Because of the collinearity between variables, the MLE method does not perform well for variables with null effects. Similar results are observed in other scenarios.

Table 2.

Simulation results on the frequency of identifying nonzero mean effect and heterogeneous effects for each variable in Example 1 with cohort size of 300 and 45% event rate.

		Z₁	Z₂	Z₃	Z₄	Z₅	Z₆	Z₇	Z₈	Z₉	Z₁₀	Z₁₁
		200 0	200 200	200 200	0 0	0 0	0 0	200 0	0 200	0 0	0 0	0 0
Method	Structure

our method	μ ≠ 0	200	200	200	2	3	2	200	1	1	0	4
our method	α ≠ 0	2	200	200	3	5	3	6	200	2	7	3

agLASSO	μ ≠ 0	200	200	200	1	9	5	200	200	4	6	3
agLASSO	α ≠ 0	200	200	200	1	9	5	200	200	4	6	3

aCAP	μ ≠ 0	200	199	200	15	18	11	200	114	10	8	6
aCAP	α ≠ 0	9	199	200	2	4	1	24	114	1	1	1

two-step	μ ≠ 0	200	0	119	14	16	7	200	0	5	13	7
two-step	α ≠ 0	7	200	200	14	15	14	20	200	8	13	11

MLE	μ ≠ 0	200	200	200	15	21	11	200	10	8	15	14
MLE	α ≠ 0	19	200	200	27	25	18	22	200	15	25	25

Open in a new tab

Table 3 presents the estimated standard errors for covariates Z₁, Z₂, and Z₈ based on the sandwich formula (7), and compares them with the sample standard deviations calculated from 200 iterations. The empirical variance estimates and asymptotic estimates show some discrepancies in these finite-sample settings, but the overall performance improves with the sample size.

Table 3.

Simulation results on the estimated standard error and sample standard deviation (in the parenthesis) of estimators for selected variables in Example 1 with cohort size of 300 and 45% event rate.

		variable 1	variable 2			variable 8

N	Method	μ₁	μ₂	α₂₂	α₃₂	α₂₈	α₃₈
150	our method	0.083(0.109)	0.086(0.131)	0.099(0.135)	0.099(0.130)	0.088(0.140)	0.084(0.124)
	agLASSO	0.073(0.119)	0.090(0.118)	0.122(0.139)	0.122(0.146)	0.078(0.144)	0.077(0.130)
	aCAP	0.082(0.121)	0.093(0.114)	0.093(0.115)	0.093(0.105)	0.090(0.138)	0.088(0.122)
	MLE	0.107(0.126)	0.124(0.144)	0.165(0.199)	0.165(0.191)	0.161(0.187)	0.161(0.162)

300	our method	0.059(0.068)	0.063(0.077)	0.077(0.082)	0.077(0.086)	0.068(0.088)	0.065(0.083)
	agLASSO	0.056(0.072)	0.067(0.072)	0.090(0.083)	0.090(0.086)	0.065(0.099)	0.064(0.095)
	aCAP	0.058(0.069)	0.067(0.079)	0.073(0.072)	0.073(0.072)	0.065(0.093)	0.064(0.083)
	MLE	0.068(0.072)	0.079(0.080)	0.106(0.104)	0.106(0.101)	0.103(0.105)	0.103(0.103)

Open in a new tab

3.1.2 Example 2: diverging number of covariates p > N

We still considered pooling 3 studies, with each study size of N = 200. The number of covariates was set to be p = 250 or 450, and the covariate vector was generated from the multivariate normal distribution with mean 0 and covariance cov(z_i, z_j) = ρ^|i−j|, ρ = 0.5 or 0.75. The survival times were generated from the Cox proportional hazards model with a constant baseline hazard function λ₀(t) = 0.1 and the true coefficients θ* were specified as follows:

θ^{*} = (\begin{matrix} μ \\ α_{2 \cdot} \\ α_{3 \cdot} \end{matrix}) = (\begin{matrix} Z_{1, \dots, 8} & Z_{9, \dots, 13} & Z_{14, \dots, 18} & Z_{19, \dots, p} \\ 1.5 \cdot 1_{8} & 1 \cdot 1_{5} & 0.5 \cdot 1_{5} & 0_{p - 18} \\ 0_{8} & 1 \cdot 1_{5} & - 0.5 \cdot 1_{5} & 0_{p - 18} \\ 0_{8} & - 1 \cdot 1_{5} & 0.5 \cdot 1_{5} & 0_{p - 18} \end{matrix}),

where a_m denotes a m-vector of a’s. Therefore, covariates Z₁ to Z₈ had homogeneous effects, Z₉ to Z₁₈ had heterogeneous effects, and the rest were null variables. Censoring times were generated from the uniform distribution U(0, 2) to yield the event rate around 40% in each cohort. Table 4 reports the average numbers of correctly and incorrectly selected homogeneous and heterogeneous variables by our method and the rMSE over 200 replicates. Our results show that the proposed method achieves good accuracy in identifying the homogeneous and heterogeneous effects and maintains low error rates when the number of covariates is greater than the sample size.

3.2 Pooled ovarian cancer study

Using the pooled data from 1676 patients from 10 studies, which had complete information on tumor stage and debulking surgery, Ganzfried et al. (2013) found that the expression level of chemokine CXCL12 was associated with patient survival, which was not detected in individual studies due to insufficient power. We first examined whether the three variables CXCL12, tumor stage and debulking were homogeneous across studies, and applied our method and the meta-analysis method with random-effects model respectively. Both methods identified all three variables as homogeneous variables and yielded similar results on effect estimation. Figure 1 shows the forest plot of hazard ratios (HRs) of the variables.

Forest plot of the hazard ratio estimates of variables CXCL12, tumorstage, debulking from the pooled ovarian cancer study with 10 sub-studies. The study names are listed at the left and study sizes are given in the parentheses. Three vertical dash lines are reference lines of the hazard ratio being 1.

To further study the association between other genetic variables and survival, we examined 21 candidate genes related to breast and ovarian cancer according to the reports from the National Cancer Institute (http://www.cancer.gov/cancer topics/pdq/genetics/breast-and-ovarian/HealthProfessional), which are CXCL12, CXCR4, RAD51C, RAD51, BABAM1, MLH1, MSH2, MSH6, TP53, HOXD1, CHEK2, HOXD3, CASP8, IRS1, TIPARP, PLEKHM1, BNC2, SKAP1, CERS6, BRCA1, and BRCA2. We also included three clinical variables: tumor stage, debulking, and age at initial pathologic diagnosis. The pooled analysis was conducted in 1053 patients from 4 studies in the database, with study size over 100 and complete information on the genetic and clinical variables. Table 5 reports the estimated coefficients by our method and the meta-analysis. The proposed method identified no heterogeneous variables, supporting the quality of data curation by Ganzfried et al. (2013).We found that clinical variables age, tumor stage, and debulking still remained as important risk factors associated with ovarian cancer, and identified important genetic biomarkers including BNC2, BRCA2, CASP8, CXCL12, CXCR4, HOXD1, and IRS1 (Goode et al., 2010; Welcsh and King, 2001; Ding et al., 2012; Kajiyama et al., 2008; Ma et al., 2011). The meta-analysis method also identified 6 out of these 10 variables as important factors with no heterogeneity, with the remaining 4 variables being non-significant, which resembled our findings in the simulation study that the meta-analysis can be of low power for small average effects. The remaining 14 variables were classified as null variables by our method, 11 of which were concluded the same by the meta-analysis method.

Table 5.

Estimates of the log hazard ratios of genetic and clinical variables in the pooled ovarian cancer study.

Variable	Our method	Two-step
age	0.022(0.004)	0.026(0.006)^*
tumor stage	0.455(0.085)	0.536(0.100)^*
debulking	0.248(0.071)	0.297(0.157)
BNC2	0.076(0.030)	0.145(0.058)^*
BRCA2	0.140(0.036)	0.183(0.049)^*
CASP8	−0.061(0.030)	−0.138(0.074)
CXCL12	0.042(0.026)	0.089(0.055)
CXCR4	−0.098(0.034)	−0.118(0.051)^*
HOXD1	−0.011(0.015)	−0.111(0.137)
IRS1	0.072(0.029)	0.158(0.057)
BABAM1	0(−)	−0.075(0.131) ⁺
BRCA1	0(−)	0.024(0.054)
CERS6	0(−)	−0.014(0.100)
CHEK2	0(−)	−0.023(0.102)
HOXD3	0(−)	0.072(0.072)
MSH2	0(−)	0.126(0.083)
MSH6	0(−)	−0.159(0.072)^*
MLH1	0(−)	−0.018(0.051)
PLEKHM1	0(−)	0.034(0.055)
RAD51	0(−)	0.108(0.067)
RAD51C	0(−)	0.017(0.099)⁺
SKAP1	0(−)	0.020(0.052)
TIPARP	0(−)	0.017(0.086)
TP53	0(−)	−0.014(0.058)

Open in a new tab

denotes nonzero average effects;

⁺

denotes nonzero heterogeneous effects with random-effects model; Standard errors are reported in the parentheses.

4. Discussion

In this article, we address the question of identifying variables’ homogeneous and heterogeneous effects on a time-to-event outcome in pooled studies using a group variable selection approach based on penalized regression. The proposed method requires that each study has the same predictors in the pooled studies. We establish the theoretical properties and show good numerical performances of the proposed method. In practice, to better estimate the variables’ effects, we could refit the data using the variables and their homogeneous/heterogeneous structure identified by our method, or we could randomly partition the data into two parts, one part of which is to detect heterogeneity and the other part for estimating effects. Also note that the proposed method can be easily extended to linear models and generalized linear models.

When the number of covariates diverges with the sample size, we establish the convergence rate of $p_{n}^{4} / n \to 0$ following Cai et al. (2005). More recently, Huang et al. (2013) established the oracle inequalities in the p_n ≫ n sparse Cox model setting, which may potentially be applicable to our context and needs further investigation.

For tuning parameter selection, it is well known that the generalized cross validation (GCV) and AIC-based methods may select irrelevant predictors with a non-vanishing probability as n → ∞ (Wang et al., 2007). The BIC-based selection of tuning parameters to select models is consistent for model selection in various settings (Wang et al., 2007; Zhang et al., 2010). The log partial likelihood of the Cox proportional hazards model can be quadratically approximated so that the optimization is conducted in a similar fashion to the least-square setting, which supports the use of BIC for our proposed method. The cross-validation score approximating the Kullback-Leibler divergence can also be used to select the tuning parameter (Du et al., 2010), but needs further investigation when both n and p tend to infinity.

Supplementary Material

Supp Material

NIHMS680620-supplement-Supp_Material.pdf^{(113.9KB, pdf)}

Acknowledgments

The authors thank the editor Jeremy M. G. Taylor, the Associate Editor, and two referees for their careful review and constructive comments that substantially improved the presentation of the paper. This work was partially supported by the National Cancer Institute (R01 CA140632, R21 CA116585).

Footnotes

Supplementary Materials

Web Appendix referenced in Section 2.3 and R code to reproduce the analyses in Section 3.2 are available with this paper at the Biometrics website on Wiley Online Library.

References

Cai J, Fan J, Li R, Zhou H. Variable selection for multivariate failure time data. Biometrika. 2005;92:303–316. doi: 10.1093/biomet/92.2.303. [DOI] [PMC free article] [PubMed] [Google Scholar]
DerSimonian R, Laird N. Meta-analysis in clinical trials. Controlled Clinical Trials. 1986;7:177–188. doi: 10.1016/0197-2456(86)90046-2. [DOI] [PubMed] [Google Scholar]
Ding YC, McGuffog L, Healey S, Friedman E, Laitman Y, Paluch-Shimon S, Kaufman B, Liljegren A, Lindblom A, Olsson H, Kristoffersson U, Stenmark-Askmalm M, Melin B, et al. A nonsynonymous polymorphism in IRS1 modifies risk of developing breast and ovarian cancers in BRCA1 and ovarian cancer in BRCA2 mutation carriers. Cancer Epidemiology Biomarkers & Prevention. 2012;21:1362–1370. doi: 10.1158/1055-9965.EPI-12-0229. [DOI] [PMC free article] [PubMed] [Google Scholar]
Du P, Ma S, Liang H. Penalized variable selection procedure for Cox models with semiparametric relative risk. The Annals of Statistics. 2010;38:2092–2117. doi: 10.1214/09-AOS780. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association. 2001;96:1348–1360. [Google Scholar]
Fu WJ. Penalized regressions: the bridge versus the lasso. Journal of Computational and Graphical Statistics. 1998;7:397–416. [Google Scholar]
Ganzfried BF, Riester M, Haibe-Kains B, Risch T, Tyekucheva S, Jazic I, Wang XV, Ahmadifar M, Birrer MJ, Parmigiani G, Huttenhower C, Waldron L. CuratedOvarianData: clinically annotated data for the ovarian cancer transcriptome. Database: The Journal of Biological Databases and Curation. 2013;2013 doi: 10.1093/database/bat013. bat013. [DOI] [PMC free article] [PubMed] [Google Scholar]
Goode EL, Chenevix-Trench G, Song H, Ramus SJ, Notaridou M, Lawrenson K, Widschwendter M, Vierkant RA, Larson MC, Kjaer SK, Birrer MJ, et al. A genome-wide association study identifies susceptibility loci for ovarian cancer at 2q31 and 8q24. Nature Genetics. 2010;42:874–879. doi: 10.1038/ng.668. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hedges LV, Olkin I. Statistical methods for meta-analysis. Orlando: Academic Press; 1985. [Google Scholar]
Huang J, Sun T, Ying Z, Yu Y, Zhang C-H. Oracle inequalities for the lasso in the Cox model. The Annals of Statistics. 2013;41:1142–1165. doi: 10.1214/13-AOS1098. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kajiyama H, Shibata K, Terauchi M, Ino K, Nawa A, Kikkawa F. Involvement of SDF-1α/CXCR4 axis in the enhanced peritoneal metastasis of epithelial ovarian carcinoma. International Journal of Cancer. 2008;122:91–99. doi: 10.1002/ijc.23083. [DOI] [PubMed] [Google Scholar]
Kim J, Sohn I, Jung S-H, Kim S, Park C. Analysis of survival data with group lasso. Communications in Statistics - Simulation and Computation. 2012;41:1593–1605. [Google Scholar]
Liu M, Lu W, Krogh V, Hallmans G, Clendenen TV, Zeleniuch-Jacquotte A. Estimation and selection of complex covariate effects in pooled nested case-control studies with heterogeneity. Biostatistics. 2013;14:682–694. doi: 10.1093/biostatistics/kxt015. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ma S, Song X, Song X, Huang J. Supervised group lasso with applications to microarray data analysis. BMC Bioinformatics. 2007;8:60–76. doi: 10.1186/1471-2105-8-60. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ma X, Zhang J, Liu S, Huang Y, Chen B, Wang D. Polymorphisms in the CASP8 gene and the risk of epithelial ovarian cancer. Gynecologic Oncology. 2011;122:554–559. doi: 10.1016/j.ygyno.2011.05.031. [DOI] [PubMed] [Google Scholar]
Moreno V, Martin ML, Bosch FX, de Sanjosé S, Torres F, Muñoz N. Combined analysis of matched and unmatched case-control studies: comparison of risk estimates from different studies. American Journal of Epidemiology. 1996;143:293–300. doi: 10.1093/oxfordjournals.aje.a008741. [DOI] [PubMed] [Google Scholar]
Wang H, Li R, Tsai C-L. Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika. 2007;94:553–568. doi: 10.1093/biomet/asm053. [DOI] [PMC free article] [PubMed] [Google Scholar]
Welcsh PL, King MC. BRCA1 and BRCA2 and the genetics of breast and ovarian cancer. Human Molecular Genetics. 2001;10:705–713. doi: 10.1093/hmg/10.7.705. [DOI] [PubMed] [Google Scholar]
Yan J, Huang J. Model selection for Cox models with time-varying coefficients. Biometrics. 2012;68:419–428. doi: 10.1111/j.1541-0420.2011.01692.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society, Series B. 2006;68:49–67. [Google Scholar]
Zhang HH, Cheng G, Liu Y. Linear or nonlinear? Automatic structure discovery for partially linear models. Journal of the American Statistical Association. 2011;106:1099–1112. doi: 10.1198/jasa.2011.tm10281. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhang HH, Lu W. Adaptive lasso for Cox’s proportional hazards model. Biometrika. 2007;94:691–703. [Google Scholar]
Zhang Y, Li R, Tsai C-L. Regularization parameter selections via generalized information criterion. Journal of the American Statistical Association. 2010;105:312–323. doi: 10.1198/jasa.2009.tm08013. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhao P, Rocha G, Yu B. The composite absolute penalties family for grouped and hierarchical variable selection. The Annals of Statistics. 2009;37:3468–3497. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp Material

NIHMS680620-supplement-Supp_Material.pdf^{(113.9KB, pdf)}

[R1] Cai J, Fan J, Li R, Zhou H. Variable selection for multivariate failure time data. Biometrika. 2005;92:303–316. doi: 10.1093/biomet/92.2.303. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] DerSimonian R, Laird N. Meta-analysis in clinical trials. Controlled Clinical Trials. 1986;7:177–188. doi: 10.1016/0197-2456(86)90046-2. [DOI] [PubMed] [Google Scholar]

[R3] Ding YC, McGuffog L, Healey S, Friedman E, Laitman Y, Paluch-Shimon S, Kaufman B, Liljegren A, Lindblom A, Olsson H, Kristoffersson U, Stenmark-Askmalm M, Melin B, et al. A nonsynonymous polymorphism in IRS1 modifies risk of developing breast and ovarian cancers in BRCA1 and ovarian cancer in BRCA2 mutation carriers. Cancer Epidemiology Biomarkers & Prevention. 2012;21:1362–1370. doi: 10.1158/1055-9965.EPI-12-0229. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] Du P, Ma S, Liang H. Penalized variable selection procedure for Cox models with semiparametric relative risk. The Annals of Statistics. 2010;38:2092–2117. doi: 10.1214/09-AOS780. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association. 2001;96:1348–1360. [Google Scholar]

[R6] Fu WJ. Penalized regressions: the bridge versus the lasso. Journal of Computational and Graphical Statistics. 1998;7:397–416. [Google Scholar]

[R7] Ganzfried BF, Riester M, Haibe-Kains B, Risch T, Tyekucheva S, Jazic I, Wang XV, Ahmadifar M, Birrer MJ, Parmigiani G, Huttenhower C, Waldron L. CuratedOvarianData: clinically annotated data for the ovarian cancer transcriptome. Database: The Journal of Biological Databases and Curation. 2013;2013 doi: 10.1093/database/bat013. bat013. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] Goode EL, Chenevix-Trench G, Song H, Ramus SJ, Notaridou M, Lawrenson K, Widschwendter M, Vierkant RA, Larson MC, Kjaer SK, Birrer MJ, et al. A genome-wide association study identifies susceptibility loci for ovarian cancer at 2q31 and 8q24. Nature Genetics. 2010;42:874–879. doi: 10.1038/ng.668. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] Hedges LV, Olkin I. Statistical methods for meta-analysis. Orlando: Academic Press; 1985. [Google Scholar]

[R10] Huang J, Sun T, Ying Z, Yu Y, Zhang C-H. Oracle inequalities for the lasso in the Cox model. The Annals of Statistics. 2013;41:1142–1165. doi: 10.1214/13-AOS1098. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] Kajiyama H, Shibata K, Terauchi M, Ino K, Nawa A, Kikkawa F. Involvement of SDF-1α/CXCR4 axis in the enhanced peritoneal metastasis of epithelial ovarian carcinoma. International Journal of Cancer. 2008;122:91–99. doi: 10.1002/ijc.23083. [DOI] [PubMed] [Google Scholar]

[R12] Kim J, Sohn I, Jung S-H, Kim S, Park C. Analysis of survival data with group lasso. Communications in Statistics - Simulation and Computation. 2012;41:1593–1605. [Google Scholar]

[R13] Liu M, Lu W, Krogh V, Hallmans G, Clendenen TV, Zeleniuch-Jacquotte A. Estimation and selection of complex covariate effects in pooled nested case-control studies with heterogeneity. Biostatistics. 2013;14:682–694. doi: 10.1093/biostatistics/kxt015. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] Ma S, Song X, Song X, Huang J. Supervised group lasso with applications to microarray data analysis. BMC Bioinformatics. 2007;8:60–76. doi: 10.1186/1471-2105-8-60. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] Ma X, Zhang J, Liu S, Huang Y, Chen B, Wang D. Polymorphisms in the CASP8 gene and the risk of epithelial ovarian cancer. Gynecologic Oncology. 2011;122:554–559. doi: 10.1016/j.ygyno.2011.05.031. [DOI] [PubMed] [Google Scholar]

[R16] Moreno V, Martin ML, Bosch FX, de Sanjosé S, Torres F, Muñoz N. Combined analysis of matched and unmatched case-control studies: comparison of risk estimates from different studies. American Journal of Epidemiology. 1996;143:293–300. doi: 10.1093/oxfordjournals.aje.a008741. [DOI] [PubMed] [Google Scholar]

[R17] Wang H, Li R, Tsai C-L. Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika. 2007;94:553–568. doi: 10.1093/biomet/asm053. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] Welcsh PL, King MC. BRCA1 and BRCA2 and the genetics of breast and ovarian cancer. Human Molecular Genetics. 2001;10:705–713. doi: 10.1093/hmg/10.7.705. [DOI] [PubMed] [Google Scholar]

[R19] Yan J, Huang J. Model selection for Cox models with time-varying coefficients. Biometrics. 2012;68:419–428. doi: 10.1111/j.1541-0420.2011.01692.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society, Series B. 2006;68:49–67. [Google Scholar]

[R21] Zhang HH, Cheng G, Liu Y. Linear or nonlinear? Automatic structure discovery for partially linear models. Journal of the American Statistical Association. 2011;106:1099–1112. doi: 10.1198/jasa.2011.tm10281. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] Zhang HH, Lu W. Adaptive lasso for Cox’s proportional hazards model. Biometrika. 2007;94:691–703. [Google Scholar]

[R23] Zhang Y, Li R, Tsai C-L. Regularization parameter selections via generalized information criterion. Journal of the American Statistical Association. 2010;105:312–323. doi: 10.1198/jasa.2009.tm08013. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] Zhao P, Rocha G, Yu B. The composite absolute penalties family for grouped and hierarchical variable selection. The Annals of Statistics. 2009;37:3468–3497. [Google Scholar]

PERMALINK

Identification of Homogeneous and Heterogeneous Variables in Pooled Cohort Studies

Xin Cheng

Wenbin Lu

Mengling Liu

Summary

1. Introduction

2. Regularized Method for Identifying Homogeneous and Heterogeneity Variables

2.1 Penalized partial likelihood function

2.2 Computation algorithm

2.3 Theoretical properties

Theorem 1 (Estimation Consistency)

Theorem 2 (Selection Consistency)

Theorem 3 (Asymptotic Normality)

3. Numerical Studies

3.1 Simulations

3.1.1 Example 1: fixed number of covariates

Table 1.

Table 4.

Table 2.

Table 3.

3.1.2 Example 2: diverging number of covariates p > N

3.2 Pooled ovarian cancer study

Figure 1.

Table 5.

4. Discussion

Supplementary Material

Acknowledgments

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Identification of Homogeneous and Heterogeneous Variables in Pooled Cohort Studies

Xin Cheng

Wenbin Lu

Mengling Liu

Summary

1. Introduction

2. Regularized Method for Identifying Homogeneous and Heterogeneity Variables

2.1 Penalized partial likelihood function

2.2 Computation algorithm

2.3 Theoretical properties

Theorem 1 (Estimation Consistency)

Theorem 2 (Selection Consistency)

Theorem 3 (Asymptotic Normality)

3. Numerical Studies

3.1 Simulations

3.1.1 Example 1: fixed number of covariates

Table 1.

Table 4.

Table 2.

Table 3.

3.1.2 Example 2: diverging number of covariates p > N

3.2 Pooled ovarian cancer study

Figure 1.

Table 5.

4. Discussion

Supplementary Material

Acknowledgments

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases