Integrative analysis of multiple case-control studies

Han Zhang; Lu Deng; William Wheeler; Jing Qin; Kai Yu

doi:10.1111/biom.13461

. Author manuscript; available in PMC: 2022 Sep 25.

Published in final edited form as: Biometrics. 2021 Apr 19;78(3):1080–1091. doi: 10.1111/biom.13461

Integrative analysis of multiple case-control studies

Han Zhang ¹, Lu Deng ¹, William Wheeler ², Jing Qin ³, Kai Yu ¹

PMCID: PMC8565848 NIHMSID: NIHMS1747007 PMID: 33768525

Abstract

It is often challenging to share detailed individual-level data among studies due to various informatics and privacy constraints. However, it is relatively easy to pool together aggregated summary level data, such as the ones required for standard meta-analyses. Focusing on data generated from case-control studies, we present a flexible inference procedure that integrates individual-level data collected from an “internal” study with summary data borrowed from “external” studies. This procedure is built on a retrospective empirical likelihood framework to account for the sampling bias in case-control studies. It can incorporate summary statistics extracted from various working models adopted by multiple independent or overlapping external studies. It also allows for external studies to be conducted in a population that is different from the internal study population. We show both theoretically and numerically its efficiency advantage over several competing alternatives.

Keywords: case-control studies, empirical likelihood, estimating equations, Lagrange multiplier, meta-analysis, retrospective likelihood

1 |. INTRODUCTION

In the era of big data, collaborative multicenter studies are often carried out to study a disease outcome, with detailed individual-level data being collected by participating centers. If individual-level data from all studies is available, the most efficient way to draw inference is to conduct a pooled analysis by applying a unified statistical model to all data. However, sharing of individual-level data can be challenging due to various informatic and privacy constraints. Also, meta-analysis of summary data (i.e., estimated coefficients) generated from participating studies can be challenging when summary data are derived from different working models (e.g., varying sets of covariates, or inconsistent covariate definitions).

We consider a setting where researchers have collected individual-level data in their own study (the internal study), and in the meantime can acquire summary data from published literature or other studies (external studies). Since the case-control sampling design is most commonly used for studying a binary disease outcome (Breslow and Day, 1980), we focus on integrating data from case-control studies. The goal is to develop a flexible statistical inference framework that can effectively synthesize all information from individual-level and aggregated summary data.

A number of procedures based on the empirical likelihood have been proposed to achieve the goal of integrative analysis (Chen and Sitter, 1999; Qin, 2000; Chaudhuri et al., 2008; Chatterjee et al., 2016; Han and Lawless, 2019; Zhang et al., 2020). The summary data can be quite general, as long as they satisfy a set of constraint equations defined by certain population moment conditions. For example, the summary data can be the population mean, disease prevalence in a given strata, or estimates of coefficients in a working regression model chosen by an external study (Chatterjee et al., 2016). Those procedures obtain their estimates by maximizing the empirical likelihood of observed individual-level data, under moment condition constraints imposed by summary information. Other procedures based on the generalized method of moment (Imbens and Lancaster, 1994; Kundu et al., 2019; Huang and Qin, 2020) and Bayesian approaches (Cheng et al., 2018, 2019) have also been proposed.

Most existing procedures are built on the prospective likelihood approach, which focuses on modeling the probability of the disease outcome given covariates under the assumption that both individual-level and summary data come from prospective studies. They cannot be directly applied to data from case-control studies without further investigation. Qin et al. (2015) developed a procedure to improve the efficiency of case-control studies by utilizing knowledge on stratum-specific disease prevalence. Chatterjee et al. (2016) incorporated more general summary information into the analysis of an internal case-control study but assumed that summary data were derived from prospective studies. Both Qin et al. (2015) and Chatterjee et al. (2016) treated summary data as known parameters without any uncertainty. However, as shown by Zhang et al. (2020), this strategy is not optimal for integrating summary data with unignorable variability.

We present a retrospective likelihood approach for the integrative analysis of data from multiple case-control studies. Following Zhang et al. (2020), we treat both individual-level and summary data as observed random variables and derive their joint likelihood function. In order to account for the sampling bias in the case-control study design, the individual-level data are modeled with a retrospective empirical likelihood, which specifies the distribution of covariates given the disease status. Moment conditions satisfied by the summary data are used as constraints on the space of the parameters under investigation. As those constraints narrow the search region for the unknown parameters, they can help to reduce the uncertainty of parameter estimates. We show in theory and through simulation studies that estimates derived from the joint likelihood subject to those constraints are more efficient than existing approaches. A real example is used to illustrate the application of the proposed procedure.

2 |. METHOD

2.1 |. Setup and notations

We assume that we have a case-control study (called the internal study) of a binary disease outcome D and a set of covariates X. The study consists of n₁ subjects, with n_1,0 controls (D = 0) and n_1,1 cases (D = 1), n₁ = n_1,0 + n_1,1. We represent the individual-level data for this internal study as (X_i, D_i), i = 1, … , n₁, with the first n_1,0 subjects being controls, and the remaining n_1,1 subjects being cases. We further assume that the following logistic regression model is correctly specified as the underlying risk model:

logit {P (D = 1 | X)} = θ_{1}^{*} + H (X; θ_{2}),

(1)

where H(X; θ₂) is a given function, such as, H(X; θ₂) = X^T θ₂, with θ₂ being the set of parameters of interest. By Bayes’ formula, model (1) specifies the following connection between covariate distributions in cases and in controls (Qin and Zhang, 1997),

P (X ∣ D = 1) = P (X ∣ D = 0) Δ (X; θ),

(2)

where $Δ (X; θ) = \exp {θ_{1} + H (X; θ_{2})}$ with $θ = {(θ_{1}, θ_{2}^{t})}^{t}$ , and $θ_{1} = θ_{1}^{*} - \log it {P (D = 1)}$ . Since we consider data from case-control studies, we here after adopt this retrospective form (2) to represent the underlying model.

To draw inference on θ₂ based on the internal study, the standard prospective likelihood procedure can be used (Prentice and Pyke, 1979). Here we briefly review the equivalent retrospective empirical likelihood approach, as our proposed procedure is built on this framework. Denote $P = {p_{i} ≜ P (X_{i} | D = 0), i = 1, \dots, n_{1}}$ as the empirical version of P(X|D = 0) supported on samples from the internal study. Following Qin and Zhang (1997), θ can inferred by maximizing the following empirical log-likelihood function:

\sum_{i = 1}^{n_{1}} log p_{i} + \sum_{i = 1}^{n_{1}} D_{i} log Δ (X_{i}; θ),

(3)

subject to constraint equations $\sum_{i = 1}^{n_{1}} p_{i} = 1$ , and $\sum_{i = 1}^{n_{1}} p_{i} {Δ (X_{i}; θ) - 1} = 0$ . After profiling out p_i, the empirical likelihood estimate of θ is the stationary point of the profile log-likelihood,

ℓ_{1} (θ) = \sum_{i = 1}^{n_{1}} D_{i} log Δ (X_{i}; θ) - \sum_{i = 1}^{n_{1}} log {1 + ρ_{1} Δ (X_{i}; θ)},

with $ρ_{1} = n_{1, 1} / n_{1, 0}$ . It is evident that the estimate of θ₂ (not θ₁) based on this function is exactly the same as the one based on the prospective likelihood specified by (1). Furthermore, similar to the result of Prentice and Pyke (1979), Qin and Zhang (1997) showed that the asymptotic distribution of the empirical likelihood estimate is the same as the one based on the prospective likelihood if $H (X; θ_{2}) = X^{t} θ_{2}$ .

Our goal is to estimate $θ_{2}$ by integrating individual-level data from the internal study and summary data (i.e., coefficient estimates) extracted from other case-control studies (called the external studies). We first present the method for incorporating summary data from one external study that is conducted within the same source population as the internal study. Later, we will expand the procedure to more complicated scenarios where multiple external studies are conducted in the same or a different source population.

For an external study consisting of n₂ subjects, with n_2,0 controls and n_2,1 cases, we represent its (unobserved) individual-level data as (X_i, D_i), i = n₁ + 1, … , n₁ + n₂, with the first n_2,0 subjects being controls, and the remainder being cases. We assume that the external study was analyzed with a working model $logit {P (D = 1 | X)} = α_{1}^{*} + M (X; α_{2}, β)$ , with $M (X; α_{2}, β)$ being a chosen function. This working model might be different from (1). Equivalently, we can represent this working model as

W (X | D = 1) = W (X | D = 0) \exp {α_{1} + M (X; α_{2}, β)},

(4)

where W(X|D = 1) and W(X|D = 0) represent distributions of X in cases and in controls, with their connection being misspecified. Here we separate all unknown parameters into two parts, with β being the set of parameters whose estimates are presented as summary data for the integrative analysis, and $α = {(α_{1}, α_{2}^{t})}^{t}$ being the set of parameters whose estimates are not given. We allow the existence of α (called nuisance parameters) to accommodate the situations where not all estimates from the working model are available. The procedure also needs to know the definition of the working model (4) in order to use the summary data properly. Here we provide two examples to illustrate setups for some applications.

Example 1. Suppose $X = {(X_{1}, X_{2}, X_{3})}^{t}$ , and the underlying model assumed by the internal study is $P (X | D = 1) = P (X | D = 0) \exp (θ_{1} + θ_{21} X_{1} + θ_{22} X_{2} + θ_{23} X_{3})$ . The external study can choose a reduced model nested within the assumed model, such as W(X|D = 1) = W(X|D = 0) exp(α₁ + α₂X₁ + βX₂), or a nonnested working model as W(X|D = 1) = W(X|D = 0) exp{α₁ + α₂X₁ + βlog(X₂)}. If only the estimate of β is provided as summary data, then (α₁, α₂) are considered as nuisance parameters. Note that these two working models are misspecified due to the noncollapsibility property of the logistic regression model.

Example 2. The true model is given by $P (X | D = 1) = P (X | D = 0) \exp {θ_{1} + θ_{2} ε (X)}$ , with ε(X) a known function of $X = {(X_{1}, \dots, X_{m})}^{t}$ . The external study fits several marginal models, $W_{k} (X | D = 1) = W_{k} (X | D = 0) \exp {α_{k} + β_{k} X_{k}}$ , k = 1, …, m. The summary data consist of estimates of β_k, k = 1, … , m. This is an example of using summary data from multiple working models fitted with the same external study, a similar setting considered in the real example.

2.2 |. Asymptotic distribution of summary data

If summary data come from a misspecified working model, its variance–covariance matrix estimated by standard statistical packages is not correct, even the robust sandwich estimate derived from the prospective logistic model is not valid. Here we present the proper asymptotic distribution of the summary data.

Based on (4), the quasi-log-likelihood function of the external study can be expressed as

ℓ_{2} (α, β) = \sum_{i = n_{1} + 1}^{n_{1} + n_{2}} D_{i} \log δ (X_{i}; α, β) - \sum_{i = n_{1} + 1}^{n_{1} + n_{2}} \log {1 + ρ_{2} δ (X_{i}; α, β)},

where ρ₂ = n_2,1/n_2,0, $δ (X; α, β) ≜ \exp {α_{1} + M (X; α_{2}, β)}$ . Estimates $(\tilde{α}, \tilde{β})$ of (α, β) are obtained from the estimating equation:

\frac{\partial ℓ_{2} (α, β)}{\partial (α, β)} = \sum_{i = n_{1} + 1}^{n_{1} + n_{2}} {D_{i} - \frac{ρ_{2} δ (X_{i}; α, β)}{1 + ρ_{2} δ (X_{i}; α, β)}} \frac{\partial log δ (X_{i}; α, β)}{\partial (α, β)} = 0 .

Let $ϕ_{0} (X; α, β) = - \frac{ρ_{2} δ (X; α, β)}{1 + ρ_{2} δ (X; α, β)} \frac{\partial log δ (X; α, β)}{\partial (α, β)}$ , and $ϕ_{1} (X; α, β) = \frac{1}{1 + ρ_{2} δ (X; α, β)} \frac{\partial log δ (X; α, β)}{\partial (α, β)}$ . With the reparameterization of the intercept term, ℓ₂(α, β) is equivalent to the prospective log-likelihood formation. Thus, $\tilde{β}$ is same as the one obtained by the standard prospective model.

Based on the estimating equation theory (White, 1982), we know ${({\tilde{α}}^{t}, {\tilde{β}}^{t})}^{t}$ is a consistent estimate of ${(α^{* t}, β^{* t})}^{t}$ , which is the solution of the following stochastic constraint equation:

E_{0} {ϕ_{0} (X; α, β)} + ρ_{2} E_{1} {ϕ_{1} (X; α, β)} = 0,

where E₀ is the expectation over P(X|D = 0), the true conditional distribution in controls, and E₁ is the expectation over $P (X | D = 1)$ . Here we assume n_2,1/n_2,0 = ρ₂ as n₂ → ∞. Let $μ^{*} = {(θ^{* t}, α^{* t}, β^{* t})}^{t}$ be the true value for $μ = {(θ^{t}, α^{t}, β^{t})}^{t}$ . Based on (2), we know μ* satisfies the following constraint equation:

E_{0} {g (X; μ^{*})} = 0,

(5)

with $g (X; μ) = ϕ_{0} (X; α, β) + ρ_{2} Δ (X; θ) ϕ_{1} (X; α, β)$ .

The asymptotic distribution of ${({\tilde{α}}^{t}, {\tilde{β}}^{t})}^{t}$ is given by

\sqrt{n_{2}} (\begin{matrix} \tilde{α} - α^{*} \\ \tilde{β} - β^{*} \end{matrix}) \overset{d}{\to} N (0, A^{- 1} B {(A^{- 1})}^{t}),

(6)

with

A = \frac{1}{ρ_{2} + 1} E_{0} {\frac{\partial g (X; μ^{*})}{\partial (α, β)}},

B = E_{0} {\frac{1}{ρ_{2} + 1} ϕ_{0} (X; α^{*}, β^{*}) ϕ_{0} {(X; α^{*}, β^{*})}^{t} + \frac{ρ_{2} Δ (X; θ^{*})}{ρ_{2} + 1} ϕ_{1} (X; α^{*}, β^{*}) ϕ_{1} {(X; α^{*}, β^{*})}^{t}} - \frac{1}{ρ_{2}} E_{0} {ϕ_{0} (X; α^{*}, β^{*})} E_{0} {ϕ_{0} {(X; α^{*}, β^{*})}^{t}} .

Let Σ₀ be the submatrix of A⁻¹B(A⁻¹)^t corresponding to β. We know $Cov (\tilde{β}) = n_{2}^{- 1} Σ_{0}$ . Here we choose to represent A and B in terms of the expectation defined by P(X|D = 0), as we can obtain the estimate of P(X|D = 0) within the empirical likelihood framework described below.

Because of the specific form of B, following Carroll et al. (1995), it can be seen that the asymptotic covariance matrix of $\tilde{β}$ is not theoretically equivalent to the one given by the corresponding prospective formula derived under the assumption that the external data are collected from a prospective study.

2.3 |. The integrative procedure

Here we extend the framework given by Zhang et al. (2020) to combine individual-level data with summary data $\tilde{β}$ . The main difference is that all considered data come from case-control studies. Rather than using the empirical distribution of X observed in the source population, we build our likelihood function using the empirical distribution of X observed among controls. We take a joint likelihood approach for the inference of μ = (θ^t, α^t, β^t)^t by treating both individual-level and summary data as observed random variables. The log-likelihood for the internal case-control study is given by (3). For the summary data $\tilde{β}$ , because of (6), its log-likelihood function can be written as $- n_{2} / 2 {(β - \tilde{β})}^{t} Σ_{0}^{- 1} (β - \tilde{β})$ . Since Σ₀ is unknown, we propose to estimate μ by solving the following optimization problem over (P, μ):

(\hat{P}, \hat{μ}) = \underset{(P, μ)}{argmax} \sum_{i = 1}^{n_{1}} log p_{i} + \sum_{i = 1}^{n_{1}} D_{i} log Δ (X_{i}; θ) - \frac{n_{2}}{2} {(β - \tilde{β})}^{t} V^{- 1} (β - \tilde{β}),

subject to

\sum_{i = 1}^{n_{1}} p_{i} = 1, \sum_{i = 1}^{n_{1}} p_{i} {Δ (X_{i}; θ) - 1} = 0, \sum_{i = 1}^{n_{1}} p_{i} g (X_{i}; μ) = 0,

(7)

with p_i ≥ 0, and V being any given positive definite matrix. The last constraint equation in (7) is due to (5). A simple choice for V is the identity matrix. We will show that Σ₀ is the most optimal choice of V under our framework. Since Σ₀ is unknown, we can use an iterative algorithm discussed later to obtain the most efficient estimate of θ based on a consistent estimate of Σ₀.

For any given V, we use the Lagrange multiplier approach to solve the constrained optimization problem (7). Define the corresponding Lagrange function as

L (P, μ, κ, λ, ξ) = \sum_{i = 1}^{n_{1}} log p_{i} + \sum_{i = 1}^{n_{1}} D_{i} log Δ (X_{i}; θ) - \frac{n_{2}}{2} {(β - \tilde{β})}^{t} V^{- 1} (β - \tilde{β}) - n_{1} κ (\sum_{i = 1}^{n_{1}} p_{i} - 1) - n_{1} λ [\sum_{i = 1}^{n_{1}} p_{i} {Δ (X_{i}; θ) - 1}] - n_{1} \sum_{i = 1}^{n_{1}} p_{i} ξ^{t} g (X_{i}; μ),

with κ, λ, and ξ being the Lagrange multipliers. It is easy to show that κ = 1, and the maximizer should satisfy

p_{i} = \frac{1}{n_{1}} \frac{1}{1 + λ {Δ (X_{i}; θ) - 1} + ξ^{t} g (X_{i}; μ)} .

Therefore, let $η = {(λ, ξ^{t}, μ^{t})}^{t}$ , we can express the profile log-likelihood as

ℓ_{V} (η) = - \sum_{i = 1}^{n_{1}} \log [1 + λ {Δ (X_{i}; θ) - 1} + ξ^{t} g (X_{i}; μ)] + \sum_{i = 1}^{n_{1}} D_{i} \log Δ (X_{i}; θ) - \frac{n_{2}}{2} {(β - \tilde{β})}^{t} V^{- 1} (β - \tilde{β}) .

(8)

We use the Newton-Raphson algorithm to find the stationary point ${\hat{η}}_{V} = {({\hat{λ}}_{V}, {\hat{ξ}}_{V}^{t}, {\hat{μ}}_{V}^{t})}^{t}$ of (8), with ${\hat{μ}}_{V} = {({\hat{θ}}_{V}^{t}, {\hat{α}}_{V}^{t}, {\hat{β}}_{V}^{t})}^{t}$ . A good initial point can be ${(n_{1, 1} / n_{1, 0}, 0^{t}, {\hat{μ}}_{0}^{t})}^{t}$ , with ${\hat{μ}}_{0}$ obtained by using the data from the internal study to fit models (2) and (4). We adjust the case-control sample size ratio when fitting (4).

We show in the Web Appendix A (Lemma 1) that under some regularity conditions, ${\hat{η}}_{V}$ is a consistent estimate of $η^{*} = {(λ^{*}, ξ^{* t}, μ^{* t})}^{t}$ , with $λ^{*} = \frac{ρ_{1}}{1 + ρ_{1}}$ , $ξ^{* t} = 0$ , and $μ^{*} = {(θ^{* t}, α^{* t}, β^{* t})}^{t}$ as defined before. After we obtain ${\hat{η}}_{V}$ , we can estimate P(X_i|D = 0), i = 1, …, n₁, as

{\hat{p}}_{i} = \frac{1}{n_{1}} \frac{1}{1 + {\hat{λ}}_{V} {Δ (X_{i}; {\hat{θ}}_{V}) - 1} + {\hat{ξ}}_{V}^{t} g (X_{i}; {\hat{μ}}_{V})} .

(9)

We want to point out that the nuisance parameter α is identifiable as α can influence ℓ_V(η) through g(X; μ).

The asymptotic distribution of ${\hat{η}}_{V}$ is given by the following result, with its proof shown in Web Appendix A.

Theorem 1. Assuming n₂/n₁ → τ, n_1,1/n_1,0 = ρ₁, and n_2,1/n_2,0 = ρ₂ remain constant as n₁ → ∞, we have

\sqrt{n_{1}} ({\hat{η}}_{V} - η^{*}) \overset{d}{\to} N (0, J_{V}^{- 1} I_{V} J_{V}^{- 1}),

where J_V and I_V are defined in the Appendix.

Definitions of J_V and I_V rely on η*, Σ₀, and P(X|D = 0). To estimate the covariance of ${\hat{η}}_{V}$ , we can replace $η^{*}$ with ${\hat{η}}_{V}$ , and use the estimated empirical distribution (9) for P(X|D = 0). Similarly, we can replace Σ₀ with its consistent estimate ${\hat{Σ}}_{0}$ , which is a submatrix of ${\hat{A}}^{- 1} \hat{B} {({\hat{A}}^{- 1})}^{t}$ corresponding to β. $\hat{A}$ and $\hat{B}$ are estimates of A and B. They can be obtained by replacing μ* with ${\hat{μ}}_{V}$ , and by calculating the expectation in controls using estimates given by (9).

Although ${\hat{η}}_{V}$ is consistent given any V in (8), its level of efficiency depends on the choice of V. In fact, we can obtain the most efficient estimate of η by using V = Σ₀ , or its consistent estimate. We need to introduce some notations before presenting the result. Let $s = {(λ, ξ^{t})}^{t}$ be the vector of Lagrange multipliers. We can represent J_v as the following block matrix:

J_{V} = (\begin{matrix} J_{s s} & J_{s μ} \\ J_{μ s} & J_{V, μ μ} \end{matrix}) .

Note that only the lower right corner submatrix depends on V. We can represent J_Σ₀ similarly by replacing J_V,_μμ with J_Σ₀,μμ. By letting V = Σ₀ in I_V, we can see that I_Σ₀ can be written as

I_{Σ_{0}} = (\begin{matrix} - J_{s s} & 0 \\ 0 & J_{Σ_{0}, μ μ} \end{matrix}) - λ^{*} (1 - λ^{*}) J_{. λ} J_{. λ}^{t},

with J._λ being the first column of J_Σ₀. We have the following result (see Web Appendix A for the proof).

Corollary 1. The asymptotic variance–covariance matrix of $\sqrt{n_{1}} ({\hat{η}}_{V} - η^{*})$ attains its minimum at V = Σ₀. At V = Σ₀, the asymptotic variance–covariance matrix of $\sqrt{n_{1}} {\hat{η}}_{Σ_{0}}$ has the following form:

(\begin{matrix} - J_{s s}^{- 1} - J_{s s}^{- 1} J_{s μ} {(J_{Σ_{0}, μ μ} - J_{μ s} J_{s s}^{- 1} J_{s μ})}^{- 1} J_{μ s} J_{s s}^{- 1} & 0 \\ 0 & {(J_{Σ_{0}, μ μ} - J_{μ s} J_{s s}^{- 1} J_{s μ})}^{- 1} \end{matrix}) - λ^{*} (1 - λ^{*}) e_{1} e_{1}^{t},

where e₁ is a vector with its first element being 1, all others being zero. This asymptotic variance–covariance matrix remains the same if V is a consistent estimate of Σ₀.

Because of Corollary 1, we can use an iterative algorithm to find the optimal estimate. First, we let V in (8) be the identity matrix to obtain an initial estimate η₁). Second, we use η₍₁₎ to obtain ${\hat{Σ}}_{0}$ , which is a consistent estimate of Σ₀. Third, by letting $V = {\hat{Σ}}_{0}$ , we obtain an updated estimate of η. We can iterate the second and third steps several times until the estimate converges. We define the final estimate of θ as ${\hat{θ}}_{{\hat{Σ}}_{0}}$ and refer to the procedure as the retrospective generalized integration method (rGIM). In contrast, we call the original integration method that was developed for integrating data from prospective studies (Zhang et al. 2020) the prospective generalized integration method (pGIM).

Although rGIM is designed for case-control studies, it can also be applied to the setting when one or both studies are conducted under a simple random sampling design. This is true because P(X|D) can be inferred properly with a set of random samples of (X, D). Simulation studies presented later also confirm this.

We can extend the aforementioned results to a more general situation where summary data come from multiple independent or partially overlapping external studies. Summary data from multiple models fitted within a given study are also allowed (see Example 2 as an illustration). All these can be achieved by specifying the correct variance–covariance matrix of ${({\tilde{α}}^{t}, {\tilde{β}}^{t})}^{t}$ . (See Web Appendix B for more details.)

2.4 |. Alternative approaches

Instead of treating the summary statistics $\tilde{β}$ as observed random variables, we can set $β = \tilde{β}$ as known and only estimate the other parameters (i.e., θ and α). Qin (2000) and Chatterjee et al. (2016) used this strategy to incorporate summary information from a prospective study, although they did not consider the existence of nuisance parameter α. Using this strategy in case-control studies, we can estimate θ and α by solving the following constrained optimization problem:

({\hat{P}}_{\tilde{β}}, {\hat{θ}}_{\tilde{β}}, {\hat{α}}_{\tilde{β}}) = \underset{(P, θ, α)}{argmax} \sum_{i = 1}^{n_{1}} log p_{i} + \sum_{i = 1}^{n_{1}} D_{i} log Δ (X_{i}; θ),

subject to

\sum_{i = 1}^{n_{1}} p_{i} = 1, \sum_{i = 1}^{n_{1}} p_{i} {Δ (X_{i}; θ) - 1} = 0, \sum_{i = 1}^{n_{1}} p_{i} g (X_{i}; μ_{\tilde{β}}) = 0,

(10)

where $μ_{\hat{β}} = {(θ^{t}, α^{t}, {\tilde{β}}^{t})}^{t}$ since $β$ is fixed at $\tilde{β}$ . We denote the resultant constraint maximum likelihood (CML) estimate as ${\hat{υ}}_{\hat{β}} = {({\hat{θ}}_{\tilde{β}}^{t}, {\hat{α}}_{\tilde{β}}^{t})}^{t}$ . Another alternative estimate of θ is the standard maximum likelihood estimate based on the internal study (MLE-Int), denoted as ${\hat{θ}}_{m l e}$ .

In Web Appendix A5, we derive the proper variance-covariance matrix of the CML estimate by accounting for the variability of $\tilde{β}$ and prove the following result.

Corollary 2. The rGIM estimate ${\hat{θ}}_{{\hat{Σ}}_{0}}$ is asymptotically more efficient than the internal data based MLE ${\hat{θ}}_{m l e}$ and the CML estimate ${\hat{θ}}_{\hat{β}}$ .

2.5 |. Different study populations

So far, we present procedures assuming that the internal and external studies are conducted within a common underlying population. In some applications, the two studies might be conducted in two source populations (called internal study population and external study population), which have different distributions of X. We can expand the proposed procedure to this more general setting.

We assume that the disease risk in the two populations can be specified by the following model:

logit {P (D = 1 ∣ X, s)} = θ^{* (s)} + H (X; θ_{2}),

where s = 1, or 2 indicates the internal or external study population. Thus, we in fact assume that regression coefficients except the intercept term are the same between the two risk models. Since we allow the two study populations to have different covariate distributions, each study population has its own covariate distribution in cases and in controls, denoted as P(X|D = 1, s), and P(X|D = 0, s), respectively. Following (2), we know

P (X ∣ D = 1, s) = P (X ∣ D = 0, s) exp {θ^{(s)} + H (X; θ_{2})},

where $θ^{(s)} = θ^{* (s)} - \log it {P (D = 1 | s)}$ . In addition to the individual level data and summary statistics, we also require observations on a set of controls ${X_{i}^{*}, i = 1, \dots, n^{*}}$ from the external study population and use them as a reference set to estimate P(X|D = 0, s = 2), which is different from P(X|D = 0, s = 1). In Web Appendix C, we extend the rGIM procedure to the setting of different study populations and refer to it as GIM^REF.

3 |. SIMULATION STUDIES

3.1 |. Same internal and external study population

We first considered the scenario when the internal and external studies were conducted within the same study population. We assumed that X = (X₁, X₂) presented genotypes on two genetic markers, called single nucleotide polymorphisms (SNPs), and the true disease risk model was $\log it {P (D = 1 | X)} = θ_{0}^{*} + θ_{1} X_{1} + θ_{2} X_{2}$ . We further assumed that the two SNPs’ locations were in close proximity and let G = (G₁, G₂) represent allele type (0 or 1) of the two SNPs on a given chromosome (i.e., haplotype). In the study population, the distribution probability of G = (0, 0) , (0,1), (1,0), and (1,1) is 0.28, 0.12, 0.18, and 0.42, respectively. For each subject, its joint genotype X was given by the sum of two randomly selected haplotypes. We let θ₁ = log(2), θ₂ = log(1.5), and chose θ₁* such that the disease prevalence was around 10%. Each pair of internal and external studies was generated from this population, with the internal study consisting of 500 cases and 500 controls, and the external study consisting of 500 cases and 2000 (or 5000) controls. Based on the external study, the summary data $\tilde{β} = ({\tilde{β}}_{1}, {\tilde{β}}_{2})$ were coefficient estimates from the following two working models: $W_{k} (X | D = 1) = W_{k} (X | D = 0) \exp {α_{k} + β_{k} X_{k}}$ , k = 1, 2.

We analyzed each simulated dataset with rGIM, pGIM, CML, and MLE-Int. Except MLE-Int, the other three methods used summary data consisting of either $({\tilde{β}}_{1}, {\tilde{β}}_{2})$ , or ${\tilde{β}}_{1}$ . Table 1 summarizes performances of the four methods over 5000 simulated datasets under each scenario.

TABLE 1.

Simulation results in situations when internal and external studies are conducted under a case-control sampling design in a common source population

	Estimate of θ₁
	MLE-Int	Given ${\tilde{β}}_{1}$			Given $({\tilde{β}}_{1}, {\tilde{β}}_{2})$
	MLE-Int	CML	pGIM	rGIM	CML	pGIM	rGIM
External study: 500 cases and 2000 controls
Bias	0.33	0.19	0.18	0.10	0.18	0.18	0.10
SE-Emp	10.94	9.09	7.66	7.53	8.86	7.14	6.98
SE-Est	11.00	9.20	6.75	7.56	8.92	5.92	6.95
CP	95.36	95.26	91.88	95.14	95.04	89.18	94.94
External study: 500 cases and 5000 controls
Bias	0.33	−0.68	−0.56	−0.45	−0.52	−0.43	−0.34
SE-Emp	10.94	8.66	7.79	7.32	8.34	7.30	6.71
SE-Est	11.00	8.75	5.63	7.34	8.39	4.38	6.69
CP	95.36	95.12	83.86	94.90	95.18	75.60	95.06
	Estimate of θ₂
	MLE-Int	Given ${\tilde{β}}_{1}$			Given $({\tilde{β}}_{1}, {\tilde{β}}_{2})$
	MLE-Int	CML	pGIM	rGIM	CML	pGIM	rGIM
External study: 500 cases and 2000 controls
Bias	0.15	0.15	0.15	0.15	0.21	0.15	0.08
SE-Emp	10.32	10.32	10.32	10.32	8.40	6.80	6.64
SE-Est	10.38	10.38	10.38	10.38	8.36	5.77	6.64
CP	95.46	95.46	95.46	95.46	94.68	90.12	94.82
External study: 500 cases and 5000 controls
Bias	0.15	0.15	0.15	0.15	−0.26	−0.22	−0.20
SE-Emp	10.32	10.32	10.32	10.32	7.87	6.92	6.39
SE-Est	10.38	10.38	10.38	10.38	7.85	4.42	6.39
CP	95.46	95.46	95.46	95.46	94.80	79.32	94.54

Open in a new tab

The internal study has 500 cases, and 500 controls. The external study has 500 cases and 2000 (or 5000) controls. The true model is logit {P(D = 1|G)} = θ₀* + θ₁G₁ + θ₂G₂. Summary data $({\tilde{β}}_{1}, {\tilde{β}}_{2})$ are coefficient estimates based on two working models, logit {P(D = 1|G_k)} = α_k + β_kG_k, k = 1, 2. Empirical bias (bias), empirical standard error (SE-Emp), mean of estimated asymptotic standard error (SE-Est), and coverage probability (CP) of a 95% confidence interval are summarized over 5000 simulated datasets. All numbers are multiplied by 100. Note: MLE-Int: standard MLE based on the internal study; CML: constraint MLE developed for case-control studies; pGIM: GIM procedure developed under the prospective likelihood framework; rGIM: GIM procedure developed under the retrospective likelihood framework.

First, we can notice from Table 1 that the asymptotic distribution of the rGIM estimate presented in Corollary 1 is quite accurate as the estimate is unbiased, with its estimated asymptotic standard error matching well with the empirical standard error in all considered settings. Its 95% confidence interval also has the correct coverage probability (CP). Second, as predicted by Corollary 2, rGIM is at least as efficient as MLE-Int and CML. The magnitude of efficiency gain depends on the available summary data. Specifically, when summary data consist of only ${\tilde{β}}_{1}$ , the estimated marginal effect of X₁, rGIM is more efficient than MLE-Int and CML for estimating θ₁ (the true effect of X₁), but all three methods have the same level of efficiency for estimating θ₂. If using $({\tilde{β}}_{1}, {\tilde{β}}_{2})$ as summary data, the rGIM procedure shows a clear advantage over MLE-Int and CML for estimating both θ₁ and θ₂. Third, the pGIM procedure that ignores the case-control sampling can generate incorrect standard error estimates. For example, when using $({\tilde{β}}_{1}, {\tilde{β}}_{2})$ estimated from an external study with 500 cases and 2000 controls, the pGIM estimate of θ₁ has an empirical standard error of 0.071, compared to the mean estimated asymptotic standard error of 0.059. As a result, its 95% confidence interval has only 89% coverage probability. When we increase the number of controls in the external study to 5000, its coverage probability decreases further to 76%.

On a 2.30 GHz Linux machine, the running time of MLE-Int on an internal study of 500 cases and 500 controls is about 0.004 s. rGIM and CML procedures integrating summary data $({\tilde{β}}_{1}, {\tilde{β}}_{2})$ with the same internal study take about 6.1 and 3.3 s, respectively. rGIM is slower than CML as it needs to reestimate (β₁, β₂), instead of assuming $β_{i} = {\tilde{β}}_{i}$ , i = 1, 2.

3.2 |. Different internal and external study populations

We considered the scenario when the internal and external studies were conducted in two different study populations. The internal study population was the same as the one given in Section 3.1. For the external study population, the distribution of haplotype G was specified as G = (0,0) , (0,1), (1,0), and (1,1) with probability of 0.14, 0.06, 0.56, and 0.24, respectively. We assumed that the same aforementioned disease risk model applies to both populations. For each pair of simulated internal and external studies, we also generated an additional set of 300 controls from the external population and used them as reference samples in the GIM^REF procedure. Simulation results are summarized in Table 2. We did not present results from the pGIM procedure as we have shown in Section 3.1 that pGIM was not appropriate for case-control studies.

TABLE 2.

Simulation results in situations when internal and external studies are conducted under a case-control sampling design in two different source populations

	Estimate of θ₁
	MLE-Int	Given ${\tilde{β}}_{1}$		Given $({\tilde{β}}_{1}, {\tilde{β}}_{2})$
	MLE-Int	GIM^REF	rGIM	GIM^REF	rGIM
External study: 500 cases and 2000 controls
Bias	0.33	−0.10	−9.83	0.06	−4.62
SE-Emp	10.94	7.79	8.60	7.59	8.93
SE-Est	11.00	7.89	7.27	7.66	6.70
CP	95.36	95.32	69.06	95.02	80.42
External study: 500 cases and 5000 controls
Bias	0.33	−0.47	−10.72	−0.24	−5.21
SE-Emp	10.94	7.66	8.45	7.44	8.73
SE-Est	11.00	7.72	6.99	7.48	6.41
CP	95.36	95.12	62.28	95.08	76.70
	Estimate of θ₂
	MLE-Int	Given ${\tilde{β}}_{1}$		Given $({\tilde{β}}_{1}, {\tilde{β}}_{2})$
	MLE-Int	GIM^REF	rGIM	GIM^REF	rGIM
External study: 500 cases and 2000 controls
Bias	0.15	0.26	0.15	−0.06	−13.96
SE-Emp	10.32	10.07	10.32	6.19	7.22
SE-Est	10.38	10.08	10.38	6.20	6.27
CP	95.46	95.48	95.46	95.16	40.64
External study: 500 cases and 5000 controls
Bias	0.15	0.38	0.15	−0.25	−14.77
SE-Emp	10.32	10.05	10.32	5.98	6.96
SE-Est	10.38	10.06	10.38	5.95	5.97
CP	95.46	95.74	95.44	94.72	32.72

Open in a new tab

The internal study has 500 cases and 500 controls. The external study has 500 cases, and 2000 (or 5000) controls. The reference set consists of 300 controls from the external population. The same risk model logit {P(D = 1|G} = θ₀* + θ₁G₁ + θ₂G₂ is assumed for both populations. Summary data $({\tilde{β}}_{1}, {\tilde{β}}_{2})$ are coefficient estimates based on two working models, logit {P(D = 1|G_k)} = α_k + β_kG_k, k = 1, 2. Empirical bias (bias), empirical standard error (SE-Emp), mean of estimated asymptotic standard error (SE-Est), and coverage probability (CP) of a 95% confidence interval are summarized over 5000 simulated datasets. All numbers are multiplied by 100. Note: MLE-Int: standard MLE based on the internal study; GIM^REF: retrospective GIM procedure with a reference set developed for the setting in which external and internal studies are conducted in two different populations; rGIM: retrospective GIM procedure with the assumption that two studies are conducted in a common population.

Results in Table 2 demonstrate that GIM^REF has the desired performance when dealing with studies from different populations. GIM^REF is also more efficient than MLE-Int. On the other hand, the rGIM procedure, which is designed for data collected from a common study population, can generate inconsistent estimate and erroneously estimated standard error (Table 2).

3.3 |. Studies under different designs

We conducted additional simulations to evaluate the performance of the retrospective likelihood approach in situations when one or both studies were conducted under a simple random sampling design.

First, we considered the scenario in which the internal study was a case-control study with 500 cases and 500 controls from the source population defined in Section 3.1, while the external study consisted of 5500 subjects randomly sampled from the same source population. For the purpose of illustration, we only present results from analyses using ${{\tilde{β}}_{1}, {\tilde{β}}_{2}}$ as summary data. Results shown in Table 3 indicate that both rGIM and CML estimates are consistent, with proper estimated standard errors. But rGIM is more efficient than CML and MLE-int. On the other hand, the prospective likelihood approach pGIM does not estimate the standard error correctly. This is expected as pGIM ignores the sampling bias in the internal data. We further considered the situation in which the external study had their 5500 subjects randomly sampled from a different source population (as defined in Section 3.2). In this setup, we also generated a reference set of 300 controls from the external study population and used it as part of the input for GIM^REF. Results summarized in Table 4 indicate that GIM^REF has the expected performance in this setting.

TABLE 3.

Simulation results in situations when the internal study is conducted under a case-control sampling design in a source population, and the external study is conducted under a simple random sampling design in the same population

	Estimate of θ₁
	MLE-Int	CML	pGIM	rGIM
Bias	0.33	0.10	0.09	0.04
SE-Emp	10.94	7.95	6.97	6.50
SE-Est	11.00	8.07	4.38	6.53
CP	95.36	95.78	77.62	95.60
	Estimate of β₂
	MLE-Int	CML	pGIM	rGIM
Bias	0.15	0.15	0.13	0.06
SE-Emp	10.32	7.57	6.67	6.23
SE-Est	10.38	7.56	4.43	6.24
CP	95.46	94.70	81.26	95.00

Open in a new tab

The internal study has 500 cases and 500 controls. The external study consists of 5500 random sampled subjects. The true risk model is logit {P(D = 1|G)} = θ₀* + θ₁G₁ + θ₂G₂ Summary data $({\tilde{β}}_{1}, {\tilde{β}}_{2})$ are coefficient estimates from two working models, logit {P(D = 1|G_k)} = α_k + β_kG_k, k = 1, 2. Empirical bias (bias), empirical standard error (SE-Emp), mean of estimated asymptotic standard error (SE-Est), and coverage probability (CP) of a 95% confidence interval are summarized over 5000 simulated datasets. All numbers are multiplied by 100.Note: MLE-Int: standard MLE based on the internal study; CML: constraint MLE developed for case-control studies; pGIM: GIM procedure developed under the prospective likelihood framework; rGIM: GIM procedure developed under the retrospective likelihood framework.

TABLE 4.

Simulation results in situations when the internal study is conducted under a case-control sampling design in one population, the external study is conducted under a simple random sampling design in a different population

	Estimate of θ₁
	MLE-Int	rGIM	GIM^REF
Bias	0.33	−4.82	0.22
SE-Emp	10.94	8.66	7.31
SE-Est	11.00	6.24	7.25
CP	95.36	77.46	95.22
	Estimate of θ₂
	MLE-Int	rGIM	GIM^REF
Bias	0.15	−15.21	0.04
SE-Emp	10.32	6.93	6.01
SE-Est	10.38	5.83	5.84
CP	95.46	29.42	94.22

Open in a new tab

The internal study has 500 cases and 500 controls. The external study consists of 5500 random sampled subjects. The same risk model logit {P(D = 1|G)} θ₀* + θ₁G₁ + θ₂G₂ is assumed for both populations. Summary data $({\tilde{β}}_{1}, {\tilde{β}}_{2})$ are coefficient estimates from two working models, logit {(P(D = 1|G_k)} = α_k + β_kG_k, k = 1, 2. Empirical bias (bias), empirical standard error (SE-Emp), mean of estimated asymptotic standard error (SE-Est), and coverage probability (CP) of a 95% confidence interval are summarized over 5000 simulated datasets. All numbers are multiplied by 100. Note: MLE-Int: standard MLE based on the internal study; GIM^REF: retrospective GIM procedure with a reference set developed for the setting in which external and internal studies are conducted in two different populations; rGIM: retrospective GIM procedure with the assumption that two studies are conducted in a common population.

Next, we considered situations when the internal study was generated under a simple random sampling design with 5500 subjects, and the external study of 500 cases and 500 controls was compiled under a case-control sampling design in the same or a different source population (see Web Appendix Tables 1 and 2). From those two tables, we can draw similar conclusions as those from Tables 3 and 4, respectively. Finally, we considered the setting in which both internal and external studies were carried out under a simple random sampling design, with each consisting of 5500 subjects. Web Appendix Table 3 shows simulation results in the scenario when both studies are conducted in the same source population. Web Appendix Table 4 shows results when the two studies are sampled from different populations. Again, we can see from both tables that the retrospective likelihood approach works fine. As expected, the prospective approach (pGIM) also performs well in this setting as data are collected from prospective studies.

4 |. REAL APPLICATION

We illustrated the application of our method by applying it to two genome-wide association studies (GWAS) of pancreatic cancer (Amundadottir et al., 2009; Petersen et al., 2010). Both GWAS had genotypes measured on over half a million SNPs. Genotypes on additional SNPs were obtained by imputation. We focused on subjects with predominantly European ancestry from the two studies. The first GWAS (called PanScan I) consisted of 1761 cases and 1804 controls. The second GWAS (called PanScan II) had 1768 cases and 1851 controls (PanScan, 2015).

In this application, we concentrated on genes from the PredictDB Data Repository defined by gene expression in pancreatic tissues (Gamazon et al., 2015; Barbeira et al., 2018). For a given gene, PredictDB provided a prediction model that used genotypes on a set of SNPs selected around the neighborhood of the gene to predict its gene expression level. The prediction model can be represented as $\sum_{k = 1}^{m} w_{k} X_{k}$ , with X = (X₁, …, X_m) being genotypes on the set of m selected SNPs, and w = (w₁, … ,w_m) being their corresponding weights. A transcriptome-wide association study (TWAS) searches for genes associated with the outcome D by assessing the correlation between ε(X) and D assuming w is known.

We considered PanScan I GWAS as the internal study. From the independent PanScan II GWAS, we randomly selected a set of 500 controls as reference samples and treated the remaining data as the external study, based on which summary data were derived. To assess each gene’s association with pancreatic cancer, we used the model $\log it {P (D = 1 | X, Z)} = Z^{t} ϕ + θ ε (X)$ , where θ is the parameter of interest, Z is the set of adjusted covariates, including the intercept, gender, and top five eigenvectors identified in the two combined GWAS. We assumed the summary data $\tilde{β} = ({\tilde{β}}_{1}, \dots, {\tilde{β}}_{m})$ were coefficient estimates from the following working models fitted with data from the external study, $W_{k} (X | D = 1) = W_{k} (X | D = 0) \exp {α_{k} + β_{k} X_{k}}$ , k = 1, …, m.

In the integrative analysis, we only used summary information on SNPs whose pairwise correlation was less than 0.95 in order to avoid the problem of collinearity, although we still included all SNPs in the definition of ε(X). For some genes, the number of summary statistics can be more than 60.

We analyzed a total of 5832 genes from PredictDB Data Repository with MLE-Int and GIM^REF. In Figure 1, we showed Q–Q plots for the two analyses. Although there is no globally significant gene (based on the Bonferroni threshold) detected by either method, there is more suggestive evidence of association indicated at the tail end of the Q–Q plot produced by GIM^REF, than that by MLE-Int. This is expected as the GIM method utilizes more information than MLE-Int.

Q–Q plots for TWAS results based on pancreatic cancer GWAS. Each of 5832 considered genes is analyzed with the standard MLE based on the internal study (MLE-Int), as well as the retrospective GIM procedure with an additional reference set (GIMREF)

Zhong et al. (2020) recently conducted TWAS analysis of pancreatic cancer using a different strategy for defining the gene expression prediction model. Based on a much larger set of GWAS data, including those from PanScan I and II, they identified 12 pancreatic cancer associated genes that passed the Bonferroni threshold. Seven of those genes are among the set of 5832 genes we analyzed. Results on those seven genes are summarized in Table 5. Again, it appears GIM^ref can detect more signals than MLE-Int.

TABLE 5.

Gene-level association testing results based on pancreatic cancer GWAS

Gene	MLE-Int		GIM^REF
Gene	Est (SE)	p-value	Est (SE)	p-value
KLF5	0.571 (0.850)	5.02E–01	0.806 (0.629)	2.00E–01
PDX1	−0.559 (0.383)	1.45E–01	−0.970 (0.287)	7.25E–04
WDR59	0.538 (0.362)	1.37E–01	0.575 (0.288)	4.59E–02
CFDP1	−0.060 (0.371)	8.72E–01	0.232 (0.288)	4.20E–01
CELA3B	–0.528 (0.368)	1.51E–01	−0.818 (0.287)	4.36E–03
INHBA	–1.658 (0.461)	3.21E–04	−0.912 (0.344)	8.07E–03
ABO	0.121 (0.067)	7.16E–02	0.103 (0.051)	4.38E–02

Open in a new tab

Results are given on seven genes that have been established to be associated with pancreatic cancer by Zhong et al. (2020). Note: MLE-Int: standard MLE based on the internal study (PanScan I); GIM^REF: retrospective GIM procedure with a reference set developed for the setting in which external (PanScan II) and internal (PanScan I) studies are conducted in two different populations.

5 |. DISCUSSION

We have developed an integrative procedure for effectively combining aggregated summary data with detailed individual-level data. We adopt a retrospective likelihood framework to account for the sampling bias resulting from the case-control study design. The procedure is very flexible to incorporate summary data generated from distinct working models from multiple external studies. It also allows for external studies to be conducted in a population different from the internal study population, provided that individual-level data on a set of reference control samples is available from the external study population. We establish asymptotic properties for the procedure and prove that its estimate is more efficient than the MLE based on the internal study and the constraint MLE procedure, which derives its estimate under the restriction imposed by the summary data.

We demonstrate that it is important to adjust for the sampling bias in integrative analyses. The prospective likelihood approach that ignores the case-control sampling design can generate inaccurate standard error estimates. Although the proposed procedure is developed specifically for integrating data from multiple case-control studies, it can be used for studies conducted under a simple random sampling design. We show through simulation studies that the proposed retrospective likelihood approach has the desired performance for internal and external studies conducted under either a case-control or a simple random sampling design. Another advantage of our procedure is that it can combine individual-level and summary data taken from different underlying populations, as long as a set of reference control samples are available from each external study population. The reference set is critical to ensure a correct estimate of the covariate distribution in the underlying population. Since the reference set only requires a random sample of controls, it is appealing in practices when case samples are not easily accessible. This feature is especially useful for genetic association studies, as ample reference genomes with different ethnic backgrounds are available from public resources. In situations where such reference samples are not available, our simulation studies suggest that applying the procedure under the assumption of a common source population can generate biased estimates. Further investigations are needed to develop procedures that can provide more robust estimates in this setting.

Supplementary Material

supporting information

NIHMS1747007-supplement-supporting_information.pdf^{(460.1KB, pdf)}

ACKNOWLEDGEMENTS

The study utilized the computational resource of the NIH Biowulf cluster (https://hpc.nih.gov/).

Footnotes

SUPPORTING INFORMATION

Web Appendices and Tables referenced in Sections 2.3, 2.4, 2.5 and 3.3 are available with this article at the Biometrics website on Wiley Online Library. We have implemented the proposed procedure in the R package “gim” that can be obtained from https://CRAN.R-project.org/package=gim. Example codes on how to use the gim package are posted at the Biometrics website.

DATA AVAILABILITY STATEMENT

The data that support the findings in this paper are openly available in the database of Genotypes and Phenotypes (dbGaP) athttps://www.ncbi.nlm.nih.gov/gap/, with dbGaP Study Accession # phs000206.v5.p3.

REFERENCES

Amundadottir L, Kraft P, Stolzenberg-Solomon RZ, Fuchs CS, Petersen GM, Arslan AA, et al. (2009) Genome-wide association study identifies variants in the ABO locus associated with susceptibility to pancreatic cancer. Nature Genetics, 41, 986–990. [DOI] [PMC free article] [PubMed] [Google Scholar]
Barbeira AN, Dickinson SP, Bonazzola R, Zheng J, Wheeler HE, Torres JM, et al. (2018) Exploring the phenotypic consequences of tissue specific gene expression variation inferred from GWAS summary statistics. Nature Communications, 9, 1825. [DOI] [PMC free article] [PubMed] [Google Scholar]
Breslow NE, and Day NE (1980) Statistical methods in cancer research. Volume I – The analysis of case-control studies. IARC Scientific Publications, 32, 5–338. [PubMed] [Google Scholar]
Carroll RJ, Wang S, and Wang CY (1995) Prospective analysis of logistic case-control studies. Journal of the American Statistical Association, 90,157–169. [Google Scholar]
Chatterjee N, Chen YH, Maas P, and Carroll RJ (2016) Constrained maximum likelihood estimation for model calibration using summary-level information from external big data sources. Journal of the American Statistical Association, 111, 107–117. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chaudhuri S, Handcock MS, and Rendall MS (2008) Generalised linear models incorporating population level information: An empirical likelihood based approach. Journal of the Royal Statistical Society Series B (Statistical Methodology), 70, 311–328. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chen J, and Sitter RR (1999) A pseudo empirical likelihood approach to the effective use of auxiliary information in complex surveys. Statistica Sinica, 9, 385–406. [Google Scholar]
Cheng W, Taylor JMG, Gu T, Tomlins SA, and Mukheqee B (2019) Informing a risk prediction model for binary outcomes with external coefficient information. Journal of the Royal Statistical Society, Series C (Applied Statistics), 68,121–139. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cheng W, Taylor JMG, Vokonas PS, Park SK, and Mukheqee B (2018) Improving estimation and prediction in linear regression incorporating external information from an established reduced model. Statistics in Medicine ,37,1515–1530. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gamazon ER, Wheeler HE, Shah KP, Mozaffari SV, Aquino-Michaels K, Carroll RJ, et al. (2015) A gene-based association method for mapping traits using reference transcriptome data. Nature Genetics, 47,1091–1098. [DOI] [PMC free article] [PubMed] [Google Scholar]
Han P, and Lawless JF (2019) Empirical likelihood estimation using auxiliary summary information with different covariate distributions. Statistica Sinica, 29, 1321–1342. [Google Scholar]
Huang CY, and Qin J (2020) A unified approach for synthesizing population-level covariate effect information in semiparametric estimation with survival data. Statistics in Medicine, 39, 1573–1590. [DOI] [PubMed] [Google Scholar]
Imbens GW, and Lancaster T (1994) Combining micro and macro data in microeconometric models. The Review of Economic Studies, 61, 655–680. [Google Scholar]
Kundu P, Tang R, and Chatterjee N (2019) Generalized meta-analysis for multiple regression models across studies with disparate covariate information. Biometrika, 106, 567–585. [DOI] [PMC free article] [PubMed] [Google Scholar]
PanScan (2015) Whole genome scan for pancreatic cancer risk in the Pancreatic Cancer Cohort Consortium and Pancreatic Cancer Case-Control Consortium (PanScan). dbGaP Study Accession # phs000206.v5.p3. [Google Scholar]
Petersen GM, Amundadottir L, Fuchs CS, Kraft P, Stolzenberg-Solomon RZ, Jacobs ZB, et al. (2010) A genome-wide association study identifies pancreatic cancer susceptibility loci on chromosomes 13q22.1, 1q32.1 and 5p15.33. Nature Genetics, 42, 224–228. [DOI] [PMC free article] [PubMed] [Google Scholar]
Prentice RL, and Pyke R (1979) Logistic disease incidence models and case-control studies. Biometrika, 66, 403–411. [Google Scholar]
Qin J (2000) Combining parametric and empirical likelihoods. Biometrika, 87, 484–490. [Google Scholar]
Qin J, and Zhang B (1997) A goodness-of-fit test for logistic regression models based on case-control data. Biometrika, 84, 609–618. [Google Scholar]
Qin J, Zhang H, Li P, Albanes D, and Yu K (2015) Using covariate-specific disease prevalence information to increase the power of case-control studies. Biometrika, 102,169–180. [Google Scholar]
White H (1982) Maximum likelihood estimation of misspecified models. Econometrica, 50,1–25. [Google Scholar]
Zhang H, Deng L, Schiffman M, Qin J, and Yu K (2020) Generalized integration model for improved statistical inference by leveraging external summary data. Biometrika, 107, 689–703. [Google Scholar]
Zhong J, Jermusyk A, Wu L, Hoskins JW, Collins I, Mocci E, et al. (2020) A transcriptome-wide association study identifies novel candidate susceptibility genes for pancreatic cancer. Journal of National Cancer Institute, 112, 1003–1012. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supporting information

NIHMS1747007-supplement-supporting_information.pdf^{(460.1KB, pdf)}

Data Availability Statement

[R1] Amundadottir L, Kraft P, Stolzenberg-Solomon RZ, Fuchs CS, Petersen GM, Arslan AA, et al. (2009) Genome-wide association study identifies variants in the ABO locus associated with susceptibility to pancreatic cancer. Nature Genetics, 41, 986–990. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] Barbeira AN, Dickinson SP, Bonazzola R, Zheng J, Wheeler HE, Torres JM, et al. (2018) Exploring the phenotypic consequences of tissue specific gene expression variation inferred from GWAS summary statistics. Nature Communications, 9, 1825. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] Breslow NE, and Day NE (1980) Statistical methods in cancer research. Volume I – The analysis of case-control studies. IARC Scientific Publications, 32, 5–338. [PubMed] [Google Scholar]

[R4] Carroll RJ, Wang S, and Wang CY (1995) Prospective analysis of logistic case-control studies. Journal of the American Statistical Association, 90,157–169. [Google Scholar]

[R5] Chatterjee N, Chen YH, Maas P, and Carroll RJ (2016) Constrained maximum likelihood estimation for model calibration using summary-level information from external big data sources. Journal of the American Statistical Association, 111, 107–117. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] Chaudhuri S, Handcock MS, and Rendall MS (2008) Generalised linear models incorporating population level information: An empirical likelihood based approach. Journal of the Royal Statistical Society Series B (Statistical Methodology), 70, 311–328. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] Chen J, and Sitter RR (1999) A pseudo empirical likelihood approach to the effective use of auxiliary information in complex surveys. Statistica Sinica, 9, 385–406. [Google Scholar]

[R8] Cheng W, Taylor JMG, Gu T, Tomlins SA, and Mukheqee B (2019) Informing a risk prediction model for binary outcomes with external coefficient information. Journal of the Royal Statistical Society, Series C (Applied Statistics), 68,121–139. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] Cheng W, Taylor JMG, Vokonas PS, Park SK, and Mukheqee B (2018) Improving estimation and prediction in linear regression incorporating external information from an established reduced model. Statistics in Medicine ,37,1515–1530. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] Gamazon ER, Wheeler HE, Shah KP, Mozaffari SV, Aquino-Michaels K, Carroll RJ, et al. (2015) A gene-based association method for mapping traits using reference transcriptome data. Nature Genetics, 47,1091–1098. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] Han P, and Lawless JF (2019) Empirical likelihood estimation using auxiliary summary information with different covariate distributions. Statistica Sinica, 29, 1321–1342. [Google Scholar]

[R12] Huang CY, and Qin J (2020) A unified approach for synthesizing population-level covariate effect information in semiparametric estimation with survival data. Statistics in Medicine, 39, 1573–1590. [DOI] [PubMed] [Google Scholar]

[R13] Imbens GW, and Lancaster T (1994) Combining micro and macro data in microeconometric models. The Review of Economic Studies, 61, 655–680. [Google Scholar]

[R14] Kundu P, Tang R, and Chatterjee N (2019) Generalized meta-analysis for multiple regression models across studies with disparate covariate information. Biometrika, 106, 567–585. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] PanScan (2015) Whole genome scan for pancreatic cancer risk in the Pancreatic Cancer Cohort Consortium and Pancreatic Cancer Case-Control Consortium (PanScan). dbGaP Study Accession # phs000206.v5.p3. [Google Scholar]

[R16] Petersen GM, Amundadottir L, Fuchs CS, Kraft P, Stolzenberg-Solomon RZ, Jacobs ZB, et al. (2010) A genome-wide association study identifies pancreatic cancer susceptibility loci on chromosomes 13q22.1, 1q32.1 and 5p15.33. Nature Genetics, 42, 224–228. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] Prentice RL, and Pyke R (1979) Logistic disease incidence models and case-control studies. Biometrika, 66, 403–411. [Google Scholar]

[R18] Qin J (2000) Combining parametric and empirical likelihoods. Biometrika, 87, 484–490. [Google Scholar]

[R19] Qin J, and Zhang B (1997) A goodness-of-fit test for logistic regression models based on case-control data. Biometrika, 84, 609–618. [Google Scholar]

[R20] Qin J, Zhang H, Li P, Albanes D, and Yu K (2015) Using covariate-specific disease prevalence information to increase the power of case-control studies. Biometrika, 102,169–180. [Google Scholar]

[R21] White H (1982) Maximum likelihood estimation of misspecified models. Econometrica, 50,1–25. [Google Scholar]

[R22] Zhang H, Deng L, Schiffman M, Qin J, and Yu K (2020) Generalized integration model for improved statistical inference by leveraging external summary data. Biometrika, 107, 689–703. [Google Scholar]

[R23] Zhong J, Jermusyk A, Wu L, Hoskins JW, Collins I, Mocci E, et al. (2020) A transcriptome-wide association study identifies novel candidate susceptibility genes for pancreatic cancer. Journal of National Cancer Institute, 112, 1003–1012. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Integrative analysis of multiple case-control studies

Han Zhang

Lu Deng

William Wheeler

Jing Qin

Kai Yu

Abstract

1 |. INTRODUCTION

2 |. METHOD

2.1 |. Setup and notations

2.2 |. Asymptotic distribution of summary data

2.3 |. The integrative procedure

2.4 |. Alternative approaches

2.5 |. Different study populations

3 |. SIMULATION STUDIES

3.1 |. Same internal and external study population

TABLE 1.

3.2 |. Different internal and external study populations

TABLE 2.

3.3 |. Studies under different designs

TABLE 3.

TABLE 4.

4 |. REAL APPLICATION

FIGURE 1.

TABLE 5.

5 |. DISCUSSION

Supplementary Material

ACKNOWLEDGEMENTS

Footnotes

DATA AVAILABILITY STATEMENT

REFERENCES

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Integrative analysis of multiple case-control studies

Han Zhang

Lu Deng

William Wheeler

Jing Qin

Kai Yu

Abstract

1 |. INTRODUCTION

2 |. METHOD

2.1 |. Setup and notations

2.2 |. Asymptotic distribution of summary data

2.3 |. The integrative procedure

2.4 |. Alternative approaches

2.5 |. Different study populations

3 |. SIMULATION STUDIES

3.1 |. Same internal and external study population

TABLE 1.

3.2 |. Different internal and external study populations

TABLE 2.

3.3 |. Studies under different designs

TABLE 3.

TABLE 4.

4 |. REAL APPLICATION

FIGURE 1.

TABLE 5.

5 |. DISCUSSION

Supplementary Material

ACKNOWLEDGEMENTS

Footnotes

DATA AVAILABILITY STATEMENT

REFERENCES

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases