Abstract
It is often challenging to share detailed individual-level data among studies due to various informatics and privacy constraints. However, it is relatively easy to pool together aggregated summary level data, such as the ones required for standard meta-analyses. Focusing on data generated from case-control studies, we present a flexible inference procedure that integrates individual-level data collected from an “internal” study with summary data borrowed from “external” studies. This procedure is built on a retrospective empirical likelihood framework to account for the sampling bias in case-control studies. It can incorporate summary statistics extracted from various working models adopted by multiple independent or overlapping external studies. It also allows for external studies to be conducted in a population that is different from the internal study population. We show both theoretically and numerically its efficiency advantage over several competing alternatives.
Keywords: case-control studies, empirical likelihood, estimating equations, Lagrange multiplier, meta-analysis, retrospective likelihood
1 |. INTRODUCTION
In the era of big data, collaborative multicenter studies are often carried out to study a disease outcome, with detailed individual-level data being collected by participating centers. If individual-level data from all studies is available, the most efficient way to draw inference is to conduct a pooled analysis by applying a unified statistical model to all data. However, sharing of individual-level data can be challenging due to various informatic and privacy constraints. Also, meta-analysis of summary data (i.e., estimated coefficients) generated from participating studies can be challenging when summary data are derived from different working models (e.g., varying sets of covariates, or inconsistent covariate definitions).
We consider a setting where researchers have collected individual-level data in their own study (the internal study), and in the meantime can acquire summary data from published literature or other studies (external studies). Since the case-control sampling design is most commonly used for studying a binary disease outcome (Breslow and Day, 1980), we focus on integrating data from case-control studies. The goal is to develop a flexible statistical inference framework that can effectively synthesize all information from individual-level and aggregated summary data.
A number of procedures based on the empirical likelihood have been proposed to achieve the goal of integrative analysis (Chen and Sitter, 1999; Qin, 2000; Chaudhuri et al., 2008; Chatterjee et al., 2016; Han and Lawless, 2019; Zhang et al., 2020). The summary data can be quite general, as long as they satisfy a set of constraint equations defined by certain population moment conditions. For example, the summary data can be the population mean, disease prevalence in a given strata, or estimates of coefficients in a working regression model chosen by an external study (Chatterjee et al., 2016). Those procedures obtain their estimates by maximizing the empirical likelihood of observed individual-level data, under moment condition constraints imposed by summary information. Other procedures based on the generalized method of moment (Imbens and Lancaster, 1994; Kundu et al., 2019; Huang and Qin, 2020) and Bayesian approaches (Cheng et al., 2018, 2019) have also been proposed.
Most existing procedures are built on the prospective likelihood approach, which focuses on modeling the probability of the disease outcome given covariates under the assumption that both individual-level and summary data come from prospective studies. They cannot be directly applied to data from case-control studies without further investigation. Qin et al. (2015) developed a procedure to improve the efficiency of case-control studies by utilizing knowledge on stratum-specific disease prevalence. Chatterjee et al. (2016) incorporated more general summary information into the analysis of an internal case-control study but assumed that summary data were derived from prospective studies. Both Qin et al. (2015) and Chatterjee et al. (2016) treated summary data as known parameters without any uncertainty. However, as shown by Zhang et al. (2020), this strategy is not optimal for integrating summary data with unignorable variability.
We present a retrospective likelihood approach for the integrative analysis of data from multiple case-control studies. Following Zhang et al. (2020), we treat both individual-level and summary data as observed random variables and derive their joint likelihood function. In order to account for the sampling bias in the case-control study design, the individual-level data are modeled with a retrospective empirical likelihood, which specifies the distribution of covariates given the disease status. Moment conditions satisfied by the summary data are used as constraints on the space of the parameters under investigation. As those constraints narrow the search region for the unknown parameters, they can help to reduce the uncertainty of parameter estimates. We show in theory and through simulation studies that estimates derived from the joint likelihood subject to those constraints are more efficient than existing approaches. A real example is used to illustrate the application of the proposed procedure.
2 |. METHOD
2.1 |. Setup and notations
We assume that we have a case-control study (called the internal study) of a binary disease outcome D and a set of covariates X. The study consists of n1 subjects, with n1,0 controls (D = 0) and n1,1 cases (D = 1), n1 = n1,0 + n1,1. We represent the individual-level data for this internal study as (Xi, Di), i = 1, … , n1, with the first n1,0 subjects being controls, and the remaining n1,1 subjects being cases. We further assume that the following logistic regression model is correctly specified as the underlying risk model:
| (1) |
where H(X; θ2) is a given function, such as, H(X; θ2) = XT θ2, with θ2 being the set of parameters of interest. By Bayes’ formula, model (1) specifies the following connection between covariate distributions in cases and in controls (Qin and Zhang, 1997),
| (2) |
where with , and . Since we consider data from case-control studies, we here after adopt this retrospective form (2) to represent the underlying model.
To draw inference on θ2 based on the internal study, the standard prospective likelihood procedure can be used (Prentice and Pyke, 1979). Here we briefly review the equivalent retrospective empirical likelihood approach, as our proposed procedure is built on this framework. Denote as the empirical version of P(X|D = 0) supported on samples from the internal study. Following Qin and Zhang (1997), θ can inferred by maximizing the following empirical log-likelihood function:
| (3) |
subject to constraint equations , and . After profiling out pi, the empirical likelihood estimate of θ is the stationary point of the profile log-likelihood,
with . It is evident that the estimate of θ2 (not θ1) based on this function is exactly the same as the one based on the prospective likelihood specified by (1). Furthermore, similar to the result of Prentice and Pyke (1979), Qin and Zhang (1997) showed that the asymptotic distribution of the empirical likelihood estimate is the same as the one based on the prospective likelihood if .
Our goal is to estimate by integrating individual-level data from the internal study and summary data (i.e., coefficient estimates) extracted from other case-control studies (called the external studies). We first present the method for incorporating summary data from one external study that is conducted within the same source population as the internal study. Later, we will expand the procedure to more complicated scenarios where multiple external studies are conducted in the same or a different source population.
For an external study consisting of n2 subjects, with n2,0 controls and n2,1 cases, we represent its (unobserved) individual-level data as (Xi, Di), i = n1 + 1, … , n1 + n2, with the first n2,0 subjects being controls, and the remainder being cases. We assume that the external study was analyzed with a working model , with being a chosen function. This working model might be different from (1). Equivalently, we can represent this working model as
| (4) |
where W(X|D = 1) and W(X|D = 0) represent distributions of X in cases and in controls, with their connection being misspecified. Here we separate all unknown parameters into two parts, with β being the set of parameters whose estimates are presented as summary data for the integrative analysis, and being the set of parameters whose estimates are not given. We allow the existence of α (called nuisance parameters) to accommodate the situations where not all estimates from the working model are available. The procedure also needs to know the definition of the working model (4) in order to use the summary data properly. Here we provide two examples to illustrate setups for some applications.
Example 1. Suppose , and the underlying model assumed by the internal study is . The external study can choose a reduced model nested within the assumed model, such as W(X|D = 1) = W(X|D = 0) exp(α1 + α2X1 + βX2), or a nonnested working model as W(X|D = 1) = W(X|D = 0) exp{α1 + α2X1 + βlog(X2)}. If only the estimate of β is provided as summary data, then (α1, α2) are considered as nuisance parameters. Note that these two working models are misspecified due to the noncollapsibility property of the logistic regression model.
Example 2. The true model is given by , with ε(X) a known function of . The external study fits several marginal models, , k = 1, …, m. The summary data consist of estimates of βk, k = 1, … , m. This is an example of using summary data from multiple working models fitted with the same external study, a similar setting considered in the real example.
2.2 |. Asymptotic distribution of summary data
If summary data come from a misspecified working model, its variance–covariance matrix estimated by standard statistical packages is not correct, even the robust sandwich estimate derived from the prospective logistic model is not valid. Here we present the proper asymptotic distribution of the summary data.
Based on (4), the quasi-log-likelihood function of the external study can be expressed as
where ρ2 = n2,1/n2,0, . Estimates of (α, β) are obtained from the estimating equation:
Let , and . With the reparameterization of the intercept term, ℓ2(α, β) is equivalent to the prospective log-likelihood formation. Thus, is same as the one obtained by the standard prospective model.
Based on the estimating equation theory (White, 1982), we know is a consistent estimate of , which is the solution of the following stochastic constraint equation:
where E0 is the expectation over P(X|D = 0), the true conditional distribution in controls, and E1 is the expectation over . Here we assume n2,1/n2,0 = ρ2 as n2 → ∞. Let be the true value for . Based on (2), we know μ* satisfies the following constraint equation:
| (5) |
with .
The asymptotic distribution of is given by
| (6) |
with
Let Σ0 be the submatrix of A−1B(A−1)t corresponding to β. We know . Here we choose to represent A and B in terms of the expectation defined by P(X|D = 0), as we can obtain the estimate of P(X|D = 0) within the empirical likelihood framework described below.
Because of the specific form of B, following Carroll et al. (1995), it can be seen that the asymptotic covariance matrix of is not theoretically equivalent to the one given by the corresponding prospective formula derived under the assumption that the external data are collected from a prospective study.
2.3 |. The integrative procedure
Here we extend the framework given by Zhang et al. (2020) to combine individual-level data with summary data . The main difference is that all considered data come from case-control studies. Rather than using the empirical distribution of X observed in the source population, we build our likelihood function using the empirical distribution of X observed among controls. We take a joint likelihood approach for the inference of μ = (θt, αt, βt)t by treating both individual-level and summary data as observed random variables. The log-likelihood for the internal case-control study is given by (3). For the summary data , because of (6), its log-likelihood function can be written as . Since Σ0 is unknown, we propose to estimate μ by solving the following optimization problem over (P, μ):
subject to
| (7) |
with pi ≥ 0, and V being any given positive definite matrix. The last constraint equation in (7) is due to (5). A simple choice for V is the identity matrix. We will show that Σ0 is the most optimal choice of V under our framework. Since Σ0 is unknown, we can use an iterative algorithm discussed later to obtain the most efficient estimate of θ based on a consistent estimate of Σ0.
For any given V, we use the Lagrange multiplier approach to solve the constrained optimization problem (7). Define the corresponding Lagrange function as
with κ, λ, and ξ being the Lagrange multipliers. It is easy to show that κ = 1, and the maximizer should satisfy
Therefore, let , we can express the profile log-likelihood as
| (8) |
We use the Newton-Raphson algorithm to find the stationary point of (8), with . A good initial point can be , with obtained by using the data from the internal study to fit models (2) and (4). We adjust the case-control sample size ratio when fitting (4).
We show in the Web Appendix A (Lemma 1) that under some regularity conditions, is a consistent estimate of , with , , and as defined before. After we obtain , we can estimate P(Xi|D = 0), i = 1, …, n1, as
| (9) |
We want to point out that the nuisance parameter α is identifiable as α can influence ℓV(η) through g(X; μ).
The asymptotic distribution of is given by the following result, with its proof shown in Web Appendix A.
Theorem 1. Assuming n2/n1 → τ, n1,1/n1,0 = ρ1, and n2,1/n2,0 = ρ2 remain constant as n1 → ∞, we have
where JV and IV are defined in the Appendix.
Definitions of JV and IV rely on η*, Σ0, and P(X|D = 0). To estimate the covariance of , we can replace with , and use the estimated empirical distribution (9) for P(X|D = 0). Similarly, we can replace Σ0 with its consistent estimate , which is a submatrix of corresponding to β. and are estimates of A and B. They can be obtained by replacing μ* with , and by calculating the expectation in controls using estimates given by (9).
Although is consistent given any V in (8), its level of efficiency depends on the choice of V. In fact, we can obtain the most efficient estimate of η by using V = Σ0 , or its consistent estimate. We need to introduce some notations before presenting the result. Let be the vector of Lagrange multipliers. We can represent Jv as the following block matrix:
Note that only the lower right corner submatrix depends on V. We can represent JΣ0 similarly by replacing JV,μμ with JΣ0,μμ. By letting V = Σ0 in IV, we can see that IΣ0 can be written as
with J.λ being the first column of JΣ0. We have the following result (see Web Appendix A for the proof).
Corollary 1. The asymptotic variance–covariance matrix of attains its minimum at V = Σ0. At V = Σ0, the asymptotic variance–covariance matrix of has the following form:
where e1 is a vector with its first element being 1, all others being zero. This asymptotic variance–covariance matrix remains the same if V is a consistent estimate of Σ0.
Because of Corollary 1, we can use an iterative algorithm to find the optimal estimate. First, we let V in (8) be the identity matrix to obtain an initial estimate η1). Second, we use η(1) to obtain , which is a consistent estimate of Σ0. Third, by letting , we obtain an updated estimate of η. We can iterate the second and third steps several times until the estimate converges. We define the final estimate of θ as and refer to the procedure as the retrospective generalized integration method (rGIM). In contrast, we call the original integration method that was developed for integrating data from prospective studies (Zhang et al. 2020) the prospective generalized integration method (pGIM).
Although rGIM is designed for case-control studies, it can also be applied to the setting when one or both studies are conducted under a simple random sampling design. This is true because P(X|D) can be inferred properly with a set of random samples of (X, D). Simulation studies presented later also confirm this.
We can extend the aforementioned results to a more general situation where summary data come from multiple independent or partially overlapping external studies. Summary data from multiple models fitted within a given study are also allowed (see Example 2 as an illustration). All these can be achieved by specifying the correct variance–covariance matrix of . (See Web Appendix B for more details.)
2.4 |. Alternative approaches
Instead of treating the summary statistics as observed random variables, we can set as known and only estimate the other parameters (i.e., θ and α). Qin (2000) and Chatterjee et al. (2016) used this strategy to incorporate summary information from a prospective study, although they did not consider the existence of nuisance parameter α. Using this strategy in case-control studies, we can estimate θ and α by solving the following constrained optimization problem:
subject to
| (10) |
where since is fixed at . We denote the resultant constraint maximum likelihood (CML) estimate as . Another alternative estimate of θ is the standard maximum likelihood estimate based on the internal study (MLE-Int), denoted as .
In Web Appendix A5, we derive the proper variance-covariance matrix of the CML estimate by accounting for the variability of and prove the following result.
Corollary 2. The rGIM estimate is asymptotically more efficient than the internal data based MLE and the CML estimate .
2.5 |. Different study populations
So far, we present procedures assuming that the internal and external studies are conducted within a common underlying population. In some applications, the two studies might be conducted in two source populations (called internal study population and external study population), which have different distributions of X. We can expand the proposed procedure to this more general setting.
We assume that the disease risk in the two populations can be specified by the following model:
where s = 1, or 2 indicates the internal or external study population. Thus, we in fact assume that regression coefficients except the intercept term are the same between the two risk models. Since we allow the two study populations to have different covariate distributions, each study population has its own covariate distribution in cases and in controls, denoted as P(X|D = 1, s), and P(X|D = 0, s), respectively. Following (2), we know
where . In addition to the individual level data and summary statistics, we also require observations on a set of controls from the external study population and use them as a reference set to estimate P(X|D = 0, s = 2), which is different from P(X|D = 0, s = 1). In Web Appendix C, we extend the rGIM procedure to the setting of different study populations and refer to it as GIMREF.
3 |. SIMULATION STUDIES
3.1 |. Same internal and external study population
We first considered the scenario when the internal and external studies were conducted within the same study population. We assumed that X = (X1, X2) presented genotypes on two genetic markers, called single nucleotide polymorphisms (SNPs), and the true disease risk model was . We further assumed that the two SNPs’ locations were in close proximity and let G = (G1, G2) represent allele type (0 or 1) of the two SNPs on a given chromosome (i.e., haplotype). In the study population, the distribution probability of G = (0, 0) , (0,1), (1,0), and (1,1) is 0.28, 0.12, 0.18, and 0.42, respectively. For each subject, its joint genotype X was given by the sum of two randomly selected haplotypes. We let θ1 = log(2), θ2 = log(1.5), and chose θ1* such that the disease prevalence was around 10%. Each pair of internal and external studies was generated from this population, with the internal study consisting of 500 cases and 500 controls, and the external study consisting of 500 cases and 2000 (or 5000) controls. Based on the external study, the summary data were coefficient estimates from the following two working models: , k = 1, 2.
We analyzed each simulated dataset with rGIM, pGIM, CML, and MLE-Int. Except MLE-Int, the other three methods used summary data consisting of either , or . Table 1 summarizes performances of the four methods over 5000 simulated datasets under each scenario.
TABLE 1.
Simulation results in situations when internal and external studies are conducted under a case-control sampling design in a common source population
| Estimate of θ1 | |||||||
|---|---|---|---|---|---|---|---|
| MLE-Int | Given | Given | |||||
| CML | pGIM | rGIM | CML | pGIM | rGIM | ||
| External study: 500 cases and 2000 controls | |||||||
| Bias | 0.33 | 0.19 | 0.18 | 0.10 | 0.18 | 0.18 | 0.10 |
| SE-Emp | 10.94 | 9.09 | 7.66 | 7.53 | 8.86 | 7.14 | 6.98 |
| SE-Est | 11.00 | 9.20 | 6.75 | 7.56 | 8.92 | 5.92 | 6.95 |
| CP | 95.36 | 95.26 | 91.88 | 95.14 | 95.04 | 89.18 | 94.94 |
| External study: 500 cases and 5000 controls | |||||||
| Bias | 0.33 | −0.68 | −0.56 | −0.45 | −0.52 | −0.43 | −0.34 |
| SE-Emp | 10.94 | 8.66 | 7.79 | 7.32 | 8.34 | 7.30 | 6.71 |
| SE-Est | 11.00 | 8.75 | 5.63 | 7.34 | 8.39 | 4.38 | 6.69 |
| CP | 95.36 | 95.12 | 83.86 | 94.90 | 95.18 | 75.60 | 95.06 |
| Estimate of θ2 | |||||||
| MLE-Int | Given | Given | |||||
| CML | pGIM | rGIM | CML | pGIM | rGIM | ||
| External study: 500 cases and 2000 controls | |||||||
| Bias | 0.15 | 0.15 | 0.15 | 0.15 | 0.21 | 0.15 | 0.08 |
| SE-Emp | 10.32 | 10.32 | 10.32 | 10.32 | 8.40 | 6.80 | 6.64 |
| SE-Est | 10.38 | 10.38 | 10.38 | 10.38 | 8.36 | 5.77 | 6.64 |
| CP | 95.46 | 95.46 | 95.46 | 95.46 | 94.68 | 90.12 | 94.82 |
| External study: 500 cases and 5000 controls | |||||||
| Bias | 0.15 | 0.15 | 0.15 | 0.15 | −0.26 | −0.22 | −0.20 |
| SE-Emp | 10.32 | 10.32 | 10.32 | 10.32 | 7.87 | 6.92 | 6.39 |
| SE-Est | 10.38 | 10.38 | 10.38 | 10.38 | 7.85 | 4.42 | 6.39 |
| CP | 95.46 | 95.46 | 95.46 | 95.46 | 94.80 | 79.32 | 94.54 |
The internal study has 500 cases, and 500 controls. The external study has 500 cases and 2000 (or 5000) controls. The true model is logit {P(D = 1|G)} = θ0* + θ1G1 + θ2G2. Summary data are coefficient estimates based on two working models, logit {P(D = 1|Gk)} = αk + βkGk, k = 1, 2. Empirical bias (bias), empirical standard error (SE-Emp), mean of estimated asymptotic standard error (SE-Est), and coverage probability (CP) of a 95% confidence interval are summarized over 5000 simulated datasets. All numbers are multiplied by 100. Note: MLE-Int: standard MLE based on the internal study; CML: constraint MLE developed for case-control studies; pGIM: GIM procedure developed under the prospective likelihood framework; rGIM: GIM procedure developed under the retrospective likelihood framework.
First, we can notice from Table 1 that the asymptotic distribution of the rGIM estimate presented in Corollary 1 is quite accurate as the estimate is unbiased, with its estimated asymptotic standard error matching well with the empirical standard error in all considered settings. Its 95% confidence interval also has the correct coverage probability (CP). Second, as predicted by Corollary 2, rGIM is at least as efficient as MLE-Int and CML. The magnitude of efficiency gain depends on the available summary data. Specifically, when summary data consist of only , the estimated marginal effect of X1, rGIM is more efficient than MLE-Int and CML for estimating θ1 (the true effect of X1), but all three methods have the same level of efficiency for estimating θ2. If using as summary data, the rGIM procedure shows a clear advantage over MLE-Int and CML for estimating both θ1 and θ2. Third, the pGIM procedure that ignores the case-control sampling can generate incorrect standard error estimates. For example, when using estimated from an external study with 500 cases and 2000 controls, the pGIM estimate of θ1 has an empirical standard error of 0.071, compared to the mean estimated asymptotic standard error of 0.059. As a result, its 95% confidence interval has only 89% coverage probability. When we increase the number of controls in the external study to 5000, its coverage probability decreases further to 76%.
On a 2.30 GHz Linux machine, the running time of MLE-Int on an internal study of 500 cases and 500 controls is about 0.004 s. rGIM and CML procedures integrating summary data with the same internal study take about 6.1 and 3.3 s, respectively. rGIM is slower than CML as it needs to reestimate (β1, β2), instead of assuming , i = 1, 2.
3.2 |. Different internal and external study populations
We considered the scenario when the internal and external studies were conducted in two different study populations. The internal study population was the same as the one given in Section 3.1. For the external study population, the distribution of haplotype G was specified as G = (0,0) , (0,1), (1,0), and (1,1) with probability of 0.14, 0.06, 0.56, and 0.24, respectively. We assumed that the same aforementioned disease risk model applies to both populations. For each pair of simulated internal and external studies, we also generated an additional set of 300 controls from the external population and used them as reference samples in the GIMREF procedure. Simulation results are summarized in Table 2. We did not present results from the pGIM procedure as we have shown in Section 3.1 that pGIM was not appropriate for case-control studies.
TABLE 2.
Simulation results in situations when internal and external studies are conducted under a case-control sampling design in two different source populations
| Estimate of θ1 | |||||
|---|---|---|---|---|---|
| MLE-Int | Given | Given | |||
| GIMREF | rGIM | GIMREF | rGIM | ||
| External study: 500 cases and 2000 controls | |||||
| Bias | 0.33 | −0.10 | −9.83 | 0.06 | −4.62 |
| SE-Emp | 10.94 | 7.79 | 8.60 | 7.59 | 8.93 |
| SE-Est | 11.00 | 7.89 | 7.27 | 7.66 | 6.70 |
| CP | 95.36 | 95.32 | 69.06 | 95.02 | 80.42 |
| External study: 500 cases and 5000 controls | |||||
| Bias | 0.33 | −0.47 | −10.72 | −0.24 | −5.21 |
| SE-Emp | 10.94 | 7.66 | 8.45 | 7.44 | 8.73 |
| SE-Est | 11.00 | 7.72 | 6.99 | 7.48 | 6.41 |
| CP | 95.36 | 95.12 | 62.28 | 95.08 | 76.70 |
| Estimate of θ2 | |||||
| MLE-Int | Given | Given | |||
| GIMREF | rGIM | GIMREF | rGIM | ||
| External study: 500 cases and 2000 controls | |||||
| Bias | 0.15 | 0.26 | 0.15 | −0.06 | −13.96 |
| SE-Emp | 10.32 | 10.07 | 10.32 | 6.19 | 7.22 |
| SE-Est | 10.38 | 10.08 | 10.38 | 6.20 | 6.27 |
| CP | 95.46 | 95.48 | 95.46 | 95.16 | 40.64 |
| External study: 500 cases and 5000 controls | |||||
| Bias | 0.15 | 0.38 | 0.15 | −0.25 | −14.77 |
| SE-Emp | 10.32 | 10.05 | 10.32 | 5.98 | 6.96 |
| SE-Est | 10.38 | 10.06 | 10.38 | 5.95 | 5.97 |
| CP | 95.46 | 95.74 | 95.44 | 94.72 | 32.72 |
The internal study has 500 cases and 500 controls. The external study has 500 cases, and 2000 (or 5000) controls. The reference set consists of 300 controls from the external population. The same risk model logit {P(D = 1|G} = θ0* + θ1G1 + θ2G2 is assumed for both populations. Summary data are coefficient estimates based on two working models, logit {P(D = 1|Gk)} = αk + βkGk, k = 1, 2. Empirical bias (bias), empirical standard error (SE-Emp), mean of estimated asymptotic standard error (SE-Est), and coverage probability (CP) of a 95% confidence interval are summarized over 5000 simulated datasets. All numbers are multiplied by 100. Note: MLE-Int: standard MLE based on the internal study; GIMREF: retrospective GIM procedure with a reference set developed for the setting in which external and internal studies are conducted in two different populations; rGIM: retrospective GIM procedure with the assumption that two studies are conducted in a common population.
Results in Table 2 demonstrate that GIMREF has the desired performance when dealing with studies from different populations. GIMREF is also more efficient than MLE-Int. On the other hand, the rGIM procedure, which is designed for data collected from a common study population, can generate inconsistent estimate and erroneously estimated standard error (Table 2).
3.3 |. Studies under different designs
We conducted additional simulations to evaluate the performance of the retrospective likelihood approach in situations when one or both studies were conducted under a simple random sampling design.
First, we considered the scenario in which the internal study was a case-control study with 500 cases and 500 controls from the source population defined in Section 3.1, while the external study consisted of 5500 subjects randomly sampled from the same source population. For the purpose of illustration, we only present results from analyses using as summary data. Results shown in Table 3 indicate that both rGIM and CML estimates are consistent, with proper estimated standard errors. But rGIM is more efficient than CML and MLE-int. On the other hand, the prospective likelihood approach pGIM does not estimate the standard error correctly. This is expected as pGIM ignores the sampling bias in the internal data. We further considered the situation in which the external study had their 5500 subjects randomly sampled from a different source population (as defined in Section 3.2). In this setup, we also generated a reference set of 300 controls from the external study population and used it as part of the input for GIMREF. Results summarized in Table 4 indicate that GIMREF has the expected performance in this setting.
TABLE 3.
Simulation results in situations when the internal study is conducted under a case-control sampling design in a source population, and the external study is conducted under a simple random sampling design in the same population
| Estimate of θ1 | ||||
|---|---|---|---|---|
| MLE-Int | CML | pGIM | rGIM | |
| Bias | 0.33 | 0.10 | 0.09 | 0.04 |
| SE-Emp | 10.94 | 7.95 | 6.97 | 6.50 |
| SE-Est | 11.00 | 8.07 | 4.38 | 6.53 |
| CP | 95.36 | 95.78 | 77.62 | 95.60 |
| Estimate of β2 | ||||
| MLE-Int | CML | pGIM | rGIM | |
| Bias | 0.15 | 0.15 | 0.13 | 0.06 |
| SE-Emp | 10.32 | 7.57 | 6.67 | 6.23 |
| SE-Est | 10.38 | 7.56 | 4.43 | 6.24 |
| CP | 95.46 | 94.70 | 81.26 | 95.00 |
The internal study has 500 cases and 500 controls. The external study consists of 5500 random sampled subjects. The true risk model is logit {P(D = 1|G)} = θ0* + θ1G1 + θ2G2 Summary data are coefficient estimates from two working models, logit {P(D = 1|Gk)} = αk + βkGk, k = 1, 2. Empirical bias (bias), empirical standard error (SE-Emp), mean of estimated asymptotic standard error (SE-Est), and coverage probability (CP) of a 95% confidence interval are summarized over 5000 simulated datasets. All numbers are multiplied by 100.Note: MLE-Int: standard MLE based on the internal study; CML: constraint MLE developed for case-control studies; pGIM: GIM procedure developed under the prospective likelihood framework; rGIM: GIM procedure developed under the retrospective likelihood framework.
TABLE 4.
Simulation results in situations when the internal study is conducted under a case-control sampling design in one population, the external study is conducted under a simple random sampling design in a different population
| Estimate of θ1 | |||
|---|---|---|---|
| MLE-Int | rGIM | GIMREF | |
| Bias | 0.33 | −4.82 | 0.22 |
| SE-Emp | 10.94 | 8.66 | 7.31 |
| SE-Est | 11.00 | 6.24 | 7.25 |
| CP | 95.36 | 77.46 | 95.22 |
| Estimate of θ2 | |||
| MLE-Int | rGIM | GIMREF | |
| Bias | 0.15 | −15.21 | 0.04 |
| SE-Emp | 10.32 | 6.93 | 6.01 |
| SE-Est | 10.38 | 5.83 | 5.84 |
| CP | 95.46 | 29.42 | 94.22 |
The internal study has 500 cases and 500 controls. The external study consists of 5500 random sampled subjects. The same risk model logit {P(D = 1|G)} θ0* + θ1G1 + θ2G2 is assumed for both populations. Summary data are coefficient estimates from two working models, logit {(P(D = 1|Gk)} = αk + βkGk, k = 1, 2. Empirical bias (bias), empirical standard error (SE-Emp), mean of estimated asymptotic standard error (SE-Est), and coverage probability (CP) of a 95% confidence interval are summarized over 5000 simulated datasets. All numbers are multiplied by 100. Note: MLE-Int: standard MLE based on the internal study; GIMREF: retrospective GIM procedure with a reference set developed for the setting in which external and internal studies are conducted in two different populations; rGIM: retrospective GIM procedure with the assumption that two studies are conducted in a common population.
Next, we considered situations when the internal study was generated under a simple random sampling design with 5500 subjects, and the external study of 500 cases and 500 controls was compiled under a case-control sampling design in the same or a different source population (see Web Appendix Tables 1 and 2). From those two tables, we can draw similar conclusions as those from Tables 3 and 4, respectively. Finally, we considered the setting in which both internal and external studies were carried out under a simple random sampling design, with each consisting of 5500 subjects. Web Appendix Table 3 shows simulation results in the scenario when both studies are conducted in the same source population. Web Appendix Table 4 shows results when the two studies are sampled from different populations. Again, we can see from both tables that the retrospective likelihood approach works fine. As expected, the prospective approach (pGIM) also performs well in this setting as data are collected from prospective studies.
4 |. REAL APPLICATION
We illustrated the application of our method by applying it to two genome-wide association studies (GWAS) of pancreatic cancer (Amundadottir et al., 2009; Petersen et al., 2010). Both GWAS had genotypes measured on over half a million SNPs. Genotypes on additional SNPs were obtained by imputation. We focused on subjects with predominantly European ancestry from the two studies. The first GWAS (called PanScan I) consisted of 1761 cases and 1804 controls. The second GWAS (called PanScan II) had 1768 cases and 1851 controls (PanScan, 2015).
In this application, we concentrated on genes from the PredictDB Data Repository defined by gene expression in pancreatic tissues (Gamazon et al., 2015; Barbeira et al., 2018). For a given gene, PredictDB provided a prediction model that used genotypes on a set of SNPs selected around the neighborhood of the gene to predict its gene expression level. The prediction model can be represented as , with X = (X1, …, Xm) being genotypes on the set of m selected SNPs, and w = (w1, … ,wm) being their corresponding weights. A transcriptome-wide association study (TWAS) searches for genes associated with the outcome D by assessing the correlation between ε(X) and D assuming w is known.
We considered PanScan I GWAS as the internal study. From the independent PanScan II GWAS, we randomly selected a set of 500 controls as reference samples and treated the remaining data as the external study, based on which summary data were derived. To assess each gene’s association with pancreatic cancer, we used the model , where θ is the parameter of interest, Z is the set of adjusted covariates, including the intercept, gender, and top five eigenvectors identified in the two combined GWAS. We assumed the summary data were coefficient estimates from the following working models fitted with data from the external study, , k = 1, …, m.
In the integrative analysis, we only used summary information on SNPs whose pairwise correlation was less than 0.95 in order to avoid the problem of collinearity, although we still included all SNPs in the definition of ε(X). For some genes, the number of summary statistics can be more than 60.
We analyzed a total of 5832 genes from PredictDB Data Repository with MLE-Int and GIMREF. In Figure 1, we showed Q–Q plots for the two analyses. Although there is no globally significant gene (based on the Bonferroni threshold) detected by either method, there is more suggestive evidence of association indicated at the tail end of the Q–Q plot produced by GIMREF, than that by MLE-Int. This is expected as the GIM method utilizes more information than MLE-Int.
FIGURE 1.

Q–Q plots for TWAS results based on pancreatic cancer GWAS. Each of 5832 considered genes is analyzed with the standard MLE based on the internal study (MLE-Int), as well as the retrospective GIM procedure with an additional reference set (GIMREF)
Zhong et al. (2020) recently conducted TWAS analysis of pancreatic cancer using a different strategy for defining the gene expression prediction model. Based on a much larger set of GWAS data, including those from PanScan I and II, they identified 12 pancreatic cancer associated genes that passed the Bonferroni threshold. Seven of those genes are among the set of 5832 genes we analyzed. Results on those seven genes are summarized in Table 5. Again, it appears GIMref can detect more signals than MLE-Int.
TABLE 5.
Gene-level association testing results based on pancreatic cancer GWAS
| Gene | MLE-Int | GIMREF | ||
|---|---|---|---|---|
| Est (SE) | p-value | Est (SE) | p-value | |
| KLF5 | 0.571 (0.850) | 5.02E–01 | 0.806 (0.629) | 2.00E–01 |
| PDX1 | −0.559 (0.383) | 1.45E–01 | −0.970 (0.287) | 7.25E–04 |
| WDR59 | 0.538 (0.362) | 1.37E–01 | 0.575 (0.288) | 4.59E–02 |
| CFDP1 | −0.060 (0.371) | 8.72E–01 | 0.232 (0.288) | 4.20E–01 |
| CELA3B | –0.528 (0.368) | 1.51E–01 | −0.818 (0.287) | 4.36E–03 |
| INHBA | –1.658 (0.461) | 3.21E–04 | −0.912 (0.344) | 8.07E–03 |
| ABO | 0.121 (0.067) | 7.16E–02 | 0.103 (0.051) | 4.38E–02 |
Results are given on seven genes that have been established to be associated with pancreatic cancer by Zhong et al. (2020). Note: MLE-Int: standard MLE based on the internal study (PanScan I); GIMREF: retrospective GIM procedure with a reference set developed for the setting in which external (PanScan II) and internal (PanScan I) studies are conducted in two different populations.
5 |. DISCUSSION
We have developed an integrative procedure for effectively combining aggregated summary data with detailed individual-level data. We adopt a retrospective likelihood framework to account for the sampling bias resulting from the case-control study design. The procedure is very flexible to incorporate summary data generated from distinct working models from multiple external studies. It also allows for external studies to be conducted in a population different from the internal study population, provided that individual-level data on a set of reference control samples is available from the external study population. We establish asymptotic properties for the procedure and prove that its estimate is more efficient than the MLE based on the internal study and the constraint MLE procedure, which derives its estimate under the restriction imposed by the summary data.
We demonstrate that it is important to adjust for the sampling bias in integrative analyses. The prospective likelihood approach that ignores the case-control sampling design can generate inaccurate standard error estimates. Although the proposed procedure is developed specifically for integrating data from multiple case-control studies, it can be used for studies conducted under a simple random sampling design. We show through simulation studies that the proposed retrospective likelihood approach has the desired performance for internal and external studies conducted under either a case-control or a simple random sampling design. Another advantage of our procedure is that it can combine individual-level and summary data taken from different underlying populations, as long as a set of reference control samples are available from each external study population. The reference set is critical to ensure a correct estimate of the covariate distribution in the underlying population. Since the reference set only requires a random sample of controls, it is appealing in practices when case samples are not easily accessible. This feature is especially useful for genetic association studies, as ample reference genomes with different ethnic backgrounds are available from public resources. In situations where such reference samples are not available, our simulation studies suggest that applying the procedure under the assumption of a common source population can generate biased estimates. Further investigations are needed to develop procedures that can provide more robust estimates in this setting.
Supplementary Material
ACKNOWLEDGEMENTS
The study utilized the computational resource of the NIH Biowulf cluster (https://hpc.nih.gov/).
Footnotes
SUPPORTING INFORMATION
Web Appendices and Tables referenced in Sections 2.3, 2.4, 2.5 and 3.3 are available with this article at the Biometrics website on Wiley Online Library. We have implemented the proposed procedure in the R package “gim” that can be obtained from https://CRAN.R-project.org/package=gim. Example codes on how to use the gim package are posted at the Biometrics website.
DATA AVAILABILITY STATEMENT
The data that support the findings in this paper are openly available in the database of Genotypes and Phenotypes (dbGaP) athttps://www.ncbi.nlm.nih.gov/gap/, with dbGaP Study Accession # phs000206.v5.p3.
REFERENCES
- Amundadottir L, Kraft P, Stolzenberg-Solomon RZ, Fuchs CS, Petersen GM, Arslan AA, et al. (2009) Genome-wide association study identifies variants in the ABO locus associated with susceptibility to pancreatic cancer. Nature Genetics, 41, 986–990. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Barbeira AN, Dickinson SP, Bonazzola R, Zheng J, Wheeler HE, Torres JM, et al. (2018) Exploring the phenotypic consequences of tissue specific gene expression variation inferred from GWAS summary statistics. Nature Communications, 9, 1825. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Breslow NE, and Day NE (1980) Statistical methods in cancer research. Volume I – The analysis of case-control studies. IARC Scientific Publications, 32, 5–338. [PubMed] [Google Scholar]
- Carroll RJ, Wang S, and Wang CY (1995) Prospective analysis of logistic case-control studies. Journal of the American Statistical Association, 90,157–169. [Google Scholar]
- Chatterjee N, Chen YH, Maas P, and Carroll RJ (2016) Constrained maximum likelihood estimation for model calibration using summary-level information from external big data sources. Journal of the American Statistical Association, 111, 107–117. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chaudhuri S, Handcock MS, and Rendall MS (2008) Generalised linear models incorporating population level information: An empirical likelihood based approach. Journal of the Royal Statistical Society Series B (Statistical Methodology), 70, 311–328. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen J, and Sitter RR (1999) A pseudo empirical likelihood approach to the effective use of auxiliary information in complex surveys. Statistica Sinica, 9, 385–406. [Google Scholar]
- Cheng W, Taylor JMG, Gu T, Tomlins SA, and Mukheqee B (2019) Informing a risk prediction model for binary outcomes with external coefficient information. Journal of the Royal Statistical Society, Series C (Applied Statistics), 68,121–139. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cheng W, Taylor JMG, Vokonas PS, Park SK, and Mukheqee B (2018) Improving estimation and prediction in linear regression incorporating external information from an established reduced model. Statistics in Medicine ,37,1515–1530. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gamazon ER, Wheeler HE, Shah KP, Mozaffari SV, Aquino-Michaels K, Carroll RJ, et al. (2015) A gene-based association method for mapping traits using reference transcriptome data. Nature Genetics, 47,1091–1098. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Han P, and Lawless JF (2019) Empirical likelihood estimation using auxiliary summary information with different covariate distributions. Statistica Sinica, 29, 1321–1342. [Google Scholar]
- Huang CY, and Qin J (2020) A unified approach for synthesizing population-level covariate effect information in semiparametric estimation with survival data. Statistics in Medicine, 39, 1573–1590. [DOI] [PubMed] [Google Scholar]
- Imbens GW, and Lancaster T (1994) Combining micro and macro data in microeconometric models. The Review of Economic Studies, 61, 655–680. [Google Scholar]
- Kundu P, Tang R, and Chatterjee N (2019) Generalized meta-analysis for multiple regression models across studies with disparate covariate information. Biometrika, 106, 567–585. [DOI] [PMC free article] [PubMed] [Google Scholar]
- PanScan (2015) Whole genome scan for pancreatic cancer risk in the Pancreatic Cancer Cohort Consortium and Pancreatic Cancer Case-Control Consortium (PanScan). dbGaP Study Accession # phs000206.v5.p3. [Google Scholar]
- Petersen GM, Amundadottir L, Fuchs CS, Kraft P, Stolzenberg-Solomon RZ, Jacobs ZB, et al. (2010) A genome-wide association study identifies pancreatic cancer susceptibility loci on chromosomes 13q22.1, 1q32.1 and 5p15.33. Nature Genetics, 42, 224–228. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Prentice RL, and Pyke R (1979) Logistic disease incidence models and case-control studies. Biometrika, 66, 403–411. [Google Scholar]
- Qin J (2000) Combining parametric and empirical likelihoods. Biometrika, 87, 484–490. [Google Scholar]
- Qin J, and Zhang B (1997) A goodness-of-fit test for logistic regression models based on case-control data. Biometrika, 84, 609–618. [Google Scholar]
- Qin J, Zhang H, Li P, Albanes D, and Yu K (2015) Using covariate-specific disease prevalence information to increase the power of case-control studies. Biometrika, 102,169–180. [Google Scholar]
- White H (1982) Maximum likelihood estimation of misspecified models. Econometrica, 50,1–25. [Google Scholar]
- Zhang H, Deng L, Schiffman M, Qin J, and Yu K (2020) Generalized integration model for improved statistical inference by leveraging external summary data. Biometrika, 107, 689–703. [Google Scholar]
- Zhong J, Jermusyk A, Wu L, Hoskins JW, Collins I, Mocci E, et al. (2020) A transcriptome-wide association study identifies novel candidate susceptibility genes for pancreatic cancer. Journal of National Cancer Institute, 112, 1003–1012. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The data that support the findings in this paper are openly available in the database of Genotypes and Phenotypes (dbGaP) athttps://www.ncbi.nlm.nih.gov/gap/, with dbGaP Study Accession # phs000206.v5.p3.
