Abstract
The two-phase sampling design is a cost-efficient way of collecting expensive covariate information on a judiciously selected sub-sample. It is natural to apply such a strategy for collecting genetic data in a sub-sample enriched for exposure to environmental factors for gene-environment interaction (G × E) analysis. In this paper, we consider two-phase studies of G × E interaction where phase I data are available on exposure, covariates and disease status. Stratified sampling is done to prioritize individuals for genotyping at phase II conditional on disease and exposure. We consider a Bayesian analysis based on the joint retrospective likelihood of phase I and phase II data. We address several important statistical issues: (i) we consider a model with multiple genes, environmental factors and their pairwise interactions. We employ a Bayesian variable selection algorithm to reduce the dimensionality of this potentially high-dimensional model; (ii) we use the assumption of gene-gene and gene-environment independence to trade-off between bias and efficiency for estimating the interaction parameters through use of hierarchical priors reflecting this assumption; (iii) we posit a flexible model for the joint distribution of the phase I categorical variables using the non-parametric Bayes construction of Dunson and Xing (2009). We carry out a small-scale simulation study to compare the proposed Bayesian method with weighted likelihood and pseudo likelihood methods that are standard choices for analyzing two-phase data. The motivating example originates from an ongoing case-control study of colorectal cancer, where the goal is to explore the interaction between the use of statins (a drug used for lowering lipid levels) and 294 genetic markers in the lipid metabolism/cholesterol synthesis pathway. The sub-sample of cases and controls on which these genetic markers were measured is enriched in terms of statin users. The example and simulation results illustrate that the proposed Bayesian approach has a number of advantages for characterizing joint effects of genotype and exposure over existing alternatives and makes efficient use of all available data in both phases.
Keywords and phrases: Biased sampling, Colorectal cancer, Dirichlet prior, Exposure enriched, sampling, Gene-environment independence, Joint effects, Multivariate categorical distribution, Spike and slab prior
1. Introduction
Case-control studies are popular analytical tools, particularly in cancer epidemiology, for assessing gene-disease association where the allele/genotype frequencies at a bi-allelic single nucleotide polymorphism (SNP) locus are compared between cases and controls. Recent genomewide case-control association studies (GWAS) have been remarkably successful in identifying susceptibility loci for many cancers [Yeager et al. (2007); Hunter et al. (2007); Amundadottir (2009)]. A large fraction of variability in the different cancer traits still remain unexplained, with the identified SNPs contributing modestly to prediction of disease risk [Wacholder et al. (2010); Park et al. (2010)]. In search of the missing heritability, it is thus natural to study the genetic architecture of a cancer phenotype in conjunction with the known environmental risk factors (environmental toxins, dietary exposures, physical activity levels, medication use, and other behavioral risk factors). In the post-GWAS era, more efficient statistical approaches to characterize such complex gene-environment (G × E) interactions, in terms of both design and analytic tools, have become a pressing need in cancer epidemiology research.
Variants of the case-control sampling design have been often employed in epidemiologic studies. Two-phase stratified sampling [Neyman (1938)] is an efficient alternative to the traditional cohort and case-control designs [Cochran (1963)] from cost and resource-saving perspectives. A typical application of two-phase sampling is for collecting expensive covariate information, for example, novel biomarkers or genotype data on a prioritized sub-sample of the initial study base. In particular, we will consider the following set-up: the binary disease outcome or case-control status D, some relatively inexpensive covariates (S) and environmental data (E) are collected at phase I (P1). At phase II (P2), genotype data (G) is collected on a subset selected from phase I sample. To select this phase II sub-sample, stratified sampling with strata defined by phase I data (D, E and possibly S) is implemented.
There is a large amount of literature on two-phase designs, using different likelihood based approaches [Horvitz and Thompson (1952); Flanders and Greenland (1991); Breslow and Cain (1988)] or estimating score approaches [Reilly and Pepe (1995); Chatterjee et al. (2003); Robins et al. (1994)]. Maximum likelihood inference for such problems was considered in the pioneering work of Scott and Wild (1997) and Breslow and Holubkov (1997a, b). Lawless et al. (1999) and Breslow and Chatterjee (1999) compare and contrast several approaches for analyzing two-phase data. It has been noted that adding more phases can lead to further efficiency gains, consequently, the two-phase design has been generalized to multiphase designs [Whittemore and Halpern (1997); Lee et al. (2010)]. Haneuse and Chen (2011) propose an intermediate phase between phase I and phase II to reduce participation bias caused by differential participation.
The potential for such sampling designs for G × E studies has been indicated in Thomas (2010). Many GWAS adopt this sampling at the design phase, but little attention is paid at the analysis stage to address the sampling design, thus potentially leading to biased estimates. To the best of our knowledge, literature on two-phase studies of G × E interaction is very limited. Chatterjee and Chen (2007) proposed maximum likelihood inference using a novel regression model for G × E interaction studies where second stage sampling was carried out based on disease outcome and family history. Asymptotic theories were established under the assumption of independence of the genetic and environmental factors in the population.
Multiple papers [Piegorsch et al. (1994); Umbach and Weinberg (1997); Chatterjee and Carroll (2005)] attest the phenomenon of gaining efficiency in studies of G × E by exploiting independence between the genetic and environmental factors under case-control sampling. Under such constraints, it is beneficial to use the retrospective likelihood for estimating interaction parameters instead of standard prospective logistic regression. However, with departures from these constraints, biases in estimating the interaction parameter can occur under retrospective methods. Several researchers have addressed this issue and proposed more robust strategies for testing G × E interaction [Mukherjee et al. (2008, 2010); Mukherjee and Chatterjee (2008); Vansteelandt et al. (2008); Li and Conti (2009); Murcray et al. (2009)]. There is no standard multivariate tool for handling multiple genetic markers simultaneously for G × G and G × E studies that data-adaptively exploits gene-gene and gene-environment independence for gaining efficiency in estimating multiple SNP × E interaction parameters in a potentially high dimensional model.
Bayesian literature on two-phase studies, even beyond the context of G × E studies is also very limited. Haneuse and Wakefield (2007) presented the first hierarchical Bayesian work that closely relate to such data structure. The Bayesian framework presented in this paper appears to be a natural route to explore for multiple reasons. First, Bayesian estimation can lead to efficient computational algorithms as the two-phase likelihood is naturally a missing data likelihood. Second, for G × E studies, Bayesian methods provide data-adaptive shrinkage to leverage the constraints of gene-environment independence by imposing informative priors around this assumption. Third, we incorporate Bayesian variable selection features which help us to handle a potentially high dimensional disease risk model with main effects and interactions of multiple genes and environmental factors simultaneously. Fourth, we use the clever non-parametric Bayesian construction of Dunson and Xing (2009) as a substitute for profile likelihood in the frequentist setting to construct the retrospective likelihood under two-phase sampling. The current paper thus contributes to analysis of G × E studies with multiple markers/environmental exposures under an outcome-exposure stratified two-phase sampling design by offering a new Bayesian treatment of the problem. Our data analysis and simulation studies illustrate that for characterizing sub-group effects of the environmental exposure across genotype categories, our method provides gain in efficiency compared to other alternatives. Moreover, there are no comparable alternatives that can offer the flexibility of our method in terms of multi-marker models and efficient G × E analysis under the two-phase design.
The paper is largely motivated by an example that originates from a population based case-control study of colorectal cancer (CRC) in Israel, namely, the Molecular Epidemiology of Colorectal Cancer (MECC) study. Statins (our environmental factor E) are a class of lipid-lowering drugs used by more than 25 million individuals worldwide for reducing cardiovascular disease risk. The MECC study was the first to establish a chemoprotective association of statins with risk of CRC [Poynter et al. (2005)]. Follow-up individual studies and a meta analysis of 18 studies have confirmed this association [Hachem et al. (2009)]. The benefit of statins for reducing CRC risk has been shown to vary with genetic variations in HMGCR (3-Hydroxy-3-methylglutaryl coenzyme A reductase) gene, a gene involved in cholesterol synthesis [Lipkin et al. (2010)]. To understand the mechanism of effect modification further, investigators measured 294 SNPs in 40 genes, including HMGCR (our set of genetic factors G), selected in the cholesterol synthesis/lipid metabolism pathway. The sub-sample selected for genotyping from the study population of all cases and controls was chosen by stratified sampling conditional on statin use (E) and case-control status (D) where statin users were purposefully oversampled. This sampling strategy was adopted due to limited budgetary resources and DNA samples. Complete statin use (E) data and other basic demographic covariates (S) were available on the entire study base (phase I or P1), and genetic data on these 294 SNPs were only available for the phase II subsample (P2).
In addition, in the MECC study, due to experimental and laboratory logistics, genotype data were missing on a subset of individuals selected in P2 on a group of genes (G1, say) and on a different subset of individuals on another group of genes (G2, say). This led to a non-monotone missing data structure with some individuals in P2 having observations on both (G1, G2) (subset denoted by P2(G1, G2)) and some only on G1 (subset denoted by P2(G1)) and some only on G2 (subset denoted by P2(G2)). Figure 1 is a flow diagram of the sampling scheme and missingness pattern in the data.
The rest of the paper is organized as follows. In Section 2, we present the model ingredients: the likelihood, priors and posteriors. In Section 3, we discuss the analysis of statin × gene interaction in the MECC study. In Section 4, we conduct a simulation study to compare the various maximum likelihood and score based approaches with the Bayesian approach. Section 5 concludes with a discussion.
2. Proposed Methods
2.1. The likelihood
We refer to Figure 1 for understanding the data structure and construction of our likelihood. Let u and D denote the subject indicator and disease status respectively. Here, E is environmental exposure and S are basic demographic covariates as described before. Let W = (E, S). There are N individuals in phase I and M individuals in phase II. To simplify notations we write the retrospective likelihood corresponding to a two-gene model (G1, G2), with the understanding that the methods/notations can be directly extended to gene-sets (G1, G2) where each contain multiple SNPs. The two-phase likelihood has the following form to capture the sampling phases and the missingness patterns in G (Figure 1),
Each term in LTP can be factorized by using P(G1, G2, W|D) = {P(D|G1, G2, W) P(G1, G2|W)P(W)}/P(D). This retrospective likelihood is then marginalized over the missing data in each term. We assume missing completely at random [Little and Rubin (2002)] for the genotype data collected at phase II. The likelihood is then expressed as,
(2.1) |
where P(Du) = ∑g1, g2 ∫w P(Du|g1, g2, w)P(g1, g2|w)P(dw) with the integral replaced by the sum when components of W are discrete. Corresponding to this likelihood, there are three model ingredients:
1. A disease risk model. We assume, P(D = 1|G1 = g1, G2 = g2, W = w, β) = H[{β0 + m(g1, g2, w, β)}], where H is the logistic function H(u) = {1 + exp(−u)}𢄡. Typical choice of m involves, say for two genes G1 and G2, , noting that w = (e, s).
2. A model for (G1, G2|W = (E, S)). For genotype data at a bi-allelic locus, Gj can take three possible values (‘g0=aa’, ‘g1=Aa’ and ‘g2=AA’). We assume, , j, j′ = 0, 1, 2. This specification will require a joint model for multivariate categorical data (trinary for SNP data at a bi-allelic locus). Under gene-gene and gene-environment independence, the model can in general be factorized conditional on covariates S, for j, j′ = 0, 1, 2.
Instead of the above fully non-parametric model, we explore a parametric model for the joint distribution P(G1, G2|W). We consider a class of log-linear models with linear by linear structure [Agresti, (2002)] for parsimonious modeling of the (G1, G2|W) associations,
(2.2) |
where gj are chosen ordinal scores, typically 0, 1, 2 [Agresti, (2002)]. This is the common allelic dosage coding under a log-additive genetic susceptibility model.
Our method could easily be extended to a co-dominant coding of the genetic factor using two dummy variables. Since log-additivity is often assumed for screening interactions, and for simplicity of presentation in terms of one parameter estimate as opposed to two, we proceed with this additive coding. Additionally, even if the true genetic susceptibility model is co-dominant with the disease-causing allele, for a tagging marker which is correlated to this causal allele, one would not a’priori know the direction of association of the marker allele and causal allele. Pfeiffer and Gail (2003) show that the additive scores are more robust to choice of marker allele and varying correlation scenarios. In case of high dimensional G, we can further reduce the dimensionality of the problem by assuming common association parameters λGE and λGS between similar functional groups of SNPs. As discussed in Agresti (2002), this Poisson log-linear model has a corresponding multinomial representation. Thus, the probability of can be written in terms of the multinomial probabilities,
Note that, gene-gene and gene-environment independence in the above model (2.2) will imply λG1E ≡ λG2E ≡ λG1G2 ≡ 0.
3. A model for W = (E, S). A non-parametric and flexible model for the distribution of W is desired. Recall that W can be a mixed set of quantitative and categorical variables. For the MECC example W is a set of categorical covariates, which will be our primary focus in this paper. The approach for modeling the joint distribution of a set of categorical variables that we follow for W can also be applied to the the joint distribution of the trinary genotype variables G1 and G2 in (2.2) as well. However, reflecting prior faith on the gene-gene and gene-environment independence assumptions through direct priors on parameters λG1E, λG2E, λG1G2 in the log-linear model is more straightforward for a practitioner (2.2). This is the primary reason for using (2.2) for the second component P(G1, G2|W = (E, S)).
Let Wu = (Eu, Su), denote the W data corresponding to subject u, u = 1, …, N. Here Wu is p×1 vector of p categorical variables, i.e. Wu = (wu1, …, wup) for a subject u. Assume that the j-th component of W can have dj values j = 1, ⋯, p. In order to parsimoniously model this (d1 × d2 × ⋯ × dp) joint distribution, DX first note that the joint distribution of two categorical variables can always be expressed as a finite mixture of product-multinomial distributions. Extending this idea DX introduce a latent class index variable zu ∈ {1, …, k}, such that wur, wut, r, t ∈ {1, …, p}, r ≠ t, are conditionally independent given zu. Then the joint distribution for wu has this finite mixture representation,
(2.3) |
For notational convenience, we rewrite (2.3) as
where ν = (ν1, …, νk)⊤ is a probability vector with νh = P(zu = h) and , is a dj × 1 probability vector i.e., the conditional probability of wuj = cj, given that subject u is in latent class h for j = 1, …, p. We will discuss the choice of k through a Dirichlet process prior structure on this latent class probability model in the next section.
Remark 1: While Chatterjee and Chen (2007), Chatterjee and Carroll (2005) use profile likelihood for handling the distribution of W non-parametrically, it has been a challenging task in the Bayesian framework to posit a flexible model for W = (E, S) which could be a mixture of categorical and continuous covariates. In this mixed case, Müller et al. (1999) model the joint distribution of the continuous covariates through a Dirichlet Process mixture of normals. Then, conditional on the continuous covariates, the categorical variables have a joint multivariate probit distribution. A recent paper by Bhattacharya and Dunson (2011) extends the above DX construction for categorical data to handle joint distribution modeling of more complex data, including continuous and discrete data. They extend the conditional independence idea and replace the product-multinomial structure in (2.4) by a product of various kernels, such as Gaussian, Poisson and more complex univariate or multivariate distributional kernel. The MECC example does not require going beyond the original DX construction, but with continuous E, this is what we would adopt.
Remark 2: If the phase I sample is a cohort study, with disease endpoint D, then the corresponding likelihood is proportional to:
(2.4) |
Similarly, if environmental data E is collected in phase II as well, the first term representing the phase I cohort likelihood can also involve an integral over the missing E data with respect to a probability distribution dF(E), exactly as in equation (3) of Chatterjee and Chen (2007). A surrogate measure of E, namely E* may be available in phase I and a measurement error model relating E and E* can also be used to construct a joint likelihood of phase I and phase II data.
2.2. Priors
As mentioned before, for this complex retrospective likelihood formulation, we have three sets of parameters from the above three ingredients of the likelihood. For β in the disease risk model, we use a spike and slab type mixture prior to handle variable selection in a high-dimensional disease risk model with multiple markers. For λ in the multivariate gene model, the Bayesian hierarchical approach provides a flexible way to allow for uncertainty around the assumption of gene-gene and gene-environment independence, through prior on λG1G2, λG1E, and λG2E. When sparsity occurs in a certain configuration of (G1, G2, W) or dimension of (G1, G2, W) grows, the frequentist profile likelihood estimation may become unstable and the log-linear model with shared parameters across gene-sets and the DX latent mixture construction aids with such situations. We follow the same sequence as in the previous section to describe the prior structure on the parameters.
1. In the presence of multiple genes in G1 and G2, the logistic disease risk model can potentially have many pairwise and higher order interaction terms. We implement a scalable variable selection framework via spike and slab type priors [Mitchell and Beaucamp (1988); George and McCullogh (1993)] on the parameters β in the disease risk model P(D|G1, G2, W; β). We impose mixture prior distributions on each component of β, say, (β0, βG1, βG2, βE, βS, βG1G2, βG1E, βG2E) for a two-gene model. In general we denote this vector by βnβ×1 = {βr, r = 1, …, nβ}. Given a latent variable p0 representing the mixture weight on the “not informative” regression coefficients, we describe the hierarchical prior structure as follows,
As discussed in Ishwaran and Rao (2003), υ0 in the above specification is assumed to be a small positive value near 0. Note that fr can assume two values υ0 or 1. At each iteration of posterior sampling, fr takes value 1 if sampled βr is significantly away from zero, implying that the r-th covariate is potentially informative. Note that a key feature of this prior specification is that the marginal prior variance of βr is calibrated as var(βr) = frτr2 and has a bimodal distribution. Large var(βr) can occur when fr = 1 and is large, inducing large values of βr, identifying potentially informative covariates. Small values of var(βr) occur when fr assumes value υ0, leading to values of βr that are near zero, suggesting that βr is potentially uninformative. The value of p0 controls how likely it is for fr to be υ0 or 1, thus controlling how many βr are non-zero or the complexity of the model. The Gamma parameters (a, b) controls the degree of parsimony through the prior on p0. We set (a, b) = (1, 1), i.e. a uniform prior on p0, for the analysis we present in the main text. Note that, (a1, a2) determines the prior on and thus the variance of βr. We fix (a1, a2) at (5, 50) to allow the possibility of large prior variances on β. The values used for the hyperparameters in the hierarchy are exactly as recommended in Ishwaran and Rao (2003).
2. In the joint log-linear model (2.2), we typically assume vague normal priors with large variance on the parameters (λG1, λG2, λG1S, λG2S). In our data example, we have used a N(0, 104) prior. On the other hand, for the G-E pairwise association parameters (λG1G2, λG1E, λG2E) we reflect a priori information on G-G or G-E independence via a normal prior centered at zero but with two different choices for the prior variance. In the first set of priors we reflect the belief that with 95% probability the association parameter lies between log(0.8) and log(1.2). This leads to an approximate SD=0.1 under a normal distribution and thus we assume an informative prior of N(0, 10−2). In the second choice, following the empirical Bayes estimation of Mukherjee and Chatterjee (2008), we compute association parameters for G1–G2, G1-E, and G2-E in the control subjects in the data, say θ̂, and use a data-driven prior N(0, θ̂2) on λG1G2, λG1E, and λG2E.
3. The mixture representation in (2.4) requires determining the number of latent classes k. Following DX, instead of selecting a fixed k, a Bayesian nonparametric approach is carried out through the Dirichlet process prior specification on ν:
where ⊗ is the outer product. The parameter α is a hyper-parameter that controls the rate of decrease from the stick-breaking process (Sethuraman, 1994). For example, in case of small values of α, νh decreases towards zero quickly with increasing h, thus putting most of the weight on first few components, leading to a sparse representation. The hyperprior on α allows one to data-adaptively determine the degree of sparseness or the number of components needed. As discussed in DX (2009), we set (aα, bα) = (1/4, 1/4) for a vague prior which implies the probability of independence across components of w in the product multinomial model to be 0.5. We set uniform priors for each category probability ψ with aj1 = ⋯ = ajdj = 1, for j = 1, …, p and let the data dominate over priors. To minimize large number of mixture components instead of using infinite mixtures, we truncate the maximum of the number of mixture components k at 30 in real data example [Ahn et al. (2012)]. We study sensitivity with respect to this truncation threshold in Table 1.
Table 1.
TPFB | TPFBemp | WL | PL | UML | CML | EB | |
---|---|---|---|---|---|---|---|
est.(PSD) | est.(PSD) | est.(se) | est.(se) | est.(se) | est.(se) | est.(se) | |
Exposure variables | |||||||
G1 | 0.04 (0.09) | 0.01 (0.09) | 0.00 (0.09) | 0.00 (0.09) | 0.00 (0.09) | 0.00 (0.08) | −0.07 (0.08) |
G2 | −0.04 (0.10) | −0.06 (0.10) | −0.13 (0.10) | −0.13 (0.10) | −0.13 (0.10) | −0.12 (0.08) | −0.12 (0.08) |
Statin use | −1.29 (0.30) | −1.32 (0.27) | −1.30 (0.30) | −1.30 (0.30) | −1.40 (0.30) | −1.54 (0.28) | −1.51 (0.29) |
G1 × G2 | 0.01 (0.07) | 0.03 (0.05) | 0.05 (0.08) | 0.06 (0.08) | 0.06 (0.08) | 0.06 (0.06) | 0.06 (0.06) |
G1 × Statin use | 0.34 (0.17) | 0.34 (0.15) | 0.25 (0.18) | 0.25 (0.18) | 0.25 (0.18) | 0.38 (0.15) | 0.34 (0.17) |
G2 × Statin use | 0.33 (0.16) | 0.33 (0.16) | 0.38 (0.18) | 0.38 (0.19) | 0.38 (0.20) | 0.38 (0.18) | 0.38 (0.19) |
Gene-Statin and Gene-Gene association parameters from P(G1, G2|E, S) | |||||||
λG1G2 | 0.02 (0.05) | 0.00 (0.05) | |||||
λG1E | 0.05 (0.07) | 0.08 (0.06) | |||||
λG2E | 0.01 (0.07) | 0.01 (0.07) |
(b) Sensitivity analysis with respect to the maximum number of allowable mixture components kmax, and the prior on G-G and G-E association parameters λ. | |||||||
---|---|---|---|---|---|---|---|
G1 | G2 | Statin use | G1 × G2 | G1 × Statin use | G2 × Statin use | ||
kmax=10 | TPFB | 0.05 (0.10) | −0.03 (0.10) | −1.29 (0.30) | 0.01 (0.07) | 0.36 (0.16) | 0.32 (0.17) |
TPFBemp | 0.01 (0.09) | −0.06 (0.10) | −1.32 (0.29) | 0.03 (0.07) | 0.31 (0.15) | 0.32 (0.16) | |
kmax=30 | TPFBnon | 0.05 (0.11) | −0.03 (0.11) | −1.29 (0.31) | 0.01 (0.07) | 0.34 (0.19) | 0.34 (0.21) |
TPFB, TPFBemp, TPFBnon: Two-phase full Bayes (with informative prior N (0, 10−2), using empirical estimates for prior variances, with non-informative prior N (0, 104)) on G-E association parameters.
UML: Unconstrained maximum likelihood, CML: Constrained maximum likelihood, EB: Empirical-Bayes, WL: weighted likelihood, and PL: pseudo-likelihood.
2.3. Posterior sampling
In the full likelihood (2.1), we would like to point out that the three components are linked with each other through the sum over each component in the expression for P(D) in the denominator. We denote the two-phase likelihood in (2.1) by LTP which involves the parameters (β, λ, ψ, V, α). The full conditionals are not reducible to a simpler closed form and are best represented by the following proportionality relations:
where nβ and nλ again represent the number of parameters in (β, λ) respectively.
Posterior sampling corresponding toP(W): Let us recapitulate the model structure for W which is essentially a Dirichlet process mixture of discrete Dirichlet kernels. For u = 1, ⋯, N and j = 1, ⋯, p,
DX present an efficient data-augmented Gibbs sampling algorithm by augmenting the likelihood with latent constructs following Walker (2007). The details of the updating steps are described in the supplemental article [Ahn et al. (2012)].
Note that while the entire likelihood in DX constituted of W data only, in our problem, P(W) is embedded as a component in the joint retrospective likelihood LTP in (2.1). Thus for updating the parameters involved in P(W), say θ(={ψ, V, α}), we use the Metropolis Hastings algorithm. Only the terms ∏u P(Wu)/P(Du) from the full likelihood (2.1) involve θ, where P(Du) = ∑g1, g2 ∑w P(Du|g1, g2, w) P(g1, g2|w)P(w). We draw θ following the DX algorithm and for the proposal density of θ we consider the implied full conditional q(θnew|W) as determined by this algorithm. Then given λ, β we repeat the following updates of θ.
At iteration l, sample a vector θnew from q(θnew|W) as described in DX (2009) algorithm.
- Compute the acceptance ratio
In calculating the acceptance ratio, we note that the numerator and denominator ∏u{P(Wu|θnew)}p(θnew) q(θl|W)/∏u{P(Wu|θl)}p(θl)q(θnew|W) is canceled out where p(θ) is a prior for θ. If r(θnew, θl) < U where U ~ unif(0, 1), we set θl+1 = θnew. Otherwise, the candidate vector θnew is rejected and θl+1 = θl.
Repeat the steps until the posterior chains converge.
Given the full conditionals, we implement the Gibbs sampler [Geman and Geman (1984)] with Metropolis Hastings updates to sample from respective full conditional distributions. For each parameter, we iterate 50,000 times and discard the first 40,000 iterations as ‘burn-in’. We check convergence of the chains using trace plots and the numerical diagnostic statistic ‘potential scale reduction factor’ [Gelman and Rubin (1992)] using the R package CODA [Plummer et al. (2009)]. Auto and cross-correlation checks are performed and a thinning of every tenth observation is carried out. Remaining posterior samples are used to construct estimated posterior summaries needed for Bayesian inference.
3. The Molecular Epidemiology of Colorectal Cancer Study
In this section, we describe the motivating example from the MECC study in detail and present analysis results. We use data on 1,746 cases and 1,853 controls with completely observed response to the question whether statins were used for more than 5 years. The binary variable ‘statin use of at least 5 years’ (E), is the environmental factor of interest with 91% “NO” and 9% “YES”.
We adjust for completely observed confounders and precision variables (S): age (S1), gender (S2), ethnicity (S3), physical activity (S4), family history of CRC (S5), vegetable consumption (S6), NSAID usage within 3 year (S7), and Aspirin usage within 3 year (S8). Age and ethnicity variables were dichotomized as Age ≥ or < 50 (94% and 6% respectively), and ‘Ashkenazi’ and ‘Non-Ashkenazi’ (68% and 32% respectively). Gender (S3) was coded as 1 (50%) for male and 0 (50%) for female. The remaining binary factors (S4, S5, S6, S7, S8) are classified to 1 or “YES” with the proportions of (0.36, 0.09,0.31, 0.02, 0.20) respectively.
For genotyping at phase II, stratified-sampling based on the disease status (D) and statin use (E) was carried out. All case-control subjects with statin use (“YES”) were included at phase II sample. We have 1,200 cases and 1,200 controls at phase II with data available on 294 trinary SNPs G = (G1, …, G294). Genotype data are not completely observed even at phase II due to technical genotyping failures for a limited number of SNPs. Among 2,400 case-control subjects at phase II, 56 subjects and 20 had partial genotype information on two subsets of SNPs. We did not have a dense set of markers typed across the genome to successfully impute these missing genotypes, thus we consider a marginalized likelihood as in (2.1).
Among 294 SNPs, we first illustrate our methods with two SNPs on two genes, RS1800775 on CETP (G1) and RS1056836 on CYP1B1 (G2) where both SNPs exhibit significant interactions with statin use in an initial single marker interaction analysis. We compare our methods for this simple two SNP model to some of the alternative methods that can only handle single marker interaction analysis. The raw frequencies of the cross-classification of case-control status (D), statins (E), genotypes G1 and G2 are shown in online supplementary Table 1 [Ahn et al. (2012)]. Simple logistic regression analysis was carried out to examine G1-E and G2-E association among control subjects and yielded odds ratios of 1.11 and 1.01 and corresponding p-values of 0.30 and 0.91 respectively. Based on a chi-squared test for independence G1–G2 reveals no association (p-value of 0.90) These tests suggest that the data support G1-E, G2-E and G1–G2 independence assumption.
We report the results of the multivariate analysis in Table 1. Along with two-phase full Bayes approach (TPFB) we consider five alternative methods. Unfortunately, none of these competing methods use the data in both phases and make use of the independence constraints. The first three use phase II data only (i) Unconstrained maximum likelihood (UML), a retrospective analysis that does not specify any constraints on P(G1, G2|E, S), (ii) Constrained maximum likelihood (CML), that imposes the Hardy-Weinberg Equilibrium as well as G1-E/G1–G2 independence, (iii) Empirical-Bayes (EB), using data-adaptive ‘shrinkage estimation’ between the constrained and unconstrained ML estimates. Since methods (ii) and (iii) are developed for single marker analysis, G2-E independence cannot be enforced in existing software (we used the ‘CGEN’ package by Bhattacharjee, Chatterjee, and Wheeler, 2011). These three methods completely ignore biased sampling at phase II and may thus lead to biased estimation of the main effect of E, particularly if the exposure sampling rates were differential among cases and controls. The next two approaches use information from both phases under a prospective likelihood framework: (iv) a Horvitz-Thompson estimator, typically known as a weighted likelihood (WL) approach [Manski and Lerman (1977); Breslow and Chatterjee (1999)]. This approach uses sampling fractions nij/Nij, where nij and Nij are the number of subjects corresponding to D = i, E = j at phase II and phase I respectively. The sampling fraction serves as weights in the likelihood to adjust for biased sampling (we used the svyglm function in ‘survey’ package in R by Lumley, 2011). Finally (v) a pseudo-likelihood (PL) approach which also adjusts for biased sampling probabilities in a likelihood framework [Schill et al. (1993)]. Briefly, if we denote Pij = P(D = i|E = j) = exp(iαj)/{1+exp(αj)} where αj is the log-odds for D = 1 when E = j, then pseudo-likelihood is defined as . Here,
where sijk denotes covariate values for a subject with D = i and E = j.
Note that all of these five methods use completely observed phase II data on G1 and G2 as opposed to our proposed method that includes partially observed data by marginalization of the likelihood in terms of G1 and G2 when needed.
As previously explained, we present our method (TPFB) corresponding to two different priors on the G-E and G-G association parameters in model (2.2). First, we consider informative prior N(0, 10−2) that enforces fixed prior belief around G-E and G-G independence, we denote this by TPFB. The analysis using an alternative prior where the prior variances on λGG and λGE are estimated based on observed association in the data is denoted by TPFBemp. In Table 1, variable selection scheme is excluded in the TPFB and TPFBemp by assuming all fr = 1, r = 1, …, nβ so that all covariates are included across all methods. This is done so that the method can be fairly compared to other alternatives which do not have the variable selection feature.
Under all methods, note in Table 1 that the estimated coefficients corresponding to statin-use suggests strong negative association with CRC status. The estimated effect size varies depending on whether the method accounts for biased sampling and/or gene-environment independence. In presence of interactions, we cannot really interpret the main effect estimates and need to combine the model results to present estimated sub-group effects. Recall that G1-E and G2-E independence does appear to be plausible in the light of this data. Note that while G2 × E interaction is detected by all methods, G1 × E interaction can only be detected by CML, EB, TPFB and TPFBemp, i.e., methods that use the independence assumption. The TPFB estimates of terms involving E are slightly different in effect sizes with smaller standard errors when compared to the other methods. Smaller standard errors corresponding to interaction parameters are noted in all retrospective methods that explicitly model (G1, G2, E) dependence structure.
We also carried out a sensitivity analyses with respect to the choice of threshold to truncate the maximum value of k in the DX construction and the prior on G-E and G-G association. As can be seen from Table 1.(b), the results are almost identical with smaller number (kmax = 10) of components in the mixture distribution for W. This suggests further computational efficiency gain is possible by imposing more parsimonious constraint on k. In another sensitivity analysis, when the prior on G-E association is non-informative N(0,104), we notice TPFB estimates slightly drift toward the estimates from PL and WL while losing some efficiency on the G1 × E and G2 × E terms.
To reflect our main interest in sub-group effects of statin across genotype configurations, we report effects of statin across genotype sub-groups of one SNP, holding the other SNP fixed at the common genotype category for that second SNP (coded as 0). It seems that statin effect is strongly modified by G1 and G2. According to TPFBemp estimates, keeping G2 genotype fixed at C/C, the benefit of taking statins to reduce the risk of CRC is maximum in the A/A genotype of G1 with the posterior estimate (and 95% HPD) of the odds-ratios (corresponding to statin users versus non-users) being 0.26 (0.15, 0.42). The corresponding ORs in genotype category A/C and C/C are 0.36 (0.23, 0.45) and 0.49 (0.31, 0.69) respectively. Figure 2 illustrates estimated posterior densities of the odds ratios corresponding to statin-use across each genotype of G1 (left) or G2 (right) respectively, while holding the other SNP fixed at the most common category. This figure indicates that the protective effect of statin in CRC are diminishing as the allelic dosage for the minor allele increases in both G1 and G2. Overall, the TPFB approaches provide much narrower credible intervals compared to PL and WL by exploiting G1-E and G2-E independence. The estimates from methods that use phase II data only, like CML, UML and EB are numerically slightly different.
Variable Selection: We explore how variable selection feature performs in this example for the TPFB method. Previous research by Ishwaran and Rao (2003) discussed the performance of spike and slab prior for general variable selection. We introduce three SNPs (RS5925224, RS10174721, RS10077453) and all possible pairwise G × G and G × E interactions to the previous two SNP model as fit in Table 1. The dimension of the disease risk model is now 34. None of the main effects and interactions corresponding to these three additional SNPs were found significant in an initial single marker analysis.
We set fr = 1 for S1 through S8 to always keep the confounders and precision variables in the model. The tuning parameters υ0 is fixed at 0:0001 for this application with sensitivity analysis results presented for υ0 = 0:001 in Table 3. We would like to see if the variable selection can still detect a significant G1 × E and G2 × E interaction. Moreover, we would like to assess if the three additional SNPs and the corresponding interactions we added (with null effects as observed in our initial analysis) are also identified to be not informative by this process. We tabulate the posterior distribution of fr = 1 among f = (f1, …, fnβ) which indicate ‘in-and-out’ frequencies of the corresponding parameters. These posterior frequencies of fr = 1 can be used to define a ranking of important predictors. An alternative is to rank the top models (not just the predictors individually). Before implementing the TPFB, we reduced the dimensionality of parameters in the model P(G|W) where G = (G1, G2, G3, G4, G5) by assuming common λGG and λGE association parameters across all SNPs. We use N(0, 0.12) prior on this common parameter. In addition, we further assume a single common parameter λGS for all G-S associations with a vague normal prior N(0, 104). These are assumptions that may be stringent in certain situations, but to reduce estimation burden in the log-linear model, we do need to make these assumptions for the TPFB methods. For SNPs on a same functional pathway like in our example, it may not be too unrealistic to assume a shared association parameter across SNPs.
Table 3.
Model | posterior probability % | BIC |
---|---|---|
[E][All S][G2 × E] | 13.1 % (12.5 %) | 43992 (43967) |
[E][All S][G1 × E][G2 × E] | 9.2 % (6.3 %) | 43993 (43978) |
[E][All S][G1 × E] | 7.7 % (5.1 %) | 43994 (43967) |
[E][All S] | 7.5 % (10.6 %) | 43997 (43977) |
[E][All S][G2 × E][G3 × E] | 3.9 % (4.6 %) | 44004 (43977) |
[E][All S][G2 × E][G4 × E] | 2.4 % (2.2 %) | 44002 (43974) |
[E][All S][G2 × E][G3 × G4] | 2.1 % (1.5 %) | 43998 (43971) |
[E][All S][G1 × E][G3 × E] | 2.1 % (2.2 %) | 44,005 (43974) |
[E][All S][G1][G2 × E] | 1.8 % (0.7 %) | 43,996 (43976) |
[E][All S][G1 × E][G2 × E][G5 × E] | 1.6 % (2.0 %) | 44010 (43974) |
(b) The estimated posterior probabilities of appearance corresponding to G and E main effects and their interactions are shown under the identical setting as in Table 3.(a). Results in parentheses represent the sensitivity analysis carried out with υ0 = 0.001. | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
G1 | G2 | G3 | G4 | G5 | E | E × G1 | E × G2 | E × G3 | E × G4 | E × G5 |
6.7 | 5.0 | 4.3 | 4.1 | 7.6 | 100.0 | 36.8 | 64.0 | 18.4 | 10.1 | 13.9 |
(5.6) | (6.8) | (10.6) | (7.3) | (9.4) | 100.0 | (29.5) | (55.4) | (19.9) | (9.7) | (9.5) |
BIC represents Bayesian Information Criterion.
In Table 3, we present numerical results on model and predictor ranking as well as the Bayesian Information Criterion (BIC) corresponding to each model. We only present the top 10 models. According to the result, the model with main effects of E and G2 × E interactions seems to be the most preferred model (posterior probability 13.1%) followed by the model with E and both G1 × E and G2 × E interactions (posterior probability 9.2%). With υ0 = 0.001, the ranking of predictors is slightly different as the main effects of G1 through G5 are now selected more often. The bottom panel of Table 3 shows the frequency of retaining a predictor in the model according to the posterior distribution of f. The main effect of E appears most of the times (100%) with large selection probabilities for G1 × E and G2 × E interactions (36.8% and 64.0%) respectively. Overall, non-significant interactions/main effects are well filtered under this variable selection scheme.
4. Simulation study
In this section, we assess the performance of the proposed method by conducting a simulation study. We mainly consider two aspects (i) varying gene-gene/gene-environment association structure and (ii) when phase II sampling is differential between cases and controls. We compare our method with the five alternative methods mentioned before: WL, PL, UML, CML, and EB in terms of the average bias and mean squared errors (MSE), based on 1,000 simulated datasets.
We first describe the data generation procedure. We consider two genes G1 and G2, and, one environment factor E, with disease status D, all binary. We generate data from the following log-linear model [Li and Conti (2009)]:
(4.1) |
where μ denotes expected cell counts corresponding to the (D, G1, G2, E) configuration. Under this model, we are capable of fixing G1-E, G2-E, and G1–G2 association under controls by setting values of λG1E, λG2E, and λG1G2 respectively. These parameters are approximately equivalent to those in model P(G1, G2|W) (2.2) when the disease is rare. Similarly, we can set βG1E, βG2E or βG1G2, corresponding to the G × E or G × G interactions in the disease risk model. The parameters (γ0, γG1, γG2, γE) controls the marginal frequencies of G1, G2 and E in controls. A large negative value of D ensures that the disease is rare.
For the model parameters in (4.1), we fixed (γ0, γG1, γG2, γE, γD) = (−6, −0.5, −0.5, −2.0, −4.5) that produces approximately 2.5% of cases, frequency of G1 = 1 and G2 = 1 both at 45% while the prevalence of E = 1 is 15%. We assign (βG1E, βG2E, βG1G2) = (0, log(2), log(2)) in (4.1). For setting parameters corresponding to G-E/G-G association, we set (λG1G2, λG1E, λG2E) = (log(2), 0, log(1.5)) to reflect G1–G2 and G2-E dependence, and (0, 0, 0) for the independence scenario.
Now we turn our attention to the sampling design. We randomly generate 1; 000 cases and 1; 000 controls with complete (D, G1, G2, E) data. We then carry out (D, E)-stratified sampling as follows. We select 600 cases and 600 controls in phase II. We consider two scenarios regarding this the stratified sampling strategy: (a) all subjects with a positive E(= 1), in cases and controls, are automatically included in phase II; (b) all subjects with a positive E(= 1) in cases are included in phase II, however, 600 controls for phase II are randomly selected regardless of E status. Finally, information on G1 and G2 from phase I subjects, that is, 400 cases and 400 controls, is treated as missing by design. We iterate this step to generate 1,000 replicate datasets under each sampling scheme.
Table 4 and 5 display the simulation results based on two different sampling schemes (a) and (b) respectively. We follow the convention that ⊥ and ~ represent independence and dependence between two variables respectively. Under G1⊥E, G2⊥E, and G1⊥G2 the CML method yields the smallest MSE with respect to G1 × E and G1 × G2 interaction followed by TPFB, TPFBemp, and EB while WL, PL, and UML present relatively larger MSE. Here we need to note that the current implementation of CML and EB can only use G1-E and G1–G2 independence, but not G2-E independence. As phase II sampling becomes differential between cases and controls from scenario (a) (Table 4) to (b) (Table 5), we notice the substantial increase in the bias for estimating the main effect of E from CML, UML, and EB as expected while WL, PL, TPFB, and TPFBemp provide relatively less biased estimates. This trend remains present in case where G1⊥E, G2 ~ E, and G1 ~ G2. Beyond the bias in βE from CML, UML, and EB, we note that under the departure from the independence assumption, namely, G1⊥G2, there is a dramatic increase in the bias corresponding to the G1 × G2 interaction under CML and to some extent in TPFB. TPFBemp and EB are more robust to this assumption. Both TPFB show gain in efficiency for interaction estimation compared to PL and WL. Overall, our proposed methods, especially TPFBemp, yield obvious gain in efficiency compared to PL and WL in terms of the G × E or G × G interactions in the presence of independence. On the other hand, TPFBemp provides less biased estimates of the E effect compared to UML, CML, and EB which use only phase II data. When sub-sampling ratio is 80%, the pattern remains same as seen in online supplemental Table 2 [Ahn et al. (2012)]. We also provide the sum of the MSEs across all parameters in order to capture the accuracy of estimating sub-group effects defined by different G-E configurations. This summary measure in the last columns of Table 4 and Table 5 clearly suggests that our methods yield more efficient characterization of the joint effect of exposure and genetic factors.
Table 4.
G1⊥E, G1⊥G2, G2⊥E | G1⊥E, G1~G2, G2~E | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Stratified sampling (a)† | E | G1 × G2 | G1 × E | G2 × E | Sum (MSE)¶ | E | G1 × G2 | G1 × E | G2 × E | Sum (MSE)¶ | |
(λG1G2, λG1E, λG2E) = (0, 0, 0) | (λG1G2, λG1E, λG2E) = (log(2), 0, log(1.5)) | ||||||||||
TPFB | Bias | −0.024 | −0.017 | 0.020 | −0.017 | −0.056 | 0.166 | −0.022 | 0.119 | ||
(MSE) | (0.093) | (0.044) | (0.120) | (0.135) | (0.392) | (0.117) | (0.081) | (0.091) | (0.122) | (0.411) | |
TPFBemp | Bias | 0.007 | −0.019 | −0.021 | −0.062 | −0.033 | 0.043 | −0.029 | 0.026 | ||
(MSE) | (0.089) | (0.025) | (0.111) | (0.126) | (0.351) | (0.113) | (0.064) | (0.091) | (0.120) | (0.388) | |
WL | Bias | −0.038 | −0.025 | 0.043 | 0.009 | −0.038 | 0.011 | 0.011 | 0.006 | ||
(MSE) | (0.099) | (0.058) | (0.144) | (0.157) | (0.458) | (0.105) | (0.057) | (0.101) | (0.121) | (0.384) | |
PL | Bias | −0.038 | −0.026 | 0.043 | 0.009 | −0.038 | 0.011 | 0.011 | 0.006 | ||
(MSE) | (0.098) | (0.056) | (0.144) | (0.157) | (0.455) | (0.105) | (0.056) | (0.101) | (0.121) | (0.383) | |
UML | Bias | −0.093 | −0.026 | 0.043 | 0.009 | −0.096 | 0.011 | 0.011 | 0.006 | ||
(MSE) | (0.110) | (0.056) | (0.144) | (0.157) | (0.467) | (0.116) | (0.056) | (0.101) | (0.121) | (0.394) | |
CML | Bias | −0.085 | −0.020 | 0.026 | 0.003 | −0.100 | 0.700 | 0.011 | 0.009 | ||
(MSE) | (0.099) | (0.025) | (0.083) | (0.155) | (0.362) | (0.112) | (0.520) | (0.070) | (0.116) | (0.818) | |
EB | Bias | −0.087 | −0.025 | 0.036 | 0.004 | −0.099 | 0.089 | 0.010 | 0.008 | ||
(MSE) | (0.099) | (0.036) | (0.099) | (0.155) | (0.389) | (0.112) | (0.069) | (0.075) | (0.116) | (0.392) |
All subjects with E = 1 in case and control are sub-sampled for phase II.
The combined MSEs as summed over all four parameters.
Table 5.
G1⊥E, G1⊥G2, G2⊥E | G1⊥E, G1 ~ G2, G2 ~ E | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Stratified sampling (b)† | E | G1 × G2 | G1 × E | G1 × E | Sum (MSE)¶ | E | G1 × G2 | G1 × E | G2 × E | Sum (MSE)¶ | ||
(λG1G2, λG1E, λG2E) = (0, 0, 0) | (λG1G2, λG1E, λG2E) = (log(2), 0, log(1.5)) | |||||||||||
TPFB | Bias | 0.007 | 0.022 | −0.007 | −0.022 | −0.105 | 0.160 | 0.032 | 0.128 | |||
(MSE) | (0.081) | (0.040) | (0.113) | (0.124) | (0.358) | (0.125) | (0.073) | (0.106) | (0.128) | (0.432) | ||
TPFBemp | Bias | 0.039 | 0.015 | −0.054 | −0.073 | −0.027 | 0.036 | 0.024 | 0.000 | |||
(MSE) | (0.086) | (0.031) | (0.127) | (0.122) | (0.366) | (0.121) | (0.058) | (0.115) | (0.135) | (0.429) | ||
WL | Bias | −0.012 | 0.016 | 0.021 | 0.024 | −0.044 | 0.004 | 0.076 | −0.014 | |||
(MSE) | (0.098) | (0.059) | (0.165) | (0.167) | (0.489) | (0.133) | (0.056) | (0.150) | (0.147) | (0.486) | ||
PL | Bias | −0.012 | 0.015 | 0.020 | 0.025 | −0.046 | 0.002 | 0.077 | −0.013 | |||
(MSE) | (0.097) | (0.059) | (0.164) | (0.166) | (0.486) | (0.132) | (0.055) | (0.149) | (0.146) | (0.482) | ||
UML | Bias | 0.538 | 0.015 | 0.020 | 0.025 | 0.520 | 0.002 | 0.077 | −0.013 | |||
(MSE) | (0.395) | (0.059) | (0.164) | (0.166) | (0.784) | (0.407) | (0.055) | (0.149) | (0.146) | (0.757) | ||
CML | Bias | 0.544 | 0.017 | −0.002 | 0.018 | 0.530 | 0.699 | 0.046 | −0.013 | |||
(MSE) | (0.384) | (0.030) | (0.088) | (0.161) | (0.663) | (0.399) | (0.515) | (0.073) | (0.141) | (1.128) | ||
EB | Bias | 0.543 | 0.016 | 0.008 | 0.018 | 0.528 | 0.078 | 0.059 | −0.013 | |||
(MSE) | (0.385) | (0.039) | (0.112) | (0.161) | (0.697) | (0.401) | (0.066) | (0.097) | (0.141) | (0.705) |
All cases with E = 1 are included in phase II, however, control are randomly selected for phase II.
The combined MSEs over all four parameters.
Table 3 in the online Supplement [Ahn et al. (2012)] presents simulation results under the traditional or unstratified case-control design when a random sample of cases and controls are taken irrespective of E status. We can note clear efficiency gains from stratified sampling when comparing Table 4 to Table 3 in the online Supplement for estimating the interaction parameters.
5. Discussion
We presented a flexible Bayesian approach to estimate gene-gene (G × G) and/or gene-environment (G × E) interactions under two-phase sampling with multiple markers. The proposed approach can handle multiple genetic and environmental factors. The method can trade off between bias and efficiency by incorporating uncertainty around gene-environment independence through the hierarchical structure in a data-adaptive way. The underlying ingredients of this hierarchy are the disease risk model, the multivariate gene model, and the joint model for the environment factors/covariates respectively. Our method can also handle potential missingness in genetic information due to technical inconsistency, or due to merging different studies or cohorts, leading to non-monotone missing data structure at phase II sub-sample. This paper is the first Bayesian paper with retrospective modeling for G × E studies under two-phase sampling that can handle multiple markers.
We compared our method to simpler alternatives such as UML, CML, and EB that use gene-environment independence but only based on phase II data, ignoring biased sampling. We also considered methods that account for biased sampling at phase II: weighted likelihood and pseudo likelihood, but do not leverage the independence assumption. Our method provides a framework that integrates both of these features. In a clinical study like the MECC example, where interest lies in estimating the differential effect of statin use across genetic sub-groups for devising targeted prevention strategies, estimates of main effects as well as gene-environment interaction are equally important, thus both estimates need to be assessed. In terms of aggregate MSE our method has superior performance across a wide range of scenarios over competing method.
There are some limitations of the current paper that need to be expanded and explored in future studies. First, we do not fully address the performance of our method in the presence of a truly high-dimensional gene model through simulation studies. The method is scalable to handle up to 294 SNPs and pairwise interactions in our data example, but we have not carried out a simulation study due to computation time. We also need to deal with exponentially increasing number of G × E and G × G interactions in the disease risk model as well as G-E/G-G/G-S associations in the multivariate gene model, as we add more G-variables in the model. We address this by Bayesian variable selection and assuming a common parameter for G-E/G-G/G-S association on genes in the same pathway in the multivariate gene model. The latter is a rather ad-hoc strategy for reducing the dimension and is a limitation of our method. Bias in parameter estimates is expected to arise under departures from this assumption. Calculation of P(D) in the denominator of the likelihood could also pose challenges with truly high-dimensional data. Second, we have not tested the Dunson and Bhattacharya (2011) algorithm for mixed set of discrete and continuous covariates in W. Future research will focus on the higher-dimensional G and E settings, more general structure of the W vector as well as possibility of capturing higher order interactions, not just pairwise interactions.
For practitioners who want to choose a design strategy to enhance the power of screening G × E effects with a relatively rare exposure, exposure enrichment of cases and controls for collecting genotype data is a better strategy than random sampling. The tools we developed in the paper provides a way to account for the biased sampling. The approach also allows one to explore a multivariate model with multiple SNPs and environmental exposure and identify potentially informative predictors. If the interest lies in characterizing sub-group effects of E across different sub-groups defined by G, this design and analysis strategy is particularly powerful. We recommend the use of default prior choices in the codes available at http://www.umich.edu/~jaeil/tp.zip and recommend using TPFBemp as the analysis to be reported. For prescribing a preventive medicine prophylactically, like use of statins for colorectal cancer, identifying genetic sub-groups that will receive the most benefit from such a therapy is particularly helpful. Characterizing G × E effects further our understanding of such sub-group effects for tailoring targeted prevention strategies.
Supplementary Material
Table 2.
Statins | Statins | Statins | Statins | Statins | |
---|---|---|---|---|---|
G1 | A/A | A/C | C/C | A/A | A/A |
G2 | C/C | C/C | C/C | G/C | G/G |
TPFB | 0.27 (0.15, 0.39) | 0.35 (0.22, 0.44) | 0.48 (0.26, 0.65) | 0.38 (0.24, 0.51) | 0.48 (0.30, 0.77) |
TPFBemp | 0.26 (0.15, 0.42) | 0.36 (0.23, 0.45) | 0.49 (0.31, 0.69) | 0.37 (0.23, 0.50) | 0.50 (0.31, 0.79) |
WL | 0.27 (0.15, 0.49) | 0.35 (0.22, 0.55) | 0.45 (0.26, 0.77) | 0.40 (0.26, 0.62) | 0.59 (0.34, 1.02) |
PL | 0.27 (0.15, 0.49) | 0.35 (0.22, 0.55) | 0.45 (0.26, 0.79) | 0.40 (0.25, 0.63) | 0.59 (0.33, 1.05) |
UML | 0.25 (0.14, 0.44) | 0.32 (0.20, 0.50) | 0.41 (0.23, 0.72) | 0.36 (0.23, 0.57) | 0.53 (0.29, 0.95) |
CML | 0.22 (0.12, 0.37) | 0.33 (0.21, 0.51) | 0.49 (0.29, 0.83) | 0.34 (0.20, 0.48) | 0.46 (0.26, 0.80) |
EB | 0.22 (0.13, 0.39) | 0.31 (0.20, 0.49) | 0.43 (0.24, 0.79) | 0.32 (0.21, 0.49) | 0.47 (0.27, 0.82) |
TPFB, TPFBemp: Two-phase full Bayes (with empirical estimates for prior variances), UML: Unconstrained maximum likelihood, CML: Constrained maximum likelihood, EB: Empirical-Bayes, WL: weighted likelihood, and PL: pseudo-likelihood
Acknowledgments
The research of Jaeil Ahn, Malay Ghosh and Bhramar Mukherjee was partially supported by NSF grant DMS-1007494, the research of Bhramar Mukherjee was supported by R03 CA156608 and NIH/NIEHS grant ES020811. The research of Stephen B. Gruber and Bhramar Mukherjee was supported by NIH grant U19 NCI-895700. Genotyping and data collection was supported by R01 CA81488 and N01 CN43302.
Footnotes
SUPPLEMENTARY MATERIAL
Bayesian semiparametric analysis for two-phase studies of gene-environment interaction
(doi: ???http://lib.stat.cmu.edu/aoas/???/???; .pdf). We consider two-phase studies of G × E interaction where phase I data is available on exposure, covariates and disease status and stratified sampling is done to prioritize individuals for genotyping at phase II. We consider a Bayesian analysis based on the joint retrospective likelihood of phase I and phase II data that handles multiple genetic and environmental factors, data adaptive use of gene-environment independence.
References
- 1.Agresti A. Categorical data analysis. 2nd ed. New York: John Wiley and Sons; 2002. MR1700749. [Google Scholar]
- 2.Ahn J, Mukherjee B, Gruber SB, Ghosh M. Supplement to "Bayesian semiparametric analysis for two-phase studies of gene-environment interaction. 2012 doi: 10.1214/12-AOAS599. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Amundadottir L, Kraft P, Stolzenberg-Solomon RZ, Fuchs CS, Petersen GM, Arslan AA, et al. Genome-wide association study identifies variants in the ABO locus associated with susceptibility to pancreatic cancer. Nat Genet. 2009;41:986–990. doi: 10.1038/ng.429. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Bhattacharya A, Dunson DB. Simplex factor models for multivariate unordered categorical data. Journal Amer. Stat. Assoc. 2011 doi: 10.1080/01621459.2011.646934. Revision invited. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Bhattacharjee S, Chatterjee N, Wheeler W. Package CGEN, Version 1.0.0, An R package for analysis of case-control studies in genetic epidemiology. 2011 http://dceg.cancer.gov/bb/tools/genetanalcasecontdata.
- 6.Breslow NE, Holubkov R. Maximum likelihood estimation of logistic regression parameters under two-phase, outcome-dependent sampling. J.R. Stat. Soc. B. 1997a;59:447–461. [Google Scholar]
- 7.Breslow NE, Holubkov R. Weighted likelihood, pseudo-likelihood and maximum likelihood methods for logistic regression analysis of two-stage data. Statist. Med. 1997b;16:103–116. doi: 10.1002/(sici)1097-0258(19970115)16:1<103::aid-sim474>3.0.co;2-p. [DOI] [PubMed] [Google Scholar]
- 8.Breslow NE, Chatterjee N. Design and analysis of two-phase studies with binary outcome applied to Wilms tumour prognosis. Appl. Statist. 1999;48:457–468. [Google Scholar]
- 9.Breslow NE, Cain KC. Logistic regression for two-stage case control data. Biometrika. 1988;75:11–20. [Google Scholar]
- 10.Chatterjee N, Carroll RJ. Semiparametric maximum likelihood estimation exploiting gene-environment independence in case-control studies. Biometrika. 2005;92:399–418. [Google Scholar]
- 11.Chatterjee N, Chen Y. Maximum likelihood inference on a mixed conditionally and marginally specified regression model for genetic epidemiologic studies with two-phase sampling. J. R. Stat. Soc. B. 2007;69:123–142. [Google Scholar]
- 12.Chatterjee N, Chen YH, Breslow NE. A pseudoscore estimator for regression problems with two-phase sampling. Journal Amer. Stat. Assoc. 2003;98:158–168. [Google Scholar]
- 13.Cochran WG. Sampling Techniques. New York: Wiley; 1963. [Google Scholar]
- 14.Dunson DB, Xing C. Nonparametric Bayes Modeling of Multivariate Categorical Data. Journal Amer. Stat. Assoc. 2009;104:1042–1051. doi: 10.1198/jasa.2009.tm08439. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Flanders WD, Greenland S. Analytic methods for 2-stage case-control studies and other stratified designs. Statist. Med. 1991;10:739–747. doi: 10.1002/sim.4780100509. [DOI] [PubMed] [Google Scholar]
- 16.Gelman A, Rubin DB. Inference from iterative simulation using multiple sequences (with discussion) Statistical Science. 1992;7:457–472. [Google Scholar]
- 17.Geman S, Geman D. Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images. IEEE Transactions on Pattern Analysis and Machine Intelligence. 1984;6:721–741. doi: 10.1109/tpami.1984.4767596. [DOI] [PubMed] [Google Scholar]
- 18.George EI, Mcculloch RE. Variable selection via Gibbs sampling. Journal Amer. Stat. Assoc. 1993;88:881–889. [Google Scholar]
- 19.Hachem C, Morgan R, Johnson M, Muebeler M, El-Serag H. Statins and the Risk of Colorectal Carcinoma: A Nested Case-Control Study in Veterans With Diabetes. Am. J. Gastroenterol. 2009;104:1241–1248. doi: 10.1038/ajg.2009.64. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Haneuse S, Chen J. A Multiphase Design Strategy for Dealing with Participation Bias. Biometrics. 2011;67:309–318. doi: 10.1111/j.1541-0420.2010.01419.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Haneuse S, Wakefield JC. Hierarchical Models for Combining Ecological and Case-Control Data. Biometrics. 2007;63:128–136. doi: 10.1111/j.1541-0420.2006.00673.x. [DOI] [PubMed] [Google Scholar]
- 22.Horvitz DG, Thompson DJ. A generalization of sampling without replacement from a finite universe. Journal Amer. Stat. Assoc. 1952;47:663–685. [Google Scholar]
- 23.Hunter DJ, Kraft P, Jacobs KB, Cox DG, Yeager M, et al. A genome-wide association study identifies alleles in FGFR2 associated with risk of sporadic post-menopausal breast cancer. Nat. genet. 2007;39:870–874. doi: 10.1038/ng2075. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Ishwaran H, Rao JS. Detecting Differentially Expressed Genes in Microarrays Using Bayesian Model Selection. Journal Amer. Stat. Assoc. 2003;98:438–455. [Google Scholar]
- 25.Lawless JF, Kalbfleisch JD, Wild CJ. Semiparametric Methods for Response-Selective and Missing Data Problems in Regression. J.R. Stat. Soc. B. 1999;61:413–438. [Google Scholar]
- 26.Lee AJ, Scott AJ, Wild CJ. Efficient estimation in multi-phase case-control studies. Biometrika. 2010;97:361–374. [Google Scholar]
- 27.Li D, Conti DV. Interactions Using a Combined Case-Only and Case-Control Approach. Am J Epidemiol. 2009;169:497–504. doi: 10.1093/aje/kwn339. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Lipkin, et al. Genetic Variation in 3-Hydroxy-3-Methylglutaryl CoA Reductase Modifies the Chemopreventive Activity of Statins for Colorectal Cancer. Cancer Prevention Research. 2010;5:597–603. doi: 10.1158/1940-6207.CAPR-10-0007. [DOI] [PubMed] [Google Scholar]
- 29.Little RJA, Rubin DB. Statistical Analysis with missing data. 2nd ed. New York: Wiley; 2002. [Google Scholar]
- 30.Lumley T. Package Survey, Version 3.2.4. R for analyzing data from complex surveys. 2011 http://cran.r-project.org/package/survey. [Google Scholar]
- 31.Manski CF, Lerman S. The Estimation of Choice Probabilities from Choice-Based Samples. Econometrica. 1977;45:1977–1988. [Google Scholar]
- 32.Mitchell TJ, Beauchamp JJ. Bayesian Variable Selection in Linear Regression. Journal Amer. Stat. Assoc. 1988;83:1023–1036. [Google Scholar]
- 33.Mukherjee B, Chatterjee N. Exploiting gene-environment independence for analysis of case-control studies: An empirical-Bayes type shrinkage estimator to trade off between bias and efficiency. Biometrics. 2008;64:685–694. doi: 10.1111/j.1541-0420.2007.00953.x. [DOI] [PubMed] [Google Scholar]
- 34.Mukherjee B, Ahn J, Stephen BG, Rennert G, Victor M, Chatterjee N. Testing gene-environment interaction from case-control data: A novel study of Type-1 error, power and designs. Gen. Epid. 2008;32:615–626. doi: 10.1002/gepi.20337. [DOI] [PubMed] [Google Scholar]
- 35.Mukherjee B, Ahn J, Stephen BG, Ghosh M, Chatterjee N. Bayesian Sample Size Determination for Case-Control Studies of Gene-Environment Interaction. Biometrics. 2010;66:934–948. doi: 10.1111/j.1541-0420.2009.01357.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Müller P, Parmigiani G, Shildkraut J, Tardella L. A Bayesian Hierarchical Approach for Combining Case-Control and Prospective Studies. Biometrics. 1999;55:858–866. doi: 10.1111/j.0006-341x.1999.00858.x. [DOI] [PubMed] [Google Scholar]
- 37.Murcray CE, Lewinger JP, Gauderman WJ. Gene-environment interaction in genome-wide association studies. Am J Epidemiol. 2009;169:219–226. doi: 10.1093/aje/kwn353. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Neyman J. Contribution to the theory of sampling from human populations. Journal Amer. Stat. Assoc. 1938;33:101–116. [Google Scholar]
- 39.Park JH, Wacholder S, Gail MH, Peters U, Jacobs KB, Chanock SJ, Chatterjee N. Estimation of effect size distribution from genome-wide association studies and implications for future discoveries. Nat Gen. 2010;42:570–575. doi: 10.1038/ng.610. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Pfeiffer RM, Gail MH. Sample size calculations for population and family-based case-control association studies on marker genotypes. Genet Epidemiol. 2003;25:136–148. doi: 10.1002/gepi.10245. [DOI] [PubMed] [Google Scholar]
- 41.Piegorsch WW, Weinberg CR, Taylor J. Non hierarchical logistic models and case-only designs for assessing susceptibility in population-based case-control studies. Stat in Med. 1994;13:153–162. doi: 10.1002/sim.4780130206. [DOI] [PubMed] [Google Scholar]
- 42.Plummer M, Best N, Cowles K, Vines K. Package CODA, Version 0.13-4, Output analysis and diagnostics for MCMC. 2009 http://cran.r-project.org/web/packages/coda/.
- 43.Poynter JN, Gruber SB, Higgins PD, Almog R, Bonner JD, Rennert HS, Low M, Greenson JK, Rennert G. Statins and the risk of colorectal cancer. N Engl J Med. 2005;352:2184–2192. doi: 10.1056/NEJMoa043792. [DOI] [PubMed] [Google Scholar]
- 44.Reilly M, Pepe MS. A mean score method for missing and auxiliary covariate data in regression models. Biometrika. 1995;82:299–314. [Google Scholar]
- 45.Robins JM, Rotnitzky A, Zhao LP. Estimation of regression coefficients when some regressors are not always observed. Journal Amer. Stat. Assoc. 1994;89:846–866. [Google Scholar]
- 46.Scott AJ, Wild CJ. Fitting regression models to case-control data by maximum likelihood. Biometrika. 1997;84:57–71. [Google Scholar]
- 47.Sethuraman J. A constructive definition of dirichlet priors. Statistica Sinica. 1994;4:639–650. [Google Scholar]
- 48.Schill W, Jockel KH, Drescher K, Timm J. Logistic analysis in Case-control studies under validation sampling. Biometrika. 1993;80:339–352. [Google Scholar]
- 49.Thomas DC. Gene-environment-wide association studies: emerging approaches. Nature Reviews, Genetics. 2010;11:259–272. doi: 10.1038/nrg2764. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Umbach DM, Weinberg CR. Designing and analysing case-control studies to exploit independence of genotype and exposure. Stat in Med. 1997;11:259–272. doi: 10.1002/(sici)1097-0258(19970815)16:15<1731::aid-sim595>3.0.co;2-s. [DOI] [PubMed] [Google Scholar]
- 51.Vansteelandt S, Vanderweele TJ, Robins JM. Multiply Robust Inference for Statistical Interactions. Journal Amer. Stat. Assoc. 2008;103:1693–1704. doi: 10.1198/016214508000001084. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Wacholder S, Hartge P, Prentice R, Garcia-Closas M, et al. Performance of common genetic variants in breast-cancer risk models. N Engl J Med. 2010;362:986–993. doi: 10.1056/NEJMoa0907727. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Walker SG. Sampling the Dirichlet Mixture Model With Slices. Simulation and Computation. 2007;36:45–54. [Google Scholar]
- 54.Whittemore AS, Halpern J. Multi-stage sampling in Genetic Epidemiology. Stat in Med. 1998;16:153–167. doi: 10.1002/(sici)1097-0258(19970130)16:2<153::aid-sim477>3.0.co;2-7. [DOI] [PubMed] [Google Scholar]
- 55.Yeager M, Orr N, Hayes RB, Jacobs KB, Kraft P, Wacholder S, et al. Genome-wide association study of prostate cancer identifies a second risk locus at 8q24. Nat Genet. 2007;39:645–649. doi: 10.1038/ng2022. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.