Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2018 Feb 7.
Published in final edited form as: Biometrika. 2017 Sep 15;104(4):801–812. doi: 10.1093/biomet/asx045

Semiparametric Analysis of Complex Polygenic Gene–Environment Interactions in Case–Control Studies

Odile Stalder 1, Alex Asher 2, Liang Liang 3, Raymond J Carroll 4, Yanyuan Ma 5, Nilanjan Chatterjee 6
PMCID: PMC5793684  NIHMSID: NIHMS909293  PMID: 29430038

Summary

Many methods have been recently proposed for efficient analysis of case–control studies of gene-environment interactions using a retrospective likelihood framework that exploits the natural assumption of gene-environment independence in the underlying population. However, for polygenic modeling of gene-environment interactions, a topic of increasing scientific interest, applications of retrospective methods have been limited due to a requirement in the literature for parametric modeling of the distribution of the genetic factors. We propose a general, computationally simple, semiparametric method for analysis of case–control studies that allows exploitation of the assumption of gene–environment independence without any further parametric modeling assumptions about the marginal distributions of any of the two sets of factors. The method relies on the key observation that an underlying efficient profile likelihood depends on the distribution of genetic factors only through certain expectation terms that can be evaluated empirically. We develop asymptotic inferential theory for the estimator and evaluate numerical performance using simulation studies. An application of the method is presented.

Some key words: Case-control studies, Gene-environment interactions, Genetic epidemiology, Pseudolikelihood, Retrospective studies, Semiparametric methods

1. Introduction

Recent genome-wide association studies indicate that complex diseases, such as cancers, diabetes and heart diseases, are in general extremely polygenic (Chatterjee et al., 2016; Fuchsberger et al., 2016). Genetic predisposition to a single disease may involve thousands of genetic variants, each of which may have a very small effect individually, but in combination they can explain substantial variation in risk in the underlying population. As discoveries from genome-wide association studies continue to enhance understanding of complex diseases, in the future, it will be critical to understand how these genetic factors interact with environmental risk factors for both understanding disease mechanisms and developing public health strategies for disease prevention.

Because of its sampling efficiency, the case–control design is widely popular for conducting studies of genetic associations and gene–environment interactions. A variety of analytic methods have been proposed to increase the efficiency of analysis of case–control data for studies of gene–environment interactions by exploiting an assumption of gene–environment independence in the underlying population. It has been shown that under the assumptions of gene–environment independence and rare disease, the interaction odds-ratio parameters of a logistic regression model can efficiently be estimated based on cases alone (Piegorsch et al., 1994). A general logistic regression model can be fit to case–control data under the gene–environment independence assumption using a log-linear modeling framework (Umbach & Weinberg, 1997) or a semiparametric retrospective profile likelihood framework (Chatterjee & Carroll, 2005). More recently, the assumption of gene–environment independence has been exploited to propose a variety of powerful hypothesis testing methods for conducting genome-wide scans of gene–environment interactions (Murcray et al., 2009; Mukherjee & Chatterjee, 2008; Han et al., 2015; Mukherjee et al., 2012; Gauderman et al., 2013; Hsu et al., 2012).

We consider developing methods for efficient analysis of case–control studies for modeling gene–environment interactions involving multiple genetic variants simultaneously. To develop parsimonious models for joint effects, many studies have recently focused on developing models for gene–environment interactions using underlying polygenic risk scores that could be defined by all known genetic variants associated with the diseases (Meigs et al., 2008; Wacholder et al., 2010; Dudbridge, 2013; Chatterjee et al., 2013, 2016). Further, for obtaining improved biological insights and for enhancing statistical power for detection, it may often be desired to model gene–environment interactions using multiple variants within genomic regions or/and biologic pathways (Chatterjee et al., 2006; Jiao et al., 2013; Lin et al., 2013, 2015). In standard prospective logistic regression analysis, which conditions on both the genetic and environmental risk factor status of the individuals, handling multiple genetic variants is relatively straightforward. In contrast, with retrospective methods, which aim to exploit the assumption of gene–environment independence, the task becomes complicated because all currently existing methods require parametric modeling of the distribution of the genetic or environmental variables.

We propose computationally simple methodology for fitting general logistic regression models to case–control data under the assumption of gene–environment independence, but without requiring any further modeling assumptions about the distributions of the genetic or environmental variables. We extend the Chatterjee–Carroll profile likelihood framework, which originally considered modeling gene–environment interactions using single genetic variants for which genotype status could be specified using parametric multinomial models. The new method relies on the observation that the profile likelihood itself can be estimated based on an empirical genotype distribution that is estimable from a case–control sample. We develop the asymptotic theory of the resulting estimator under a semiparametric inferential framework. Simulations and an example illustrate the properties of the new methodology.

2. Model, Method and Theory

2·1. Background, model and method

In the following, we use notation similar to that of Chatterjee & Carroll (2005). We will denote disease status, genetic information and environmental risk factors by D, G and X, respectively. Here G may correspond to a complex multivariate genotype associated with multiple genetic variants or a continuous polygenic risk score that is defined a priori based on known associations of the genetic variants with the disease. We assume the risk of the disease given genetic and environmental factors in the underlying population can be specified using a model of the form

pr(D=1G,X)=H{α0+m(G,X,β)}, (1)

where H(x) = {1 + exp(−x)}−1 is the logistic distribution function and m(G, X, β) is a parametrically specified function that defines a model for the joint effect of G and X on the logistic-risk scale. The goal of the gene–environment interaction study is to make inference on the parameters β in (1), including interaction parameters.

Let F(G, X) denote the joint distribution of G and X in the underlying population. The key assumption that genetic, G, and environmental factors, X, are independently distributed in the underlying population can be mathematically stated as

dF(G,X)=dFG(G)×dFX(X),

where FG and FX denote the underlying marginal distributions of G and X, respectively. In the Supplementary Material we discuss how to weaken this assumption by suitable conditioning on additional stratification factors. In contrast to the existing literature, here we assume that the marginal distributions FG(G) and FX(X) are both completely unspecified.

We consider a population-based case–control study, in which (G, X) are sampled independently from those with the disease, called cases, and those without the disease, called controls. Suppose there are n1 cases and n0 controls. Standard prospective logistic regression analysis, which is equivalent to maximum likelihood estimation when F(G, X) is allowed to be completely unspecified, yields consistent estimates of β (Prentice & Pyke, 1979).

The retrospective likelihood is the probability of observing the genetic and environmental variables, given the subject’s disease status. Under gene–environment independence in the underlying population, the retrospective likelihood is

pr(G=g,X=xD=d)=pr(D=dG=g,X=x)pr(G=g)pr(X=x)/pr(D=d).

Let fG(·) and fX(·) represent the density/mass functions of G and X, respectively. The retrospective likelihood is

fG(g)fX(x)exp[d{α0+m(g,x,β)}]/[1+exp{α0+m(g,x,β)}]fG(u)fX(v)exp[d{α0+m(u,v,β)}]/[1+exp{α0+m(u,v,β)}]dudv. (2)

Chatterjee & Carroll (2005) profiled out fX(·) by treating it as discrete on the set of distinct observed values (x1, . . . , xm) of X with probabilities δi = pr(X = xi), and then maximized (2) over (δ1, . . . , δm), leading eventually to the semiparametric profile likelihood described as follows. Define κ = α0 + log(n1/n0) − log(π1/π0), where π1 = 1 − π0 = pr(D = 1) is defined as the probability of the disease in the underlying population. Define Ω = (κ, βT)T. Also define

S(d,g,x,Ω)=exp[d{κ+m(g,x,β)}]1+exp{κ+log(π1/π0)-log(n1/n0)+m(g,x,β)}.

Then, with this notation, the semiparametric profile likelihood is

L(D,G,X,Ω,fG)=fG(G)S(D,G,X,Ω)d=01fG(v)S(d,v,X,Ω)dv. (3)

While the representation in (3) does not involve the unknown density of X, it does involve the unknown density of G. This is a major reason that the current literature specifies a parametric distribution for G. Our task in this paper is to dispense with the need to give a parametric form for the distribution function of G, so that analysis can be performed with respect to potentially complex multivariate genotype data for which parametric modeling can be difficult and cumbersome.

Here is our key insight, which we discuss first in the context that π1 is known or at least can be estimated well. For case–control studies that are conducted within well defined populations, relevant probabilities of the disease can be ascertained based on population-based disease registries. When case–control studies are conducted by sampling of subjects within a larger cohort study, the probability of the disease in the underlying population can be estimated using the disease incidence rate observed in the cohort.

Our key insight in treating the distribution of G as nonparametric concerns the term in the denominator of (3), defined as

R(x,Ω)=r=01fG(v)S(r,v,x,Ω)dv.

This is simply the expectation, in the source population, of r=01S(r,G,x,Ω). That is, R(x,Ω)=Epop{r=01S(r,G,x,Ω)}, where the subscript pop emphasizes that the expectation is in the source population, not the case–control study. However, crucially,

R(x,Ω)=π1E{r=01S(r,G,x,Ω)D=1}+π0E{r=01S(r,G,x,Ω)D=0} (4)

Of course, R(x, Ω) is unknown, but we estimate it unbiasedly and nonparametrically by

R^(x,Ω)=j=1nr=01d=01(πd/nd)I(Dj=d)S(r,Gj,x,Ω). (5)

In the Supplementary Material, we show that (x, Ω) is an unbiased estimate of R(x, Ω), that is n1/2-consistent, and that it is asymptotically normally distributed.

Ignoring the leading term fG(G) in (3), which is not estimated, and taking logarithms, leads us to an estimated loglikelihood in Ω across the data as

L(Ω)=i=1nlogS(Di,Gi,Xi,Ω)-i=1nlogR^(Xi,Ω). (6)

Define SΩ(d, g, x, Ω) = ∂S(d, g, x, Ω)/Ω and similarly for Ω(x, Ω). Then the estimated score function, a type of estimated estimating equation, is

S^n(Ω)=n-1/2i=1n{SΩ(Di,Gi,Xi,Ω)S(Di,Gi,Xi,Ω)-R^Ω(Xi,Ω)R^(Xi,Ω)}. (7)

Define

Sn(Ω)=n-1/2i=1n{SΩ(Di,Gi,Xi,Ω)S(Di,Gi,Xi,Ω)-RΩ(Xi,Ω)R(Xi,Ω)},

which is the profile loglikelihood score function when the distribution of G is known. Since the profile loglikelihood score of Chatterjee & Carroll (2005) would have mean zero if the distribution of G were known, it follows that

E{Sn(Ω)}=0, (8)

where the expectation in (8) is taken in the case–control study, not in the source population. Thus, since (x, Ω) and Ω(x, Ω) converge in probability to R(x, Ω) and RΩ(x, Ω), respectively, a consistent estimate of Ω can be obtained by solving 𝒮̂n(Ω) = 0. This estimate Ω̂, which maximizes the semiparametric pseudolikelihood (6), will be referred to as the semiparametric pseudolikelihood estimator.

2·2. Rare diseases when π1 is unknown

When the probability of disease in the source population is unknown, one can invoke a rare disease assumption which is often reasonable for case–control studies (Piegorsch et al., 1994; Modan et al., 2001; Epstein & Satten, 2003; Lin & Zeng, 2006; Kwee et al., 2007; Zhao et al., 2003). If we assume that π1 ≈ 0, then S(d, g, x, Ω) ≈ exp[d{κ + m(g, x, β}], and the expectation involved in calculation of R(X, Ω) can be evaluated based on the sample of controls, with D = 0, only. In this case, the estimates of Ω converge not to Ω itself, but instead to Ω*, the solution to (8) with π1 = 0. Typically, except when the sample size is very large and hence standard errors are unusually small, the small possible bias of the rare disease approximation is of little consequence and coverage probabilities of confidence intervals remain near nominal, see §3 for examples. The asymptotic theory of §2·3 below is then unchanged.

In the Supplementary Material, we show that the score and the Hessian take on simple forms in this case, and that the Hessian is negative semidefinite. Computation is thus very efficient.

2·3. Asymptotic theory

To state the asymptotic results, we first make the definitions

Γ1=d=01(nd/n)E{SΩ(D,G,X,Ω)/S(D,G,X,Ω)ΩT|D=d};Γ2=d=01(nd/n)E{RΩ(X,Ω)/R(X,Ω)ΩT|D=d}.

In addition, define cd = nd/n, Zi = (Di, Gi, Xi), P1(Xi, Ω) = 1/R(Xi, Ω) and P2(Xi, Ω) = RΩ(Xi, Ω)/R2(Xi, Ω).

We use the notational convention that for arbitrary functions (P, T), TE(r, d, x) = E{T(r, G, x) | D = d}. Also, we use the convention that

E[P(X){T(r,gi,X)-TE(r,d,X)}D=t]=E[P(X){T(r,g,X)-TE(r,d,X)}D=t]g=Gi.

Define

ζ(Zi,Ω)=SΩ(Zi,Ω)S(Zi,Ω)-RΩ(Xi,Ω)R(Xi,Ω)-d=01r=01cdπdicdiE[{P1(X,Ω)SΩ(r,gi,X)-P2(X,Ω)S(r,gi,X)}D=d].

Finally, define ζ*(Zi, Ω) = ζ(Zi, Ω) − E{ζ(Z, Ω) | D = Di}.

Theorem 1

Suppose nd/ncd, where 0 < cd <, and that π1 is known. Then

n1/2(Ω^-Ω)=-(Γ1-Γ2)-1n-1/2i=1nζ(Zi,Ω)+op(1). (9)

Thus, since the Zi are independent and E{ζ*(Z, Ω) | Di} = 0, as n → ∞, in distribution,

n1/2(Ω^-Ω)N[0,(Γ1-Γ2)-1{(Γ1-Γ2)-1}T];=d=01(nd/n)cov{ζ(D,X,G,Ω)D=d}=d=01(nd/n)cov{ζ(D,X,G,Ω)D=d}.

In §2·2, when π1 is unknown and the disease is relatively rare, the same result holds by setting π1 = 0.

3. Simulations

3·1. Overview

In our simulations, m(G, X, β) = GTβG + X + (GX)TβGX and the value of X is binary with population frequency 0.5. There are either three or five correlated single nucleotide polymorphisms within a region: we report the latter case, but the results for the former are similar. Each single nucleotide polymorphism takes on the values 0, 1 or 2 following a trinomial distribution that follows Hardy–Weinberg equilibrium, i.e., the jth component of G equals 0, 1, 2 with probabilities {(1 − pj)2, 2pj(1 − pj), pj2}. The values of the pj are described below.

To generate correlation among the single nucleotide polymorphisms, we first generated a 3 or 5-variate multivariate normal variate, each with mean 0 and standard deviation 1, and a correlation matrix with correlation between the jth and kth component = ρ|jk|, where ρ = 0.7. After generating these random variables, we trichotomized them with appropriate thresholds so that frequency of 0, 1 and 2 matched those specified by the allele frequency pj and Hardy–Weinberg equilibrium.

In both simulations, the logistic intercept α0 was chosen so that the population disease rate π1 = 0.03. However additional simulations with π1 = 0.01 yielded very similar results with regards to coverage, efficiency gains, and unbiasedness. See also §3·3 for a discussion of additional simulations, and the Supplementary Material. In the simulation reported here, (p1, p2, p3, p4, p5) = (0.1, 0.3, 0.3, 0.3, 0.1), βX = log(1.5), βG = {log(1.2), log(1.2), 0.0, log(1.2), 0.0}, and βGX = {log(1.3), 0.0, 0.0, log(1.3), 0.0}. Here the value of α0 = −4.14.

3·2. Results

The standard error estimators used in our simulation were based on the asymptotic theory described in Theorem 1: we also used the bootstrap, with very similar results. The appropriate bootstrap in a case–control study is to resample the cases and controls separately, thus maintaining the sample sizes for each.

The simulation results are presented in Table 1. Our semiparametric pseudolikelihood estimator has little bias and coverage percentages near the nominal level. Both with a rare disease approximation and with π1 known, our semiparametric pseudolikelihood estimator achieves approximately a 25% increase in mean squared error efficiency over ordinary logistic regression for the main effects in both G and X.

Table 1.

Results of 1000 simulations as described in §3, with mean bias, coverage probabilities of a 95% nominal confidence interval, and mean squared error efficiency of our semiparametric pseudolikelihood estimator compared to ordinary logistic regression. The simulations were performed with 1000 cases and 1000 controls.

βG1 βG2 βG3 βG4 βG5 βX βG1X βG2X βG3X βG4X βG5X
True 0.18 0.18 0.00 0.18 0.00 0.41 0.26 0.00 0.00 0.26 0.00
Logistic: 1000 cases
Bias 0.00 0.01 0.00 0.01 −0.01 0.01 0.01 −0.01 0.00 0.00 0.01
CI (%) 94.3 95.2 95.7 95.1 94.7 94.6 94.9 94.2 94.5 96.0 94.2
SPMLE, Rare: 1000 cases
Bias 0.01 0.00 0.00 0.02 −0.01 0.02 −0.02 −0.01 0.01 −0.02 0.01
CI (%) 95.2 95.4 96.4 95.8 95.3 95.1 95.4 94.8 96.1 95.5 94.9
Avg MSE Eff All G: 1.28 X: 1.26 All G * X: 2.18
SPMLE, π1 known: 1000 cases
Bias 0.00 0.00 0.00 0.01 −0.01 0.01 0.00 −0.01 0.01 −0.01 0.01
CI (%) 95.1 95.5 96.4 95.8 95.0 95.5 95.6 94.6 95.9 95.2 94.5
Avg MSE Eff All G: 1.28 All X: 1.28 All G * X: 2.07

Logistic is ordinary logistic regression; SPMLE, Rare is our estimator using the rare disease approximation with unknown π1 (§2·2); SPMLE, π1 known is our estimator when π1 is known in the source population (§2·1); CI (%) is the coverage in percent of a nominal 95% confidence interval (calculated using the asymptotic standard error); Avg MSE Eff is the mean squared error efficiency of our method compared to logistic regression averaged over G (All G), over X (All X) and over the G * X (All G * X) interactions.

Strikingly, the mean squared error efficiency of our semiparametric pseudolikelihood estimators compared to ordinary logistic regression is approximately 2.00 for all the interaction terms, thus demonstrating that our methods, which do not model the distribution of either G or X, achieve numerically significant increases in efficiency.

3·3. Additional simulations

The Supplementary Material presents a series of additional simulations. These include the results of a simulation to evaluate the robustness of our method to misspecification of the population disease rate, where we found a surprising robustness to disease rate misspecification. Additionally, there are simulations to examine the robustness of our method to violations of the gene–environment independence assumption. Those simulation studies show that there will be bias in the estimate of gene–environment interaction parameters for the specific single nucleotide polymorphisms under violation of gene–environment independence, but average mean square error for parameter estimates across all the different single nucleotide polymorphisms could be still substantially lower than that obtained from prospective logistic regression analysis. We also show there how to remove this bias when G and E are independent conditional on a discrete stratification variable. Mukherjee & Chatterjee (2008) and Chen et al. (2009) show how to use empirical-Bayes methods to provide additional robustness to violations of the gene–environment independence assumption.

4. Data Analysis

In this section, we apply our methodology to a case–control study for breast cancer arising from a large prospective cohort at the National Cancer Institute: the Prostate, Lung, Colorectal and Ovarian cancer screening trial (Canzian et al., 2010). The design of this study is described in detail by Prorok et al. (2000) and Hayes et al. (2000). The cohort data consisted of 622, 449 women, of whom 3.56% developed breast cancer (Pfeiffer et al., 2013). The case–control study analyzed here consists of 753 controls and 658 cases. Although π1 is known in this population, we analyze the data both with π1 known and with π1 unknown but with a rare disease approximation.

We had data available on genotypes for 21 single nucleotide polymorphisms that have been previously associated with breast cancer based on large genome-wide association studies. The polygenic risk score was defined by a weighted combination of the genotypes, with the weights defined by log-odds-ratio coefficients reported in prior studies. We examined the interaction of the polygenic risk score with age at menarche (X), a known risk factor for breast cancer, defined as the binary indicator of whether the age at menarche exceeds 13 or not. We also adjust the model for age as a continuous variable, denoted here as Z, so that the model fitted is

pr(D=1)=H(β0+βGG+βXX+βGXGX+βZZ). (10)

Results in which age was categorized as < 35, 35–40, 40–45,. . . ,> 75 were similar.

We also performed analyses to check the gene–environment independence assumption. Since X is binary, we ran a t-test of the polygenic risk score against the levels of X, of course among the controls only. The p-value was 0.91, indicating almost no genetic effect. We also ran chisquared tests for the 21 individual genes, finding no significant association after controlling the false discovery rate: the minimum q–value was 0.09. We also checked for correlation, known as linkage disequilibrium, between the 21 loci used to create the polygenic risk score and 32 loci that are known to influence age at menarche (Elks et al., 2010). The data available to us do not have the necessary information to analyze linkage disequilibrium between the two sets of loci.

Using phased haplotypes from subjects of European descent from 1000 Genomes (The 1000 Genomes Project Consortium, 2015) and HapMap (Gibbs et al., 2003), no evidence of linkage disequilibrium was found: the maximum R2 was 0.1 and the minimum q–value was 0.85. Finally, a 2014 study examined the relationship between age at menarche and 10 of the 21 SNPs used to create our polygenic risk score, none of which were found to influence age at menarche (Andersen et al., 2014).

Table 2 presents the results for the cases that π1 is unknown and known, respectively: as remarked upon previously, the results are very similar. To provide a basis for comparison because of the very different scales of the variables, the variable age at baseline was standardized to have mean zero and standard deviation one. In addition, we standardized some of the coefficient estimates so that βG was multiplied by the standard deviation of the polygenic risk score, and βGX was multiplied by the standard deviation of X times the polygenic risk score.

Table 2.

Results of the analysis of the Prostate, Lung, Colorectal and Ovarian cancer screening trial data

βZ βG βX βGX
Logistic
 Estimate 0.018 0.297 0.165 0.124
 std err 0.054 0.064 0.132 0.068
 p–value 7.45 × 10−1 3.19 × 10−6 2.10 × 10−1 6.87 × 10−2
SPMLE, Rare
 Estimate 0.024 0.321 0.175 0.138
 std err (asymptotic) 0.054 0.067 0.134 0.055
 p–value (asymptotic) 6.60 × 10−1 1.62 × 10−6 1.91 × 10−1 1.16 × 10−2
SPMLE, π1 known
 Estimate 0.022 0.313 0.174 0.141
 std err (asymptotic) 0.054 0.065 0.133 0.055
 p–value (asymptotic) 6.78 ×10−1 1.64 × 10−6 1.93 × 10−1 1.13 × 10−2

Logistic is ordinary logistic regression; SPMLE, Rare is our method using the rare disease approximation with unknown π1; SPMLE, π1 known is our method when the disease rate is known in the source population (π1 = 3.56%); std err is the asymptotic standard error estimate; βZ is the main effect for age; βG and βX are the main effects for the polygenic risk score (G) and the environmental variable X (age at menarche > 13), respectively; βGX is the gene–environment interaction.

As expected from the known association of the single nucleotide polymorphisms with risk of breast cancer, the polygenic risk score was strongly associated with breast cancer status of the women in the study. Standard logistic regression analysis reveals some evidence for interaction of the polygenic risk score with age-at-menarche, but the result was not statistically significant at the 0.05 level. When the analysis was done under the gene–environment independence assumption, the evidence of interaction appeared to be stronger.

The coefficient estimate for the interaction term is slightly larger for our semipara-metric methods than that for logistic regression. Also, the asymptotic standard error estimate of logistic regression is approximately 23% larger than our methods, indicating a variance increase of ≈ 50%. Although not listed here, the bootstrap mentioned in §3·2 has very similar standard error estimates. In that bootstrap, 33% of the time, the logistic interaction estimate was actually greater than that of the disease rate known estimate.

5. Discussion and Extensions

We have proposed a general method for using retrospective likelihoods for studying gene–environment interactions involving multiple markers, a method that does not require any distributional assumption of the multivariate genotype distribution. Sometimes, one may consider modeling multi–marker gene–environment interactions using an underlying polygenic risk score, which is a weighted combination of numerous genetic markers where the weights are pre-determined from previous association studies. In such situations, the polygenic risk score might be assumed to follow approximately a normal distribution in the underlying population and the profile likelihood method of Chatterjee & Carroll (2005) can be used with appropriate modification by replacing the parametric multinomial distribution for a single nucleotide polymorphism genotype by a parametric normal distribution for the polygenic risk score, see also Chen et al. (2008) and Lin & Zeng (2009). In general, however, when an investigator desires to explore complex models for multivariate gene–environment interactions retaining separate parameters for distinct single nucleotide polymorphisms or for distinct genetic profiles defined by combinations of correlated single nucleotide polymorphisms, then one cannot avoid dealing with complex multivariate genotype distributions, something that is not easy to specify through parametric models.

Our methods are types of semiparametric plug-in estimators, and thus have certain features in common with the work of Newey (1994), namely that the profile likelihood has the nonparametric component R(x, Ω) in (4) that is estimated by (5). Generally, however, such plug-in estimators are not semiparametric efficient. We believe it will be possible to create an efficient semiparametric estimator by modifying the work of Ma (2010): we are exploring this and its computational aspects, which may be daunting.

Supplementary Material

R Code
Supplement

Acknowledgments

Stalder and Asher should be considered as joint first authors. Carroll is also Distinguished Professor, School of Mathematical and Physical Sciences, University of Technology Sydney, Broadway NSW 2007, Australia. Stalder’s work was supported by a fellowship from the Fondation Ernest Boninchi. Ma’s research was supported by the National Science Foundation and the National Institute of Neurological Disorders and Stroke. Asher, Liang and Carroll’s research was supported by the National Cancer Institute.

Footnotes

Supplementary Material

Supplementary Material available at Biometrika online contains proofs, skewness and kurtosis and q–q plots for the simulation in Table 1, how to modify our methods to account for strata, additional simulations and software written in R. The data used in §4 are available from the National Cancer Institute via a data transfer agreement.

Contributor Information

Odile Stalder, Department of Clinical Research, and Institute of Social and Preventive Medicine, University of Bern, 3012 Bern, Switzerland.

Alex Asher, Department of Statistics, Texas A&M University, College Station, Texas 77843-3143, U.S.A.

Liang Liang, Department of Statistics, Texas A&M University, College Station, Texas 77843-3143, U.S.A.

Raymond J. Carroll, Department of Statistics, Texas A&M University, College Station, Texas 77843-3143, U.S.A

Yanyuan Ma, Department of Statistics, Penn State University, University Park, Pennsylvania 16802, U.S.A.

Nilanjan Chatterjee, Department of Biostatistics and Department of Oncology, Johns Hopkins University, Baltimore, Maryland 21205, U.S.A.

References

  1. Andersen SW, Trentham-Dietz A, Gangnon RE, Hampton JM, Skinner HG, Engelman CD, Klein BE, Titus LJ, Egan KM, Newcomb PA. Breast cancer susceptibility loci in association with age at menarche, age at natural menopause and the reproductive lifespan. Cancer Epidemiol. 2014;38:62–65. doi: 10.1016/j.canep.2013.12.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Canzian F, Cox DG, Setiawan VW, Stram DO, Ziegler RG, Dossus L, Beckmann L, Blanché H, Barricarte A, Berg CD, et al. Comprehensive analysis of common genetic variation in 61 genes related to steroid hormone and insulin-like growth factor-i metabolism and breast cancer risk in the NCI breast and prostate cancer cohort consortium. Hum Mol Genet. 2010;19:3873–3884. doi: 10.1093/hmg/ddq291. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Chatterjee N, Carroll RJ. Semiparametric maximum likelihood estimation in case-control studies of gene-environment interactions. Biometrika. 2005;92:399–418. [Google Scholar]
  4. Chatterjee N, Kalaylioglu Z, Moslehi R, Peters U, Wacholder S. Powerful multilocus tests of genetic association in the presence of gene-gene and gene-environment interactions. Am J Hum Genet. 2006;79:1002–1016. doi: 10.1086/509704. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Chatterjee N, Shi J, García-Closas M. Developing and evaluating polygenic risk prediction models for stratified disease prevention. Nature Rev Genet. 2016;17:392–406. doi: 10.1038/nrg.2016.27. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Chatterjee N, Wheeler B, Sampson J, Hartge P, Chanock SJ, Park JH. Projecting the performance of risk prediction based on polygenic analyses of genome-wide association studies. Nature Genet. 2013;45:400–405. doi: 10.1038/ng.2579. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Chen YH, Chatterjee N, Carroll RJ. Retrospective analysis of haplotype-based case-control studies under a exible model for gene-environment association. Biostatistics. 2008;9:81–99. doi: 10.1093/biostatistics/kxm011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Chen YH, Chatterjee N, Carroll RJ. Shrinkage estimators for robust and efficient inference in haplotype-based case-control studies. J Am Statist Assoc. 2009;104:220–233. doi: 10.1198/jasa.2009.0104. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Dudbridge F. Power and predictive accuracy of polygenic risk scores. PLoS Genet. 2013;9:e1003348. doi: 10.1371/journal.pgen.1003348. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Elks CE, Perry JRB, Sulem P, Chasman DI, Franceschini N, He C, Lunetta KL, Visser JA, Byrne EM, Cousminer DL, et al. Thirty new loci for age at menarche identified by a meta-analysis of genome-wide association studies. Nature Genet. 2010;42:1077–1085. doi: 10.1038/ng.714. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Epstein MP, Satten GA. Inference on haplotype effects in case-control studies using unphased genotype data. Am J Hum Genet. 2003;73:1316–1329. doi: 10.1086/380204. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Fuchsberger C, Flannick J, Teslovich TM, et al. The genetic architecture of type 2 diabetes. Nature. 2016 doi: 10.1038/nature18642. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Gauderman WJ, Zhang P, Morrison JL, Lewinger JP. Finding novel genes by testing G×E interactions in a Genome-Wide Association Study. Genet Epidemiol. 2013;37:603–613. doi: 10.1002/gepi.21748. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Gibbs RA, Belmont JW, Hardenbol P, Willis TD, Yu F, Yang H, Ch’ang LY, Huang W, Liu B, Shen Y, et al. The International HapMap Project. Nature. 2003;426:789–796. doi: 10.1038/nature02168. [DOI] [PubMed] [Google Scholar]
  15. Han SS, Rosenberg PS, Ghosh A, Landi MT, Caporaso NE, Chatterjee N. An exposure-weighted score test for genetic associations integrating environmental risk factors. Biometrics. 2015;71:596–605. doi: 10.1111/biom.12328. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Hayes RB, Reding D, Kopp W, Subar AF, Bhat N, Rothman N, Caporaso N, Ziegler RG, Johnson CC, Weissfeld JL, et al. Etiologic and early marker studies in the Prostate, Lung, Colorectal and Ovarian (PLCO) cancer screening trial. Control Clin Trials. 2000;21:349S–355S. doi: 10.1016/s0197-2456(00)00101-x. [DOI] [PubMed] [Google Scholar]
  17. Hsu L, Jiao S, Dai JY, Hutter C, Peters U, Kooperberg C. Powerful cocktail methods for detecting genome-wide gene-environment interaction. Genet Epidemiol. 2012;36:183–194. doi: 10.1002/gepi.21610. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Jiao S, Hsu L, Bézieau S, Brenner H, Chan AT, Chang-Claude J, Le Marchand L, Lemire M, Newcomb PA, Slattery ML, et al. SBERIA: Set-based gene-environment interaction test for rare and common variants in complex diseases. Genet Epidemiol. 2013;37:452–464. doi: 10.1002/gepi.21735. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Kwee LC, Epstein MP, Manatunga AK, Duncan R, Allen AS, Satten GA. Simple methods for assessing haplotype-environment interactions in case-only and case-control studies. Genet Epidemiol. 2007;31:75–90. doi: 10.1002/gepi.20192. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Lin DY, Zeng D. Likelihood-based inference on haplotype effects in genetic association studies. J Am Statist Assoc. 2006;101:89–104. [Google Scholar]
  21. Lin DY, Zeng D. Proper analysis of secondary phenotype data in case-control association studies. Genet Epidemiol. 2009;33:256–265. doi: 10.1002/gepi.20377. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Lin X, Lee S, Christiani DC, Lin X. Test for interactions between a genetic marker set and environment in generalized linear models. Biostatistics. 2013;14:667–681. doi: 10.1093/biostatistics/kxt006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Lin X, Lee S, Wu MC, Wang C, Chen H, Li Z, Lin X. Test for rare variants by environment interactions in sequencing association studies. Biometrics. 2015;72:156–164. doi: 10.1111/biom.12368. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Ma Y. A semiparametric efficient estimator in case-control studies. Bernoulli. 2010;16:585–603. doi: 10.1016/j.jmva.2019.01.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Meigs JB, Shrader P, Sullivan LM, McAteer JB, Fox CS, Dupuis J, Manning AK, Florez JC, Wilson PW, D’Agostino RB, Sr, et al. Genotype score in addition to common risk factors for prediction of type 2 diabetes. N Engl J Med. 2008;359:2208–2219. doi: 10.1056/NEJMoa0804742. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Modan B, Hartge P, Hirsh-Yechezkel G, Chetrit A, Lubin F, Beller U, Ben-Baruch G, Fishman A, Menczer J, Struewing JP, et al. Parity, oral contraceptives, and the risk of ovarian cancer among carriers and noncarriers of a BRCA1 or BRCA2 mutation. N Engl J Med. 2001;345:235–240. doi: 10.1056/NEJM200107263450401. [DOI] [PubMed] [Google Scholar]
  27. Mukherjee B, Ahn J, Gruber SB, Chatterjee N. Testing gene-environment interaction in large-scale case-control association studies: possible choices and comparisons. Am J Epidemiol. 2012;175:177–190. doi: 10.1093/aje/kwr367. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Mukherjee B, Chatterjee N. Exploiting gene-environment independence for analysis of case–control studies: An empirical bayes-type shrinkage estimator to trade-off between bias and efficiency. Biometrics. 2008;64:685–694. doi: 10.1111/j.1541-0420.2007.00953.x. [DOI] [PubMed] [Google Scholar]
  29. Murcray CE, Lewinger JP, Gauderman WJ. Gene-environment interaction in genome-wide association studies. Am J Epidemiol. 2009;169:219–226. doi: 10.1093/aje/kwn353. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Newey WK. The asymptotic variance of semiparametric estimators. Econometrica. 1994;62:1349–1382. [Google Scholar]
  31. Pfeiffer RM, Park Y, Kreimer AR, Lacey JV, Jr, Pee D, Greenlee RT, Buys SS, Hollenbeck A, Rosner B, Gail MH, et al. Risk prediction for breast, endometrial, and ovarian cancer in white women aged 50 y or older: derivation and validation from population-based cohort studies. PLoS Med. 2013;10:e1001492. doi: 10.1371/journal.pmed.1001492. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Piegorsch WW, Weinberg CR, Taylor JA. Non-hierarchical logistic models and case-only designs for assessing susceptibility in population based case-control studies. Statist Med. 1994;13:153–162. doi: 10.1002/sim.4780130206. [DOI] [PubMed] [Google Scholar]
  33. Prentice RL, Pyke R. Logistic disease incidence models and case-control studies. Biometrika. 1979;66:403–411. [Google Scholar]
  34. Prorok PC, Andriole GL, Bresalier RS, Buys SS, Chia D, Crawford ED, Fogel R, Gelmann EP, Gilbert F, Hasson MA, et al. Design of the prostate, lung, colorectal and ovarian (PLCO) cancer screening trial. Control Clin Trials. 2000;21:273S–309S. doi: 10.1016/s0197-2456(00)00098-2. [DOI] [PubMed] [Google Scholar]
  35. The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature. 2015;526:68–74. doi: 10.1038/nature15393. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Umbach DM, Weinberg CR. Designing and analysing case-control studies to exploit independence of genotype and exposure. Statist Med. 1997;16:1731–1743. doi: 10.1002/(sici)1097-0258(19970815)16:15<1731::aid-sim595>3.0.co;2-s. [DOI] [PubMed] [Google Scholar]
  37. Wacholder S, Hartge P, Prentice R, Garcia-Closas M, Feigelson HS, Diver WR, Thun MJ, Cox DG, Hankinson SE, Kraft P, et al. Performance of common genetic variants in breast-cancer risk models. N Engl J Med. 2010;362:986–993. doi: 10.1056/NEJMoa0907727. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Zhao LP, Li SS, Khalid N. A method for the assessment of disease associations with single-nucleotide polymorphism haplotypes and environmental variables in case-control studies. Am J Hum Genet. 2003;72:1231–1250. doi: 10.1086/375140. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

R Code
Supplement

RESOURCES