Skip to main content
Oxford University Press logoLink to Oxford University Press
. 2017 Sep 15;104(4):801–812. doi: 10.1093/biomet/asx045

Semiparametric analysis of complex polygenic gene-environment interactions in case-control studies

Odile Stalder 1, Alex Asher 2,*, Liang Liang 2,*, Raymond J Carroll 2,*, Yanyuan Ma 3, Nilanjan Chatterjee 4
PMCID: PMC5793684  NIHMSID: NIHMS909293  PMID: 29430038

Summary

Many methods have recently been proposed for efficient analysis of case-control studies of gene-environment interactions using a retrospective likelihood framework that exploits the natural assumption of gene-environment independence in the underlying population. However, for polygenic modelling of gene-environment interactions, which is a topic of increasing scientific interest, applications of retrospective methods have been limited due to a requirement in the literature for parametric modelling of the distribution of the genetic factors. We propose a general, computationally simple, semiparametric method for analysis of case-control studies that allows exploitation of the assumption of gene-environment independence without any further parametric modelling assumptions about the marginal distributions of any of the two sets of factors. The method relies on the key observation that an underlying efficient profile likelihood depends on the distribution of genetic factors only through certain expectation terms that can be evaluated empirically. We develop asymptotic inferential theory for the estimator and evaluate its numerical performance via simulation studies. An application of the method is presented.

Keywords: Case-control study, Gene-environment interaction, Genetic epidemiology, Pseudolikelihood, Retrospective study, Semiparametric method

1. Introduction

Recent genome-wide association studies indicate that complex diseases, such as cancers, diabetes and heart diseases, are in general extremely polygenic (Chatterjee et al., 2016; Fuchsberger et al., 2016). Genetic predisposition to a single disease may involve thousands of genetic variants; each of these may have a very small effect individually, but in combination they can explain substantial variation in risk in the underlying population. As discoveries from genome-wide association studies continue to enhance understanding of complex diseases, in the future it will be critical to elucidate how these genetic factors interact with environmental risk factors, in order to better understand disease mechanisms and to develop public health strategies for disease prevention.

Because of its sampling efficiency, the case-control design is widely popular for conducting studies of genetic associations and gene-environment interactions. A variety of analytical methods have been proposed to increase the efficiency of analysis of case-control data for studies of gene-environment interactions by exploiting an assumption of gene-environment independence in the underlying population. It has been shown that under the assumptions of gene-environment independence and rare disease, the interaction odds-ratio parameters of a logistic regression model can be estimated efficiently based on cases alone (Piegorsch et al., 1994). A general logistic regression model can be fitted to case-control data under the gene-environment independence assumption using a log-linear modelling framework (Umbach & Weinberg, 1997) or a semiparametric retrospective profile likelihood framework (Chatterjee & Carroll, 2005). More recently, the assumption of gene-environment independence has been exploited to propose a variety of powerful hypothesis testing methods for conducting genome-wide scans of gene-environment interactions (Mukherjee & Chatterjee, 2008; Murcray et al., 2009; Hsu et al., 2012; Mukherjee et al., 2012; Gauderman et al., 2013; Han et al., 2015).

We consider developing methods for efficient analysis of case-control studies for modelling gene-environment interactions that involve multiple genetic variants simultaneously. To develop parsimonious models for joint effects, many studies have focused on developing models for gene-environment interactions using underlying polygenic risk scores that could be defined by all known genetic variants associated with the disease (Meigs et al., 2008; Wacholder et al., 2010; Chatterjee et al., 2013; Dudbridge, 2013; Chatterjee et al., 2016). Further, to obtain improved biological insights and to enhance statistical power for detection, one may often wish to model gene-environment interactions using multiple variants within genomic regions and/or biologic pathways (Chatterjee et al., 2006; Jiao et al., 2013; Lin et al., 2013, 2015). In standard prospective logistic regression analysis, which conditions on both the genetic and the environmental risk factor status of the individuals, handling multiple genetic variants is relatively straightforward. In contrast, with so-called retrospective methods, which aim to exploit the assumption of gene-environment independence, the task becomes complicated because all currently existing methods require parametric modelling of the distribution of the genetic or environmental variables.

We propose a computationally simple method for fitting general logistic regression models to case-control data under the assumption of gene-environment independence, but without requiring any further modelling assumptions about the distributions of the genetic or environmental variables. We extend the Chatterjee–Carroll profile likelihood framework, which originally considered modelling gene-environment interactions using single genetic variants for which genotype status could be specified using parametric multinomial models. The new method relies on the observation that the profile likelihood itself can be estimated based on an empirical genotype distribution that is estimable from a case-control sample. We develop the asymptotic theory of the resulting estimator under a semiparametric inferential framework. Simulations and an example illustrate the properties of the new method.

2. Model, method and theory

2.1. Background, model and method

In the following, we use notation similar to that in Chatterjee & Carroll (2005). We will denote disease status, genetic information and environmental risk factors by Inline graphic, Inline graphic and Inline graphic, respectively. Here Inline graphic may correspond to a complex multivariate genotype associated with multiple genetic variants or to a continuous polygenic risk score that is defined a priori based on known associations of the genetic variants with the disease. We assume that the risk of the disease given genetic and environmental factors in the underlying population can be specified using a model of the form

graphic file with name Equation1.gif (1)

where Inline graphic is the logistic distribution function and Inline graphic is a parametrically specified function that defines a model for the joint effect of Inline graphic and Inline graphic on the logistic-risk scale. The goal of the gene-environment interaction study is to make inference on the parameters Inline graphic in (1), including interaction parameters.

Let Inline graphic denote the joint distribution of Inline graphic and Inline graphic in the underlying population. The key assumption that genetic factors, Inline graphic, and environmental factors, Inline graphic, are independently distributed in the underlying population can be mathematically stated as

graphic file with name Equation2.gif

where Inline graphic and Inline graphic denote the underlying marginal distributions of Inline graphic and Inline graphic, respectively. In the Supplementary Material we discuss how to weaken this assumption by suitable conditioning on additional stratification factors. In contrast to the existing literature, here we assume that the marginal distributions Inline graphic and Inline graphic are both completely unspecified.

We consider a population-based case-control study, in which Inline graphic are sampled independently from individuals with the disease, called cases, and those without the disease, called controls. Suppose there are Inline graphic cases and Inline graphic controls. Standard prospective logistic regression analysis, which is equivalent to maximum likelihood estimation when Inline graphic is allowed to be completely unspecified, yields consistent estimates of Inline graphic (Prentice & Pyke, 1979).

The retrospective likelihood is the probability of observing the genetic and environmental variables, given the subject’s disease status. Under gene-environment independence in the underlying population, the retrospective likelihood is

graphic file with name Equation3.gif

Let Inline graphic and Inline graphic represent the density or mass functions of Inline graphic and Inline graphic, respectively. The retrospective likelihood is

graphic file with name Equation4.gif (2)

Chatterjee & Carroll (2005) profiled out Inline graphic by treating it as discrete on the set of distinct observed values Inline graphic of Inline graphic with probabilities Inline graphic, and then maximizing (2) over Inline graphic, leading eventually to the semiparametric profile likelihood described as follows. Define Inline graphic, where Inline graphic is defined as the probability of the disease in the underlying population. Define Inline graphic. Also let

graphic file with name Equation5.gif

Then, with this notation, the semiparametric profile likelihood is

graphic file with name Equation6.gif (3)

While the representation in (3) does not involve the unknown density of Inline graphic, it does involve the unknown density of Inline graphic. This is a major reason that methods in the current literature specify a parametric distribution for Inline graphic. Our aim in this paper is to dispense with the need to give a parametric form for the distribution function of Inline graphic, so that analysis can be performed with respect to potentially complex multivariate genotype data for which parametric modelling can be difficult and cumbersome.

Here is our key insight, which we discuss first in the context that Inline graphic is known or at least can be estimated well. For case-control studies that are conducted within well-defined populations, relevant probabilities of the disease can be ascertained using population-based disease registries. When case-control studies are conducted by the sampling of subjects within a larger cohort study, the probability of the disease in the underlying population can be estimated using the disease incidence rate observed in the cohort.

Our key insight in treating the distribution of Inline graphic as nonparametric concerns the term in the denominator of (3), defined as

graphic file with name Equation7.gif

This is simply the expectation, in the source population, of Inline graphic; that is, Inline graphic, where the subscript pop emphasizes that the expectation is in the source population, not in the case-control study. However, crucially,

graphic file with name Equation8.gif (4)

Of course, Inline graphic is unknown, but we estimate it unbiasedly and nonparametrically by

graphic file with name Equation9.gif (5)

In the Supplementary Material, we show that Inline graphic is an unbiased estimate of Inline graphic which is Inline graphic-consistent, and that it is asymptotically normally distributed.

Ignoring the leading term Inline graphic in (3), which is not estimated, and taking logarithms leads us to an estimated loglikelihood in Inline graphic across the data as

graphic file with name Equation10.gif (6)

Define Inline graphic and similarly for Inline graphic. Then the estimated score function, a type of estimated estimating equation, is

graphic file with name Equation11.gif (7)

Define

graphic file with name Equation12.gif

which is the profile loglikelihood score function when the distribution of Inline graphic is known. Since the profile loglikelihood score of Chatterjee & Carroll (2005) would have mean zero if the distribution of Inline graphic were known, it follows that

graphic file with name Equation13.gif (8)

where the expectation in (8) is taken in the case-control study, not in the source population. Thus, since Inline graphic and Inline graphic converge in probability to Inline graphic and Inline graphic, respectively, a consistent estimate of Inline graphic can be obtained by solving Inline graphic. This estimate Inline graphic, which maximizes the semiparametric pseudolikelihood (6), will be referred to as the semiparametric pseudolikelihood estimator.

2.2. Rare diseases when Inline graphic is unknown

When the probability of disease in the source population is unknown, one can invoke a rare disease assumption which is often reasonable for case-control studies (Piegorsch et al., 1994; Modan et al., 2001; Epstein & Satten, 2003; Zhao et al., 2003; Lin & Zeng, 2006; Kwee et al., 2007). If we assume that Inline graphic, then Inline graphic, and the expectation involved in the calculation of Inline graphic can be evaluated based on only the sample of controls, with Inline graphic. In this case, the estimates of Inline graphic converge not to Inline graphic itself but to Inline graphic, the solution to (8) with Inline graphic. Typically, except when the sample size is very large and hence standard errors are unusually small, the small possible bias of the rare disease approximation is of little consequence and coverage probabilities of confidence intervals remain near nominal; see § 3 for examples. The asymptotic theory of § 2.3 below is then unchanged.

In the Supplementary Material, we show that the score and the Hessian take simple forms in this case, and that the Hessian is negative semidefinite. Computation is thus very efficient.

2.3. Asymptotic theory

To state the asymptotic results, we first make the definitions

graphic file with name Equation14.gif

In addition, define Inline graphic, Inline graphic, Inline graphic and Inline graphic.

We use the notational convention that for arbitrary functions Inline graphic, Inline graphic. Also, we use the convention that

graphic file with name Equation15.gif

Define

graphic file with name Equation16.gif

Finally, define Inline graphic.

Theorem 1.

Suppose that Inline graphic, where Inline graphic, and that Inline graphic is known. Then

Theorem 1.

Therefore, since the Inline graphic are independent and Inline graphic, as Inline graphic,

Theorem 1.

in distribution, where

Theorem 1.

In § 2.2, when Inline graphic is unknown and the disease is relatively rare, the same result holds upon setting Inline graphic.

3. Simulations

3.1. Overview

In our simulations, Inline graphic and the value of Inline graphic is binary with population frequency Inline graphic. There are either three or five correlated single nucleotide polymorphisms within a region; we report on the latter case, but the results for the former case are similar. Each single nucleotide polymorphism takes on the values 0, 1 or 2 following a trinomial distribution that follows the Hardy–Weinberg equilibrium, i.e., the Inline graphicth component of Inline graphic equals 0, 1, 2 with probabilities Inline graphic, respectively. The values of the Inline graphic are described below.

To generate correlation among the single nucleotide polymorphisms, we first generated a 3- or 5-variate multivariate normal variate, with mean 0 and standard deviation 1, and a correlation matrix with correlation between the Inline graphicth and Inline graphicth components being Inline graphic, where Inline graphic. After generating these random variables, we trichotomized them with appropriate thresholds so that the frequencies of 0, 1 and 2 matched those specified by the allele frequency Inline graphic and Hardy–Weinberg equilibrium.

In both simulations, the logistic intercept Inline graphic was chosen so that the population disease rate Inline graphic. However additional simulations with Inline graphic yielded very similar results in terms of coverage, efficiency gains, and unbiasedness. See § 3.3 and the Supplementary Material for a discussion of additional simulations. In the simulation reported here, Inline graphic, Inline graphic, Inline graphicInline graphic and Inline graphic. Here Inline graphic.

3.2. Results

The standard error estimators used in our simulation were based on the asymptotic theory described in Theorem 1; we also used the bootstrap and obtained very similar results. The appropriate bootstrap in a case-control study is to resample the cases and controls separately, thus maintaining the sample sizes for each.

The simulation results are presented in Table 1. Our semiparametric pseudolikelihood estimator shows little bias and has coverage percentages near the nominal level. Both with a rare disease approximation and with Inline graphic known, our semiparametric pseudolikelihood estimator achieves approximately a 25% increase in mean squared error efficiency over ordinary logistic regression for the main effects in both Inline graphic and Inline graphic.

Table 1.

Results of Inline graphic simulations as described in Inline graphic 3: mean bias, coverage probabilities of a Inline graphic nominal confidence interval, and mean squared error efficiency of our semiparametric pseudolikelihood estimator compared with ordinary logistic regression; the simulationsc were performed with Inline graphic cases and Inline graphic controls

  Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic
True 0.18 0.18 0.00 0.18 0.00 0.41 0.26 0.00 0.00 0.26 0.00
Logistic: 1000 cases
Bias 0.00 0.01 0.00 0.01 –0.01 0.01 0.01 –0.01 0.00 0.00 0.01
CI (%) 94.3 95.2 95.7 95.1 94.7 94.6 94.9 94.2 94.5 96.0 94.2
SPMLE Rare: 1000 cases
Bias 0.01 0.00 0.00 0.02 –0.01 0.02 –0.02 –0.01 0.01 –0.02 0.01
CI (%) 95.2 95.4 96.4 95.8 95.3 95.1 95.4 94.8 96.1 95.5 94.9
Avg MSE Eff All Inline graphic: 1Inline graphic28 Inline graphic : 1Inline graphic26 All Inline graphic: 2Inline graphic18
SPMLE Inline graphic known: 1000 cases
Bias 0.00 0.00 0.00 0.01 –0.01 0.01 0.00 –0.01 0.01 –0.01 0.01
CI (%) 95.1 95.5 96.4 95.8 95.0 95.5 95.6 94.6 95.9 95.2 94.5
Avg MSE Eff All Inline graphic: 1Inline graphic28 All Inline graphic: 1Inline graphic28 All Inline graphic: 2Inline graphic07

Logistic, ordinary logistic regression; SPMLE Rare, our estimator using the rare disease approximation with unknown Inline graphic2.2); SPMLE Inline graphic known, our estimator when Inline graphic is known in the source population (§ 2.1); CI, coverage of a nominal 95% confidence interval, calculated using the asymptotic standard error; Avg MSE Eff, mean squared error efficiency of our method compared to logistic regression averaged over Inline graphic (All Inline graphic), over Inline graphic (All Inline graphic) or over all Inline graphic interactions (All Inline graphic).

Strikingly, the mean squared error efficiency of our semiparametric pseudolikelihood estimators compared to ordinary logistic regression is approximately Inline graphic for all the interaction terms, thus demonstrating that our methods, which do not model the distribution of either Inline graphic or Inline graphic, achieve numerically significant increases in efficiency.

3.3. Additional simulations

The Supplementary Material presents a series of additional simulations. These include the results of a simulation to evaluate the robustness of our method with respect to misspecification of the population disease rate; we found a surprising robustness with respect to disease rate misspecification. Additionally, we performed simulations to examine the robustness of our method with respect to violations of the gene-environment independence assumption. Those simulation studies show that there will be bias in the estimates of gene-environment interaction parameters for the specific single nucleotide polymorphisms that violate gene-environment independence, but the average mean squared error for parameter estimates across all the different single nucleotide polymorphisms could still be substantially lower than that obtained from prospective logistic regression analysis. We also show in the Supplementary Material how to remove this bias when Inline graphic and Inline graphic are independent conditional on a discrete stratification variable. Mukherjee & Chatterjee (2008) and Chen et al. (2009) show how to use empirical Bayes methods to provide additional robustness with respect to violations of the gene-environment independence assumption.

4. Data analysis

In this section, we apply our method to a case-control study for breast cancer arising from a large prospective cohort at the National Cancer Institute: the Prostate, Lung, Colorectal and Ovarian cancer screening trial (Canzian et al., 2010). The design of this study is described in detail by Prorok et al. (2000) and Hayes et al. (2000). The cohort data consisted of Inline graphic women, of whom 3Inline graphic56% developed breast cancer (Pfeiffer et al., 2013). The case-control study analysed here consists of 753 controls and 658 cases. Although Inline graphic is known in this population, we analyse the data both with Inline graphic known and with Inline graphic unknown but using a rare disease approximation.

We had data available on genotypes for 21 single nucleotide polymorphisms that have been previously associated with breast cancer based on large genome-wide association studies. The polygenic risk score was defined by a weighted combination of the genotypes, with the weights defined by log-odds-ratio coefficients reported in prior studies. We examined the interaction of the polygenic risk score with age at menarche, Inline graphic, a known risk factor for breast cancer, defined as the binary indicator of whether the age at menarche exceeds 13 or not. We also adjust the model for age as a continuous variable, denoted here by Inline graphic, so that the model fitted is

graphic file with name Equation20.gif

Results when age was categorized as Inline graphic35, 35–40, 40–45, Inline graphic75 were similar.

We also performed analyses to check the gene-environment independence assumption. Since Inline graphic is binary, we ran a Inline graphic-test of the polygenic risk score against the levels of Inline graphic, of course among the controls only. The Inline graphic-value was 0Inline graphic91, indicating almost no genetic effect. We also ran chi-squared tests for the 21 individual genes, finding no significant association after controlling the false discovery rate: the minimum Inline graphic-value was 0Inline graphic09. In addition, we checked for correlation, known as linkage disequilibrium, between the 21 loci used to create the polygenic risk score and 32 loci that are known to influence age at menarche (Elks et al., 2010). The data available to us do not contain the necessary information to analyse linkage disequilibrium between the two sets of loci.

Using phased haplotypes from subjects of European descent from the 1000 Genomes Project (The 1000 Genomes Project Consortium, 2015) and HapMap (Gibbs et al., 2003), no evidence of linkage disequilibrium was found: the maximum Inline graphic was 0Inline graphic1 and the minimum Inline graphic-value was 0Inline graphic85. Finally, a 2014 study examined the relationship between age at menarche and 10 of the 21 single nucleotide polymorphisms used to create our polygenic risk score, none of which were found to influence age at menarche (Andersen et al., 2014).

Table 2 presents the results for the cases where Inline graphic is unknown and known; as remarked upon previously, the results are very similar. Because of the very different scales of the variables, to provide a basis for comparison the variable age at baseline was standardized to have mean zero and standard deviation one. In addition, we standardized some of the coefficient estimates so that Inline graphic was multiplied by the standard deviation of the polygenic risk score, and Inline graphic was multiplied by the standard deviation of Inline graphic times the polygenic risk score.

Table 2.

Results of the analysis of the Prostate, Lung, Colorectal and Ovarian cancer screening trial data

  Inline graphic Inline graphic Inline graphic Inline graphic
Logistic        
 Estimate 0Inline graphic018 0Inline graphic297 Inline graphic 0Inline graphic165 0Inline graphic124
 Std err 0Inline graphic054 0Inline graphic064 Inline graphic 0Inline graphic132 0Inline graphic068
Inline graphic -value Inline graphic Inline graphic Inline graphic Inline graphic
SPMLE Rare        
 Estimate 0Inline graphic024 0Inline graphic321 Inline graphic 0Inline graphic175 0Inline graphic138
 Std err (asymptotic) 0Inline graphic054 0Inline graphic067 Inline graphic 0Inline graphic134 0Inline graphic055
Inline graphic -value (asymptotic) Inline graphic Inline graphic Inline graphic Inline graphic
SPMLE Inline graphic known        
 Estimate 0Inline graphic022 0Inline graphic313 Inline graphic 0Inline graphic174 0Inline graphic141
 Std err (asymptotic) 0Inline graphic054 0Inline graphic065 Inline graphic 0Inline graphic133 0Inline graphic055
Inline graphic -value (asymptotic) Inline graphic Inline graphic Inline graphic Inline graphic

Logistic, ordinary logistic regression; SPMLE Rare, our method using the rare disease approximation with unknown Inline graphic; SPMLE Inline graphic known, our method when the disease rate is known in the source population (Inline graphic); Std err, the asymptotic standard error estimate; Inline graphic, the main effect for age; Inline graphic and Inline graphic, the main effects for the polygenic risk score (Inline graphic) and the environmental variable Inline graphic (age at menarche Inline graphic13), respectively; Inline graphic, the gene-environment interaction.

As expected from the known association of the single nucleotide polymorphisms with risk of breast cancer, the polygenic risk score was strongly associated with breast cancer status of the women in the study. Standard logistic regression analysis reveals some evidence for interaction of the polygenic risk score with age at menarche, but the result was not statistically significant at the 0Inline graphic05 level. When the analysis was done under the gene-environment independence assumption, the evidence for interaction appeared to be stronger.

The coefficient estimate for the interaction term is slightly larger for our semiparametric methods than for logistic regression. Also, the asymptotic standard error estimate of logistic regression is approximately 23% larger than that for our methods, indicating a variance increase of approximately 50%. Although not listed here, the bootstrap mentioned in § 3.2 has very similar standard error estimates. In that bootstrap, 33% of the time the logistic interaction estimate was actually greater than that of the disease-rate-known estimate.

5. Discussion and extensions

We have proposed a general method for using retrospective likelihoods to study gene-environment interactions involving multiple markers, an approach that does not require any distributional assumption on the multivariate genotype distribution. Sometimes, one may consider modelling multimarker gene-environment interactions using an underlying polygenic risk score, which is a weighted combination of numerous genetic markers where the weights are predetermined from previous association studies. In such situations, the polygenic risk score might be assumed to follow approximately a normal distribution in the underlying population, and the profile likelihood method of Chatterjee & Carroll (2005) can be used with appropriate modification by replacing the parametric multinomial distribution for a single nucleotide polymorphism genotype with a parametric normal distribution for the polygenic risk score; see also Chen et al. (2008) and Lin & Zeng (2009). In general, however, if one wishes to explore complex models for multivariate gene-environment interactions retaining separate parameters for distinct single nucleotide polymorphisms or for distinct genetic profiles defined by combinations of correlated single nucleotide polymorphisms, then one cannot avoid dealing with complex multivariate genotype distributions, something that is not easy to specify through parametric models.

Our methods are types of semiparametric plug-in estimators, and thus have certain features in common with the work of Newey (1994), namely that the profile likelihood has the nonparametric component Inline graphic in (4) that is estimated by (5). Generally, however, such plug-in estimators are not semiparametric efficient. We believe it will be possible to create an efficient semiparametric estimator by modifying the work of Ma (2010); we are exploring this and its computational aspects, which may be daunting.

Supplementary Material

Supplementary Data
Supplementary Package

Acknowledgement

Stalder and Asher should be considered joint first authors. Carroll is also Distinguished Professor at the University of Technology Sydney. Chatterjee is also Bloomberg Professor of Oncology at the Johns Hopkins University. Stalder was supported by a fellowship from the Fondation Ernest Boninchi. Ma was supported by the U.S. National Science Foundation and National Institute of Neurological Disorders and Stroke. Asher, Liang and Carroll were supported by the National Cancer Institute. Chatterjee’s research was partially funded through a Patient-Centered Outcomes Research Institute Award. The statements and opinions in this article are solely the responsibility of the authors and do not necessarily represent the views of the Patient-Centered Outcomes Research Institute, its Board of Governors or Methodology Committee.

Supplementary material

Supplementary Material available at Biometrika online contains proofs, skewness and kurtosis and Q-Q plots for the simulation in Table 1, a discussion of how to modify our methods to account for strata, results of additional simulations, and software written in R. The data used in § 4 are available from the National Cancer Institute via a data transfer agreement.

References

  1. Andersen S. W., Trentham-Dietz A., Gangnon R. E., Hampton J. M., Skinner H. G., Engelman C. D., Klein B. E., Titus L. J., Egan K. M. & Newcomb P. A. (2014). Breast cancer susceptibility loci in association with age at menarche, age at natural menopause and the reproductive lifespan. Cancer Epidemiol. 38, 62–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Canzian F., Cox D. G., Setiawan V. W., Stram D. O., Ziegler R. G., Dossus L., Beckmann L., Blanché H., Barricarte A., Berg C. D.. et al. (2010). Comprehensive analysis of common genetic variation in 61 genes related to steroid hormone and insulin-like growth factor-I metabolism and breast cancer risk in the NCI breast and prostate cancer cohort consortium. Hum. Molec. Genet. 19, 3873–84. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Chatterjee N. & Carroll R. J. (2005). Semiparametric maximum likelihood estimation in case-control studies of gene-environment interactions. Biometrika 92, 399–418. [Google Scholar]
  4. Chatterjee N., Kalaylioglu Z., Moslehi R., Peters U. & Wacholder S. (2006). Powerful multilocus tests of genetic association in the presence of gene-gene and gene-environment interactions. Am. J. Hum. Genet. 79, 1002–16. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Chatterjee N., Shi J. & García-Closas M. (2016). Developing and evaluating polygenic risk prediction models for stratified disease prevention. Nature Rev. Genet. 17, 392–406. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Chatterjee N., Wheeler B., Sampson J., Hartge P., Chanock S. J. & Park J.-H. (2013). Projecting the performance of risk prediction based on polygenic analyses of genome-wide association studies. Nature Genet. 45, 400–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Chen Y. H., Chatterjee N. & Carroll R. J. (2008). Retrospective analysis of haplotype-based case-control studies under a flexible model for gene-environment association. Biostatistics 9, 81–99. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Chen Y. H., Chatterjee N. & Carroll R. J. (2009). Shrinkage estimators for robust and efficient inference in haplotype-based case-control studies. J. Am. Statist. Assoc. 104, 220–33. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Dudbridge F. (2013). Power and predictive accuracy of polygenic risk scores. PLoS Genet. 9, e1003348. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Elks C. E., Perry J. R. B., Sulem P., Chasman D. I., Franceschini N., He C., Lunetta K. L., Visser J. A., Byrne E. M., Cousminer D. L.. et al. (2010). Thirty new loci for age at menarche identified by a meta-analysis of genome-wide association studies. Nature Genet. 42, 1077–85. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Epstein M. P. & Satten G. A. (2003). Inference on haplotype effects in case-control studies using unphased genotype data. Am. J. Hum. Genet. 73, 1316–29. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Fuchsberger C., Flannick J., Teslovich T. M., Mahajan A., Agarwala V., Gaulton K. J., Ma C., Fontanillas P., Moutsianas L., McCarthy D. J.. et al. (2016). The genetic architecture of type 2 diabetes. Nature 536, 41–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Gauderman W. J., Zhang P., Morrison J. L. & Lewinger J. P. (2013). Finding novel genes by testing GInline graphicE interactions in a genome-wide association study. Genet. Epidemiol. 37, 603–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Gibbs R. A., Belmont J. W., Hardenbol P., Willis T. D., Yu F., Yang H., Ch’ang L.-Y., Huang W., Liu B., Shen Y.. et al. (2003). The International HapMap Project. Nature 426, 789–96. [DOI] [PubMed] [Google Scholar]
  15. Han S. S., Rosenberg P. S., Ghosh A., Landi M. T., Caporaso N. E. & Chatterjee N. (2015). An exposure-weighted score test for genetic associations integrating environmental risk factors. Biometrics 71, 596–605. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Hayes R. B., Reding D., Kopp W., Subar A. F., Bhat N., Rothman N., Caporaso N., Ziegler R. G., Johnson C. C., Weissfeld J. L.. et al. (2000). Etiologic and early marker studies in the prostate, lung, colorectal and ovarian (PLCO) cancer screening trial. Contr. Clin. Trials 21, 349S–55S. [DOI] [PubMed] [Google Scholar]
  17. Hsu L., Jiao S., Dai J. Y., Hutter C., Peters U. & Kooperberg C. (2012). Powerful cocktail methods for detecting genome-wide gene-environment interaction. Genet. Epidemiol. 36, 183–94. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Jiao S., Hsu L., Bézieau S., Brenner H., Chan A. T., Chang-Claude J., Le Marchand L., Lemire M., Newcomb P. A., Slattery M. L.. et al. (2013). SBERIA: Set-based gene-environment interaction test for rare and common variants in complex diseases. Genet. Epidemiol. 37, 452–64. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Kwee L. C., Epstein M. P., Manatunga A. K., Duncan R., Allen A. S. & Satten G. A. (2007). Simple methods for assessing haplotype-environment interactions in case-only and case-control studies. Genet. Epidemiol. 31, 75–90. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Lin D. Y. & Zeng D. (2006). Likelihood-based inference on haplotype effects in genetic association studies. J. Am. Statist. Assoc. 101, 89–104. [Google Scholar]
  21. Lin D. Y. & Zeng D. (2009). Proper analysis of secondary phenotype data in case-control association studies. Genet. Epidemiol. 33, 256–65. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Lin X., Lee S., Christiani D. C. & Lin X. (2013). Test for interactions between a genetic marker set and environment in generalized linear models. Biostatistics 14, 667–81. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Lin X., Lee S., Wu M. C., Wang C., Chen H., Li Z. & Lin X. (2015). Test for rare variants by environment interactions in sequencing association studies. Biometrics 72, 156–64. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Ma Y. (2010). A semiparametric efficient estimator in case-control studies. Bernoulli 16, 585–603. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Meigs J. B., Shrader P., Sullivan L. M., McAteer J. B., Fox C. S., Dupuis J., Manning A. K., Florez J. C., Wilson P. W., D’Agostino Sr R. B.. et al. (2008). Genotype score in addition to common risk factors for prediction of type 2 diabetes. N. Engl. J. Med. 359, 2208–19. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Modan B., Hartge P., Hirsh-Yechezkel G., Chetrit A., Lubin F., Beller U., Ben-Baruch G., Fishman A., Menczer J., Struewing J. P.. et al. (2001). Parity, oral contraceptives, and the risk of ovarian cancer among carriers and noncarriers of a BRCA1 or BRCA2 mutation. N. Engl. J. Med. 345, 235–40. [DOI] [PubMed] [Google Scholar]
  27. Mukherjee B., Ahn J., Gruber S. B. & Chatterjee N. (2012). Testing gene-environment interaction in large-scale case-control association studies: Possible choices and comparisons. Am. J. Epidemiol. 175, 177–90. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Mukherjee B. & Chatterjee N. (2008). Exploiting gene-environment independence for analysis of case–control studies: An empirical Bayes-type shrinkage estimator to trade-off between bias and efficiency. Biometrics 64, 685–94. [DOI] [PubMed] [Google Scholar]
  29. Murcray C. E., Lewinger J. P. & Gauderman W. J. (2009). Gene-environment interaction in genome-wide association studies. Am. J. Epidemiol. 169, 219–26. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Newey W. K. (1994). The asymptotic variance of semiparametric estimators. Econometrica 62, 1349–82. [Google Scholar]
  31. Pfeiffer R. M., Park Y., Kreimer A. R., Lacey Jr J. V., Pee D., Greenlee R. T., Buys S. S., Hollenbeck A., Rosner B., Gail M. H.. et al. (2013). Risk prediction for breast, endometrial, and ovarian cancer in white women aged 50 y or older: Derivation and validation from population-based cohort studies. PLoS Med. 10, e1001492. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Piegorsch W. W., Weinberg C. R. & Taylor J. A. (1994). Non-hierarchical logistic models and case-only designs for assessing susceptibility in population based case-control studies. Statist. Med. 13, 153–62. [DOI] [PubMed] [Google Scholar]
  33. Prentice R. L. & Pyke R. (1979). Logistic disease incidence models and case-control studies. Biometrika 66, 403–11. [Google Scholar]
  34. Prorok P. C., Andriole G. L., Bresalier R. S., Buys S. S., Chia D., Crawford E. D., Fogel R., Gelmann E. P., Gilbert F., Hasson M. A.. et al. (2000). Design of the prostate, lung, colorectal and ovarian (PLCO) cancer screening trial. Control. Clin. Trials 21, 273S–309S. [DOI] [PubMed] [Google Scholar]
  35. The 1000 Genomes Project Consortium (2015). A global reference for human genetic variation. Nature 526, 68–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Umbach D. M. & Weinberg C. R. (1997). Designing and analysing case-control studies to exploit independence of genotype and exposure. Statist. Med. 16, 1731–43. [DOI] [PubMed] [Google Scholar]
  37. Wacholder S., Hartge P., Prentice R., Garcia-Closas M., Feigelson H. S., Diver W. R., Thun M. J., Cox D. G., Hankinson S. E., Kraft P.. et al. (2010). Performance of common genetic variants in breast-cancer risk models. N. Engl. J. Med. 362, 986–93. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Zhao L. P., Li S. S. & Khalid N. (2003). A method for the assessment of disease associations with single-nucleotide polymorphism haplotypes and environmental variables in case-control studies. Am. J. Hum. Genet. 72, 1231–50. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data
Supplementary Package

Articles from Biometrika are provided here courtesy of Oxford University Press

RESOURCES