Abstract
Case-controls studies are popular epidemiological designs for detecting gene–environment interactions in the etiology of complex diseases, where the genetic susceptibility and environmental exposures may often be reasonably assumed independent in the source population. Various papers have presented analytical methods exploiting gene–environment independence to achieve better efficiency, all of which require either a rare disease assumption or a distributional assumption on the genetic variables. We relax both assumptions. We construct a semiparametric estimator in case-control studies exploiting gene–environment independence, while the distributions of genetic susceptibility and environmental exposures are both unspecified and the disease rate is assumed unknown and is not required to be close to zero. The resulting estimator is semiparametric efficient and its superiority over prospective logistic regression, the usual analysis in case-control studies, is demonstrated in various numerical illustrations.
Keywords: Biased samples, Case-control study, Gene–environment independence, Gene–environment interaction, Semiparametric estimation
1. Introduction
The etiology of most complex diseases, such as cancers and cardiovascular diseases, is the joint effect of genetic susceptibility and environmental or non-genetic exposures, as well as their interactions. Even subtle differences in genetic factors between people, when exposed to the same environmental factors, can lead to dramatically different responses. In other words, people with certain genes may have a low risk of developing a disease whereas others may be more vulnerable when exposed to an identical environmental agent. One common example is that sunlight exposure results in higher risk of developing skin cancer among fair-skinned individuals than people with dark skin [17,22]. Studying gene–environment interactions is thus of great importance to understand disease mechanisms and develop new treatments and prevention strategies.
The case-control study design is commonly used to investigate the intricate interplay of genetic susceptibility and environment effects. It is cost-efficient and convenient to implement compared to a cohort study, especially when dealing with relatively rare diseases [6]. Instead of taking a random sample from the underlying source population, the case-control design randomly draws a fixed number of cases (diseased subjects) and a comparable number of controls (non-diseased subjects) from the respective case and control subpopulations. Genetic and environmental factors are measured and recorded for these sampled subjects. The standard approach for the analysis of such a case-control study is prospective logistic regression, which ignores the underlying retrospective nature of the case-control design. Cornfield [10] showed the equivalence of prospective and retrospective odds ratios, which validates the prospective approach. Prentice and Pyke [24] further showed that prospective logistic regression analysis gives an efficient estimator, in the sense that it yields the maximum likelihood estimates of the odds ratio parameters under a semiparametric model that allows an arbitrary covariate distribution.
Despite this, prospective logistic regression treatment in a case control study can still require a large sample size to obtain adequate statistical power for detecting gene–environment interactions or testing other hypotheses of interest. As a consequence, epidemiological researchers often exploit the potential efficiency gain from further assuming certain parametric or semiparametric structures for the covariate distribution. For example, in practice, a common assumption is that genetic susceptibility and environmental exposure are independent in the underlying source population [23], possibly given strata. Under such a model, prospective logistic regression analysis is still valid but may not be efficient because it ignores gene–environment independence.
A growing number of articles have been published in the last two decades, proposing analytical methods that exploit gene–environment independence assumption [5,14,15,20,21,23]. Piegorsch et al. [23] showed that under gene–environment independence and a rare disease assumption, the multiplicative interaction odds-ratio parameter can be estimated by cases alone and the resulting estimator is more precise than the estimator from traditional prospective logistic regression analysis using both cases and controls. However, the misuse of a rare disease assumption in analyzing diseases with moderate prevalence or diseases with small marginal probability in the source population but high risk for certain combination of genetic and environmental exposures can lead to considerable bias in the estimation. Noting this fact, Chatterjee and Carroll [5] developed a semiparametric maximum likelihood estimator employing the gene–environment independence assumption but not requiring any rare-disease assumption. Their approach leaves the distribution of the environmental exposures totally unspecified but restricts genetic susceptibility to have a discrete distribution that takes values in a finite and fixed set. Ma [20] proposed a semiparametric efficient estimator in the same setting as Chatterjee and Carroll [5] except the distribution of genetic susceptibility is allowed to be either discrete or continuous with a finite-dimensional parameter. The key ingredient of this approach is to construct a hypothetical population with infinite population size and a disease to non-disease ratio of n1/n0, where n1 and n0 are the numbers of cases and controls in the case-control sample. Section 2 of Ma [20] showed that the case-control sample can be viewed as a size n = n0+n1 random sample of independent and identically distributed observations from this hypothetical population, and hence classical semiparametric analysis is applicable. The validity and usefulness of such a hypothetical population was established in Ma [20]. Instead of assuming independence of gene and environment, there is a literature based on parametric modeling of the relationship between them [8,9,18,19]: we make no such parametric assumptions.
In this paper, we consider a more general setting which keeps the gene–environment independence assumption, while further allowing an unknown disease rate and completely nonparametric distributions for both the genetic susceptibility and the environmental exposure. Under such a model setting, we adopt the hypothetical population framework of Ma [20] and derive the semiparametric efficient estimator by employing a semiparametric approach, which links the efficient estimator with the efficient score function. Throughout our work, the underlying source population is referred to as the true population to emphasize the difference between the underlying source population and the hypothetical population. The inherent connection between the two populations allows us to transport parameter estimation and inference results derived in the hypothetical population directly to those in the true population, see Theorem 1. Although general semiparametric theory applies in the hypothetical population framework, computing the efficient estimator in this context is technically challenging because the efficient score does not have an explicit form and must be solved from an integral equation. We adopt a simple numerical approach to solve the integral equation by discretizing the distribution of the genetic susceptibility when it is continuous. The resulting estimator, when properly implemented, is asymptotically linear with optimal efficiency.
The rest of the paper is organized as follows. The specific model and the hypothetical population framework are presented in Section 2, with the corresponding identifiability conditions provided in Appendix A.1. In Section 3, we formulate the problem by using a conventional semiparametric approach. The analytic expression of our semiparametric efficient estimator as well as its detailed implementation are discussed in this section. Section 4 illustrates the asymptotic properties of the resulting estimator. Several simulation studies are conducted in Section 5 to demonstrate the numerical performance of our semiparametric efficient estimator compared with prospective logistic regression. A real data analysis is provided in Section 6, followed with a brief discussion in Section 7. Technical details and proofs are given in an Appendix and in the Online Supplement.
2. Model and framework
2.1. Background
It is useful to describe how the methods, referenced in Section 1, for exploiting a genetic-environmental relationship in an underlying source population have evolved from the earlier work, a relatively simple case in Chatterjee and Carroll [5], which includes the following key ingredients:
An underlying logistic regression for disease D as a function of genetic variables G and environmental exposures X.
A parametric distribution assumption for G in the source population when G and X are independent.
Writing out the retrospective likelihood of the observed case-control data.
A profile likelihood argument that estimates the distribution of X in the source population using a Lagrange multiplier argument that places probability mass at each observed value of X. This leads to a pseudolikelihood that involves the distribution of G but not the distribution of X.
The main technical difficulty is carrying out the algebra of the Lagrange multiplier argument and getting an explicit pseudolikelihood, where by explicit we mean that the resulting formula requires no numerical solutions to nonlinear equations.
In our case, however, we are not making the assumption of a parametric distribution for G in the source population. A profile likelihood method to remove the distribution of G and get a new, explicit, profile likelihood based on a Lagrange multiple argument does not appear to be possible, or at least it seems to be very difficult, because of the form of the pseudolikelihood.
To overcome these difficulties, there have been two main alternatives, and they are both based on the idea of relating the case-control study to some version of a prospective random sampling framework to derive a methodology, and to then show that this methodology is valid in the case-control study. Recall that n0 is the number of controls in the sample and n1 the number of cases. Define .
I. In Section 2.3.3 of [9], Chen et al. treat the case-control study as if it were a random sample from the source population but with data missing at random. They propose a prospective sampling scenario where each subject from the source population is observed with probability , where d = 1 for cases and d = 0 for controls, respectively. They show that performing a missing data analysis for the distribution of (D, G) given X and the probability that the subject is observed yields the same pseudolikelihood as other papers have computed, but without having to do the Lagrange argument, and in a much easier way.
II. Ma [20] takes an entirely different approach, also without having to do the Lagrange argument. This approach, which she calls a hypothetical population approach, differs from that of Chen et al. [9] in that she aims to create a likelihood that (a) is equivalent to that of the case-control sample; and (b) is that of a simple random sample of size from a hypothetical population. Because it is a random sample, rather than a sample with missing data, when we use it this allows us to rely on the classic machinery of semiparametric methods as exemplified by Bickel et al. [4] and Tsiatis [27].
2.2. Basic calculations and likelihood
Assume that the prospective risk given the covariates (G, X) follows a logistic model, viz.
(1) |
where and m is a function known up to the parameter β. Here and throughout the text, the superscript “true” is used to emphasize that those quantities are related to the true source population. In addition, in the true population, G and X are assumed to be independent so that the joint probability density/mass function of G, X can be written as . Here, for notational simplicity, we write as . The problem stated above is identifiable in the case-control study under mild conditions, which are given in Appendix A.1, along with the proof of identifiability.
The hypothetical population study joint density/mass function of (D, G, X) is
(2) |
where
(3) |
We consider η = {η1, η2} as the infinite-dimensional nuisance parameter. The approach of Ma [20] views this as a semiparametric problem, to be solved using techniques explained in Bickel et al. [4] and Tsiatis [27]. Here, the concept of hypothetical population and the corresponding distorted likelihood is used as a vehicle to allow us to transport the semiparametric tools for direct application. It enables us to construct consistent estimators without having to concern about the non-random sample issue in case-control study. Because the non-random sampling issue is already taken into account when we formulate the distorted likelihood, the resulting estimator is indeed automatically consistent under the original case-control sampling framework, that is, if the case-control sample size grows to infinity while retaining the relative sample proportion of n1/n0, the estimator will converge to the true parameter value. We formally write out this result in Theorem 1.
Theorem 1. Assume (d1, g1, x1), …, (dn, gn, xn) is a case-control sample with n1 cases, n0 controls, and with disease model (1) and independence of X and G. Assume is a random sample of independent and identically distributed observations with size n from model (2). Then, if is a -consistent regular asymptotically linear estimator of θ and satisfies , then so is .
Theorem 1 essentially says that if we can develop a -consistent estimator based on a random sample from model (2), then we can simply apply this estimation procedure to the case-control sample and we will still get a -consistent estimator. The proof of Theorem 1 is the entire content of Section 2 of Ma [20]. We take advantage of this property to generate an estimation procedure, which we will then show consistently estimates the parameters when using the case-control data. In particular, the procedure is not dependent on the hypothetical population study formalism.
3. Analytic derivations: Efficient score and algorithm
The outline of the semiparametric approach is to first construct a Hilbert space , consisting of all measurable functions with mean zero and finite variance. We next decompose into nuisance tangent space and its orthogonal complement . The efficient estimator can then be obtained by solving
where Seff is the projection of the score function Sθ onto , and thus Seff is called efficient score function.
Careful calculation shows that the score function under the hypothetical population (2) takes the form where and . Let p denote the dimension of θ. The final form of the spaces Λ and Λ is listed below with the detailed derivation provided in Appendix A.2. Specifically,
Define and . Projecting the score function onto Λ⊥ shows that
where
(4) |
(5) |
It is easy to check that .
In order to obtain the efficient score function, we need to solve a and b from the integral equations (4) and (5). The existence of the solution is automatically guaranteed by the identifiability of the problem, whereas the uniqueness is not. However, it is shown in Appendix A.3 that a and b are unique up to constant shifts. Thus, (4) and (5) have a unique solution under the constraints . It is further proved in Appendix A.4 that, under the mean zero constraint, (4) and (5) have an equivalent expression, which is given by Eqs. (A.1)–(A.3), in the Appendix. Such an equivalent expression allows us to separate a and b by introducing an intermediate variable . However, there is no explicit expression for a and b. We still need to solve the integral equation (A.1). In Appendix A.5, we propose an approximation to its solution in the spirit of Tsiatis and Ma [28], by discretizing X if X is continuous.
The detailed algorithm for constructing the efficient score function and computing the efficient estimator for θ is given in Algorithm 1, where the disease rate is estimated during the procedure. Usually, the disease prevalence is not identifiable from a case-control sample [24]. However, the additional assumption we make on the relationship between G and X in the source population, i.e., gene–environment independence, leads to the technical identifiability [5,20].
Algorithm 1
- Estimate the conditional density/mass function of X given disease status D = d, by nonparametric kernel density estimation among the data with Di = d for d ϵ {0, 1}.
for continuous X, and
for discrete X, where K is a univariate kernel function. Estimate the conditional density/mass function of G given disease status D = d, by nonparametric kernel density estimation among the data with Di = d for d ∈ {0, 1}. similarly as for X. Denote the result by .
Define , , what we call a weighted nonparametric density/mass function estimate, being weighted by the (estimated) population probabilities.
- When (π0, π1) is unknown, estimate them by solving the integral equation
and setting Follow the method described in Appendix A.5 to obtain the solution of the integral equations (4) and (5), with result , and approximate using nonparametric density estimates and with result .
From , and estimate θ by solving the estimating equation
(6) |
It is critical that we estimate and involved in Steps 5 and 6 using and described in Steps 1 and 2 of the above algorithm, instead of simply taking a sample version of the expectations. This ensures that all the conditional expectations are computed using the same kind of approximation and the gene–environment independence assumption is fully employed.
4. Distribution theory
It is not surprising that the semiparametric estimator described in Algorithm 1 is asymptotically normal with a parametric convergence rate and optimal efficiency as it is formed by estimating all conditional expectations in the efficient score nonparametrically. The asymptotic properties of our estimator are described in Theorem 2 under regularity conditions C1–C2 listed below. The proof is provided in the Online Supplement.
C1 The univariate kernel function K has support (–1, 1) and satisfies , . The bandwidth h satisfies and .
C2 Any discrete covariate has finitely many levels. Any continuous covariate has compact support and its density function is twice continuously differentiable.
Theorem 2. Under the regularity conditions C1 and C2, the estimator obtained from solving the estimating Eq. (6) is asymptotically normal with optimal efficiency, i.e., , and is semiparametric efficient.
5. Simulation study
We performed simulations to understand the finite sample performance of the semiparametric efficient estimator described in Section 3 and demonstrate its superiority to prospective logistic regression method under the gene–environment independent model. Two scenarios are considered: (a) and (b) , corresponding to cases with a relatively rare disease rate and a common disease rate, respectively. In each scenario, we generated X from the standard normal distribution or the Gamma distribution with mean 20 and variance 20, , while the distribution of G is one of the following: (i) Bernoulli with success probability , where for example G = 1 or G = 0 corresponds to the presence or absence of a genetic mutation, and (ii) , which can be used to model gene expression levels or continuous traits, such as height and skin color, that are controlled by several genes. Given G and X, we generated disease status D from the logistic regression model
where for both settings with normal X, and for both settings with Gamma X. We varied the intercept β0 in different simulations to get the desired disease rate. Specifically speaking, in the case of , we set α = −3.61 and −3.465 for binary G and normal G respectively to achieve a disease rate of 4.5%, and we set α = −2.74 and −2.538 for binary G and normal G respectively to achieve a disease rate of 10%. In the case of , we set α = −5.220 and −5.086 for binary G and normal G respectively to achieve a disease rate of 4.5%, and we set α = −4.352 and −4.158 for binary G and normal G respectively to achieve a disease rate of 10%. For each setting, we simulated 1000 data sets, each with n1 = 1000 cases and n0 = 1000 controls. The details of simulating the case-control data are provided in the Online Supplement. In the computation of the weighted nonparametric density/mass function estimates defined in Algorithm 1, we used the asymptotically justified bandwidth , where , and the results were insensitive to the choice of c.
The results are summarized in Tables 1–4. For 4.5% disease prevalence and normally distributed X (Table 1), it is clear that prospective logistic regression and our semiparametric efficient estimator are both consistent, while the semiparametric estimator has smaller variance. Specifically, the semiparametric efficient estimator has a mean squared error efficiency gain as large as 57% (the interaction term between G and X) for binary G, and 46% (the interaction term between G and X) for normal G. For 4.5% disease prevalence and Gamma X (Table 3), when G follows a Bernoulli distribution, our semiparametric efficient estimator has a mean squared error efficiency gain between 31% (the main effect of X) and 56% (the interaction term between G and X); when G is normal, the corresponding efficiency gain of the interaction term is 44%.
Table 1.
Binary G, Normal X | Normal G, Normal X | ||||||
---|---|---|---|---|---|---|---|
β | 0.76 | 0.36 | −0.63 | 0.76 | 0.36 | −0.63 | |
Logistic | Mean | 0.761 | 0.363 | −0.635 | 0.762 | 0.363 | −0.634 |
se | 0.101 | 0.088 | 0.103 | 0.055 | 0.053 | 0.056 | |
est se | 0.101 | 0.084 | 0.101 | 0.056 | 0.054 | 0.055 | |
95% | 0.952 | 0.939 | 0.942 | 0.950 | 0.954 | 0.942 | |
Semi | Mean | 0.761 | 0.360 | −0.630 | 0.761 | 0.362 | −0.627 |
se | 0.101 | 0.077 | 0.082 | 0.054 | 0.051 | 0.046 | |
est se | 0.100 | 0.073 | 0.079 | 0.053 | 0.051 | 0.041 | |
95% | 0.953 | 0.939 | 0.941 | 0.949 | 0.953 | 0.921 | |
MSE Eff | 1.003 | 1.325 | 1.566 | 1.068 | 1.112 | 1.457 |
Table 4.
Binary G, Gamma X | Normal G, Gamma X | ||||||
---|---|---|---|---|---|---|---|
β | 3.577 | 0.080 | −0.141 | 3.577 | 0.080 | −0.141 | |
Logistic | Mean | 3.589 | 0.081 | −0.141 | 3.600 | 0.081 | −0.142 |
se | 0.459 | 0.018 | 0.022 | 0.274 | 0.012 | 0.013 | |
est se | 0.460 | 0.018 | 0.022 | 0.269 | 0.012 | 0.012 | |
95% | 0.949 | 0.950 | 0.947 | 0.950 | 0.934 | 0.944 | |
Semi | Mean | 3.565 | 0.080 | −0.140 | 3.590 | 0.081 | −0.142 |
se | 0.394 | 0.016 | 0.019 | 0.268 | 0.012 | 0.012 | |
est se | 0.381 | 0.016 | 0.018 | 0.247 | 0.011 | 0.011 | |
95% | 0.945 | 0.953 | 0.938 | 0.934 | 0.937 | 0.930 | |
MSE Eff | 1.360 | 1.240 | 1.406 | 1.048 | 1.031 | 1.061 |
Table 3.
Binary G, Gamma X | Normal G, Gamma X | ||||||
---|---|---|---|---|---|---|---|
β | 3.577 | 0.080 | −0.141 | 3.577 | 0.080 | −0.141 | |
Logistic | Mean | 3.599 | 0.081 | −0.142 | 3.592 | 0.080 | −0.141 |
se | 0.456 | 0.018 | 0.022 | 0.269 | 0.012 | 0.012 | |
est se | 0.462 | 0.018 | 0.022 | 0.259 | 0.012 | 0.012 | |
95% | 0.957 | 0.953 | 0.949 | 0.937 | 0.950 | 0.942 | |
Semi | Mean | 3.586 | 0.080 | −0.141 | 3.569 | 0.080 | −0.140 |
se | 0.375 | 0.016 | 0.018 | 0.230 | 0.011 | 0.010 | |
est se | 0.369 | 0.016 | 0.017 | 0.202 | 0.011 | 0.009 | |
95% | 0.950 | 0.949 | 0.942 | 0.914 | 0.940 | 0.919 | |
MSE Eff | 1.484 | 1.305 | 1.559 | 1.372 | 1.059 | 1.437 |
The results for the 10% disease rate case (Tables 2 and 4) are similar. Both approaches are asymptotically valid, with our approach being superior to prospective logistic regression in the sense that our semiparametric efficient estimator has smaller mean squared error.
Table 2.
Binary G, Normal X | Normal G, Normal X | ||||||
---|---|---|---|---|---|---|---|
β | 0.76 | 0.36 | −0.63 | 0.76 | 0.36 | −0.63 | |
Logistic | Mean | 0.762 | 0.363 | −0.638 | 0.762 | 0.363 | −0.633 |
Se | 0.102 | 0.084 | 0.100 | 0.056 | 0.051 | 0.057 | |
est se | 0.100 | 0.083 | 0.100 | 0.056 | 0.053 | 0.057 | |
95% | 0.943 | 0.952 | 0.955 | 0.957 | 0.960 | 0.952 | |
Semi | Mean | 0.762 | 0.359 | −0.628 | 0.761 | 0.363 | −0.629 |
se | 0.102 | 0.077 | 0.087 | 0.055 | 0.050 | 0.053 | |
est se | 0.100 | 0.074 | 0.081 | 0.055 | 0.052 | 0.050 | |
95% | 0.944 | 0.932 | 0.936 | 0.953 | 0.960 | 0.934 | |
MSE Eff | 1.004 | 1.180 | 1.325 | 1.032 | 1.065 | 1.145 |
6. Example
Prostate cancer is a heterogeneous disease resulting from the complex interplay of genetic susceptibility and environmental exposures. It is the second leading cause of cancer death among men in the USA [1]. Prostate cells (both primary and cancer cells) were demonstrated to have 1α-OHase activity, whereas 1α-OHase is the enzyme responsible for converting [25(OH)D], the major circulating form of vitamin D that reflects both dietary and sunlight exposures, into 1,25-dihydroxy-vitamin D [1,25(OH)2D], the most active form of this vitamin that can induce cell-cycle regulation, apoptosis and differentiation in prostate cancer cells via the vitamin D receptor (VDR). Thus, (a) [25(OH)D] is hypothesized to have an anticancer effect, and (b) an important question is whether its relationship with the risk of developing prostate cancer is modified by genetic polymorphisms in the VDR gene.
In this section, we implemented our methodology in a case-control study of prostate cancer, using the same data set analyzed but in a different context by Chen et al. [9], see that reference for details about the study. Specifically, our analysis is based on a polygenic risk score, a single risk factor incorporating information from susceptibility SNPs, whereas Chen et al. [9] focused on haplotypes. The data consist of n1 = 690 cases and n0 = 717 controls randomly selected from the screening arm of a large population-based cohort study, the Prostate, Lung, Colorectal, and Ovarian Cancer Screening Trial (PLCO) at the National Cancer Institute. The PLCO cohort study recruited a total of 76,685 men aged 55–74 at 10 screening centers between November 1993 and July 2001, then randomly assigned 38,340 of them to the screening arm and the rest to the non-screening arm. In a 10-year follow-up period, in the study population, the cumulative incidence rate for prostate cancer in the screening arm was 108.4 per 10,000 person-years [3]. Apart from case-control status, [25(OH)D] level (nmol/L) and genotype data on 19 single-nucleotide polymorphisms (SNPS) are available for each subject involved in the case-control study. According to Chen et al. [9], these polymorphisms, our G, are unlikely to affect the [25(OH)D] level, our X, as the VDR gene plays a “downstream” role in the vitamin-D pathway. In other words, the gene–environment independence assumption is likely to be valid in this application. Detailed information about the design can be found in Andriole et al. [3],Hayes et al. [16],Prorok et al. [25].
One difficulty in investigating the genetic modification of the VDR gene to [25(OH)D] on the risk of prostate cancer is that the VDR gene contains multiple underlying susceptibility SNPs, where each individual SNP may only confer a small component of overall risk. In fact, running a logistic regression of case-control status on each of the 19 SNPs shows only three SNPs have p-values ≤ 0.10. Recently, it has been recognized that the polygenic risk score has the potential of improving risk prediction for some common diseases [2,7,11–13,26]. Therefore, we created a polygenic risk score for the prostate cancer data by weighting those 19 SNPs, where the weights are the effect sizes of separate logistic regressions applied to each SNP.
The results of prospective logistic regression and our semiparametric approach based on 1000 bootstrap samples are given in Table 5. The two sets of estimates are fairly consistent as expected. However, our semiparametric efficient estimator has smaller standard errors than does the prospective logistic regression, in accordance with theory and our simulations. This leads to a substantial difference in inference for the interaction between the polygenic risk score and the [25(OH)D] level. Specifically, both prospective logistic regression and our semiparametric efficient method show that the main effects of both the polygenic risk score and the [25(OH)D] level are statistically significant and positive. That is, if ignoring the interaction, men with higher polygenic risk scores or/and higher [25(OH)D] levels tend to have higher risk of developing prostate cancer.
Table 5.
βG | βX | βGX | ||
---|---|---|---|---|
Logistic | Estimates | 0.169 | 0.123 | −0.101 |
se, bootstrap | 0.056 | 0.056 | 0.054 | |
est se, asymptotic | 0.055 | 0.055 | 0.055 | |
p-value, bootstrap | 0.002 | 0.028 | 0.064 | |
p-value, asymptotic | 0.002 | 0.024 | 0.066 | |
Semi | Estimates | 0.168 | 0.124 | −0.110 |
se, bootstrap | 0.056 | 0.056 | 0.049 | |
est se, asymptotic | 0.055 | 0.054 | 0.042 | |
p-value, bootstrap | 0.003 | 0.027 | 0.026 | |
p-value, asymptotic | 0.002 | 0.021 | 0.009 |
Importantly, the estimates of the interaction parameter from the prospective logistic regression is not significant at the 5% level. However, our approach shows significant evidence of interaction, i.e., the effects of [25(OH)D] level on prostate cancer risk differ depending on the polygenic risk score.
In addition, our approach provides an estimated disease rate in the population of 10.6%, whereas the disease rate in the PLCO cohort study is 10.8% per person-year. This validation of our methodology suggests an additional use to which it can be applied.
7. Discussion
We have developed a semiparametric efficient estimator in case-control studies for the gene–environment independent model, where the distributions of genetic susceptibility and environmental exposure are allowed to be arbitrary and the disease rate is assumed completely unknown. We showed that in spite of these weak assumptions, the problem is identifiable in most cases. The proposed estimator is derived under the so-called hypothetical population framework, which enables us to view the case-control sample as a random sample from a hypothetical distribution and thus facilitates the application of a conventional semiparametric approach. Such an estimator is semiparametric efficient and its superiority over the prospective logistic regression was demonstrated in various simulations. The general methodology of our approach can be extended to parametric models other than the logistic model, such as the probit model, and it can be used to consider assumptions other than gene–environment independence, such as Hardy–Weinberg equilibrium, as long as the resulting model is identifiable.
The method hinges on the assumption of gene–environment independence. When they are in fact dependent, blindly applying this method will not lead to a consistent estimator. It is possible to further apply the empirical Bayes shrinkage method of [9] to improve robustness to the model assumptions. This method effectively uses our method when the assumption holds, and effectively uses logistic regression when the model assumption fails.
To handle the nuisance parameters in the estimation procedure, nonparametric density/mass function estimation is used. When the dimensions of genetic susceptibility or environmental exposures increase, such nonparametric estimation suffers from the curse of dimensionality. In such cases, dimension reduction techniques might be needed to maintain model flexibility as well as ensure computation feasibility. This will be pursued in future work.
Supplementary Material
Acknowledgments
Ma’s research was partially supported by the National Science Foundation, USA (DMS-1608540). Carroll and Liang’s research was supported by a grant from the National Cancer Institute, USA (U01-CA057030). We thank Nilanjan Chatterjee and Alex Asher for many helpful comments.
Appendix. Sketch of technical arguments
A.1. Identifiability
A1 There exists cx so that when , or for any g.
A2 There exists g1 and x1, x2 such that .
A3 There exists cg so that when , or for any x.
A4 There exists x1 and g1, g2 such that .
Proposition 1. The problem stated in (2) is identifiable
(i) If condition A1 holds, and at least one of the conditions A3 and A4 holds;
(ii) or if at least one of the conditions A1 and A2 holds, and condition A3 holds.
Remark 1. In practice, a widely used model is the one including main effects and two-way interaction, i.e., . It can be easily verified that if g and x both have the support on ℝ then this model satisfies conditions A1 and A3 described above and hence is identifiable.
Remark 2. Proposition 1 applies in the case where at most one of G and X is discrete. In the case where both G and X are discrete with levels ℓG and ℓX respectively, identifiability requires as a necessary condition. Additional conditions may be needed. Although for a specific model with known ℓG and ℓX, it can be easy to derive the sufficient conditions for identifiability, such result is difficult to describe in general.
Proof of Proposition 1. From [24], β is identifiable. Thus, we aim at establishing the identifiability of η1, η2 and α. We first prove the result under A1 and A3. Assume there are α, η1, η2 and so that
This yields
Taking the ratio of the above two and solving, we obtain . This leads to
Under condition A1, letting , we obtain . Similarly, under condition A3, letting , we obtain . This in turn leads to . Finally, these results lead to .
We now prove the result under A1 and A4. Under condition A1 alone, the same derivation as before leads to
Thus A4 further implies
or equivalently, . Hence, for d = 0, 1. As a result, α∗ = α and .
The result under A2 and A3 is symmetric to the one under A1 and A4 hence is omitted. □
The requirements in A1 and A3 are appropriate in the case where G and X are both continuous. The requirements in A1 and A4 are suitable in the case where G is discrete and X is continuous. The requirements in A2 and A3 are suitable in the case where X is discrete and G is continuous.
A.2. Nuisance tangent space Λ and its orthogonal complement
The nuisance tangent space Λ is computed in two steps. First, replacing the nuisance parameter with a finite-dimensional parameter, say , and taking the derivative of ln with respect to γ to get . Second, finding the mean squared closure that contains all such Sγ, which is Λ.
For any finite-dimensional parameter , we have , where
It is easy to show the nuisance tangent spaces associated with η1 and η2 are respectively
Then
Define . Now consider . Then for any ,
Hence, almost surely. Besides, need to be a subspace of the Hilbert space , hence . Thus, we have shown . Furthermore, for any ,
hence . Thus, we have obtained . Similarly, we can prove
Hence,
A.3. Uniqueness of a and b up to constants
To prove that a and b defined in Eqs. (4) and (5) are unique up to constant shifts, we consider the following. If there exist a1, a2, b1, b2 such that
then
The left-hand side is a function of g while the right-hand side is a function of x and d. Hence is a constant. Similarly, is also a constant. □
A.4. Equivalent expression of Eqs. (4) and (5) and the proof under the condition
We claim under the mean zero constraint , (4) and (5) are equivalent to (A.1)–(A.3), below, namely
(A.1) |
(A.2) |
(A.3) |
where , .
Proof. Suppose a and b are the solution of Eqs. (4) and (5). Let , . Then (A.3) automatically holds. It is easy to verify that . Hence (4) and (5) become
Further write
Then
(A.4) |
(A.5) |
Note that (A.4) above is exactly (A.2) defined in Section 3. Taking conditional expectation of (A.4) given G = g, we obtain
Subtracting the above from (A.5), we obtain (A.1), namely
From the above derivation, it is clear that any mean zero functions a(g), b(x) that solve (4) and (5) also satisfy (A.1)–(A.3). We now prove the other way around, that is any mean zero functions a(g), b(x) that satisfy (A.1)–(A.3) also satisfy (4) and (5).
Taking the expectation of (A.2) conditionally on G = g and adding the resulting equation to (A.1), we obtain exactly (A.5).
Hence Eqs. (A.1) and (A.2) lead to Eqs. (A.2) and (A.5).
For preparation, note also that . Hence under (A.3) and the condition , we can further write
Similarly, . From (A.2), we obtain
which is exactly (4). Similarly, from (A.5), we obtain (5). □
Eq. (A.1) allows us to solve for a(g) as a function of u0 and other known quantities, say , where Fa is a function that solves (A.1) which does not need to have mean 0. Then we can solve b from (A.2) as a function of u0 to obtain
Now
which allows us to solve for u0. Having obtained u0, we can then solve for all other quantities easily. Unfortunately, the integral equation (A.1) does not have an explicit solution. We propose an approximation to its solution in the spirit of Tsiatis and Ma [28], which is provided in Appendix A.5, by discretizing X if X is continuous.
The efficient score Seff, especially the procedure of solving for a and b, contains several expectations conditional on D, G, or X. To get estimations of these conditional expectations, we need density estimators of the nuisance parameter η = (η1, η2).
If the disease rate π1 or the non-disease rate π0 = 1 − π1 is known, then η can be approximated by
where and are the nonparametric estimators of the conditional density/mass function and respectively for d ∈ {0, 1}. Of course, in practice, π0 is typically unknown. However, we can get an estimate of π0 through (3).
A.5. Solving the integral equation (A.1)
Define . An equivalent expression of (A.1) is
(A.6) |
For fixed u0, all the quantities in Z are known or have explicit form except E(S | D). With the weighted kernel density , estimated non-disease rate and disease rate , we can estimate it by
A.5.1. Discrete G with finite number of levels
Assume G is discrete with mass at mg points g1, …, gmg. We computed each term in (A.6) under the weighted nonparametric densities
Similarly, we have
(A.7) |
and
(A.8) |
Consequently, the integral equation (A.6) reduces to the linear equations , where A is the (p+1) × mg matrix a(g1), …, a(gmg), corresponding to the solution of the integral equation, I is an mg mg identity matrix, B is an mg mg matrix whose (i, j)th element is given by
and C is a (p + 1) × mg matrix whose kth column is defined in (A.7) and (A.8). After obtaining a, we set
Then we compute , where
A.5.1. Continuous G or discrete G with infinite number of levels
When G is a continuous variable, we discretize it at a finite number of equally distributed points, say, g1 ≤ ··· ≤ gmg with for all i ∈ {1, …, mg − 1}, such that
Similarly, when G is discrete with infinite number of levels, we simply choose a sufficient number of points from its support to get an overall probability close to 1. Then the sequential procedures are exactly the same as that described in the case where G is discrete with finite number of levels.
Footnotes
Appendix B. Supplementary data
Supplementary material related to this article can be found online at https://doi.org/10.1016/j.jmva.2019.01.006.
Contributor Information
Yanyuan Ma, Email: yzm63@psu.edu.
Raymond J. Carroll, Email: carroll@stat.tamu.edu.
References
- [1].American Cancer Society, Cancer Facts & Figures 2015, American Cancer Society, Atlanta, GA, 2015. [Google Scholar]
- [2].Aly M, Wiklund F, Xu J, Isaacs WB, Eklund M, D’Amato M, Adolfsson J, Grönberg H, Polygenic risk score improves prostate cancer risk prediction: Results from the Stockholm-1 cohort study, Eur. Urol 60 (2011) 21–28. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [3].Andriole GL, Crawford ED, Grubb RL, Buys SS, Chia D, Church TR, Fouad MN, Isaacs C, Kvale PA, Reding DJ, Weissfeld JL, Yokochi LA, O’Brien B, Ragard LR, Clapp JD, Rathmell JM, Riley TL, Hsing AW, Izmirlian G, Pinsky PF, Kramer BS, Miller AB, Gohagan JK, Prorok PC, PLCO Project Team, Prostate cancer screening in the randomized prostate, lung, colorectal, and ovarian cancer screening trial: Mortality results after 13 years of follow-up, J. Natl. Cancer Inst 104 (2012) 125–132. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [4].Bickel PJ, Klaassen CA, Ritov Y, Wellner JA, Efficient and Adaptive Estimation for Semiparametric Models, The Johns Hopkins University Press, Baltimore, MD, 1993. [Google Scholar]
- [5].Chatterjee N, Carroll RJ, Semiparametric maximum likelihood estimation exploiting gene-environment independence in case-control studies, Biometrika 92 (2005) 399–418. [Google Scholar]
- [6].Chatterjee N, Chen Y-H, Luo S, Carroll RJ, Analysis of case-control association studies: SNPs, imputation and haplotypes, Statist. Sci 24 (2009) 489–502. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [7].Chatterjee N, Shi J, García-Closas M, Developing and evaluating polygenic risk prediction models for stratified disease prevention, Nature Rev. Genet 17 (2016) 392–406. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [8].Chen YH, Chatterjee N, Carroll RJ, Retrospective analysis of haplotype-based case-control studies under a flexible model for gene-environment association, Biostatistics 9 (2008) 81–99. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [9].Chen YH, Chatterjee N, Carroll RJ, Shrinkage estimators for robust and efficient inference in haplotype-based case-control studies, J. Amer. Statist. Assoc 104 (2009) 220–233. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [10].Cornfield J, A statistical problem arising from retrospective studies, in: Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, vol. 4, pp. 135–148. [Google Scholar]
- [11].Dudbridge F, Power and predictive accuracy of polygenic risk scores, PLoS Genet 9 (2013) e1003348. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [12].Evans DM, Visscher PM, Wray NR, Harnessing the information contained within genome-wide association studies to improve individual prediction of complex disease risk, Hum. Mol. Gen 18 (2009) 3525–3531. [DOI] [PubMed] [Google Scholar]
- [13].Fuchsberger C, Flannick J, Teslovich TM, Mahajan A, Agarwala V, Gaulton KJ, Ma C, Fontanillas P, Moutsianas L, McCarthy DJ, et al. , The genetic architecture of type 2 diabetes, Nature 536 (2016) 41–47. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [14].Gauderman WJ, Zhang P, Morrison JL, Lewinger JP, Finding novel genes by testing G × E interactions in a genome-wide association study, Genetic Epidemiol 37 (2013) 603–613. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [15].Han SS, Rosenberg PS, Ghosh A, Landi MT, Caporaso NE, Chatterjee N, An exposure-weighted score test for genetic associations integrating environmental risk factors, Biometrics 71 (2015) 596–605. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [16].Hayes RB, Reding D, Kopp W, Subar AF, Bhat N, Rothman N, Caporaso N, Ziegler RG, Johnson CC, Weissfeld JL, Hoover RN, Hartge P, Palace C, Gohagan JK, et al. , Etiologic and early marker studies in the Prostate, Lung, Colorectal and Ovarian, PLCO cancer screening trial, Controlled Clin. Trials 21 (2000) 349S–355S. [DOI] [PubMed] [Google Scholar]
- [17].Hunter DJ, Gene–environment interactions in human diseases, Nature Rev. Genet 6 (2005) 287–298. [DOI] [PubMed] [Google Scholar]
- [18].Jiang Y, Scott AJ, Wild CJ, Secondary analysis of case-control data, Stat. Med 25 (2006) 1323–1339. [DOI] [PubMed] [Google Scholar]
- [19].Lin D, Zeng D, Proper analysis of secondary phenotype data in case-control association studies, Genet. Epidemiol 33 (2009) 256–265. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [20].Ma Y, A semiparametric efficient estimator in case-control studies, Bernoulli 16 (2010) 585–603. [Google Scholar]
- [21].Murcray CE, Lewinger JP, Gauderman WJ, Gene-environment interaction in genome-wide association studies, Am. J. Epidemiol 169 (2009) 219–226. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [22].Ottman R, Gene-environment interaction: Definitions and study designs, Prev. Med 25 (1996) 764. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [23].Piegorsch WW, Weinberg CR, Taylor JA, Non-hierarchical logistic models and case-only designs for assessing susceptibility in population based case-control studies, Stat. Med 13 (1994) 153–162. [DOI] [PubMed] [Google Scholar]
- [24].Prentice RL, Pyke R, Logistic disease incidence models and case-control studies, Biometrika 66 (1979) 403–411. [Google Scholar]
- [25].Prorok PC, Andriole GL, Bresalier RS, Buys SS, Chia D, Crawford ED, Fogel R, Gelmann EP, Gilbert F, Hasson MA, Hayes RB, Johnson CC, Mandel JS, Oberman A, O’Brien B, Oken MM, Rafla S, Reding D, Rutt W, Weissfeld JL, Yokochi L, Gohagan JK, et al. , Design of the Prostate, Lung, Colorectal and Ovarian, PLCO cancer screening trial, Controlled Clin. Trials 21 (2000) 273S–309S. [DOI] [PubMed] [Google Scholar]
- [26].Purcell SM, Wray NR, Stone JL, Visscher PM, O’Donovan MC, Sullivan PF, Sklar P, Ruderfer DM, McQuillin A, Morris DW, et al. , Common polygenic variation contributes to risk of schizophrenia and bipolar disorder, Nature 460 (2009) 748–752. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [27].Tsiatis AA, Semiparametric Theory and Missing Data, Springer, New York, 2007. [Google Scholar]
- [28].Tsiatis AA, Ma Y, Locally efficient semiparametric estimators for functional measurement error models, Biometrika 91 (2004) 835–848. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.