Summary
It is increasingly of interest in statistical genetics to test for the presence of an additive interaction between genetic (G) and environmental (E) risk factors. In case-control studies involving a rare disease, a statistical test of no additive G × E interaction typically entails a test of no relative excess risk due to interaction (RERI). It has been shown that a likelihood ratio test of a null RERI incorporating the G-E independence assumption (RERI-LRT) outperforms the standard approach. The RERI-LRT relies on correct specification of a logistic model for the binary outcome, as a function of G, E and auxiliary covariates. However, when at least one exposure is not categorical or auxiliary covariates are present, nonparametric estimation may not be feasible, while parametric logistic regression will a priori rule out the null hypothesis of no additive interaction in most practical situations, inflating type I error rate. In this paper, we present a general approach to test for G × E additive interaction exploiting G-E independence. Unlike the RERI-LRT, it allows the regression model for the binary outcome to remain unrestricted, and nonetheless still allows for covariate adjustment in order to ensure the G-E independence assumption or to rule out residual confounding. The methods are illustrated through extensive simulation studies and an ovarian cancer study.
Keywords: gene-environment additive interaction, gene-environment independence, case-control study
1 |. INTRODUCTION
There is growing interest in the development and application of statistical methods to detect the presence of an interaction between genetic (G) and environmental (E) risk factors. In case-control studies, the vast majority of interaction analyses has been based on the multiplicative scale. This is mainly because linear regression is generally inconsistent [1], while logistic regression with product terms can be conveniently fitted and such model is approximately multiplicative in the case of rare disease. However, it has been argued that assessment of interaction should be mainly based on the additive scale rather than the multiplicative scale [2, 3, 4, 5, 6]. The additive scale allows one to capture whether the disease risks would be different across different subgroups and thus is more relevant for measuring public health impact. It also closely corresponds to tests for a mechanistic interaction [5]. Specifically, a weaker notion of the mechanistic interaction is a sufficient cause interaction, which is present if there is at least one person for whom the outcome would occur if both exposures were present but would not occur if just one exposure were present [7]. A stronger notion is an epistatic interaction which concerns whether there is at least one person who will have the outcome if and only if both exposures are present [5] Such interactions can sometimes be detected by evaluating additive interactions under certain monotonicity conditions.
Although the interaction term in a logistic regression does not have an additive interpretation, statistical tests of no additive G × E interaction can in fact be performed using such model when the disease is rare. In particular, the relative excess risk due to interaction (RERI) test statistics [8] can be represented in terms of relative risks. For case-control data involving a rare disease, the relative risks can be approximated by odds ratios, thus a test of null RERI can be obtained from a standard logistic regression. When G and E are known to be independent in the target population, [9] developed a likelihood ratio test of the null hypothesis of no RERI incorporating the G-E independence assumption (hereafter RERI-LRT), which is generally more powerful than the standard RERI test of no additive interaction. Such a gain in power has long been observed for various tests of multiplicative interaction [10, 11, 12, 13, 14].
RERI-based tests of additive interaction rely on correct specification of a logistic model for the binary outcome, as a function of G, E and auxiliary covariates. Ideally, a saturated or nonparametric specification of the outcome logistic regression would in principle ensure validity of RERI-based tests (i.e. that it preserves nominal type 1 error rate). However, when the environmental exposure (E) or auxiliary covariates include count or continuous variables, nonparametric estimation may not be feasible. Moreover, in this case, standard parametric specification of a logistic outcome regression will generally a priori rule out the null hypothesis of no additive interaction, leading to inflation of type I error rate, as we detail in the appendix.
In this paper, we present a general, yet fairly straightforward approach to directly test for the presence of additive G × E interaction in case-control studies without requiring a regression model of disease risk. The proposed approach which is easily made to exploit the G-E independence assumption, relies on separate regression models for G and E given the covariates. By avoiding specification of an outcome model, the approach circumvents aforementioned difficulties with RERI-based tests and is completely robust to mis-specification of the outcome regression, although correct specification of models for G and E is instead needed for the new approach to be valid. Nonetheless, unlike RERI-based tests, standard parametric models can be used for the latter in most practical situations without a priori ruling out the null hypothesis of no additive interaction, even if E is continuous. In addition, although the test is developed in the context of G × E interaction, it can be used in a more general context for test of interaction [15].
The methods are illustrated through an extensive simulation study in the setting of binary G and E with no additional covariates so that the traditional RERI-based tests and the new approach equally apply and can be directly compared in terms of power. Additional simulations are performed to illustrate the poor behavior of parametric RERI-based tests in more practical scenarios while the proposed approach continues to perform well. We also demonstrate the new approach using data from an ovarian cancer study to detect an additive interaction between the BRCA1/2 genetic variant (G), and the woman’s parity and number of years of oral contraceptive use (E). Because both environmental exposures are counts, the RERI-LRT cannot be implemented without possibly recoding the original environmental exposures as dichotomous or as categorical with few levels. Covariate adjustment needed in the study also presents analytical difficulty for RERI-LRT and for these reasons the approach is forgone in both applications in favor of the new methodology.
The paper is organized as follows. We detail the proposed method in Section [2]. Specifically, we start with the simple setting of binary genetic and environmental risk factors. We review the RERI in Section [2.1] We then propose the test statistic based on an alternative characterization of the null hypothesis of no additive interaction in Sections [2.2]. which possesses the flexibility to leverage G-E independence as described in Section [2.3] and to adjust for covariates as described in Section [2.4]. Next we extend the proposed test to more general types of exposures in Section [2.5]. We discuss the possibility of relaxing the rare disease assumption in Section [2.6]. In Section [3], we provide a unified class of test statistics which subsumes all settings in previous sections as special cases. We evaluate the performance of the proposed tests via extensive simulation studies in Section [4], and illustrate use of the new test statistic using data from the ovarian cancer study in Section [5]. We close with a discussion in Section [6].
2 |. METHODS
In this section, we first consider the scenario where both G and E are binary. We provide a brief review of the existing method based on a test of a null RERI with binary exposures. We introduce a new test based on an alternative representation of the null hypothesis of no RERI. We extend the test to incorporate G-E independence assumption and adjust for covariates. Then we provide a more general test for polytomous or continuous exposures. We also briefly discuss approaches to potentially relax the rare disease assumption. In the next section, we will provide a unified class of test statistics which additionally allows for relaxation of the G-E independence assumption.
Suppose one has observed retrospective case-control data on n unrelated individuals, let D denote the rare disease outcome defining case-control status and (A1,A2) denote two exposures in view. For instance, in a statistical genetic application, A1 may denote the genetic variant G and A2 an environmental exposure E. Here we will use the more generic notation (A1, A2) to allow for more general contexts considered in Section [3], which considers cases where either or both exposures may be count or continuous. Let μ(a1, a2) = Pr(D = 1|A1 = a1, A2 = a2) denote the disease risk of individuals in the target population with exposure values (a1, a2). The additive interaction between A1 and A2 is measured by the extent to which the effect of A1 and A2 together exceeds the effect of each considered individually, i.e.,
2.1 |. Binary exposures and the standard RERI
In the case of binary A1 and A2, we have the following saturated model
| (1) |
Therefore, an additive interaction between A1 and A2 is said to be present if
In case-control studies we cannot generally estimate the risk μ(a1, a2). An alternative representation commonly considered is the relative excess risk due to interaction (RERI), defined as
based on the observation that
An empirical version of RERI can be obtained by estimating the required risk ratios μ(a1, a2)/μ(0,0) via a saturated logistic regression with the rare disease assumption under which odds ratios and risk ratios are approximately equal. Specifically, let denote the point estimate of the odds ratio ORG, E = [μ(G,E)(1 − μ(0,0))]/[μ(0,0)(1 − μ(G,E))] for levels (G,E) compared to the reference category (G,E) = (0,0) from a saturated logistic regression including G, E, and GxE terms, then
Standard Delta method can be applied to provide a consistent estimate of its standard error which has been detailed in [16]. Finally a Wald-type test statistic is constructed as . Using standard asymptotic arguments and the Delta method, under the null hypothesis of no additive interaction H0: β3 = RERI = 0, TRERI is approximately standard normal in large samples.
2.2 |. The proposed test of additive interaction
The following result gives an alternative characterization of the null hypothesis of no additive interaction which motivates the new approach. To state the result, let π1 (a2) = Pr(A1 = 1|A2 = a2) denote the prevalence of the first exposure A1 among individuals with second exposure A2 = a2 in the underlying population, and likewise define π2 (a1) = Pr(A2 = 1|A1 = a1). Let α denote the log odds ratio relating A1 and A2 in the target population. Then
Note that α = 0 encodes the independence assumption between A1 and A2.
Result 1. We have that the null hypothesis of no additive interaction H0: β3 = 0 holds if and only if
where
| (2) |
We should note that Result 1 does not rely on the rare disease assumption and holds irrespective of the population disease prevalence. The result is a special case of a more general lemma given later in Section [3] allowing for arbitrary exposures and for covariate adjustment. According to the result, the null hypothesis of no additive interaction holds if and only if RERI is equal to zero, or equivalently if and only if the random variable U has mean zero among cases (D = 1), i.e.,
Remark 1. Intuition about the result is gained by assuming G-E independence, i.e. α = 0. Let fj(aj) = Pr(Aj = aj) denote the density of Aj in the underlying target population. Under G-E independence we have πj(a) = πj, ∀a. Then, upon noting that the conditional density of (A1, A2) given D = 1 is proportional to
where fj(1) = πj, one observes that is proportional to
confirming that if and only if the additive interaction β3 = 0. Result 1 further shows that a similar result holds when the exposures are dependent upon applying a weight to individuals with both exposures, equal to the inverse of the odds ratio association of the two exposures. Intuitively, weighting makes the exposures independent, thus essentially recovering the independent exposure setting in the weighted sample. Since U only uses exposure data among cases (with D = 1), the result suggests that one may be able to test for additive interaction by considering whether the distribution of the exposures in view satisfies the above condition using data for cases only.
Unfortunately, U is not directly observed and therefore cannot directly be used for inference, as it depends on the unknown population parameters πj(0), j = 1, 2. Nonetheless, progress can be made under the rare disease assumption, since one may use the controls (with D = 0) for approximate inference, upon observing that πj(0) ≈ pj(0) where pl(a2) = Pr(Al = 1|A2 = a2, D = 0) and p2(al) = Pr(A2 = 11Al = al, D = 0). Specifically, let
| (3) |
then ω ≈ α under the rare disease assumption. In particular, is asymptotically negligible when Al and A2 are independent in the population. Therefore, one may estimate with where
| (4) |
with the sample version of pl(a), and similarly defined, and the sample odds ratio relating Al and A2 in the controls. In the Appendix, we show how to derive (see equation (1) of the Appendix).
Suppose that unbeknownst to the analyst, Al and A2 are independent in the population and therefore is asymptotically negligible. We evaluate at this particular submodel and show that can be decomposed as , where the first term Vl captures the variance of if (ω, pl(0),p2(0)) were known; the second term V2 reflects the uncertainty due to estimation of (pl(0),p2(0)); while V3 reflects the uncertainty associated with estimation of the odds ratio parameter ω. Specifically,
In the next section, we further consider how explicitly leveraging G-E independence assumption alters each of these contributions, to reveal how the independence assumption can improve power to detect the presence of an additive interaction.
Here we note that, under H0 the standardized test statistic approximately standard normal in large samples, where is a consistent estimator of a obtained by substituting all unknown expectations with empirical averages and unknown parameters with corresponding estimators. under the two-sided alternative hypothesis β3 ≈ 0, one can further show that in large samples, T has approximate variance one, and is approximately centered at the non-centrality parameter κ × β3, where:
λ is the sampling fraction of cases (i.e. λ = proportion of cases in case-control sample/proportion of cases in population). Since tends to infinity with sample size, T has asymptotic power one, confirming that similar to TRERI, T is a consistent test statistic of H0.
Interestingly, the above derivation also implies that the statistic gives a consistent estimate of β3/Pr{D = 1}, which is the interaction parameter of interest scaled by the inverse of the population disease prevalence. Thus, one could in principle recover a consistent estimate of β3 if either the underlying population disease prevalence or the sampling fraction of cases were known.
We note that neither T nor TRERI makes explicit use of the G-E independence assumption and therefore both may be inefficient when the assumption holds. In the following section, we modify T to explicitly encode the independence assumption thus obtaining a more powerful test statistic.
2.3 |. Test incorporating independence assumption
Suppose that A1 and A2 are known to be independent in the population. Naturally, one may wish to exploit such prior information in testing for G-E interaction. This can be accomplished by adapting the methodology developed in the previous section upon noting that the independence assumption implies α = 0, which, under the rare disease assumption, also implies that ω ≈ 0. This leads us to modify . Define similarly to with , i.e.
| (5) |
In the appendix, we show that can be estimated by . Consequently , reflecting the efficiency gain due to the independence assumption, i.e. is exactly zero since there is no uncertainty associated with One can verify that the non-centrality parameter β3 × κ1 of becomes , confirming that T1 is guaranteed to be more powerful than T.
2.4 |. Adjusting for covariates
In observational studies, it is usually desirable to adjust for potential confounding of the joint effects of A1 and A2, and such covariate adjustment may also be required to enforce the G-E independence assumption. Let X denote such a vector of covariates and suppose that the exposures are independent conditional on X, i.e., A1 ± A2 |X. Define p1(x) = Pr(A1 = 1|X = x,D = 0) and p2(x) = Pr(A2 = 1|X = x,D = 0). Likewise, let and correspond to estimates, obtained using standard parametric models, e.g. using logistic regressions of the form , where is computed by maximum likelihood. Under the null hypothesis of no additive interaction, the test statistic has an approximate standard normal distribution, with defined as
| (6) |
where is obtained using equation (1) of the Appendix.
2.5 |. More general exposures
In this section, we extend the test statistic to the scenarios where the exposures A1 and A2 may be continuous or polytomous. Suppose that the environmental exposure A2 were continuous. For example if D were diabetes status, A2 could be body mass index (BMI) typically coded on a continuous scale. Note that the null hypothesis of no additive interaction can be restated as follows to acknowledge the continuous exposure:
where μ(a1,a2,x) = Pr(D = 1|a1,a2,x). To construct an appropriate test statistic of H0, suppose that is estimated with the linear model via ordinary least squares using controls only. Assuming G-E conditional independence given X, it is straightforward to modify the proposed test statistic to account for the continuous exposure, by simply replacing with . Thus, we let
| (7) |
and let denote an estimate of the variance of obtained using equation (1) of the Appendix. Then, the test statistic is approximately standard normal under H0.
A similar test statistic could be defined if A2 were a count, upon estimating its mean with the log-linear model computed by maximum likelihood under say a Poisson model for A2. Likewise, if A1 were a count, for instance, if A1 were to encode the number of minor alleles measured at a single nucleotide polymorphism (SNP) locus, one could model its mean under the Hardy-Weinberg equilibrium model, where A1 could be modelled as Binomial with two trials and event probability estimated by a logistic regression model . Note that under this model the mean of A1 is given by . Then, one could simply replace with (or replace with ) in defining the test statistic, and likewise modify the estimated variance of the test statistic using (1) of the Appendix.
Suppose now that A1 were more generally categorical having K possible levels {0,a1,1,…,a1,k−1} with 0 a reference value. Further assuming that A2 were, say, continuous and independent of A1 given X, we could then simply define
| (8) |
where ,k(x) is a maximum likelihood estimate of Pr(A1 = ak\x) computed using standard polytomous logistic regression, i.e.. Let denote an estimate of the large sample variance of based on (1) of the Appendix. Then in large samples, the resulting test statistic is approximate standard normal under the null hypothesis of no additive interaction which may be restated to account for the polytomous and continuous exposures:
2.6 |. Relaxing the rare disease assumption
In case the rare disease assumption does not apply, estimating exposure regression models in controls only may not be entirely appropriate. Nonetheless, it may still be possible to test for the presence of an additive interaction. For instance, if, as often the case in nested case-control studies, sampling fractions for cases and controls were known, then, standard inverse probability weighting could be used based on known sampling weights to estimate population models for the exposures using both cases and controls. Potentially more efficient estimates of models for the exposures could alternatively be obtained using more recent methodology for regression analysis of secondary outcomes in case-control studies [17].
3 |. A UNIFIED CLASS OF TEST STATISTICS
We now provide a unified class of test statistics for the null hypothesis of no additive interaction which subsumes each of the settings considered in previous sections as special case, but which also allows for the conditional independence assumption of the two exposures to be relaxed.
To do so, we proceed as in [18] and use the following representation of the joint density of (A1, A2) given X:
| (9) |
where v is a dominating measure of the distribution of (A1, A2), OR(A1, A2; X) is the generalized odds ratio function relating A1 and A2 within levels of X, that is
and {f (A1|A2 = 0, X), f (A2|A1 = 0, X)} are baseline densities in the target population. Note that in the simple case of binary exposures, the generalized odds ratio function reduces to the standard odds ratio effect measure, but remains well defined as a measure of association for exposures of a more general nature, including categorical, count or continuous variables. In particular, OR(A1, A2; X) = 1 if and only if A1 and A2 are independent within levels of X. Let
denote the additive interaction between A1 and A2.The null hypothesis of no additive interaction can more generally be stated as:
For any function g(A1, A2, X) of (A1,A2,X), define
| (10) |
Remark 2. Intuition about the form of W (g) is similar to that given in Remark [1] upon noting that for any function g(A1, A2, X) of (A1, A2, X),
| (11) |
which is equal to zero if and only if β3(ai, a2, x) =0 as shown in the proof. By Theorem 1 of [18], any function of (A1, A2, X, D) that satisfies [11] must be of the form of W(g).
A special case of W(g) is U defined in [2]. In particular, it is straightforward to verify that the test statistics proposed to handle binary, continuous or count exposures under the independence assumption are obtained by taking:
where is the mean of Aj evaluated under f (Aj |X), j = 1, 2. For Ai categorical with K distinct categories and A2 binary, continuous or a count, one likewise obtains the test statistic previously proposed by taking:
Lemma 1. The null hypothesis H0 : β3(a1, a2, x) = 0 holds if and only if
Result 1 is easily recovered as a corollary of Lemma 1. According to Lemma 1, an empirical version of W(g) with user-specified function g may be used to test H0. Specifically, One must estimate the unknown odds ratio function and baseline densities, in order to obtain an estimate of the joint density of (A1,A2) given X using the parametrization given in equation (9).
Under the rare disease assumption, estimation of the joint density can be done by standard maximum likelihood in the controls only, upon positing parametric models for the odds ratio function and baseline densities. To ground ideas, one could posit a single parameter model for the odds ratio function as log OR(A1, A2; X; w) = wA1A2, which encodes the assumption that the odds ratio association between A1 and A2 given X does not vary with X, i.e. no effect heterogeneity in X of the odds ratio association between A1 and A2 in the population. Similarly, one could posit parametric models for the exposure densities f (A1 |A2 = 0, X; α1) and f (A2 |A1 = 0, X; α2). For exposures that are either binary, continuous or counts, generalized linear models within the exponential family may be used to model the baseline densities. For example, counts may be modeled by assuming a Poisson distribution for the corresponding baseline density.
Let , and denote the approximate maximum likelihood estimate using controls only; and denote the resulting estimate of W(g), where θ = (w, α1,α2). Our proposed test statistic is then given by , where is the estimate of provided in the Appendix. Under the independence assumption, we set OR(A1, A2; X) to 1 for all persons in the sample. In this case, the asymptotic variance of can be easily modified to reflect that OR(A1, A2; X) is set to 1 with no uncertainty associated.
4 |. A SIMULATION STUDY
We study the power and type I error of our proposed test in the standard setting of binary genetic and environmental variables with no other covariate, so that it is readily compared to the approach of [9]. In order to evaluate both type I error rates and power of various test statistics, we generated simulated data following the design of [9] which encodes the magnitude of the interaction indirectly by varying RERI from 0 (to assess type I error) to 0.5. The probability of having the genetic variant was 0.5, and the probability of the binary environmental variable was 0.2, and these factors were generated to be independent. Let and . The disease risk model was
with baseline risk equal to 0.01 (i.e. α0 = logit(0.01)). The gene and environment main effects were varied so that (α1,α2) ϵ {log(0.7), log(1.2), log(2)}, and the multiplicative G-E interaction parameter a3 was selected to yield the desired RERI, according to the formula
In each simulation, we generated 4000 cases and 4000 controls. We report results from 10,000 simulations for each setting corresponding to a particular combination of (α1,α2) and RERI values.
Figure 1 summarizes the simulation results in terms of power comparing the proposed tests with and without using the G-E independence assumption, labeled ‘U ind’ and ‘U’ respectively. The figure also presents results for the retrospective profile likelihood ratio test proposed by [9] with and without using the independence assumption respectively, labeled ‘Han ind’ and ‘Han’ respectively. In addition, the figure also displays results from the standard RERI test based on prospective logistic regression, which is labeled ‘prosp’. Table 1 summarizes the type I error rate at significance level of 5% of the various methods under ranging parameter values. Additional summaries of the type I error at significance level of 1%, as well as the power at a different Pr(G = 1) value are presented in the appendix.
FIGURE 1.
Power of the various tests for identifying additive G-E interaction, in the simple settings in which both a1. and a2 are binary, and no covariates are used in the model, with Pr(G = 1) = 0.5. The simulated a1 is common Pr(al = 1) = 0.5, Pr(a2 = 1) = 0.2, and the disease model Pr(D = 1|al, a2) has the fixed baseline risk α0 = logit(0.01), the main effects of the al,a2 are varied as αl, α2, and the RERI is varied from 0 to 0.5. The compared estimators are the proposed ‘U’ and ‘U ind’ (without and with assuming G-E independence), ‘prosp’ is the standard test under prospective logistic regression and ‘Han’ and ‘Han ind’ is the retrospective profile likelihood test proposed in [9] without and with assuming G-E independence.
TABLE 1.
The type I error of the compared tests, under various combinations of the prevalence of the genetic variant a1, and the effect of the genetic and environmental variables on the disease outcome (α1 and α2, respectively). The tests ‘U’ and ‘U ind’ are the proposed tests without and with the assumption of G-E independence. ‘prosp’ is the usual test based on standard logistic regression, and ‘Han’ and ‘Han ind’ are the tests based on retrospective profile likelihood proposed by [9]. The type I error was calculated from 10,000 simulations, each with 4000 cases and 4000 controls.
| α1 : | log(0.7) | log(1.2) | log(2) | ||||||
|---|---|---|---|---|---|---|---|---|---|
| α2 : | log(0.7) | log(1.2) | log(2) | α2 = log(0.7) | log(1.2) | log(2) | log(0.7) | log(1.2) | log(2) |
| Pr(G = 1) = 0.5 | |||||||||
| U | 0.048 | 0.049 | 0.051 | 0.050 | 0.048 | 0.049 | 0.049 | 0.050 | 0.048 |
| U ind | 0.047 | 0.048 | 0.049 | 0.047 | 0.047 | 0.049 | 0.046 | 0.046 | 0.048 |
| prosp | 0.050 | 0.048 | 0.047 | 0.048 | 0.048 | 0.046 | 0.046 | 0.049 | 0.049 |
| Han | 0.049 | 0.049 | 0.050 | 0.050 | 0.051 | 0.050 | 0.047 | 0.052 | 0.051 |
| Han ind | 0.047 | 0.048 | 0.050 | 0.047 | 0.048 | 0.049 | 0.046 | 0.046 | 0.048 |
| Pr(G = 1) = 0.2 | |||||||||
| U | 0.050 | 0.047 | 0.048 | 0.046 | 0.051 | 0.048 | 0.052 | 0.050 | 0.046 |
| U ind | 0.052 | 0.045 | 0.050 | 0.045 | 0.047 | 0.050 | 0.048 | 0.049 | 0.050 |
| prosp | 0.050 | 0.046 | 0.046 | 0.045 | 0.049 | 0.045 | 0.051 | 0.047 | 0.045 |
| Han | 0.051 | 0.048 | 0.050 | 0.048 | 0.050 | 0.048 | 0.055 | 0.051 | 0.050 |
| Han ind | 0.051 | 0.046 | 0.049 | 0.045 | 0.048 | 0.051 | 0.048 | 0.049 | 0.052 |
| Pr(G = 1) = 0.05 | |||||||||
| U | 0.050 | 0.054 | 0.053 | 0.054 | 0.054 | 0.050 | 0.051 | 0.049 | 0.048 |
| U ind | 0.052 | 0.057 | 0.050 | 0.051 | 0.051 | 0.050 | 0.048 | 0.047 | 0.048 |
| prosp | 0.043 | 0.048 | 0.050 | 0.042 | 0.047 | 0.045 | 0.042 | 0.043 | 0.041 |
| Han | 0.052 | 0.048 | 0.050 | 0.050 | 0.054 | 0.052 | 0.051 | 0.051 | 0.050 |
| Han ind | 0.052 | 0.052 | 0.049 | 0.050 | 0.049 | 0.046 | 0.047 | 0.048 | 0.048 |
One observes that the RERI-LRT test ‘Han ind’ and ‘U ind’ were equally powerful when Pr(G = 1) = 0.5 across various possible values for the other parameters, and both tests were dramatically more powerful when compared to the other tests, while ‘U’ was slightly less powerful than ‘Han’, which was in turn slightly less powerful than ‘prosp’.
In additional simulations, we varied the prevalence of the genetic marker Pr(G = 1) to have population probabilities 0.2 and 0.05, while the environmental factor was maintained to have probability 0.2. All tests appeared to have correct type I error rate as shown in Table 1. Power plots similar to those appearing in Figure 1 are provided in the Appendix for these additional settings. These additional simulations confirmed that all tests became less powerful as the genetic variant became less common, with ‘Han ind’ being slightly more powerful than ‘U ind’ when Pr(G = 1) = 0.05. Overall, the simulation study confirmed that the proposed approach performs quite competitively when compared with the efficient RERI-LRT approach, in settings where both methods are available.
In addition to the above results, we also conducted additional simulation studies to evaluate the performance of the various tests under violation of G-E independence assumption, violation of rare disease assumption, and a variety of exposure variable types. Simulation results are provided in the Appendix. In summary, under violation of G-E independence assumption, tests that assumes G-E independence had inflated type I errors. Violation of rare disease assumption did not have substantial impact on the performance of the tests, except for inflated type I error under relatively extreme genetic and environmental main effects. Our proposed tests performed quite well for various types of risk factors.
In the following section, we consider a data application of the new approach in which RERI is no longer readily available and cannot easily be applied without further making unnecessary assumptions.
5 |. OVARIAN CANCER APPLICATION
We applied the proposed test of additive interaction to the Israeli Ovarian Cancer data [19] which is recently analyzed by [12, 13, 14]. The Israeli Ovarian Cancer data is collected from a population-based case-control study based on all ovarian cancer patients identified in Israel between March 1st, 1994 and June 30th, 1999. The main goal of the study was to examine the interplay of the BRCA1/2 mutation and the known reproductive/gynecological risk factors of ovarian cancer. We obtained the data from the R package CGEN provided by [12]. Although previous analyses aimed to detect a multiplicative gene-environment interaction between having the BRCA1/2 mutation and two environmental exposures, number of years of oral contraceptive use (OC) and number of children (parity), here we are primarily concerned with determining whether such interactions might be operating on the additive scale.
There were 832 cases and 747 controls available in this case-control study. A number of covariates were available for confounding adjustment and also to enforce the independence assumption. Carriers (non-carriers) of BRCA1/2 mutation were coded as one (zero), and we considered the following logistic regression model
where X denotes the set of adjusted covariates. Specifically, we included an indicator variable for age ≤ 50, indicator variables for ethnicities of Ashkenazi jew, and non-Ashkenazi (with mixed race serving as reference category), indicator variables for personal history of breast cancer, family history of breast cancer, and family history of ovarian cancer. Both environmental exposures are naturally coded as counts and therefore were modelled using Poisson regression adjusting for the same set of covariates X, that is
We also conducted a sensitivity analysis without adjusting for personal history of breast cancer in both models.
We performed various tests when assuming G-E independence, and without using such an assumption. Without G-E independence, we assumed a generalized odds ratio function as OR(G, E; X)=eωGE, where ω is the log odds ratio parameter, as suggested in Section 3. We use our proposed variance estimator to evaluate 95% confidence intervals and p-values. We also included a comparison with the RERI-based test using retrospective profile likelihood proposed in [9] with environmental factor dichotomized as .
The table provides results from testing for a G×E additive interaction with and without making the G-E independence assumption. In accordance with simulation results, the independence assumption yields a test statistic consistently more extreme for both exposures in view than the corresponding test which does not incorporate the assumption. Specifically, we successfully reject the null hypothesis of no additive G-E interaction between BRCA1/2 mutation and parity at the alpha level of 0.05, and we found no conclusive evidence of an additive interaction with OC. Similar results were observed when using the RERI test based on retrospective profile likelihood (Han p-val) with dichotomized environmental factor as well as in the sensitivity analysis. It is particularly interesting to compare these findings with previous analyses of these data that have primarily been concerned with detecting the presence of a multiplicative G×E interaction. For instance, [13] leveraged the independence assumption to detect a G × E multiplicative interaction only with OC and failed to find evidence of a similar interaction with parity, thus essentially reporting the opposite findings to ours. However, our findings are potentially more scientifically relevant given that interactions on the multiplicative scale may be harder to interpret biologically.
6 |. CONCLUSION
We have described a very general framework to test for G×E additive interactions exploiting G-E independence in case-control studies. The proposed strategy has several advantages over existing RERI-based strategies. In particular, it does not require a regression model for the outcome, and therefore is robust to model misspecification of the outcome regression, a potential concern particularly if E is a count or continuous variable and additional covariates are included in the regression.
The approach put forward in this paper is closely related to the semiparametric framework of [15], which characterized the set of influence functions of a model of interaction (on the additive or multiplicative scale) under a semiparametric union model in which only a subset but not all of the parametric models used to describe the data generating mechanism need to be correct for valid inference. In fact, one can show that our proposed test statistic belongs to the general class of test statistics for additive interaction associated with the set of influence functions in [15]. However, because [15] did not allow for outcome dependent sampling and only considered standard prospective random sampling, not all test statistics in their class may be used under a case-control design. Thus, an important contribution of the current paper has been to characterize the subset of the class of test statistics of an additive interaction that may be used both under prospective and retrospective sampling.
An important limitation of the proposed approach is that it does not readily produce an estimate of the risk difference parameters which are often of primary interest for understanding the public health significance of any significant finding. To obtain such estimates, one would need an estimate of the main effect of exposures, which are being treated as unspecified nuisance parameters in the proposed approach from the ground of robustness. Addressing this limitation is a priority for future research to extend the methods describe herein.
TABLE 2.
Results for tests of null hypothesis of no additive G × E interaction between presence of BRCA1/2 mutation (G) and number of years of oral contraceptive (OC) use, and parity (E variables), with and without G-E independence assumption. U is the proposed (standardized) test statistic, and its 95% bootstrap confidence interval, p-value, and the RERI test based on retrospective profile likelihood (Han p-val) are provided.
| E | Assume G-E indept |
Primary Analysis | Sensitivity Analysis | ||||||
|---|---|---|---|---|---|---|---|---|---|
| U | 95% CI | P-val | Han P-val |
U | 95% CI | P-val | Han P-val |
||
| Parity | Yes | −0.044 | (−0.084, −0.004) | 0.030 | 0.046 | −0.039 | (−0.075, −0.004) | 0.028 | 0.040 |
| No | −0.052 | (−0.102, −0.001) | 0.047 | 0.107 | −0.048 | (−0.095, −0.001) | 0.046 | 0.089 | |
| OC | Yes | 0.050 | (−0.007, 0.107) | 0.088 | 0.685 | 0.044 | (−0.013, 0.100) | 0.129 | 0.584 |
| No | −0.051 | (−0.112, 0.011) | 0.110 | 0.666 | −0.052 | (−0.126, 0.022) | 0.169 | 0.699 | |
ACKNOWLEDGEMENT
The research was supported by U.S. National Institutes of Health grants R01AI104459, R01AI127271, R01CA222147. We thank the associate editor and two reviewers for their helpful comments.
APPENDIX
1. Proof that .
To show the result requires the influence function of which is of the form
where , where the first component is the score of p1(0), the second component is the score of p2(0), the last component is the score of ω, and θ = (ω, p1, p2). Standard matrix algebra can be used to show that at the submodel where A1 and A2 are independent IF = (IF1, IF2, IF3) where:
A Taylor series argument then gives
Upon noting that the above four terms are mutually uncorrelated, we have that:
where
A similar derivation shows that
which gives
proving the result.
2. Asymptotic variance for unified class of test statistics
Our proposed test statistic is then given by , where is an estimate of one can derive using a standard Taylor series argument:
| (1) |
where Wθ(g) is the derivative of W(g, θ) with respect to θ evaluated at the truth, and is the influence function of [1]. For instance, when is a maximum likelihood estimator, , where Sθ denote the score of θ. Under the assumption that A1 and A2 are independent, we may set and redefine θ = (α1, α2), also note that under independence, the joint density (3) simplifies to f(A1,A2|X) = f(A1|X)f(A2|X), leading to some simplification in the above expression for the asymptotic variance of the test statistic.
3. Proof of Lemma 1.
Consider the nonparametric additive representation of μ(a1, a2, x) given by μ(a1, a2, x) = β1(a1, x) + β2(a2, x) + β3(a1, a2, x) + β4(x) where β1(a1, x) is the main effect of A1 and satisfies β1(0, x) = 0, likewise β2(a2, x) is the main effect of A2 and satisfies β2(0, x) = 0, β3(a1, a2, x) is the additive interaction between A1 and A2 and satisfies β3(0, a2, x) = β3(a1, 0, x) = 0, and β4(x) is the main effect of X. For any function g, note that
since
furthermore by symmetry,
and finally
therefore
for any choice of g. Thus, the null of no additive interaction β3(a1, a2, x) = 0 for all (a1, a2, x) implies that E {W(g)|D = 1, x} = 0. We get the result in the other direction by choosing g(a1, a2, x) = g*(a1, a2, x) = β3(a1, a2, x) which gives
which in turn implies that β3(a1, a2, x) = 0 for all (a1, a2, x) proving the result. □
4. Failure of RERI-based approaches with continuous exposure
We describe in some detail the failure of RERI-based approaches that use standard logistic regression when at least one exposure is non-discrete and auxiliary covariates are present. In this vein, suppose that A1 is continuous, while A2 may be binary. In practice, to evaluate RERI in this context, one typically proceeds by estimating a standard logistic regression for Pr {D = 1|a1, a2, x} using a simple parametric formulation of the model, such as:
| (2) |
where logit(p) =log{p/(1 − p)} and the parameters (α0, α1, α2, α3, α4) are variation independent [2]. Below, we argue that such a standard logistic regression will generally be incompatible with the null hypothesis of no additive interaction if both exposures (a1, a2) have a non-null association with the outcome. Specifically, suppose that the main effects of A1 and A2 and X are correctly specified in the logistic model, i.e.
with α1 ≠ 0 and α2 ≠ = 0. Then there will generally be no parameter value of (α0, α1, α2, α3, α4) that encodes the null hypothesis of no additive interaction, consequently any RERI-type test, based on model (2) will generally have inflated type I error rate for testing the null of no additive interaction.
To further understand the failure of RERI in this context, note that under the null hypothesis of no additive interaction
| (3) |
for all possible values of a1 and a2. We show that RERI = 0, or equivalently, a3 = 0, fails to imply the above null. The interaction function on the log odds ratio scale is given by
in which case, correct specification of a logistic model for Pr {D = 1|a1, a2, x} under a null additive interaction is of the form
| (4) |
Because of the nonlinear dependence of θ on a1 and x, it is clear that model (4) cannot be nested in the standard logistic model (2), and therefore the latter cannot be used to obtain a valid test of the null hypothesis of no additive interaction.
In order to implement an LRT of additive interaction using the RERI approach, an analyst would need to carefully specify a model for the odds ratio interaction, so that model (4) is recovered under the null of no additive interaction. Such a parametrization of the outcome regression will characteristically be nonstandard in the sense that the interaction of the resulting logistic model would need to be explicitly defined as a function of models for both exposure main effects, and the effect of covariates. Such a parametrization of a logistic model would seldom naturally arise in practice purely on scientific basis. Furthermore, one would generally be unable to easily obtain parameter estimates for such a model using off-the-shelf statistical software for standard logistic regression, which completely undermines the often quoted practical advantage of the RERI approach.
5. Additional simulation studies
5.1. Power of the tests under alternative values of Pr(G=1) and type I error under at significance level of 1%
Figure 1:
Power of the various tests for identifying additive G-E interaction, in the simple settings in which both a1 and a2 are binary, and no covariates are used in the model, with Pr(G = 1) = 0.2. The simulated a1 is common Pr(a1 = 1) = 0.2, Pr(a2 = 1) = 0.2, and the disease model Pr(D = 1|a1, a2) has the fixed baseline risk α0 = logit(0.01), the main effects of the a1, a2 are varied as α1, α2, and the RERI is varied from 0 to 0.5. The compared estimators are the proposed ‘U’ and ‘U ind’ (without and with assuming G-E independence), ‘prosp’ is the standard test under prospective logistic regression and ‘Han’ and ‘Han ind’ is the retrospective profile likelihood test proposed in Han (2012) without and with assuming G-E independence.
Figure 2:
Power of the various tests for identifying additive G-E interaction, in the simple settings in which both a1 and a2 are binary, and no covariates are used in the model, with Pr(G = 1) = 0.05. The simulated a1 is common Pr(a1 = 1) = 0.05, Pr(a2 = 1) = 0.2, and the disease model Pr(D = 1|a1, a2) has the fixed baseline risk α0 = logit(0.01), the main effects of the a1, a2 are varied as α1, α2, and the RERI is varied from 0 to 0.5. The compared estimators are the proposed ‘U’ and ‘U ind’ (without and with assuming G-E independence), ‘prosp’ is the standard test under prospective logistic regression and ‘Han’ and ‘Han ind’ is the retrospective profile likelihood test proposed in Han (2012) without and with assuming G-E independence.
Table 1:
The type I error of the compared tests at significance level of 1%, under various combinations of the prevalence of the genetic variant a1, and the effect of the genetic and environmental variables on the disease outcome (α1 and α2, respectively). The tests ‘U’ and ‘U ind’ are the proposed tests without and with the assumption of G-E independence. ‘prosp’ is the usual test based on standard logistic regression, and ‘Han’ and ‘Han ind’ are the tests based on retrospective profile likelihood proposed by [3]. The type I error was calculated from 10,000 simulations, each with 4000 cases and 4000 controls.
| α1 : | log(0.7) | log(1.2) | log(2.0) | ||||||
|---|---|---|---|---|---|---|---|---|---|
| α2 : | log(0.7) | log(1.2) | log(2.0) | log(0.7) | log(1.2) | log(2.0) | log(0.7) | log(1.2) | log(2.0) |
| Pr(G=1)=0.5 | |||||||||
| U | 0.010 | 0.010 | 0.011 | 0.010 | 0.010 | 0.009 | 0.011 | 0.012 | 0.010 |
| U ind | 0.011 | 0.009 | 0.008 | 0.007 | 0.009 | 0.008 | 0.009 | 0.010 | 0.010 |
| prosp | 0.010 | 0.008 | 0.009 | 0.009 | 0.009 | 0.009 | 0.010 | 0.010 | 0.009 |
| Han | 0.009 | 0.010 | 0.010 | 0.010 | 0.009 | 0.010 | 0.010 | 0.012 | 0.010 |
| Han ind | 0.011 | 0.009 | 0.009 | 0.007 | 0.009 | 0.008 | 0.009 | 0.010 | 0.011 |
| Pr(G=1)=0.2 | |||||||||
| U | 0.011 | 0.012 | 0.010 | 0.011 | 0.012 | 0.012 | 0.011 | 0.012 | 0.010 |
| U ind | 0.010 | 0.010 | 0.012 | 0.010 | 0.008 | 0.011 | 0.011 | 0.011 | 0.011 |
| prosp | 0.010 | 0.008 | 0.008 | 0.009 | 0.009 | 0.009 | 0.008 | 0.009 | 0.008 |
| Han | 0.010 | 0.009 | 0.009 | 0.011 | 0.010 | 0.010 | 0.009 | 0.009 | 0.009 |
| Han ind | 0.011 | 0.010 | 0.011 | 0.010 | 0.009 | 0.011 | 0.011 | 0.010 | 0.011 |
| Pr(G=1)=0.05 | |||||||||
| U | 0.011 | 0.016 | 0.017 | 0.014 | 0.020 | 0.017 | 0.018 | 0.016 | 0.016 |
| U ind | 0.010 | 0.013 | 0.011 | 0.010 | 0.014 | 0.011 | 0.010 | 0.009 | 0.011 |
| prosp | 0.006 | 0.010 | 0.013 | 0.007 | 0.012 | 0.012 | 0.010 | 0.010 | 0.011 |
| Han | 0.009 | 0.009 | 0.010 | 0.009 | 0.010 | 0.009 | 0.013 | 0.009 | 0.009 |
| Han ind | 0.010 | 0.011 | 0.010 | 0.009 | 0.012 | 0.011 | 0.010 | 0.008 | 0.010 |
5.2. Violation of G-E independence assumption
In this section, we evaluate the performance of the proposed tests under the scenario where G and E are in fact not independent. This is done by generating the binary environmental variable as Binomial(Pr(E = 1|G = g) = −1 − 0.1g) As expected, the tests that relies on G-E independence are no longer valid.
Table 2:
The type I error of the compared tests, under various combinations of the prevalence of the genetic variant a1, and the effect of the genetic and environmental variables on the disease outcome (α1 and α2, respectively). The tests ‘U’ and ‘U ind’ are the proposed tests without and with the assumption of G-E independence. ‘prosp’ is the usual test based on standard logistic regression, and ‘Han’ and ‘Han ind’ are the tests based on retrospective profile likelihood proposed by [3]. The type I error was calculated from 10,000 simulations, each with 4000 cases and 4000 controls.
| α1 : | log(0.7) | log(1.2) | log(2.0) | ||||||
|---|---|---|---|---|---|---|---|---|---|
| α2 : | log(0.7) | log(1.2) | log(2.0) | log(0.7) | log(1.2) | log(2.0) | log(0.7) | log(1.2) | log(2.0) |
| Pr(G=1)=0.5 | |||||||||
| U | 0.048 | 0.049 | 0.051 | 0.049 | 0.05 | 0.047 | 0.052 | 0.052 | 0.048 |
| U ind | 0.22 | 0.303 | 0.372 | 0.249 | 0.301 | 0.33 | 0.253 | 0.274 | 0.272 |
| prosp | 0.047 | 0.049 | 0.047 | 0.048 | 0.049 | 0.046 | 0.049 | 0.051 | 0.048 |
| Han | 0.047 | 0.049 | 0.05 | 0.049 | 0.05 | 0.048 | 0.05 | 0.053 | 0.051 |
| Han ind | 0.22 | 0.303 | 0.374 | 0.249 | 0.302 | 0.33 | 0.252 | 0.272 | 0.27 |
| Pr(G=1)=0.2 | |||||||||
| U | 0.05 | 0.048 | 0.053 | 0.049 | 0.05 | 0.051 | 0.049 | 0.052 | 0.047 |
| U ind | 0.133 | 0.187 | 0.242 | 0.202 | 0.233 | 0.254 | 0.26 | 0.268 | 0.26 |
| prosp | 0.049 | 0.046 | 0.05 | 0.046 | 0.048 | 0.05 | 0.045 | 0.049 | 0.044 |
| Han | 0.051 | 0.049 | 0.053 | 0.047 | 0.05 | 0.052 | 0.05 | 0.051 | 0.049 |
| Han ind | 0.125 | 0.181 | 0.233 | 0.195 | 0.226 | 0.243 | 0.252 | 0.26 | 0.25 |
| Pr(G=1)=0.05 | |||||||||
| U | 0.051 | 0.056 | 0.051 | 0.05 | 0.05 | 0.049 | 0.052 | 0.043 | 0.049 |
| U ind | 0.081 | 0.1 | 0.115 | 0.101 | 0.115 | 0.116 | 0.14 | 0.137 | 0.134 |
| prosp | 0.043 | 0.05 | 0.046 | 0.041 | 0.044 | 0.045 | 0.042 | 0.039 | 0.042 |
| Han | 0.052 | 0.054 | 0.051 | 0.049 | 0.05 | 0.05 | 0.051 | 0.047 | 0.054 |
| Han ind | 0.071 | 0.086 | 0.101 | 0.09 | 0.1 | 0.105 | 0.13 | 0.123 | 0.12 |
Figure 3:
Power of the various tests for identifying additive G-E interaction under violation of G-E independence, in the simple settings in which both a1 and a2 are binary, and no covariates are used in the model, with Pr(G = 1) = 0.5. The simulated a1 is common Pr(a1 = 1) = 0.2, Pr(a2 = 1) = 0.2, and the disease model Pr(D = 1|a1, a2) has the fixed baseline risk α0 = logit(0.01), the main effects of the a1, a2 are varied as α1, α2, and the RERI is varied from 0 to 0.5. The compared estimators are the proposed ‘U’ and ‘U ind’ (without and with assuming G-E independence), ‘prosp’ is the standard test under prospective logistic regression and ‘Han’ and ‘Han ind’ is the retrospective profile likelihood test proposed in Han (2012) without and with assuming G-E independence.
Figure 4:
Power of the various tests for identifying additive G-E interaction under violation of G-E independence, in the simple settings in which both a1 and a2 are binary, and no covariates are used in the model, with Pr(G = 1) = 0.2. The simulated a1 is common Pr(a1 = 1) = 0.2, Pr(a2 = 1) = 0.2, and the disease model Pr(D = 1|a1, a2) has the fixed baseline risk α0 = logit(0.01), the main effects of the a1 , a2 are varied as α1, α2, and the RERI is varied from 0 to 0.5. The compared estimators are the proposed ‘U’ and ‘U ind’ (without and with assuming G-E independence), ‘prosp’ is the standard test under prospective logistic regression and ‘Han’ and ‘Han ind’ is the retrospective profile likelihood test proposed in Han (2012) without and with assuming G-E independence.
Figure 5:
Power of the various tests for identifying additive G-E interaction under violation of G-E independence, in the simple settings in which both a1 and a2 are binary, and no covariates are used in the model, with Pr(G = 1) = 0.05. The simulated a1 is common Pr(a1 = 1) = 0.05, Pr(a2 = 1) = 0.2, and the disease model Pr(D = 1|a1, a2) has the fixed baseline risk α0 = logit(0.01), the main effects of the a1, a2 are varied as α1, α2, and the RERI is varied from 0 to 0.5. The compared estimators are the proposed ‘U’ and ‘U ind’ (without and with assuming G-E independence), ‘prosp’ is the standard test under prospective logistic regression and ‘Han’ and ‘Han ind’ is the retrospective profile likelihood test proposed in Han (2012) without and with assuming G-E independence.
5.3. Violation of rare outcome assumption
In this section, we evaluate the performance of the proposed tests under the scenario where the outcome is not very rare with baseline risk equal to 0.1 (i.e. α0 = logit(0.1)), which is ten times the original baseline risk. In most of the coefficient settings, the tests are still valid with similar power. We observed slightly inflated type I error under large G and E main effect with Pr(G = 1) = 0.5.
Table 3:
The type I error of the compared tests, under various combinations of the prevalence of the genetic variant a1, and the effect of the genetic and environmental variables on the disease outcome (α1 and α2, respectively). The tests ‘U’ and ‘U ind’ are the proposed tests without and with the assumption of G-E independence. ‘prosp’ is the usual test based on standard logistic regression, and ‘Han’ and ‘Han ind’ are the tests based on retrospective profile likelihood proposed by [3]. The type I error was calculated from 10,000 simulations, each with 4000 cases and 4000 controls.
| α1 : | log(0.7) | log(1.2) | log(2.0) | ||||||
|---|---|---|---|---|---|---|---|---|---|
| α2 : | log(0.7) | log(1.2) | log(2.0) | log(0.7) | log(1.2) | log(2.0) | log(0.7) | log(1.2) | log(2.0) |
| Pr(G=1)=0.5 | |||||||||
| U | 0.052 | 0.052 | 0.069 | 0.049 | 0.052 | 0.047 | 0.077 | 0.052 | 0.102 |
| U ind | 0.058 | 0.052 | 0.079 | 0.053 | 0.054 | 0.057 | 0.078 | 0.060 | 0.201 |
| prosp | 0.058 | 0.046 | 0.054 | 0.046 | 0.052 | 0.052 | 0.067 | 0.054 | 0.119 |
| Han | 0.056 | 0.050 | 0.061 | 0.047 | 0.054 | 0.054 | 0.067 | 0.058 | 0.127 |
| Han ind | 0.058 | 0.053 | 0.080 | 0.053 | 0.055 | 0.057 | 0.077 | 0.060 | 0.205 |
| Pr(G=1)=0.2 | |||||||||
| U | 0.050 | 0.052 | 0.071 | 0.053 | 0.046 | 0.046 | 0.071 | 0.042 | 0.042 |
| U ind | 0.054 | 0.055 | 0.078 | 0.050 | 0.047 | 0.052 | 0.074 | 0.058 | 0.058 |
| prosp | 0.055 | 0.048 | 0.063 | 0.048 | 0.044 | 0.047 | 0.063 | 0.043 | 0.043 |
| Han | 0.058 | 0.050 | 0.061 | 0.049 | 0.049 | 0.055 | 0.062 | 0.051 | 0.051 |
| Han ind | 0.057 | 0.054 | 0.074 | 0.048 | 0.046 | 0.054 | 0.071 | 0.059 | 0.059 |
| Pr(G=1)=0.05 | |||||||||
| U | 0.046 | 0.053 | 0.068 | 0.057 | 0.048 | 0.046 | 0.062 | 0.043 | 0.029 |
| U ind | 0.049 | 0.053 | 0.063 | 0.054 | 0.052 | 0.050 | 0.062 | 0.048 | 0.072 |
| prosp | 0.043 | 0.043 | 0.061 | 0.045 | 0.042 | 0.043 | 0.049 | 0.037 | 0.028 |
| Han | 0.055 | 0.046 | 0.053 | 0.052 | 0.049 | 0.051 | 0.051 | 0.052 | 0.063 |
| Han ind | 0.052 | 0.052 | 0.058 | 0.053 | 0.052 | 0.053 | 0.055 | 0.048 | 0.083 |
Figure 6:
Power of the various tests for identifying additive G-E interaction, in the simple settings in which both a1 and a2 are binary, and no covariates are used in the model, with Pr(G = 1) = 0.5. The simulated a1 is common Pr(a1 = 1) = 0.2, Pr(a2 = 1) = 0.2, and the disease model Pr(D = 1|a1, a2) has the fixed baseline risk α0 = logit(0.1), the main effects of the a1, a2 are varied as α1, α2, and the RERI is varied from 0 to 0.5. The compared estimators are the proposed ‘U’ and ‘U ind’ (without and with assuming G-E independence), ‘prosp’ is the standard test under prospective logistic regression and ‘Han’ and ‘Han ind’ is the retrospective profile likelihood test proposed in Han (2012) without and with assuming G-E independence.
Figure 7:
Power of the various tests for identifying additive G-E interaction, in the simple settings in which both a1 and a2 are binary, and no covariates are used in the model, with Pr(G = 1) = 0.2. The simulated a1 is common Pr(a1 = 1) = 0.2, Pr(a2 = 1) = 0.2, and the disease model Pr(D = 1|a1, a2) has the fixed baseline risk α0 = logit(0.1), the main effects of the a1, a2 are varied as α1, α2, and the RERI is varied from 0 to 0.5. The compared estimators are the proposed ‘U’ and ‘U ind’ (without and with assuming G-E independence), ‘prosp’ is the standard test under prospective logistic regression and ‘Han’ and ‘Han ind’ is the retrospective profile likelihood test proposed in Han (2012) without and with assuming G-E independence.
Figure 8:
Power of the various tests for identifying additive G-E interaction, in the simple settings in which both a1 and a2 are binary, and no covariates are used in the model, with Pr(G = 1 = 0.05. The simulated a1 is common Pr(a1 = 1) = 0.05, Pr(a2 = 1) = 0.2, and the disease model Pr(D = 1|a1, a2) has the fixed baseline risk α0 = logit(0.1), the main effects of the a1, a2 are varied as α1, α2, and the RERI is varied from 0 to 0.5. The compared estimators are the proposed ‘U’ and ‘U ind’ (without and with assuming G-E independence), ‘prosp’ is the standard test under prospective logistic regression and ‘Han’ and ‘Han ind’ is the retrospective profile likelihood test proposed in Han (2012) without and with assuming G-E independence.
5.4. Different type of environmental variable: count data
In this section, we evaluate the performance of the proposed tests when the environmental variable E is count data following Poisson distribution with mean equal to 1. RERI based on parametric logistic regression can have slightly inflated type I error under relatively extreme gene or environment effects.
Table 4:
The type I error of the compared tests, under various combinations of the prevalence of the genetic variant a1, and the effect of the genetic and environmental variables on the disease outcome (α1 and α2, respectively). The tests ‘U’ and ‘U ind’ are the proposed tests without and with the assumption of G-E independence. ‘prosp’ is the usual test based on standard logistic regression. The type I error was calculated from 10,000 simulations, each with 4000 cases and 4000 controls.
| α1 : | log(0.7) | log(1.2) | log(2.0) | ||||||
|---|---|---|---|---|---|---|---|---|---|
| α2 : | log(0.7) | log(1.2) | log(2.0) | log(0.7) | log(1.2) | log(2.0) | log(0.7) | log(1.2) | log(2.0) |
| Pr(G=1)=0.5 | |||||||||
| U | 0.045 | 0.045 | 0.046 | 0.048 | 0.048 | 0.048 | 0.049 | 0.049 | 0.049 |
| U ind | 0.051 | 0.048 | 0.047 | 0.048 | 0.049 | 0.049 | 0.050 | 0.049 | 0.049 |
| prosp | 0.048 | 0.048 | 0.050 | 0.051 | 0.051 | 0.050 | 0.049 | 0.049 | 0.053 |
| Pr(G=1)=0.2 | |||||||||
| U | 0.046 | 0.045 | 0.047 | 0.048 | 0.048 | 0.048 | 0.049 | 0.049 | 0.049 |
| U ind | 0.050 | 0.049 | 0.047 | 0.049 | 0.049 | 0.049 | 0.049 | 0.049 | 0.049 |
| prosp | 0.048 | 0.048 | 0.051 | 0.051 | 0.050 | 0.049 | 0.049 | 0.049 | 0.054 |
| Pr(G=1)=0.05 | |||||||||
| U | 0.045 | 0.045 | 0.048 | 0.048 | 0.048 | 0.049 | 0.049 | 0.049 | 0.049 |
| U ind | 0.049 | 0.048 | 0.048 | 0.049 | 0.049 | 0.049 | 0.049 | 0.049 | 0.049 |
| prosp | 0.048 | 0.047 | 0.050 | 0.051 | 0.050 | 0.049 | 0.048 | 0.048 | 0.054 |
6. R codes for simulation and data application
Below we provide R codes for simulation and data application. R codes are also available at https://github.com/shixu0830/A-test-for-gene-environment-additive-interaction.
################################################################ ### Simulation code for Binary G and Binary E: ### Xu Shi (xushi@hsph.harvard.edu) ### Benedict H.W.Wong (wong01@g.harvard.edu) ### Tamar Sofer (tsofer@bwh.harvard.edu) ################################################################ rm(list =ls ()) library (alr3) library (maxLik) library (MASS) library (boot) require (CGEN) source (“functions.R”) a0.seq = c(logit (0.01), logit (0.05),logit (0.1)) ## baseline risk g.mean.seq =c (0.5, 0.2, 0.05) signal.seq = seq (0, 0.5, by = 0.1) ##0: typeIerror; >0: power ## we here look at a few combinations of gene/environement effects env.eff.seq = c(log (0.7), log (1.2), log (2)) gene.eff.seq = c(log (0.7), log (1.2), log (2)) ### select one comb of the above seq gene.eff = gene.eff.seq [1] #1,2,3 env.eff = env.eff.seq [1] #1,2,3 a0 = a0.seq [1] #1,2 g.mean = g.mean.seq [1] #1,2,3 sig = signal.seq [1] #1,2,3 ## setting the coeffcient of g*e such that RERI = sig. gene.env.eff = logit ((sig −1)*expit (a0) + expit (a0 + env.eff) + expit (a0 + gene.eff)) − a0 − gene .eff − env.eff n.sim =10000 nn =4000 N <− nn/0.001 n.cont <− nn n.case <− nn n=n.cont*2 test.stat.U = pval.U= test.stat.U.ind = pval.U.ind = pvals.reri = pvals.UML = pvals.mult = pvals.wald.mult.indep = pvals.mult.indep = pvals.han= pvals.han.indep = test.stat.RERI = pval.RERI = rep (NA, n.sim) for (i in 1:n.sim){ set.seed (i) e <− rbinom (N, 1, 0.2) g <− rbinom (N, 1, g.mean) pD <− expit (a0 + env.eff*e + gene.eff*g + gene.env.eff*g*e) D <− rbinom (length (pD),1,pD) cases <− as.data.frame (cbind (e,g,D)[sample (which (D == 1), n.case),]) conts <− as.data.frame (cbind (e,g,D)[sample (which (D == 0), n.cont),]) d <− c(cases $D, conts $D) e <− c(cases $e, conts $e) g <− c(cases $g, conts $g) ############################################################ ### Han and RERI temp <− additive.test (data = data.frame (d=d,g=g,e=e), response.var = “d“, snp.var =“g“, exposure. var =“e”) temp.indep <− additive.test (data = data.frame (d=d,g=g,e=e), response.var = “d“, snp.var =“g“, exposure.var =“e“, op = list (indep = T)) pvals.reri [i] <− temp $ RERI $ pval pvals.han[i] <− temp $ pval.add pvals.han.indep [i] <− temp.indep $ pval.add ############################################################ ## RERI temp.RERI <− RERI.E.cont (d,g,e) test.stat.RERI [i] <− temp.RERI $ test.stat pval.RERI [i] <− temp.RERI $ pval ############################################################ ## Proposed test stat U ### first : test that assumes GE independence probs.mod <− estimate.g.e.probs (g[which (d == 0)], e[which (d == 0)], ge.indep = T, return.equation = T) p10 <− probs.mod$ cond.probs $p.g1.cond.e0 p11 <− probs.mod$ cond.probs $p.g1.cond.e1 p20 <− probs.mod$ cond.probs $p.e1.cond.g0 u <− (g−p10)*(e−p20)*d mean.u = mean (u) alpha <− probs.mod $ coef $ alpha ## alpha is the parameter of the model f(g,e|D =0) exp.2 <− exp (alpha [2]) exp.1 <− exp (alpha [1]) ## derivative of U w.r.t.alpha partial.u.partial.V.p <− cbind ( d*(0 − exp.1/((exp.1 + 1) ^2))*(e − exp.2/(1 + exp.2)), d*(g − exp.1/(exp.1 + 1))*(0 − exp.2/((1 + exp.2) ^2)) ) # its mean E.partial.u.partial.V.p <− colMeans (partial.u.partial.V.p) alpha <− alpha [1:2] ## the equations accounting for estimation of alpha term.account.p <− n*rbind (matrix (0, nrow = sum (d == 1), ncol = length (alpha)), probs.mod$ equation) %*% E.partial.u.partial.V.p u.account.p.ind <− u − term.account.p sd.u <− sqrt (mean (u.account.p.ind ^2)/(n)) test.stat.U.ind[i] <− mean.u/sd.u pval.U.ind[i]=(1 − pnorm (abs(mean.u/sd.u)))*2 ### second : test that does not assumes GE independence probs.mod <− estimate.g.e.probs (g[which (d == 0)], e[which (d == 0)], ge.indep = F, return.equation = T) p10 <− probs.mod$ cond.probs $p.g1.cond.e0 p11 <− probs.mod$ cond.probs $p.g1.cond.e1 p20 <− probs.mod$ cond.probs $p.e1.cond.g0 alpha <− probs.mod $ coef $ alpha u <− exp(−alpha [3]*g*e)*(g−p10)*(e−p20)*d mean.u = mean (u) exp.2 <− exp (alpha [2]) exp.1 <− exp (alpha [1]) term.3 <− exp (−alpha [3]*g*e) ## derivative of U w.r.t.alpha partial.u.partial.V.p <− cbind ( d*term.3*(0 − exp.1/((exp.1 + 1) ^2))*(e − exp.2/(1 + exp.2)), d*term.3*(g − exp.1/(exp.1 + 1))*(0 − exp.2/((1 + exp.2) ^2)), -d*term.3*g*e*(g − exp.1/(exp.1 + 1))*(e − exp.2/(1 + exp.2)) ) # its mean E.partial.u.partial.V.p <− colMeans (partial.u.partial.V.p) term.account.p <− n*rbind (matrix (0, nrow = sum (d == 1), ncol = length (alpha)), probs.mod$ equation) %*% E.partial.u.partial.V.p u.account.p <− u − term.account.p sd.u <− sqrt (mean (u.account.p ^2)/n) test.stat.U[i] <− mean.u/sd.u pval.U[i]=(1 − pnorm (abs(mean.u/sd.u)))*2 if(i%%(n.sim/10) ==0) {print (i)} } ################################################################ ### Simulation code for Binary G and Count E: ### Xu Shi (xushi@hsph.harvard.edu) ### Benedict H.W.Wong (wong01@g.harvard.edu) ### Tamar Sofer (tsofer@bwh.harvard.edu) ################################################################ rm(list =ls ()) library (alr3) library (maxLik) library (MASS) library (boot) require (CGEN) source (“functions.R”) a0.seq = c(logit (0.01),logit (0.05),logit (0.1)) ## baseline risk g.mean.seq =c (0.5,0.2,0.05) signal.seq = seq (0, 0.5, by = 0.1) ##0: typeIerror ; >0: power ## we here look at a few combinations of gene/environement effects env.eff.seq = c(log (0.7), log (1.2), log (2)) gene.eff.seq = c(log (0.7), log (1.2), log (2)) n.sim =10000 nn =4000 N = nn/0.001 n.cont = nn n.case = nn n=n.cont*2 pval = pval2 = pval.indept = NULL for (i in 1:n.sim){ #### simulate data set.seed ((myseed−1)*n.sim + i) e = rpois (N, lambda =1) g = rbinom (N,1,g.mean) pD = expit (a0 + gene.eff*g) + expit (a0 + env.eff*e) + expit (a0) D = rbinom (length (pD),1,pD) cases = as.data.frame (cbind (e,g,D)[sample (which (D == 1), n.case),]) conts = as.data.frame (cbind (e,g,D)[sample (which (D == 0), n.cont),]) d = c(cases $D, conts $D) e = c(cases $e, conts $e) g = c(cases $g, conts $g) 19 X = matrix (rep (1, length (g)),nrow = length (g)) p= ncol (X) myg =g[d ==0]; mye =e[d ==0]; myX =as.matrix (X[d ==0,]); #### Proposed test without assuming G-E independence pval.i = get.pval (g,e,X,d,ge.indep =F) pval =c(pval, pval.i[“p“]) #### Proposed test assuming G-E independence pval.indept.i = get.pval (g,e,X,d,ge.indep =T) if(inherits (pval.indept.i, “try − error”)){ pval.indept.i=NA} pval.indept =c(pval.indept, pval.indept.i[“p“]) #### RERI using standard logistic regression reri = RERI.E.cont (d,g,e) pval2 =c(pval2, reri $ pval) } #### compute type I error at sig level of 0.05 mean (pval <0.05) mean (pval.indept <0.05) mean (pval2 <0.05) ################################################################ ### Data application : ovarian cancer study ### Xu Shi (xushi@hsph.harvard.edu) ### Benedict H.W.Wong (wong01@g.harvard.edu) ### Tamar Sofer (tsofer@bwh.harvard.edu) ################################################################ rm(list =ls ()) library (alr3) library (maxLik) library (MASS) library (CGEN) source (“functions.R”) ############### get data ############### data (Xdata, package =“CGEN”) dat = Xdata dat $ ethnicity.1= as.numeric (dat $ ethnic.group == 1) dat $ ethnicity.2= as.numeric (dat $ ethnic.group == 2) dat $ ethnicity.3= as.numeric (dat $ ethnic.group == 3) dat $ famhist.1= as.numeric (dat$ family.history == 1) dat $ famhist.2= as.numeric (dat$ family.history == 2) dat $ age50 =as.numeric (dat $age.group > 2) ## age at least 50 #### Adjusted covariates in the primary analysis adjust.vars = c(“age50 “,“ethnicity.1“,“ethnicity.2“,“BRCA.history “,“famhist.1“,“famhist.2”) #### Adjusted covariates in the sensitivity analysis : remove BRCA.history adjust.vars = c(“age50 “,“ethnicity.1“,“ethnicity.2“,“famhist.1“,“famhist.2”) rslt = NULL for (children_or_oralcon in c(1,2)){ for (ge.indep in c(T,F)){ boot.dat = dat d = boot.dat $ case.control g = boot.dat $ BRCA.status if(children_or_oralcon ==1) { e = boot.dat $n.children } else { e = boot.dat $ oral.years } X = cbind (intercept =1, as.matrix (boot.dat [, adjust.vars])) p= ncol (X) myg =g[d ==0]; mye =e[d ==0]; myX =as.matrix (X[d ==0,]); reri.p= RERI.E.cont.X(d,g,e,X) temp <− additive.test (data = data.frame (d=d,g=g,e=as.numeric (e< mean (e)),boot.dat [, adjust.vars]), response.var = “d“, snp.var=“g“, exposure.var =“e“, main.vars = adjust.vars, op = list (indep = ge.indep)) pvals.reri <− temp $ RERI $ pval pvals.han <− temp $ pval.add 20 ### our proposed tests if(ge.indep ==T){ ### assume independent MLE.rslt = maxLik (loglik.indept, start = rep (0,p*2)) par = MLE.rslt $ estimate phi =0; alpha = par [1:(p)]; beta = par [(1 + p):(p*2)] } else { ### not assume independent MLE.rslt = maxLik (loglik, start = rep (0,p*2 + 1)) par = MLE.rslt $ estimate phi = par [1]; alpha = par [2:(1 + p)]; beta = par [(2 + p):(1 + p*2)] } ### get u m.g = expit (X%*% alpha); m.e = exp (X%*% beta) w= exp (− phi*g*e); u = w*(g − m.g)*(e − m.e)*d ### if(ge.indep ==T){ ### get IF d.u.d.theta = cbind ( diag (as.numeric (w*(e − m.e)*d*(-m.g*(1-m.g))))%*%X, diag (as.numeric (w*(g − m.g)*d*(-m.e)))%*%X) E.d.u.d.theta = apply (d.u.d.theta,2, sum) ## order : (phi),g,e score = numDeriv :: jacobian (loglik.indept,c(par)) } else { ### get IF d.u.d.theta = cbind (-g*e*u, diag (as.numeric (w*(e − m.e)*d*(-m.g*(1-m.g))))%*%X, diag (as.numeric (w*(g − m.g)*d*(-m.e)))%*%X) E.d.u.d.theta = apply (d.u.d.theta,2, sum) ## order : phi,g,e score = numDeriv :: jacobian (loglik,c(par)) } score.IF= ginv (t(score)) adj.IF = score.IF%*%E.d.u.d.theta u.IF = c(u[d==1],u[d ==0]) + c(rep (0, length (which (d ==1))),adj.IF) (p =(1 − pnorm (abs (mean (u)*sqrt (length (u))/sd(u.IF))))*2) stat = sum (u)/sqrt (sum(u.IF ^2)) (CI = round (mean (u) + c(-1,0,1)*qnorm (0.975)*sd(u.IF)/sqrt (length (u)),3)) rslt = rbind (rslt,c(sprintf (“%.3 f“,phi),sprintf (“%.3 f“,CI [2]), paste0 (“(“,sprintf (“%.3 f“,CI [1]),“,“,sprintf (“%.3 f“,CI [3]),”)”) ,sprintf (“%.3 f“,p),sprintf (“%.3 f“,reri.p$ pval) ,sprintf (“%.3 f“,pvals.reri),sprintf (“%.3 f“,pvals.han) )) } } rslt ################################################################ ### Functions ### Xu Shi (xushi@hsph.harvard.edu) ### Benedict H.W.Wong (wong01@g.harvard.edu) ### Tamar Sofer (tsofer@bwh.harvard.edu) ################################################################ expit = function (x){ exp (x)/(1 + exp (x))} logit = function (x){ log (x/(1 − x))} get.pval = function (g,e,X,d,ge.indep){ if(ge.indep ==T){ ### assume G-E independence MLE.rslt = maxLik (loglik.indept, start = rep (0,p*2)) par = MLE.rslt $ estimate phi =0; alpha = par [1:(p)]; beta = par [(1 + p):(p*2)] } else { ### not assume G-E independence MLE.rslt = maxLik (loglik, start = rep (1,p*2 + 1)) par = MLE.rslt $ estimate phi = par [1]; alpha = par [2:(1 + p)]; beta = par [(2 + p):(1 + p*2)] 21 } ### get u m.g = expit (X%*% alpha); m.e = exp (X%*% beta) w= exp (− phi*g*e); u = w*(g − m.g)*(e − m.e)*d if(ge.indep ==T){ ### get IF d.u.d.theta = cbind ( diag (as.numeric (w*(e − m.e)*d*(-m.g*(1-m.g))))%*%X, diag (as.numeric (w*(g − m.g)*d*(-m.e)))%*%X) E.d.u.d.theta = apply (d.u.d.theta,2, sum) ## order : (phi),g,e score = numDeriv :: jacobian (loglik.indept,c(par)) } else { ### get IF d.u.d.theta = cbind (-g*e*u, diag (as.numeric (w*(e − m.e)*d*(-m.g*(1-m.g))))%*%X, diag (as.numeric (w*(g − m.g)*d*(-m.e)))%*%X) E.d.u.d.theta = apply (d.u.d.theta,2, sum) ## order : phi,g,e score = numDeriv :: jacobian (loglik,c(par)) } score.IF= score %*% solve (t(score)%*%(score)) #or score.IF= ginv (t(score)) adj.IF = score.IF%*%E.d.u.d.theta u.IF = c(u[d==1],u[d ==0]) + c(rep (0, length (which (d ==1))),adj.IF) p =(1 − pnorm (abs (mean (u)*sqrt (length (u))/sd(u.IF))))*2 #or p =(1 − pnorm (abs (sum (u)/sqrt ( sum (u.IF ^2)))))*2 stat = sum (u)/sqrt (sum(u.IF ^2)) return (c(p=p, mean.u= mean (u),phi =phi, stat =stat,sd.u=sd(u.IF)/sqrt (length (u)) )) } loglik.indept = function (par){#### Function for the log likelihood assuming G-E indep ## get coefs phi =0; alpha = par [1:(ncol (myX))]; beta = par [(1 + ncol (myX)):(ncol (myX)*2)] ## compute joint destribution f(G,E|X) lambdaX = exp (myX%*% beta) ### proportional to f(G,E|X) fGE_cond_X = exp(myX%*% alpha*myg)/(1 + exp (myX%*% alpha))*#fG_cond_E0X lambdaX ^mye*exp(− lambdaX)/factorial (mye)*#fE_cond_G0X exp (phi*myg*mye) #OR(G,E|X), assume doesnt depend on X C_X = expit (myX %*% alpha)*exp (lambdaX*(exp(phi)−1)) + 1/(1 + exp (myX %*% alpha)) ## normalizing const fGE_cond_X=fGE_cond_X/C_X return ((log(fGE_cond_X))) } loglik = function (par){#### Function for the log likelihood without assuming G-E indep ## get coefs phi = par [1]; alpha = par [2:(1 + ncol (myX))]; beta = par [(2 + ncol (myX)):(1 + ncol (myX)*2)] ## compute joint destribution f(G,E|X) lambdaX = exp (myX%*% beta) ### proportional to f(G,E|X) fGE_cond_X = exp(myX%*% alpha*myg)/(1 + exp (myX%*% alpha))*#fG_cond_E0X lambdaX ^mye*exp(− lambdaX)/factorial (mye)*#fE_cond_G0X exp (phi*myg*mye) #OR(G,E|X), assume doesnt depend on X C_X = expit (myX %*% alpha)*exp (lambdaX*(exp(phi) −1)) + 1/(1 + exp (myX %*% alpha)) ## normalizing const fGE_cond_X=fGE_cond_X/C_X return ((log(fGE_cond_X))) } RERI.E.cont = function (d,g,e){#### RERI using standard logistic regression for the outcome mod = glm (d ~ g + e + g*e, family = “binomial”) temp = deltaMethod (mod, “exp (b1 + b2 + b3) − exp (b1) − exp (b2) + 1“, parameterNames = paste (“b“, 0:3, sep = “”)) test.stat = temp $ Estimate/temp $SE pval = 2*(1 − pnorm (abs(test.stat))) return (list (test.stat = test.stat, pval = pval)) } RERI.E.cont.X = function (d,g,e,X){#### RERI using standard logistic regression for the outcome 22 with covariate adjustment X=X[, −1] # remove intercept colnames (X)= paste0 (“x“,4:(ncol (X) + 3)) mod = glm (as.formula (paste (“d~“,paste (paste0 (“x“,1:(ncol (X) + 3)),collapse =“+”))),data = data.frame (x1 =g,x2=e,x3=g*e,X),family = “binomial”) temp = deltaMethod (mod, “exp (x1 + x2 + x3) − exp (x1) − exp (x2) + 1“, parameterNames = c(paste0 (“x“,0:( ncol (X) + 3)))) test.stat = temp $ Estimate/temp $SE pval = 2*(1 − pnorm (abs(test.stat))) return (list (test.stat = test.stat, pval = pval)) } estimate.g.e.probs = function (g,e, ge.indep = F, return.equation = F, max.iter = 500, eps = 1e −6) { #### Function for the proposed test under binary G and E (saturated model) converge = F iter = 0 ind.alpha1 = 1 ind.alpha2 = 2 if (ge.indep){ alpha.old = rep (0,2) U.vec = rep (0, length (alpha.old)) U.deriv.mat = matrix (0, length (alpha.old), length (alpha.old)) while (! converge & iter < max.iter){ e1 = exp (alpha.old [ind.alpha1]) e2 = exp (alpha.old [ind.alpha2]) e12 = exp (sum (alpha.old)) C = 1/(e12 + e1 + e2 + 1) U.vec [ind.alpha1] = mean (g) − C*(e12 + e1) U.vec [ind.alpha2] = mean (e) − C*(e12 + e2) U.deriv.mat[ind.alpha1, ind.alpha1] = C^2*(e12 + e1)*(e12 + e1) − C*(e12 + e1) U.deriv.mat[ind.alpha1, ind.alpha2] = C^2*(e12 + e2)*(e12 + e1) − C*e12 U.deriv.mat[ind.alpha2, ind.alpha1] = C^2*(e12 + e1)*(e12 + e2) − C*e12 U.deriv.mat[ind.alpha2, ind.alpha2] = C^2*(e12 + e2)*(e12 + e2) − C*(e12 + e2) alpha.new = alpha.old − solve (U.deriv.mat) %*% U.vec if (max (abs (alpha.new − alpha.old)) <= eps) converge = T else { alpha.old = alpha.new iter = iter + 1 } } if (return.equation){ U = matrix (0, ncol = length (alpha.new), nrow = length (g)) U[, ind.alpha1] = g − C*(exp (sum (alpha.new)) + exp(alpha.old[ind.alpha1])) U[, ind.alpha2] = e − C*(exp (sum (alpha.new)) + exp(alpha.old[ind.alpha2])) # EU.inv.U = U %*% solve (U.deriv.mat) EU.inv.U = U %*% solve (t(U) %*% U) } alpha.new = c(alpha.new, 0) } else { alpha.old = rep (0,3) ind.alpha3 = 3 U.vec = rep (0, length (alpha.old)) 23 U.deriv.mat = matrix (0, length (alpha.old), length (alpha.old)) while (! converge & iter < max.iter){ e1 = exp (alpha.old [ind.alpha1]) e2 = exp (alpha.old [ind.alpha2]) e123 = exp (sum (alpha.old)) C = 1/(e123 + e1 + e2 + 1) U.vec [ind.alpha1] = mean (g) − C*(e123 + e1) U.vec [ind.alpha2] = mean (e) − C*(e123 + e2) U.vec [ind.alpha3] = mean (g*e) − C*e123 U.deriv.mat[ind.alpha1, ind.alpha1] = C^2*(e123 + e1)*(e123 + e1) − C*(e123 + e1) U.deriv.mat[ind.alpha1, ind.alpha2] = C^2*(e123 + e2)*(e123 + e1) − C*e123 U.deriv.mat[ind.alpha1, ind.alpha3] = C^2*e123*(e123 + e1) − C*e123 U.deriv.mat[ind.alpha2, ind.alpha1] = C^2*(e123 + e1)*(e123 + e2) − C*e123 U.deriv.mat[ind.alpha2, ind.alpha2] = C^2*(e123 + e2)*(e123 + e2) − C*(e123 + e2) U.deriv.mat[ind.alpha2, ind.alpha3] = C^2*e123*(e123 + e2) − C*e123 U.deriv.mat[ind.alpha3, ind.alpha1] = C^2*(e123 + e1)*e123 − C*e123 U.deriv.mat[ind.alpha3, ind.alpha2] = C^2*(e123 + e2)*e123 − C*e123 U.deriv.mat[ind.alpha3, ind.alpha3] = C^2*e123*e123 − C*e123 alpha.new = alpha.old − solve (U.deriv.mat) %*% U.vec if (max (abs (alpha.new − alpha.old)) <= eps) converge = T else { alpha.old = alpha.new iter = iter + 1 } } if (return.equation){ U = matrix (0, ncol = length (alpha.new), nrow = length (g)) U[, ind.alpha1] = g − C*(e123 + e1) U[, ind.alpha2] = e − C*(e123 + e2) U[, ind.alpha3] = g*e − C*e123 # EU.inv.U = U %*% solve (U.deriv.mat) EU.inv.U = U %*% solve (t(U) %*% U) } } ## joint probabilities : p.g1.e1 = C*exp (sum (alpha.new)) p.g1.e0 = C*exp (alpha.new [ind.alpha1]) p.g0.e1 = C*exp (alpha.new [ind.alpha2]) p.g0.e0 = C ### return conditional probabilities : p.g1.cond.e1 = p.g1.e1/(p.g1.e1 + p.g0.e1) p.e1.cond.g1 = p.g1.e1/(p.g1.e0 + p.g1.e1) p.g1.cond.e0 = p.g1.e0/(p.g0.e0 + p.g1.e0) p.e1.cond.g0 = p.g0.e1/(p.g0.e1 + p.g0.e0) if (return.equation){ return (list (cond.probs = list (p.g1.cond.e1 = p.g1.cond.e1, p.g1.cond.e0 = p.g1.cond.e0, p.e1.cond. g0 = p.e1.cond.g0, p.e1.cond.g1 = p.e1.cond.g1), coefs = list (alpha = alpha.new), equation = EU.inv.U)) } else { return (list (cond.probs = list (p.g1.cond.e1 = p.g1.cond.e1, p.g1.cond.e0 = p.g1.cond.e0, p.e1.cond. g0 = p.e1.cond.g0, p.e1.cond.g1 = p.e1.cond.g1), coefs = list (alpha = alpha.new)))} }
References
- [1].Tchetgen Tchetgen Eric J, Robins James M, Rotnitzky Andrea. On doubly robust estimation in a semiparametric odds ratio model Biometrika. 2010;97:171–180. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [2].Knol Mirjam J, Tweel Ingeborg, Grobbee Diederick E, Numans Mattijs E, Geerlings Mirjam I. Estimating interaction on an additive scale between continuous determinants in a logistic regression model International Journal of Epidemiology. 2007;36:1111–1118. [DOI] [PubMed] [Google Scholar]
- [3].Han Summer S, Rosenberg Philip S, Garcia-Closas Montse, et al. Likelihood Ratio Test for Detecting Gene (G)-Environment (E) Interactions Under an Additive Risk Model Exploiting G-E Independence for Case-Control Data American Journal of Epidemiology. 2012;176:1060–1067. [DOI] [PMC free article] [PubMed] [Google Scholar]
References
- [1].Skrondal Anders. Interaction as departure from additivity in case-control studies: a cautionary note. American Journal of Epidemiology. 2003;158(3):251–258. [DOI] [PubMed] [Google Scholar]
- [2].Rothman Kenneth J, Greenland Sander, Walker Alexander M. Concepts of interaction. American Journal of Epidemiology. 1980;112(4):467–470. [DOI] [PubMed] [Google Scholar]
- [3].Greenland Sander. Tests for interaction in epidemiologic studies: a review and a study of power. Statistics in Medicine. 1983;2(2):243–251. [DOI] [PubMed] [Google Scholar]
- [4].Cordell Heather J Epistasis: what it means, what it doesn’t mean, and statistical methods to detect it in humans. Human Molecular Genetics. 2002;11(20):2463–2468. [DOI] [PubMed] [Google Scholar]
- [5].VanderWeele Tyler J, Knol Mirjam J. A tutorial on interaction. Epidemiologic Methods. 2014;3(1):33–72. [Google Scholar]
- [6].Gang Liu, Bhramar Mukherjee, Seunggeun Lee, et al. Robust tests for additive gene-environment interaction in case-control studies using gene-environment independence. American Journal of Epidemiology. 2017;187(2):366–377. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [7].Rothman Kenneth J. Causes. American journal of epidemiology. 1976;104(6):587–592. [DOI] [PubMed] [Google Scholar]
- [8].Rothman Kenneth J, et al. Modern epidemiology. Philadelphia, PA: Lippincott Williams & Wilkins; 3 ed. 2008. [Google Scholar]
- [9].Han Summer S, Rosenberg Philip S, Garcia-Closas Montse, et al. Likelihood Ratio Test for Detecting Gene (G)-Environment (E) Interactions Under an Additive Risk Model Exploiting G-E Independence for Case-Control Data. American Journal of Epidemiology. 2012;176(11):1060–1067. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [10].Piegorsch Walter W, Weinberg Clarice R, Taylor Jack A. Non-hierarchical logistic models and case-only designs for assessing susceptibility in population-based case-control studies. Statistics in Medicine. 1994;13(2):153–162. [DOI] [PubMed] [Google Scholar]
- [11].Umbach David M, Weinberg Clarice R. Designing and analysing case-control studies to exploit independence of genotype and exposure. Statistics in Medicine. 1997;16(15):1731–1743. [DOI] [PubMed] [Google Scholar]
- [12].Chatterjee Nilanjan, Carroll Raymond J. Semiparametric maximum likelihood estimation exploiting gene-environment independence in case-control studies. Biometrika. 2005;92(2):399–418. [Google Scholar]
- [13].Tchetgen Tchetgen Eric J, Robins James. The semiparametric case-only estimator. Biometrics. 2010;66(4):1138–1144. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [14].Tchetgen Tchetgen E Robust Discovery of Genetic Associations incorporating Gene-Environment Interaction and Independence.(2011) Epidemiology. Volume.;22(2):262–272. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [15].Vansteelandt Stijn, VanderWeele Tyler J, Tchetgen Eric J, Robins James M Multiply robust inference for statistical interactions. Journal of the American Statistical Association. 2008;103(484):1693–1704. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [16].Hosmer David W, Lemeshow Stanley. Confidence interval estimation of interaction. Epidemiology. 1992;:452–456. [DOI] [PubMed] [Google Scholar]
- [17].Tchetgen Tchetgen Eric J. A general regression framework for a secondary outcome in case-control studies. Biostatistics. 2013;15(1):117–128. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [18].Tchetgen Tchetgen Eric J, Robins James M, Rotnitzky Andrea. On doubly robust estimation in a semiparametric odds ratio model. Biometrika. 2010;97(1):171–180. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [19].Modan Baruch, Hartge Patricia, Hirsh-Yechezkel Galit, et al. Parity, oral contraceptives, and the risk of ovarian cancer among carriers and noncarriers of a BRCA1 or BRCA2 mutation. New England Journal of Medicine. 2001;345(4):235–240. [DOI] [PubMed] [Google Scholar]









