A general approach to detect gene (G)-environment (E) additive interaction leveraging G-E independence in case-control studies

Eric J Tchetgen Tchetgen; Xu Shi; Benedict HW Wong; Tamar Sofer

doi:10.1002/sim.8337

. Author manuscript; available in PMC: 2020 Oct 30.

Published in final edited form as: Stat Med. 2019 Aug 23;38(24):4841–4853. doi: 10.1002/sim.8337

A general approach to detect gene (G)-environment (E) additive interaction leveraging G-E independence in case-control studies

Eric J Tchetgen Tchetgen ¹, Xu Shi ², Benedict HW Wong ², Tamar Sofer ³

PMCID: PMC6959489 NIHMSID: NIHMS1042495 PMID: 31441522

Summary

It is increasingly of interest in statistical genetics to test for the presence of an additive interaction between genetic (G) and environmental (E) risk factors. In case-control studies involving a rare disease, a statistical test of no additive G × E interaction typically entails a test of no relative excess risk due to interaction (RERI). It has been shown that a likelihood ratio test of a null RERI incorporating the G-E independence assumption (RERI-LRT) outperforms the standard approach. The RERI-LRT relies on correct specification of a logistic model for the binary outcome, as a function of G, E and auxiliary covariates. However, when at least one exposure is not categorical or auxiliary covariates are present, nonparametric estimation may not be feasible, while parametric logistic regression will a priori rule out the null hypothesis of no additive interaction in most practical situations, inflating type I error rate. In this paper, we present a general approach to test for G × E additive interaction exploiting G-E independence. Unlike the RERI-LRT, it allows the regression model for the binary outcome to remain unrestricted, and nonetheless still allows for covariate adjustment in order to ensure the G-E independence assumption or to rule out residual confounding. The methods are illustrated through extensive simulation studies and an ovarian cancer study.

Keywords: gene-environment additive interaction, gene-environment independence, case-control study

1 |. INTRODUCTION

There is growing interest in the development and application of statistical methods to detect the presence of an interaction between genetic (G) and environmental (E) risk factors. In case-control studies, the vast majority of interaction analyses has been based on the multiplicative scale. This is mainly because linear regression is generally inconsistent [1], while logistic regression with product terms can be conveniently fitted and such model is approximately multiplicative in the case of rare disease. However, it has been argued that assessment of interaction should be mainly based on the additive scale rather than the multiplicative scale [2, 3, 4, 5, 6]. The additive scale allows one to capture whether the disease risks would be different across different subgroups and thus is more relevant for measuring public health impact. It also closely corresponds to tests for a mechanistic interaction [5]. Specifically, a weaker notion of the mechanistic interaction is a sufficient cause interaction, which is present if there is at least one person for whom the outcome would occur if both exposures were present but would not occur if just one exposure were present [7]. A stronger notion is an epistatic interaction which concerns whether there is at least one person who will have the outcome if and only if both exposures are present [5] Such interactions can sometimes be detected by evaluating additive interactions under certain monotonicity conditions.

Although the interaction term in a logistic regression does not have an additive interpretation, statistical tests of no additive G × E interaction can in fact be performed using such model when the disease is rare. In particular, the relative excess risk due to interaction (RERI) test statistics [8] can be represented in terms of relative risks. For case-control data involving a rare disease, the relative risks can be approximated by odds ratios, thus a test of null RERI can be obtained from a standard logistic regression. When G and E are known to be independent in the target population, [9] developed a likelihood ratio test of the null hypothesis of no RERI incorporating the G-E independence assumption (hereafter RERI-LRT), which is generally more powerful than the standard RERI test of no additive interaction. Such a gain in power has long been observed for various tests of multiplicative interaction [10, 11, 12, 13, 14].

RERI-based tests of additive interaction rely on correct specification of a logistic model for the binary outcome, as a function of G, E and auxiliary covariates. Ideally, a saturated or nonparametric specification of the outcome logistic regression would in principle ensure validity of RERI-based tests (i.e. that it preserves nominal type 1 error rate). However, when the environmental exposure (E) or auxiliary covariates include count or continuous variables, nonparametric estimation may not be feasible. Moreover, in this case, standard parametric specification of a logistic outcome regression will generally a priori rule out the null hypothesis of no additive interaction, leading to inflation of type I error rate, as we detail in the appendix.

In this paper, we present a general, yet fairly straightforward approach to directly test for the presence of additive G × E interaction in case-control studies without requiring a regression model of disease risk. The proposed approach which is easily made to exploit the G-E independence assumption, relies on separate regression models for G and E given the covariates. By avoiding specification of an outcome model, the approach circumvents aforementioned difficulties with RERI-based tests and is completely robust to mis-specification of the outcome regression, although correct specification of models for G and E is instead needed for the new approach to be valid. Nonetheless, unlike RERI-based tests, standard parametric models can be used for the latter in most practical situations without a priori ruling out the null hypothesis of no additive interaction, even if E is continuous. In addition, although the test is developed in the context of G × E interaction, it can be used in a more general context for test of interaction [15].

The methods are illustrated through an extensive simulation study in the setting of binary G and E with no additional covariates so that the traditional RERI-based tests and the new approach equally apply and can be directly compared in terms of power. Additional simulations are performed to illustrate the poor behavior of parametric RERI-based tests in more practical scenarios while the proposed approach continues to perform well. We also demonstrate the new approach using data from an ovarian cancer study to detect an additive interaction between the BRCA1/2 genetic variant (G), and the woman’s parity and number of years of oral contraceptive use (E). Because both environmental exposures are counts, the RERI-LRT cannot be implemented without possibly recoding the original environmental exposures as dichotomous or as categorical with few levels. Covariate adjustment needed in the study also presents analytical difficulty for RERI-LRT and for these reasons the approach is forgone in both applications in favor of the new methodology.

The paper is organized as follows. We detail the proposed method in Section [2]. Specifically, we start with the simple setting of binary genetic and environmental risk factors. We review the RERI in Section [2.1] We then propose the test statistic based on an alternative characterization of the null hypothesis of no additive interaction in Sections [2.2]. which possesses the flexibility to leverage G-E independence as described in Section [2.3] and to adjust for covariates as described in Section [2.4]. Next we extend the proposed test to more general types of exposures in Section [2.5]. We discuss the possibility of relaxing the rare disease assumption in Section [2.6]. In Section [3], we provide a unified class of test statistics which subsumes all settings in previous sections as special cases. We evaluate the performance of the proposed tests via extensive simulation studies in Section [4], and illustrate use of the new test statistic using data from the ovarian cancer study in Section [5]. We close with a discussion in Section [6].

2 |. METHODS

In this section, we first consider the scenario where both G and E are binary. We provide a brief review of the existing method based on a test of a null RERI with binary exposures. We introduce a new test based on an alternative representation of the null hypothesis of no RERI. We extend the test to incorporate G-E independence assumption and adjust for covariates. Then we provide a more general test for polytomous or continuous exposures. We also briefly discuss approaches to potentially relax the rare disease assumption. In the next section, we will provide a unified class of test statistics which additionally allows for relaxation of the G-E independence assumption.

Suppose one has observed retrospective case-control data on n unrelated individuals, let D denote the rare disease outcome defining case-control status and (A₁,A₂) denote two exposures in view. For instance, in a statistical genetic application, A₁ may denote the genetic variant G and A₂ an environmental exposure E. Here we will use the more generic notation (A₁, A₂) to allow for more general contexts considered in Section [3], which considers cases where either or both exposures may be count or continuous. Let μ(a₁, a₂) = Pr(D = 1|A₁ = a₁, A₂ = a₂) denote the disease risk of individuals in the target population with exposure values (a₁, a₂). The additive interaction between A₁ and A₂ is measured by the extent to which the effect of A₁ and A₂ together exceeds the effect of each considered individually, i.e.,

[μ (a_{1}, a_{2}) - μ (0, 0)] - {[μ (a_{1}, 0) - μ (0, 0)] + [μ (0, a_{2}) - μ (0, 0)]} = μ (a_{1}, a_{2}) - μ (a_{1}, 0) - μ (0, a_{2}) + μ (0, 0), for some a_{1}, a_{2} .

2.1 |. Binary exposures and the standard RERI

In the case of binary A₁ and A₂, we have the following saturated model

μ (A_{1}, A_{2}) = β_{0} + β_{1} A_{1} + β_{2} A_{2} + β_{3} A_{1} A_{2} .

(1)

Therefore, an additive interaction between A₁ and A₂ is said to be present if

β_{3} = μ (1, 1) - μ (1, 0) - μ (0, 1) + μ (0, 0) \neq 0.

In case-control studies we cannot generally estimate the risk μ(a₁, a₂). An alternative representation commonly considered is the relative excess risk due to interaction (RERI), defined as

R E R I = {μ (1, 1) - μ (1, 0) - μ (0, 1)} / μ (0, 0) + 1 = β_{3} / μ (0, 0),

based on the observation that

β_{3} = 0 \Leftrightarrow R E R l = 0.

An empirical version of RERI can be obtained by estimating the required risk ratios μ(a₁, a₂)/μ(0,0) via a saturated logistic regression with the rare disease assumption under which odds ratios and risk ratios are approximately equal. Specifically, let ${\hat{OR}}_{G, E}$ denote the point estimate of the odds ratio OR_{G, E} = [μ(G,E)(1 − μ(0,0))]/[μ(0,0)(1 − μ(G,E))] for levels (G,E) compared to the reference category (G,E) = (0,0) from a saturated logistic regression including G, E, and GxE terms, then

\hat{R E R I} = {\hat{OR}}_{1, 1} - {\hat{OR}}_{1, 0} - {\hat{OR}}_{0, 1} + 1.

Standard Delta method can be applied to provide a consistent estimate of its standard error $\sqrt{{\hat{σ}}_{R E R I}^{2}}$ which has been detailed in [16]. Finally a Wald-type test statistic is constructed as $T_{R E R I} = \hat{R E R I} / \sqrt{{\hat{σ}}_{R E R I}^{2}}$ . Using standard asymptotic arguments and the Delta method, under the null hypothesis of no additive interaction H₀: β₃ = RERI = 0, T_RERI is approximately standard normal in large samples.

2.2 |. The proposed test of additive interaction

The following result gives an alternative characterization of the null hypothesis of no additive interaction which motivates the new approach. To state the result, let π₁ (a₂) = Pr(A₁ = 1|A₂ = a₂) denote the prevalence of the first exposure A₁ among individuals with second exposure A₂ = a₂ in the underlying population, and likewise define π₂ (a₁) = Pr(A₂ = 1|A₁ = a₁). Let α denote the log odds ratio relating A₁ and A₂ in the target population. Then

e^{α} = \frac{π_{1} (1) (1 - π_{1} (0))}{π_{1} (0) (1 - π_{1} (1))} .

Note that α = 0 encodes the independence assumption between A₁ and A₂.

Result 1. We have that the null hypothesis of no additive interaction H₀: β₃ = 0 holds if and only if

E {U | D = 1} = 0,

where

U = e^{- α A_{1} A_{2}} (A_{1} - π_{1} (0)) (A_{2} - π_{2} (0)) D .

(2)

We should note that Result 1 does not rely on the rare disease assumption and holds irrespective of the population disease prevalence. The result is a special case of a more general lemma given later in Section [3] allowing for arbitrary exposures and for covariate adjustment. According to the result, the null hypothesis of no additive interaction holds if and only if RERI is equal to zero, or equivalently if and only if the random variable U has mean zero among cases (D = 1), i.e.,

β_{3} = 0 \Leftrightarrow R E R I = 0 \Leftrightarrow E {U | D = 1} = 0.

Remark 1. Intuition about the result is gained by assuming G-E independence, i.e. α = 0. Let f_j(a_j) = Pr(A_j = a_j) denote the density of A_j in the underlying target population. Under G-E independence we have π_j(a) = π_j, ∀a. Then, upon noting that the conditional density of (A₁, A₂) given D = 1 is proportional to

μ (A_{1}, A_{2}) f_{1} (A_{1}) f_{2} (A_{2}) = (β_{0} + β_{1} A_{1} + β_{2} A_{2} + β_{3} A_{1} A_{2}) f_{1} (A_{1}) f_{2} (A_{2})

where f_j(1) = π_j, one observes that $E {U | D = 1}$ is proportional to

\sum_{a_{1}, a_{2}} (a_{1} - π_{1}) (a_{2} - π_{2}) (β_{0} + β_{1} a_{1} + β_{2} a_{2} + β_{3} a_{1} a_{2}) f_{1} (a_{1}) f_{2} (a_{2}) = \underset{= 0}{\underset{︸}{β_{0} \sum_{a_{1}, a_{2}} (a_{1} - π_{1}) (a_{2} - π_{2}) f_{1} (a_{1}) f_{2} (a_{2})}} + \underset{= 0}{\underset{︸}{β_{1} \sum_{a_{1}, a_{2}} (a_{1} - π_{1}) (a_{2} - π_{2}) a_{1} f_{1} (a_{1}) f_{2} (a_{2})}} + \underset{= 0}{\underset{︸}{β_{2} \sum_{a_{1}, a_{2}} (a_{1} - π_{1}) (a_{2} - π_{2}) a_{2} f_{1} (a_{1}) f_{2} (a_{2})}} + β_{3} \sum_{a_{1}, a_{2}} (a_{1} - π_{1}) (a_{2} - π_{2}) a_{1} a_{2} f_{1} (a_{1}) f_{2} (a_{2}) = β_{3} π_{1} (1 - π_{1}) π_{2} (1 - π_{2}) for binary A_{1}, A_{2},

confirming that $E {U | D = 1} = 0$ if and only if the additive interaction β₃ = 0. Result 1 further shows that a similar result holds when the exposures are dependent upon applying a weight to individuals with both exposures, equal to the inverse of the odds ratio association of the two exposures. Intuitively, weighting makes the exposures independent, thus essentially recovering the independent exposure setting in the weighted sample. Since U only uses exposure data among cases (with D = 1), the result suggests that one may be able to test for additive interaction by considering whether the distribution of the exposures in view satisfies the above condition using data for cases only.

Unfortunately, U is not directly observed and therefore cannot directly be used for inference, as it depends on the unknown population parameters π_j(0), j = 1, 2. Nonetheless, progress can be made under the rare disease assumption, since one may use the controls (with D = 0) for approximate inference, upon observing that π_j(0) ≈ p_j(0) where p_l(a₂) = Pr(A_l = 1|A₂ = a₂, D = 0) and p₂(a_l) = Pr(A₂ = 11A_l = a_l, D = 0). Specifically, let

ω = log [p_{1} (1) (1 - p_{1} (0)) / p_{1} (0) (1 - p_{1} (1))],

(3)

then ω ≈ α under the rare disease assumption. In particular, $\hat{ω}$ is asymptotically negligible when A_l and A₂ are independent in the population. Therefore, one may estimate $\sum_{i} U_{i}$ with $\sum_{i} {\hat{U}}_{i}$ where

{\hat{U}}_{i} = exp (- \hat{ω} A_{1, i} A_{2, i}) (A_{1, i} - {\hat{p}}_{1} (0)) (A_{2, i} - {\hat{p}}_{2} (0)) D_{i},

(4)

with ${\hat{p}}_{1} (a) = \sum_{i} A_{1, i} I (A_{2, i} = a, D = 0) / \sum_{i} I (A_{2, i} = a, D = 0)$ the sample version of p_l(a), and ${\hat{p}}_{2} (a)$ similarly defined, and $e^{\hat{ω}} = [{\hat{p}}_{1} (1) (1 - {\hat{p}}_{1} (0))] / [{\hat{p}}_{1} (0) (1 - {\hat{p}}_{1} (1))]$ the sample odds ratio relating A_l and A₂ in the controls. In the Appendix, we show how to derive $σ_{\tilde{U}}^{2} = V a r (\sum_{i} {\hat{U}}_{i} / n)$ (see equation (1) of the Appendix).

Suppose that unbeknownst to the analyst, A_l and A₂ are independent in the population and therefore $\hat{ω}$ is asymptotically negligible. We evaluate $σ_{\tilde{U}}^{2}$ at this particular submodel and show that $σ_{\tilde{U}}^{2}$ can be decomposed as $σ_{\tilde{U}}^{2} = V_{1} + V_{2} + V_{3}$ , where the first term V_l captures the variance of $\sum_{i} U_{i} / n$ if (ω, p_l(0),p₂(0)) were known; the second term V₂ reflects the uncertainty due to estimation of (p_l(0),p₂(0)); while V3 reflects the uncertainty associated with estimation of the odds ratio parameter ω. Specifically,

V_{1} = V a r (U) / n,

V_{2} = E {[(A_{2} - p_{2} (0)) D]}^{2} V a r ((A_{1} - E (A_{1} | D = 0)) (1 - D)) / n + E {[(A_{1} - p_{1} (0)) D]}^{2} V a r ((A_{2} - E (A_{2} | D = 0)) (1 - D)) / n,

V_{3} = {(E [A_{1} A_{2} (A_{2} - p_{2}) (A_{1} - p_{1}) D] {p_{1} p_{2} (1 - p_{1}) (1 - p_{2})}^{- 1} + E [(A_{2} - p_{2}) D] {[(1 - p_{2})]}^{- 1} + E [(A_{1} - p_{1}) D] {[(1 - p_{1})]}^{- 1})}^{2} \times V a r ((A_{1} - E (A_{1} | D = 0)) (A_{2} - E (A_{2} | D = 0)) (1 - D)) / n .

In the next section, we further consider how explicitly leveraging G-E independence assumption alters each of these contributions, to reveal how the independence assumption can improve power to detect the presence of an additive interaction.

Here we note that, under H₀ the standardized test statistic $T = \sum_{i} {\hat{U}}_{i} / (n \sqrt{{\hat{σ}}_{\hat{U}}^{2}})$ approximately standard normal in large samples, where ${\hat{σ}}_{\hat{U}}^{2}$ is a consistent estimator of a $σ_{\hat{U}}^{2}$ obtained by substituting all unknown expectations with empirical averages and unknown parameters with corresponding estimators. under the two-sided alternative hypothesis β₃ ≈ 0, one can further show that in large samples, T has approximate variance one, and is approximately centered at the non-centrality parameter κ × β₃, where:

κ = p_{1} (0) (1 - p_{1} (0)) p_{2} (0) (1 - p_{2} (0)) λ / σ_{\hat{U}}^{2},

λ is the sampling fraction of cases (i.e. λ = proportion of cases in case-control sample/proportion of cases in population). Since $1 / σ_{\hat{U}}^{2}$ tends to infinity with sample size, T has asymptotic power one, confirming that similar to T_RERI, T is a consistent test statistic of H₀.

Interestingly, the above derivation also implies that the statistic $\sum_{i} {\hat{U}}_{i} / {{\hat{p}}_{1} (0) (1 - {\hat{p}}_{1} (0)) {\hat{p}}_{2} (0) (1 - {\hat{p}}_{2} (0)) \sum_{i} D_{i}}$ gives a consistent estimate of β₃/Pr{D = 1}, which is the interaction parameter of interest scaled by the inverse of the population disease prevalence. Thus, one could in principle recover a consistent estimate of β₃ if either the underlying population disease prevalence or the sampling fraction of cases were known.

We note that neither T nor T_RERI makes explicit use of the G-E independence assumption and therefore both may be inefficient when the assumption holds. In the following section, we modify T to explicitly encode the independence assumption thus obtaining a more powerful test statistic.

2.3 |. Test incorporating independence assumption

Suppose that A₁ and A₂ are known to be independent in the population. Naturally, one may wish to exploit such prior information in testing for G-E interaction. This can be accomplished by adapting the methodology developed in the previous section upon noting that the independence assumption implies α = 0, which, under the rare disease assumption, also implies that ω ≈ 0. This leads us to modify ${\hat{U}}_{i}$ . Define ${\tilde{U}}_{i}$ similarly to ${\hat{U}}_{i}$ with $\hat{ω} = 0$ , i.e.

{\tilde{U}}_{i} = (A_{1, i} - {\hat{p}}_{1} (0)) (A_{2, i} - {\hat{p}}_{2} (0)) D_{i} .

(5)

In the appendix, we show that $σ_{\tilde{U}}^{2} = V a r (\sum_{i} {\tilde{U}}_{i} / n)$ can be estimated by ${\hat{σ}}_{\tilde{U}}^{2} = {\hat{V}}_{1} + {\hat{V}}_{2}$ . Consequently ${\hat{σ}}_{\tilde{U}}^{2} < {\hat{σ}}_{\hat{U}}^{2}$ , reflecting the efficiency gain due to the independence assumption, i.e. ${\hat{V}}_{3}$ is exactly zero since there is no uncertainty associated with $\hat{ω} = 0$ One can verify that the non-centrality parameter β₃ × κ₁ of $T_{1} = \sum_{i} {\tilde{U}}_{i} / (n \sqrt{{\hat{σ}}_{\tilde{U}}^{2}})$ becomes $κ_{1} = \frac{σ_{\hat{U}}}{σ_{\tilde{U}}} κ > κ$ , confirming that T₁ is guaranteed to be more powerful than T.

2.4 |. Adjusting for covariates

In observational studies, it is usually desirable to adjust for potential confounding of the joint effects of A₁ and A₂, and such covariate adjustment may also be required to enforce the G-E independence assumption. Let X denote such a vector of covariates and suppose that the exposures are independent conditional on X, i.e., A₁ ± A₂ |X. Define p₁(x) = Pr(A₁ = 1|X = x,D = 0) and p₂(x) = Pr(A₂ = 1|X = x,D = 0). Likewise, let ${\hat{p}}_{1} (x)$ and ${\hat{p}}_{2} (x)$ correspond to estimates, obtained using standard parametric models, e.g. using logistic regressions of the form $l o g i t {\hat{p}}_{j} (x) = l o g i t p_{j} (x; {\hat{θ}}_{j}) = (1, x^{'}) {\hat{θ}}_{j}, j = 1, 2$ , where ${\hat{θ}}_{j}$ is computed by maximum likelihood. Under the null hypothesis of no additive interaction, the test statistic $T_{2} = \sum_{i} {\bar{U}}_{i} / (n \sqrt{{\hat{σ}}_{\bar{U}}^{2}})$ has an approximate standard normal distribution, with ${\bar{U}}_{i}$ defined as

{\bar{U}}_{i} = (A_{1, i} - {\hat{p}}_{1} (X)) (A_{2, i} - {\hat{p}}_{2} (X)) D_{i},

(6)

where ${\hat{σ}}_{\bar{U}}^{2}$ is obtained using equation (1) of the Appendix.

2.5 |. More general exposures

In this section, we extend the test statistic to the scenarios where the exposures A₁ and A₂ may be continuous or polytomous. Suppose that the environmental exposure A₂ were continuous. For example if D were diabetes status, A₂ could be body mass index (BMI) typically coded on a continuous scale. Note that the null hypothesis of no additive interaction can be restated as follows to acknowledge the continuous exposure:

H_{0} : μ (1, a_{2}, x) - μ (1, 0, x) - μ (0, a_{2}, x) + μ (0, 0, x) = 0 for all values of a_{2} and x,

where μ(a₁,a₂,x) = Pr(D = 1|a₁,a₂,x). To construct an appropriate test statistic of H₀, suppose that $E (A_{2} | X = x, D = 0)$ is estimated with the linear model ${\hat{m}}_{2} (x) = m_{2} (x; {\hat{θ}}_{2}) = (1, x^{'}) {\hat{θ}}_{2}$ via ordinary least squares using controls only. Assuming G-E conditional independence given X, it is straightforward to modify the proposed test statistic to account for the continuous exposure, by simply replacing ${\hat{p}}_{2} (x)$ with ${\hat{m}}_{2} (x)$ . Thus, we let

{\bar{U}}_{i}^{c} = (A_{1, i} - {\hat{p}}_{1} (X_{i})) (A_{2, i} - {\hat{m}}_{2} (X_{i})) D_{i},

(7)

and let ${\hat{σ}}_{{\bar{U}}^{c}}^{2}$ denote an estimate of the variance of $\sum_{i} {\bar{U}}_{i}^{c} / n$ obtained using equation (1) of the Appendix. Then, the test statistic $T_{3} = \sum_{i} {\bar{U}}_{i}^{c} / (n \sqrt{{\hat{σ}}_{{\bar{U}}^{c}}^{2}})$ is approximately standard normal under H₀.

A similar test statistic could be defined if A₂ were a count, upon estimating its mean with the log-linear model $log n_{2} (x; {\hat{θ}}_{2}) = (1, x^{'}) {\hat{θ}}_{2}$ computed by maximum likelihood under say a Poisson model for A₂. Likewise, if A₁ were a count, for instance, if A₁ were to encode the number of minor alleles measured at a single nucleotide polymorphism (SNP) locus, one could model its mean under the Hardy-Weinberg equilibrium model, where A₁ could be modelled as Binomial with two trials and event probability estimated by a logistic regression model $q (x; {\hat{θ}}_{1}) = e x p i t ((1, x^{'}) {\hat{θ}}_{1})$ . Note that under this model the mean of A₁ is given by $n_{1} (x; {\hat{θ}}_{1}) = 2 \hat{q} (x; {\hat{θ}}_{1})$ . Then, one could simply replace ${\hat{m}}_{2}$ with ${\hat{n}}_{2}$ (or replace ${\hat{p}}_{1}$ with ${\hat{n}}_{1}$ ) in defining the test statistic, and likewise modify the estimated variance of the test statistic using (1) of the Appendix.

Suppose now that A₁ were more generally categorical having K possible levels {0,a_1,1,…,a₁,k₋₁} with 0 a reference value. Further assuming that A₂ were, say, continuous and independent of A₁ given X, we could then simply define

{\bar{U}}_{i}^{m} = \sum_{k = 1}^{K - 1} (I (A_{1, i} = a_{1, k}) - {\hat{p}}_{1, k} (X_{i})) (A_{2, i} - {\hat{m}}_{2} (X_{i})) D_{i},

(8)

where ${\hat{p}}_{1, k} (x)$ ,_k(x) is a maximum likelihood estimate of Pr(A₁ = a_k\x) computed using standard polytomous logistic regression, i.e. ${\hat{p}}_{1, k} (x) = e^{(1, x^{'})}^{{\hat{θ}}_{k}} / (1 + \sum_{j = 1}^{K - 1} e^{(1, x^{'}) {\hat{θ}}_{j}})$ . Let ${\hat{σ}}_{{\bar{U}}^{m}}^{2}$ denote an estimate of the large sample variance of $\sum_{i} {\bar{U}}_{i}^{m} / n$ based on (1) of the Appendix. Then in large samples, the resulting test statistic $T_{4} = \sum_{i} {\bar{U}}_{i}^{m} / (n \sqrt{{\hat{σ}}_{{\bar{U}}^{m}}^{2}})$ is approximate standard normal under the null hypothesis of no additive interaction which may be restated to account for the polytomous and continuous exposures:

H_{0} : μ (a_{1, k}, a_{2}, x) - μ (a_{1, k}, 0, x) - μ (0, a_{2}, x) + μ (0, 0, x) = 0 for all k, and all values of a_{2} and x .

2.6 |. Relaxing the rare disease assumption

In case the rare disease assumption does not apply, estimating exposure regression models in controls only may not be entirely appropriate. Nonetheless, it may still be possible to test for the presence of an additive interaction. For instance, if, as often the case in nested case-control studies, sampling fractions for cases and controls were known, then, standard inverse probability weighting could be used based on known sampling weights to estimate population models for the exposures using both cases and controls. Potentially more efficient estimates of models for the exposures could alternatively be obtained using more recent methodology for regression analysis of secondary outcomes in case-control studies [17].

3 |. A UNIFIED CLASS OF TEST STATISTICS

We now provide a unified class of test statistics for the null hypothesis of no additive interaction which subsumes each of the settings considered in previous sections as special case, but which also allows for the conditional independence assumption of the two exposures to be relaxed.

To do so, we proceed as in [18] and use the following representation of the joint density of (A₁, A₂) given X:

f (A_{1}, A_{2} | X) = \frac{f (A_{1} | A_{2} = 0, X) f (A_{2} | A_{1} = 0, X) O R (A_{1}, A_{2}; X)}{\iint f (a_{1} | A_{2} = 0, X) f (a_{2} | A_{1} = 0, X) O R (a_{1}, a_{2}; X) d ν (a_{1}, a_{2})},

(9)

where v is a dominating measure of the distribution of (A₁, A₂), OR(A₁, A₂; X) is the generalized odds ratio function relating A₁ and A₂ within levels of X, that is

O R (A_{1}, A_{2}; X) = \frac{f (A_{1}, A_{2} | X) f (A_{1} = 0, A_{2} = 0 | X)}{f (A_{1} = 0, A_{2} | X) f (A_{1}, A_{2} = 0 | X)}

and {f (A₁|A₂ = 0, X), f (A₂|A₁ = 0, X)} are baseline densities in the target population. Note that in the simple case of binary exposures, the generalized odds ratio function reduces to the standard odds ratio effect measure, but remains well defined as a measure of association for exposures of a more general nature, including categorical, count or continuous variables. In particular, OR(A₁, A₂; X) = 1 if and only if A₁ and A₂ are independent within levels of X. Let

β_{3} (a_{1}, a_{2}, x) = μ (a_{1}, a_{2}, x) - μ (a_{1}, 0, x) - μ (0, a_{2}, x) + μ (0, 0, x)

denote the additive interaction between A₁ and A₂.The null hypothesis of no additive interaction can more generally be stated as:

H_{0} : β_{3} (a_{1}, a_{2}, x) = 0 for all values of a_{1}, a_{2} and x .

For any function g(A₁, A₂, X) of (A₁,A₂,X), define

W (g) = w (A_{1}, A_{2}, X, D; g) = O R {(A_{1}, A_{2}; X)}^{- 1} {g (A_{1}, A_{2}, X) - \int g (A_{1}, a_{2}, X) f (a_{2} | A_{1} = 0, X) d μ (a_{2}) - \int g (a_{1}, A_{2}, X) f (a_{1} | A_{2} = 0, X) d μ (a_{1}) + \int g (a_{1}, a_{2}, X) f (a_{2} | A_{1} = 0, X) f (a_{1} | A_{2} = 0, X) d μ (a_{1}, a_{2})} D .

(10)

Remark 2. Intuition about the form of W (g) is similar to that given in Remark [1] upon noting that for any function g(A₁, A₂, X) of (A₁, A₂, X),

E [W (g) | D = 1, X] \propto \iint w (a_{1}, a_{2}, x, 1; g) β_{3} (a_{1}, a_{2}, x) f (a_{1} | A_{2} = 0, x) f (a_{2} | A_{1} = 0, x) O R (a_{1}, a_{2}; x) d ν (a_{1}, a_{2}),

(11)

which is equal to zero if and only if β₃(a_i, a₂, x) =0 as shown in the proof. By Theorem 1 of [18], any function of (A₁, A₂, X, D) that satisfies [11] must be of the form of W(g).

A special case of W(g) is U defined in [2]. In particular, it is straightforward to verify that the test statistics proposed to handle binary, continuous or count exposures under the independence assumption are obtained by taking:

g (A_{1}, A_{2}, X) = (A_{1} - E (A_{1} | X)) (A_{2} - E (A_{2} | X)),

where $E (A_{1} | X)$ is the mean of A_j evaluated under f (Aj |X), j = 1, 2. For A_i categorical with K distinct categories and A₂ binary, continuous or a count, one likewise obtains the test statistic previously proposed by taking:

g (A_{1}, A_{2}, X) = \sum_{k = 1}^{K - 1} (I (A_{1, i} = a_{1, k}) - E (I (A_{1, i} = a_{1, k}) | X)) (A_{2} - E (A_{2} | X)) .

Lemma 1. The null hypothesis H₀ : β₃(a₁, a₂, x) = 0 holds if and only if

E {W (g) | D = 1, x} = 0 f o r a l l v a l u e s o f x a n d a l l f u n c t i o n s g .

Result 1 is easily recovered as a corollary of Lemma 1. According to Lemma 1, an empirical version of W(g) with user-specified function g may be used to test H₀. Specifically, One must estimate the unknown odds ratio function and baseline densities, in order to obtain an estimate of the joint density of (A₁,A₂) given X using the parametrization given in equation (9).

Under the rare disease assumption, estimation of the joint density can be done by standard maximum likelihood in the controls only, upon positing parametric models for the odds ratio function and baseline densities. To ground ideas, one could posit a single parameter model for the odds ratio function as log OR(A₁, A₂; X; w) = wA₁A₂, which encodes the assumption that the odds ratio association between A₁ and A₂ given X does not vary with X, i.e. no effect heterogeneity in X of the odds ratio association between A₁ and A₂ in the population. Similarly, one could posit parametric models for the exposure densities f (A₁ |A₂ = 0, X; α₁) and f (A₂ |A₁ = 0, X; α₂). For exposures that are either binary, continuous or counts, generalized linear models within the exponential family may be used to model the baseline densities. For example, counts may be modeled by assuming a Poisson distribution for the corresponding baseline density.

Let $\hat{ω}$ , $\hat{f} (A_{1} | A_{2} = 0, X)$ and $\hat{f} (A_{2} | A_{1} = 0, X)$ denote the approximate maximum likelihood estimate using controls only; and $\hat{W} (g) = W (g; \hat{θ})$ denote the resulting estimate of W(g), where θ = (w, α₁,α₂). Our proposed test statistic is then given by $Z = \sum_{i} {\hat{W}}_{i} (g) / (n \sqrt{{\hat{σ}}_{W}^{2}})$ , where ${\hat{σ}}_{W}^{2}$ is the estimate of $V a r (\sum_{i} {\hat{W}}_{i} (g) / n)$ provided in the Appendix. Under the independence assumption, we set OR(A₁, A₂; X) to 1 for all persons in the sample. In this case, the asymptotic variance of $V a r (\sum_{i} {\hat{W}}_{i} (g) / n)$ can be easily modified to reflect that OR(A₁, A₂; X) is set to 1 with no uncertainty associated.

4 |. A SIMULATION STUDY

We study the power and type I error of our proposed test in the standard setting of binary genetic and environmental variables with no other covariate, so that it is readily compared to the approach of [9]. In order to evaluate both type I error rates and power of various test statistics, we generated simulated data following the design of [9] which encodes the magnitude of the interaction indirectly by varying RERI from 0 (to assess type I error) to 0.5. The probability of having the genetic variant was 0.5, and the probability of the binary environmental variable was 0.2, and these factors were generated to be independent. Let $e x p i t (z) = e^{z} / [e^{z} + 1]$ and $l o g i t (p) = log [p / (1 - p)]$ . The disease risk model was

l o g i t P r (D = 1 | a_{1}, a_{2}) = α_{0} + α_{1} a_{1} + α_{2} a_{2} + α_{3} a_{1} a_{2};

with baseline risk equal to 0.01 (i.e. α₀ = logit(0.01)). The gene and environment main effects were varied so that (α₁,α₂) ϵ {log(0.7), log(1.2), log(2)}, and the multiplicative G-E interaction parameter a₃ was selected to yield the desired RERI, according to the formula

α_{3} = l o g i t [(R E RI - 1) e x p i t (α_{0}) + e x p i t (α_{0} + α_{1}) + e x p i t (α_{0} + α_{2})] - α_{0} - α_{1} - α_{2} .

In each simulation, we generated 4000 cases and 4000 controls. We report results from 10,000 simulations for each setting corresponding to a particular combination of (α₁,α₂) and RERI values.

Figure 1 summarizes the simulation results in terms of power comparing the proposed tests with and without using the G-E independence assumption, labeled ‘U ind’ and ‘U’ respectively. The figure also presents results for the retrospective profile likelihood ratio test proposed by [9] with and without using the independence assumption respectively, labeled ‘Han ind’ and ‘Han’ respectively. In addition, the figure also displays results from the standard RERI test based on prospective logistic regression, which is labeled ‘prosp’. Table 1 summarizes the type I error rate at significance level of 5% of the various methods under ranging parameter values. Additional summaries of the type I error at significance level of 1%, as well as the power at a different Pr(G = 1) value are presented in the appendix.

Power of the various tests for identifying additive G-E interaction, in the simple settings in which both a₁. and a₂ are binary, and no covariates are used in the model, with Pr(G = 1) = 0.5. The simulated a₁ is common Pr(a_l = 1) = 0.5, Pr(a₂ = 1) = 0.2, and the disease model Pr(D = 1|a_l, a₂) has the fixed baseline risk α₀ = logit(0.01), the main effects of the a_l,a₂ are varied as α_l, α₂, and the RERI is varied from 0 to 0.5. The compared estimators are the proposed ‘U’ and ‘U ind’ (without and with assuming G-E independence), ‘prosp’ is the standard test under prospective logistic regression and ‘Han’ and ‘Han ind’ is the retrospective profile likelihood test proposed in [9] without and with assuming G-E independence.

TABLE 1.

The type I error of the compared tests, under various combinations of the prevalence of the genetic variant a₁, and the effect of the genetic and environmental variables on the disease outcome (α₁ and α₂, respectively). The tests ‘U’ and ‘U ind’ are the proposed tests without and with the assumption of G-E independence. ‘prosp’ is the usual test based on standard logistic regression, and ‘Han’ and ‘Han ind’ are the tests based on retrospective profile likelihood proposed by [9]. The type I error was calculated from 10,000 simulations, each with 4000 cases and 4000 controls.

α₁ :	log(0.7)			log(1.2)			log(2)
α₂ :	log(0.7)	log(1.2)	log(2)	α₂ = log(0.7)	log(1.2)	log(2)	log(0.7)	log(1.2)	log(2)
Pr(G = 1) = 0.5
U	0.048	0.049	0.051	0.050	0.048	0.049	0.049	0.050	0.048
U ind	0.047	0.048	0.049	0.047	0.047	0.049	0.046	0.046	0.048
prosp	0.050	0.048	0.047	0.048	0.048	0.046	0.046	0.049	0.049
Han	0.049	0.049	0.050	0.050	0.051	0.050	0.047	0.052	0.051
Han ind	0.047	0.048	0.050	0.047	0.048	0.049	0.046	0.046	0.048
Pr(G = 1) = 0.2
U	0.050	0.047	0.048	0.046	0.051	0.048	0.052	0.050	0.046
U ind	0.052	0.045	0.050	0.045	0.047	0.050	0.048	0.049	0.050
prosp	0.050	0.046	0.046	0.045	0.049	0.045	0.051	0.047	0.045
Han	0.051	0.048	0.050	0.048	0.050	0.048	0.055	0.051	0.050
Han ind	0.051	0.046	0.049	0.045	0.048	0.051	0.048	0.049	0.052
Pr(G = 1) = 0.05
U	0.050	0.054	0.053	0.054	0.054	0.050	0.051	0.049	0.048
U ind	0.052	0.057	0.050	0.051	0.051	0.050	0.048	0.047	0.048
prosp	0.043	0.048	0.050	0.042	0.047	0.045	0.042	0.043	0.041
Han	0.052	0.048	0.050	0.050	0.054	0.052	0.051	0.051	0.050
Han ind	0.052	0.052	0.049	0.050	0.049	0.046	0.047	0.048	0.048

Open in a new tab

One observes that the RERI-LRT test ‘Han ind’ and ‘U ind’ were equally powerful when Pr(G = 1) = 0.5 across various possible values for the other parameters, and both tests were dramatically more powerful when compared to the other tests, while ‘U’ was slightly less powerful than ‘Han’, which was in turn slightly less powerful than ‘prosp’.

In additional simulations, we varied the prevalence of the genetic marker Pr(G = 1) to have population probabilities 0.2 and 0.05, while the environmental factor was maintained to have probability 0.2. All tests appeared to have correct type I error rate as shown in Table 1. Power plots similar to those appearing in Figure 1 are provided in the Appendix for these additional settings. These additional simulations confirmed that all tests became less powerful as the genetic variant became less common, with ‘Han ind’ being slightly more powerful than ‘U ind’ when Pr(G = 1) = 0.05. Overall, the simulation study confirmed that the proposed approach performs quite competitively when compared with the efficient RERI-LRT approach, in settings where both methods are available.

In addition to the above results, we also conducted additional simulation studies to evaluate the performance of the various tests under violation of G-E independence assumption, violation of rare disease assumption, and a variety of exposure variable types. Simulation results are provided in the Appendix. In summary, under violation of G-E independence assumption, tests that assumes G-E independence had inflated type I errors. Violation of rare disease assumption did not have substantial impact on the performance of the tests, except for inflated type I error under relatively extreme genetic and environmental main effects. Our proposed tests performed quite well for various types of risk factors.

In the following section, we consider a data application of the new approach in which RERI is no longer readily available and cannot easily be applied without further making unnecessary assumptions.

5 |. OVARIAN CANCER APPLICATION

We applied the proposed test of additive interaction to the Israeli Ovarian Cancer data [19] which is recently analyzed by [12, 13, 14]. The Israeli Ovarian Cancer data is collected from a population-based case-control study based on all ovarian cancer patients identified in Israel between March 1st, 1994 and June 30th, 1999. The main goal of the study was to examine the interplay of the BRCA1/2 mutation and the known reproductive/gynecological risk factors of ovarian cancer. We obtained the data from the R package CGEN provided by [12]. Although previous analyses aimed to detect a multiplicative gene-environment interaction between having the BRCA1/2 mutation and two environmental exposures, number of years of oral contraceptive use (OC) and number of children (parity), here we are primarily concerned with determining whether such interactions might be operating on the additive scale.

There were 832 cases and 747 controls available in this case-control study. A number of covariates were available for confounding adjustment and also to enforce the independence assumption. Carriers (non-carriers) of BRCA1/2 mutation were coded as one (zero), and we considered the following logistic regression model

l o g i t [P r (G = 1 | E = 0, X)] = β_{0} + β_{age} I (a g e > 50) + β_{ash} I (Ashkenazi) + β_{non-ash} I (non-Ashkenazi) + β_{PHB} I (personal history of breast cancer) + β_{FHB} I (family history of breast cancer) + β_{FBO} I (family history of ovarian cancer),

where X denotes the set of adjusted covariates. Specifically, we included an indicator variable for age ≤ 50, indicator variables for ethnicities of Ashkenazi jew, and non-Ashkenazi (with mixed race serving as reference category), indicator variables for personal history of breast cancer, family history of breast cancer, and family history of ovarian cancer. Both environmental exposures are naturally coded as counts and therefore were modelled using Poisson regression adjusting for the same set of covariates X, that is

log E [E | G = 0, X] = α_{0} + α_{age} I (a g e > 50) + β_{ash} I (Ashkenazi) + α_{non-ash} I (non-Ashkenazi) + α_{PHB} I (personal history of breast cancer) + α_{FHB} I (family history of breast cancer) + α_{FBO} I (family history of ovarian cancer) .

We also conducted a sensitivity analysis without adjusting for personal history of breast cancer in both models.

We performed various tests when assuming G-E independence, and without using such an assumption. Without G-E independence, we assumed a generalized odds ratio function as OR(G, E; X)=e^ωGE, where ω is the log odds ratio parameter, as suggested in Section 3. We use our proposed variance estimator to evaluate 95% confidence intervals and p-values. We also included a comparison with the RERI-based test using retrospective profile likelihood proposed in [9] with environmental factor dichotomized as $E_{bin} = 1 (E > \bar{E})$ .

The table provides results from testing for a G×E additive interaction with and without making the G-E independence assumption. In accordance with simulation results, the independence assumption yields a test statistic consistently more extreme for both exposures in view than the corresponding test which does not incorporate the assumption. Specifically, we successfully reject the null hypothesis of no additive G-E interaction between BRCA1/2 mutation and parity at the alpha level of 0.05, and we found no conclusive evidence of an additive interaction with OC. Similar results were observed when using the RERI test based on retrospective profile likelihood (Han p-val) with dichotomized environmental factor as well as in the sensitivity analysis. It is particularly interesting to compare these findings with previous analyses of these data that have primarily been concerned with detecting the presence of a multiplicative G×E interaction. For instance, [13] leveraged the independence assumption to detect a G × E multiplicative interaction only with OC and failed to find evidence of a similar interaction with parity, thus essentially reporting the opposite findings to ours. However, our findings are potentially more scientifically relevant given that interactions on the multiplicative scale may be harder to interpret biologically.

6 |. CONCLUSION

We have described a very general framework to test for G×E additive interactions exploiting G-E independence in case-control studies. The proposed strategy has several advantages over existing RERI-based strategies. In particular, it does not require a regression model for the outcome, and therefore is robust to model misspecification of the outcome regression, a potential concern particularly if E is a count or continuous variable and additional covariates are included in the regression.

The approach put forward in this paper is closely related to the semiparametric framework of [15], which characterized the set of influence functions of a model of interaction (on the additive or multiplicative scale) under a semiparametric union model in which only a subset but not all of the parametric models used to describe the data generating mechanism need to be correct for valid inference. In fact, one can show that our proposed test statistic belongs to the general class of test statistics for additive interaction associated with the set of influence functions in [15]. However, because [15] did not allow for outcome dependent sampling and only considered standard prospective random sampling, not all test statistics in their class may be used under a case-control design. Thus, an important contribution of the current paper has been to characterize the subset of the class of test statistics of an additive interaction that may be used both under prospective and retrospective sampling.

An important limitation of the proposed approach is that it does not readily produce an estimate of the risk difference parameters which are often of primary interest for understanding the public health significance of any significant finding. To obtain such estimates, one would need an estimate of the main effect of exposures, which are being treated as unspecified nuisance parameters in the proposed approach from the ground of robustness. Addressing this limitation is a priority for future research to extend the methods describe herein.

TABLE 2.

Results for tests of null hypothesis of no additive G × E interaction between presence of BRCA1/2 mutation (G) and number of years of oral contraceptive (OC) use, and parity (E variables), with and without G-E independence assumption. U is the proposed (standardized) test statistic, and its 95% bootstrap confidence interval, p-value, and the RERI test based on retrospective profile likelihood (Han p-val) are provided.

E	Assume G-E indept	Primary Analysis				Sensitivity Analysis
E	Assume G-E indept	U	95% CI	P-val	Han P-val	U	95% CI	P-val	Han P-val
Parity	Yes	−0.044	(−0.084, −0.004)	0.030	0.046	−0.039	(−0.075, −0.004)	0.028	0.040
Parity	No	−0.052	(−0.102, −0.001)	0.047	0.107	−0.048	(−0.095, −0.001)	0.046	0.089
OC	Yes	0.050	(−0.007, 0.107)	0.088	0.685	0.044	(−0.013, 0.100)	0.129	0.584
OC	No	−0.051	(−0.112, 0.011)	0.110	0.666	−0.052	(−0.126, 0.022)	0.169	0.699

Open in a new tab

ACKNOWLEDGEMENT

The research was supported by U.S. National Institutes of Health grants R01AI104459, R01AI127271, R01CA222147. We thank the associate editor and two reviewers for their helpful comments.

APPENDIX

1. Proof that $V a r (\sum_{i} {\hat{U}}_{i} / n) > V a r (\sum_{i} {\tilde{U}}_{i} / n)$ .

To show the result requires the influence function of $\hat{θ} = {(\hat{ω}, {\hat{p}}_{1} = {\hat{p}}_{1} (0), {\hat{p}}_{2} = {\hat{p}}_{2} (0))}^{T}$ which is of the form

I F = - E {(\frac{\partial R (θ)}{\partial θ})}^{- 1} R (θ)

where $R (θ) = {(1 - D) (1 - A_{2}) \times [A_{1} - E (A_{1}, θ)], (1 - D) (1 - A_{1}) \times [A_{2} - E (A_{2}; θ)], (1 - D) \times [A_{1} A_{2} - E (A_{1} A_{2}; θ)]}^{T}$ , where the first component is the score of p₁(0), the second component is the score of p₂(0), the last component is the score of ω, and θ = (ω, p₁, p₂). Standard matrix algebra can be used to show that at the submodel where A₁ and A₂ are independent IF = (IF₁, IF₂, IF₃) where:

I F_{1} = E {[(1 - A_{2}) (1 - D)]}^{- 1} (1 - A_{2}) (A_{1} - E (A_{1})) (1 - D) \approx - E {[(1 - E (A_{2} | D = 0))]}^{- 1} (A_{2} - E (A_{2} | D = 0)) (A_{1} - E (A_{1} | D = 0)) (1 - D) + (A_{1} - E (A_{1} | D = 0)) (1 - D)

I F_{2} = E {[(1 - A_{1}) (1 - D)]}^{- 1} (1 - A_{1}) (A_{2} - E (A_{2})) (1 - D) \approx - E {[(1 - E (A_{1} | D = 0))]}^{- 1} (A_{1} - E (A_{1} | D = 0)) (A_{2} - E (A_{2} | D = 0)) (1 - D) + (A_{2} - E (A_{2} | D = 0)) (1 - D)

I F_{3} = E {[{(A_{1} - E (A_{1} | D = 0))}^{2} | D = 0]}^{- 1} E {[{(A_{2} - E (A_{2} | D = 0))}^{2} | D = 0]}^{- 1} \times (A_{1} - E (A_{1} | D = 0)) (A_{2} - E (A_{2} | D = 0)) (1 - D)

A Taylor series argument then gives

\sum_{i} {\hat{U}}_{i} / \sqrt{n} \approx \sum_{i} U_{i} / \sqrt{n} - E [(A_{2} - p_{2} (0)) D] I F_{1} - E [(A_{1} - p_{1} (0)) D] I F_{2} - E [A_{1} A_{2} (A_{2} - p_{2} (0)) (A_{1} - p_{1} (0)) D] I F_{3} = \sum_{i} U_{i} / \sqrt{n} - E [(A_{2} - p_{2} (0)) D] \sum_{i} (A_{1, i} - E (A_{1} | D = 0)) (1 - D_{i}) / \sqrt{n} - E [(A_{1} - p_{1} (0)) D] \sum_{i} (A_{2, i} - E (A_{2} | D = 0)) (1 - D_{i}) / \sqrt{n} - (E [A_{1} A_{2} (A_{2} - p_{2}) (A_{1} - p_{1}) D] {p_{1} p_{2} (1 - p_{1}) (1 - p_{2})}^{- 1} + E [(A_{2} - p_{2}) D] {[(1 - p_{2})]}^{- 1} + E [(A_{1} - p_{1}) D] {[(1 - p_{1})]}^{- 1}) \times \sum_{i} (A_{1, i} - E (A_{1} | D = 0)) (A_{2, i} - E (A_{2} | D = 0)) (1 - D_{i}) / \sqrt{n}

Upon noting that the above four terms are mutually uncorrelated, we have that:

V a r (\sum_{i} {\hat{U}}_{i} / n) \approx V_{1} + V_{2} + V_{3}

where

V_{1} = V a r (U) / n

V_{2} = E {[(A_{2} - p_{2} (0)) D]}^{2} V a r ((A_{1} - E (A_{1} | D = 0)) (1 - D)) / n + E {[(A_{1} - p_{1} (0)) D]}^{2} V a r ((A_{2} - E (A_{2} | D = 0)) (1 - D)) / n

V_{3} = {(E [A_{1} A_{2} (A_{2} - p_{2}) (A_{1} - p_{1}) D] {p_{1} p_{2} (1 - p_{1}) (1 - p_{2})}^{- 1} + E [(A_{2} - p_{2}) D] {[(1 - p_{2})]}^{- 1} + E [(A_{1} - p_{1}) D] {[(1 - p_{1})]}^{- 1})}^{2} \times V a r ((A_{1} - E (A_{1} | D = 0)) (A_{2} - E (A_{2} | D = 0)) (1 - D)) / n

A similar derivation shows that

\sum_{i} {\tilde{U}}_{i} / \sqrt{n} \approx \sum_{i} U_{i} / \sqrt{n} - E [(A_{2} - p_{2} (0)) D] \sum (A_{1, i} - E (A_{1} | D = 0)) (1 - D_{i}) / \sqrt{n} - E [(A_{1} - p_{1} (0)) D] \sum_{i} (A_{2, i} - E (A_{2} | D = 0)) (1 - D_{i}) / \sqrt{n}

which gives

V a r (\sum_{i} {\tilde{U}}_{i} / \sqrt{n}) \approx V_{1} + V_{2}

proving the result.

2. Asymptotic variance for unified class of test statistics

Our proposed test statistic is then given by $Z = \sum_{i} \hat{W} (g) / n {\hat{σ}}_{W}$ , where ${\hat{σ}}_{W}^{2}$ is an estimate of $V a r (\sum_{i} \hat{W} (g) / n)$ one can derive using a standard Taylor series argument:

V a r (\sum_{i} {\hat{W}}_{i} (g) / n) \approx n^{- 1} V a r (W (g, θ)) + n^{- 1} E (W_{θ}^{T} (g)) V a r (S_{θ}^{†}) E (W_{θ} (g))

(1)

where W_θ(g) is the derivative of W(g, θ) with respect to θ evaluated at the truth, and $S_{θ}^{†}$ is the influence function of $\hat{θ}$ [1]. For instance, when $\hat{θ}$ is a maximum likelihood estimator, $S_{θ}^{†} = E {(S_{θ} S_{θ}^{T})}^{- 1} S_{θ}$ , where S_θ denote the score of θ. Under the assumption that A₁ and A₂ are independent, we may set $\hat{ω} = 1$ and redefine θ = (α₁, α₂), also note that under independence, the joint density (3) simplifies to f(A₁,A₂|X) = f(A₁|X)f(A₂|X), leading to some simplification in the above expression for the asymptotic variance of the test statistic.

3. Proof of Lemma 1.

Consider the nonparametric additive representation of μ(a₁, a₂, x) given by μ(a₁, a₂, x) = β₁(a₁, x) + β₂(a₂, x) + β₃(a₁, a₂, x) + β₄(x) where β₁(a₁, x) is the main effect of A₁ and satisfies β₁(0, x) = 0, likewise β₂(a₂, x) is the main effect of A₂ and satisfies β₂(0, x) = 0, β₃(a₁, a₂, x) is the additive interaction between A₁ and A₂ and satisfies β₃(0, a₂, x) = β₃(a₁, 0, x) = 0, and β₄(x) is the main effect of X. For any function g, note that

E {W (g) | D = 1, x} = \iint w (a_{1}, a_{2}, x, d = 1; g) f (a_{1}, a_{2} | D = 1, x) d ν (a_{1}, a_{2}) = \iint w (a_{1}, a_{2}, x, 1; g) μ (a_{1}, a_{2}, x) f (a_{1}, a_{2} | x) f (x) d ν (a_{1}, a_{2}) / \iint μ (a_{1}, a_{2}, x) f (a_{1}, a_{2} | x) f (x) d ν (a_{1}, a_{2}) \propto \iint w (a_{1}, a_{2}, x, 1; g) μ (a_{1}, a_{2}, x) f (a_{1} | A_{2} = 0, x) f (a_{2} | A_{1} = 0, x) O R (a_{1}, a_{2}; x) d ν (a_{1}, a_{2}) = \iint w (a_{1}, a_{2}, x, 1; g) β_{3} (a_{1}, a_{2}, x) f (a_{1} | A_{2} = 0, x) f (a_{2} | A_{1} = 0, x) O R (a_{1}, a_{2}; x) d ν (a_{1}, a_{2})

since

\int w (a_{1}, a_{2}, x, 1; g) β_{1} (a_{1}, x) f (a_{2} | A_{1} = 0, x) O R (a_{1}, a_{2}; x) d ν (a_{2}) = {β_{1} (a_{1}, x) \int g (A_{1}, A_{2}, X) f (a_{2} | A_{1} = 0, x) d ν (a_{2}) - β_{1} (a_{1}, x) \iint g (A_{1}, a_{2}, X) f (a_{2} | A_{1} = 0, X) d μ (a_{2}) f (a_{2} | A_{1} = 0, x) d ν (a_{2}) - β_{1} (a_{1}, x) \iint g (a_{1}, A_{2}, X) f (a_{1} | A_{2} = 0, X) d μ (a_{1}) f (a_{2} | A_{1} = 0, x) d ν (a_{2}) + β_{1} (a_{1}, x) \iint g (a_{1}, a_{2}, X) f (a_{1} | A_{2} = 0, X) d μ (a_{1}) f (a_{2} | A_{1} = 0, x) d ν (a_{2}) = 0

furthermore by symmetry,

\int w (a_{1}, a_{2}, x, 1; g) β_{2} (a_{2}, x) f (a_{1} | A_{2} = 0, x) O R (a_{1}, a_{2}; x) d ν (a_{1}) = 0

and finally

\int w (a_{1}, a_{2}, x, 1; g) β_{4} (x) f (a_{2} | A_{1} = 0, x) O R (a_{1}, a_{2}; x) d ν (a_{2}) = {β_{4} (x) \int g (A_{1}, A_{2}, X) f (a_{2} | A_{1} = 0, x) d ν (a_{2}) - β_{4} (x) \int g (A_{1}, a_{2}, X) f (a_{2} | A_{1} = 0, X) d μ (a_{2}) f (a_{2} | A_{1} = 0, x) d ν (a_{2}) - β_{4} (x) \iint g (a_{1}, A_{2}, X) f (a_{1} | A_{2} = 0, X) d μ (a_{1}) f (a_{2} | A_{1} = 0, x) d ν (a_{2}) + β_{4} (x) \iint g (a_{1}, A_{2}, X) f (a_{1} | A_{2} = 0, X) d μ (a_{1}) f (a_{2} | A_{1} = 0, x) d ν (a_{2}) = 0

therefore

\iint w (a_{1}, a_{2}, x, 1; g) {β_{1} (a_{1}, x) + β_{2} (a_{2}, x) + β_{4} (x)} \times f (a_{1} | A_{2} = 0, x) f (a_{2} | A_{1} = 0, x) d ν (a_{1}, a_{2}) = 0

for any choice of g. Thus, the null of no additive interaction β₃(a₁, a₂, x) = 0 for all (a₁, a₂, x) implies that E {W(g)|D = 1, x} = 0. We get the result in the other direction by choosing g(a₁, a₂, x) = g^*(a₁, a₂, x) = β₃(a₁, a₂, x) which gives

E {W (g) | D = 1, x} = 0 for all g and x implies that

\iint w {(a_{1}, a_{2}, x, 1; g^{*})}^{2} f (a_{1} | A_{2} = 0, x) f (a_{2} | A_{1} = 0, x) d ν (a_{1}, a_{2}) = 0 for all x

which in turn implies that β₃(a₁, a₂, x) = 0 for all (a₁, a₂, x) proving the result. □

4. Failure of RERI-based approaches with continuous exposure

We describe in some detail the failure of RERI-based approaches that use standard logistic regression when at least one exposure is non-discrete and auxiliary covariates are present. In this vein, suppose that A₁ is continuous, while A₂ may be binary. In practice, to evaluate RERI in this context, one typically proceeds by estimating a standard logistic regression for Pr {D = 1|a₁, a₂, x} using a simple parametric formulation of the model, such as:

l o g i t P r {D = 1 | a_{1}, a_{2}, x; α_{0}, α_{1}, α_{2}, α_{3}, α_{4}} = α_{0} + α_{1} a_{1} + α_{2} a_{2} + α_{3} a_{1} a_{2} + α_{4}^{'} x,

(2)

where logit(p) =log{p/(1 − p)} and the parameters (α₀, α₁, α₂, α₃, α₄) are variation independent [2]. Below, we argue that such a standard logistic regression will generally be incompatible with the null hypothesis of no additive interaction if both exposures (a₁, a₂) have a non-null association with the outcome. Specifically, suppose that the main effects of A₁ and A₂ and X are correctly specified in the logistic model, i.e.

l o g i t P r {D = 1 | a_{1}, a_{2} = 0, x} = α_{0} + α_{1} a_{1} + α_{4}^{'} x

l o g i t P r {D = 1 | a_{1} = 0, a_{2}, x} = α_{0} + α_{2} a_{2} + α_{4}^{'} x

with α₁ ≠ 0 and α₂ ≠ = 0. Then there will generally be no parameter value of (α₀, α₁, α₂, α₃, α₄) that encodes the null hypothesis of no additive interaction, consequently any RERI-type test, based on model (2) will generally have inflated type I error rate for testing the null of no additive interaction.

To further understand the failure of RERI in this context, note that under the null hypothesis of no additive interaction

0 = P r {D = 1 | a_{1}, a_{2}, x} - P r {D = 1 | a_{1}, a_{2} = 0, x} - P r {D = 1 | a_{1} = 0, a_{2}, x} + P r {D = 1 | a_{1} = 0, a_{2} = 0, x}

(3)

for all possible values of a₁ and a₂. We show that RERI = 0, or equivalently, a₃ = 0, fails to imply the above null. The interaction function on the log odds ratio scale is given by

θ (a_{1}, a_{2}, x; α_{0}, α_{1}, α_{2}, α_{4}) = l o g i t P r {D = 1 | a_{1}, a_{2}, x; α_{0}, α_{1}, α_{2}, α_{4}} - l o g i t P r {D = 1 | a_{1}, a_{2} = 0, x; α_{0}, α_{1}, α_{4}} - l o g i t P r {D = 1 | a_{1} = 0, a_{2}, x; α_{0}, α_{2}, α_{4}} + l o g i t P r {D = 1 | a_{1} = 0, a_{2} = 0, x; α_{0}, α_{4}} \overset{(3)}{=} l o g i t {P r {D = 1 | a_{1} = 0, a_{2}, x; α_{0}, α_{2}, α_{4}} + P r {D = 1 | a_{1}, a_{2} = 0, x; α_{0}, α_{1}, α_{4}} - P r {D = 1 | a_{1} = 0, a_{2} = 0, x; α_{0}, α_{4}}} - l o g i t P r {D = 1 | a_{1} = 0, a_{2}, x; α_{0}, α_{2}, α_{4}} - l o g i t P r {D = 1 | a_{1}, a_{2} = 0, x; α_{0}, α_{1}, α_{4}} + l o g i t P r {D = 1 | a_{1} = 0, a_{2} = 0, x; α_{0}, α_{4}},

in which case, correct specification of a logistic model for Pr {D = 1|a₁, a₂, x} under a null additive interaction is of the form

l o g i t P r {D = 1 | a_{1}, a_{2}, x; α_{0}, α_{1}, α_{2}, α_{4}} = α_{0} + α_{1} a_{1} + α_{2} a_{2} + θ (a_{1}, a_{2}, x; α_{0}, α_{1}, α_{2}, α_{4}) + α_{4}^{'} x

(4)

Because of the nonlinear dependence of θ on a₁ and x, it is clear that model (4) cannot be nested in the standard logistic model (2), and therefore the latter cannot be used to obtain a valid test of the null hypothesis of no additive interaction.

In order to implement an LRT of additive interaction using the RERI approach, an analyst would need to carefully specify a model for the odds ratio interaction, so that model (4) is recovered under the null of no additive interaction. Such a parametrization of the outcome regression will characteristically be nonstandard in the sense that the interaction of the resulting logistic model would need to be explicitly defined as a function of models for both exposure main effects, and the effect of covariates. Such a parametrization of a logistic model would seldom naturally arise in practice purely on scientific basis. Furthermore, one would generally be unable to easily obtain parameter estimates for such a model using off-the-shelf statistical software for standard logistic regression, which completely undermines the often quoted practical advantage of the RERI approach.

5. Additional simulation studies

5.1. Power of the tests under alternative values of Pr(G=1) and type I error under at significance level of 1%

Figure 1: — Power of the various tests for identifying additive G-E interaction, in the simple settings in which both a₁ and a₂ are binary, and no covariates are used in the model, with Pr(G = 1) = 0.2. The simulated a₁ is common Pr(a₁ = 1) = 0.2, Pr(a₂ = 1) = 0.2, and the disease model Pr(D = 1|a₁*, a*₂) has the fixed baseline risk α₀ = logit(0.01), the main effects of the a₁*, a*₂ are varied as α₁*, α*₂, and the RERI is varied from 0 to 0.5. The compared estimators are the proposed ‘U’ and ‘U ind’ (without and with assuming G-E independence), ‘prosp’ is the standard test under prospective logistic regression and ‘Han’ and ‘Han ind’ is the retrospective profile likelihood test proposed in Han (2012) without and with assuming G-E independence.

Figure 2: — Power of the various tests for identifying additive G-E interaction, in the simple settings in which both a₁ and a₂ are binary, and no covariates are used in the model, with Pr(G = 1) = 0.05. The simulated a₁ is common Pr(a₁ = 1) = 0.05, Pr(a₂ = 1) = 0.2, and the disease model Pr(D = 1|a₁*, a*₂) has the fixed baseline risk α₀ = logit(0.01), the main effects of the a₁*, a*₂ are varied as α₁*, α*₂, and the RERI is varied from 0 to 0.5. The compared estimators are the proposed ‘U’ and ‘U ind’ (without and with assuming G-E independence), ‘prosp’ is the standard test under prospective logistic regression and ‘Han’ and ‘Han ind’ is the retrospective profile likelihood test proposed in Han (2012) without and with assuming G-E independence.

Table 1:

The type I error of the compared tests at significance level of 1%, under various combinations of the prevalence of the genetic variant a₁, and the effect of the genetic and environmental variables on the disease outcome (α₁ and α₂, respectively). The tests ‘U’ and ‘U ind’ are the proposed tests without and with the assumption of G-E independence. ‘prosp’ is the usual test based on standard logistic regression, and ‘Han’ and ‘Han ind’ are the tests based on retrospective profile likelihood proposed by [3]. The type I error was calculated from 10,000 simulations, each with 4000 cases and 4000 controls.

α₁ :	log(0.7)			log(1.2)			log(2.0)
α₂ :	log(0.7)	log(1.2)	log(2.0)	log(0.7)	log(1.2)	log(2.0)	log(0.7)	log(1.2)	log(2.0)
Pr(G=1)=0.5
U	0.010	0.010	0.011	0.010	0.010	0.009	0.011	0.012	0.010
U ind	0.011	0.009	0.008	0.007	0.009	0.008	0.009	0.010	0.010
prosp	0.010	0.008	0.009	0.009	0.009	0.009	0.010	0.010	0.009
Han	0.009	0.010	0.010	0.010	0.009	0.010	0.010	0.012	0.010
Han ind	0.011	0.009	0.009	0.007	0.009	0.008	0.009	0.010	0.011
Pr(G=1)=0.2
U	0.011	0.012	0.010	0.011	0.012	0.012	0.011	0.012	0.010
U ind	0.010	0.010	0.012	0.010	0.008	0.011	0.011	0.011	0.011
prosp	0.010	0.008	0.008	0.009	0.009	0.009	0.008	0.009	0.008
Han	0.010	0.009	0.009	0.011	0.010	0.010	0.009	0.009	0.009
Han ind	0.011	0.010	0.011	0.010	0.009	0.011	0.011	0.010	0.011
Pr(G=1)=0.05
U	0.011	0.016	0.017	0.014	0.020	0.017	0.018	0.016	0.016
U ind	0.010	0.013	0.011	0.010	0.014	0.011	0.010	0.009	0.011
prosp	0.006	0.010	0.013	0.007	0.012	0.012	0.010	0.010	0.011
Han	0.009	0.009	0.010	0.009	0.010	0.009	0.013	0.009	0.009
Han ind	0.010	0.011	0.010	0.009	0.012	0.011	0.010	0.008	0.010

Open in a new tab

5.2. Violation of G-E independence assumption

In this section, we evaluate the performance of the proposed tests under the scenario where G and E are in fact not independent. This is done by generating the binary environmental variable as Binomial(Pr(E = 1|G = g) = −1 − 0.1g) As expected, the tests that relies on G-E independence are no longer valid.

Table 2:

The type I error of the compared tests, under various combinations of the prevalence of the genetic variant a₁, and the effect of the genetic and environmental variables on the disease outcome (α₁ and α₂, respectively). The tests ‘U’ and ‘U ind’ are the proposed tests without and with the assumption of G-E independence. ‘prosp’ is the usual test based on standard logistic regression, and ‘Han’ and ‘Han ind’ are the tests based on retrospective profile likelihood proposed by [3]. The type I error was calculated from 10,000 simulations, each with 4000 cases and 4000 controls.

α₁ :	log(0.7)			log(1.2)			log(2.0)
α₂ :	log(0.7)	log(1.2)	log(2.0)	log(0.7)	log(1.2)	log(2.0)	log(0.7)	log(1.2)	log(2.0)
Pr(G=1)=0.5
U	0.048	0.049	0.051	0.049	0.05	0.047	0.052	0.052	0.048
U ind	0.22	0.303	0.372	0.249	0.301	0.33	0.253	0.274	0.272
prosp	0.047	0.049	0.047	0.048	0.049	0.046	0.049	0.051	0.048
Han	0.047	0.049	0.05	0.049	0.05	0.048	0.05	0.053	0.051
Han ind	0.22	0.303	0.374	0.249	0.302	0.33	0.252	0.272	0.27
Pr(G=1)=0.2
U	0.05	0.048	0.053	0.049	0.05	0.051	0.049	0.052	0.047
U ind	0.133	0.187	0.242	0.202	0.233	0.254	0.26	0.268	0.26
prosp	0.049	0.046	0.05	0.046	0.048	0.05	0.045	0.049	0.044
Han	0.051	0.049	0.053	0.047	0.05	0.052	0.05	0.051	0.049
Han ind	0.125	0.181	0.233	0.195	0.226	0.243	0.252	0.26	0.25
Pr(G=1)=0.05
U	0.051	0.056	0.051	0.05	0.05	0.049	0.052	0.043	0.049
U ind	0.081	0.1	0.115	0.101	0.115	0.116	0.14	0.137	0.134
prosp	0.043	0.05	0.046	0.041	0.044	0.045	0.042	0.039	0.042
Han	0.052	0.054	0.051	0.049	0.05	0.05	0.051	0.047	0.054
Han ind	0.071	0.086	0.101	0.09	0.1	0.105	0.13	0.123	0.12

Open in a new tab

Figure 3: — Power of the various tests for identifying additive G-E interaction under violation of G-E independence, in the simple settings in which both a₁ and a₂ are binary, and no covariates are used in the model, with Pr(G = 1) = 0.5. The simulated a₁ is common Pr(a₁ = 1) = 0.2, Pr(a₂ = 1) = 0.2, and the disease model Pr(D = 1|a₁*, a*₂) has the fixed baseline risk α₀ = logit(0.01), the main effects of the a₁*, a*₂ are varied as α₁*, α*₂, and the RERI is varied from 0 to 0.5. The compared estimators are the proposed ‘U’ and ‘U ind’ (without and with assuming G-E independence), ‘prosp’ is the standard test under prospective logistic regression and ‘Han’ and ‘Han ind’ is the retrospective profile likelihood test proposed in Han (2012) without and with assuming G-E independence.

Figure 4: — Power of the various tests for identifying additive G-E interaction under violation of G-E independence, in the simple settings in which both a₁ and a₂ are binary, and no covariates are used in the model, with Pr(G = 1) = 0.2. The simulated a₁ is common Pr(a₁ = 1) = 0.2, Pr(a₂ = 1) = 0.2, and the disease model Pr(D = 1|a₁*, a*₂) has the fixed baseline risk α₀ = logit(0.01), the main effects of the a₁ *, a*₂ are varied as α₁*, α*₂, and the RERI is varied from 0 to 0.5. The compared estimators are the proposed ‘U’ and ‘U ind’ (without and with assuming G-E independence), ‘prosp’ is the standard test under prospective logistic regression and ‘Han’ and ‘Han ind’ is the retrospective profile likelihood test proposed in Han (2012) without and with assuming G-E independence.

Figure 5: — Power of the various tests for identifying additive G-E interaction under violation of G-E independence, in the simple settings in which both a₁ and a₂ are binary, and no covariates are used in the model, with Pr(G = 1) = 0.05. The simulated a₁ is common Pr(a₁ = 1) = 0.05, Pr(a₂ = 1) = 0.2, and the disease model Pr(D = 1|a₁*, a*₂) has the fixed baseline risk α₀ = logit(0.01), the main effects of the a₁*, a*₂ are varied as α₁*, α*₂, and the RERI is varied from 0 to 0.5. The compared estimators are the proposed ‘U’ and ‘U ind’ (without and with assuming G-E independence), ‘prosp’ is the standard test under prospective logistic regression and ‘Han’ and ‘Han ind’ is the retrospective profile likelihood test proposed in Han (2012) without and with assuming G-E independence.

5.3. Violation of rare outcome assumption

In this section, we evaluate the performance of the proposed tests under the scenario where the outcome is not very rare with baseline risk equal to 0.1 (i.e. α₀ = logit(0.1)), which is ten times the original baseline risk. In most of the coefficient settings, the tests are still valid with similar power. We observed slightly inflated type I error under large G and E main effect with Pr(G = 1) = 0.5.

Table 3:

The type I error of the compared tests, under various combinations of the prevalence of the genetic variant a₁, and the effect of the genetic and environmental variables on the disease outcome (α₁ and α₂, respectively). The tests ‘U’ and ‘U ind’ are the proposed tests without and with the assumption of G-E independence. ‘prosp’ is the usual test based on standard logistic regression, and ‘Han’ and ‘Han ind’ are the tests based on retrospective profile likelihood proposed by [3]. The type I error was calculated from 10,000 simulations, each with 4000 cases and 4000 controls.

α₁ :	log(0.7)			log(1.2)			log(2.0)
α₂ :	log(0.7)	log(1.2)	log(2.0)	log(0.7)	log(1.2)	log(2.0)	log(0.7)	log(1.2)	log(2.0)
Pr(G=1)=0.5
U	0.052	0.052	0.069	0.049	0.052	0.047	0.077	0.052	0.102
U ind	0.058	0.052	0.079	0.053	0.054	0.057	0.078	0.060	0.201
prosp	0.058	0.046	0.054	0.046	0.052	0.052	0.067	0.054	0.119
Han	0.056	0.050	0.061	0.047	0.054	0.054	0.067	0.058	0.127
Han ind	0.058	0.053	0.080	0.053	0.055	0.057	0.077	0.060	0.205
Pr(G=1)=0.2
U	0.050	0.052	0.071	0.053	0.046	0.046	0.071	0.042	0.042
U ind	0.054	0.055	0.078	0.050	0.047	0.052	0.074	0.058	0.058
prosp	0.055	0.048	0.063	0.048	0.044	0.047	0.063	0.043	0.043
Han	0.058	0.050	0.061	0.049	0.049	0.055	0.062	0.051	0.051
Han ind	0.057	0.054	0.074	0.048	0.046	0.054	0.071	0.059	0.059
Pr(G=1)=0.05
U	0.046	0.053	0.068	0.057	0.048	0.046	0.062	0.043	0.029
U ind	0.049	0.053	0.063	0.054	0.052	0.050	0.062	0.048	0.072
prosp	0.043	0.043	0.061	0.045	0.042	0.043	0.049	0.037	0.028
Han	0.055	0.046	0.053	0.052	0.049	0.051	0.051	0.052	0.063
Han ind	0.052	0.052	0.058	0.053	0.052	0.053	0.055	0.048	0.083

Open in a new tab

Figure 6: — Power of the various tests for identifying additive G-E interaction, in the simple settings in which both a₁ and a₂ are binary, and no covariates are used in the model, with Pr(G = 1) = 0.5. The simulated a₁ is common Pr(a₁ = 1) = 0.2, Pr(a₂ = 1) = 0.2, and the disease model Pr(D = 1|a₁*, a*₂) has the fixed baseline risk α₀ = logit(0.1), the main effects of the a₁*, a*₂ are varied as α₁*, α*₂, and the RERI is varied from 0 to 0.5. The compared estimators are the proposed ‘U’ and ‘U ind’ (without and with assuming G-E independence), ‘prosp’ is the standard test under prospective logistic regression and ‘Han’ and ‘Han ind’ is the retrospective profile likelihood test proposed in Han (2012) without and with assuming G-E independence.

Figure 7: — Power of the various tests for identifying additive G-E interaction, in the simple settings in which both a₁ and a₂ are binary, and no covariates are used in the model, with Pr(G = 1) = 0.2. The simulated a₁ is common Pr(a₁ = 1) = 0.2, Pr(a₂ = 1) = 0.2, and the disease model Pr(D = 1|a₁*, a*₂) has the fixed baseline risk α₀ = logit(0.1), the main effects of the a₁*, a*₂ are varied as α₁*, α*₂, and the RERI is varied from 0 to 0.5. The compared estimators are the proposed ‘U’ and ‘U ind’ (without and with assuming G-E independence), ‘prosp’ is the standard test under prospective logistic regression and ‘Han’ and ‘Han ind’ is the retrospective profile likelihood test proposed in Han (2012) without and with assuming G-E independence.

Figure 8: — Power of the various tests for identifying additive G-E interaction, in the simple settings in which both a₁ and a₂ are binary, and no covariates are used in the model, with Pr(G = 1 = 0.05. The simulated a₁ is common Pr(a₁ = 1) = 0.05, Pr(a₂ = 1) = 0.2, and the disease model Pr(D = 1|a₁*, a*₂) has the fixed baseline risk α₀ = logit(0.1), the main effects of the a₁*, a*₂ are varied as α₁*, α*₂, and the RERI is varied from 0 to 0.5. The compared estimators are the proposed ‘U’ and ‘U ind’ (without and with assuming G-E independence), ‘prosp’ is the standard test under prospective logistic regression and ‘Han’ and ‘Han ind’ is the retrospective profile likelihood test proposed in Han (2012) without and with assuming G-E independence.

5.4. Different type of environmental variable: count data

In this section, we evaluate the performance of the proposed tests when the environmental variable E is count data following Poisson distribution with mean equal to 1. RERI based on parametric logistic regression can have slightly inflated type I error under relatively extreme gene or environment effects.

Table 4:

The type I error of the compared tests, under various combinations of the prevalence of the genetic variant a₁, and the effect of the genetic and environmental variables on the disease outcome (α₁ and α₂, respectively). The tests ‘U’ and ‘U ind’ are the proposed tests without and with the assumption of G-E independence. ‘prosp’ is the usual test based on standard logistic regression. The type I error was calculated from 10,000 simulations, each with 4000 cases and 4000 controls.

α₁ :	log(0.7)			log(1.2)			log(2.0)
α₂ :	log(0.7)	log(1.2)	log(2.0)	log(0.7)	log(1.2)	log(2.0)	log(0.7)	log(1.2)	log(2.0)
Pr(G=1)=0.5
U	0.045	0.045	0.046	0.048	0.048	0.048	0.049	0.049	0.049
U ind	0.051	0.048	0.047	0.048	0.049	0.049	0.050	0.049	0.049
prosp	0.048	0.048	0.050	0.051	0.051	0.050	0.049	0.049	0.053
Pr(G=1)=0.2
U	0.046	0.045	0.047	0.048	0.048	0.048	0.049	0.049	0.049
U ind	0.050	0.049	0.047	0.049	0.049	0.049	0.049	0.049	0.049
prosp	0.048	0.048	0.051	0.051	0.050	0.049	0.049	0.049	0.054
Pr(G=1)=0.05
U	0.045	0.045	0.048	0.048	0.048	0.049	0.049	0.049	0.049
U ind	0.049	0.048	0.048	0.049	0.049	0.049	0.049	0.049	0.049
prosp	0.048	0.047	0.050	0.051	0.050	0.049	0.048	0.048	0.054

Open in a new tab

6. R codes for simulation and data application

Below we provide R codes for simulation and data application. R codes are also available at https://github.com/shixu0830/A-test-for-gene-environment-additive-interaction.

################################################################
### Simulation code for Binary G and Binary E:
### Xu Shi (xushi@hsph.harvard.edu)
### Benedict H.W.Wong (wong01@g.harvard.edu)
### Tamar Sofer (tsofer@bwh.harvard.edu)
################################################################


rm(list =ls ())
library (alr3)
library (maxLik)
library (MASS)
library (boot)
require (CGEN)
source (“functions.R”)
a0.seq = c(logit (0.01), logit (0.05),logit (0.1)) ## baseline risk
g.mean.seq =c (0.5, 0.2, 0.05)
signal.seq = seq (0, 0.5, by = 0.1) ##0: typeIerror; >0: power
## we here look at a few combinations of gene/environement effects
env.eff.seq = c(log (0.7), log (1.2), log (2))
gene.eff.seq = c(log (0.7), log (1.2), log (2))
### select one comb of the above seq
gene.eff = gene.eff.seq [1]  #1,2,3
env.eff = env.eff.seq [1]    #1,2,3
a0 = a0.seq [1]              #1,2
g.mean = g.mean.seq [1]      #1,2,3
sig = signal.seq [1]         #1,2,3
## setting the coeffcient of g*e such that RERI = sig.
gene.env.eff = logit ((sig −1)*expit (a0) + expit (a0 + env.eff) + expit (a0 + gene.eff)) − a0 − gene
    .eff − env.eff
n.sim =10000
nn =4000
N <− nn/0.001
n.cont <− nn
n.case <− nn
n=n.cont*2

test.stat.U = pval.U= test.stat.U.ind = pval.U.ind =
pvals.reri = pvals.UML = pvals.mult = pvals.wald.mult.indep = pvals.mult.indep =
pvals.han= pvals.han.indep = test.stat.RERI = pval.RERI =
rep (NA, n.sim)

for (i in 1:n.sim){
set.seed (i)
e <− rbinom (N, 1, 0.2)
g <− rbinom (N, 1, g.mean)
pD <− expit (a0 + env.eff*e + gene.eff*g + gene.env.eff*g*e)
D <− rbinom (length (pD),1,pD)
cases <− as.data.frame (cbind (e,g,D)[sample (which (D == 1), n.case),])
conts <− as.data.frame (cbind (e,g,D)[sample (which (D == 0), n.cont),])
d <− c(cases $D, conts $D)
e <− c(cases $e, conts $e)
g <− c(cases $g, conts $g)

############################################################
### Han and RERI
temp <− additive.test (data = data.frame (d=d,g=g,e=e), response.var = “d“, snp.var =“g“, exposure.
    var =“e”)
temp.indep <− additive.test (data = data.frame (d=d,g=g,e=e), response.var = “d“, snp.var =“g“,
    exposure.var =“e“, op = list (indep = T))
pvals.reri [i] <− temp $ RERI $ pval
pvals.han[i] <− temp $ pval.add
pvals.han.indep [i] <− temp.indep $ pval.add

############################################################
## RERI
temp.RERI <− RERI.E.cont (d,g,e)
test.stat.RERI [i] <− temp.RERI $ test.stat
pval.RERI [i] <− temp.RERI $ pval

############################################################
## Proposed test stat U
### first : test that assumes GE independence
probs.mod <− estimate.g.e.probs (g[which (d == 0)], e[which (d == 0)], ge.indep = T, return.equation
    = T)
p10 <− probs.mod$ cond.probs $p.g1.cond.e0
p11 <− probs.mod$ cond.probs $p.g1.cond.e1
p20 <− probs.mod$ cond.probs $p.e1.cond.g0

u <− (g−p10)*(e−p20)*d
mean.u = mean (u)
alpha <− probs.mod $ coef $ alpha ## alpha is the parameter of the model f(g,e|D =0)
exp.2 <− exp (alpha [2])
exp.1 <− exp (alpha [1])
## derivative of U w.r.t.alpha
partial.u.partial.V.p <− cbind (
d*(0 − exp.1/((exp.1 + 1) ^2))*(e − exp.2/(1 + exp.2)),
d*(g − exp.1/(exp.1 + 1))*(0 − exp.2/((1 + exp.2) ^2))
)
# its mean
E.partial.u.partial.V.p <− colMeans (partial.u.partial.V.p)
alpha <− alpha [1:2]
## the equations accounting for estimation of alpha
term.account.p <− n*rbind (matrix (0, nrow = sum (d == 1), ncol = length (alpha)), probs.mod$
    equation) %*% E.partial.u.partial.V.p
u.account.p.ind <− u − term.account.p
sd.u <− sqrt (mean (u.account.p.ind ^2)/(n))

test.stat.U.ind[i] <− mean.u/sd.u
pval.U.ind[i]=(1 − pnorm (abs(mean.u/sd.u)))*2
### second : test that does not assumes GE independence
probs.mod <− estimate.g.e.probs (g[which (d == 0)], e[which (d == 0)], ge.indep = F, return.equation
    = T)
p10 <− probs.mod$ cond.probs $p.g1.cond.e0
p11 <− probs.mod$ cond.probs $p.g1.cond.e1
p20 <− probs.mod$ cond.probs $p.e1.cond.g0
alpha <− probs.mod $ coef $ alpha
u <− exp(−alpha [3]*g*e)*(g−p10)*(e−p20)*d
mean.u = mean (u)
exp.2 <− exp (alpha [2])
exp.1 <− exp (alpha [1])
term.3 <− exp (−alpha [3]*g*e)
## derivative of U w.r.t.alpha
partial.u.partial.V.p <− cbind (
d*term.3*(0 − exp.1/((exp.1 + 1) ^2))*(e − exp.2/(1 + exp.2)),
d*term.3*(g − exp.1/(exp.1 + 1))*(0 − exp.2/((1 + exp.2) ^2)),
-d*term.3*g*e*(g − exp.1/(exp.1 + 1))*(e − exp.2/(1 + exp.2))
)
# its mean
E.partial.u.partial.V.p <− colMeans (partial.u.partial.V.p)
term.account.p <− n*rbind (matrix (0, nrow = sum (d == 1), ncol = length (alpha)), probs.mod$
    equation) %*% E.partial.u.partial.V.p
u.account.p <− u − term.account.p
sd.u <− sqrt (mean (u.account.p ^2)/n)
test.stat.U[i] <− mean.u/sd.u
pval.U[i]=(1 − pnorm (abs(mean.u/sd.u)))*2
if(i%%(n.sim/10) ==0) {print (i)}
}

################################################################
### Simulation code for Binary G and Count E:
### Xu Shi (xushi@hsph.harvard.edu)
### Benedict H.W.Wong (wong01@g.harvard.edu)
### Tamar Sofer (tsofer@bwh.harvard.edu)
################################################################

rm(list =ls ())
library (alr3)
library (maxLik)
library (MASS)
library (boot)
require (CGEN)
source (“functions.R”)
a0.seq = c(logit (0.01),logit (0.05),logit (0.1)) ## baseline risk
g.mean.seq =c (0.5,0.2,0.05)
signal.seq = seq (0, 0.5, by = 0.1) ##0: typeIerror ; >0: power
## we here look at a few combinations of gene/environement effects
env.eff.seq = c(log (0.7), log (1.2), log (2))
gene.eff.seq = c(log (0.7), log (1.2), log (2))
n.sim =10000
nn =4000
N = nn/0.001
n.cont = nn
n.case = nn
n=n.cont*2
pval = pval2 = pval.indept = NULL
for (i in 1:n.sim){
#### simulate data
set.seed ((myseed−1)*n.sim + i)
e = rpois (N, lambda =1)
g = rbinom (N,1,g.mean)
pD = expit (a0 + gene.eff*g) + expit (a0 + env.eff*e) + expit (a0)
D = rbinom (length (pD),1,pD)
cases = as.data.frame (cbind (e,g,D)[sample (which (D == 1), n.case),])
conts = as.data.frame (cbind (e,g,D)[sample (which (D == 0), n.cont),])
d = c(cases $D, conts $D)
e = c(cases $e, conts $e)
g = c(cases $g, conts $g)
19
X = matrix (rep (1, length (g)),nrow = length (g))
p= ncol (X)
myg =g[d ==0]; mye =e[d ==0]; myX =as.matrix (X[d ==0,]);

#### Proposed test without assuming G-E independence
pval.i = get.pval (g,e,X,d,ge.indep =F)
pval =c(pval, pval.i[“p“])
#### Proposed test assuming G-E independence
pval.indept.i = get.pval (g,e,X,d,ge.indep =T)
if(inherits (pval.indept.i, “try − error”)){ pval.indept.i=NA}
pval.indept =c(pval.indept, pval.indept.i[“p“])
#### RERI using standard logistic regression
reri = RERI.E.cont (d,g,e)
pval2 =c(pval2, reri $ pval)
}

#### compute type I error at sig level of 0.05
mean (pval <0.05)
mean (pval.indept <0.05)
mean (pval2 <0.05)

################################################################
### Data application : ovarian cancer study
### Xu Shi (xushi@hsph.harvard.edu)
### Benedict H.W.Wong (wong01@g.harvard.edu)
### Tamar Sofer (tsofer@bwh.harvard.edu)
################################################################


rm(list =ls ())
library (alr3)
library (maxLik)
library (MASS)
library (CGEN)
source (“functions.R”)
############### get data ###############
data (Xdata, package =“CGEN”)
dat = Xdata
dat $ ethnicity.1= as.numeric (dat $ ethnic.group == 1)
dat $ ethnicity.2= as.numeric (dat $ ethnic.group == 2)
dat $ ethnicity.3= as.numeric (dat $ ethnic.group == 3)
dat $ famhist.1= as.numeric (dat$ family.history == 1)
dat $ famhist.2= as.numeric (dat$ family.history == 2)
dat $ age50 =as.numeric (dat $age.group > 2) ## age at least 50
#### Adjusted covariates in the primary analysis
adjust.vars = c(“age50 “,“ethnicity.1“,“ethnicity.2“,“BRCA.history “,“famhist.1“,“famhist.2”)
#### Adjusted covariates in the sensitivity analysis : remove BRCA.history
adjust.vars = c(“age50 “,“ethnicity.1“,“ethnicity.2“,“famhist.1“,“famhist.2”)

rslt = NULL
for (children_or_oralcon in c(1,2)){
for (ge.indep in c(T,F)){
boot.dat = dat
d = boot.dat $ case.control
g = boot.dat $ BRCA.status
if(children_or_oralcon ==1) {
e = boot.dat $n.children
} else {
e = boot.dat $ oral.years
}
X = cbind (intercept =1, as.matrix (boot.dat [, adjust.vars]))
p= ncol (X)
myg =g[d ==0]; mye =e[d ==0]; myX =as.matrix (X[d ==0,]);
reri.p= RERI.E.cont.X(d,g,e,X)
temp <− additive.test (data = data.frame (d=d,g=g,e=as.numeric (e< mean (e)),boot.dat [, adjust.vars]),
    response.var = “d“, snp.var=“g“, exposure.var =“e“,
main.vars = adjust.vars,
op = list (indep = ge.indep))
pvals.reri <− temp $ RERI $ pval
pvals.han <− temp $ pval.add

20

### our proposed tests
if(ge.indep ==T){
### assume independent
MLE.rslt = maxLik (loglik.indept, start = rep (0,p*2))
par = MLE.rslt $ estimate
phi =0; alpha = par [1:(p)]; beta = par [(1 + p):(p*2)]
} else {
### not assume independent
MLE.rslt = maxLik (loglik, start = rep (0,p*2 + 1))
par = MLE.rslt $ estimate
phi = par [1]; alpha = par [2:(1 + p)]; beta = par [(2 + p):(1 + p*2)]
}
### get u
m.g = expit (X%*% alpha); m.e = exp (X%*% beta)
w= exp (− phi*g*e); u = w*(g − m.g)*(e − m.e)*d
###
if(ge.indep ==T){
### get IF
d.u.d.theta = cbind (
diag (as.numeric (w*(e − m.e)*d*(-m.g*(1-m.g))))%*%X,
diag (as.numeric (w*(g − m.g)*d*(-m.e)))%*%X)
E.d.u.d.theta = apply (d.u.d.theta,2, sum) ## order : (phi),g,e
score = numDeriv :: jacobian (loglik.indept,c(par))
} else {
### get IF
d.u.d.theta = cbind (-g*e*u,
diag (as.numeric (w*(e − m.e)*d*(-m.g*(1-m.g))))%*%X,
diag (as.numeric (w*(g − m.g)*d*(-m.e)))%*%X)
E.d.u.d.theta = apply (d.u.d.theta,2, sum) ## order : phi,g,e
score = numDeriv :: jacobian (loglik,c(par))
}
score.IF= ginv (t(score))
adj.IF = score.IF%*%E.d.u.d.theta
u.IF = c(u[d==1],u[d ==0]) + c(rep (0, length (which (d ==1))),adj.IF)
(p =(1 − pnorm (abs (mean (u)*sqrt (length (u))/sd(u.IF))))*2)
stat = sum (u)/sqrt (sum(u.IF ^2))
(CI = round (mean (u) + c(-1,0,1)*qnorm (0.975)*sd(u.IF)/sqrt (length (u)),3))


rslt = rbind (rslt,c(sprintf (“%.3 f“,phi),sprintf (“%.3 f“,CI [2]),
paste0 (“(“,sprintf (“%.3 f“,CI [1]),“,“,sprintf (“%.3 f“,CI [3]),”)”)
,sprintf (“%.3 f“,p),sprintf (“%.3 f“,reri.p$ pval)
,sprintf (“%.3 f“,pvals.reri),sprintf (“%.3 f“,pvals.han)
))
}
}
rslt
################################################################
### Functions
### Xu Shi (xushi@hsph.harvard.edu)
### Benedict H.W.Wong (wong01@g.harvard.edu)
### Tamar Sofer (tsofer@bwh.harvard.edu)
################################################################


expit = function (x){ exp (x)/(1 + exp (x))}
logit = function (x){ log (x/(1 − x))}


get.pval = function (g,e,X,d,ge.indep){
if(ge.indep ==T){
### assume G-E independence
MLE.rslt = maxLik (loglik.indept, start = rep (0,p*2))
par = MLE.rslt $ estimate
phi =0; alpha = par [1:(p)]; beta = par [(1 + p):(p*2)]
} else {
### not assume G-E independence
MLE.rslt = maxLik (loglik, start = rep (1,p*2 + 1))
par = MLE.rslt $ estimate
phi = par [1]; alpha = par [2:(1 + p)]; beta = par [(2 + p):(1 + p*2)]
21
}
### get u
m.g = expit (X%*% alpha); m.e = exp (X%*% beta)
w= exp (− phi*g*e); u = w*(g − m.g)*(e − m.e)*d

if(ge.indep ==T){
### get IF
d.u.d.theta = cbind (
diag (as.numeric (w*(e − m.e)*d*(-m.g*(1-m.g))))%*%X,
diag (as.numeric (w*(g − m.g)*d*(-m.e)))%*%X)
E.d.u.d.theta = apply (d.u.d.theta,2, sum) ## order : (phi),g,e
score = numDeriv :: jacobian (loglik.indept,c(par))
} else {
### get IF
d.u.d.theta = cbind (-g*e*u,
diag (as.numeric (w*(e − m.e)*d*(-m.g*(1-m.g))))%*%X,
diag (as.numeric (w*(g − m.g)*d*(-m.e)))%*%X)
E.d.u.d.theta = apply (d.u.d.theta,2, sum) ## order : phi,g,e
score = numDeriv :: jacobian (loglik,c(par))
}

score.IF= score %*% solve (t(score)%*%(score)) #or score.IF= ginv (t(score))
adj.IF = score.IF%*%E.d.u.d.theta
u.IF = c(u[d==1],u[d ==0]) + c(rep (0, length (which (d ==1))),adj.IF)
p =(1 − pnorm (abs (mean (u)*sqrt (length (u))/sd(u.IF))))*2 #or p =(1 − pnorm (abs (sum (u)/sqrt (
     sum (u.IF ^2)))))*2
stat = sum (u)/sqrt (sum(u.IF ^2))
return (c(p=p, mean.u= mean (u),phi =phi, stat =stat,sd.u=sd(u.IF)/sqrt (length (u))
))
}


loglik.indept = function (par){#### Function for the log likelihood assuming G-E indep
## get coefs
phi =0; alpha = par [1:(ncol (myX))]; beta = par [(1 + ncol (myX)):(ncol (myX)*2)]
## compute joint destribution f(G,E|X)
lambdaX = exp (myX%*% beta)
### proportional to f(G,E|X)
fGE_cond_X = exp(myX%*% alpha*myg)/(1 + exp (myX%*% alpha))*#fG_cond_E0X
lambdaX ^mye*exp(− lambdaX)/factorial (mye)*#fE_cond_G0X
exp (phi*myg*mye) #OR(G,E|X), assume doesnt depend on X
C_X = expit (myX %*% alpha)*exp (lambdaX*(exp(phi)−1)) + 1/(1 + exp (myX %*% alpha)) ## normalizing
    const
fGE_cond_X=fGE_cond_X/C_X
return ((log(fGE_cond_X)))
}
loglik = function (par){#### Function for the log likelihood without assuming G-E indep
## get coefs
phi = par [1]; alpha = par [2:(1 + ncol (myX))]; beta = par [(2 + ncol (myX)):(1 + ncol (myX)*2)]
## compute joint destribution f(G,E|X)
lambdaX = exp (myX%*% beta)
### proportional to f(G,E|X)
fGE_cond_X = exp(myX%*% alpha*myg)/(1 + exp (myX%*% alpha))*#fG_cond_E0X
lambdaX ^mye*exp(− lambdaX)/factorial (mye)*#fE_cond_G0X
exp (phi*myg*mye) #OR(G,E|X), assume doesnt depend on X
C_X = expit (myX %*% alpha)*exp (lambdaX*(exp(phi) −1)) + 1/(1 + exp (myX %*% alpha)) ## normalizing
    const
fGE_cond_X=fGE_cond_X/C_X
return ((log(fGE_cond_X)))
}


RERI.E.cont = function (d,g,e){#### RERI using standard logistic regression for the outcome
mod = glm (d ~ g + e + g*e, family = “binomial”)
temp = deltaMethod (mod, “exp (b1 + b2 + b3) − exp (b1) − exp (b2) + 1“, parameterNames = paste (“b“,
    0:3, sep = “”))
test.stat = temp $ Estimate/temp $SE
pval = 2*(1 − pnorm (abs(test.stat)))
return (list (test.stat = test.stat, pval = pval))
}
RERI.E.cont.X = function (d,g,e,X){#### RERI using standard logistic regression for the outcome
22
     with covariate adjustment
X=X[, −1] # remove intercept
colnames (X)= paste0 (“x“,4:(ncol (X) + 3))

mod = glm (as.formula (paste (“d~“,paste (paste0 (“x“,1:(ncol (X) + 3)),collapse =“+”))),data = data.frame (x1
    =g,x2=e,x3=g*e,X),family = “binomial”)
temp = deltaMethod (mod, “exp (x1 + x2 + x3) − exp (x1) − exp (x2) + 1“, parameterNames = c(paste0 (“x“,0:(
    ncol (X) + 3))))
test.stat = temp $ Estimate/temp $SE
pval = 2*(1 − pnorm (abs(test.stat)))
return (list (test.stat = test.stat, pval = pval))
}

estimate.g.e.probs = function (g,e, ge.indep = F, return.equation = F, max.iter = 500, eps = 1e −6) {
#### Function for the proposed test under binary G and E (saturated model)
converge = F
iter = 0
ind.alpha1 = 1
ind.alpha2 = 2

if (ge.indep){

alpha.old = rep (0,2)

U.vec = rep (0, length (alpha.old))
U.deriv.mat = matrix (0, length (alpha.old), length (alpha.old))

while (! converge & iter < max.iter){

e1 = exp (alpha.old [ind.alpha1])
e2 = exp (alpha.old [ind.alpha2])
e12 = exp (sum (alpha.old))

C = 1/(e12 + e1 + e2 + 1)
U.vec [ind.alpha1] = mean (g) − C*(e12 + e1)
U.vec [ind.alpha2] = mean (e) − C*(e12 + e2)

U.deriv.mat[ind.alpha1, ind.alpha1] = C^2*(e12 + e1)*(e12 + e1) − C*(e12 + e1)
U.deriv.mat[ind.alpha1, ind.alpha2] = C^2*(e12 + e2)*(e12 + e1) − C*e12
U.deriv.mat[ind.alpha2, ind.alpha1] = C^2*(e12 + e1)*(e12 + e2) − C*e12
U.deriv.mat[ind.alpha2, ind.alpha2] = C^2*(e12 + e2)*(e12 + e2) − C*(e12 + e2)

alpha.new = alpha.old − solve (U.deriv.mat) %*% U.vec

if (max (abs (alpha.new − alpha.old)) <= eps) converge = T else {
alpha.old = alpha.new
iter = iter + 1
}

}


if (return.equation){
U = matrix (0, ncol = length (alpha.new), nrow = length (g))
U[, ind.alpha1] = g − C*(exp (sum (alpha.new)) + exp(alpha.old[ind.alpha1]))
U[, ind.alpha2] = e − C*(exp (sum (alpha.new)) + exp(alpha.old[ind.alpha2]))

#  EU.inv.U = U %*% solve (U.deriv.mat)
EU.inv.U = U %*% solve (t(U) %*% U)

}

alpha.new = c(alpha.new, 0)

} else {

alpha.old = rep (0,3)

ind.alpha3 = 3

U.vec = rep (0, length (alpha.old))
23
U.deriv.mat = matrix (0, length (alpha.old), length (alpha.old))

while (! converge & iter < max.iter){

e1 = exp (alpha.old [ind.alpha1])
e2 = exp (alpha.old [ind.alpha2])
e123 = exp (sum (alpha.old))


C = 1/(e123 + e1 + e2 + 1)
U.vec [ind.alpha1] = mean (g) − C*(e123 + e1)
U.vec [ind.alpha2] = mean (e) − C*(e123 + e2)
U.vec [ind.alpha3] = mean (g*e) − C*e123

U.deriv.mat[ind.alpha1, ind.alpha1] = C^2*(e123 + e1)*(e123 + e1) − C*(e123 + e1)
U.deriv.mat[ind.alpha1, ind.alpha2] = C^2*(e123 + e2)*(e123 + e1) − C*e123
U.deriv.mat[ind.alpha1, ind.alpha3] = C^2*e123*(e123 + e1) − C*e123
U.deriv.mat[ind.alpha2, ind.alpha1] = C^2*(e123 + e1)*(e123 + e2) − C*e123
U.deriv.mat[ind.alpha2, ind.alpha2] = C^2*(e123 + e2)*(e123 + e2) − C*(e123 + e2)
U.deriv.mat[ind.alpha2, ind.alpha3] = C^2*e123*(e123 + e2) − C*e123
U.deriv.mat[ind.alpha3, ind.alpha1] = C^2*(e123 + e1)*e123 − C*e123
U.deriv.mat[ind.alpha3, ind.alpha2] = C^2*(e123 + e2)*e123 − C*e123
U.deriv.mat[ind.alpha3, ind.alpha3] = C^2*e123*e123 − C*e123

alpha.new = alpha.old − solve (U.deriv.mat) %*% U.vec

if (max (abs (alpha.new − alpha.old)) <= eps) converge = T else {
alpha.old = alpha.new
iter = iter + 1
}

}

if (return.equation){
U = matrix (0, ncol = length (alpha.new), nrow = length (g))
U[, ind.alpha1] = g − C*(e123 + e1)
U[, ind.alpha2] = e − C*(e123 + e2)
U[, ind.alpha3] = g*e − C*e123
#   EU.inv.U = U %*% solve (U.deriv.mat)
EU.inv.U = U %*% solve (t(U) %*% U)

}


}

## joint probabilities :
p.g1.e1 = C*exp (sum (alpha.new))
p.g1.e0 = C*exp (alpha.new [ind.alpha1])
p.g0.e1 = C*exp (alpha.new [ind.alpha2])
p.g0.e0 = C

### return conditional probabilities :

p.g1.cond.e1 = p.g1.e1/(p.g1.e1 + p.g0.e1)
p.e1.cond.g1 = p.g1.e1/(p.g1.e0 + p.g1.e1)
p.g1.cond.e0 = p.g1.e0/(p.g0.e0 + p.g1.e0)
p.e1.cond.g0 = p.g0.e1/(p.g0.e1 + p.g0.e0)

if (return.equation){
return (list (cond.probs = list (p.g1.cond.e1 = p.g1.cond.e1, p.g1.cond.e0 = p.g1.cond.e0, p.e1.cond.
    g0 = p.e1.cond.g0, p.e1.cond.g1 = p.e1.cond.g1), coefs = list (alpha = alpha.new), equation =
     EU.inv.U))
} else {
return (list (cond.probs = list (p.g1.cond.e1 = p.g1.cond.e1, p.g1.cond.e0 = p.g1.cond.e0, p.e1.cond.
    g0 = p.e1.cond.g0, p.e1.cond.g1 = p.e1.cond.g1), coefs = list (alpha = alpha.new)))}

}

References

[1].Tchetgen Tchetgen Eric J, Robins James M, Rotnitzky Andrea. On doubly robust estimation in a semiparametric odds ratio model Biometrika. 2010;97:171–180. [DOI] [PMC free article] [PubMed] [Google Scholar]
[2].Knol Mirjam J, Tweel Ingeborg, Grobbee Diederick E, Numans Mattijs E, Geerlings Mirjam I. Estimating interaction on an additive scale between continuous determinants in a logistic regression model International Journal of Epidemiology. 2007;36:1111–1118. [DOI] [PubMed] [Google Scholar]
[3].Han Summer S, Rosenberg Philip S, Garcia-Closas Montse, et al. Likelihood Ratio Test for Detecting Gene (G)-Environment (E) Interactions Under an Additive Risk Model Exploiting G-E Independence for Case-Control Data American Journal of Epidemiology. 2012;176:1060–1067. [DOI] [PMC free article] [PubMed] [Google Scholar]

References

[1].Skrondal Anders. Interaction as departure from additivity in case-control studies: a cautionary note. American Journal of Epidemiology. 2003;158(3):251–258. [DOI] [PubMed] [Google Scholar]
[2].Rothman Kenneth J, Greenland Sander, Walker Alexander M. Concepts of interaction. American Journal of Epidemiology. 1980;112(4):467–470. [DOI] [PubMed] [Google Scholar]
[3].Greenland Sander. Tests for interaction in epidemiologic studies: a review and a study of power. Statistics in Medicine. 1983;2(2):243–251. [DOI] [PubMed] [Google Scholar]
[4].Cordell Heather J Epistasis: what it means, what it doesn’t mean, and statistical methods to detect it in humans. Human Molecular Genetics. 2002;11(20):2463–2468. [DOI] [PubMed] [Google Scholar]
[5].VanderWeele Tyler J, Knol Mirjam J. A tutorial on interaction. Epidemiologic Methods. 2014;3(1):33–72. [Google Scholar]
[6].Gang Liu, Bhramar Mukherjee, Seunggeun Lee, et al. Robust tests for additive gene-environment interaction in case-control studies using gene-environment independence. American Journal of Epidemiology. 2017;187(2):366–377. [DOI] [PMC free article] [PubMed] [Google Scholar]
[7].Rothman Kenneth J. Causes. American journal of epidemiology. 1976;104(6):587–592. [DOI] [PubMed] [Google Scholar]
[8].Rothman Kenneth J, et al. Modern epidemiology. Philadelphia, PA: Lippincott Williams & Wilkins; 3 ed. 2008. [Google Scholar]
[9].Han Summer S, Rosenberg Philip S, Garcia-Closas Montse, et al. Likelihood Ratio Test for Detecting Gene (G)-Environment (E) Interactions Under an Additive Risk Model Exploiting G-E Independence for Case-Control Data. American Journal of Epidemiology. 2012;176(11):1060–1067. [DOI] [PMC free article] [PubMed] [Google Scholar]
[10].Piegorsch Walter W, Weinberg Clarice R, Taylor Jack A. Non-hierarchical logistic models and case-only designs for assessing susceptibility in population-based case-control studies. Statistics in Medicine. 1994;13(2):153–162. [DOI] [PubMed] [Google Scholar]
[11].Umbach David M, Weinberg Clarice R. Designing and analysing case-control studies to exploit independence of genotype and exposure. Statistics in Medicine. 1997;16(15):1731–1743. [DOI] [PubMed] [Google Scholar]
[12].Chatterjee Nilanjan, Carroll Raymond J. Semiparametric maximum likelihood estimation exploiting gene-environment independence in case-control studies. Biometrika. 2005;92(2):399–418. [Google Scholar]
[13].Tchetgen Tchetgen Eric J, Robins James. The semiparametric case-only estimator. Biometrics. 2010;66(4):1138–1144. [DOI] [PMC free article] [PubMed] [Google Scholar]
[14].Tchetgen Tchetgen E Robust Discovery of Genetic Associations incorporating Gene-Environment Interaction and Independence.(2011) Epidemiology. Volume.;22(2):262–272. [DOI] [PMC free article] [PubMed] [Google Scholar]
[15].Vansteelandt Stijn, VanderWeele Tyler J, Tchetgen Eric J, Robins James M Multiply robust inference for statistical interactions. Journal of the American Statistical Association. 2008;103(484):1693–1704. [DOI] [PMC free article] [PubMed] [Google Scholar]
[16].Hosmer David W, Lemeshow Stanley. Confidence interval estimation of interaction. Epidemiology. 1992;:452–456. [DOI] [PubMed] [Google Scholar]
[17].Tchetgen Tchetgen Eric J. A general regression framework for a secondary outcome in case-control studies. Biostatistics. 2013;15(1):117–128. [DOI] [PMC free article] [PubMed] [Google Scholar]
[18].Tchetgen Tchetgen Eric J, Robins James M, Rotnitzky Andrea. On doubly robust estimation in a semiparametric odds ratio model. Biometrika. 2010;97(1):171–180. [DOI] [PMC free article] [PubMed] [Google Scholar]
[19].Modan Baruch, Hartge Patricia, Hirsh-Yechezkel Galit, et al. Parity, oral contraceptives, and the risk of ovarian cancer among carriers and noncarriers of a BRCA1 or BRCA2 mutation. New England Journal of Medicine. 2001;345(4):235–240. [DOI] [PubMed] [Google Scholar]

[R20] [1].Tchetgen Tchetgen Eric J, Robins James M, Rotnitzky Andrea. On doubly robust estimation in a semiparametric odds ratio model Biometrika. 2010;97:171–180. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] [2].Knol Mirjam J, Tweel Ingeborg, Grobbee Diederick E, Numans Mattijs E, Geerlings Mirjam I. Estimating interaction on an additive scale between continuous determinants in a logistic regression model International Journal of Epidemiology. 2007;36:1111–1118. [DOI] [PubMed] [Google Scholar]

[R22] [3].Han Summer S, Rosenberg Philip S, Garcia-Closas Montse, et al. Likelihood Ratio Test for Detecting Gene (G)-Environment (E) Interactions Under an Additive Risk Model Exploiting G-E Independence for Case-Control Data American Journal of Epidemiology. 2012;176:1060–1067. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

A general approach to detect gene (G)-environment (E) additive interaction leveraging G-E independence in case-control studies

Eric J Tchetgen Tchetgen

Xu Shi

Benedict HW Wong

Tamar Sofer

Summary

1 |. INTRODUCTION

2 |. METHODS

2.1 |. Binary exposures and the standard RERI

2.2 |. The proposed test of additive interaction

2.3 |. Test incorporating independence assumption

2.4 |. Adjusting for covariates

2.5 |. More general exposures

2.6 |. Relaxing the rare disease assumption

3 |. A UNIFIED CLASS OF TEST STATISTICS

4 |. A SIMULATION STUDY

FIGURE 1.

TABLE 1.

5 |. OVARIAN CANCER APPLICATION

6 |. CONCLUSION

TABLE 2.

ACKNOWLEDGEMENT

APPENDIX

1. Proof that Var(∑iU^i/n)>Var(∑iU˜i/n).

2. Asymptotic variance for unified class of test statistics

3. Proof of Lemma 1.

4. Failure of RERI-based approaches with continuous exposure

5. Additional simulation studies

5.1. Power of the tests under alternative values of Pr(G=1) and type I error under at significance level of 1%

Figure 1:

Figure 2:

Table 1:

5.2. Violation of G-E independence assumption

Table 2:

Figure 3:

Figure 4:

Figure 5:

5.3. Violation of rare outcome assumption

Table 3:

Figure 6:

Figure 7:

Figure 8:

5.4. Different type of environmental variable: count data

Table 4:

6. R codes for simulation and data application

References

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

1. Proof that $V a r (\sum_{i} {\hat{U}}_{i} / n) > V a r (\sum_{i} {\tilde{U}}_{i} / n)$ .