Partial linear varying multi-index coefficient model for integrative gene-environment interactions

Xu Liu; Yuehua Cui; Runze Li

doi:10.5705/ss.202015.0114

. Author manuscript; available in PMC: 2017 Jul 1.

Published in final edited form as: Stat Sin. 2016 Jul;26:1037–1060. doi: 10.5705/ss.202015.0114

Partial linear varying multi-index coefficient model for integrative gene-environment interactions

Xu Liu ¹, Yuehua Cui ¹, Runze Li ²

PMCID: PMC5033130 NIHMSID: NIHMS781374 PMID: 27667907

Abstract

Gene-environment (G×E) interactions play key roles in many complex diseases. An increasing number of epidemiological studies have shown the combined effect of multiple environmental exposures on disease risk. However, no appropriate statistical models have been developed to conduct a rigorous assessment of such combined effects when G×E interactions are considered. In this paper, we propose a partial linear varying multi-index coefficient model (PLVMICM) to assess how multiple environmental factors act jointly to modify individual genetic risk on complex disease. Our model includes the varying-index coefficient model as a special case, where discrete variables are admitted as the linear part. Thus PLVMICM allows one to study nonlinear interaction effects between genes and continuous environments as well as linear interactions between genes and discrete environments, simultaneously. We derive a profile method to estimate parametric parameters and a B-spline backfitted kernel method to estimate nonlinear interaction functions. Consistency and asymptotic normality of the parametric and nonparametric estimates are established under some regularity conditions. Hypothesis testing for the parametric coefficients and nonparametric functions are conducted. Results show that the statistics for testing the parametric coefficients and the non-parametric functions asymptotically follow a χ²-distribution with different degrees of freedom. The utility of the method is demonstrated through extensive simulations and a case study.

Key words and phrases: Association study, Backfitting, B-spline, Single index model, Varying coefficient model

1. Introduction

There has been great interest in identifying gene-environment (G×E) interaction in the scientific literature. G×E interaction is defined as how genotypes influence phenotypes differently under different environmental conditions (Falconer (1952)), a phenomenon also termed as genetic sensitivity to environmental stimulus. A growing number of reports have confirmed the role of G×E interaction in many diseases, such as Parkinson disease (Ross and Smith (2007)) and type 2 diabetes (Zimmet et al. (2001)). G×E interaction has traditionally been pursued based on a single environment exposure model. Evidence from epidemiological studies has clearly indicated that disease risk can be modified by simultaneous exposure to multiple environmental factors, higher than what would be expected from simple addition of the effects of factors acting alone (Carpenter et al. (2002); Sexton and Hattis (2007)). Thus, assessing the combined effect of environmental mixtures and the mechanism in which they, as a whole, interact with genes to affect disease risk could shed novel insight into disease etiology. Suppose that Y is the trait response of primary interest. In many genetic studies, one collects a p-dimensional continuous covariate vector X, and a q-dimensional discrete covariate vector Z. Motivated by an empirical analysis to study G×E interaction, see Section 5, we propose a partial linear varying multi-index coefficient model (PLVMICM):

Y = m_{0} (β_{0}^{T} X) + α_{0}^{T} Z + \sum_{l = 1}^{L} {m_{l} (β_{l}^{T} X) G_{l} + α_{l}^{T} Z G_{l}} + ε,

(1.1)

where G_l, l = 1, ⋯,L are genetic variables (e.g., single nucleotide polymorphisms (SNPs)) of interest, ε is an error term with mean 0 and finite variance; m_l(·), l = 0, 1, ⋯,L are unknown index functions; α₀, ⋯, α_L and β₀, ⋯, β_L are parameters of interest, where the index coefficients β_l are the index loadings or the loading parameters. The SNP variable G_l can be coded as 2, 1, and 0 for genotype AA, Aa, and aa, assuming an additive model. Note that the main genetic effect for each G_l is captured by the function $m_{l} (β_{l}^{T} X) (l = 1, \dots, L)$ . Thus we do not need to have a separate term to model the main genetic effect for each SNP. Model (1.1) provides a unified model framework for many existing models used for studying G × E interaction. Specifically, the model proposed in Ma et al. (2011) can be viewed as a special case with p = 1 (the dimension of β_l), q = 0 (the dimension of α_l), and L = 1. Model (1.1) also include the semiparametric varying-index coefficient model proposed by Ma and Xu (2015), studying G×E interaction, as a special case with β₀ = β_l = β, l = 1, ⋯,L, i.e., assuming the same index loading parameter. Our empirical analysis in the data example in Section 5 clearly shows that this assumption is not realistic, making it necessary to allow different loading parameters in the model.

Model (1.1) also includes many other existing models as special cases. It reduces to the partial linear single-index model (Carroll et al. (1997); Xia and Li (1999); Xia and Hardle (2006); Liang et al. (2010); Cui et al. (2011)), in which the discrete variable in the linear part is admitted if all G_l = 0; it reduces to VICM proposed by Ma and Song (2015) if Z = 0.

This paper aims to develop a set of statistical estimation and hypothesis procedures for model (1.1). We employ a B-spline backfitted kernel smoothing (BSBK) procedure to estimate the parametric parameters and the nonparametric functions (Wang and Yang (2007)). We first develop a profile least squares method to estimate the index coefficients β_l and the linear coefficients α_l by approximating unknown function m_l(·) with B-spline basis functions. The parametric estimates can be shown to be n^1/2-consistent and asymptotically normal. We also obtain uniformly consistent estimators of the nonparametric functions. Given the n^1/2-consistent parametric estimators and the consistent estimators of the nonparametric functions, the kernel estimators of nonparametric functions can be obtained from which we establish the asymptotic normality.

Under model (1.1), it is natural to ask whether there is an interaction between discrete/continuous environments and genes, and whether the interaction with the combined environmental exposures is linear or nonlinear. Cai et al. (2000) studied the nonparametric testing problem for varying coefficient models based on the generalized likelihood ratio test. Nonparametric inferences for additive models were previously discussed by employing the generalized likelihood ratio (GLR) statistic (Fan and Jiang (2005)). We propose a parametric likelihood ratio test to test for the linear interaction term and a nonparametric GLR test to test for the nonparametric interaction functions (Fan et al. (2001)). We further show that the proposed nonparametric GLR statistic is asymptotically χ². We conduct rigorous theoretical evaluation of the proposed estimators and test statistics and show the utility of the model through extensive simulations and a case study.

The paper is organized as follows. In Section 2.2, we formulate the model and describe the BSBK procedure and the parametric estimators for the continuous and discrete parts based on a profile least squares method. The nonparametric kernel estimators for index functions are given in Section 2.3. The consistency and normality of parametric and nonparametric estimators are given in Section 2.4. Section 3 gives the parametric likelihood ratio statistic and several nonparametric GLR statistics, as well as their theoretical properties. In Section 4, we report on simulation studies that illustrate the finite sample performance of the proposed estimators and test statistics. In Section 5 we show the utility of the method by applying it to a baby birthweight data set. Some concluding remarks are given in Section 6. The proofs of the main results are relegated to the Appendix.

2. Estimation Procedures

2.1. Estimation Procedures

We focus on the situation with L = 1 for ease of presentation, and rewrite (1.1) as

Y = m_{0} (β_{0}^{T} X) + α_{0}^{T} Z + m_{1} (β_{1}^{T} X) G + α_{1}^{T} Z G + ε .

(2.1)

The proposed procedure for model (2.1) can be easily extend to model (1.1) with multiple G’s (i.e., multiple SNPs), and it is still more general than the existing ones used for G×E interaction. It is motivated by a recent genome-wide association study to identify genetic risk factors interacting with maternal uterine environments to increase the risk of low and high birth weight (HAPO Study Cooperative Research Group (2009)). The underlying hypothesis is that the variation of birth weight can be explained by complex G×E interactions in the context of the maternal-fetal unit. As a fetus resides inside its mother’s womb, there is intensive signalling and chemical exchanges between the two. The effects of fetal genes could be modified by simultaneous exposure to multiple stimuli from the mother’s side such as mother’s glucose level and blood pressure. For continuously measured environmental variables, we propose to model the joint effect of environment variables as a whole through an unknown index function m(·). The index function can be linear or nonlinear. That is determined by the data, with flexibility to capture the underlying mechanism of environmental mixtures modifying genetic influences on disease risk. For such discrete environmental variables as smoking status and family disease history, their interaction effects with genes can be modeled through a parametric function.

The motivation for assessing nonlinear G×E interaction in complex disease has been discussed extensively in Ma et al. (2011) and Wu and Cui (2013). The model for testing nonlinear G×E interactions in Ma et al. (2011) can be viewed as a special case of (2.1) with p = 1 (the dimension of β_l) and q = 0 (the dimension of α_l). We assume the index loading parameters β₀ and β₁ to be different; this differs from the single index model assuming common loading parameters for different index functions proposed by Xia and Li (1999). Li et al. (2010) studied the generalized functional linear models with semi-parametric single index interaction, but did not allow dissimilar loading parameters in different index functions. Although the varying-index coefficient model (VICM) proposed by Ma and Song (2015) could consider the joint interaction of multiple environments with genes, it does not admit discrete variables Z. Such discrete environmental variables are common in G×E studies and the inclusion of these variables is crucial to assess the discrete G×E interactions, as implemented in most partial linear single index models (Carroll et al. (1997); Xia and Li (1999); Xia and Hardle (2006);Liang et al. (2010)). Nevertheless, including both parametric and nonparametric terms into the same model poses computational and theoretical challenges. As discussed earlier, our model differs from that proposed by Ma and Xu (2015) in which they assumed the same loading parameters for different index functions. This assumption is too strong in reality, the modulation effect of environmental variables may differ from gene to gene. Our data analysis results in Section 5 indicate that such an assumption is invalid there. Theoretical and practical considerations thus motivate us to consider a more flexible model that can incorporate both linear and nonlinear interactions, and without too many assumptions on the model parameters, as in (2.1).

2.2. Parameter estimation

Consider the PLVMICM model given in (2.1). Let θ = (α^T, β^T)^T, where $α = {(α_{0}^{T}, α_{1}^{T})}^{T}$ and $β = {(β_{0}^{T}, β_{1}^{T})}^{T}$ . Let V_i = (X_i,Z_i,G_i), i = 1, ⋯, n, be the observations, and Θ_α and Θ_β be the parameter spaces for α and β, respectively. In this section, we derive the detailed estimation procedure employing the BSBK method proposed by Wang and Yang (2007). Let ℱ_n be the space of B-spline basis functions of order r (r ≥ 2) (de Boor (2001)) with the B-spline basis B_r(u) = (B_s,r(u) : 1 ≤ s ≤ J_n)^T, u ∈ [a, b], where J_n = N + r and N = N_n is the number of interior knots for a knot sequence ξ₁ = ⋯= 0 = ξ_r < ξ_r₊₁ < ⋯< ξ_{r+N_n} < 1 = ξ_{r+N_n+1} = ⋯= ξ_{N_n+2r} in which N_n increases along with the sample size n. Then m_l(u_l) with $u_{l} = u_{l} (β_{l}) = β_{l}^{T} X$ , l = 0, 1, can be approximated by a spline function,

{\tilde{m}}_{l} (u_{l}) \equiv {\tilde{m}}_{l} (u_{l}, β) \approx \sum_{s = 1}^{J_{n}} B_{s, r} (u_{l}) λ_{s, l} (β) = B_{r}^{T} (u_{l}) λ_{l} (β),

where λ_l(β) = (λ_s,l(β), 1 ≤ s ≤ J_n)^T and λ(β) = (λ₀(β)^T, λ₁(β)^T). For given β, the B-spline coefficients λ(β) and α can be estimated as

{({\hat{α}}^{T}, \hat{λ} {(β)}^{T})}^{T} = \underset{α \in Θ_{α}, λ (β) \in ℝ^{2 J_{n}}}{arg min} \tilde{R} ({(α^{T}, β^{T})}^{T}, λ (β)),

where $\tilde{R} ({(α^{T}, β^{T})}^{T}, λ (β)) = \sum_{i = 1}^{n} {[Y_{i} - {\tilde{m}}_{0} (β_{0}^{T} X_{i}) - α_{0}^{T} Z_{i} - ({\tilde{m}}_{1} (β_{1}^{T} X_{i}) - α_{1}^{T} Z_{i}) G_{i}]}^{2}$ . Let $D_{i} ({\tilde{Z}}_{i}, β) = {[{\tilde{Z}}_{i}^{T}, {(D_{i, s l} (β_{l}), 1 \leq s \leq J_{n}, l = 0, 1)}^{T}]}^{T}$ , where ${\tilde{Z}}_{i} = {(Z_{i}^{T}, Z_{i}^{T} G_{i})}^{T}, D_{i, s 0} (β_{0}) = B_{s, r} (β_{0}^{T} X_{i})$ and $D_{i, s 1} (β_{1}) = B_{s, r} (β_{1}^{T} X_{i}) G_{i}$ . Let D(Z̃, β) = (D₁(Z̃₁, β), ⋯, D_n(Z̃_n, β))^T, an n × 2(q + J_n) matrix, and Y = (Y₁, ⋯, Y_n)^T, where Z̃ = (Z̃₁, ⋯, Z̃_n)^T is an n × 2q matrix. Then the least squares estimators of α and λ(β) is

{({\hat{α}}^{T}, \hat{λ} {(β)}^{T})}^{T} = {(D {(\tilde{Z}, β)}^{T} D (\tilde{Z}, β))}^{- 1} D {(\tilde{Z}, β)}^{T} Y .

(2.2)

Once the B-spline coefficients λ(β) are estimated, we can obtain the first derivative of the spline approximation of the nonparametric function as ${\tilde{m}}_{l}^{'} (u_{l}) \equiv {\tilde{m}}_{l}^{'} (u_{l}, β) \approx B_{r}^{'} {(u_{l})}^{T} {\hat{λ}}_{l} (β)$ , where $B_{r}^{'} {(u_{l})}^{T}$ is the first derivative of B_r(u_l). Given the estimator λ̂_l(β) in (2.2), we can estimate the loading parameters β by

\hat{β} = \underset{β \in Θ_{β}}{arg min} \tilde{R} ({({\hat{α}}^{T}, β^{T})}^{T}, \hat{λ} (β)),

Let λ̂_l(β̂) be the estimators of the spline coefficients obtained by replacing D(Z̃, β) with D(Z̃, β̂) in (2.2). Based on the parametric estimator θ̂, it is easy to obtain the estimator of the nonparametric function m_l(u_l) as

{\tilde{m}}_{l} (u_{l}, \hat{β}) = B_{r} {(u_{l})}^{T} {\hat{λ}}_{l} (\hat{β}), l = 0, 1.

(2.3)

A detailed estimation algorithm is given in Supplementary Materials.

2.3. Kernel estimator of nonparametric functions

To obtain the asymptotic normality of the spline estimators for the nonparametric functions m_l(u_l), l = 0, 1, as in Wang and Yang (2007), we use the BSBK estimator to establish their asymptotic normality. Define Ỹ_l = (Ỹ₁_l, ⋯, Ỹ_nl)^T as the new pseudo-responses, and their corresponding “oracle” responses as $Y_{l}^{O} = {(Y_{1 l}^{O}, \dots, Y_{n l}^{O})}^{T}$ , l = 0, 1. By using the B-spline estimators m̃_l(·) and the parametric estimators $\hat{θ} = {({\hat{α}}_{0}^{T}, {\hat{α}}_{1}^{T}, {\hat{β}}_{0}^{T}, {\hat{β}}_{1}^{T})}^{T}$ of Section 2.2, we have

{\tilde{Y}}_{i 1} = Y_{i} - {\hat{α}}_{0}^{T} Z_{i} - {\tilde{m}}_{0} ({\hat{β}}_{0}^{T} X_{i}, \hat{β}) - {\hat{α}}_{1}^{T} Z_{i} G_{i}, and Y_{i 1}^{O} = Y_{i} - {\hat{α}}_{0}^{T} Z_{i} - m_{0} ({\hat{β}}_{0}^{T} X_{i}) - {\hat{α}}_{1}^{T} Z_{i} G_{i},

Similarly, Ỹ_i₀ and $Y_{i 0}^{O}$ can be defined. In the “oracle” responses, the functions m_l(·) are assumed to be known.

Based on the new responses Ỹ₁, we can obtain the BSBK estimator of m₁(u₁) as m̂₁(u₁, β̂) = â +b̂u₁ by local linear fitting, in which

(\hat{a}, \hat{b}) = \underset{a, b}{arg min} \sum_{i = 1}^{n} {{\tilde{Y}}_{i 1} - a G_{i} - b ({\hat{β}}_{1}^{T} X_{i} - u_{1}) G_{i}}^{2} K_{h_{1}} ({\hat{β}}_{1}^{T} X_{i} - u_{1}),

where K_h(t) = K(t/h)/h and K(·) is a kernel function and h is a bandwidth. By minimizing the weighted least squares, the estimator m̂₁(u₁, β̂) has a closed form

{\hat{m}}_{1} (u_{1}, \hat{β}) = (1, 0) {[{\tilde{X}}^{T} W \tilde{X}]}^{- 1} {\tilde{X}}^{T} W {\tilde{Y}}_{1},

(2.4)

where

\begin{array}{l} \tilde{X} & \equiv \tilde{X} (u_{1}, {\hat{β}}_{1}) = {(\begin{matrix} G_{1} & \dots & G_{n} \\ ({\hat{β}}_{1}^{T} X_{1} - u_{1}) G_{i} / h_{1} & \dots & ({\hat{β}}_{1}^{T} X_{n} - u_{1}) G_{n} / h_{1} \end{matrix})}^{T}, \\ W & \equiv W (u_{1}, {\hat{β}}_{1}) = diag {K_{h_{1}} ({\hat{β}}_{1}^{T} X_{1} - u_{1}), \dots, K_{h_{1}} ({\hat{β}}_{1}^{T} X_{n} - u_{1})} . \end{array}

Similarly, we can also obtain the “oracle” kernel estimator of m₁(u₁) as ${\hat{m}}_{1}^{O} (u_{1}, {\hat{β}}_{1})$ based on new data $Y_{1}^{O}$ by local linear fitting

{\hat{m}}_{1}^{O} (u_{1}, \hat{β}) = (1, 0) {[{\tilde{X}}^{T} W \tilde{X}]}^{- 1} {\tilde{X}}^{T} W Y_{1}^{O} .

(2.5)

An outline of the algorithm can be found in Supplementary Materials. We use the BIC criterion to select the number of interior knots, while fixing the order of basis function as cubic to approximate the unknown functions, as described in Ma and Song (2015). The positions of interior knots are chosen as the uniform quantiles of $u_{l}^{(k)} = X^{T} {\hat{β}}_{l}^{(k)}$ in the (k + 1)-th step (l = 0, 1, ⋯,L). Thus they change at each step while the number of knots remain fixed. This, however, does not affect the convergence of the algorithm in practice. To prove convergence of the algorithm with changes in knots is beyond the scope of this work. The BSBK estimator m̂_l(u_l, θ̂) is sensitive to the choice of bandwidth h_l, l = 0, 1. Bandwidth selection has been intensively studied, see Sepanski et al. (1994) and Ruppert et al. (1995) for good discussions. To avoid the estimation of high order derivatives, we employ a bandwidth selector based on the mean squared error (MSE) criterion, called empirical bias bandwidth selection (EBBS) (Ruppert (1997); Carroll et al. (1998); Liu et al. (2014)). The details of EBBS are provided in Supplementary Materials.

Remark 1

Cui et al. (2011) and Ma and Song (2015) relaxed the constraints ||β_l||₂ = 1 to ||β_l,₋₁|| < 1 with β _l,₋₁ = (β_l₂, ⋯, β_lp)^T, l = 0, 1. We work directly on the equality constraints ||β_l||₂ = 1 which allows us to easily develop a Newton-Raphson algorithm. We can then test H₀ : β_lk = 0 for all k = 1, ⋯, p (see Section 5 for a demonstration). In addition, the Newton-Raphson algorithm is faster than the nonlinear optimization method adopted in Ma and Song (2015), especially under nonlinear constraints.

2.4. Theoretical results

We need some additional notation to show the asymptotic normality of the estimator. Let θ⁰ = ((α⁰)^T, (β⁰)^T)^T be the true parameter θ, where $α^{0} = {({(α_{0}^{0})}^{T}, {(α_{1}^{0})}^{T})}^{T}$ and $β^{0} = {({(β_{0}^{0})}^{T}, {(β_{1}^{0})}^{T})}^{T}$ . Let the space ℳ be a collection of functions with finite L₂ norm on [a₀, b₀]×[a₁, b₁]×ℛ with ℳ= {g(u) = g₀(u₀) + g₁(u₁)G, E_gl(u_l)² ≤ ∞}, where u = (u₀, u₁)^T. For 1 ≤ k ≤ q, let $g_{Z_{k}}^{0} (u)$ be a maximizer in ℳ for the optimization problem,

g_{Z_{k}}^{0} (U (β^{0})) = g_{0}^{0} (X^{T} β_{0}^{0}) + g_{1}^{0} (X^{T} β_{1}^{0}) G = \underset{g \in M}{arg min} E {Z_{k} - g (U (β^{0}))}^{2},

where $U (β^{0}) = {(X^{T} β_{0}^{0}, X^{T} β_{1}^{0})}^{T}$ . Let $P_{k} (Z_{k}) = g_{Z_{k}}^{0} (U (β^{0}))$ and P(Z) = (P₁(Z₁), ⋯, P_q(Z_q))^T. We take P(X) = (P₁(X₁), ⋯, P_p(X_p))^T with $P_{k} (X_{k}) = g_{X_{k}}^{0} (U (β^{0}))$ . Let Ẑ = Z − P(Z), X̂ = X − P(X) and ϕ(V,β⁰) = (ϕ₁(V,β⁰)^T,ϕ₂(V,β⁰)^T)^T, where ϕ₁(V,β⁰) = (Ẑ^T, Ẑ^T G)^T and $ϕ_{2} (V, β^{0}) = {({[m_{0}^{'} (X^{T} β^{0}) \hat{X}]}^{T}, {[m_{1}^{'} (X^{T} β^{0}) \hat{X} G]}^{T})}^{T}$ . Define the covariance matrix of θ⁰ as

\sum = {E [ϕ {(V, β^{0})}^{\otimes 2}]}^{- 1} {E [σ {(V)}^{2} ϕ {(V, β^{0})}^{\otimes 2}]} {E [ϕ {(V, β^{0})}^{\otimes 2}]}^{- 1},

where ζ^⊗2 = ζζ^T for any vector ζ. Σ can be simplified as $\sum = σ_{0}^{2} {E [ϕ {(V, β^{0})}^{\otimes 2}]}^{- 1}$ if the error variance σ(V) is a constant $σ_{0}^{2}$ .

Theorem 1

If assumptions (A.1)–(A.4) in the Appendix hold, and nN⁴ → ∞ and nN⁻²^r⁻² → 0, then ||θ̂−θ⁰||₂ = O_p(n^−1/2), and as n → ∞, $n^{1 / 2} (\hat{θ} - θ^{0}) \overset{L}{\to} N (0, \sum)$ .

Theorem 2

If assumptions (A.1)–(A.4) in the Appendix hold, and nN⁴ → ∞ and nN⁻²^r⁻² → 0, then for l = 0, 1,

sup_{u_{l} \in [a_{l}, b_{l}]} ∣ {\tilde{m}}_{l} (u_{l}, \hat{β}) - m_{l} (u_{l}) ∣ = O_{p} ({(N / n)}^{1 / 2} + N^{- r}),

where m̃_l(u_l, β̂) is given in (2.3), and m_l(·) is the true function.

Next we show that the order of the asymptotic uniform magnitude of the difference between the BSBK estimator m̂_l(u_l, β̂) and its “oracle” version ${\hat{m}}_{l}^{O} (u_{l}, \hat{β})$ is o_p(n^−2/5), so m̂_l(u_l, β̂) and ${\hat{m}}_{l}^{O} (u_{l}, \hat{β})$ share the same asymptotic distribution.

Theorem 3

If assumptions (A.1)–(A.6) in the Appendix hold, and nN⁴ → ∞ and nN⁻^δ → 0 with δ = min(2r + 2, 5r/2), then for l = 0, 1,

sup_{u_{l} \in [a_{l}, b_{l}]} ∣ {\hat{m}}_{l} (u_{l}, \hat{β}) - {\hat{m}}_{l}^{O} (u_{l}, \hat{β}) ∣ = o_{p} (n^{- 2 / 5}) .

Set μ_k = ∫t^kK(t)dt, ν_k = ∫t^kK²(t)dt. The consistency and asymptotic normality of the unknown functions m₀(·) and m₁(·) now follow.

Theorem 4

If assumptions (A.1)–(A.6) in the Appendix hold, and nN⁴ → ∞ and nN⁻²^r⁻² → 0, then, for l = 0, 1,

{(n h_{l})}^{1 / 2} {{\hat{m}}_{l} (u_{l}, \hat{β}) - m_{l} (u_{l}) - b_{l} (u_{l}) h_{l}^{2}} \overset{L}{\to} N (0, v_{l} (u_{l})), a s n \to \infty,

where $b_{l} (u_{l}) = μ_{1} m_{1}^{″} (u_{l}) / 2$ , l = 0, 1, $v_{0} (u_{0}) = f_{0} {(u_{0})}^{- 1} ν_{0} E [σ^{2} (V) ∣ X^{T} β_{0}^{0} = u_{0}]$ , and $v_{1} (u_{1}) = f_{1} {(u_{1})}^{- 1} ν_{0} E [G^{2} σ^{2} (V) ∣ X^{T} β_{1}^{0} = u_{1}] / {(E [G^{2} ∣ X^{T} β_{1}^{0} = u_{1}])}^{2}$ .

If Equation $σ^{2} (V) = σ_{0}^{2}$ , the variance v_l(u_l) can be simplified as $f_{l} {(u_{l})}^{- 1} ν_{0} σ_{0}^{2}$ for l = 0, 1.

3. Hypothesis tests

3.1. Testing for nonparametric components

Our model can assess the interaction of the combined effect of multiple environmental exposures with genes. This can be achieved by testing the nonparametric component m₁(·) to discover the change trend of the interaction of the combined environmental effect. We consider a test to detect whether m₁(u₁) is a linear function $m_{1}^{0} (u_{1}) = δ_{0} + δ_{1} u_{1}$ ,

H_{0} : m_{1} (\cdot) = m_{1}^{0} (\cdot) v.s. H_{1} : m_{1} (\cdot) \neq m_{1}^{0} (\cdot),

(3.1)

via a generalized likelihood ratio (GLR) test (Fan et al. (2001); Liang et al. (2010); Ma and Song (2015)). Rejecting H₀ indicates statistical evidence of nonlinear interaction between G and multiple environmental mixtures. If we fail to reject H₀, we can further assess whether there exists a genetic effect as well as linear interaction effect between a gene and multiple environmental exposures by fitting a parametric linear interaction model.

Remark 2

In addition to the linear hypothesis, we are interested in testing H₀ : m₁(·) = 0 or H₀ : m₁(·) = c where c is a constant. Testing for zero or constant effect can be done under the varying-coefficient model proposed in Ma et al. (2011), this cannot be done in the current model setup due to the fact that the loading parameters β₁ are not identifiable under the above nulls. If we fail to reject the null in hypotheses (3.1), we can fit a linear interaction model as $Y = m_{0} (β_{0}^{T} X) + α_{0}^{T} Z + (δ_{0} + β_{1}^{T} X + α_{1}^{T} Z) G + ε$ , where no constraints on β₁ are imposed. Then one can proceed to test $H_{0}^{L} : δ_{0} = β_{1} = α_{1} = 0$ to assess the overall effect of G on Y. One can continue to assess the marginal effect of G on Y and the interaction effect between G and X or Z if $H_{0}^{L}$ is rejected.

Consider (3.1). Let θ̂ be the BSBK estimate of θ proposed in Section 2.2. Let m̂_l,H₀(u_l) and m̂_l,H₁ (u_l) be the estimators under H₀ and H₁, respectively. Let the residual sums of squares under H₀ and H₁ in (3.1) be ${RSS}_{1} (H_{0}) = \sum_{i = 1}^{n} {{\hat{Y}}_{i} - {\hat{m}}_{0, H_{0}} ({\hat{β}}_{0}^{T} X_{i}) - {\hat{m}}_{1, H_{0}} ({\hat{β}}_{1}^{T} X_{i}) G_{i}}^{2}$ and ${RSS}_{1} (H_{1}) = \sum_{i = 1}^{n} {{\hat{Y}}_{i} - {\hat{m}}_{0, H_{1}} ({\hat{β}}_{0}^{T} X_{i}) - {\hat{m}}_{1, H_{1}} ({\hat{β}}_{1}^{T} X_{i}) G_{i}}^{2}$ , where Ŷ_i = Y_i − α̂^TZ̃_i. We define the generalized likelihood ratio (GLR) test statistic as

T_{1} = \frac{n}{2} \frac{{RSS}_{1} (H_{0}) - {RSS}_{1} (H_{1})}{{RSS}_{1} (H_{1})},

(3.2)

Let a_K = {K(0) − 1/2 ∫ K²(u)du} [∫{K(u) − 1/2K * K(u)}du]⁻¹, where K * K(u) denotes the convolution of K. Denote by Ω_l the support of $β_{l}^{T} x$ , and by |Ω_l| the length of Ω_l, l = 0, 1.

Theorem 5

If assumptions (A.1)–(A.6) in the Appendix hold, and nN⁴ → ∞ and nN⁻²^r⁻² → 0, then under H₀ in (3.1), when $m_{1}^{0} (u_{1})$ is a linear function of u₁,

σ_{1 n}^{- 1} (T_{1} - μ_{1 n}) \overset{L}{\to} N (0, 1),

where $σ_{1 n}^{2} = \frac{2}{h_{1}} ∣ Ω_{1} ∣ \int {K (u) - 1 / 2 K * K (u)}^{2} d u$ , and $μ_{1 n} = \frac{1}{h_{1}} ∣ Ω_{1} ∣ {K (0) - 1 / 2 \int K^{2} (u) d u}$ . Furthermore, $a_{K} T_{1} \overset{a}{\sim} χ_{d_{1}}^{2}$ , where d₁ = a_Kμ₁_n.

When assessing the linear form of the function, RSS₁(H₀) and RSS₁(H₁) can be calculated by first getting the estimators of m₀(·) and m₁(·) using the B-spline method under the null and alternative hypotheses. The B-spline estimators under H₀ are given by ${\tilde{m}}_{0, H_{0}} (u_{0}) = B_{r}^{T} (u_{0}) {\hat{λ}}_{0}$ and m̂_1,H₀(u₁) = δ̂₀ +δ̂₁u₁, where δ̂₀, δ̂₁, and λ̂₀ are the ordinary least squares estimators of δ₀, δ₁, and λ₀. Then, we can obtain the kernel estimator m̂_0,H₀(u₀) based on the new data (Ŷ_H₀,X,Z,G) and ${\hat{u}}_{0} = {\hat{β}}_{0}^{T} X$ , using the arguments in Section 2.3, where Ŷ_H₀ = (Ŷ_1,H₀, ⋯, Ŷ_n,H₀)^T and ${\hat{Y}}_{i, H_{0}} = Y_{i} - α^{T} {\tilde{Z}}_{i} - {\hat{m}}_{1, H_{0}} ({\hat{β}}_{1}^{T} X_{i})$ . Here m̂_0,H₁(·) and m̂_1,H₁(.) are the BSBK estimators which can be obtained as in (2.4).

To illustrate the testing for the case with l > 1, we consider a model with two genetic variables G₁ and G₂,

Y = m_{0} (β_{0}^{T} X) + α_{0}^{T} Z + {m_{1} (β_{1}^{T} X) + α_{1}^{T} Z} G_{1} + {m_{2} (β_{2}^{T} X) + α_{2}^{T} Z} G_{2} + ε .

(3.3)

One can simultaneously test m₁(·) and m₂(·), for example, testing

H_{0} : m_{1} (\cdot) = m_{1}^{0} (\cdot), m_{2} (\cdot) = m_{2}^{0} (\cdot) v.s. H_{1} : m_{1} (\cdot) \neq m_{1}^{0} (\cdot) or m_{2} (\cdot) \neq m_{2}^{0} (\cdot),

(3.4)

where $m_{1}^{0} (\cdot)$ and $m_{2}^{0} (\cdot)$ are linear functions. Similarly, we can construct the corresponding GLR test statistic

T_{2} = \frac{n}{2} {{RSS}_{2} (H_{0}) - {RSS}_{2} (H_{1})} / {RSS}_{2} (H_{1}),

(3.5)

where ${RSS}_{2} (H_{0}) = \sum_{i = 1}^{n} {{\hat{Y}}_{i} - {\hat{m}}_{0, H_{0}} ({\hat{β}}_{0}^{T} X_{i}) - {\hat{m}}_{1, H_{0}} ({\hat{β}}_{1}^{T} X_{i}) G_{i 1} - {\hat{m}}_{2, H_{0}} ({\hat{β}}_{2}^{T} X_{i}) G_{i 2}}^{2}, {RSS}_{2} (H_{1}) = \sum_{i = 1}^{n} {{\hat{Y}}_{i} - {\hat{m}}_{0, H_{1}} ({\hat{β}}_{0}^{T} X_{i}) - {\hat{m}}_{1, H_{1}} ({\hat{β}}_{1}^{T} X_{i}) G_{i 1} - {\hat{m}}_{2, H_{1}} ({\hat{β}}_{2}^{T} X_{i}) G_{i 2}}^{2},$ , and ${\hat{Y}}_{i} = Y_{i} - {\hat{α}}_{0}^{T} Z_{i} - {\hat{α}}_{1}^{T} Z_{i} G_{i 1} - {\hat{α}}_{2}^{T} Z_{i} G_{i 2}$ . Note that ${\hat{m}}_{l, H_{0}} ({\hat{β}}_{l}^{T} X_{i})$ , l = 0, 1, 2, are different from those in T₁, but the estimation is similar.

Theorem 6

If assumptions (A.1)–(A.6) in the Appendix hold, nN⁴ → ∞ and nN⁻²^r⁻² → 0, then under H₀ in (3.4), when $m_{1}^{0} (u_{1})$ and $m_{2}^{0} (u_{2})$ are linear functions,

σ_{2 n}^{- 1} (T_{2} - μ_{2 n}) \overset{L}{\to} N (0, 1),

where $σ_{2 n}^{2} = 2 b_{n} \int {K (u) - 1 / 2 K * K (u)}^{2} d u$ , μ₂_n = b_n{K(0) − 1/2 ∫ K²(u)du} and b_n = Σ_l₌₁_,₂ |Ω_l|/h_l. Furthermore, $a_{K}^{*} T_{2} \overset{a}{\sim} χ_{d_{2}}^{2}$ , where $d_{2} = a_{K}^{*} μ_{2 n}$ with $a_{K}^{*} = 2 μ_{2 n} / σ_{2 n}^{2}$ .

Remark 3

The formulation of asymptotic normality in Theorem 6 is that in Fan and Jiang (2005). Theorem 6 can be generalized to cases where three or more genetic variables can be fitted and tested (l ≥ 3). One can apply Theorem 6 for simultaneous inference on the functions of some components of varying index coefficients. While the asymptotic results for T₁ and T₂ are available, they may not perform well when sample sizes are small. We recommend the conditional bootstrap method (Cai et al. (2000); Fan et al. (2001)) in applications.

3.2. Testing parametric components

We are also interested in assessing the interaction effects of genes with discrete environments. This can be addressed via parametric hypothesis testing. Furthermore, if there is G×E interaction, one may be interested in testing which index coefficients contribute to the joint effect. This results in another parametric hypothesis testing problem. We consider a class of general hypothesis testing problems with

H_{0} : A ζ = γ v.s. H_{1} : A ζ \neq γ,

(3.6)

where A is a known k × (q + s) full-rank matrix, s is the number of elements in S ⊂ {1, ⋯, p}, $ζ = {(α_{1}^{T}, β_{S}^{T})}^{T}$ with β_S = (β_j₁, ⋯, β_{j_s})^T, j_l ∈ S, and γ is a k-dimensional vector. For a special case, we can detect whether α₁ and β_S are zeros by taking

H_{0} : α_{1} = 0, β_{S} = 0 v.s. H_{1} : α_{1} \neq 0 or β_{S} \neq 0 .

(3.7)

Let $θ_{H_{0}} = {(α_{0, H_{0}}^{T}, α_{1, H_{0}}^{T}, β_{0, H_{0}}^{T}, β_{1, H_{0}}^{T})}^{T}$ be the parameters corresponding to θ under H₀ in (3.7) and $θ_{H_{1}} = {(α_{0, H_{1}}^{T}, α_{1, H_{1}}^{T}, β_{0, H_{1}}^{T}, β_{1, H_{1}}^{T})}^{T}$ be the counterparts under H₁. Define the residual sums of squares under H₀ and H₁ as

\begin{array}{l} R_{H_{0}} & = \sum_{i = 1}^{n} {Y_{i} - {\hat{m}}_{0, H_{0}} ({\hat{β}}_{0, H_{0}}^{T} X_{i}, {\hat{β}}_{H_{0}}) - {\hat{α}}_{0, H_{0}}^{T} Z_{i} - ({\hat{m}}_{1, H_{0}} ({\hat{β}}_{1, H_{0}}^{T} X_{i}, {\hat{β}}_{H_{0}}) - {\hat{α}}_{1, H_{0}}^{T} Z_{i}) G_{i}}^{2} \\ R_{H_{1}} & = \sum_{i = 1}^{n} {Y_{i} - {\hat{m}}_{0, H_{1}} ({\hat{β}}_{0, H_{1}}^{T} X_{i}, {\hat{β}}_{H_{1}}) - {\hat{α}}_{0, H_{1}}^{T} Z_{i} - ({\hat{m}}_{1, H_{1}} ({\hat{β}}_{1, H_{1}}^{T} X_{i}, {\hat{β}}_{H_{1}}) - {\hat{α}}_{1, H_{1}}^{T} Z_{i}) G_{i}}^{2}, \end{array}

where θ̂_H₀ and θ̂_H₁ are the estimators of θ under H₀ and H₁ proposed in Section 2.2, and m̂_l,H₀(·) and m̂_l,H₁ are estimators of m_l(·) proposed in (2.4) under H₀ and H₁, l = 0, 1, respectively. We take the test statistic

T_{3} = \frac{n {R_{H_{0}} - R_{H_{1}}}}{R_{H_{1}}} .

(3.8)

Theorem 7

If assumptions (A.1)–(A.6) in the Appendix hold, nN⁴ → ∞ and nN⁻²^r⁻² → 0, then when σ(V) is a constant $σ_{0}^{2}$ ,

under H₀ in (3.6), $T_{3} \overset{L}{\to} χ_{k}^{2}$ ;
under H₁ in (3.6), T₃ converges to a noncentral χ² distribution with k degrees of freedom with noncentrality parameter ϕ = lim_n_→∞ nσ²(Aζ − γ)^T (AΣ⁻¹A)⁻¹(Aζ − γ), where Σ is defined as in Theorem 1.

4. Monte Carlo simulation

The finite sample performance of the proposed method was evaluated by simulation studies. Under model (2.1), we generated continuous X variables X₁, X₂, X₃ as independent uniform U(0, 1) and discrete Z variables Z₁, Z₂ as independent Bernoulli Ber(1, 0.5). The genetic variable G was coded as (2, 1, 0) corresponding to genotypes (AA, Aa, aa). We set the minor allele frequency (MAF) p_A = (0.1, 0.3, 0.5) and assumed Hardy-Weinberg equilibrium. SNP genotypes AA, Aa, and aa were simulated from a multinomial distribution with frequencies $p_{A}^{2}$ , 2p_A(1 − p_A) and (1 − p_A)², respectively. The error term ε was normal N(0, 0.1).

We set m₀(u) = cos(πu) and m₁(u) = sin{π(u − A)/(B − A)} with $A = \sqrt{3} / 2 - 1.645 / \sqrt{12}$ and $B = \sqrt{3} / 2 + 1.645 / \sqrt{12}$ , and $β_{0} = (\sqrt{5}, \sqrt{4}, \sqrt{4}) / \sqrt{13}, β_{1} = (1, 1, 1) / \sqrt{3}$ , α₀ = (0.5, 0.5)^T, and α₁ = (0.3, 0.3)^T. We drew 1000 data sets with sample size n = 200, 500. The Epanechnikov kernel K(t) = 0.75(1 − t²)₊ was chosen to localize the unknown functions m₀(·) and m₁(·). The suitable smoothing bandwidths for estimating both functions were selected using the EBBS method described in Section 2.3. The number of interior knots N_k was selected by the BIC method.

4.1. Performance of estimation

Table 1 summarizes the average bias of the estimators (Bias), the standard deviation of the 1000 estimators (SD), the average of the estimated standard errors (SE) based on the theoretical calculation, and the estimated coverage probability (CP) at the nominal 95% confidence level for the parameters. In general, the coverage probability for all the parameters was close to 95% and reasonably controlled. As the sample size increased, the performance of the parameter estimators improved. We observed consistently smaller SD and SE when n increased from 200 to 500. The same trend was observed when n increased to 1000 (see Supplementary Materials for more details). The parameter estimators for the interaction effects (β₁, α₁) improved as MAF increased. For example, the SD of β̂₁₁ went from 0.028 to 0.012 when MAF increased from 0.1 to 0.5 under a fixed sample size n = 200. However, the estimators for the main effects (β₀,α₀) showed an opposite direction due to limited data information to estimate these parameters when MAF increased. This is due to the fact that the amount of data used to estimate these parameters is proportional to (1 − p_A)².

Table 1.

Simulation results for p_A = 0.1, 0.3, 0.5 with sample size n = 200, 500.

n	Param	True	p_A = 0.1				p_A = 0.3				p_A = 0.5
n	Param	True	Bias	SD	SE	CP	Bias	SD	SE	CP	Bias	SD	SE	CP
200	α₀₁	0.500	4.4E-04	0.016	0.016	95.2	3.1E-04	0.020	0.020	95.2	9.9E-04	0.026	0.026	95.1
	α₀₂	0.500	−1.6E-04	0.016	0.016	95.3	4.1E-04	0.020	0.020	95.3	5.6E-04	0.026	0.026	95.8
	α₁₁	0.300	9.4E-05	0.040	0.039	94.1	6.0E-04	0.024	0.024	94.1	6.7E-05	0.022	0.022	95.2
	α₁₂	0.300	−1.1E-03	0.040	0.039	95.0	−1.1E-03	0.023	0.024	95.9	−4.4E-04	0.021	0.022	96.3
	β₀₁	0.620	−3.7E-04	0.011	0.011	94.7	−1.7E-03	0.012	0.013	94.8	−2.1E-03	0.014	0.014	94.5
	β₀₂	0.555	3.3E-04	0.012	0.012	95.3	1.0E-03	0.013	0.013	96.4	1.5E-03	0.014	0.015	96.6
	β₀₃	0.555	−2.7E-04	0.012	0.012	94.0	4.2E-04	0.013	0.013	95.3	3.1E-04	0.015	0.015	95.4
	β₁₁	0.577	1.4E-03	0.028	0.027	92.9	−4.0E-04	0.015	0.015	95.5	−7.5E-05	0.012	0.012	95.1
	β₁₂	0.577	−3.4E-04	0.029	0.028	93.5	9.5E-05	0.015	0.015	95.3	2.9E-04	0.011	0.012	96.2
	β₁₃	0.577	−3.2E-03	0.028	0.027	94.3	−2.6E-04	0.015	0.015	96.1	−5.7E-04	0.012	0.012	96.0
500	α₀₁	0.500	−3.2E-04	0.010	0.010	95.8	−5.5E-04	0.012	0.012	95.2	−4.0E-04	0.016	0.016	96.1
	α₀₂	0.500	1.9E-04	0.010	0.010	94.1	2.0E-04	0.013	0.012	94.2	3.8E-04	0.016	0.016	94.6
	α₁₁	0.300	5.6E-04	0.023	0.022	93.7	9.9E-04	0.015	0.014	93.8	6.5E-04	0.013	0.013	94.5
	α₁₂	0.300	1.2E-05	0.023	0.022	94.0	2.6E-04	0.015	0.014	93.8	2.0E-04	0.013	0.013	94.1
	β₀₁	0.620	−4.6E-04	0.007	0.007	95.2	−1.0E-03	0.008	0.008	95.7	−1.2E-03	0.009	0.009	94.9
	β₀₂	0.555	1.2E-04	0.007	0.007	95.5	4.3E-04	0.008	0.008	95.1	5.5E-04	0.009	0.009	95.1
	β₀₃	0.555	2.6E-04	0.007	0.007	94.2	5.2E-04	0.008	0.008	94.1	5.2E-04	0.009	0.009	94.4
	β₁₁	0.577	5.2E-04	0.015	0.016	95.0	3.0E-05	0.009	0.009	96.6	−8.5E-06	0.007	0.007	95.9
	β₁₂	0.577	−3.4E-04	0.016	0.016	94.0	−8.0E-06	0.009	0.009	95.6	1.0E-04	0.007	0.007	96.3
	β₁₃	0.577	−8.3E-04	0.016	0.016	94.5	−2.3E-04	0.009	0.009	95.2	−2.3E-04	0.007	0.007	94.8

Open in a new tab

Figure 1 shows the plot of the estimators of m₁(u₁), and its corresponding confidence bands under different sample sizes and MAFs in the interval of u₁ from 0.25 to 1.25. It can be there seen that the estimated curves almost overlap with the corresponding true curves, and the confidence bands are very tight, especially under large MAF and sample size. We also plotted the estimate of m₀(·), see the Supplementary Materials.

The estimation of function m₁(·) under different MAFs and sample sizes. The estimated and true functions are denoted by the solid and dashed lines, respectively. The 95% confidence band is denoted by the dotted-dash line.

4.2. Performance of hypothesis tests

We first evaluated the performance of the test for the nonparametric function under the hypothesis $H_{0} : m_{1} (\cdot) = m_{1}^{0} (\cdot)$ , where $m_{1}^{0} (u_{1}) = δ_{0} + δ_{1} u_{1}$ , and δ₀ and δ₁ are some constants. Power was evaluated under a sequence of alternative models indexed by τ, $H_{1}^{τ} : m_{1}^{τ} (\cdot) = m_{1}^{0} (\cdot) + τ {m_{1} (\cdot) - m_{1}^{0} (\cdot)}$ . When τ = 0, the test results provide the false positive rates. The null model corresponds to a linear G×E effect.

Figure 2 shows the size (τ = 0) and power function (τ > 0) at significance level 0.05 based on 500 Monte Carlo simulations each with 500 bootstrap samples under sample sizes n = 200, 500, 1000. The empirical type I errors under the three scenarios are very close to the nominal level 0.05. We observed dramatic power increase when MAF increased from 0.1 to 0.3 in all scenarios. The results indicate that our method can reasonably control the false positives and has appropriate power to detect genetic difference. We also considered the PLVMICM model in (3.3) with two genetic components and tested if both m₁(·) and m₂(·) are simultaneously linear, following Theorem 6. The results are in the Supplemental Materials.

The empirical size and power function of testing nonparametric function m₁(·) under different sample sizes and MAFs.

To check the performance of the interaction test between G and discrete variable Z, under model (2.1), we considered the hypothesis H₀ : α₁ = 0. The power of the test was evaluated under a sequence of alternatives indexed by τ, $H_{1}^{τ} : α_{1}^{τ} = τ α_{1}$ . Data were simulated as in the previous section. Figure 3 depicts the empirical size (τ = 0) and power functions (τ > 0) under different sample sizes and MAFs at the 0.05 significance level. As expected, the power and size improve as MAF and sample size increase. Under low MAF (p_A = 0.1), the size is a little inflated when n is small (200 and 500), but is well controlled when n increases to 1000. As tith the nonparametric test, dramatic power improvement is observed when MAF increases from 0.1 to 0.3. The power difference between MAF=0.3 and MAF=0.5 is small indicating good performance of the test.

The empirical size and power functions of testing H₀ : α₁ = 0 under differen sample sizes and MAFs.

5. A case study

We applied the proposed PLVMICM model to a data set from the Gene Environment Association Studies initiative (GENEVA, http://www.genevastudy.org) funded by the trans-NIH Genes, Environment, and Health Initiative (GEI), to show the utility of the method. Low and high birth weights are not only the major causes of neonatal morbidity and mortality, but are also related to increased risk of metabolic diseases later in life. Fetal growth is determined by fetal genes as well as complex interactions between fetal genes and the maternal uterine environment. We focused on the Thai population with 1126 subjects genotyped with the Omni1-Quad v1-0 B platform after removing outliers. After regressing the baby’s body weight on twelve environmental variables, including nine continuous and three discrete variables, five continuous variables and one discrete variable remained significant at the 0.0001 significance level. Three of the five continuous variables were chosen, including mother’s mean OGTT diastolic blood pressure (denoted as X₁), mother’s one hour OGTT glucose level (denoted as X₂), and mother’s mean OGTT systolic blood pressure (denoted as X₃). The discrete variable, denoted as Z, is baby’s gender. To show the utility of the method, we picked one candidate gene CDKAL1 for a demonstration. The gene is located on chromosome 6 and contains 192 SNPs after removing those with MAF< 0.05. Low birth weight has been shown to be associated with high risk in type 2 diabetes later in life. Evidence of genetic studies on type 2 diabetes loci suggests that this gene is associated with reduced birth weight in Caucasian populations (Zhao et al. (2009); Andersson et al. (2010)). Our goal is to evaluate whether this gene also functions in the Thai population and, if so, how SNPs in the gene interact with mother’s condition (considered as environment) to affect birth weight and further determine the interaction mechanism.

We first tested whether any SNP is associated with birth weight based on the nonparametric test of H₀ : m₁(u₁) = δ₀ + δ₁u₁ with p-value denoted by p_m₁. Since we tested each SNP individually, we applied a simple multiple testing correction method. We first calculated the effective number of tests E₀ by using the Cheverud estimation method, given by $E_{0} = 1 + L^{- 1} \sum_{i, j = 1}^{L} (1 - r_{i j}^{2})$ , where L = 192 is the total number of SNPs and r_ij are the pairwise correlation coefficients of SNPs (Cheverud (2001)). The estimated E₀ = 188.09, which yields a gene-wide significance level of α = 0.01/E₀ = 5.3 × 10⁻⁵. Figure 4 depicts the −log₁₀(p-values). Clearly, six SNPs rs16884481 rs10946428, rs6904348, rs10806925, rs9465873, and rs12662218 passed the significance level based on 10⁵ bootstrap samples.

Plot of −*log*₁₀(p-value) for SNPs within gene CDKAL1.

The testing results for the six SNPs are reported in Table 2. We report SNP ID, MAF, allele information with bold font letter as the minor allele, p-values for the nonparametric test (described in Section 4.2). We also report the p-value of the test H₀ : β₀ = β₁ v.s. H₁ : β₀ ≠ β₁ in the column labeled by p_β as opposed to the model by Ma and Xu (2015) based on the generalized likelihood ratio test in Section 3.2. The p-value of the parametric test H₀ : α₁ = 0 is reported in the column labeled by p_α₁ following the procedure described in Section 4.2. To compare the goodness of fit for PLVMICM with an additive varying-coefficient model (AVCM), $E (Y ∣ X, Z, G) = \sum_{j = 1}^{3} m_{0 j} (X_{j}) + α_{0}^{T} Z + \sum_{j = 1}^{3} m_{1 j} (X_{j}) G + α_{1}^{T} Z G$ , and to see the relative gain by the integrative analysis, we calculated the MSEs of both models; they are given in the last two columns of Table 2. The p-values for testing H₀ : m₁₁(X₁) = m₁₂(X₂) = m₁₃(X₃) = 0 when fitting the AVCM model are reported in the column labeled by p_AVCM.

Table 2.

List of SNPs with MAF, allele, p-values under different hypothesis testing and MSE

SNP ID	MAF	Alleles	p-value				MSE
SNP ID	MAF	Alleles	p_m₁	p_β	p_α₁	p_AVCM	PLVMICM	AVCM
rs16884481	0.1960	C/T	≤1.0E-05	5.1E-04	0.2517	0.1799	0.1342	0.1402
rs10946428	0.2744	A/G	≤1.0E-05	1.5E-05	0.0960	0.1227	0.1333	0.1399
rs6904348	0.2766	A/C	≤1.0E-05	1.9E-05	0.0869	0.1358	0.1334	0.1399
rs10806925	0.4761	C/T	2.0E-05	2.2E-06	0.3671	0.2733	0.1340	0.1405
rs9465873	0.4503	A/G	3.0E-05	6.5E-06	0.4911	0.2562	0.1340	0.1403
rs12662218	0.2719	A/G	5.0E-05	5.4E-06	0.2802	0.4616	0.1345	0.1408

Open in a new tab

The p-values in column p_β for the comparison of different model assumptions clearly show that the loading parameters are different for different index functions, indicating the necessity of the proposed model vs the one proposed by Ma and Xu (2015). The p-values in column p_α₁ indicate that SNP×gender interactions are not significant for these six SNPs. The goodness of fit measure in the last two columns shows that the PLVMICM model fits the data better than the AVCM model, indicating the potential benefit of integrative G×E analysis. Furthermore, the testing p-values for the AVCM model do not show significance. The results imply that the genetic effects of these six SNPs are modified by the mixture effect of the three X variables, rather than separately, which further indicate the power of the integrative analysis.

For the 186 SNPs that were rejected, we fitted the model assuming m₁(u₁) = δ₀ + δ₁u₁, assuming linear G×E interaction, then testing H₀ : δ₀ = δ₁ = 0. No SNPs showed signs of significance at the 5.3E-05 significant level. The most significant SNP was rs12209806 with a p-value of 6.72E-05. This indicates that there is no linear interaction between these SNPs and the three environmental variables. However, there are four SNPs, rs12196595, rs6908425, rs6917599, and rs7773189 showing interactions with gender based on p_α₁ for the 186 SNPs; the p-values were 6.12E-08, 1.89E-07, 3.69E-07, and 1.61E-05, respectively.

We tested the significance of the individual X variable that contributes to the joint effect following the procedure given in Section 3.2. The results showed that X₁ and X₂ contribute significantly to the joint effect in these six SNPs, but not X₃ (see Table S2 in Supplementary Materials). The estimators of the nonparametric function m₁(u₁) for the first two SNPs, rs16884481 and rs10946428 along with their 95% confidence band are given in Figure 5. The estimators for the other four SNPs are shown in Section 3 in Supplementary Materials due to space limit. The estimated function shows a decreasing pattern then slightly increases as the index value u₁ increases. Our model clearly reveals the nonlinear modulating effect of environmental mixtures on genetic effect of birth weight. Such dynamic effects can be helpful in designing prevention strategy when the model is applied to other complex diseases such as diabetes.

Plot of the estimate (solid curve) of the nonparametric function m₁(u₁) for SNPs rs16884481 and rs10946428 along with their 95% confidence band (dash-dotted line).

6. Discussion

G×E interaction has been studied intensively in the literature and many statistical methods have been proposed. In this paper, we developed a partially linear varying multi-index coefficient model (PLVMICM) to conduct a rigorous assessment of the combined effect of multiple environmental exposures on the risk of disease under the paradigm of G×E interaction. Our model can be interpreted as a systems genetics approach to modeling the joint effect of environmental mixtures as a whole, then assessing how the integrative effect modifies genetic influence on disease risk. Our model is biologically attractive in that it addresses a long-term question on G×E interaction from a systems genetics perspective and is well supported by epidemiological studies (Carpenter et al. (2002); Monosson (2005); Powers et al. (2008)); and it has the flexibility to detect nonlinear interactions, and therefore, is more powerful when genetic effects are nonlinearly modified by simultaneous exposure to multiple environments.

From a statistical point of view, the index coefficient function treats multiple environmental variables X as a single index variable, and therefore can reduce multiple testing burden when interactions between the X variables and G are modelled separately. In addition, when there exist interactions between the X variables, our model has the flexibility to incorporate such interactions by adding interaction terms to the index function. PLVMICM is flexible and includes several existing models as special cases, for example, the partially linear single-index model (Carroll et al. (1997); Xia and Li (1999); Xia and Hardle (2006); Liang et al. (2010); Cui et al. (2011)) and the nonparametric additive model discussed by Fan and Jiang (2005).

In a typical G×E study, there are usually a large number of genetic variables (e.g., SNPs), and it is important to fit multiple SNPs in a single model and to select important players that interact with environmental mixtures to affect disease risk in a high dimensional model setup. In addition, many human diseases are measured on a binary scale. It is natural to extend the current PLVMICM model to a generalized PLVMICM model framework. This will be considered in a future investigation.

Supplementary Material

suppl

NIHMS781374-supplement-suppl.pdf^{(321.4KB, pdf)}

Acknowledgments

This work was partially supported by grants from NSF (IOS-1237969, DMS-1209112 and DMS-1512422), from NIDA/NIH (P50 DA10075 and P50 DA039838), and from NSFC (31371336). The content is solely the responsibility of the authors and does not necessarily represent the official views of the NSF, the NIDA/NIH and the NSFC. The authors thank an associate editor and two reviewers for their constructive and helpful comments. Funding support for the GWA mapping: Maternal Metabolism-Birth Weight Interactions study was provided through the NIH Genes, Environment and Health Initiative [GEI] (U01HG004415). The datasets used for the analyses described in this manuscript were obtained from dbGaP at http://www.ncbi.nlm.nih.gov/sites/entrez?db=gap through dbGaP accession number phs000096.v4.p1. Code for implementing the method was written in Matlab and C, and is available for free download at http://www.stt.msu.edu/~cui/software.html.

Appendix: Proofs

Notations

For any vector ξ = (ξ₁, ⋯, ξ_s)^T ∈ ℛ^s, let ||ξ||_∞ = max_1≤_l_≤_s |ξ_l|. For any nonzero matrix A_s_×_s, denotes its L_r norm as ${‖ A ‖}_{r} = {max}_{ξ \in ℝ^{s}, ξ \neq 0} {‖ A ‖}_{r} {‖ ξ ‖}_{r}^{- 1}$ . For any matrix $A = {(A_{i j})}_{i, j = 1}^{s, t}$ , let ${‖ A ‖}_{\infty} = {max}_{i \leq i \leq s} \sum_{j = 1}^{t} ∣ A_{i j}$ . Let C⁽^p⁾[a, b] = {ψ : ψ⁽^p⁾ ∈ C[a, b]} be the space of the pth-order smooth functions. Denote the space of Lipschitz continuous functions for any fixed constant c₀ as Lib([a, b], c₀) = {ψ : | ψ(x₁) − ψ(x₂)| ≤ c₀|x₁−x₂|, ∀x₁, x₂ ∈ [a, b]}. The following assumptions are required.

A.1
For each l = 0, 1, the density function f_{U(β_l)}(·) of random variable $U (β_{l}) = β_{l}^{T} X$ is bounded away from 0 on Ω_l, and there exists a constant 0 < c₀ < ∞ such that f_{U(β_l)}(·) ∈ Lib([a, b], c₀) for β_l in the neighborhood of $β_{l}^{0}$ , where $Ω_{l} = {β_{l}^{T} X, X \in X}$ and 𝒳 is a compact support of X.
A.2
The nonparametric function m_l ∈ C⁽^r⁾[a_l, b_l], l = 0, 1.
A.3
The noise ε satisfies E(ε|V) = 0, E(|ε|⁴) < ∞ and σ(v) = var(ε|V = v) < c₁ for some 0 < c₁ < ∞.
A.4
There exist constants 0 < c_z ≤ C_z < ∞ such that c_z ≤ Q(x) = E(Z̃Z̃^T |X = x) ≤ C_z for all x ∈ 𝒳.
A.5
The kernel function K(·) is a symmetric density function with compact support [−1, 1] and K ∈ Lib([a, b], c_K) for some constant c_K. The bandwidth h_l = O(n⁻¹^/⁵), l = 0, 1.
A.6
The function u³K(u) and u³K′(u) are bounded and ∫ u⁴K(u)du < ∞.

Let $Y_{z, i} = Y_{i} - Z_{i}^{T} α_{0}^{0} - Z_{i}^{T} α_{1}^{0} G_{i}$ , Y_z = (Y_z,₁, ⋯, Y_z,n)^T, e = (ε₁, ⋯, ε_n)^T, 𝕏 = (X₁, ⋯, X_n)^T, ℤ = (Z₁, ⋯, Z_n)^T, ℤ̃ = (1_n, ℤ), and G = (G₁, ⋯, G_n)^T. Define

\begin{array}{r} U (β) = E [D_{i} (β) D_{i} {(β)}^{T}], \hat{U} (β) = \frac{1}{n} D {(β)}^{T} D (β), \\ U (\tilde{Z}, β) = E [D_{i} (\tilde{Z}, β) D_{i} {(\tilde{Z}, β)}^{T}], \hat{U} (\tilde{Z}, β) = \frac{1}{n} D {(\tilde{Z}, β)}^{T} D (\tilde{Z}, β), \end{array}

(A.1)

where D_i(β) = (D_i,sl(β_l), 1 ≤ s ≤ J_n, l = 0, 1)^T and D(β) = (D₁(β), ⋯, D_n(β))^T, an n × 2J_n matrix.

Proof of Theorem 1

This is a straightforward result of Lemma S.6 in the Supplementary Materials.

Proof of Theorem 2

For simplicity, we assume [a_l, b_l] = [a, b] for l = 0, 1. Since for any u_l ∈ [a_l, b_l], B_s,l(u_l), s = 1, ⋯, J_n, l = 0, 1, have bounded first derivatives, by Lemmas S.4 and S.5 in the Supplementary Materials and Theorem 1, we have for any u_l ∈ [a, b],

\begin{array}{l} ∣ {\tilde{m}}_{l} (u_{l}, \hat{β}) - {\tilde{m}}_{l} (u_{l}, β^{0}) ∣ = ∣ D {(\hat{β})}^{T} \hat{λ} (\hat{β}) - D {(β^{0})}^{T} λ (β^{0}) ∣ \\ \leq ∣ D {(β^{0})}^{T} {\hat{λ} (\hat{β}) - λ (β^{0})} ∣ + ∣ {D (\hat{β}) - D (β^{0})}^{T} \hat{λ} (\hat{β}) ∣ \\ \leq ∣ n^{- 1} D {(β^{0})}^{T} \hat{U} {(β^{0})}^{- 1} D {(β^{0})}^{T} e ∣ + O_{p} (n^{- 1 / 2}) \\ = O_{p} ({(N / n)}^{1 / 2}) . \end{array}

Then, combined with Lemma S.4, we have

\begin{array}{l} sup_{u_{l} \in [a, b]} ∣ {\tilde{m}}_{l} (u_{l}, \hat{β}) - m_{l} (u_{l}) ∣ \leq sup_{u_{l} \in [a, b]} ∣ {\tilde{m}}_{l} (u_{l}, \hat{β}) - {\tilde{m}}_{l} (u_{l}, β^{0}) ∣ + sup_{u_{l} \in [a, b]} ∣ {\tilde{m}}_{l} (u_{l}, β^{0}) - m_{l} (u_{l}) ∣ \\ = O_{p} ({(N / n)}^{1 / 2} + N^{- r}) . \end{array}

This completes the proof of Theorem 2.

Proof of Theorem 4

As nh⁵ = O(1), we have (nh_l)¹^/²n⁻²^/⁵ = o(1). By Theorem 3, we have

{(n h_{l})}^{1 / 2} {{\hat{m}}_{l} (u_{l}, \hat{β}) - m_{l} (u_{l}) - b_{l} (u_{l}) h_{l}^{2}} = {(n h_{l})}^{1 / 2} {{\hat{m}}_{l}^{O} (u_{l}, \hat{β}) - m_{l} (u_{l}) - b_{l} (u_{l}) h_{l}^{2}} + o_{p} (1) .

Thus Theorem 4 can be shown straightforwardly following Lemma S.7 in the Supplementary Materials.

Proof of Theorem 7

This proof is similar to that of Liang et al. (2010). Accordingly, we only provide a sketch of the proof here, more details can be found in the Supplementary Materials. We first prove n⁻¹R(H₁) = E{σ(V)}+o_p(1). Let m̂(X, β) = m̂₀(X^Tβ₀, β) + m̂₁(X^Tβ₁, β)G and, correspondingly, ${\hat{m}}^{O} (X, θ) = {\hat{m}}_{0}^{O} (X^{T} β_{0}, β) + {\hat{m}}_{1}^{O} (X^{T} β_{1}, β) G$ . By Theorem 3 and Lemma S.7 in the Supplementary Materials, n⁻¹R(H₁) can be decomposed as

\begin{array}{l} n^{- 1} R (H_{1}) = \frac{1}{n} \sum_{i = 1}^{n} {y_{i} - {\tilde{Z}}^{T} \hat{α} - \hat{m} (X_{i}, \hat{β})}^{2} \\ = \frac{1}{n} \sum_{i = 1}^{n} {y_{i} - {\tilde{Z}}^{T} α^{0} - {\hat{m}}^{O} (X_{i}, β^{0})}^{2} + o_{p} (n^{- 2 / 5}) + O_{p} (n^{- 1 / 2}) \\ = \frac{1}{n} \sum_{i = 1}^{n} {ε_{i} - ({\hat{m}}^{O} (X_{i}, β^{0}) - m (X_{i}, β^{0})}^{2} + o_{p} (n^{- 2 / 5}) \\ \equiv I_{1} + I_{2} + I_{3} + o_{p} (n^{- 2 / 5}), \end{array}

where $I_{3} = \frac{1}{n} \sum_{i = 1}^{n} {{\hat{m}}^{O} (X_{i}, β^{0}) - m (X_{i}, β^{0})}^{2}, I_{2} = - 2 \frac{1}{n} \sum_{i = 1}^{n} {{\hat{m}}^{O} (X_{i}, β^{0}) - m (X_{i}, β^{0})} ε_{i}$ , and $I_{1} = \frac{1}{n} \sum_{i = 1}^{n} ε_{i}^{2}$ . It is easy to see by the Law of Large Numbers that 𝕀₁ = E{σ(V)} + O_p(n⁻¹^/²). By Theorem 2.6 in Li and Racine (2007), we have max_i | m̂^O(X_i, β⁰) − m(X_i, β⁰)| = O_p((log(n)/(nh))¹^/²), which results in 𝕀₂ = O_p((log(n)/(n²h))¹^/²) and 𝕀₃ = O_p(log(n)/(nh)). This leads to n⁻¹R(H₁) = E{σ(V)} + o_p(1).

The difference R(H₀) − R(H₁) can be decomposed as

\begin{array}{l} R (H_{0}) - R (H_{1}) = \sum_{i = 1}^{n} {{\tilde{Z}}^{T} ({\hat{α}}_{H_{0}} - {\hat{α}}_{H_{1}}) + (\hat{m} (X_{i}, {\hat{β}}_{H_{0}}) - \hat{m} (X_{i}, {\hat{β}}_{H_{1}}))}^{2} \\ + 2 \sum_{i = 1}^{n} {{\tilde{Z}}^{T} ({\hat{α}}_{H_{0}} - {\hat{α}}_{H_{1}}) + (\hat{m} (X_{i}, {\hat{β}}_{H_{0}}) - \hat{m} (X_{i}, {\hat{β}}_{H_{1}}))} \\ \times {y_{i} - {\tilde{Z}}^{T} {\hat{α}}_{H_{1}} - \hat{m} (X_{i}, {\hat{β}}_{H_{1}})} \equiv I_{4} + I_{5} . \end{array}

Under the null, we have $σ^{- 2} I_{4} \overset{L}{\to} χ_{k}^{2}$ , and under the alternative σ⁻²𝕀₄ asymptotically follows a noncentral Chi-squared distribution with k degrees of freedom and noncentrality parameter ϕ. It remains to show that 𝕀₅ = o_p(1). This can be shown along the same lines as 𝕀₄. This completes the proof of Theorem 7.

The proofs of Theorem 3, 5, and 6 are in the Supplementary Materials.

Footnotes

Supplementary Materials Proofs of theorems and lemmas, additional simulation, and data analysis results can be found in the Supplementary Materials.

References

Andersson EA, Pilgaard K, Pisinger C, Harder MN, Grarup N, Faerch K, Poulsen P, Witte DR, Jrgensen T, Vaag A, Hansen T, Pedersen O. Type 2 diabetes risk alleles near ADCY5, CDKAL1 and HHEX-IDE are associated with reduced birthweight. Diabetologia. 2010;53:1908–1916. doi: 10.1007/s00125-010-1790-0. [DOI] [PubMed] [Google Scholar]
Cai Z, Fan J, Li R. Efficient estimation and inferences for varying-coefficient models. J Am Stat Assoc. 2000;95:888–902. [Google Scholar]
Carroll RJ, Fan J, Gijbels I, Wand MP. Generalized partially linear single-index models. J Am Stat Assoc. 1997;92:477–489. [Google Scholar]
Carroll RJ, Ruppert D, Welsh AH. Local estimating equations. J Am Stat Assoc. 1998;93:214–227. [Google Scholar]
Carpenter DO, Arcaro K, Spink DC. Understanding the human health effects of chemical mixtures. Environ Health Perspect. 2002;110(suppl 1):25–42. doi: 10.1289/ehp.02110s125. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cheverud J. A simple correction for multiple comparisons in interval mapping genome scans. Heredity. 2001;87:52–58. doi: 10.1046/j.1365-2540.2001.00901.x. [DOI] [PubMed] [Google Scholar]
Cui X, Härdle W, Zhu L. The EFM approach for single-index models. Ann Stat. 2011;39:1658–1688. [Google Scholar]
de Boor C. A Practical Guide to Splines. Springer; New York: 2001. [Google Scholar]
Falconer DS. The problem of environment and selection. Am Natural. 1952;86:293–299. [Google Scholar]
Fan J, Jiang J. Nonparametric inferences for additive models. J Am Stat Assoc. 2005;100:890–907. [Google Scholar]
Fan J, Zhang C, Zhang J. Generalized likelihood ratio statistics and Wilks phenomenon. Ann Stat. 2001;29:153–193. [Google Scholar]
HAPO Study Cooperative Research Group. Hyperglycemia and Adverse Pregnancy Outcome (HAPO) Study: associations with neonatal anthropometrics. Diabetes. 2009;58:453–459. doi: 10.2337/db08-1112. [DOI] [PMC free article] [PubMed] [Google Scholar]
Li Q, Racine RS. Nonparametric Econometrics: Theory and Practice. Princeton University Press; Princeton, N. J: 2007. [Google Scholar]
Li Y, Wang N, Carroll RJ. Generalized functional linear models with semi- parametric single-index interactions. J Am Stat Assoc. 2010;105:621–633. doi: 10.1198/jasa.2010.tm09313. [DOI] [PMC free article] [PubMed] [Google Scholar]
Liang H, Liu X, Li R, Tsai CL. Estimation and testing for partially linear single index models. Ann Stat. 2010;38:3811–3836. doi: 10.1214/10-AOS835. [DOI] [PMC free article] [PubMed] [Google Scholar]
Liu X, Jiang H, Zhou Y. Local empirical likelihood inference for varying-coefficient density-ratio models based on case-control data. J Am Stat Assoc. 2014;109:635–646. [Google Scholar]
Ma S, Song PX. Varying index coefficient models. J Am Stat Assoc. 2015;110:341–356. [Google Scholar]
Ma S, Xu S. Semiparametric nonlinear regression for detecting gene and environment interactions. J Stat Plan Inf. 2015;156:31–47. [Google Scholar]
Ma S, Yang L, Romero R, Cui Y. Varying coefficient model for gene-environment interaction: a non-linear look. Bioinformatics. 2011;27:2119–2126. doi: 10.1093/bioinformatics/btr318. [DOI] [PMC free article] [PubMed] [Google Scholar]
Monosson E. Chemical mixtures: considering the evolution of toxicology and chemical assessment. Environ Health Perspect. 2005;113:383–390. doi: 10.1289/ehp.6987. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ruppert D. Empirical-bias bandwidths for lcoal polynomial nonparametric regression and density estimation. J Am Stat Assoc. 1997;92:1049–1062. [Google Scholar]
Ruppert D, Sheathers SJ, Wand MP. An effective bandwidth selector for local least squares regression. J Am Stat Assoc. 1995;90:1257–1270. [Google Scholar]
Ross CA, Smith WW. Gene environment interactions in Parkinson’s disease. Parkins Rel Dis. 2007;13:S309–S315. doi: 10.1016/S1353-8020(08)70022-1. [DOI] [PubMed] [Google Scholar]
Powers KM, Kay DM, Factor SA, Zabetian CP, Higgins DS, Samii A, Nutt JG, Griffith A, Leis B, Roberts JW, Martinez ED, Montimurro JS, Checkoway H, Payami H. Combined effects of smoking, coffee, and NSAIDs on Parkinson’s disease risk. Mov Disord. 2008;23:88–95. doi: 10.1002/mds.21782. [DOI] [PubMed] [Google Scholar]
Sepanski JH, Knickerbocker R, Carroll RJ. A semiparametric correction for attenuation. J Am Stat Assoc. 1994;89:1366–1373. [Google Scholar]
Sexton K, Hattis D. Assessing cumulative health risks from exposure to environmental mixtures - three fundamental questions. Environ Health Perspect. 2007;115:825–832. doi: 10.1289/ehp.9333. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang L, Yang L. Spline-backfitted kernel smoothing of nonlinear additive autoregression model. Ann Stat. 2007;35:2474–2503. [Google Scholar]
Wu C, Cui Y. A novel method for identifying nonlinear gene-environment inter- actions in case-control association studies. Hum Genet. 2013;132:1413–1425. doi: 10.1007/s00439-013-1350-z. [DOI] [PubMed] [Google Scholar]
Xia Y, Härdle W. Semi-parametric estimation of partially linear single-index models. J Multiv Anal. 2006;97:1162–1184. [Google Scholar]
Xia YC, Li WK. On single-index coefficient regression models. J Am Stat Assoc. 1999;94:1275–1285. [Google Scholar]
Zhao J, Li M, Bradfield JP, Wang K, Zhang H, Sleiman P, Kim CE, Annaiah K, Glaberson W, Glessner JT, Otieno FG, Thomas KA, Garris M, Hou C, Frackelton EC, Chiavacci RM, Berkowitz RI, Hakonarson H, Grant SF. Examination of type 2 diabetes loci implicates CDKAL1 as a birth weight gene. Diabetes. 2009;58:2414–8. doi: 10.2337/db09-0506. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zimmet P, Alberti K, Shaw J. Global and societal implications of the diabetes epidemic. Nature. 2001;414:782–787. doi: 10.1038/414782a. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

suppl

NIHMS781374-supplement-suppl.pdf^{(321.4KB, pdf)}

[R1] Andersson EA, Pilgaard K, Pisinger C, Harder MN, Grarup N, Faerch K, Poulsen P, Witte DR, Jrgensen T, Vaag A, Hansen T, Pedersen O. Type 2 diabetes risk alleles near ADCY5, CDKAL1 and HHEX-IDE are associated with reduced birthweight. Diabetologia. 2010;53:1908–1916. doi: 10.1007/s00125-010-1790-0. [DOI] [PubMed] [Google Scholar]

[R2] Cai Z, Fan J, Li R. Efficient estimation and inferences for varying-coefficient models. J Am Stat Assoc. 2000;95:888–902. [Google Scholar]

[R3] Carroll RJ, Fan J, Gijbels I, Wand MP. Generalized partially linear single-index models. J Am Stat Assoc. 1997;92:477–489. [Google Scholar]

[R4] Carroll RJ, Ruppert D, Welsh AH. Local estimating equations. J Am Stat Assoc. 1998;93:214–227. [Google Scholar]

[R5] Carpenter DO, Arcaro K, Spink DC. Understanding the human health effects of chemical mixtures. Environ Health Perspect. 2002;110(suppl 1):25–42. doi: 10.1289/ehp.02110s125. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] Cheverud J. A simple correction for multiple comparisons in interval mapping genome scans. Heredity. 2001;87:52–58. doi: 10.1046/j.1365-2540.2001.00901.x. [DOI] [PubMed] [Google Scholar]

[R7] Cui X, Härdle W, Zhu L. The EFM approach for single-index models. Ann Stat. 2011;39:1658–1688. [Google Scholar]

[R8] de Boor C. A Practical Guide to Splines. Springer; New York: 2001. [Google Scholar]

[R9] Falconer DS. The problem of environment and selection. Am Natural. 1952;86:293–299. [Google Scholar]

[R10] Fan J, Jiang J. Nonparametric inferences for additive models. J Am Stat Assoc. 2005;100:890–907. [Google Scholar]

[R11] Fan J, Zhang C, Zhang J. Generalized likelihood ratio statistics and Wilks phenomenon. Ann Stat. 2001;29:153–193. [Google Scholar]

[R12] HAPO Study Cooperative Research Group. Hyperglycemia and Adverse Pregnancy Outcome (HAPO) Study: associations with neonatal anthropometrics. Diabetes. 2009;58:453–459. doi: 10.2337/db08-1112. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] Li Q, Racine RS. Nonparametric Econometrics: Theory and Practice. Princeton University Press; Princeton, N. J: 2007. [Google Scholar]

[R14] Li Y, Wang N, Carroll RJ. Generalized functional linear models with semi- parametric single-index interactions. J Am Stat Assoc. 2010;105:621–633. doi: 10.1198/jasa.2010.tm09313. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] Liang H, Liu X, Li R, Tsai CL. Estimation and testing for partially linear single index models. Ann Stat. 2010;38:3811–3836. doi: 10.1214/10-AOS835. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] Liu X, Jiang H, Zhou Y. Local empirical likelihood inference for varying-coefficient density-ratio models based on case-control data. J Am Stat Assoc. 2014;109:635–646. [Google Scholar]

[R17] Ma S, Song PX. Varying index coefficient models. J Am Stat Assoc. 2015;110:341–356. [Google Scholar]

[R18] Ma S, Xu S. Semiparametric nonlinear regression for detecting gene and environment interactions. J Stat Plan Inf. 2015;156:31–47. [Google Scholar]

[R19] Ma S, Yang L, Romero R, Cui Y. Varying coefficient model for gene-environment interaction: a non-linear look. Bioinformatics. 2011;27:2119–2126. doi: 10.1093/bioinformatics/btr318. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] Monosson E. Chemical mixtures: considering the evolution of toxicology and chemical assessment. Environ Health Perspect. 2005;113:383–390. doi: 10.1289/ehp.6987. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] Ruppert D. Empirical-bias bandwidths for lcoal polynomial nonparametric regression and density estimation. J Am Stat Assoc. 1997;92:1049–1062. [Google Scholar]

[R22] Ruppert D, Sheathers SJ, Wand MP. An effective bandwidth selector for local least squares regression. J Am Stat Assoc. 1995;90:1257–1270. [Google Scholar]

[R23] Ross CA, Smith WW. Gene environment interactions in Parkinson’s disease. Parkins Rel Dis. 2007;13:S309–S315. doi: 10.1016/S1353-8020(08)70022-1. [DOI] [PubMed] [Google Scholar]

[R24] Powers KM, Kay DM, Factor SA, Zabetian CP, Higgins DS, Samii A, Nutt JG, Griffith A, Leis B, Roberts JW, Martinez ED, Montimurro JS, Checkoway H, Payami H. Combined effects of smoking, coffee, and NSAIDs on Parkinson’s disease risk. Mov Disord. 2008;23:88–95. doi: 10.1002/mds.21782. [DOI] [PubMed] [Google Scholar]

[R25] Sepanski JH, Knickerbocker R, Carroll RJ. A semiparametric correction for attenuation. J Am Stat Assoc. 1994;89:1366–1373. [Google Scholar]

[R26] Sexton K, Hattis D. Assessing cumulative health risks from exposure to environmental mixtures - three fundamental questions. Environ Health Perspect. 2007;115:825–832. doi: 10.1289/ehp.9333. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] Wang L, Yang L. Spline-backfitted kernel smoothing of nonlinear additive autoregression model. Ann Stat. 2007;35:2474–2503. [Google Scholar]

[R28] Wu C, Cui Y. A novel method for identifying nonlinear gene-environment inter- actions in case-control association studies. Hum Genet. 2013;132:1413–1425. doi: 10.1007/s00439-013-1350-z. [DOI] [PubMed] [Google Scholar]

[R29] Xia Y, Härdle W. Semi-parametric estimation of partially linear single-index models. J Multiv Anal. 2006;97:1162–1184. [Google Scholar]

[R30] Xia YC, Li WK. On single-index coefficient regression models. J Am Stat Assoc. 1999;94:1275–1285. [Google Scholar]

[R31] Zhao J, Li M, Bradfield JP, Wang K, Zhang H, Sleiman P, Kim CE, Annaiah K, Glaberson W, Glessner JT, Otieno FG, Thomas KA, Garris M, Hou C, Frackelton EC, Chiavacci RM, Berkowitz RI, Hakonarson H, Grant SF. Examination of type 2 diabetes loci implicates CDKAL1 as a birth weight gene. Diabetes. 2009;58:2414–8. doi: 10.2337/db09-0506. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] Zimmet P, Alberti K, Shaw J. Global and societal implications of the diabetes epidemic. Nature. 2001;414:782–787. doi: 10.1038/414782a. [DOI] [PubMed] [Google Scholar]

PERMALINK

Partial linear varying multi-index coefficient model for integrative gene-environment interactions

Xu Liu

Yuehua Cui

Runze Li

Abstract

1. Introduction

2. Estimation Procedures

2.1. Estimation Procedures

2.2. Parameter estimation

2.3. Kernel estimator of nonparametric functions

Remark 1

2.4. Theoretical results

Theorem 1

Theorem 2

Theorem 3

Theorem 4

3. Hypothesis tests

3.1. Testing for nonparametric components

Remark 2

Theorem 5

Theorem 6

Remark 3

3.2. Testing parametric components

Theorem 7

4. Monte Carlo simulation

4.1. Performance of estimation

Table 1.

Figure 1.

4.2. Performance of hypothesis tests

Figure 2.

Figure 3.

5. A case study

Figure 4.

Table 2.

Figure 5.

6. Discussion

Supplementary Material

Acknowledgments

Appendix: Proofs

Notations

Proof of Theorem 1

Proof of Theorem 2

Proof of Theorem 4

Proof of Theorem 7

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases