Skip to main content
Wiley Open Access Collection logoLink to Wiley Open Access Collection
. 2022 May 18;41(19):3643–3660. doi: 10.1002/sim.9440

Multivariate partial linear varying coefficients model for gene‐environment interactions with multiple longitudinal traits

Honglang Wang 1, Jingyi Zhang 2,3, Kelly L Klump 4, Sybil Alexandra Burt 4, Yuehua Cui 2,
PMCID: PMC9308731  NIHMSID: NIHMS1807455  PMID: 35582816

Abstract

Correlated phenotypes often share common genetic determinants. Thus, a multi‐trait analysis can potentially increase association power and help in understanding pleiotropic effect. When multiple traits are jointly measured over time, the correlation information between multivariate longitudinal responses can help to gain power in association analysis, and the longitudinal traits can provide insights on the dynamic gene effect over time. In this work, we propose a multivariate partially linear varying coefficients model to identify genetic variants with their effects potentially modified by environmental factors. We derive a testing framework to jointly test the association of genetic factors and illustrated with a bivariate phenotypic trait, while taking the time varying genetic effects into account. We extend the quadratic inference functions to deal with the longitudinal correlations and used penalized splines for the approximation of nonparametric coefficient functions. Theoretical results such as consistency and asymptotic normality of the estimates are established. The performance of the testing procedure is evaluated through Monte Carlo simulation studies. The utility of the method is demonstrated with a real data set from the Twin Study of Hormones and Behavior across the menstrual cycle project, in which single nucleotide polymorphisms associated with emotional eating behavior are identified.

Keywords: gene‐environment interaction, longitudinal traits, multi‐trait analysis, partial linear model, quadratic inference function

1. INTRODUCTION

Cross‐sectional disease traits have been the primary focus in genetic association studies. Given the improved power to identify disease genes with phenotypic data measured over time, longitudinal designs are becoming popular in genetic association studies. 1 , 2 , 3 , 4 Most statistical methods developed so far focus on a single outcome of interest. When multiple outcomes are measured over time, for example, multiple measures of heart function in a longitudinal study of cardiac function, methods focusing on just a single outcome over time may not provide a complete picture of cardiac function.

In genetics, the phenomenon that a single gene or locus influences more than one trait is known as pleiotropy. 5 , 6 Genetic pleiotropy plays a crucial role in many complex diseases. One of the most well‐known examples is the phenylketonuria (PKU) disease. 7 The conventional approach to identify genetic pleiotropic effects on multiple traits is to test the association between a gene and each trait individually and then determine whether the genetic effect is significantly associated with more than one trait. The disadvantages of this approach, such as the inflation in the family wise Type I error and incomplete information in individual tests compared to a combined analysis for multiple traits, have been discussed in some studies. 5 Therefore, a joint genetic association test on multiple traits is more desirable to control the family wise Type I error and enhance the power of tests.

In real life, timing is a very important factor in the development of a disease. Genetic effects on a disease trait vary during the life span of an individual. The function of a gene depends largely on when it turns on and off, which could show a temporal pattern. In order to capture the dynamic effect of a gene on a disease trait over time, it is natural to model the dynamic effect as a potential (nonlinear) function over time. Considering multiple longitudinal traits, we proposed the following partially linear varying coefficients model,

yli(tij)=β0l(tij)+β1l(tij)Gi+αlTZij+ϵli(tij),l=1,,L,i=1,,N,j=1,,ni, (1)

where ylij=yli(tij) is the response variable which measures the lth phenotype on the ith subject at the jth time point; Zij is a p‐dimensional vector of covariates, which can be either time dependent or independent; Gi denotes the time‐invariant genetic variable within subject; β0l(·) and β1l(·) are unknown functions; and the stacked error vector ϵi=(ϵ1iT,,ϵLiT)T with ϵli=(ϵli1,,ϵlini)T is assumed to have mean zero and covariance i. Models for multivariate longitudinal traits are necessarily complex, because they must consider different types of correlations for each independent subject: correlation between measurements for the same trait at different time points, correlation between measurements at the same time point on different traits, and correlation between measurements at different time points and on different traits. With the stacked error vector ϵi, its covariance matrix i carries all of these correlations.

If we use a time‐varying environmental factor Xij instead of tij in the model, that is,

ylij=β0l(Xij)+β1l(Xij)Gi+αlTZij+ϵlij,

then the model can be used for jointly modeling nonlinear gene‐environment (G×E) interactions for multiple longitudinal traits. In the model, one can assess the influence of X on G to affect multiple responses Y. Models for nonlinear G×E interactions have been studied. 8 , 9

Qu and Li applied the method of quadratic inference functions (QIF) to the varying coefficients models for longitudinal data. 10 One important advantage is that the QIF method only requires correct specification of the mean structure and does not require any likelihood or approximation of the likelihood in hypothesis testing. In addition, when the working correlation structure is misspecified, QIF is more efficient than the generalized estimation equation (GEE) approach. Another advantage of the QIF approach is that the inference function has an asymptotic form, which provides a model selection criteria similar to AIC and BIC. It also allows us to test whether coefficients are significantly time‐varying based on the asymptotic results.

Rochon analyzed bivariate longitudinal data for discrete and continuous outcomes by using generalized estimating equations, which did not utilize the nice property of the QIF. 11 Cho applied QIF for multivariate longitudinal data with generalized linear models, which is not adequate to consider nonlinear effects as in varying coefficient models. 12 Using random effects for modeling longitudinal data is another very popular and common way. 13 Proudfoot and his coauthors modeled the longitudinal data using random effects then combines multiple outcomes together again similar to the generalized estimating equations. 14 Recently, Zhao and his coauthors proposed a joint penalized quasi‐likelihood modeling based on splines for multivariate longitudinal data using random effects with applications to HIV‐1 RNA load levels and CD4 cell counts. 15 Hector and Song investigated a distributed quadratic inference function framework to jointly estimate regression parameters from multiple heterogeneous data sets with correlated responses. 16

With the nice property of QIF to deal with complicated correlated data focusing on a univariate longitudinal response, 10 in this article, we consider to generalize it to partial linear varying coefficient models with multivariate longitudinal responses. 17 , 18 The purpose of this article is to develop a powerful joint testing procedure using QIF for model (1). If the correlation between the longitudinal outcomes is reasonably high, we aim to show that the joint test has higher power than the marginal tests to detect the signal like the genetic effect in model (1). We first use splines to approximate the nonparametric functions in the model, 19 followed by penalized estimation to avoid over fitting. Then we develop a 2‐step testing procedure to have a joint test for the interaction effect on multiple outcomes based on the QIF approach, followed by separate test of marginal effect on each outcome if the overall null is rejected. In cross‐sectional studies, Wu et al. 20 developed a multivariate partially linear varying coefficient model to detect G×E interactions with multiple traits. Their method can select genetic variants with pleiotropic effects incorporating either the homogeneity (ie, pleiotropy) or heterogeneity (ie, no pleiotropy) assumptions. However, their approach cannot provide uncertainty quantification for the selected variables. Generalizing their method to multivariate longitudinal data is worth further studying.

This article is organized as follows. We state our proposed model in Section 2.1, and generalize the QIF method to the multivariate longitudinal responses in Section 2.2. Estimation procedure and asymptotical properties of estimators are provided in Section 2.3. A theorem for the general goodness‐of‐fit test via QIF is established in Section 2.4, based on which we propose a 2‐step testing procedure. We assess the finite sample performance of the proposed procedure with Monte Carlo simulation in Section 3 and illustrate the proposed methodology by the analysis of an emotional eating behavior study in Section 4. Conclusions and discussion are made in Section 5. Proofs are rendered to Appendix.

2. STATISTICAL METHODS

2.1. A joint multivariate partial linear model

In multivariate longitudinal studies, suppose ylij is the lth continuous outcome collected on the ith observation at time point tij, where l = 1, , L, i = 1, , N, j = 1, , ni. The joint partially linear varying coefficient models are defined as

ylij=yli(tij)=β0l(tij)+β1l(tij)Gi+αlTZij+ϵlij,

where Gi is the single nucleotide polymorphism (SNP) variable which does not depend on time and other types of measurement; Zij is a p‐dimensional covariate vector, which can be either time‐dependent or time‐independent; to accommodate the correlation between multiple responses, we stack the error terms ϵlij together into a long vector

ϵi=ϵ1iϵLiwhereϵli=ϵli1ϵlini.

We assume ϵi mean 0 with covariance , which carries three different association information: the within‐subject correlation across different time points, the between‐subject correlation at the same time point and the between‐subject correlation across different time points; β0l(·) and β1l(·) are unknown nonparametric smooth functions, representing the main time effect and time dependent genetic effect respectively. To illustrate the idea, in the following we demonstrate the methods assuming L=2. For the situation where there are more than two traits (L>2), the technique can be easily extended.

2.2. Quadratic inference function

To construct the objective function using the QIF approach, we first approximate the unknown functions β01, β11, β02, and β12 by a q‐degree truncated power spline basis, that is,

βsl(t)Bsl(t)Tγsl,fors=0,1andl=1,2, (2)

where Bsl(t)=(1,t,t2,,tqsl,(tκsl,1)+qsl,,(tκsl,Ksl)+qsl)T is a truncated power spline basis with degree qsl and Ksl knots κsl,1,,κsl,Ksl. γsl is a (qsl+Ksl+1)‐dimensional vector of spline coefficients.

Under the GEE framework, we solve

i=1Nμ˙iTvi1(yiμi)=0, (3)

where yi=(y1iT,y2iT)T, yli=(yli1,,ylini)T; μi=E(yi) is the mean function and μ˙i is the first derivative of μi with respect to the parameters; vi is the covariance matrix of yi and can be decomposed as vi=Ai1/2R(ρ)Ai1/2 with Ai being a diagonal matrix of marginal variances and R(ρ) being a working correlation matrix with nuisance parameters ρ. To avoid the estimation of ρ, QIF approach considers the inverse of the correlation matrix R as a linear combination of several known basis matrices in a form

R1a1M1+a2M2++ahMh, (4)

where M1 is the identity matrix and M2,,Mh are symmetric basis matrices. As discussed in the existing literature, 10 , 12 the choice of the basis for the inverse of the correlation matrix plays an important role. Suppose Γ as the within‐subject correlation structure and ω as between‐subject correlation coefficient, that is, the working correlation structure can be expressed as the Kronecker product (tensor product) R=ΩΓ with Ω as the 2×2 symmetric matrix with 1 on the diagonal and ω elsewhere. The inverse of the Kronecker product is R1=Ω1Γ1=(γ0I+γ1W)Γ1 with W as a 2×2 symmetric matrix with 0 on the diagonal and 1 elsewhere, and I as the identity matrix with compatible dimension. So if the basis matrix for the inverse of the within‐subject correlation Γ1 is given by U1(=I),U2,,Uk, then we have the bases M's as {IUj,WUj:1jk}. For exchangeable working correlation, we can set k=2 and U2 has 0 on the diagonal and 1 elsewhere. If the working correlation is AR(1), we can set k=2 and U2 to have 1 on its two subdiagonals and 0 elsewhere. Following QIF approach, we define the estimation function as

gN(θ)=1Ni=1Ngi(θ)=1Ni=1Nμ˙iTAi1/2M1Ai1/2(yiμi)i=1Nμ˙iTAi1/2MhAi1/2(yiμi). (5)

Using the spline approximation, the mean function μi can be written as

μi(θ)=μ1i(θ)μ2i(θ)=μ1i1(θ)μ1ini(θ)μ2i1(θ)μ2ini(θ)=B01T(ti1)γ01+B11T(ti1)γ11Gi+α1TZi1B01T(tini)γ01+B11T(tini)γ11Gi+α1TZiniB02T(ti1)γ02+B12T(ti1)γ12Gi+α2TZi1B02T(tini)γ02+B12T(tini)γ12Gi+α2TZini,

and the first derivative of μi is given as,

μ˙i=B01T(ti1)B11T(ti1)GiZi1000B01T(tini)B11T(tini)GiZini000000B02T(ti1)B12T(ti1)GiZi1000B02T(tini)B12T(tini)GiZini,

where θ=(γ01T,γ11T,α1T,γ02T,γ12T,α2T)T.

Setting each component in (5) to be zero will result in more equations than unknown parameters. Following the idea of generalized method of moments, 21 the QIF method is defined as

QN(θ)=NgNTCN1gN, (6)

where CN=1Ni=1NgigiT is a consistent estimator for var(gi). Minimizing the objective function (6) provides the estimation of the parameters.

2.3. Estimation procedure via penalized QIF

The estimation of the parameters can be obtained through minimizing the objective function, that is,

θ^=argminθQN(θ).

To avoid over‐fitting, we can define a penalized QIF in a form

N1QN(θ)+λθTDθ, (7)

where D is a diagonal matrix with 1 if the corresponding parameter is the spline coefficient associated with knots, and 0 otherwise. Minimizing the penalized QIF provides

θ^=argminθ(N1QN(θ)+λθTDθ). (8)

To estimate the tuning parameter λ, we can extend the generalized cross‐validation 10 , 22 , 23 to the penalized QIF and define the generalized cross‐validation statistic as

GCV(λ)=N1QN(1N1df)2

with the effective degree of freedom

df=tr[(Q¨N+2NλD)1Q¨N],

where Q¨N is the second derivative of QN. The optimized tuning parameter λ is given as

λ^=argminλGCV(λ).

To establish the asymptotic properties for the penalized quadratic inference function estimators with fixed knots, we assume θ0 to be the parameter satisfying Eθ0(gi)=0. Similar theoretical results are provided in Qu and Li. 10 Following their idea and extend those results to the estimators in our model, we get the strong consistency of the resulting estimators in Theorem 1. The N‐consistency and asymptotic normality of the estimators are given in Theorem 2 .

Theorem 1

Suppose conditions (A1)‐(A6) in the Appendix hold and the smoothing parameter λN=o(1) , then the estimator θ^ , which is obtained by minimizing the penalized quadratic function in ( 7 ), exists and converges to θ0 almost surely.

Theorem 2

Suppose conditions (A1)‐(A6) in the Appendix hold and the smoothing parameter λN=o(N1/2) , then the estimator θ^ obtained by minimizing the penalized quadratic function in ( 7 ) is asymptotically normally distributed with the limiting distribution,

N(θ^θ0)dN(0,(G0TC01G0)1),

where the calculation of G0 defined in (A6) and C0 defined in (A5) can be found in the Appendix.

2.4. A two‐step hypothesis testing procedure

Compared to GEE, an advantage of the QIF approach is that QIF provides a goodness‐of‐fit test without estimating the second moment parameters. Suppose that the d‐dimensional parameter vector γ is partitioned into (ψ,ζ), where ψ is the parameter of interest with dimension d1, and ζ is a nuisance parameter with dimension d2=dd1. If we are interested in testing

H0:ψ=ψ0,

then the test statistic

QN(ψ0,ζ˜)QN(ψ^,ζ^),

follows an asymptotically chi‐square distribution with d1 degrees of freedom as from Qu and her coauthors work cited below. 24

Theorem 3

Suppose that all required regularity conditions are satisfied and

ψ

has dimension

d1

. Under the null hypothesis,

QN(ψ0,ζ˜)QN(ψ^,ζ^)

is asymptotically chi‐square distributed with

d1

degrees of freedom, where

ζ˜=argminQN(ψ0,ζ),(ψ^,ζ^)=argminQN(ψ,ζ). (9)

In Model (1), it is of interest to test whether the genetic effects on multiple traits are significant or not. Based on Theorem 3, we develop a 2‐step testing procedure for testing the significance of the varying coefficient functions. In the first step, the joint test is performed to see whether a genetic factor has a significant effect on at least one longitudinal trait. If the testing result in the first step is significant, we then further conduct the marginal test in the second step to assess if the genetic effect is significant on both traits or just one trait. The first step is a joint test of significance followed by a marginal test to assess individual significance. For associated multiple traits with reasonably strong correlation, the joint test is more powerful than the marginal tests, which is empirically verified in our simulation studies.

2.4.1. Step 1: Joint test

First, we are interested in testing whether the genetic factor G has an effect on at least one longitudinal trait. The hypothesis is stated as

H0:β11(·)=β12(·)=0v.s.H1:β11(·)0orβ12(·)0.

This can be handled through the truncated power spline approximation of the nonparametric functions stated in (2). In particular, testing this hypothesis is equivalent to test the following null hypothesis

H0:γ11=γ12=0.

According to Theorem 3, we can construct a test statistic

TN=QN(θ˜)QN(θ^),

where

θ˜=arg minγ11=γ12=0QNγ01,γ11,α1,γ02,γ12,α2|y1,y2,G,Z,

and

θ^=arg minQNγ01,γ11,α1,γ02,γ12,α2|y1,y2,G,Z.

The test statistic TN has an asymptotic χ2 distribution with the degrees of freedom equal the number of constraints under H0, according to Theorem 3.

2.4.2. Step 2: Marginal tests

From the joint test, if there exists a significant genetic effect on at least one longitudinal trait, then we can further test the marginal effects, that is,

H0l:β1l(·)=0v.s.H1:β1l(·)0,l=1,2.

Based on (2), this is equivalent to test H01:γ11=0 and H02:γ12=0, separately.

For testing H01:γ11=0, we use test statistic TN1=QN(γ˜01,0,α˜1)QN(γ^01,γ^11,α^1), where

(γ˜01,0,α˜1)=arg minγ11=0QNγ01,γ11,α1|y1,G,Z,

and

(γ^01,γ^11,α^1)=arg minQNγ01,γ11,α1|y1,G,Z.

Similarly, we can construct a test statistic TN2=QN(γ˜02,0,α˜2)QN(γ^02,γ^12,α^2) for testing H02:γ12=0, where

(γ˜02,0,α˜2)=arg minγ12=0QNγ02,γ12,α2|y2,G,Z,

and

(γ^02,γ^12,α^2)=arg minQNγ01,γ12,α2|y2,G,Z.

The asymptotic distribution of the test statistics TN1 and TN2 can be obtained from Theorem 3.

3. SIMULATION STUDIES

3.1. Simulation setup

In this section, the finite sample performance of the proposed method is evaluated through Monte Carlo simulation studies. Two continuous longitudinal responses are generated from the models

y1ij=y1i(tij)=β01(tij)+β11(tij)Gi+α1Zi+ϵ1ij,y2ij=y2i(tij)=β02(tij)+β12(tij)Gi+α2Zi+ϵ2ij,

where β01(tij)=0.5cos(2πtij), β11=sin(π(tij0.2)), β02(tij)=sin(πtij)0.5, β12(tij)=cos(πtij0.8), α1=0.2 and α2=0.3. We generate the same number of time points n for each individual ti=(ti1,,tin) from a uniform distribution U(0,1). The time independent predictor variable Zi is also generated from U(0,1). We set the minor allele frequency (MAF) for Gi as pA and assume Hardy‐Weinberg equilibrium. Three different SNP genotypes AA, Aa, and aa are simulated from a multinomial distribution with frequencies pA2, 2pA(1pA) and (1pA)2, respectively. In this simulation study, we vary pA{0.1,0.3,0.5} to investigate the effect of minor allele frequency. Variable G takes value {0,1,2} corresponding to genotypes {aa,Aa,AA}, following an additive model. We assume ϵ1ij and ϵ2ij are jointly normally distributed as

ϵ1iϵ2iN00,σ1211σ1σ212σ1σ212σ2222.

We set the marginal variances σ12 = σ22 = 0.1. The true correlation structure of 11 and 22 are both exchangeable with the structure

1ρρρρ1ρρρρρ1

with ρ1=ρ2=0.5. And for 12 we choose

τρ12ρ12ρ12ρ12τρ12ρ12ρ12ρ12ρ12τ

with ρ12=0.2 as the between‐subject correlation across different time points. We vary the between‐subject correlation at the same time point τ=corr(ϵ1ij,ϵ2ij) to investigate the power gain for the joint test.

We draw 1000 data sets with sample size N=200,500 and time points ni=n=10, in order to compare the performances of our proposed method under different sample sizes. We set M1 to be the identity matrix and M2 to be 1 on subdiagonals and 0 elsewhere, that is, AR(1) working correlation. An important issue for the model selection is to decide whether the spline model (2) is adequate for further penalization by (8). In the following simulations, we use quartic splines with the number of knots taken to be the largest integer not greater than 0.6 ×N1/5 as suggested in Tian and his colleagues work. 25

3.2. Estimation performance

We use the asymptotic normality in Theorem 2 to construct the Wald type confidence interval for parameters α1 and α2. Table 1 summarizes the empirical coverage probability (CP) in percentage and the average length (AL) of the confidence intervals at 95% confidence level based on 1000 simulation replicates. As we can see from the table, the CPs are close to the nominal level 95%. When the sample size gets larger, the ALs are shorter and the CPs are closer to 95%.

TABLE 1.

Empirical coverage probability (%) and average length of confidence intervals for αl, l=1,2

N=200
N=500
CP AL CP AL
α1
93.2 0.078 94.8 0.050
α2
93.7 0.078 95.6 0.050

Next, we consider the estimation performance of the nonparametric functions βsl(t) for s=0,1,l=1,2. In Figure 1, the plots are from the case with sample size N=200 and pA=0.1 (for other situations, please refer to the Appendix). For each function, the red solid line is the true function, and the three blue dashed lines correspond to the average of the estimated functions from 1000 simulation replicates in the middle and the 95% pointwise confidence bands with the standard error calculated from the standard deviation of 1000 replicates. The estimation is quite accurate with low sample size and MAF. As the MAF or sample increases, the estimation performance improves (see Appendix ). We also use the asymptotic normality in Theorem 2 to construct the pointwise confidence intervals. Table 2 summarized the empirical coverage probability (CP) in percentage and the average length (AL) (in parentheses) of the CIs for βsl(t) at t=0.2,0.4,0.6, and 0.8 for sample size N=500 and N=1000. The CPs are all close to the nominal level 95% and the ALs are shorter under a larger sample size.

FIGURE 1.

SIM-9440-FIG-0001-c

The estimation of nonparametric functions βsl(·) for s=0,1,l=1,2 with N = 200 and pA=0.1. In each panel, the red solid line is the true function, and the three blue dashed lines correspond to the estimated function in the middle and the 95% pointwise confidence bands

TABLE 2.

Empirical coverage probability (%) and average length of pointwise confidence intervals (in parentheses) for βsl(t) at t=0.2,0.4,0.6, and 0.8

t
N
Intercept β01(t) Slope β11(t) Intercept β02(t) Slope β12(t)
0.2 500 92.2 (0.115) 92.6 (0.073) 92.9 (0.115) 92.7 (0.073)
1000 91.7 (0.082) 93.3 (0.052) 92.9 (0.082) 92.9 (0.052)
0.4 500 90.5 (0.109) 91.9 (0.069) 92.4 (0.108) 92.8 (0.068)
1000 92.3 (0.077) 93.7 (0.049) 94.4 (0.077) 93.1 (0.049)
0.6 500 91.0 (0.111) 91.0 (0.070) 92.3 (0.110) 93.1 (0.070)
1000 94.0 (0.079) 93.8 (0.050) 95.1 (0.079) 94.4 (0.050)
0.8 500 91.5 (0.113) 93.1 (0.072) 92.9 (0.113) 94.1 (0.071)
1000 91.5 (0.081) 94.2 (0.051) 94.8 (0.080) 94.6 (0.051)

3.3. Testing performance

We propose a two‐step hypothesis testing procedure to detect the genetic effects on multiple traits. With the joint test, higher power is expected with correlated traits than the marginal tests. We would like to evaluate how much we can gain in power when the correlation between multiple traits increases. This is done by varying the correlation coefficient τ at the same time point for the two simulated traits.

We evaluate the performance of the joint test under the null hypothesis H0:β11(·)=β12(·)=0. Power is evaluated under a sequence of alternative models with different values of δ, which is denoted by H1:β11(·)=δβ11(·)andβ12(·)=δβ12(·). The performance of the marginal tests for the nonparametric functions corresponding to different traits is evaluated under the two null hypotheses H01:β11(·)=0 and H02:β12(·)=0 respectively. For each test, power is evaluated under a sequence of alternative models, denoted by Hal:β1l(·)=δβ1l(·),l=1,2, correspondingly.

Figure 2 shows the power comparison between joint test and marginal tests under different correlation coefficient τ varying from 0.1 through 0.6. Each panel corresponds to the results with one τ value and displays the comparison of the three power curves, with the empirical size (when δ = 0) and power at different δ(>0) at the significance level 0.05 and sample size N=200. Similar pattern can be observed for larger sample size N=500. As expected, the Type I error is closer to 0.05 and the power increase as the signal δ increases for every power curve. When τ is small (low correlation), we do not see much power gain of the joint test compared to the marginal tests. As τ increases (correlation between the two traits increases), we observe higher power of the joint test (starting at τ=0.4). This shows that the joint test is more powerful than the marginal tests for moderate or high correlation between traits. We also conducted simulations to evaluate the impact of between‐subject correlation across different time points on the testing power. We observed similar results as the one by varying the between‐subject correlation at the same time point. Due to space limit, the results were rendered in the supplemental file.

FIGURE 2.

SIM-9440-FIG-0002-c

The power comparison between the joint test and marginal tests under different correlation coefficient τ from 0.1 to 0.6 with sample size N=200. The exact empirical sizes are given in Table 3

MAF also plays a major role for the inference performance of an association test in general. For the proposed method, the power increases as the MAF pA increases from 0.1 to 0.5. This is in align with the general conception. In particular, there is a big power improvement as pA increases from 0.1 to 0.3 as shown in Figure 3.

FIGURE 3.

SIM-9440-FIG-0003-c

The power comparison of the joint test under different minor allele frequencies (PA=0.1,0.3,0.5) and different sample sizes (N=200,500)

TABLE 3.

Empirical size for the joint and marginal tests under different correlation coefficient τ from 0.1 to 0.6 with sample size N=200

τ=0.1
τ=0.2
τ=0.3
τ=0.4
τ=0.5
τ=0.6
Joint 0.040 0.040 0.037 0.036 0.034 0.026
Marginal 1 0.038 0.036 0.038 0.039 0.039 0.038
Marginal 2 0.034 0.036 0.039 0.037 0.038 0.037

4. REAL DATA APPLICATION

We applied the proposed multivariate partially linear varying coefficients model and the two‐step hypothesis testing procedure to the Twin Study of Hormones and Behavior across the Menstrual Cycle project 26 from the Michigan State University Twin Registry (MSUTR). 27 , 28 , 29 The goal of the study was to examine associations between changes in estradiol and progesterone levels and emotional eating across the menstrual cycle. Emotional eating was measured with the Dutch Eating Behavior Questionnaire (DEBQ) and negative affect was measured with the Negative Affect scale from the Positive and Negative Affect Schedule (PANAS). The DEBQ assesses the tendency to eat in response to negative emotions while PANAS is used to measure negative emotional states like sadness and anxiety.

In this study, we wanted to examine how genes respond to the hormone change (eg, estrogen) to affect emotional eating measured by DEBQ and PANAS. Since body mass index (BMI) is an important covariate for the study, we included it in the linear component of the model. Although the original study contains twins data, we only included one of the twins in each family in the analysis to make the samples independent. Measurements for each participant were collected for 45 consecutive days, which then were grouped into eight menstrual cycle phases, that is, ovulatory phase (1), transition ovulatory to midluteal (2), midluteal phase (3), transition midluteal to premenstrual (4), premenstrual phase including the first day of menstrual cycle (5), remaining days of menstrual cycle, part of follicular phase (6), follicular phase (7) and transition follicular to ovulatory phase (8). They were grouped into these phases based on profiles of changes in estrogen and progesterone across the cycle. 30 Data that belong to the same phase were averaged to get a phase‐level measure. All individuals were aligned according to the 8 phases for further analysis.

To demonstrate the utility of the method, here we focused on a candidate gene, nuclear receptor coactivator 7 (NCoA7). This gene codes for an estrogen receptor‐associated protein which plays an important role in the cellular response to estrogen. After removing SNPs with MAF <0.05, we had 12 SNPs measured on 327 participants for further analysis.

We consider the partially linear varying coefficient model with the two longitudinal traits, namely DEBQ and PANAS, with the form

yijD=β0D(Xij)+β1D(Xij)Gi+αDZij+ϵijD,
yijP=β0P(Xij)+β1P(Xij)Gi+αPZij+ϵijP.

For the ith individual measured at menstrual cycle phase j, the two longitudinal traits are denoted as yijD and yijP for DEBQ and PANAS, respectively. One phase dependent covariate BMI is denoted as Zij. Xij refers to the hormone estradiol level, which is standardized to range between 0 and 1 by Φ((EXE)/SE) where E is the original estrogen level, XE and SE are the sample mean and standard deviation of E, and Φ is the cumulative distribution function of a standard normal. Gi represents the SNP variable and the 12 SNPs were analyzed separately.

We aimed to test if an SNP is associated with the two traits with its effect modified by the estrogen hormone level, that is,

H0:β1D(·)=β1P(·)=0v.s.H1:β1D(·)0orβ1P(·)0.

We applied the quadratic splines and the exchangeable working correlation structure for this real data analysis. After the Bonferroni correction for the 12 SNPs, we found three SNPs, rs584032, rs6911452, and rs9401855, are significant with the joint test. Table 4 lists the results. The joint test results are all more significant than the marginal tests. The between‐trait correlations between DEBQ and PANAS at the same cycle phase are shown in Figure 4, which shows a quite strong correlation at different phases ranging from 0.36 to 0.61. This explains why the joint test shows stronger significance than the marginal tests. Figure 4 shows the detailed correlation information about the three components: within‐trait correlation, between‐trait correlation at the same time points and across different time points.

TABLE 4.

The test results of the 3 significant SNPs with their rs numbers, the alleles (minor allele shows with bold font), the MAF, and the P‐values for the joint test (denoted as Pjoint) and the two marginal tests (denoted as PDEBQ and PPANAS)

SNP Alleles MAF
Pjoint
PDEBQ
PPANAS
rs584032 T/A 0.176 2.390e‐4 3.472e‐2 1.937e‐3
rs6911452 A/G 0.089 8.225e‐4 3.491e‐3 4.627e‐3
rs9401855 A/G 0.129 3.758e‐4 5.449e‐2 1.108e‐3

FIGURE 4.

SIM-9440-FIG-0004-c

The correlation information including within‐trait correlation, between‐trait correlation at the same and across different cycle phases. The x‐axis and y‐axis represent the 8 cycle phases for the two variables DEBQ and PANAS, respectively

Figure 5 shows the estimated nonparametric coefficient functions for both responses DEBQ and PANAS, with SNP rs9401855 as an example. The point‐wise 95% confidence bands cover a large part of the zero line for the DEBQ, which is consistent with its P‐value .0545 from the marginal test. And for PANAS, in the central region, the zero line is outside of the 95% point‐wise confidence bands, which is also consistent with the marginal P‐value .0011. The result shows that this SNP interacts with estrogen hormone and only affects PANAS, but not DEBQ. The negative coefficients show that estrogen hormone negatively impacts PANAS. Individuals carrying the GG genotype are more likely to experience negative affective states such as sadness and anxiety, compared to those carrying one or no G allele. From the slightly quadratic effect curve, it can be seen that the negative impact peaks around phase 5‐6, that is, the Premenstrual phase including the first day of menstrual cycle (5) to Remaining days of menstrual cycle, part of follicular phase (6), while less negative impact is observed at the beginning and the end of the eight phased cycle (ie, during the ovulatory phase).

FIGURE 5.

SIM-9440-FIG-0005-c

The estimated intercept and slope functions for DEBQ and PANAS from the joint model (red solid curve) and their point‐wise 95% confidence bands (dashed curve)

5. DISCUSSION

Joint analysis of multiple correlated traits can potentially improve the power to identify genetic variants associated with complex traits. However, association analysis focusing on multiple longitudinal traits has not be well studied. Method on G×E interaction with multiple traits under a longitudinal design is even rare. In this article, we proposed a joint multivariate varying coefficient modeling approach to accommodate correlated longitudinal traits and proposed a testing procedure to identify genetic variants associated with multiple longitudinal traits with their effects modified by some environmental factors. By modeling the environmental effect with a nonparametric function, one can estimate the dynamic changing effect of G on Y over the changing values of X. The nonparametric function is flexible in the sense that the function is determined by the data without assuming a parametric structure. Both simulation and real data analysis demonstrate the utility of the proposed method.

One difficulty in jointly modeling multiple longitudinal traits is to model the complex correlation structure. For each subject, we should consider correlation between measurements for the same trait at different time points, correlation between measurements at the same time point on different traits, and correlation between measurements at different time points and on different traits. We applied the QIF approach in estimation and testing procedures. There are several advantages for QIF approach. First, the QIF approach only requires correct specification of the mean structure and does not require any joint likelihood in hypothesis testing. Second, it avoids estimating the nuisance correlation structure parameters by assuming that the inverse of working correlation matrix can be approximated by a linear combination of several known basis matrices. Third, when the working correlation structure is misspecified, the QIF is more efficient than the GEE approach. Fourth, the inference function of the QIF approach has an explicit asymptotic form, which provides a model selection criteria and allows us to test whether coefficients are significant or time varying based on the asymptotic results. It is worth mentioning that missing completely at random (MCAR) is assumed in this work, which is a common assumption under the QIF framework when dealing with missing data. 31

In the real application, we investigated association of SNPs in a candidate gene with two longitudinal traits DBEQ and PANAS. Although the data were regrouped into eight phases, they still carry the temporal information and can be treated like longitudinal data. The results show that three SNPs passed the Bonferroni threshold with the joint test and the P‐values of the joint test are smaller than the individual marginal test. This shows the relative advantage of the joint test. As shown in the simulation study, the joint test can achieve power gain when the traits are correlated. Therefore, it is essential to assess the correlations between traits when fitting multiple traits jointly and conducting joint testing.

Our method was demonstrated with two traits. The method can be extended to multiple longitudinal traits with L>2, although the computational cost might increase. In addition, our method is not restricted to a longitudinal study. It also applies to other studies where multiple traits can be measured over a linear scale. For example, in a pharmacogenetic study, multiple drug responses (eg, blood pressure and heart rate) can be measured over different dosage of a drug treatment. The proposed model can be fitted to assess how genes respond to the increasing dosage levels to affect the drug responses. For another example, in a brain imaging genetic study, brain activities in different brain regions can be measured over a spatial scale and can be treated as multiple traits. One can fit the proposed model to understand how genes affect brain activities over a spatial scale.

Supporting information

Data S1: Supporting Information

ACKNOWLEDGEMENTS

The authors wish to thank the anonymous reviewers for their insightful comments that greatly improved the presentation of the manuscript. Research reported in this publication was supported by the National Human Genome Research Institute of the National Institutes of Health (NIH) under award number R21HG010073, by the National Institute of General Medical Sciences of the NIH under award number R01GM131398 and by the National Institute of Mental Health of the NIH under award number R01MH082054. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH.

PROOFS OF THEOREMS

1.

To establish the asymptotic properties for the estimator of θ, we need the following regularity conditions.

  • (A1)

    {ni} is a bounded sequence of integers.

  • (A2)

    The parameter space Ωθ is compact and θ0 is an interior point of Ωθ.

  • (A3)

    The parameter θ is identified, that is, there is a unique θ0Ωθ such that the first moment assumption E[gi(θ0)]=0 holds for i=1,,N, and E[gi(θ)] is continuous.

  • (A4)

    E[g(θ)] is continuous in θ.

  • (A5)

    CN(θ^)=1Ni=1Ngi(θ^)gi(θ^)T converges almost surely to C0, which is a constant and invertible matrix.

  • (A6)

    The first derivative of gN exists and is continuous. gNθ(θ^) converges in probability to G0 if θ^ converges in probability to θ0.

Proof of Theorem 1

θ^ exists because (7) has zero as a lower bound and the global minimum exists. To prove the consistency, first, the estimator θ^ is obtained by minimizing (7), then we have

1NQN(θ^)+λNθ^TDθ^1NQN(θ0)+λNθ0TDθ0. (A1)

Since

1NQN(θ0)=gNT(θ0)CN1(θ0)gN(θ0)=o(1),

by the strong law of large number and (A5), and λN=o(1),

1NQN(θ0)+λNθ0TDθ0a.s.0.

Thus, we can obtain from (A1) that

1NQN(θ^)=gNT(θ^)CN1(θ^)gN(θ^)a.s.0. (A2)

Since the parameter space Ωθ is compact, by Glivenko‐Cantelli theorem,

supθΩθgN(θ)E[g(θ)]a.s.0.

Hence, by (A5) and the continuity mapping theorem,

gNT(θ^)CN1(θ^)gN(θ^)E[g(θ^)]TC01E[g(θ^)]a.s.0.

Combined with (A2), we get

E[g(θ^)]TC01E[g(θ^)]a.s.0. (A3)

Suppose θ^ is not a strong consistent estimator of θ, then there exists a neighborhood of the true parameter θ0, say U, such that θ^Uc. Since E[g(θ)]TC01E[g(θ)] is a continuous function and Uc is compact, there exists a point θUc such that

E[g(θ)]TC01E[g(θ)],

achieve its minimum in Uc. By the identification of θ in (A3), there is a unique θ0Ωθ satisfying E[g(θ0)]=0, and we have

E[g(θ)]TC01E[g(θ)]>0,

which contradicts (A3). Hence, θ^ is a consistent estimator of θ.

Proof of Theorem 2

The estimate of θ satisfies

1NQNθ(θ^)+2λNDθ^=0.

By Taylor's expansion, we obtain

1NQNθ(θ0)+2λNDθ0+1N2QNθ2(θ˜)+2λND(θ^θ0)=0,

where θ˜ is some value between θ^ and θ0. Thus, we can have

θ^θ0=1N2QNθ2(θ˜)+2λND11NQNθ(θ0)+2λNDθ0. (A4)

Since θ^ converges to θ0 in probability and θ˜ is between θ^ and θ0, by (A5) and (A6) we can get

1N2QNθ2(θ˜)=2gNθT(θ˜)CN1(θ˜)gNθ(θ˜)+op(1)p2G0TC01G0.

When λN=o(N1/2),

1N2QNθ2(θ˜)+2λND1=12(G0TC01G0)1+op(N1/2).

Similarly, since

1NQNθ(θ0)=gNθT(θ0)CN1(θ0)gN(θ0),

and λN=o(N1/2), we have

1NQNθ(θ0)+2λNDθ0=G0TC01gN(θ0)+o(N1/2).

Therefore, (A4) can be written as

N(θ^θ0)=N(G0TC01G0)1G0TC01gN(θ0)+op(1). (A5)

By Central Limit Theorem,

NgN(θ0)dN(0,C0). (A6)

Using (A5) and (A6), we obtain

N(θ^θ0)dN(0,(G0TC01G0)1).

Proof of Theorem 3

By Taylor's expansion,

Q(ψ0,ζ0)Q(ψ^,ζ^)=ψ0ψ^ζ0ζ^TQ˙(ψ^,ζ^)+12ψ0ψ^ζ0ζ^TQ¨(ψ,ζ)ψ0ψ^ζ0ζ^,

where (ψ,ζ) is some value between (ψ0,ζ0) and (ψ^,ζ^). We can also obtain from Taylor's expansion that

Q(ψ0,ζ0)Q(ψ0,ζ˜)=(ζ0ζ˜)TQ˙ζ(ψ0,ζ˜)+12(ζ0ζ˜)TQ¨ζζ(ψ0,ζ)(ζ0ζ˜),

where ζ is between ζ0 and ζ˜. From conditions in (9), we have

Q˙(ψ^,ζ^)=0andQ˙(ψ0,ζ˜)=0.

Hence

Q(ψ0,ζ˜)Q(ψ^,ζ^)=12ψ^ψ0ζ^ζ0TQ¨(ψ,ζ)ψ^ψ0ζ^ζ0120ζ˜ζ0TQ¨(ψ0,ζ)0ζ˜ζ0.

If we expand Q˙ζ(ψ0,ζ˜) about ζ0, and Q˙ζ(ψ^,ζ^) about (ψ0,ζ0), we obtain

0=Q˙ζ(ψ0,ζ˜)=Q˙ζ(ψ0,ζ0)+Q¨ζζ(ζ˜ζ0)+Op(N12),
0=Q˙ζ(ψ^,ζ^)=Q˙ζ(ψ0,ζ0)+Q¨ζψ(ψ^ψ0)+Q¨ζζ(ζ^ζ0)+Op(N12).

The above two equations give us

ζ˜ζ0=Q¨ζζ1Q¨ζψ(ψ^ψ)+(ζ^ζ0)+Op(N12),

which can be written as

0ζ˜ζ0=00Q¨ζζ1Q¨ζψIψ^ψ0ζ^ζ0.

Then Q(ψ0,ζ˜)Q(ψ^,ζ^) can be written as

ψ^ψ0ζ^ζ0TQ¨(ψ,ζ)00Q¨ζζ1Q¨ζψITQ¨(ψ0,ζ)00Q¨ζζ1Q¨ζψIψ^ψ0ζ^ζ0+Op(N12),

which is asymptotically equivalent to

ψ^ψ0ζ^ζ0TJψψJψζJζψJζζ00Jζζ1JζψITQ¨(ψ0,ζ)00Jζζ1JζψIψ^ψ0ζ^ζ0=(ψ^ψ0)T(JψψJψζJζζ1Jζψ)(ψ^ψ0).

By theorem 3.2 in Hansen, 21

ψ^ψ0ζ^ζ0dNd00,JψψJψζJζψJζζ1.

Therefore,

ψ^ψ0dNd10,(JψψJψζJζζ1Jζψ)1,

thus, Q(ψ0,ζ˜)Q(ψ^,ζ^) follows χd12 asymptotically.

Wang H, Zhang J, Klump KL, Alexandra Burt S, Cui Y. Multivariate partial linear varying coefficients model for gene‐environment interactions with multiple longitudinal traits. Statistics in Medicine. 2022;41(19):3643–3660. doi: 10.1002/sim.9440

 

Honglang Wang and Jingyi Zhang are contributed equally to this work.

Funding information National Institutes of Health, Grant/Award Numbers: R01GM131398; R01MH082054; R21HG010073

DATA AVAILABILITY STATEMENT

Research data used for real data analysis are from a different group and are not shared. R code used to implement the method can be downloaded at https://github.com/Honglang/MPLVC.

REFERENCES

  • 1. Sitlani CM, Rice KM, Lumley T, et al. Generalized estimating equations for genome‐wide association studies using longitudinal phenotype data. Stat Med. 2015;34(1):118‐130. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Macgregor S, Knott SA, White I, Visscher PM. Quantitative trait locus analysis of longitudinal quantitative trait data in complex pedigrees. Genetics. 2005;171(3):1365‐1376. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Furlotte NA, Eskin E, Eyheramendy S. Genome‐wide association mapping with longitudinal data. Genet Epidemiol. 2012;36(5):463‐471. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Xu Z, Shen X, Pan W, Alzheimer's Disease Neuroimaging Initiative . Longitudinal analysis is more powerful than cross‐sectional analysis in detecting genetic association with neuroimaging phenotypes. PLoS One. 2014;9(8):e102312. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Wang W, Feng Z, Bull SB, Wang Z. A 2‐step strategy for detecting pleiotropic effects on multiple longitudinal traits. Front Genet. 2014;5:357. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Gratten J, Visscher PM. Genetic pleiotropy in complex traits and diseases: implications for genomic medicine. Genome Med. 2016;8(1):78. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Lobo I. Pleiotropy: one gene can affect multiple traits; 2008.
  • 8. Ma S, Yang L, Romero R, Cui Y. Varying coefficient model for gene–environment interaction: a non‐linear look. Bioinformatics. 2011;27(15):2119‐2126. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Liu X, Cui Y, Li R. Partial linear varying multi‐index coefficient model for integrative gene‐environment interactions. Stat Sin. 2016;26:1037. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Qu A, Li R. Quadratic inference functions for varying‐coefficient models with longitudinal data. Biometrics. 2006;62(2):379‐391. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Rochon J. Analyzing bivariate repeated measures for discrete and continuous outcome variables. Biometrics. 1996;52(2):740. [PubMed] [Google Scholar]
  • 12. Cho H. The analysis of multivariate longitudinal data using multivariate marginal models. J Multivar Anal. 2016;143:481‐491. [Google Scholar]
  • 13. Fieuws S, Verbeke G. Joint modelling of multivariate longitudinal profiles: pitfalls of the random‐effects approach. Stat Med. 2004;23(20):3093‐3104. [DOI] [PubMed] [Google Scholar]
  • 14. Proudfoot J, Faig W, Natarajan L, Xu R. A joint marginal‐conditional model for multivariate longitudinal data. Stat Med. 2018;37(5):813‐828. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Zhao L, Chen T, Novitsky V, Wang R. Joint penalized spline modeling of multivariate longitudinal data, with application to HIV‐1 RNA load levels and CD4 cell counts. Biometrics. 2021;77(3):1061‐1074. [DOI] [PubMed] [Google Scholar]
  • 16. Hector EC, Song PXK. Joint integrative analysis of multiple data sources with correlated vector outcomes; 2020. arXiv preprint arXiv:2011.14996.
  • 17. Bandyopadhyay S, Ganguli B, Chatterjee A. A review of multivariate longitudinal data analysis. Stat Methods Med Res. 2011;20(4):299‐330. [DOI] [PubMed] [Google Scholar]
  • 18. Verbeke G, Fieuws S, Molenberghs G, Davidian M. The analysis of multivariate longitudinal data: a review. Stat Methods Med Res. 2014;23(1):42‐59. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Ruppert D, Carroll RJ. Theory & methods: spatially‐adaptive penalties for spline fitting. Aust N Z J Stat. 2000;42(2):205‐223. [Google Scholar]
  • 20. Wu C, Cui Y, Ma S. Integrative analysis of gene–Environment interactions under a multi‐response partially linear varying coefficient model. Stat Med. 2014;33(28):4988‐4998. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Hansen LP. Large sample properties of generalized method of moments estimators. Econometrica. 1982;50(4):1029‐1054. [Google Scholar]
  • 22. Ruppert D. Selecting the number of knots for penalized splines. J Comput Graph Stat. 2002;11(4):735‐757. [Google Scholar]
  • 23. Bai Y, Fung WK, Zhu ZY. Penalized quadratic inference functions for single‐index models with longitudinal data. J Multivar Anal. 2009;100(1):152‐161. [Google Scholar]
  • 24. Qu A, Lindsay BG, Li B. Improving generalised estimating equations using quadratic inference functions. Biometrika. 2000;87(4):823‐836. [Google Scholar]
  • 25. Tian R, Xue L, Liu C. Penalized quadratic inference functions for semiparametric varying coefficient partially linear models with longitudinal data. J Multivar Anal. 2014;132:94‐110. [Google Scholar]
  • 26. Klump KL, Keel PK, Racine SE, et al. The interactive effects of estrogen and progesterone on changes in emotional eating across the menstrual cycle. J Abnorm Psychol. 2013;122(1):131. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Klump KL, Burt SA. The Michigan State University Twin Registry (MSUTR): genetic, environmental and neurobiological influences on behavior across development. Twin Res Hum Genet. 2006;9(6):971‐977. [DOI] [PubMed] [Google Scholar]
  • 28. Burt SA, Klump KL. The Michigan state university twin registry (MSUTR): an update. Twin Res Hum Genet. 2013;16(1):344‐350. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Burt SA, Klump KL. The Michigan State University Twin Registry (MSUTR): 15 years of twin and family research. Twin Res Human Genet Offic J Int Soc Twin Stud. 2019;22(6):741. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Klump KL, Racine SE, Hildebrandt B, et al. Ovarian hormone influences on dysregulated eating: a comparison of associations between women with versus without binge episodes. Clin Psychol Sci. 2014;2(5):545‐559. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Song PXK, Jiang Z, Park E, Qu A. Quadratic inference functions in marginal models for longitudinal data. Stat Med. 2009;28(29):3683‐3696. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Data S1: Supporting Information

Data Availability Statement

Research data used for real data analysis are from a different group and are not shared. R code used to implement the method can be downloaded at https://github.com/Honglang/MPLVC.


Articles from Statistics in Medicine are provided here courtesy of Wiley

RESOURCES