Abstract
Correlated phenotypes often share common genetic determinants. Thus, a multi‐trait analysis can potentially increase association power and help in understanding pleiotropic effect. When multiple traits are jointly measured over time, the correlation information between multivariate longitudinal responses can help to gain power in association analysis, and the longitudinal traits can provide insights on the dynamic gene effect over time. In this work, we propose a multivariate partially linear varying coefficients model to identify genetic variants with their effects potentially modified by environmental factors. We derive a testing framework to jointly test the association of genetic factors and illustrated with a bivariate phenotypic trait, while taking the time varying genetic effects into account. We extend the quadratic inference functions to deal with the longitudinal correlations and used penalized splines for the approximation of nonparametric coefficient functions. Theoretical results such as consistency and asymptotic normality of the estimates are established. The performance of the testing procedure is evaluated through Monte Carlo simulation studies. The utility of the method is demonstrated with a real data set from the Twin Study of Hormones and Behavior across the menstrual cycle project, in which single nucleotide polymorphisms associated with emotional eating behavior are identified.
Keywords: gene‐environment interaction, longitudinal traits, multi‐trait analysis, partial linear model, quadratic inference function
1. INTRODUCTION
Cross‐sectional disease traits have been the primary focus in genetic association studies. Given the improved power to identify disease genes with phenotypic data measured over time, longitudinal designs are becoming popular in genetic association studies. 1 , 2 , 3 , 4 Most statistical methods developed so far focus on a single outcome of interest. When multiple outcomes are measured over time, for example, multiple measures of heart function in a longitudinal study of cardiac function, methods focusing on just a single outcome over time may not provide a complete picture of cardiac function.
In genetics, the phenomenon that a single gene or locus influences more than one trait is known as pleiotropy. 5 , 6 Genetic pleiotropy plays a crucial role in many complex diseases. One of the most well‐known examples is the phenylketonuria (PKU) disease. 7 The conventional approach to identify genetic pleiotropic effects on multiple traits is to test the association between a gene and each trait individually and then determine whether the genetic effect is significantly associated with more than one trait. The disadvantages of this approach, such as the inflation in the family wise Type I error and incomplete information in individual tests compared to a combined analysis for multiple traits, have been discussed in some studies. 5 Therefore, a joint genetic association test on multiple traits is more desirable to control the family wise Type I error and enhance the power of tests.
In real life, timing is a very important factor in the development of a disease. Genetic effects on a disease trait vary during the life span of an individual. The function of a gene depends largely on when it turns on and off, which could show a temporal pattern. In order to capture the dynamic effect of a gene on a disease trait over time, it is natural to model the dynamic effect as a potential (nonlinear) function over time. Considering multiple longitudinal traits, we proposed the following partially linear varying coefficients model,
| (1) |
where is the response variable which measures the th phenotype on the th subject at the th time point; is a ‐dimensional vector of covariates, which can be either time dependent or independent; denotes the time‐invariant genetic variable within subject; and are unknown functions; and the stacked error vector with is assumed to have mean zero and covariance . Models for multivariate longitudinal traits are necessarily complex, because they must consider different types of correlations for each independent subject: correlation between measurements for the same trait at different time points, correlation between measurements at the same time point on different traits, and correlation between measurements at different time points and on different traits. With the stacked error vector , its covariance matrix carries all of these correlations.
If we use a time‐varying environmental factor instead of in the model, that is,
then the model can be used for jointly modeling nonlinear gene‐environment (GE) interactions for multiple longitudinal traits. In the model, one can assess the influence of on to affect multiple responses . Models for nonlinear GE interactions have been studied. 8 , 9
Qu and Li applied the method of quadratic inference functions (QIF) to the varying coefficients models for longitudinal data. 10 One important advantage is that the QIF method only requires correct specification of the mean structure and does not require any likelihood or approximation of the likelihood in hypothesis testing. In addition, when the working correlation structure is misspecified, QIF is more efficient than the generalized estimation equation (GEE) approach. Another advantage of the QIF approach is that the inference function has an asymptotic form, which provides a model selection criteria similar to AIC and BIC. It also allows us to test whether coefficients are significantly time‐varying based on the asymptotic results.
Rochon analyzed bivariate longitudinal data for discrete and continuous outcomes by using generalized estimating equations, which did not utilize the nice property of the QIF. 11 Cho applied QIF for multivariate longitudinal data with generalized linear models, which is not adequate to consider nonlinear effects as in varying coefficient models. 12 Using random effects for modeling longitudinal data is another very popular and common way. 13 Proudfoot and his coauthors modeled the longitudinal data using random effects then combines multiple outcomes together again similar to the generalized estimating equations. 14 Recently, Zhao and his coauthors proposed a joint penalized quasi‐likelihood modeling based on splines for multivariate longitudinal data using random effects with applications to HIV‐1 RNA load levels and CD4 cell counts. 15 Hector and Song investigated a distributed quadratic inference function framework to jointly estimate regression parameters from multiple heterogeneous data sets with correlated responses. 16
With the nice property of QIF to deal with complicated correlated data focusing on a univariate longitudinal response, 10 in this article, we consider to generalize it to partial linear varying coefficient models with multivariate longitudinal responses. 17 , 18 The purpose of this article is to develop a powerful joint testing procedure using QIF for model (1). If the correlation between the longitudinal outcomes is reasonably high, we aim to show that the joint test has higher power than the marginal tests to detect the signal like the genetic effect in model (1). We first use splines to approximate the nonparametric functions in the model, 19 followed by penalized estimation to avoid over fitting. Then we develop a 2‐step testing procedure to have a joint test for the interaction effect on multiple outcomes based on the QIF approach, followed by separate test of marginal effect on each outcome if the overall null is rejected. In cross‐sectional studies, Wu et al. 20 developed a multivariate partially linear varying coefficient model to detect GE interactions with multiple traits. Their method can select genetic variants with pleiotropic effects incorporating either the homogeneity (ie, pleiotropy) or heterogeneity (ie, no pleiotropy) assumptions. However, their approach cannot provide uncertainty quantification for the selected variables. Generalizing their method to multivariate longitudinal data is worth further studying.
This article is organized as follows. We state our proposed model in Section 2.1, and generalize the QIF method to the multivariate longitudinal responses in Section 2.2. Estimation procedure and asymptotical properties of estimators are provided in Section 2.3. A theorem for the general goodness‐of‐fit test via QIF is established in Section 2.4, based on which we propose a 2‐step testing procedure. We assess the finite sample performance of the proposed procedure with Monte Carlo simulation in Section 3 and illustrate the proposed methodology by the analysis of an emotional eating behavior study in Section 4. Conclusions and discussion are made in Section 5. Proofs are rendered to Appendix.
2. STATISTICAL METHODS
2.1. A joint multivariate partial linear model
In multivariate longitudinal studies, suppose is the th continuous outcome collected on the th observation at time point , where = 1, , , = 1, , , = 1, , . The joint partially linear varying coefficient models are defined as
where is the single nucleotide polymorphism (SNP) variable which does not depend on time and other types of measurement; is a ‐dimensional covariate vector, which can be either time‐dependent or time‐independent; to accommodate the correlation between multiple responses, we stack the error terms together into a long vector
We assume mean 0 with covariance , which carries three different association information: the within‐subject correlation across different time points, the between‐subject correlation at the same time point and the between‐subject correlation across different time points; and are unknown nonparametric smooth functions, representing the main time effect and time dependent genetic effect respectively. To illustrate the idea, in the following we demonstrate the methods assuming =2. For the situation where there are more than two traits (), the technique can be easily extended.
2.2. Quadratic inference function
To construct the objective function using the QIF approach, we first approximate the unknown functions , , , and by a ‐degree truncated power spline basis, that is,
| (2) |
where is a truncated power spline basis with degree and knots . is a ‐dimensional vector of spline coefficients.
Under the GEE framework, we solve
| (3) |
where , ; is the mean function and is the first derivative of with respect to the parameters; is the covariance matrix of and can be decomposed as with being a diagonal matrix of marginal variances and being a working correlation matrix with nuisance parameters . To avoid the estimation of , QIF approach considers the inverse of the correlation matrix as a linear combination of several known basis matrices in a form
| (4) |
where is the identity matrix and are symmetric basis matrices. As discussed in the existing literature, 10 , 12 the choice of the basis for the inverse of the correlation matrix plays an important role. Suppose as the within‐subject correlation structure and as between‐subject correlation coefficient, that is, the working correlation structure can be expressed as the Kronecker product (tensor product) with as the symmetric matrix with 1 on the diagonal and elsewhere. The inverse of the Kronecker product is with as a symmetric matrix with 0 on the diagonal and 1 elsewhere, and as the identity matrix with compatible dimension. So if the basis matrix for the inverse of the within‐subject correlation is given by , then we have the bases 's as . For exchangeable working correlation, we can set and has 0 on the diagonal and 1 elsewhere. If the working correlation is AR(1), we can set and to have 1 on its two subdiagonals and 0 elsewhere. Following QIF approach, we define the estimation function as
| (5) |
Using the spline approximation, the mean function can be written as
and the first derivative of is given as,
where .
Setting each component in (5) to be zero will result in more equations than unknown parameters. Following the idea of generalized method of moments, 21 the QIF method is defined as
| (6) |
where is a consistent estimator for . Minimizing the objective function (6) provides the estimation of the parameters.
2.3. Estimation procedure via penalized QIF
The estimation of the parameters can be obtained through minimizing the objective function, that is,
To avoid over‐fitting, we can define a penalized QIF in a form
| (7) |
where is a diagonal matrix with 1 if the corresponding parameter is the spline coefficient associated with knots, and 0 otherwise. Minimizing the penalized QIF provides
| (8) |
To estimate the tuning parameter , we can extend the generalized cross‐validation 10 , 22 , 23 to the penalized QIF and define the generalized cross‐validation statistic as
with the effective degree of freedom
where is the second derivative of . The optimized tuning parameter is given as
To establish the asymptotic properties for the penalized quadratic inference function estimators with fixed knots, we assume to be the parameter satisfying . Similar theoretical results are provided in Qu and Li. 10 Following their idea and extend those results to the estimators in our model, we get the strong consistency of the resulting estimators in Theorem 1. The ‐consistency and asymptotic normality of the estimators are given in Theorem 2 .
Theorem 1
Suppose conditions (A1)‐(A6) in the Appendix hold and the smoothing parameter , then the estimator , which is obtained by minimizing the penalized quadratic function in ( 7 ), exists and converges to almost surely.
Theorem 2
Suppose conditions (A1)‐(A6) in the Appendix hold and the smoothing parameter , then the estimator obtained by minimizing the penalized quadratic function in ( 7 ) is asymptotically normally distributed with the limiting distribution,
where the calculation of defined in (A6) and defined in (A5) can be found in the Appendix.
2.4. A two‐step hypothesis testing procedure
Compared to GEE, an advantage of the QIF approach is that QIF provides a goodness‐of‐fit test without estimating the second moment parameters. Suppose that the ‐dimensional parameter vector is partitioned into , where is the parameter of interest with dimension , and is a nuisance parameter with dimension . If we are interested in testing
then the test statistic
follows an asymptotically chi‐square distribution with degrees of freedom as from Qu and her coauthors work cited below. 24
Theorem 3
Suppose that all required regularity conditions are satisfied and
has dimension
. Under the null hypothesis,
is asymptotically chi‐square distributed with
degrees of freedom, where
(9)
In Model (1), it is of interest to test whether the genetic effects on multiple traits are significant or not. Based on Theorem 3, we develop a 2‐step testing procedure for testing the significance of the varying coefficient functions. In the first step, the joint test is performed to see whether a genetic factor has a significant effect on at least one longitudinal trait. If the testing result in the first step is significant, we then further conduct the marginal test in the second step to assess if the genetic effect is significant on both traits or just one trait. The first step is a joint test of significance followed by a marginal test to assess individual significance. For associated multiple traits with reasonably strong correlation, the joint test is more powerful than the marginal tests, which is empirically verified in our simulation studies.
2.4.1. Step 1: Joint test
First, we are interested in testing whether the genetic factor has an effect on at least one longitudinal trait. The hypothesis is stated as
This can be handled through the truncated power spline approximation of the nonparametric functions stated in (2). In particular, testing this hypothesis is equivalent to test the following null hypothesis
According to Theorem 3, we can construct a test statistic
where
and
The test statistic has an asymptotic distribution with the degrees of freedom equal the number of constraints under , according to Theorem 3.
2.4.2. Step 2: Marginal tests
From the joint test, if there exists a significant genetic effect on at least one longitudinal trait, then we can further test the marginal effects, that is,
Based on (2), this is equivalent to test and , separately.
For testing , we use test statistic , where
and
Similarly, we can construct a test statistic for testing , where
and
The asymptotic distribution of the test statistics and can be obtained from Theorem 3.
3. SIMULATION STUDIES
3.1. Simulation setup
In this section, the finite sample performance of the proposed method is evaluated through Monte Carlo simulation studies. Two continuous longitudinal responses are generated from the models
where , , , , and . We generate the same number of time points for each individual from a uniform distribution . The time independent predictor variable is also generated from . We set the minor allele frequency (MAF) for as and assume Hardy‐Weinberg equilibrium. Three different SNP genotypes , , and are simulated from a multinomial distribution with frequencies , and , respectively. In this simulation study, we vary to investigate the effect of minor allele frequency. Variable takes value {0,1,2} corresponding to genotypes , following an additive model. We assume and are jointly normally distributed as
We set the marginal variances = = 0.1. The true correlation structure of and are both exchangeable with the structure
with . And for we choose
with as the between‐subject correlation across different time points. We vary the between‐subject correlation at the same time point to investigate the power gain for the joint test.
We draw 1000 data sets with sample size and time points , in order to compare the performances of our proposed method under different sample sizes. We set to be the identity matrix and to be 1 on subdiagonals and 0 elsewhere, that is, AR(1) working correlation. An important issue for the model selection is to decide whether the spline model (2) is adequate for further penalization by (8). In the following simulations, we use quartic splines with the number of knots taken to be the largest integer not greater than 0.6 as suggested in Tian and his colleagues work. 25
3.2. Estimation performance
We use the asymptotic normality in Theorem 2 to construct the Wald type confidence interval for parameters and . Table 1 summarizes the empirical coverage probability (CP) in percentage and the average length (AL) of the confidence intervals at confidence level based on 1000 simulation replicates. As we can see from the table, the CPs are close to the nominal level 95%. When the sample size gets larger, the ALs are shorter and the CPs are closer to 95%.
TABLE 1.
Empirical coverage probability (%) and average length of confidence intervals for ,
|
|
|
|||||
|---|---|---|---|---|---|---|
| CP | AL | CP | AL | |||
|
|
93.2 | 0.078 | 94.8 | 0.050 | ||
|
|
93.7 | 0.078 | 95.6 | 0.050 | ||
Next, we consider the estimation performance of the nonparametric functions for . In Figure 1, the plots are from the case with sample size and (for other situations, please refer to the Appendix). For each function, the red solid line is the true function, and the three blue dashed lines correspond to the average of the estimated functions from 1000 simulation replicates in the middle and the 95% pointwise confidence bands with the standard error calculated from the standard deviation of 1000 replicates. The estimation is quite accurate with low sample size and MAF. As the MAF or sample increases, the estimation performance improves (see Appendix ). We also use the asymptotic normality in Theorem 2 to construct the pointwise confidence intervals. Table 2 summarized the empirical coverage probability (CP) in percentage and the average length (AL) (in parentheses) of the CIs for at , and 0.8 for sample size and . The CPs are all close to the nominal level 95% and the ALs are shorter under a larger sample size.
FIGURE 1.

The estimation of nonparametric functions for with = 200 and . In each panel, the red solid line is the true function, and the three blue dashed lines correspond to the estimated function in the middle and the 95% pointwise confidence bands
TABLE 2.
Empirical coverage probability (%) and average length of pointwise confidence intervals (in parentheses) for at , and 0.8
|
|
|
Intercept | Slope | Intercept | Slope | ||
|---|---|---|---|---|---|---|---|
| 0.2 | 500 | 92.2 (0.115) | 92.6 (0.073) | 92.9 (0.115) | 92.7 (0.073) | ||
| 1000 | 91.7 (0.082) | 93.3 (0.052) | 92.9 (0.082) | 92.9 (0.052) | |||
| 0.4 | 500 | 90.5 (0.109) | 91.9 (0.069) | 92.4 (0.108) | 92.8 (0.068) | ||
| 1000 | 92.3 (0.077) | 93.7 (0.049) | 94.4 (0.077) | 93.1 (0.049) | |||
| 0.6 | 500 | 91.0 (0.111) | 91.0 (0.070) | 92.3 (0.110) | 93.1 (0.070) | ||
| 1000 | 94.0 (0.079) | 93.8 (0.050) | 95.1 (0.079) | 94.4 (0.050) | |||
| 0.8 | 500 | 91.5 (0.113) | 93.1 (0.072) | 92.9 (0.113) | 94.1 (0.071) | ||
| 1000 | 91.5 (0.081) | 94.2 (0.051) | 94.8 (0.080) | 94.6 (0.051) |
3.3. Testing performance
We propose a two‐step hypothesis testing procedure to detect the genetic effects on multiple traits. With the joint test, higher power is expected with correlated traits than the marginal tests. We would like to evaluate how much we can gain in power when the correlation between multiple traits increases. This is done by varying the correlation coefficient at the same time point for the two simulated traits.
We evaluate the performance of the joint test under the null hypothesis . Power is evaluated under a sequence of alternative models with different values of , which is denoted by . The performance of the marginal tests for the nonparametric functions corresponding to different traits is evaluated under the two null hypotheses and respectively. For each test, power is evaluated under a sequence of alternative models, denoted by , correspondingly.
Figure 2 shows the power comparison between joint test and marginal tests under different correlation coefficient varying from 0.1 through 0.6. Each panel corresponds to the results with one value and displays the comparison of the three power curves, with the empirical size (when = 0) and power at different at the significance level 0.05 and sample size . Similar pattern can be observed for larger sample size . As expected, the Type I error is closer to 0.05 and the power increase as the signal increases for every power curve. When is small (low correlation), we do not see much power gain of the joint test compared to the marginal tests. As increases (correlation between the two traits increases), we observe higher power of the joint test (starting at ). This shows that the joint test is more powerful than the marginal tests for moderate or high correlation between traits. We also conducted simulations to evaluate the impact of between‐subject correlation across different time points on the testing power. We observed similar results as the one by varying the between‐subject correlation at the same time point. Due to space limit, the results were rendered in the supplemental file.
FIGURE 2.

The power comparison between the joint test and marginal tests under different correlation coefficient from 0.1 to 0.6 with sample size . The exact empirical sizes are given in Table 3
MAF also plays a major role for the inference performance of an association test in general. For the proposed method, the power increases as the MAF increases from 0.1 to 0.5. This is in align with the general conception. In particular, there is a big power improvement as increases from 0.1 to 0.3 as shown in Figure 3.
FIGURE 3.

The power comparison of the joint test under different minor allele frequencies () and different sample sizes ()
TABLE 3.
Empirical size for the joint and marginal tests under different correlation coefficient from 0.1 to 0.6 with sample size
|
|
|
|
|
|
|
|||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Joint | 0.040 | 0.040 | 0.037 | 0.036 | 0.034 | 0.026 | ||||||
| Marginal 1 | 0.038 | 0.036 | 0.038 | 0.039 | 0.039 | 0.038 | ||||||
| Marginal 2 | 0.034 | 0.036 | 0.039 | 0.037 | 0.038 | 0.037 |
4. REAL DATA APPLICATION
We applied the proposed multivariate partially linear varying coefficients model and the two‐step hypothesis testing procedure to the Twin Study of Hormones and Behavior across the Menstrual Cycle project 26 from the Michigan State University Twin Registry (MSUTR). 27 , 28 , 29 The goal of the study was to examine associations between changes in estradiol and progesterone levels and emotional eating across the menstrual cycle. Emotional eating was measured with the Dutch Eating Behavior Questionnaire (DEBQ) and negative affect was measured with the Negative Affect scale from the Positive and Negative Affect Schedule (PANAS). The DEBQ assesses the tendency to eat in response to negative emotions while PANAS is used to measure negative emotional states like sadness and anxiety.
In this study, we wanted to examine how genes respond to the hormone change (eg, estrogen) to affect emotional eating measured by DEBQ and PANAS. Since body mass index (BMI) is an important covariate for the study, we included it in the linear component of the model. Although the original study contains twins data, we only included one of the twins in each family in the analysis to make the samples independent. Measurements for each participant were collected for 45 consecutive days, which then were grouped into eight menstrual cycle phases, that is, ovulatory phase (1), transition ovulatory to midluteal (2), midluteal phase (3), transition midluteal to premenstrual (4), premenstrual phase including the first day of menstrual cycle (5), remaining days of menstrual cycle, part of follicular phase (6), follicular phase (7) and transition follicular to ovulatory phase (8). They were grouped into these phases based on profiles of changes in estrogen and progesterone across the cycle. 30 Data that belong to the same phase were averaged to get a phase‐level measure. All individuals were aligned according to the 8 phases for further analysis.
To demonstrate the utility of the method, here we focused on a candidate gene, nuclear receptor coactivator 7 (NCoA7). This gene codes for an estrogen receptor‐associated protein which plays an important role in the cellular response to estrogen. After removing SNPs with MAF , we had 12 SNPs measured on 327 participants for further analysis.
We consider the partially linear varying coefficient model with the two longitudinal traits, namely DEBQ and PANAS, with the form
For the th individual measured at menstrual cycle phase , the two longitudinal traits are denoted as and for DEBQ and PANAS, respectively. One phase dependent covariate BMI is denoted as . refers to the hormone estradiol level, which is standardized to range between 0 and 1 by where is the original estrogen level, and are the sample mean and standard deviation of , and is the cumulative distribution function of a standard normal. represents the SNP variable and the 12 SNPs were analyzed separately.
We aimed to test if an SNP is associated with the two traits with its effect modified by the estrogen hormone level, that is,
We applied the quadratic splines and the exchangeable working correlation structure for this real data analysis. After the Bonferroni correction for the 12 SNPs, we found three SNPs, rs584032, rs6911452, and rs9401855, are significant with the joint test. Table 4 lists the results. The joint test results are all more significant than the marginal tests. The between‐trait correlations between DEBQ and PANAS at the same cycle phase are shown in Figure 4, which shows a quite strong correlation at different phases ranging from 0.36 to 0.61. This explains why the joint test shows stronger significance than the marginal tests. Figure 4 shows the detailed correlation information about the three components: within‐trait correlation, between‐trait correlation at the same time points and across different time points.
TABLE 4.
The test results of the 3 significant SNPs with their rs numbers, the alleles (minor allele shows with bold font), the MAF, and the ‐values for the joint test (denoted as ) and the two marginal tests (denoted as and )
| SNP | Alleles | MAF |
|
|
|
|||
|---|---|---|---|---|---|---|---|---|
| rs584032 | T/A | 0.176 | 2.390e‐4 | 3.472e‐2 | 1.937e‐3 | |||
| rs6911452 | A/G | 0.089 | 8.225e‐4 | 3.491e‐3 | 4.627e‐3 | |||
| rs9401855 | A/G | 0.129 | 3.758e‐4 | 5.449e‐2 | 1.108e‐3 |
FIGURE 4.

The correlation information including within‐trait correlation, between‐trait correlation at the same and across different cycle phases. The x‐axis and y‐axis represent the 8 cycle phases for the two variables DEBQ and PANAS, respectively
Figure 5 shows the estimated nonparametric coefficient functions for both responses DEBQ and PANAS, with SNP rs9401855 as an example. The point‐wise 95% confidence bands cover a large part of the zero line for the DEBQ, which is consistent with its ‐value .0545 from the marginal test. And for PANAS, in the central region, the zero line is outside of the 95% point‐wise confidence bands, which is also consistent with the marginal ‐value .0011. The result shows that this SNP interacts with estrogen hormone and only affects PANAS, but not DEBQ. The negative coefficients show that estrogen hormone negatively impacts PANAS. Individuals carrying the GG genotype are more likely to experience negative affective states such as sadness and anxiety, compared to those carrying one or no G allele. From the slightly quadratic effect curve, it can be seen that the negative impact peaks around phase 5‐6, that is, the Premenstrual phase including the first day of menstrual cycle (5) to Remaining days of menstrual cycle, part of follicular phase (6), while less negative impact is observed at the beginning and the end of the eight phased cycle (ie, during the ovulatory phase).
FIGURE 5.

The estimated intercept and slope functions for DEBQ and PANAS from the joint model (red solid curve) and their point‐wise 95% confidence bands (dashed curve)
5. DISCUSSION
Joint analysis of multiple correlated traits can potentially improve the power to identify genetic variants associated with complex traits. However, association analysis focusing on multiple longitudinal traits has not be well studied. Method on GE interaction with multiple traits under a longitudinal design is even rare. In this article, we proposed a joint multivariate varying coefficient modeling approach to accommodate correlated longitudinal traits and proposed a testing procedure to identify genetic variants associated with multiple longitudinal traits with their effects modified by some environmental factors. By modeling the environmental effect with a nonparametric function, one can estimate the dynamic changing effect of on over the changing values of . The nonparametric function is flexible in the sense that the function is determined by the data without assuming a parametric structure. Both simulation and real data analysis demonstrate the utility of the proposed method.
One difficulty in jointly modeling multiple longitudinal traits is to model the complex correlation structure. For each subject, we should consider correlation between measurements for the same trait at different time points, correlation between measurements at the same time point on different traits, and correlation between measurements at different time points and on different traits. We applied the QIF approach in estimation and testing procedures. There are several advantages for QIF approach. First, the QIF approach only requires correct specification of the mean structure and does not require any joint likelihood in hypothesis testing. Second, it avoids estimating the nuisance correlation structure parameters by assuming that the inverse of working correlation matrix can be approximated by a linear combination of several known basis matrices. Third, when the working correlation structure is misspecified, the QIF is more efficient than the GEE approach. Fourth, the inference function of the QIF approach has an explicit asymptotic form, which provides a model selection criteria and allows us to test whether coefficients are significant or time varying based on the asymptotic results. It is worth mentioning that missing completely at random (MCAR) is assumed in this work, which is a common assumption under the QIF framework when dealing with missing data. 31
In the real application, we investigated association of SNPs in a candidate gene with two longitudinal traits DBEQ and PANAS. Although the data were regrouped into eight phases, they still carry the temporal information and can be treated like longitudinal data. The results show that three SNPs passed the Bonferroni threshold with the joint test and the ‐values of the joint test are smaller than the individual marginal test. This shows the relative advantage of the joint test. As shown in the simulation study, the joint test can achieve power gain when the traits are correlated. Therefore, it is essential to assess the correlations between traits when fitting multiple traits jointly and conducting joint testing.
Our method was demonstrated with two traits. The method can be extended to multiple longitudinal traits with , although the computational cost might increase. In addition, our method is not restricted to a longitudinal study. It also applies to other studies where multiple traits can be measured over a linear scale. For example, in a pharmacogenetic study, multiple drug responses (eg, blood pressure and heart rate) can be measured over different dosage of a drug treatment. The proposed model can be fitted to assess how genes respond to the increasing dosage levels to affect the drug responses. For another example, in a brain imaging genetic study, brain activities in different brain regions can be measured over a spatial scale and can be treated as multiple traits. One can fit the proposed model to understand how genes affect brain activities over a spatial scale.
Supporting information
Data S1: Supporting Information
ACKNOWLEDGEMENTS
The authors wish to thank the anonymous reviewers for their insightful comments that greatly improved the presentation of the manuscript. Research reported in this publication was supported by the National Human Genome Research Institute of the National Institutes of Health (NIH) under award number R21HG010073, by the National Institute of General Medical Sciences of the NIH under award number R01GM131398 and by the National Institute of Mental Health of the NIH under award number R01MH082054. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH.
PROOFS OF THEOREMS
1.
To establish the asymptotic properties for the estimator of , we need the following regularity conditions.
-
(A1)
is a bounded sequence of integers.
-
(A2)
The parameter space is compact and is an interior point of .
-
(A3)
The parameter is identified, that is, there is a unique such that the first moment assumption holds for , and is continuous.
-
(A4)
is continuous in .
-
(A5)
converges almost surely to , which is a constant and invertible matrix.
-
(A6)
The first derivative of exists and is continuous. converges in probability to if converges in probability to .
Proof of Theorem 1
exists because (7) has zero as a lower bound and the global minimum exists. To prove the consistency, first, the estimator is obtained by minimizing (7), then we have
(A1) Since
by the strong law of large number and (A5), and ,
Thus, we can obtain from (A1) that
(A2) Since the parameter space is compact, by Glivenko‐Cantelli theorem,
Hence, by (A5) and the continuity mapping theorem,
Combined with (A2), we get
(A3) Suppose is not a strong consistent estimator of , then there exists a neighborhood of the true parameter , say , such that . Since is a continuous function and is compact, there exists a point such that
achieve its minimum in . By the identification of in (A3), there is a unique satisfying , and we have
which contradicts (A3). Hence, is a consistent estimator of .
Proof of Theorem 2
The estimate of satisfies
By Taylor's expansion, we obtain
where is some value between and . Thus, we can have
(A4) Since converges to in probability and is between and , by (A5) and (A6) we can get
When ,
Similarly, since
and , we have
Therefore, (A4) can be written as
(A5) By Central Limit Theorem,
(A6) Using (A5) and (A6), we obtain
Proof of Theorem 3
By Taylor's expansion,
where is some value between and . We can also obtain from Taylor's expansion that
where is between and . From conditions in (9), we have
Hence
If we expand about , and about , we obtain
The above two equations give us
which can be written as
Then can be written as
which is asymptotically equivalent to
By theorem 3.2 in Hansen, 21
Therefore,
thus, follows asymptotically.
Wang H, Zhang J, Klump KL, Alexandra Burt S, Cui Y. Multivariate partial linear varying coefficients model for gene‐environment interactions with multiple longitudinal traits. Statistics in Medicine. 2022;41(19):3643–3660. doi: 10.1002/sim.9440
Honglang Wang and Jingyi Zhang are contributed equally to this work.
Funding information National Institutes of Health, Grant/Award Numbers: R01GM131398; R01MH082054; R21HG010073
DATA AVAILABILITY STATEMENT
Research data used for real data analysis are from a different group and are not shared. R code used to implement the method can be downloaded at https://github.com/Honglang/MPLVC.
REFERENCES
- 1. Sitlani CM, Rice KM, Lumley T, et al. Generalized estimating equations for genome‐wide association studies using longitudinal phenotype data. Stat Med. 2015;34(1):118‐130. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Macgregor S, Knott SA, White I, Visscher PM. Quantitative trait locus analysis of longitudinal quantitative trait data in complex pedigrees. Genetics. 2005;171(3):1365‐1376. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Furlotte NA, Eskin E, Eyheramendy S. Genome‐wide association mapping with longitudinal data. Genet Epidemiol. 2012;36(5):463‐471. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Xu Z, Shen X, Pan W, Alzheimer's Disease Neuroimaging Initiative . Longitudinal analysis is more powerful than cross‐sectional analysis in detecting genetic association with neuroimaging phenotypes. PLoS One. 2014;9(8):e102312. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Wang W, Feng Z, Bull SB, Wang Z. A 2‐step strategy for detecting pleiotropic effects on multiple longitudinal traits. Front Genet. 2014;5:357. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Gratten J, Visscher PM. Genetic pleiotropy in complex traits and diseases: implications for genomic medicine. Genome Med. 2016;8(1):78. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Lobo I. Pleiotropy: one gene can affect multiple traits; 2008.
- 8. Ma S, Yang L, Romero R, Cui Y. Varying coefficient model for gene–environment interaction: a non‐linear look. Bioinformatics. 2011;27(15):2119‐2126. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Liu X, Cui Y, Li R. Partial linear varying multi‐index coefficient model for integrative gene‐environment interactions. Stat Sin. 2016;26:1037. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Qu A, Li R. Quadratic inference functions for varying‐coefficient models with longitudinal data. Biometrics. 2006;62(2):379‐391. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Rochon J. Analyzing bivariate repeated measures for discrete and continuous outcome variables. Biometrics. 1996;52(2):740. [PubMed] [Google Scholar]
- 12. Cho H. The analysis of multivariate longitudinal data using multivariate marginal models. J Multivar Anal. 2016;143:481‐491. [Google Scholar]
- 13. Fieuws S, Verbeke G. Joint modelling of multivariate longitudinal profiles: pitfalls of the random‐effects approach. Stat Med. 2004;23(20):3093‐3104. [DOI] [PubMed] [Google Scholar]
- 14. Proudfoot J, Faig W, Natarajan L, Xu R. A joint marginal‐conditional model for multivariate longitudinal data. Stat Med. 2018;37(5):813‐828. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Zhao L, Chen T, Novitsky V, Wang R. Joint penalized spline modeling of multivariate longitudinal data, with application to HIV‐1 RNA load levels and CD4 cell counts. Biometrics. 2021;77(3):1061‐1074. [DOI] [PubMed] [Google Scholar]
- 16. Hector EC, Song PXK. Joint integrative analysis of multiple data sources with correlated vector outcomes; 2020. arXiv preprint arXiv:2011.14996.
- 17. Bandyopadhyay S, Ganguli B, Chatterjee A. A review of multivariate longitudinal data analysis. Stat Methods Med Res. 2011;20(4):299‐330. [DOI] [PubMed] [Google Scholar]
- 18. Verbeke G, Fieuws S, Molenberghs G, Davidian M. The analysis of multivariate longitudinal data: a review. Stat Methods Med Res. 2014;23(1):42‐59. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Ruppert D, Carroll RJ. Theory & methods: spatially‐adaptive penalties for spline fitting. Aust N Z J Stat. 2000;42(2):205‐223. [Google Scholar]
- 20. Wu C, Cui Y, Ma S. Integrative analysis of gene–Environment interactions under a multi‐response partially linear varying coefficient model. Stat Med. 2014;33(28):4988‐4998. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Hansen LP. Large sample properties of generalized method of moments estimators. Econometrica. 1982;50(4):1029‐1054. [Google Scholar]
- 22. Ruppert D. Selecting the number of knots for penalized splines. J Comput Graph Stat. 2002;11(4):735‐757. [Google Scholar]
- 23. Bai Y, Fung WK, Zhu ZY. Penalized quadratic inference functions for single‐index models with longitudinal data. J Multivar Anal. 2009;100(1):152‐161. [Google Scholar]
- 24. Qu A, Lindsay BG, Li B. Improving generalised estimating equations using quadratic inference functions. Biometrika. 2000;87(4):823‐836. [Google Scholar]
- 25. Tian R, Xue L, Liu C. Penalized quadratic inference functions for semiparametric varying coefficient partially linear models with longitudinal data. J Multivar Anal. 2014;132:94‐110. [Google Scholar]
- 26. Klump KL, Keel PK, Racine SE, et al. The interactive effects of estrogen and progesterone on changes in emotional eating across the menstrual cycle. J Abnorm Psychol. 2013;122(1):131. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Klump KL, Burt SA. The Michigan State University Twin Registry (MSUTR): genetic, environmental and neurobiological influences on behavior across development. Twin Res Hum Genet. 2006;9(6):971‐977. [DOI] [PubMed] [Google Scholar]
- 28. Burt SA, Klump KL. The Michigan state university twin registry (MSUTR): an update. Twin Res Hum Genet. 2013;16(1):344‐350. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Burt SA, Klump KL. The Michigan State University Twin Registry (MSUTR): 15 years of twin and family research. Twin Res Human Genet Offic J Int Soc Twin Stud. 2019;22(6):741. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30. Klump KL, Racine SE, Hildebrandt B, et al. Ovarian hormone influences on dysregulated eating: a comparison of associations between women with versus without binge episodes. Clin Psychol Sci. 2014;2(5):545‐559. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31. Song PXK, Jiang Z, Park E, Qu A. Quadratic inference functions in marginal models for longitudinal data. Stat Med. 2009;28(29):3683‐3696. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data S1: Supporting Information
Data Availability Statement
Research data used for real data analysis are from a different group and are not shared. R code used to implement the method can be downloaded at https://github.com/Honglang/MPLVC.
