A nonparametric regression method for multiple longitudinal phenotypes using multivariate adaptive splines

Wensheng ZHU; Heping ZHANG

doi:10.1007/s11464-012-0256-8

. Author manuscript; available in PMC: 2014 Oct 10.

Published in final edited form as: Front Math China. 2012 Oct 22;8(3):731–743. doi: 10.1007/s11464-012-0256-8

A nonparametric regression method for multiple longitudinal phenotypes using multivariate adaptive splines

Wensheng ZHU ^1,², Heping ZHANG ²

PMCID: PMC4193387 NIHMSID: NIHMS423886 PMID: 25309585

Abstract

In genetic studies of complex diseases, particularly mental illnesses, and behavior disorders, two distinct characteristics have emerged in some data sets. First, genetic data sets are collected with a large number of phenotypes that are potentially related to the complex disease under study. Second, each phenotype is collected from the same subject repeatedly over time. In this study, we present a nonparametric regression approach to study multivariate and time-repeated phenotypes together by using the technique of the multivariate adaptive regression splines for analysis of longitudinal data (MASAL), which makes it possible to identify genes, gene-gene and gene-environment, including time, interactions associated with the phenotypes of interest. Furthermore, we propose a permutation test to assess the associations between the phenotypes and selected markers. Through simulation, we demonstrate that our proposed approach has advantages over the existing methods that examine each longitudinal phenotype separately or analyze the summarized values of phenotypes by compressing them into one-time-point phenotypes. Application of the proposed method to the Framingham Heart Study illustrates that the use of multivariate longitudinal phenotypes enhanced the significance of the association test.

Keywords: Multivariate phenotypes, longitudinal data analysis, genetic association test, multivariate adaptive regression splines

1 Introduction

In genetic studies of many complex diseases, particularly mental illnesses, and behavior disorders, it is common that researchers collect multiple phenotypes related to a disorder as well as other disorders that are potentially correlated with the primary disorder of interest [21]. On one hand, these phenotypes may be mutually correlated, and some or all of them may share a common genetic mechanism. On the other hand, these phenotypes may also be time-dependent, since some phenotypes are sometimes collected from the same subject repeatedly over time, which is known as longitudinal study design. A well-known example is the Framingham Heart Study (FHS), where hundreds of phenotypes related to the Cardiovascular disease (CVD) were collected repeatedly every two to four years over time. Hence, it is useful to analyze multiple phenotypes with repeated measures simultaneously in the genetic association studies.

Several methods have been developed to test the association between multivariate phenotypes and genetic markers. Lange et al. [8] proposed a multivariate extension of family-based association tests based on generalized estimation equation (FBAT-GEE) to analyze multiple phenotypes simultaneously [9]. Xu et al. [13] suggested combining all univariate trait-specific test statistics based on individual phenotypes for a global association test. Furthermore, Zhu and Zhang [21] performed a series of simulations to enhance the understanding on how beneficial analyzing multiple phenotypes jointly is and when it makes the most sense to analyze them simultaneously. Their simulation results demonstrated that it is advantageous to analyze multiple phenotypes jointly, especially when the genes affect more than one phenotype and when there exist causal relations between multivariate phenotypes. Recently, Zhang et al. [18] constructed a nonparametric test based on the generalized Kendall’s tau to accommodate any combination of dichotomous, ordinal, and quantitative traits. Zhu et al. [20] extended their nonparametric test to adjust for covariate effects.

However, some existing methods cannot deal with the longitudinal data directly, and they require repeated measures to be summarized into one-time-point trait by taking the average of several measures or by other transformations [19]. Although the FBAT-GEE can be applied to longitudinal studies in theory, it is conveniently designed to include time-varying covariates and gene-time interaction. Furthermore, most methods only considered the main effects of genes but did not accommodate interactions among genes and between genes and environmental factors when analyzing multiple phenotypes simultaneously; yet, it has been documented that considering those interactions will increase the chance of identifying susceptible loci in the association studies [1,3].

Zhang [15] proposed a nonparametric regression method to analyze longitudinal data, called multivariate adaptive splines for analysis of longitudinal data (MASAL). MASAL is an automatic procedure for fitting longitudinal data without specifying the full regression surface, and it can directly be used to establish the relationship between univariate longitudinal phenotype and genetic markers while adjusting for environmental and time-dependent covariates.

In the present paper, we extend the MASAL to establish the relationship between multivariate, repeated measured phenotypes and genetic markers as well as other non-genetic covariates in the study of complex disorders. To analyze multiple phenotypes jointly, we must acknowledge the fact that different phenotypes of the same subject share the common non-time covariates (say, the common genotypes and the common gender), but these common covariates may have different effects on different phenotypes. In addition, in order for the MASAL to distinguish the different effects of the common covariates, we present a suitable way to re-code covariates for different phenotypes. Moreover, based on the fitted MASAL model, we provide a formal testing procedure for assessing the significance of gene, gene-gene, gene-environment, and gene-time interactions in association with multiple phenotypes. To demonstrate the benefit of our proposed method, we carry out simulation studies and analyze the data from the Framingham Heart Study (FHS).

2 Method

2.1 MASAL for univariate phenotype (univariate MASAL)

We first briefly review MASAL presented by Zhang [15] and demonstrate its application in genetic association studies. Let Y be the univariate phenotype, and let X = (x₁, …, x_p)^T be the p covariates that consist of the measured environmental factors E and genotypes G measured on a series of SNPs, with each genotype coded as 0, 1, and 2 representing the number of copies of one allele at this marker. For the ith individual (i = 1, …, n), suppose that Y and X are repeatedly measured at T_i different time points t_i₁, …, t_{iT_i}, where x_l_,_ij and y_ij are respectively the measurements of the lth covariate x_l (l = 1, …, p) and the observed value of the phenotype Y at time t_ij, j = 1, …, T_i. We note, however, that the genotypes of each individual are identical over time.

To establish the relationship between phenotype Y and the p environmental and genotypic covariates, we consider a nonparametric model

y_{i j} = f (x_{1, i j}, \dots, x_{p, i j}, t_{i j}) + e_{i j},

(1)

where f is an unknown smooth function and e_ij is the error term for j = 1, …, T_i, i = 1, …, n.

To fit model (1), we adopt the MASAL method presented by Zhang [15–17]. MASAL selects a model from the following class of functions:

{f; f (x) = \sum_{k = 0}^{K} β_{k} B_{k} (x), K = 0, 1, \dots},

(2)

where B_k(x) is a special basis function of the p covariates x = (x₁, …, x_p)^T, and β_k is the regression coefficient of B_k(x) (k = 0, 1, …, K). All basis functions are made of the following two functions:

{(x_{l} - τ)}^{+}, x_{l} (l = 1, \dots, p),

(3)

where τ is called a knot and a⁺ = max(a, 0) for any number of a. Because B_k(x) can either be one of the functions in equation (3) or the product of those functions involving distinct covariates such as

{(x_{2} - τ_{2})}^{+} \times x_{3} {(x_{4} - τ_{3})}^{+},

the MASAL model can accommodate not only gene-gene interactions but also gene-environment and gene-time interactions in the genetic association studies.

To select a model from (2), MASAL uses a forward step first and then a backward step. In the forward step, all knots are found and terms are added to minimize the (weighted) sum of squared residuals

WLS = \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{T} W_{i}^{- 1} (y_{i} - {\hat{y}}_{i}),

where y_i = (y_i₁, …, y_{iT_i})^T and ŷ_i is the predicted value of y_i, and W_i is the within-subject covariance matrix for

e_{i} = {(e_{i 1}, \dots, e_{{i T}_{i}})}^{T}, i = 1, \dots, n .

In the backward step, based on generalized cross-validation (GCV), MASAL deletes one least significant term at a time from the large model obtained after the forward step. The final model is the one that yields the smallest

GCV = {WLS}_{k} \cdot {[1 - \frac{λ k + 1}{\sum_{i = 1}^{n} T_{i}}]}^{- 2},

where W LS_k is the W LS of a reduced model with k terms and λ is the penalizing parameter for model complexity, which is usually chosen between three and five [2,16].

2.2 MASAL for multivariate phenotypes (multivariate MASAL)

In this subsection, we will discuss how to use MASAL to analyze multiple phenotypes. Let Y = (Y₁, …, Y_M )^T denote the M phenotypes of interest. Our goal is to assess the effect of X on the multiple phenotypes Y. Similar to the case of univariate phenotype, each phenotype of Y₁, …, Y_M is measured repeatedly over time. Although measurements may be taken at different times for different phenotypes for each subject, for clarity, we assume that M phenotypes are measured repeatedly over the same time points t_i₁, …, t_{iT_i} for the ith individual, where T_i is the number of repeated measurements, i = 1, …, n.

Let y_ijm denote the observed value of the mth phenotype Y_m at time t_ij for the ith individual, and X = (x₁, …, x_p)^T still denote the p covariates that consist of the measured environmental factors E and genotypes G measured on a series of SNPs. Although different phenotypes of each subject share the common environmental factors as well as the common genotypes, these common factors may have different effects on different phenotypes. To distinguish the different effects of X for different phenotypes, we should re-code different covariates for each one. For the mth phenotype Y_m, we define the corresponding covariates as

X_{m} = 1_{m} \otimes (x_{1}, \dots, x_{p}),

where ⊗ represents the Kronecker product and 1_m is an M-dimensional row-vector consisting of (M − 1) 0 and one 1 for the mth element. After recoding, the new covariates corresponding to each phenotype actually become a pM -dimensional row-vector. For example, if we consider two phenotypes (M = 2) and three covariates (p = 3), X₁ and X₂ could be coded as

{(x_{1}, x_{2}, x_{3}, 0, 0, 0)}^{T}, {(0, 0, 0, x_{1}, x_{2}, x_{3})}^{T},

respectively. In general, let x_l_,_ijm denote the value of the lth covariate (l = 1, …, pM ) at time t_ij for the mth phenotype of the ith individual.

Now, the nonparametric model of interest becomes

y_{ijm} = f_{m} (x_{1, ijm}, \dots, x_{p M, ijm}, t_{i j}) + e_{ijm},

(4)

where m = 1, …, M, j = 1, …, T_i, i = 1, …, n.

The procedure that is used by MASAL to fit model (4) is the same as the univariate-trait case, except the following modifications. First, the x in (2) now is x = (x₁, …, x_pM )^T and B_k(x) is a special basis function of the pM covariates x = (x₁, …, x_pM )^T. Second, the y_i in W LS should be

y_{i} = {(y_{i 11}, \dots, y_{i 1 M}, \dots, y_{i j 1}, \dots, y_{ijM}, \dots, y_{{i T}_{i} 1}, \dots, y_{{i T}_{i} M})}^{T}

and ŷ_i is the predicted value of y_i. Furthermore, W_i should be the within-subject covariance matrix for

e_{i} = {(e_{i 11}, \dots, e_{i 1 M}, \dots, e_{i j 1}, \dots, e_{ijM}, \dots, e_{{i T}_{i} 1}, \dots, e_{{i T}_{i} M})}^{T}, i = 1, \dots, n .

2.3 Fitted model and test procedure

According to above discussion, all the terms in the fitted MASAL model can be summarized as three categories: intercept, genetic covariate terms, and non-genetic covariate terms, where the genetic terms include main effect of genes, gene-gene, gene-environment, and gene-time interactions, and the non-genetic terms include environmental factors, time-dependent covariate, as well as interactions among them. In general, by re-arranging the order of terms, the final MASAL model can be summarized as

\hat{f} (x) = \hat{α} + {\hat{β}}^{T} B^{G} (x) + {\hat{γ}}^{T} B^{E T} (x),

(5)

where

B^{G} (x) = {(B_{1}^{G} (x), \dots, B_{k_{1}}^{G} (x))}^{T}

represents k₁ genetic covariate terms, and

B^{E T} (x) = {(B_{1}^{E T} (x), \dots, B_{k_{2}}^{E T} (x))}^{T}

represents k₂ non-genetic covariate terms. α̂ is the estimate of intercept, and

\hat{β} = {({\hat{β}}_{1}, \dots, {\hat{β}}_{k_{1}})}^{T}, \hat{γ} = {({\hat{γ}}_{1}, \dots, {\hat{γ}}_{k_{2}})}^{T}

are the estimates of the corresponding regression coefficients.

To assess the significance of gene, gene-gene, gene-environment, and gene-time interactions in association with traits, we propose a Wald-type statistic to test whether or not β = 0 based on the final MASAL model (5). The Wald-type statistic can be written as

W = {\hat{β}}^{T} {({\sum^{^}}_{β, β})}^{- 1} \hat{β},

(6)

where Σ̂_β_,_β is the estimated covariance matrix of β̂, which can be obtained with the MASAL model establishment.

As Zhang [15] pointed out, MASAL requires the identification of the adaptive knot allocation, and therefore, W no longer follows the chi-square distribution under the null hypothesis. We use a permutation procedure to establish the null distribution of W. The permutation is done by randomly assigning the genotypes while keeping the phenotype and non-genetic covariates intact for each individual. This permutation generates the data under the null hypothesis of no association between genetic covariates and the multiple phenotypes. For each permuted data set, we use MASAL to find the fitted model and then calculate the Wald-type statistic as we have done by using the original data set. We repeat this procedure 1000 times to generate the distribution of W under the null hypothesis of no association between genetic covariates and the multiple phenotypes.

3 Simulation

We carried out a series of simulations to demonstrate the promise of our method in genetic association studies of multiple phenotypes with repeated measures. To consider the linkage disequilibrium (LD) in our simulations, we simulated multi-locus genotypes at 11 SNPs by randomly drawing pairs of haplotypes according to the haplotype frequencies given by Yeager et al. [14, Fig. 3]. Then, we assumed the fifth SNP as the disease locus that affects the multiple phenotypes under study. This disease causing SNP is used to generate the phenotypes but is not used for association studies. In addition to the genotypes, a non-genetic binary covariate, X, was generated with

Pr (X = 1) = 0.3, Pr (X = 0) = 0.7.

To simulate the repeated-measure values of multiple phenotypes, for the mth phenotype, we consider a 3-dimensional function

f_{m} (X, g, t) = α_{m} X + β_{m} g + γ_{m} t,

where g = 0, 1, or 2 represents the number of copies of the miner allele at the disease locus, m = 1, …, M. For simplicity, we only consider two phenotypes (i.e., M = 2) and each of them was measured at three different time points t = 1, 2, and 3. For each subject, based on the obtained X and g, we generated the repeated-measure values of multivariate phenotypes

Y = {(y_{11}, y_{21}, y_{12}, y_{22}, y_{13}, y_{23})}^{T}

from the multivariate normal distribution N₆(μ, Σ), where

μ = {(f_{1} (X, g, 1), f_{2} (X, g, 1), f_{1} (X, g, 2), f_{2} (X, g, 2), f_{1} (X, g, 3), f_{2} (X, g, 3))}^{T},

$\sum = {(σ_{m t, m^{'} t^{'}}^{2})}_{6 \times 6}$ is the covariance matrix, and $σ_{m t, m^{'} t^{'}}^{2}$ represents the covariance between y_mt and y_m_′_t_′, m, m′ = 1, 2, t, t′ = 1, 2, 3. In our simulations, we let the variance $σ_{m t, m t}^{2} = 1$ . Then the covariance

σ_{m t, m^{'} t^{'}}^{2} = ρ_{m t, m^{'} t^{'}} .

For ease of interpretation, we refer to ρ_m_1,_m₂, ρ_m_2,_m₃, ρ_m_1,_m₃ (m = 1, 2) as the within-trait correlation coefficients ρ_WT, and refer to the rest of coefficients as the between-trait correlation coefficients ρ_BT. For all simulation studies, we let α_m = 0.2, γ_m = 0.1 for m = 1, 2, and let the within-trait correlation coefficients ρ_{W T} = 0.2.

For estimating the type I error, under the null hypothesis, the multiple phenotypes are not associated with the disease susceptibility locus. Then we let β_m = 0, m = 1, 2. To accommodate different covariance structures among multiple traits, we let the between-trait correlation coefficients ρ_BT = 0, 0.1, 0.2, and 0.3. To evaluate power, we let β_m = 0.2, m = 1, 2, and let the between-trait correlation coefficients

ρ_{B T} = - 0.3, - 0.2, - 0.1, 0, 0.1, 0.2, 0.3.

The simulations were replicated 2000 times (2000 data sets) for type I error estimation and power evaluation, and each consisting of 400 subjects.

To demonstrate the advantages of analyzing the multivariate phenotypes with repeated measures simultaneously, we also used the following two methods to analyze the simulated data and compared the results of them with that of our proposed method (Multivariate MASAL). One method is that we first used Univariate MASAL to analyze each longitudinal phenotype separately, tested the associations between genetic markers and each of the phenotype using the Wald-type statistic, and then adjusted for multiple testing by using Bonferroni correction. The other method is that we first took the average value of each phenotype over all the time points and then used Multivariate MASAL to analyze the multiple average phenotypes without repeated measures. For the purpose of illustration, we refer to these two methods as Univariate MASAL and One-Time-Point MASAL (OTP-MASAL), respectively.

Table 1 provides the comparison results of type I error for Multivariate MASAL, Univariate MASAL, and OTP-MASAL. It is clear from Table 1 that the empirical type I errors of Multivariate MASAL are close to the nominal significance level. OTP-MASAL slightly inflates the type I error rate but it is still acceptable. For Univariate MASAL, the tests performed on each phenotype are not independent due to the correlations between the phenotypes, and this violation of the independence assumption should limit the ability of the Bonferroni adjustment to control type I error effectively. However, it is interesting to see that Univariate MASAL controls the type I error rates well according to the results of Table 1. We can also see from Table 1 that all three methods are very robust to different correlations among multiple phenotypes for the type I error estimation.

Table 1.

Estimated type I errors of Multivariate MASAL, Univariate MASAL, and OTP-MASAL at significance level of 0.05

ρ_BT	method
ρ_BT	Multivariate MASAL	Univariate MASAL	OTP-MASAL
0.0	0.048	0.052	0.057
0.1	0.052	0.051	0.056
0.2	0.050	0.048	0.058
0.3	0.047	0.046	0.053

Open in a new tab

Table 2 provides the power comparison for three methods at the significance level 0.05. It appears that analyzing multiple longitudinal phenotypes jointly with Multivariate MASAL has the best power among all the cases shown in Table 2. This indicates that there is a substantial loss of power if we analyze each longitudinal phenotype separately by using Univariate MASAL or we analyze all the average value of each phenotype over all the time points when multiple longitudinal phenotypes are available in our research. The explanation is that Multivariate MASAL outperforms Univariate MASAL by avoiding Bonferroni correction. As we pointed out above, correlations among the phenotypes result in the dependence of the tests performed on each phenotype, which limits the ability of the Bonferroni adjustment for Univariate MASAL. In our simulation studies, two phenotypes are really correlated even when ρ_BT = 0, because they share the common factors (age, gene, and time) in the linear model used to simulate the data. For OTP-MASAL, the loss of power is mostly due to the change of correlation information between phenotypes when taking the average value of each phenotype over all the time points. However, the advantage is not one-sided between Univariate MASAL and OTP-MASAL based on our simulation results. Furthermore, we can see from Table 2 that the performance of the three methods varies across different correlations between phenotypes, especially for Multivariate MASAL and OTP-MASAL. As expected, Multivariate MASAL is more powerful when the correlation between the two phenotypes becomes stronger. Interestingly, the power of OTP-MASAL increases as the correlation between the two phenotypes changes from positive to negative. One possible explanation is that the correlation between the average phenotypes is not the same as the correlation between the original phenotypes any more.

Table 2.

Power comparisons of Multivariate MASAL, Univariate MASAL, and OTP-MASAL at significance level of 0.05

ρ_BT	method
ρ_BT	Multivariate MASAL	Univariate MASAL	OTP-MASAL
0.3	0.892	0.592	0.551
0.2	0.791	0.613	0.583
0.1	0.715	0.627	0.615
0.0	0.724	0.601	0.629
−0.1	0.771	0.629	0.632
−0.2	0.787	0.642	0.684
−0.3	0.816	0.657	0.690

Open in a new tab

4 Application on Framingham Heart Study (FHS)

4.1 Background

Cardiovascular disease (CVD) is the leading cause of death and serious illness in the US. The Framingham Heart Study, founded in 1948 and now under the direction of the National Heart, Lung, and Blood Institute (NHLBI), was aim to identify the common factors or characteristics that contribute to CVD. The participants in the FHS belongs to three different cohort studies (Original Cohort, Offspring Cohort, and Gen3 Cohort) and were recruited and had been examined every two to four years over a long period of time. It belongs to the longitudinal cohort study.

Many studies pointed out that blood lipid levels, such as low-density lipoprotein cholesterol (LDL-C), high-density lipoprotein cholesterol (HDL-C), and triglycerides (TG), are heritable risk factors for CVD [4–6,10–12]. However, some of studies did not directly deal with longitudinal data but required the longitudinal measures to be summarized into one-time-point trait [5]. Furthermore, few of studies analyzed several blood lipid levels together. On the contrary, most of methods examined each blood lipid level one at a time. The goal of the present study is to illustrate the practical relevance of the proposed method for analyzing multiple blood lipid levels jointly in the longitudinal cohort studies.

4.2 Data analysis

Phenotypes of our primary interest are the blood lipid levels measured in main exams of the FHS. Specifically, we considered three quantitative traits: total cholesterol (CHOL), HDL-C, and TG. Although LDL-C can be calculated according to Friedewald’s formula from CHOL, HDL-C, and TG, which is only valid if the TG level is less than 400 mg/dL. As a result, we used CHOL instead of LDL-C here. In the main clinical exams, since blood lipid levels only from a single exam are available for Gen3, and HDL-C levels are not available at all for original cohort, these datasets are not suitable for longitudinal analyses. We restricted our analyses to participants from the offspring cohort, where the blood lipid levels were measured from up to four exam visits (exams 1, 3, 5, and 7). For each family, we randomly selected one subject and obtained an independent sample consisting of 702 subjects. Besides blood lipid levels, we also included sex, age, body mass index (BMI), smoking status, and alcohol intake and treated them as covariates in our analyses, which are all repeatedly measured variables.

We focused on the gene LPL and studied the associations between the candidate gene and three blood lipid traits jointly. The gene LPL, especially, the SNP rs328 (S447X) in LPL, was previously found to be associated with TG and HDL-C [5–7]. In the FHS 50K SNP data set, 29 SNPs were genotyped around the LPL gene as well as its neighboring genes. In order to minimize false positive associations due to rarer SNPs and genotyping artifact, we limited our analyses to SNPs with minor allele frequency not less than 1% and the p-value for testing Hardy-Weinberg Equilibrium greater than 0.001 by using Pearson’s chi-squared test. Thus, there are a total of 22 SNPs remained in our analyses.

We first performed an association analysis for each of the three traits separately by using Univariate MASAL. None of these three analyses revealed any evidence of association between a single trait and the underling SNPs. Actually, MASAL did not put any SNP terms into the final fitted model when it handled each trait separately. In this situation, it is not necessary to perform the following testing procedure to test the associations between traits and SNPs. However, when three traits were analyzed jointly, not only the main effects of SNPs but also SNP-SNP and SNP-covariate interactions were selected in the final MASAL model. Based on the final model, we listed all the terms that are related to TG trait as well as the corresponding coefficients in Table 3. For the CHOL and HDL-C traits, only BMI covariate has effect on them. We further performed a permutation study to test the associations between genetic covariates and the three traits jointly, where the value of the corresponding Wald statistic is 370.53, and the result revealed a strong association (p-value = 0.027). Thus, the power of the association test was increased by analyzing the three traits jointly.

Table 3.

Main and interaction effect on TG trait selected by MASAL

effect	selected terms	coefficients
Main effect	rs6988825	−62.4857
	Sex	−108.4280
	Alcohol	−33.6991
	BMI	+8.9254
Gene-gene interaction	rs328 × rs6988825	+13.7025
	rs6988825 × rs6988757 × rs2975418	+22.1655
Gene-environment interaction	rs6988825 × (Alcohol-1.6556)⁺	+16.2935
	rs4922086 × Alcohol	+3.4890
	rs3923920 × Alcohol	+10.3054
	rs6988825 × rs3923920 × (Alcohol-1.6556)⁺	−8.4261
Environment-environment interaction	Sex × Age	+1.4097
	BMI × (Age-23)⁺	−0.2104
	BMI × (Age-53.4752)⁺	+0.1683
	Alcohol × Smoking	+2.8496

Open in a new tab

A × B denotes interaction term between A and B

From Table 3, we can see that SNP rs6988757 was selected by MASAL and had an interaction effect on TG trait. This SNP is strong linkage to SNP rs328 (D′ = 1), the latter was previously found to be associated with TG and HDL-C [5–7]. Furthermore, an interaction between SNP rs328 and SNP rs6988825 was also found by MASAL, where SNP rs6988825 lies in the CSGALNACT1 that plays a role in the initiation and elongation in the synthesis of chondroitin sulfate. Besides the gene-gene interactions, the fitted models also provided us gene-covariate interactions (e.g., rs4922086 and alcohol, rs6988825, rs3923920 and alcohol, etc.).

5 Discussion

Understanding comorbidity is a very important and challenging issue in the genetic studies of complex diseases, especially for mental health conditions. A common approach is to analyze all the phenotypes of interest together instead of analyzing one at a time. In this article, we proposed a nonparametric regression method to perform associations between genes and multivariate phenotypes with repeated measures simultaneously, which is based on a proven method, MASAL [15]. The major advantage of our proposed method is that it can handle the complex data set that contains two distinguishing characteristics: multivariate phenotypes and longitudinal data, which are hard to be handled simultaneously by the existing methods. Besides, the new method can also accommodate gene-gene, gene-environment, and time-covariate interactions in the association studies.

On the basis of our simulations, we found that our proposed method gained more power when we analyze multivariate phenotypes with longitudinal measurements jointly, as opposed to analyzing single longitudinal phenotype one at a time or analyzing multivariate phenotypes by summarizing repeated measures into one-time point value. By applying our method into the FHS data focusing on the blood lipid levels, we demonstrated that the analysis of multivariate longitudinal phenotypes enhanced the significance of the association test. Furthermore, our method did identify the gene-gene and gene-environment interactions (e.g., SNP rs328 and SNP rs6988825, alcohol and SNP rs4922086). Previous studies have shown that rs328 was associated the TG and HDL-C [5–7]. Our method not only verified their findings, but also drew us attention to the interactions between rs328 and other SNPs on TG trait, which was never reported by other methods to our best knowledge. Thus, further investigation is warranted to confirm our findings.

We used a permutation procedure to establish the null distribution of the Wald-type statistic because it no longer follows the chi-square distribution under the null hypothesis as Zhang [15] pointed out. The computation time of permutation is reasonable for a real data set, but can be intensive for simulation studies. Moreover, to permute the data, we randomly assigned the genotypes (G) only, while keeping the phenotype (Y ) and non-genetic covariates (E) intact for each individual. However, this may destroy the correlation between E and G although our method works well for simulation studies and real data analysis. Theoretical studies for exploring the asymptotic distribution of the proposed statistic will be useful.

Acknowledgments

The authors thank two anonymous referees for their constructive comments and suggestions. This work was supported by grant R01 DA016750-09 from the National Institute on Drug Abuse. Zhu’s work was also supported by the National Natural Science Foundation of China (Grant No. 11001044), the Fundamental Research Funds for the Central Universities (11CXPY007, 10JCXK001), the Natural Science Foundation of Jilin Province (Grant No. 201215007), the Scientific Research Foundation for Returned Scholars, MOE of China, and the Program for Changjiang Scholars and Innovative Research Team in University. The Framingham Heart Study project is conducted and supported by the National Heart, Lung, and Blood Institute (NHLBI) in collaboration with Boston University (N01 HC25195). The Framingham data used for the analyses described in this manuscript were obtained through dbGaP (phs000128.v3.p3).

References

1.Carlborg O, Haley CS. Epistasis: too often neglected in complex trait studies? Nat Rev Genet. 2004;5:618–625. doi: 10.1038/nrg1407. [DOI] [PubMed] [Google Scholar]
2.Friedman JH. Multivariate adaptive regression splines. Ann Stat. 1991;19:1–141. doi: 10.1177/096228029500400303. [DOI] [PubMed] [Google Scholar]
3.Kallberg H, Padyukov L, Plenge RM, et al. Gene-gene and gene-environment interactions involving HLA-DRB1, PTPN22, and smoking in two subsets of rheumatoid arthritis. Am J Hum Genet. 2007;80:867–875. doi: 10.1086/516736. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Kannel WB, Dawber TR, Kagan A, et al. Factors of risk in the development of coronary heart disease-six year follow-up experience. The Framingham Study. Ann Intern Med. 1961;55:33–50. doi: 10.7326/0003-4819-55-1-33. [DOI] [PubMed] [Google Scholar]
5.Kathiresan S, Manning AK, Demissie S, et al. A genome-wide association study for blood lipid phenotypes in the Framingham Heart Study. BMC Med Genet. 2007;8(Suppl 1):S17. doi: 10.1186/1471-2350-8-S1-S17. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Kathiresan S, Melander O, Guiducci C, et al. Six new loci associated with blood low-density lipoprotein cholesterol, high-density lipoprotein cholesterol or triglycerides in humans. Nat Genet. 2008;40:189–197. doi: 10.1038/ng.75. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Kooner JS, Chambers JC, Aguilar-Salinas CA, et al. Genome-wide scan identifies variation in MLXIPL associated with plasma triglycerides. Nat Genet. 2008;40:149–151. doi: 10.1038/ng.2007.61. [DOI] [PubMed] [Google Scholar]
8.Lange C, Silverman E, Xu X, et al. A multivariate family-based association test using generalized estimating equations: FBAT-GEE. Biostatistics. 2003;4:195–206. doi: 10.1093/biostatistics/4.2.195. [DOI] [PubMed] [Google Scholar]
9.Liang KY, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika. 1986;3:13–22. [Google Scholar]
10.Miller NE, Miller GJ. Letter: high-density lipoprotein and atherosclerosis. Lancet. 1975;1:10–33. doi: 10.1016/s0140-6736(75)91977-7. [DOI] [PubMed] [Google Scholar]
11.Namboodiri KK, Kaplan EB, Heuch I, et al. The Collaborative Lipid Research Clinics Family Study: biological and cultural determinants of familial resemblance for plasma lipids and lipoproteins. Genet Epidemiol. 1985;2:227–254. doi: 10.1002/gepi.1370020302. [DOI] [PubMed] [Google Scholar]
12.Pollin TI, Damcott CM, Shen H, et al. A null mutation in human APOC3 confers a favorable plasma lipid profile and apparent cardioprotection. Science. 2008;322:1702–1705. doi: 10.1126/science.1161524. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Xu X, Tian L, Wei LJ. Combining dependent tests for linkage or association across multiple phenotypic traits. Biostatistics. 2003;4:223–229. doi: 10.1093/biostatistics/4.2.223. [DOI] [PubMed] [Google Scholar]
14.Yeager M, Orr N, Hayes RB, et al. Genome-wide association study of prostate cancer identifies a second risk locus at 8q24. Nat Genet. 2007;39:645–649. doi: 10.1038/ng2022. [DOI] [PubMed] [Google Scholar]
15.Zhang HP. Multivariate adaptive splines for analysis of longitudinal data. J Comput Graph Stat. 1997;6:74–91. [Google Scholar]
16.Zhang HP. Analysis of infant growth curves using multivariate adaptive splines. Biometrics. 1999;55:452–459. doi: 10.1111/j.0006-341x.1999.00452.x. [DOI] [PubMed] [Google Scholar]
17.Zhang HP. Mixed effects multivariate adaptive splines model for the analysis of longitudinal and growth curve data. Stat Methods Med Res. 2004;13:63–82. doi: 10.1191/0962280204sm353ra. [DOI] [PubMed] [Google Scholar]
18.Zhang HP, Liu C-T, Wang XQ. An association test for multiple traits based on the generalized Kendall’s tau. J Amer Stat Assoc. 2010;105:473–481. doi: 10.1198/jasa.2009.ap08387. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Zhang HP, Zhong X. Linkage analysis of longitudinal data and design consideration. BMC Genet. 2006;7:37. doi: 10.1186/1471-2156-7-37. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Zhu WS, Jiang Y, Zhang HP. Nonparametric covariate-adjusted association tests based on the generalized Kendall’s tau. J Amer Stat Assoc. 2012;107:1–11. doi: 10.1080/01621459.2011.643707. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Zhu WS, Zhang HP. Why do we test multiple traits in genetic association studies? (with discussion) J Korean Stat Soc. 2009;38:1–10. doi: 10.1016/j.jkss.2008.10.006. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R1] 1.Carlborg O, Haley CS. Epistasis: too often neglected in complex trait studies? Nat Rev Genet. 2004;5:618–625. doi: 10.1038/nrg1407. [DOI] [PubMed] [Google Scholar]

[R2] 2.Friedman JH. Multivariate adaptive regression splines. Ann Stat. 1991;19:1–141. doi: 10.1177/096228029500400303. [DOI] [PubMed] [Google Scholar]

[R3] 3.Kallberg H, Padyukov L, Plenge RM, et al. Gene-gene and gene-environment interactions involving HLA-DRB1, PTPN22, and smoking in two subsets of rheumatoid arthritis. Am J Hum Genet. 2007;80:867–875. doi: 10.1086/516736. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] 4.Kannel WB, Dawber TR, Kagan A, et al. Factors of risk in the development of coronary heart disease-six year follow-up experience. The Framingham Study. Ann Intern Med. 1961;55:33–50. doi: 10.7326/0003-4819-55-1-33. [DOI] [PubMed] [Google Scholar]

[R5] 5.Kathiresan S, Manning AK, Demissie S, et al. A genome-wide association study for blood lipid phenotypes in the Framingham Heart Study. BMC Med Genet. 2007;8(Suppl 1):S17. doi: 10.1186/1471-2350-8-S1-S17. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] 6.Kathiresan S, Melander O, Guiducci C, et al. Six new loci associated with blood low-density lipoprotein cholesterol, high-density lipoprotein cholesterol or triglycerides in humans. Nat Genet. 2008;40:189–197. doi: 10.1038/ng.75. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] 7.Kooner JS, Chambers JC, Aguilar-Salinas CA, et al. Genome-wide scan identifies variation in MLXIPL associated with plasma triglycerides. Nat Genet. 2008;40:149–151. doi: 10.1038/ng.2007.61. [DOI] [PubMed] [Google Scholar]

[R8] 8.Lange C, Silverman E, Xu X, et al. A multivariate family-based association test using generalized estimating equations: FBAT-GEE. Biostatistics. 2003;4:195–206. doi: 10.1093/biostatistics/4.2.195. [DOI] [PubMed] [Google Scholar]

[R9] 9.Liang KY, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika. 1986;3:13–22. [Google Scholar]

[R10] 10.Miller NE, Miller GJ. Letter: high-density lipoprotein and atherosclerosis. Lancet. 1975;1:10–33. doi: 10.1016/s0140-6736(75)91977-7. [DOI] [PubMed] [Google Scholar]

[R11] 11.Namboodiri KK, Kaplan EB, Heuch I, et al. The Collaborative Lipid Research Clinics Family Study: biological and cultural determinants of familial resemblance for plasma lipids and lipoproteins. Genet Epidemiol. 1985;2:227–254. doi: 10.1002/gepi.1370020302. [DOI] [PubMed] [Google Scholar]

[R12] 12.Pollin TI, Damcott CM, Shen H, et al. A null mutation in human APOC3 confers a favorable plasma lipid profile and apparent cardioprotection. Science. 2008;322:1702–1705. doi: 10.1126/science.1161524. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] 13.Xu X, Tian L, Wei LJ. Combining dependent tests for linkage or association across multiple phenotypic traits. Biostatistics. 2003;4:223–229. doi: 10.1093/biostatistics/4.2.223. [DOI] [PubMed] [Google Scholar]

[R14] 14.Yeager M, Orr N, Hayes RB, et al. Genome-wide association study of prostate cancer identifies a second risk locus at 8q24. Nat Genet. 2007;39:645–649. doi: 10.1038/ng2022. [DOI] [PubMed] [Google Scholar]

[R15] 15.Zhang HP. Multivariate adaptive splines for analysis of longitudinal data. J Comput Graph Stat. 1997;6:74–91. [Google Scholar]

[R16] 16.Zhang HP. Analysis of infant growth curves using multivariate adaptive splines. Biometrics. 1999;55:452–459. doi: 10.1111/j.0006-341x.1999.00452.x. [DOI] [PubMed] [Google Scholar]

[R17] 17.Zhang HP. Mixed effects multivariate adaptive splines model for the analysis of longitudinal and growth curve data. Stat Methods Med Res. 2004;13:63–82. doi: 10.1191/0962280204sm353ra. [DOI] [PubMed] [Google Scholar]

[R18] 18.Zhang HP, Liu C-T, Wang XQ. An association test for multiple traits based on the generalized Kendall’s tau. J Amer Stat Assoc. 2010;105:473–481. doi: 10.1198/jasa.2009.ap08387. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] 19.Zhang HP, Zhong X. Linkage analysis of longitudinal data and design consideration. BMC Genet. 2006;7:37. doi: 10.1186/1471-2156-7-37. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] 20.Zhu WS, Jiang Y, Zhang HP. Nonparametric covariate-adjusted association tests based on the generalized Kendall’s tau. J Amer Stat Assoc. 2012;107:1–11. doi: 10.1080/01621459.2011.643707. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] 21.Zhu WS, Zhang HP. Why do we test multiple traits in genetic association studies? (with discussion) J Korean Stat Soc. 2009;38:1–10. doi: 10.1016/j.jkss.2008.10.006. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

A nonparametric regression method for multiple longitudinal phenotypes using multivariate adaptive splines

Wensheng ZHU

Heping ZHANG

Abstract

1 Introduction

2 Method

2.1 MASAL for univariate phenotype (univariate MASAL)

2.2 MASAL for multivariate phenotypes (multivariate MASAL)

2.3 Fitted model and test procedure

3 Simulation

Table 1.

Table 2.

4 Application on Framingham Heart Study (FHS)

4.1 Background

4.2 Data analysis

Table 3.

5 Discussion

Acknowledgments

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

A nonparametric regression method for multiple longitudinal phenotypes using multivariate adaptive splines

Wensheng ZHU

Heping ZHANG

Abstract

1 Introduction

2 Method

2.1 MASAL for univariate phenotype (univariate MASAL)

2.2 MASAL for multivariate phenotypes (multivariate MASAL)

2.3 Fitted model and test procedure

3 Simulation

Table 1.

Table 2.

4 Application on Framingham Heart Study (FHS)

4.1 Background

4.2 Data analysis

Table 3.

5 Discussion

Acknowledgments

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases