Summary
Genetic association studies with longitudinal markers of chronic diseases (e.g., blood pressure, body mass index) provide a valuable opportunity to explore how genetic variants affect traits over time by utilizing the full trajectory of longitudinal outcomes. Since these traits are likely influenced by the joint effect of multiple variants in a gene, a joint analysis of these variants considering linkage disequilibrium (LD) may help to explain additional phenotypic variation. In this article, we propose a longitudinal genetic random field model (LGRF), to test the association between a phenotype measured repeatedly during the course of an observational study and a set of genetic variants. Generalized score type tests are developed, which we show are robust to misspefication of within-subject correlation, a feature that is desirable for longitudinal analysis. In addition, a joint test incorporating gene-time interaction is further proposed. Computational advancement is made for scalable implementation of the proposed methods in large-scale genome-wide association studies (GWAS). The proposed methods are evaluated through extensive simulation studies and illustrated using data from the Multi-Ethnic Study of Atherosclerosis (MESA). Our simulation results indicate substantial gain in power using LGRF when compared with two commonly used existing alternatives: (i) single marker tests using longitudinal outcome and (ii) existing gene-based tests using the average value of repeated measurements as the outcome.
Keywords: Genetic association, Generalized estimating equations, Generalized score test, Longitudinal study, Multi-marker test, Random field
1. Introduction
Genome-wide association studies (GWAS) have been successful in identifying susceptibility loci for risk factors of chronic diseases. For genetic studies of cardiovascular disease risk factors, such as the Mulit-Ethnic Study of Atherosclerosis (MESA), observations at multiple time points are available for each individual (Bild, et al., 2002). The longitudinal nature of these studies results in more precise phenotypic characterization, enhancing the ability to associate genes or chromosomal regions with the phenotypes and assess gene-time interaction. However, current statistical methods for testing genetic association in longitudinal studies, in the presence of effect heterogeneity across time are limited, even for one single-nucleotide polymorphism (SNP) at a time analysis (Fan, et al., 2012; Furlotte, Eskin and Eyheramendy, 2012). Investigators often take a simple approach of collapsing the repeated measurements into a single value and hence the method is not able to harness the power of the complete information that is contained in the longitudinal trajectory. One can also apply the standard methods available for correlated outcome models to better utilize the longitudinal data, namely, random effects models (Fitzmaurice, Laird and Ware, 2011) and generalized estimating equations (GEE) (Zeger and Liang, 1986). These methods are primarily proposed for modeling and testing a limited number of SNPs, and cannot be directly applied to assess the joint association of a longitudinally varying outcome with an entire gene or a region with hundreds of SNPs without further modifications.
Recent studies showed the advantages of multi-marker tests over individual SNP analyses. First, the genetic markers in LD with the causal SNP(s) carry additional information and may enhance the power of identifying the true effect. Second, gene-based tests considerably reduce the burden of multiple comparisons. Third, Region-based methods are appealing for multi-ethnic cohorts due to differences in LD structure across ethnic groups and thus meta-analysis of a region-based statistic is likely to be more consistent than meta-analysis of single marker tests across ethnicities. Last, gene-based tests enhance the power of identifying rare-variant association in next generation sequencing studies (Morris and Zeggini, 2010). Two notable existing approaches are the sequence kernel association tests (SKAT) (Wu, et al., 2011) and similarity regression (SIMreg) (Tzeng, et al., 2009). From a random field framework and borrowing ideas from spatial statistics, the genetic random field model (GenRF) was recently developed for modeling and testing joint associations (He, et al., 2014; Li, et al., 2014). So far, however, extensions of these methods are not available for longitudinal data.
It is desirable to have a multi-marker test for longitudinal studies that can incorporate the time-dependent variation in outcome, utilize all the variants in a gene or region and boost power in the presence of effect heterogeneity across time. Extending the GenRF method to the longitudinal setting, we propose a longitudinal genetic random field model (LGRF) and develop generalized score type tests to study the association between repeatedly measured phenotypes and a set of genetic variants in a gene or region. The methods are evaluated through extensive simulation studies and illustrated by analyzing the association between blood pressure and 29 candidate genomic regions across four ethnic groups in MESA.
2. Method
2.1 Notation
Consider a study population of m subjects, and the i-th subject has ni repeated observations. Each subject is sequenced in a region of interest with q variants, and measured on p additional non-genetic covariates such as age, gender and other potential confounders. Let Yi,l be the phenotypic value for the l-th observation on the i-th subject, measured at time ti,l; Gi = (Gi,1, Gi,2, . . ., Gi,q)T be the genotypes for the q variants within the region where Gi,h ∈ {0, 1, 2} for any 1 ≤ h ≤ q, which does not change over time; Xi,l = (Xi,l,1, . . ., Xi,l,p)T be the covariates corresponding to the l-th observation on the i-th subject, either time-varying or time-invariant. We denote n = Σi ni, Yn×1 = (Y1,1, . . ., Y1,n1, Y2,1, . . .)T and define Xn×p, Gn×q similarly for covariates and genotypes. We are interested in investigating the association between phenotype Yn×1 and variants Gn×q, adjusted for the effect of Xn×p.
2.2 Longitudinal Genetic Random Field Model
The GenRF method (He, et al., 2014) is a gene-based association test motivated by the general idea that, if the genetic variants in a region are jointly associated with the phenotype, then subjects having similar genotypes in that region will have similar phenotype (Tzeng, et al., 2009). Motivated by development in spatial statistics (Cressie, 1993) and random field theory (Besag, 1974; Adler and Taylor, 2007), GenRF views phenotypic values as a random field on a genetic space where the vector of genotype sequences determines the location in the space; i.e., the phenotype at each location is a random variable and these random variables are possibly correlated depending on their spatial location, e.g., the closer the more similar. It directly regresses the phenotype of a given subject on that of all others, where the contribution of other subjects is weighted by their genotype similarity with the given subject. This leads to a conditional autoregressive model commonly used in spatial statistics to study spatial dependence.
With repeated measurements, one has to appropriately account for the within-subject correlation between outcomes to obtain valid inference and improve efficiency. Extending the GenRF model to the longitudinal setting, we propose a longitudinal GenRF (LGRF) model, where the conditional mean of each observation is modeled as a weighted sum of all other observations, including those from the same subject. In a longitudinal setting, one may expect that phenotypes from the same subject may be more similar due to reasons other than the shared genetic variants of interest. To capture this, we define a within-subject similarity, which depends on the time between two measurements on the same subject; for example, if two observations are measured closer in time, their within-subject similarity may be larger. Formally LGRF model is written as:
| (1) |
where Y–(i,l) denotes all other phenotypic values except Yi,l; Xi,l and β are, respectively, covariates and the corresponding regression coefficients, and thus is the contribution to outcome mean from non-genetic covariates; εi,l ~ i.i.d. N(0, σ2); w(ti,k, ti,l; η) is the within-subject similarity between Yi,k and Yi,l with parameters η playing the role of introducing within-subject correlation between repeated measurements, similar to parameters in a correlation matrix in a GEE framework; si,j is the genetic similarity between subjects i and j. Possible forms can be referred to as genetic relationship (GR) (Yang, et al., 2011) where ph is the population allele frequency of h-th SNP in the region, and the identity-by-state (IBS) similarity: . Parameter γ measures the magnitude of the joint association between genetic variants and the phenotype. If none of the genetic variants are associated with the phenotype, the phenotype of subject i will be irrelevant to the phenotypes of others regardless of their proximity in the genetic space, i.e., γ = 0. On the contrary, a large positive γ indicates a strong spatial dependence or equivalently genetic association. Thus, γ can be interpreted as the magnitude of the joint association between the q genetic variants and the phenotype. Briefly, the conditional autoregressive model relates each observation to others measured on the same subject by within-subject similarity w(ti,k, ti,l; η), and all other observations (including other measurements on the same subject) in the study by genetic similarity si,j.
According to the factorization theorem of Besag (1974), the conditional model (1) uniquely determines a joint distribution of Y:
| (2) |
where I is an n × n identity matrix; W(η) and S are matrices (n × n) composed of w(ti,k, ti,l; η) and si,j, respectively. Specifically, the within-subject similarity matrix W(η) is block diagonal with block i (ni × ni) corresponding to subject i and the (k, l)-th element of block i is w(ti,k, ti,l; η) except for diagonal elements of W(η). The genetic similarity matrix S is composed of m × m block matrices with dimension ni × nj, i, j = 1, . . ., m, and all elements in the (i, j)-th block are si,j except for the diagonal elements of S. The diagonal elements of W(η) and S are 0 as in model (1) observations are not compared with themselves. To evaluate the joint association of multiple genetic variants with the phenotype we can test the null hypothesis H0 : γ = 0 involving a single parameter in the precision matrix (or equivalently in the variance matrix).
With respect to the within-subject similarity, the random field model focuses on how the observations are related, regardless of the direction (past or future) as opposed to transition models which condition each observation only on the past observations. However, they can result in very similar marginal correlation structures such as the first-order auto-regressive (AR1) correlation. Examples of plausible W(η) are given below.
Example 1
One might assume observations from the same subject to be equally similar and sets w(ti,k, ti,l; η) = η for ∀i, k, l, and in matrix notation, W(η) = ηT, where T is a block diagonal matrix with block i, i = 1, . . ., m, an ni × ni matrix with 0's in the diagonal and 1's off-diagonal. Under H0 : γ = 0, the corresponding covariance matrix is σ2(I – ηT)–1. This specification is equivalent to the usual compound symmetric correlation.
Example 2
One might assume each observation conditionally depends on only the nearest observations before and after it (Markov property): w(ti,k, ti,l; η) = η if |k – l| = 1, and 0 otherwise. This is an approximation of the usual AR1 correlation by ignoring the edge effect (Qu, et al., 2000). Again W(η) = ηT for a block diagonal matrix T, where the (k, l)-th element of the i-th block is 1 if |k – l| = 1 and 0 otherwise.
In addition, multiple within-subject similarities can be combined for a better working precision matrix, adaptively approximating the underlying structure. Taking W(η) to be linear in η, e.g., the two examples given above and their linear combinations, can lead to a rich class to accommodate many commonly used working correlation structures. A similar idea has been studied by Qu, et al. (2000) to improve efficiency of estimation over GEE method.
As in the GEE framework, the within-subject similarity matrix W(η), or equivalently the correlation matrix {I – W(η)}–1 under the null, is only a working assumption that is not required to be correct for valid inference. Thus we present our test using a working within-subject similarity matrix that is of the form ηT, as in the two examples, and note the method applies to more general W(η). For simplicity, the matrix representation of the LGRF model is given by:
| (3) |
where Y is the n dimensional vector of all observations; Y|Y– stands for that each observation Y(i,l) is conditional on all other observations Y–(i,l); Matrices T and S have diagonal elements equal to zero, to reflect that the mean of each element of Y only depends on other elements but not on itself; ε = (ε1,1, . . ., ε1,n1, ε2,1, . . .)T is the residual vector. Since the genetic similarities are compared across all observations, the model does not have the Markov property, i.e., each observation has finite neighbors, typically assumed in a conditional auto-regressive model in spatial statistics. Thus the regular likelihood-ratio test or score test used in spatial statistics for testing spatial auto-correlation cannot be applied directly. Also, because of the within-subject similarly, the pseudo-likelihood approach developed by He, et al. (2014) does not apply. Instead, we propose a set of generalized score type tests.
2.3 Association Test under the Longitudinal Genetic Random Field Model
In this subsection we focus on developing a generalized score type test for testing H0 : γ = 0 under model (3). The inference procedure is developed by treating the within-subject correlation as a working model, leading to a test that is robust to misspecification of the correlation structure. Model (3) states, given all other observations, the conditional mean of each observation is linearly related to others, i.e., E(Y|Y–) = Xβ + (ηT + γS)(Y – Xβ). Adopting the similar argument for the usual GEE method (Zeger and Liang, 1986) to our conditional auto-regressive model, we construct the following generalized estimating function:
| (4) |
where μ = Xβ. The estimating equation is quadratic in Y because γ is a coefficient in an auto-regressive model and corresponds to a parameter in the marginal variance as in (2). In the Supplementary Materials section 1.1, we show that the above estimating function is unbiased in the sense that its expectation is zero under the truth. Therefore, following Boos (1992), we refer to it as a “generalized” score and the score evaluated at γ = 0, i.e., Uγ(β, η, 0) = (Y – μ)T S(I – ηT)(Y – μ), can be used to construct a generalized score type test. Due to the unbiasedness, we show that Uγ(β, η, 0) has mean 0 under H0 and positive mean γE{(Y – μ)T S2(Y – μ)} under H1 : γ > 0. This rationale leads to constructing a generalized score statistic
| (5) |
and rejecting H0 when it is sufficiently large. In (5), and are estimates under the null hypothesis that γ = 0. Specifically, and are the solution to the following estimating equations:
The first equation is the usual estimating equation for estimating β in GEE based on the the joint distribution (2) as I – ηT is proportional to the inverse of a working correlation matrix under H0. The second equation is derived by considering the estimating function under H0. It is worth noting that the second estimating equation is linear in η. Thus the estimators and can be calculated by iteratively solving linear equations. This property remains when we linearly combine multiple within-subject similarities, leading to an efficient way to estimate the correlation structure.
We derive an asymptotically equivalent representation of QG under H0 and show that this representation allows us to achieve theoretical protection against the misspecification of within-subject correlation as well as facilitating computationally efficient implementation suitable for large-scale studies. Specifically, we show in Supplementary Materials sections 1.2 and 1.3 that for all the genetic similarity metrics introduced previously, under H0, QG can be represented as
where η0 is the true parameter under H0; Idq is a dq × dq identity matrix and 0dq is a zero matrix; and ; Z is an n × dq matrix for some integer d, and c is a constant. The exact form or value of Z, d and c depend on the chosen genetic similarity and the details are given in the Supplementary Materials sections 1.2 and 1.4. For example, for GR similarity, Z, (n × q), is the centered genotype matrix, i.e., each column of the genotype matrix G is now centered by the genotype population mean 2ph. Note that , which is a summation of m terms each with expectation zero under the null regardless of the specified working correlation structure. Therefore, the summand is an unbiased estimating function for β, and according to the theory of M-estimators (Stefanski and Boos, 2002), is asymptotically normal with a covariance matrix that can be robustly estimated by some sandwich covariance estimates, leading to robustness to misspecification of working correlation.
In Results 2 and 3 of Supplementary Materials, using the theory of M-estimation as well as distributions for quadratic forms, we show that QG has an asymptotic distribution
under H0, where c is a constant which does not affect the inference; 's are i.i.d. Chi-square distributions with degree of freedom one; λk are eigenvalues of a 2dq × 2dq matrix
where Σ can be consistently estimated by a sandwich covariance estimate , defined in Result 2 of the Supplementary Materials. Moreover, the null distribution of QG only depends on the eigen-values of a 2dq × 2dq matrix. As the number of variants in a target gene q is relatively small, it is computationally efficient and hence suitable for large scale GWAS. To obtain the p-value, Davies’ method (1980) can be used as a computationally efficient way to analytically calculate the tail probability of a mixture of chi-squares by inverting the corresponding characteristic function.
2.4 Testing for the Joint Effect of Gene and Gene-time Interaction
As in a regression framework interaction effect is typically modeled using new variables defined as the product of two interacting factors, similarly, we can define interaction terms, Giti,l = (Gi,1ti,l, Gi,2ti,l, . . ., Gi,qti,l)T, and treat them the same way as Gi. Therefore the modified LGRF is given by:
where is the similarity generated by gene-time interaction terms, similar to the genetic similarity; and γ1 and γ2 represent the genotype main effect and gene-time interaction effect, respectively. The IBS similarity is not suitable for the interaction terms because it is specifically designed for genetic variants/imputed dosage lying between 0 and 2. In the spirit of genetic relationship similarity, we define , where . Considering a working within-subject similarity matrix ηT as before, in matrix form the model is written as
| (6) |
where SGT is the similarity matrix of the interaction terms with the (l, k)-th element of the (i, j)-th block (ni × nj) equal to except for the diagonal of SGT. Under this model, we can evaluate the joint effect of gene and gene-time interaction by testing .
Denoting γ = (γ1, γ2)T, following previous development, we construct two estimating function with respect to γ1 and γ2:
As before, evaluating the corresponding estimating functions at leads to the following generalized score statistics
We propose to combine these two by:
where and ; and are proportional to the variance of Uγ1 and Uγ2 respectively. Though the choice of weights can be arbitrary depending on the need of assessing marginal or interaction effect, our weights are defined such that αGQG and αGTQGT have approximately equal variance. Defining ZGT as the centered gene-interaction matrix, i.e., each gene-interaction term Gi,hti,l is centered by the its mean , and , we can rewrite the joint test statistic as a quadratic form:
where d is a constant depending on the chosen genetic similarity for the marginal genetic effect as in Section 2.4 and cJ is a constant similar to c. Although more complex, QJ has an identical form as QG in Section 2.4. The inference follows directly from previous development and therefore we omit the details. The proposed method does not test the gene-time interaction separately; instead, it improves the power of LGRF test by exploiting the potential interaction effect if exists.
3. Illustration in MESA
We refer to the LGRF test for the marginal effect of a gene as LGRF-G and the joint test as LGRF-J. We illustrate the proposed methods using data from the Multi-Ethnic Study of Atherosclerosis (MESA). MESA is a collaborative longitudinal study initiated in July 2000 to investigate the prevalence, correlates, and progression of subclinical cardiovascular disease (CVD) (Bild, et al., 2002). From 2000 to 2007, four examinations of blood pressure (BP) were conducted over 18- to 24-month periods. We aimed to replicate the findings (29 significant SNPs associated with blood pressure) of the International Consortium for Blood Pressure (ICBP) (International Consortium for Blood Pressure Genome-Wide Association Studies, 2011) by a region based analysis. A total of 6361 subjects consisting of 2526 Caucasians (CAU), 1611 African Americans (AFA), 1449 Hispanics (HIS) and 775 Asian of Chinese descent (CHN) with genome-wide genotype data, systolic blood pressure (sBP) and diastolic blood pressure (dBP) outcomes were considered in the current analysis. The longitudinal summaries and characteristics of the study population, descriptive statistics are provided in Supplementary Tables 8 - 11. For this analysis, we used SNPs that have been directly genotyped using the Affymetrix Genome-Wide Human SNP Array 6.0 or imputed as per MESA protocol. Imputation was performed using the IMPUTE 2.1.0 program (Marchini, et al., 2007) in conjunction with HapMap Phase I and II reference panels (CEU+YRI+CHB+JPT, release 22 - NCBI Build 36 for African-, Chinese- and Hispanic-American participants; CEU, release 24 - NCBI Build 36 for European Americans). We selected genomic regions around the 29 index SNPs that have demonstrated significant (p-value < 10–9) by the ICBP. Each genomic region was defined according to the following criteria: when the index SNP fell within a gene, we selected all SNPs within the gene +/− 5kb and adopted the gene's name. When the index SNP fell outside of a gene, we selected the index SNP plus all SNPs +/− 50kb and name the region after the index SNP. All SNPs are included in the analysis without any minor allele frequency filters. We applied LGRF-G and LGRF-J using longitudinal outcomes and SKAT using the average value of repeated measures to test the association between each candidate region and BPs (sBP and dBP) for the four ethnic groups separately, adjusting for age, gender, BMI and top two principal components (PCs) to correct for potential within-ethnicity stratification. Since only the first two principal components were associated with either systolic or diastolic blood pressure in any ethnicity at p < 0.01 (Supplementary Table 7), we only included these two principal components as adjustment variables. We adjusted the measured blood pressures for participants taking anti-hypertension medications using the standard procedure of adding 10 mmHg to systolic blood pressure and 5 mmHg to diastolic blood pressure (Cui, et al., 2003). The SKAT was implemented with a linear kernel and equal weights on the SNPs. Based on the p-values of the stratified analysis, a meta-analysis was done by Fisher's method.
We analyzed 29 regions with details summarized in the Supplementary Tables 12 - 21. The LGRF-G test results in comparable or smaller p-values than SKAT using average outcomes in most cases. We expect LGRF-J to have higher power than LGRF-G when there exists gene-time interaction, but lower power when there is no such interaction. In the MESA example, the LGRF-J test has smaller p-values than LGRF-G in relatively few instances (for example association of C10orf107 with diastolic blood pressure in Table 1), but larger p-values than LGRF-G in general. This may indicate that gene-time interaction does not have sufficient contribution to the marginal gene-level association in most cases. Table 1 shows the results of the top two associations between sBP/dBP and candidate regions. The top two regions were selected according to the p-values of LGRF-G in meta-analysis using Fisher's combined probability test. The region indexed by rs13082711 emerged as the most strongly associated region. The meta-analysis p-values of LGRF-G are 8.69 × 10–4 for sBP and 6.25 × 10–4 for dBP. Another suggestive association identified by LGRF-G that is consistent with the ICBP analysis is between dBP and C10orf107 (p-value= 9.71 × 10–4), and LGRF-J exhibited a smaller p-value for this association (p-value= 8.64 × 10–4).
Table 1.
Analysis of Multi-Ethnic Study of Atherosclerosis (MESA) data: top two regions associated with systolic blood pressure/diastolic blood pressure.
| Systolic Blood Pressure | ||||||||
|---|---|---|---|---|---|---|---|---|
| Region Indexed by rs13082711 | Region Indexed by rs1378942 | |||||||
| SNPs | LGRF-G | LGRF-J | SKAT-Avg. | SNPs | LGRF-G | LGRF-J | SKAT-Avg. | |
| CAU | 111 | 0.0052 | 0.0078 | 0.0047 | 84 | 0.0019 | 0.0023 | 0.0019 |
| AFA | 82 | 0.6750 | 0.6315 | 0.6806 | 70 | 0.1894 | 0.2047 | 0.1929 |
| HIS | 82 | 0.0267 | 0.0453 | 0.0307 | 70 | 0.5269 | 0.3446 | 0.4094 |
| CHN | 79 | 0.0191 | 0.0496 | 0.0302 | 70 | 0.8798 | 0.9364 | 0.8969 |
| Meta | - | 0.0009 | 0.0036 | 0.0013 | - | 0.0258 | 0.0248 | 0.0222 |
| Diastolic Blood Pressure | ||||||||
|---|---|---|---|---|---|---|---|---|
| Region Indexed by rs13082711 | C10orf107 | |||||||
| SNPs | LGRF-G | LGRF-J | SKAT-Avg. | SNPs | LGRF-G | LGRF-J | SKAT-Avg. | |
| CAU | 111 | 0.1774 | 0.1185 | 0.1704 | 190 | 0.0283 | 0.0412 | 0.0202 |
| AFA | 82 | 0.0263 | 0.0222 | 0.0233 | 157 | 0.0129 | 0.0106 | 0.0152 |
| HIS | 82 | 0.0086 | 0.0349 | 0.0058 | 157 | 0.0104 | 0.0081 | 0.0234 |
| CHN | 79 | 0.0292 | 0.0713 | 0.0308 | 154 | 0.5361 | 0.4998 | 0.4757 |
| Meta | - | 0.0006 | 0.0024 | 0.0004 | - | 0.0010 | 0.0009 | 0.0015 |
Each cell shows the p-value. CAU: Caucasians; AFA: African Americans; HIS: Hispanics; CHN: Asians of Chinese descent. Meta: Meta-analysis combining the results of four ethnic groups using Fisher's combined probability test. LGRF-G: the LGRF test for the marginal effect of a gene. LGRF-J: the LGRF test for the joint effect of gene and gene-time interaction. The working correlation assumed in LGRF is compound symmetric. SKAT-Avg.: cross-sectional SKAT using the average value of repeated measurements as the outcome. The column “SNPs” shows the total number of typed and imputed SNPs in each ethnic group.
4. Simulation Studies
We evaluated three classes of methods: (a) the proposed multi-marker tests for longitudinal data: LGRF-G, LGRF-J; (b) a multi-marker test in cross-sectional studies using the average of the repeated measures as a single outcome: SKAT; and (c) single-marker tests for longitudinal outcomes: namely, GEE, adjusted by the Bonferroni correction. Specifically, in LGRF-G, LGRF-J and GEE, a working compound symmetric correlation structure was used, and SKAT was implemented with equal weights on the SNPs. Classes (b) and (c) represent two commonly used strategies in practice as currently no multi-marker tests are available for longitudinal data and the specific method (SKAT and GEE) is chosen to be the representative in each class, recognizing that multiple other alternatives in each class exist. Additional simulation studies with respect to the impact of different genetic similarity measures, further evaluation of the power gain using a longitudinal design, use of LGRF in a meta-analysis, and evaluation of type-I error rate and power at lower significance levels are showed in the Supplementary Tables 2 - 7.
For each replicated dataset, subjects were randomly selected from the Caucasian (CAU) ethnic group in MESA, and the variants commonly existing in all four ethnicities (154 SNPs) in genotype region C10orf107 are included as the target region. We varied the number of repeated measurements to be 4, 6 and 8, and number of subjects 600, 400 and 300 respectively, keeping total number of observations as 2400. Assuming missing completely at random, we first simulated the complete data, and then a missingness indicator with fixed drop-out rate of 4% at each exam was applied approximating what is observed in the MESA study.
4.1 Type-I Error Simulations
We evaluated the type-I error rate at level α = 0.05, 0.01, and 0.001 using 100000 replicates. Data are generated from the model:
| (7) |
where ; r is the number of measurements per subject; εi = (εi,1, . . ., εi,r)T independently follows multivariate normal distribution with four types of covariance matrices:
Independent (Ind.): .
Auto-regressive of order 1 (AR1): εi ~ N(0, ΣAR), where ΣAR is an r × r matrix and its (l, k) element is .
Compound symmetry (CS): , , , where and bi are independent.
Mixed model with a random intercept and a random slope (RR): , , b1i, , where , b1i and b2i are independent.
Where ; , ρ = 0.6; ; ; . The missingness indicator was then applied to the simulated data with 4% drop-out rate. The empirical type-I error rates are presented in Table 2. LGRF-G and LGRF-J both have well controlled type-I error rates under all scenarios, even if the true correlation is not the assumed working correlation “CS”. The tests also have valid type-I error rates at low α-levels (0.01 and 0.001). The simulation results demonstrate that, consistent with the asymptotic result, the proposed methods are robust to misspecification of within-subject correlation in finite samples. We note that the proposed methods tend to be slightly conservative at lower significance levels (Supplementary Table 5) due to the use of sandwich estimator as in regular GEE.
Table 2.
Type-I Error Rate Corresponding to Different Within-Subject Correlation Structures.
| Type-I Error Rate | ||||||
|---|---|---|---|---|---|---|
| Four Repeated Measurements (600 Subjects) | ||||||
| LGRF-G | LGRF-J | |||||
| α = | 0.05 | 0.01 | 0.001 | 0.05 | 0.01 | 0.001 |
| Ind. | 0.0495 | 0.0096 | 0.0008 | 0.0493 | 0.0097 | 0.0008 |
| CS | 0.0493 | 0.0099 | 0.0009 | 0.0491 | 0.0096 | 0.0009 |
| AR1 | 0.0499 | 0.0097 | 0.0009 | 0.0507 | 0.0097 | 0.0009 |
| RR | 0.0497 | 0.0094 | 0.0009 | 0.0498 | 0.0096 | 0.0008 |
| Six Repeated Measurements (400 Subjects) | ||||||
|---|---|---|---|---|---|---|
| LGRF-G | LGRF-J | |||||
| α = | 0.05 | 0.01 | 0.001 | 0.05 | 0.01 | 0.001 |
| Ind. | 0.0501 | 0.0096 | 0.0009 | 0.0501 | 0.0093 | 0.0010 |
| CS | 0.0501 | 0.0097 | 0.0009 | 0.0488 | 0.0089 | 0.0008 |
| AR1 | 0.0485 | 0.0093 | 0.0009 | 0.0494 | 0.0097 | 0.0008 |
| RR | 0.0497 | 0.0096 | 0.0010 | 0.0500 | 0.0095 | 0.0009 |
| Eight Repeated Measurements (300 Subjects) | ||||||
|---|---|---|---|---|---|---|
| LGRF-G | LGRF-J | |||||
| α = | 0.05 | 0.01 | 0.001 | 0.05 | 0.01 | 0.001 |
| Ind. | 0.0488 | 0.0091 | 0.0008 | 0.0483 | 0.0091 | 0.0007 |
| CS | 0.0484 | 0.0092 | 0.0010 | 0.0488 | 0.0090 | 0.0007 |
| AR1 | 0.0474 | 0.0090 | 0.0008 | 0.0471 | 0.0089 | 0.0009 |
| RR | 0.0492 | 0.0095 | 0.0008 | 0.0485 | 0.0091 | 0.0008 |
Each cell represents the empirical type-I error rate evaluated at α=0.05, 0.01 and 0.001 based on 100000 replicates. The total number of observations is 2,400 and repeated measurements per subject were generated in the same follow-up period according to different correlation structures. Ind.: the repeated measurements are independent. CS: the correlation is compound symmetric. AR1: the repeated measurements follow a first-order auto-regressive model. RR: observations follow a mixed model with a random intercept and a random slope. LGRF-G: the LGRF test for the marginal effect of a gene. LGRF-J: the LGRF test for the joint effect of gene and gene-time interaction. The working correlation assumed in LGRF is CS.
4.2 Power Simulations
In the first set of power simulations, one out of 154 SNPs was randomly set to be causal. We evaluated two distinct scenarios where the effect of the single causal SNP is manifested through: 1. its marginal association with outcome, without any gene-time interaction; 2. its interaction with time (SNP × Time interaction). The data was generated respectively:
| (8) |
| (9) |
where Gi is the genotype of subject i for the randomly selected causal SNP; α0 = 12/r, α1 = 0.4 and α2 = 0.6/r; r is the number of measurements per subject. To mimic the real data scenario, α1 and α2 were elicited based on fitting single SNP models with and without gene-time interaction to MESA data. We chose a large α0 in our simulation studies to illustrate the power gain that can be expected from a longitudinal design with strong time trend in the mean outcome levels compared to using the average of repeated measures. We recognize that smaller values of α0 will lead to smaller power differences.
In the second set of simulations, ten out of 154 were randomly set to be causal each time. Among them, six SNPs have only marginal effects, three have both marginal and interaction effects and the remaining one has only an interaction effect. The true model is of the form:
Where Gi,k is the genotype of subject i on the k-th randomly selected causal SNP. The coefficients are proportional to α1 and α2: and , such that the empirical powers are differentiable.
Two important points are illustrated by this simulation: 1. the advantage of incorporating longitudinal information over using only the average outcome; 2. The use of multi-marker tests over single-marker tests. The proposed multi-marker tests using the longitudinal outcome have larger power than SKAT using the average of outcomes, as the proposed tests use the whole trajectory of longitudinal outcomes as opposed to only information contained in the average. When the number of repeated measurements increases, the power becomes more distinct. Not surprisingly, LGRF-J test has slightly lower power than LGRF-G because gene-time interaction does not exist in these scenarios.
When the causal SNP has only an interaction effect (Table 4), the relative performance of the methods using repeated measures compared with the one using average outcome is more distinct. In addition, the joint test LGRF-J is able to further enhance power in these scenarios because it incorporates the gene-time interaction explicitly. We note that the power difference between LGRF and SKAT using average outcome is mainly attributed to the longitudinal design rather than the difference between genetic random field model and SKAT (Supplementary Table 4).
Table 4.
Power comparisons when one randomly selected SNP is causal and has only a gene-time interaction effect.
| Power: Single SNP×Time Effect | |||||
|---|---|---|---|---|---|
| Four Repeated Measurements (600 Subjects) | |||||
| LGRF-G | LGRF-J | SKAT-Avg. | GEE-G | GEE-J | |
| Ind. | 0.38 | 0.39 | 0.29 | 0.21 | 0.20 |
| CS | 0.48 | 0.54 | 0.36 | 0.33 | 0.46 |
| AR1 | 0.41 | 0.49 | 0.34 | 0.27 | 0.34 |
| RR | 0.53 | 0.57 | 0.39 | 0.42 | 0.50 |
| Six Repeated Measurements (400 Subjects) | |||||
|---|---|---|---|---|---|
| LGRF-G | LGRF-J | SKAT-Avg. | GEE-G | GEE-J | |
| Ind. | 0.38 | 0.43 | 0.20 | 0.21 | 0.23 |
| CS | 0.33 | 0.44 | 0.19 | 0.17 | 0.37 |
| AR1 | 0.31 | 0.39 | 0.21 | 0.16 | 0.21 |
| RR | 0.42 | 0.50 | 0.25 | 0.27 | 0.38 |
| Eight Repeated Measurements (300 Subjects) | |||||
|---|---|---|---|---|---|
| LGRF-G | LGRF-J | SKAT-Avg. | GEE-G | GEE-J | |
| Ind. | 0.32 | 0.36 | 0.16 | 0.16 | 0.19 |
| CS | 0.25 | 0.36 | 0.16 | 0.12 | 0.30 |
| AR1 | 0.25 | 0.35 | 0.14 | 0.13 | 0.16 |
| RR | 0.35 | 0.44 | 0.16 | 0.16 | 0.31 |
Each cell represents the empirical power from 500 replicates at level α=0.05. The total number of observations is 2,400 and repeated measurements were recorded in the same follow-up period. Ind.: the repeated measurements are independent. CS: the correlation is compound symmetric. AR1: the repeated measurements follow a first-order auto-regressive model. RR: observations follow a mixed model with a random intercept and a random slope. LGRF-G: the LGRF test for the marginal effect of a gene. LGRF-J: the LGRF test for the joint effect of gene and gene-time interaction. The working correlation assumed in LGRF is CS. SKAT-Avg.: cross-sectional SKAT using the average value of repeated measurements as the outcome. GEE-G: test the marginal association by GEE. GEE-J: jointly test the marginal association and gene-time interaction by GEE. These single-marker tests were implemented by testing every SNP in the region and adjusting the minimum p-value by the Bonferroni correction.
We also note that the proposed multi-marker tests have larger power than single-marker tests using GEE with Bonferroni correction (Tables 3-5), consistent with results found in cross-sectional studies where advantages of multi-marker tests over single-marker tests have been demonstrated repeatedly. The advantage in power is more substantial when there are multiple causal SNPs (Table 5) than when there is only one causal SNP (Tables 3-4).
Table 3.
Power comparisons when one randomly selected SNP is causal and has a marginal effect.
| Power: Single SNP Marginal Effect | |||||
|---|---|---|---|---|---|
| Four Repeated Measurements (600 Subjects) | |||||
| LGRF-G | LGRF-J | SKAT-Avg. | GEE-G | GEE-J | |
| Ind. | 0.42 | 0.39 | 0.34 | 0.26 | 0.19 |
| CS | 0.53 | 0.49 | 0.43 | 0.41 | 0.33 |
| AR1 | 0.46 | 0.45 | 0.38 | 0.32 | 0.28 |
| RR | 0.58 | 0.55 | 0.46 | 0.50 | 0.43 |
| Six Repeated Measurements (400 Subjects) | |||||
|---|---|---|---|---|---|
| LGRF-G | LGRF-J | SKAT-Avg. | GEE-G | GEE-J | |
| Ind. | 0.48 | 0.47 | 0.31 | 0.29 | 0.26 |
| CS | 0.40 | 0.41 | 0.28 | 0.28 | 0.23 |
| AR1 | 0.41 | 0.38 | 0.29 | 0.26 | 0.21 |
| RR | 0.51 | 0.48 | 0.35 | 0.42 | 0.37 |
| Eight Repeated Measurements (300 Subjects) | |||||
|---|---|---|---|---|---|
| LGRF-G | LGRF-J | SKAT-Avg. | GEE-G | GEE-J | |
| Ind. | 0.40 | 0.39 | 0.25 | 0.29 | 0.23 |
| CS | 0.36 | 0.35 | 0.25 | 0.22 | 0.18 |
| AR1 | 0.36 | 0.36 | 0.22 | 0.23 | 0.21 |
| RR | 0.49 | 0.45 | 0.24 | 0.34 | 0.30 |
Each cell represents the empirical power from 500 replicates at level α=0.05. The total number of observations is 2,400 and repeated measurements were recorded in the same follow-up period. Ind.: the repeated measurements are independent. CS: the correlation is compound symmetric. AR1: the repeated measurements follow a first-order auto-regressive model. RR: observations follow a mixed model with a random intercept and a random slope. LGRF-G: the LGRF test for the marginal effect of a gene. LGRF-J: the LGRF test for the joint effect of gene and gene-time interaction. The working correlation assumed in LGRF is CS. SKAT-Avg.: cross-sectional SKAT using the average value of repeated measurements as the outcome. GEE-G: test the marginal association by GEE. GEE-J: jointly test the marginal association and gene-time interaction by GEE. These single-marker tests were implemented by testing every SNP in the region and adjusting the minimum p-value by the Bonferroni correction.
Table 5.
Power comparisons when randomly selected multiple SNPs are causal and have both marginal and interaction effects.
| Power: Multiple SNPs Combined Effect | |||||
|---|---|---|---|---|---|
| Four Repeated Measurements (600 Subjects) | |||||
| LGRF-G | LGRF-J | SKAT-Avg. | GEE-G | GEE-J | |
| Ind. | 0.36 | 0.36 | 0.25 | 0.13 | 0.09 |
| CS | 0.50 | 0.49 | 0.37 | 0.19 | 0.18 |
| AR1 | 0.43 | 0.42 | 0.35 | 0.19 | 0.17 |
| RR | 0.60 | 0.60 | 0.46 | 0.36 | 0.29 |
| Six Repeated Measurements (400 Subjects) | |||||
|---|---|---|---|---|---|
| LGRF-G | LGRF-J | SKAT-Avg. | GEE-G | GEE-J | |
| Ind. | 0.37 | 0.36 | 0.21 | 0.15 | 0.11 |
| CS | 0.33 | 0.35 | 0.21 | 0.12 | 0.10 |
| AR1 | 0.32 | 0.32 | 0.22 | 0.13 | 0.10 |
| RR | 0.46 | 0.43 | 0.24 | 0.22 | 0.15 |
| Eight Repeated Measurements (300 Subjects) | |||||
|---|---|---|---|---|---|
| LGRF-G | LGRF-J | SKAT-Avg. | GEE-G | GEE-J | |
| Ind. | 0.30 | 0.30 | 0.17 | 0.11 | 0.11 |
| CS | 0.27 | 0.29 | 0.18 | 0.09 | 0.11 |
| AR1 | 0.26 | 0.28 | 0.14 | 0.08 | 0.08 |
| RR | 0.40 | 0.41 | 0.20 | 0.19 | 0.15 |
Each cell represents the empirical power from 500 replicates at level α=0.05. The total number of observations is 2,400 and repeated measurements were recorded in the same follow-up period. Ind.: the repeated measurements are independent. CS: the correlation is compound symmetric. AR1: the repeated measurements follow a first-order auto-regressive model. RR: observations follow a mixed model with a random intercept and a random slope. LGRF-G: the LGRF test for the marginal effect of a gene. LGRF-J: the LGRF test for the joint effect of gene and gene-time interaction. The working correlation assumed in LGRF is CS. SKAT-Avg.: cross-sectional SKAT using the average value of repeated measurements as the outcome. GEE-G: test the marginal association by GEE. GEE-J: jointly test the marginal association and gene-time interaction by GEE. These single-marker tests were implemented by testing every SNP in the region and adjusting the minimum p-value by the Bonferroni correction.
5. Discussion
We extended the genetic random field model to the longitudinal setting and developed generalized score type tests to test the joint association between a set of genetic variants and a repeatedly measured phenotype. Besides the advantages of region-based tests over single-marker tests in cross-sectional studies, the LGRF model is able to utilize all the repeated measurements, incorporate gene-time interaction explicitly and result in higher power. As in GenRF, LGRF models the joint association using a single parameter by considering the similarity in phenotype induced by genetic similarity. A main challenge in modeling longitudinal data is to account for within-subject correlation and correlation is conceptually viewed and modeled in a unified way as the joint genetic association in LGRF. Furthermore, the specified correlation structure is treated as a working assumption in inference and the resulting LGRF tests are robust to misspecification.
LGRF tests are generalized score tests that only need to fit the model under the null hypothesis, which is irrelevant to the target region. Users can fit the null model once and test all regions without repeatedly fitting the model. In addition, the computational cost of LGRF mainly depends on the fixed number of variants in the region but not the sample size. This property improves the computational efficiency dramatically (see Supplementary Table 1) especially when the target region is small, for example if investigators are only interest in the exon.
We note that not only the longitudinal outcomes precisely describe the phenotype progression, considering time varying exposure and its interaction with genotype may also improve the discovery process. However, an analysis using the average outcome and a single measure of exposure will lose the longitudinal features of the time varying exposure variables and their correlations, reducing the rich exposure and outcome data to an aggregate summary measure. In the spirit of multi-marker based tests for gene-environment interaction, such as GESAT (Lin, et al., 2013), we expect that a potential future extension of LGRF towards separately testing gene-time or gene-environment interaction in longitudinal studies with time dependent covariates may enhance the discovery process. Finally, the proposed test is only valid when the data is missing completely at random as in GEE (Zeger and Liang, 1986). Future work extending the method to cases other than this will be of interest.
Supplementary Material
Acknowledgements
This work was supported by NSF DMS 1406712 and NIH/NIEHS grant ES020811 (B.M.), NIH/NHLBI grant R00HL113164 (S.L.), NIH/NHLBI HL101161 (A.D., S.K., B.M., J.S.), and NIMHHD Grant 2P60MD002249 Center for Integrative Approaches to health Disparities (CIAHD) (A.D., S.K., J.S.). MESA and the MESA SHARe project are conducted and supported by the National Heart, Lung, and Blood Institute (NHLBI) in collaboration with MESA investigators. Support for MESA is provided by contracts N01-HC-95159, N01-HC-95160, N01-HC-95161, N01-HC-95162, N01-HC-95163, N01-HC-95164, N01-HC-95165, N01-HC-95166, N01-HC-95167, N01-HC-95168, N01-HC-95169 and CTSA UL1-RR-024156. Funding for SHARe genotyping was provided by NHLBI Contract N02-HL-64278. Genotyping was performed at Affymetrix (Santa Clara, California, USA) and the Broad Institute of Harvard and MIT (Boston, Massachusetts, USA) using the Affymetrix Genome-Wide Human SNP Array 6.0.
Footnotes
6. Supplementary Materials
Web Appendices referenced in Sections 2, 3, 4 and 5, and the R code implementing the method are available with this paper at the Biometrics website on Wiley Online Library. The code and an illustrative example are also available at: http://sitemaker.umich.edu/statzihuai/longitudinal_genetic_random_field_lgrf_.
References
- Adler RJ, Taylor JE. Random Fields and Geometry. Springer; New York: 2007. [Google Scholar]
- Besag J. Spatial interaction and the statistical analysis of lattice systems. Journal of the Royal Statistical Society. Series B. 1974;36(2):192–236. [Google Scholar]
- Bild DE, Bluemke DA, Burke GL, Detrano R, Roux AVD, Folsom AR, et al. Multi-ethnic study of atherosclerosis: objectives and design. American Journal of Epidemiology. 2002;156(9):871–881. doi: 10.1093/aje/kwf113. [DOI] [PubMed] [Google Scholar]
- Boos DD. On generalized score tests. The American Statistician. 1992;46(4):327–333. [Google Scholar]
- Cui JS, Hopper JL, Harrap SB. Antihypertensive treatments obscure familial contributions to blood pressure variation. Hypertension. 2003;41(2):207–210. doi: 10.1161/01.hyp.0000044938.94050.e3. [DOI] [PubMed] [Google Scholar]
- Cressie N. Statistics for spatial data. Wiley; New York: 1993. [Google Scholar]
- Davies R. The distribution of a linear combination of chi-square random variables. Applied Statistics. 1980;29:323–333. [Google Scholar]
- Fan R, Zhang Y, Albert P, Liu A, Wang Y, Xiong M. Longitudinal association analysis of quantitative traits. Genetic Epidemiology. 2012 doi: 10.1002/gepi.21673. doi:10.1002/gepi.21673. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fitzmaurice GM, Laird NM, Ware JH. Applied longitudinal analysis. 2nd Edition John Wiley & Sons; 2011. [Google Scholar]
- Furlotte N, Eskin E, Eyheramendy S. Genome-wide association mapping with longitudinal data. Genetic Epidemiology. 2012;36:463–471. doi: 10.1002/gepi.21640. [DOI] [PMC free article] [PubMed] [Google Scholar]
- He Z, Zhang M, Zhan X, Lu Q. Modeling and testing for joint association using a genetic random field model. Biometrics. 2014 doi: 10.1111/biom.12160. doi: 10.1111/biom.12160. [DOI] [PubMed] [Google Scholar]
- International Consortium for Blood Pressure Genome-Wide Association Studies Genetic variants in novel pathways influence blood pressure and cardiovascular disease risk. Nature. 2011;478(7367):103–109. doi: 10.1038/nature10405. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li M, He Z, Zhang M, Zhan X, Wei C, Elston RC, Lu Q. A generalized genetic random field method for the genetic association analysis of sequencing data. Genetic Epidemiology. 2014;38:242–253. doi: 10.1002/gepi.21790. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lin X, Lee S, Christiani DC, Lin X. Test for the Interaction between a Genetic Marker Set and Environment in Generalized Linear Models. Biostatistics. 2013;14(4):667–681. doi: 10.1093/biostatistics/kxt006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Marchini J, Howie B, Myers S, McVean G, Donnelly P. A new multipoint method for genome-wide association studies by imputation of genotypes. Nature genetics. 2007;39(7):906–913. doi: 10.1038/ng2088. [DOI] [PubMed] [Google Scholar]
- Morris AP, Zeggini E. An evaluation of statistical approaches to rare variant analysis in genetic association studies. Genetic Epidemiology. 2010;34(2):188–193. doi: 10.1002/gepi.20450. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Qu A, Lindsay BG, Li B. Improving generalised estimating equations using quadratic inference functions. Biometrika. 2000;87(4):823–836. [Google Scholar]
- Stefanski LA, Boos DD. The calculus of M-estimation. The American Statistician. 2002;56:29–38. [Google Scholar]
- Tzeng JY, Zhang D, Chang SM, Thomas DC, Davidian M. Gene-Trait Similarity Regression for Multimarker-based Association Analysis. Biometrics. 2009;65(3):822–832. doi: 10.1111/j.1541-0420.2008.01176.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X. Rare-variant association testing for sequencing data with the sequence kernel association test. The American Journal of Human Genetics. 2011;89(1):82–93. doi: 10.1016/j.ajhg.2011.05.029. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yang J, Lee SH, Goddard ME, Visscher PM. GCTA: a tool for genome-wide complex trait analysis. American Journal of Human Genetics. 2011;88(1):76–82. doi: 10.1016/j.ajhg.2010.11.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zeger SL, Liang KY. Longitudinal data analysis for discrete and continuous outcomes. Biometrics. 1986:121–130. [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
