Skip to main content
American Journal of Human Genetics logoLink to American Journal of Human Genetics
. 2018 Apr 5;102(4):574–591. doi: 10.1016/j.ajhg.2018.02.016

L-GATOR: Genetic Association Testing for a Longitudinally Measured Quantitative Trait in Samples with Related Individuals

Xiaowei Wu 1,3, Mary Sara McPeek 1,2,
PMCID: PMC5985289  PMID: 29625022

Abstract

In complex-trait mapping, when each subject has multiple measurements of a quantitative trait over time, power for detecting genetic association can be gained by the inclusion of all measurements and not just single time points or averages in the analysis. To increase power and control type 1 error, one should account for dependence among observations for a single individual as well as dependence between observations of related individuals if they are present in the sample. We propose L-GATOR, a retrospective, mixed-effects method for association mapping of longitudinally measured traits in samples with related individuals. L-GATOR allows arbitrary time points for different individuals, incorporates both time-varying and static covariates, and properly addresses various types of dependence. In simulations, we show that L-GATOR outperforms existing prospective methods in terms of both type 1 error and power when there is phenotype model misspecification or missing data. Compared with the previously proposed longGWAS method, L-GATOR was more than ten times faster for association testing in our simulations and almost 100 times faster for parameter estimation. L-GATOR is applicable to essentially arbitrary combinations of related and unrelated individuals, including small families as well as large, complex pedigrees. We apply the method to data from the Framingham Heart Study to identify association between longitudinal systolic blood pressure measurements and genome-wide SNPs. Of the smallest p values, one-third occur in or near genes that have been previously identified as associated with pulse pressure (such as PIK3CG) and systolic and diastolic blood pressure (such as C10orf107), showing that L-GATOR is able to prioritize relevant loci in a genome screen.

Keywords: longitudinal, GWAS, linear mixed model, pedigree, missing genotypes, Framingham, systolic blood pressure, relatives, association mapping, mixed effects model

Introduction

In some genome-wide association studies (GWASs), phenotype data are measured longitudinally or repeatedly for each individual in the sample. In such studies, association analyses based on single-time-point measurements of the trait, or on individuals’ average trait values, can lose power by not making full use of the available phenotype information.1 Furthermore, when individuals’ average phenotype values are analyzed, it can be problematic to incorporate dynamic (i.e., time-varying) covariates in an appropriate way. In addition, when related individuals are included in the sample, failure to account for residual correlation among relatives’ phenotypes can lead to inflated type 1 error and reduced power. A statistical analysis that uses all the longitudinal information and properly accounts for relatedness among individuals, as well as for dependence among observations from an individual, has the potential to improve power for detecting genetic association.

For genetic analysis of longitudinal traits, methods that are appropriate for unrelated individuals (i.e., that assume there is no residual phenotypic correlation among individuals) include fGWAS2 and MASAL.3 For related individuals, the FBAT-PC-based methods4, 5, 6 have been developed, and these can allow covariates and missing phenotype values, but they have stringent requirements on the availability of genotype data on relatives. The longGWAS method1 also allows covariates and is applicable to a wider range of samples than FBAT-PC, but it is limited in its handling of different time points for different individuals and missing data.

We propose L-GATOR, a retrospective, mixed-model method for genetic association analysis of longitudinal traits in related individuals. L-GATOR can be viewed as an extension, to longitudinal traits, of the MASTOR method7 for quantitative traits. Features of L-GATOR include the following: (1) it is applicable to arbitrary combinations of related and unrelated individuals, (2) it allows both static and dynamic covariates to be included in the analysis, (3) it allows different individuals to have different time points, (4) it has correct type 1 error even when the trait model is misspecified, and (5) it improves power by using information on dependence among relatives to make use of observations that have either missing phenotype or missing genotype data. For comparison, we also propose and implement PMALT, a prospective, mixed-model method that is an extension, to longitudinal traits, of the GTAM method8 for quantitative traits and that is also closely related to the longGWAS method.1

Through simulations, we demonstrate that L-GATOR represents an improvement over other methods in terms of type 1 error and power. The proposed methods are applied to a whole-genome analysis of longitudinal systolic blood pressure (SBP) measurements from the Framingham Heart Study (FHS).

Material and Methods

Suppose that in an association study we have phenotype, covariate, and genotype data for a group of m individuals with known pedigree information. The phenotype data are assumed to consist of longitudinal measurements of a quantitative trait. We allow in the analysis both dynamic covariates (e.g., age) and static covariates (e.g., sex). We consider the phenotype and covariate data at a single time point for a given individual to be a “record” for that individual. Each record thus provides a snapshot of the trait and the covariates of an individual at a specified time point. Different individuals are allowed to have different numbers of records and different time points. We can think of records as being nested within individuals and individuals as being nested within families. We assume that the set of individuals is partitioned into families in such a way that if two individuals are in different families, then they are unrelated. (A family is permitted to consist of a single individual if that person is unrelated to anyone else in the sample.) Assume, for simplicity, that the m individuals in the sample are ordered so that individuals within the same family are listed consecutively. Denote the number of records for individual i, 1im, by ni, and denote the total number of records by n=i=1mni. We define t=(t1T,t2T,,tmT)T, where ti is a vector of length ni whose jth entry, tij, is equal to the measurement time of the jth record on the ith individual for 1jni. (In practice, only the time intervals and not the actual times are needed.) We let Y denote the trait vector (length n) and let W denote the covariate matrix (size n×q), which is assumed to include an intercept. Here, Y and the rows of W are assumed to be arranged in the same order as t, such that all the records for person 1 are listed first, followed by all the records for person 2, and so on. Note that W can incorporate dynamic covariates such as age. Static covariates in W retain the same value for all records on a given individual. Table 1 shows an example of the assumed data structure. We define a “complete record” to be a table row that contains no missing values, i.e., it consists of the data of a single individual at a single time point for which the phenotype and covariates are observed without missing values. For example, in Table 1, individual 101 has three complete records, and individual 112 has two complete records. Note that we use the term “complete record” to refer only to phenotype and covariate data and not to genotype data.

Table 1.

Example of Phenotype and Covariate Data

Family Individual Record t Y W
Intercept Static Covariate Dynamic Covariate
1 101 1 40.01 4.79 1 1 3.03
1 101 2 42.04 4.71 1 1 3.06
1 101 3 44.06 4.77 1 1 3.04
1 112 1 58.67 4.86 1 0 2.97
1 112 2 60.38 4.93 1 0 2.98
1 31 1 27.38 5.11 1 1 3.29
1 31 2 29.70 5.24 1 1 3.28
1 31 3 31.52 5.17 1 1 3.27
1 31 4 33.27 5.21 1 1 3.31
2 15 1 58.82 4.91 1 0 3.15
2 15 2 60.91 4.86 1 0 3.18
2 15 3 62.71 4.77 1 0 3.21

The first three columns show the hierarchy of family, individual, and record. In columns 4 and 5, t denotes the measurement time (note that only time intervals and not the times themselves are used in the analysis), and Y denotes the trait value. The covariate matrix, W, is shown with an intercept column, one static covariate, and one dynamic covariate. In general, W is assumed to always contain the intercept column, and it could contain any number of additional static and/or dynamic covariates.

For simplicity, we assume that the observed genetic variants are biallelic. (The extension to multiallelic variants is similar to that given in Appendix A of Jakobsdottir and McPeek.7) Markers are tested one at a time for association. For a given variant, with alleles arbitrarily labeled 0 and 1, we write the corresponding genotype data as a vector X (length m) with ith component

Xi=12×(thenumberofallelesoftype1inindividuali),1im.

The vector X differs from Y and W in that it is indexed by individual rather than by record. We define the matrix B (size n×m) to be a “vertically expanded identity matrix,” i.e.,

B=(B1Bm), (Equation 1)

where, for 1im, Bi is an ni×m matrix with every row equal to the ith row of the m×m identity matrix, i.e., the (j,k)th element of Bi is 1k=i. Then, for example, to expand the genotype data, X, from the individual level to the record level, we can simply write BX.

We assume known pedigree structure and define the m×m kinship matrix of the sampled individuals by

Φ=(1+h12ϕ122ϕ1m2ϕ121+h22ϕ2m2ϕ1m2ϕ2m1+hm),

where hi is the inbreeding coefficient of individual i, and ϕij is the kinship coefficient between individuals i and j, where 1im and 1jm. Matrix Φ is block-diagonal such that each diagonal block represents the kinship matrix of a family and the off-diagonal blocks are 0 matrices.

In what follows, we introduce two association testing methods: PMALT (which is a loose acronym for “prospective mixed-model method for association of a longitudinal trait”) and L-GATOR (an acronym for “longitudinal genetic association test on relatives”). PMALT is a prospective method, meaning that the analysis is conditional on (W,X) with Y treated as random, whereas L-GATOR is a retrospective method, meaning that the analysis is conditional on (W,Y) with X treated as random. An important difference between the two is that the PMALT analysis includes only genotyped individuals who have at least one complete record, whereas the L-GATOR analysis can incorporate certain types of partially missing data, as described in the subsection L-GATOR Incorporates Incomplete Data.

PMALT: Missing Values Excluded

The PMALT method for longitudinal quantitative traits can be viewed as an extension of the GTAM method8 for non-longitudinal quantitative traits. Like GTAM, PMALT can be formulated as a version of the prospective likelihood score test based on a mixed model for phenotype conditional on genotype and covariates. (This is described in more detail in the following paragraphs.) In PMALT, only individuals with observed genotype data at the SNP being tested and with at least one complete record are included in the analysis. Furthermore, for each genotyped individual, incomplete records are ignored. Notationally, once we have culled the data that will be ignored by PMALT, we could write that in the PMALT analysis, the number of records, ni, for individual i satisfies ni1 for each 1im, such that the resulting X,Y, and W are assumed to contain no missing values. The numbers of records and the time points are permitted to differ across individuals. The set of individuals included in the analysis can differ across variants because of missing genotype data for some individuals at some SNPs.

The modeling assumption underlying PMALT is that the trait vector Y (possibly after suitable transformation of the phenotype and/or covariates) has a multivariate normal distribution conditional on W and X,

Y|W,XMVN(μ,Σ), (Equation 2)

where the mean, μ, is assumed to be the following linear function of the covariates and genotype data:

μ=Wβ+BXγ, (Equation 3)

where β is an unknown length-q vector of fixed effects and γ is an unknown scalar representing the association between the variant and the trait. Additional sources of variation are incorporated in the variance-covariance matrix,

Σ=σE2I+σI2J+σA2K+σT2L, (Equation 4)

where σE2,σI2,σA2, and σT2 are unknown variance components and I,J,K, and L are matrices of size n×n. I,J, and K are known matrices, whereas L includes two additional unknown parameters. I is the identity matrix, so σE2 represents variance due to random measurement error at the record level. J and L are both block-diagonal such that each diagonal block corresponds to one individual and the off-diagonal blocks are 0 matrices. For the ith individual, the ni×ni diagonal block Ji is a matrix with every entry equal to 1, so σI2 stands for variance due to non-genetic individual random effects that do not vary over time. For the ith individual, the ni×ni diagonal block Li has a (j,k)th entry given by exp(α|tijtik|c), where 1jni, 1kni, α and c are unknown parameters and tij and tik are the jth and kth components of the subvector ti in the vector t, so that |tijtik| is the absolute difference in measurement times between the jth and kth records for individual i. We think of L as modeling autocorrelation in an individual’s trait measured over time.9 Therefore, σT2 represents variance attributable to time-varying individual random effects. Parameters α and c control the time dependence between the trait measurements for an individual. Note that c=1 corresponds to a continuous-time analog of the first-order AR covariance. K=BΦBT is the kinship matrix expanded from the individual level to the record level; hence, σA2 stands for variance attributed to additive polygenic random effects. The variance-component model is similar to other polygenic models that have been proposed for linkage analysis and association testing in longitudinal family studies.10, 11, 12, 13 In this setting, the covariance matrix Σ depends on six unknown parameters, σE2,σI2,σA2,σT2,α, and c. For convenience, we reparametrize the variance components so that one of the parameters is the total residual variance (for an outbred individual), σ2=σE2+σI2+σA2+σT2, and the others are the narrow-sense heritability, σA2/σ2, and the proportions of (residual) variance explained by individual effects, σI2/σ2, and by measurement error, σE2/σ2. We drop σT2/σ2 from the parameterization to avoid redundancy. The vector of unknown parameters that appear in Σ can then be written as (σ2,τ), where we define τ=(σE2/σ2,σI2/σ2,σA2/σ2,α,c). We further write

Σ=σ2V(τ), (Equation 5)
whereV(τ)σE2σ2I+σI2σ2J+σA2σ2K+(1σE2σ2σI2σ2σA2σ2)L. (Equation 6)

For convenience, we often drop the argument τ and denote this matrix by V. In the special case when all individuals are outbred, V is the correlation matrix of Y|W,X.

To obtain a prospective likelihood score test of the null hypothesis H0: γ=0 in the model defined by Equations 2, 3, 4, 5, and 6, one would typically first obtain the null maximum likelihood estimate (MLE) of the nuisance parameter (β,σ2,τ), i.e., the MLE of (β,σ2,τ) in the model with γ=0. We denote this null MLE by (βˆ0,σˆ02,τˆ0). (See Appendix A for details on how to obtain the null MLE.) We define the transformed phenotypic residual to be

RY=Vˆ01(YWβˆ0), (Equation 7)

where R=Vˆ01Vˆ01W(WTVˆ01W)1WTVˆ01, Vˆ0=V(τˆ0), and βˆ0=(WTVˆ01W)1WTVˆ01Y. The test statistic for the prospective likelihood score test could then be written as

S=(BX)TRYVarˆ0[(BX)TRY|X,W]=(BX)TRY(BX)TRBXσˆ02=(BX)TVˆ01(YWβˆ0)(BX)TRBXσˆ02, (Equation 8)

where σˆ02=n1YTRY=n1(YWβˆ0)TVˆ01(YWβˆ0) and where we have used the fact that RVˆ0R=R in deriving Equation 8. Under the null hypothesis, S would follow a standard normal distribution asymptotically. The PMALT test statistic is a slight variation on the prospective likelihood score test, given by

PMALT=(BX)TRY(BX)TRBXσˆY2=σˆ02σˆY2S, (Equation 9)

where σˆY2=(nq1)1YTQY, with Q=RRBX[(BX)TRBX]1(BX)TR. Here, σˆY2 is an estimate of the residual variance of Y under the alternative model (e.g., as in generalized regression), whereas σˆ02 is an estimate of the residual variance of Y under the null model. Under the null hypothesis, σˆ02/σˆY2 converges in probability to 1, so PMALT and S are asymptotically equivalent under the null hypothesis, and the asymptotic null distribution of PMALT is also standard normal.

Note that the numerator of the PMALT statistic can be written as XTA, where A=BTRY is a vector whose ith entry, Ai, is obtained via summation of individual i’s entries of the transformed phenotypic residual vector RY=Vˆ01(YWβˆ0). Thus, Ai is a scalar quantity that encapsulates all the complete records (longitudinal measurements and covariates) for individual i into a single transformed phenotypic residual.

We can obtain a slightly different interpretation of the PMALT test statistic by viewing it as the t-statistic for testing H0:γ=0 versus H1:γ0 in the generalized regression model given by

Y=Wβ+BXγ+ϵ, (Equation 10)

with the assumption that Eϵ=0 and Varϵ=σ2Vˆ0, where Vˆ0 is treated as fixed and known and β, γ, and σ2 are treated as unknown. Let βˆ, γˆ, and σˆY2 represent the parameter estimates obtained by generalized regression under this model. Then the PMALT test statistic is the same as the generalized regression t-statistic for testing H0:γ=0, which is given by

PMALT=γˆ(BX)TRBXσˆY2, (Equation 11)

where γˆ=[(BX)TRBX]1(BX)TRY, σˆY2 is as defined above, and the expressions in Equations 9 and 11 are equal.

Note that if the same individuals were genotyped at every tested SNP, then, in the expression for the PMALT test statistic given in Equation 9, B, R, Y, and σˆY2 would be the same for every SNP, and only the values of X would change across SNPs. However, if different individuals had missing genotypes at different SNPs, then not only would B, Y, and X change for every SNP, but, in principle, we would also need to re-estimate τˆ0 at every SNP, in addition to recalculating R and σˆY2, which are functions of both τˆ0 and Y. For computational convenience, we only calculate τˆ0 once per genome screen by using a representative set of individuals rather than re-estimating it at every SNP. However, we do recalculate B, Y, X, R, and σˆY2 at every SNP.

L-GATOR for Complete Data

Unlike PMALT, which is a prospective analysis based on Y|(X,W), L-GATOR is a retrospective analysis based on X|(Y,W). The retrospective framework is particularly useful for incorporating certain types of partially missing data because one can conveniently make use of the known dependence structure among relatives’ genotypes that arises from Mendelian inheritance. (This will be explained in more detail in the next subsection.) Another advantage of the retrospective approach is the robustness of the analysis to misspecification of the phenotype model, in the sense that the type 1 error of the association test will generally be correct even when the phenotype model is misspecified.

For clarity, we first present L-GATOR for the special case of complete data, meaning that only individuals with observed genotype data at the SNP being tested and with at least one complete record are included in the analysis (the same assumptions used in PMALT in the previous subsection). In the complete-data case, we define the L-GATOR test statistic by

L-GATORcomp=[(BX)TRY]2Varˆ0[(BX)TRY|Y,W]. (Equation 12)

Comparison of Equations 9 and 12 shows that the square of the PMALT statistic and the L-GATORcomp statistic have the same numerator, whereas their denominators are the null conditional variances of the numerator given (X,W) and (Y,W), respectively.

In order to give a more specific expression for L-GATOR in the complete-data case and to partially justify an asymptotic χ12 null distribution, we first make two modeling assumptions on the distribution of X|(Y,W). The first assumption concerns the null mean:

E0(X|Y,W)=Wsη, (Equation 13)

where Ws is the m×qs matrix of static covariate values and qs is the number of static covariates. The rows of Ws are indexed by individuals, the columns are indexed by the static covariates, and the (i,j)th element is equal to the value of the jth static covariate in the ith individual. The static covariates are the covariates that stay the same for all records of an individual (e.g., sex), whereas dynamic covariates, such as age, change over time. Because the intercept is always included as a static covariate, Ws will always contain at least one column, and 1qsq. In Equation 13, η is an unknown column vector of length qs. The interpretation of Equation 13 is that, under the null hypothesis of no association and no linkage between genotype and phenotype, the genotype is permitted to be linearly related to the static covariates. A special case of this condition would be E0X|Y,W=f1—where f is the allele frequency and 1 is a vector of length m, all of whose entries are equal to 1—a case that arises when all the entries of η are 0 except for the entry corresponding to the intercept, so that the null mean does not depend on any other static covariates.

To understand why we make the assumption in Equation 13, it is helpful to first understand the relationship between the full covariate matrix W, which has rows indexed by records (and thus has n rows), and the static covariate matrix Ws, which has rows indexed by individuals (and thus has m rows). If we use the vertically expanded identity matrix, B, to expand Ws from the individual level to the record level, then the columns of BWs are a subset of the columns of W. Because the matrix R is orthogonal to the column space of W, we have RBWs=0. Thus, the assumption of Equation 13 ensures that

E0[(BX)TRY|Y,W]=E0[YTRBX|Y,W]=YTRBWsη=0. (Equation 14)

In other words, the square root of the numerator of the L-GATORcomp statistic has mean 0 under the null hypothesis.

The second modeling assumption that we make on the distribution of X|(Y,W) is that

Var0(X|Y,W)=σX2Φ, (Equation 15)

where σX2 is an unknown scalar. This is a version of the standard variance relationship that holds, for example, under Mendelian inheritance in a single population. Here, however, we use the more general parameter σX2 rather than assuming that it equals 0.5f(1f), which would hold under Hardy-Weinberg equilibrium (HWE). The complete-data L-GATOR-test statistic from Equation 12 can then be equivalently written as

L-GATORcomp=(YTRBX)2Varˆ0(YTRBX|Y,W)=(ATX)2Varˆ0(ATX|Y,W)=(ATX)2ATΦAσˆX2, (Equation 16)

where A=BTRY is the individual-level transformed phenotypic residual vector, defined in the sub-section PMALT: Missing Values Excluded. Furthermore,

σˆX2=1mqsXTUX (Equation 17)

is an estimator of the null conditional variance of the genotype of an outbred individual, where we define

U=Φ1Φ1Ws(WsTΦ1Ws)1WsTΦ1. (Equation 18)

The estimator σˆX2 is based on the residual sum of squares for the generalized regression of X on Ws with conditional covariance matrix proportional to Φ. If more restrictive assumptions were made, e.g., if Equation 13 were replaced with the assumption E0X|Y,W=f1, then we could replace σˆX2 of Equation 17 with

σX2=1m1XTΦ1(Xfˆ1)=1m1XTU1X, (Equation 19)

where fˆ=1TΦ1111TΦ1X is the best linear unbiased estimator14 (BLUE) of the allele frequency f for the current variant, and U1=Φ1Φ111TΦ1111TΦ1. If, in addition, we assumed HWE at the variant, we could replace σˆX2 of Equation 17 with the estimator15

σ˜X2=12fˆ(1fˆ). (Equation 20)

Under the assumptions in Equations 13 and 15, and with the usual regularity conditions sufficient for a central limit theorem, L-GATORcomp is asymptotically χ12 distributed under the null hypothesis of no association and no linkage.

L-GATOR Incorporates Incomplete Data

One advantage of the retrospective analysis used in L-GATOR is that it provides a natural way to incorporate certain types of partially missing data in order to improve the power of association testing. In contrast, the PMALT analysis ignores data on individuals who have one or more complete records but have missing genotype data at the variant being tested or who have genotype data at the variant being tested but do not have any complete records.

We introduce notation to represent and distinguish different types of incomplete data. Recall that we assume that genetic variants are tested one by one for association with the trait. In what follows, we call the genetic variant currently being tested the “variant of interest.” For a given variant of interest, let P denote the subset of individuals in the sample who have at least one complete record and who satisfy at least one of the following three criteria: (1) they have nonmissing genotype at the variant of interest, (2) they have a relative in the study with nonmissing genotype at the variant of interest, or (3) they are in the same pedigree with an individual who has at least one complete record and either has nonmissing genotype or has a relative with nonmissing genotype at the variant of interest. We let G denote the subset of individuals in the sample with nonmissing genotype at the variant of interest. The PMALT association analysis includes only the individuals in the set PG (i.e., those in P who satisfy criterion 1 above). In contrast, the L-GATOR association analysis is based on the larger set of individuals in PG. We also introduce notation, Z, for the set of all individuals who have at least one complete record, where PZ.

To accommodate partially missing data, we need to redefine the index sets for some of the vectors and matrices defined in the previous two subsections. We let p=|P| and g=|G|, which are the numbers of phenotyped and genotyped individuals, respectively, who are included in the L-GATOR association analysis for the variant of interest. We assume that all incomplete records have been culled from the dataset, that ni represents the number of complete records for individual i, and that n=iPni represents the total number of complete records included in the L-GATOR analysis. We redefine the genotype vector, X, to have length g and be indexed by the set G, and we redefine the vertically expanded identity matrix, B, to be n×p. We define ΦP, ΦPG, and ΦG to be submatrices of Φ where, for example, ΦP is the p×p matrix that retains in the Φ matrix the rows and columns corresponding to individuals in P, and ΦPG is the p×g matrix that retains in the Φ matrix the rows corresponding to individuals in P and the columns corresponding to individuals in G. Hence, we rewrite the null conditional variance assumption in Equation 15 as

Var0(X|Y,W)=σX2ΦG, (Equation 21)

we redefine the expanded kinship matrix Kn×n to be K=BΦPBT, and we redefine U1 to be the g×g matrix U1=ΦG1ΦG111TΦG1111TΦG1. We also redefine the individual-level transformed phenotypic residual vector A to be p×1 with A=BTRY.

In the incomplete-data case, the L-GATOR statistic becomes

L-GATOR=(YTRBΦPGU1X)2Varˆ0(YTRBΦPGU1X|Y,W)=(ATΦPGU1X)2Varˆ0(ATΦPGU1X|Y,W), (Equation 22)

which we rewrite as

L-GATOR=(ATXˆ)2Varˆ0(ATXˆ|Y,W)=(ATXˆ)2ATΦPGU1ΦGPAσˆX2, (Equation 23)

where σˆX2 is a consistent estimator of σX2 (discussed below) and where the vector Xˆp×1 is the best linear unbiased predictor16 (BLUP) of the vector of genotypes for the individuals in set P. Here, the BLUP, Xˆ, is based on X, the vector of genotypes of the individuals in set G, and the kinship information. Specifically, the BLUP is given by

Xˆ=fˆ1+ΦPGΦG1(Xfˆ1)=[1(1TΦG11)11TΦG1+ΦPGU1]X, (Equation 24)

where fˆ=(1TΦG11)11TΦG1X is the BLUE of allele frequency based on X. Further details can be found in McPeek.16 The equality of the expressions in Equations 22 and 23 is shown in Appendix B.

In the special case when an individual is in PG, so that his or her genotype is observed, then the BLUP of the individual’s genotype is simply equal to the individual’s observed genotype. Thus, in the complete-data case, the numerator, (ATXˆ)2, of L-GATOR in Equation 23 becomes equal to the numerator, (ATX)2, of L-GATORcomp in Equation 16. Similarly, in the complete-data case, the denominator, ATΦPGU1ΦGPAσˆX2, of L-GATOR in Equation 23 becomes equal to the denominator, ATΦAσˆX2, of L-GATORcomp in Equation 16. (See Appendix B for details.)

Possible estimators, σˆX2, of σX2 of Equation 23 include the estimator of Equation 17, where we restrict the estimator to be based only on those individuals who have both genotype and static covariates available (see Appendix B). Alternatively, if we assumed E0X|Y,W=f1, then we could replace σˆX2 of Equation 23 by

σX2=1g1XTU1X, (Equation 25)

as in Thornton and McPeek17, or we could use Equation 20 if we were also willing to assume HWE.

To obtain an asymptotic χ12 distribution for the L-GATOR test statistic under the null hypothesis, in the case of incomplete data, we slightly modify the assumption on the null mean, given in Equation 13 for the complete-data case, to obtain

E0(Xˆ|Y,W)=Wsη, (Equation 26)

where X in Equation 13 has been replaced with Xˆ in Equation 26. Under the assumptions of Equations 21 and 26 and with the usual regularity conditions sufficient for a central limit theorem, L-GATOR is asymptotically χ12 distributed under the null hypothesis of no association and no linkage.

We can define an alternative version of L-GATOR, which we call L-GATORalt, in which the BLUP, Xˆ, depends not only on X but also on the static covariates (see Appendix C for details). Note that use of L-GATORalt would require that all individuals included in the set G also have non-missing values for the static covariates.

L-GATOR can be viewed as a quasi-likelihood score test that generalizes the MASTOR test7 for quantitative traits to the context of repeated-measures. Specifically, the expression for L-GATOR in Equation 23 can be seen as a generalization of the MASTOR statistic given in equation 11 of Jakobsdottir and McPeek,7 where the quantity PY defined in Jakobsdottir and McPeek for a non-longitudinal quantitative trait is replaced by the more general vector A=BTRY for a longitudinal quantitative trait and where these two quantities are equal if there is only one time point.

Understanding the Contribution of Partially Missing Data in L-GATOR

If an individual has at least one complete record but has missing genotype data at the current marker, or if the individual has genotype data but no complete records, it is of interest to understand to what extent the data on this individual can still provide information on genotype-phenotype association. One type of information these data can provide is information on nuisance parameters, which indirectly improves inference on association. To be specific, genotyped individuals with no complete records can still provide information on population allele frequencies, and individuals with missing genotype data at the current marker but with at least one complete record can still provide information on the null MLE of the phenotypic nuisance parameter (β,τ) used in L-GATOR. In the current implementation of L-GATOR, the nuisance parameter f is estimated from the data on all individuals in G, the nuisance parameter τ is estimated from the data on all individuals in Z (i.e., the individuals with at least one complete record), and the nuisance parameter β is estimated from the data on all individuals in the set P.

When the sample includes relatives, it is also possible for individuals with at least one complete record but missing genotype data to provide more direct information on genotype-phenotype association if they have relatives whose data give partial information on the missing genotypes. In particular, if an individual has a missing genotype, one or more complete records, and at least one genotyped relative, then the BLUP formulation of L-GATOR in Equation 23 shows that the individual contributes to the association analysis such that the BLUP of their genotype replaces their observed genotype in the numerator (and the additional uncertainty in their genotype is accounted for in the denominator).

Finally, there is another less obvious contribution that can be made by an individual, say individual a, who has at least one complete record, missing genotype data, and no genotyped relatives but is in the same pedigree with an individual, b, who has at least one complete record and either is genotyped at the variant of interest or has a genotyped relative. In that case, b contributes to the association analysis because either an observed genotype or a BLUP for the genotype of b is available, and a can contribute to the transformed phenotypic residual of b provided that (Vˆ01)ij0 for at least one pair (i,j), where i is a record of individual a and j is a record of individual b. This represents a slight contribution of individual a to the analysis through the transformed phenotypic residual of individual b. For example, consider a parent-offspring trio in which the father has genotype data but no complete records, whereas both the mother and offspring have at least one complete record but missing genotype data. Then, a BLUP of the genotype is available for the offspring, so the BLUP formulation of L-GATOR in Equation 23 makes clear how the offspring’s data contribute to the association analysis. The father’s data provide information for the BLUP of the offspring, whereas the mother’s data can contribute to the observed phenotypic residual of the offspring.

Connection between longGWAS and PMALT

The previously proposed longGWAS method1 is a prospective, longitudinal analysis based on a mixed-effects model and implemented in R. There is a close connection between the longGWAS and PMALT methods. The main difference between them lies in the parameterization of the correlation between repeated observations on an individual, where this correlation is given in the L matrix of PMALT and in the D matrix of longGWAS. In PMALT, the correlation between the jth and kth observations of individual i depends only on the length of time, |tijtik|, between the two observations and is given by exp(α|tijtik|c), which is the (j,k)th element of the ith diagonal block of the L matrix. The entire L matrix involves only two unknown parameters, α and c, which must be estimated. In contrast, in longGWAS, the correlation between the jth and kth observations of individual i depends not simply on the length of time between the two observations but on the actual times themselves and is given by dtij,tik, which is an unknown parameter. Thus, for example, if we were to try to use longGWAS to analyze data from the original cohort of the FHS dataset (see the subsection SBP Data from the Framingham Heart Study for a description) and if we used the age of the individual rounded to the nearest year as the time point and assumed that individuals were tested every 2 years (and never allowed an odd number of years as a time increment), the D matrix would already involve more than 1,800 unknown parameters that must be estimated, in contrast to two parameters for the L matrix of PMALT. Because of this, the longGWAS method does not seem well suited to analysis of the FHS data or other similar datasets in which the number of possible distinct time points is large and each individual has observations at only a small subset of the possible time points.

Another difference between the two methods is that longGWAS does not include a component of variance for non-genetic individual random effects that do not vary over time (the term σI2J in Equation 4). In addition, in longGWAS, the minor allele frequency (MAF) is imputed for any missing genotype for computational convenience, whereas missing genotypes are excluded in PMALT. The variance estimators used in the two methods are asymptotically equivalent, though not exactly equal in small samples. In datasets in which the number of distinct possible time points is not very large in relation to the number of observations per individual, in which σI2 is also close to 0 and in which there are no missing genotypes, PMALT and longGWAS would be expected to behave very similarly.

Simulation Studies

We performed simulation studies to assess the type 1 error and compare the power of L-GATOR, PMALT, and longGWAS. We also assessed the sensitivity of L-GATOR and PMALT to estimation of the variance-components parameter, τ, from a subset of individuals who are not exactly the same individuals used in the association test. To achieve these goals, we simulated data that included related individuals by using two different trait models and various assumptions about missing genotype and phenotype data, as we now describe.

We simulated two trait models, denoted I and II, both of which have an intercept, static and dynamic covariates, major gene effects, additive polygenic effects, and static and dynamic non-genetic individual effects. The models differ in the major gene effects, such that model I has two unlinked major causal loci that each act dominantly with epistasis between them, whereas model II has four unlinked major causal loci with epistasis between them. For both models, the values of the static covariate were simulated as independent standard normal random variables, one per individual. Each individual was assumed to have five time points with a common difference of 2 years between each consecutive pair of measurement times for an individual. The values of the dynamic covariate were simulated as independent standard normal random variables, five per individual, corresponding to the five time points. Trait model I is given by Equations 2 and 4 with

μ=Wβ+Bf(X1,X2), (Equation 27)

where the first column of W contains the intercept, the second column contains the static covariate, and the third column contains the dynamic covariate; β=(1,0.6,0.2)T, τ=(0.4,0.1,0.4,0.04,1.2)T, and either σ2=5 (in the scenarios with no missing genotype data) or σ2=2 (in the scenarios with some missing genotype data). Here, B is the vertically expanded identity matrix given in Equation 1, and f(X1,X2) is a vector with ith element equal to f(X1i,X2i), where X1i and X2i are the genotype values of individual i at causal loci 1 and 2, respectively. Each individual’s genotype at locus i is coded as 0, 0.5, or 1 according to whether the individual carries 0, 1, or 2 copies, respectively, of the variant allele at locus i. The frequency of the variant allele is 0.1 at locus 1 and 0.5 at locus 2. Table 2 gives the values of f(x1,x2) for (x1,x2){0,0.5,1}2. Trait model II is given by Equations 2 and 4 with

μ=Wβ+Bg(X1,X2,X3,X4), (Equation 28)

where W, β, B, τ, and σ2 are the same as for trait model I. Here, g(X1,X2,X3,X4) is a vector with ith element equal to g(X1i,X2i,X3i,X4i), where X1i,X2i,X3i, and X4i are the genotype values of individual i at causal loci 1, 2, 3, and 4, respectively. The frequency of the variant alleles at loci 1, 2, 3, and 4 are 0.1, 0.2, 0.3, and 0.4, respectively. Table 3 gives the values of g(x1,x2,x3,x4) for (x1,x2,x3,x4){0,0.5,1}4.

Table 2.

Genetic Effects, f(x1,x2), for the Two Unlinked Causal Loci of Trait Model I

x2= 0 x2= 0.5 x2= 1
x1=0 0.04 0.04 0.04
x1=0.5 0.04 0.8 0.8
x1=1 0.04 0.8 0.8

Genotypic effects are given as a function of (x1,x2), where xi is 0, 0.5, or 1 according to whether the individual has zero, one, or two copies, respectively, of the variant allele at locus i. The frequency of the variant allele is 0.1 at locus 1 and 0.5 at locus 2.

Table 3.

Genetic Effects, g(x1,x2,x3,x4), for the Four Unlinked Causal Loci of Trait Model II

(x1, x2) (x3, x4)
(0, 0) (0, 0.5) (0, 1) (0.5, 0) (0.5, 0.5) (0.5, 1) (1, 0) (1, 0.5) (1, 1)
(0,0) 0.04 0.04 0.04 0.3 0.3 0.3 0.5 0.5 0.5
(0,0.5) 0.04 0.1 0.1 0.3 0.3 0.3 0.5 0.5 0.5
(0,1) 0.04 0.1 0.1 0.3 0.3 0.3 0.5 0.5 0.5
(0.5,0) 0.2 0.2 0.2 1.2 1.2 1.2 1.4 1.4 1.4
(0.5,0.5) 0.2 0.2 0.2 1.2 1.2 1.2 1.4 1.4 1.4
(0.5,1) 0.2 0.2 0.2 1.2 1.2 1.2 1.4 1.4 1.4
(1,0) 0.4 0.4 0.4 1.4 1.4 1.4 1.6 1.6 1.6
(1,0.5) 0.4 0.4 0.4 1.4 1.4 1.4 1.6 1.6 1.6
(1,1) 0.4 0.4 0.4 1.4 1.4 1.4 1.6 1.6 1.6

Genotypic effects are given as a function of (x1,x2,x3,x4), where xi is 0, 0.5, or 1 according to whether the individual has zero, one, or two copies, respectively, of the variant allele at locus i. The frequency of the variant allele is 0.1, 0.2, 0.3, and 0.4 at loci 1, 2, 3, and 4, respectively.

We simulated three different sample configurations and three different ascertainment schemes for a total of nine different sample types, each of which consisted of a choice of sample configuration and a choice of ascertainment scheme. Sample configuration 1 contained 50 outbred, three-generation families, each containing 16 individuals related as in Figure S1. Sample configuration 2 contained 75 families of the type in Figure S1. Sample configuration 3 contained 400 unrelated individuals plus 50 families of the type in Figure S1. For each sample configuration, we allowed three possible ascertainment schemes. Ascertainment scheme a was random ascertainment of families and individuals. In ascertainment scheme b, ascertainment of each family was conditional on its having at least two family members whose time-averaged phenotype values were >1.5 SD away from the population mean, whereas unrelated individuals (if any) were sampled randomly from the population. In ascertainment scheme c, ascertainment of each family was conditional on its having at least two family members whose time-averaged phenotype values were >1 SD above the population mean, whereas unrelated individuals (if any) were sampled randomly from the population. Then, for example, sample type 2b denoted a sample with configuration 2 and ascertainment type b.

We considered three possible choices of which individuals were genotyped in a given sample: “all” denoted that all sampled individuals were genotyped, “both tails” denoted that only individuals whose time-averaged phenotype value was >1.5 SD away from the population mean were genotyped, and “upper tail” denoted that only individuals whose time-averaged phenotype value was >1 SD above the population mean were genotyped. We also considered three possible choices of which individuals were phenotyped in a given sample: “all” denoted that all sampled individuals were phenotyped for all records; “missing A” denoted that each sampled individual had a 25% chance of having all phenotype records set to be missing, where this was independent across individuals; and “missing B” denoted that each of the individuals unrelated to anyone else in the sample (if any) had a 50% chance of having all phenotype records set to be missing, where this was independent across individuals, whereas all phenotype records were assumed to be available for all individuals in the sampled families in missing B.

SBP Data from the Framingham Heart Study

The FHS18 is a long-term, multigenerational study designed for the purpose of identifying risk factors that contribute to cardiovascular disease. Our use of the FHS data was approved by the institutional review board of the Biological Sciences Division of the University of Chicago. We analyzed longitudinal SBP measurements on 14,173 FHS-sampled individuals, 1,475 of whom were apparently unrelated to anyone else in the sample and 12,698 of whom could be divided into 1,537 multigeneration pedigrees. The sampled individuals were divided into three cohorts such that the individuals in cohort 1 (original cohort) had measurements at up to 28 time points, approximately every 2 years; the individuals in cohort 2 (offspring cohort) had up to 8 time points, approximately every 4 years; and the individuals in cohort 3 (third generation) had one time point. The time points for an individual in a given cohort could be somewhat irregularly spaced, and the number of time points could vary across individuals in a cohort due to missing data. We analyzed log(SBP) as a longitudinal quantitative trait. In order to account for possible effects of hypotensive medication usage among sampled individuals, we first adjusted the SBP measurements19, 20 by adding a sensible constant (10 mmHg) to measurements that occurred when the corresponding patient was on treatment. In the longitudinal trait model, we included the following nine covariates: sex (1 = male, 0 = female); age (years); age2; log(BMI), where BMI is body mass index (MIM: 606641); log(current smoking), where current smoking was measured as the number of cigarettes per day; smoking history (1 = current or former smoker, 0 = never smoked up to the time of measurement); log(blood glucose), where blood glucose was measured in mg/100 mL; and two cohort indicators to account for generation effects. We excluded 149 records that occurred when the corresponding individuals were under 18 years of age.

Among the sampled individuals, 9,240 were genotyped on the Affymetrix 500K array, and 536 of these were apparently unrelated to anyone else in the sample. As part of our quality-control procedure, we excluded genotyped individuals who met either of the following two criteria: (1) empirical self-kinship coefficient Φ˜ii0.525, i.e., empirical inbreeding coefficient h˜i0.05, or (2) completeness 96%, where completeness is defined as the proportion of markers for which a given individual had genotypes called and where we used the completeness value given in the dbGaP quality-control file. Genotype data on 469 individuals satisfying criterion 1 and 584 individuals satisfying criterion 2 (such that 410 individuals satisfied both criteria) were then filtered out (i.e., set to be missing) for all SNPs. Only SNPs that met all of the following quality-control criteria were included in the analysis: (1) call rate 96%, (2) Mendelian error rate 2%, and (3) MAF 1%; for all three of these criteria, we used the values given in the dbGaP quality-control file. We then used PMALT and L-GATOR to test for association between log(SBP) and each of the 369,046 SNPs that met all three of these quality-control criteria. (The analysis of the FHS data was not computationally feasible with longGWAS.)

Results

To assess type 1 error, we simulated under each of the 18 different settings of trait model, missing data, and sample type listed in Table 4, and in each setting, we tested association between the trait and an unlinked and unassociated marker. For each setting, one phenotype replicate was sampled, and empirical type 1 error was calculated on the basis of 50,000 simulated replicates of genotype data for the unlinked and unassociated marker, which in each case had a MAF of 0.2. Columns 5, 7, and 9 of Table 4 give the empirical type 1 error for PMALT, L-GATOR, and longGWAS at nominal level 0.001. From these results, we can see that both PMALT and L-GATOR were correctly calibrated in all the simulation settings, in the sense that none of the empirical type 1 error results were significantly different from the nominal, according to a z-test at level 0.05. In contrast, although longGWAS was correctly calibrated in the scenarios with no missing genotype data, its type 1 error was drastically too large in the “both tails” scenario of missing genotype data and drastically too small in the “upper tail” scenario. This phenomenon was caused by the fact that longGWAS substitutes the MAF for any missing genotypes, which, in these scenarios, results in heterskedasticity that causes the type 1 error to be incorrect.

Table 4.

Empirical Type 1 Error of PMALT, L-GATOR, and Long GWAS at Level 0.001

Genotyped Individuals Sample Type Phenotyped Individuals Trait Model Empirical Type 1 Error
PMALT PMALTZ L-GATOR L-GATORPG LongGWAS
All 1a all I 0.0010 0.0010 0.0011 0.0011 0.0009
All 1a all II 0.0010 0.0010 0.0010 0.0010 0.0009
All 2a missing A I 0.0012 0.0012 0.0011 0.0011 0.0010
All 2a missing A II 0.0010 0.0010 0.0010 0.0010 0.0009
All 3a missing B I 0.0012 0.0012 0.0012 0.0012 0.0010
All 3a missing B II 0.0009 0.0009 0.0010 0.0010 0.0012
Both tails 1b all I 0.0012 0.0200 0.0010 0.0010 0.0673
Both tails 1b all II 0.0010 0.0354 0.0008 0.0009 0.1235
Both tails 2b missing A I 0.0010 0.0330 0.0008 0.0009 0.1224
Both tails 2b missing A II 0.0012 0.0373 0.0009 0.0009 0.0990
Both tails 3b missing B I 0.0010 0.0165 0.0009 0.0009 0.0541
Both tails 3b missing B II 0.0011 0.0269 0.0011 0.0010 0.0869
Upper tail 1c all I 0.0010 0 0.0008 0.0009 0
Upper tail 1c all II 0.0010 0 0.0011 0.0012 4 × 10−5
Upper tail 2c missing A I 0.0011 0 0.0009 0.0009 0.0004
Upper tail 2c missing A II 0.0009 0 0.0008 0.0008 0.0006
Upper tail 3c missing B I 0.0010 0 0.0012 0.0011 0.0001
Upper tail 3c missing B II 0.0012 0 0.0009 0.0009 0.0003

For each row of the table, the empirical type 1 error values were calculated on the basis of one phenotype replicate and 50,000 simulated genotype replicates. The minor allele frequency of the tested marker is 0.2. Underlined values are those that differ significantly (p value < 0.05) from the nominal level of 0.001 by a z-test.

In addition, we considered the robustness of the type 1 error of PMALT and L-GATOR to estimation of the variance-component parameter, τ, from a slightly different subset of individuals than those included in the association test. Recall that we ordinarily include in the PMALT analysis only those individuals in PG, i.e., those who are both genotyped and phenotyped, and this set is used for both estimation of τ and testing of association. In column 6 of Table 4, we give the type 1 error for PMALT when the variance-component parameter, τ, was instead estimated from the larger set, Z, consisting of all phenotyped individuals, whereas the association test was applied only to the set, PG, of individuals who were both phenotyped and genotyped. We refer to this procedure as PMALTZ. From Table 4, we can see that this small change in the estimation procedure for τ can, in some settings, lead to drastically incorrect type 1 error, either too liberal or too conservative, depending on the setting. We can apply a similar analysis to L-GATOR. In our default implementation of L-GATOR, we estimated τ by using the set of all phenotyped individuals, Z. In column 8 of Table 4, we give the type 1 error for L-GATOR when the variance-component parameter, τ, was instead estimated from the smaller set, PG, of individuals who were both phenotyped and genotyped, whereas the association test was applied to the larger set, PG. We refer to this procedure as L-GATORPG. From Table 4, we can see that L-GATORPG was correctly calibrated in all simulation settings, in the sense that none of the empirical type 1 error results were significantly different from the nominal, according to a z-test at level 0.05. Thus, L-GATOR is much more robust than PMALT to changes in how the variance components are estimated. This can be explained by the fact that L-GATOR is a retrospective method, so correct type 1 error does not depend on correct specification of the phenotype model, whereas PMALT is a prospective method, so type 1 error is more sensitive to changes in parameter estimates of the phenotype model. In the simulation settings in which the individuals in Z had a different phenotype distribution overall from those in the subset PG, namely the “both tails” and “upper tail” settings, the change in how the parameters were estimated could throw off the type 1 error of PMALTZ. This suggests that when PMALT is used, the variance components should ideally be estimated from the same set of individuals used in the test for a specific variant.

To assess power, we simulated under each of the 18 different settings of trait model, missing data, and sample type listed in Table 5, and in each setting, we tested association between the trait and causal SNP 1 of the simulated model. For each setting, empirical power was calculated on the basis of 5,000 simulated replicates of the sample genotypes, covariates, and phenotypes. In Table 5, we can see that for settings with nonmissing genotype data (rows 1–6 of Table 5), PMALT and L-GATOR achieved comparable power. This result held even when there was missing phenotype data (rows 3–6 of Table 5), suggesting that the extra information (i.e., genotype data on individuals with missing phenotype) that is used by L-GATOR, but not by P-MALT, in estimation of the nuisance parameter, f, does not have a large effect on power. In contrast, in the settings in which some genotypes were missing (rows 7–18 of Table 5), L-GATOR was always more powerful than PMALT. This extra power resulted from the fact that genotyped relatives provided partial genotype information on the individuals who were phenotyped but had missing genotypes. P-MALT ignores this information, whereas L-GATOR is able to make use of this additional information to increase power for association.

Table 5.

Power Comparison of PMALT and L-GATOR

Genotyped Individuals Sample Type Phenotyped Individuals Trait Model Power (SE)
PMALT L-GATOR
All 1a all I 0.84 (0.005) 0.84 (0.005)
All 1a all II 0.82 (0.005) 0.82 (0.005)
All 2a missing A I 0.89 (0.004) 0.89 (0.004)
All 2a missing A II 0.87 (0.005) 0.87 (0.005)
All 3a missing B I 0.93 (0.004) 0.93 (0.004)
All 3a missing B II 0.90 (0.004) 0.90 (0.004)
Both tails 1b all I 0.60 (0.007) 0.73 (0.006)
Both tails 1b all II 0.62 (0.007) 0.72 (0.006)
Both tails 2b missing A I 0.66 (0.007) 0.81 (0.006)
Both tails 2b missing A II 0.69 (0.007) 0.83 (0.005)
Both tails 3b missing B I 0.74 (0.006) 0.85 (0.005)
Both tails 3b missing B II 0.74 (0.006) 0.84 (0.005)
Upper tail 1c all I 0.25 (0.006) 0.37 (0.007)
Upper tail 1c all II 0.36 (0.007) 0.46 (0.007)
Upper tail 2c missing A I 0.29 (0.006) 0.35 (0.007)
Upper tail 2c missing A II 0.34 (0.007) 0.36 (0.007)
Upper tail 3c missing B I 0.33 (0.007) 0.37 (0.007)
Upper tail 3c missing B II 0.41 (0.007) 0.42 (0.007)

Empirical power at significance level 0.05 was calculated on the basis of 5,000 simulated replicates. The higher power for each simulation setting is underlined.

We also assessed the power of longGWAS in the first six settings listed in Table 5 because these were the only simulation settings in which longGWAS had correct type 1 error. The empirical power of longGWAS was calculated on the basis of only 500 simulated replicates of the sample genotypes, covariates, and phenotypes rather than on 5,000 replicates as for L-GATOR and PMALT because running longGWAS is much slower. In these six settings, the power of longGWAS was not significantly different from that of L-GATOR and PMALT (see Table S1). This verifies our theoretically based expectation that in datasets in which the number of possible time points is not very large in relation to the number of observations per individuals, in which σI2 is not large, and in which there are no missing genotypes, longGWAS would behave similarly to PMALT.

In the analysis of the SBP data (described in the next subsection), we noticed that the estimated value of σI2/σ2, the proportion of variance due to non-genetic individual random effects, was quite small. For the case when the true value of σI2/σ2 was small to moderate, we explored whether this parameter could be safely removed from the analysis without harming power. To do this, we assessed the power of PMALT and L-GATOR in six settings (see Table S1) under which the analyses were performed with σI2 set to 0 (i.e., removed from the model) even though the simulation settings had σI2/σ2=0.1. Comparing the results of Table S1 with those of the first six rows of Table 5, we found that in this case of relatively small σI2/σ2, there was no significant difference in power for either PMALT or L-GATOR when σI2 was removed from the model.

Analysis of SBP Data from the Framingham Heart Study

We used both PMALT and L-GATOR to analyze the SBP data from the FHS. We were not able to complete the analysis with the longGWAS method. (In general, longGWAS is not suitable for the analysis of the FHS data or other similar datasets in which the number of possible distinct time points is large and each individual has observations at only a small subset of the possible time points.) For both PMALT and L-GATOR, we first estimated the variance-component parameter, τ, and the covariate effects, β, under the null on the basis of all individuals with at least one non-missing record; the results are given in Table 6 and Table 7, respectively. In each case, the standard error was calculated as the square root of the corresponding diagonal entry of the inverse Fisher information matrix (see Appendix A for more details). In Tables 6 and 7, two different PMALT analyses are summarized. “PMALT After” refers to the PMALT analysis after the genotypes of all individuals in cohort 1 were set to be missing, whereas “PMALT Before” refers to the PMALT analysis when the genotypes of the individuals in cohort 1 were not set to be missing. The purpose of the PMALT After analysis is described later in this subsection. In Table 7, we see that all covariates except log(current smoking) were significant in the L-GATOR analysis, and all but log(current smoking) and smoking history were significant in the PMALT analyses. In Table 6, we see that the variance due to (non-genetic) individual effects, σI2, appeared to be not significantly different from 0 in any of the three analyses. This presumably reflects the fact that there were three other variance components in the model, so the modeled covariance structure was able to fit the data well without the σI2 term. Parameters α and c control the time dependence across the trait measurements for an individual. Under the model with the values of α and c given in Table 6 (where these values are based on time measured in years), the correlation between the measurements on a single individual at two different time points decayed fairly slowly as a function of the time interval between the measurements. For example, the correlation was 0.93 (PMALT Before), 0.95 (PMALT After), or 0.91 (L-GATOR) for a 2-year span, and the correlation was 0.59 (PMALT Before), 0.61 (PMALT After), or 0.56 (L-GATOR) for a 10-year span. The estimated narrow-sense heritability for the log(SBP) trait in this study ranged from 0.31 to 0.35 for the different analyses.

Table 6.

Estimated Variance Components (Based on L-GATOR and PMALT) for Log(SBP) in the FHS Data

Parameter Estimate (SE)
L-GATOR PMALT Before PMALT After
σ2 0.017 (2 × 10−4) 0.014 (2 × 10−4) 0.013 (2 × 10−4)
σE2/σ2 0.254 (0.006) 0.303 (0.008) 0.302 (0.012)
σI2/σ2 3 × 10−7 (0.02) 4 × 10−7 (0.03) 2 × 10−5 (0.04)
σA2/σ2 0.35 (0.01) 0.31 (0.02) 0.33 (0.02)
α 0.043 (0.006) 0.031 (0.006) 0.017 (0.007)
c 1.13 (0.07) 1.23 (0.09) 1.47 (0.18)

“PMALT Before” refers to the variance-component estimation results from PMALT before the cohort 1 genotypes were removed from the FHS dataset, whereas “PMALT After” refers to the variance-component estimation results from PMALT after the cohort 1 genotypes were removed from the FHS dataset. The variance-component estimation results for L-GATOR are based on all phenotyped individuals, so they do not change when cohort 1 genotypes are removed. Estimation of α and c is based on years as the unit of measurement for time.

Table 7.

Estimated Covariate Effects (with SEs and p Values) from the PMALT and L-GATOR Analyses of Log(SBP) in the FHS Data

Parameter PMALT Beforea
PMALT After
L-GATOR
estimate (SE) p value estimate (SE) p value estimate (SE) p value
Intercept 3.98 (0.02) 0 3.78 (0.03) 0 3.96 (0.02) 0
Age −2.3 × 10−3 (3 × 10−4) 6.6 × 10−17 −3.6 × 10−3 (4 × 10−4) 5.3 × 10−25 −1.0 × 10−3 (2 × 10−4) 2.3 × 10−5
Age2 5.3 × 10−5 (2 × 10−6) 1.8 × 10−98 6.8 × 10−5 (4 × 10−6) 2.3 × 10−79 4.1 × 10−5 (2 × 10−6) 8.9 × 10−77
Sex 0.035 (0.002) 9.7 × 10−58 0.038 (0.002) 2.1 × 10−61 0.025 (0.002) 2.5 × 10−38
log(BMI) 0.242 (0.005) 0 0.231 (0.006) 0 0.265 (0.004) 0
log(current smoking) 4 × 10−4 (6 × 10−4) 0.53 3 × 10−4 (7 × 10−4) 0.62 9 × 10−4 (5 × 10−4) 0.054
Smoking history −0.005 (.003) 0.051 −0.004 (0.003) 0.15 −0.012 (0.002) 1.7 × 10−8
log(blood glucose) 0.012 (0.003) 2.6 × 10−6 0.053 (0.004) 3.4 × 10−32 0.006 (0.002) 5.6 × 10−4
Cohort 2 indicator −0.044 (0.003) 1.6 × 10−50 0.036 (0.003) 2.5 × 10−43 −0.076 (0.002) 3.3 × 10−301
Cohort 3 indicator −0.084 (0.004) 4.2 × 10−119 NA NA −0.125 (0.003) 0

SBP measurements are adjusted according to hypotensive medication usage. SBP, BMI, current smoking, and glucose are log transformed. Age is measured in years; sex is coded as 1 = male and 0 = female; BMI is body mass index; current smoking is measured in cigarettes per day; smoking history = 1 if the individual is a current or former smoker and 0 if the individual has never smoked up to the time of measurement; blood glucose is measured in mg/(100 mL); cohort 2 indicator = 1 if the individual is a member of cohort 2 and 0 otherwise, and the same applies for the cohort 3 indicator.

a

See Table 6 legend.

Figure 1 shows the quantile-quantile (Q-Q) plots of the negative log10-transformed p values for the PMALT and L-GATOR analyses (before the removal of cohort 1 genotypes). The resulting genomic-control inflation factors were 1.064 (PMALT) and 1.065 (L-GATOR), slightly larger than the threshold of 1.05, which is generally considered benign.21 Although the inflation factors suggest that the p values did not show substantial departure from the uniform distribution overall, we observed that the Q-Q plots seemed to lift off the line between 1 and 2 (on the negative log10 scale), and we also noticed that the smallest p values seemed noticeably smaller in PMALT than in L-GATOR. This motivated us to perform a further investigation on the quality of the genotype data for the three cohorts. To do this, we went back to the original data before applying our quality-control filters. Figure S2 gives histograms of the completeness values (where completeness is defined as the proportion of markers for which a given individual has genotypes called and where we use the completeness value given in the dbGaP quality-control file) for the individuals in cohort 1 (Figure S2A) and the individuals in cohorts 2 and 3 (Figure S2B). Figure S3 gives histograms of empirical self-kinship coefficients for these two groups. The distributions of completeness and empirical self-kinship coefficient suggest that the original cohort had on average noticeably lower-quality genotype data than the other two cohorts.

Figure 1.

Figure 1

Q-Q Plots of PMALT and L-GATOR p Values for the FHS SBP Data before the Removal of Cohort 1 Genotypes

Q-Q plots of genome-wide p values for association with SBP in the FHS data before the removal of genotypes for cohort 1 are shown. The p values were computed by the PMALT method (A) or the L-GATOR method (B). In each panel, p values for 369,046 genome-wide SNPs are plotted on the negative log10 scale.

In order to investigate whether lower-quality genotype data in the original cohort were likely to affect the association tests, we tested, for each SNP, whether the allele frequency in cohort 1 was equal to the allele frequency in the combined cohorts 2 and 3. At significance level 1010, 709 SNPs showed a significant difference in allele frequency between the two groups. Of these 709 SNPs, 170 had already been filtered out on the basis of our original quality-control criteria (call rate, Mendelian error rate, and MAF) and so were not included in the association analysis. Of the remaining 539 SNPs with extremely significant cohort allele-frequency differences, 322 had MAF <2%. This is not surprising because, for fixed rates of differential genotyping error, SNPs with smaller MAF would be expected to be more sensitive than SNPs with MAF closer to 0.5. Figure 2A shows that the 539 SNPs with extremely significant cohort allele-frequency differences tended to have markedly different association p values between the L-GATOR and PMALT analyses; specifically, many of them had very small p values in PMALT but not in L-GATOR. Among the 539 SNPs with extremely significant cohort allele-frequency differences, 40% of them (215 SNPs) had PMALT p values smaller than 1/10 of their L-GATOR p values (below the lower blue line of 10-fold change on the negative log10 scale), whereas only one SNP showed up in the other direction (above the upper blue line of 10-fold change on the negative log10 scale). The small p values for PMALT were probably the result of confounding due to different apparent allele frequencies in different generations, which probably resulted from differential genotyping error. To understand why the L-GATOR p values were not also inflated, it is helpful to recall that L-GATOR can be thought of as handling missing genotype data by imputing the BLUP of the missing genotypes (and fully accounting for the error of imputation). For the 539 SNPs with extremely significant cohort allele-frequency differences, we noticed that many individuals with missing genotype data had relatives with highly unlikely genotype configurations, which led to imputation of BLUP values that were “contaminated” by likely genotyping errors in some of the data used for calculating the BLUP. This led to extra noise in the genotypes for the L-GATOR method, which tended to make the p values larger.

Figure 2.

Figure 2

Scatterplots for L-GATOR p Values versus PMALT p Values from the FHS SBP Data before and after the Removal of Cohort 1 Genotypes

The L-GATOR p value of each SNP in the FHS SBP data is plotted against its PMALT p value, where both p values are on the negative log10 scale. Shown are the results before (A) and after (B) removal of the genotypes for cohort 1. In each case, the red line represents perfect concordance between the PMALT and L-GATOR p values (ratio of 1 between them), whereas the blue lines represent a 10-fold difference between the PMALT and L-GATOR p values (ratio of 10:1 or 1:10). Shown in green are the SNPs with extremely significant (p value <1010) differences in allele frequency between cohort 1 and the other cohorts.

To avoid problems caused by possible lower-quality genotype data in the original cohort, we set the genotype data of the original cohort to be missing for all SNPs and reran the two analyses. Note that for L-GATOR, the estimates of the variance-component parameters under the null hypothesis (Table 6) did not change when the genotype data of cohort 1 were set to be missing because they were estimated based on the set, Z, of individuals with at least one complete record, and this set did not change. Similarly, the L-GATOR estimates of the fixed effects (Table 7) did not change noticeably when the genotype data of cohort 1 were set to be missing because L-GATOR used the set P, which also did not change very much, to estimate these parameters. In contrast, the parameter estimates for PMALT could change substantially because parameter estimation was based on the set PG, which changed substantially because G was changed substantially. Both sets of estimates for PMALT are listed in Tables 6 and 7. Figure 3 shows the Q-Q plots of the negative log10-transformed p values for the PMALT and L-GATOR analyses after the removal of cohort 1 genotypes. The genomic control inflation factors for the PMALT and L-GATOR analyses after removal of cohort 1 genotypes were 1.028 and 1.041, respectively, which are lower than in the previous analyses. The corresponding scatterplot of PMALT p values versus L-GATOR p values is shown in Figure 2B. We see that after the removal of genotype data for cohort 1, the Q-Q plots look closer to uniform, and the two analyses produce relatively consistent p values.

Figure 3.

Figure 3

Q-Q Plots of PMALT and L-GATOR p Values for the FHS SBP Data after Removal of Cohort 1 Genotypes

Q-Q plots of genome-wide p values for association with SBP in the FHS data after removal of genotypes for cohort 1 are shown. The p values were computed by the PMALT method (A) or the L-GATOR method (B). In each panel, p values for 369,046 genome-wide SNPs are plotted on the negative log10 scale.

Table 8 lists the 13 SNPs for which L-GATOR gave a p value <2.5×105 for association with SBP after removal of the genotype data for the original cohort. Three of these SNPs, rs17398575, rs12705390, and rs11760498, are located at 7q22, close to PIK3CG (MIM: 601232), which was identified22 as associated with pulse pressure, the difference between SBP and diastolic blood pressure (DBP), in a meta-analysis of individuals of European ancestry from 35 studies (one of which is the FHS). PIK3CG encodes a protein that belongs to the pi3/pi4-kinase family, and the gene is located in a region previously identified as associated with platelet aggregation23 and with mean platelet volume, counts, and function.24 SNP rs12246717, located at 10q21, is near C10orf107, which was previously reported to be associated with SBP and DBP in meta-analyses of individuals of European ancestry.25, 26

Table 8.

Strongest Association Signals in the FHS SBP Data after Removal of Cohort 1 Genotypes

Chromosomal Region Nearby Genes SNP ID Position P Value
1q24 BAT2L2, MYOC rs235891 169,839,888 3.71 × 10−7
rs235864 169,855,206 5.51 × 10−6
2q14 CNTNAP5 rs13033474 124,653,857 2.06 × 10−5
2q21 TMEM163 rs655472 135,006,691 5.50 × 10−6
rs666614 135,006,923 5.43 × 10−6
7q22 PIK3CG, FLJ36031 rs17398575 106,196,688 1.04 × 10−5
rs12705390 106,198,013 1.56 × 10−5
rs11760498 106,206,208 7.23 × 10−6
10q21 C10orf107 rs12246717 63,129,189 2.11 × 10−5
17q22 C17orf67, NOG rs227661 52,170,897 7.31 × 10−6
rs227660 52,171,218 4.41 × 10−6
18p11 RAB31 rs1455587 9,700,554 1.39 × 10−5

SBP-associated SNPs with L-GATOR p value <2.5×105 are reported on the basis of the FHS data with cohort 1 genotypes removed. Underlined genes have been previously identified as associated with SBP. MIM numbers of genes not mentioned in the text: MYOC (MIM: 601652), CNTNAP5 (MIM: 610519), NOG (MIM: 602991), RAB31 (MIM: 605694).

Our results seem to have little in common with the results of a previous FHS SBP analysis,27 which tested SNPs on the Affymetrix 100K GeneChip for association with either single time-point measurements or long-term averages. The strongest results in the previous study do not appear to have been replicated to date. The previous study had a much smaller sample size and lower SNP density than did our analysis, and the previous study did not make use of longitudinal information, whereas our analysis did. These major differences most likely account for the lack of consistency between the two sets of results.

Assessment of Run Time

We provide freely available software called L-GATOR, which is programmed in C and which can perform both the L-GATOR and PMALT analyses. In Table 9, we compare the run times of L-GATOR and PMALT, using the L-GATOR software with default settings, with that of longGWAS, using the longGWAS R package with default settings, for analysis of sample type 1a of trait model I. These times were obtained with a single processor on a shared machine with 4-core Intel Xeon 3.16 GHz central processing units and 32 GB RAM. From Table 9, we can see that longGWAS took almost 100 times as long as PMALT or L-GATOR for variance-component estimation, which is done once per genome screen, and longGWAS took more than ten times as long per SNP as L-GATOR for association testing.

Table 9.

Comparison of Run Times for PMALT, L-GATOR, and longGWAS

Time (s)
PMALT L-GATOR longGWAS
Variance-component estimation (once per dataset) 7.778 7.771 743.581
Association testing (per SNP) 0.016 0.028 0.335

Comparison uses sample type 1a of trait model I.

We also give run times for L-GATOR genome-wide analysis of the FHS SBP data. (We were not able to complete the analysis with longGWAS.) The FHS data are computationally challenging because they contain some large families with a large number of repeated measures per individual. For example, there are six families with sizes larger than 100 individuals, and the largest family contains 499 individuals and 3,284 records. To assess the run time, we first performed a genome-wide analysis of the FHS SBP data while excluding the six largest families. In that case, the average run time for L-GATOR was about 0.58 s per SNP for association testing. When all families were included, the computation was five to six times slower as a result of memory allocation and matrix decomposition and inversion for large matrices.

For both the L-GATOR and PMALT analyses, the L-GATOR software calculates τˆ0 just once per genome screen. The slowest step is the calculation, at each marker, of the inverse, Vˆ01, for the PMALT analysis and of both Vˆ01 and ΦG1 for the L-GATOR analysis. By default, these inverses are calculated at each marker because the set PG in PMALT and the sets P and G in L-GATOR can vary by marker. For very large datasets, possible approximate approaches, which speed calculation, are to either (1) calculate Vˆ01 just once per genome screen and select the needed rows and columns of Vˆ01 for each SNP or (2) calculate the transformed phenotypic residual vector, Vˆ01(YWβˆ0), once per genome screen and select the needed entries of it for each SNP. These approaches would be expected to provide good approximations only if the set of individuals with missing genotypes is quite similar across markers (and would be exact if the set of individuals with missing genotypes is the same across markers). Our software gives users the option of using either of these approaches, although the default is to recalculate Vˆ01 for every SNP. The use of either option provides a considerable speed-up, and option 2 is faster than option 1.

Discussion

Longitudinal studies have the potential to provide substantial information for identification of genetic variants and environmental factors that affect complex phenotypes over time. When individuals in the longitudinal study are related, the dependence of genotypes across individuals and dependence of phenotype measurements both across individuals and across time can pose statistical and computational challenges for genetic analysis. In order to make efficient use of the full longitudinal information contained in such data, we propose L-GATOR, a retrospective quasi-likelihood score test for genetic association with a longitudinally measured trait in samples with related individuals. L-GATOR is based on a mixed-effects trait model that (1) incorporates both time-varying and static covariates as fixed effects and (2) uses random effects to model dependence among observations for an individual, including both static and time-varying individual effects, as well as additive genetic effects that model dependence across observations from related individuals. The mixed-effects approach allows the analysis to make full use of the available information on repeated-measurements, for which different individuals are permitted to have different time points and different numbers of observations. In samples with related individuals, the retrospective approach has the advantage of providing a natural and computationally efficient way to incorporate partially missing observations in the analysis in order to increase power while properly accounting for uncertainty and dependence. L-GATOR is applicable and computationally feasible for essentially arbitrary combinations of related and unrelated individuals, including large, complex, inbred pedigrees as well as samples in which small outbred families are combined with unrelated individuals.

In our simulations, L-GATOR was more than ten times faster than the previously proposed prospective method, longGWAS, for association testing and almost 100 times faster for parameter estimation. Furthermore, in the presence of ascertainment or missing data, L-GATOR was able to maintain correct type 1 error, whereas longGWAS was not. In part because of the challenge of using longGWAS in our simulations and data analysis, we also developed PMALT, a prospective method similar to longGWAS but that adopts the longitudinal modeling approach of L-GATOR so that it can take advantage of the computational speed-up. Through simulation, we have demonstrated that, although both L-GATOR and PMALT are correctly calibrated, L-GATOR provides higher power to detect association in samples with ascertainment and partially missing data (missing phenotype or genotype) when related individuals are present in the sample. This is because L-GATOR uses the information on dependence of related individuals to allow individuals with missing genotype or missing phenotype to contribute power to the analysis. Furthermore, we have shown that compared with PMALT, L-GATOR is more robust to misspecification of the phenotype model in that it maintains correct type 1 error even when the variance-component parameter is misspecified. We applied L-GATOR to detect association with SBP in data from the FHS. Among the strongest association signals were four SNPs in or near genes previously reported to be associated with blood pressure (PIK3CG and C10orf107).

From the point of view of analyzing the FHS SBP data, a key aspect of the L-GATOR model is the parametric modeling of correlation among observations from the same individual at different time points, as captured in the L matrix, which depends on the two parameters α and c. This is essential because of the wide range of ages (time points) in the study and the fact that each individual has observations for at most a small subset of these. In contrast, completely nonparametric modeling of the correlation among observations from the same individual at different time points1 is impractical in this context. The modeling of additive genetic variance is also essential in this context because of the presence of relatedness in the sample. In contrast, the estimated value of σI2/σ2, the proportion of variance due to non-genetic individual random effects, is quite small in the FHS SBP data, indicating that it is not needed in the analysis. In Table S1, we report the power results for L-GATOR and PMALT in six settings when the true value of σI2/σ2=0.1 but when the analyses were performed with σI2 set to 0 (i.e., removed from the model). We found that in this case of low to moderate σI2/σ2, there was no significant difference in power for either PMALT or L-GATOR when σI2 was removed from the model. The current version of L-GATOR software offers the user the option of removing this variance component from the model.

The L-GATOR method we describe accounts for known relatedness. In addition, it might be of interest to account for further unknown population structure and cryptic relatedness. One possible approach to this is to replace the L-GATOR statistic of Equation 23 with

L-GATOR=(ATXˆ)2ATΦPGU1ΨˆU1ΦGPAσˆX2, (Equation 29)

where Ψˆ is an empirical genetic-relatedness matrix calculated from genome-wide data. This has the effect of correcting the variance of the statistic for additional unknown structure. An alternative approach could be to include ancestry-informative vectors as covariates in the model and replace Ψˆ with a residual empirical genetic-relatedness matrix that is orthogonal to the ancestry-informative vectors. The second approach could be expected to have more power if population structure is substantial.

A general feature of retrospective association testing approaches such as L-GATOR is that the type 1 error is not affected by misspecification of the trait model, and this is confirmed by Table 4. However, power for association can sometimes be affected by trait model misspecification. An example in which power is not significantly affected is given in Table S1. On the other hand, examples in which power is reduced (in the case of a non-longitudinal quantitative trait) can be seen by a comparison of the ASTOR and MASTOR results in rows 1–4 of Table 4 of a previous work7. The fact that power for association can sometimes be affected by trait model misspecification motivates our careful development of the mean and variance models for L-GATOR, and it is worth considering how to address possible deviations from that model. For example, the trait model fit by L-GATOR assumes that the genetic effect is constant over time. Other possibilities could include that there is an initial genetic effect that attenuates over time or a genetic effect that does not manifest until late in life. If the true trait model deviated strongly from the assumption of constant genetic effect over time, it could affect power for association. Relatively simple approaches to improving the power in that case could include (1) restricting consideration to an age range over which the genetic effect is believed to be strong and/or (2) extending the model of L-GATOR to allow an interaction between the genetic fixed effect and age.

The L-GATOR method we propose is designed for genetic analysis of longitudinal measurements of a quantitative trait. L-GATOR could certainly also be applied to longitudinal measurements of a binary trait. A drawback would be that the trait model given in (Equation 2), (Equation 3), (Equation 4) would be misspecified for a binary trait because a key feature of binary data is that the variance of a binary random variable with mean μ must be μ(1μ). This model misspecification could affect the power of the association test for a binary trait (although not the type 1 error). To improve power for longitudinal binary traits, we could consider developing a new method that combines the approach used in L-GATOR with the approaches used in the recent CARAT28 and CERAMIC29 methods for non-longitudinal binary traits to create a hybrid method that would be expected to be more powerful than L-GATOR for longitudinal binary traits.

Acknowledgments

This study was supported by National Institutes of Health grant R01 HG001645 (to M.S.M.). The Framingham Heart Study (FHS) is conducted and supported by the National Heart, Lung, and Blood Institute (NHLBI) in collaboration with Boston University (contract nos. N01-HC-25195 and HHSN268201500001I). Funding for SHARe Affymetrix genotyping was provided by NHLBI contract N02-HL-64278. The Framingham SHARe data used for the analyses described in this manuscript were obtained through dbGaP: phs000007.v15.p6. This manuscript was not prepared in collaboration with investigators of the FHS and does not necessarily reflect the opinions or views of the FHS, Boston University, or the NHLBI.

Published: April 5, 2018

Footnotes

Supplemental Data include three figures and one table and can be found with this article online at https://doi.org/10.1016/j.ajhg.2018.02.016.

Appendix A

Null MLE and Fisher Information for (σ2,τ)

From the log-likelihood function under the null,

=12[nlog(2πσ2)+log|V|+1σ2(YWβ)TV1(YWβ)],

we can obtain the null MLE, (βˆ0,σˆ02,τˆ0), of (β,σ2,τ) by performing the one-dimensional minimization

τˆ0=argminτ{log|V|+nlogσˆ02(τ)},

where σˆ02(τ)=1nYTV1[YWβˆ0(τ)] and βˆ0(τ)=(WTV1W)1WTV1Y are the null MLEs of σ2 and β as functions of τ. We then set βˆ0=βˆ0(τˆ0) and σˆ02=σˆ02(τˆ0).

To obtain the Fisher information for the parameter vector (σ2,τ)T where τ=(τ1,τ2,τ3,α,c) (note that β is independent and can be considered separately), let Λ1=V/τ1,Λ2=V/τ2,Λ3=V/τ3,Λ4=V/α,andΛ5=V/c, and the corresponding Fisher information matrix is then

I=[Iσ2Iσ2τTIσ2τIτ],

where Iσ2=n/2σ4, Iσ2τ is a vector of length 5 with ith component (1/2σ2)tr(V1Λi) for 1i5, and Iτ is a matrix of size 5 × 5 with (i,j)th component (1/2)tr(V1ΛiV1Λj) for 1i5and 1j5.

Appendix B

Equality of the Expressions in Equations 22 and 23

To obtain equality of the numerators in Equations 22 and 23, it is sufficient to show that

ATXˆ=ATΦPGU1X. (Equation B1)

This follows from the fact that RB1=0 (because R is orthogonal to the column space of W, and B1 equals the intercept column of W), and so AT1=YTRB1=0. Thus, ATXˆ=AT1(1TΦG11)11TΦG1X+ATΦPGU1X=ATΦPGU1X because the first term is 0. To obtain equality of the denominators of Equations 22 and 23, we note that a key fact, easily verified, is that

U1ΦGU1=U1. (Equation B2)

Combining this with Equation B1, we have

Varˆ0(ATXˆ|Y,W)=Varˆ0(ATΦPGU1X)=ATΦPGU1ΦGU1ΦGPAσˆX2=ATΦPGU1ΦGPAσˆX2.

Equality of the Denominators in Equations 23 and 16 with Complete Data

With complete data, we have ΦPG=ΦG, so the denominator of Equation 23 becomes ATΦGU1ΦGAσˆX2=AT(ΦG1(1TΦG11)11T)AσˆX2=ATΦGAσˆX2 because AT1=0.

Estimator σˆX2 in Equation 23

Equation 23 gives the form of L-GATOR in the incomplete-data case, which includes σˆX2, an estimator of the null conditional genotypic variance. In the complete-data case, we have an expression for σˆX2 that is given in Equation 17. However, this expression must be modified for the incomplete-data case. Let Δ denote the set of individuals with genotype and all static covariates available, and let δ=|Δ|. Then, in the incomplete-data case, we let

σˆX2=(δqs)1XΔTUΔXΔ, (Equation B3)

where XΔ is the genotype vector X restricted to those individuals in Δ, and UΔ=ΦΔ1ΦΔ1WΔ,s(WΔ,sTΦΔ1WΔ,s)1WΔ,sTΦΔ1, where ΦΔ and WΔ,s are Φ and Ws, respectively, restricted to those individuals in Δ.

Appendix C

Extension of L-GATOR to Include Covariates in BLUP for Genotype

If we required all of the individuals in G to have non-missing values for the static covariates, i.e., if we required G=Δ, then we could define an alternative version of L-GATOR, called L-GATORalt, where

L-GATORalt=(YTRBΦPΔUΔXΔ)2Varˆ0(YTRBΦPΔUΔXΔ|Y,W)

which we rewrite as

L-GATORalt=(ATXˆalt)Varˆ0(ATXˆalt|Y,W)=(ATXˆalt)ATΦPΔUΔΦΔPAσˆX2,

where σˆX2 is defined in Equation B3 and where

Xˆalt=Wsηˆ+ΦPΔΦΔ1(XWΔ,sηˆ)=[Ws(WΔ,sTΦΔ1WΔ,s)1WΔ,sTΦΔ1+ΦPΔUΔ]XΔ

is a new BLUP that uses not only the available genotypes but also the static covariates to predict the missing genotypes.

L-GATORalt belongs to the family of quasi-likelihood score tests in association analysis,15, 30 and it can be derived as the quasi-likelihood score test of H0:ξ=0 (no association and no linkage) under the retrospective mean model E[XΔ|W,Y]=WΔ,sη+ξΦΔPA with the null variance assumption Var0(XΔ|W,Y)=σX2ΦΔ. Note that validity of the L-GATORalt test depends only on the assumptions E0[XΔ|W,Y]=WΔ,sη and Var0(XΔ|W,Y)=σX2ΦΔ, in addition to the usual regularity conditions needed for a central limit theorem.

Web Resources

Supplemental Data

Document S1. Figures S1–S3 and Table S1
mmc1.pdf (102.7KB, pdf)
Document S2. Article plus Supplemental Data
mmc2.pdf (1.2MB, pdf)

References

  • 1.Furlotte N.A., Eskin E., Eyheramendy S. Genome-wide association mapping with longitudinal data. Genet. Epidemiol. 2012;36:463–471. doi: 10.1002/gepi.21640. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Das K., Li J., Wang Z., Tong C., Fu G., Li Y., Xu M., Ahn K., Mauger D., Li R., Wu R. A dynamic model for genome-wide association studies. Hum. Genet. 2011;129:629–639. doi: 10.1007/s00439-011-0960-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Zhu W., Cho K., Chen X., Zhang M., Wang M., Zhang H. A genome-wide association analysis of Framingham Heart Study longitudinal data using multivariate adaptive splines. BMC Proc. 2009;3(Suppl 7):S119. doi: 10.1186/1753-6561-3-s7-s119. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Lange C., van Steen K., Andrew T., Lyon H., DeMeo D.L., Raby B., Murphy A., Silverman E.K., MacGregor A., Weiss S.T., Laird N.M. A family-based association test for repeatedly measured quantitative traits adjusting for unknown environmental and/or polygenic effects. Stat. Appl. Genet. Mol. Biol. 2004;3:e17. doi: 10.2202/1544-6115.1067. [DOI] [PubMed] [Google Scholar]
  • 5.Ding X., Lange C., Xu X., Laird N. New powerful approaches for family-based association tests with longitudinal measurements. Ann. Hum. Genet. 2009;73:74–83. doi: 10.1111/j.1469-1809.2008.00481.x. [DOI] [PubMed] [Google Scholar]
  • 6.Ding X., Laird N. Family-Based Association Tests with longitudinal measurements: handling missing data. Hum. Hered. 2009;68:98–105. doi: 10.1159/000212502. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Jakobsdottir J., McPeek M.S. MASTOR: mixed-model association mapping of quantitative traits in samples with related individuals. Am. J. Hum. Genet. 2013;92:652–666. doi: 10.1016/j.ajhg.2013.03.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Abney M., Ober C., McPeek M.S. Quantitative-trait homozygosity and association mapping and empirical genomewide significance in large, complex pedigrees: fasting serum-insulin level in the Hutterites. Am. J. Hum. Genet. 2002;70:920–934. doi: 10.1086/339705. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Diggle P.J. An approach to the analysis of repeated measurements. Biometrics. 1988;44:959–971. [PubMed] [Google Scholar]
  • 10.de Andrade M., Guéguen R., Visvikis S., Sass C., Siest G., Amos C.I. Extension of variance components approach to incorporate temporal trends and longitudinal pedigree data analysis. Genet. Epidemiol. 2002;22:221–232. doi: 10.1002/gepi.01118. [DOI] [PubMed] [Google Scholar]
  • 11.Soler J.M.P., Blangero J. Longitudinal familial analysis of blood pressure involving parametric (co)variance functions. BMC Genet. 2003;4(Suppl 1):S87. doi: 10.1186/1471-2156-4-S1-S87. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Burton P.R., Scurrah K.J., Tobin M.D., Palmer L.J. Covariance components models for longitudinal family data. Int. J. Epidemiol. 2005;34:1063–1077. doi: 10.1093/ije/dyi069. discussion 1077–1079. [DOI] [PubMed] [Google Scholar]
  • 13.Kang H.M., Sul J.H., Service S.K., Zaitlen N.A., Kong S.Y., Freimer N.B., Sabatti C., Eskin E. Variance component model to account for sample structure in genome-wide association studies. Nat. Genet. 2010;42:348–354. doi: 10.1038/ng.548. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.McPeek M.S., Wu X., Ober C. Best linear unbiased allele-frequency estimation in complex pedigrees. Biometrics. 2004;60:359–367. doi: 10.1111/j.0006-341X.2004.00180.x. [DOI] [PubMed] [Google Scholar]
  • 15.Bourgain C., Hoffjan S., Nicolae R., Newman D., Steiner L., Walker K., Reynolds R., Ober C., McPeek M.S. Novel case-control test in a founder population identifies P-selectin as an atopy-susceptibility locus. Am. J. Hum. Genet. 2003;73:612–626. doi: 10.1086/378208. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.McPeek M.S. BLUP genotype imputation for case-control association testing with related individuals and missing data. J. Comput. Biol. 2012;19:756–765. doi: 10.1089/cmb.2012.0024. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Thornton T., McPeek M.S. ROADTRIPS: case-control association testing with partially or completely unknown population and pedigree structure. Am. J. Hum. Genet. 2010;86:172–184. doi: 10.1016/j.ajhg.2010.01.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Splansky G.L., Corey D., Yang Q., Atwood L.D., Cupples L.A., Benjamin E.J., D’Agostino R.B., Sr., Fox C.S., Larson M.G., Murabito J.M. The third generation cohort of the National Heart, Lung, and Blood Institute’s Framingham Heart Study: design, recruitment, and initial examination. Am. J. Epidemiol. 2007;165:1328–1335. doi: 10.1093/aje/kwm021. [DOI] [PubMed] [Google Scholar]
  • 19.Cui J.S., Hopper J.L., Harrap S.B. Antihypertensive treatments obscure familial contributions to blood pressure variation. Hypertension. 2003;41:207–210. doi: 10.1161/01.hyp.0000044938.94050.e3. [DOI] [PubMed] [Google Scholar]
  • 20.Tobin M.D., Sheehan N.A., Scurrah K.J., Burton P.R. Adjusting for treatment effects in studies of quantitative traits: antihypertensive therapy and systolic blood pressure. Stat. Med. 2005;24:2911–2935. doi: 10.1002/sim.2165. [DOI] [PubMed] [Google Scholar]
  • 21.Price A.L., Zaitlen N.A., Reich D., Patterson N. New approaches to population stratification in genome-wide association studies. Nat. Rev. Genet. 2010;11:459–463. doi: 10.1038/nrg2813. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Wain L.V., Verwoert G.C., O’Reilly P.F., Shi G., Johnson T., Johnson A.D., Bochud M., Rice K.M., Henneman P., Smith A.V., LifeLines Cohort Study. EchoGen consortium. AortaGen Consortium. CHARGE Consortium Heart Failure Working Group. KidneyGen consortium. CKDGen consortium. Cardiogenics consortium. CardioGram Genome-wide association study identifies six new loci influencing pulse pressure and mean arterial pressure. Nat. Genet. 2011;43:1005–1011. doi: 10.1038/ng.922. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Johnson A.D., Yanek L.R., Chen M.H., Faraday N., Larson M.G., Tofler G., Lin S.J., Kraja A.T., Province M.A., Yang Q. Genome-wide meta-analyses identifies seven loci associated with platelet aggregation in response to agonists. Nat. Genet. 2010;42:608–613. doi: 10.1038/ng.604. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Soranzo N., Rendon A., Gieger C., Jones C.I., Watkins N.A., Menzel S., Döring A., Stephens J., Prokisch H., Erber W. A novel variant on chromosome 7q22.3 associated with mean platelet volume, counts, and function. Blood. 2009;113:3831–3837. doi: 10.1182/blood-2008-10-184234. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Newton-Cheh C., Johnson T., Gateva V., Tobin M.D., Bochud M., Coin L., Najjar S.S., Zhao J.H., Heath S.C., Eyheramendy S., Wellcome Trust Case Control Consortium Genome-wide association study identifies eight loci associated with blood pressure. Nat. Genet. 2009;41:666–676. doi: 10.1038/ng.361. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Ehret G.B., Munroe P.B., Rice K.M., Bochud M., Johnson A.D., Chasman D.I., Smith A.V., Tobin M.D., Verwoert G.C., Hwang S.J., International Consortium for Blood Pressure Genome-Wide Association Studies. CARDIoGRAM consortium. CKDGen Consortium. KidneyGen Consortium. EchoGen consortium. CHARGE-HF consortium Genetic variants in novel pathways influence blood pressure and cardiovascular disease risk. Nature. 2011;478:103–109. doi: 10.1038/nature10405. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Levy D., Larson M.G., Benjamin E.J., Newton-Cheh C., Wang T.J., Hwang S.J., Vasan R.S., Mitchell G.F. Framingham Heart Study 100K Project: genome-wide associations for blood pressure and arterial stiffness. BMC Med. Genet. 2007;8(Suppl 1):S3. doi: 10.1186/1471-2350-8-S1-S3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Jiang D., Zhong S., McPeek M.S. Retrospective binary-trait association test elucidates genetic architecture of Crohn disease. Am. J. Hum. Genet. 2016;98:243–255. doi: 10.1016/j.ajhg.2015.12.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Zhong S., Jiang D., McPeek M.S. CERAMIC: case-control association testing in samples with related individuals, based on retrospective mixed model analysis with adjustment for covariates. PLoS Genet. 2016;12:e1006329. doi: 10.1371/journal.pgen.1006329. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Thornton T., McPeek M.S. Case-control association testing with related individuals: a more powerful quasi-likelihood score test. Am. J. Hum. Genet. 2007;81:321–337. doi: 10.1086/519497. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Document S1. Figures S1–S3 and Table S1
mmc1.pdf (102.7KB, pdf)
Document S2. Article plus Supplemental Data
mmc2.pdf (1.2MB, pdf)

Articles from American Journal of Human Genetics are provided here courtesy of American Society of Human Genetics

RESOURCES