Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2017 Apr 1.
Published in final edited form as: Genet Epidemiol. 2016 Apr;40(3):210–221. doi: 10.1002/gepi.21955

Uncovering local trends in genetic effects of multiple phenotypes via functional linear models

Olga A Vsevolozhskaya 1,*,#, Dmitri V Zaykin 2,#, David A Barondess 3, Xiaoren Tong 3, Sneha Jadhav 4, Qing Lu 3,*
PMCID: PMC4817279  NIHMSID: NIHMS758694  PMID: 27027515

Abstract

Recent technological advances equipped researchers with capabilities that go beyond traditional genotyping of loci known to be polymorphic in a general population. Genetic sequences of study participants can now be assessed directly. This capability removed technology-driven bias toward scoring predominantly common polymorphisms and let researchers reveal a wealth of rare and sample-specific variants. While the relative contributions of rare and common polymorphisms to trait variation are being debated, researchers are faced with the need for new statistical tools for simultaneous evaluation of all variants within a region. Several research groups demonstrated flexibility and good statistical power of the functional linear model approach. In this work we extend previous developments to allow inclusion of multiple traits and adjustment for additional covariates. Our functional approach is unique in that it provides a nuanced depiction of effects and interactions for the variables in the model by representing them as curves varying over a genetic region. We demonstrate flexibility and competitive power of our approach by contrasting its performance with commonly used statistical tools and illustrate its potential for discovery and characterization of genetic architecture of complex traits using sequencing data from the Dallas Heart Study.

Keywords: multivariate analysis, pleiotropy, genome-wide association studies, sequencing studies, quantitative traits, qualitative traits, functional analysis

1 Introduction

Genome-wide association studies (GWAS) have identified numerous risk loci for common complex diseases, and next-generation sequencing based association strategies are now emerging to characterize the contribution of rare genetic variants to human genetic disorders. Analysis of the ‘rare variant - common complex disease’ hypothesis requires tailored statistical methods, as single marker tests fail to uncover these rare variants [Carvajal-Carmona, 2010]. An entirely new powerful class of statistical methods based on non-parametric functions was recently developed for genetic association testing that can accommodate both rare and common variants, or the combination of the two [Fan et al., 2014, 2013; Lee et al., 2014; Luo et al., 2011, 2012a,b; Svishcheva et al., 2015; Vsevolozhskaya et al., 2014; Wang et al., 2015; Zhu and Xiong, 2012]. A comprehensive comparison of non-parametric functional-based methods (FBMs) via simulation studies and real data applications have repeatedly shown that FBMs have a valid type-I error rate and a substantially higher power to detect an association compared with alternative approaches. Additionally, FMBs were proven to be a powerful approach for genetic association studies with longitudinal data [Reimherr et al., 2014], or for the analysis of gene expression data [Storey et al., 2005].

Recently, our research group has demonstrated that within FBMs, functional analysis of variance (FANOVA) attains higher power to detect an association between a genetic region and a dichotomous trait compared to methods based on functional linear models (FLM) [Vsevolozhskaya et al., 2014]. Specifically, we have shown that FANOVA outperforms FLM for small to moderate effect sizes of the variants within a genetic region. Nonetheless, from a practical point of view, FANOVA had a notable limitation in that it was not able to accommodate quantitative traits or adjust for continuous covariates.

In light of these shortcomings, our aim was to extend the existing FANOVA method to association analyses of multiple quantitative and qualitative traits and to accommodate situations in which (1) a gene influences more than one trait (i.e., pleiotropy), (2) where there are confounding/mediation effects (due to population substructure or other sources), and (3) where the effect of disease risk can be modified by a trait or an exposure – a phenomena that we refer hereafter as “Treatment by Trait” (T×T) interaction.

To conceptualize T×T interaction, consider a study of genetic risk factors of substance abuse disorder. It is well known that personality traits like impulsivity and sensation-seeking are highly prevalent in drug-dependent individuals (e.g., [De Wit, 2009]). It is also known that personality traits are substantially influenced by genes (e.g., [Bouchard Jr and Loehlin, 2001]). Suppose there are genetic risk factors that contribute to the increased risk of developing drug addiction among individuals with high trait-impulsivity. Suppose, further, that a different genetic disposition might be involved in the increased risk of developing drug addiction among individuals with low trait-impulsively. Hence, risk alleles for drug-dependence (i.e., ‘treatment’) might vary by the level of personality traits, which will be modeled as T×T interaction in our generalized FANOVA approach – more on this later.

A distinctive contribution of the approach presented here to the emerging field of FBMs for genetic association studies is the introduction of an efficient way to estimate the effects of phenotypes, confounding factors and T×T interactions using continuous curves smoothly varying over genetic loci. Previously proposed functional methods for genetic association studies (e.g., [Fan et al., 2013; Luo et al., 2011, 2012a]) and other methods that combine information across multiple variants within a gene (e.g., [Liu and Leal, 2010; Wu et al., 2011]) aggregate across both the association signals of genetic variants as well as over covariate effects. We exploit the flexibility of the functional approach to unveil a more nuanced blueprint of how covariate and interaction effects vary within a genetic region by estimating partial regression coefficient curves that change over variant positions.

Unlike traditional statistical models that treat a disease phenotype as an outcome (i.e., on the left-hand side of the equation), our model puts non-genetic variables on the right-hand side, including traits, environmental exposures, and confounders. The response function in our model is an allelic dosage curve, fitted through genetic variants within a region. If we start our modeling by including a binary trait such as drug dependence as a single predictor, the continuous regression coefficient will be the difference between average allelic dosages over multiple variants of the two groups. That is, a continuous intercept curve will estimate smoothed average allelic dosage among non drug dependent controls, and a continuous regression coefficient will estimate a deviation from this baseline allelic dosage over multiple variants among drug-dependent cases. Further, if we include personality trait as a covariate, the regression coefficient curve for drug-addiction will be adjusted for personality trait. Finally, if we include a T×T interaction between drug-dependence status and a personality trait, the deviation from the baseline allelic dosage among drug-dependent cases will vary by the level of a personality trait.

Functional models where genetic predictor (X) and the outcome (Y) are swapped in the regression equation are reminiscent of the reverse regression approach [Maddala, 1992]. In general, coefficients of the direct and the reverse regressions are not the same, however the test statistic for the X (adjusted for any covariates) as well as the corresponding partial correlation coefficient remain the same after the swapping. For example, adjustment for confounding or mediation is unaffected and remains valid in the reverse regression approach.

To estimate continuous coefficient curves, our new generalized FANOVA approach utilizes a connection between penalized spline regression and best linear unbiased predictors (BLUPs), enabling a straightforward practical implementation using standard linear mixed models statistical software. A connection between BLUPs and penalized functional regression has been explored in statistical and machine learning literature [Brumback et al., 1999; Crainiceanu et al., 2005; Crainiceanu and Goldsmith, 2010; Eilers and Marx, 1996; Goldsmith et al., 2011; Ivanescu et al., 2014; Lian, 2007; Nosedal-Sanchez et al., 2012; Pearce and Wand, 2006; Ruppert et al., 2003; Wand and Ormerod, 2008; Wang, 1998]. However, this connection has largely been ignored in functional method approaches for genetic association studies.

We provide an illustration of our method using data from the Dallas Heart Study [Romeo et al., 2007], by characterizing associations of sequence variants with plasma triglyceride levels, modified by race and adjusted for sex. In addition to identifying the originally reported association between triglyceride levels and the ANGPTL4 gene, our new FANOVA approach identified specific sub-regions of the ANGPTL4 gene associated with plasma triglyceride levels among European Americans, African Americans, and Hispanics.

2 Methods

2.1 Genotypic functions: a brief overview

In brief, our method is an extension of the previously proposed FANOVA methodology, which seeks to quantify the relationship between scalar phenotypes X1, X2, . . . , Xk and smooth genotypic functions G(t)'s, with t indexing a genetic variant's position over a genetic region, t ∈ [0, τ] [Vsevolozhskaya et al., 2014]. By using the term ‘genotypic functions,’ we refer to nonparametric functions fitted with a basis expansion method [Ramsay and Silverman, 2005; Ruppert et al., 2003; Wood, 2006]. Thus, for each subject, the genetic data is not of a discrete (i.e., counted) nature, such as would be the case for genotype frequencies, but rather a single nonparametric genotypic function, G(t), of a continuous nature.

A genotypic function is obtained by either (i) a cubic B-spline basis expansion over a dense set of knots, κ1, . . . , κK, over the range of the variant's genomic positions ti's (in the one-base coordinate system) or (ii) penalized spline smoothing that avoids the knot selection problem completely (e.g., [Luo et al., 2012a; Vsevolozhskaya et al., 2014]). Earlier investigations of functional linear models designed for genetic association testing include comprehensive coverage of the estimation procedure for the genotypic functions G(t)'s [Fan et al., 2014, 2013; Lee et al., 2014; Luo et al., 2011, 2012a,b; Svishcheva et al., 2015; Vsevolozhskaya et al., 2014; Wang et al., 2015; Zhu and Xiong, 2012].

If we let G1(t), . . . , GN(t), t ∈ [0, τ] denote the functional genotypic data for N individuals, and we let X1i, . . . , XPi, i = 1, . . . , N denote a set of P variables that consists of covariates and traits (either quantitative or qualitative) that may contribute to a disease, our model for each individual's genotypic function is:

Gi(t)=β0(t)+β1(t)X1i++βP(t)XPi+ϵi(t), (1)

where βi(t)'s are continuous regression coefficients that describe an association between a scalar trait and a set of variants in a genetic region t ∈ [0, τ], and where ε(t) is a residual function. Unlike traditional models where the outcome is regressed on a set of predictors, this model treats genetic information as an outcome. Outside of the functional approach, utility of such “reverse regressions” has been explored previously for analysis of genetic associations [Feng, 2014; Kwan et al., 2011]. While coefficient estimates change, in general, due to swapping of variables between two sides of a regression equation, the partial correlations as well as the test statistics and P-values for the coefficients remain invariant: this follows simply from expressing these quantities in terms of the entries of the inverse of the correlation matrix between all variables including the outcome. Thus, testing for effects or for validity of regression adjustments are preserved under the reversal. There is also convenience in having the same type of outcome (i.e., genetic information) and thus the same type of a link function regardless of the type and the number of other variables in the model. Additionally, within the functional approach, exploration of β^(t)'s may allow researchers to determine sub-regions of [0, τ] that harbor causal genetic variants (i.e., sub-regions over which β^(t)0).

To estimate β^(t)'s, we place a function-on-scalar regression in Eq. 1 into the context of a mixed-effects model or, more generally, embed the penalized splines problem into the class of reproducing kernel methods. To introduce the method, we first present a case of a single curve estimation, and conclude with the general case that allows us to estimate continuous coefficients of multiple traits, construct their confidence intervals and test for an association. We finally note that in the context of this paper, the word “kernel” should not be confused with a weight function as in the local regression (or local smoothing), which is also called a kernel [Hastie et al., 2009].

2.2 Estimating a single curve

To draw connections between smoothing splines and reproducing kernels, first consider a simpler problem of estimating a single curve from the observed yi's and ti's, i = 1, . . . , n. One possible approach to estimating a nonparametric function f(t) from discrete data is to invoke penalized spline smoothing (e.g., [Wahba, 1990]). This smooth interpolation of the data is achieved by minimizing least squares fits with an additional roughness penalty (i.e., penalized sums of squares) as follows:

min{n1i=1n(yif(ti))2+λ01[Ly(t)]2dt}. (2)

Here, the roughness of a function is quantified by the square of a linear differential operator Ly(t) (a typical choice is Ly(t) = f″(t) that corresponds to penalizing curvature of the function). The constant term, λ, referred to as a smoothing or a tuning parameter, should be either specified by a user or determined through the generalized cross-validation (GCV) [Wood, 2006].

The above minimization problem is analogous to a corresponding regularization problem within the machine learning domain:

minfH{i=1nL(yi,f(ti))+λPfH2}, (3)

where L(yi, f(ti)) is a loss function, ∥P f∥2 penalizes f in terms of the variability of its function values, and H is the reproducing kernel Hilbert space (RKHS) of real functions f. The theory of RKHS was developed by Aronszajn [1950] and Saitoh [1988], with good overviews provided by Smola and Schölkopf [1998]; Wahba [1990] and Rasmussen and Williams [2006]. Briefly, a RKHS on Rd is a Hilbert space of real-valued functions generated by a bivariate symmetric, positive definite kernel k(·, ·) with the following properties: (i) for every t in Rd, k(t, t′) is a function of t′ in H and (ii) k has the reproducing property k(,ti),fH=f(ti), where 〈·, ·〉 denotes the inner product. To conceptualize penalized splines in Eq. (2) as BLUPs in a mixed model framework, we explore the solution to the regularization problem in Eq. (3) from the machine learning theory. Based on the results of the representer theorem [Kimeldorf and Wahba, 1971], it can be shown that each minimizer fH of Eq. (3) can be written as a linear combination of kernel functions, as follows:

f(t)=i=1nαik(t,ti). (4)

The solution for α = [α1, . . . , αn]′ can be obtained as α^=(K+λIn)1y, in which K is then n × n matrix with ijth entry of k(ti, tj), I is the n × n identity matrix, nad y is the n × 1 vector of observed yi's [Hastie et al., 2009; Rasmussen and Williams, 2006]. Further, the vector of n fitted values is given by f^=Kα^. This solution looks very similar to that from a linear regression model (i.e., y^=Tβ^ since we used ti instead xi in Eqs. (2-3)). Regrettably, this reproducing kernel transformation of ti's does not simply move our non-linear problem into the ‘friendly’ linear model domain, because the solution for α depends on λ.

A slight variation to the representer theorem can be achieved by decomposing H into H0H1, where H0 is a finite-dimensional null space containing terms which will not be penalized, and H1 is it's orthogonal complement (i.e., penalized terms). For example, for ∥P f∥2 defined by differential operators of the form Ly(t) = f(m)(t), the null space H0 is spanned by polynomials of degree up to m – 1. More specifically, if m = 2, then constant and linear functions are in the null space, because they are not penalized for ‘curvature.’ With the decomposition of H, the minimizer f of the regularization function in Eq. (3) now has the form:

f(t)=j=1mdjϕj(t)+i=1ncik1(t,ti), (5)

where ϕ1(t), . . . , ϕm(t) form the basis of H0 and k1(·, ·) is a reproducing kernel that generates H1. If m = 2 as in the example above, then ϕ1(t) = 1 and ϕ2(t) = t span the null space of unpenalized functions.

There are relatively few published recommendations in the statistical literature on how to construct k1(·, ·). For example, Lian [2007] writes “[...] the construction of k1 in general is difficult and a search of the literature does not seem to provide us with any clues about how to construct a positive definite kernel in general.” Nonetheless, if we shift our attention to the machine learning literature, we see that k1(t, ti) = G(t, ti), where G(t, ti) is a Green's function of the linear differential operator Ly(t) [Fasshauer, 2012; Fasshauer and Ye, 2013; Poggio and Girosi, 1990; Rasmussen and Williams, 2006]. Note that the Green's function also depends on the boundary conditions. A ‘natural’ choice is the “Natural Boundary Condition” f(j)(a) = f(j)(b) = 0, j = 1, . . . , m; where a and b are the boundaries of the functional domain [Green and Silverman, 1993].

How can we estimate the fitted values of the coefficients and ĉ in Eq. (5) for a specific problem? If we re-write Eq. (5) using linear algebra notations as:

f^=Φd^+K1c^, (6)

it becomes evident that Eq. (6) represents a solution to the linear mixed-effects model with design matrices Φ and K1, and and ĉ estimated as best linear unbiased predictors (BLUPs) from this model [Speed, 1991]. In addition, the BLUP solution for the coefficients is independent of the smoothing parameters λ, which is equal to the ratio of the variances of the residuals and random effects. For numerical stability reasons, the design matrices are specified for a sequence of knots k1, . . . , kκ places at sample quantilies over the range of ti's [Ruppert, 2002] as:

Φ=[1t11tn]andK1=[(t1k1)+(t1kκ)+(tnk1)+(tntκ)+], (7)

where G(ti, tj) = (ti – tj)+ is the Green's function of the linear differential operator f(2)(t), and x+ = max{0, x}. This specification of the design matrices corresponds to a truncated lines series basis expansion f^(t)=d^0+d^1t+i=1κc^i(tki)+. Other choices of basis functions can also be used with corresponding changes to penalized terms. Possible choices include, but are not limited to, (a) truncated power basis (tki)+p, (b) O'sullivan splines [Wand and Ormerod, 2008], (c) thin plate splines [Ivanescu et al., 2014], or (d) the Gaussian kernel [Lian, 2007].

Some readers might wonder whether the mixed model formulation for penalized splines bear the same parameter interpretation as in a typical application to nested hierarchical data. We should clarify that the functional representation in Eq. (6) is just a convenient way of shifting a non-linear problem into a linear domain, while simultaneously estimating a smoothing parameter. Similarly, the random effects in c, are just a convenience device to model the curvature in f and should not be interpreted as random effects, per se.

2.3 Estimating β(t)'s

With respect to the conceptual model in Eq. (1), continuous regression coefficients can be estimated as follows. For each subject, the genotypic function is evaluated on the grid of genomic positions t1, . . . , tn, i.e, Ĝi(t) = Ĝi(t1), . . . , Ĝi(tn). For the sequence of knots k1, . . . , kκ, each functional regression coefficient is expanded in terms of the linear combination of ϕ's and k1's. This expansion yields the following mixed-model representation of Eq. (1):

G^i(t)=β^0(t)+β^1(t)X1i++β^P(t)XPi=(d^1+d^2t+i=1κk1(t,ki)c^i)β^0(t)+(d^1+d^2t+i=1κk1(t,ki)c^i)β^1(t)X1i++(d^1+d^2t+i=1κk(t,ki)c^i)β^P(t)XPi. (8)

Conceptually, the generalized FANOVA-based regression coefficients, β(t)'s, are similar to the genetic effect coefficients in the recently published paper by Wang et al. [2015]. Specifically, Wang et al. [2015] also proposed to estimate regression coefficients, βl(t)'s, smoothly varying over the genetic position t. However, unlike the methodology proposed in the present study, their approach can not simultaneously handle quantitative and qualitative traits, adjust coefficients for confounders/mediators over a continuum [0, τ] or modify effects by the level of another trait. With our approach, this adjustments can be easily incorporated into the model.

Suppose we want to adjust the effect of a risk factor X1 by trait X2 over all t. The model will be written as:

G^i(t)=β^0(t)+β^1(t)X1i+β^2(t)X2i.

Suppose, further, we want to modify the effect of a risk factor X1 by the level of trait X2, i.e., model a T×T interaction (for simplicity, assume that X2 has only two levels). The model can be expressed as:

G^i(t)=β^0(t)+β^1(t)X1i+β^2(t)X2i+β^12(t)X1iX2i.

Then, for the first level of X2, dummy coded as 0, the association between a gene and X1 will be estimated by β^1(t):

G^i(t)=β^0(t)+β^1(t)X1i,

and for the second level of X2, dummy coded as 1, the association between a gene and X1 will be modified as:

G^i(t)=(β^0(t)+β^2(t))+(β^1(t)+β^12(t))X1i.

To facilitate the data analysis using mixed-effects software, an input response should be a vectorized matrix of genotype functions for N subjects evaluated on the grid of genomic positions, vec(G^)=[G^1(t1),,G^1(tn),,G^N(t1),,G^N(tn)]. Input predictors should be N ·n×1 vectors X1, . . . , XP which are generated by repeating each phenotype observation n times and stacking them on top of one another. The fixed and the random effects design matrices, Φ and K1, are then constructed as follows:

Φ=[1t1X11t1X11XP1t1XP11tnX11tnX11XP1tnXP11t1X1Nt1X1NXPNt1XPN1tnX1NtnX1NXPNtnXPN]=[1N[1t11tn],X1[1t11tn],,XP[1t11tn]],

and K1=[1NK,X1K,,XPK], where 1N is N × 1 vector of 1's, is the Kronecker product, and K is the n × κ matrix with the ijth entry of k1(ti, kj) calculated over the sequence of knots k1, . . . , kκ.

2.4 Confidence interval for β^(t)

Since the conceptual model in Eq. (1) can be expressed as a mixed-effects model in Eq. (8), the typical inferential machinery for mixed-effects models can be used to obtain the variance-covariance estimates of the model parameters [Ruppert et al., 2003]. An explicit formulation for the estimated standard deviation of β^(t) is:

st.dev^(β^(t))=σ^ϵC(CTC+λ^D)1CTC(CTC+λ^D)1CT, (9)

where σ^ϵ is a REML estimate of σε, C = [Φ K1] is formed by two design matrices described in Eq. (7), λ^=σ^ϵ2σ^c2 is the estimated smoothing parameter, and D is formed as follows:

D=[0m×m0m×κ0κ×mIκ×κ],

where m is the number of ‘fixed effects’ and κ is the number of ‘random effects.’ An approximate point-wise 100%(1 – α) confidence interval is β^(t)±z(1α2)st.dev^(β^(t)) Alternatively, Bayesian credible intervals can be obtained by realizing a connection between Gaussian processes and spline construction [Crainiceanu et al., 2005; Rasmussen and Williams, 2006], or “subject re-sampling” bootstrap error bars can be obtained to construct the confidence intervals [Wu and Yu, 2002].

In the application of point-wise bands to functional genotype data, the issue of bias-variance trade-o associated with the selection of the degree of smoothing might deserve more careful attention. Specifically, in the context of the mixed-effects model in Eq. (8), the response variable is a fitted genotypic function Ĝ(t). If the fitted function is somewhat wiggly, this ‘noise’ will account for the increased width of the point-wise standard error bands for β^(t). We previously proposed the ‘flipping algorithm’ for genotype re-labeling that decreases the number of noisy oscillations for smoothed genotype data and showed that this approach results in a substantial increase of statistical power to detect a genetic association [Vsevolozhskaya et al., 2014]. Nonetheless, too smooth genotype functions might result in narrow standard error bands for β^(t) and thus estimate a biased version of a true function with great reliability. Further research is needed on the issue of optimal choice of a smoothing parameter in the context of genotype function fitting.

2.5 Testing for an association

In this section we turn our attention to a test statistic used for evaluating an association between a genetic region and one or more phenotypes. Whereas different types of point-wise confidence intervals for the coefficient curves can be constructed, the hypothesis testing problem of distinguishing an optimal sub-model of β(t)'s is still of interest. To address this issue, we will use the function F statistic [Shen and Faraway, 2004] as previously used in our FANOVA methodology [Vsevolozhskaya et al., 2014]. Specifically, suppose we want to test the nullity of a single predictor:

H0:βi(t)=0,i=1,,P.

By using Theorem 2 in Shen and Faraway [2004], a test statistic to determine if β(t) is equivalent to the zero function can be constructed as follows:

F=(NP)β^i2(t)dtrss1(XTX)ii1, (10)

where X = (1 XM1 . . . XP) is a design matrix for the full model containing all phenotypic variables, and rss1=i=1N(G^i(t)Xβ^)2dt is the residual sum of squares for the full model. Under the null hypothesis, it can be easily shown (e.g. [Reimherr et al., 2014; Shen and Faraway, 2004; Zhang, 2013]) that the distribution of F can be approximated by an F-distribution as:

FFd^,(NP)d^,

where d^=(i=1nri)2i=1nri2, with n being the number of genetic variants, and ri is the ith order eigenvalue of the empirical variance-covariance matrix under the full model, Σ^1.

Alternatively, if we want to test the nullity of K predictors simultaneously, that is, to compare the full model:

G^i(t)=j=1Pβ^j(t)Xij,

to the reduced model:

G^i(t)=j=1(PK)β^j(t)Xij,

the test statistic F can be defined in terms of the reduction in the sums of squared errors, as follows:

F=(rss0rss1)Krss1(NP)trace(Σ^0Σ^1)Ktrace(Σ^1)(NP), (11)

where rss0 is the residual sum of squares for the reduced model, and Σ^0 is the empirical variance-covariance matrix under the reduced model. Under the null hypothesis, the distribution of F is approximated by FKd^,(NP)d^.

We note that the test statistic in Eq. (11) is computationally more complex than the one in Eq. (10). That is, if the goal is to test the nullity of only one predictor at a time, the test statistic in Eq. (10) can be calculated directly by fitting only the full model, and thus omitting fitting the reduced model. Further details and comparisons of the two formulas can be found in Shen and Faraway [2004].

3 Simulation Study

3.1 Design

The flexibility of our method allows us to accommodate various analysis settings and types of variables, including multiple, possibly correlated or pleiotropic phenotypes, and T×T interactions. One way to analyzing multiple traits is to test for an association one trait at a time. For a proper control of the experiment-wise false-discovery error rate, this ‘one at a time’ testing approach requires accounting for the number of tests performed and correcting for each individual trait's P-value. This individual correction typically leads to an inflation in the observed P-values. However, our method provides an efficient way of testing multiple traits simultaneously, with no P-value correction required, and thus naturally provides superior performance in terms of statistical power to detect an association. Moreover, to handle T×T interactions, or to assess modification of genetic susceptibility to disease by trait, our model requires a test of nullity for an interaction term. Previously, we investigated the power of FANOVA to detect an association with a single predictor [Vsevolozhskaya et al., 2014]. Simulation studies presented here reflect the extension of our previous basic model with the addition of mediation/confounding scenarios

Figure 1 aids in conceptualization of our data simulation process. We focused on a three variable system and hypothesized that there is a genetic predisposition (G) to continuous phenotypes (Z) and (X). We also assumed a relationship between (Z) and (X) and were interested in testing for an association between (G) and (X), while adjusting for the third variable (Z). Clearly, data generated under this scenario fits the mediation analysis framework, but MacKinnon et al. [2000] point out that the label of (Z) (i.e., either as a mediator or a confounder) depends on the framework used to conceptualize the phenomenon. From a statistical modeling point of view, directionality and the causality are indistinguishable, making these seemingly different concepts of mediation and confounding statistically equivalent. Therefore, data generated under our design can be used to check for both a mediator and a confounding control.

Figure 1.

Figure 1

The genetic information (G(t)) is directly associated with the outcome of interest (X) and indirectly through the third variable (Z).

3.2 Data generation

We generated genetic data (G) using the 1,000 genome project [Durbin et al., 2010] to mimic the real sequencing data structure (e.g., linkage disequilibrium patterns, allele frequencies, and randomly missing genotype data). Specifically, at each simulation iteration, a random 30 kb section of genetic region was drawn. Within this 30 kb region, each simulated data contained an average of 300 variants with minor allele frequencies (MAF) ranging from less than 0.001 to almost 0.5. The complete distribution of MAF for all variants across simulations is provided in the left panel of Figure 2.

Figure 2.

Figure 2

Panel (a): The range and the distribution of MAF for all variants. Panels (b)-(c): MAF distribution of causal variants by the effect size.

Next, a continuous trait (Z) was simulated as:

Zi=j=1nGiχ(tj)×γ(tj)+ϵi,i=1,,N,j=1,,n, (12)

where N is the number of subjects, n is the number of variants, tj indexes the position of variants, γ(tj) is the effect of the variant in tj's, εi ~ N(0, 1), and ”χ” indicates a subset of genetic variants harboring causal alleles. For example, if χ = 10%, then a random sample of 10% of all variants for subject i were causal, and the effect of each causal variant j, γ(tj), was drawn from an N(μγ,σγ2) distribution (the rest of γ(t)'s, corresponding to non-causal variants, were zero). If μγ = 0, the effect of a given causal variant was either protective or deleterious . If μγ > 0, then the majority of causal variants had the same direction of the effects (i.e., deleterious), and the magnitude of the effect size varied by manipulating σγ2. The middle panel of Figure 2 illustrates simulated effects by MAF for the choice μγ = 0 and σγ2=1; the right panel for μγ = 0.25 and σγ2=1. The reader should note that under our simulation scenario, the causal variants can be both rare and common. Alternative situations with only rare or common causal variants were previously investigated by our group and showed favorable performance by FANOVA [Vsevolozhskaya et al., 2014].

Another continuous trait (X) was simulated as:

Xi=j=1nGiχ(tj)×α(tj)+β×Zi+ϵi. (13)

Similar to γ(tj),α(tj)N(μα,σα2) represents the effect of a causal variant j on the trait (X), and β ~ N(3,1) represents the effect of the third variable (Z) on the trait (X).

3.3 Type I error results

For empirical type I error simulations, we set the genetic effect on the continuous trait (X) to zero, i.e., α(tj) = 0 for all j, and tested for an association between (G) and (X), while adjusting for (Z). The percentage of risk variants for the association between (G) and (Z) in Eq.(12) was set to χ = 30% and γj's were simulated from an N(μγ = 0, σγ = 3) distribution. For the different sample sizes, we compared the generalized FANOVA approach to the SKAT methodology [Wu et al., 2011]. The results are summarized in Table 1. For both methods, all empirical type I error rates are around the nominal α levels with the exception of SKAT for a small sample size. To further contrast the differences between FANOVA and SKAT, we proceeded to a comparison of power simulations.

Table 1.

Empirical type I error rates for the association tests between (G) and (X), while adjusting for (Z).

Sample size Nominal level α FANOVA SKAT
50 0.05 0.04319 0.04164
0.01 0.01037 0.00845
0.001 0.00191 0.00018
0.0001 0.00036 0.00000
500 0.05 0.04346 0.04854
0.01 0.00941 0.01002
0.001 0.00108 0.00123
0.0001 0.00023 0.00000

3.4 Statistical power results

For the statistical power comparison, both traits (Z) and (X) shared the same percentage, but a random set of risk variants. The percent of risk variants were set to 5%, 10%, 30%, 50%, 70%, 90%, and 100%. The sample size values were N = 50, 500, 2500, and 5000. The execution time of a single iteration of the simulations (the statistical power is presented based on at least 1,000 iterations) on a single core (2.5Ghz Intel Xeon E5-2670v2) of high-performance computing center (HPCC: https://wiki.hpcc.msu.edu/) ranged from 20 seconds for N = 50 up to an hour for N = 5000. The allocated memory for N = 5000 subjects was 64GB.

Figure 3 summarizes empirical power results for the scenario with risk variants having either positive or negative effects (i.e., μγ = μα = 0) for the different number of subjects. In Figure 4 the majority of risk variants had deleterious effects for both traits (i.e., μγ > 0 and μα > 0). In each figure, the generalized FANOVA statistical power to detect an association between (G) and (X), while adjusting for (Z), is represented by a solid line, and the power of SKAT is represented by a dashed line.

Figure 3.

Figure 3

Empirical power of FANOVA (solid line) and SKAT (dashed line) when the variants can have either protective or deleterious effects (i.e., μγ = μα = 0). Panel (a): N = 50 σγ = σα = 0.05; (b): N = 50 σγ = σα = 1; (c): N = 500 σγ = σα = 0.05; (d): N = 1000 σγ = σα = 0.05; (e) N = 2500 σγ = σα = 0.015; (f) N = 5000 σγ = σα = 0.015.

Figure 4.

Figure 4

Empirical power of FANOVA (solid line) and SKAT (dashed line) when the majority of variants have deleterious effect (i.e, μγ > 0 μα > 0) . Panel (a): N = 50 μγ = μα = 0.05 σγ = σα = 0.25; (b): N = 50 μγ = μα = 0.05 σγ = σα = 1; (c): N = 500 μγ = μα = 0.05 σγ = σα = 0.15; (d): N = 500 μγ = μα = 0.25 σγ = σα = 0.15; (e): N = 1000 μγ = μα = 0.05 σγ = σα = 0.05

In general, the proposed FANOVA approach attained higher power than SKAT, especially for small sample sizes, small effect sizes, and when the percentage of risk variants is small. The empirical power of the two approaches become comparable if the effect sizes and the proportion of risk variants were large.

4 Application to real data: ANGPRL4 association with triglyceride

To further illustrate the utility of our generalized FANOVA approach, we turn to the issue of association testing between sequence variations in ANGPTL4 gene and lipid metabolism. In mice, the involvement of ANGPTL4 in lipid metabolism was shown by intravenous injection of recombinant ANGPTL4, resulting in an increase in plasma triglycerides (TG) levels [Yoshida et al., 2002]. In humans, the involvement of ANGPTL4 in lipid metabolism is probable and may be associated with a higher risk of cardiovascular disorder [Kathiresan et al., 2009; Muendlein et al., 2014; Romeo et al., 2007]. However, each individual ANGPTL4 variant confers a modest effect [Kathiresan et al., 2009], suggesting an improved statistical power for methods like generalized FANOVA that perform a joint gene-based association analysis.

We conducted an analysis of 93 sequence variations in ANGPTL4 that were identified among 3,551 participants in the Dallas Heart Study [Romeo et al., 2007]. To examine an increase in plasma TG levels, we binned individuals into the ‘low-triglyceride’ group (660 individuals with plasma triglyceride level ≤25th percentile) and into the ‘high-triglyceride’ group (679 individuals with plasma triglyceride level ≥75th percentile). The resulting sample included 443 individuals of mixed European descent, 651 African Americans, and 245 Hispanics.

As discussed elsewhere (e.g., [Svishcheva et al., 2015; Vsevolozhskaya et al., 2014]), statistical power of functional methods may depend on the quality of genotype data smoothing. To obtain smooth genotypic functions, we first coded allelic dosage based on the minor allele counts (i.e., either 0, 1 or 2) and applied the “flipping algorithm” [Vsevolozhskaya et al., 2014] to minimize the number of 0-2 (or 2-0) patterns in every two subsequent variant positions. However, because the majority of 93 sequenced variants were rare [Romeo et al., 2007], the coding based on minor allele counts was concluded to be optimal and no re-coding of allelic dosage was necessary.

To examine an effect of increase in TG levels, modified by race and adjusted for sex, we built the following model:

G^i(tj)=β0(tj)+β1(tj)XTGi+β2(tj)XAAi+β3(tj)XHi+β12(tj)XTGiXAAi+β13(tj)XTGi(tj)XHi+β4(tj)XSexi+ϵi(tj),

where β0(tj) is the smoothed baseline allelic dosage j = 1, . . . , 93; β1(tj) is the effect of TG-increase on allelic dosage. The next four terms are added to examine T×T interaction or whether the effect of TG increase varies among European Americans (β1(tj)), African Americans (β1(tj) + β12(tj)), and Hispanics (β1(tj) + β13(tj)). Finally, β4(tj) is the adjustment for sex.

To determine the most parsimonious model, we first performed a test for T×T interaction, i.e., H0 : β12(tj) = β13(tj) = 0 for all tj, and found statistically significant differences in TG-increasing effect among individuals of different racial descent (P-value=0.0028). We note that the magnitude of this P-value remained the same for different choices of kernels and as such, we proceeded to explore specific sub-regions of the ANGPTL4 gene that may harbor causal variants for the different racial groups.

Each panel of Figure 5 illustrates the estimated TG-increasing effect among different racial groups and across 93 variants of the ANGPTL4 gene. Further, the positions of the recently identified variants E40K and T266M [Romeo et al., 2007; Talmud et al., 2008] are added as vertical lines to each panel. The left panel of Figure (5) shows β^1(t) or the estimated effect of TG increase among European Americans. From this panel we can infer that the region around the E40K variant has the top contribution among European Americans, since it is the region over which β^1(t) deviates the most from the zero line. Additionally, the direction of β^1(t) around E40K is negative, indicating that TG increase is associated with a lower dosage of E40K variant, which implies that European American E40K carriers can be expected to have lower TG levels. However, the confidence bands for β^1(t) include zero and indicate lack of statistical significance.

Figure 5.

Figure 5

TG-increasing effect among European Americans (left panel), African Americans (middle panel), and Hispanics (right panel) with the 95% confidence bands (shaded regions).

The right panel of Figure 5 shows β^1(t)+β^13(t) or the estimated effect of TG increase among Hispanics. Once again, the effect has the top magnitude around E40K region, but it's direction is reversed, indicating that Hispanic E40K carriers tend to have higher TG levels. Additionally, among Hispanics, E40K region association with TG-increase reached statistical significance.

The middle panel of Figure 5 shows β^1(t)+β^12(t) or the estimated effect of TG increase among African Americans. Unlike European Americans and Hispanics, the contribution of E40K variant does not appear to be appreciably associated with TG increase. Also, no contribution of T266M variant to either TG increase (or decrease) was found among any racial groups.

Finally, to compare our T×T interaction results to SKAT, we performed a subgroup analysis on data from European Americans, African Americans, and Hispanics. The P-values, adjusted for sex, for the test of an association between TG-levels and variants in the ANGPTL4 gene were as follows: among European Americans PSKAT = 0.0006, PFANOVA = 0.0262; among Hispanics PSKAT = 0.1738, PFANOVA = 0.0001; among African Americans PSKAT = 0.2321, PFANOVA = 0.9447. Accordingly, both methods concluded an association between ANGPTL4 variants and plasma triglycerides levels among European Americans, no association among African Americans, and discordant results among Hispanics. The reader should not be surprised by seemingly disagreeing FANOVA conclusions for European Americans summarized via the confidence bends in Figure 5 and via the P-value for an association test. It has been noted multiple times, including by our research group [Vsevolozhskaya et al., 2015], that a combination of multiple “marginally significant” outcomes across different variants may result in the overall significance for a genetic region.

5 Discussion

By generalizing previously proposed FANOVA methodology, we offer a novel approach not previously explored in FLM-based association studies for estimating multiple phenotype-specific effects smoothly varying over genetic variants. Furthermore, by treating genetic information as the response variable and all traits as predictors (qualitative or quantitative), the generalized FANOVA provides a straightforward way to account for hidden population stratification, confounders, mediators and T×T interactions. The established connection between penalized least squares and best linear unbiased predictors allows for a straightforward implementation of the proposed methodology using standard mixed linear model software.

The introduced notion of T×T interaction deserves additional clarification. We are not necessarily putting emphasis on the interaction itself or the value of its coefficient. Rather, the inclusion of this term gives a simple way of detecting possible effects of various combinations of treatment and trait values that may go beyond what is captured by the sum of their individual effects.

How well do our generalized FANOVA regression coefficient estimates replicate what others have found in prior studies of ANGPTL4 ? Studies of Romeo et al. [2007] and Talmud et al. [2008] revealed that among European Americans E40K carriers have significantly lower TG levels. Talmud et al. [2008] also showed TG-lowering effect of T266M variant, but only among E40K carriers (i.e., whenever E40K men were excluded from the reanalysis, there was no longer a significant association between T266M and TG levels). T266M is more prevalent than E40K and in our sample out of 620 T266M carriers only 16 were also carriers of E40K, which may be a reason behind lack of association. Furthermore, no studies presented conclusive findings over TG-lowering effect and mutations in ANGPTL4, so a replication of the reported association is required.

Our generalized FANOVA model is a functional model analogue of “reverse regression” (e.g., [Maddala, 1992]), where genetic information, X, becomes the response while phenotypes, Y , are treated as predictors. Regression coefficients are not invariant to swapping of predictor and response variables. However, partial correlations, as well as test statistics and the corresponding P-values remain the same after swapping. Thus, effects of adjustments for covariates in a direct model are properly preserved when testing for association in a reverse model. With multiple correlated predictors at an arbitrary variant's position tj, the test statistic for the regression coefficient βi can be re-expressed based on the partial correlation between Y and Xi, which is not affected by swapping of variables, and the test statistic (and therefore the P-value) is also invariant under the reversal in a functional model. One limitation of this approach is that for the direct and reverse tests to be equivalent, Xi cannot enter any interaction terms with other variables.

The generalized FANOVA is an extension of the previously proposed FANOVA approach and thus inherits some of its features. For example, generalized FANOVA fully utilizes variants’ position information and linkage-disequilibrium structure when computing the test statistic F. However, unlike the previously proposed FANOVA, our current approach allows inclusion of multiple traits and adjustment for additional covariates. Moreover, our new functional approach provides a unique way of graphically depicting phenotypic effects and interactions by representing them as continuous curves varying over a genetic region. We also hypothesize that the functional approach may hold increased robustness to genotyping errors. This may be due to the fact that the estimated genotype functions, Ĝ(t), are used for the analysis in place of allele frequencies of the marked locus. It is noted that genotyping errors can have severe consequences for the analysis of low frequency alleles (e.g., [Abecasis et al., 2001]). Although genotype functions are estimated via allele counts, they incorporate a certain degree of smoothing, therefore the fitted functions are expected to be less prone to genotyping errors.

In terms of the application of the generalized FANOVA methodology, practitioners can use standard mixed-effects software to estimate continuous regression coefficients as illustrated in the Methods section of this article. Previous research in penalized regression models [Scheipl and Greven, 2012] suggests that a penalty with a small null space should be preferred (a typical choice for the number of “fixed effects” is 2) and a ‘rule of thumb’ for the number of “random effects” is κ = 35. However, the specific number of kernel functions is unimportant as long as the fitted genotype functions are not too smooth.

Acknowledgments

Funding

This work was supported by a National Institute of Drug Abuse T32 research training program award (NIDA; T32DA021129) for OAV's postdoctoral fellowship, DVZ's Intramural Research Program of the National Institute of Environmental Health Sciences (NIEHS), DAB's research award (NIDA; R01DA016558), and QL's Mentored Research Scientist Development Award (NIDA; K01DA033346).

Footnotes

Conflict of Interests

We have no conflicts of interest to declare.

References

  1. Abecasis GR, Cherny SS, Cardon LR. The impact of genotyping error on family-based analysis of quantitative traits. Eur J Hum Genet. 2001;9(2):130–134. doi: 10.1038/sj.ejhg.5200594. [DOI] [PubMed] [Google Scholar]
  2. Aronszajn N. Theory of reproducing kernels. Trans Amer Math Soc. 1950:337–404. [Google Scholar]
  3. Bouchard TJ, Jr, Loehlin JC. Genes, evolution, and personality. Behav Genet. 2001;31(3):243–273. doi: 10.1023/a:1012294324713. [DOI] [PubMed] [Google Scholar]
  4. Brumback BA, Ruppert D, Wand MP. Comment. J Am Stat Assoc. 1999;94(447):794–797. [Google Scholar]
  5. Carvajal-Carmona LG. Challenges in the identification and use of rare disease-associated predisposition variants. Curr Opin Genet Dev. 2010;20(3):277–281. doi: 10.1016/j.gde.2010.05.005. [DOI] [PubMed] [Google Scholar]
  6. Crainiceanu C, Ruppert D, Wand MP. Bayesian analysis for penalized spline regression using winbugs. Journal of Statistical Software. 2005;14(14):1–24. [Google Scholar]
  7. Crainiceanu CM, Goldsmith AJ. Bayesian functional data analysis using winbugs. Journal of Statistical Software. 2010;32(11) [PMC free article] [PubMed] [Google Scholar]
  8. De Wit H. Impulsivity as a determinant and consequence of drug use: a review of underlying processes. Addict Biol. 2009;14(1):22–31. doi: 10.1111/j.1369-1600.2008.00129.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Durbin RM, Altshuler DL, Durbin RM, Abecasis GAR, Bentley DR, Chakravarti A, Clark AG, Collins FS, et al. A map of human genome variation from population-scale sequencing. Nature. 2010;467(7319):1061–1073. doi: 10.1038/nature09534. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Eilers PH, Marx BD. Flexible smoothing with b-splines and penalties. Stat Sci. 1996:89–102. [Google Scholar]
  11. Fan R, Wang Y, Mills JL, Carter TC, Lobach I, Wilson AF, Bailey-Wilson JE, Weeks DE, Xiong M. Generalized functional linear models for gene-based case-control association studies. Genet Epidemiol. 2014;38(7):622–637. doi: 10.1002/gepi.21840. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Fan R, Wang Y, Mills JL, Wilson AF, Bailey-Wilson JE, Xiong M. Functional linear models for association analysis of quantitative traits. Genet Epidemiol. 2013;37(7):726–742. doi: 10.1002/gepi.21757. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Fasshauer GE. Approximation Theory XIII. Springer; San Antonio: 2012. Greens functions: Taking another look at kernel approximation, radial basis functions, and splines. pp. 37–63. 2010. [Google Scholar]
  14. Fasshauer GE, Ye Q. Reproducing kernels of sobolev spaces via a green kernel approach with differential operators and boundary operators. Adv Comput Math. 2013;38(4):891–921. [Google Scholar]
  15. Feng Z. A generalized quasi-likelihood scoring approach for simultaneously testing the genetic association of multiple traits. J Roy Stat Soc C-App. 2014;63(3):483–498. [Google Scholar]
  16. Goldsmith J, Bobb J, Crainiceanu CM, Ca o B, Reich D. Penalized functional regression. J Comput Graph Stat. 2011;20(4) doi: 10.1198/jcgs.2010.10007. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Green PJ, Silverman BW. Nonparametric regression and generalized linear models: a roughness penalty approach. CRC Press; 1993. [Google Scholar]
  18. Hastie T, Tibshirani R, Friedman J, Hastie T, Friedman J, Tibshirani R. The elements of statistical learning. Springer; 2009. [Google Scholar]
  19. Ivanescu AE, Staicu A-M, Scheipl F, Greven S. Penalized function-on-function regression. Computation Stat. 2014:1–30. [Google Scholar]
  20. Kathiresan S, Willer CJ, Peloso GM, Demissie S, Musunuru K, Schadt EE, Kaplan L, Bennett D, Li Y, Tanaka T, et al. Common variants at 30 loci contribute to polygenic dyslipidemia. Nat Genet. 2009;41(1):56–65. doi: 10.1038/ng.291. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Kimeldorf G, Wahba G. Some results on tchebycheffian spline functions. J Mathe Anal Appl. 1971;33(1):82–95. [Google Scholar]
  22. Kwan JS, Kung AW, Sham PC. A simple bias correction in linear regression for quantitative trait association under two-tail extreme selection. Behavior genetics. 2011;41(5):776–779. doi: 10.1007/s10519-011-9475-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Lee D-Y, Hanis C, Bell G, Aguilar D, Redline S, Below J, Xiong M. Genetic studies of physiological traits with their application to sleep apnea. 2014:1410, 7363. arXiv preprint arXiv. [Google Scholar]
  24. Lian H. Nonlinear functional models for functional responses in reproducing kernel hilbert spaces. Can J Stat. 2007;35(4):597–606. [Google Scholar]
  25. Liu DJ, Leal SM. A novel adaptive method for the analysis of next-generation sequencing data to detect complex trait associations with rare variants due to gene main effects and interactions. PLoS Genetics. 2010;6(10):e1001156. doi: 10.1371/journal.pgen.1001156. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Luo L, Boerwinkle E, Xiong M. Association studies for next-generation sequencing. Genome Res. 2011;21(7):1099–1108. doi: 10.1101/gr.115998.110. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Luo L, Zhu Y, Xiong M. Quantitative trait locus analysis for next-generation sequencing with the functional linear models. J Med Genet. 2012a;49(8):513–524. doi: 10.1136/jmedgenet-2012-100798. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Luo L, Zhu Y, Xiong M. Smoothed functional principal component analysis for testing association of the entire allelic spectrum of genetic variation. Eur J Hum Genet. 2012b;21(2):217–224. doi: 10.1038/ejhg.2012.141. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. MacKinnon DP, Krull JL, Lockwood CM. Equivalence of the mediation, confounding and suppression effect. Prevention Science. 2000;1(4):173–181. doi: 10.1023/a:1026595011371. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Maddala GS. Introduction to econometrics. Vol. 2. Macmillan; New York: 1992. [Google Scholar]
  31. Muendlein A, Saely CH, Leiherer A, Fraunberger P, Kinz E, Rein P, Vonbank A, Zanolin D, Malin C, Drexel H. Angiopoietin-like protein 4 significantly predicts future cardiovascular events in coronary patients. Atherosclerosis. 2014;237(2):632–638. doi: 10.1016/j.atherosclerosis.2014.10.028. [DOI] [PubMed] [Google Scholar]
  32. Nosedal-Sanchez A, Storlie CB, Lee TC, Christensen R. Reproducing kernel hilbert spaces for penalized regression: A tutorial. Am Stat. 2012;66(1):50–60. [Google Scholar]
  33. Pearce ND, Wand MP. Penalized splines and reproducing kernel methods. Am Stat. 2006;60(3) [Google Scholar]
  34. Poggio T, Girosi F. Networks for approximation and learning. Proceedings of the IEEE. 1990;78(9):1481–1497. [Google Scholar]
  35. Ramsay J, Silverman B. Functional Data Analysis. second edition Springer; 2005. [Google Scholar]
  36. Rasmussen CE, Williams CKI. Gaussian processes for machine learning. Citeseer. 2006 [Google Scholar]
  37. Reimherr M, Nicolae D, et al. A functional data analysis approach for genetic association studies. Ann Appl Stat. 2014;8(1):406–429. [Google Scholar]
  38. Romeo S, Pennacchio LA, Fu Y, Boerwinkle E, Tybjaerg-Hansen A, Hobbs HH, Cohen JC. Population-based resequencing of ANGPTL4 uncovers variations that reduce triglycerides and increase hdl. Nat Genet. 2007;39(4):513–516. doi: 10.1038/ng1984. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Ruppert D. Selecting the number of knots for penalized splines. J Comput Graph Stat. 2002;11(4) [Google Scholar]
  40. Ruppert D, Wand MP, Carroll RJ. Semiparametric regression. Cambridge university press; 2003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Saitoh S. Theory of reproducing kernels and its applications. Longman Scientific & Technical Harlow. 1988 [Google Scholar]
  42. Scheipl F, Greven S. Technical Report 125. University of Munich - Department of Statistics; 2012. Identifiability in penalized function-on-function regression models. [Google Scholar]
  43. Shen Q, Faraway J. An f test for linear models with functional responses. Statistica Sinica. 2004;14(4):1239–1258. [Google Scholar]
  44. Smola AJ, Schölkopf B. Learning with kernels. Citeseer. 1998 [Google Scholar]
  45. Speed T. [that blup is a good thing: The estimation of random effects]: Comment. Statistical Science. 1991:42–44. [Google Scholar]
  46. Storey JD, Xiao W, Leek JT, Tompkins RG, Davis RW. Significance analysis of time course microarray experiments. Proc Natl Acad Sci U S A. 2005;102(36):12837–12842. doi: 10.1073/pnas.0504609102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. Svishcheva GR, Belonogova NM, Axenovich TI. Region-based association test for familial data under functional linear models. PloS One. 2015;10(6):e0128999. doi: 10.1371/journal.pone.0128999. [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Talmud PJ, Smart M, Presswood E, Cooper JA, Nicaud V, Drenos F, Palmen J, Marmot MG, Boekholdt SM, Wareham NJ, et al. Angptl4 e40k and t266m effects on plasma triglyceride and hdl levels, postprandial responses, and chd risk. Arterioscler Thromb Vasc Biol. 2008;28(12):2319–2325. doi: 10.1161/ATVBAHA.108.176917. [DOI] [PubMed] [Google Scholar]
  49. Vsevolozhskaya OA, Greenwood MC, Powell SL, Zaykin DV. Resampling-based multiple comparison procedure with application to point-wise testing with functional data. Environmental and Ecological Statistics. 2015;22(1):45–59. doi: 10.1007/s10651-014-0282-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  50. Vsevolozhskaya OA, Zaykin DV, Greenwood MC, Wei C, Lu Q. Functional analysis of variance for association studies. PloS One. 2014;9(9):e105074. doi: 10.1371/journal.pone.0105074. [DOI] [PMC free article] [PubMed] [Google Scholar]
  51. Wahba G. Spline models for observational data. Siam. 1990 [Google Scholar]
  52. Wand M, Ormerod J. On semiparametric regression with o'sullivan penalized splines. Australian & New Zealand Journal of Statistics. 2008;50(2):179–198. [Google Scholar]
  53. Wang Y. Smoothing spline models with correlated random errors. J Am Stat Assoc. 1998;93(441):341–348. [Google Scholar]
  54. Wang Y, Liu A, Mills JL, Boehnke M, Wilson AF, Bailey-Wilson JE, Xiong M, Wu CO, Fan R. Pleiotropy analysis of quantitative traits at gene level by multivariate functional linear models. Genet Epidemiol. 2015 doi: 10.1002/gepi.21895. [DOI] [PMC free article] [PubMed] [Google Scholar]
  55. Wood S. Generalized Additive Models: An Introduction with R. Chapman & Hall/CRC Texts in Statistical Science. 2006 [Google Scholar]
  56. Wu CO, Yu KF. Nonparametric varying-coefficient models for the analysis of longitudinal data. Int Stat Rev. 2002;70(3):373–393. [Google Scholar]
  57. Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X. Rare-variant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet. 2011;89(1):82–93. doi: 10.1016/j.ajhg.2011.05.029. [DOI] [PMC free article] [PubMed] [Google Scholar]
  58. Yoshida K, Shimizugawa T, Ono M, Furukawa H. Angiopoietin-like protein 4 is a potent hyperlipidemia-inducing factor in mice and inhibitor of lipoprotein lipase. J Lipid Res. 2002;43(11):1770–1772. doi: 10.1194/jlr.c200010-jlr200. [DOI] [PubMed] [Google Scholar]
  59. Zhang J-T. Analysis of variance for functional data. CRC Press; 2013. [Google Scholar]
  60. Zhu Y, Xiong M. Family-based association studies for next-generation sequencing. Am J Hum Genet. 2012;90(6):1028–1045. doi: 10.1016/j.ajhg.2012.04.022. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES