Skip to main content
Oxford University Press logoLink to Oxford University Press
. 2017 Sep 2;104(4):987–994. doi: 10.1093/biomet/asx044

Dependent generalized functional linear models

S Jadhav 1, H L Koul 1, Q Lu 2
PMCID: PMC5771479  NIHMSID: NIHMS929781  PMID: 29353911

Summary

This paper considers testing for no effect of functional covariates on response variables in multivariate regression. We use generalized estimating equations to determine the underlying parameters and establish their joint asymptotic normality. This is then used to test the significance of the effect of predictors on the vector of response variables. Simulations demonstrate the importance of considering existing correlation structures in the data. To explore the effect of treating genetic data as a function, we perform a simulation study using gene sequencing data and find that the performance of our test is comparable to that of another popular method used in sequencing studies. We present simulations to explore the behaviour of our test under varying sample size, cluster size and dimension of the parameter to be estimated, and an application where we are able to confirm known associations between nicotine dependence and neuronal nicotinic acetylcholine receptor subunit genes.

Keywords: Cluster data, Family sequencing data, Functional data analysis, Generalized estimating equation

1. Introduction

This paper presents a large-sample test for assessing whether a functional covariate has a regression effect on real-valued and possibly dependent responses. Dependent data arise in family-based genetic studies, whose aim is often to test for a relation between a gene region and the phenotype of interest. Sequencing data for a gene or gene region consist of observations on a large number of single nucleotide variants. In light of linkage disequilibrium (Laird & Lange, 2010), dependence among variants may decline with their separation, so we treat sequencing data as a function of the single nucleotide variant positions and use a functional data-based approach.

There is abundant literature on functional linear models (Cardot et al., 1999, 2003; Cardot & Sarda, 2005; Ramsay, 2006; Cardot & Johannes, 2010). Recent reviews can be found in Morris (2015) and Wang et al. (2016). These models assume that the response is a continuous scalar and the predictor variable is a function. However, in genetic studies the phenotype or response is often binary, and few papers discuss generalized functional linear models (Müller & Stadtmüller, 2005; Li et al., 2010; Gertheiss et al., 2013) as needed in this case. Existing methodology cannot be directly applied to genetic family data, where the response variable is a vector of dependent traits and the predictor is a vector of functions. We address this shortcoming using generalized estimating equations. The estimators thus obtained are shown to be consistent and asymptotically normal, under suitable conditions. The latter result is used to propose an asymptotic test for the regression relation between the functional covariates and the responses.

2. Model

Let Inline graphic and Inline graphic be positive integers. We observe Inline graphic clusters Inline graphic, each of size Inline graphic, where, for each Inline graphic, Inline graphic is an Inline graphic-dimensional predicting process and Inline graphic is a vector of Inline graphic responses. We assume that the clusters are independent and identically distributed with the same correlation structure. For the Inline graphicth subject in the Inline graphicth cluster, the predicting process Inline graphic is assumed to be a square-integrable random process on Inline graphic. The corresponding response Inline graphic is a continuous or discrete scalar which is related to Inline graphic via the following generalized regression model, in which Inline graphic is a positive measure on Inline graphic. All integrals in this paper are taken over the interval Inline graphic, unless specified otherwise. For a constant Inline graphic and a real-valued function Inline graphic, let

graphic file with name Equation1.gif

We model the regression of Inline graphic on Inline graphic as

graphic file with name Equation2.gif (1)

for a known real-valued link function Inline graphic and a positive function Inline graphic, where the Inline graphic have zero mean and Inline graphic is the true Inline graphic correlation matrix.

We now give another representation of the model (1). Let Inline graphicInline graphic be an orthonormal basis of the functional space Inline graphic. The predictor process and parameter function can be written as

graphic file with name Equation3.gif

with random variables Inline graphic= Inline graphic and coefficients Inline graphic. The random variables Inline graphic and Inline graphic are uncorrelated for Inline graphic. By the orthonormality of these bases,

graphic file with name Equation4.gif

Using this representation, we now address the issue of an infinite number of predicting variables. Based on the truncation strategy proposed by Müller & Stadtmüller (2005), we replace model (1) with the following approximate sequence of finite-dimensional models. Let Inline graphic be a sequence of positive integers tending to infinity and define our new approximate model by

graphic file with name Equation5.gif (2)

Let Inline graphic. The superscript Inline graphic indicates the number of parameters. We exhibit this superscript and subscript when necessary. We assume that the standardized error Inline graphic is independent of Inline graphic for all Inline graphic and Inline graphic.

In what follows, all limits are taken as Inline graphic and Inline graphic. For any vector or finite-dimensional matrix Inline graphic, Inline graphic will denote the Frobenius norm of Inline graphic. For a matrix Inline graphic we denote its maximum and minimum eigenvalues by Inline graphic and Inline graphic, respectively.

We use generalized estimating equations to estimate Inline graphic. In most applications, we do not know the true correlation matrix Inline graphic, so we use a working correlation matrix Inline graphic that depends on a parameter Inline graphic. We let Inline graphic denote an estimated working correlation matrix. The Inline graphic that we use below was suggested by Balan & Schiopu-Kratina (2005). In the following, Inline graphic denotes the derivative of the function Inline graphic. Let

graphic file with name Equation6.gif

The estimator denoted by Inline graphic is the solution to the equation

graphic file with name Equation7.gif (3)

Let Inline graphic, Inline graphic, Inline graphic, Inline graphic and Inline graphic, and let Inline graphic be a preliminary Inline graphic-consistent estimate of Inline graphic. Details on obtaining this estimator can be found in Wang (2011).

3. Asymptotic theory

3.1. Existence and consistency

We first state the assumptions needed for existence of the solution to (3) and its consistency.

Assumption 1.

Let Inline graphic for all Inline graphic, Inline graphic, and Inline graphic.

Assumption 2.

The function Inline graphic is monotone and invertible and has two continuous bounded derivatives. The function Inline graphic has a continuous bounded derivative and is bounded from below by Inline graphic.

Assumption 3.

Let Inline graphic.

Assumption 4.

The true correlation matrix Inline graphic has positive eigenvalues. The estimated working correlation matrix Inline graphic satisfies Inline graphic and Inline graphic is a constant positive-definite matrix.

Assumption 5.

There exist two positive constants Inline graphic and Inline graphic such that Inline graphic.

We can obtain a consistent estimator of the mean function to centre the predictor function. We assume that Inline graphic to obtain consistent estimates of the eigenvalues (Horváth & Kokoszka, 2012). Assumptions 4 and 5 can also be found in Wang (2011). We prove that Inline graphic satisfies Assumption 4 in Proposition S1 in the Supplementary Material.

Theorem 1.

Under Assumptions 1–5, the solution Inline graphic to (3) exists and satisfies Inline graphic.

The proof is along the lines of Wang (2011). and can be found in the Supplementary Material together with proofs of the other results. We approximate Inline graphic by Inline graphic. Thus, we can consistently estimate the parameters even if the correlation structure is misspecified.

3.2. Asymptotic normality

To show the asymptotic normality of the estimator, we approximate Inline graphic by Inline graphic. We write Inline graphic where

graphic file with name Equation8.gif

Next, we state the assumptions needed for establishing the asymptotic normality of Inline graphic.

Assumption 6.

Let Inline graphic.

Assumption 7.

The matrices Inline graphic and Inline graphic are nonsingular for all Inline graphic.

Assumption 8.

The eigenvalues of Inline graphic are bounded and

Assumption 8.

Let Inline graphic and Inline graphic.

Theorem 2.

Under Assumptions 1–8, the following convergences in distribution hold as Inline graphic:

Theorem 2. (4)
Theorem 2. (5)
Theorem 2. (6)

This theorem can easily be extended to include additional finite-dimensional covariates and finitely many functional predictors. It can also be used to construct confidence bands; see Müller & Stadtmüller (2005, Corollary 4.3).

We are now ready to describe an Inline graphic test for a regression relation between a scalar response and a functional predicting variable. Referring to the model (1), testing for no association between the response and the predicting process is equivalent to testing Inline graphic. Since we use the sequence of approximate models proposed in (2), we test the hypothesis Inline graphic versus the alternative that Inline graphic is not true for a given Inline graphic. The proposed test rejects Inline graphic for large absolute values of

graphic file with name Equation13.gif

From Theorem 2 it follows that the test that rejects Inline graphic whenever Inline graphic is of asymptotic size Inline graphic, where Inline graphic is the Inline graphicth percentile of the standard normal distribution.

4. Simulations

For our first simulation we generated pseudo-random regression functions using Fourier basis functions Inline graphic and the model Inline graphic, with the Inline graphic independent and identically distributed as Inline graphic. The effect function Inline graphic, where Inline graphicInline graphic and Inline graphic.

For cluster size Inline graphic and a continuous response we used the following model to generate the responses: Inline graphic, Inline graphic, where Inline graphic and Inline graphic is a Inline graphic correlation matrix with all off-diagonal elements equal to Inline graphic, for Inline graphic. For a binary response, we generated the correlated responses using the function from the bindata package in R (R Development Core Team, 2017) with marginal probabilities given by Inline graphic, where Inline graphic is the logistic link function. The correlation matrix is the same as in the previous case.

Table 1 shows that as the correlation increases, so does the empirical power of the Inline graphic test. The power of the test which treats responses as independent is constant. Therefore we need to consider the correlation structure. In this study we chose Inline graphic, with Inline graphic replications and significance level Inline graphic. The value of Inline graphic is fixed at Inline graphic for continuous traits and at Inline graphic for binary traits. The number of parameters Inline graphic was determined using five-fold crossvalidation. The empirical level for this study can be found in the Supplementary Material.

Table 1.

Empirical power Inline graphic of the F and Ind tests as a function of Inline graphic

Corr Inline graphic Continuous trait Binary trait
  Inline graphic test Ind test Inline graphic test Ind test
0 42·1 41·8 13·7 12·7
0·05 45·6 45·0 13·0 12·4
0·30 56·7 45·7 16·5 12·4
0·50 69·2 44·3 19·5 12·9
0·80 99·3 45·5 30·8 13·4

Ind test, functional test with independent structure; Inline graphic test, functional test with correlation Inline graphic.

For the second study, we compared the performance of the Inline graphic test with that of a generalized estimating equation-based sequence kernel association test proposed by Wang et al. (2013). Although this is based on generalized estimating equations, it is not meant to be applied to functional variables. For a fair comparison we used a region of the genome from chromosome 17, obtained from the 1000 Genomes Project (The 1000 Genomes Project Consortium, 2010), and then simulated response values using these data. As this is a population dataset, all the individuals are assumed to be independent. The focus of this simulation is to investigate the effects of treating sequencing data as a function.

In applications, we do not observe the predictor function but assume that the densely observed data constitute a realization of a function; so a smoothing step is needed to construct predictor functions. Smoothing methods given in Ramsay (2006) typically use the following model to fit a single curve. The given curve Inline graphic is observed at Inline graphic discrete points Inline graphic. Let Inline graphicInline graphic denote these observed values. We then recover the function Inline graphic from these observed values by fitting the linear model Inline graphicInline graphic where the Inline graphic are basis functions. We select a large value for Inline graphic and use penalization to ensure that the fitted function is not very rough. We choose Inline graphicInline graphic to minimize Inline graphic, where Inline graphic denotes the second derivative. We obtain Inline graphic. We used cubic B-spline basis functions and the R function smooth.spline to carry out the smoothing. We specified the knots to be at Inline graphic, so Inline graphic.

We next simulated a discrete and a continuous response using the sequencing data. Let Inline graphic denote the sequencing data matrix of dimension Inline graphic, where Inline graphic and Inline graphic represent the number of clusters, the cluster size, and the number of variants, respectively. For the selected region from the 1000 Genomes Project, Inline graphic and Inline graphic while the family size Inline graphic. Then the responses are generated as Inline graphic with Inline graphic and Inline graphic. For the binary case we proceed as before using the R function rmvbin. The following empirical powers are based on 1000 replicates and a significance level of 0·05. We chose Inline graphic by five-fold crossvalidation. We compare the Inline graphic test with that of Wang et al. (2013). Table 2 shows that the two tests have comparable power but the Type I error for the latter is inflated in the binary case.

Table 2.

Comparison of empirical power Inline graphic of the Inline graphic test and gSKAT

Effect Inline graphic Continuous trait Binary trait
  Inline graphic test gSKAT Inline graphic test gSKAT
0·00 4·1 4·8 4·7 6·1
0·05 10·2 10·8 7·2 6·3
1·00 31·4 37·4 11·4 13·3
3·00 99·6 99·4 55·5 60·8
5·00 100 100 95·1 96·2

Inline graphic test, functional test with the correlation structure; gSKAT, family sequence kernel association test of Wang et al. (2013)

In the Supplementary Material we investigate the effect of sample size and cluster size on the power of the Inline graphic test. The regression function and the response variable are generated as in the first simulation study. The number of basis functions Inline graphic selected to generate the regression functions is Inline graphic. The effect size Inline graphic is Inline graphic for the binary response case and Inline graphic for the continuous response case. The correlation is Inline graphic. We find that the power increases with the sample size for both types of response variables.

To demonstrate the effect of using a large cluster size Inline graphic, we took a sample size of Inline graphic. Results of this simulation are reported in the Supplementary Material. We observe that the Type I error is inflated as the cluster size increases. For a cluster of size Inline graphic, the number of parameters in the correlation matrix is of order Inline graphic. Large Inline graphic will affect the consistency of the correlation estimate (Inline graphic) used in the score equation (3), rendering our asymptotic results invalid.

In the Supplementary Material we also explore the effect of increasing the number of parameters Inline graphic in the model. The predictor function and the continuous response are simulated as before. The number of basis functions used to generate the predictor is set to Inline graphic. The sample size is Inline graphic, the effect size Inline graphic is Inline graphic, and the correlation parameter Inline graphic is taken to be Inline graphic. The number of parameters Inline graphic in the model is chosen to be 30, 50 or 90 rather than using crossvalidation. We find that as the number of parameters increases, the Type I error also increases.

To determine the number of parameters Inline graphic, we used functional principal component analysis for dimension reduction. We projected the function onto a subspace generated by the first Inline graphic eigenfunctions of the covariance operator of the predictor functions. This operator is estimated from the predictor functions obtained by smoothing the observed data. We first selected those values of Inline graphic for which the proportion of variance explained by the first Inline graphic principal components, Inline graphic, varied from 80% to 99% as potential values for the number of parameters Inline graphic. We then used five-fold crossvalidation to select Inline graphic.

5. Application

Twin and family-based studies have suggested a substantial genetic contribution to substance dependence. Studies have been successful in identifying genetic variants contributing to nicotine dependence. In this application, we applied the Inline graphic test to assess the relation between 15 neuronal nicotinic acetylcholine receptor, nAChR, subunit genes and nicotine dependence using sequencing data from the Minnesota Twin Study (Vrieze et al., 2014). Genetic association between nAChR subunit genes and nicotine dependence has already been established. A comprehensive study of these genes (Saccone et al., 2009) found associations between nicotine dependence and loci in the CHRNA5, CHRNA3, CHRNA4, CHRNB4, CHRNB3, CHRNB1, CHRNA6, CHRND, CHRNG and CHRNB4 genes. Our aim is to confirm whether our test can replicate these associations. The sequencing dataset we used for this analysis contains 662 families and 1445 individuals.

Nicotine dependence is a continuous variable measured using the protocols of the Substance Abuse Module of Composite International Diagnostic Interview (Hicks et al., 2011). It takes into account the frequency and quantity of nicotine use, including cigarettes, cigars, pipes and tobacco chewing. The covariates of age and sex were also considered. We first fitted a regression model with nicotine dependence as the response and with age and sex as predictors, and then used the residuals in the analysis. We applied our proposed test to the 15 genes individually. As the response and predictors are centred, we omit the intercept from this analysis. The smooth functions obtained by the smoothing method discussed in the previous section serve as predictors in the test. Table 3 reports the number of parameters Inline graphic used for each gene. Because these values are small, normal approximation does not work, as shown in the Supplementary Material. We instead use Inline graphic as the statistic that follows Inline graphic, and use this to report the unadjusted Inline graphic-values for the Inline graphic test. We find that the Inline graphic-values for the Inline graphic test are smaller than those of gSKAT for many genes, so our method flags several associations that are undetected by gSKAT, which might suggest that our method has better performance. Even after adjustment for multiple testing, we find associations for genes such as CHRNB2, CHRNA6 and CHRND.

Table 3.

The Inline graphic-values for Inline graphic nAChR subunit genes

Gene Inline graphic test gSKAT Inline graphic Inline graphic
CHRNA1 0·929 0·621 2 0·147
CHRNA2 0·740 0·595 2 0·600
CHRNA3 0·057 0·611 2 5·713
CHRNA4 0·011 0·125 2 8·900
CHRNA5 0·148 0·415 2 3·810
CHRNA6 4E-05 0·216 6 29·765
CHRNA7 0·625 0·736 2 0·937
CHRNA9 0·400 0·524 2 1·831
CHRNB1 0·017 0·870 2 8·047
CHRNB2 2E-09 0·1351 7 53·838
CHRNB3 0·003 0·429 5 17·524
CHRNB4 0·382 0·561 2 1·922
CHRND 8E-04 0·175 2 14·230
CHRNE 0·207 0·675 2 3·142
CHRNG 0·044 0·270 2 6·219

Inline graphic , number of parameters selected; Inline graphic, statistic defined as Inline graphic; gSKAT, family sequence kernel association test of Wang et al. (2013).

Supplementary Material

Supplementary Data

Acknowledgement

This work was supported by the National Institute on Drug Abuse. We thank Drs Scott Vrieze, Matt McGue and S. Alexandra Burt for helping us access the whole-genome sequencing data from the Minnesota Twin Study. We are grateful to the reviewers for their constructive comments, which have helped to improve the presentation of the paper.

Supplementary material

Supplementary material available at Biometrika online contains results on the empirical levels of the first simulation study, the sample size study, the cluster size study, and the study of increasing dimensions, as well as Q-Q plots to check the normality of the test statistic. It also includes proofs of the theoretical results.

References

  1. Balan R. M. & Schiopu-Kratina I. (2005). Asymptotic results with generalized estimating equations for longitudinal data. Ann. Statist. 33, 522–41. [Google Scholar]
  2. Cardot H., Ferraty F. & Sarda P. (1999). Functional linear model. Statist. Prob. Lett. 45, 11–22. [Google Scholar]
  3. Cardot H., Ferraty F. & Sarda P. (2003). Spline estimators for the functional linear model. Statist. Sinica 13, 571–91. [Google Scholar]
  4. Cardot H. & Johannes J. (2010). Thresholding projection estimators in functional linear models. J. Mult. Anal. 101, 395–408. [Google Scholar]
  5. Cardot H. & Sarda P. (2005). Estimation in generalized linear models for functional data via penalized likelihood. J. Mult. Anal. 92, 24–41. [Google Scholar]
  6. Gertheiss J., Maity A. & Staicu A.-M. (2013). Variable selection in generalized functional linear models. Stat 2, 86–101. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Hicks B. M., Schalet B. D., Malone S. M., Iacono W. G. & McGue M. (2011). Psychometric and genetic architecture of substance use disorder and behavioral disinhibition measures for gene association studies. Behav. Genet. 41, 459–75. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Horváth L. & Kokoszka P. (2012). Inference for Functional Data with Applications. New York: Springer. [Google Scholar]
  9. Laird N. M. & Lange C. (2010). The Fundamentals of Modern Statistical Genetics. New York: Springer. [Google Scholar]
  10. Li Y., Wang N. & Carroll R. J. (2010). Generalized functional linear models with semiparametric single-index interactions. J. Am. Statist. Assoc. 105, 621–33. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Morris J. S. (2015). Functional regression. Ann. Rev. Statist. Applic. 2, 321–59. [Google Scholar]
  12. Müller H.-G. & Stadtmüller U. (2005). Generalized functional linear models. Ann. Statist. 33, 774–805. [Google Scholar]
  13. R Development Core Team (2017). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0. http://www.R-project.org.
  14. Ramsay J. O. (2006). Functional Data Analysis. New York: Springer. [Google Scholar]
  15. Saccone N. L., Saccone S. F., Hinrichs A. L., Stitzel J. A., Duan W., Pergadia M. L., Agrawal A., Breslau N., Grucza R. A., Hatsukami D.. et al. (2009). Multiple distinct risk loci for nicotine dependence identified by dense coverage of the complete family of nicotinic receptor subunit (CHRN) genes. Am. J. Med. Genet. B 150, 453–66. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. The 1000 Genomes Project Consortium (2010). A map of human genome variation from population-scale sequencing. Nature 467, 1061–73. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Vrieze S. I., Malone S. M., Vaidyanathan U., Kwong A., Kang H. M., Zhan X., Flickinger M., Irons D., Jun G., Locke A. E.. et al. (2014). In search of rare variants: Preliminary results from whole genome sequencing of 1,325 individuals with psychophysiological endophenotypes. Psychophysiology 51, 1309–20. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Wang J.-L., Chiou J.-M. & Müller H.-G. (2016). Functional data analysis. Ann. Rev. Statist. Applic. 3, 257–95. [Google Scholar]
  19. Wang L. (2011). GEE analysis of clustered binary data with diverging number of covariates. Ann. Statist. 39, 389–417. [Google Scholar]
  20. Wang X., Lee S., Zhu X., Redline S. & Lin X. (2013). GEE-based SNP set association test for continuous and discrete traits in family-based association studies. Genet. Epidemiol. 37, 778–86. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data

Articles from Biometrika are provided here courtesy of Oxford University Press

RESOURCES