Dependent generalized functional linear models

S Jadhav; H L Koul; Q Lu

doi:10.1093/biomet/asx044

. 2017 Sep 2;104(4):987–994. doi: 10.1093/biomet/asx044

Dependent generalized functional linear models

S Jadhav ¹, H L Koul ¹, Q Lu ²

PMCID: PMC5771479 NIHMSID: NIHMS929781 PMID: 29353911

Summary

This paper considers testing for no effect of functional covariates on response variables in multivariate regression. We use generalized estimating equations to determine the underlying parameters and establish their joint asymptotic normality. This is then used to test the significance of the effect of predictors on the vector of response variables. Simulations demonstrate the importance of considering existing correlation structures in the data. To explore the effect of treating genetic data as a function, we perform a simulation study using gene sequencing data and find that the performance of our test is comparable to that of another popular method used in sequencing studies. We present simulations to explore the behaviour of our test under varying sample size, cluster size and dimension of the parameter to be estimated, and an application where we are able to confirm known associations between nicotine dependence and neuronal nicotinic acetylcholine receptor subunit genes.

Keywords: Cluster data, Family sequencing data, Functional data analysis, Generalized estimating equation

1. Introduction

This paper presents a large-sample test for assessing whether a functional covariate has a regression effect on real-valued and possibly dependent responses. Dependent data arise in family-based genetic studies, whose aim is often to test for a relation between a gene region and the phenotype of interest. Sequencing data for a gene or gene region consist of observations on a large number of single nucleotide variants. In light of linkage disequilibrium (Laird & Lange, 2010), dependence among variants may decline with their separation, so we treat sequencing data as a function of the single nucleotide variant positions and use a functional data-based approach.

There is abundant literature on functional linear models (Cardot et al., 1999, 2003; Cardot & Sarda, 2005; Ramsay, 2006; Cardot & Johannes, 2010). Recent reviews can be found in Morris (2015) and Wang et al. (2016). These models assume that the response is a continuous scalar and the predictor variable is a function. However, in genetic studies the phenotype or response is often binary, and few papers discuss generalized functional linear models (Müller & Stadtmüller, 2005; Li et al., 2010; Gertheiss et al., 2013) as needed in this case. Existing methodology cannot be directly applied to genetic family data, where the response variable is a vector of dependent traits and the predictor is a vector of functions. We address this shortcoming using generalized estimating equations. The estimators thus obtained are shown to be consistent and asymptotically normal, under suitable conditions. The latter result is used to propose an asymptotic test for the regression relation between the functional covariates and the responses.

2. Model

Let Inline graphic and be positive integers. We observe clusters , each of size , where, for each , is an -dimensional predicting process and is a vector of responses. We assume that the clusters are independent and identically distributed with the same correlation structure. For the th subject in the Inline graphic th cluster, the predicting process is assumed to be a square-integrable random process on . The corresponding response is a continuous or discrete scalar which is related to via the following generalized regression model, in which is a positive measure on . All integrals in this paper are taken over the interval Inline graphic , unless specified otherwise. For a constant and a real-valued function , let

We model the regression of Inline graphic on as

(1)

for a known real-valued link function Inline graphic and a positive function , where the have zero mean and is the true correlation matrix.

We now give another representation of the model (1). Let Inline graphic be an orthonormal basis of the functional space . The predictor process and parameter function can be written as

with random variables Inline graphic = and coefficients . The random variables and are uncorrelated for . By the orthonormality of these bases,

Using this representation, we now address the issue of an infinite number of predicting variables. Based on the truncation strategy proposed by Müller & Stadtmüller (2005), we replace model (1) with the following approximate sequence of finite-dimensional models. Let Inline graphic be a sequence of positive integers tending to infinity and define our new approximate model by

(2)

Let Inline graphic . The superscript indicates the number of parameters. We exhibit this superscript and subscript when necessary. We assume that the standardized error is independent of for all and .

In what follows, all limits are taken as Inline graphic and . For any vector or finite-dimensional matrix , will denote the Frobenius norm of . For a matrix we denote its maximum and minimum eigenvalues by and , respectively.

We use generalized estimating equations to estimate Inline graphic . In most applications, we do not know the true correlation matrix , so we use a working correlation matrix that depends on a parameter . We let denote an estimated working correlation matrix. The that we use below was suggested by Balan & Schiopu-Kratina (2005). In the following, denotes the derivative of the function Inline graphic . Let

The estimator denoted by Inline graphic is the solution to the equation

(3)

Let Inline graphic , , , and , and let be a preliminary -consistent estimate of . Details on obtaining this estimator can be found in Wang (2011).

3. Asymptotic theory

3.1. Existence and consistency

We first state the assumptions needed for existence of the solution to (3) and its consistency.

Assumption 1.

Let for all , , and .

Assumption 2.

The function is monotone and invertible and has two continuous bounded derivatives. The function has a continuous bounded derivative and is bounded from below by .

Assumption 3.

Let .

Assumption 4.

The true correlation matrix has positive eigenvalues. The estimated working correlation matrix satisfies and is a constant positive-definite matrix.

Assumption 5.

There exist two positive constants and such that .

We can obtain a consistent estimator of the mean function to centre the predictor function. We assume that Inline graphic to obtain consistent estimates of the eigenvalues (Horváth & Kokoszka, 2012). Assumptions 4 and 5 can also be found in Wang (2011). We prove that satisfies Assumption 4 in Proposition S1 in the Supplementary Material.

Theorem 1.

Under Assumptions 1–5, the solution to (3) exists and satisfies .

The proof is along the lines of Wang (2011). and can be found in the Supplementary Material together with proofs of the other results. We approximate Inline graphic by . Thus, we can consistently estimate the parameters even if the correlation structure is misspecified.

3.2. Asymptotic normality

To show the asymptotic normality of the estimator, we approximate Inline graphic by . We write where

Next, we state the assumptions needed for establishing the asymptotic normality of Inline graphic .

Assumption 6.

Let .

Assumption 7.

The matrices and are nonsingular for all .

Assumption 8.

The eigenvalues of are bounded and

Let Inline graphic and .

Theorem 2.

Under Assumptions 1–8, the following convergences in distribution hold as :

(4)

(5)

(6)

This theorem can easily be extended to include additional finite-dimensional covariates and finitely many functional predictors. It can also be used to construct confidence bands; see Müller & Stadtmüller (2005, Corollary 4.3).

We are now ready to describe an Inline graphic test for a regression relation between a scalar response and a functional predicting variable. Referring to the model (1), testing for no association between the response and the predicting process is equivalent to testing . Since we use the sequence of approximate models proposed in (2), we test the hypothesis Inline graphic versus the alternative that is not true for a given . The proposed test rejects for large absolute values of

From Theorem 2 it follows that the test that rejects Inline graphic whenever is of asymptotic size , where is the th percentile of the standard normal distribution.

4. Simulations

For our first simulation we generated pseudo-random regression functions using Fourier basis functions Inline graphic and the model , with the independent and identically distributed as . The effect function , where and .

For cluster size Inline graphic and a continuous response we used the following model to generate the responses: , , where and is a correlation matrix with all off-diagonal elements equal to , for . For a binary response, we generated the correlated responses using the function from the bindata package in R (R Development Core Team, 2017) with marginal probabilities given by Inline graphic , where is the logistic link function. The correlation matrix is the same as in the previous case.

Table 1 shows that as the correlation increases, so does the empirical power of the Inline graphic test. The power of the test which treats responses as independent is constant. Therefore we need to consider the correlation structure. In this study we chose , with replications and significance level . The value of is fixed at for continuous traits and at for binary traits. The number of parameters Inline graphic was determined using five-fold crossvalidation. The empirical level for this study can be found in the Supplementary Material.

Table 1.

Empirical power Inline graphic of the F and Ind tests as a function of

Corr	Continuous trait		Binary trait
	test	Ind test	test	Ind test
0	42·1	41·8	13·7	12·7
0·05	45·6	45·0	13·0	12·4
0·30	56·7	45·7	16·5	12·4
0·50	69·2	44·3	19·5	12·9
0·80	99·3	45·5	30·8	13·4

Open in a new tab

Ind test, functional test with independent structure; Inline graphic test, functional test with correlation .

For the second study, we compared the performance of the Inline graphic test with that of a generalized estimating equation-based sequence kernel association test proposed by Wang et al. (2013). Although this is based on generalized estimating equations, it is not meant to be applied to functional variables. For a fair comparison we used a region of the genome from chromosome 17, obtained from the 1000 Genomes Project (The 1000 Genomes Project Consortium, 2010), and then simulated response values using these data. As this is a population dataset, all the individuals are assumed to be independent. The focus of this simulation is to investigate the effects of treating sequencing data as a function.

In applications, we do not observe the predictor function but assume that the densely observed data constitute a realization of a function; so a smoothing step is needed to construct predictor functions. Smoothing methods given in Ramsay (2006) typically use the following model to fit a single curve. The given curve Inline graphic is observed at discrete points . Let denote these observed values. We then recover the function from these observed values by fitting the linear model where the are basis functions. We select a large value for and use penalization to ensure that the fitted function is not very rough. We choose Inline graphic to minimize , where denotes the second derivative. We obtain . We used cubic B-spline basis functions and the R function smooth.spline to carry out the smoothing. We specified the knots to be at , so .

We next simulated a discrete and a continuous response using the sequencing data. Let Inline graphic denote the sequencing data matrix of dimension , where and represent the number of clusters, the cluster size, and the number of variants, respectively. For the selected region from the 1000 Genomes Project, and while the family size . Then the responses are generated as with and Inline graphic . For the binary case we proceed as before using the R function rmvbin. The following empirical powers are based on 1000 replicates and a significance level of 0·05. We chose by five-fold crossvalidation. We compare the test with that of Wang et al. (2013). Table 2 shows that the two tests have comparable power but the Type I error for the latter is inflated in the binary case.

Table 2.

Comparison of empirical power Inline graphic of the test and gSKAT

Effect	Continuous trait		Binary trait
	test	gSKAT	test	gSKAT
0·00	4·1	4·8	4·7	6·1
0·05	10·2	10·8	7·2	6·3
1·00	31·4	37·4	11·4	13·3
3·00	99·6	99·4	55·5	60·8
5·00	100	100	95·1	96·2

Open in a new tab

Inline graphic test, functional test with the correlation structure; gSKAT, family sequence kernel association test of Wang et al. (2013)

In the Supplementary Material we investigate the effect of sample size and cluster size on the power of the Inline graphic test. The regression function and the response variable are generated as in the first simulation study. The number of basis functions selected to generate the regression functions is . The effect size is for the binary response case and for the continuous response case. The correlation is Inline graphic . We find that the power increases with the sample size for both types of response variables.

To demonstrate the effect of using a large cluster size Inline graphic , we took a sample size of . Results of this simulation are reported in the Supplementary Material. We observe that the Type I error is inflated as the cluster size increases. For a cluster of size , the number of parameters in the correlation matrix is of order . Large will affect the consistency of the correlation estimate ( Inline graphic ) used in the score equation (3), rendering our asymptotic results invalid.

In the Supplementary Material we also explore the effect of increasing the number of parameters Inline graphic in the model. The predictor function and the continuous response are simulated as before. The number of basis functions used to generate the predictor is set to . The sample size is , the effect size is , and the correlation parameter is taken to be . The number of parameters in the model is chosen to be 30, 50 or 90 rather than using crossvalidation. We find that as the number of parameters increases, the Type I error also increases.

To determine the number of parameters Inline graphic , we used functional principal component analysis for dimension reduction. We projected the function onto a subspace generated by the first eigenfunctions of the covariance operator of the predictor functions. This operator is estimated from the predictor functions obtained by smoothing the observed data. We first selected those values of Inline graphic for which the proportion of variance explained by the first principal components, , varied from 80% to 99% as potential values for the number of parameters . We then used five-fold crossvalidation to select .

5. Application

Twin and family-based studies have suggested a substantial genetic contribution to substance dependence. Studies have been successful in identifying genetic variants contributing to nicotine dependence. In this application, we applied the Inline graphic test to assess the relation between 15 neuronal nicotinic acetylcholine receptor, nAChR, subunit genes and nicotine dependence using sequencing data from the Minnesota Twin Study (Vrieze et al., 2014). Genetic association between nAChR subunit genes and nicotine dependence has already been established. A comprehensive study of these genes (Saccone et al., 2009) found associations between nicotine dependence and loci in the CHRNA5, CHRNA3, CHRNA4, CHRNB4, CHRNB3, CHRNB1, CHRNA6, CHRND, CHRNG and CHRNB4 genes. Our aim is to confirm whether our test can replicate these associations. The sequencing dataset we used for this analysis contains 662 families and 1445 individuals.

Nicotine dependence is a continuous variable measured using the protocols of the Substance Abuse Module of Composite International Diagnostic Interview (Hicks et al., 2011). It takes into account the frequency and quantity of nicotine use, including cigarettes, cigars, pipes and tobacco chewing. The covariates of age and sex were also considered. We first fitted a regression model with nicotine dependence as the response and with age and sex as predictors, and then used the residuals in the analysis. We applied our proposed test to the 15 genes individually. As the response and predictors are centred, we omit the intercept from this analysis. The smooth functions obtained by the smoothing method discussed in the previous section serve as predictors in the test. Table 3 reports the number of parameters Inline graphic used for each gene. Because these values are small, normal approximation does not work, as shown in the Supplementary Material. We instead use as the statistic that follows , and use this to report the unadjusted -values for the test. We find that the -values for the test are smaller than those of gSKAT for many genes, so our method flags several associations that are undetected by gSKAT, which might suggest that our method has better performance. Even after adjustment for multiple testing, we find associations for genes such as CHRNB2, CHRNA6 and CHRND.

Table 3.

The Inline graphic -values for nAChR subunit genes

Gene	test	gSKAT
CHRNA1	0·929	0·621	2	0·147
CHRNA2	0·740	0·595	2	0·600
CHRNA3	0·057	0·611	2	5·713
CHRNA4	0·011	0·125	2	8·900
CHRNA5	0·148	0·415	2	3·810
CHRNA6	4E-05	0·216	6	29·765
CHRNA7	0·625	0·736	2	0·937
CHRNA9	0·400	0·524	2	1·831
CHRNB1	0·017	0·870	2	8·047
CHRNB2	2E-09	0·1351	7	53·838
CHRNB3	0·003	0·429	5	17·524
CHRNB4	0·382	0·561	2	1·922
CHRND	8E-04	0·175	2	14·230
CHRNE	0·207	0·675	2	3·142
CHRNG	0·044	0·270	2	6·219

Open in a new tab

Inline graphic , number of parameters selected; , statistic defined as ; gSKAT, family sequence kernel association test of Wang et al. (2013).

Supplementary Material

Supplementary Data

Click here for additional data file.^{(600.6KB, pdf)}

Acknowledgement

This work was supported by the National Institute on Drug Abuse. We thank Drs Scott Vrieze, Matt McGue and S. Alexandra Burt for helping us access the whole-genome sequencing data from the Minnesota Twin Study. We are grateful to the reviewers for their constructive comments, which have helped to improve the presentation of the paper.

Supplementary material

Supplementary material available at Biometrika online contains results on the empirical levels of the first simulation study, the sample size study, the cluster size study, and the study of increasing dimensions, as well as Q-Q plots to check the normality of the test statistic. It also includes proofs of the theoretical results.

References

Balan R. M. & Schiopu-Kratina I. (2005). Asymptotic results with generalized estimating equations for longitudinal data. Ann. Statist. 33, 522–41. [Google Scholar]
Cardot H., Ferraty F. & Sarda P. (1999). Functional linear model. Statist. Prob. Lett. 45, 11–22. [Google Scholar]
Cardot H., Ferraty F. & Sarda P. (2003). Spline estimators for the functional linear model. Statist. Sinica 13, 571–91. [Google Scholar]
Cardot H. & Johannes J. (2010). Thresholding projection estimators in functional linear models. J. Mult. Anal. 101, 395–408. [Google Scholar]
Cardot H. & Sarda P. (2005). Estimation in generalized linear models for functional data via penalized likelihood. J. Mult. Anal. 92, 24–41. [Google Scholar]
Gertheiss J., Maity A. & Staicu A.-M. (2013). Variable selection in generalized functional linear models. Stat 2, 86–101. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hicks B. M., Schalet B. D., Malone S. M., Iacono W. G. & McGue M. (2011). Psychometric and genetic architecture of substance use disorder and behavioral disinhibition measures for gene association studies. Behav. Genet. 41, 459–75. [DOI] [PMC free article] [PubMed] [Google Scholar]
Horváth L. & Kokoszka P. (2012). Inference for Functional Data with Applications. New York: Springer. [Google Scholar]
Laird N. M. & Lange C. (2010). The Fundamentals of Modern Statistical Genetics. New York: Springer. [Google Scholar]
Li Y., Wang N. & Carroll R. J. (2010). Generalized functional linear models with semiparametric single-index interactions. J. Am. Statist. Assoc. 105, 621–33. [DOI] [PMC free article] [PubMed] [Google Scholar]
Morris J. S. (2015). Functional regression. Ann. Rev. Statist. Applic. 2, 321–59. [Google Scholar]
Müller H.-G. & Stadtmüller U. (2005). Generalized functional linear models. Ann. Statist. 33, 774–805. [Google Scholar]
R Development Core Team (2017). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0. http://www.R-project.org.
Ramsay J. O. (2006). Functional Data Analysis. New York: Springer. [Google Scholar]
Saccone N. L., Saccone S. F., Hinrichs A. L., Stitzel J. A., Duan W., Pergadia M. L., Agrawal A., Breslau N., Grucza R. A., Hatsukami D.. et al. (2009). Multiple distinct risk loci for nicotine dependence identified by dense coverage of the complete family of nicotinic receptor subunit (CHRN) genes. Am. J. Med. Genet. B 150, 453–66. [DOI] [PMC free article] [PubMed] [Google Scholar]
The 1000 Genomes Project Consortium (2010). A map of human genome variation from population-scale sequencing. Nature 467, 1061–73. [DOI] [PMC free article] [PubMed] [Google Scholar]
Vrieze S. I., Malone S. M., Vaidyanathan U., Kwong A., Kang H. M., Zhan X., Flickinger M., Irons D., Jun G., Locke A. E.. et al. (2014). In search of rare variants: Preliminary results from whole genome sequencing of 1,325 individuals with psychophysiological endophenotypes. Psychophysiology 51, 1309–20. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang J.-L., Chiou J.-M. & Müller H.-G. (2016). Functional data analysis. Ann. Rev. Statist. Applic. 3, 257–95. [Google Scholar]
Wang L. (2011). GEE analysis of clustered binary data with diverging number of covariates. Ann. Statist. 39, 389–417. [Google Scholar]
Wang X., Lee S., Zhu X., Redline S. & Lin X. (2013). GEE-based SNP set association test for continuous and discrete traits in family-based association studies. Genet. Epidemiol. 37, 778–86. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data

Click here for additional data file.^{(600.6KB, pdf)}

[B1] Balan R. M. & Schiopu-Kratina I. (2005). Asymptotic results with generalized estimating equations for longitudinal data. Ann. Statist. 33, 522–41. [Google Scholar]

[B2] Cardot H., Ferraty F. & Sarda P. (1999). Functional linear model. Statist. Prob. Lett. 45, 11–22. [Google Scholar]

[B3] Cardot H., Ferraty F. & Sarda P. (2003). Spline estimators for the functional linear model. Statist. Sinica 13, 571–91. [Google Scholar]

[B4] Cardot H. & Johannes J. (2010). Thresholding projection estimators in functional linear models. J. Mult. Anal. 101, 395–408. [Google Scholar]

[B5] Cardot H. & Sarda P. (2005). Estimation in generalized linear models for functional data via penalized likelihood. J. Mult. Anal. 92, 24–41. [Google Scholar]

[B6] Gertheiss J., Maity A. & Staicu A.-M. (2013). Variable selection in generalized functional linear models. Stat 2, 86–101. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B7] Hicks B. M., Schalet B. D., Malone S. M., Iacono W. G. & McGue M. (2011). Psychometric and genetic architecture of substance use disorder and behavioral disinhibition measures for gene association studies. Behav. Genet. 41, 459–75. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B8] Horváth L. & Kokoszka P. (2012). Inference for Functional Data with Applications. New York: Springer. [Google Scholar]

[B9] Laird N. M. & Lange C. (2010). The Fundamentals of Modern Statistical Genetics. New York: Springer. [Google Scholar]

[B10] Li Y., Wang N. & Carroll R. J. (2010). Generalized functional linear models with semiparametric single-index interactions. J. Am. Statist. Assoc. 105, 621–33. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B11] Morris J. S. (2015). Functional regression. Ann. Rev. Statist. Applic. 2, 321–59. [Google Scholar]

[B12] Müller H.-G. & Stadtmüller U. (2005). Generalized functional linear models. Ann. Statist. 33, 774–805. [Google Scholar]

[B13] R Development Core Team (2017). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0. http://www.R-project.org.

[B14] Ramsay J. O. (2006). Functional Data Analysis. New York: Springer. [Google Scholar]

[B15] Saccone N. L., Saccone S. F., Hinrichs A. L., Stitzel J. A., Duan W., Pergadia M. L., Agrawal A., Breslau N., Grucza R. A., Hatsukami D.. et al. (2009). Multiple distinct risk loci for nicotine dependence identified by dense coverage of the complete family of nicotinic receptor subunit (CHRN) genes. Am. J. Med. Genet. B 150, 453–66. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B16] The 1000 Genomes Project Consortium (2010). A map of human genome variation from population-scale sequencing. Nature 467, 1061–73. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B17] Vrieze S. I., Malone S. M., Vaidyanathan U., Kwong A., Kang H. M., Zhan X., Flickinger M., Irons D., Jun G., Locke A. E.. et al. (2014). In search of rare variants: Preliminary results from whole genome sequencing of 1,325 individuals with psychophysiological endophenotypes. Psychophysiology 51, 1309–20. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B18] Wang J.-L., Chiou J.-M. & Müller H.-G. (2016). Functional data analysis. Ann. Rev. Statist. Applic. 3, 257–95. [Google Scholar]

[B19] Wang L. (2011). GEE analysis of clustered binary data with diverging number of covariates. Ann. Statist. 39, 389–417. [Google Scholar]

[B20] Wang X., Lee S., Zhu X., Redline S. & Lin X. (2013). GEE-based SNP set association test for continuous and discrete traits in family-based association studies. Genet. Epidemiol. 37, 778–86. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Dependent generalized functional linear models

S Jadhav

H L Koul

Q Lu

Summary

1. Introduction

2. Model

3. Asymptotic theory

3.1. Existence and consistency

Assumption 1.

Assumption 2.

Assumption 3.

Assumption 4.

Assumption 5.

Theorem 1.

3.2. Asymptotic normality

Assumption 6.

Assumption 7.

Assumption 8.

Theorem 2.

4. Simulations

Table 1.

Table 2.

5. Application

Table 3.

Supplementary Material

Acknowledgement

Supplementary material

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Dependent generalized functional linear models

S Jadhav

H L Koul

Q Lu

Summary

1. Introduction

2. Model

3. Asymptotic theory

3.1. Existence and consistency

Assumption 1.

Assumption 2.

Assumption 3.

Assumption 4.

Assumption 5.

Theorem 1.

3.2. Asymptotic normality

Assumption 6.

Assumption 7.

Assumption 8.

Theorem 2.

4. Simulations

Table 1.

Table 2.

5. Application

Table 3.

Supplementary Material

Acknowledgement

Supplementary material

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases