Summary
This paper considers testing for no effect of functional covariates on response variables in multivariate regression. We use generalized estimating equations to determine the underlying parameters and establish their joint asymptotic normality. This is then used to test the significance of the effect of predictors on the vector of response variables. Simulations demonstrate the importance of considering existing correlation structures in the data. To explore the effect of treating genetic data as a function, we perform a simulation study using gene sequencing data and find that the performance of our test is comparable to that of another popular method used in sequencing studies. We present simulations to explore the behaviour of our test under varying sample size, cluster size and dimension of the parameter to be estimated, and an application where we are able to confirm known associations between nicotine dependence and neuronal nicotinic acetylcholine receptor subunit genes.
Keywords: Cluster data, Family sequencing data, Functional data analysis, Generalized estimating equation
1. Introduction
This paper presents a large-sample test for assessing whether a functional covariate has a regression effect on real-valued and possibly dependent responses. Dependent data arise in family-based genetic studies, whose aim is often to test for a relation between a gene region and the phenotype of interest. Sequencing data for a gene or gene region consist of observations on a large number of single nucleotide variants. In light of linkage disequilibrium (Laird & Lange, 2010), dependence among variants may decline with their separation, so we treat sequencing data as a function of the single nucleotide variant positions and use a functional data-based approach.
There is abundant literature on functional linear models (Cardot et al., 1999, 2003; Cardot & Sarda, 2005; Ramsay, 2006; Cardot & Johannes, 2010). Recent reviews can be found in Morris (2015) and Wang et al. (2016). These models assume that the response is a continuous scalar and the predictor variable is a function. However, in genetic studies the phenotype or response is often binary, and few papers discuss generalized functional linear models (Müller & Stadtmüller, 2005; Li et al., 2010; Gertheiss et al., 2013) as needed in this case. Existing methodology cannot be directly applied to genetic family data, where the response variable is a vector of dependent traits and the predictor is a vector of functions. We address this shortcoming using generalized estimating equations. The estimators thus obtained are shown to be consistent and asymptotically normal, under suitable conditions. The latter result is used to propose an asymptotic test for the regression relation between the functional covariates and the responses.
2. Model
Let
and
be positive integers. We observe
clusters
, each of size
, where, for each
,
is an
-dimensional predicting process and
is a vector of
responses. We assume that the clusters are independent and identically distributed with the same correlation structure. For the
th subject in the
th cluster, the predicting process
is assumed to be a square-integrable random process on
. The corresponding response
is a continuous or discrete scalar which is related to
via the following generalized regression model, in which
is a positive measure on
. All integrals in this paper are taken over the interval
, unless specified otherwise. For a constant
and a real-valued function
, let
![]() |
We model the regression of
on
as
![]() |
(1) |
for a known real-valued link function
and a positive function
, where the
have zero mean and
is the true
correlation matrix.
We now give another representation of the model (1). Let 
be an orthonormal basis of the functional space
. The predictor process and parameter function can be written as
![]() |
with random variables
=
and coefficients
. The random variables
and
are uncorrelated for
. By the orthonormality of these bases,
![]() |
Using this representation, we now address the issue of an infinite number of predicting variables. Based on the truncation strategy proposed by Müller & Stadtmüller (2005), we replace model (1) with the following approximate sequence of finite-dimensional models. Let
be a sequence of positive integers tending to infinity and define our new approximate model by
![]() |
(2) |
Let
. The superscript
indicates the number of parameters. We exhibit this superscript and subscript when necessary. We assume that the standardized error
is independent of
for all
and
.
In what follows, all limits are taken as
and
. For any vector or finite-dimensional matrix
,
will denote the Frobenius norm of
. For a matrix
we denote its maximum and minimum eigenvalues by
and
, respectively.
We use generalized estimating equations to estimate
. In most applications, we do not know the true correlation matrix
, so we use a working correlation matrix
that depends on a parameter
. We let
denote an estimated working correlation matrix. The
that we use below was suggested by Balan & Schiopu-Kratina (2005). In the following,
denotes the derivative of the function
. Let
![]() |
The estimator denoted by
is the solution to the equation
![]() |
(3) |
Let
,
,
,
and
, and let
be a preliminary
-consistent estimate of
. Details on obtaining this estimator can be found in Wang (2011).
3. Asymptotic theory
3.1. Existence and consistency
We first state the assumptions needed for existence of the solution to (3) and its consistency.
Assumption 1.
Let
for all
,
, and
.
Assumption 2.
The function
is monotone and invertible and has two continuous bounded derivatives. The function
has a continuous bounded derivative and is bounded from below by
.
Assumption 3.
Let
.
Assumption 4.
The true correlation matrix
has positive eigenvalues. The estimated working correlation matrix
satisfies
and
is a constant positive-definite matrix.
Assumption 5.
There exist two positive constants
and
such that
.
We can obtain a consistent estimator of the mean function to centre the predictor function. We assume that
to obtain consistent estimates of the eigenvalues (Horváth & Kokoszka, 2012). Assumptions 4 and 5 can also be found in Wang (2011). We prove that
satisfies Assumption 4 in Proposition S1 in the Supplementary Material.
Theorem 1.
Under Assumptions 1–5, the solution
to (3) exists and satisfies
.
The proof is along the lines of Wang (2011). and can be found in the Supplementary Material together with proofs of the other results. We approximate
by
. Thus, we can consistently estimate the parameters even if the correlation structure is misspecified.
3.2. Asymptotic normality
To show the asymptotic normality of the estimator, we approximate
by
. We write
where
![]() |
Next, we state the assumptions needed for establishing the asymptotic normality of
.
Assumption 6.
Let
.
Assumption 7.
The matrices
and
are nonsingular for all
.
Assumption 8.
The eigenvalues of
are bounded and
Let
and
.
Theorem 2.
Under Assumptions 1–8, the following convergences in distribution hold as
:
(4)
(5)
(6)
This theorem can easily be extended to include additional finite-dimensional covariates and finitely many functional predictors. It can also be used to construct confidence bands; see Müller & Stadtmüller (2005, Corollary 4.3).
We are now ready to describe an
test for a regression relation between a scalar response and a functional predicting variable. Referring to the model (1), testing for no association between the response and the predicting process is equivalent to testing
. Since we use the sequence of approximate models proposed in (2), we test the hypothesis
versus the alternative that
is not true for a given
. The proposed test rejects
for large absolute values of
![]() |
From Theorem 2 it follows that the test that rejects
whenever
is of asymptotic size
, where
is the
th percentile of the standard normal distribution.
4. Simulations
For our first simulation we generated pseudo-random regression functions using Fourier basis functions
and the model
, with the
independent and identically distributed as
. The effect function
, where 
and
.
For cluster size
and a continuous response we used the following model to generate the responses:
,
, where
and
is a
correlation matrix with all off-diagonal elements equal to
, for
. For a binary response, we generated the correlated responses using the function from the bindata package in R (R Development Core Team, 2017) with marginal probabilities given by
, where
is the logistic link function. The correlation matrix is the same as in the previous case.
Table 1 shows that as the correlation increases, so does the empirical power of the
test. The power of the test which treats responses as independent is constant. Therefore we need to consider the correlation structure. In this study we chose
, with
replications and significance level
. The value of
is fixed at
for continuous traits and at
for binary traits. The number of parameters
was determined using five-fold crossvalidation. The empirical level for this study can be found in the Supplementary Material.
Table 1.
Empirical power
of the F and Ind tests as a function of
Corr
|
Continuous trait | Binary trait | ||
|---|---|---|---|---|
test |
Ind test |
test |
Ind test | |
| 0 | 42·1 | 41·8 | 13·7 | 12·7 |
| 0·05 | 45·6 | 45·0 | 13·0 | 12·4 |
| 0·30 | 56·7 | 45·7 | 16·5 | 12·4 |
| 0·50 | 69·2 | 44·3 | 19·5 | 12·9 |
| 0·80 | 99·3 | 45·5 | 30·8 | 13·4 |
Ind test, functional test with independent structure;
test, functional test with correlation
.
For the second study, we compared the performance of the
test with that of a generalized estimating equation-based sequence kernel association test proposed by Wang et al. (2013). Although this is based on generalized estimating equations, it is not meant to be applied to functional variables. For a fair comparison we used a region of the genome from chromosome 17, obtained from the 1000 Genomes Project (The 1000 Genomes Project Consortium, 2010), and then simulated response values using these data. As this is a population dataset, all the individuals are assumed to be independent. The focus of this simulation is to investigate the effects of treating sequencing data as a function.
In applications, we do not observe the predictor function but assume that the densely observed data constitute a realization of a function; so a smoothing step is needed to construct predictor functions. Smoothing methods given in Ramsay (2006) typically use the following model to fit a single curve. The given curve
is observed at
discrete points
. Let 
denote these observed values. We then recover the function
from these observed values by fitting the linear model 
where the
are basis functions. We select a large value for
and use penalization to ensure that the fitted function is not very rough. We choose 
to minimize
, where
denotes the second derivative. We obtain
. We used cubic B-spline basis functions and the R function smooth.spline to carry out the smoothing. We specified the knots to be at
, so
.
We next simulated a discrete and a continuous response using the sequencing data. Let
denote the sequencing data matrix of dimension
, where
and
represent the number of clusters, the cluster size, and the number of variants, respectively. For the selected region from the 1000 Genomes Project,
and
while the family size
. Then the responses are generated as
with
and
. For the binary case we proceed as before using the R function rmvbin. The following empirical powers are based on 1000 replicates and a significance level of 0·05. We chose
by five-fold crossvalidation. We compare the
test with that of Wang et al. (2013). Table 2 shows that the two tests have comparable power but the Type I error for the latter is inflated in the binary case.
Table 2.
Comparison of empirical power
of the
test and gSKAT
Effect
|
Continuous trait | Binary trait | ||
|---|---|---|---|---|
test |
gSKAT |
test |
gSKAT | |
| 0·00 | 4·1 | 4·8 | 4·7 | 6·1 |
| 0·05 | 10·2 | 10·8 | 7·2 | 6·3 |
| 1·00 | 31·4 | 37·4 | 11·4 | 13·3 |
| 3·00 | 99·6 | 99·4 | 55·5 | 60·8 |
| 5·00 | 100 | 100 | 95·1 | 96·2 |
test, functional test with the correlation structure; gSKAT, family sequence kernel association test of Wang et al. (2013)
In the Supplementary Material we investigate the effect of sample size and cluster size on the power of the
test. The regression function and the response variable are generated as in the first simulation study. The number of basis functions
selected to generate the regression functions is
. The effect size
is
for the binary response case and
for the continuous response case. The correlation is
. We find that the power increases with the sample size for both types of response variables.
To demonstrate the effect of using a large cluster size
, we took a sample size of
. Results of this simulation are reported in the Supplementary Material. We observe that the Type I error is inflated as the cluster size increases. For a cluster of size
, the number of parameters in the correlation matrix is of order
. Large
will affect the consistency of the correlation estimate (
) used in the score equation (3), rendering our asymptotic results invalid.
In the Supplementary Material we also explore the effect of increasing the number of parameters
in the model. The predictor function and the continuous response are simulated as before. The number of basis functions used to generate the predictor is set to
. The sample size is
, the effect size
is
, and the correlation parameter
is taken to be
. The number of parameters
in the model is chosen to be 30, 50 or 90 rather than using crossvalidation. We find that as the number of parameters increases, the Type I error also increases.
To determine the number of parameters
, we used functional principal component analysis for dimension reduction. We projected the function onto a subspace generated by the first
eigenfunctions of the covariance operator of the predictor functions. This operator is estimated from the predictor functions obtained by smoothing the observed data. We first selected those values of
for which the proportion of variance explained by the first
principal components,
, varied from 80% to 99% as potential values for the number of parameters
. We then used five-fold crossvalidation to select
.
5. Application
Twin and family-based studies have suggested a substantial genetic contribution to substance dependence. Studies have been successful in identifying genetic variants contributing to nicotine dependence. In this application, we applied the
test to assess the relation between 15 neuronal nicotinic acetylcholine receptor, nAChR, subunit genes and nicotine dependence using sequencing data from the Minnesota Twin Study (Vrieze et al., 2014). Genetic association between nAChR subunit genes and nicotine dependence has already been established. A comprehensive study of these genes (Saccone et al., 2009) found associations between nicotine dependence and loci in the CHRNA5, CHRNA3, CHRNA4, CHRNB4, CHRNB3, CHRNB1, CHRNA6, CHRND, CHRNG and CHRNB4 genes. Our aim is to confirm whether our test can replicate these associations. The sequencing dataset we used for this analysis contains 662 families and 1445 individuals.
Nicotine dependence is a continuous variable measured using the protocols of the Substance Abuse Module of Composite International Diagnostic Interview (Hicks et al., 2011). It takes into account the frequency and quantity of nicotine use, including cigarettes, cigars, pipes and tobacco chewing. The covariates of age and sex were also considered. We first fitted a regression model with nicotine dependence as the response and with age and sex as predictors, and then used the residuals in the analysis. We applied our proposed test to the 15 genes individually. As the response and predictors are centred, we omit the intercept from this analysis. The smooth functions obtained by the smoothing method discussed in the previous section serve as predictors in the test. Table 3 reports the number of parameters
used for each gene. Because these values are small, normal approximation does not work, as shown in the Supplementary Material. We instead use
as the statistic that follows
, and use this to report the unadjusted
-values for the
test. We find that the
-values for the
test are smaller than those of gSKAT for many genes, so our method flags several associations that are undetected by gSKAT, which might suggest that our method has better performance. Even after adjustment for multiple testing, we find associations for genes such as CHRNB2, CHRNA6 and CHRND.
Table 3.
The
-values for
nAChR subunit genes
| Gene |
test |
gSKAT |
|
|
|---|---|---|---|---|
| CHRNA1 | 0·929 | 0·621 | 2 | 0·147 |
| CHRNA2 | 0·740 | 0·595 | 2 | 0·600 |
| CHRNA3 | 0·057 | 0·611 | 2 | 5·713 |
| CHRNA4 | 0·011 | 0·125 | 2 | 8·900 |
| CHRNA5 | 0·148 | 0·415 | 2 | 3·810 |
| CHRNA6 | 4E-05 | 0·216 | 6 | 29·765 |
| CHRNA7 | 0·625 | 0·736 | 2 | 0·937 |
| CHRNA9 | 0·400 | 0·524 | 2 | 1·831 |
| CHRNB1 | 0·017 | 0·870 | 2 | 8·047 |
| CHRNB2 | 2E-09 | 0·1351 | 7 | 53·838 |
| CHRNB3 | 0·003 | 0·429 | 5 | 17·524 |
| CHRNB4 | 0·382 | 0·561 | 2 | 1·922 |
| CHRND | 8E-04 | 0·175 | 2 | 14·230 |
| CHRNE | 0·207 | 0·675 | 2 | 3·142 |
| CHRNG | 0·044 | 0·270 | 2 | 6·219 |
, number of parameters selected;
, statistic defined as
; gSKAT, family sequence kernel association test of Wang et al. (2013).
Supplementary Material
Acknowledgement
This work was supported by the National Institute on Drug Abuse. We thank Drs Scott Vrieze, Matt McGue and S. Alexandra Burt for helping us access the whole-genome sequencing data from the Minnesota Twin Study. We are grateful to the reviewers for their constructive comments, which have helped to improve the presentation of the paper.
Supplementary material
Supplementary material available at Biometrika online contains results on the empirical levels of the first simulation study, the sample size study, the cluster size study, and the study of increasing dimensions, as well as Q-Q plots to check the normality of the test statistic. It also includes proofs of the theoretical results.
References
- Balan R. M. & Schiopu-Kratina I. (2005). Asymptotic results with generalized estimating equations for longitudinal data. Ann. Statist. 33, 522–41. [Google Scholar]
- Cardot H., Ferraty F. & Sarda P. (1999). Functional linear model. Statist. Prob. Lett. 45, 11–22. [Google Scholar]
- Cardot H., Ferraty F. & Sarda P. (2003). Spline estimators for the functional linear model. Statist. Sinica 13, 571–91. [Google Scholar]
- Cardot H. & Johannes J. (2010). Thresholding projection estimators in functional linear models. J. Mult. Anal. 101, 395–408. [Google Scholar]
- Cardot H. & Sarda P. (2005). Estimation in generalized linear models for functional data via penalized likelihood. J. Mult. Anal. 92, 24–41. [Google Scholar]
- Gertheiss J., Maity A. & Staicu A.-M. (2013). Variable selection in generalized functional linear models. Stat 2, 86–101. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hicks B. M., Schalet B. D., Malone S. M., Iacono W. G. & McGue M. (2011). Psychometric and genetic architecture of substance use disorder and behavioral disinhibition measures for gene association studies. Behav. Genet. 41, 459–75. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Horváth L. & Kokoszka P. (2012). Inference for Functional Data with Applications. New York: Springer. [Google Scholar]
- Laird N. M. & Lange C. (2010). The Fundamentals of Modern Statistical Genetics. New York: Springer. [Google Scholar]
- Li Y., Wang N. & Carroll R. J. (2010). Generalized functional linear models with semiparametric single-index interactions. J. Am. Statist. Assoc. 105, 621–33. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Morris J. S. (2015). Functional regression. Ann. Rev. Statist. Applic. 2, 321–59. [Google Scholar]
- Müller H.-G. & Stadtmüller U. (2005). Generalized functional linear models. Ann. Statist. 33, 774–805. [Google Scholar]
- R Development Core Team (2017). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0. http://www.R-project.org.
- Ramsay J. O. (2006). Functional Data Analysis. New York: Springer. [Google Scholar]
- Saccone N. L., Saccone S. F., Hinrichs A. L., Stitzel J. A., Duan W., Pergadia M. L., Agrawal A., Breslau N., Grucza R. A., Hatsukami D.. et al. (2009). Multiple distinct risk loci for nicotine dependence identified by dense coverage of the complete family of nicotinic receptor subunit (CHRN) genes. Am. J. Med. Genet. B 150, 453–66. [DOI] [PMC free article] [PubMed] [Google Scholar]
- The 1000 Genomes Project Consortium (2010). A map of human genome variation from population-scale sequencing. Nature 467, 1061–73. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Vrieze S. I., Malone S. M., Vaidyanathan U., Kwong A., Kang H. M., Zhan X., Flickinger M., Irons D., Jun G., Locke A. E.. et al. (2014). In search of rare variants: Preliminary results from whole genome sequencing of 1,325 individuals with psychophysiological endophenotypes. Psychophysiology 51, 1309–20. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang J.-L., Chiou J.-M. & Müller H.-G. (2016). Functional data analysis. Ann. Rev. Statist. Applic. 3, 257–95. [Google Scholar]
- Wang L. (2011). GEE analysis of clustered binary data with diverging number of covariates. Ann. Statist. 39, 389–417. [Google Scholar]
- Wang X., Lee S., Zhu X., Redline S. & Lin X. (2013). GEE-based SNP set association test for continuous and discrete traits in family-based association studies. Genet. Epidemiol. 37, 778–86. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.






































