Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2023 Dec 1.
Published in final edited form as: Biometrika. 2022 Dec;109(4):1133–1148. doi: 10.1093/biomet/asac007

Functional hybrid factor regression model for handling heterogeneity in imaging studies

C HUANG 1, H ZHU 2
PMCID: PMC9754099  NIHMSID: NIHMS1815755  PMID: 36531154

Summary

This paper develops a functional hybrid factor regression modelling framework to handle the heterogeneity of many large-scale imaging studies, such as the Alzheimer’s disease neuroimaging initiative study. Despite the numerous successes of those imaging studies, such heterogeneity may be caused by the differences in study environment, population, design, protocols or other hidden factors, and it has posed major challenges in integrative analysis of imaging data collected from multicentres or multistudies. We propose both estimation and inference procedures for estimating unknown parameters and detecting unknown factors under our new model. The asymptotic properties of both estimation and inference procedures are systematically investigated. The finite-sample performance of our proposed procedures is assessed by using Monte Carlo simulations and a real data example on hippocampal surface data from the Alzheimer’s disease study.

Keywords: Alzheimer’s disease, Functional hybrid factor regression model, Hippocampal surface, Imaging heterogeneity, Surrogate variable analysis

1. Introduction

With the rapid growth of modern technology, many large-scale imaging studies, such as the Alzheimer’s disease neuroimaging initiative, ADNI, study (Mueller et al., 2005), the Human Connectome Project (Van Essen et al., 2013) and the UK Biobank study (Sudlow et al., 2015), have been conducted to collect massive datasets with large volumes of complex information from increasingly large cohorts for unravelling the etiology of different diseases, such as Alzheimer’s disease. For example, the ADNI study is a multi-phase study that aims to discover the progression of Alzheimer’s disease and improve clinical trials for the prevention and treatment of Alzheimer’s disease. However, such integrative data analysis is challenging largely due to the heterogeneity in those imaging studies, since the datasets are often collected from different centres and/or phases and need to be rigorously integrated (Lock et al., 2013; Yu et al., 2017). The potential heterogeneity may be caused by the differences in study environment, population (e.g., race), design, protocols (e.g., imaging acquisition protocol) and some other (unknown) hidden factors in multiple centres and/or phases (Leek & Storey, 2007; Mirzaalian et al., 2016; Fortin et al., 2017). As an illustration, we consider a hippocampal surface dataset obtained from the three different phases, ADNI-1, ADNI-GO and ADNI-2, of the ADNI study. Fig. 1 presents the three quantiles of loged radial distances across all the vertices on the left and right hippocampal surfaces. More details on the calculation of radial distances will be discussed in § 5. We observe different patterns in the quantile plots across the three phases, especially between ADNI-1 and the other two phases, indicating that the imaging heterogeneity does exist in the ADNI hippocampal surface data. Thus, appropriately handling the imaging heterogeneity can be critically important for understanding the role of imaging biomarkers in the etiological mechanism of Alzheimer’s disease. Another example on diffusion tensor imaging also illustrates the heterogeneity in different imaging datasets and can be found in the Supplementary Material.

Fig. 1.

Fig. 1.

Heterogeneity in the ADNI hippocampal surface dataset: (a) three quantiles of the logged radial distances across all the vertices on the left hippocampal surface and (b) those on the right hippocampal surface for all subjects obtained from ADNI-1 (blue), ADNI-GO (orange) and ADNI-2 (green).

Currently, there are two approaches to tackling heterogeneity in imaging studies. The first one is image-based meta analysis, in which study-specific statistical analyses are performed first, e.g., Fisher’s combined probability test and Stouffer’s z-transformation test, and the results are combined afterwards (Salimi-Khorshidi et al., 2009). Although it has shown great promise for some studies with a large number of participants at each phase, or site (Kochunov et al., 2014), this technique still suffers from at least two major limitations: (i) the study-specific population might not be large enough to estimate the true biological variability in the entire population (Mirzaalian et al., 2016); and (ii) computing study-specific summary statistics can be affected by unbalanced data. For instance, the variance in the z-score is highly dependent on the ratio of cases over controls in each individual study, and can lead to inaccurate statistical inferences (Fortin et al., 2017). The second approach is to apply either fixed-effect or mixed-effect models to capture the heterogeneity. These methods estimate primary effects, while adjusting for study related known covariates and unknown hidden factors. To identify those unknown factors, surrogate variable analysis has been developed in various genomic studies (Johnson et al., 2007; Leek & Storey, 2007, 2008; Sun et al., 2012; Lee et al., 2017; Wang et al., 2017), and recently adapted to imaging data analysis (Guillaume et al., 2018). Since surrogate variable analysis assumes that massive univariate regression models share a common set of unknown factors, imaging measures are usually treated as multivariate phenotypes. However, image measures across different voxels, or grid points, are more likely to be treated as functional responses, so it is natural to use functional data analysis tools, which can explicitly account for the three key features of imaging data: spatial smoothness, spatial correlation and low-dimensional representation (Zhu et al., 2012). Furthermore, by applying some smoothing techniques, the noise component of image measures can be reduced and the estimates of primary effects outperform those under mass-univariate analysis in terms of estimation precision (Ramsay & Silverman, 2002). Therefore, it is greatly important to address the hidden factor issue in functional regression models by borrowing some ideas from surrogate variable analysis.

The aim of this paper is to develop a functional hybrid factor regression modelling framework to investigate the relationship between functional responses and primary covariates, while adjusting for hidden factors. Compared to existing surrogate variable analysis methods, our proposed method is the first designed for functional data. In contrast, although some functional models also consider recovering hidden factors via functional principal component analysis (Zhu et al., 2012), they are inefficient for handling the imaging heterogeneity, since the hidden factors and observed covariates are assumed to be uncorrelated. We develop a three-step estimation procedure to estimate unknown quantities in our proposed model. In addition to the estimation procedure, a global Wald-type test and a simultaneous confidence band are also constructed for coefficient functions. We also systematically investigate the asymptotic properties of estimated coefficient functions, detected hidden factors and test statistics. Furthermore, both simulation studies and real data analysis show that our proposed method outperforms competing methods in terms of both estimation accuracy and robustness.

2. Methods

2.1. Functional hybrid factor regression model

Suppose that we observe both imaging data and some covariates from n unrelated subjects. Assume that all the images have been registered to a common template, denoted as SRd. The template S includes nv grid points, denoted as s1, … , snv, which have a common density p(s) with bounded support supp(p)S. For each registered image, it is assumed that J imaging measurements, or features, are derived at each point such that y(sk) = {y.1(sk), … , y.J(sk)} is an n × J matrix of J features at sk across n subjects. Let X be an n × p full column rank matrix of observed covariates including the intercept, and let Z be an n × q full column rank matrix of hidden factors, where the number of latent factors, q, is unknown. Let C2(S) denote a class of functions whose second-order partial derivatives exist and are continuous everywhere in S.

In this paper, to build up the relationship between imaging responses and both observed covariates and hidden factors, a functional hybrid factor regression model is described as

y.j(s)=Xβj(s)+Zγj(s)+η.j(s)+ϵ.j(s)(j=1,,J) (1)

where βj(s) is a p × 1 vector with entries {βtj(s)C2(S)}t=1p representing the primary effect related to X on y, j(s), and γj(s) is a q × 1 vector with entries {γlj(s)C2(S)}l=1q representing the effect on y. j(s) caused by the hidden factors Z. Moreover, let η(s) = {η.1(s), … , η.J(s)} be an n × J matrix that characterizes both subject-specific and location-specific spatial variability, and let ϵ(s) = {ϵ.1(s), … , ϵ.J(s)} be measurement errors. It is also assumed that each row in η(s) and that in ϵ(s) are mutually independent and identical copies of SP(0, Ση) and SP(0, Σϵ), respectively, where SP(μ, Σ) denotes a stochastic process vector with mean function μ(s) and covariance function Σ(s, s′). Moreover, Σϵ(s, s′) takes the form of Ωϵ(s)1(s=s), where Ωϵ(s) is a diagonal matrix and 1(·) is the indicator function. As a comparison, we also consider multivariate varying coefficient models (Zhu et al., 2012) given by

y.j(s)=Xβj(s)+η.j(s)+ϵ.j(s),(j=1,,J). (2)

Here models (1) and (2) share several common features. First, both models account for the spatial smoothness, spatial correlation and the low-dimensional representation of functional responses (Zhu et al., 2012). Second, both models are feasible to investigate the relationship between multivariate functional responses and some observed covariates of interest. Third, the individual function variations are considered through η(s) in both models (Zhu et al., 2012). Fourth, the detection and adjustment of hidden factors are possible in both models.

However, models (1) and (2) use different strategies to handle the hidden factors. In Zhu et al. (2012), the hidden factors can be captured by the individual functions η(s) based on the functional principal component analysis (Wang et al., 2016), where all the principal component scores can be used to recover the structure of hidden factors. A major issue associated with this strategy is that it cannot appropriately handle the case that observed covariates and hidden factors are correlated to each other. Specifically, in Zhu et al. (2012), the observed covariates X are assumed to be uncorrelated with the hidden factors in individual functions η(s). However, such an assumption may be questionable in some applications (Helmer et al., 1999; Sundström et al., 2016; Sommerlad et al., 2018) and, thus, model (2) can be problematic for appropriately detecting and adjusting the hidden factors. In contrast, in model (1), the individual functions η(s) are assumed to be uncorrelated with both observed covariates X and hidden factors Z, while no assumptions are made for the correlation between X and Z. Therefore, model (1) can handle hidden factors even when they are correlated with the observed covariates.

2.2. Estimation procedure

We present the estimation procedure for coefficient functions and hidden factors in three steps.

Step 1. By applying the orthogonal decomposition of the matrix Z onto the column space of X, we reparameterize model (1) as

y.j(s)=Xβj(s)+Zγj(s)+η.j(s)+ϵ.j(s)(j=1,,J), (3)

where βj(s)=βj(s)+(XTX)1XTZγj(s), Z* = (InPX)Z and PX = X (XTX)−1XT. Obviously, the columns of X are orthogonal to those of Z*. Then, given that {y.j(s)}j=1J and X are observed, the multivariate local linear kernel smoothing technique (Ruppert & Wand, 1994; Fan & Gijbels, 1996) is then used to derive the weighted least squares estimator of βj(s) in (3). Let e⊗2 = eeT for any vector e, and let CD be the Kronecker product of two matrices C and D. In addition, define KHβ(s)=Hβ1K(Hβ1s) and zHβ(sks)={1,(sks)THβ1}T, where K(·) is the kernel function, and Hβ is the positive definite bandwidth matrix and ∣Hβ∣ is its determinant. For each j and fixed Hβ, the estimator of βj(s) is derived as

β^j(s)=(XTX)1XTk=1nvϱk(Hβ,s)y,j(sk), (4)

where ϱk(Hβ,s)=(1,01×d){k=1nvKHβ(sks)zHβ2(sks)}1KHβ(sks)zHβ(sks). Since there is no linearity assumption on the coefficient function βj(s), the local linear smoother in (4) is a biased estimator (Fan & Gijbels, 1996). To overcome this issue, a standard technique considered here is bias correction. Following the pre-asymptotic substitution method in Fan & Gijbels (1996), the bias term can be obtained by using local cubic fit with a pilot bandwidth selected in (4). Furthermore, according to the definition of βj(s), the aim of the following two steps is to seek an estimate of j(s). Then the estimate of βj(s) can be derived by subtracting the term (XTX)1XTZ^γj(s) from β^j(s).

Step 2. The residual term from the previous step is first defined as R.j(s)=y.j(s)Xβ~j(s), where β~j(s) is the refined version of β^j(s) after correcting the bias using the local linear kernel smoothing technique. Next, we construct an n × Jnv extended residual matrix written as R¯={R.1(s1),,R.1(snv),,R.J(s1),,R.J(snv)}. Then, given S, X and Z, the conditional expectation of the extended residual matrix can be derived as (Ruppert & Wand, 1994)

E(R¯S,X,Z)=ZΓ¯+op{tr(Hβ2)}, (5)

where Γ¯={γ.1(s1),,γ.1(snv),,γ.J(s1),,γ.J(snv)} and tr(·) is the trace of a given matrix. To estimate the primary term Z* in (5), the singular value decomposition technique is first performed on R¯, i.e., R¯=UΛVT, where the columns of U and V consist of the left and right singular vectors, respectively, and Λ is a diagonal matrix whose diagonal entries are the ordered singular values of R¯. Specifically, the first q columns in U, denoted as U1:q, can be treated as an estimator of linear combinations of the columns of Z*; see the Supplementary Material. Then, there exists a q × q orthonormal matrix Q such that U1:q = Z*Q + op(1) and Z*γj(s) = U1:qαj(s), where αj(s) = QTγj(s) (j = 1, … , J).

Step 3. To derive the estimate of αj(s), the residual terms in the previous steps are treated as functional responses. Then, a new varying coefficient model is constructed via substituting the singular value decomposition results:

R.j(s)=U1:qαj(s)+η~.j(s)+ϵ~.j(s)(j=1,,J)

with η~.j(s) and ϵ~,j(s) similarly defined as η , j(s), and ϵ, j(s), respectively. For the fixed Hα, the estimator of αj(s) can be derived as U1:qTk=1nvϱk(Hα,s)R.j(sk), and α^j(s) is denoted as the corresponding bias corrected version. Then, an estimation equation can be constructed as

XB~(s)+U1:qA^(s)=XB(s)+GA^(s),

where B~(s)={β~1(s),,β~J(s)},G=ZQ and A^(s)={α^1(s),,α^J(s)}. With an additional assumption that the row vectors of B(s) = {β1(s), … , βj(s)} and the row vectors of Γ(s) = {γ1(s), … , γJ(s)} are orthogonal with respect to p(s) on S after mean centring, we can derive the estimator of G as

G^=U1:q+XSB~(s)(IJPJ)A^T(s)p(s)dsΩ1,

where Ω=SA^(s)(IJPJ)A^T(s)p(s)ds and PJ=1J(1T1J)11JT, in which 1J is a J × 1 vector of 1s. Since j(s) = j(s) for j = 1, … , J, the estimator of B(s) is given by

B^(s)=B~(s)(XTX)1XTG^A^(s).

2.3. Other issues in the estimation procedure

First, according to the definitions of G and {αj(s)}, they are not identifiable due to the scaling issue. To address this issue, we impose the constraint (nq)1i=1nj=1qGi,j2=1, where Gi,j is the (i,j)th element of G and the estimated G^ is adjusted to satisfy this constraint. Thus, G and {αj(s)} are identifiable up to an orthonormal transformation only.

Second, by using the smoothing method in Ruppert & Wand (1994), we smooth the individual functions of η(s) based on the updated residual matrix as

η^(s)=k=1nvϱk(Hη,s){y(s)XB^(s)G^A^(s)},

where Hη is the fixed bandwidth matrix. Furthermore, we use the empirical covariance Σ^η(s,s)=(npq)1i=1nη^i.(s)η^i.T(s) to estimate Ση(s, s′).

Third, to select the optimal bandwidth in B^(s) and A^(s), we use leave-one-curve-out crossvalidation, whereas for the optimal bandwidth in η^(s), we use the generalized cross-validation score method (Zhang & Chen, 2007; Zhu et al., 2012). Moreover, we standardize all covariates to have mean zero and standard deviation one, as well as all functional features. Finally, we choose a common bandwidth for all covariates and features. More details can be found in the Supplementary Material.

Fourth, since the number of latent factors, q, is unknown, we consider four different methods: a permutation version of the analytical-asymptotic approach (Johnstone, 2001), parallel analysis (Buja & Eyuboglu, 1992), the eigenvalue difference method (Onatski, 2010) and the bicrossvalidation method (Owen & Wang, 2016). We compare the four different methods in terms of high detection accuracy and computation time in the simulation studies, and select the one with the best performance in the rest of our data analyses.

2.4. Inference procedure

We consider the following linear hypotheses on B(s):

H0:Cvec{B(s)}=b0(s)for allsSversusH1:Cvec{B(s)}b0(s)for somesS. (6)

Here C is an r × Jp matrix with rank r, vec(·) denotes the vectorization of a given matrix and b0(s) is an r × 1 vector of functions. The global test statistic Tn for (6) is defined as

Tn=STn(s)p(s)dswithTn(s)=ζT(s)[C{Σ^η(s,s)(M^M^T)}CT]1ζ(s), (7)

where ζ(s)=Cvec{B^(s)}b0(s), M^=(Ip,0q×q)(W^TW^)1W^T and W^=(X,G^).

As the asymptotic distribution of Tn under H0 is quite complicated, it is difficult to derive the percentiles of Tn directly from the corresponding asymptotic results. To address this issue, the wild bootstrap method is developed (Zhu et al., 2012) consisting of the following four steps.

Step 1. Fit model (1) under H0 on X and {y(sk)}k=1nv, yielding G^, A^(s), B^(s), η^(s), ϵ^(s) and the global test statistic Tn.

Step 2. Generate random vectors τi(m) and τi(m)(sk) independently from the standard normal distribution N(0, In) for k = 1, … , nv, and then construct

y(m)(sk)=XB^(sk)+G^A^(sk)+diag(τi(m))η^(sk)+diag{τi(m)(sk)}ϵ^(sk),

where diag(τ) denotes a diagonal matrix with the elements of τ lying along the diagonal.

Step 3. Based on X and {y(m)(sk)}k=1nv, recalculate B^(m)(s) and the global test statistic Tn(m).

Step 4. Repeat the previous two steps M times to obtain {Tn(1),,Tn(M)}, which yields the empirical p-value as p=m=1M1(Tn(m)>Tn)M.

Construction of simultaneous confidence bands for coefficient functions is also of great interest in statistical inference for our proposed model. For a given confidence level ϑ, we construct the 1 − ϑ simultaneous confidence band for βtj(s),

{β^tj(s)n12Ctj(ϑ),β^tj(s)+n12Ctj(ϑ)},1tp,1jJ,

where Ctj(ϑ) is a scalar, which is to be determined. Here an efficient resampling method (Kosorok, 2003; Zhu et al., 2007, 2012) is developed to approximate Ctj(ϑ) as follows.

Step 1. Fit model (1) on X and {y(sk)}k=1nv, yielding the residuals v.j(s)=y(s)Xβ^(s)+G^α^(s).

Step 2. Generate the random vector τi(m) from the standard normal distribution N(0, In), and then construct ωtj(m)(s)=n12etTM^diag(τi(m))k=1nvϱk(H,s)v.j(sk), where et is a p × 1 vector with the tth element being 1 and 0 otherwise.

Step 3. Repeat the second step M times to obtain {supsωtj(1)(s),,supsωtj(M)(s)}, and use their 1 − ϑ empirical percentile to estimate Ctj(ϑ).

3. Asymptotic properties

We systematically investigate the asymptotic properties of all estimators proposed in § 2.2 and inference procedures in § 2.4. Assumptions used to facilitate the technical details can be found in the Supplementary Material.

The following theorem tackles the theoretical properties of B^(s) and G^. The detailed proof can be found in the Supplementary Material.

Theorem 1. Under Assumptions A1–A7 in the Supplementary Material, we have the following results.

  1. The columns of G^ span the same column space as the columns of Z in probability.

  2. It holds that n12[{IJ(M^M^T)12}vec(B^(s)E[B^(s)])sS] weakly converges to a centred Gaussian process with covariance matrix Ση(s, s) ⊗ Ip.

The following theorem derives the asymptotic distribution of global test statistic Tn in (7) under the null hypothesis and its asymptotic power under local alternative hypotheses. The detailed proof can be found in the Supplementary Material.

Theorem 2. Under Assumptions A1–A9 in the Supplementary Material, we have the following results.

  1. It holds that TnSξ(s)Tξ(s)ds as n → ∞, where ξ(s) is a centred Gaussian process.

  2. It holds that P{Tn > Tn,ϑ ∣ H1n} → 1 as n → ∞ for a sequence of local alternatives H1n: Cvec(B(s)) − b0(s) = nτ/2ζ(s), where τ is any scalar in [0, 1), Tn,ϑ is the upper 100ϑ percentile of Tn under H0 and 0<Sζ(s)2ds<.

4. Simulation studies

To examine the proposed methods, we generated synthetic curves from the model

yij(sk)=xiTβj(sk)+ziγj(sk)+ηij(sk)+ϵij(sk),j=1,2,

where s1 = 0 ⩽ s2 ⩽ ⋯ ⩽ snv = 1, in which we independently simulated s~kU(0,1) for k = 2, … , nv − 1 and sorted them to obtain {sk : k = 2, … , nv −1}. We set xi = (1, xi1, xi2, xi3)T, in which we independently simulated xi1 ~ Ber(0.5), xi2 ~ N(0, 1) and xi3 ~ N(0, 1) for i = 1, … , n. We simulated zi as

zi=xiTφ+ωi,ωiN(0,1)(i=1,,n),

where φ = {u1(2b1−1), u2(2b2−1), u3(2b3−1), u4(2b4−1)}T with bl being independently generated from Ber(0.5). We independently simulated ul for all l and consider four different simulation scenarios on ul: (i) ul = 0; (ii) ul ~ U(0, 0.2); (iii) ul ~ U(0.2, 0.5) and (iv) ul ~ U(0.5, 1). Those scenarios correspond to hidden factors Z being (i) independent of X, (ii) weakly correlated with X, (iii) moderately correlated with X and (iv) highly correlated with X, respectively. The ηij(s) admits the Karhunen–Loeve expansion as ηij = ξij1ψj1(s) + ξij2ψj2(s), where the ψjl(s) are the eigenfunctions and ξijl ~ N(0, 0.5) for j = 1, 2 and l = 1, 2. We simulated (ϵi,1, ϵi,2)T ~ N{(0, 0)T, 0.5diag(σ12,σ22)}, where σl2 ~ Inverse-Gamma(10, 9) for l = 1, 2. Also, we set the following functions:

β1(s)={3s2,3(1s)2,6s(1s),s2}T,γ1(s)=2sin(πs),β2(s)={12(s0.5)2,1.5s,3s2,2s3}T,γ2(s)=2cos(2πs),ψ11(s)=0.5,ψ12(s)=s0.5,ψ21(s)=2s1andψ22(s)=1.

Throughout the simulation studies, we set n = 50 and nv = 2000. Finally, we generated 200 datasets for each simulation scenario.

We compare our method with two other methods: the multivariate varying coefficient model of Zhu et al. (2012) and the confounder adjustment method of Wang et al. (2017). For the method in Wang et al. (2017), the curved data is treated as multivariate responses. To evaluate the finite-sample performance of each method, we consider the integrated square error, i.e., l=1201β^l(s)βl(s)2ds, where β^j(s) is any estimator of βj(s). For both the method in Wang et al. (2017) and our method, the eigenvalue difference method (Onatski, 2010) is used to estimate the number of factors.

Fig. 2 presents the comparison results for the three methods in all four scenarios. Inspecting Fig. 2 reveals the following results. First, compared with the method in Zhu et al. (2012), both the method in Wang et al. (2017) and our method are very stable and robust to the correlation between X and Z. Second, our method outperforms that in Wang et al. (2017) for all scenarios, indicating that it is critically important to use the functional data analysis tools. Third, when Z and X are independent, the difference between our method and that of Zhu et al. (2012) is very small. Fourth, when the correlation between X and Z is high in scenario (iv), the integrated square errors based on the method in Zhu et al. (2012) dramatically increase. In contrast, those of our method are much smaller even though there are a few outliers, which are caused by the uncertainty of estimating q, as detailed below.

Fig. 2.

Fig. 2.

Simulation results for comparisons of the proposed and competing methods on synthetic curve data in terms of the integrated square error. Four scenarios were considered: the hidden factors Z are (i) independent of X, (ii) weakly correlated with X, (iii) moderately correlated with X and (iv) highly correlated with X. The methods of Wang et al. (2017) (blue) and Zhu et al. (2012) (orange) are compared with our method (green).

We compare four estimation methods for the number of hidden factors, including the analytical-asymptotic approach in Johnstone (2001), the permutation version of the parallel analysis in Buja & Eyuboglu (1992), the eigenvalue difference method in Onatski (2010) and the bi-cross-validation method in Owen & Wang (2016). Table 1 reports the estimation results for the four methods. We observe that the last three methods can achieve almost 100% estimation accuracy, while outperforming the analytical-asymptotic approach with low estimation accuracy around 30%. In addition, in terms of average computation time, the eigenvalue difference method (Onatski, 2010) is much more efficient than the bi-cross-validation method (Owen & Wang, 2016) and the parallel analysis approach (Buja & Eyuboglu, 1992). Thus, the eigenvalue difference method is used in subsequent analyses.

Table 1.

Comparison of four methods for estimating the number of hidden factors with q = 1. The average computation time for each method is reported as well. In the four scenarios considered Z is (i) independent of X, (ii) weakly correlated with X, (iii) moderately correlated with X and (iv) highly correlated with X

Method Scenario Average computation time
(seconds per dataset)
(i) (ii) (iii) (iv)
Johnstone (2001) 62/200 65/200 64/200 64/200 0.1
Buja & Eyuboglu (1992) 190/200 191/200 192/200 191/200 70.2
Onatski (2010) 200/200 200/200 198/200 198/200 0.8
Owen & Wang (2016) 200/200 196/200 196/200 196/200 9.7

We investigate the sensitivity of our method with respect to the misspecification of q under the four scenarios since there are some outliers in Fig. 2 for our method when Z and X are highly correlated. We also consider three choices of q including q = 0, 1 and 2, which represent the underestimated, true and overestimated values, respectively. Fig. 3 presents the box plots of integrated square errors for all q values under the four scenarios. There are three major findings. First, when the hidden factor Z is independent of or weakly correlated with X, the integrated square errors are relatively stable even when q is misspecified. Second, when Z is moderately or highly correlated with X, the integrated square errors dramatically increase for misspecified q values. Third, the underestimated q = 0 has much larger effects on integrated square errors than the overestimated q = 2.

Fig. 3.

Fig. 3.

Simulation results for the sensitivity analysis of our method under the three choices of q, q = 0 (blue), q = 1 (orange) and q = 2 (green) in the four scenarios in which Z is (i) independent of X, (ii) weakly correlated with X, (iii) moderately correlated with X and (iv) highly correlated with X.

We examine the correlation between the space spanned by the columns of detected latent factors with that spanned by the columns of true Z. Fig. 4 presents simulation results in the four scenarios with the absolute values of Pearson correlation between G^ and Z being greater than 0.90, indicating their consistency with each other. Moreover, when the correlation between Z and X gets higher, the absolute values of the Pearson correlation coefficient are closer to 1.

Fig. 4.

Fig. 4.

Simulation results for the absolute values of the Pearson correlation between G^ and Z in the four scenarios in which Z is (i) independent of X, (ii) weakly correlated with X, (iii) moderately correlated with X and (iv) highly correlated with X.

We examine the Type I and Type II error rates of Tn. For the sake of space, we only consider the third scenario (iii), in which φ = (u1, −u2, u3, −u4)T independently simulating ul from U(0.2, 0.5) for all l = 1, 2, 3, 4. Moreover, we fix all other parameters at their values specified above except that we set β14(s) = −cs2 and β24(s) = −2cs/3, where c is a scalar specified below. We want to test the following hypotheses:

H0:β14(s)=β24(s)=0for allsversusH1:β14(s)0orβ24(s)0for at least ones. (8)

We set c = 0 to assess the Type I error rates for Tn, and set c = 0.1, 0.2, 0.3, 0.4 and 0.5 to examine the power of Tn. We set the sample size to n = 100 and 200. For each case, 500 bootstrap replications were generated to construct the empirical distribution of Tn under H0. Fig. 5 presents the power curves at the significance levels α = 0.05 and 0.01. The rejection rates for Tn based on the wild bootstrap method are accurate for moderate sample sizes with n = 100 and 200 at both significance levels α = 0.01 and 0.05. As expected, the power increases with the sample size.

Fig. 5.

Fig. 5.

Power curves for the hypothesis testing problem (8) based on our method with different choices of c and levels of α: α = 0.05 (red) and α = 0.01 (blue). Two horizontal dashed lines are added to indicate the levels α = 0.05 (orange) and α = 0.01 (green).

Finally, we investigate the coverage probabilities of simultaneous confidence bands for the functional coefficients in B(s) based on the resampling method. We only consider the third scenario (iii). We fix all parameters specified above except that we set n = 200 and the number of grid points nv = 200 and 2000. We calculated the simultaneous confidence bands for each component in B(s) based on 200 replications. Table 2 summarizes the empirical coverage probabilities at α = 0.05 and 0.01. As expected, the coverage probabilities improve as the number of grid points nv increases.

Table 2.

Empirical coverage probabilities of 1 − α simultaneous confidence bands

α nv β 11 β 12 β 13 β 14 β 21 β 22 β 23 β 24
0.05 200 0.935 0.920 0.925 0.920 0.915 0.915 0.930 0.940
2000 0.945 0.950 0.950 0.950 0.945 0.945 0.955 0.950
0.01 200 0.985 0.990 0.995 0.980 0.980 0.995 0.990 0.990
2000 0.990 0.995 0.990 0.995 0.995 0.995 0.990 0.995

5. Real data analysis

5.1. Data processing

In this data analysis, we consider 936 MRI scans from normal controls and individuals with mild cognitive impairment or Alzheimer’s disease from the three phases ADNI-1, ADNI-GO and ADNI-2. Table 3 summarizes the demographic information of all the subjects.

Table 3.

Hippocampal surface data: demographic information of 936 subjects

Phase ADNI-1 ADNI-GO ADNI-2 Total
Size 800 24 112 936
Gender (F/M) 465/335 13/11 61/51 539/397
Handedness (R/L) 738/62 20/4 9/103 861/75
Age range (years) [58, 95] [55, 84] [53, 87] [53, 95]
Education length range (years) [4, 20] [12, 20] [8, 20] [4, 20]
Disease (NC/MCI/AD) 224/389/187 0/24/0 29/58/25 253/471/212

NC, normal control; MCI, mild cognitive impairment; AD, Alzheimer’s disease.

We processed the MRI data by using standard steps and generated one-to-one hippocampal surface registration in Shi et al. (2013). Then, we computed the various surface statistics on the registered surface, such as multivariate tensor-based morphometry statistics, which retain the full tensor information of the deformation Jacobian matrix, together with the radial distance, which retains information on the deformation along the surface normal direction. More detailed image data processing procedures can be found in the Supplementary Material.

5.2. Data analysis

The hippocampus is believed to be involved in memory, spatial navigation and memory, and behavioural inhibition. In Alzheimer’s disease, the hippocampus is one of the first regions of the brain to be affected, leading to the confusion and loss of memory so commonly seen in the early stages of the disease (Kong et al., 2019). The objective of this data analysis is to integrate the data from three different data phases, i.e., ADNI-1, ADNI-GO and ADNI-2, and examine the effects of clinical variables and demographic variables on either the left or right hippocampus. Moreover, the hidden factors are expected to be recovered and discussed. Before conducting this analysis, we would like to check if there is any heterogeneity caused by phases. According to Fig. 1 and the related discussion in § 1, this study-level heterogeneity does exist in the ADNI hippocampal surface data. Therefore, the phase information should be included as predictors in the data analysis.

We applied our new method with either the left or right hippocampal surface data as the functional responses. The method in Zhu et al. (2012) was used for comparison. Specifically, we consider four imaging measurements: the logged radial distance and three tensor-based morphometry statistics measured over 7500 vertices on the hippocampal surface (3750 on each side). In this case, we have J = 4. Moreover, we included an intercept, gender, handedness, education length, age, diagnostic information and phase information as predictors in X. The corresponding coefficients are considered as functions on the cerebral cortex, and the Gaussian kernel function is adopted in the estimation procedure. Subsequently, we test the effects of all the primary variables on the four functional responses on hippocampal surfaces. We calculated the global test statistic for each predictor and used 500 replications in the wild bootstrap approach. Table 4 summarizes the corresponding p-values, where p-values less than 5% are in red. Given the significant level 0.05, both disease, Alzheimer’s disease versus Normal control, and age are found to be significant on the left hippocampal surface based on the method in Zhu et al. (2012). In contrast, more predictors are found to be significant based on our method. For example, significant age effect is found on the left hippocampal surface, while both education length effect and disease effect, Alzheimer’s disease versus normal control, are significant on left and right hippocampal surfaces. Among all these variables, education length is found to be significant in our method, but not in the competing method. Education length is an important factor for the changes of hippocampus structure in the literature (Arenaza-Urquijo et al., 2013).

Table 4.

Hippocampal surface data: comparison of p-values for primary variables

Variable p-value
Left hippocampus Right hippocampus
Zhu et al. (2012) Our method Zhu et al. (2012) Our method
Gender 0.212 0.092 0.234 0.116
Handedness 0.652 0.102 0.704 0.082
Education length 0.132 0.036 0.244 0.048
Age 0.048 0.048 0.096 0.052
MCI versus NC 0.156 0.066 0.082 0.064
AD versus NC 0.046 0.034 0.054 0.040
ADNI-GO versus ADNI-1 0.134 0.112 0.136 0.120
ADNI-2 versus ADNI-1 0.118 0.106 0.112 0.114

NC, normal control; MCI, mild cognitive impairment; AD, Alzheimer’s disease.

Furthermore, we are also interested in detecting significant subregions by using the local test statistic and the false discovery rate (Benjamini & Yekutieli, 2001). Fig. 6 presents the false discovery rate adjusted −log10(p)-value maps. To better understand the significant subregions, we consider the cytoarchitectonic subregions mapped on blank MR-based models at 3 T of the hippocampal formation (Frisoni et al., 2008, Fig. 2). All the significant subregions associated with age and disease circled in red are found in the CA1 subfield. Similar hippocampal subregions were found to be affected by Alzheimer’s disease (Frisoni et al., 2008), indicating that our findings are in agreement with those in the literature.

Fig. 6.

Fig. 6.

Hippocampal surface data: adjusted −log10(p)-value maps corresponding to three covariates of interest: age, education level, and diagnosis status.

We investigate the potential hidden factors estimated by our method. Applying the eigenvalue difference method yields three hidden factors. Table 5 presents the correlation between primary variables and detected hidden factors, where p-values less than 5% are in red. Specifically, we calculated the Pearson correlation between two continuous variables and the polyserial correlation between a continuous variable and a discrete one. Inspecting Table 5 reveals that on both left and right hippocampal surfaces, the detected factors are highly related to education length, age, disease status and phase information. In contrast, for the method in Zhu et al. (2012), the key assumption that the hidden factors and primary variables are uncorrelated is inappropriate.

Table 5.

Hippocampal surface data: correlations between primary variables and detected hidden factors and their associated p-values in parentheses

Primary variable Hidden factor
Left hippocampus Right hippocampus
Factor 1 Factor 2 Factor 3 Factor 1 Factor 2 Factor 3
Gender −0.038 (0.358) 0.015 (0.724) −0.048 (0.239) 0.006 (0.883) 0.023 (0.582) −0.045 (0.278)
Handedness −0.013 (0.835) −0.041 (0.517) 0.076 (0.209) 0.041 (0.494) −0.055 (0.382) 0.047 (0.435)
Education length −0.021 (0.531) 0.024 (0.466) 0.090 (0.006) 0.058 (0.078) 0.014 (0.665) 0.074 (0.025)
Age 0.120 (<0.001) 0.089 (0.007) −0.079 (0.015) −0.163 (<0.001) 0.071 (0.030) −0.131 (<0.001)
MCI versus NC −0.045 (0.272) 0.061 (0.144) 0.020 (0.617) 0.064 (0.119) 0.003 (0.944) 0.062 (0.131)
AD versus NC 0.087 (0.041) −0.058 (0.228) 0.061 (0.507) −0.094 (0.039) −0.029 (0.530) −0.008 (0.853)
ADNI-GO versus ADNI-1 −0.305 (<0.001) 0.392 (<0.001) 0.215 (0.011) 0.440 (<0.001) −0.176 (0.064) 0.403 (<0.001)
ADNI-2 versus ADNI-1 −0.221 (<0.001) −0.318 (<0.001) 0.213 (<0.001) 0.271 (<0.001) −0.469 (<0.001) 0.466 (<0.001)

NC, normal control; MCI, mild cognitive impairment; AD, Alzheimer’s disease.

Finally, we investigate whether there are any other variables not included in our current analysis that may be strongly correlated with the latent factors. We consider seven new variables in the three categories of ethnic group information (three dummy variables were introduced to represent Asian, African American and White), marital status (three dummy variables were introduced to represent widow, divorce and not married) and retirement status. There are several reasons that we do not include the new regressors in the main model at the beginning. First, we only include a standard set of covariates, which have been widely considered in the existing literature (Kong et al., 2019), in the main model. Second, we apply our proposed method to detect some hidden factors that cannot be explained by the existing covariates. Third, we correlate the hidden factors with a set of new regressors and find that these regressors can partially explain these factors. This process also illustrates the importance of our functional hybrid factor regression model. Another reason is that there are many missing data in these new regressors. Specifically, the missing data rates for the new regressors in the three categories are 9.8% for ethnic group information, 10.9% for marital status and 9.4% for retirement status. We observe that on the left hippocampal surface, the detected hidden factors are strongly correlated with all of them, whereas on the right hippocampal surface, the detected hidden factors are only correlated with the marital status. More detailed results can be found in the Supplementary Material.

6. Discussion

The key assumption of our method is Assumption A6 in the Supplementary Material, which requires that the row vectors of B(s) and the row vectors of Γ(s) are orthogonal with respect to the underlying density function p(s) after mean centring. Similar assumptions for model identification can be found in some existing methods (Sun et al., 2012; Lee et al., 2017). This assumption is reasonable in many imaging studies. For example, in neuroimage data analysis, batch effects are usually caused by the heterogeneity in imaging acquisition protocols. Their effect sizes would not be correlated with those of population differences or diagnostic status (Lee et al., 2017). Also, our simulation studies show that our method is robust even when this assumption is violated. Specifically, in our simulation settings, when ∥ s B(s)(IJPJT(s)p(s) ds1 = 3.544, indicating that this assumption does not hold, our method still outperforms the two competing methods.

Besides the assumption on functional coefficients, modelling of latent factors Z is also a key term in our method. In this paper, we treat the latent factors as fixed. However, to account for the imaging heterogeneity, it will be more flexible to assume that the latent factors are random. For example, Wang et al. (2017) modelled the latent factors Z through a linear model on primary variables X, i.e., Z = T + W and W is normally distributed. Therefore, it is important to extend our model in this paper to handle the random setting of latent factors, which will be the focus of future work.

Another interesting topic is to extend our method to some unsupervised or semisupervised learning, whose goal is to recover the subgroup structure within the functional data when the subgroup information is unknown or not completely observable. It is challenging because unwanted variations may be correlated with the subgroup information. For example, it is of great interest to conduct the clustering analysis in terms of brain atrophy variations among patients with Alzheimer’s disease (Poulakis et al., 2018), and there is increasing evidence that the patients’ cluster information has strong association with some unknown factors like marital status (Sommerlad et al., 2018). Thus, it would be interesting to extend our model to simultaneously investigate the latent subgroup structure, while accounting for unknown latent factors. We leave these extensions to future research.

Supplementary Material

Supplemetary_text

Acknowledgement

Huang gratefully acknowledges financial support from the National Science Foundation and Zhu from the National Institutes of Health. The authors are grateful to the editor, associate editor and reviewers for their feedback that helped to improve the manuscript.

Footnotes

Supplementary material

Supplementary Material available at Biometrika online includes another example illustrating the heterogeneity in different imaging datasets, assumptions of theorems, proofs of the theoretical results, and additional simulation and real data analysis results.

Contributor Information

C. HUANG, Department of Statistics, Florida State University, 117 N. Woodward Ave., Tallahassee, Florida 32304, U.S.A.

H. ZHU, Department of Biostatistics, The University of North Carolina at Chapel Hill, 135 Dauer Drive, Chapel Hill, North Carolina 27599, U.S.A.

References

  1. Arenaza-Urquijo EM, Landeau B, La Joie R, Mevel K, Mézenge F, Perrotin a., Desgranges B, Bartrés-Faz D, Eustache F & Chételat G (2013). Relationships between years of education and gray matter volume, metabolism and functional connectivity in healthy elders. NeuroImage 83, 450–7. [DOI] [PubMed] [Google Scholar]
  2. Benjamini Y & Yekutieli D (2001). The control of the false discovery rate in multiple testing under dependency. Ann. Statist 29, 1165–88. [Google Scholar]
  3. Buja A & Eyuboglu N (1992). Remarks on parallel analysis. Mult. Behav. Res 27, 509–40. [DOI] [PubMed] [Google Scholar]
  4. Fan J & Gijbels I (1996). Local Polynomial Modelling and Its Applications. London: Chapman and Hall. [Google Scholar]
  5. Fortin J-P, Parker D, Tunç B, Watanabe T, Elliott MA, Ruparel K, Roalf DR, Satterthwaite TD, Gur RC & Gur RE (2017). Harmonization of multi-site diffusion tensor imaging data. NeuroImage 161, 149–70. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Frisoni GB, Ganzola R, Canu E, Rüb U, Pizzini FB, Alessandrini F, Zoccatelli G, Beltramello A, Caltagirone C & Thompson PM (2008). Mapping local hippocampal changes in Alzheimer’s disease and normal ageing with MRI at 3 Tesla. Brain 131, 3266–76. [DOI] [PubMed] [Google Scholar]
  7. Guillaume B, Wang C, Poh J, Shen MJ, Ong ML, Tan PF, Karnani N, Meaney M & Qiu A (2018). Improving mass-univariate analysis of neuroimaging data by modelling important unknown covariates: application to epigenome-wide association studies. NeuroImage 173, 57–71. [DOI] [PubMed] [Google Scholar]
  8. Helmer C, Damon D, Letenneur L, Fabrigoule C, Barberger-Gateau P, Lafont S, Fuhrer R, Antonucci T, Commenges D & Orgogozo J (1999). Marital status and risk of Alzheimer’s disease: a French population-based cohort study. Neurology 53, 1953–8. [DOI] [PubMed] [Google Scholar]
  9. Johnson WE, Li C & Rabinovic A (2007). Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8, 118–27. [DOI] [PubMed] [Google Scholar]
  10. Johnstone IM (2001). On the distribution of the largest eigenvalue in principal components analysis. Ann. Statist 29, 295–327. [Google Scholar]
  11. Kochunov P, Jahanshad N, Sprooten E, Nichols TE, Mandl RC, Almasy L, Booth T, Brouwer RM, Curran JE & de Zubicaray GI (2014). Multi-site study of additive genetic effects on fractional anisotropy of cerebral white matter: comparing meta and megaanalytical approaches for data pooling. NeuroImage 95, 136–50. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Kong D, An B, Zhang J & Zhu H (2019). L2RM: low-rank linear regression models for high-dimensional matrix responses. J. Am. Statist. Assoc 115, 403–24. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Kosorok MR (2003). Bootstraps of sums of independent but not identically distributed stochastic processes. J. Mult. Anal 84, 299–318. [Google Scholar]
  14. Lee S, Sun W, Wright FA & Zou F (2017). An improved and explicit surrogate variable analysis procedure by coefficient adjustment. Biometrika 104, 303–16. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Leek JT & Storey JD (2007). Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet. 3, e161. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Leek JT & Storey JD (2008). A general framework for multiple testing dependence. Proc. Nat. Acad. Sci. U.S.A 105, 18718–23. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Lock EF, Hoadley KA, Marron JS & Nobel AB (2013). Joint and individual variation explained (jive) for integrated analysis of multiple data types. Ann. Appl. Statist 7, 523–42. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Mirzaalian H, Ning L, Savadjiev P, Pasternak O, Bouix S, Michailovich O, Grant G, Marx C, Morey RA & Flashman L (2016). Inter-site and inter-scanner diffusion MRI data harmonization. NeuroImage 135, 311–23. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Mueller SG, Weiner MW, Thal LJ, Petersen RC, Jack C, Jagust W, Trojanowski JQ, Toga AW & Beckett L (2005). The Alzheimer’s disease neuroimaging initiative. Neuroimaging Clin. N. Am 15, 869–77. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Onatski A (2010). Determining the number of factors from empirical distribution of eigenvalues. Rev. Econ. Statist 92, 1004–16. [Google Scholar]
  21. Owen AB & Wang J (2016). Bi-cross-validation for factor analysis. Statist. Sci 31, 119–39. [Google Scholar]
  22. Poulakis K, Pereira JB, Mecocci P, Vellas B, Tsolaki M, Kłoszewska I, Soininen H, Lovestone S, Simmons A, Wahlund L-O et al. (2018). Heterogeneous patterns of brain atrophy in Alzheimer’s disease. Neurobiol. Aging 65, 98–108. [DOI] [PubMed] [Google Scholar]
  23. Ramsay JO & Silverman BW (2002). Applied Functional Data Analysis: Methods and Case Studies. NewYork: Springer. [Google Scholar]
  24. Ruppert D & Wand MP (1994). Multivariate locally weighted least squares regression. Ann. Statist 22, 1346–70. [Google Scholar]
  25. Salimi-Khorshidi G, Smith SM, Keltner JR, Wager TD & Nichols TE (2009). Meta-analysis of neuroimaging data: a comparison of image-based and coordinate-based pooling of studies. NeuroImage 45, 810–23. [DOI] [PubMed] [Google Scholar]
  26. Shi J, Thompson PM, Gutman B & Wang Y (2013). Surface fluid registration of conformal representation: application to detect disease burden and genetic influence on hippocampus. NeuroImage 78, 111–34. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Sommerlad A, Ruegger J, Singh-Manoux A, Lewis G & Livingston G (2018). Marriage and risk of dementia: systematic review and meta-analysis of observational studies. J. Neurol. Neurosurg. Psychiat 89, 231–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Sudlow C, Gallacher J, Allen N, Beral V, Burton P, Danesh J, Downey P, Elliott P, Green J & Landray M (2015). UK Biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 12, e1001779. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Sun Y, Zhang NR & Owen AB (2012). Multiple hypothesis testing adjusted for latent variables, with an application to the agemap gene expression data. Ann. Appl. Statist 6, 1664–88. [Google Scholar]
  30. Sundström A, Westerlund O & Kotyrlo E (2016). Marital status and risk of dementia: a nationwide population-based prospective study from Sweden. BMJ Open 6, e008565. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Van Essen DC, Smith SM, Barch DM, Behrens TE, Yacoub E, Ugurbil K & Consortium W-MH (2013). The WU-Minn Human Connectome Project: an overview. NeuroImage 80, 62–79. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Wang J, Zhao Q, Hastie T & Owen AB (2017). Confounder adjustment in multiple hypothesis testing. Ann. Statist 45, 1863–94. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Wang J-L, Chiou J-M & Müller H-G (2016). Functional data analysis. Ann. Rev. Statist 3, 257–95. [Google Scholar]
  34. Yu Q, Risk BB, Zhang K & Marron J (2017). Jive integration of imaging and behavioral data. NeuroImage 152, 38–49. [DOI] [PubMed] [Google Scholar]
  35. Zhang J & Chen J (2007). Statistical inference for functional data. Ann. Statist 35, 1052–79. [Google Scholar]
  36. Zhu H, Ibrahim JG, Tang N, Rowe DB, Hao X, Bansal R & Peterson BS (2007). A statistical analysis of brain morphology using wild bootstrapping. IEEE Trans. Med. Imag 26, 954–66. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Zhu H, Li R & Kong L (2012). Multivariate varying coefficient model for functional responses. Ann. Statist 40, 2634–66. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemetary_text

RESOURCES