ABSTRACT
The aim of this paper is to systematically investigate merging and ensembling methods for spatially varying coefficient mixed effects models (SVCMEM) in order to carry out integrative learning of neuroimaging data obtained from multiple biomedical studies. The ”merged” approach involves training a single learning model using a comprehensive dataset that encompasses information from all the studies. Conversely, the ”ensemble” approach involves creating a weighted average of distinct learning models, each developed from an individual study. We systematically investigate the prediction accuracy of the merged and ensemble learners under the presence of different degrees of interstudy heterogeneity. Additionally, we establish asymptotic guidelines for making strategic decisions about when to employ either of these models in different scenarios, along with deriving optimal weights for the ensemble learner. To validate our theoretical results, we perform extensive simulation studies. The proposed methodology is also applied to 3 large-scale neuroimaging studies.
Keywords: ensemble learner, interstudy heterogeneity, merged learner, neuroimaging, spatially varying coefficient mixed effects model
1. INTRODUCTION
With rapid imaging technology advancements, an array of extensive biomedical studies, including the UK Biobank (UKB) (Sudlow et al., 2015), Adolescent Brain Cognitive Development (ABCD) study (Casey et al., 2018), Alzheimer’s Disease Neuroimaging Initiative (ADNI) (Weiner et al., 2017), and Human Connectome Project (HCP) (Somerville et al., 2018), are underway or completed. These studies encompass diverse data types: neuroimaging data, genetics, clinical records, and health details. The essential concern is integrating these multiview datasets across studies to enhance predicting neuroimaging outcomes and identifying reliable imaging biomarkers for subsequent tasks, such as Alzheimer’s disease detection (Zhu et al., 2023). However, attaining these goals is complex due to substantial interstudy heterogeneity challenges stemming from varied sources. These include differences in data collection, study design, acquisition protocols, preprocessing pipelines, and study-specific elements, collectively hindering integrated data analysis. This challenge resonates across fields (Leek and Storey, 2007; Fortin et al., 2017; Zhang et al., 2020; Zhu et al., 2023; Beer et al., 2020; Chen et al., 2022; Zhu et al., 2023; Hu et al., 2023). Notably, confounding factors such as device, acquisition parameters, and motion effects exert larger influence on neuroimaging data than subtle brain change signals like age, gender, and disease predictors (Alfaro-Almagro et al., 2021). Hence, effectively addressing the challenge of interstudy heterogeneity becomes paramount when pursuing integrative learning of neuroimaging data across multiple studies.
A total of 2 principal strategies, namely the ”merged learner” and the ”ensemble learner,” stand as key approaches to confront the issue of interstudy heterogeneity within integrative learning (Cai et al., 2020; Patil and Parmigiani, 2018). In the first strategy, data sourced from various studies are initially amalgamated into a unified dataset, which serves as the basis for training a single learning model. Within this context, either fixed-effect or random-effect models are commonly employed to train the merged learner. Techniques such as principal components analysis (PCA) (Price et al., 2006), confounder adjusted testing and estimation (CATE) (Wang et al., 2017), and direct surrogate variable analysis (dSVA) (Lee et al., 2017) can capture the presence of interstudy heterogeneity. Recently, dSVA was adapted to tackle interstudy heterogeneity within neuroimaging data (Guillaume et al., 2018). Furthermore, Huang and Zhu (2022) introduced a functional hybrid factor regression modeling framework that merges surrogate variable analysis with functional data analysis, forming a hybrid solution to effectively navigate the challenges posed by interstudy heterogeneity.
In the second strategy, the ensemble learner is usually constructed as a weighted average of learners estimated based on individual datasets. Analogous concepts have been applied in ensemble machine learning (Patil and Parmigiani, 2018), meta-analysis (Jackson and Riley, 2014), and fusion learning (Cai et al., 2020). For instance, fusion learning combines confidence distributions for the parameters of interest from different studies. It has been shown that under some specific settings (eg, linear mixed models), the ensemble learner outperforms the merged one when there is a large heterogeneity across studies (Patil and Parmigiani, 2018). Conversely, the merged learner showcases reduced prediction error in comparison to the ensemble counterpart when dealing with studies that are relatively homogeneous (Lagani et al., 2016). Hence, it becomes intriguing to explore the circumstances and methodologies for optimizing the ensemble learner across broader scenarios, while taking into account specific criteria such as prediction accuracy (Guan et al., 2019) and asymptotic efficiency (Zeng and Lin, 2015), in more general settings.
In practice, there is significant interest in comparing the merged and ensemble learners in terms of predicting neuroimaging outcomes for external testing datasets. The K-fold cross-validation (CV), a useful statistical learning technique, is commonly employed to assess the prediction performance. Specifically, both learners undergo estimation on training datasets, and prediction metrics, such as mean squared error, are computed by applying the trained learners to the validation dataset. However, the effectiveness of traditional K-fold CV approaches may diminish due to variations in interstudy heterogeneity patterns across folds. While some K-fold CV variants, like stratified K-fold CV (Prusty et al., 2022), attempt to address this issue by incorporating additional information from observed confounding factors during dataset partitioning, evaluating prediction performance using K-fold CV methods remains challenging when interstudy heterogeneity is predominantly influenced by unknown study-specific random effects. Therefore, it is of great importance to derive strategy-decision guidelines to help understand which among the merged and ensemble learners are doing better.
The primary goal of this paper is to conduct a comprehensive exploration of ensemble learning’s potential in analyzing neuroimaging data across multiple studies. To realize this objective, we embark on a specific case study involving multiview data encompassing genetic predictors [causal single nucleotide polymorphisms (SNPs)], demographic components, imaging data, and clinical factors. The aim is to predict imaging outcomes along the genu of the corpus callosum (GCC) for participants across 3 distinct biomedical studies: UKB, ABCD, and HCP studies. Imaging outcomes considered here include both fractional anisotropy (FA) and mean diffusivity (MD) values derived from diffusion-weighted magnetic resonance imaging (dMRI). Within this framework, we adopt a spatially varying coefficient mixed-effects model (SVCMEM). This model posits that imaging outcomes across studies have 4 distinct components: (i) fixed effects, (ii) study-specific random effects, (iii) subject-specific and location-specific spatial variations expressed through individual stochastic functions, and (iv) random noise. Notably, in contrast to earlier models for multistudy neuroimaging data that primarily consider univariate and multivariate phenotypes (Guan et al., 2019; Guillaume et al., 2018), our proposed SVCMEM explicitly accommodates the intricate multilevel variations inherent in imaging data encompassing locations, phenotypes, subjects, and studies. Further insights into SVCMEM are provided in Section 2.1. Building upon the neuroimaging data generation mechanism, we proceed to construct both the ensemble learner and the merged learner. Subsequently, we methodically compare their individual performances in terms of predicting neuroimaging outcomes. Furthermore, we derive the asymptotic strategy-decision guideline across various scenarios and determine the optimal weights for the ensemble learner. Finally, through simulation studies and real data analyses, we substantiate the effectiveness of the ensemble learner’s derived decisions in enhancing the accuracy of neuroimaging outcome predictions.
The paper is organized as follows: Section 2 introduces SVCMEM and outlines the estimation procedure for the merged and ensemble learners. Subsequently, we delve into exploring the theoretical properties of both learners in predicting imaging outcomes. In Section 3, simulation studies on synthetic curve data are presented to validate the theoretical results concerning the strategy-decision guideline for the two learners across different scenarios. Finally, in Section 4, we apply our method to an imaging genetics scenario using data from UKB, ABCD, and HCP studies.
2. METHOD
Suppose that we observe both neuroimaging data and a vector of covariates of interest from n unrelated subjects in K distinct neuroimaging studies such that there are
subjects in the k-th study for
and
. Without loss of generality, all the images have been registered to a common template. Let
be a region of interest (ROI) containing
grid points
, which follow a common density
for
. It’s worth mentioning that the method proposed in this paper can be readily extended to accommodate multi-ROI scenarios, where voxels from different ROIs aren’t obligated to share a common density. At the grid point
, we observe J imaging features for each subject, for example, FA and MD values. For the k-th neuroimaging studies, let
be an
matrix including J imaging features and
be an
full column rank matrix including all covariates of interest (eg, age, gender, disease status, and causal SNPs) as well as the intercept.
2.1. Spatially varying coefficient mixed effects model (SVCMEM)
We assume that multivariate neuroimaging outcomes are generated from the SVCMEM:
![]() |
(1) |
where
is a
matrix representing fixed effects related to
, which are invariant across studies (such as age and gender). The
is an
design matrix for random effects, where
is a
matrix with
if the j-th predictor is included and associated with the t-th random effect, and
is a
matrix representing the corresponding study-specific random effects. Moreover,
is an
matrix that includes individual stochastic functions characterizing both subject-specific and location-specific spatial variability, and
s’ are measurement errors. To further characterize the spatial correlations within the multivariate imaging phenotypes, similar to Zhu et al. (2012) and Huang and Zhu (2022), we assume that the rows in
and
are mutually independent and identical copies of SP
and SP
, where SP
denotes a stochastic process vector with mean function
and covariance function
. Moreover,
takes the form of
, where
is a diagonal matrix and
is the indicator function. It is also assumed that the elements in
, that is,
are independent of
and
and they are mutually independent copies of SP
, respectively. It does not exclude correlations along the functional direction as
is not required to be zero.
The proposed SVCMEM encompasses several established models as special instances. For instance, if all imaging outcomes originate from a single study, the SVCMEM simplifies to the multivariate varying coefficient model proposed by Zhu et al. (2012). In cases where only univariate imaging outcomes are considered, the SVCMEM transforms into the mixed-effect model proposed by Guan et al. (2019). Additionally, excluding subject-specific and location-specific spatial variability aligns the SVCMEM with the confounder-adjusted regression model introduced in Guillaume et al. (2018). Notably, in contrast to current models catering to multistudy neuroimaging data (Guan et al., 2019; Guillaume et al., 2018), our SVCMEM adeptly captures imaging variations across multiple dimensions: locations, phenotypes, subjects, and studies.
2.2. Estimation procedure
We introduce the estimation procedure of the merged learner and the ensemble learner below.
Merging. Regarding the merged learner, we typically assume the absence of a study-specific random effect
, which implies relative homogeneity among imaging outcomes across all studies. Under this assumption, we initially consolidate observed data from all K studies and derive the estimator
for
based on this amalgamated dataset.
Let
and
be, respectively, the merged
imaging and
covariate matrices across K studies. Given
and
, the multivariate local linear kernel smoothing technique (Zhu et al., 2012) is used to derive the weighted least squares (WLS) estimator of
. Some notations are introduced here. Let
for any vector
and
be the Kronecker product of 2 matrices
and
. In addition, denote
and
, where
is the kernel function and
is the positive definite bandwidth matrix. Then, the WLS estimator of
based on the merged data is given by
![]() |
(2) |
where
. More detailed derivation of the estimator in (2) is provided in the supplementary materials.
Ensembling. Different from the merged learner, the requirement for homogeneity regarding
does not apply to the ensemble one. In fact, for the ensemble learner, individual estimators of
, denoted as
, are first derived based on the data from each of the K studies respectively, and then the ensemble estimator of
, denoted as
, is calculated by a weighted average of
. Specifically,
s and
can be computed, respectively, as follows:
![]() |
(3) |
where
are the bandwidth matrices and
satisfy
.
Throughout this paper, we standardize all covariates and imaging responses to have mean zero and SD one. The leave-one-curve-out CV (Zhang and Chen, 2007) is used to select
in
and
in
. Finally, we set a common bandwidth, denoted as
, for all covariates and imaging responses (Huang et al., 2017).
2.3. Prediction performance comparison
Suppose that we have
new coming subjects forming an
covariate matrix, denoted as
, and an
imaging matrix, denoted as
. Given
,
can be predicted based on either the merged learner or the ensemble learner. We are interested in comparing the merged and ensemble learners by using the squared prediction error (SPE) given by
![]() |
(4) |
where
is the Frobenius norm,
for the merged learner, and
for the ensemble learner. Furthermore,
can be decomposed as the sum of 3 key terms given by
![]() |
(5) |
where
is the trace of a given matrix,
is the asymptotic bias of the j-th column in
, and
is the corresponding asymptotic conditional variances. The detailed derivation of the decomposition in (5) can be found in the supplementary materials. It follows from (5) that
is primarily determined by the terms
and
, since
only depends on the variations in the new dataset
. Moreover,
and
are, respectively, determined by the squared bias and variance of estimated functional coefficients.
2.4. Theoretical analysis
In this section, we systematically explore the theoretical properties of
and
, assuming a constant number of studies, K, while the minimum number of subjects across these studies, denoted as
, approaches infinity. The comprehensive proof and the underlying assumptions that facilitate the technical details are provided in the supplementary materials.
We initiate our examination by deriving the asymptotic bias and conditional variance of
and
. We introduce
, satisfying
, where
represents a
identity matrix. Furthermore, let
and
denote the j-th columns of
and
, respectively.
Theorem 1:
Suppose that Assumptions 1-7 in the supplementary materials hold. The following results hold:
(a) For the merged learner, the asymptotic bias of
is given by
(6) where the l-th element in
is defined as
with
being the
Hessian matrix of the
-th element in
. Furthermore, the asymptotic conditional covariance matrix,
, is given by
(7) where
,
is a diagonal matrix with the elements
on the main diagonal, and
is the j-th diagonal element in
.
(b) For the ensemble learner, the asymptotic bias of
is
(8) and the corresponding asymptotic conditional variances,
, is given by
(9) where
.
Theorem 1 has 2 important implications. First, the biases of the 2 learners are asymptotically comparable with each other. Second, the conditional variances of
and
can be decomposed into the study-specific variation and the subject-specific variation. Specifically, for the ensemble learner, its subject-specific and study-specific variations are equal to
and
, respectively. When
and
, the subject-specific and study-specific variations of
reduce to those of
.
Second, we compare the ensemble learner and the merged learner in terms of the expected SPE below. Let
be the average study-specific variation and
be the average subject-specific variation. We also define 
![]() |
(10) |
![]() |
(11) |
where
is a
vector with the l-th element being 1 and zero others for
.
Theorem 2:
Suppose that Assumptions 1-7 in the supplementary materials hold. The following results hold:
if
and
for
, then
asymptotically holds if and only if
;
are asymptotically valid when
and
hold;
are asymptotically valid when
and
hold.
Theorem 2 has several implications. First, in the equal variances case, Theorem 2 (a) provides a sufficient and necessary condition for that the ensemble learner outperforms the merged learner. In this case,
represents a transition point. Second, in more general settings, Theorem 2 (b) and (c) provide sufficient conditions for that the ensemble learner outperforms the merged learner and vice versa. Moreover, since
is smaller than
,
quantifies the degree of heterogeneity across study-specific random effects. In the interval of
, the ensemble learner and the merged one are comparable with each other. In the equal variances case,
and
are equal with each other and reduce to
. Thus, Theorem 2 (a) is a special case of Theorem 2 (b) and (c).
Third, we investigate the optimal choice of weighting scheme in the ensemble learner and present the results as follows:
Theorem 3:
Suppose that Assumptions 1-7 in the supplementary materials hold. For the ensemble learner
, the optimal ensembling weights,
, which minimize the expected SPE, are given by
(12) for
, where
.
In Theorem 3, since the term
in the expected SPE (5) only depends on the variations in the new dataset and the asymptotic bias in the term
is of negligible order, the optimal weights are mainly determined by the term
, that is, the variance of the prediction errors
. Specifically, the optimal weight
for the k-th study is proportional to the inverse variance of the prediction error with the ensemble learner trained exactly on the k-th study, that is.,
.
In practice, to derive the optimal weights
, we need to estimate the study-specific variation,
, and the average subject-specific variation,
. Here we consider the method of “smoothing first, then estimation” in Zhang and Chen (2007). Specifically, we first adopted the local linear kernel smoothing techniques to smooth the imaging responses
, leading to
, and then we fit SVCMEM with smoothed responses, that is,
![]() |
(13) |
where the measurement error term
in (1) is not included due to the smoothing procedure. Then, the WLS estimator described in equation (2) can be employed as an approximation for estimating
, denoted as
. Given the residuals
for
, the covariance functions,
and
, can be estimated through a least squares method, whose details can be found in Zhu et al. (2019), yielding the estimates of
and
.
3. SIMULATION STUDIES
In this section, we use the empirical SPE to examine the prediction performance of the ensemble and merged learners based on simulated datasets. The R codes for implementing these simulation studies can be found in the supplementary materials. Specifically, we set the number of training studies
with unequal sample sizes
for
and the sample size
for the test study. In each study, we considered bivariate imaging feature with
at each grid point and generated synthetic curves from SVCMEM as follows:
![]() |
(14) |
for
and
. Moreover, we independently simulated
for
and sorted them to obtain
. For the covariates of interest
in the k-th training study, we set the i-th row as
, in which
was generated from a bivariate normal distribution with mean
and covariance matrix
with
, and
. For the testing study, we set
, where
were generated in the same way as
for
. In the k-th study, we set the number of random effects
and
includes the second and third columns of the design matrix
for
. Furthermore, let the random effects
with covariance function Cov
and
. For the stochastic individual function
in the k-th study, its
-th element
admits the Karhunen Loeve expansion as
, where
with
for
and
,
,
,
, and
. Therefore,
. We simulated the
-th element in the measurement error
independently according to
. Also, we set the coefficient functions as
.
We consider 2 scenarios for the variances of random effects, including the homogeneous scenario with
corresponding to Theorem 2 (a) and the general scenario with
and
corresponding to Theorem 2 (b)-(c). In each scenario, 5000 data sets were generated for each of 10 levels of the mean variance
, including 0 and the theoretical transition points in Theorem 2. For each of the 10 levels of
in each scenario,
and
were calculated according to the estimation procedure introduced in Section 2.2, with equal weights
for all k. Subsequently, we calculated the predicted values for the testing study based on both the merged and ensemble learners and their corresponding empirical SPEs. The 5000 simulated data sets were evenly and randomly divided into 50 subgroups, and the sample mean of
in each subgroup was calculated.
Figure 1 presents the averaged
from all the 50 subgroups in the 10 levels of
in each scenario and the 3 theoretical transition points in Theorem 2. We observe some important findings as follows. In both homogeneous and heterogeneous scenarios, the averaged
decreases as the average variance of random effects
increases, indicating that the ensemble learner
gradually outperforms the merged learner
in terms of the prediction accuracy. For the homogeneous scenario, the empirical transition point, indicated by
, and the red dashed line in the top plot of Figure 1 coincide with the transition point provided in Theorem 2. For the heterogeneous scenario, according to Theorem 2, in the interval
, the approach with better performance is transiting from the merged learner to the ensemble one. It coincides with the empirical transition points with
near 0 shown as the red dashed lines in the bottom plot of Figure 1. Therefore, our simulation studies validate our theoretical results in Theorem 2.
FIGURE 1.
Boxplots of averaged
over 100 replicates for homogeneous scenario (top) and heterogeneous scenario (bottom). The theoretical transition points are indicated by the red dashed lines.
4. REAL DATA ANALYSIS
In this section, we carried out a real data analysis of imaging genetics in 2 scenarios:
A. Training the merged and ensemble learners using 4 subsets of the UKB study and testing them on a holdout subset of the UKB study;
B. Training the merged and ensemble learners using 2 large-scale studies, including the UKB and ABCD studies, and testing them on the HCP study as an independent study.
For each scenario, both learners were estimated based on the training data and the corresponding SPEs were calculated.
4.1. Data description
We consider 3 studies, including the UKB study, whose individuals aged from 45 to 80 years old, the ABCD study, whose subjects aged from 9 to11 years old, and the HCP study, whose subjects aged from 22 to 35 years old. Besides different age ranges, these 3 studies may differ from each other in device, acquisition parameters, different noises, and image processing protocol, among others. Such differences may introduce systematical differences in neuroimaging data. We used an individual filtering procedure to make sure that only independent individuals were included into our analysis. Only individuals with European ancestry are included for HCP and ABCD, while only British individuals are included for UKB. After further excluding the individuals with missing data, the number of individuals included in HCP, ABCD, and UKB are given by 298, 5088, and 16381, respectively.
We considered dMRIs (Basser et al., 1994) and used both the FA and MD diffusion statistics along the GCC consisting of
grid points as our imaging outcomes. The corpus callosum (CC) consists of a flat bundle of commissural fibers beneath the cerebral cortex in the brain, connecting the left and right cerebral hemispheres. The CC is the largest white matter structure in the human brain and has 4 main parts, including the rostrum, the genu, the body, and the splenium. As shown in (Zhao et al., 2021), both the FA and MD values of GCC are significantly linked to a large number of SNPs. Recent findings show that the brain white matter structures are much more significantly associated with genetic markers compared to other brain imaging features, including cortical volume and thickness (Zhao et al., 2019; 2021).
We focus on 2 important SNPs rs12653308 and rs2237077 as genetic predictors, since they are (i) mutually uncorrelated and available in all the three studies and (ii) significantly associated with both the FA and MD values in GCC in existing literature (Huang et al., 2017; Zhao et al., 2021). The SNPs are coded as quantitative variables with the values equal to the number of alternative alleles (ie, 0, 1, or 2), and thus, the additive model is implemented with respect to each SNP. The number of alternative alleles was extracted from imputed genotype data. Specifically, ABCD and HCP genotype data were imputed using the 1000 Genomes reference panel (Zhao et al., 2021), whereas UKB genotype data was imputed using the Haplotype Reference Consortium and UK10K + 1000 Genomes reference panels (Bycroft et al., 2018).
4.2. Data analysis
In scenarios A and B, we divided all available data into the training part and the test part as follows. For scenario A, we selected phases 1 and 2 of the UKB study and randomly split them into five subsets with approximately equal sizes such that
and
for
in the training part and
in the test part. For scenario B, both the ABCD study (
) and the UKB study (
) were used for training (
), while the HCP study (
) was used for test. The covariates of interest
includes an intercept, age, age
, sex, sex
age, sex
age
, the top ten principal scores, and the number of alternative alleles for SNPs rs12653308 and rs2237077. We estimated
,
,
, and the optimal weights
in each scenario accordingly. Table 1 shows all these estimation results. We observe that
is greater than
for both scenarios, no matter whether using equal ensemble weights or optimal ensemble weights. According to Theorem 2, the ensemble learner yields a smaller expected SPE in both scenarios. However, different from scenario B, both
and
are very close to 0 in scenario A.
TABLE 1.
Estimation results of
,
,
,
,
and
in Theorem 2 and the optimal weights
in Theorem 3 for scenarios A and B.
| scenario A | scenario B | ||
|---|---|---|---|
| Training data | 4 UKB subsamples ( ) |
ABCD ( ) |
|
UKB ( ) |
|||
| Testing data | 1 UKB subsample ( ) |
HCP ( ) |
|
|
|
0.363 | |
|
2.811 | 3.749 | |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||
|
|
|
|
|
|
|
|
The empirical SPEs for both 2 learners were calculated to further compare their prediction performance. For each scenario, we calculated
for
,
,
based on equal ensemble weights
, and
based on optimal ensemble weights
estimated from Table 1. Given the predictors
in testing data, we calculated the predictions
based on
,
and
, respectively, and their corresponding empirical SPEs. Furthermore, we generated 100 bootstrap samples by randomly drawing with replacement from the original training data, while fixing the test data unchanged. Given each bootstrap sample, all the learners were retrained and used to predict the imaging responses in the test data.
Figure 2 presents the bootstrap SPEs with the empirical SPEs based on the original training data indicated by the blue lines for both scenarios. We have several important findings from Figure 2. For scenario A,
,
, and
have similar prediction performance in terms of their empirical SPEs and corresponding variations. It indicates a low heterogeneity in the UKB cohort and the small variations among different subsets with small
for scenario A in Table 1. For scenario B,
and
outperform the merged learner. It is consistent with the sufficient condition of Theorem 2 (b) and the corresponding estimation results in Table 1, that is
. In addition, since the optimal weights are close to the equal weights,
and
are comparable with each other in terms of their empirical SPEs.
FIGURE 2.
Boxplots of 100 bootstrap SPEs for scenario A (left) and scenario B (right) with original SPEs indicated by the red circles.
5. DISCUSSION
In this paper, we have systematically compared the merged and ensemble learners to explicitly deal with the interstudy heterogeneity for integrative analysis of neuroimaging data obtained from multiple studies. We have considered SVCMEM to spatially model the varying association between imaging measures with a set of covariates, while explicitly accounting for the spatial smoothness and correlation of neuroimaging data. We have constructed the ensemble and merged learners for regression coefficient functions and compared them regarding the prediction accuracy of neuroimaging outcomes. We have used both simulation studies and real data analysis to examine the finite sample performance of both learners.
There are several topics of interest for future research. First, we may extend the current linear spatial varying coefficient mixed effects model in (1) to the nonlinear setting, that is,
,
where
is the unknown link function applying to each element in
. Then, the estimation and inference procedures of single-index varying coefficient models (Luo et al., 2016) may be adopted here to establish the corresponding ensemble and merged learners. Second, we may compare the statistical efficiency of the ensemble and merged learners based on hypothesis testing and confidence interval (Xie et al., 2011). Specifically, given the p-values of all K studies, denoted as
, we may define an ensemble p-value,
, where
is the cumulative distribution function of the standard normal distribution. We may compare
with the p-value of the merging learner, but we leave it for future research.
Supplementary Material
Web Appendices (detailed derivations of estimation procedures, comprehensive proof, and the underlying assumptions that facilitate the technical details) and code (R codes for implementing the simulation studies) referenced in Sections 2.2, 2.3, 2.4, and 3 are available with this paper at the Biometrics website on Oxford Academic. In addition, the R codes, including synthetic multivariate functional data generation, merged and ensemble learning algorithms, and calculation of asymptotic transition points in Section 2.4, can also be found from the GitHub link https://github.com/BIG-S2/SVCMEM.
Acknowledgement
The first 2 authors, Dr. Shan and Dr. Huang, contributed equally to this paper. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH and NSF.
Contributor Information
Yue Shan, Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, United States.
Chao Huang, Department of Statistics, Florida State University, Tallahassee, FL 32306, United States.
Yun Li, Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, United States; Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, United States.
Hongtu Zhu, Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, United States; Department of Statistics, Florida State University, Tallahassee, FL 32306, United States; Department of Statistics & Operations Research, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, United States; Department of Computer Science, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, United States.
FUNDING
The research of Dr. Zhu was partially supported by the National Institute On Aging (NIA) of the National Institutes of Health (NIH) under Award Numbers RF1AG082938 and NIH MH116527. The research of Dr. Li was partially supported by NIH grants R56-AG079291 and U01HG011720. The research of Dr. Huang was partially supported by National Science Foundation (NSF) under Award Number DMS-1953087.
CONFLICT OF INTEREST
None declared.
DATA AVAILABILITY
The data that support the findings in this paper includes (i) the UK Biobank resource (application no. 22783), which is subject to a data transfer agreement, (ii) the Adolescent Brain Cognitive Development (ABCD) study (https://abcdstudy.org), held in the National Institute of Mental Health Data Arichive (NDA), and (iii) the Human Connectome Project (HCP) study (https://www.humanconnectome.org) by the WU-Minn Consortium (1U54MH091657).
References
- Alfaro-Almagro F., McCarthy P., Afyouni S., Andersson J. L., Bastiani M., Miller K. L. et al. (2021). Confound modelling in uk biobank brain imaging. NeuroImage, 224, 117002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Basser P. J., Mattiello J., Lebihan D. (1994). Estimation of the effective self-diffusion tensor from the NMR spin echo. Journal of Magnetic Resonance, Series B, 103, 247–254. [DOI] [PubMed] [Google Scholar]
- Beer J. C., Tustison N. J., Cook P. A., Davatzikos C., Sheline Y. I., Shinohara R. T. et al. (2020). Longitudinal combat: a method for harmonizing longitudinal multi-scanner imaging data. NeuroImage, 220, 117129. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bycroft C., Freeman C., Petkova D., Band G., Elliott L. T., Sharp K. et al. (2018). The UK Biobank resource with deep phenotyping and genomic data. Nature, 562, 203–209. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cai C., Chen R., Xie M. (2020). Individualized inference through fusion learning. WIREs Computational Statistics, 12, e1498. [Google Scholar]
- Casey B. J., Cannonier T., Conley M. I., Cohen A. O., Barch D. M., Heitzeg M. M. et al. (2018). The adolescent brain cognitive development (abcd) study: Imaging acquisition across 21 sites. Developmental Cognitive Neuroscience, 32, 43–54. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen A. A., Luo C., Chen Y., Shinohara R. T., Shou H. and ADNI (2022). Privacy-preserving harmonization via distributed combat. NeuroImage, 248, 118822. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fortin J.-P., Parker D., Tunç B., Watanabe T., Elliott M. A., Ruparel K. et al. (2017). Harmonization of multi-site diffusion tensor imaging data. NeuroImage, 161, 149–170. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Guan Z., Parmigiani G., Patil P. (2019). Merging versus ensembling in multi-study prediction: Theoretical insight from random effects. arXiv preprint arXiv:1905.07382.
- Guillaume B., Wang C., Poh J., Shen M. J., Ong M. L., Tan P. F. et al. (2018). Improving mass-univariate analysis of neuroimaging data by modelling important unknown covariates: application to epigenome-wide association studies. NeuroImage, 173, 57–71. [DOI] [PubMed] [Google Scholar]
- Hu F., Chen A. A., Horng H., Bashyam V., Davatzikos C., Alexander-Bloch A. et al. (2023). Image harmonization: a review of statistical and deep learning methods for removing batch effects and evaluation metrics for effective harmonization. NeuroImage, 274, 120125. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Huang C., Thompson P., Wang Y., Yu Y., Zhang J., Kong D. et al. (2017). FGWAS: functional genome wide association analysis. NeuroImage, 159, 107–121. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Huang C., Zhu H. (2022). Functional hybrid factor regression model for handling heterogeneity in imaging studies. Biometrika, 109, 1133–1148. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jackson D., Riley R. D. (2014). A refined method for multivariate meta-analysis and meta-regression. Statistics in Medicine, 33, 541–554. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lagani V., Karozou A. D., Gomez-Cabrero D., Silberberg G., Tsamardinos I. (2016). A comparative evaluation of data-merging and meta-analysis methods for reconstructing gene-gene interactions. BMC Bioinformatics, 17, 287–305. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lee S., Sun W., Wright F. A., Zou F. (2017). An improved and explicit surrogate variable analysis procedure by coefficient adjustment. Biometrika, 104, 303–316. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Leek J. T., Storey J. D. (2007). Capturing heterogeneity in gene expression studies by surrogate variable analysis. Plos Genetics, 3, 1724–1735. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Luo X., Zhu L., Zhu H. (2016). Single-index varying coefficient model for functional responses. Biometrics, 72, 1275–1284. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Patil P., Parmigiani G. (2018). Training replicable predictors in multiple studies. Proceedings of the National Academy of Sciences, 115, 2578–2583. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Price A. L., Patterson N. J., Plenge R. M., Weinblatt M. E., Shadick N. A., Reich D. (2006). Principal components analysis corrects for stratification in genome-wide association studies. Nature Genetics, 38, 904–909. [DOI] [PubMed] [Google Scholar]
- Prusty S., Patnaik S., Dash S. K. (2022). Skcv: Stratified k-fold cross-validation on ml classifiers for predicting cervical cancer. Frontiers in Nanotechnology, 4, 972421. [Google Scholar]
- Somerville L. H., Bookheimer S. Y., Buckner R. L., Burgess G. C., Curtiss S. W., Dapretto M. et al. (2018). The lifespan human connectome project in development: A large-scale study of brain connectivity development in 5-21 year olds. NeuroImage, 183, 456–468. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sudlow C., Gallacher J., Allen N., Beral V., Burton P., Danesh J. et al. (2015). UK Biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. Plos Medicine, 12, e1001779. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang J., Zhao Q., Hastie T., Owen A. B. (2017). Confounder adjustment in multiple hypothesis testing. Annals of Statistics, 45, 1863–1894. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Weiner M. W., Veitch D. P., Aisen P. S., Beckett L. A., Cairns N. J., Green R. C.,..., Alzheimer’s Disease Neuroimaging Initiative (2017). Recent publications from the alzheimer’s disease neuroimaging initiative: Reviewing progress toward improved ad clinical trials. Alzheimer’s & Dementia : the Journal of the Alzheimer’s Association, 13, e1–e85. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Xie M., Singh K., Strawderman W. E. (2011). Confidence distributions and a unifying framework for meta-analysis. Journal of the American Statistical Association, 106, 320–333. [Google Scholar]
- Zeng D., Lin D. (2015). On random-effects meta-analysis. Biometrika, 102, 281–294. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang J., Chen J. (2007). Statistical inference for functional data. Annals of Statistics, 35, 1052–1079. [Google Scholar]
- Zhang Y., Bernau C., Parmigiani G., Waldron L. (2020). The impact of different sources of heterogeneity on loss of accuracy from genomic prediction models. Biostatistics, 21, 253–268. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhao B., Li T., Yang Y., Wang X., Luo T., Shan Y. et al. (2021). Common genetic variation influencing human white matter microstructure. Science, 372, eabf3736. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhao B., Luo T., Li T., Li Y., Zhang J., Shan Y. et al. (2019). Genome-wide association analysis of 19,629 individuals identifies variants influencing regional brain volumes and refines their genetic co-architecture with cognitive and mental health traits. Nature Genetics, 51, 1637–1644. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhu H., Chen K., Luo X., Yuan Y., Wang J.-L. (2019). Fmem: Functional mixed effects models for longitudinal functional responses. Statistica Sinica, 29, 2007–2033. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhu H., Li R., Kong L. (2012). Multivariate varying coefficient model for functional responses. Annals of Statistics, 40, 2634–2666. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhu H., Li T., Zhao B. (2023). Statistical learning methods for neuroimaging data analysis with applications. Annual Review of Biomedical Data Science, 6, 73–104. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Web Appendices (detailed derivations of estimation procedures, comprehensive proof, and the underlying assumptions that facilitate the technical details) and code (R codes for implementing the simulation studies) referenced in Sections 2.2, 2.3, 2.4, and 3 are available with this paper at the Biometrics website on Oxford Academic. In addition, the R codes, including synthetic multivariate functional data generation, merged and ensemble learning algorithms, and calculation of asymptotic transition points in Section 2.4, can also be found from the GitHub link https://github.com/BIG-S2/SVCMEM.
Data Availability Statement
The data that support the findings in this paper includes (i) the UK Biobank resource (application no. 22783), which is subject to a data transfer agreement, (ii) the Adolescent Brain Cognitive Development (ABCD) study (https://abcdstudy.org), held in the National Institute of Mental Health Data Arichive (NDA), and (iii) the Human Connectome Project (HCP) study (https://www.humanconnectome.org) by the WU-Minn Consortium (1U54MH091657).




















































