Skip to main content
Oxford University Press logoLink to Oxford University Press
. 2024 Mar 11;80(1):ujae003. doi: 10.1093/biomtc/ujae003

Merging or ensembling: integrative analysis in multiple neuroimaging studies

Yue Shan 1, Chao Huang 2, Yun Li 3,4, Hongtu Zhu 5,6,7,8,
PMCID: PMC10926268  PMID: 38465984

ABSTRACT

The aim of this paper is to systematically investigate merging and ensembling methods for spatially varying coefficient mixed effects models (SVCMEM) in order to carry out integrative learning of neuroimaging data obtained from multiple biomedical studies. The ”merged” approach involves training a single learning model using a comprehensive dataset that encompasses information from all the studies. Conversely, the ”ensemble” approach involves creating a weighted average of distinct learning models, each developed from an individual study. We systematically investigate the prediction accuracy of the merged and ensemble learners under the presence of different degrees of interstudy heterogeneity. Additionally, we establish asymptotic guidelines for making strategic decisions about when to employ either of these models in different scenarios, along with deriving optimal weights for the ensemble learner. To validate our theoretical results, we perform extensive simulation studies. The proposed methodology is also applied to 3 large-scale neuroimaging studies.

Keywords: ensemble learner, interstudy heterogeneity, merged learner, neuroimaging, spatially varying coefficient mixed effects model

1. INTRODUCTION

With rapid imaging technology advancements, an array of extensive biomedical studies, including the UK Biobank (UKB) (Sudlow et al., 2015), Adolescent Brain Cognitive Development (ABCD) study (Casey et al., 2018), Alzheimer’s Disease Neuroimaging Initiative (ADNI) (Weiner et al., 2017), and Human Connectome Project (HCP) (Somerville et al., 2018), are underway or completed. These studies encompass diverse data types: neuroimaging data, genetics, clinical records, and health details. The essential concern is integrating these multiview datasets across studies to enhance predicting neuroimaging outcomes and identifying reliable imaging biomarkers for subsequent tasks, such as Alzheimer’s disease detection (Zhu et al., 2023). However, attaining these goals is complex due to substantial interstudy heterogeneity challenges stemming from varied sources. These include differences in data collection, study design, acquisition protocols, preprocessing pipelines, and study-specific elements, collectively hindering integrated data analysis. This challenge resonates across fields (Leek and Storey, 2007; Fortin et al., 2017; Zhang et al., 2020; Zhu et al., 2023; Beer et al., 2020; Chen et al., 2022; Zhu et al., 2023; Hu et al., 2023). Notably, confounding factors such as device, acquisition parameters, and motion effects exert larger influence on neuroimaging data than subtle brain change signals like age, gender, and disease predictors (Alfaro-Almagro et al., 2021). Hence, effectively addressing the challenge of interstudy heterogeneity becomes paramount when pursuing integrative learning of neuroimaging data across multiple studies.

A total of 2 principal strategies, namely the ”merged learner” and the ”ensemble learner,” stand as key approaches to confront the issue of interstudy heterogeneity within integrative learning (Cai et al., 2020; Patil and Parmigiani, 2018). In the first strategy, data sourced from various studies are initially amalgamated into a unified dataset, which serves as the basis for training a single learning model. Within this context, either fixed-effect or random-effect models are commonly employed to train the merged learner. Techniques such as principal components analysis (PCA) (Price et al., 2006), confounder adjusted testing and estimation (CATE) (Wang et al., 2017), and direct surrogate variable analysis (dSVA) (Lee et al., 2017) can capture the presence of interstudy heterogeneity. Recently, dSVA was adapted to tackle interstudy heterogeneity within neuroimaging data (Guillaume et al., 2018). Furthermore, Huang and Zhu (2022) introduced a functional hybrid factor regression modeling framework that merges surrogate variable analysis with functional data analysis, forming a hybrid solution to effectively navigate the challenges posed by interstudy heterogeneity.

In the second strategy, the ensemble learner is usually constructed as a weighted average of learners estimated based on individual datasets. Analogous concepts have been applied in ensemble machine learning (Patil and Parmigiani, 2018), meta-analysis (Jackson and Riley, 2014), and fusion learning (Cai et al., 2020). For instance, fusion learning combines confidence distributions for the parameters of interest from different studies. It has been shown that under some specific settings (eg, linear mixed models), the ensemble learner outperforms the merged one when there is a large heterogeneity across studies (Patil and Parmigiani, 2018). Conversely, the merged learner showcases reduced prediction error in comparison to the ensemble counterpart when dealing with studies that are relatively homogeneous (Lagani et al., 2016). Hence, it becomes intriguing to explore the circumstances and methodologies for optimizing the ensemble learner across broader scenarios, while taking into account specific criteria such as prediction accuracy (Guan et al., 2019) and asymptotic efficiency (Zeng and Lin, 2015), in more general settings.

In practice, there is significant interest in comparing the merged and ensemble learners in terms of predicting neuroimaging outcomes for external testing datasets. The K-fold cross-validation (CV), a useful statistical learning technique, is commonly employed to assess the prediction performance. Specifically, both learners undergo estimation on training datasets, and prediction metrics, such as mean squared error, are computed by applying the trained learners to the validation dataset. However, the effectiveness of traditional K-fold CV approaches may diminish due to variations in interstudy heterogeneity patterns across folds. While some K-fold CV variants, like stratified K-fold CV (Prusty et al., 2022), attempt to address this issue by incorporating additional information from observed confounding factors during dataset partitioning, evaluating prediction performance using K-fold CV methods remains challenging when interstudy heterogeneity is predominantly influenced by unknown study-specific random effects. Therefore, it is of great importance to derive strategy-decision guidelines to help understand which among the merged and ensemble learners are doing better.

The primary goal of this paper is to conduct a comprehensive exploration of ensemble learning’s potential in analyzing neuroimaging data across multiple studies. To realize this objective, we embark on a specific case study involving multiview data encompassing genetic predictors [causal single nucleotide polymorphisms (SNPs)], demographic components, imaging data, and clinical factors. The aim is to predict imaging outcomes along the genu of the corpus callosum (GCC) for participants across 3 distinct biomedical studies: UKB, ABCD, and HCP studies. Imaging outcomes considered here include both fractional anisotropy (FA) and mean diffusivity (MD) values derived from diffusion-weighted magnetic resonance imaging (dMRI). Within this framework, we adopt a spatially varying coefficient mixed-effects model (SVCMEM). This model posits that imaging outcomes across studies have 4 distinct components: (i) fixed effects, (ii) study-specific random effects, (iii) subject-specific and location-specific spatial variations expressed through individual stochastic functions, and (iv) random noise. Notably, in contrast to earlier models for multistudy neuroimaging data that primarily consider univariate and multivariate phenotypes (Guan et al., 2019; Guillaume et al., 2018), our proposed SVCMEM explicitly accommodates the intricate multilevel variations inherent in imaging data encompassing locations, phenotypes, subjects, and studies. Further insights into SVCMEM are provided in Section 2.1. Building upon the neuroimaging data generation mechanism, we proceed to construct both the ensemble learner and the merged learner. Subsequently, we methodically compare their individual performances in terms of predicting neuroimaging outcomes. Furthermore, we derive the asymptotic strategy-decision guideline across various scenarios and determine the optimal weights for the ensemble learner. Finally, through simulation studies and real data analyses, we substantiate the effectiveness of the ensemble learner’s derived decisions in enhancing the accuracy of neuroimaging outcome predictions.

The paper is organized as follows: Section 2 introduces SVCMEM and outlines the estimation procedure for the merged and ensemble learners. Subsequently, we delve into exploring the theoretical properties of both learners in predicting imaging outcomes. In Section 3, simulation studies on synthetic curve data are presented to validate the theoretical results concerning the strategy-decision guideline for the two learners across different scenarios. Finally, in Section 4, we apply our method to an imaging genetics scenario using data from UKB, ABCD, and HCP studies.

2. METHOD

Suppose that we observe both neuroimaging data and a vector of covariates of interest from n unrelated subjects in K distinct neuroimaging studies such that there are Inline graphic subjects in the k-th study for Inline graphic and Inline graphic. Without loss of generality, all the images have been registered to a common template. Let Inline graphic be a region of interest (ROI) containing Inline graphic grid points Inline graphic, which follow a common density Inline graphic for Inline graphic. It’s worth mentioning that the method proposed in this paper can be readily extended to accommodate multi-ROI scenarios, where voxels from different ROIs aren’t obligated to share a common density. At the grid point Inline graphic, we observe J imaging features for each subject, for example, FA and MD values. For the k-th neuroimaging studies, let Inline graphic be an Inline graphic matrix including J imaging features and Inline graphic be an Inline graphic full column rank matrix including all covariates of interest (eg, age, gender, disease status, and causal SNPs) as well as the intercept.

2.1. Spatially varying coefficient mixed effects model (SVCMEM)

We assume that multivariate neuroimaging outcomes are generated from the SVCMEM:

2.1. (1)

where Inline graphic is a Inline graphic matrix representing fixed effects related to Inline graphic, which are invariant across studies (such as age and gender). The Inline graphic is an Inline graphic design matrix for random effects, where Inline graphic is a Inline graphic matrix with Inline graphic if the j-th predictor is included and associated with the t-th random effect, and Inline graphic is a Inline graphic matrix representing the corresponding study-specific random effects. Moreover, Inline graphic is an Inline graphic matrix that includes individual stochastic functions characterizing both subject-specific and location-specific spatial variability, and Inline graphics’ are measurement errors. To further characterize the spatial correlations within the multivariate imaging phenotypes, similar to Zhu et al. (2012) and Huang and Zhu (2022), we assume that the rows in Inline graphic and Inline graphic are mutually independent and identical copies of SPInline graphic and SPInline graphic, where SPInline graphic denotes a stochastic process vector with mean function Inline graphic and covariance function Inline graphic. Moreover, Inline graphic takes the form of Inline graphic, where Inline graphic is a diagonal matrix and Inline graphic is the indicator function. It is also assumed that the elements in Inline graphic, that is, Inline graphic are independent of Inline graphic and Inline graphic and they are mutually independent copies of SPInline graphic, respectively. It does not exclude correlations along the functional direction as Inline graphic is not required to be zero.

The proposed SVCMEM encompasses several established models as special instances. For instance, if all imaging outcomes originate from a single study, the SVCMEM simplifies to the multivariate varying coefficient model proposed by Zhu et al. (2012). In cases where only univariate imaging outcomes are considered, the SVCMEM transforms into the mixed-effect model proposed by Guan et al. (2019). Additionally, excluding subject-specific and location-specific spatial variability aligns the SVCMEM with the confounder-adjusted regression model introduced in Guillaume et al. (2018). Notably, in contrast to current models catering to multistudy neuroimaging data (Guan et al., 2019; Guillaume et al., 2018), our SVCMEM adeptly captures imaging variations across multiple dimensions: locations, phenotypes, subjects, and studies.

2.2. Estimation procedure

We introduce the estimation procedure of the merged learner and the ensemble learner below.

Merging. Regarding the merged learner, we typically assume the absence of a study-specific random effect Inline graphic, which implies relative homogeneity among imaging outcomes across all studies. Under this assumption, we initially consolidate observed data from all K studies and derive the estimator Inline graphic for Inline graphic based on this amalgamated dataset.

Let Inline graphic and Inline graphic be, respectively, the merged Inline graphic imaging and Inline graphic covariate matrices across K studies. Given Inline graphic and Inline graphic, the multivariate local linear kernel smoothing technique (Zhu et al., 2012) is used to derive the weighted least squares (WLS) estimator of Inline graphic. Some notations are introduced here. Let Inline graphic for any vector Inline graphic and Inline graphic be the Kronecker product of 2 matrices Inline graphic and Inline graphic. In addition, denote Inline graphic and Inline graphic, where Inline graphic is the kernel function and Inline graphic is the positive definite bandwidth matrix. Then, the WLS estimator of Inline graphic based on the merged data is given by

2.2. (2)

where Inline graphic. More detailed derivation of the estimator in (2) is provided in the supplementary materials.

Ensembling. Different from the merged learner, the requirement for homogeneity regarding Inline graphic does not apply to the ensemble one. In fact, for the ensemble learner, individual estimators of Inline graphic, denoted as Inline graphic, are first derived based on the data from each of the K studies respectively, and then the ensemble estimator of Inline graphic, denoted as Inline graphic, is calculated by a weighted average of Inline graphic. Specifically, Inline graphics and Inline graphic can be computed, respectively, as follows:

2.2. (3)

where Inline graphic are the bandwidth matrices and Inline graphic satisfy Inline graphic.

Throughout this paper, we standardize all covariates and imaging responses to have mean zero and SD one. The leave-one-curve-out CV (Zhang and Chen, 2007) is used to select Inline graphic in Inline graphic and Inline graphic in Inline graphic. Finally, we set a common bandwidth, denoted as Inline graphic, for all covariates and imaging responses (Huang et al., 2017).

2.3. Prediction performance comparison

Suppose that we have Inline graphic new coming subjects forming an Inline graphic covariate matrix, denoted as Inline graphic, and an Inline graphic imaging matrix, denoted as Inline graphic. Given Inline graphic, Inline graphic can be predicted based on either the merged learner or the ensemble learner. We are interested in comparing the merged and ensemble learners by using the squared prediction error (SPE) given by

2.3. (4)

where Inline graphic is the Frobenius norm, Inline graphic for the merged learner, and Inline graphic for the ensemble learner. Furthermore, Inline graphic can be decomposed as the sum of 3 key terms given by

2.3. (5)

where Inline graphic is the trace of a given matrix, Inline graphic is the asymptotic bias of the j-th column in Inline graphic, and Inline graphic is the corresponding asymptotic conditional variances. The detailed derivation of the decomposition in (5) can be found in the supplementary materials. It follows from (5) that Inline graphic is primarily determined by the terms Inline graphic and Inline graphic, since Inline graphic only depends on the variations in the new dataset Inline graphic. Moreover, Inline graphic and Inline graphic are, respectively, determined by the squared bias and variance of estimated functional coefficients.

2.4. Theoretical analysis

In this section, we systematically explore the theoretical properties of Inline graphic and Inline graphic, assuming a constant number of studies, K, while the minimum number of subjects across these studies, denoted as Inline graphic, approaches infinity. The comprehensive proof and the underlying assumptions that facilitate the technical details are provided in the supplementary materials.

We initiate our examination by deriving the asymptotic bias and conditional variance of Inline graphic and Inline graphic. We introduce Inline graphic, satisfying Inline graphic, where Inline graphic represents a Inline graphic identity matrix. Furthermore, let Inline graphic and Inline graphic denote the j-th columns of Inline graphic and Inline graphic, respectively.

Theorem 1:

Suppose that Assumptions 1-7 in the supplementary materials hold. The following results hold:

(a) For the merged learner, the asymptotic bias of Inline graphic is given by

Theorem 1: (6)

where the l-th element in Inline graphic is defined as Inline graphic with Inline graphic being the Inline graphic Hessian matrix of the Inline graphic-th element in Inline graphic. Furthermore, the asymptotic conditional covariance matrix, Inline graphic, is given by

Theorem 1: (7)

where Inline graphic, Inline graphic is a diagonal matrix with the elements Inline graphic on the main diagonal, and Inline graphic is the j-th diagonal element in Inline graphic.

(b) For the ensemble learner, the asymptotic bias of Inline graphic is

Theorem 1: (8)

and the corresponding asymptotic conditional variances, Inline graphic, is given by

Theorem 1: (9)

where Inline graphic.

Theorem 1 has 2 important implications. First, the biases of the 2 learners are asymptotically comparable with each other. Second, the conditional variances of Inline graphic and Inline graphic can be decomposed into the study-specific variation and the subject-specific variation. Specifically, for the ensemble learner, its subject-specific and study-specific variations are equal to Inline graphic and Inline graphic, respectively. When Inline graphic and Inline graphic, the subject-specific and study-specific variations of Inline graphic reduce to those of Inline graphic.

Second, we compare the ensemble learner and the merged learner in terms of the expected SPE below. Let Inline graphic be the average study-specific variation and Inline graphic be the average subject-specific variation. We also define Inline graphic

2.4. (10)
2.4. (11)

where Inline graphic is a Inline graphic vector with the l-th element being 1 and zero others for Inline graphic.

Theorem 2:

Suppose that Assumptions 1-7 in the supplementary materials hold. The following results hold:

  1. if Inline graphic and Inline graphic for Inline graphic, then Inline graphic asymptotically holds if and only if Inline graphic;

  2. Inline graphic are asymptotically valid when Inline graphic and Inline graphic hold;

  3. Inline graphic are asymptotically valid when Inline graphic and Inline graphic hold.

Theorem 2 has several implications. First, in the equal variances case, Theorem 2 (a) provides a sufficient and necessary condition for that the ensemble learner outperforms the merged learner. In this case, Inline graphic represents a transition point. Second, in more general settings, Theorem 2 (b) and (c) provide sufficient conditions for that the ensemble learner outperforms the merged learner and vice versa. Moreover, since Inline graphic is smaller than Inline graphic, Inline graphic quantifies the degree of heterogeneity across study-specific random effects. In the interval of Inline graphic, the ensemble learner and the merged one are comparable with each other. In the equal variances case, Inline graphic and Inline graphic are equal with each other and reduce to Inline graphic. Thus, Theorem 2 (a) is a special case of Theorem 2 (b) and (c).

Third, we investigate the optimal choice of weighting scheme in the ensemble learner and present the results as follows:

Theorem 3:

Suppose that Assumptions 1-7 in the supplementary materials hold. For the ensemble learner Inline graphic, the optimal ensembling weights, Inline graphic, which minimize the expected SPE, are given by

Theorem 3: (12)

for Inline graphic, where Inline graphic.

In Theorem 3, since the term Inline graphic in the expected SPE (5) only depends on the variations in the new dataset and the asymptotic bias in the term Inline graphic is of negligible order, the optimal weights are mainly determined by the term Inline graphic, that is, the variance of the prediction errors Inline graphic. Specifically, the optimal weight Inline graphic for the k-th study is proportional to the inverse variance of the prediction error with the ensemble learner trained exactly on the k-th study, that is., Inline graphic.

In practice, to derive the optimal weights Inline graphic, we need to estimate the study-specific variation, Inline graphic, and the average subject-specific variation, Inline graphic. Here we consider the method of “smoothing first, then estimation” in Zhang and Chen (2007). Specifically, we first adopted the local linear kernel smoothing techniques to smooth the imaging responses Inline graphic, leading to Inline graphic, and then we fit SVCMEM with smoothed responses, that is,

2.4. (13)

where the measurement error term Inline graphic in (1) is not included due to the smoothing procedure. Then, the WLS estimator described in equation (2) can be employed as an approximation for estimating Inline graphic, denoted as Inline graphic. Given the residuals Inline graphic for Inline graphic, the covariance functions, Inline graphic and Inline graphic, can be estimated through a least squares method, whose details can be found in Zhu et al. (2019), yielding the estimates of Inline graphic and Inline graphic.

3. SIMULATION STUDIES

In this section, we use the empirical SPE to examine the prediction performance of the ensemble and merged learners based on simulated datasets. The R codes for implementing these simulation studies can be found in the supplementary materials. Specifically, we set the number of training studies Inline graphic with unequal sample sizes Inline graphic for Inline graphic and the sample size Inline graphic for the test study. In each study, we considered bivariate imaging feature with Inline graphic at each grid point and generated synthetic curves from SVCMEM as follows:

3. (14)

for Inline graphic and Inline graphic. Moreover, we independently simulated Inline graphic for Inline graphic and sorted them to obtain Inline graphic. For the covariates of interest Inline graphic in the k-th training study, we set the i-th row as Inline graphic, in which Inline graphic was generated from a bivariate normal distribution with mean Inline graphic and covariance matrix Inline graphic with Inline graphic, and Inline graphic. For the testing study, we set Inline graphic, where Inline graphic were generated in the same way as Inline graphic for Inline graphic. In the k-th study, we set the number of random effects Inline graphic and Inline graphic includes the second and third columns of the design matrix Inline graphic for Inline graphic. Furthermore, let the random effects Inline graphic with covariance function CovInline graphic and Inline graphic. For the stochastic individual function Inline graphic in the k-th study, its Inline graphic-th element Inline graphic admits the Karhunen Loeve expansion as Inline graphic, where Inline graphic with Inline graphic for Inline graphic and Inline graphic, Inline graphic, Inline graphic, Inline graphic, and Inline graphic. Therefore, Inline graphic. We simulated the Inline graphic-th element in the measurement error Inline graphic independently according to Inline graphic. Also, we set the coefficient functions as Inline graphic.

We consider 2 scenarios for the variances of random effects, including the homogeneous scenario with Inline graphic corresponding to Theorem 2 (a) and the general scenario with Inline graphic and Inline graphic corresponding to Theorem 2 (b)-(c). In each scenario, 5000 data sets were generated for each of 10 levels of the mean variance Inline graphic, including 0 and the theoretical transition points in Theorem 2. For each of the 10 levels of Inline graphic in each scenario, Inline graphic and Inline graphic were calculated according to the estimation procedure introduced in Section 2.2, with equal weights Inline graphic for all k. Subsequently, we calculated the predicted values for the testing study based on both the merged and ensemble learners and their corresponding empirical SPEs. The 5000 simulated data sets were evenly and randomly divided into 50 subgroups, and the sample mean of Inline graphic in each subgroup was calculated.

Figure 1 presents the averaged Inline graphic from all the 50 subgroups in the 10 levels of Inline graphic in each scenario and the 3 theoretical transition points in Theorem 2. We observe some important findings as follows. In both homogeneous and heterogeneous scenarios, the averaged Inline graphic decreases as the average variance of random effects Inline graphic increases, indicating that the ensemble learner Inline graphic gradually outperforms the merged learner Inline graphic in terms of the prediction accuracy. For the homogeneous scenario, the empirical transition point, indicated by Inline graphic, and the red dashed line in the top plot of Figure 1 coincide with the transition point provided in Theorem 2. For the heterogeneous scenario, according to Theorem 2, in the interval Inline graphic, the approach with better performance is transiting from the merged learner to the ensemble one. It coincides with the empirical transition points with Inline graphic near 0 shown as the red dashed lines in the bottom plot of Figure 1. Therefore, our simulation studies validate our theoretical results in Theorem 2.

FIGURE 1.

FIGURE 1

Boxplots of averaged Inline graphic over 100 replicates for homogeneous scenario (top) and heterogeneous scenario (bottom). The theoretical transition points are indicated by the red dashed lines.

4. REAL DATA ANALYSIS

In this section, we carried out a real data analysis of imaging genetics in 2 scenarios:

  • A. Training the merged and ensemble learners using 4 subsets of the UKB study and testing them on a holdout subset of the UKB study;

  • B. Training the merged and ensemble learners using 2 large-scale studies, including the UKB and ABCD studies, and testing them on the HCP study as an independent study.

For each scenario, both learners were estimated based on the training data and the corresponding SPEs were calculated.

4.1. Data description

We consider 3 studies, including the UKB study, whose individuals aged from 45 to 80 years old, the ABCD study, whose subjects aged from 9 to11 years old, and the HCP study, whose subjects aged from 22 to 35 years old. Besides different age ranges, these 3 studies may differ from each other in device, acquisition parameters, different noises, and image processing protocol, among others. Such differences may introduce systematical differences in neuroimaging data. We used an individual filtering procedure to make sure that only independent individuals were included into our analysis. Only individuals with European ancestry are included for HCP and ABCD, while only British individuals are included for UKB. After further excluding the individuals with missing data, the number of individuals included in HCP, ABCD, and UKB are given by 298, 5088, and 16381, respectively.

We considered dMRIs (Basser et al., 1994) and used both the FA and MD diffusion statistics along the GCC consisting of Inline graphic grid points as our imaging outcomes. The corpus callosum (CC) consists of a flat bundle of commissural fibers beneath the cerebral cortex in the brain, connecting the left and right cerebral hemispheres. The CC is the largest white matter structure in the human brain and has 4 main parts, including the rostrum, the genu, the body, and the splenium. As shown in (Zhao et al., 2021), both the FA and MD values of GCC are significantly linked to a large number of SNPs. Recent findings show that the brain white matter structures are much more significantly associated with genetic markers compared to other brain imaging features, including cortical volume and thickness (Zhao et al., 2019; 2021).

We focus on 2 important SNPs rs12653308 and rs2237077 as genetic predictors, since they are (i) mutually uncorrelated and available in all the three studies and (ii) significantly associated with both the FA and MD values in GCC in existing literature (Huang et al., 2017; Zhao et al., 2021). The SNPs are coded as quantitative variables with the values equal to the number of alternative alleles (ie, 0, 1, or 2), and thus, the additive model is implemented with respect to each SNP. The number of alternative alleles was extracted from imputed genotype data. Specifically, ABCD and HCP genotype data were imputed using the 1000 Genomes reference panel (Zhao et al., 2021), whereas UKB genotype data was imputed using the Haplotype Reference Consortium and UK10K + 1000 Genomes reference panels (Bycroft et al., 2018).

4.2. Data analysis

In scenarios A and B, we divided all available data into the training part and the test part as follows. For scenario A, we selected phases 1 and 2 of the UKB study and randomly split them into five subsets with approximately equal sizes such that Inline graphic and Inline graphic for Inline graphic in the training part and Inline graphic in the test part. For scenario B, both the ABCD study (Inline graphic) and the UKB study (Inline graphic) were used for training (Inline graphic), while the HCP study (Inline graphic) was used for test. The covariates of interest Inline graphic includes an intercept, age, ageInline graphic, sex, sexInline graphicage, sexInline graphicageInline graphic, the top ten principal scores, and the number of alternative alleles for SNPs rs12653308 and rs2237077. We estimated Inline graphic, Inline graphic, Inline graphic, and the optimal weights Inline graphic in each scenario accordingly. Table 1 shows all these estimation results. We observe that Inline graphic is greater than Inline graphic for both scenarios, no matter whether using equal ensemble weights or optimal ensemble weights. According to Theorem 2, the ensemble learner yields a smaller expected SPE in both scenarios. However, different from scenario B, both Inline graphic and Inline graphic are very close to 0 in scenario A.

TABLE 1.

Estimation results of Inline graphic, Inline graphic, Inline graphic, Inline graphic, Inline graphic and Inline graphic in Theorem 2 and the optimal weights Inline graphic in Theorem 3 for scenarios A and B.

scenario A scenario B
Training data 4 UKB subsamples (Inline graphic) ABCD (Inline graphic)
UKB (Inline graphic)
Testing data 1 UKB subsample (Inline graphic) HCP (Inline graphic)
Inline graphic Inline graphic 0.363
Inline graphic 2.811 3.749
Inline graphic Inline graphic Inline graphic Inline graphic
Inline graphic Inline graphic Inline graphic
Inline graphic Inline graphic Inline graphic Inline graphic
Inline graphic Inline graphic
Inline graphic Inline graphic Inline graphic
Inline graphic Inline graphic Inline graphic

The empirical SPEs for both 2 learners were calculated to further compare their prediction performance. For each scenario, we calculated Inline graphic for Inline graphic, Inline graphic, Inline graphic based on equal ensemble weights Inline graphic, and Inline graphic based on optimal ensemble weights Inline graphic estimated from Table 1. Given the predictors Inline graphic in testing data, we calculated the predictions Inline graphic based on Inline graphic, Inline graphic and Inline graphic, respectively, and their corresponding empirical SPEs. Furthermore, we generated 100 bootstrap samples by randomly drawing with replacement from the original training data, while fixing the test data unchanged. Given each bootstrap sample, all the learners were retrained and used to predict the imaging responses in the test data.

Figure 2 presents the bootstrap SPEs with the empirical SPEs based on the original training data indicated by the blue lines for both scenarios. We have several important findings from Figure 2. For scenario A, Inline graphic, Inline graphic, and Inline graphic have similar prediction performance in terms of their empirical SPEs and corresponding variations. It indicates a low heterogeneity in the UKB cohort and the small variations among different subsets with small Inline graphic for scenario A in Table 1. For scenario B, Inline graphic and Inline graphic outperform the merged learner. It is consistent with the sufficient condition of Theorem 2 (b) and the corresponding estimation results in Table 1, that is Inline graphic. In addition, since the optimal weights are close to the equal weights, Inline graphic and Inline graphic are comparable with each other in terms of their empirical SPEs.

FIGURE 2.

FIGURE 2

Boxplots of 100 bootstrap SPEs for scenario A (left) and scenario B (right) with original SPEs indicated by the red circles.

5. DISCUSSION

In this paper, we have systematically compared the merged and ensemble learners to explicitly deal with the interstudy heterogeneity for integrative analysis of neuroimaging data obtained from multiple studies. We have considered SVCMEM to spatially model the varying association between imaging measures with a set of covariates, while explicitly accounting for the spatial smoothness and correlation of neuroimaging data. We have constructed the ensemble and merged learners for regression coefficient functions and compared them regarding the prediction accuracy of neuroimaging outcomes. We have used both simulation studies and real data analysis to examine the finite sample performance of both learners.

There are several topics of interest for future research. First, we may extend the current linear spatial varying coefficient mixed effects model in (1) to the nonlinear setting, that is, Inline graphic, Inline graphic where Inline graphic is the unknown link function applying to each element in Inline graphic. Then, the estimation and inference procedures of single-index varying coefficient models (Luo et al., 2016) may be adopted here to establish the corresponding ensemble and merged learners. Second, we may compare the statistical efficiency of the ensemble and merged learners based on hypothesis testing and confidence interval (Xie et al., 2011). Specifically, given the p-values of all K studies, denoted as Inline graphic, we may define an ensemble p-value, Inline graphic, where Inline graphic is the cumulative distribution function of the standard normal distribution. We may compare Inline graphic with the p-value of the merging learner, but we leave it for future research.

Supplementary Material

ujae003_Supplemental_Files

Web Appendices (detailed derivations of estimation procedures, comprehensive proof, and the underlying assumptions that facilitate the technical details) and code (R codes for implementing the simulation studies) referenced in Sections 2.2, 2.3, 2.4, and 3 are available with this paper at the Biometrics website on Oxford Academic. In addition, the R codes, including synthetic multivariate functional data generation, merged and ensemble learning algorithms, and calculation of asymptotic transition points in Section 2.4, can also be found from the GitHub link https://github.com/BIG-S2/SVCMEM.

Acknowledgement

The first 2 authors, Dr. Shan and Dr. Huang, contributed equally to this paper. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH and NSF.

Contributor Information

Yue Shan, Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, United States.

Chao Huang, Department of Statistics, Florida State University, Tallahassee, FL 32306, United States.

Yun Li, Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, United States; Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, United States.

Hongtu Zhu, Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, United States; Department of Statistics, Florida State University, Tallahassee, FL 32306, United States; Department of Statistics & Operations Research, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, United States; Department of Computer Science, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, United States.

FUNDING

The research of Dr. Zhu was partially supported by the National Institute On Aging (NIA) of the National Institutes of Health (NIH) under Award Numbers RF1AG082938 and NIH MH116527. The research of Dr. Li was partially supported by NIH grants R56-AG079291 and U01HG011720. The research of Dr. Huang was partially supported by National Science Foundation (NSF) under Award Number DMS-1953087.

CONFLICT OF INTEREST

None declared.

DATA AVAILABILITY

The data that support the findings in this paper includes (i) the UK Biobank resource (application no. 22783), which is subject to a data transfer agreement, (ii) the Adolescent Brain Cognitive Development (ABCD) study (https://abcdstudy.org), held in the National Institute of Mental Health Data Arichive (NDA), and (iii) the Human Connectome Project (HCP) study (https://www.humanconnectome.org) by the WU-Minn Consortium (1U54MH091657).

References

  1. Alfaro-Almagro  F., McCarthy  P., Afyouni  S., Andersson  J. L., Bastiani  M., Miller  K. L.  et al. (2021). Confound modelling in uk biobank brain imaging. NeuroImage, 224, 117002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Basser  P. J., Mattiello  J., Lebihan  D. (1994). Estimation of the effective self-diffusion tensor from the NMR spin echo. Journal of Magnetic Resonance, Series B, 103, 247–254. [DOI] [PubMed] [Google Scholar]
  3. Beer  J. C., Tustison  N. J., Cook  P. A., Davatzikos  C., Sheline  Y. I., Shinohara  R. T.  et al. (2020). Longitudinal combat: a method for harmonizing longitudinal multi-scanner imaging data. NeuroImage, 220, 117129. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Bycroft  C., Freeman  C., Petkova  D., Band  G., Elliott  L. T., Sharp  K.  et al. (2018). The UK Biobank resource with deep phenotyping and genomic data. Nature, 562, 203–209. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Cai  C., Chen  R., Xie  M. (2020). Individualized inference through fusion learning. WIREs Computational Statistics, 12, e1498. [Google Scholar]
  6. Casey  B. J., Cannonier  T., Conley  M. I., Cohen  A. O., Barch  D. M., Heitzeg  M. M.  et al. (2018). The adolescent brain cognitive development (abcd) study: Imaging acquisition across 21 sites. Developmental Cognitive Neuroscience, 32, 43–54. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Chen  A. A., Luo  C., Chen  Y., Shinohara  R. T., Shou  H. and ADNI (2022). Privacy-preserving harmonization via distributed combat. NeuroImage, 248, 118822. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Fortin  J.-P., Parker  D., Tunç  B., Watanabe  T., Elliott  M. A., Ruparel  K.  et al. (2017). Harmonization of multi-site diffusion tensor imaging data. NeuroImage, 161, 149–170. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Guan  Z., Parmigiani  G., Patil  P. (2019). Merging versus ensembling in multi-study prediction: Theoretical insight from random effects. arXiv preprint arXiv:1905.07382.
  10. Guillaume  B., Wang  C., Poh  J., Shen  M. J., Ong  M. L., Tan  P. F.  et al. (2018). Improving mass-univariate analysis of neuroimaging data by modelling important unknown covariates: application to epigenome-wide association studies. NeuroImage, 173, 57–71. [DOI] [PubMed] [Google Scholar]
  11. Hu  F., Chen  A. A., Horng  H., Bashyam  V., Davatzikos  C., Alexander-Bloch  A.  et al. (2023). Image harmonization: a review of statistical and deep learning methods for removing batch effects and evaluation metrics for effective harmonization. NeuroImage, 274, 120125. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Huang  C., Thompson  P., Wang  Y., Yu  Y., Zhang  J., Kong  D.  et al. (2017). FGWAS: functional genome wide association analysis. NeuroImage, 159, 107–121. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Huang  C., Zhu  H. (2022). Functional hybrid factor regression model for handling heterogeneity in imaging studies. Biometrika, 109, 1133–1148. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Jackson  D., Riley  R. D. (2014). A refined method for multivariate meta-analysis and meta-regression. Statistics in Medicine, 33, 541–554. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Lagani  V., Karozou  A. D., Gomez-Cabrero  D., Silberberg  G., Tsamardinos  I. (2016). A comparative evaluation of data-merging and meta-analysis methods for reconstructing gene-gene interactions. BMC Bioinformatics, 17, 287–305. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Lee  S., Sun  W., Wright  F. A., Zou  F. (2017). An improved and explicit surrogate variable analysis procedure by coefficient adjustment. Biometrika, 104, 303–316. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Leek  J. T., Storey  J. D. (2007). Capturing heterogeneity in gene expression studies by surrogate variable analysis. Plos Genetics, 3, 1724–1735. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Luo  X., Zhu  L., Zhu  H. (2016). Single-index varying coefficient model for functional responses. Biometrics, 72, 1275–1284. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Patil  P., Parmigiani  G. (2018). Training replicable predictors in multiple studies. Proceedings of the National Academy of Sciences, 115, 2578–2583. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Price  A. L., Patterson  N. J., Plenge  R. M., Weinblatt  M. E., Shadick  N. A., Reich  D. (2006). Principal components analysis corrects for stratification in genome-wide association studies. Nature Genetics, 38, 904–909. [DOI] [PubMed] [Google Scholar]
  21. Prusty  S., Patnaik  S., Dash  S. K. (2022). Skcv: Stratified k-fold cross-validation on ml classifiers for predicting cervical cancer. Frontiers in Nanotechnology, 4, 972421. [Google Scholar]
  22. Somerville  L. H., Bookheimer  S. Y., Buckner  R. L., Burgess  G. C., Curtiss  S. W., Dapretto  M.  et al. (2018). The lifespan human connectome project in development: A large-scale study of brain connectivity development in 5-21 year olds. NeuroImage, 183, 456–468. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Sudlow  C., Gallacher  J., Allen  N., Beral  V., Burton  P., Danesh  J.  et al. (2015). UK Biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. Plos Medicine, 12, e1001779. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Wang  J., Zhao  Q., Hastie  T., Owen  A. B. (2017). Confounder adjustment in multiple hypothesis testing. Annals of Statistics, 45, 1863–1894. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Weiner  M. W., Veitch  D. P., Aisen  P. S., Beckett  L. A., Cairns  N. J., Green  R. C.,..., Alzheimer’s Disease Neuroimaging Initiative (2017). Recent publications from the alzheimer’s disease neuroimaging initiative: Reviewing progress toward improved ad clinical trials. Alzheimer’s & Dementia : the Journal of the Alzheimer’s Association, 13, e1–e85. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Xie  M., Singh  K., Strawderman  W. E. (2011). Confidence distributions and a unifying framework for meta-analysis. Journal of the American Statistical Association, 106, 320–333. [Google Scholar]
  27. Zeng  D., Lin  D. (2015). On random-effects meta-analysis. Biometrika, 102, 281–294. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Zhang  J., Chen  J. (2007). Statistical inference for functional data. Annals of Statistics, 35, 1052–1079. [Google Scholar]
  29. Zhang  Y., Bernau  C., Parmigiani  G., Waldron  L. (2020). The impact of different sources of heterogeneity on loss of accuracy from genomic prediction models. Biostatistics, 21, 253–268. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Zhao  B., Li  T., Yang  Y., Wang  X., Luo  T., Shan  Y.  et al. (2021). Common genetic variation influencing human white matter microstructure. Science, 372, eabf3736. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Zhao  B., Luo  T., Li  T., Li  Y., Zhang  J., Shan  Y.  et al. (2019). Genome-wide association analysis of 19,629 individuals identifies variants influencing regional brain volumes and refines their genetic co-architecture with cognitive and mental health traits. Nature Genetics, 51, 1637–1644. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Zhu  H., Chen  K., Luo  X., Yuan  Y., Wang  J.-L. (2019). Fmem: Functional mixed effects models for longitudinal functional responses. Statistica Sinica, 29, 2007–2033. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Zhu  H., Li  R., Kong  L. (2012). Multivariate varying coefficient model for functional responses. Annals of Statistics, 40, 2634–2666. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Zhu  H., Li  T., Zhao  B. (2023). Statistical learning methods for neuroimaging data analysis with applications. Annual Review of Biomedical Data Science, 6, 73–104. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

ujae003_Supplemental_Files

Web Appendices (detailed derivations of estimation procedures, comprehensive proof, and the underlying assumptions that facilitate the technical details) and code (R codes for implementing the simulation studies) referenced in Sections 2.2, 2.3, 2.4, and 3 are available with this paper at the Biometrics website on Oxford Academic. In addition, the R codes, including synthetic multivariate functional data generation, merged and ensemble learning algorithms, and calculation of asymptotic transition points in Section 2.4, can also be found from the GitHub link https://github.com/BIG-S2/SVCMEM.

Data Availability Statement

The data that support the findings in this paper includes (i) the UK Biobank resource (application no. 22783), which is subject to a data transfer agreement, (ii) the Adolescent Brain Cognitive Development (ABCD) study (https://abcdstudy.org), held in the National Institute of Mental Health Data Arichive (NDA), and (iii) the Human Connectome Project (HCP) study (https://www.humanconnectome.org) by the WU-Minn Consortium (1U54MH091657).


Articles from Biometrics are provided here courtesy of Oxford University Press

RESOURCES