Efficacy of MRI data harmonization in the age of machine learning: a multicenter study across 36 datasets

Chiara Marzi; Marco Giannelli; Andrea Barucci; Carlo Tessa; Mario Mascalchi; Stefano Diciotti

doi:10.1038/s41597-023-02421-7

. 2024 Jan 23;11:115. doi: 10.1038/s41597-023-02421-7

Efficacy of MRI data harmonization in the age of machine learning: a multicenter study across 36 datasets

Chiara Marzi ^1,², Marco Giannelli ³, Andrea Barucci ², Carlo Tessa ⁴, Mario Mascalchi ^5,⁶, Stefano Diciotti ^7,^8,^✉

PMCID: PMC10805868 PMID: 38263181

Abstract

Pooling publicly-available MRI data from multiple sites allows to assemble extensive groups of subjects, increase statistical power, and promote data reuse with machine learning techniques. The harmonization of multicenter data is necessary to reduce the confounding effect associated with non-biological sources of variability in the data. However, when applied to the entire dataset before machine learning, the harmonization leads to data leakage, because information outside the training set may affect model building, and potentially falsely overestimate performance. We propose a 1) measurement of the efficacy of data harmonization; 2) harmonizer transformer, i.e., an implementation of the ComBat harmonization allowing its encapsulation among the preprocessing steps of a machine learning pipeline, avoiding data leakage by design. We tested these tools using brain T₁-weighted MRI data from 1740 healthy subjects acquired at 36 sites. After harmonization, the site effect was removed or reduced, and we showed the data leakage effect in predicting individual age from MRI data, highlighting that introducing the harmonizer transformer into a machine learning pipeline allows for avoiding data leakage by design.

Subject terms: Predictive markers, Brain imaging

Introduction

In recent years there has been an increasing trend toward data sharing in neuroimaging research communities, leading to a rising number of public neuroimaging databases and collaborative multicenter initiatives^1–4. Indeed, pooling MRI data from multiple sites provides an opportunity to assemble more extensive and diverse groups of subjects^2,3,5,6, increase statistical power^3,7–10, and study rare disorders and subtle effects^11,12. However, a major drawback of combining neuroimaging data across sites is the introduction of confounding effects due to non-biological variability in the data, typically related to image acquisition hardware and protocol. Indeed, properties of MRI such as scanner field strength, radiofrequency coil type, gradients coil characteristics, hardware, image reconstruction algorithm, and non-standardized acquisition protocol parameters can introduce unwanted technical variability, also reflected in MRI-derived features^13–15.

The harmonization of multicenter data, defined as applying mathematical and statistical concepts to reduce unwanted site variability while maintaining the biological content, is, therefore, necessary to ensure the success of cooperative analyses. Currently, among the harmonization methods for tabular data available to the neuroimaging scientific community, ComBat is one of the most widely used^7,12,16–34. The ComBat model was first introduced in gene expression analysis as a batch-effect correction tool to remove unwanted variation associated with the site and preserve biological associations in the data³⁵. In general, ComBat applies to situations where multiple features of the same type are measured for each participant, i.e., expression levels for different genes or imaging-derived metrics from different voxels or anatomical regions. The success of ComBat and derivatives has been measured compared to other harmonization techniques^3,5,6 and through simulations of the site effect from single-center data². Previous literature has primarily focused on assessing the maintenance of biological variability in harmonized data^2,3,5. However, less effort has been put into quantitative measurements of the efficacy of harmonization in removing the unwanted site effect.

Moreover, the pooling of multicenter data and the consequent availability of large sample sizes paves the way for data reuse with machine and deep learning techniques^{17,19,22,23,25}. In the case of multicenter data, harmonization is thus added to conventional data preprocessing steps, including, e.g., data cleaning and imputation, feature extraction, and reduction. Similar to other procedures, the harmonization parameters should be optimized on training data only and subsequently applied to test data. Indeed, this approach avoids data leakage, which happens when information from outside the training set is used to create the model, potentially leading to falsely overestimated performance. Crucially, this aspect has sometimes been overlooked in previous applications of ComBat by harmonizing the entire data sample before data splitting (training and test sets) used for training and testing machine or deep learning techniques^{2,5,17,19,22,23,25,36–41}.

To the best of our knowledge, the harmonization techniques for neuroimaging data have been applied without paying attention to avoid data leakage, and this effect has not been quantified. In addition, despite the Python package neuroHarmonize² and the R code provided by Radua and colleagues³ include functions that estimate the harmonization model on the training data and apply it separately to the test data, they have not been conceived to be executed on a machine learning pipeline, i.e., an end-to-end framework that orchestrates the flow of data into a machine learning model and allows to speed-up the development and test of machine learning systems, natively avoiding data leakage by design.

For these reasons, in this study, we propose 1) a measurement of the efficacy of data harmonization in reducing the site effect by the performance of a machine learning classifier trained to identify the imaging site, 2) a ComBat implementation using a harmonizer transformer, i.e., a method that, combined with a classifier/regressor, forms a composite estimator, to be used in a machine learning pipeline, thus simplifying data analysis and avoiding data leakage by design (the source code of the efficacy measurement and harmonizer transformer are publicly available in a GitHub repository at https://github.com/Imaging-AI-for-Health-virtual-lab/harmonizer). First, we showed and measured the effect of data leakage when harmonization is performed before data splitting using simulated neuroimaging data with known site effect. Then, we estimated the efficacy of data harmonization in reducing the site effect using the harmonizer transformer on brain T₁-weighted MRI data from 1787 healthy subjects aged 5–87 years acquired at 36 imaging sites. The morphological features of cortical thickness (CT) and fractal dimension (FD), a descriptor of the structural complexity of objects with self-similarity properties⁴², are extracted to characterize brain morphology. To the best of our knowledge, this is the first time that measures of brain structural complexity, such as FD, have been studied on such a large, multicenter, and harmonized data sample. Finally, we investigated the age prediction using neuroimaging variables harmonized in the entire dataset before machine learning and using the harmonizer transformer to estimate the effect of data leakage in in vivo data.

Methods

MRI datasets

We gathered brain MR T₁-weighted images of 1787 healthy subjects aged 5–87 years belonging to 36 single-center datasets of various studies. These include the Autism Brain Imaging Data Exchange (ABIDE) (https://fcon_1000.projects.nitrc.org/indi/abide/) first and second initiatives (ABIDE I and ABIDE II, respectively)^43,44, the Information eXtraction from Images (IXI) study (https://brain-development.org/ixi-dataset/), the 1000 Functional Connectomes Project (FCP) (https://fcon_1000.projects.nitrc.org/fcpClassic/FcpTable.html), and the Consortium for Reliability and Reproducibility (CoRR) (https://fcon_1000.projects.nitrc.org/indi/CoRR/html/index.html). From each study, we drew several specific datasets of brain MR T₁-weighted images acquired in the same place with the same scanner and acquisition protocol (see Table 1). Both ABIDE I, and II initiatives contributed with 17 datasets, and we named them with the initiative prefix (ABIDEI or ABIDEII) followed by the institution name that collected the images (e.g., ABIDEI-CALTECH and ABIDEII-BNI_1). For the institution names, we used the same nomenclature as reported online⁴⁵ with the following exceptions: (i) we merged LEUVEN_1 and LEUVEN_2 data in ABIDEI-LEUVEN, UCLA_1, and UCLA_2 data in ABIDEI-UCLA, UM_1 and UM_2 data in ABIDEI-UM, because the acquisition parameters were the same, (ii) we split the data from ABIDEII-KKI_1 into ABIDEII-KKI_8ch and ABIDEII-KKI_32ch, because the acquisitions were performed using an 8-channel or a 32-channel phased-array head coil, respectively. The IXI study provided three different datasets named with the prefix IXI followed by the institution name that collected the images (e.g., IXI-Guys). From the 1000 FCP and CoRR studies, we used the International Consortium for Brain Mapping (ICBM) and the Nathan Kline Institute - Rockland Sample Pediatric Multimodal Imaging Test-Retest Sample (NKI2) datasets, respectively.

Table 1.

Scanning parameters for each single-center dataset.

Dataset	Manufacturer and model	Magnetostatic field (T)	In-plane resolution (mm × mm)	Slice thickness (mm)	TR (ms)	TE (ms)	TI (ms)	FA (°)
ABIDEI-CALTECH	Siemens MAGNETOM Trio	3	1.0 × 1.0	1.0	1590	2.73	800	10
ABIDEI-CMU	Siemens MAGNETOM Verio	3	1.0 × 1.0	1.0	1870	2.48	1100	8
ABIDEI-KKI	Philips Achieva	3	1.0 × 1.0	1.0	8.0	3.7	843	8
ABIDEI-LEUVEN	Philips Intera	3	0.98 × 0.98	1.20	9.6	4.6	885	8
ABIDEI-MAX_MUN	Siemens MAGNETOM Verio	3	1.0 × 1.0	1.0	1800	3.06	900	9
ABIDEI-NYU	Siemens MAGNETOM Allegra	3	1.3 × 1.0	1.3	2530	3.25	1100	7
ABIDEI-OHSU	Siemens MAGNETOM Trio	3	1.0 × 1.0	1.1	2300	3.58	900	10
ABIDEI-OLIN	Siemens MAGNETOM Allegra	3	1.0 × 1.0	1.0	2500	2.74	900	8
ABIDEI-PITT	Siemens MAGNETOM Allegra	3	1.1 × 1.1	1.1	2100	3.93	1000	7
ABIDEI-SBL	Philips Intera	3	1.0 × 1.0	1.0	9.0	3.5	144	8
ABIDEI-SDSU	GE Discovery MR750	3	1.0 × 1.0	1.0	11.08	4.3	600	45
ABIDEI-STANFORD	GE Signa	3	0.859 × 1.500	0.859	8.4	1.8	NA	15
ABIDEI-TRINITY	Philips Achieva	3	1.0 × 1.0	1.0	8.5	3.9	1060	8
ABIDEI-UCLA	Siemens MAGNETOM Trio	3	1.0 × 1.0	1.2	2300	2.84	853	9
ABIDEI-UM	GE Signa	3	NA	1.2	NA	1.8	NA	15
ABIDEI-USM	Siemens MAGNETOM Trio	3	1.0 × 1.0	1.2	2300	2.91	900	9
ABIDEII-BNI_1	Philips Ingenia	3	1.1 × 1.1	1.2	6.7	3.1	799	9
ABIDEII-EMC_1	GE Discovery MR750	3	1.1 × 1.1	1.2	6.7	3.1	350	9
ABIDEII-ETH_1	Philips Achieva	3	0.9 × 0.9	0.9	8.4	3.9	1150	8
ABIDEII-GU_1	Siemens MAGNETOM Trio	3	1.0 × 1.0	1.0	2530	3.5	1100	7
ABIDEII-IP_1	Philips Achieva	1.5	1.0 × 1.0	1.0	25	5.6	NA	30
ABIDEII-IU_1	Siemens TrioTim	3	0.7 × 0.7	0.7	2400	2.3	1000	8
ABIDEII-KKI_32ch	Philips Achieva	3	0.95 × 0.96	1.0	8.2	3.7	753	8
ABIDEII-KKI_8ch	Philips Achieva	3	1.0 × 1.0	1.0	8.0	3.7	843	8
ABIDEII-NYU_1	Siemens Allegra	3	1.3 × 1.0	1.33	2530	3.25	1100	7
ABIDEII-OHSU_1	Siemens TrioTim	3	1.0 × 1.0	1.1	2300	3.58	900	10
ABIDEII-SDSU_1	GE Discovery MR750	3	1.0 × 1.0	1.0	8.136	3.172	600	8
ABIDEII-TCD_1	Philips Intera Achieva	3	0.9 × 0.9	0.9	8.4	3.9	1150	8
ABIDEII-UCD_1	Siemens TrioTim	3	1.0 × 1.0	1.0	2000	3.16	1050	8
ABIDEII-UCLA_1	Siemens TrioTim	3	1.0 × 1.0	1.2	2300	2.86	853	9
ABIDEII-USM_1	Siemens TrioTim	3	1.0 × 1.0	1.2	2300	2.91	900	9
ICBM	NA	3	1.0 × 1.0	1.0	NA	NA	NA	NA
IXI-Guys	Philips Gyroscan Intera	1.5	NA	NA	9.813	4.603	NA	8
IXI-HH	Philips Intera	3	NA	NA	9.6	4.6	NA	8
IXI-IOP	NA	NA	NA	NA	NA	NA	NA	NA
NKI2	NA	3	1.0 × 1.0	1.0	NA	NA	NA	NA

Open in a new tab

FA, flip angle; NA, not available; TE, echo time; TI, inversion time; TR, repetition time.

In each single-center dataset, baseline MRI scans of typically developing and aging brain (one for each subject) with available age and sex information were included. The lack of a recognized neurological or psychiatric disorder diagnosis was used to define normal development and aging. The leading institutions, at each site where the MR images were collected, had obtained informed consent from all participants, and were authorized by the local Ethics Committees. Table S1 shows the general characteristics of each single-center dataset. In this study, we grouped the single-center datasets into three multicenter meta-datasets based on age and the amount of overlap between age distributions. We have considered the following age ranges: childhood (5–13 years), adolescence (11–20 years), and adulthood (18–87 years). We measured the overlap between age distributions by the n-distribution Bhattacharyya coefficient (BC)⁴⁶, an extension of the 2-distribution BC⁴⁷. The BC coefficient is 0 when there is no overlap and 1 when the overlap is complete. In our study, n is the number of the single-center datasets grouped in the meta-dataset covering the above-mentioned age ranges and may be different in every meta-dataset. Therefore, we constructed the CHILDHOOD meta-dataset containing 11 single-center datasets, whose subjects’ age varies between 5 and 13 years, and age distributions have a BC = 0.71. The ADOLESCENCE meta-dataset includes 9 single-center datasets whose subjects’ age ranges from 11 to 20 years, and age distributions have an overlap amount equal to 0.45. Finally, the ADULTHOOD meta-dataset consists of all data belonging to subjects aged between 18 and 87 years old (12 single-center datasets), whose age distributions have a BC = 0. A detailed description of the composition of each meta-dataset and their age distributions are shown in Table 2 and Fig. 1, respectively. In addition, we also merged all single-center datasets, creating a meta-dataset, called LIFESPAN, that covers the entire age range (5–87 years). In this meta-dataset, composed of 36 imaging sites, the single-center age distributions have a null overlap (Fig. 1).

Table 2.

Description of the demographic characteristics of each meta-dataset.

Meta-dataset	# of single-center datasets included	# participants	Females (%)	Age range min-max in years	Age median (IQR) in years	Age distributions BC
CHILDHOOD	11	442	34.39	5.89–13.0	9.91 (2.08)	0.71
ADOLESCENCE	9	222	15.32	11.0–20.0	14.22 (3.3)	0.45
ADULTHOOD	12	814	47.42	18.0–86.32	42.5 (30.74)	0
LIFESPAN	36	1787	35.65	5.89–86.32	19.98 (27.94)	0

Open in a new tab

BC, Bhattacharyya coefficient, IQR: interquartile range.

Fig. 1 — Age distributions. Age distributions of participants for CHILDHOOD, ADOLESCENCE, ADULTHOOD, and LIFESPAN meta-datasets, grouped by single-center dataset and sorted by median age.

MR image processing

For each brain MR T₁-weighted image, we performed a cortical reconstruction and a volumetric segmentation. In this work, we analyzed cerebral structures only, and we extracted neuroimaging features from various regions of the cerebral cortex: the entire cerebral cortex, separately the left/right hemispheres of the cerebral cortex, and left/right frontal, temporal, parietal, and temporal lobes. In particular, for each region, we computed the average cortical thickness (CT) and the fractal dimension (FD).

Cortical reconstruction and volumetric segmentation

We used the FreeSurfer package to perform completely automated cortical reconstruction and volumetric segmentation of each subject’s structural T₁-weighted scan. We used version 7.1.1, except in a few cases: (i) for T₁-weighted images belonging to ICBM and NKI2 datasets, we used FreeSurfer version 5.3, and (ii) for the ABIDEI datasets, we used the FreeSurfer version 5.1 outputs previously made available online by Cameron and colleagues⁴⁸ (http://preprocessed-connectomes-project.org/abide/index.html). Even though different FreeSurfer versions may affect neuroimaging variables^49–53, such variability is considered part of the site variability and handled by the harmonization procedure. Indeed, all subjects in each center have been processed with the same version of FreeSurfer. FreeSurfer is extensively documented (see ref. ⁵⁴ for a review) and publicly accessible (http://surfer.nmr.mgh.harvard.edu/). In addition to the standard FreeSurfer outputs, we performed a parcellation of the cortical lobes using the mri_annotation2label tool with the–lobesStrict option.

All Freesurfer outputs used in this study were visually inspected for quality assurance by two experienced radiologists (M.M. and C.T., with 35 and 30 years of experience, respectively) following an improved version of the ENIGMA Cortical Quality Control Protocol 2.0 (http://enigma.ini.usc.edu/protocols/imaging-protocols/). Firstly, we created an HTML file for each single-center dataset showing, for each subject, the segmentation of the cortical regions overlayed on the T₁-weighted images. Then, we scrolled the HTML file to determine gross segmentation errors in any cortical regions visually. For each single-center dataset, we estimated the statistical outliers for CT features, defined as any data points below or above the mean by 2.698 standard deviations. For each subject, we carefully inspected the cortical segmentations that showed features values labeled as statistical outliers to assess whether the outlier was an actual segmentation error. In this case, the subject was excluded from further analyses.

Extraction of cortical thickness and fractal dimension features

For each subject, using FreeSurfer tools, we computed the average CT of each cortical region as the average distance measured from each vertex of the gray/white boundary surface to the pial surface⁵⁵.

The FD is a numerical representation of shape complexity⁵⁶. The FD is normally a fractional value and is considered a dimension because it gives a measure of space-filling capacity⁵⁷. An FD value between 2 and 3 is typical of a complex and heavily folded 2-D surface buried in a 3-D region, such as the human cerebral cortex. The FD is a very compact measure of shape complexity, combining cortical thickness, sulcal depth, and folding area into a single numeric value^58,59. In this study, the fractal analysis was carried out using the fractalbrain toolkit version 1.1 (freely available at https://github.com/chiaramarzi/fractalbrain-toolkit) and described in detail in Marzi et al.⁵⁹. The fractalbrain toolkit processes FreeSurfer outputs directly, computing the FD of various regions of the cerebral cortex: the entire cerebral cortex, separately the left/right hemispheres, and left/right frontal, temporal, parietal, and temporal lobes. Fractalbrain performs the 3D box-counting algorithm⁶⁰, adopting an automated selection of the fractal scaling window⁵⁹ – a crucial step for establishing the FD for non-ideal fractals^59,61.

Briefly, we overlapped a grid composed of 3D cubes of different sizes s (where s = 2^k voxels, and k = 0, 1, …, 8) onto the segmentation and recorded the number of cubes N(s) needed to fully enclose the structure for each size. This process was repeated with 20 uniformly distributed random offsets to prevent the systematic influence of the grid placement, and the relative box count was averaged to obtain a single N(s) value^62,63. For a fractal object, the data points of the number of cubes N(s) vs. size s in the log-log plane can be modeled through a linear regression within a range of spatial scales called the fractal scaling window. Fractalbrain automatically selects the optimal fractal scaling window by searching for the interval of spatial scales that provides the best linear fit, as measured by the rounded coefficient of determination adjusted for the number of data points (R²_adj). If multiple intervals have the same rounded R²_adj, the widest interval (i.e., the one that contains the most data points in the log-log plot) is selected⁵⁹. The FD of the brain structure is then estimated as the slope (in absolute value) of the linear regression model included in the automatically selected fractal scaling window. As an example, in Fig. 2, we reported a log-log plot of the 3D box-counting algorithm optimized for the automatic selection of the best fractal scaling window of the cerebral cortex of one subject.

Fig. 2 — 3D box-counting for computation of the FD. An example of the 3D box-counting algorithm that uses an automated selection of the fractal scaling window through the *fractalbrain* toolkit⁵⁹. *N(s)* is the average number of 3D cubes of side s needed to fully enclose the brain structure computed using 20 uniformly distributed random offsets to the grid origin. The regression line within the optimal fractal scaling window, whose slope (sign changed) is the FD, is depicted in red.

Harmonization of brain cortical features

We harmonized cortical features using ComBat, a model that builds upon the statistical harmonization technique proposed by Johnson and colleagues³⁵ for location and scale (L/S) adjustments to the data while preserving between-subject biological variability. Briefly, let y_ijf be the one-dimensional array of n neuroimaging features for the single-center i, participant j, and feature f, for a total of k single-center datasets, n participants, and V features. Still, let X be the n × p matrix of biological covariates of interest, and Z be the n × k matrix of single-center labels. The ComBat harmonization model can be written as follows:

y_{i j f} = f_{f} (X_{i j}) + Z_{i j} ϑ_{f} + δ_{i f} ε_{i j f}

where f_f (X_ij) denotes the variation of y_ijf captured by the biologically relevant covariates X_ij, $ϑ_{f}$ is the one-dimensional array of the k coefficients associated with the single-center labels Z_ij for the feature f. We assume that the residual terms ε_ijf have mean 0. The parameters δ_if describe the multiplicative site effect of the i-th site on the feature f, i.e., the scale (S) adjustment, while the location (L) parameter for the i-th site on the feature f, is represented by γ_if (the empirical Bayes estimates of the term $Z_{i j} ϑ_{f}$ ). Consistent with the ComBat model notation used in Fortin et al. (2017), the harmonized $y_{i j f}^{*}$ become:

y_{i j f}^{*} = \frac{y_{i j f} - f_{f} (X_{i j}) - γ_{i f}}{δ_{i f}} + f_{f} (X_{i j})

In this study, we used the ComBat model implemented in the neuroHarmonize v. 2.1.0 package (freely available at https://github.com/rpomponio/neuroHarmonize) – an open-source and easy-to-use Python module². In particular, neuroHarmonize extends the neuroCombat package^5,6 with the possibility of specifying covariates with generic nonlinear effects on the neuroimaging feature to harmonize. In particular, the f_f (X_ij) term in Eq. (1) is a Generalized Additive Model (GAM) function of the specific covariates². Indeed, MRI-derived features are known to be influenced by demographic factors, such as age^{2,3,5,59,64–70} and sex⁷¹. In our study, these variables were included in the harmonization process as sources of inter-subject biological variability. Finally, since it is not evident that the site effect affects all MRI-derived measures in the same way³, we performed a separate harmonization for each feature group of the same type (i.e., CT and FD).

The harmonizer transformer

The increased sample size due to the pooling of data acquired in various centers necessarily facilitates the application of machine learning techniques. For training and testing machine learning models, a proper validation scheme that handles data splitting must be chosen (Fig. 3). This choice is crucial to avoid data leakage by ensuring that the entire workflow (preprocessing and model-building steps) is constructed on training data and evaluated on test data never seen during the learning phase. Indeed, data leakage in the training process may incur falsely high performance in the test set (see, e.g., ref. ⁷² and ref. ⁷³). Especially in Medicine and Healthcare, where relatively small datasets are usually available, the straightforward hold-out validation scheme is rarely applied. In contrast, the cross-validation (CV) and its nested version (nested CV) for hyperparameters optimization of the entire workflow^74–76 are frequently preferred. Also, repeated CVs or repeated nested CVs are suggested for improving the reproducibility of the entire machine learning system⁷⁵. Several training and test data procedures are carried out in all these validation schemes on different data split, recalling the need for a compact code structure to avoid errors that may lead to data leakage. In this view, machine learning pipelines are a solution because they orchestrate all the processing steps in a short, easier-to-read, and easier-to-maintain code structure (Fig. 3). A pipeline represents the entire data workflow, combining all transformation steps (e.g., data cleaning, data imputation, data scaling, and general data preprocessing) and machine learning model training. It is essential to automate an end-to-end training/test process without any form of data leakage and improve reproducibility, ease of deployment, and code reuse, especially when complex validation schemes are needed.

Fig. 3 — Machine learning pipeline. A pipeline represents the entire data workflow, combining all transformation steps and machine learning model training. It is essential to automate an end-to-end training/test process without any form of data leakage and improve reproducibility, ease of deployment, and code reuse, especially when complex validation schemes are needed.

In the Scikit-learn library, a popular, open-source, well-documented, and easy-to-learn machine learning package that implements a vast number of machine learning algorithms, a pipeline is a chain of “transformers” and a final “estimator” acting as a single object. The transformers are modules that apply preprocessing to the data, whereas estimators are modules that fit a model based on training data and are capable of inferring some properties on new data (https://scikit-learn.org/stable/developers/develop.html). In particular, transformers are classes with a “fit” method, which learns model parameters (e.g., mean and standard deviation for data standardization) from a training set, and a “transform” method which applies this transformation model to any data. For example, for data standardization (transforming data to have zero mean and unit standard deviation), the mean μ must be subtracted from the data, and the result must be divided by the standard deviation σ. Notwithstanding, this procedure must be firstly performed on the training set (using μ and σ computed in the training set). In the test set, or any validation set, the same transformation must be applied to data using the same two parameters μ and σ computed for centering the training set. Basically, the “fit” method calculates the parameters (e.g., μ and σ in our case) and saves them internally, whereas the “transform” method applies the transformation (using the saved parameters) to any particular set of data.

For these reasons, in this study, we propose the harmonizer – a Scikit-learn Python transformer that encapsulates the neuroHarmonize procedure among the preprocessing steps of a machine learning pipeline. The “fit” method of the harmonizer transformer learns the NeuroHarmonize model parameters from a training set and saves the parameters internally, whereas the “transform’” method is used to apply the neuroHarmonize model, previously learned on the training data set, e.g., to unseen data. The source code of the harmonizer transformer is publicly available in a GitHub repository at https://github.com/Imaging-AI-for-Health-virtual-lab/harmonizer.

In the following, we included the harmonizer transformer in a pipeline to learn the harmonization procedure parameters on the training data only and apply the harmonization procedure (with parameters obtained in the training set) to the test data. This prevented data leakage by design in the harmonization procedure independently of the chosen validation scheme.

Statistical and machine learning analyses

We performed the statistical and machine learning analyses described in the following paragraphs for each feature group of the same type (i.e., CT and FD) and each meta-dataset (i.e., CHILDHOOD, ADOLESCENCE, ADULTHOOD, and LIFESPAN).

Visualization and quantification of site effect

We first performed a series of analyses of increasing complexity to explore the actual existence of a site effect in the data. For each region-feature pair, we qualitatively showed the site effect on raw data through boxplots, using the site as the independent variable and each region-feature pair as the dependent variable. Quantitatively, the site effect was measured by analyzing covariance (ANCOVA) – a general linear model that blends analysis of variance (ANOVA) and linear regression. ANCOVA evaluates whether the means of a dependent variable are equal across levels of a categorical independent variable while statistically controlling for the effects of other variables that are not of primary interest, known as covariates or nuisance variables. In this study, we set the single-center dataset as the independent variable, age, age×age, and sex as covariates, and each region-feature pair as the dependent variable.

Additionally, to further investigate the site effect on raw data and to measure the success of ComBat harmonization, we predicted the imaging site from the neuroimaging features, grouped by feature type, namely CT and FD. Specifically, we used the supervised eXtreme Gradient Boosting (XGBoost) method (with version 0.90 default hyperparameters for a classification task), a scalable end-to-end tree-boosting system widely used to achieve state-of-the-art performance on many recent machine learning challenges⁷⁷. Using N=100 repetitions of a stratified 5-fold CV, we estimated the median balanced accuracy. The statistical significance of prediction performance was determined via permutation analysis. Thus, for each features group, 5000 new models were created using a random permutation of the target labels (i.e., the imaging site), such that the explanatory neuroimaging variables were dissociated from their corresponding imaging site to simulate the null distribution of the performance measure against which the observed value was tested⁷⁸. Since, in this study, single-center datasets showed different age groups, the random target labels permutation was performed within groups of subjects of similar age⁷⁹, which were categorized into five-year intervals. The selection of a 5-year value was made to ensure it was sufficiently small to discern age differences while being large enough to avoid an excessive reduction in the potential permutations within each age group.

Median balanced accuracy was considered significantly different from the chance level when the p-value computed using permutation tests was < 0.05. Additionally, we calculated the average confusion matrix over repetitions to graphically evaluate the goodness of prediction. The same imaging site prediction was performed on raw data (i.e., without harmonization) to confirm the existence of the site effect and on harmonized data (with neuroHarmonize and Harmonizer transformer) to investigate if the site effect was reduced or removed.

We propose to measure the efficacy of harmonization in reducing or removing the site effect through a two-step assessment. First, we evaluated whether the site prediction after the harmonization process was not significantly different from a random prediction by comparing the median balanced accuracy over repetitions with the distribution of balanced accuracies estimated using the permutation test with 5000 permutations (the default value in FSL – FMRIB Software library – randomise tool for non-parametric permutation inference on neuroimaging data⁸⁰). Considering, for example, a significance threshold of 0.05 in the permutation test, in the case of complete removal of the site effect, the site prediction will not be different from that of a random model (i.e., p-value ≥ 0.05). Second, in the case of permutations test p-value < 0.05, we compared the balanced accuracy obtained by predicting the site without and with the harmonization procedure. In particular, we assessed the site effect reduction by ensuring that the median balanced accuracy obtained predicting the imaging site with harmonized data was significantly lower than that estimated with raw data through the non-parametric one-sided Wilcoxon signed-rank test, with a significance threshold of 0.05⁸¹. The source code for evaluating the effectiveness of harmonization using the harmonizer transformer is publicly available in a GitHub repository at https://github.com/Imaging-AI-for-Health-virtual-lab/harmonizer.

To estimate the effect of data leakage in the prediction of the imaging site caused by performing the harmonization on all data before splitting into training and test sets, we tested whether the balanced accuracies obtained using neuroHarmonize on all data before any split were consistently lower than those estimated using the harmonizer transformer in the above mentioned stratified CV scheme. Since the same data set splits were applied for both CT and FD, the comparison was carried out through a paired test, i.e., the non-parametric one-sided Wilcoxon signed-rank test with a significance threshold of 0.05⁸¹.

Associations with age

While it is essential to show that a harmonization method successfully reduces a possible site effect, it is equally crucial to note that it preserves the biological variability in the data. Indeed, a harmonization method that removes both site and biological effects has no utility. One of the most influential sources of biological variability in the neuroimaging features of healthy subjects is undoubtedly chronological age. Throughout the lifespan, the brain structure changes because of a complex interplay between multiple maturational and neurodegenerative processes. Such processes could yield large spatial and temporal variations in the brain^65,82,83.

For these reasons, we attempted to predict individual age from neuroimaging features through an XGBoost model (version 0.90 with default hyperparameters for a regression task)⁷⁷. We estimated the median (over repetitions) mean absolute error (MAE) using N = 100 repetitions of a 5-fold CV. Age prediction was performed on harmonized data using both neuroHarmonize and the harmonizer transformer in the CV pipeline. To estimate the effect of data leakage in the age prediction caused by performing the harmonization on all data before splitting into training and test sets, we compared the MAE values obtained using neuroHarmonize on all data before any split and the harmonizer transformer in the above-mentioned CV scheme. In particular, since the same data set splits were applied for both CT and FD, we assessed whether the median MAE using neuroHarmonize on all data before any split was consistently lower than that estimated using the harmonizer transformer through a paired test, i.e., the non-parametric one-sided Wilcoxon signed-rank test with a significance threshold of 0.05⁸¹.

Moreover, before and after the harmonization procedure, for each region-feature pair, we qualitatively visualized the site effect on the relationship between age and each region-feature pair through scatterplots (with age as the independent variable and each region-feature pair as the dependent variable).

Simulation experiments

The harmonizer transformer prevents data leakage by design in the harmonization procedure in any machine learning pipeline independently of the chosen validation scheme. Differently, applying harmonization before data spitting, data leakage is present, and its severity depends on the specific context and the extent of the leakage. In Neuroimaging, the entity and impact of the data leakage effect is still an underexplored area. Therefore, we performed simulation experiments (with known site effects) and computational tests for assessing the data leakage effect when the harmonization process is performed before the training-test data splitting.

CT and FD data simulation settings

Let y_ijf be the one-dimensional array of the simulated feature f, for the single-center i, and participant j, for a total of k single-center datasets, n_i participants for each center, and V features. In this study, we simulated CT and FD data for k = 3, 10, 36 single-centers. Each single-center dataset provided the same number of participants (i.e., n_i=n), with n assuming the values 25, 50, 100, 250. Totally, we did 24 experiments, i.e., we simulated 24 different multicenter datasets (12 for the CT features and 12 for the FD measures).

Each y_ijf was generated based on the model proposed by Johnson and colleagues³⁵ and recently used for neuroimaging features’ simulation by Chen and collaborators⁸⁴:

y_{i j f} = α_{f} + β_{f 1} x_{i j} + β_{f 2} x_{i j}^{2} + γ_{i f} + δ_{i f} ε_{i j f}

where α_f is the average value of the feature f in the single-center ICBM dataset, β_f1 = −0.0009 and β_f2 = −0.00005 are the linear and quadratic effects of the age on the feature f, respectively, and x_ij is a simulated age variable drawn from a uniform distribution X ~ uniform([20,90]). Considering the nature of our investigation, which examines the relationship between cortical thickness and FD with age, it is reasonable to assume that the relationship is no more than quadratic^59,85. The mean site effect γ_if was drawn from a normal distribution with zero mean and standard deviation equal to 0.1, while the variance site effect δ_if was drawn from a center-specific inverse gamma distribution with chosen parameters. For our simulations, we chose to distinguish the site-specific location factors by assuming independent and identically distributed (i.i.d.) normal distributions and scaling factors using the parameters described as follows. We set the value of the inverse gamma shape, for each center, as {46, 51, 56}, respectively, when k = 3, as {40, 42, .., 58} when k = 10, and as {10, 12, .., 40, 41, .., 50, 52, .., 70} when k = 36. In all cases, the inverse gamma scale was set to 50.

Measuring the effect of data leakage

We measured the effect of data leakage for both the site and age prediction independently. Hereinafter, we will refer generically to performance, indicating the balanced accuracy for the site prediction task and the MAE for the age prediction task. To measure the effect of data leakage, after an external hold-out (Fig. 4), firstly, we computed the performance of an imaging site/age prediction estimator trained using a) the harmonizer transformer within the machine learning pipeline (internal not leaked test set) and b) harmonizing all data with neuroHarmonize before the actual prediction (internal leaked test set). Secondly, we compared these performances with that observed on an external test set never used for harmonization and training (Fig. 4). In the absence of data leakage, the performance in the internal and external test sets should be similar and not significantly different. When data leakage is present, the performance in the internal test set is overly optimistic (i.e., significantly better than that on the external test set). In detail, for each experiment, we performed the following steps.

Fig. 4 — Overview of the analysis of simulated data for each experiment. After an external hold-out, we computed the performance of a site prediction classifier trained using (a) the *harmonizer* transformer within the machine learning pipeline (internal not leaked test set) and (b) harmonizing all data with *neuroHarmonize* before imaging site/age prediction (internal leaked test set). Secondly, we compared these performances with that observed on an external test set never used for harmonization and training.

External hold-out

We randomly split the data into two parts, i.e., a data set containing 50% of the samples and an external test set with the other 50% of the instances.

Imaging site/age prediction estimator training and test on the external test set

We fitted a harmonization model with neuroHarmonize using age as a covariate with a nonlinear relationship with individual MRI-derived features. To fit the harmonization model, we used the same number of instances adopted for the other two approaches (see next analyses), i.e., 80% of samples, randomly chosen, of the data set. Then, we applied the harmonization model to the data set and the external test set. Finally, we trained an XGBoost model (with version 0.90 default hyperparameters for a classification task) to predict the imaging site/age and tested it on the harmonized external test set.

Imaging site/age prediction estimator training and test using harmonizer transformer within the machine learning pipeline (not leaked internal test set)

We trained and tested a pipeline containing the harmonizer transformer and an XGBoost estimator (with version 0.90 default hyperparameters) on the data set to predict the imaging site/age through a stratified 10-times repeated 5-fold CV. Thus, we trained the pipeline in the training sets of each iteration of the CV and considered the performance within the test sets of the CV.

Imaging site/age prediction estimator training and test harmonizing all data with neuroHarmonize before imaging site prediction (leaked internal test set)

We trained and tested a pipeline containing an XGBoost estimator (with version 0.90 default hyperparameters) on the harmonized dataset to predict the imaging site/age through a stratified 10-times repeated 5-fold CV. Thus, we trained the pipeline in the training sets of each iteration of the CV and considered the performance metric within the test sets of the CV.

For each task, i.e., imaging site and age prediction, we repeated each experiment (i.e., all these steps) 100 times with random data splits and computed the average performance across the 100 repetitions. Finally, we compared the average performance across the 100 repetitions of each internal test set (leaked and not-leaked) with that of the external test set. When data leakage is present, the performance in the internal test set is better than that on the external test set (i.e., lower balanced accuracy and MAE values for the imaging site and age prediction, respectively). To assess whether the average performance of each internal test set was lower than that of the external test set, we conducted a one-tailed t-test, applying Bonferroni correction for multiple comparisons. This statistical analysis allowed us to evaluate the significance of any differences observed between the average performance of the internal and external test sets.

In addition, we calculated, for each internal test set, the Cohen’s d effect size to estimate the magnitude of the differences between performance distributions’ means. Specifically, we used the following Cohen’s d formula: $d = \frac{\bar{x_{e}} - \bar{x_{i}}}{s}$ where $\bar{x_{e}}$ is the average performance in the external test set, $\bar{x_{i}}$ is the average performance in the internal test set, and s is the standard deviation of the difference between performance obtained in the external test set and that achieved in the internal test set.

Results

Measuring the effect of data leakage in simulated data

Regarding the imaging site prediction, the results were similar for both CT (Table 3) and FD (Table 4) simulated features. The performances obtained on the leaked internal test set were overly optimistic, i.e., significantly better than those obtained in the external test set, indicating the presence of data leakage. In contrast, the average balanced accuracies recorded on the not leaked test internal set were statistically not different from those of the external test set (except in one case – see details in Table 4).

Table 3.

Imaging site prediction results with CT simulated data.

k	n	External test set	Leaked internal test set	Not leaked internal test set
3	25	0.330 (0.080)	0.253 (0.050)*	0.340 (0.049)
3	50	0.330 (0.052)	0.269 (0.035)*	0.335 (0.034)
3	100	0.347 (0.037)	0.299 (0.033)*	0.347 (0.032)
3	250	0.321 (0.023)	0.288 (0.016)*	0.320 (0.015)
10	25	0.096 (0.026)	0.054 (0.016)*	0.100 (0.020)
10	50	0.098 (0.019)	0.072 (0.012)*	0.108 (0.013)
10	100	0.096 (0.013)	0.071 (0.007)*	0.097 (0.007)
10	250	0.100 (0.008)	0.085 (0.005)*	0.102 (0.005)
36	25	0.025 (0.007)	0.010 (0.002)*	0.027 (0.004)
36	50	0.027 (0.005)	0.014 (0.002)*	0.027 (0.002)
36	100	0.028 (0.004)	0.018 (0.001)*	0.029 (0.002)
36	250	0.027 (0.002)	0.021 (0.001)*	0.028 (0.001)

Open in a new tab

Average balanced accuracy (standard deviation) obtained in the external and internal test sets.

k is the number of single-center datasets, each one containing n participants. CT: cortical thickness.

*One-tailed paired t-test Bonferroni adjusted p-value < 10⁻⁹ for the comparison with the external test set average balanced accuracy.

Table 4.

Imaging site prediction results with FD simulated data.

k	n	External test set	Leaked internal test set	Not leaked internal test set
3	25	0.358 (0.078)	0.288 (0.048)*	0.369 (0.046)
3	50	0.324 (0.060)	0.270 (0.045)*	0.333 (0.045)
3	100	0.352 (0.047)	0.296 (0.029)*	0.347 (0.027)
3	250	0.328 (0.021)	0.292 (0.019)*	0.325 (0.017)
10	25	0.084 (0.026)	0.049 (0.014)*	0.089 (0.017)
10	50	0.104 (0.017)	0.068 (0.011)*	0.105 (0.013)
10	100	0.098 (0.013)	0.072 (0.008)*	0.097 (0.009)
10	250	0.097 (0.008)	0.084 (0.005)*	0.100 (0.005)
36	25	0.024 (0.007)	0.010 (0.002)*	0.026 (0.004)
36	50	0.028 (0.005)	0.013 (0.002)*	0.026 (0.002)^
36	100	0.028 (0.003)	0.017 (0.002)*	0.028 (0.002)
36	250	0.028 (0.003)	0.021 (0.001)*	0.028 (0.001)

Open in a new tab

Average balanced accuracy (standard deviation) obtained in the external and internal test sets.

k is the number of single-center datasets, each one containing n participants. FD: fractal dimension.

*One-tailed paired t-test Bonferroni adjusted p-value < 10⁻¹⁰ for the comparison with the external test set average balanced accuracy.

^One-tailed t-test Bonferroni adjusted p-value = 0.003 for the comparison with the external test set average balanced accuracy.

Moreover, as the number of samples available in each single-center dataset decreases, the effect of data leakage increases (Tables 3, 4 for CT and FD, respectively). This phenomenon is even more evident in Fig. 5, where we reported the difference between the average balanced accuracy obtained in the external test set and that gained in the internal test sets vs. the number of participants in each single-center site for the CT and FD, respectively. When data leakage is present (dashed lines in Fig. 5), the difference between the average balanced accuracy in the external test set and that in the internal leaked test set always differs significantly from zero (Bonferroni adjusted p-values < 10⁻⁹ and < 10⁻¹⁰ for CT and FD, respectively) and increases as the number of participants in each single-center dataset decreases. This result has a profound impact because most neuroimaging studies (with in vivo data) have single-centers datasets with a number of subjects between 25 and 100. Conversely, when data leakage is not present (solid lines in Fig. 5), the difference between the average balanced accuracy in the external test set and that in the internal not leaked test set was approximately zero and remained constant as the number of participants in each single-center dataset changes.

Fig. 5 — Imaging site prediction results with CT and FD simulated data. We reported the difference between the average balanced accuracy obtained in the external test set and that gained in the internal test sets (dotted line for leaked internal test set and solid line for not leaked internal test set) and Cohen’s d effect size vs. the number of participants per single-center dataset n. The cross marker indicates a significant difference between balanced accuracy distributions (one-tailed paired t-test Bonferroni adjusted p-value < 10⁻⁹ and < 10⁻¹⁰ for CT and FD, respectively). The colors and line types in Cohen’s d plots are consistent with those employed in the other plots.

Data leakage was also observed in the age prediction task for both CT and FD features. Similarly to the site prediction task, the performance on the leaked internal test set appears overly optimistic (Tables 5, 6 for CT and FD, respectively), and the impact of data leakage becomes more pronounced as the number of samples in each single-center dataset decreases (Fig. 6).

Table 5.

Age prediction results with CT simulated data.

k	n	External test set	Leaked internal test set	Not leaked internal test set
3	25	6.908 (1.199)	6.561 (0.676)	6.997 (0.621)
3	50	5.409 (0.524)	5.076 (0.322)*	5.279 (0.301)
3	100	5.326 (0.314)	5.273 (0.207)	5.376 (0.196)
3	250	4.596 (0.144)	4.643 (0.137)	4.670 (0.132)
10	25	5.473 (0.402)	5.099 (0.270)*	5.651 (0.273)
10	50	4.930 (0.200)	4.891 (0.166)	5.128 (0.153)
10	100	4.823 (0.163)	4.589 (0.116)*	4.701 (0.109)^
10	250	4.529 (0.097)	4.556 (0.068)	4.604 (0.068)
36	25	8.114 (0.308)	7.272 (0.223)*	7.931 (0.201)^
36	50	7.478 (0.175)	7.117 (0.150)*	7.424 (0.146)
36	100	7.175 (0.129)	7.147 (0.083)*	7.300 (0.083)
36	250	7.118 (0.073)	7.076 (0.054)*	7.139 (0.050)

Open in a new tab

Average MAE (standard deviation) obtained in the external and internal test sets.

k is the number of single-center datasets, each one containing n participants. CT: cortical thickness.

*One-tailed t-test Bonferroni adjusted p-value < 10⁻⁴ for the comparison with the external test set MAE.

^One-tailed t-test Bonferroni adjusted p-value < 10⁻⁶ for the comparison with the external test set MAE.

Table 6.

Age prediction results with FD simulated data.

k	n	External test set	Leaked internal test set	Not leaked internal test set
3	25	6.736 (0.857)	6.060 (0.682)*	6.526 (0.672)
3	50	5.280 (0.500)	4.971 (0.333)*	5.133 (0.322)
3	100	5.069 (0.282)	4.827 (0.225)*	4.896 (0.220)^
3	250	4.423 (0.159)	4.486 (0.138)	4.516 (0.134)
10	25	5.644 (0.462)	5.128 (0.319)*	5.691 (0.299)
10	50	5.263 (0.288)	5.007 (0.269)*	5.254 (0.256)
10	100	4.830 (0.144)	4.599 (0.134)*	4.707 (0.124)^
10	250	4.383 (0.072)	4.422 (0.065)	4.467 (0.062)
36	25	8.199 (0.298)	7.653 (0.162)*	8.281 (0.144)
36	50	7.608 (0.201)	7.398 (0.136)*	7.707 (0.133)
36	100	7.119 (0.144)	7.149 (0.084)	7.307 (0.078)
36	250	7.174 (0.098)	7.102 (0.061)*	7.160 (0.057)

Open in a new tab

Average MAE (standard deviation) obtained in the external and internal test sets.

k is the number of single-center datasets, each one containing n participants. CT: cortical thickness.

*One-tailed t-test Bonferroni adjusted p-value < 10⁻⁷ for the comparison with the external test set MAE.

^One-tailed t-test Bonferroni adjusted p-value < 0.0001 for the comparison with the external test set MAE.

Fig. 6 — Age prediction results with CT and FD simulated data. We reported the difference between the average MAE obtained in the external test set and that gained in the internal test sets (dotted line for leaked internal test set and solid line for not leaked internal test set) and Cohen’s d effect size vs. the number of participants per single-center dataset n. The cross marker indicates a significant difference between balanced accuracy distributions (see Tables 5, 6 for details). The colors and line types in Cohen’s d plots are consistent with those employed in the other plots.

Visualization and quantification of the site effect in in vivo data

Quality control of FreeSurfer’s outputs resulted in removing 47 subjects based on the overall low quality of cortical reconstruction or segmentation errors in any regions. All brain regions of the remaining 1740 subjects had both CT and FD features. Thus, we have been able to analyze the site effect, the harmonization adjustments, and age prediction on the same subjects for the CT and FD groups of features. The demographic characteristics of the subjects included in the study after the quality control have been reported in Table 7.

Table 7.

Demographic characteristics of the subjects remaining after quality control and who entered into the analyses.

Site	# participants	Females (%)	Age min – max (years)	Age mean (SD) (years)	Age median (IQR) (years)
ABIDEI-CALTECH	19	21.05	17.0–56.2	28.87 (11.21)	23.6 (15.85)
ABIDEI-CMU	13	23.08	20.0–40.0	26.85 (5.74)	27.0 (9.0)
ABIDEI-KKI	32	28.12	8.07–12.77	10.15 (1.28)	9.97 (1.58)
ABIDEI-LEUVEN	35	14.29	12.2–29.0	18.17 (4.99)	16.6 (7.95)
ABIDEI-MAX_MUN	33	12.12	7.0–48.0	26.21 (9.8)	26.0 (9.0)
ABIDEI-NYU	104	24.04	6.47–31.78	15.87 (6.25)	14.4 (8.78)
ABIDEI-OHSU	15	0	8.2–11.99	10.06 (1.08)	10.08 (1.31)
ABIDEI-OLIN	14	14.29	10.0–23.0	16.93 (3.63)	16.5 (5.75)
ABIDEI-PITT	27	14.81	9.44–33.24	18.88 (6.64)	17.13 (8.3)
ABIDEI-SBL	15	0	20.0–42.0	33.73 (6.61)	36.0 (11.5)
ABIDEI-SDSU	21	28.57	8.67–16.88	14.18 (1.94)	14.1 (2.62)
ABIDEI-STANFORD	12	33.33	7.75–12.43	10.2 (1.68)	9.76 (2.9)
ABIDEI-TRINITY	25	0	12.04–25.66	17.08 (3.77)	15.91 (5.25)
ABIDEI-UCLA	42	14.29	9.21–17.79	12.92 (1.96)	12.7 (2.15)
ABIDEI-UM	72	25	8.2–28.8	14.85 (3.64)	14.8 (4.95)
ABIDEI-USM	43	0	8.77–39.39	21.36 (7.64)	19.76 (10.4)
ABIDEII-BNI_1	29	0	18.0–64.0	39.59 (15.09)	43.0 (27.0)
ABIDEII-EMC_1	25	16	6.33–10.12	8.16 (1.03)	8.19 (1.52)
ABIDEII-ETH_1	24	0	13.83–30.67	23.88 (4.5)	24.0 (6.79)
ABIDEII-GU_1	51	49.02	8.06–13.8	10.49 (1.72)	10.43 (3.04)
ABIDEII-IP_1	32	68.75	8.07–46.6	24.05 (11.64)	22.49 (14.71)
ABIDEII-IU_1	20	25	19.0–37.0	23.75 (4.9)	22.0 (4.25)
ABIDEII-KKI_32ch	45	26.67	8.06–12.67	10.42 (1.26)	10.27 (1.67)
ABIDEII-KKI_8ch	107	40.19	8.02–12.9	10.3 (1.17)	10.3 (1.67)
ABIDEII-NYU_1	30	6.67	5.89–23.81	9.52 (3.33)	9.11 (3.12)
ABIDEII-OHSU_1	55	52.73	8.0–14.0	10.4 (1.64)	10.0 (2.5)
ABIDEII-SDSU_1	25	8	8.1–17.7	13.25 (3.04)	13.0 (5.2)
ABIDEII-TCD_1	21	0	10.25–20.0	15.61 (3.12)	15.25 (5.25)
ABIDEII-UCD_1	14	28.57	12.25–17.17	14.8 (1.71)	14.75 (2.65)
ABIDEII-UCLA_1	15	33.33	7.76–14.09	9.81 (2.18)	9.02 (1.93)
ABIDEII-USM_1	16	18.75	11.5–36.15	23.98 (7.8)	23.78 (10.72)
ICBM	86	52.33	19.0–85.0	44.19 (17.92)	44.5 (31.5)
IXI-Guys	309	55.99	20.07–86.2	50.75 (15.83)	53.41 (25.59)
IXI-HH	178	52.25	20.17–81.94	47.12 (16.65)	47.38 (29.16)
IXI-IOP	63	65.08	19.98–86.32	40.58 (15.41)	35.46 (17.33)
NKI2	73	41.1	6.0–17.0	11.85 (3.14)	12.0 (6.0)

Open in a new tab

IQR: interquartile range; SD: standard deviation.

The boxplots in Figs. 7, 8 summarize the distribution of the average CT and FD of the cerebral cortex at each imaging site. Specifically, the site effect differs between the two features. For example, in the CHILDHOOD meta-dataset, the ABIDEI-KKI_32ch, ABIDEI-KKI_8ch, and ABIDEII-NYU_1 single-center datasets show the lowest average CT values, while subjects from the ABIDEI-STANFORD dataset have the lowest FD values. Also, for the ADOLESCENCE meta-dataset, the site effect has a different behavior for CT and FD features: for example, ABIDEI-TCD_1 shows the lowest values of CT, while ABIDEI-LEUVEN shows the lowest values of FD. At the same time, in the ADULTHOOD meta-dataset, ABIDEI-SBL has the lowest mean CT values, whereas ABIDEII-BNI_1 has the lowest FD values.

Fig. 7 — Boxplot of the average CT of the cerebral cortex. The boxplots of the average CT of the cerebral cortex without harmonization are shown for the CHILDHOOD, ADOLESCENCE, ADULTHOOD, and LIFESPAN meta-datasets.

Fig. 8 — Boxplot of the average FD of the cerebral cortex. The boxplots of the FD of the cerebral cortex without harmonization are shown for the CHILDHOOD, ADOLESCENCE, ADULTHOOD, and LIFESPAN meta-datasets.

The same result was measured quantitatively using ANCOVA analysis. Indeed, all CT and FD features were significantly different across the single-center datasets (Table 8), but the site effect, measured by the partial η² was different in the two feature sets. In the CHILDHOOD meta-dataset, for example, each cortical region showed a higher partial η² for FD than for CT, suggesting that, in childhood, acquisition characteristics impact more on the structural complexity measure, i.e., FD, than on the cortical thickness. On the other hand, in the ADOLESCENCE meta-dataset, the frontal and temporal lobes (bilaterally), along with the entire structure, show lower partial η² for FD than for CT, whereas the parietal and occipital lobes (bilaterally) have higher partial η² for FD than for CT. Finally, in the ADULTHOOD meta-dataset, only the occipital and temporal lobes (bilaterally) have lower partial η² for FD than CT.

Table 8.

ANCOVA results on raw data.

Region	CHILDHOOD meta-dataset		ADOLESCENCE meta-dataset		ADULTHOOD meta-dataset		LIFESPAN meta-dataset
Region	CT	FD	CT	FD	CT	FD	CT	FD
Entire cortex	0.42	0.52	0.34	0.32	0.28	0.31	0.33	0.41
lh cortex	0.44	0.49	0.32	0.36	0.28	0.36	0.33	0.43
rh cortex	0.39	0.42	0.34	0.36	0.26	0.27	0.32	0.39
lh cortex frontal lobe	0.34	0.43	0.48	0.42	0.22	0.26	0.35	0.35
lh cortex occipital lobe	0.26	0.37	0.31	0.46	0.44	0.30	0.36	0.41
lh cortex temporal lobe	0.57	0.65	0.25	0.21	0.33	0.25	0.44	0.51
lh cortex parietal lobe	0.26	0.39	0.25	0.39	0.21	0.27	0.24	0.37
rh cortex frontal lobe	0.21	0.34	0.53	0.41	0.18	0.26	0.30	0.33
rh cortex occipital lobe	0.21	0.41	0.30	0.46	0.46	0.34	0.37	0.46
rh cortex temporal lobe	0.59	0.66	0.23	0.17	0.35	0.29	0.48	0.54
rh cortex parietal lobe	0.27	0.47	0.32	0.38	0.24	0.28	0.29	0.39

Open in a new tab

The partial η² values are reported for each region/feature pair analyzed separately in each meta-dataset. All p-values are < 0.001.

CT: cortical thickness; FD: fractal dimension; lh: left hemisphere; rh: right hemisphere.

Harmonization efficacy

To assess whether most of the variation in the data was still associated with the site after harmonization, we predicted the imaging site using neuroimaging features grouped by feature type (i.e., CT and FD). Figures 9, 10 report the average confusion matrices (over 100 repetitions) for CT and FD features, respectively. When predicting the site using the raw data, the main diagonal of the confusion matrix is prominent (i.e., the predicted site is usually the actual site) for both feature groups and each meta-dataset (Figs. 9, 10). On the other hand, when the prediction of the site is performed using harmonized data (through neuroHarmonize or harmonizer transformer), the impact of the main diagonal of the confusion matrix is weak. The confusion matrices show a vertical pattern indicating that the predicted site is often the same site, regardless of the actual site (Figs. 9, 10). Moreover, the confusion matrix obtained using the harmonizer within the machine learning pipeline seems similar to that obtained by harmonizing all the data with neuroHarmonize before imaging site prediction. This result suggests that the action of the harmonizer resembles that of neuroHarmonize, although the model is built on training data only and then applied to test data. The confusion matrices for CT and FD features in the LIFESPAN meta-dataset have also been shown in Fig. 11.

Fig. 9 — Confusion matrices of site prediction using CT features. Each confusion matrix was normalized for the number of subjects belonging to each site. In this way, the sum of the matrix cells of each row gives 1. The confusion matrix obtained using the *harmonizer* within the machine learning pipeline seems similar to that obtained by harmonizing all the data with *neuroHarmonize* before imaging site prediction, even though the model is built on training data only and then applied to test data.

Fig. 10 — Confusion matrices of site prediction using FD features. Each confusion matrix was normalized for the number of subjects belonging to each site. In this way, the sum of the matrix cells of each row gives 1. The confusion matrix obtained using the *harmonizer* within the machine learning pipeline seems similar to that obtained by harmonizing all the data with *neuroHarmonize* before imaging site prediction, even though the model is built on training data only and then applied to test data.

Fig. 11 — Confusion matrices of site prediction using CT and FD features in the LIFESPAN meta-dataset. Each confusion matrix was normalized for the number of subjects belonging to each site. In this way, the sum of the matrix cells of each row gives 1. The confusion matrix obtained using the *harmonizer* within the machine learning pipeline seems similar to that obtained by harmonizing all the data with *neuroHarmonize* before imaging site prediction, even though the model is built on training data only and then applied to test data.

Table 9 reports the median balanced accuracies (over 100 repetitions) of imaging site prediction, and the efficacy of the harmonization is shown in Table 10. Specifically, we have reported the pair (age-group permutation test p-value, one-sided Wilcoxon signed-rank test p-value) to statistically assess the removal or reduction of the site effect, respectively. As expected, the median balanced accuracy of site prediction using the raw data was significantly different from the chance level (age-group permutation test p-value ≥ 0.05 for all data), and thus, an actual imaging site effect was present on raw data. After harmonization, with neuroHarmonize or harmonizer transformer, the site effect was removed (age-group permutation test p-value ≥ 0.05 in Table 10) or only reduced (age-group permutation test p-value < 0.05, but with median balanced accuracy reduced on harmonized data, as statistically measured by the one-sided Wilcoxon signed-rank test p-value < 0.05 in Table 10). Specifically, by performing harmonization using neuroHarmonize on all data, we observe that the site effect removal seems to be ensured in all analyses performed except for the imaging site predictions using FD features in the ADOLESCENCE and ADULTHOOD meta-datasets (age-group permutation test p-value equal to 0.0188 and 0.0002, respectively, in Table 10). We found the same behavior when predicting the imaging site using CT and FD features in the LIFESPAN meta-dataset (age-group permutation test p-value equal to 0.0002 in Table 10). In the latter cases, although significantly different from a random prediction, the balanced accuracies were significantly lower than those obtained using the original data (one-sided Wilcoxon signed-rank test p-values < 0.001 in Table 10), and this indicates a site effect reduction. When applying the harmonizer transformer to the data (within the CV), we observed the actual efficacy of the harmonization, without introducing data leakage, as in the previous case. Indeed, we confirmed a complete removal of site effect only in imaging site prediction using CT features in ADULTHOOD meta-dataset (age-group permutation test p-value equal to 0.1064 in Table 10). In all the other cases, the imaging site prediction was significantly different from the chance level (age-group permutation test p-values < 0.05 in Table 10), but the balanced accuracies were significantly lower than those obtained using the original data (one-sided Wilcoxon signed-rank test p-values < 0.001 in Table 10). Thus, the site effect removal measured using data harmonized before the splitting into training and test sets was a clear sign of data leakage even in in vivo data.

Table 9.

Site prediction results.

Meta-dataset and feature type	Balanced accuracy median (IQR)
Meta-dataset and feature type	Without harmonization	Harmonization with neuroHarmonize	Harmonization with harmonizer transformer
CHILDHOOD
CT	0.45 (0.03)	0.09 (0.01)	0.13 (0.02)
FD	0.35 (0.02)	0.09 (0.01)	0.13 (0.02)
ADOLESCENCE
CT	0.45 (0.03)	0.13 (0.02)	0.15 (0.03)
FD	0.43 (0.04)	0.16 (0.03)	0.22 (0.04)
ADULTHOOD
CT	0.40 (0.03)	0.09 (0.01)	0.09 (0.01)
FD	0.29 (0.02)	0.12 (0.01)	0.13 (0.01)
LIFESPAN
CT	0.28 (0.01)	0.06 (0.01)	0.07 (0.01)
FD	0.22 (0.01)	0.08 (0.01)	0.10 (0.01)

Open in a new tab

The median and the interquartile range of the balanced accuracy over 100 repetitions of the 5-fold CV have been reported. In bold, we have highlighted significant falsely overestimated performance due to data leakage (the median balanced accuracy in imaging site prediction using data harmonized with neuroHarmonize is lower, i.e., better performance, than that estimated using data harmonized with the harmonizer transformer within the CV – one-sided Wilcoxon signed-rank test p-values < 0.001 for all the analyses).

CT: cortical thickness; CV: cross-validation; FD: fractal dimension; IQR: interquartile range.

Table 10.

Harmonization efficacy.

Meta-dataset and feature type	Harmonization efficacy (age-group permutation test p-value, One-sided Wilcoxon signed-rank test p-value)
Meta-dataset and feature type	Harmonization with neuroHarmonize	Harmonization with harmonizer transformer
CHILDHOOD
CT	(0.5363, 10⁻¹⁸)	(0.0036, 10⁻¹⁸)
FD	(0.5853, 10⁻¹⁸)	(0.0268, 10⁻¹⁸)
ADOLESCENCE	ADOLESCENCE
CT	(0.3559, 10⁻¹⁸)	(0.0484, 10⁻¹⁸)
FD	(0.0090, 10⁻¹⁸)	(0.0002, 10⁻¹⁸)
ADULTHOOD	ADULTHOOD
CT	(0.4545, 10⁻¹⁸)	(0.4727, 10⁻¹⁸)
FD	(0.0042, 10⁻¹⁸)	(0.0006, 10⁻¹⁸)
LIFESPAN
CT	(0.1128, 10⁻¹⁸)	(0.0006, 10⁻¹⁸)
FD	(0.0002, 10⁻¹⁸)	(0.0002, 10⁻¹⁸)

Open in a new tab

The age-group permutation test p-value and one-sided Wilcoxon signed-rank test p-value have been reported. The permutation test p-value indicates whether the site effect has been removed (i.e., p-value ≥ 0.05 means that the imaging site prediction is not different from a random prediction). One-sided Wilcoxon signed-rank test p-value indicates whether the site effect has been reduced (i.e., p-value less than 0.05 means that the prediction of imaging site using the harmonized features obtains a balanced accuracy significantly less than that estimated using raw data).

CT: cortical thickness; FD: fractal dimension.

Age prediction

Table 11 reports the median MAE values (over 100 repetitions) of the age prediction model. Overall, MAE values of age prediction using data harmonized with neuroHarmonize before the splitting into training and test sets are significantly lower than those obtained using data harmonized with the harmonizer within the CV (one-sided Wilcoxon signed-rank p-values < 0.001 for all the cases, except for CT features in the CHILDHOOD meta-dataset, see Table 11). In line with the results of simulations, the data leakage introduced by harmonizing the data all at once leads to an overly optimistic performance.

Table 11.

Age prediction results.

Meta-dataset and feature type	MAE median (IQR)
Meta-dataset and feature type	Harmonization with neuroHarmonize	Harmonization with harmonizer	One-sided Wilcoxon signed-rank test p-value
CHILDHOOD
CT	1.31 (0.03)	1.28 (0.03)	1
FD	1.16 (0.02)	1.18 (0.02)	10⁻¹³
ADOLESCENCE
CT	1.69 (0.06)	1.76 (0.07)	10⁻¹⁵
FD	1.56 (0.06)	1.59 (0.06)	10⁻⁶
ADULTHOOD
CT	10.85 (0.12)	10.88 (0.14)	0.007
FD	8.68 (0.13)	8.73 (0.13)	10⁻⁵
LIFESPAN
CT	7.35 (0.06)	7.55 (0.09)	10⁻¹⁸
FD	5.48 (0.04)	5.60 (0.07)	10⁻¹⁸

Open in a new tab

The median MAE and the relative standard deviation over 100 repetitions have been reported. In bold, we have highlighted significant falsely overestimated performance due to data leakage (the median MAE in predicting age using data harmonized with neuroHarmonize is lower than that estimated using data harmonized with the harmonizer transformer within the CV – one-sided Wilcoxon signed-rank test p-values < 0.05 for all the analyses, except for the CT features of the CHILDHOOD meta-dataset).

CT: cortical thickness; CV: cross-validation; FD: fractal dimension; IQR: interquartile range; MAE: median absolute error.

Finally, in Figs. 12, 13, we reported the age-dependent trends of the average CT and FD of the cerebral cortex without harmonization and harmonized with the harmonizer transformer, respectively. In line with previous literature concerning features such as CT and volumes^2,5, also in this study, the harmonized average CT and FD values showed less variability than that observed on raw data.

Fig. 12 — Scatterplot of the average CT of the cerebral cortex vs. age. The plot of the average CT of the cerebral cortex vs. age is shown for the CHILDHOOD, ADOLESCENCE, ADULTHOOD, and LIFESPAN meta-datasets without and with harmonization using the *harmonizer* transformer. In the latter case, we considered only the first CV among the 100 repetitions. Specifically, for each subject, we plotted the harmonized value obtained in the fold when the subject was included in the test set.

Fig. 13 — Scatterplot of the FD of the cerebral cortex vs. age. The plot of the FD of the cerebral cortex vs. age is shown for the CHILDHOOD, ADOLESCENCE, ADULTHOOD, and LIFESPAN meta-datasets without and with harmonization using the *harmonizer* transformer. In the latter case, we considered only the first CV among the 100 repetitions. Specifically, for each subject, we plotted the harmonized value obtained in the fold when the subject was included in the test set.

Discussion

In this study, we introduced the harmonizer transformer, which encapsulates the data harmonization procedure among the preprocessing steps of a machine learning pipeline to avoid data leakage by design. To this end, we explored the ComBat harmonization of CT and FD features extracted from brain T₁-weighted MRI data of 1740 healthy subjects aged 5–87 years acquired at 36 sites and simulated data. We measured the efficacy of the harmonization process in reducing or removing the unwanted site effect through a two-step assessment comparing the performance in imaging site prediction using harmonized data with that of 1) a random prediction and 2) a prediction using non-harmonized data. Finally, we confirmed how data leakage related to harmonization performed before data splitting leads to overestimating performance in simulated and in vivo data.

Using simulated data, we showed that the data leakage effect introduced by performing the harmonization before data splitting is clearly evident and worse when the single-center dataset size is small and comparable with the size of the most common neuroimaging in vivo studies. In these simulated experiments, we paid particular attention to comparing different harmonization and machine learning approaches in the same conditions, i.e., the same data splits and using the same number of subjects for harmonization (for this reason, we adopted 80% of the data set size for fitting the neuroHarmonize model; indeed using the harmonizer approach, the harmonization was computed in the training fold of a 5-fold CV, i.e., using 80% of the samples).

We chose the ComBat harmonization method due to its widespread use in the scientific community^7,12,16–34 and its implementation in the neuroHarmonize package, which enables the specification of covariates with generic non-linear effects². The efficacy of ComBat and its variants has been evaluated by comparing their performance with other harmonization techniques^3,5,6 and by simulating site effects using single-center data². However, various harmonization techniques can be used for features extracted from MRI images. One such method is the residuals harmonization, which employs a global scaling procedure to account for the influence of each site using a pair of parameters (offset and scale). These parameters can be estimated through a linear regression model or a more sophisticated approach that considers non-linearities⁵. Global scaling was initially introduced to harmonize images directly⁶. The adjusted residuals harmonization, an advancement of the residuals harmonization, integrates biological covariates (such as age, sex, and diseases) into the linear regression model, facilitating the removal of unwanted site effects while maintaining biological variability⁵. Lastly, the Correcting Covariance Batch Effects (CovBat) method is a recent variant of the ComBat method that aims to address site effects in the mean, variance, and covariance of the neuroimaging features⁸⁴.

It is important to note that this study was the first in which the efficacy of the harmonization procedure of neuroimaging data has been evaluated by comparing the accuracy of the imaging site prediction also to the chance level. Indeed, previous works have consistently shown a decrease in the accuracy of the imaging site prediction after harmonization, but without applying a significance test, and thus it was not known whether the site effect was removed or only reduced [see, e.g., ref. ² and ref. ⁵]. As hypothesized, there was a real imaging site effect on the raw data (age-group permutation test p-value < 0.05 for all data). The site effect was either eliminated or only reduced after data harmonization with neuroHarmonize or harmonizer transformer. Specifically, the difference between the efficacy of harmonization by applying neuroHarmonize on all data or harmonizer within the CV was expected because, in the former case, data leakage is present leading to a falsely overestimated performance, i.e., an age-group permutation test p-value ≥ 0.05 and a lower median balanced accuracy (Tables 7, 8). On the one hand, the complete removal of the imaging site measured using the data harmonized with neuroHarmonize was only apparent. Indeed, using the harmonizer within the CV, the imaging site effect was completely removed only for CT features in the ADULTHOOD meta-dataset. In line with the results of the simulations, we noted that the median balanced accuracies obtained by performing site prediction with harmonized data using the neuroHarmonize show significantly lower values than those observed using the harmonizer transformer within the CV (one-sided Wilcoxon signed-rank p-values < 0.001 for all the analyses). The differences found in the median balanced accuracy of imaging site prediction using the harmonizer transformer and neuroHarmonize emphasize the importance of introducing the harmonizer transformer into a machine learning pipeline to avoid data leakage, a source of bias in prediction results. Notably, the procedure used to measure data leakage on the simulated data (i.e., comparing the performance of imaging site prediction between the internal test set of the CV and external test set) was not viable for the in vivo data due to the limited sample size in several centers (less than 20 subjects).

Looking at the age-group permutation test p-values for imaging site prediction using data harmonized with neuroHarmonize (which were harmonized before splitting into training and test sets), it can be observed that the efficacy of harmonization worsened as the overlap of the age distributions in multicenter meta-datasets decreased (Table 10). Specifically, for CT features, the age-group permutation test p-value was 0.5023 in the CHILDHOOD meta-dataset, which exhibits a good overlap of age distributions (BC = 0.71), but dropped to 0.0002 in the LIFESPAN meta-dataset, which exhibits a BC = 0. Similar behavior was observed for FD features. These results on in vivo data are in line with the simulations performed by Pomponio and colleagues², which suggested that age-disjoint studies should be challenging to harmonize in the presence of nonlinear age effects². The efficacy of the harmonization performed in CV using the harmonizer transformer does not appear seemingly to have a close link to the degree of overlap of the age distributions in the multicenter meta-datasets. This may be explained by the fact that the harmonizer transformer handles training data only – randomly chosen within the whole meta-dataset – in the different folds of the CV, and the actual BC values may vary.

The goodness of age prediction using the data harmonized with neuroHarmonize before the splitting into training and test sets is falsely increased compared with the use of data harmonized with the harmonizer within the CV. Indeed, the median MAE values obtained in predicting age using data harmonized with neuroHarmonize before splitting into training and test sets were significantly lower than those estimated using data harmonized with harmonizer within the CV (one-sided Wilcoxon signed-rank p-values < 0.001 for all the cases, except for CT features in the CHILDHOOD meta-dataset, see Table 10). These results confirm how data leakage related to data harmonization before splitting them into training and test sets leads to performance overestimation even for in vivo data and underlines the importance of encapsulating the data harmonization procedure among the preprocessing steps of a machine learning pipeline.

In previous single-centers studies, we observed that the computation of the FD using the box-counting algorithm with the automated selection of the optimal fractal scaling window implemented in fractalbrain best predicted chronological age in two datasets of healthy children and adults among various FD approaches, and more conventional features, such as CT, and gyrification index⁵⁹. In this large multicenter study, we confirmed the more remarkable ability of the FD of the cerebral cortex to predict individual age better than the average CT. In the LIFESPAN meta-dataset, for example, the error in age prediction using CT features (MAE = 7.55 years) was reduced by more than 25% using FD features (MAE = 5.60 years) in line with previous literature^59,68. This result furtherly confirms that FD conveys additional information to that provided by other conventional structural features^{58,59,67,68,86–99}.

This study has some limitations. Firstly, to show the utility of encapsulating the data harmonization procedure among the preprocessing steps of a machine learning pipeline to avoid data leakage, we used only the ComBat harmonization method. However, other harmonization techniques are available and could be similarly effective, including the recent CovBat model, which adds harmonization of covariance between sites⁸⁴. Future research may consider comparing and contrasting the performance of different harmonization methods to identify the optimal approach for specific research questions and data sets.

Secondly, we showed and measured the data leakage effect using simulated and in vivo data of CT and FD of the cerebral cortex only. Various other morphological and functional MRI-derived features might be considered. However, the focus of the study was mainly to measure the efficacy of the harmonization and show a possible detrimental effect of data harmonization on the entire dataset before machine learning analysis, and this effect is not relative to the features considered.

Lastly, for site/age prediction, we adopted an XGBoost decision tree with default parameters. It is well known that classification/regression performances may be affected by the value of the hyperparameters, and proper hyperparameter optimization, e.g., through a nested CV, could be adopted. However, this procedure was not feasible in our study because of the relatively small size of data in many centers – an undesired but common scenario in many publicly available datasets. Thus, though this choice was arbitrary, we feel that using the same hyperparameters for both neuroHarmonize and Harmonize transformer data was reasonable.

In conclusion, we showed that introducing the harmonizer transformer, which encapsulates the harmonization procedure among the preprocessing steps of a machine learning pipeline, avoided data leakage. Using in vivo data, after Combat harmonization, the site effect was completely removed or reduced while preserving the biological variability. We, therefore, suggest that future multicenter imaging studies will include the data harmonization method in the machine learning pipelines and measure the efficacy of the harmonization process.

Supplementary information

Related Manuscript File^{(100KB, pdf)}

Acknowledgements

We wish to thank Federica Giorgini and Riccardo Benedetti for data management, Stefano Orsolini for the technical support in the quality control assessment of all FreeSurfer outputs, and Martina Franco for her preliminary analysis of a part of this data. The research leading to these results has received funding from the European Union - NextGenerationEU through the Italian Ministry of University and Research under PNRR - M4C2-I1.3 Project PE_00000019 "HEAL ITALIA" to Stefano Diciotti" CUP J33C22002920006. The views and opinions expressed are those of the authors only and do not necessarily reflect those of the European Union or the European Commission. Neither the European Union nor the European Commission can be held responsible for them.

Author contributions

Chiara Marzi: Conceptualization, Methodology, Software, Validation, Formal analysis, Data curation, Writing – Original draft, Writing – Review & Editing, Visualization. Marco Giannelli, Andrea Barucci, Carlo Tessa and Mario Mascalchi: Writing – Review & Editing. Stefano Diciotti: Conceptualization, Methodology, Resources, Writing – Original draft, Writing – Review & Editing, Visualization, Supervision, Project administration.

Data availability

The brain MR T₁-weighted images that support the findings of this study are available from the following online repositories:

- Autism Brain Imaging Data Exchange (ABIDE): https://fcon_1000.projects.nitrc.org/indi/abide/

- Information eXtraction from Images (IXI) study: https://brain-development.org/ixi-dataset/

- 1000 Functional Connectomes Project (FCP) – ICBM dataset: https://fcon_1000.projects.nitrc.org/fcpClassic/FcpTable.html

- Consortium for Reliability and Reproducibility (CoRR) - NKI 2 - Nathan Kline Institute (Milham): https://fcon_1000.projects.nitrc.org/indi/CoRR/html/index.html

The CT and FD features, derived from brain MR T₁-weighted of 1740 subjects, that support the findings of this study, are freely available on Zenodo^100,101. The simulated CT and FD features that support the findings of this study are freely available on a Zenodo repository¹⁰².

Code availability

The source code of the efficacy measurement and harmonizer transformer is publicly available in a GitHub repository at https://github.com/Imaging-AI-for-Health-virtual-lab/harmonizer. The following are the versions of software and Python libraries used to obtain the results presented in this study:

- FreeSurfer version 7.1.1. For T₁-weighted images belonging to ICBM and NKI2 datasets, we used FreeSurfer version 5.3. ABIDEI T₁-weighted images were already processed using FreeSurfer version 5.1.

- fractalbrain toolkit version 1.1

- neuroHarmonize v. 2.1.0 package

- eXtreme Gradient Boosting (XGBoost) version 0.90.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

The online version contains supplementary material available at 10.1038/s41597-023-02421-7.

References

1.Alfaro-Almagro F, et al. Image processing and Quality Control for the first 10,000 brain imaging datasets from UK Biobank. NeuroImage. 2018;166:400–424. doi: 10.1016/j.neuroimage.2017.10.034. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Pomponio R, et al. Harmonization of large MRI datasets for the analysis of brain imaging patterns throughout the lifespan. NeuroImage. 2020;208:116450. doi: 10.1016/j.neuroimage.2019.116450. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Radua J, et al. Increased power by harmonizing structural MRI site differences with the ComBat batch adjustment method in ENIGMA. NeuroImage. 2020;218:116956. doi: 10.1016/j.neuroimage.2020.116956. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Thompson PM, et al. The ENIGMA Consortium: large-scale collaborative analyses of neuroimaging and genetic data. Brain Imaging Behav. 2014;8:153–182. doi: 10.1007/s11682-013-9269-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Fortin JP, et al. Harmonization of cortical thickness measurements across scanners and sites. NeuroImage. 2018;167:104–120. doi: 10.1016/j.neuroimage.2017.11.024. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Fortin JP, et al. Harmonization of multi-site diffusion tensor imaging data. NeuroImage. 2017;161:149–170. doi: 10.1016/j.neuroimage.2017.08.047. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Beer JC, et al. Longitudinal ComBat: A method for harmonizing longitudinal multi-scanner imaging data. NeuroImage. 2020;220:117129. doi: 10.1016/j.neuroimage.2020.117129. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Keshavan A, et al. Power estimation for non-standardized multisite studies. NeuroImage. 2016;134:281–294. doi: 10.1016/j.neuroimage.2016.03.051. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Pinto MS, et al. Harmonization of Brain Diffusion MRI: Concepts and Methods. Front. Neurosci. 2020;14:396. doi: 10.3389/fnins.2020.00396. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Suckling J, et al. Components of variance in a multicentre functional MRI study and implications for calculation of statistical power. Hum. Brain Mapp. 2008;29:1111–1122. doi: 10.1002/hbm.20451. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Dansereau C, et al. Statistical power and prediction accuracy in multisite resting-state fMRI connectivity. NeuroImage. 2017;149:220–232. doi: 10.1016/j.neuroimage.2017.01.072. [DOI] [PubMed] [Google Scholar]
12.Yu M, et al. Statistical harmonization corrects site effects in functional connectivity measurements from multi‐site fMRI data. Hum. Brain Mapp. 2018;39:4213–4227. doi: 10.1002/hbm.24241. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Han X, et al. Reliability of MRI-derived measurements of human cerebral cortical thickness: The effects of field strength, scanner upgrade and manufacturer. NeuroImage. 2006;32:180–194. doi: 10.1016/j.neuroimage.2006.02.051. [DOI] [PubMed] [Google Scholar]
14.Jovicich J, et al. Reliability in multi-site structural MRI studies: Effects of gradient non-linearity correction on phantom and human data. NeuroImage. 2006;30:436–443. doi: 10.1016/j.neuroimage.2005.09.046. [DOI] [PubMed] [Google Scholar]
15.Takao H, Hayashi N, Ohtomo K. Effect of scanner in longitudinal studies of brain volume changes. J. Magn. Reson. Imaging. 2011;34:438–444. doi: 10.1002/jmri.22636. [DOI] [PubMed] [Google Scholar]
16.Hatton SN, et al. White matter abnormalities across different epilepsy syndromes in adults: an ENIGMA-Epilepsy study. Brain. 2020;143:2454–2473. doi: 10.1093/brain/awaa200. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Ingalhalikar M, et al. Functional Connectivity-Based Prediction of Autism on Site Harmonized ABIDE Dataset. IEEE Trans. Biomed. Eng. 2021;68:3628–3637. doi: 10.1109/TBME.2021.3080259. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Li Y, Ammari S, Balleyguier C, Lassau N, Chouzenoux E. Impact of Preprocessing and Harmonization Methods on the Removal of Scanner Effects in Brain MRI Radiomic Features. Cancers. 2021;13:3000. doi: 10.3390/cancers13123000. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Luna A, et al. Maturity of gray matter structures and white matter connectomes, and their relationship with psychiatric symptoms in youth. Hum. Brain Mapp. 2021;42:4568–4579. doi: 10.1002/hbm.25565. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Maikusa N, et al. Comparison of traveling‐subject and ComBat harmonization methods for assessing structural brain characteristics. Hum. Brain Mapp. 2021;42:5278–5287. doi: 10.1002/hbm.25615. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Orlhac F, et al. How can we combat multicenter variability in MR radiomics? Validation of a correction procedure. Eur. Radiol. 2021;31:2272–2280. doi: 10.1007/s00330-020-07284-9. [DOI] [PubMed] [Google Scholar]
22.Wachinger C, Rieckmann A, Pölsterl S. Detect and correct bias in multi-site neuroimaging datasets. Med. Image Anal. 2021;67:101879. doi: 10.1016/j.media.2020.101879. [DOI] [PubMed] [Google Scholar]
23.Wengler K, et al. Cross‐Scanner Harmonization of Neuromelanin‐Sensitive MRI for Multisite Studies. J. Magn. Reson. Imaging. 2021;54:1189–1199. doi: 10.1002/jmri.27679. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Zavaliangos-Petropulu A, et al. Diffusion MRI Indices and Their Relation to Cognitive Impairment in Brain Aging: The Updated Multi-protocol Approach in ADNI3. Front. Neuroinformatics. 2019;13:2. doi: 10.3389/fninf.2019.00002. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Zhu, Y. et al. Application of a Machine Learning Algorithm for Structural Brain Images in Chronic Schizophrenia to Earlier Clinical Stages of Psychosis and Autism Spectrum Disorder: A Multiprotocol Imaging Dataset Study. Schizophr. Bull. sbac030 (2022). [DOI] [PMC free article] [PubMed]
26.Tafuri B, et al. The impact of harmonization on radiomic features in Parkinson’s disease and healthy controls: A multicenter study. Front. Neurosci. 2022;16:1012287. doi: 10.3389/fnins.2022.1012287. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Parekh P, et al. Sample size requirement for achieving multisite harmonization using structural brain MRI features. NeuroImage. 2022;264:119768. doi: 10.1016/j.neuroimage.2022.119768. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Chen AA, Luo C, Chen Y, Shinohara RT, Shou H. Privacy-preserving harmonization via distributed ComBat. NeuroImage. 2022;248:118822. doi: 10.1016/j.neuroimage.2021.118822. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Lombardi A, et al. Extensive Evaluation of Morphological Statistical Harmonization for Brain Age Prediction. Brain Sci. 2020;10:364. doi: 10.3390/brainsci10060364. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Zounek AJ, et al. Feasibility of radiomic feature harmonization for pooling of [18F]FET or [18F]GE-180 PET images of gliomas. Z. Für Med. Phys. 2023;33:91–102. doi: 10.1016/j.zemedi.2022.12.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Dai P, et al. The alterations of brain functional connectivity networks in major depressive disorder detected by machine learning through multisite rs-fMRI data. Behav. Brain Res. 2022;435:114058. doi: 10.1016/j.bbr.2022.114058. [DOI] [PubMed] [Google Scholar]
32.Saponaro S, et al. Multi-site harmonization of MRI data uncovers machine-learning discrimination capability in barely separable populations: An example from the ABIDE dataset. NeuroImage Clin. 2022;35:103082. doi: 10.1016/j.nicl.2022.103082. [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Du X, et al. Unraveling schizophrenia replicable functional connectivity disruption patterns across sites. Hum. Brain Mapp. 2023;44:156–169. doi: 10.1002/hbm.26108. [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Dudley JA, et al. ABCD_Harmonizer: An Open-source Tool for Mapping and Controlling for Scanner Induced Variance in the Adolescent Brain Cognitive Development Study. Neuroinformatics. 2023;21:323–337. doi: 10.1007/s12021-023-09624-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Johnson WE, Li C, Rabinovic A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostat. Oxf. Engl. 2007;8:118–127. doi: 10.1093/biostatistics/kxj037. [DOI] [PubMed] [Google Scholar]
36.He L, et al. Deep Multimodal Learning From MRI and Clinical Data for Early Prediction of Neurodevelopmental Deficits in Very Preterm Infants. Front. Neurosci. 2021;15:753033. doi: 10.3389/fnins.2021.753033. [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Kim, J. I. et al. Classification of Preschoolers with Low-Functioning Autism Spectrum Disorder Using Multimodal MRI Data. J. Autism Dev. Disord. (2022). [DOI] [PubMed]
38.Lo Gullo R, et al. Assessing PD-L1 Expression Status Using Radiomic Features from Contrast-Enhanced Breast MRI in Breast Cancer Patients: Initial Results. Cancers. 2021;13:6273. doi: 10.3390/cancers13246273. [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Lopez-Soley E, et al. Dynamics and Predictors of Cognitive Impairment along the Disease Course in Multiple Sclerosis. J. Pers. Med. 2021;11:1107. doi: 10.3390/jpm11111107. [DOI] [PMC free article] [PubMed] [Google Scholar]
40.Simhal AK, et al. Predicting multiscan MRI outcomes in children with neurodevelopmental conditions following MRI simulator training. Dev. Cogn. Neurosci. 2021;52:101009. doi: 10.1016/j.dcn.2021.101009. [DOI] [PMC free article] [PubMed] [Google Scholar]
41.Zhou X, et al. Multimodal MR Images-Based Diagnosis of Early Adolescent Attention-Deficit/Hyperactivity Disorder Using Multiple Kernel Learning. Front. Neurosci. 2021;15:710133. doi: 10.3389/fnins.2021.710133. [DOI] [PMC free article] [PubMed] [Google Scholar]
42.Mandelbrot, B. B. The fractal geometry of nature. (W.H. Freeman, 1982).
43.Di Martino A, et al. Enhancing studies of the connectome in autism using the autism brain imaging data exchange II. Sci. Data. 2017;4:170010. doi: 10.1038/sdata.2017.10. [DOI] [PMC free article] [PubMed] [Google Scholar]
44.Di Martino A, et al. The autism brain imaging data exchange: towards a large-scale evaluation of the intrinsic brain architecture in autism. Mol. Psychiatry. 2014;19:659–667. doi: 10.1038/mp.2013.78. [DOI] [PMC free article] [PubMed] [Google Scholar]
45.Autism Brain Imaging Data Exchange (ABIDE). https://fcon_1000.projects.nitrc.org/indi/abide/ (2017).
46.Kang, S. M. & Wildes, R. P. The n-distribution Bhattacharyya coefficient. York Univ. (2015).
47.Bhattacharyya A. On a measure of divergence between two statistical populations defined by their probability distributions. Bull Calcutta Math Soc. 1943;35:99–109. [Google Scholar]
48.Cameron, C. et al. The Neuro Bureau Preprocessing Initiative: open sharing of preprocessed neuroimaging data and derivatives. Front. Neuroinformatics7 (2013).
49.Bigler ED, et al. FreeSurfer 5.3 versus 6.0: are volumes comparable? A Chronic Effects of Neurotrauma Consortium study. Brain Imaging Behav. 2020;14:1318–1327. doi: 10.1007/s11682-018-9994-x. [DOI] [PubMed] [Google Scholar]
50.Chepkoech J-L, Walhovd KB, Grydeland H, Fjell AM, for the Alzheimer’s Disease Neuroimaging Initiative Effects of change in FreeSurfer version on classification accuracy of patients with Alzheimer’s disease and mild cognitive impairment: Effects of Change in FreeSurfer Version. Hum. Brain Mapp. 2016;37:1831–1841. doi: 10.1002/hbm.23139. [DOI] [PMC free article] [PubMed] [Google Scholar]
51.Filip P, et al. Different FreeSurfer versions might generate different statistical outcomes in case–control comparison studies. Neuroradiology. 2022;64:765–773. doi: 10.1007/s00234-021-02862-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
52.Glatard, T. et al. Reproducibility of neuroimaging analyses across operating systems. Front. Neuroinformatics9, (2015). [DOI] [PMC free article] [PubMed]
53.Gronenschild EHBM, et al. The Effects of FreeSurfer Version, Workstation Type, and Macintosh Operating System Version on Anatomical Volume and Cortical Thickness Measurements. PLoS ONE. 2012;7:e38234. doi: 10.1371/journal.pone.0038234. [DOI] [PMC free article] [PubMed] [Google Scholar]
54.Fischl B. FreeSurfer. NeuroImage. 2012;62:774–781. doi: 10.1016/j.neuroimage.2012.01.021. [DOI] [PMC free article] [PubMed] [Google Scholar]
55.Fischl B, Dale AM. Measuring the thickness of the human cerebral cortex from magnetic resonance images. Proc. Natl. Acad. Sci. 2000;97:11050–11055. doi: 10.1073/pnas.200033797. [DOI] [PMC free article] [PubMed] [Google Scholar]
56.Cutting JE, Garvin JJ. Fractal curves and complexity. Percept. Psychophys. 1987;42:365–370. doi: 10.3758/BF03203093. [DOI] [PubMed] [Google Scholar]
57.Fernández E, Jelinek HF. Use of Fractal Theory in Neuroscience: Methods, Advantages, and Potential Problems. Methods. 2001;24:309–321. doi: 10.1006/meth.2001.1201. [DOI] [PubMed] [Google Scholar]
58.Im K, et al. Fractal dimension in human cortical surface: Multiple regression analysis with cortical thickness, sulcal depth, and folding area. Hum. Brain Mapp. 2006;27:994–1003. doi: 10.1002/hbm.20238. [DOI] [PMC free article] [PubMed] [Google Scholar]
59.Marzi C, Giannelli M, Tessa C, Mascalchi M, Diciotti S. Toward a more reliable characterization of fractal properties of the cerebral cortex of healthy subjects during the lifespan. Sci. Rep. 2020;10:16957. doi: 10.1038/s41598-020-73961-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
60.Russell DA, Hanson JD, Ott E. Dimension of Strange Attractors. Phys. Rev. Lett. 1980;45:1175–1178. doi: 10.1103/PhysRevLett.45.1175. [DOI] [Google Scholar]
61.Losa GA. The fractal geometry of life. Riv. Biol. 2009;102:29–59. [PubMed] [Google Scholar]
62.Falconer, K. J. Fractal geometry: mathematical foundations and applications. (John Wiley & Sons Inc, 2014).
63.Goñi J, et al. Robust estimation of fractal measures for characterizing the structural complexity of the human brain: Optimization and reproducibility. NeuroImage. 2013;83:646–657. doi: 10.1016/j.neuroimage.2013.06.072. [DOI] [PMC free article] [PubMed] [Google Scholar]
64.Courchesne E, et al. Normal Brain Development and Aging: Quantitative Analysis at in Vivo MR Imaging in Healthy Volunteers. Radiology. 2000;216:672–682. doi: 10.1148/radiology.216.3.r00au37672. [DOI] [PubMed] [Google Scholar]
65.Fjell, A. M. & Walhovd, K. B. Structural Brain Changes in Aging: Courses, Causes and Cognitive Consequences. Rev. Neurosci. 21 (2010). [DOI] [PubMed]
66.Hogstrom LJ, Westlye LT, Walhovd KB, Fjell AM. The Structure of the Cerebral Cortex Across Adult Life: Age-Related Patterns of Surface Area, Thickness, and Gyrification. Cereb. Cortex. 2013;23:2521–2530. doi: 10.1093/cercor/bhs231. [DOI] [PubMed] [Google Scholar]
67.Madan CR, Kensinger EA. Predicting age from cortical structure across the lifespan. Eur. J. Neurosci. 2018;47:399–416. doi: 10.1111/ejn.13835. [DOI] [PMC free article] [PubMed] [Google Scholar]
68.Madan CR, Kensinger EA. Cortical complexity as a measure of age-related brain atrophy. NeuroImage. 2016;134:617–629. doi: 10.1016/j.neuroimage.2016.04.029. [DOI] [PMC free article] [PubMed] [Google Scholar]
69.Raznahan A, et al. How Does Your Cortex Grow? J. Neurosci. 2011;31:7174–7177. doi: 10.1523/JNEUROSCI.0054-11.2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
70.Zheng F, et al. Age-related changes in cortical and subcortical structures of healthy adult brains: A surface-based morphometry study: Age-Related Study in Healthy Adult Brain Structure. J. Magn. Reson. Imaging. 2019;49:152–163. doi: 10.1002/jmri.26037. [DOI] [PubMed] [Google Scholar]
71.Sowell ER, et al. Sex Differences in Cortical Thickness Mapped in 176 Healthy Individuals between 7 and 87 Years of Age. Cereb. Cortex. 2007;17:1550–1560. doi: 10.1093/cercor/bhl066. [DOI] [PMC free article] [PubMed] [Google Scholar]
72.Yagis E, et al. Effect of data leakage in brain MRI classification using 2D convolutional neural networks. Sci. Rep. 2021;11:22544. doi: 10.1038/s41598-021-01681-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
73.Tampu IE, Eklund A, Haj-Hosseini N. Inflation of test accuracy due to data leakage in deep learning-based classification of OCT images. Sci. Data. 2022;9:580. doi: 10.1038/s41597-022-01618-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
74.Müller, A. C. & Guido, S. Introduction to machine learning with Python: a guide for data scientists. (O’Reilly Media, Inc, 2016).
75.Scheda R, Diciotti S. Explanations of Machine Learning Models in Repeated Nested Cross-Validation: An Application in Age Prediction Using Brain Complexity Features. Appl. Sci. 2022;12:6681. doi: 10.3390/app12136681. [DOI] [Google Scholar]
76.Varma S, Simon R. Bias in error estimation when using cross-validation for model selection. BMC Bioinformatics. 2006;7:91. doi: 10.1186/1471-2105-7-91. [DOI] [PMC free article] [PubMed] [Google Scholar]
77.Chen, T. & Guestrin, C. XGBoost: A Scalable Tree Boosting System. in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 785–794, 10.1145/2939672.2939785 (ACM, 2016).
78.Nichols TE, Holmes AP. Nonparametric permutation tests for functional neuroimaging: A primer with examples. Hum. Brain Mapp. 2002;15:1–25. doi: 10.1002/hbm.1058. [DOI] [PMC free article] [PubMed] [Google Scholar]
79.Ojala M, Garriga GC. Permutation Tests for Studying Classifier Performance. J Mach Learn Res. 2010;11:1833–1863. [Google Scholar]
80.Winkler AM, Ridgway GR, Webster MA, Smith SM, Nichols TE. Permutation inference for the general linear model. NeuroImage. 2014;92:381–397. doi: 10.1016/j.neuroimage.2014.01.060. [DOI] [PMC free article] [PubMed] [Google Scholar]
81.Wilcoxon F. Individual Comparisons by Ranking Methods. Biom. Bull. 1945;1:80. doi: 10.2307/3001968. [DOI] [Google Scholar]
82.Brouwer RM, et al. Genetic variants associated with longitudinal changes in brain structure across the lifespan. Nat. Neurosci. 2022;25:421–432. doi: 10.1038/s41593-022-01042-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
83.Oschwald J, et al. Brain structure and cognitive ability in healthy aging: a review on longitudinal correlated change. Rev. Neurosci. 2019;31:1–57. doi: 10.1515/revneuro-2018-0096. [DOI] [PMC free article] [PubMed] [Google Scholar]
84.Chen AA, et al. Mitigating site effects in covariance for machine learning in neuroimaging data. Hum. Brain Mapp. 2022;43:1179–1195. doi: 10.1002/hbm.25688. [DOI] [PMC free article] [PubMed] [Google Scholar]
85.Steffener J. Education and age-related differences in cortical thickness and volume across the lifespan. Neurobiol. Aging. 2021;102:102–110. doi: 10.1016/j.neurobiolaging.2020.10.034. [DOI] [PMC free article] [PubMed] [Google Scholar]
86.Free SL, Sisodiya SM, Cook MJ, Fish DR, Shorvon SD. Three-dimensional fractal analysis of the white matter surface from magnetic resonance images of the human brain. Cereb. Cortex. 1996;6:830–836. doi: 10.1093/cercor/6.6.830. [DOI] [PubMed] [Google Scholar]
87.King RD, et al. Fractal dimension analysis of the cortical ribbon in mild Alzheimer’s disease. NeuroImage. 2010;53:471–479. doi: 10.1016/j.neuroimage.2010.06.050. [DOI] [PMC free article] [PubMed] [Google Scholar]
88.King RD, et al. Characterization of Atrophic Changes in the Cerebral Cortex Using Fractal Dimensional Analysis. Brain Imaging Behav. 2009;3:154–166. doi: 10.1007/s11682-008-9057-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
89.Marzi C, Giannelli M, Tessa C, Mascalchi M, Diciotti S. Fractal Analysis of MRI Data at 7 T: How Much Complex Is the Cerebral Cortex? IEEE Access. 2021;9:69226–69234. doi: 10.1109/ACCESS.2021.3077370. [DOI] [Google Scholar]
90.Marzi C, et al. Structural Complexity of the Cerebellum and Cerebral Cortex is Reduced in Spinocerebellar Ataxia Type 2. J. Neuroimaging. 2018;28:688–693. doi: 10.1111/jon.12534. [DOI] [PubMed] [Google Scholar]
91.Pani, J. et al. Longitudinal study of the effect of a 5-year exercise intervention on structural brain complexity in older adults. A Generation 100 substudy. NeuroImage 119226 (2022). [DOI] [PubMed]
92.Pantoni L, et al. Fractal dimension of cerebral white matter: A consistent feature for prediction of the cognitive performance in patients with small vessel disease and mild cognitive impairment. NeuroImage Clin. 2019;24:101990. doi: 10.1016/j.nicl.2019.101990. [DOI] [PMC free article] [PubMed] [Google Scholar]
93.Nazlee, N., Waiter, G. D. & Sandu, A. Age‐associated sex and asymmetry differentiation in hemispheric and lobar cortical ribbon complexity across adulthood: A UK Biobank imaging study. Hum. Brain Mapp. hbm.26076, 10.1002/hbm.26076 (2022). [DOI] [PMC free article] [PubMed]
94.Sandu A-L, et al. Fractal dimension analysis of MR images reveals grey matter structure irregularities in schizophrenia. Comput. Med. Imaging Graph. 2008;32:150–158. doi: 10.1016/j.compmedimag.2007.10.005. [DOI] [PubMed] [Google Scholar]
95.Sandu A-L, et al. Post-adolescent developmental changes in cortical complexity. Behav. Brain Funct. 2014;10:44. doi: 10.1186/1744-9081-10-44. [DOI] [PMC free article] [PubMed] [Google Scholar]
96.Sandu A-L, et al. Sexual dimorphism in the relationship between brain complexity, volume and general intelligence (g): a cross-cohort study. Sci. Rep. 2022;12:11025. doi: 10.1038/s41598-022-15208-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
97.Sandu A-L, Specht K, Beneventi H, Lundervold A, Hugdahl K. Sex-differences in grey–white matter structure in normal-reading and dyslexic adolescents. Neurosci. Lett. 2008;438:80–84. doi: 10.1016/j.neulet.2008.04.022. [DOI] [PubMed] [Google Scholar]
98.Sandu A-L, et al. Structural brain complexity and cognitive decline in late life — A longitudinal study in the Aberdeen 1936 Birth Cohort. NeuroImage. 2014;100:558–563. doi: 10.1016/j.neuroimage.2014.06.054. [DOI] [PubMed] [Google Scholar]
99.Sandu A-L, Paillère Martinot M-L, Artiges E, Martinot J-L. 1910s’ brains revisited. Cortical complexity in early 20th century patients with intellectual disability or with dementia praecox. Acta Psychiatr. Scand. 2014;130:227–237. doi: 10.1111/acps.12243. [DOI] [PubMed] [Google Scholar]
100.Marzi C, Diciotti S. 2023. Multicenter dataset of neuroimaging features (part I) Zenodo. [DOI]
101.Marzi C, Diciotti S. 2023. Multicenter dataset of neuroimaging features (part II) Zenodo. [DOI]
102.Marzi C, Diciotti S. 2023. Multicenter dataset of simulated neuroimaging features - quadratic relationship with age. Zenodo. [DOI]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Citations

Marzi C, Diciotti S. 2023. Multicenter dataset of neuroimaging features (part I) Zenodo. [DOI]
Marzi C, Diciotti S. 2023. Multicenter dataset of neuroimaging features (part II) Zenodo. [DOI]
Marzi C, Diciotti S. 2023. Multicenter dataset of simulated neuroimaging features - quadratic relationship with age. Zenodo. [DOI]

Supplementary Materials

Related Manuscript File^{(100KB, pdf)}

Data Availability Statement

The brain MR T₁-weighted images that support the findings of this study are available from the following online repositories:

- Autism Brain Imaging Data Exchange (ABIDE): https://fcon_1000.projects.nitrc.org/indi/abide/

- Information eXtraction from Images (IXI) study: https://brain-development.org/ixi-dataset/

- 1000 Functional Connectomes Project (FCP) – ICBM dataset: https://fcon_1000.projects.nitrc.org/fcpClassic/FcpTable.html

- Consortium for Reliability and Reproducibility (CoRR) - NKI 2 - Nathan Kline Institute (Milham): https://fcon_1000.projects.nitrc.org/indi/CoRR/html/index.html

- fractalbrain toolkit version 1.1

- neuroHarmonize v. 2.1.0 package

- eXtreme Gradient Boosting (XGBoost) version 0.90.

[CR1] 1.Alfaro-Almagro F, et al. Image processing and Quality Control for the first 10,000 brain imaging datasets from UK Biobank. NeuroImage. 2018;166:400–424. doi: 10.1016/j.neuroimage.2017.10.034. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR2] 2.Pomponio R, et al. Harmonization of large MRI datasets for the analysis of brain imaging patterns throughout the lifespan. NeuroImage. 2020;208:116450. doi: 10.1016/j.neuroimage.2019.116450. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR3] 3.Radua J, et al. Increased power by harmonizing structural MRI site differences with the ComBat batch adjustment method in ENIGMA. NeuroImage. 2020;218:116956. doi: 10.1016/j.neuroimage.2020.116956. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR4] 4.Thompson PM, et al. The ENIGMA Consortium: large-scale collaborative analyses of neuroimaging and genetic data. Brain Imaging Behav. 2014;8:153–182. doi: 10.1007/s11682-013-9269-5. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR5] 5.Fortin JP, et al. Harmonization of cortical thickness measurements across scanners and sites. NeuroImage. 2018;167:104–120. doi: 10.1016/j.neuroimage.2017.11.024. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR6] 6.Fortin JP, et al. Harmonization of multi-site diffusion tensor imaging data. NeuroImage. 2017;161:149–170. doi: 10.1016/j.neuroimage.2017.08.047. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR7] 7.Beer JC, et al. Longitudinal ComBat: A method for harmonizing longitudinal multi-scanner imaging data. NeuroImage. 2020;220:117129. doi: 10.1016/j.neuroimage.2020.117129. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR8] 8.Keshavan A, et al. Power estimation for non-standardized multisite studies. NeuroImage. 2016;134:281–294. doi: 10.1016/j.neuroimage.2016.03.051. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR9] 9.Pinto MS, et al. Harmonization of Brain Diffusion MRI: Concepts and Methods. Front. Neurosci. 2020;14:396. doi: 10.3389/fnins.2020.00396. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR10] 10.Suckling J, et al. Components of variance in a multicentre functional MRI study and implications for calculation of statistical power. Hum. Brain Mapp. 2008;29:1111–1122. doi: 10.1002/hbm.20451. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR11] 11.Dansereau C, et al. Statistical power and prediction accuracy in multisite resting-state fMRI connectivity. NeuroImage. 2017;149:220–232. doi: 10.1016/j.neuroimage.2017.01.072. [DOI] [PubMed] [Google Scholar]

[CR12] 12.Yu M, et al. Statistical harmonization corrects site effects in functional connectivity measurements from multi‐site fMRI data. Hum. Brain Mapp. 2018;39:4213–4227. doi: 10.1002/hbm.24241. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR13] 13.Han X, et al. Reliability of MRI-derived measurements of human cerebral cortical thickness: The effects of field strength, scanner upgrade and manufacturer. NeuroImage. 2006;32:180–194. doi: 10.1016/j.neuroimage.2006.02.051. [DOI] [PubMed] [Google Scholar]

[CR14] 14.Jovicich J, et al. Reliability in multi-site structural MRI studies: Effects of gradient non-linearity correction on phantom and human data. NeuroImage. 2006;30:436–443. doi: 10.1016/j.neuroimage.2005.09.046. [DOI] [PubMed] [Google Scholar]

[CR15] 15.Takao H, Hayashi N, Ohtomo K. Effect of scanner in longitudinal studies of brain volume changes. J. Magn. Reson. Imaging. 2011;34:438–444. doi: 10.1002/jmri.22636. [DOI] [PubMed] [Google Scholar]

[CR16] 16.Hatton SN, et al. White matter abnormalities across different epilepsy syndromes in adults: an ENIGMA-Epilepsy study. Brain. 2020;143:2454–2473. doi: 10.1093/brain/awaa200. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR17] 17.Ingalhalikar M, et al. Functional Connectivity-Based Prediction of Autism on Site Harmonized ABIDE Dataset. IEEE Trans. Biomed. Eng. 2021;68:3628–3637. doi: 10.1109/TBME.2021.3080259. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR18] 18.Li Y, Ammari S, Balleyguier C, Lassau N, Chouzenoux E. Impact of Preprocessing and Harmonization Methods on the Removal of Scanner Effects in Brain MRI Radiomic Features. Cancers. 2021;13:3000. doi: 10.3390/cancers13123000. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR19] 19.Luna A, et al. Maturity of gray matter structures and white matter connectomes, and their relationship with psychiatric symptoms in youth. Hum. Brain Mapp. 2021;42:4568–4579. doi: 10.1002/hbm.25565. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR20] 20.Maikusa N, et al. Comparison of traveling‐subject and ComBat harmonization methods for assessing structural brain characteristics. Hum. Brain Mapp. 2021;42:5278–5287. doi: 10.1002/hbm.25615. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR21] 21.Orlhac F, et al. How can we combat multicenter variability in MR radiomics? Validation of a correction procedure. Eur. Radiol. 2021;31:2272–2280. doi: 10.1007/s00330-020-07284-9. [DOI] [PubMed] [Google Scholar]

[CR22] 22.Wachinger C, Rieckmann A, Pölsterl S. Detect and correct bias in multi-site neuroimaging datasets. Med. Image Anal. 2021;67:101879. doi: 10.1016/j.media.2020.101879. [DOI] [PubMed] [Google Scholar]

[CR23] 23.Wengler K, et al. Cross‐Scanner Harmonization of Neuromelanin‐Sensitive MRI for Multisite Studies. J. Magn. Reson. Imaging. 2021;54:1189–1199. doi: 10.1002/jmri.27679. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR24] 24.Zavaliangos-Petropulu A, et al. Diffusion MRI Indices and Their Relation to Cognitive Impairment in Brain Aging: The Updated Multi-protocol Approach in ADNI3. Front. Neuroinformatics. 2019;13:2. doi: 10.3389/fninf.2019.00002. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR25] 25.Zhu, Y. et al. Application of a Machine Learning Algorithm for Structural Brain Images in Chronic Schizophrenia to Earlier Clinical Stages of Psychosis and Autism Spectrum Disorder: A Multiprotocol Imaging Dataset Study. Schizophr. Bull. sbac030 (2022). [DOI] [PMC free article] [PubMed]

[CR26] 26.Tafuri B, et al. The impact of harmonization on radiomic features in Parkinson’s disease and healthy controls: A multicenter study. Front. Neurosci. 2022;16:1012287. doi: 10.3389/fnins.2022.1012287. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR27] 27.Parekh P, et al. Sample size requirement for achieving multisite harmonization using structural brain MRI features. NeuroImage. 2022;264:119768. doi: 10.1016/j.neuroimage.2022.119768. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR28] 28.Chen AA, Luo C, Chen Y, Shinohara RT, Shou H. Privacy-preserving harmonization via distributed ComBat. NeuroImage. 2022;248:118822. doi: 10.1016/j.neuroimage.2021.118822. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR29] 29.Lombardi A, et al. Extensive Evaluation of Morphological Statistical Harmonization for Brain Age Prediction. Brain Sci. 2020;10:364. doi: 10.3390/brainsci10060364. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR30] 30.Zounek AJ, et al. Feasibility of radiomic feature harmonization for pooling of [18F]FET or [18F]GE-180 PET images of gliomas. Z. Für Med. Phys. 2023;33:91–102. doi: 10.1016/j.zemedi.2022.12.005. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR31] 31.Dai P, et al. The alterations of brain functional connectivity networks in major depressive disorder detected by machine learning through multisite rs-fMRI data. Behav. Brain Res. 2022;435:114058. doi: 10.1016/j.bbr.2022.114058. [DOI] [PubMed] [Google Scholar]

[CR32] 32.Saponaro S, et al. Multi-site harmonization of MRI data uncovers machine-learning discrimination capability in barely separable populations: An example from the ABIDE dataset. NeuroImage Clin. 2022;35:103082. doi: 10.1016/j.nicl.2022.103082. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR33] 33.Du X, et al. Unraveling schizophrenia replicable functional connectivity disruption patterns across sites. Hum. Brain Mapp. 2023;44:156–169. doi: 10.1002/hbm.26108. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR34] 34.Dudley JA, et al. ABCD_Harmonizer: An Open-source Tool for Mapping and Controlling for Scanner Induced Variance in the Adolescent Brain Cognitive Development Study. Neuroinformatics. 2023;21:323–337. doi: 10.1007/s12021-023-09624-8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR35] 35.Johnson WE, Li C, Rabinovic A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostat. Oxf. Engl. 2007;8:118–127. doi: 10.1093/biostatistics/kxj037. [DOI] [PubMed] [Google Scholar]

[CR36] 36.He L, et al. Deep Multimodal Learning From MRI and Clinical Data for Early Prediction of Neurodevelopmental Deficits in Very Preterm Infants. Front. Neurosci. 2021;15:753033. doi: 10.3389/fnins.2021.753033. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR37] 37.Kim, J. I. et al. Classification of Preschoolers with Low-Functioning Autism Spectrum Disorder Using Multimodal MRI Data. J. Autism Dev. Disord. (2022). [DOI] [PubMed]

[CR38] 38.Lo Gullo R, et al. Assessing PD-L1 Expression Status Using Radiomic Features from Contrast-Enhanced Breast MRI in Breast Cancer Patients: Initial Results. Cancers. 2021;13:6273. doi: 10.3390/cancers13246273. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR39] 39.Lopez-Soley E, et al. Dynamics and Predictors of Cognitive Impairment along the Disease Course in Multiple Sclerosis. J. Pers. Med. 2021;11:1107. doi: 10.3390/jpm11111107. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR40] 40.Simhal AK, et al. Predicting multiscan MRI outcomes in children with neurodevelopmental conditions following MRI simulator training. Dev. Cogn. Neurosci. 2021;52:101009. doi: 10.1016/j.dcn.2021.101009. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR41] 41.Zhou X, et al. Multimodal MR Images-Based Diagnosis of Early Adolescent Attention-Deficit/Hyperactivity Disorder Using Multiple Kernel Learning. Front. Neurosci. 2021;15:710133. doi: 10.3389/fnins.2021.710133. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR42] 42.Mandelbrot, B. B. The fractal geometry of nature. (W.H. Freeman, 1982).

[CR43] 43.Di Martino A, et al. Enhancing studies of the connectome in autism using the autism brain imaging data exchange II. Sci. Data. 2017;4:170010. doi: 10.1038/sdata.2017.10. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR44] 44.Di Martino A, et al. The autism brain imaging data exchange: towards a large-scale evaluation of the intrinsic brain architecture in autism. Mol. Psychiatry. 2014;19:659–667. doi: 10.1038/mp.2013.78. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR45] 45.Autism Brain Imaging Data Exchange (ABIDE). https://fcon_1000.projects.nitrc.org/indi/abide/ (2017).

[CR46] 46.Kang, S. M. & Wildes, R. P. The n-distribution Bhattacharyya coefficient. York Univ. (2015).

[CR47] 47.Bhattacharyya A. On a measure of divergence between two statistical populations defined by their probability distributions. Bull Calcutta Math Soc. 1943;35:99–109. [Google Scholar]

[CR48] 48.Cameron, C. et al. The Neuro Bureau Preprocessing Initiative: open sharing of preprocessed neuroimaging data and derivatives. Front. Neuroinformatics7 (2013).

[CR49] 49.Bigler ED, et al. FreeSurfer 5.3 versus 6.0: are volumes comparable? A Chronic Effects of Neurotrauma Consortium study. Brain Imaging Behav. 2020;14:1318–1327. doi: 10.1007/s11682-018-9994-x. [DOI] [PubMed] [Google Scholar]

[CR50] 50.Chepkoech J-L, Walhovd KB, Grydeland H, Fjell AM, for the Alzheimer’s Disease Neuroimaging Initiative Effects of change in FreeSurfer version on classification accuracy of patients with Alzheimer’s disease and mild cognitive impairment: Effects of Change in FreeSurfer Version. Hum. Brain Mapp. 2016;37:1831–1841. doi: 10.1002/hbm.23139. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR51] 51.Filip P, et al. Different FreeSurfer versions might generate different statistical outcomes in case–control comparison studies. Neuroradiology. 2022;64:765–773. doi: 10.1007/s00234-021-02862-0. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR52] 52.Glatard, T. et al. Reproducibility of neuroimaging analyses across operating systems. Front. Neuroinformatics9, (2015). [DOI] [PMC free article] [PubMed]

[CR53] 53.Gronenschild EHBM, et al. The Effects of FreeSurfer Version, Workstation Type, and Macintosh Operating System Version on Anatomical Volume and Cortical Thickness Measurements. PLoS ONE. 2012;7:e38234. doi: 10.1371/journal.pone.0038234. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR54] 54.Fischl B. FreeSurfer. NeuroImage. 2012;62:774–781. doi: 10.1016/j.neuroimage.2012.01.021. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR55] 55.Fischl B, Dale AM. Measuring the thickness of the human cerebral cortex from magnetic resonance images. Proc. Natl. Acad. Sci. 2000;97:11050–11055. doi: 10.1073/pnas.200033797. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR56] 56.Cutting JE, Garvin JJ. Fractal curves and complexity. Percept. Psychophys. 1987;42:365–370. doi: 10.3758/BF03203093. [DOI] [PubMed] [Google Scholar]

[CR57] 57.Fernández E, Jelinek HF. Use of Fractal Theory in Neuroscience: Methods, Advantages, and Potential Problems. Methods. 2001;24:309–321. doi: 10.1006/meth.2001.1201. [DOI] [PubMed] [Google Scholar]

[CR58] 58.Im K, et al. Fractal dimension in human cortical surface: Multiple regression analysis with cortical thickness, sulcal depth, and folding area. Hum. Brain Mapp. 2006;27:994–1003. doi: 10.1002/hbm.20238. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR59] 59.Marzi C, Giannelli M, Tessa C, Mascalchi M, Diciotti S. Toward a more reliable characterization of fractal properties of the cerebral cortex of healthy subjects during the lifespan. Sci. Rep. 2020;10:16957. doi: 10.1038/s41598-020-73961-w. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR60] 60.Russell DA, Hanson JD, Ott E. Dimension of Strange Attractors. Phys. Rev. Lett. 1980;45:1175–1178. doi: 10.1103/PhysRevLett.45.1175. [DOI] [Google Scholar]

[CR61] 61.Losa GA. The fractal geometry of life. Riv. Biol. 2009;102:29–59. [PubMed] [Google Scholar]

[CR62] 62.Falconer, K. J. Fractal geometry: mathematical foundations and applications. (John Wiley & Sons Inc, 2014).

[CR63] 63.Goñi J, et al. Robust estimation of fractal measures for characterizing the structural complexity of the human brain: Optimization and reproducibility. NeuroImage. 2013;83:646–657. doi: 10.1016/j.neuroimage.2013.06.072. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR64] 64.Courchesne E, et al. Normal Brain Development and Aging: Quantitative Analysis at in Vivo MR Imaging in Healthy Volunteers. Radiology. 2000;216:672–682. doi: 10.1148/radiology.216.3.r00au37672. [DOI] [PubMed] [Google Scholar]

[CR65] 65.Fjell, A. M. & Walhovd, K. B. Structural Brain Changes in Aging: Courses, Causes and Cognitive Consequences. Rev. Neurosci. 21 (2010). [DOI] [PubMed]

[CR66] 66.Hogstrom LJ, Westlye LT, Walhovd KB, Fjell AM. The Structure of the Cerebral Cortex Across Adult Life: Age-Related Patterns of Surface Area, Thickness, and Gyrification. Cereb. Cortex. 2013;23:2521–2530. doi: 10.1093/cercor/bhs231. [DOI] [PubMed] [Google Scholar]

[CR67] 67.Madan CR, Kensinger EA. Predicting age from cortical structure across the lifespan. Eur. J. Neurosci. 2018;47:399–416. doi: 10.1111/ejn.13835. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR68] 68.Madan CR, Kensinger EA. Cortical complexity as a measure of age-related brain atrophy. NeuroImage. 2016;134:617–629. doi: 10.1016/j.neuroimage.2016.04.029. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR69] 69.Raznahan A, et al. How Does Your Cortex Grow? J. Neurosci. 2011;31:7174–7177. doi: 10.1523/JNEUROSCI.0054-11.2011. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR70] 70.Zheng F, et al. Age-related changes in cortical and subcortical structures of healthy adult brains: A surface-based morphometry study: Age-Related Study in Healthy Adult Brain Structure. J. Magn. Reson. Imaging. 2019;49:152–163. doi: 10.1002/jmri.26037. [DOI] [PubMed] [Google Scholar]

[CR71] 71.Sowell ER, et al. Sex Differences in Cortical Thickness Mapped in 176 Healthy Individuals between 7 and 87 Years of Age. Cereb. Cortex. 2007;17:1550–1560. doi: 10.1093/cercor/bhl066. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR72] 72.Yagis E, et al. Effect of data leakage in brain MRI classification using 2D convolutional neural networks. Sci. Rep. 2021;11:22544. doi: 10.1038/s41598-021-01681-w. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR73] 73.Tampu IE, Eklund A, Haj-Hosseini N. Inflation of test accuracy due to data leakage in deep learning-based classification of OCT images. Sci. Data. 2022;9:580. doi: 10.1038/s41597-022-01618-6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR74] 74.Müller, A. C. & Guido, S. Introduction to machine learning with Python: a guide for data scientists. (O’Reilly Media, Inc, 2016).

[CR75] 75.Scheda R, Diciotti S. Explanations of Machine Learning Models in Repeated Nested Cross-Validation: An Application in Age Prediction Using Brain Complexity Features. Appl. Sci. 2022;12:6681. doi: 10.3390/app12136681. [DOI] [Google Scholar]

[CR76] 76.Varma S, Simon R. Bias in error estimation when using cross-validation for model selection. BMC Bioinformatics. 2006;7:91. doi: 10.1186/1471-2105-7-91. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR77] 77.Chen, T. & Guestrin, C. XGBoost: A Scalable Tree Boosting System. in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 785–794, 10.1145/2939672.2939785 (ACM, 2016).

[CR78] 78.Nichols TE, Holmes AP. Nonparametric permutation tests for functional neuroimaging: A primer with examples. Hum. Brain Mapp. 2002;15:1–25. doi: 10.1002/hbm.1058. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR79] 79.Ojala M, Garriga GC. Permutation Tests for Studying Classifier Performance. J Mach Learn Res. 2010;11:1833–1863. [Google Scholar]

[CR80] 80.Winkler AM, Ridgway GR, Webster MA, Smith SM, Nichols TE. Permutation inference for the general linear model. NeuroImage. 2014;92:381–397. doi: 10.1016/j.neuroimage.2014.01.060. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR81] 81.Wilcoxon F. Individual Comparisons by Ranking Methods. Biom. Bull. 1945;1:80. doi: 10.2307/3001968. [DOI] [Google Scholar]

[CR82] 82.Brouwer RM, et al. Genetic variants associated with longitudinal changes in brain structure across the lifespan. Nat. Neurosci. 2022;25:421–432. doi: 10.1038/s41593-022-01042-4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR83] 83.Oschwald J, et al. Brain structure and cognitive ability in healthy aging: a review on longitudinal correlated change. Rev. Neurosci. 2019;31:1–57. doi: 10.1515/revneuro-2018-0096. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR84] 84.Chen AA, et al. Mitigating site effects in covariance for machine learning in neuroimaging data. Hum. Brain Mapp. 2022;43:1179–1195. doi: 10.1002/hbm.25688. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR85] 85.Steffener J. Education and age-related differences in cortical thickness and volume across the lifespan. Neurobiol. Aging. 2021;102:102–110. doi: 10.1016/j.neurobiolaging.2020.10.034. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR86] 86.Free SL, Sisodiya SM, Cook MJ, Fish DR, Shorvon SD. Three-dimensional fractal analysis of the white matter surface from magnetic resonance images of the human brain. Cereb. Cortex. 1996;6:830–836. doi: 10.1093/cercor/6.6.830. [DOI] [PubMed] [Google Scholar]

[CR87] 87.King RD, et al. Fractal dimension analysis of the cortical ribbon in mild Alzheimer’s disease. NeuroImage. 2010;53:471–479. doi: 10.1016/j.neuroimage.2010.06.050. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR88] 88.King RD, et al. Characterization of Atrophic Changes in the Cerebral Cortex Using Fractal Dimensional Analysis. Brain Imaging Behav. 2009;3:154–166. doi: 10.1007/s11682-008-9057-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR89] 89.Marzi C, Giannelli M, Tessa C, Mascalchi M, Diciotti S. Fractal Analysis of MRI Data at 7 T: How Much Complex Is the Cerebral Cortex? IEEE Access. 2021;9:69226–69234. doi: 10.1109/ACCESS.2021.3077370. [DOI] [Google Scholar]

[CR90] 90.Marzi C, et al. Structural Complexity of the Cerebellum and Cerebral Cortex is Reduced in Spinocerebellar Ataxia Type 2. J. Neuroimaging. 2018;28:688–693. doi: 10.1111/jon.12534. [DOI] [PubMed] [Google Scholar]

[CR91] 91.Pani, J. et al. Longitudinal study of the effect of a 5-year exercise intervention on structural brain complexity in older adults. A Generation 100 substudy. NeuroImage 119226 (2022). [DOI] [PubMed]

[CR92] 92.Pantoni L, et al. Fractal dimension of cerebral white matter: A consistent feature for prediction of the cognitive performance in patients with small vessel disease and mild cognitive impairment. NeuroImage Clin. 2019;24:101990. doi: 10.1016/j.nicl.2019.101990. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR93] 93.Nazlee, N., Waiter, G. D. & Sandu, A. Age‐associated sex and asymmetry differentiation in hemispheric and lobar cortical ribbon complexity across adulthood: A UK Biobank imaging study. Hum. Brain Mapp. hbm.26076, 10.1002/hbm.26076 (2022). [DOI] [PMC free article] [PubMed]

[CR94] 94.Sandu A-L, et al. Fractal dimension analysis of MR images reveals grey matter structure irregularities in schizophrenia. Comput. Med. Imaging Graph. 2008;32:150–158. doi: 10.1016/j.compmedimag.2007.10.005. [DOI] [PubMed] [Google Scholar]

[CR95] 95.Sandu A-L, et al. Post-adolescent developmental changes in cortical complexity. Behav. Brain Funct. 2014;10:44. doi: 10.1186/1744-9081-10-44. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR96] 96.Sandu A-L, et al. Sexual dimorphism in the relationship between brain complexity, volume and general intelligence (g): a cross-cohort study. Sci. Rep. 2022;12:11025. doi: 10.1038/s41598-022-15208-4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR97] 97.Sandu A-L, Specht K, Beneventi H, Lundervold A, Hugdahl K. Sex-differences in grey–white matter structure in normal-reading and dyslexic adolescents. Neurosci. Lett. 2008;438:80–84. doi: 10.1016/j.neulet.2008.04.022. [DOI] [PubMed] [Google Scholar]

[CR98] 98.Sandu A-L, et al. Structural brain complexity and cognitive decline in late life — A longitudinal study in the Aberdeen 1936 Birth Cohort. NeuroImage. 2014;100:558–563. doi: 10.1016/j.neuroimage.2014.06.054. [DOI] [PubMed] [Google Scholar]

[CR99] 99.Sandu A-L, Paillère Martinot M-L, Artiges E, Martinot J-L. 1910s’ brains revisited. Cortical complexity in early 20th century patients with intellectual disability or with dementia praecox. Acta Psychiatr. Scand. 2014;130:227–237. doi: 10.1111/acps.12243. [DOI] [PubMed] [Google Scholar]

[CR100] 100.Marzi C, Diciotti S. 2023. Multicenter dataset of neuroimaging features (part I) Zenodo. [DOI]

[CR101] 101.Marzi C, Diciotti S. 2023. Multicenter dataset of neuroimaging features (part II) Zenodo. [DOI]

[CR102] 102.Marzi C, Diciotti S. 2023. Multicenter dataset of simulated neuroimaging features - quadratic relationship with age. Zenodo. [DOI]

PERMALINK

Efficacy of MRI data harmonization in the age of machine learning: a multicenter study across 36 datasets

Chiara Marzi

Marco Giannelli

Andrea Barucci

Carlo Tessa

Mario Mascalchi

Stefano Diciotti

Abstract

Introduction

Methods

MRI datasets

Table 1.

Table 2.

Fig. 1.

MR image processing

Cortical reconstruction and volumetric segmentation

Extraction of cortical thickness and fractal dimension features

Fig. 2.

Harmonization of brain cortical features

The harmonizer transformer

Fig. 3.

Statistical and machine learning analyses

Visualization and quantification of site effect

Associations with age

Simulation experiments

CT and FD data simulation settings

Measuring the effect of data leakage

Fig. 4.

External hold-out

Imaging site/age prediction estimator training and test on the external test set

Imaging site/age prediction estimator training and test using harmonizer transformer within the machine learning pipeline (not leaked internal test set)

Imaging site/age prediction estimator training and test harmonizing all data with neuroHarmonize before imaging site prediction (leaked internal test set)

Results

Measuring the effect of data leakage in simulated data

Table 3.

Table 4.

Fig. 5.

Table 5.

Table 6.

Fig. 6.

Visualization and quantification of the site effect in in vivo data

Table 7.

Fig. 7.

Fig. 8.

Table 8.

Harmonization efficacy

Fig. 9.

Fig. 10.

Fig. 11.

Table 9.

Table 10.

Age prediction

Table 11.

Fig. 12.

Fig. 13.

Discussion

Supplementary information

Acknowledgements

Author contributions

Data availability

Code availability

Competing interests

Footnotes

Supplementary information

References

Associated Data

Data Citations

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases