Skip to main content
Journal of the American Medical Informatics Association : JAMIA logoLink to Journal of the American Medical Informatics Association : JAMIA
. 2023 Sep 5;30(12):1925–1933. doi: 10.1093/jamia/ocad171

Image-encoded biological and non-biological variables may be used as shortcuts in deep learning models trained on multisite neuroimaging data

Raissa Souza 1,2,3,, Matthias Wilms 4,5,6,7, Milton Camacho 8,9,10, G Bruce Pike 11,12, Richard Camicioli 13, Oury Monchi 14,15,16,17, Nils D Forkert 18,19,20,21
PMCID: PMC10654841  PMID: 37669158

Abstract

Objective

This work investigates if deep learning (DL) models can classify originating site locations directly from magnetic resonance imaging (MRI) scans with and without correction for intensity differences.

Material and Methods

A large database of 1880 T1-weighted MRI scans collected across 41 sites originally for Parkinson’s disease (PD) classification was used to classify sites in this study. Forty-six percent of the datasets are from PD patients, while 54% are from healthy participants. After preprocessing the T1-weighted scans, 2 additional data types were generated: intensity-harmonized T1-weighted scans and log-Jacobian deformation maps resulting from nonlinear atlas registration. Corresponding DL models were trained to classify sites for each data type. Additionally, logistic regression models were used to investigate the contribution of biological (age, sex, disease status) and non-biological (scanner type) variables to the models’ decision.

Results

A comparison of the 3 different types of data revealed that DL models trained using T1-weighted and intensity-harmonized T1-weighted scans can classify sites with an accuracy of 85%, while the model using log-Jacobian deformation maps achieved a site classification accuracy of 54%. Disease status and scanner type were found to be significant confounders.

Discussion

Our results demonstrate that MRI scans encode relevant site-specific information that models could use as shortcuts that cannot be removed using simple intensity harmonization methods.

Conclusion

The ability of DL models to exploit site-specific biases as shortcuts raises concerns about their reliability, generalization, and deployability in clinical settings.

Keywords: shortcut learning, deep learning, multisite classification, MRI, data harmonization

Introduction

Research has shown that machine learning, especially deep learning (DL) models, can achieve accuracies similar to human experts in many domains, including healthcare.1,2 These data-driven advancements in precision medicine may reduce healthcare costs by diagnosing diseases earlier and better, preventing diseases before becoming clinically evident, and providing better patient-specific treatment.3 To date, machine learning techniques have been successfully applied in several healthcare specialties, such as radiology4 and cardiology.5

Convolutional neural networks (CNNs), a DL method specifically designed for image analysis,1,3 have great potential to support many clinically relevant prediction, classification, and segmentation tasks using medical images.6,7 However, in order for CNNs to achieve high accuracy and precision, a large number of diverse training samples is typically required to capture the full real-world variability of the problem at hand.1 The standard approach to increasing the number and diversity of training samples in healthcare is to collect multisite data into a central database before training.

Generally, the training data need to represent the population of interest well in order to train a machine learning model that generalizes well to new institutions. If the data used during training differ from the real world, the corresponding models will likely achieve low performance when tested on new data. This may lead to the model performing poorly in real-world scenarios and may also introduce considerable biases in the model.

Within this context, several researchers have shown that biological and/or non-biological biases can influence DL models as well as traditional statistical analyses, such as voxel-based morphometry, for single-site and multisite applications using magnetic resonance imaging (MRI) data.8–13 Briefly described, biases can be categorized into biological (participants’ cohorts) and non-biological (technical) sources. Biological biases include variables such as sex, age, and other demographic variables, while non-biological biases involve differences regarding the number and class distribution of participants per site, imaging acquisition protocols, and scanners. While it is generally accepted in the MRI domain that the hardware and the protocols used can lead to significant differences in the resulting images, it remains unclear how much of an effect this has on DL models, especially in case of large multicenter studies. Practically, these differences can manifest as local and global distortions, imaging artifacts (eg, eddy currents), and variations in intensity distributions of non-quantitative imaging data resulting from differences in acquisition parameters and scanner hardware. This becomes particularly relevant when combining data collected acquired using non-harmonized imaging protocols from multiple sites to increase data diversity and size for machine learning training. Machine learning models trained using such data may exploit these differences as shortcuts, learning features associated with the specific site and its corresponding patient distribution, rather than relevant imaging patterns for the intended clinical task. Consequently, deploying such models in clinical centers that did not contribute to the training data collection may pose significant challenges, as the model may not rely on disease-related patterns for accurate predictions but false shortcuts that do not apply to the new data.

Geirhos et al.14 identified 4 shortcut opportunities, briefly described in the following. First, models can identify features that are different from the ones intended by model developers to accomplish their tasks, for example, the scanner type. Second, they can combine various features to inform their decision, for instance, disease status and scanner types. Third, models usually operate with minimum effort. More precisely, once a shortcut is found, the model can rely entirely on it, even though it represents an artifact of the data. Finally, models can focus on the majority group while accepting misclassifications for the minority group(s).

To overcome this potential problem, data harmonization techniques are often applied in practice. For example, rather simple linear intensity normalization methods such as calculating the mean and standard deviation of the image intensities for normalization (z-score normalization) or normalizing the image intensities with respect to a reference image (histogram matching) are often used as a preprocessing step of the image data prior to training of DL models. However, it remains unclear if rather simple intensity harmonization techniques are really suitable for removing the relevant biases that could be used for shortcut learning.

Therefore, the aim of this work was to develop a site classifier based on a large clinical database consisting of T1-weighted MRI brain scans originally collected for the classification of patients with Parkinson’s disease (PD) and investigate the effect of simple intensity harmonization techniques on the ability of the machine learning model to classify the originating sites. The created database covers all aspects of data heterogeneity (biological and non-biological) expected when working with multisite datasets. Therefore, the major contributions of this work can be summarized as follows: (1) the implementation of a site classifier using neuroimaging data from 41 different sites, and (2) the first evaluation of the effect of 3 biological variables (sex, age, and disease status) and one non-biological variable (scanner type) on the accuracy of the model using the raw as well as harmonized data.

Materials and methods

Dataset

A multisite database was created by collecting datasets from patients with PD and healthy participants acquired across 41 different sites managed by 12 studies.15–26 In total, 1880 high-resolution T1-weighted brain MRI scans were included in this work. Table 1 summarizes the database distribution with respect to biological variables (refer to Table S1 for site-specific information). A variety of scanners and protocols were used during imaging acquisition. Siemens, GE, and Phillips were among the manufacturers with magnetic field strengths of either 1.5 or 3.0 T. Figure 1 shows the scanner type distribution per site (refer to Figures S1–S3 for sex, age, and disease status distributions).

Table 1.

Sex and age distributions per subject characteristic.

Characteristic Males Females Young (<60) Old (60+)
Parkinson’s disease 542 325 222 645
Healthy participants 635 378 243 770

Figure 1.

Figure 1.

Scanner type distribution per site.

Each study received ethics approval from their local ethics board and received written informed consent from all the participants in accordance with the declaration of Helsinki.

Dataset preprocessing

Three different imaging data types were used to train DL models to classify originating sites (see Figure 2): raw T1-weighted MRI scans, harmonized T1-weighted MRI scans, and log-Jacobians maps. A detailed explanation of how each image type was generated is provided in the following.

Figure 2.

Figure 2.

Example imaging data. T1-weighted denotes the preprocessed raw MRI brain scans, harmonized T1-weighted denotes the T1-weighted MRI scans after intensity harmonization (histogram matching), and log-Jacobians denotes the deformation fields generated during registration.

The collected database was preprocessed as follows. First, skull stripping was performed using HD-BET.27 Second, the scans were resampled to an isotropic resolution of 1 mm using linear interpolation. Third, bias field correction was applied using the Advanced Normalization Tools (ANTs) nonparametric nonuniform intensity normalization technique (version 2.3.1).28 After that, each scan was registered to the PD25-T1-MPRAGE-1mm brain atlas29 from the Montreal Neurological Institute (MNI) (fixed image) using ANTs. Registration of the scans to the atlas was performed in 2 steps. First, ANTs were used to affinely align the scans to the atlas. These scans were used as the raw T1-weighted MRI data type after cropping them to 160 × 192 × 160 voxels to eliminate extraneous background voxels and decrease computational strain during site classifier training as described in the next sections. Next, the affine registration was used to initialize a nonlinear registration step. The displacement fields resulting from the nonlinear registration were then utilized to generate the associated log-Jacobian maps30 as a means of data harmonization. Briefly described, log-Jacobians are symmetric around zero and indicate local changes in volume between the atlas and the individual scans for each voxel. Thus, values less than zero reflect a loss in volume. In contrast, values greater than zero represent an increase in volume. It can be assumed that any difference in intensity distribution should be removed from these log-Jacobian maps and that they are mainly representing morphological characteristics of the brain analyzed. The MNI PD25 brain mask was then used to crop the log-Jacobian maps to remove information outside the brain (background voxels).

After the preprocessing steps, which generated the raw T1-weighted MRI scans and log-Jacobian maps, intensity-harmonized T1-weighted scans were generated by applying histogram matching,31 an intensity harmonization technique. This technique normalizes the intensity values of each raw T1-weighted MRI scan based on a reference scan to account for site differences.13,32 In this work, the brain atlas used for registration was used as a reference for intensity harmonization.

Thus, it is expected that each image type will present different levels of site-specific information. (1) Raw T1-weighted scans likely contain the most discriminatory information to perform site classification, (2) harmonized T1-weighted scans may remove some site information encoded in the grey values, and (3) log-Jacobian maps should eliminate most site information as they contain mainly morphological information.

Deep learning model

This work uses the state-of-the-art simple fully convolutional network (SFCN)33 as the basis for the site classifier because it achieved high performance for adult brain age prediction and sex classification using T1-weighted MRI datasets.6,8 Our DL architecture consisted of 7 blocks: The first 6 blocks are identical to the original SFCN model, with 5 blocks containing a 3D convolutional layer with a 3 × 3 × 3 kernel, batch normalization, 2 × 2 × 2 max pooling, and ReLU activation, and one block including a 3D convolutional layer with a 1 × 1 × 1 kernel, batch normalization, and ReLU activation. The final block was adapted for our task and consisted of a 3D average pooling layer, a dropout layer with a rate of 0.2, a flattening layer, and a multiclass classification layer with softmax activation to classify the originating site. The created database was stratified based on the number of subjects available at each site. More precisely, 80% of the MRI scans provided by each site were randomly selected for training and 20% for the testing set to develop and evaluate the originating acquisition site classifier. The Adam optimizer, with an initial learning rate of 0.001 and a decay rate of 0.003, was utilized for training the networks. Early stopping with patience of 10 was applied, and the best models (lowest validation loss) were saved for evaluation.

Simulated data distribution

As a baseline for our work, we generated ten artificial sites by randomly splitting the created database, assigning 188 MRI scans to each artificial site, 148 scans for training, and 40 for testing. This way, every site has a more similar distribution (less biased), which should result in lower site classification accuracies when compared to the models trained in the real data distribution, for which models may learn site-dependent information as shortcuts.14

Metrics and evaluation

In this study, a total of 6 site classifiers were trained and evaluated using the centrally collected scans from all centers. Among them, 3 were trained to analyze the real data distribution, while the remaining 3 were applied to evaluate the simulated data distribution. Each classifier utilized distinct types of imaging data as input, including raw T1-weighted MRI scans, intensity-harmonized T1-weighted MRI scans, or log-Jacobians maps.

All models were evaluated with respect to their ability to correctly classify the originating sites (accuracy). Furthermore, site classifiers classification rates based on sex, age, disease status, and scanner types were computed. In this context, classification rate refers to the percentage of subjects that had their sites correctly identified when grouped by each variable.

Additionally, a statistical likelihood ratio test analysis of the contribution of each variable was performed using logistic regression. In this case, multiple regression models were computed by dropping one of the variables of interest per time. Then, a likelihood ratio test was conducted between the models with dropped variables and a base model, which includes all variables, to determine how significant the effect of each variable is to the trained model.

Lastly, saliency maps were generated to display the brain regions contributing the most to the models’ decisions when classifying originating sites. In this work, the SmoothGrad34 method available in the tf-keras-vis toolkit35 was employed using the trained models, and up to 5 (in case fewer participants were available, all of them were included) correctly classified participants from each site from the testing set. In this method, Gaussian noise is added to the data before computing the standard saliency map, and the resulting maps are averaged. The saliency maps were averaged over 20 noisy samples per participant and the noise was sampled from a zero-mean Gaussian distribution with 0.2 variances.

Results

Table 2 summarizes accuracy and classification rates grouped by sex, age, disease status, and scanner types for the models trained on the real data distribution, while Figure 3 shows site-specific performance in a confusion matrix for real and simulated data distributions.

Table 2.

Site classifier accuracy and classification rates stratified by biological and non-biological variables per input type.

Site classifier performance
Site classifier classification rates grouped by
Input Accuracy Sex (M/F) Age (<60/60+) Disease status (PD/HP) Averaged scanner types
T1-weighted 0.85 0.82/0.90 0.81/0.86 0.75/0.93 0.88
Harmonized T1-weighted 0.85 0.83/0.88 0.80/0.87 0.77/0.92 0.86
Log-Jacobians 0.54 0.56/0.52 0.36/0.60 0.32/0.74 0.52

Scanner-specific classification rates are available in Supplementary Table S2.

Abbreviations: F, female; HP, healthy participants; M, male; PD, Parkinson’s disease.

Figure 3.

Figure 3.

Site classification rates in a confusion matrix. (A) illustrates the results of the site classifiers using real data distribution while (B) presents the results using simulated data distribution.

Overall, the results highlight that site classification using these MRI observational data is generally possible. As can be seen in Table 2, models trained with raw T1-weighted and intensity-harmonized T1-weighted MRI scans performed similarly in terms of accuracy (85%). Classification rates for biological and non-biological variables are also highly similar. For instance, averaged sex, age, disease status, and scanner rates achieved accuracies of 86%, 84%, 85%, and 87%, respectively. These results suggest that a simple intensity correction is not sufficient for removing site information.

Although the model’s results employing the log-Jacobian maps were considerably lower in accuracy (54%), the model could still identify sites with an accuracy substantially better than the chance level. These results suggest that even removing intensity information completely is insufficient to remove site-related information. Figure 3A supports this result, showing that the model trained using the log-Jacobians maps could identify 18 sites precisely.

Furthermore, Figure 3A shows that 5 specific sites (BIOCOG, HAMBURG, OASIS, PPMI_20, and UKBB) were classified with very high accuracy for models trained with raw T1-weighted, harmonized T1-weighted, and log-Jacobians maps. For the misclassified sites, 2 patterns were identified (Figure 3A). First, data from sites tended to be misclassified as other sites managed by the same study. For instance, some PPMI sites were misclassified as other PPMI sites, and SBK got misclassified as UOA, 2 sites from the same study. Lastly, sites tended to get misclassified as OASIS, which is one of the studies providing a large number of scans from a single site but multiple scanner types to this database.

As expected, Figure 3B demonstrates that all models trained using the simulated data distribution (less biased) performed poorly on the site classification task, achieving accuracy at the chance level of around 10%.

Statistical likelihood ratio tests were performed on the 3 models trained on the real data distribution. These tests revealed that disease status and scanner type contributed significantly (P-value < .05) to the accuracy of the 3 models. Furthermore, age also contributed significantly to the model trained using the log-Jacobians maps. Table S3 shows the likelihood ratio test results for the 3 models and all biological and non-biological variables.

Figure 4 highlights the brain regions used by the models to identify 5 sites, which achieved high accuracy: BIOCOG, HAMBURG, OASIS, PPMI_20, and UKBB. Figures S4–S6 present saliency maps for all sites correctly identified by each model for completeness purposes.

Figure 4.

Figure 4.

Saliency maps for 5 sites of the models trained in real data distribution.

Overall, it can be observed that each model focused on distinct areas of the brain, suggesting that some more local and global artifacts and distortions are present in the data. Models trained with raw T1-weighted and harmonized T1-weighted MRI scans highlight more similar brain areas as most informative. However, it is noticeable that the frontal lobe and some background regions are less important in the intensity-harmonized T1-weighted MRI scans. Although our result suggests that harmonizing the intensities reduced the ability of the model to rely on areas of the brain that are more susceptible to artifacts caused by differences in the head coil and during image registration, originating sites are still distinguishable by the model. Concerning the log-Jacobian maps’ saliency maps, brain structures affected most by age are highlighted as important, such as in the ventricles and sulci, supporting the statistical likelihood ratio test results that disease status, age, and scanner types contributed significantly (P-value < .05) to this model. Nevertheless, it is essential to note that those brain areas correspond to regions that are usually more challenging for registration methods to align. While it is very complicated to disentangle the contribution of the biological age and possible registration errors, our results suggest that even when the data contain mainly morphological information, site classification is still possible.

Discussion

The main finding of this work is that machine learning models can classify sites directly from multisite MRI data that were initially compiled for a disease classification task, even after intensity harmonization or complete removal of intensity information. This finding suggests that there is site-specific information in brain MRI scans that cannot be removed by using rather simple intensity harmonization techniques used as standard preprocessing steps prior to model training. Moreover, log-Jacobian maps, which mostly contain morphological information, are also not capable of completely removing site-specific information, which could, for example, be caused by MRI magnetic field homogeneities or other imaging artifacts. Although this work specifically investigated the ability of DL to perform site classification, we believe that the features used by this model after intensity harmonization could also be used as a shortcut(s) by a disease classifier trained on the same data. This has many practical implications regarding the training of machine learning models based on multisite data and suggests that more advanced data harmonization techniques may be required but also that trained models need to be carefully evaluated to exclude the chance that shortcuts were used in the background. Overall, these results may explain why many DL models do not generalize well or even fail when applied to new datasets that were acquired in centers that did not contribute to the training set.

Our analysis shows that disease status and scanner type significantly contributed to all models. Such biological and non-biological influences were also observed in previous studies. For example, Stanley et al.8 noticed a contribution of the pubertal score when classifying sex using a DL model. While Tardif et al.9,10 reported a strong influence of scanners, the magnetic field strength, and the imaging protocol in their voxel-based morphometry analysis, Nielson et al.36 observed similar influences in a DL model. Within this context, it is important to note that there are considerable differences in the patient distribution between some of the studies included in this work. For example, Biocog is a clinical study that aimed at examining representative patients across the whole disease spectrum (with the usual biases of patients who agree to be part of studies), while the Parkinson’s Progression Markers Initiative (PPMI) was primarily designed to investigate de novo patients with PD, leading to differences in disease stage as a potential bias that can be used as a shortcut. Moreover, it is important to note that the imbalanced number of subjects scanned using the same scanner type and protocol may add potential biases that could be used as shortcuts. In contrast to other well-known databases, such as the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database, PPMI does not acquire and collect data using a harmonized imaging protocol. Instead, the data available in PPMI were mostly acquired using local imaging protocols that can differ considerably between sites. In a real-world scenario, it is unlikely that every medical center will have access to the same scanner device, which justifies treating each center managed by PPMI as an individual site in our work and highlights the importance of our work analyzing this unique, realistic database.

Our results clearly show that simple intensity normalization,31 the most used data harmonization technique for training DL models, is not optimal and cannot remove all site-related information from raw T1-weighted MRI scans. Although histogram matching may make the intensity distribution more similar between sites, it is not able to correct for local intensity differences, for example, caused by coil inhomogeneities or local/global artifacts and distortions that may be scanner and protocol dependent. Similar results were observed by Glocker et al.13 when analyzing scanner effects between 2 sites. The results of the present study expand those findings and show that this is even true when using databases that were collected from a much larger number of centers. Thus, the general assumption that using data from many centers helps to train more generalizable machine learning models may not hold true and needs to be carefully revisited.

Our findings indicate that log-Jacobian maps, which completely remove intensity information, do not fully eliminate site-related information. A potential reason for this may be that the model trained with these maps may rely on brain areas that are traditionally challenging for registration methods to align accurately. While disentangling the contribution of brain morphology from potential registration errors is complex, our results suggest that site classification remains possible, even when the data predominantly consists of morphological information.

Overall, our findings indicate that various factors within cohorts, such as disease duration, sex, and age distribution of patients, the number of subjects scanned at each site, the representation of disease status (PD vs healthy) at each site, and the heterogeneity of scanner devices across sites, are potential features that can be exploited by DL models to accurately discern originating sites. Thus, these findings highlight the importance of and need for more research on advanced data harmonization and bias mitigation techniques.

Recently, more complex data harmonization strategies that utilize (generative) DL techniques such as generative adversarial networks (GANs) have been explored. For example, Dewey et al.37 proposed a DL model that generates image data with consistent contrast, whereas Dinsdale et al.11 employed 2 networks in an adversarial training technique—one to remove protocol-specific effects and the other to predict brain aging. More generally, the objective of GANs is to generate images that preserve structural differences while altering their appearance. For instance, the image-to-image translation technique can translate MRI into computed tomography (CT) datasets.38 On the other hand, the style transfer technique applies a reference image’s style, such as the intensities of an atlas, to the structural information of the datasets of interest.39 While these advanced approaches hold promise for improved results compared to the simple histogram matching used in our study, it is essential to consider that preserving structural information may still retain local and global distortions as in the case of log-Jacobian maps, potentially leading to shortcut learning opportunities. Thus, it may be important to consider additional methods during or after imaging, such as distortion correction.40

It is essential to highlight some of the limitations of this work. First, although we demonstrated that site classification is possible and identified potential shortcuts, we did not show that the identified factors are indeed used as shortcut learning, which will be explored in more detail in future work. Second, our statistical analysis using logistic regression assumes that the effect of biological and non-biological variables on the accuracy of the models is linear. A non-linear analysis may be necessary to investigate if sex and age are indeed insignificant. Third, although saliency maps are widely used to interpret machine learning models’ decisions, it is known that gradient-based attribution maps are not always perfect and may highlight some background regions close to the regions of high informative value, even when they are null (eg, all black or zeros). Lastly, we only analyzed the SFCN model in this work. Thus, results could be different for other machine learning models.

Conclusion

This work investigated the ability of machine learning models to classify sites from raw T1-weighted MRI scans with and without applying intensity harmonization techniques. To the best of our knowledge, this is the first work that implemented site classifiers using a very large multisite MRI database collected from across 41 sites. Our results demonstrate that MRI scans encode relevant site-specific information that models could use as shortcuts that cannot be removed using simple intensity harmonization methods. Thus, caution is advised when developing machine learning applications, especially when using multisite datasets, as biological and non-biological variables may introduce biases into the model. Most importantly, caution is essential when drawing conclusions from machine learning models trained in a multisite setup as a disease classifier (modeled task) could end up being a secret site classifier (shortcut/biases task) due to complex confounding factors encoded in the data.

Supplementary Material

ocad171_Supplementary_Data

Contributor Information

Raissa Souza, Department of Radiology, Cumming School of Medicine, University of Calgary, Calgary, AB T2N 4N1, Canada; Hotchkiss Brain Institute, University of Calgary, Calgary, AB T2N 4N1, Canada; Biomedical Engineering Graduate Program, University of Calgary, Calgary, AB T2N 4N1, Canada.

Matthias Wilms, Hotchkiss Brain Institute, University of Calgary, Calgary, AB T2N 4N1, Canada; Alberta Children’s Hospital Research Institute, University of Calgary, Calgary, AB T2N 4N1, Canada; Department of Pediatrics, University of Calgary, Calgary, AB T2N 4N1, Canada; Department of Community Health Sciences, University of Calgary, Calgary, AB T2N 4N1, Canada.

Milton Camacho, Department of Radiology, Cumming School of Medicine, University of Calgary, Calgary, AB T2N 4N1, Canada; Hotchkiss Brain Institute, University of Calgary, Calgary, AB T2N 4N1, Canada; Biomedical Engineering Graduate Program, University of Calgary, Calgary, AB T2N 4N1, Canada.

G Bruce Pike, Department of Radiology, Cumming School of Medicine, University of Calgary, Calgary, AB T2N 4N1, Canada; Hotchkiss Brain Institute, University of Calgary, Calgary, AB T2N 4N1, Canada.

Richard Camicioli, Department of Medicine (Neurology), Neuroscience and Mental Health Institute, University of Alberta, Edmonton, AB T6G 2E1, Canada.

Oury Monchi, Hotchkiss Brain Institute, University of Calgary, Calgary, AB T2N 4N1, Canada; Department of Radiology, Radio-Oncology and Nuclear Medicine, Université de Montréal, Montréal, QC H3C 3J7, Canada; Centre de Recherche, Institut Universitaire de Gériatrie de Montréal, Montréal, QC H3W 1W4, Canada; Department of Clinical Neurosciences, Cumming School of Medicine, University of Calgary, Calgary, AB T2N 4N1, Canada.

Nils D Forkert, Department of Radiology, Cumming School of Medicine, University of Calgary, Calgary, AB T2N 4N1, Canada; Hotchkiss Brain Institute, University of Calgary, Calgary, AB T2N 4N1, Canada; Alberta Children’s Hospital Research Institute, University of Calgary, Calgary, AB T2N 4N1, Canada; Department of Clinical Neurosciences, Cumming School of Medicine, University of Calgary, Calgary, AB T2N 4N1, Canada.

Author contribution

R.S., M.W., and N.D.F. contributed to the study’s conception. R.C. and O.M. contributed to data acquisition, and M.C., O.M., and R.C. contributed to data curation. R.S., G.B.P., N.D.F., and M.W. analyzed the results. R.S. wrote the first draft of the article. All authors critically revised the previous versions of the article and approved the final article.

Supplementary material

Supplementary material is available at Journal of the American Medical Informatics Association online.

Funding

This work was supported by the Parkinson Association of Alberta, the Hotchkiss Brain Institute, the Canadian Consortium on Neurodegeneration in Aging (CCNA), the Canadian Open Neuroscience Platform (CONP), the Natural Sciences and Engineering Research Council of Canada (NSERC), the Discovery Grant, the Canada Research Chairs program, the River Fund at Calgary Foundation, the Canadian Institutes for Health Research, and the Tourmaline Chair in Parkinson disease.

Conflict of interest

None declared.

Data availability

Image data used were provided, in part, by the OASIS-3 project (Principal Investigators: T. Benzinger, D. Marcus, J. Morris; NIH P50 AG00561, P30 NS09857781, P01 AG026276, P01 AG003991, R01 AG043434, UL1 TR000448, R01 EB009352), by the PPMI-a, public–private partnership funded by Michael J. Fox Foundation, by the OpenfMRI database (accession number ds000245), and by the UK Biobank (application number 77508).

References

  • 1. Lo Vercio L, Amador K, Bannister JJ, et al. Supervised machine learning tools: a tutorial for clinicians. J Neural Eng. 2020;17(6):062001. 10.1088/1741-2552/abbff2 [DOI] [PubMed] [Google Scholar]
  • 2. Noorbakhsh-Sabet N, Zand R, Zhang Y, Abedi V.. Artificial intelligence transforms the future of health care. Am J Med. 2019;132(7):795-801. 10.1016/j.amjmed.2019.01.017 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Maceachern SJ, Forkert ND.. Machine learning for precision medicine. Genome. 2021;64(4):416-425. 10.1139/gen-2020-0131 [DOI] [PubMed] [Google Scholar]
  • 4. Pesapane F, Codari M, Sardanelli F.. Artificial intelligence in medical imaging: threat or opportunity? Radiologists again at the forefront of innovation in medicine. Eur Radiol Exp. 2018;2(1):35. 10.1186/S41747-018-0061-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Mouches P, Langner S, Domin M, Hill MD, Forkert ND.. Influence of cardiovascular risk-factors on morphological changes of cerebral arteries in healthy adults across the life span. Sci Rep. 2021;11(1):12236. 10.1038/s41598-021-91669-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Mouches P, Wilms M, Rajashekar D, Langner S, Forkert ND.. Multimodal biological brain age prediction using magnetic resonance imaging and angiography with the identification of predictive regions. Hum Brain Mapp. 2022;43(8):2554-2566. 10.1002/hbm.25805 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Talai AS, Sedlacik J, Boelmans K, Forkert ND.. Utility of multi-modal MRI for differentiating of Parkinson’s disease and progressive supranuclear palsy using machine learning. Front Neurol. 2021;12:648548. 10.3389/fneur.2021.648548 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Stanley EAM, Wilms M, Mouches P, Forkert ND.. Fairness-related performance and explainability effects in deep learning models for brain image analysis. J Med Imaging (Bellingham). 2022;9(6):061102. 10.1117/1.jmi.9.6.061102 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Tardif CL, Collins DL, Pike GB.. Sensitivity of voxel-based morphometry analysis to choice of imaging protocol at 3 T. Neuroimage. 2009;44(3):827-838. 10.1016/j.neuroimage.2008.09.053 [DOI] [PubMed] [Google Scholar]
  • 10. Tardif CL, Collins DL, Pike GB.. Regional impact of field strength on voxel-based morphometry results. Hum Brain Mapp. 2010;31(7):943-957. 10.1002/hbm.20908 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Dinsdale NK, Jenkinson M, Namburete AIL.. Unlearning scanner bias for MRI harmonization. In: Martel AL, et al., eds. Medical Image Computing and Computer Assisted Intervention – MICCAI 2020. MICCAI 2020. Lecture Notes in Computer Science. Vol. 12262. Springer, Cham; 2020: 369-378. 10.1007/978-3-030-59713-9_36 [DOI] [Google Scholar]
  • 12. Nath V, Parvathaneni P, Hansen CB, et al. Inter-scanner harmonization of high angular resolution DW-MRI using null space deep learning. Comput Diffus MRI. 2019;2019:193-201. [PMC free article] [PubMed] [Google Scholar]
  • 13. Glocker B, Robinson R, Castro DC, Dou Q, Konukoglu E. Machine learning with multi-site imaging data: an empirical study on the impact of scanner effects. arxiv. 10.48550/arxiv.1910.04597, 2019, preprint: not peer reviewed. [DOI]
  • 14. Geirhos R, Jacobsen J-H, Michaelis C, et al. Shortcut learning in deep neural networks. Nat Mach Intell. 2020;2(11):665-673. 10.1038/s42256-020-00257-z [DOI] [Google Scholar]
  • 15. Sudlow C, Gallacher J, Allen N, et al. UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 2015;12(3):e1001779. 10.1371/journal.pmed.1001779 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Parkinson’s Progression Markers Initiative. https://www.ppmi-info.org/. Accessed November 22, 2022.
  • 17. Duchesne S, Chouinard I, Potvin O, et al. ; CIMA-Q group and the CCNA group. The Canadian dementia imaging protocol: harmonizing national cohorts. J Magn Reson Imaging. 2019;49(2):456-465. 10.1002/jmri.26197 [DOI] [PubMed] [Google Scholar]
  • 18. Acharya HJ, Bouchard TP, Emery DJ, Camicioli RM.. Axial signs and magnetic resonance imaging correlates in Parkinson’s disease. Can J Neurol Sci. 2007;34(1):56-61. 10.1017/s0317167100005795 [DOI] [PubMed] [Google Scholar]
  • 19. Lang S, Hanganu A, Gan LS, et al. Network basis of the dysexecutive and posterior cortical cognitive profiles in Parkinson’s disease. Mov Disord. 2019;34(6):893-902. 10.1002/mds.27674 [DOI] [PubMed] [Google Scholar]
  • 20. Hanganu A, Bedetti C, Degroot C, et al. Mild cognitive impairment is linked with faster rate of cortical thinning in patients with Parkinson’s disease longitudinally. Brain. 2014;137(Pt 4):1120-1129. 10.1093/brain/awu036 [DOI] [PubMed] [Google Scholar]
  • 21.Open Science, to Accelerate Discovery and Deliver Cures | The Neuro—McGill University. https://www.mcgill.ca/neuro/open-science/c-big-repository. https://www.mcgill.ca/neuro/open-science. Accessed November 22, 2022.
  • 22. Badea L, Onu M, Wu T, Roceanu A, Bajenaru O.. Exploring the reproducibility of functional connectivity alterations in Parkinson’s disease. PLoS One. 2017;12(11):e0188196. 10.1371/journal.pone.0188196 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.OpenNeuro. https://openneuro.org/datasets/ds000245/versions/00001. Accessed November 24, 2022.
  • 24. Boelmans K, Holst B, Hackius M, et al. Brain iron deposition fingerprints in Parkinson’s disease and progressive supranuclear palsy. Mov Disord. 2012;27(3):421-427. 10.1002/mds.24926 [DOI] [PubMed] [Google Scholar]
  • 25. LaMontagne PJ, Benzinger TLS, Morris JC, et al. OASIS-3: longitudinal neuroimaging, clinical, and cognitive dataset for normal aging and Alzheimer disease. medRxiv. p. 2019.12.13.19014902. 10.1101/2019.12.13.19014902, December, 2019, preprint: not peer reviewed. [DOI]
  • 26. Wei D, Zhuang K, Ai L,. et al. Structural and functional brain scans from the cross-sectional Southwest University adult lifespan dataset. Sci Data. 2018;5:180134. 10.1038/sdata.2018.134 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Isensee F, Schell M, Pflueger I, et al. Automated brain extraction of multisequence MRI using artificial neural networks. Hum Brain Mapp. 2019;40(17):4952-4964. 10.1002/hbm.24750 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Tustison NJ, Avants BB, Cook PA, et al. N4ITK: improved N3 bias correction. IEEE Trans Med Imaging. 2010;29(6):1310-1320. 10.1109/tmi.2010.2046908 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Xiao Y, Fonov V, Chakravarty MM, et al. A dataset of multi-contrast population-averaged brain MRI atlases of a Parkinson’s disease cohort. Data Brief. 2017;12:370-379. 10.1016/j.dib.2017.04.013 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Leow AD, Yanovsky I, Chiang M-C, et al. Statrationistical properties of Jacobian maps and the realization of unbiased large-deformation nonlinear image regist. IEEE Trans Med Imaging. 2007;26(6):822-832. 10.1109/tmi.2007.892646 [DOI] [PubMed] [Google Scholar]
  • 31. Nyúl LG, Udupa JK.. On standardizing the MR image intensity scale. Magn Reson Med. 1999;42(6):1072-1081. 10.1002/(SICI)1522-2594(199912)42:6 [DOI] [PubMed] [Google Scholar]
  • 32. Bashyam VM, Doshi J, Erus G, et al. ; iSTAGING and PHENOM consortia. Deep generative medical image harmonization for improving cross-site generalization in deep learning predictors. J Magn Reson Imaging. 2022;55(3):908-916. 10.1002/jmri.27908 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Peng H, Gong W, Beckmann CF, Vedaldi A, Smith SM.. Accurate brain age prediction with lightweight deep neural networks. Med Image Anal. 2021;68:101871. 10.1016/j.media.2020.101871 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Smilkov D, Thorat N, Kim B, Viégas F, Wattenberg M. SmoothGrad: removing noise by adding noise. 10.48550/arxiv.1706.03825, June 2017, preprint: not peer reviewed. [DOI]
  • 35. Kubota Y. keisen/tf-keras-vis: Neural Network Visualization Toolkit for tf.keras. https://keisen.github.io/tf-keras-vis-docs/. Accessed February 14, 2023.
  • 36. Nielson DM, Pereira F, Zheng CY, et al. Detecting and harmonizing scanner differences in the ABCD study—annual release 1.0. bioRxiv 309260. 10.1101/309260, May 2018, preprint: not peer reviewed. [DOI]
  • 37. Dewey BE, Zhao C, Reinhold JC, et al. DeepHarmony: a deep learning approach to contrast harmonization across scanner changes. Magn Reson Imaging. 2019;64:160-170. 10.1016/j.mri.2019.05.041 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Gutierrez A, Tuladhar A, Rajashekar D, Forkert ND.. Lesion-preserving unpaired image-to-image translation between MRI and CT from ischemic stroke patients. 2022;12033:38-328. 10.1117/12.2613203 [DOI] [PubMed] [Google Scholar]
  • 39. Cho H, Lim S, Choi G, Min H.. Neural stain-style transfer learning using GAN for histopathological images. 2017;80:1-10. 10.48550/arxiv.1710.08543 [DOI] [Google Scholar]
  • 40. Thaler C, Sedlacik J, Forkert ND, et al. Effect of geometric distortion correction on thickness and volume measurements of cortical parcellations in 3D T1w gradient echo sequences. PLoS One. 2023;18(4):e0284440. 10.1371/journal.pone.0284440 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

ocad171_Supplementary_Data

Data Availability Statement

Image data used were provided, in part, by the OASIS-3 project (Principal Investigators: T. Benzinger, D. Marcus, J. Morris; NIH P50 AG00561, P30 NS09857781, P01 AG026276, P01 AG003991, R01 AG043434, UL1 TR000448, R01 EB009352), by the PPMI-a, public–private partnership funded by Michael J. Fox Foundation, by the OpenfMRI database (accession number ds000245), and by the UK Biobank (application number 77508).


Articles from Journal of the American Medical Informatics Association : JAMIA are provided here courtesy of Oxford University Press

RESOURCES