Abstract
Deep learning has the potential to standardize and automate diagnostics for complex medical imaging data, but real-world clinical images are plagued by a high degree of heterogeneity and confounding factors that may introduce imbalances and biases to such processes. To address this, we developed and applied a data matching algorithm to 467,464 clinical brain magnetic resonance imaging (MRI) data from the Mass General Brigham (MGB) healthcare system for Alzheimer’s disease (AD) classification. We identified 18 technical and demographic confounding factors that can be readily distinguished by MRI or have significant correlations with AD status and isolated a training set free from these confounds. We then applied an ensemble of 3D ResNet-50 deep learning models to classify brain MRIs between groups of AD, mild cognitive impairment (MCI), and healthy controls. From a confounder-free matched dataset of 287,367 MRI files, we achieved an area under the receiver operating characteristic (AUROC) of 0.82 in distinguishing healthy controls from patients with AD or MCI. We also showed that confounding factors in heterogeneous clinical data could lead to artificial gains in model performance for disease classification, which our data matching approach could correct. This approach could accelerate using deep learning models for clinical diagnosis and find broad applications in medical image analysis.
Keywords: Deep learning, Magnetic resonance imaging, Alzheimer’s disease, Mild cognitive impairment, Data matching, Confounding factors
1. Introduction
Magnetic resonance imaging (MRI) plays an important role in understanding and diagnosing mental and neurological disorders. With the advances in machine learning (ML), several ML models have been developed and applied for MRI-based single-subject phenotypic classification of mental and neurological disorders [1–9]. In research settings, such studies in medical image analysis are often limited by sample size, as medical images are expensive and labor-intensive to acquire. With growing interest in big data in MRI [10], organized initiatives to acquire more data, such as the UK BioBank, has allowed for training sets in the hundreds [11] or thousands [12]. Such large databases represent cohorts for which considerable effort has been expended, even if imperfectly [13], to minimize heterogeneities in order to account for site differences and other confounding factors between test and control groups [14]. Heterogeneities, as noted by Ma et al. [15], can “emanate from scanner differences, differences in image acquisition protocols, or ethnic and treatment differences among participating patient populations,” so attempts to minimize such differences are important for multi-center studies. Voxel-based morphometry studies have shown reasonable, though imperfect, reliability in reproducing results on data across different sites [16,17].
Despite its feasibility in research settings, applying ML methods to clinical MRI data is still challenging due to a much higher degree of heterogeneities and confounding factors in clinical data. This is not only due to differences in technical variables (e.g., sites and scanners), but also demographic factors such as age that would be difficult to regress from individual medical images. In most clinical data, except data collected for a particular study, little to no effort is typically expended to reduce heterogeneities, and thus ML models usually cannot effectively be trained using such data. For instance, an ML model trained on data from a hospital that uses different scanners in its intensive care unit and dementia ward could determine dementia based on scanner types (i.e., technical confounding factor) rather than true biomarkers in MRI. Discouraging this idea further in healthcare, controversy has arisen from previous efforts to amass clinical data, such as Google’s Project Nightingale [18], which bore concerns about data privacy, and IBM Watson Health’s partnerships with hospitals [19], which bore concerns about diagnostic accuracy. Nonetheless, clinical repositories hold promise for large and useful real-world data that can aid ML researchers in creating generalizable diagnostic models.
Conventionally, confounded or imbalanced datasets are addressed using confound regression techniques. However, most standard regression techniques, such as multivariate linear regression, are inapplicable to high-dimensional imaging data. Some efforts in data regression in medical imaging have been focused on very specific covariates in MRI (e.g., motion [20]), which cannot be applied to every confounding factor, while others have focused on adjusting the output accuracies themselves [21]. Regression deep learning models have also recently been proposed specifically for medical imaging applications [22]. These approaches, however, require an adversarial training process, so achieving convergence is more difficult, and certain deep learning network layers and architectures would be unusable [23,24].
Another possible strategy is data matching between two or more groups across a number of discrete or continuous covariates. This could be done by finding data points between groups that are “closest” to one another (in the case of continuous covariates) or in the same multiple categories (in the case of discrete covariates) and excluding those without a match. The concept of data matching to remove bias from observational studies has been in practice since the 1940s [25,26], with a theoretical basis being developed in the 1970s [27,28]. The main advantage of data matching is its general applicability, and thus the development of such methods has been spread across different fields [29], such as statistics [30], sociology [31], epidemiology [32], economics [33], and political science [34]. However, data matching is relatively undeveloped in the context of big data for ML, for which many computational methods of matching data with continuous covariates would either be computationally intensive, leave out too much, or have not considered the need to find a matching subset of a larger dataset as much as finding test/control divisions.
Here, we developed an advanced data matching algorithm and applied it across 467,464 multimodal clinical brain MRIs from the Mass General Brigham (MGB) Healthcare System to isolate a training set free from technical and demographic confounding factors. Our advanced data matching algorithm could match both categorical and continuous variables at a large scale. As a model system, we elected to test ML’s ability to classify patients with Alzheimer’s disease (AD) and mild cognitive impairment (MCI) from healthy controls based on brain MRI. From a total of 141 variables we analyzed, we identified 10 categorical and 8 continuous (3 demographic and 15 technical) confounding factors that can be readily distinguished by a deep-learning model and generated a matched dataset of 287,367 MRI files. Using an ensemble of 3D ResNet-50 deep learning models, we achieved an area under the receiver operating characteristic (AUROC) of 0.82 in distinguishing patients with AD/MCI from healthy controls. Our results also showed that confounding factors in heterogeneous clinical data could lead to artificial gains in model performance for disease classification, but these could be controlled by our data matching scheme. With the generalizability of the data matching scheme, this approach could accelerate testing and using deep learning models for clinical diagnosis and find broad applications in medical image analysis.
2. Methods
2.1. Data
Two sets of MRI data, collected between 1995 and 2020 (with the vast majority being collected after 2004), were requested from the Research Patient Data Registry of MGB: in the first set (the “test group”, totaling 13,401 patients), patients that had previously had a head or brain MRI and who had been prescribed at least one of Rivastigmine, Galantamine, Donepezil, or Memantine, or who had an international classification of disease (ICD)-9 or ICD-10 code indicating AD (G30) or MCI (G31.84); and the second set (the “control group”, totaling 23,910 patients) of patients that had previously had a head or brain MRI that had never been prescribed a central nervous system medication and had never had a brain lesion or tumor. The dataset contained demographic data, MRIs, and technical covariates about the MRI. Dated ICD-9 and ICD-10 codes for all other disorders were also collected; ICD-9 codes were all translated to their corresponding ICD-10 codes according to a database lookup table in the Mass General Brigham Enterprise Data Warehouse (EDW). Table S1 in the Supplementary Information shows all of the variables associated with the dataset, and Fig. S1 shows the distribution of MRIs per patient in the dataset. Ethics review and institutional review boards approved the study (IRB protocol 2015P001915).
2.2. Data acquisition and preprocessing
Fig. 1 depicts a high-level description of the study design for preprocessing, data matching, and training the deep learning model. Test group patients were divided into two groups: patients with an ICD code for AD or a previous prescription of Memantine, and patients with an ICD code for MCI or a previous prescription for Galantamine, Rivastigmine, or Donepezil, but not Memantime. The vast majority were labeled on the basis of medications, with only 8.7% of the AD group having an associated ICD code for AD and 7.6% of the MCI group having an ICD code for MCI. 7546 unique ICD-10 codes were present across the full dataset. Analysis of the dataset indicated a high number of MRI files from patients with ICD-10 codes for a malignant neoplasm of the brain (C71.1, n = 2841; C71.9, n = 24,713; C79.31, n = 40,426), cerebral infarction (I63.9, n = 6755), neoplasm of unspecified behavior of brain, (D49.6, n = 17,425), benign neoplasm of cerebral meninges (D32.0, n = 2531), and previous head trauma (S00 - S09, n = 8217); these cases were excluded entirely. Other ICD codes analyzed bore a relatively large gap between groups, though these were either symptoms consistent with the diagnosis (e.g., other amnesia (R41.3)) or irrelevant (e.g., encounter for general adult medical examination without abnormal findings (Z00.00)). Because clinical records span a long period of time, dates of the medication prescription and ICD code application were taken into account, so the labels were given to individual scanning sessions rather than particular patients (thus, a patient may fall into a control group if their scan was taken many years before a diagnosis). In all, 189,688 MRIs were classified as healthy controls; 38,229 as having MCI; 51,239 as having AD; and 180,097 were excluded. MRI data were converted from its original DICOM format to a NIFTI format, reoriented using fslreorient2std from the FMRIB Software Library [35], and resized to a 96 × 96 × 96 Numpy array; further computations, such as motion correction and registration to a standard space, were forgone.
Fig. 1.

A high-level description of the full pipeline for pre-processing and training a deep learning model on clinical MRI, with a display, on the right, of the reduction of the full dataset to the matched one.
2.3. Data covariates
MRI data was accompanied by 101 categorical (e.g., sex) and 40 continuous variables (e.g., age) at the time of the scan (Table S1). Several labels were output using keywords from the label rather than the label itself due to their inconsistent description. For instance, “ProtocolName”, which describes the MRI modality used, had 25,562 unique strings across the whole of the data; the most common descriptions of axial FLAIR were, variously, “AX_FLAIR_T2”, “AX_FLAIR”, “AX_T2_FLAIR”, “AXIAL_FLAIR”, “FLAIR_AXIAL”, and “FLAIR_AX”. We incorporated them in a new variable called “ProtocolNameSimplified”, which describes the modality and angle. Some labels were not used in the data matching scheme (see “Data matching”). Labels relating to the geographic location of the patient had too many unique values (City Name, for instance, had 2260 values) that could not be preprocessed similarly, which made their use impractical, while others were missing from too many data points to be incorporated (Technologist User ID, which indicates the individual collecting data, was missing from 69.1% of data).
2.4. Data matching
The isolation of a subset of data to classify by Alzheimer’s stages (i.e., AD, MCI, healthy control) was achieved using data matching. Thus, for each healthy control in the training and validation sets, there is a corresponding MCI and AD file with the same covariates, ensuring equal distributions with regards to factors such as age. The goal of a data matching function is to create an environment for which, given a label A (in this case, A ∈ {“Control”, “AD”, “MCI”}) and a number of confounds {C1,C2…Cn}, the following is true:
This offers an alternative to directly regressing confounds, which is exceedingly difficult on data as complex and varied as clinical MRI.
A typical data matching algorithm operates on a number of categorical covariates by ensuring that a datapoint in a given class has a corresponding datapoint in another class with the same set of variables. For example, when considering sex and collection site, a male control patient from site A ought to be matched by male AD and MCI patients from site A.
Our data matching algorithm extends this beyond categorical covariates to continuous covariates, such as age. It does so by converting continuous covariates into categorical ones, so that the basic categorical data matching algorithm may then be applied; this conversion, however, is the most technically complex aspect of its implementation. Fig. 2 shows a flowchart of the data matching algorithm. The algorithm first translates continuous variables into categorical ones by discretizing them by range. To do so, it places each continuous value into K = 2 buckets (transforming it into a categorical variable with two values), such that each label has an equal number of datapoints in a bucket, and excludes the rest. The standard, categorical data matching algorithm is then applied. Then for data points that were included, it compares, between each class, the statistical distributions of the continuous covariates within these buckets using the non-parametric Mann-Whitney U test; if any two classes are significantly different with p < 0.10 with respect to a particular continuous covariate, K is raised by one on that covariate and the process starts again. It terminates either when p > 0.10 for all continuous covariates, or no data can be matched.
Fig. 2.

Flow chart of the data matching algorithm.
This algorithm was extended from a previous method developed by Leming and Suckling [36] in two ways. First, bucket ranges were selected using the mean of the numerical range (i.e., equal divisions between the lowest and highest values in the continuous set of numbers) and the density range (i.e., buckets with equal numbers of datapoints between them), whereas the previous method simply used the numerical range; this accommodated a wide range of distributions and helped in handling outliers. Second, the function was re-applied on data that had been excluded in its previous application until no more could be matched (i.e., recursively). While this did potentially compromise the p-value of two matched datasets (as two matched datasets, each with p > 0.10 individually, does not guarantee p > 0.05 on their union), it increased the size of the training/test/validation sets substantially. Data that had one or more variables missing as a null value were excluded entirely (though a similar string, such as “unavailable”, was not treated as a null value).
We distinguished variables into two general categories: demographic (relating to the patient) and technical (relating to the MRI site or scanner). Variables that have statistically significant differences between groups (i.e., those that can readily be distinguished by a deep-learning MRI model) were included in the data matching algorithm. Table 1 shows the results when several of these variables matched for and classified for directly in their own ML tests. The ResNet-50 model is capable of classifying sex, manufacturer, and ethnic group with a high degree of accuracy. Therefore, these variables should be matched prior to training. However, certain variables, such as religion, patient class (e.g., inpatient, outpatient, emergency), veteran status, or marital status, cannot be distinguished by MRI with a high degree of certainty (AUROC 0.51–0.54), making it less necessary to match for them.
Table 1.
Results of other variables, matched for and trained on a single 3D Resnet-50 model.
| Variable | Highest AUROC | Top labels |
|---|---|---|
| Manufacturer | 0.951 | Simens; general electric |
| Employment status | 0.517 | Disabled; full-time; not employed; retired |
| Ethnic group | 0.624 | No - non hispanic; yes - hispanic |
| Marital status | 0.537 | Married; single; divorced |
| Patient class | 0.533 | Emergency; inpatient; outpatient |
| Religion | 0.528 | Not affiliated; roman catholic |
| Sex | 0.886 | Female; male |
| Veteran status | 0.540 | Yes; no, never served or is currently active |
Table 2 shows a list of variables used in the data matching algorithm. Sex, age (between 4 and 100 years), and ethnicity were the three demographic variables used in the data matching scheme. The following technical variables were used: MR Acquisition Type (e.g., 2D or 3D image), station name (e.g., the machine used for acquisition), protocol name simplified (see Data Covariates), specific absorption rate (SAR), pixel bandwidth, sequence variant, repetition time, slice thickness, imaging frequency, percent sampling, software version of the MRI, manufacturer of the MRI machine (limited to Siemens or General Electric), percent phase field of view (FOV), procedure ID, and number of Current Procedural Terminology (CPT) codes.
Table 2.
Descriptions of variables used in the data matching scheme. D/T* denote demographic and technical variables, respectively. Relevant technical variable descriptions were taken from online DICOM definitions [37].
|
2.5. Machine learning model
Classifications were undertaken using an ensemble of five 3D ResNet-50 models [38] (see Fig. S2 in the Supplementary Information for model parameters). Ensembles were used because, in practice, averaging more classifications led to higher model performance than single models [39]. For each model, data were matched and divided into training and validation sets, divided with PatientID taken into account to ensure multiple files from the same patient were not distributed between training/validation/test sets. The training set was iteratively loaded into memory in batches of size 500, which were each fit for ten iterations using an adam optimizer on a categorical cross-entropy loss function; this was repeated 50 times for each model. The model with the highest score on the validation set was then evaluated. A test set, consisting of the remaining data that did not use patients within the training or validation sets, was iteratively loaded into memory.
In this work, ensembles of independent models were also used, allowing for a distribution of predictions for each particular MRI file; furthermore, MRI predictions were combined to allow for a distribution of predictions for a particular scanning session that incorporated multiple modalities, or multiple scanning sessions with the same patient. In other words, if 30 different MRIs of a patient each fell into the test set of 10 models, then the resultant 30 × 10 = 300 predictions would be averaged to a single value; likewise, if a patient only had a single MRI prediction in one of the models’ test sets, this single prediction would be used.
2.6. Three-stage AD classification
For this work, we conduct classifications on three AD stages: healthy controls, MCI, and AD. We demonstrated the effectiveness of the data matching algorithm by comparing the statistical significance of correlations between variables substantially represented in an unmatched and fully matched dataset. We then applied the same classification to an unmatched dataset and a partially matched dataset using only technical covariates (both equally sized) to illustrate the artificial performance gains of models with confounding factors. A final classification was undertaken on ensemble of five ResNet-50 models, with further analysis of which groups classified better.
3. Results
3.1. Data matching
Fig. 3 shows statistically significant correlations between selected variables in the full dataset and the correlations for a matched dataset. These heatmaps display the p-values of non-parametrical statistical tests used to compare variables’ dependencies. Continuous variables were compared to each other using the Mann-Whitney U test; categorical variables were compared to each other using the Kruskal-Wallis One-Way ANOVA test; and the dependency of continuous variables on categorical variables were also assessed as well using the Kruskal-Wallis test. No appropriate statistical test was found to find the dependency of continuous variables on categorical variables; these comparisons are shown as white squares (NaN). The result shows that nearly all of the indicated variables (highlighted in bold), only except vendor reported echo spacing, were statistically different between the different AD stages (the first column indicated by a red arrow). However, these were largely fixed after data matching, leaving only 10 variables showing statistical correlations with AD stage (AlzStage). Notably, these include the marital and veteran status, which are previously known risk factors for AD. Unlike 3 demographic variables (age, sex, ethnicity) that we included in the data matching, the marital and veteran cannot be distinguished by an MRI-based deep learning model (AUROC = 0.537 for marital status and 0.540 for veteran status). These variables, therefore, are less necessary to be matched despite their significant correlation with the AD stage. Medication was used to design the AD/MCI label and thus cannot be matched out. The other variables are locational (city name, country, county, current PCP location) and administrative information (in/outpatient status, primary care provider ID, registration status), which could not effectively be matched for because they had too many distinct categories.
Fig. 3.

P-values from comparisons of dependency of variables. Shown is the dependency of variables on the y-axis to those on the x-axis in the full dataset (top) and a matched sample (bottom), using non-parametrical statistical tests. Variables in bold on the left axis are those that have a statistically significant correlation with the presence of Alzheimer’s (indicated by the variable name AlzStage, in the first column). White tiles indicate dependencies of categorical on continuous variables, for which no test was found.
Fig. 4A–C show the distributions of ages, sex, and MRI manufacturers for unmatched data, partially matched data using only technical variables, and fully matched data. For patient ages, there is a significant discrepancy between healthy controls (39.7 ± 15.7) and MCI/AD groups (63.7 ± 13.0 for MCI, 60.7 ± 15.9 for AD) in the original unmatched data. This is expected in real-world clinical data, as MCI/AD are mostly developed in elderly populations. While a deep-learning model trained with clinical data without data matching might show a higher classification accuracy than a model trained with matched data, this is likely due to having age as a confounding factor between the AD/MCI and healthy control groups. Notably, while it is typical to threshold subjects by age in AD studies, the data matching algorithm does so naturally, excluding younger patients from the training groups.
Fig. 4.

Comparison of variable distributions in unmatched, partially matched only on technical variables, and fully matched data. A. Patient ages in years, B. Patient sex C. Manufacturer of the MRI machines. D. The comparison of model performances in classifying AD status for the matched data, data matched only on technical variables, and unmatched data, using single ResNet-50 models trained on equal-sized training/validation sets and evaluated on the entire test set (excluding patients with MRIs in the training/validation sets).
Fig. 4D also shows the classification results on unmatched data, partially matched data using the 15 technical variables, and fully matched data using a total of 18 demographic and technical variables. Each used a single ResNet-50 model, as well as a training set totaling 2176 MRI files and a validation set totaling 544 files; the test set consisted of all other datapoints that came from a patient that fell outside of the training and validation sets. Unmatched data could distinguish healthy controls with 0.861 AUROC, while data matched only on technical confounding factors performed with 0.832 AUROC, and fully matched data performed at 0.748. This supports the hypothesis that confounding factors could lead to artificial increases in model performance.
3.2. Classification on ensemble models
Fig. 5 shows the results of the multi-stage AD classification using an ensemble of five ResNet-50 models. The average size of a matched dataset for a single model was 3487 out of a possible 279,156 files. Fig. 5A shows that model performance notably increased when predictions were averaged; while predictions of individual MRI files performed at 0.665, predictions averaged over particular scanning sessions increased to 0.758, and those averaged further over whole patients performed at 0.816. Because of this, all other AUROCs are reported as an average over a particular patient ID. For different imaging modalities, the model distinguished between healthy control and AD/MCI with above 0.80 AUROC (Fig. 5B). The results show that magnetization-prepared rapid acquisition gradient echo (MPRAGE, AUROC = 0.859) and 3D MRI (AUROC = 0.825) led to the highest model performance, though these performances are all comparable and fall within standard error.
Fig. 5.

Descriptions of distributions and results on different subdivisions of data. A. AUROCs of ensemble model across predictions of individual MRI files; predictions averaged by scanning session; and predictions averaged by patient. B. Performance relative to MRI modality. C. Performance in 2D and 3D MRI.
The model was likely able to distinguish more between control and AD/MCI than between AD and MCI specifically because detecting gradations between stages of dementia is simply a more difficult task than distinguishing a healthy brain. The results further suggested that our model was better able to utilize fully 3D sequences (i.e., MPRAGE) than 2D (Fig. 5C), suggesting that sequences with more information included overall aid in classification accuracy. There was not a clear relationship between the number of patients included (N) and performance.
4. Discussion
In this work, we describe a practical approach to classifying large amounts of clinical MRIs of the brain. This work represents an attempt to bring big data MRI analysis from research data to highly heterogeneous clinical data, highlighting, as well, the considerations and limitations in doing so. We isolated measured covariates and established how successfully each may be estimated from MRI data alone. This led us to identify variables that could adversely act as confounding factors in deep learning models and construct matched datasets. With fully matched data in an ensemble, our results showed an AUROC of 0.82 when distinguishing multimodal MRIs between healthy controls from AD/MCI patients.
4.1. Classification results
The present work is novel in the context of applying deep learning models to distinguish between AD, MCI, and control from large clinical MRI data. In deep learning, as applied to research MRI, distinguishing between AD, MCI and controls is an oft-studied problem. Several large MRI datasets have been made available that power these studies, including the AD Neuroimaging Initiative [40], the Australian Imaging, Biomarker \& Lifestyle Flagship Study of Ageing [41]; the Open-Access Series of Imaging Studies (OASIS) [42]; the National Alzheimer’s Coordinating Center (NACC) [43]; and the Framington Heart Study (FHS) [44]. Wen et al. 2020 [45] provides a review of 30 ML studies that have diagnosed T1 MRIs at various stages of AD and MCI development across these databases, achieving up to 91% accuracy [46], and more such studies continue to be released [4,46–56]. Table S2 show a comparison summary of some of these works with our study presented here. This study is unique as it i) applies deep learning models to routinely-collected clinical data rather than data collected in a research setting; ii) uses a dataset orders of magnitude larger than these public datasets; iii) articulates the importance of considering confounding factors present in the highly heterogenous clinical data. It should be also noted that direct comparison of the model performance is less meaningful here because these studies were done in different settings and modalities from the present study. We showed that not matching for demographic covariates leads to a deceptive increase in classification accuracy, and the unmatched data for neither technical nor demographic covariates leads to even higher increases. This emphasized that a higher model performance does not necessarily mean a better model in clinical settings, because clinical data is highly heterogeneous. Therefore, it is critical to preprocess data carefully and construct matched datasets for model training to minimize potential errors.
Besides the purely scientific difficulties in seeking biomarkers to detect AD, working with clinical data brought to light other difficulties, namely labeling. We used the presence of an ICD code or a previous prescription of Memantine as an indicator of AD, and an ICD code or a previous prescription of Rivastigmine, Galantamine, or Donepezil as an indicator for MCI. However, while prescription history is recorded in detail in the databases, it is an imperfect diagnostic indicator. For instance, while Memantine ought to only be prescribed to those with a severe case of cognitive impairment, this may not always be the case in common clinical practice. Furthermore, the databases we drew from may be incomplete, as it is possible that a patient receiving an MRI in MGB healthcare system may have had an ICD code or a Memantine prescription stored in an outside system. This issue of incomplete or possibly inaccurate labeling applies not only to the main disease labels used – in this study, healthy control, MCI, and AD – but also those for confounding factors incorporated into the matching algorithm. This study lacked the usual rigor with which data collected in a research setting are curated, and this is reflected in overall model performance. Another difficulty with labeling in a clinical database is that the lack of consistent information prevents using more gradations and classifications applied to MCI in a research setting, like progressive and stable states. As deep learning analysis becomes more popular in clinical spaces, the consideration and construction of an electronic healthy record system facilitating constant labeling for prospective data could help accelerating clinical testing of deep learning algorithms. Furthermore, incorporating other neurological examination results for cognitive impairment assessment, such as monumental state examination (MMSE), Montreal cognitive assessment (MoCa), clinical dementia rating (CDR), could facilitate using deep learning to more systemic studies to evaluate progressive status of patients, identify patients at higher risk to develop AD, and develop a new diagnostic approach for AD.
Our results also indicated differences in model performance with respect to data resolution, given the higher accuracy of the model in 3D as opposed to 2D data (Fig. 5C), as well as general differences with respect to MRI modality. Further gains in model performance could be made with higher-resolution data, and this likely accounts for some of the discrepancies between model performance in this study and those of similar AD classification studies in a purely research environment.
Taken together, these results offer two common-sense recommendations for clinical systems that wish to incorporate automated diagnostics. First, standardization and completeness of databases is critical to optimizing accuracy. Second, collection of high-quality and high-resolution data, even though it may not directly aid the task for which it is being collected, improves performance overall in diagnostic systems. While these recommendations would most certainly improve model performance, however, results would still most likely fall short of the performance that would be necessary for single-subject AD diagnostics based on MRI alone. Even so, because the results were far above random, there is promise that automatic evaluation of MRI may be one factor to aid clinicians in establishing the overall likelihood of AD. Combined with other covariates about the patient and other imaging modalities, diagnostic and prognostic accuracy could be substantially improved in the future to the point of clinical usefulness.
4.2. Technical design of the study
A point of contention about this study may be the lack of conventional data preprocessing pipeline. We used simple methods with low computational overhead that focused on bringing data into a similar orientation and dimension; methodologically, our approach was closer to a deep learning or computer vision study than a typical brain MRI study relying on inferential statistics. There were three reasons behind this. First, we practically lacked the computing power to carry out all the preprocessing that would be necessary on over 400,000 MRIs; skull stripping, motion correction, and nonlinear registration are all methods that, in their current state-of-the-art, may take two to three hours per MRI to compute. Second, most preprocessing pipelines were designed for research MRI, not clinical; clinical MRI, in all its forms, would prove too diverse for any single pipeline to process it consistently, especially given that effective quality control would be extremely time-consuming. Third, considering the applications in clinical settings, deep learning models were designed to handle inconsistent and spatially irregular data, making preprocessing less of a necessity than it would be if only inferential statistics were used. As MRI diagnostics move from a research setting to a clinical setting, it may become more and more of a necessity to perform computations in real time on very imperfect data.
Another decision taken was to simply amass all MRI data, regardless of modality, into the same model, rather than separating modalities into independent models. This was done, practically, because it saved on computational power (as only one set of models had to be trained, rather than five or six for each modality); because it increased training set sizes and diversity; and because generalist models able to classify any type of MRI are more useful in clinical settings. As data used is routinely collected and disease labeling practices become more sophisticated, the applicability of dataset matching on ever-growing clinical datasets can only increase in the long-term.
4.3. Data matching and real-world applicability
Even so, the need to have as large a dataset as possible for it to power deep learning models also limit its clinical applicability only to diseases which have many labels and few confounds, making it less applicable to rare diseases, diseases measured only on particular sites or scanners, or diseases found in small populations. Because of the nature of the data matching algorithm, results of this study represent the most “normative” cases. Our study says very little about patients that fall into certain rare extrema. The extrema of the most clinical interest, in this case, is early-onset Alzheimer’s; as may be observed in Fig. 4, the matched group did contain adults with Alzheimer’s in their 30s, but this constituted a very small sample. Even so, given the nature of this task, it is unlikely that any study would be able to practically diagnose such outliers unless more data on it were acquired, and this is an ongoing issue in the study of AD. The present algorithm does not explicitly exclude such instances, however, and it would be designed to include such data if more were acquired.
In data matching, there is generally an inverse relationship between the number of confounding factors included in the algorithm and the size of the resultant matched dataset; the more factors included, the smaller the matched dataset. In including which covariates to match by, a balance ought to be struck between the number of covariates included and the size of the resulting dataset. It is thus preferable to start with an extremely large dataset.
The presence of a large and labeled dataset, however, does not mean that a data matching scheme would be applicable. If one covariate, however, had values exclusively for one class of data — for instance, if everyone with AD/MCI were scanned with a T1 sequence and every control were scanned with a T2 sequence (this sort of exclusivity would be more likely to happen over several variables, however) — the resulting “matched” dataset would be empty. This does not imply that every covariate ought to be included; as a trivial example, the inclusion of a patient’s street name would account for geography, but it, in itself, is an irrelevant variable that would likely result in a vanishingly small matched set.
Results of this study may also be improved with different design parameters – for instance, more computer memory to allow for higher-resolution versions of the data than the 96 × 96 × 96 arrays used, adding more models to the ensemble \textit{ad infinitum}, or redesigning the matching algorithm to distribute the test set more evenly across the ensemble models and thus increase overall dataset utilization – though these would represent relatively moderate gains in performance compared to the more fundamental issues addressed above with incomplete labeling, diagnostic criteria, and so on.
5. Conclusion
In conclusion, while the relationship between MRI and Alzheimer’s diagnostics has been extensively investigated, this work represents a unique case study in using deep learning to extract useful information from a highly heterogeneous clinical dataset from one of the biggest hospital systems in the U.S. While confounding factors may complicate the analysis of clinical MRI, this is not insurmountable, as long as those confounding factors are measurable and accounted for, which we have done through a data matching process we proposed in this work. Further improvements in automated clinical diagnostics may be seen with database completeness and standardization, as well as collection of high-resolution data in a clinical setting. This could lead to using deep learning models as one factor in establishing the likelihood of an Alzheimer’s prognosis in clinical practice. Another important task will be testing the general applicability of deep learning algorithms for AD classification through external validation across different clinical settings. We are actively working on this task by testing deep learning models using external data collected from hospitals different from the one used for training. As both the size of clinical databases interest in automated diagnostics continue to grow, we envision that the applicability of data matching could find broad application to manage these very large and potentially useful databases.
Supplementary Material
Acknowledgements
This study was funded by U.S. NIH grant P30AG062421 and R00CA201248-S1.
Footnotes
Declaration of competing interest
The authors declare no conflicts of interests.
CRediT authorship contribution statement
M.L. conceived the idea and performed experiments. H.I. supervised the overall study. S.D. provided data resources. M.L. and H.I. wrote the manuscript. All authors analyzed the results and edited the manuscript.
Appendix A. Supplementary data
Supplementary data to this article can be found online at https://doi.org/10.1016/j.artmed.2022.102309.
Data availability
All data needed to evaluate the conclusions in the paper are present in the paper and/or the Supplementary Materials. The code for analysis is available at https://csb.mgh.harvard.edu/bme/_software.
References
- [1].Anderson JS, Nielsen JA, Froehlich AL, DuBray MB, Druzgal TJ, Cariello AN, et al. Functional connectivity magnetic resonance imaging classification of autism. Brain 2011;134:3742–54. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [2].Arbabshirani MR, Kiehl KA, Pearlson GD, Calhoun VD. Classification of schizophrenia patients based on resting-state functional network connectivity. Front Neurosci 2013;7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [3].Kim J, Calhoun VD, Shim E, Lee JH. Deep neural network with weight sparsity control and pre-training extracts hierarchical features and enhances classification performance: evidence from whole-brain resting-state functional connectivity patterns of schizophrenia. Neuroimage 2016;124:127–46. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [4].Oritz A, Munilla J, Gorriz JM, Ramirez J. Ensembles of deep learning architectures for the early diagnosis of the Alzheimer’s disease. Int J Neural Syst 2016;26:1650025. [DOI] [PubMed] [Google Scholar]
- [5].Tejwani R, Liska A, You H, Reinen J, Das P. Autism Classification Using Brain Functional Connectivity Dynamics and Machine Learning. ArXiv; 2017. [Google Scholar]
- [6].Heinsfeld AS, Franco AR, Craddock RC, Buchweitz A, Meneguzzia F. Identification of autism spectrum disorder using deep learning and the ABIDE dataset. NeuroImage: Clin 2018;17:16–23. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [7].Kazeminejad A, Sotero RC. Topological properties of resting-state fMRI functional networks improve machine learning-based autism classification. Front Neurosci 2019;12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [8].Wang T, Kamata S. Classification of structural MRI images in ADHD using 3D fractal dimension complexity map. In: 2019 IEEE international conference on image processing (ICIP); 2019. p. 215–9. [Google Scholar]
- [9].Nunes A, Schnack HG, Ching CRK, et al. Using structural MRI to identify bipolar disorders – 13 site machine learning study in 3020 individuals from the ENIGMA bipolar disorders working group. Mol Psychiatry 2020;25:2130–43. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [10].Smith SM, Nichols TE. Statistical challenges in “Big data” human neuroimaging. Neuron 2018;97:263–8. [DOI] [PubMed] [Google Scholar]
- [11].Abraham A, Milham MP, Di Martino A, Craddock RC, Samaras D, Thirion B, et al. Deriving reproducible biomarkers from multi-site resting-state data: an autism-based example. Neuroimage 2016;147:736–45. [DOI] [PubMed] [Google Scholar]
- [12].He T, Kong R, Holmes AJ, Sabuncu MR, Eickhoff MR, Bzdok D, et al. Is deep learning better than kernel regression for functional connectivity prediction of fluid intelligence?. In: 2018 international workshop on pattern recognition in neuroimaging (PRNI); 2018. [Google Scholar]
- [13].Alfaro-Almagro F, McCarthy P, Afyouni S, Andersson JLR, Bastiani M, Miller KL, et al. Confound modelling in UK biobank brain imaging. Neuroimage 2021;224:117002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [14].Van Horn JD, Toga AW. Multisite neuroimaging trials. Curr Opin Neurol 2009;22:370–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [15].Ma Q, Zhang T, Zanetti MV, Shena H, Satterthwaite TD, Wolf DH, et al. Classification of multi-site MR images in the presence of heterogeneity using multi-task learning. NeuroImage: Clin 2018;19:476–86. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [16].Schnack HG, van Haren NE, Brouwer RM, van Baal GC, Picchioni M, Weisbrod M, et al. Mapping reliability in multicenter MRI: voxel-based morphometry and cortical thickness. Hum Brain Mapp 2010;31:1967–82. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [17].Jovicich J, Minati L, Marizzoni M, Marchitelli R, Sala-Llonch R, Bartres-Faz D, et al. Longitudinal reproducibility of default-mode network connectivity in healthy elderly participants: a multicentric resting-state fMRI study. Neuroimage 2015;124(Pt A):442–54. [DOI] [PubMed] [Google Scholar]
- [18].Copeland R Google’s ‘Project Nightingale’ Gathers Personal Health Data on Millions of Americans. The Wall Street Journal, https://www.wsj.com/articles/google-s-secret-project-nightingale-gathers-personal-health-data-on-millions-of-americans-11573496790; 2019. Available from:.
- [19].Quach K IBM Watson dishes out ’dodgy cancer advice’, Google Translate isn’t better than humans yet, and other AI tidbits. The Register, https://www.theregister.co.uk/2018/07/28/ai_roundup_720718/; 2018. Available from:. [Google Scholar]
- [20].Parkes L, Fulcher B, Yucel M, Fornito A. An evaluation of the efficacy, reliability, and sensitivity of motion correction strategies for resting-state functional MRI. Neuroimage 2018;171:415–36. [DOI] [PubMed] [Google Scholar]
- [21].Dinga R, Schmaal L, Penninx BWJH, Veltman DJ, Marquand AF. Controlling for effects of confounding variables on machine learning predictions. 2020. Bioarxiv. [Google Scholar]
- [22].Zhao Q, Adeli E, Pohl K. Training confounder-free deep learning models for medical applications. NatureCommunications 2020;11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [23].Salimans T, Goodfellow I, Zaremba W, Cheung V, Radford A, Improved Chen X. Techniques for Training GANs. arXiv; 2016. Bioarxiv preprint. [Google Scholar]
- [24].Chintala S, Denton E, Arjovsky M, Mathieu M. How to Train a GAN? Tips and Tricks to Make GANs Work. NIPS2016. Available from. 2016. https://github.com/soumith/ganhacks. [Google Scholar]
- [25].Greenwood E Experimental sociology: a study in method. 1st ed. New York: King’s Crown Press; 1945. [Google Scholar]
- [26].Chapin F Experimental designs in sociological research. 1st ed. New York: Harper; 1947. [Google Scholar]
- [27].Cochran WG, Rubin DB. Controlling bias in observational studies: a review. <span/><span>Sankhya: Indian J Stat A</span> 1973;35:417–46. [Google Scholar]
- [28].Rubin DB. Matching to remove bias in observational studies. Biometrics 1973;29:159–84. [Google Scholar]
- [29].Stuart EA. Matching methods for causal inference: a review and a look forward. Stat Sci 2010;25:1–21. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [30].Rosenbaum PR. Optimal matching for observational studies. J Am Stat Assoc 1989;84:1024–42. [Google Scholar]
- [31].Morgan SL, Harding DJ. Matching estimators of causal effects: prospects and pitfalls in theory and practice. Sociol Methods Res 2006;35:3–60. [Google Scholar]
- [32].Brookhart MA, Schneeweiss S, Rothman KJ, Glynn RJ, Avorn J, Stürmer T. Variable selection for propensity score models. Am J Epidemiol 2006;163:1149–56. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [33].Imbens GW. Nonparametric estimation of average treatment effects under exogeneity: a review. Rev Econ Stat 2004;86:4–29. [Google Scholar]
- [34].Ho DE, Imai K, King G, Stuart EA. Matching as nonparametric preprocessing for reducing model dependence in parametric causal inference. Polit Anal 2007;15:199–236. [Google Scholar]
- [35].Woolrich MW, Jbabdi S, Patenaude B, Chappell M, Makni S, Behrens T, et al. Bayesian analysis of neuroimaging data in FSL. Neuroimage 2009;45:S173–86. [DOI] [PubMed] [Google Scholar]
- [36].Leming M, Suckling J. Deep learning for sex classification in resting-state and task functional brain networks from the UK biobank. Neuroimage 2021;241:118409. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [37].DICOM PS3.3. 2018d - Information Object Definitions. Medical Imaging Technology Association (MITA); 2018. http://dicom.nema.org/medical/Dicom/2018d/output/chtml/part03/sect_C.8.3.html. [Accessed 25 June 2021]. [Google Scholar]
- [38].Ju J contributors. keras-resnet3d. GitHub; 2019. https://github.com/JihongJu/keras-resnet3d.
- [39].Leming M, Suckling J. Ensemble deep learning on large, mixed-site fMRI datasets in autism and other tasks. Int J Neural Syst 2020;30:2050012–1–16. [DOI] [PubMed] [Google Scholar]
- [40].Weiner MW, Veitch DP, Aisen PS, Beckett LA, Cairns NJ, Green RC, et al. Alzheimer’s disease neuroimaging initiative. The Alzheimer’s disease neuroimaging initiative: a review of papers published since its inception. Alzheimers Dement. 2013;9. e111–194. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [41].Ellis KA, Bush AI, Darby D, De Fazio D, Foster J, Hudson P, et al. The Australian imaging, biomarkers and lifestyle (AIBL) study of aging: methodology and baseline characteristics of 1112 individuals recruited for a longitudinal study of Alzheimer’s disease. Int. Psychogeriatr 2009;21:672–87. [DOI] [PubMed] [Google Scholar]
- [42].Marcus DS, Fotenos AF, Csernansky JG, Morris JC, Buckner RL. Open access series of imaging studies: longitudinal MRI data in nondemented and demented older adults. J Cogn Neurosci 2010;22:2677–84. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [43].Beekly DL, Ramos EM, van Belle G, Deitrich W, Clark AD, Jacka ME, et al. The National Alzheimer’s coordinating center (NACC) database: an alzheimer disease database. Alzheimer Dis Assoc Disord 2004;18:270–7. [PubMed] [Google Scholar]
- [44].Andersson C, Johnson AD, Benjamin EJ, Levy D, Vasan RS. 70-year legacy of the Framingham heart study. Nat Rev Cardiol 2019;16:687–98. [DOI] [PubMed] [Google Scholar]
- [45].Wen J, Thibeau-Sutre J, Diaz-Meloe M, Samper-Gonzaleze J, Routiere A, Bottanie S, et al. Convolutional neural networks for classification of Alzheimer’s disease: overview and reproducible evaluation. Medical Image Analysis 2020:63. [DOI] [PubMed] [Google Scholar]
- [46].Liu M, Zhang J, Adeli E, Shen D. Landmark-based deep multi-instance learning for brain disease diagnosis. Available from Medical Image Analysis 2018;43:157–68. http://www.sciencedirect.com/science/aiticle/pii/S1361841517301524. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [47].Cheng D, Liu M, Fu J, Wang Y. Classification of MR brain images by combination of multi-CNNs for AD diagnosis. In: Proceedings Volume 10420, Ninth International Conference on Digital Image Processing (ICDIP 2017); 2017. [Google Scholar]
- [48].Korolev S, Safiullin A, Belyaev M, Dodonova Y. Residual and plain convolutional neural networks for 3D brain MRI classification. In: 2017 IEEE 14th International Symposium on Biomedical Imaging (ISBI 2017); 2017. [48]. [Google Scholar]
- [49].Lin W, Tong T, Gao Q, Guo D, Du X, Yang Y, et al. Convolutional neural networks-based MRI image analysis for the Alzheimer’s disease prediction from mild cognitive impairment. Front Neurosci 2018,12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [50].Li F, Liu M. The Alzheimer’s disease neuroimaging initiative. Alzheimer’s disease diagnosis based on multiple cluster dense convolutional networks. Comput Med Imaging Graph 2018;70:101–10. [DOI] [PubMed] [Google Scholar]
- [51].Bae J, Stocks J, Heywood A, Jung Y, Jenkins L, Katsaggelos A, et al. Transfer Learning for Predicting Conversion from Mild Cognitive Impairment to Dementia of Alzheimer’s Type based on 3D-Convolutional Neural Network. bioRxiv. 2019. Available from: https://www.biorxiv.org/content/early/2019/12/23/2019.12.20.884932. [DOI] [PMC free article] [PubMed]
- [52].Yagis E, Herrera AGS De, Citi L. Generalization performance of deep learning models in neurodegenerative disease classification. In: 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM); 2019. p. 1692–8. [Google Scholar]
- [53].Pan D, Zeng A, Jia L, Huang Y, Frizzell T, Song X. Early detection of Alzheimer’s disease using magnetic resonance imaging: a novel approach combining convolutional neural networks and ensemble learning. Front Neurosci 2020;14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [54].Liu S, Yadav C, Fernandez-Granda C, Razavian N. On the design of convolutional neural networks for automatic detection of Alzheimer’s disease. Available from. In: Dalca AV, McDermott MBA, Alsentzer E, Finlayson SG, Oberst M, Falck F, et al. , editors. Proceedings of the Machine Learning for Health NeurIPS Workshop. vol. 116 of Proceedings of Machine Learning Research. PMLR; 2020. p. 184–201. http://proceedings.mlr.press/v116/liu20a.html. [Google Scholar]
- [55].Du Y, Fu Z, Calhoun VD. Classification and prediction of brain disorders using functional connectivity: promising but challenging. Front Neurosci 2018;12:525. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [56].Folego G, Weiler M, Casseb RF, Pires R, Rocha A, ADN Initiative, et al. Front Bioeng Biotechnol 2020;8. https://www.frontiersin.org/articles/10.3389/fbioe.2020.534592/full. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
All data needed to evaluate the conclusions in the paper are present in the paper and/or the Supplementary Materials. The code for analysis is available at https://csb.mgh.harvard.edu/bme/_software.
