Skip to main content
Translational Psychiatry logoLink to Translational Psychiatry
. 2021 Mar 15;11:168. doi: 10.1038/s41398-021-01286-x

Magnetic resonance imaging for individual prediction of treatment response in major depressive disorder: a systematic review and meta-analysis

Sem E Cohen 1, Jasper B Zantvoord 1,2, Babet N Wezenberg 1, Claudi L H Bockting 1,3, Guido A van Wingen 1,
PMCID: PMC7960732  PMID: 33723229

Abstract

No tools are currently available to predict whether a patient suffering from major depressive disorder (MDD) will respond to a certain treatment. Machine learning analysis of magnetic resonance imaging (MRI) data has shown potential in predicting response for individual patients, which may enable personalized treatment decisions and increase treatment efficacy. Here, we evaluated the accuracy of MRI-guided response prediction in MDD. We conducted a systematic review and meta-analysis of all studies using MRI to predict single-subject response to antidepressant treatment in patients with MDD. Classification performance was calculated using a bivariate model and expressed as area under the curve, sensitivity, and specificity. In addition, we analyzed differences in classification performance between different interventions and MRI modalities. Meta-analysis of 22 samples including 957 patients showed an overall area under the bivariate summary receiver operating curve of 0.84 (95% CI 0.81–0.87), sensitivity of 77% (95% CI 71–82), and specificity of 79% (95% CI 73–84). Although classification performance was higher for electroconvulsive therapy outcome prediction (n = 285, 80% sensitivity, 83% specificity) than medication outcome prediction (n = 283, 75% sensitivity, 72% specificity), there was no significant difference in classification performance between treatments or MRI modalities. Prediction of treatment response using machine learning analysis of MRI data is promising but should not yet be implemented into clinical practice. Future studies with more generalizable samples and external validation are needed to establish the potential of MRI to realize individualized patient care in MDD.

Subject terms: Predictive markers, Neuroscience, Depression

Introduction

Major depressive disorder (MDD) is a debilitating disease, accounting for 40% of the global disability-adjusted life years caused by psychiatric disorders1. Depression is associated with impaired social functioning and unemployment and is associated with a wide range of chronic physical illnesses, such as diabetes and cardiovascular disease2,3. MDD is estimated to have a life-time prevalence of 20.6% in the United States4. Despite general consensus that effective treatment of depression is paramount for both a patient’s health and for reducing global burden of disease, global disease burden by MDD has not decreased in the past decades5. This is partly because treatment selection is based on trial and error, with no possibility to predict an individual’s response to a certain treatment6. Non-response to initial pharmacological and psychotherapeutic interventions is highly prevalent, with treatment-resistant depression affecting 20–30% of depressed patients in the current clinical practice79. Treatment of choice for patients who have not responded to pharmacological and psychotherapeutic treatments is electroconvulsive therapy (ECT), which produces remission in about 50% of therapy-resistant patients10,11. Furthermore, non-response can only be determined at least 4 weeks after initiation of pharmacotherapy, ECT requires 4–6 weeks on average, and effects of psychotherapy can even take 16 weeks to manifest7,12. Consequently, patients are regularly exposed to multiple failed treatments and might spend months to years waiting for successful treatment. This stresses the need for markers, which, before treatment commencement, can inform clinicians on the chance of responding to a particular treatment.

A large number of studies have correlated baseline clinical characteristics and biomarkers with MDD status and treatment outcome and have identified many factors that are associated with treatment success13. However, such descriptive analyses only provide inference at the group level and not at the level of the individual patient, which is required for clinical decision-making14. More recent studies have started to use machine learning analyses that aim to develop predictive models and that are tested using independent data15. More than with correlational analysis, single-subject response prediction studies using machine learning might be able to redeem the promise of individualized psychiatry16. Without being explicitly pre-programmed, these algorithms (either linear or non-linear) are able to learn from aggregated data in a patient sample using multivariate pattern recognition, in order to provide the best prediction of an output variable17,18. In predictive modeling, machine learning could enable clinicians to judge the viability of treatments for individual patients. As such, it might increase treatment efficacy, decrease illness duration, and reduce MDD’s impact on the global burden of disease.

Multiple modalities have been considered for single-subject response prediction. A recent meta-analysis covering different markers found neuroimaging to overall be most successful in predicting treatment response in depressed patients (i.e., more than phenomenological or genetic studies)19. However, the review pooled different treatments and neuroimaging modalities such as electroencephalography (EEG) and magnetic resonance imaging (MRI). Since it did not differentiate between prediction success in different neuroimaging techniques, the study offers little insight into treatment-specific biomarkers or specific (MRI) modalities. A recent meta-analysis on EEG for individual prediction of antidepressant treatment response found reasonable accuracy (72% sensitivity and 68% specificity) but concludes that EEG should not yet be used clinically as a prediction tool, since generalizability and validity of the reported studies are limited20. However, a meta-analysis of prediction accuracy in anti-depressive treatment that specifically focuses on MRI does not yet exist, which may reveal a better predictive value than EEG.

The primary aim of the present study was to calculate the aggregate classification performance of predictive MRI biomarkers in patients with MDD using a bivariate random-effect model meta-analysis. We further investigated whether classification performance was influenced by intervention type (i.e., pharmacotherapy, psychotherapy, or ECT) or imaging modality (i.e., structural MRI (sMRI), resting-state functional MRI (fMRI), task-based fMRI, diffusion tensor imaging (DTI)).

Methods and materials

Inclusion and exclusion criteria

Two authors (S.E.C. and B.N.W.) included studies using any form of MRI (structural, resting-state, task-based, spectroscopy, DTI), which were conducted at baseline, i.e., within 4 weeks before the start of antidepressant treatment. Furthermore, inclusion criteria were an overarching definition of antidepressant treatment according to the current NICE guidelines and a non-selective patient population with MDD suffering from a current depressive episode. Studies that used feature selection based on in-sample data without validating prediction outcomes either internally (e.g., through cross-validation) or externally (through independent set validation) were excluded. Inclusion or exclusion conflicts were resolved by consensus or if necessary by authors J.B.Z. and G.A.v.W.

Search strategy

We conducted a search in EMBASE, Medline, PsycInfo, and Web of Science databases. Each database was searched from inception to January 2020. Furthermore, we searched the WHO International Clinical Trial Registry Platforms search portal for registered and unpublished studies, and we looked for “gray” literature such as abstracts and conference articles through conference websites and from other relevant sources. Additionally, we checked included articles for references and conducted citation screening. For a full account of our search strategy and inclusion criteria, see the Supplementary Material.

Data extraction

Two authors (S.E.C. and B.N.W.) independently extracted data from included studies, including the number of participants, patient population and depression severity subtype, treatment history, antidepressant intervention and outcome measures, response/remission rates, neuroimaging technique, brain region and feature selection, method of analysis, and validation strategy (see Table 1). From the included articles, we extracted the confusion table (a 2 × 2 table for correctly and incorrectly classified patients) for sensitivity or specificity. If these were not supplied, we computed the matrix from additional information in the article. If multiple studies analyzed the same patient sample, we used mean outcome measures based on these studies. If necessary, we contacted authors requesting additional information.

Table 1.

Methodological summary of the studies.

Study + year n Outcome Intervention Duration Modality Analysis Validation
Costafreda et al. 2009—1 16 Remission CBT 16 wk tbfMRI SVM LOO CV
Siegle et al. 2012 12 Response CBT 12 wk tbfMRI RF Ind. replication
Queirazza et al. 2019 37 Response CBT 6–10 wk tbfMRI SVM, LR LOO nested CV
Van Waarde et al. 2015 45 Remission ECT 10 wk rsfMRI lSVM LOO CV
Moreno-Ortega et al. 2019 19 Remission ECT ns rsfMRI LR LOO CV
Sun et al. 2019 122 Remission + remission ECT 3–4 wk rsfMRI LR LOO CV
Redlich et al. 2016 23 Response ECT 3–8 wk sMRI lSVM/GPC LOO CV
Wade et al. 2016 34 Response ECT 2–7 wk sMRI RBFSVM LOO CV
Cao et al. 2018 24 Response + remission ECT 3–4 wk sMRI lSVM LOO CV
Jiang et al. 2018 38 Remission ECT 3–4 wk sMRI LR

10-fold LOO CV+

Independent cohort rep

Wade et al. 2017

Leaver et al. 2017

44a

Remission

Response

ECT

ECT

ns

ns

sMRI

rsfMRI, aslMRI

RF

RBFSVM

Nested CV

5-fold LOO CV

Drysdale et al. 2017

124

30

Response rTMS 4–6 wk rsfMRI lSVM

LOO CV

Ind. replication

Cash et al. 2019 33 Remission rTMS 5–8 wk rsfMRI lSVM LOO + k fold CV

Costafreda et al. 2009—2

Nouretdinov et al. 2011

18a

Remission

Remission

SSRI

SSRI

8 wk

8 wk

sMRI

sMRI

lSVM

TCP

LOO CV

LOO CV

Gong et al. 2011 46 Response SSRI/TCA/SNRI 12 wk sMRI lSVM LOO CV
Marquand et al. 2008 20 Response SSRI 8 wk tbfMRI lSVM LOO CV
Godlewska et al. 2018 32 Response SSRI 6 wk tbfMRI LR LOO CV
Meyer et al. 2019 22 Remission/non-response SSRI 8 wk tbfMRI LR LOO CV
Karim et al. 2018 49 Remission SNRI 12 wk tbfMRI LR 10-fold LOO CV
Patel et al. 2015 19 Remission SSRI/SNRI ns rsfMRI, DTI, sMRI ADTree/lSVM/RBFSVM/L1LR Nested LOO CV
iSPOT trials 77a SSRI/SNRI 8 wk
Korgaonkar et al. 2014 Remission DTI LR K-fold CV
Williams et al. 2015 Response tbfMRI LDA LOO CV
Goldstein-Piekarski et al. 2016 Remission tbfMRI LR 10-fold LOO CV
Grieve et al. 2016 Non-remission DTI LR Independent rep
Goldstein-Piekarski et al. 2018 Remission rsfMRI LR LOO CV

Reported sample sizes were not necessarily equal in articles with overlapping sample.

SSRI selective serotonin reuptake inhibitor, TCA tricyclic antidepressant, SNRI serotonin-norepinephrine reuptake inhibitor, ECT electroconvulsive therapy, CBT cognitive behavioral therapy, rTMS repetitive transcranial magnetic stimulation, iTBS intermittent theta burst stimulation, AP antipsychotics, ns not specified, tb task based, rs resting state, asl arterial spin labeling, fMRI functional magnetic resonance imaging, sMRI structural magnetic resonance imaging, WB whole brain, ROI region of interest, DTI diffusion tensor imaging, lSVM linear support vector machine, RBF radial basic function, TCP transductive conformal predictor, LR logistic regression, LinR linear regression, LDA linear discriminant analysis, RF random forest, LOO CV leave-one-out cross-validation, wm white matter, sLR stepwise linear regression, beta-w beta-weights, LARS least-angle regression, PMVD proportional marginal decomposition.

an is a weighted average across studies.

Meta-analytic method

For quantitative analysis, we used confusion matrices to pool studies using Reitsma’s bivariate random effect model, as suggested in the Cochrane handbook for diagnostic tests of accuracy studies21,22. We used this method for computing our main outcomes, which were the overall area under the summary receiver operating characteristic (SROC) curve, sensitivity, and specificity, as well as sensitivity and specificity of intervention subsets. Additionally, we performed a separate bivariate regression for modalities (fMRI and sMRI) by including from each study both sMRI and fMRI, if provided in the original article or after our request for further information. As a post hoc analysis, we excluded DTI from this regression, and in the fMRI group, we subdivided resting-state and task-based modalities.

Heterogeneity and publication bias

To visualize between-study differences, we conducted a univariate random-effect forest plot of the diagnostic odds ratios (ORs), subdivided per treatment group. We identified clinical and statistical heterogeneity by visually assessing confidence interval (CI) overlap and by identifying outlying studies. We avoided using an objective measure of heterogeneity, since these have shown to be inappropriately conservative for accuracy studies23. Rather, we used a random-effect model that assumes that our data was heterogeneous and set out to investigate potential sources of heterogeneity22. We did not perform any sensitivity analyses, as no studies were of such low quality, or were such outliers that sensitivity analysis was appropriate. To assess sample size effects and possible publication bias, we used Deeks’ test, as recommended for diagnostic accuracy studies24,25. For assessing quality of the primary studies, we used the QUADAS-2 tool26. We pre-specified methods in the PROSPERO database for systematic reviews (registration number CRD42019137497). All analyses were conducted using the mada and metafor package in R2729.

Results

Search results

Our search yielded 5824 hits, 168 of which were included for full-text review (see Fig. 1). After contacting the authors for additional information, we excluded 21 studies for not reporting data necessary for reconstructing a confusion matrix, all of which were “gray literature”, i.e., abstracts or conference summary articles. Furthermore, we excluded 11 articles for not reporting any form of validation of their prediction model. After exclusion of non-eligible studies and, through citation searching, addition of 2 eligible studies that did not come up in search hits, 27 remained3056.

Fig. 1. Flow diagram of the study inclusion process.

Fig. 1

n number.

Description of the study characteristics

We included 27 studies with an accumulated number of 957 unique patients and a mean sample size of 44 per study, with a median of 33 (see Table 1 for a full methodological study summary. Please refer to Supplementary Table 1 for an overview of patient characteristics and study demographics). Three patient samples were used in more than one article30,32,40,41,5155.

Of the included studies, 50% used some form of pharmacotherapeutic intervention (total n = 283), all of which administered a clinically viable dosage, with response time varying from 2 weeks (early response) to 12 weeks. Only one study did not use selective serotonin reuptake inhibitors (SSRIs), instead using an serotonin-norepinephrine reuptake inhibitor (SNRI)49. Three studies used either an SSRI or SNRI, and one of these three chose a tricyclic antidepressant as a third treatment option45,50,57. ECT was administered in 35% of studies (total n = 285), 8% used transcranial magnetic stimulation, and 8% used cognitive therapy. Most studies used either sMRI (31%) or task-based fMRI (31%), most often using emotional stimuli, 19% used resting-state fMRI, and 8% used DTI. Two studies combined multiple modalities40,50.

As machine learning paradigm, 31% studies used support vector machine (SVM) for data-analysis, while 28% used logistic regression. After comparing classification accuracy with multiple algorithms (among others, SVM and random forest), Patel and colleagues used an alternating decision tree method50. For validation, 85% used leave-one-out cross-validation. Two studies used an independent cohort to validate their results, while one study first cross-validated classification results, after which authors validated their prediction model in two small, independent cohorts, achieving similar results39,43,53. For additional information on approaches to imaging analysis, please refer to Supplementary Table 2.

Meta-analysis

General outcome

After pooling results from studies with overlapping patient samples, we quantitatively analyzed 22 samples, including one independent cohort replication that we have interpreted as a separate study43. For all imaging modalities and interventions taken together, the meta-analytic estimate for the SROC AUC was 0.84 (95% CI 0.81–0.87), with 77% sensitivity (95% CI 71–82) and 79% specificity (95% CI 73–84), amounting to a moderately high classification performance (see Fig. 2).

Fig. 2. Overall accuracy measures: area under the curve 0.84 (95% CI 0.81–0.87), sensitivity 77% (95% CI 71–82), specificity 79% (95% CI 73–84).

Fig. 2

Reitsma bivariate SROC model of the receiver operating characteristic curve. Summary of sensitivity and false-positive rate (1 − specificity) is indicated in black, sensitivity and false-positive rates for different interventions are gray-scale. ECT electroconvulsive therapy, rTMS repetitive transcranial magnetic stimulation, pharmacological pharmacotherapeutic antidepressive interventions.

Intervention differences

Sensitivity and specificity of ECT interventions were 80% (95% CI 73–85) and 83% (95% CI 72–90), respectively, compared to 75% (95% CI 68–82) and 72% (95% CI 64–80) for antidepressant medication. Exclusion of the studies that did not use SSRI as pharmacological agent had little influence on the results49. Although prediction outcomes in ECT studies do show a trend toward higher precision, CIs overlapped (see Table 2). With only few primary studies, sensitivity and specificity for psychotherapy were, respectively, 84% (95% CI 68–92) and 72% (39–92), for repetitive transcranial magnetic stimulation (rTMS), respectively, 79% (95% CI 71–86) and 82% (74–88).

Table 2.

Summary estimates of sensitivity/specificity for different interventions.

Intervention group Sensitivity 95% CI Specificity 95% CI
Combined 77% 71–82 79% 73–84
Medication 75% 68–82 73% 64–80
ECT 80% 73–85 83% 72–90
Psychotherapy 84% 68–92 72% 39–92
rTMS 79% 71–86 82% 74–88

CI confidence interval, rTMS repetitive transcranial magnetic stimulation, ECT electroconvulsive therapy.

Modality differences

In order to assess whether sMRI studies yielded different performance measures compared to fMRI studies, we performed random-effect meta-regression for modality subtypes. When comparing fMRI and sMRI, z-regression values for sensitivities and specificities were non-significant, suggesting that prediction success for structural or functional neuroimaging did not differ between studies (see Table 3). Post hoc analysis excluding DTI and subdividing task-based and resting-state fMRI did not alter the results.

Table 3.

Bivariate random-effect meta-regression z-scores for modality as covariate.

Point estimate Standard error 95% Lower 95% Upper z-value p Value
Sensitivity 0.221 0.233 −0.236 0.677 0.948 0.343
Specificity 0.217 0.252 −0.77 0.711 0.861 0.389

p Values for both sensitivity and specificity >0.05, i.e., z-score differences for functional and structural MRI are non-significant.

Quality assessment

Three studies included only late-life depression, which reduces applicability in the general MDD population (see Supplementary Fig. 1 and Supplementary Table 3). In terms of flow and timing, drop-outs were a common issue, with 10 studies having a drop-out rate of ≥30%, while 11 studies did not clarify drop-outs, possibly leading to attrition bias. Furthermore, two studies adapted the definition of response to create an even split in responders/non-responders, causing applicability concerns45,48. One study did not pre-specify the pharmacological intervention50.

Heterogeneity and publication bias

The univariate forest plot of diagnostic performance (in ln OR) showed considerable overlap in CIs between studies with different ORs, indicating that heterogeneity might be caused by sample variance (see Fig. 3)23. As described in the study description above, inter-study differences were present in population, modalities, intervention type, response/remission definition, feature selection, and analysis technique. Deeks’ funnel plot asymmetry test showed study size and diagnostic OR to be inversely related (p = 0.044; see Supplementary Fig. 2), indicating that classification performance was lower in studies with larger samples. Inspection of the gray literature that was excluded due to missing information in order to construct a confusion matrix (all of which were conference/poster abstracts) showed that the gray literature had comparable mean sample sizes (n = 22, mean n = 56) and accuracies (ranging from 73 to 95%) compared to the included studies. For an overview of gray literature results, see Supplementary Table 4.

Fig. 3. Univariate random-effect forest plot of natural logarithm of diagnostic odds ratios.

Fig. 3

Summary estimates for odds ratios are computed assuming normal distribution. CI confidence interval, rTMS repetitive transcranial magnetic stimulation, ECT electroconvulsive therapy.

Discussion

Our results show that machine learning analysis of MRI data can predict antidepressive treatment success with an AUC of 0.84, 77% sensitivity, and 79% specificity (Fig. 2). Furthermore, we did not find a difference in classification performance between studies using pharmacotherapy and ECT. Although ECT showed somewhat higher sensitivity and specificity, CIs largely overlapped between the two intervention types (Table 2). There were few primary studies for psychotherapy and rTMS, which also show overlapping CIs. In addition, classification performance of sMRI and fMRI did not differ significantly (Table 3).

To our knowledge, this is the first meta-analysis specifically examining MRI for predicting treatment effects in depression. The overall classification performance is comparable to the one reported by Lee et al., who found a general accuracy of 85% when combining the results for different neuroimaging modalities (defined as EEG, computed tomography, positron emission tomography, or MRI)58. Those results were, however, based on a total of 8 MRI studies, whereas our search resulted in 22 individual studies for analysis. This is partly due to the time gap between studies, which underscores the rapid development in this research area. Our results show that MRI prediction studies perform somewhat better than EEG (AUC of 0.76) and comparable to accuracy of diagnostic classification studies with MRI that distinguishes depressed patients and healthy controls20,59. In contrast to the review of EEG studies, we excluded studies that tested their model on the training set, which increased generalizability of our sample and avoided presenting inflated accuracy results.

Clinical practice would require different prediction approaches for a broad range of specific settings. It would be useful to have a single predictive test for therapy-resistant patients, especially to guide decision-making for invasive treatments such as ECT. For example, ECT is associated with cognitive side effects that are preferably avoided in case the treatment is unsuccessful60. In addition, ECT is only applied in 1–2% of patients with persistent or severe depression and a biomarker that indicates a high probability of success may reduce the hesitance of its use61. However, for most treatments, a differential biomarker would be preferable, which would enable selecting the treatment with the highest chance of success. As of yet, no MRI study has used such prospective prediction and subsequent treatment matching to guide decision-making between two treatment options (for instance, between cognitive behavioral therapy and an SSRI). Furthermore, no studies have yet compared efficacy of prediction-guided treatments versus regular treatment based on patient–clinician preference. Thus, although the predictive performance of MRI biomarkers is certainly promising, the current study designs do not yet enable the translation of research findings to the clinic.

Generally, studies were of acceptable quality, although drop-out rates could cause concern in terms of reliability. Drop-out rates were not mentioned in 11 studies, and for 10 studies, drop-out rates were >30% without using an intention-to-diagnose approach. Not accounting for drop-outs, who might be less likely to respond to treatment, could inflate response/remission data and consequently alter sensitivity and specificity of the predictive test. Additionally, our results show between-study variety regarding the response criterion, which typically consisted of clinical response (≥50% symptom reduction) or symptom remission. Different clinical settings might require different prediction outcomes. For instance, one could expect treatment of a first-time depressive episode to lead to complete remission, while in severe treatment-resistant depression, response might be a more practical and achievable goal62. Authors should take care to pre-specify which outcome they will use and why that outcome is the most appropriate for their sample or intervention.

Furthermore, although no objective investigation for clinical heterogeneity in prediction studies exists, our random-effect forest plot shows considerable overlap of CIs with differing study results, implying the presence of sampling variation (Fig. 3)22. Clinical variance between samples is an important obstacle in generalizability of any diagnostic or predictive marker, especially in psychiatric illnesses such as MDDs, which is heterogeneous in both its clinical and neurophysiological manifestation63,64. Thus, inter-sample diversity of inclusion criteria and methodological design might hamper the realization of a reliable predictive biomarker.

In the current literature on diagnostic accuracy studies, the possibility of publication selection as a source of bias is still under debate25,65. Common forms of formalizations of publication bias, such as the Egger’s or Begg’s test, are not recommended for meta-analyses of prediction studies, since their sensitivity in diagnostic accuracy studies is generally poor23. However, the recommended Deeks’ funnel plot asymmetry test (see Supplementary Fig. 2) shows the presence of a sample size effect, with the n of a study being negatively correlated to classification performance, which could be attributable to publication bias66. Another explanation of this significant correlation might be that large-scale studies with large samples are more likely to consist of heterogeneous patient groups, which in turn reduces prediction accuracy67. As a further exploration of publication bias, our search also took into account gray literature, which indicated that publication (or positive result) bias was absent. In conclusion, quantitative testing could not distinguish between a real effect (due to accuracy reduction in large heterogeneous samples) or publication bias. Although the gray literature deems its presence less likely, we cannot exclude the presence of publication bias.

The following limitations warrant further discussion. First, we did not find modality differences, but studies conducting fMRI research might have also attempted prediction with (less time-consuming and cheaper) sMRI, which remained unpublished. Although we did contact authors for additional information, response was poor, so we were unable to rule out reporting bias for modality differences. We would advise authors of future studies to publish non-significant results as well as significant but less accurate results, since both are potentially useful in comparing the merits of different modalities. Second, the number of studies predicting the effects of psychotherapy, specifically cognitive therapy, outcome was low, resulting in a blind spot for a commonly deployed treatment in MDD68. Third, cross-validation in small samples results in large variation of the estimated accuracy, and as indicated above, accuracy reduces with larger sample heterogeneity67,69. Since the mean sample size of our studies was 44 (with a median n of 33), the reported results may be optimistic because of overfitting. Overfitting is a cause for concern specifically in MRI studies, with relatively small sample sizes and large amounts of fitted data70. Furthermore, characteristics of the test set during cross-validation will approximate the characteristics of the training set more than when tested in the general population, due to selection bias71. Only two included studies replicated their training data in an independent cohort, and one included study used an out-of-sample cohort to further test their cross-validated results, leaving the question open to which extent the majority of results can be generalized to new patients.

In order to optimize patient care, reduce treatment resistance, and shorten duration of illness, developing models that predict treatment success on individual-patient level is an urgent task. In a 2012 consensus report on diagnostic imaging markers in psychiatry, the American Psychiatric Association research council proposed 80% sensitivity and specificity as prerequisite for the clinical application of a biomarker72. Furthermore, biomarkers should be ideally be reliable, reproducible, non-invasive, simple to perform, and inexpensive. The results for an ECT biomarker fulfilled the 80% criterion, but the results for a medication biomarker fell short. But following these terms, primarily reproducibility has not yet been sufficiently well established with small sample sizes and external validation in only a minority of studies. This precludes recommending MRI for treatment response prediction in clinical practice at this point. Future multicenter studies with large patient samples that represent clinical heterogeneity are required to warrant MRI biomarker generalizability73. However, one might question whether excellent generalizability is a goal that should be aimed for: if each clinical site were to develop its own locally reliable and replicable biomarker that incorporates the local hardware, patient, and treatment variability, the predictive accuracy is expected to be higher than when all potential sources of heterogeneity are accounted for67,74. Standard machine learning analysis would, then, mean a departure from the traditional universalist paradigm in diagnostics and instead initiate a shift to a paradigm of localization: heterogeneous yet locally applicable classification models. This will enable to retrain predictive models to obtain even better performance with more data after biomarker deployment. And this may enable to take advantage rather than disadvantage from (inevitable) hardware upgrades, such as higher signal-to-noise for new generations of MR scanners and coils.

In conclusion, prediction of treatment success using machine learning analysis of MRI data holds promise but has not transcended the research status and should not yet be implemented into clinical practice. Once it overcomes the aforementioned hurdles, MRI may become a clinical decision support tool aimed to reduce unsuccessful treatments and improve treatment efficacy and efficiency.

Supplementary information

Supplemental material (278.3KB, pdf)

Acknowledgements

We thank Joost Daams, clinical librarian at the Amsterdam UMC hospital, for his excellent help in defining our search terms and in running the search. This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Conflict of interest

The authors declare no competing interests.

Footnotes

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

The online version contains supplementary material available at 10.1038/s41398-021-01286-x.

References

  • 1.Whiteford HA, et al. Global burden of disease attributable to mental and substance use disorders: findings from the Global Burden of Disease Study 2010. Lancet. 2013;382:1575–1586. doi: 10.1016/S0140-6736(13)61611-6. [DOI] [PubMed] [Google Scholar]
  • 2.Kawakami N, et al. Early-life mental disorders and adult household income in the World Mental Health Surveys. Biol. Psychiatry. 2012;72:228–237. doi: 10.1016/j.biopsych.2012.03.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Kessler RC, Bromet EJ. The epidemiology of depression across cultures. Annu. Rev. Public Health. 2013;34:119–138. doi: 10.1146/annurev-publhealth-031912-114409. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Hasin DS, et al. Epidemiology of adult DSM-5 major depressive disorder and its specifiers in the United States. JAMA Psychiatry. 2018;75:336–346. doi: 10.1001/jamapsychiatry.2017.4602. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Herrman H, et al. Reducing the global burden of depression: a Lancet–World Psychiatric Association Commission. Lancet. 2019;393:e42–e43. doi: 10.1016/S0140-6736(18)32408-5. [DOI] [PubMed] [Google Scholar]
  • 6.Gelenberg AJ, et al. American Psychiatric Association Practice Guideline for the Treatment of Patients With Major Depressive Disorder, Third Edition. Am. J. Psychiatry. 2010;167:167. [Google Scholar]
  • 7.Pigott HE, Leventhal AM, Alter GS, Boren JJ. Efficacy and effectiveness of antidepressants: current status of research. Psychother. Psychosom. 2010;79:267–279. doi: 10.1159/000318293. [DOI] [PubMed] [Google Scholar]
  • 8.Loerinc AG, et al. Response rates for CBT for anxiety disorders: need for standardized criteria. Clin. Psychol. Rev. 2015;42:72–82. doi: 10.1016/j.cpr.2015.08.004. [DOI] [PubMed] [Google Scholar]
  • 9.John Rush A, et al. Acute and longer-term outcomes in depressed outpatients requiring one or several treatment steps: a STAR*D report. Am. J. Psychiatry. 2006;163:1905–1917. doi: 10.1176/ajp.2006.163.11.1905. [DOI] [PubMed] [Google Scholar]
  • 10.Heijnen WT, Birkenhager TK, Wierdsma AI, van den Broek WW. Antidepressant pharmacotherapy failure and response to subsequent electroconvulsive therapy: a meta-analysis. J. Clin. Psychopharmacol. 2010;30:616–619. doi: 10.1097/JCP.0b013e3181ee0f5f. [DOI] [PubMed] [Google Scholar]
  • 11.Kellner CH, et al. ECT in treatment-resistant depression. Am. J. Psychiatry. 2012;169:1238–1244. doi: 10.1176/appi.ajp.2012.12050648. [DOI] [PubMed] [Google Scholar]
  • 12.McIntyre RS, et al. Treatment-resistant depression: definitions, review of the evidence, and algorithmic approach. J. Affect. Disord. 2014;156:1–7. doi: 10.1016/j.jad.2013.10.043. [DOI] [PubMed] [Google Scholar]
  • 13.Kennis M, et al. Prospective biomarkers of major depressive disorder: a systematic review and meta-analysis. Mol. Psychiatry. 2020;25:321–338. doi: 10.1038/s41380-019-0585-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Ozomaro U, Wahlestedt C, Nemeroff CB. Personalized medicine in psychiatry: problems and promises. BMC Med. 2013;11:132. doi: 10.1186/1741-7015-11-132. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Perlman K, et al. A systematic meta-review of predictors of antidepressant treatment outcome in major depressive disorder. J. Affect. Disord. 2019;243:503–515. doi: 10.1016/j.jad.2018.09.067. [DOI] [PubMed] [Google Scholar]
  • 16.Bzdok D, Meyer-Lindenberg A. Machine learning for precision psychiatry: opportunities and challenges. Biol. Psychiatry Cogn. Neurosci. Neuroimaging. 2018;3:223–230. doi: 10.1016/j.bpsc.2017.11.007. [DOI] [PubMed] [Google Scholar]
  • 17.Yahata N, Kasai K, Kawato M. Computational neuroscience approach to biomarkers and treatments for mental disorders. Psychiatry Clin. Neurosci. 2017;71:215–237. doi: 10.1111/pcn.12502. [DOI] [PubMed] [Google Scholar]
  • 18.Deo RC. Machine learning in medicine. Circulation. 2015;132:1920–1930. doi: 10.1161/CIRCULATIONAHA.115.001593. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Lee Y. et al. Corrigendum to “Applications of machine learning algorithms to predict therapeutic outcomes in depression: a meta-analysis and systematic review”. J Affect Disord. 241 (2018) 519–532. J. Affect. Disord.274, 1211–1215 (2020). [DOI] [PubMed]
  • 20.Widge AS, et al. Electroencephalographic biomarkers for treatment response prediction in major depressive illness: a meta-analysis. Am. J. Psychiatry. 2019;176:44–56. doi: 10.1176/appi.ajp.2018.17121358. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Reitsma JB, et al. Bivariate analysis of sensitivity and specificity produces informative summary measures in diagnostic reviews. J. Clin. Epidemiol. 2005;58:982–990. doi: 10.1016/j.jclinepi.2005.02.022. [DOI] [PubMed] [Google Scholar]
  • 22.Macaskill, P., Gatsonis, C., Deeks, J. J., Harbord, R. M. & Takwoingi, Y. in Cochrane Handbook for Systematic Reviews of Diagnostic Test Accuracy Version 1.0 (eds Deeks, J. J., Bossuyt, P. M. & Gatsonis, C.) Ch. 10 (The Cochrane Collaboration, 2010).
  • 23.Bossuyt, P. D. C., Deeks, J., Hyde, C., Leeflang, M. & Scholten, R. (eds) Cochrane Handbook for Systematic Reviews of Diagnostic Test Accuracy Version 1.0 (The Cochrane Collaboration, 2010).
  • 24.Deeks JJ, Macaskill P, Irwig L. The performance of tests of publication bias and other sample size effects in systematic reviews of diagnostic test accuracy was assessed. J. Clin. Epidemiol. 2005;58:882–893. doi: 10.1016/j.jclinepi.2005.01.016. [DOI] [PubMed] [Google Scholar]
  • 25.van Enst WA, Ochodo E, Scholten RJPM, Hooft L, Leeflang MM. Investigation of publication bias in meta-analyses of diagnostic test accuracy: a meta-epidemiological study. BMC Med. Res. Methodol. 2014;14:70–70. doi: 10.1186/1471-2288-14-70. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Whiting PF, et al. QUADAS-2: a revised tool for the quality assessment of diagnostic accuracy studies. Ann. Intern. Med. 2011;155:529–536. doi: 10.7326/0003-4819-155-8-201110180-00009. [DOI] [PubMed] [Google Scholar]
  • 27.Viechtbauer W. Conducting meta-analyses in R with the metafor package. J. Stat. Softw. 2010;36:1–48. doi: 10.18637/jss.v036.i03. [DOI] [Google Scholar]
  • 28.Doebler, P. mada: meta-analysis of diagnostic accuracy. R package version 0.5.10. https://rdrr.io/rforge/mada/ (2020).
  • 29.R Core Team. R. A Language and Environment for Statistical Computing (R Foundation for Statistical Computing, 2019).
  • 30.Costafreda SG, Chu C, Ashburner J, Fu CH. Prognostic and diagnostic potential of the structural neuroanatomy of depression. PLoS ONE. 2009;4:e6353. doi: 10.1371/journal.pone.0006353. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Costafreda SG, Khanna A, Mourao-Miranda J, Fu CH. Neural correlates of sad faces predict clinical remission to cognitive behavioural therapy in depression. Neuroreport. 2009;20:637–641. doi: 10.1097/WNR.0b013e3283294159. [DOI] [PubMed] [Google Scholar]
  • 32.Nouretdinov I, et al. Machine learning classification with confidence: application of transductive conformal predictors to MRI-based diagnostic and prognostic markers in depression. Neuroimage. 2011;56:809–813. doi: 10.1016/j.neuroimage.2010.05.023. [DOI] [PubMed] [Google Scholar]
  • 33.Siegle GJ, et al. Toward clinically useful neuroimaging in depression treatment: prognostic utility of subgenual cingulate activity for determining depression outcome in cognitive therapy across studies, scanners, and patient characteristics. Arch. Gen. Psychiatry. 2012;69:913–924. doi: 10.1001/archgenpsychiatry.2012.65. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Queirazza F, Fouragnan E, Steele JD, Cavanagh J, Philiastides MG. Neural correlates of weighted reward prediction error during reinforcement learning classify response to cognitive behavioral therapy in depression. Sci. Adv. 2019;5:eaav4962. doi: 10.1126/sciadv.aav4962. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.van Waarde JA, et al. A functional MRI marker may predict the outcome of electroconvulsive therapy in severe and treatment-resistant depression. Mol. Psychiatry. 2015;20:609–614. doi: 10.1038/mp.2014.78. [DOI] [PubMed] [Google Scholar]
  • 36.Moreno-Ortega M, et al. Resting state functional connectivity predictors of treatment response to electroconvulsive therapy in depression. Sci. Rep. 2019;9:5071. doi: 10.1038/s41598-019-41175-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Sun, H. et al. Preliminary prediction of individual response to electroconvulsive therapy using whole-brain functional magnetic resonance imaging data. Neuroimage Clin.26, 102080 (2020). [DOI] [PMC free article] [PubMed]
  • 38.Redlich R, et al. Prediction of individual response to electroconvulsive therapy via machine learning on structural magnetic resonance imaging data. JAMA Psychiatry. 2016;73:557–564. doi: 10.1001/jamapsychiatry.2016.0316. [DOI] [PubMed] [Google Scholar]
  • 39.Jiang R, et al. SMRI biomarkers predict electroconvulsive treatment outcomes: accuracy with independent data sets. Neuropsychopharmacology. 2017;43:1078. doi: 10.1038/npp.2017.165. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Leaver AM, et al. Fronto-temporal connectivity predicts ECT outcome in major depression. Front. Psychiatry. 2018;9:92. doi: 10.3389/fpsyt.2018.00092. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Wade BSC, et al. Data-driven cluster selection for subcortical shape and cortical thickness predicts recovery from depressive symptoms. Proc. IEEE Int. Symp. Biomed. Imaging. 2017;2017:502–506. doi: 10.1109/ISBI.2017.7950570. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Cao B, et al. Predicting individual responses to the electroconvulsive therapy with hippocampal subfield volumes in major depression disorder. Sci. Rep. 2018;8:5434. doi: 10.1038/s41598-018-23685-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Drysdale AT, et al. Resting-state connectivity biomarkers define neurophysiological subtypes of depression. Nat. Med. 2017;23:28–38. doi: 10.1038/nm.4246. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Cash RFH, et al. A multivariate neuroimaging biomarker of individual outcome to transcranial magnetic stimulation in depression. Hum. Brain Mapp. 2019;40:4618–4629. doi: 10.1002/hbm.24725. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Gong Q, et al. Prognostic prediction of therapeutic response in depression using high-field MR imaging. Neuroimage. 2011;55:1497–1503. doi: 10.1016/j.neuroimage.2010.11.079. [DOI] [PubMed] [Google Scholar]
  • 46.Marquand AF, Mourao-Miranda J, Brammer MJ, Cleare AJ, Fu CH. Neuroanatomy of verbal working memory as a diagnostic biomarker for depression. Neuroreport. 2008;19:1507–1511. doi: 10.1097/WNR.0b013e328310425e. [DOI] [PubMed] [Google Scholar]
  • 47.Godlewska BR, et al. Predicting treatment response in depression: the role of anterior cingulate cortex. Int. J. Neuropsychopharmacol. 2018;21:988–996. doi: 10.1093/ijnp/pyy069. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Meyer BM, et al. Prefrontal networks dynamically related to recovery from major depressive disorder: a longitudinal pharmacological fMRI study. Transl. Psychiatry. 2019;9:64. doi: 10.1038/s41398-019-0395-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Karim HT, et al. Acute trajectories of neural activation predict remission to pharmacotherapy in late-life depression. Neuroimage Clin. 2018;19:831–839. doi: 10.1016/j.nicl.2018.06.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Patel MJ, et al. Machine learning approaches for integrating clinical and imaging features in late-life depression classification and response prediction. Int. J. Geriatr. Psychiatry. 2015;30:1056–1067. doi: 10.1002/gps.4262. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Goldstein-Piekarski AN, et al. Human amygdala engagement moderated by early life stress exposure is a biobehavioral target for predicting recovery on antidepressants. Proc. Natl Acad. Sci. USA. 2016;113:11955–11960. doi: 10.1073/pnas.1606671113. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Goldstein-Piekarski, et al. Intrinsic functional connectivity predicts remission on antidepressants: a randomized controlled trial to identify clinically applicable imaging biomarkers. Transl. Psychiatry. 2018;8:57. doi: 10.1038/s41398-018-0100-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Grieve SM, Korgaonkar MS, Gordon E, Williams LM, Rush AJ. Prediction of nonremission to antidepressant therapy using diffusion tensor imaging. J. Clin. Psychiatry. 2016;77:e436–e443. doi: 10.4088/JCP.14m09577. [DOI] [PubMed] [Google Scholar]
  • 54.Korgaonkar MS, Williams LM, Song YJ, Usherwood T, Grieve SM. Diffusion tensor imaging predictors of treatment outcomes in major depressive disorder. Br. J. Psychiatry. 2014;205:321–328. doi: 10.1192/bjp.bp.113.140376. [DOI] [PubMed] [Google Scholar]
  • 55.Williams LM, et al. Amygdala reactivity to emotional faces in the prediction of general and medication-specific responses to antidepressant treatment in the randomized iSPOT-D trial. Neuropsychopharmacology. 2015;40:2398–2408. doi: 10.1038/npp.2015.89. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Wade BS, et al. Effect of electroconvulsive therapy on striatal morphometry in major depressive disorder. Neuropsychopharmacology. 2016;41:2481–2491. doi: 10.1038/npp.2016.48. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Williams LM, et al. International study to predict optimized treatment for depression (iSPOT-D), a randomized clinical trial: rationale and protocol. Trials. 2011;12:4–4. doi: 10.1186/1745-6215-12-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Lee Y, et al. Applications of machine learning algorithms to predict therapeutic outcomes in depression: a meta-analysis and systematic review. J. Affect. Disord. 2018;241:519–532. doi: 10.1016/j.jad.2018.08.073. [DOI] [PubMed] [Google Scholar]
  • 59.Kambeitz J, et al. Detecting neuroimaging biomarkers for depression: a meta-analysis of multivariate pattern recognition studies. Biol. Psychiatry. 2017;82:330–338. doi: 10.1016/j.biopsych.2016.10.028. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Semkovska M, McLoughlin DM. Objective cognitive performance associated with electroconvulsive therapy for depression: a systematic review and meta-analysis. Biol. Psychiatry. 2010;68:568–577. doi: 10.1016/j.biopsych.2010.06.009. [DOI] [PubMed] [Google Scholar]
  • 61.Slade EP, Jahn DR, Regenold WT, Case BG. Association of electroconvulsive therapy with psychiatric readmissions in US hospitals. JAMA Psychiatry. 2017;74:798–804. doi: 10.1001/jamapsychiatry.2017.1378. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Rush AJ, et al. Report by the ACNP Task Force on Response and Remission in Major Depressive Disorder. Neuropsychopharmacology. 2006;31:1841–1853. doi: 10.1038/sj.npp.1301131. [DOI] [PubMed] [Google Scholar]
  • 63.Fried EI. Moving forward: how depression heterogeneity hinders progress in treatment and research. Expert Rev. Neurother. 2017;17:423–425. doi: 10.1080/14737175.2017.1307737. [DOI] [PubMed] [Google Scholar]
  • 64.Dinga R, et al. Evaluating the evidence for biotypes of depression: methodological replication and extension of Drysdale et al. Neuroimage Clin. 2017;2019:101796. doi: 10.1016/j.nicl.2019.101796. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65.Murad MH, Chu H, Lin L, Wang Z. The effect of publication bias magnitude and direction on the certainty in evidence. BMJ Evid. Based Med. 2018;23:84. doi: 10.1136/bmjebm-2018-110891. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66.Leeflang MMG. Systematic reviews and meta-analyses of diagnostic test accuracy. Clin. Microbiol. Infect. 2014;20:105–113. doi: 10.1111/1469-0691.12474. [DOI] [PubMed] [Google Scholar]
  • 67.Schnack HG, Kahn RS. Detecting neuroimaging biomarkers for psychiatric disorders: sample size matters. Front. Psychiatry. 2016;7:50–50. doi: 10.3389/fpsyt.2016.00050. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68.Widnall E, Price A, Trompetter H, Dunn BD. Routine cognitive behavioural therapy for anxiety and depression is more effective at repairing symptoms of psychopathology than enhancing wellbeing. Cogn. Ther. Res. 2020;44:28–39. doi: 10.1007/s10608-019-10041-y. [DOI] [Google Scholar]
  • 69.Varoquaux G. Cross-validation failure: small sample sizes lead to large error bars. Neuroimage. 2018;180:68–77. doi: 10.1016/j.neuroimage.2017.06.061. [DOI] [PubMed] [Google Scholar]
  • 70.Mateos-Pérez JM, et al. Structural neuroimaging as clinical predictor: a review of machine learning applications. Neuroimage Clin. 2018;20:506–522. doi: 10.1016/j.nicl.2018.08.019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 71.Schmidt RL, Factor RE. Understanding sources of bias in diagnostic accuracy studies. Arch. Pathol. Lab. Med. 2013;137:558–565. doi: 10.5858/arpa.2012-0198-RA. [DOI] [PubMed] [Google Scholar]
  • 72.First M. B. et al. Consensus Report of the APA Work Group on Neuroimaging Markers of Psychiatric Disorders. APA Council on Research Consensus Paper (APA, 2012).
  • 73.Woo C-W, Chang LJ, Lindquist MA, Wager TD. Building better biomarkers: brain models in translational neuroimaging. Nat. Neurosci. 2017;20:365. doi: 10.1038/nn.4478. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 74.Dluhos P, et al. Multi-center machine learning in imaging psychiatry: a meta-model approach. Neuroimage. 2017;155:10–24. doi: 10.1016/j.neuroimage.2017.03.027. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental material (278.3KB, pdf)

Articles from Translational Psychiatry are provided here courtesy of Nature Publishing Group

RESOURCES