Abstract
To bring biomarkers closer to clinical application, they should be generalizable, reliable, and maintain performance within the constraints of routine clinical conditions. The functional striatal abnormalities (FSA), is among the most advanced neuroimaging biomarkers in schizophrenia, trained to discriminate diagnosis, with post-hoc analyses indicating prognostic properties. Here, we attempt to replicate its diagnostic capabilities measured by the area under the curve (AUC) in receiver operator characteristic curves discriminating individuals with psychosis (n = 101) from healthy controls (n = 51) in the Human Connectome Project for Early Psychosis. We also measured the test-retest (run 1 vs 2) and phase encoding direction (i.e., AP vs PA) reliability with intraclass correlation coefficients (ICC). Additionally, we measured effects of scan length on classification accuracy (i.e., AUCs) and reliability (i.e., ICCs). Finally, we tested the prognostic capability of the FSA by the correlation between baseline scores and symptom improvement over 12 weeks of antipsychotic treatment in a separate cohort (n = 97). Similar analyses were conducted for the Yeo networks intrinsic connectivity as a reference. The FSA had good/excellent diagnostic discrimination (AUC = 75.4%, 95% CI = 67.0–83.3%; in non-affective psychosis AUC = 80.5%, 95% CI = 72.1–88.0%, and in affective psychosis AUC = 58.7%, 95% CI = 44.2–72.0%). Test-retest reliability ranged between ICC = 0.48 (95% CI = 0.35–0.59) and ICC = 0.22 (95% CI = 0.06–0.36), which was comparable to that of networks intrinsic connectivity. Phase encoding direction reliability for the FSA was ICC = 0.51 (95% CI = 0.42–0.59), generally lower than for networks intrinsic connectivity. By increasing scan length from 2 to 10 min, diagnostic classification of the FSA increased from AUC = 71.7% (95% CI = 63.1–80.3%) to 75.4% (95% CI = 67.0–83.3%) and phase encoding direction reliability from ICC = 0.29 (95% CI = 0.14–0.43) to ICC = 0.51 (95% CI = 0.42–0.59). FSA scores did not correlate with symptom improvement. These results reassure that the FSA is a generalizable diagnostic – but not prognostic – biomarker. Given the replicable results of the FSA as a diagnostic biomarker trained on case-control datasets, next the development of prognostic biomarkers should be on treatment-response data.
INTRODUCTION
Personalized medicine aims to use biomarkers to match individuals with their most appropriate interventions [1]. This is particularly relevant in psychiatry, since most often treatment choice is determined by trial-and-error, which is associated with disengagement and greater chance of suboptimal outcomes [2]. For personalized medicine to deliver on its promise of improving outcomes, biomarker development should fulfill several criteria. Akin to the phases of drug discovery, these consecutive steps are: identifying a biological measure fit for the clinical paradigm of interest; demonstrating that it scales with the clinical measure of interest independent of potential confounding (internal validity); showing out of sample performance (external validity); and finally proving clinical utility [3].
In a recent review covering the state of the field in biomarker development [4], we identified the Functional Striatal Abnormalities (FSA) [5] index among the most developed in schizophrenia. FSA is a data-driven measure informed by previous evidence on the role of striatal function in schizophrenia. For its development, investigators used functional magnetic resonance images (fMRI) acquired from seven independent scanners (n = 1100), which were used to derive intra- and extra-striatal functional connectivity and striatal fractional amplitude of low-frequency fluctuations (fALFF) images. Features from these three sources were selected using a support vector machine classifier trained to discriminate between patients with non-affective psychosis (n = 560) and controls (n = 540), and the FSA score value was defined as the distance in the feature space to the separating hyperplane. Polarity was defined so that positive values were predictive for healthy controls. The authors found accuracy of 80.4%, sensitivity of 79.3%, and specificity of 81.5% based on leave-one-site-out cross-validation. Despite not having been developed as a predictive biomarker, post-hoc analyses showed that in two separate cohorts for which there were data on treatment response, baseline FSA score was significantly correlated with symptom reduction (r = 0.62, 0.42; p < 0.01 and p < 0.01 respectively), showing promise as a prognostic biomarker.
While these data are encouraging, the ability to move biomarkers towards clinical practice depends on independent confirmations of external validity [3]. The FSA’s performance was tested by leave-one-site-out cross-validation [5], but validation of these results by independent groups under generalizable conditions (i.e., different scanners, imaging protocols, or participant characteristics) is still necessary [6]. Furthermore, in addition to testing the accuracy of predictions, several additional requirements for an effective biomarker include characterization of reliability; examination of potential confounding effects by clinical and demographic variables; testing sensitivity to effects of imaging acquisition and analysis parameters. Biomarker stability in the face of these potential confounds is critical to plan subsequent experiments geared towards demonstrating clinical utility.
Here, we aim to advance this line of research on biomarker development in schizophrenia by testing whether the classification performance of the FSA replicates in the Human Connectome Project for Early Psychosis (HCP-EP) [7], a publicly available dataset. The objective of this manuscript is to replicate the findings reported in the original FSA publication, using an approach and independent data as similar as possible to the original, in order to test the external validity of this biomarker. Furthermore, we aim to study parameters that affect internal validity and reliability. Specifically, we measured the effects of relevant confounding on FSA values, calculated test-retest (i.e., run 1 vs run 2) and phase encoding direction reliability (i.e., FSA generated from PA scan vs from AP scan). In addition, we repeated these analyses for intrinsic connectivity for the Yeo networks, which have been well characterized [8]. Also, we concatenated scan runs and resliced them to obtain scans of increasing lengths from which we obtained FSA and network intrinsic connectivity to study the effects of increasing scan duration on accuracy of classification and reliability. Finally, since the HCP-EP [7] does not have data on treatment response, to replicate the post-hoc finding of the correlation between treatment response and baseline FSA scores, we tested this on a separate cohort of individuals with first episode psychosis who were treated with 12 weeks of antipsychotics (Fig. 1).
METHODS
Participant characteristics
We included data from the HCP-EP as in its 1.1 public release of August 2021 [9]. Participants were patients within 5 years of having been diagnosed with either non-affective or affective psychosis (total n = 101, n = 78 non-affective and n = 23 affective psychosis), and healthy controls (n = 51). All participants were administered the SCID-5 [10] diagnostic interview, which generated rule in and rule out diagnoses. Main entry criteria included age 16–35 and confirmation of primary psychosis for patients and ruling out history of primary psychosis or active affective, substance use, or anxiety disorder for controls. About half of the patients were receiving treatment with antipsychotics at the time of the scan (n = 42, for those on treatment with a mean dose of 384.1 mg/d) and overall were mildly symptomatic (mean total PANSS = 46.29, SD = 16.58). Data were acquired at four sites: Indiana University, Brigham and Women’s Hospital, McLean Hospital, and Massachusetts General Hospital. All procedures were approved by the Partners Healthcare Human Research Committee/IRB and comply with the regulations set forth by the Declaration of Helsinki.
In addition, for the prediction of treatment response we used data from the Zucker Hillside Hospital cohort (ZHH cohort) 97 patients at the time of their first treatment for psychosis (non-affective n = 69, affective psychosis n = 28) with minimal exposure to antipsychotics (43% treatment naïve, median exposure = 5 days). Inclusion criteria were: 1) Demonstration of acute psychosis by presence of scoring ≥4 (moderate) on one or more of these Brief Psychopathology Rating Scale (BPRS-A) [11] items: hallucinatory behavior, unusual thought content, conceptual disorganization; 2) Early phase of illness as defined by having taken antipsychotic drugs for a cumulative lifetime period of 2 weeks or less; 3) age 16 to 30; 4) competent to sign informed consent. Exclusion criteria were: 1) serious neurological or endocrine disorder; 2) any medical condition which requires treatment with a medication with psychotropic effects; 3) significant risk of suicidal or homicidal behavior; 4) inability to provide informed consent; 5) contraindications to monotherapy with risperidone; 6) contraindications to MR imaging. Diagnostic eligibility was confirmed by the SCID-5 [10]. Patients underwent a standardized treatment protocol with risperidone or aripiprazole for 12 weeks, and regular clinical ratings. Patients underwent assessment by the BPRS-A [11] at baseline and weeks 2,4,6,8 and 12, and change in symptom severity between last assessment and baseline was calculated. In addition, we calculated treatment response, defined as two consecutive ratings of much or very much improved on the CGI, as well as a rating of ≤3 on four psychosis-related items of the BPRS-A [11], resulting in n = 52 (53.61%) responders (being the rest either non-responders or early exits). This recently collected dataset is different from those in which we have developed similar work [12]. This sample size was deemed adequate since it is larger than the one in the original study. All participants were scanned at treatment onset after providing informed consent as approved by the Institutional Review Board of the Feinstein Institutes for Medical Research. Participant characteristics are described in Supplementary Table S1.
fMRI acquisition
Resting state fMRI (rs-fMRI) scans were collected on 3T Siemens Prisma scanners using a multi-band accelerated echo-planar imaging sequence described in detail in the Human Connectome Project [13]. For each study participant, a T1-weighted scan (TR = 2400 msec, TE = 2.22 msec, voxel size = 0.8 mm3, scan length = 6 min, 38 s) and two rs-fMRI sessions (TR = 800 ms, TE = 37 ms) were acquired at each timepoint. During each fMRI session, two runs of 5-min 47-s scans (i.e. 420 volumes) were collected in opposite phase encoding directions (one with PA and the other with AP). Rs-fMRI scans were collected with eyes open. To ensure signal stability, the first 13 volumes were discarded in data analysis. Each rs-fMRI scan consisted of 72 contiguous axial/oblique slices in the AC-PC orientation (TR = 720 ms, TE = 33.1 ms, matrix = 104 × 90, FOV = 208 mm, voxel = 2 × 2 × 2 mm, multi-band ×8). The only major differences in acquisition for the scans from the ZHH cohort were that the rsMRI runs were of 7-min 17-s rsMRI runs, one each with AP and PA phase encoding directions (total 2 runs at each timepoint, 594 volumes each), and that images were acquired with eyes closed, with verification of wakefulness.
Calculation of FSA scores and intrinsic connectivity of canonical networks
Image preprocessing.
Scans from both datasets were preprocessed using HCP based pipelines [14]. Briefly, structural preprocessing included gradient distortion correction, brain extraction, cross-modal registration of T2 weighted (T2w) images to T1w, bias field correction based on square root (T1w*T2w) and non-linear registration to MNI space. The functional preprocessing methods used were gradient distortion correction, motion correction, and EPI image distortion correction based on spin-echo EPI field maps and spatial registration to T1w image and MNI space [14]. An initial high pass filter of 2000 Hz was applied to remove slow drift trends before nuisance regression was performed using FMRIB’s ICA-based X-noiseifier (FIX) [15–17]. Functional images then underwent 5-mm full-width-at-half-maximum spatial smoothing. Finally, we ran global signal regression (GSR), since this step was taken in the original FSA publication [5], and also additional literature has suggested that GSR may facilitate behavioral predictions from rs-fMRI [18]. Both GSR and no-GSR results are presented. To control for head motion, frame-wise displacement (FD) was calculated for each scan time point [19]. We applied a stringent motion threshold so that individuals for whom >30% of volumes had FD > 0.3 were excluded from the analyses. This led to the exclusion of 13 and 8 individuals in the HCP-EP dataset and ZHH cohort respectively.
Calculation of FSA and intrinsic network connectivity.
For the FSA calculation, we followed the procedures described in the original study [5], using publicly available scripts for the calculation of the FSA [20], using weights and features as originally developed, since our objective is to replicate the original findings in an independent dataset. The input to calculate the FSA were the preprocessed resting state fMRI images as described above, resliced to 3 mm isotropic voxels, and fALFF images [21] that we calculated separately using the RestPlus toolbox in Matlab [22]. For this step, we used preprocessed images as described above. Fourier transformation was used at every voxel to calculate the power of BOLD signal in the low frequency range of 0.01–0.08 Hz, and then divided by the entire frequency range. Both resting state and fALFF images were finally used to calculate the FSA scores, which reflect the distance to the separating hyperplane between cases and controls based on the model developed in the original publication [5]. For each scan, we generated FSA values for each run (i.e., 1 and 2), phase encoding direction (PA and AP) and GSR (with and without).
In addition to the FSA, we measured the intrinsic network connectivity for each subject using custom python scripts based on nilearn [23]. We chose these resting state fMRI measure as a control given its high reproducibility and ease of interpretation [8]. For this step, we generated voxel-wise connectivity matrices for each scan and fetched the ‘Dictionaries of Functional Modes’ (DiFuMo) atlas [24] subsequently deriving functional connectivity for the Yeo networks [8]: cognitive control, default mode, dorsal attention, salience, somatosensory and visual. We extracted and averaged the correlations between nodes within each network as the intrinsic connectivity value for each subject.
Generation of values based on scans of increasing length.
We used concatenation and slicing to generate scans of increasing length from which to calculate FSA and network intrinsic connectivity values. For this step, we normalized, mean centered, and concatenated the two consecutive runs in each phase encoding direction for each individual’s rs-fMRI scan. Subsequently, we sliced each ~10’ concatenated scan in 10 increments (i.e., first 82 volumes, first 164 volumes…). Finally, we preprocessed each increment and analyzed it to obtain FSA and intrinsic network connectivity values as described above.
Analyses
Distribution and potential confounding.
We compared the distribution of the FSA and the network intrinsic connectivity by participant type (patients vs controls) and calculated the corresponding effect sizes of the difference in values between patients and controls. Then, we ran a linear regression including as dependent variable the FSA and independent variables age, sex, race, medication dose, and illness severity score to measure whether these variables may behave as confounding. Of those, antipsychotic dose is relevant, since antipsychotics may have effects on the connectome that if picked up by the FSA, instead of features associated with the diagnosis, could confound the results. In addition, we examined whether these variables affected differently patients or controls by running group interactions.
Classification performance and correlation with symptom improvement.
We calculated receiver operating characteristic (ROC) curves to estimate the accuracy of classification of diagnosis. Analyses were repeated separately for patient sub-groups that only included affective and non-affective psychoses. Since the FSA was developed on patients with non-affective psychosis, we hypothesized that it would perform better in this population. We ran separate ROCs models for FSA derived from PA GSR, PA NoGSR, AP GSR and AP NoGSR scans. Classification accuracy was measured by the area under the curve (AUC) of the ROC curve, and 95% confidence intervals (95% CIs) were generated using 2000 bootstraps. ROCs were also used to identify the score with the best discriminating ability (i.e., value with greatest true positive and lowest false positive fraction), which was then used to calculate the sensitivity and specificity for that discriminating threshold. As a reference, it has been suggested that AUC > 80% is necessary for clinical utility of a biomarker [25] although lower accuracy may still be useful depending on consequences of wrong classification and alternatives to the biomarker [4].
To replicate the post-hoc finding reported in the original publication, of a significant correlation between baseline FSA score and total symptom improvement following an antipsychotic trial, we ran a correlation between FSA scores derived for each phase encoding direction and GSR condition, and total symptom improvement defined as difference in total BPRS-A score between the baseline assessment and the last available measure. Although other approaches such as using individual random slopes of a mixed model may be reflect better individual treatment response [26], we designed the analyses to be as similar to the original study [5].
Test-retest and phase encoding direction reliability.
We compared the FSA and intrinsic network connectivity from each run (i.e., test-retest reliability) and PA vs AP scans (i.e., phase encoding direction reliability) by using the intraclass correlation coefficient (ICC), an established measure of reliability [27] that reflects the ratio of within-individual variance over total variance. Specifically, we used a two-way mixed, single score intraclass correlation coefficient [ICC(3,1)] generating also 95% CIs, which is recommended to test the reliability of the same measure by one rater [28]. We calculated reliabilities for the entire sample, as well as separately for patients and controls. As a rule of thumb ICC is deemed poor <0.4, fair 0.4–0.59, good 0.6–0.74, and excellent >0.75 [29]. The reason to focus on phase encoding direction reliability, in addition to test-retest, is based on recent findings from our group about general differences in reliability in the connectome by phase encoding direction [30].
Effects of scan length on classification performance and reliability.
For this particular part of the analyses, we re-calculated FSA and network intrinsic connectivity values from the scans of increasing lengths described above. For values generated from scans of each duration, we calculated AUCs with corresponding 95% CIs, as well ICC with 95% CIs for phase encoding direction reliability.
RESULTS
Distribution and potential confounding
FSA values were significantly lower in patients than in controls (p < 0.0001) for both phase encoding directions and GSR analyses, in the same direction as in the original publication [5], with effect sizes that ranged between cohen’s d = 0.79 (95% CI = 0.44–1.15) for PA No GSR to d = 0.91 (95% CI = 0.55–1.26) (Fig. 2; see effect sizes and p values in Supplementary Table S2). Differences between patients and controls were for the most part not significant for intrinsic network connectivity, and when significant of small effect size. Out of all the networks and scans (i.e., PA vs AP and GSR vs No GSR, only somatosensory network PA No GSR, d = 0.41 (95% CI = 0.07–0.75) and default mode network AP No GSR, d = 0.35 (95% CI = 0.01–0.68) were different between patients and controls. As expected, network intrinsic connectivity values were lower in scans that were subject to GSR. Age, sex, race, antipsychotic dose, total psychopathology and negative symptoms scores did not have any significant impact on FSA values (Supplementary Table S3).
Classification performance and correlation with symptom improvement
For classification of diagnosis (i.e., psychosis vs healthy control), the classification performance of the FSA ranged between AUC = 75.7%, (95% CI = 67.3–84.1%) for PA scans with GSR with sensitivity of 82% and specificity of 65%, and AUC = 73.6%, (95% CI = 65.0–82.2%) for AP scans without GSR with sensitivity of 76% and specificity of 66%. When the discrimination was between non-affective psychosis and healthy controls, the classification ranged between AUC = 80.5% (95% CI = 72.1–88.0%) for PA scans with GSR with specificity of 82% and sensitivity of 73% and AUC = 74.5% (95% CI = 65.2–82.1%) for AP scans without GSR with specificity of 65% and specificity of 77%, whereas when the discrimination was between affective psychosis and healthy controls the classification ranged between AUC = 70.2% (95% CI = 57.4–81.9%) for AP scans without GSR with specificity of 43% and specificity of 96% and AUC = 58.7% (95% CI = 44.2–72.0%) for PA scans with GSR with sensitivity of 82% and specificity of 39% (Fig. 3, panels A–C; Supplementary Table S4).
When we tested correlation between baseline FSA scores and change in total symptom severity in the ZHH cohort, we found that the regression coefficients ranged between r = 0.038 p = 0.75 for AP scans without GSR and r = 0.089 p = 0.45 for PA scans with GSR (Fig. 3, panel D). When using a dichotomous definition of treatment response, classification performance ranged between AUC = 55.7% (95% CI = 40.4–70.8%) for PA scans with GSR with sensitivity of 35% and specificity of 88%, and AUC = 49.3% (95% CI = 34.4–65.9%) for AP scans with GSR with sensitivity of 72% and specificity of 41%. When the classification was between non-affective psychosis and healthy controls, the classification performance ranged between AUC = 55.7% (95% CI = 39.2–51.5%) for PA scans with GSR with sensitivity of 31% and specificity of 93%, and AUC = 47.9% (95% CI = 33.2–63.7%) for PA scans without GSR with sensitivity of 31% and specificity of 93%. There were not enough individuals with affective psychosis with enough data on treatment response to run separate classification analyses (Supplementary Table S5; Supplementary Fig. S1).
Test-retest and phase encoding direction reliability
Test-retest reliability between two runs within the same scan session for the FSA ranged between ICC = 0.48 (95% CI = 0.35–0.6) for AP scans without GSR, and ICC = 0.22 (95% CI = 0.06–0.37) for PA scans without GSR. Test-retest reliability for network intrinsic connectivity ranged between ICC = 0.5 (95% CI = 0.37–0.61) for cognitive control network in AP scans with GSR and ICC = 0.07 (95% CI =−0.09–0.22) in salience network with PA No GSR scans (Fig. 4, panel A; Supplementary Table S6).
Phase encoding direction reliability for the FSA was ICC = 0.51 (95% CI = 0.42–0.59), while it ranged between ICC = 0.86 (95% CI = 0.82–0.88) for the dorsal attention network and ICC = 0.75 (95% CI = 0.70–0.80) for the somatosensory network (Fig. 4, panel B; Supplementary Table S7).
Effects of scan length on classification performance and reliability
Using a slice of approximately 2’ (164 volumes) out of the concatenated runs (total 820 volumes) the diagnostic classification of the FSA ranged between AUC = 67.2% (95% CI = 57.6–76.7%) for AP scans without GSR and AUC = 71.7% (95% CI = 63.1–80.3%) for PA scans with GSR. When the entire concatenated scan was used, the diagnostic classification ranged between AUC = 73.6% (95% CI = 65.0–82.2%) for AP scans without GSR and AUC = 75.7% (95% CI = 67.3–84.1%) for PA scans with GSR. None of the classifications made by network intrinsic connectivity were significant, regardless of the scan time (Fig. 5).
When the same slice scheme was used to test phase encoding direction reliability, we found significant increments in reliability for the FSA, starting from ICC = 0.29 (95% CI = 0.14–0.43) in slices of approximately 2’ duration to ICC = 0.56 (95% CI = 0.45–0.66) when using the entire concatenated scan. Similar increments were observed across network intrinsic connectivity values (Fig. 6).
DISCUSSION
Using a public independent dataset, we found that the original findings on the performance of the FSA as a diagnostic biomarker for schizophrenia [5] are generalizable out of sample. However, the post-hoc finding of the association between treatment response and baseline FSA scores in the original report [5] did not replicate in a separate treatment response dataset, suggesting that while as a diagnostic biomarker of schizophrenia the FSA has demonstrated external validity, it may be limited as a prognostic biomarker in its current form. Furthermore, this biomarker showed test-retest reliability comparable -and in some cases superior – to that of the Yeo networks, which have been previously well-validated [8]. Like recent work by our group [30], we identified differences in reliability by phase encoding direction, which seemed to be more pronounced in the FSA than in network intrinsic connectivity, although this did not have any significant impact in the accuracy of predictions. Similarly, there seemed to be greater gains in reliability than in accuracy of predictions by increasing the scan length.
Despite differences in participant demographic and clinical characteristics, scanner type, acquisition parameters, and preprocessing approach – all of which may reduce the reliability of fMRI scans and potentially the reproducibility of findings [31–33] – the results in the HCP-EP are consistent with those of the Chinese datasets in which it was developed. In addition to a classification accuracy of 75.7% (vs 80.1% in the original publication), we also observed consistently higher FSA values in controls than in patients, improvement in classification accuracy when only non-affective psychoses were included, and better results with global signal regression, all of which were reported in the original publication [5]. Furthermore, we did not find significant effects of demographic, clinical, or acquisition variables on FSA values, highlighting the internal validity of this biomarker. The replication of these findings attests to the value of the approach used to develop the FSA: advanced multivariable classification methods on large datasets with leave-one-site-out cross-validation. Similar methods applying data-hungry models with contingencies to mitigate overfitting have been successfully applied to other areas of medicine, such as liquid biopsy for cancer [34], radiomics in medical imaging [35], or ophthalmology [36]. However, diagnostic biomarkers have limited applicability in psychiatric conditions since the psychiatric interview is still necessary to assess the needs of each individual. It is possible that diagnostic biomarkers that classify individuals with a first episode of psychosis between different diagnostic trajectories (e.g., schizophrenia vs bipolar disorder) could be useful, although this was separate from the diagnostic question assessed by the FSA. Despite there may not be immediate clinical application of this biomarker, it is remarkable that the FSA findings replicate under so different conditions, which as a field, brings closer the application of neuroimaging biomarkers in psychiatric disorders.
An application more likely to be adopted in the clinic is that of prognostic biomarker. Predicting that an individual will not respond to an antipsychotic could avoid unsatisfactory experiences with ineffective treatments, a known risk factor of non-adherence [37] and ultimately relapse [38]. Furthermore, it could reduce the time while someone is acutely psychotic – and potentially a danger to self or others – by allowing the expedited use of clozapine, the only drug approved for treatment resistant schizophrenia [39]. Neuroimaging biomarkers could be very helpful for this purpose since clinical information alone is rather limited to predict treatment response [40]. However, it may be necessary to develop these biomarkers using datasets from treatment response, instead of datasets of cases vs controls. It is not entirely surprising that despite robust performance for diagnosis classification, the FSA – for which features were selected using case-control datasets – was limited as a prognostic biomarker. Our interpretation is that, despite similar approach and comparable independent data, including in sample size, the original correlation between baseline FSA and treatment response did not replicate, and therefore it cannot be confirmed that the FSA has external validity as a prognostic biomarker. Meta-analyses on functional connectivity in schizophrenia vs controls [41], and in treatment responders vs non-responders [42] do not necessarily overlap in their findings, aligning with the idea that the pathophysiology of the schizophrenia syndrome may go beyond the mechanisms engaged by antipsychotic drugs. In fact, in about one third of individuals with schizophrenia, positive symptoms fail to respond to antipsychotic drugs [43], and furthermore negative symptoms or cognitive deficits, core aspects of the syndrome, are not engaged by antipsychotic medication [44]. Thus, the mechanisms involved in treatment response likely account for only a proportion of the connectivity abnormalities observed in schizophrenia that were used for feature selection of the FSA. A similar example is the application of polygenic risk scores (PRS) in schizophrenia. PRSs can classify individuals with schizophrenia with AUCs ranging between 60 and 66% [45], but in general the proportion of the variance on treatment response explained by schizophrenia PRSs is lower [46]. There were differences in the demographic and scanning parameter characteristics between the treatment response cohort and the case-control sample from the HCP-EP which make reasonable to test the performance of the FSA as a prognostic biomarker in other datasets prior to completely ruling out this capability. Overall, this line of research suggests that applying approaches like the ones used to develop the FSA on large datasets of treatment response could be helpful to optimize prognostic biomarkers in schizophrenia and bring them closer to clinical application.
We observed effects of the phase encoding direction of the scan on reliability. Compared with network intrinsic network connectivity, the reliability of FSA values generated from PA vs AP scans was significantly lower, suggesting that FSA values are more vulnerable to phase encoding direction effects than other connectivity measures. AP scans showed better test-retest reliability; however, the predictions of AP and PA scans were not significantly different, suggesting that phase encoding direction may be more relevant for reliability than for accuracy of predictions. Similarly, we observed significant improvements in phase encoding direction reliability by increasing the acquisition length, however this had little effect on predictions, which were already good with scans of about 2 min. This apparent decoupling between accuracy of predictions and reliability has been well documented [47–50], and indeed reflects the fact that accuracy and reliability are related but independent constructs, and that possibly the FSA draws on features that are disease related in a categorical sense rather than on a continuous liability to psychosis proneness. For instance, a given measure may get better reliability by capitalizing on reliable but uninformative features in the connectome (i.e., consistent noise), whereas other measures may capitalize on various states that while different, are all related to the phenotype of interest. Thus, it has been argued that since these discrepancies may occur, the focus in biomarker development should be on accuracy of predictions rather than on reliability [50]. Yet, to our knowledge in the development of the FSA there was no explicit management of the potential effects of phase encoding direction. These data indicate that in future iterations of biomarker development at the very least it should be explicitly managed in the analyses, and ideally at the stage of image acquisition.
These data should be interpreted in the light of several limitations. First, the sample size in this replication was relatively small (n = 152) compared to the development sample size in the original study (n = 1100). However, this sample size is comparable to the sample sizes for the cross-validation sets that were used in the original analyses. Second, we did not test the specificity of the replication by attempting to measure the discrimination of diagnoses other than psychosis, although we observed that predictions were significantly better for discrimination between non-affective psychosis and healthy controls than between affective psychosis and healthy controls. Third, treatment response was tested on a cohort slightly different from the cohorts used to test treatment response in the original FSA study (multiepisode patients treated with both clozapine and non-clozapine antipsychotics vs first episode patients treated with non-clozapine antipsychotics). Fourth, test-retest reliability could only be tested for the individual runs, but only reliability between scans of different phase encoding direction (i.e., PA vs AP) could be tested used in the time analysis since we concatenated both runs to obtain longer scans to test the effects of acquisition length. Fifth, despite concatenation of both runs, scans were relatively short and did not reach a plateau in the time analysis, thus not allowing to conclude about the minimal scan duration for best reliability and accuracy. Sixth, although treatment response causes the actual observed phenotype – symptom change over time – other elements contribute to this, including regression to the mean, placebo response and measurement error. However, approaches to separate true treatment response from these other components of symptom improvement over time, such as a placebo-controlled trial or a head-to-head of a clozapine vs a non-clozapine antipsychotic may have ethical challenges. Seven, theoretically the classification could have been confounded by antipsychotics effects (i.e., the FSA would be sensitive to fMRI changes caused by antipsychotics, rather than the illness, and classify accordingly). However, we believe that the risk of confounding by antipsychotics is small, since in the original publication there was no antipsychotic effect on FSA scores (p = 0.36), in the HCP-EP dataset more than half of the patients were not medicated at the time of the scan, and that in our regression, antipsychotic dose was not associated with FSA scores. Finally, the finding that acquisitions as short as 2 min yielded FSA scores that already classified diagnosis significantly suggests that spatial autocorrelation pattern in addition or instead of the temporal dynamics may account for the classificatory power.
In conclusion, using the framework for biomarker development of consecutive contingent phases – target identification, internal validity, external validity, and demonstration of clinical utility [3] - this work emphasizes the internal validity of the FSA given the limited effect of common confounding, such as demographic or clinical characteristics. It also confirms the external validity as a diagnostic biomarker by replicating the original findings out of sample under rather different conditions. By and large, the test-retest reliability of the FSA comparable to that of the intrinsic connectivity in the Yeo networks [8], which have been well studied. These findings demonstrate the value of multi-site research to develop large datasets in which data hungry advanced statistical techniques can be applied, yielding robust results. To move forward biomarker development in schizophrenia – from demonstrating external validity to clinical utility – it is necessary to apply similar approaches to treatment response data, to generate robust prognostic biomarkers that can predict treatment response.
Supplementary Material
ACKNOWLEDGEMENTS
JR was supported by NIH grant K23MH127300.
Footnotes
COMPETING INTERESTS
JR has received research support from Alkermes, speaker bureau or advisory board compensation from TEVA, Karuna, Janssen, royalties from UpToDate. PH has received grants and honoraria from Novartis, Lundbeck, Mepha, Janssen, Boehringer Ingelheim, Neurolite outside of this work. JK has received funds for research support from H. Lundbeck, Janssen, and Otsuka; has received consulting fees or honoraria for lectures from Alkermes, Biogen, Boehringer Ingelheim, Cerevel, Dainippon Sumitomo, H. Lundbeck, HLS, Indivior, Intra-Cellular Therapies, Janssen, Johnson and Johnson, Karuna, LB Pharmaceuticals, Merck, Minerva, Neurocrine, Neumora, Newron, Novartis, Otsuka, Reviva, Roche, Saladax, Sunovion, Takeda, and Teva; and has ownership interest in HealthRhythms, North Shore Therapies, LB Pharmaceuticals, Medincell, and the Vanguard Research Group. The rest of the authors declare no conflict of interest.
ADDITIONAL INFORMATION
Reprints and permission information is available at http://www.nature.com/reprints
Supplementary information The online version contains supplementary material available at https://doi.org/10.1038/s41380-023-02381-9.
DATA AVAILABILITY
Analyses were conducted with custom scripts in R and python available on https://github.com/lorente01. Data for the HCP Early Psychosis is available at https://www.humanconnectome.org/study/human-connectome-project-for-early-psychosis and data for the ZHH cohort was accessed directly from the study Principal Investigator and is expected to be publicly available at https://nda.nih.gov/.
REFERENCES
- 1.Collins FS, Varmus H. A new initiative on precision medicine. N Engl J Med.2015;372:793–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Stephan KE, Bach DR, Fletcher PC, Flint J, Frank MJ, Friston KJ, et al. Charting the landscape of priority problems in psychiatry, part 1: classification and diagnosis. Lancet Psychiatry. 2016;3:77–83. [DOI] [PubMed] [Google Scholar]
- 3.Abi-Dargham A, Horga G. The search for imaging biomarkers in psychiatric disorders. Nat Med. 2016;22:1248–55. [DOI] [PubMed] [Google Scholar]
- 4.Abi-Dargham A, Moeller SJ, Ali F, DeLorenzo C, Domschke K, Horga G, et al. Candidate biomarkers in psychiatric disorders: state of the field. World Psychiatry. 2023;22:236–62. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Li A, Zalesky A, Yue W, Howes O, Yan H, Liu Y, et al. A neuroimaging biomarker for striatal dysfunction in schizophrenia. Nat Med. 2020;26:558–65. [DOI] [PubMed] [Google Scholar]
- 6.Scheinost D, Noble S, Horien C, Greene AS, Lake EM, Salehi M, et al. Ten simple rules for predictive modeling of individual differences in neuroimaging. Neuroimage. 2019;193:35–45. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Demro C, Mueller BA, Kent JS, Burton PC, Olman CA, Schallmo M-P, et al. The psychosis human connectome project: an overview. NeuroImage. 2021;241:118439. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Yeo BTT, Krienen FM, Sepulcre J, Sabuncu MR, Lashkari D, Hollinshead M, et al. The organization of the human cerebral cortex estimated by intrinsic functional connectivity. J Neurophysiol. 2011;106:1125–65. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Human Connectome Project for Early Psychosis Investigators. HCP Early Psychosis1.1 Data Release https://www.human-connectome.org/study/humanconnectome-project-for-early-psychosis.
- 10.First MB, Williams JBW, Karg RS, Spitzer RL. Structured Clinical Interview for DSM-5—Research Version (SCID-5 for DSM-5, Research Version; SCID-5-RV). Arlington, VA: American Psychiatric Association; US; 2015. [Google Scholar]
- 11.Woerner MG, Mannuzza S, Kane JM. Anchoring the BPRS: an aid to improved reliability. Psychopharmacol Bull. 1988;24:112–7. [PubMed] [Google Scholar]
- 12.Sarpal DK, Robinson DG, Fales C, Lencz T, Argyelan M, Karlsgodt KH, et al. Relationship between duration of untreated psychosis and intrinsic corticostriatal connectivity in patients with early phase schizophrenia. Neuropsychopharmacology. 2017;42:2214–21. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Van Essen DC, Ugurbil K, Auerbach E, Barch D, Behrens TEJ, Bucholz R, et al. The Human Connectome Project: a data acquisition perspective. Neuroimage. 2012;62:2222–31. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Glasser MF, Sotiropoulos SN, Wilson JA, Coalson TS, Fischl B, Andersson JL, et al. The minimal preprocessing pipelines for the Human Connectome Project. Neuroimage. 2013;80:105–24. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Smith SM, Beckmann CF, Andersson J, Auerbach EJ, Bijsterbosch J, Douaud G, et al. Resting-state fMRI in the Human Connectome Project. Neuroimage. 2013;80:144–68. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Griffanti L, Salimi-Khorshidi G, Beckmann CF, Auerbach EJ, Douaud G, Sexton CE, et al. ICA-based artefact removal and accelerated fMRI acquisition for improved resting state network imaging. Neuroimage. 2014;95:232–47. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Salimi-Khorshidi G, Douaud G, Beckmann CF, Glasser MF, Griffanti L, Smith SM. Automatic denoising of functional MRI data: combining independent component analysis and hierarchical fusion of classifiers. Neuroimage. 2014;90:449–68. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Li J, Kong R, Liégeois R, Orban C, Tan Y, Sun N, et al. Global signal regression strengthens association between resting-state functional connectivity and behavior. NeuroImage. 2019;196:126–41. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Power JD, Barnes KA, Snyder AZ, Schlaggar BL, Petersen SE. Spurious but systematic correlations in functional connectivity MRI networks arise from subject motion. Neuroimage. 2012;59:2142–54. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Bing Liu Lab Github repository, https://github.com/BingLiu-Lab/FSA. [Google Scholar]
- 21.Zou Q-H, Zhu C-Z, Yang Y, Zuo X-N, Long X-Y, Cao Q-J, et al. An improved approach to detection of amplitude of low-frequency fluctuation (ALFF) for resting-state fMRI: fractional ALFF. J Neurosci Methods. 2008;172:137–41. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Jia X-Z, Wang J, Sun H-Y, Zhang H, Liao W, Wang Z, et al. RESTplus: an improved toolkit for resting-state functional magnetic resonance imaging data processing. Sci Bull. 2019;64:953–4. [DOI] [PubMed] [Google Scholar]
- 23.Abraham A, Pedregosa F, Eickenberg M, Gervais P, Mueller A, Kossaifi J, et al. Machine learning for neuroimaging with scikit-learn. Front Neuroinform. 2014;8:14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Dadi K, Varoquaux G, Machlouzarides-Shalit A, Gorgolewski KJ, Wassermann D, Thirion B, et al. Fine-grain atlases of functional modes for fMRI analysis. NeuroImage. 2020;221:117126. [DOI] [PubMed] [Google Scholar]
- 25.Kraguljac NV, McDonald WM, Widge AS, Rodriguez CI, Tohen M, Nemeroff CB. Neuroimaging Biomarkers in Schizophrenia. AJP. 2021;178:509–21. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Hedeker D, Gibbons RD. Longitudinal data analysis. Wiley-Interscience; Chicago, Illinois, US: 2006. [Google Scholar]
- 27.Noble S, Scheinost D, Constable RT. A guide to the measurement and interpretation of fMRI test-retest reliability. Curr Opin Behav Sci. 2021;40:27–32. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Shrout PE, Fleiss JL. Intraclass correlations: uses in assessing rater reliability. Psychol Bull. 1979;86:420–8. [DOI] [PubMed] [Google Scholar]
- 29.Cicchetti DV, Sparrow SA. Developing criteria for establishing interrater reliability of specific items: applications to assessment of adaptive behavior. Am J Ment Defic. 1981;86:127–37. [PubMed] [Google Scholar]
- 30.Cao H, Barber AD, Rubio JM, Argyelan M, Gallego JA, Lencz T, et al. Effects of phase encoding direction on test-retest reliability of human functional connectome. Neuroimage. 2023;277:120238. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Li J, Bzdok D, Chen J, Tam A, Ooi LQR, Holmes AJ, et al. Cross-ethnicity/race generalization failure of behavioral prediction from resting-state functional connectivity. Sci Adv. 2022;8:eabj1812. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Badhwar A, Collin-Verreault Y, Orban P, Urchs S, Chouinard I, Vogel J, et al. Multivariate consistency of resting-state fMRI connectivity maps acquired on a single individual over 2.5 years, 13 sites and 3 vendors. NeuroImage. 2020;205:116210. [DOI] [PubMed] [Google Scholar]
- 33.Srirangarajan T, Mortazavi L, Bortolini T, Moll J, Knutson B. Multi-band FMRI compromises detection of mesolimbic reward responses. NeuroImage. 2021;244:118617. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Abbosh C, Birkbak NJ, Wilson GA, Jamal-Hanjani M, Constantin T, Salari R, et al. Phylogenetic ctDNA analysis depicts early-stage lung cancer evolution. Nature. 2017;545:446–51. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Lambin P, Leijenaar RTH, Deist TM, Peerlings J, de Jong EEC, van Timmeren J, et al. Radiomics: the bridge between medical imaging and personalized medicine. Nat Rev Clin Oncol. 2017;14:749–62. [DOI] [PubMed] [Google Scholar]
- 36.Gulshan V, Peng L, Coram M, Stumpe MC, Wu D, Narayanaswamy A, et al. Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. JAMA. 2016;316:2402–10. [DOI] [PubMed] [Google Scholar]
- 37.Kane JM, Kishimoto T, Correll CU. Non-adherence to medication in patients with psychotic disorders: epidemiology, contributing factors and management strategies. World Psychiatry. 2013;12:216–26. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Alvarez-Jimenez M, Priede A, Hetrick SE, Bendall S, Killackey E, Parker AG, et al. Risk factors for relapse following treatment for first episode psychosis: a systematic review and meta-analysis of longitudinal studies. Schizophr Res. 2012;139:116–28. [DOI] [PubMed] [Google Scholar]
- 39.Howes OD, McCutcheon R, Agid O, de Bartolomeis A, van Beveren NJM, Birnbaum ML, et al. Treatment-Resistant Schizophrenia: Treatment Response and Resistance in Psychosis (TRRIP) Working Group Consensus Guidelines on Diagnosis and Terminology. Am J Psychiatry. 2017;174:216–29. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Carbon M, Correll CU. Clinical predictors of therapeutic response to antipsychotics in schizophrenia. Dialogues Clin Neurosci. 2014;16:505–24. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Dong D, Wang Y, Chang X, Luo C, Yao D. Dysfunction of large-scale brain networks in schizophrenia: a meta-analysis of resting-state functional connectivity. Schizophrenia Bull. 2018;44:168–81. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Mehta UM, Ibrahim FA, Sharma MS, Venkatasubramanian G, Thirthalli J, Bharath RD, et al. Resting-state functional connectivity predictors of treatment response in schizophrenia – a systematic review and meta-analysis. Schizophrenia Res. 2021;237:153–65. [DOI] [PubMed] [Google Scholar]
- 43.Wimberley T, Støvring H, Sørensen HJ, Horsdal HT, MacCabe JH, Gasse C. Predictors of treatment resistance in patients with schizophrenia: a population-based cohort study. Lancet Psychiatry. 2016;3:358–66. [DOI] [PubMed] [Google Scholar]
- 44.Jauhar S, Johnstone M, McKenna PJ. Schizophrenia. Lancet. 2022;399:473–86. [DOI] [PubMed] [Google Scholar]
- 45.Vivian-Griffiths T, Baker E, Schmidt KM, Bracher-Smith M, Walters J, Artemiou A, et al. Predictive modeling of schizophrenia from genomic data: comparison of polygenic risk score with kernel support vector machines approach. Am J Med Genet. 2019;180:80–85. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Zhang J-P, Robinson D, Yu J, Gallego J, Fleischhacker WW, Kahn RS, et al. Schizophrenia polygenic risk score as a predictor of antipsychotic efficacy in first-episode psychosis. AJP. 2019;176:21–28. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Noble S, Spann MN, Tokoglu F, Shen X, Constable RT, Scheinost D. Influences on the test-retest reliability of functional connectivity MRI and its relationship with behavioral utility. Cereb Cortex. 2017;27:5415–29. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Liu J, Liao X, Xia M, He Y. Chronnectome fingerprinting: identifying individuals and predicting higher cognitive functions using dynamic brain connectivity patterns. Hum Brain Mapp. 2018;39:902–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Byrge L, Kennedy DP. Accurate prediction of individual subject identity and task, but not autism diagnosis, from functional connectomes. Hum Brain Mapp. 2020;41:2249–62. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Finn ES, Rosenberg MD. Beyond fingerprinting: choosing predictive connectomes over reliable connectomes. Neuroimage. 2021;239:118254. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Analyses were conducted with custom scripts in R and python available on https://github.com/lorente01. Data for the HCP Early Psychosis is available at https://www.humanconnectome.org/study/human-connectome-project-for-early-psychosis and data for the ZHH cohort was accessed directly from the study Principal Investigator and is expected to be publicly available at https://nda.nih.gov/.