Opaque ontology: neuroimaging classification of ICD-10 diagnostic groups in the UK Biobank

Ty Easley; Xiaoke Luo; Kayla Hannon; Petra Lenzini; Janine Bijsterbosch

doi:10.1093/gigascience/giae119

. 2025 Feb 11;14:giae119. doi: 10.1093/gigascience/giae119

Opaque ontology: neuroimaging classification of ICD-10 diagnostic groups in the UK Biobank

Ty Easley ^1,^✉,³, Xiaoke Luo ^2,³, Kayla Hannon ³, Petra Lenzini ⁴, Janine Bijsterbosch ^5,^✉

PMCID: PMC11811528 PMID: 39931027

Abstract

Background

The use of machine learning to classify diagnostic cases versus controls defined based on diagnostic ontologies such as the International Classification of Diseases, Tenth Revision (ICD-10) from neuroimaging features is now commonplace across a wide range of diagnostic fields. However, transdiagnostic comparisons of such classifications are lacking. Such transdiagnostic comparisons are important to establish the specificity of classification models, set benchmarks, and assess the value of diagnostic ontologies.

Results

We investigated case-control classification accuracy in 17 different ICD-10 diagnostic groups from Chapter V (mental and behavioral disorders) and Chapter VI (diseases of the nervous system) using data from the UK Biobank. Classification models were trained using either neuroimaging (structural or functional brain magnetic resonance imaging feature sets) or sociodemographic features. Random forest classification models were adopted using rigorous shuffle-splits to estimate stability as well as accuracy of case-control classifications. Diagnostic classification accuracies were benchmarked against age classification (oldest vs. youngest) from the same feature sets and against additional classifier types (k-nearest neighbors and linear support vector machine). In contrast to age classification accuracy, which was high for all feature sets, few ICD-10 diagnostic groups were classified significantly above chance (namely, demyelinating diseases based on structural neuroimaging features and depression based on sociodemographic and functional neuroimaging features).

Conclusion

These findings highlight challenges with the current disease classification system, leading us to recommend caution with the use of ICD-10 diagnostic groups as target labels in brain-based disease prediction studies.

Keywords: UK Biobank, machine learning, neuroimaging, mental health disorders, nervous system diseases

Background

Many studies have trained machine learning classifiers on features derived from noninvasive structural and/or functional neuroimaging data to differentiate between cases and healthy controls in a range of diseases (for a review, see [1]). In such disease classification studies, the definition of cases and controls is commonly based on standardized disease ontologies such as the International Classification of Diseases, Tenth Revision (ICD-10) or the Diagnostic and Statistical Manual of Mental Disorders, Fifth Edition (DSM-5) and assessed via structured clinical interviews and/or health records. However, most published diagnostic classification efforts are single-disease studies performed in disease-specific cohorts. As such, a comprehensive analysis across diseases in population data is lacking. The goal of this study is to leverage epidemiological data to comprehensively assess the ability to accurately classify 17 ICD-10 diagnostic groups from Chapters V (mental and behavioral disorders) and VI (diseases of the nervous system) based on neuroimaging features.

The UK Biobank (UKB) offers the first available neuroimaging dataset that adopts an epidemiological approach in terms of its prospective recruitment strategy and large sample size [2]. At the time of writing, neuroimaging data for tens of thousands of participants recruited from the National Health Service (NHS) database had been acquired and released, with data acquisition still ongoing toward the goal of N = 100,000 [3]. Although the UKB has some healthy volunteer selection bias resulting from the opt-in choice of participation [4], there are no explicit health-based inclusion/exclusion criteria apart from standard magnetic resonance imaging (MRI) contraindications for neuroimaging. As such, participants with a wide range of ICD-10 diagnoses (derived from clinical records) are included in the UKB cohort [5]. This work uses the UKB dataset to systematically compare brain-based classification models across 17 different ICD-10 diagnostic groups.

This study provides a unique lens on brain-based classifications across a large set of mental and neurological ICD-10 diagnostic groups. There are a number of reasons why such transdiagnostic comparisons of diagnostic classification models are important. First, such comparisons are needed to assess the specificity of classification models (in addition to their accuracy/sensitivity). Machine learning tools are increasingly adopted in clinical care settings, yet their disease specificity cannot be assessed in single-disease studies. Although some transdiagnostic comparisons have been performed across related disorders (e.g., bipolar–schizophrenia [6]; mild cognitive impairment–Alzheimer disease [7]), broader comparisons are lacking. Second, our findings provide a comprehensive benchmark for future research on diagnostic classification models in the UKB and beyond. The UKB cohort will become an increasingly valuable resource to train disease classification models as it follows participants longitudinally, capturing hospital and death records. As such, our findings provide an important benchmark for future efforts. Third, our findings shed light on the limited validity of the ICD-10 diagnostic ontology, consistent with other research pointing to the limited reliability of diagnostic coding systems, including the ICD-10 [8] and DSM-5 [9, 10], and their vulnerabilities to both systemic bias [11–16] and fraud [17–20]. Importantly, the use of suboptimal clinical labels as targets for machine learning models impedes meaningful biomarker discovery [21, 22].

In total, this work included N = 5,861 unique cases and the same number of carefully matched healthy controls. We trained and tested over 400 diagnostic classification models to gather a comprehensive overview of results. Despite well-matched samples of moderate to large sample sizes (N range: 250–2,658), mean classification accuracies ranged from chance (0.5) up to 0.69, and many diagnostic groups could not be classified significantly beyond chance. As such, our comprehensive results revealed limits of the ICD-10 ontology (as available in the UKB) and provide an important benchmark for future work in the UKB and beyond.

Data Description

Case selection

We used UKB variable ID 41270 to collect ICD-10 information for all participants. We focused on ICD-10 diagnostic groups in Chapter V (mental and behavioral disorders) and Chapter VI (diseases of the nervous system) because these diagnostic groups are most relevant for brain-based classification. The number of UKB participants with complete neuroimaging data at each of the first levels of the ICD-10 hierarchy (e.g., F00–F09, which we term broad diagnostic groups) and at each of the second levels of the ICD-10 hierarchy (e.g., F00 separately, which we term narrow diagnostic groups) were determined. ICD-10 groups with N ≥ 125 cases at the narrow diagnostic category level were retained. If no individual narrow diagnostic category contained N ≥ 125 but the combined broad diagnostic category did contain N ≥ 125, then the broad diagnostic category was retained. As a result, we retained 17 ICD-10 diagnostic groups (see Supplementary Table S1). The overall comorbidity rate, defined as the proportion of individuals who were in more than 1 of the 17 diagnostic groups, is 42.55% (see Supplementary Table S10). Specifically, 38.95% of individuals in the anxiety group also have a diagnosis of depression, indicating significant overlap between these 2 diagnostic groups (see Supplementary Fig. S3). Thus, we generated a unique case list by excluding all individuals who were in more than 1 of the 17 diagnostic groups (Supplementary Table S2). This unique case list was used for multiclass classification analysis.

Matched healthy controls

For each of the 17 diagnostic groups, we matched cases to controls to achieve 17 fully balanced case-control groups for classification. The total number of UKB participants with complete neuroimaging data but with no ICD-10 labels in either Chapter V (mental and behavioral disorders; all F classes) or Chapter VI (diseases of the nervous system; all G classes) was N = 31,225, which composed our pool of healthy controls. Out of this pool, controls were selected for each case to match sex, age, and resting state head motion as closely as possible (in this order of priority). Each case (combined N = 5,861) was matched to a unique control participant across all combined diagnostic groups. The matching procedure resulted in perfectly matched groups for sex (χ² P = 1 for all diagnostic groups) and no significant group differences for age (P > 0.3) or for head motion (P > 0.7).

Matched sample size subgroups

The resulting ICD-10 diagnostic groups varied substantially in sample size (ranging from 125–1,329 cases). To test whether classification performance was impacted by sample size, we repeated analyses after subsampling each ICD-10 diagnostic group to match the minimum sample size (125 cases + 125 controls). Subsampling was performed by matching the cases from each ICD-10 diagnostic group to the ICD-10 diagnostic group with the minimum sample size (G35–G37; demyelinating diseases of the central nervous systems) for sex, age, and resting state head motion using the same procedure described above. This subsampling procedure therefore additionally removed potential differences in confounding variables between ICD-10 diagnostic groups.

Analyses

Random forest classification models were trained separately for each diagnostic category across 100 shuffle-split repeats with 80% training data and 20% validation data (see Supplementary Table S3). The primary neuroimaging features used to drive the classification algorithms included 2 sets of structural measures (285 surface-based measures or 153 volumetric measures; Supplementary Table S4). Significance testing was performed by comparing the resulting 100 classification accuracies for each shuffle-split against chance level (0.5), as described in the Methods. Follow-up analyses were performed to test whether the primary classification results could be improved upon by performing (i) classifications using matched sample sizes, (ii) classifications using a multiclass algorithm, (iii) alternative classification models (support vector and k-nearest neighbors classifiers), and (iv) alternative features sets (from functional neuroimaging or sociodemographic information; see Supplementary Table S5). Furthermore, diagnostic classification accuracies were benchmarked based on age classification by differentiating between extreme ends of the age distribution (youngest vs. oldest). Two age classification groups were generated that matched the sample sizes of the largest and smallest ICD-10 diagnostic groups. A more detailed description of the analytical approaches can be found in the Methods section.

Diagnostic classification using structural neuroimaging features

Only the “demyelinating diseases” ICD-10 diagnostic group (G35–G37) was classified significantly above chance by volumetric features (μ_αCC = 0.68, P = 0.094, n = 248; Fig. 1A & Supplementary Table S6) after false discovery rate correction over the 2 structural feature sets (volume and surface). Comparable results using the surface feature set can be found in Supplementary Fig. S1. Volumetric neuroimaging data features failed to classify the remaining 16 diagnostic groups significantly more accurately than chance (P ≳ 0.25; Fig. 1A). To address site effects, the same primary classification was performed using leave-one-group-out cross-validation, and the results showed that the “demyelinating diseases” ICD-10 diagnostic group (G35–G37) was classified above chance (Fig. 1B).

Figure 1: — (A) Diagnostic classification based on volume-based structural neuroimaging features. Classification accuracy distributions across ICD-10 diagnostic groups for cortical volumetric features derived from T1-weighted structural MRI data. Only demyelinating diseases were classified significantly above chance. Mean classification accuracy across splits is shown as a single red dot within each distribution. Matching results using the surface feature sets are available in Supplementary Fig. S1, and all numeric results are available in Supplementary Table S6. (B) Structural results using leave-one-group-out cross-validation (LOGO CV). The blue dots represent the individual classification accuracies from the random forest classifier using LOGO CV, with each blue dot corresponding to the result for a single group left out, and the black dots represent the mean of the LOGO results. The red diamond marks the mean classification accuracy obtained from random forest classification using StratifiedShuffleSplit cross-validation as a comparison.

Diagnostic classification corrected for sample size

To mitigate the impact of varying sample sizes, we repeated the diagnostic classification after subsampling each ICD-10 diagnostic group to match the minimum sample size of 125 cases and 125 matched controls. With a uniform size of 125 for diagnostic classification, the main results shown in Fig. 1 were replicated. Namely, only the “demyelinating diseases” ICD-10 diagnostic group (G35–G37) was classified significantly above chance by surface-based structural neuroimaging features (μ_αCC = 0.64, P = 0.01, n = 125; Fig. 2A). Full numeric results for surface-based and volumetric features sets are available in Supplementary Table S7.

Figure 2: — (A) Size-corrected diagnostic classification based on surface-based structural neuroimaging features. Classification accuracy distributions across ICD-10 diagnostic groups matched in sample size. All classification results matched for sample size are reported in Supplementary Table S7. (B) Size-corrected multiclass diagnostic classification. Confusion matrix of the multiclass classification for all 17 ICD-10 diagnostic groups along with healthy controls. The color bar on the right side represents the number of individuals who fall into the intersection between the true and predicted diagnostic groups. The darker shades (near black) indicate fewer individuals, while the lighter shades (near white or light peach) indicate a higher number of individuals in that particular cell.

Multiclass classification of ICD-10 diagnostic groups

To test whether our diagnostic classifications may improve by combining all diagnoses into 1 model, we trained a multiclass classifier. Participants with multiple diagnostic labels were removed to ensure disjoint groups and sample sizes were matched across ICD-10 diagnostic groups to avoid the impact of unbalanced groups (see Methods for more details; Supplementary Table S2). Multiclass classification results are summarized in the confusion matrix in Fig. 2B. With a uniform sample size of N = 59 per ICD-10 diagnostic group for multiclass classification, 92% (977 off-diagonal individuals out of N = 1,062) of the subjects were misclassified across all the diagnostic groups. Aligning with the results from diagnostic classification, the ICD-10 diagnostic group with the highest number of correct classifications was “demyelinating diseases.” Notably, although the individual group sizes are small due to the removal of participants with multiple diagnostic labels, the total sample size for the multiclass classification algorithm remained relatively large (N = 1,062). Comparable results using the volumetric feature set can be found in Supplementary Fig. S2.

Comparison against alternative classification models

We also classified ICD-10 diagnostic groups from structural features using support vector classification and k-nearest neighbors classification to check the robustness of our findings in other classification paradigms. The results were similar across classification models, and the 2 additional models did not significantly predict any additional diagnostic groups beyond the random forest classification results (Fig. 3; Supplementary Table S8).

Figure 3: — Effect of classification models. Classifier performance for all ICD-10 diagnostic groups on T1-weighted structural MRI surface data. Results for all classification models are shown in gray dots, and results for the best-performing classification model are shown in colored dots. Because classification accuracy distributions for different classifiers heavily overlap, few gray dots are visible. Large red diamonds show mean classification accuracy of the best-performing classifier, while large black diamonds show mean classification accuracy across all classifiers. All classification results across different models are reported in Supplementary Table S8.

Comparison against alternative feature sets

We furthermore tested whether adopting alternative feature sets may improve the classification accuracy of ICD-10 diagnostic groups. Specifically, we tested 17 different feature sets derived from resting state functional MRI (see Supplementary Materials section 3) and 1 sociodemographic features set (based on [23]; see Supplementary Materials section 4). Within each diagnostic group, multiple comparisons correction using false discovery rate control was performed across the total number of feature sets tested (i.e., across 20 feature sets, including the 2 primary structural MRI feature sets, 17 functional MRI feature sets, and 1 sociodemographic feature set). Because the multiple comparisons correction was more stringent than when using only structural features, demyelinating diseases (G35–G37) were not significantly classified after correcting across all feature types. Instead, we found that the depression group (F32) was classified significantly above chance after correction across multiple feature groups (Fig. 4). Depression classification accuracy was highest when using the sociodemographic feature set (μ_αCC = 0.58, P = 3.5e-3, n = 2,692) but was also classified significantly above chance by several resting state feature sets after multiple comparisons correction (see Supplementary Table S9). Within the 16 diagnostic groups that were not classified significantly above chance, we found little variation in classification accuracy across feature sets. This suggests that classification accuracy was generally low and largely insensitive to feature choice, given the statistical power available in this study.

Figure 4: — Diagnostic and age classification across feature sets. Summary of classification accuracy distributions across all 17 diagnostic groups for structural, functional, and sociodemographic classification features. In each classification group’s swarm plot, the accuracy distribution of the feature set with the highest mean accuracy is shown in color; the rest are shown in gray. Of the diagnostic groups, only the depression group (N = 2,692) was classified significantly above chance after multiple comparison corrections. Depression was significantly classified by sociodemographic features and 3 functional MRI features (full network connectivity matrices for the Schaefer parcellation, full network connectivity matrices for ICA parcellation at rank 150 and at rank and 300). Benchmark results for age (youngest vs. oldest) classifications in both large and small groups (size-matched, respectively, to the largest and smallest diagnostic groups) are shown on the right and approximate an upper bound on diagnostic classification accuracy in terms of effect size versus sample size. All classification results are reported in Supplementary Table S9.

Comparison against age classification

To establish a plausible upper bound on classification accuracy of our random forest classification model and neuroimaging feature sets, we classified age (oldest vs. youngest; see Methods). Specifically, we classified age (oldest vs. youngest) in both a large group, size-matched to the depression group (N = 2,676; our largest ICD-10 diagnostic group; F32), and a small group, size-matched to the demyelinating diseases group (N = 250; our smallest ICD-10 diagnostic group; G35–G37). Structural neuroimaging features (derived from surface data) best classified age in both the small (μ_αCC = 0.90, P = 1.3e-15, n = 246) and large (μ_αCC = 0.94, P = 1.4e-66, n = 2,676) groups. All considered features classified age significantly above chance in the large age group (P < 1.6e-3; Fig. 4). Nearly all features were able to significantly classify age above chance in the small group as well (P < 1.8e-3), with the exception of partial connectivity matrices derived from the Schaefer parcellation (P = 0.35) and independent component analyses of ranks 150 (P = 0.085) and 300 (P = 0.34). See Supplementary Table S9 for a detailed summary of results. These results confirm that the classification model and neuroimaging feature sets adopted in this work can achieve higher classification accuracies than those observed across all ICD-10 diagnostic groups.

Discussion

The goal of this study was to systematically compare brain-based classification models across 17 different ICD-10 diagnostic groups. Our findings revealed that most ICD-10 diagnostic groups were not classified significantly above chance from neuroimaging or sociodemographic data in the UK Biobank. We found that only a single diagnostic group (“demyelinating diseases,” G35–G37) could be accurately classified from structural neuroimaging features alone. After adding additional feature sets, the largest group (“Depression,” F32) was classified significantly above chance by both sociodemographic features [24] and several functional neuroimaging features but not by structural neuroimaging features. Size-matched classification groups and a multiclass formulation of the problem both yielded similar insights. Classifications with random forest, support vector, and k-nearest neighbor classifiers all gave comparable results. Ultimately, no set of either neuroimaging or sociodemographic features was able to classify more than 1 ICD-10 diagnostic group significantly above chance. By contrast, nearly all neuroimaging features were able to classify age with high accuracy in samples that were size-matched to both our smallest and largest diagnostic groups.

Paradoxically, the 2 diagnostic groups that were significantly classified by 1 or more feature sets represented the largest group (depression; N = 2,658) and the smallest group (demyelinating diseases; N = 250) included in this work. These results are consistent with the notion that the statistical power to detect an effect depends on 3 factors: (i) the magnitude of the effect of interest in the population, (ii) the sample size used to detect the effect, and (iii) the statistical significance criterion used in the test. The demyelinating diseases diagnostic group revealed outsized sensitivity, suggesting a relatively larger magnitude of effect of interest in this diagnostic population. The impact of the statistical significance criterion, particularly the multiple comparisons burden, can be observed in the finding that the demyelinating disease group was significantly classified from our primary structural neuroimaging feature sets (controlling across 2 results), but this result no longer reached significance after correction across the expanded number of feature sets (controlling across 20 results; Supplementary Table S9). These results highlight the importance of balancing hypothesis-driven choices with data-driven exploration. Notably, our statistical significance criterion was relatively strict by using the standard deviation—rather than standard error of the mean—as the variance criterion for division. Although a larger number of results would reach significance under a more lenient significance criterion, this does not destabilize the central interpretation that classification accuracy was close to chance for the majority of ICD-10 diagnostic groups.

For the depression group, our findings revealed that classification accuracy based on functional neuroimaging feature sets outperformed classification accuracy based on structural neuroimaging feature sets (Supplementary Table S9). These findings are consistent with the conceptualization of depression as driven by circuit dysconnectivity [25, 26]. Although structural brain correlates of depression have also routinely been reported in the literature [27, 28], our findings are partially consistent with a recent meta-analysis that did not find significant convergence of structural or functional brain correlates of depression [29]. Importantly, the individual studies included in this meta-analysis were relatively underpowered (N < 100), further emphasizing the need for more well-powered research [30]. It is possible that our findings may be linked to the larger feature space for functional features (>10,000 features compared to <300 structural features), although our rigorous shuffle-split validation framework mitigates the risk of overfitting. Overall, our findings may suggest that functional neuroimaging features are more sensitive to depression than structural neuroimaging features, although future research is needed to confirm this finding.

Our classification results showed little variation in performance over different classifiers, classification features, and target diagnostic groups. Sociodemographic features were also unable to accurately classify ICD-10 diagnostic groups, despite previous work showing higher classification accuracy of sociodemographic features for other phenotypes [23]. In contrast to the high accuracy of age classification across all feature types and observed sample sizes, we found ICD-10 diagnostic groups to have nearly uniformly low classification accuracy. Our investigation across feature types, classification algorithms, and classification targets therefore suggests that ICD-10 diagnostic groups may constitute unreliable phenotypes. This conclusion corroborates past work on low interrater reliability in ICD-10 coding, which may be driven by many factors, including patient–provider communication, administrative decision chains, and insurance incentives [8, 17–20]. Furthermore, these results are consistent with prior work highlighting challenges with UK Biobank clinical codes [31]. Specifically, Stroganov et al. [31] highlighted mapping issues of available hospital inpatient and general practitioner information onto the ICD-10 ontology. The limited classification accuracy for most ICD-10 diagnostic groups observed in this study furthermore confirms previous work suggesting that the identification of biomarkers of psychopathology is not feasible without increased efforts to address suboptimal phenotypic reliability [21]. Indeed, a recent study empirically showed the degree to which accuracy of classification models is attenuated as a function of the reliability of the classification target [22]. Notably, age is a phenotype with high reliability and was therefore chosen as a classification target to benchmark our analyses. In summary, our findings reveal limits of the ICD-10 diagnostic ontology that arise from a variety of complex sources, including clinical heterogeneity, lack of interrater reliability, and structural inequities in health care incentives and access.

Finally, we would like to discuss some methodological limitations of the current study. First, ICD-10 diagnostic groups in the UK Biobank were automatically mapped from available hospital inpatient and general practitioner data sources. These automated mappings are known to be imprecise [31], suggesting that the reliability of ICD-10 diagnostic information in the UK Biobank may be lower than in other research and clinical settings. Second, although we tested our classifications across a range of feature sets and classifiers, it is possible that higher classification accuracy can be achieved using alternative feature sets or imaging modalities (e.g., clinically sensitive modalities such as susceptibility weighted imaging) and/or by using more sensitive classifier architectures; in particular, future work leveraging deep learning on image data may yield further insight. Third, we identified target diagnostic groups at different levels of the ICD-10 diagnostic hierarchy depending on sample size availability. It is possible that this choice may negatively impact classification accuracy (e.g., by including varying degrees of heterogeneity), although we note that the 2 diagnostic groups with significant classification results covered both levels of the hierarchy. Higher classification accuracy may be obtained by comparing severe patient groups to particularly healthy controls. Although this was not feasible based on available metrics in the UKB, it may be a fruitful area for future research.

Potential Implications

In summary, demyelinating diseases and depression could be classified above chance. These findings were likely driven by a larger magnitude of effect for demyelinating diseases and a larger sample size for depression. Notably, depression was significantly classified based on several functional neuroimaging feature sets but not based on structural feature sets, suggesting that functional neuroimaging features may be more sensitive to depression than structural neuroimaging features. We were unable to reliably classify 15 of 17 ICD-10 diagnostic cases from matched controls. This classification failure implies that either (i) ICD-10 diagnostic categories in the UKB cannot be strongly predicted by neuroimaging data, suggesting limitations of the ICD-10 ontology, or (ii) the extracted imaging features and/or predictive algorithms used were insufficient. Although future work should explore the predictive capabilities of imaging data (before feature extraction) with alternative algorithms, our findings demonstrate—at minimum—that standard predictive models using standard neuroimaging features do not predict ICD-10 labels in the largest epidemiological dataset available. These results reveal limits of the ICD-10 ontology. We recommend caution with the use of ICD-10 diagnostic codes as target labels in future work exploring brain-based diagnostic prediction, particularly in epidemiological cohorts like the UK Biobank.

Methods

Neuroimaging data and preprocessing

Out of the available UKB neuroimaging data, this work used the T1-weighted scan (1-mm isotropic voxels, TR = 2,000 ms, TI = 880 ms) and the resting state functional scan (2.4-mm isotropic voxels, TR = 735 ms, TE = 39 ms, multiband factor 8). Processed data released through the UKB were used (up until and including independent component analysis [ICA]–FIX clean-up for resting state). T1 and resting state data were transformed into MNI space using the linear and nonlinear transforms provided. For detailed preprocessing information, please see [32].

Classification model

The primary analyses were performed using the random forest classifier as implemented in scikit-learn, informed by prior work [23]. The random forest classifier was selected due to its flexibility in handling data of varied units, its suitability for nonlinear classification tasks, and its scalability [33]. Notably, the findings presented in this article generalized across other classification algorithms (see Methods Comparison against age classification). Using the pipeline option in scikit-learn, our estimator included scaling, feature space dimension reduction using principal component analysis (PCA), and classification. Nested 5-fold cross-validation was used to tune the PCA dimensionality and the following sets of classifier-specific hyperparameters (Supplementary Table S3). The depth of the trees and the number of variables considered for splitting were tuned. The number of trees was fixed at 250 following prior work [23, 34]. A shuffle-split resampling scheme was used to subdivide the data into 100 stratified training (80%) and validation (20%) splits. Split validation performance was used to generate the swarm plots.

Structural neuroimaging feature extraction

Two structural neuroimaging features sets from T1-weighted images were defined based on available imaging derived phenotypes (IDPs) provided by the UKB pipelines [32] (Supplementary Table S4). The surface-based structural feature set included 285 IDPs available in UKB variable IDs 190 and 196 derived from Freesurfer pipelines. Here, category 196 consists of 186 cortical IDPs from Freesurfer’s DKT-based parcellation, and category 190 contains 99 subcortical IDPs (ASEG). The volumetric structural feature set consisted of 153 IDPs from variable IDs 1101 and 1102. Here, category 1101 contains 139 regional gray matter volumes segmented using FSL FAST, and category 1102 contains 14 subcortical volumes segmented using FSL FIRST.

Alternative feature sets

As a result of the limited classification accuracy based on the primary structural neuroimaging feature sets, we broadened our scope to determine whether alternative feature sets may outperform the primary results. Specifically, we tested 17 different feature sets derived from resting state functional MRI data (Supplementary Table S4) and 1 feature set comprising demographic information (i.e., nonbrain data; Supplementary Table S5). The 17 different resting state functional MRI feature sets reflected combinations across 3 different brain parcellations (Schaefer parcellation [35], data-driven decomposition using independent component analysis [36, 37], or data-driven decomposition using probabilistic function modes [38, 39]), across feature types (partial correlation matrix, full correlation matrix, or amplitude [40]), and across dimensionality (ICA only, considering data-driven decomposition dimensionalities of 25, 100, 250, and 300). More detail regarding the resting state functional MRI feature extraction can be found in section 3 of the supplementary materials. The sociodemographic feature set was included based on previous work that revealed that this feature set outperformed neuroimaging-derived features in phenotype prediction [23]. Further details on the sociodemographic feature set can be found in Supplementary Table S5 of the supplementary materials.

Statistical analysis

We used statistical significance as our measure of successful classification. In this study, we computed statistical significance from the distribution of split-wise accuracy scores as the empirical probability of classifying above chance. For a given ICD-10 diagnostic group, a feature set classified significantly above chance if its fitted Student’s t-distribution lies, within significance threshold, above the guess line:

Above, n is the number of degrees of freedom (given by the number of shuffle-splits), and we use a significance threshold of α = 0.05. Note that this computation treats the classification accuracy score of a given shuffle-split as a mean of independent and identically distributed Bernoulli variables and assumes it is asymptotically normally distributed.

We emphasize, however, that we used the sample standard deviation instead of the sample standard error, which makes our significance criterion more stringent than a 1-sample t-test against 0.5 (chance). A 1-sample t-test (against chance) would be inappropriate in this context, since it would reflect the likelihood of the population mean accuracy lying above chance, rather than the likelihood of a particular (sub)population being well classified in a study on a single dataset. Correction for multiple comparisons was performed across feature sets using the false discovery rate.

Comparison against multiclass diagnostic classification

In addition to the separate diagnosis-specific case-control classifications, we performed a multiclass classification. For our multiclass classification task, we aimed to categorize samples into 17 ICD-10 groups and the control group (i.e., total of 18 possible labels; see Supplementary Table S4). Internally, the multiclass procedure trains 1 classifier for each class, treating the samples of that group as positive and all other samples as negative. The output from the multiclass classifier is combined across all groups. In this multiclass classification, we utilized the unique and matched sample size subject list, as detailed in the case sample selection section, to avoid the presence of multiple diagnostic labels per individual case. We employed a random forest classifier with the number of trees set to 250, the criterion for splitting set to “gini,” and the random state set to 42 to ensure reproducibility.

Comparison against alternative classification models

In addition to the random forest classifier, 2 further classifiers were tested for classification, namely, the support vector classifier and k-nearest neighbors classifier, as implemented in scikit-learn. For the support vector classifier, the regularization parameter C and the kernel type were tuned. For the k-nearest neighbors classifier, the number of neighbors, the weight function, and the distance metric were tuned. See Supplementary Table S3 for hyperparameter values included in the tuning. The classification pipeline described above—including principal component analysis, nested folds for hyperparameter estimation, and shuffle-splits—was identical for random forest, support vector, and k-nearest neighbors classifiers.

Comparison against alternative features sets

In addition to the diagnostic classifications using structural neuroimaging features, we repeated the classification analyses using functional neuroimaging features and socioeconomic features instead. A wide range of options exists to calculate features from resting state functional MRI data [41, 42]. We compared classifications based on feature sets obtained from data-driven approaches, including independent component analysis [36, 37], probabilistic functional modes [38, 39], and atlas-based features [35]. Please see the supplementary methods and Supplementary Table S4 for further information.

For the socioeconomic feature set, we based our selection of features on prior work from [23]. Existing UK Biobank variables in 36 variable IDs across categories of age, sex, education, early life, and lifestyle were selected (see Supplementary Table S5). Compared to prior work (see Appendix 2, Supplementary Table S7 in [23]), all variables in the mood and sentiment category and any variables related to smoking behaviors were excluded because these variables directly reflect key diagnostic symptoms of ICD-10 labels. For example, to receive an ICD-10 diagnosis of major depressive disorder, patients have to meet 5 out of 9 criteria (including depressed mood, loss of interest, sleep disturbance, etc.). The variables that were excluded explicitly measured these key criteria.

Comparison against age classification

To benchmark our classification analyses, we repeated the same random forest regression model to classify older versus younger groups based on the same feature sets. To this end, we combined all the subjects from 17 ICD-10 diagnostic groups and subjects with complete neuroimaging data but with no ICD-10 labels in either Chapter V or VI as the cohort. We selected the subjects from this cohort by pairing those with the largest age differences while ensuring that older and younger groups were matched for sex and head motion. The older subjects were considered our “case” group (aged 67–70), and the younger subjects were considered the “control” group (aged 40–42). We compiled a balanced group (half “young” and half “old”) of N = 2,656 subjects to match the size of the largest ICD-10 diagnostic group and subsampled N = 252, a balanced sublist to match the smallest ICD-10 diagnostic group. This allowed us to benchmark classification effect size across all ICD-10 diagnostic groups against classification effect size of age. To assess the classification accuracy, we employed the same random forest classification models (mentioned in the classification model section) on both the structural feature sets (surface and volume) and functional data extracted through independent component analysis, probabilistic functional modes, and the Schaefer atlas (see supplements for details).

Availability of Source Code

Project name: WAPIAW3

Project homepage: https://github.com/tyo8/WAPIAW3

Operating system: Platform independent

Programming language: Python, shell

Other requirements: None

License: MIT

RRID: SCR_026093.

The code has also been archived in Software Heritage [43].

Additional Files

Supplementary Fig. S1. Surface structural results.

Supplementary Fig. S2. Multiclass volumetric results.

Supplementary Fig. S3. Heatmap of comorbidity.

Supplementary Table S1. Demographics for the surface-based structural analysis diagnostic groups. The comorbidity rate column shows the percentage of individuals who have more than 1 diagnosis within 17 diagnostic groups. The age and head motion columns show mean (standard deviation). The sex column reports percent male. Years since diagnosis was obtained by subtracting the date of first occurrence (using UKB variable 41280) from the date of scanning (using UKB variable 53, for instance, 2), and dividing the difference in days by 365 (i.e., negative values indicate diagnosis occurred after scanning). Case-control groups were perfectly matched groups for sex (χ² P = 1 for all diagnostic groups), and there were no significant group differences for age (P > 0.3) or for head motion (P > 0.7). Supplementary Table S2 reveals the same table for unique groups.

Supplementary Table S2. Demographics for the unique diagnostic groups. The age and head motion columns show mean (± standard deviation). The sex column reports percent recorded as male. *Indicates a significant (P < 0.05) case-control difference within the diagnostic group.

Supplementary Table S3. Hyperparameter search grids for all scikit-learn classification models. Note that “rank” here means “maximum possible rank” and is given by min (n,m) for Inline graphic ; rank was not estimated for each input dataset.

Supplementary Table S4. Overview of 2 structural MRI feature sets, 16 resting state functional MRI feature sets, and 1 sociodemographic feature set. Parcellation dimensionality is in regions/network within the whole-brain map unless otherwise stated.

Supplementary Table S5. Information regarding sociodemographic features. The 36 sociodemographic variables listed constitute a 36-dimensional predictive feature set after one-hot encoding categorical variables.

Supplementary Table S6. Overview of primary random forest classification results from structural neuroimaging data only. Each cell reports to mean accuracy (Ac), uncorrected P value (pu), and corrected P value (pc). Corrected P values were corrected for multiple comparisons across 2 feature sets (i.e., within each column) using false discovery rate correction. Cells highlighted in green indicate significant results (pc < 0.05) and cells highlighted in yellow indicate trend-level results (0.05 < pc < 0.10).

Supplementary Table S7. Overview of primary random forest classification results from structural neuroimaging data using the matched sample sizes (i.e., 125 cases and 125 controls in each ICD-10 diagnostic category). Each cell reports to mean accuracy (Ac), uncorrected P value (pu), and corrected P value (pc). Corrected P values were corrected for multiple comparisons across 2 feature sets (i.e., within each column) using false discovery rate correction. Cells highlighted in green indicate significant results (pc < 0.05) and cells highlighted in yellow indicate trend-level results (0.05 < pc < 0.10).

Supplementary Table S8. Overview of results over varying classifiers. Support vector (SVC) and k-nearest neighbors classification (KNC) of diagnostic codes from structural neuroimaging features. In this experiment, we corrected for a total of 6 multiple comparisons over classifier choice and structural feature type. The results do not change: only G35-37 (demyelinating diseases; Supplementary Table S1) is classified significantly above chance from structural features.

Supplementary Table S9. Overview of all random forest classification results. Complete table summarizing all random forest classification results. Each cell reports to mean accuracy (Ac), uncorrected P value (pu), and corrected P value (pc). Corrected P values were corrected for multiple comparisons across 20 feature sets (i.e., within each row) using false discovery rate correction. Cells highlighted in green indicate significant results (pc < 0.05) and cells highlighted in yellow indicate trend-level results (0.05 < pc < 0.10). Age prediction was not performed for PROFUMO and sociodemographic results as indicated in the cells highlighted in black. Amps: amplitudes; Fnets: full correlation network matrix; ICA: independent component analysis; Pnets: partial correlation network matrix; PROF.: PROFUMO; Schaef.: Schaefer parcellation; SpNets: spatial correlation overlap matrix; Surf.: surface. Only F32 (depression; see Supplementary Table S1) was classified significantly above chance after multiple comparisons correction.

Supplementary Table S10. Overview of the site effect test. This table reports the distribution of sites for cases and controls across each diagnostic group. Cells highlighted in green indicate a significant site difference between cases and controls, based on the chi-square test (P < 0.05).

giae119_Supplemental_Files

giae119_supplemental_files.zip^{(2.6MB, zip)}

giae119_Response_to_Reviewer_Comments_Original_Submission

giae119_response_to_reviewer_comments_original_submission.pdf^{(70.1KB, pdf)}

giae119_Response_to_Reviewer_Comments_Revision_1

giae119_response_to_reviewer_comments_revision_1.pdf^{(36.9KB, pdf)}

giae119_Revision_1

giae119_revision_1.pdf^{(2.6MB, pdf)}

giae119_Revision_2

giae119_revision_2.pdf^{(1.8MB, pdf)}

giae119_Reviewer_1_Report_Original_Submission

Ashlea Segal -- 5/14/2024

giae119_reviewer_1_report_original_submission.pdf^{(129KB, pdf)}

giae119_Reviewer_2_Report_Original_Submission

Janna Hastings -- 6/10/2024

giae119_reviewer_2_report_original_submission.pdf^{(130KB, pdf)}

giae119_Reviewer_2_Report_Revision_1

Janna Hastings -- 11/6/2024

giae119_reviewer_2_report_revision_1.pdf^{(120.8KB, pdf)}

Abbreviations

ASEG: automatic segmentation; DKT: Desikan–Killiany–Tourville; DSM-5: Diagnostic and Statistical Manual of Mental Disorders, Fifth Edition; FAST: FMRIB’s automated segmentation tool; FIRST: FMRIB’s integrated registration and segmentation tool; FSL: FMRIB Software Library; ICA-FIX: independent component analysis—FMRIB’s ICA-based X-noiseifier; ICD-10: International Classification of Diseases, Tenth Revision; IDP: imaging-derived phenotype; MNI: Montreal Neurological Institute; MRI: magnetic resonance imaging; NHS: National Health Service; PCA: principal component analysis; TI: inversion time; TR: repetition time; UKB: UK Biobank.

Acknowledgments

We are grateful to UK Biobank and the UK Biobank participants for making the resource data possible and to the data processing team at Oxford University for producing the shared processed data. Thanks to Hailey Modi for pointing us to the scikit-learn pipeline option.

Contributor Information

Ty Easley, Department of Radiology, Washington University School of Medicine, Saint Louis, Missouri 63110, USA.

Xiaoke Luo, Department of Radiology, Washington University School of Medicine, Saint Louis, Missouri 63110, USA.

Kayla Hannon, Department of Radiology, Washington University School of Medicine, Saint Louis, Missouri 63110, USA.

Petra Lenzini, Department of Radiology, Washington University School of Medicine, Saint Louis, Missouri 63110, USA.

Janine Bijsterbosch, Department of Radiology, Washington University School of Medicine, Saint Louis, Missouri 63110, USA.

Author Contributions

Conceptualization: Ty Easley, Janine Bijsterbosch; Data Curation: Ty Easley, Petra Lenzini, Xiaoke Luo; Formal Analysis: Ty Easley, Xiaoke Luo; Funding Acquisition: Janine Bijsterbosch, Ty Easley; Methodology: Ty Easley, Xiaoke Luo, Kayla Hannon, Petra Lenzini, Janine Bijsterbosch; Supervision: Janine Bijsterbosch; Visualization: Xiaoke Luo, Petra Lenzini, Kayla Hannon; Writing—Original Draft: Janine Bijsterbosch, Ty Easley, Kayla Hannon, Petra Lenzini, Xiaoke Luo; Writing—Review & Editing: Ty Easley, Janine Bijsterbosch, Xiaoke Luo.

Funding

Janine Bijsterbosch is supported by the National Institutes of Health (NIMH R01 MH128286 & NIMH R01 MH132962), and Ty Easley is supported by the National Science Foundation under grant DGE-2139839. Computations were performed using the facilities of the Washington University Research Computing and Informatics Facility (RCIF), which has received funding from the following NIH S10 program grants: 1S10OD025200-01A1 and 1S10OD030477-01.

Data Availability

UK Biobank data [2, 3] are available following an access application process. For more information please, see https://www.ukbiobank.ac.uk/enable-your-research/apply-for-access. This research was performed under UK Biobank application number 47267. The DOME-ML annotations can be found in the DOME-ML registry [44].

Competing Interests

The authors declare that they have no competing interests.

References

1. Rashid B, Calhoun V. Towards a brain-based predictome of mental illness. Hum Brain Mapp. 2020;41:3468–535.. 10.1002/hbm.25013. [DOI] [PMC free article] [PubMed] [Google Scholar]
2. Sudlow C, Gallacher J, Allen N, et al. UK Biobank: An open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 2015;12:e1001779. 10.1371/journal.pmed.1001779. [DOI] [PMC free article] [PubMed] [Google Scholar]
3. Miller KL, Alfaro-Almagro F, Bangerter NK, et al. Multimodal population brain imaging in the UK Biobank prospective epidemiological study. Nat Neurosci. 2016;19:1523–36.. 10.1038/nn.4393. [DOI] [PMC free article] [PubMed] [Google Scholar]
4. Fry A, Littlejohns TJ, Sudlow C, et al. Comparison of sociodemographic and health-related characteristics of UK biobank participants with those of the general population. Am J Epidemiol. 2017;186:1026–34.. 10.1093/aje/kwx246. [DOI] [PMC free article] [PubMed] [Google Scholar]
5. Allen N, Sudlow C, Downey P, et al. UK Biobank: Current status and what it means for epidemiology. Health Policy Technol. 2012;1:123–26.. 10.1016/j.hlpt.2012.07.003. [DOI] [Google Scholar]
6. Smucny J, Barch DM, Gold JM, et al. Cross-diagnostic analysis of cognitive control in mental illness: Insights from the CNTRACS consortium. Schizophr Res. 2019;208:377–83.. 10.1016/j.schres.2019.01.018. [DOI] [PMC free article] [PubMed] [Google Scholar]
7. Lampe L, Niehaus S, Huppertz H-J, et al. Comparative analysis of machine learning algorithms for multi-syndrome classification of neurodegenerative syndromes. Alzheimers Res Ther. 2022;14:62. 10.1186/s13195-022-00983-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
8. Stausberg J, Lehmann N, Kaczmarek D, et al. Reliability of diagnoses coding with ICD-10. Int J Med Informatics. 2008;77:50–57.. 10.1016/j.ijmedinf.2006.11.005. [DOI] [PubMed] [Google Scholar]
9. Shankman SA, Funkhouser CJ, Klein DN, et al. Reliability and validity of severity dimensions of psychopathology assessed using the Structured Clinical Interview for DSM-5 (SCID). Int J Methods Psychiatr Res. 2018;27:e1590. 10.1002/mpr.1590. [DOI] [PMC free article] [PubMed] [Google Scholar]
10. Regier DA, Narrow WE, Clarke DE, et al. DSM-5 field trials in the United States and Canada, part II: test-retest reliability of selected categorical diagnoses. Am J Psychiatry. 2013;170:59–70.. 10.1176/appi.ajp.2012.12070999. [DOI] [PubMed] [Google Scholar]
11. Tait RC, Chibnall JT, Kalauokalani D. Provider judgments of patients in pain: Seeking symptom certainty. Pain Med. 2009;10:11–34.. 10.1111/j.1526-4637.2008.00527.x. [DOI] [PubMed] [Google Scholar]
12. Tversky A, Kahneman D. Judgment under uncertainty: Heuristics and biases. Science. 1974;185:1124–31.. 10.1126/science.185.4157.1124. [DOI] [PubMed] [Google Scholar]
13. Bernheim SM, Ross JS, Krumholz HM, et al. Influence of patients’ socioeconomic status on clinical management decisions: A qualitative study. Ann Fam Med. 2008;6:53–59.. 10.1370/afm.749. [DOI] [PMC free article] [PubMed] [Google Scholar]
14. van Ryn M, Burke J. The effect of patient race and socio-economic status on physicians’ perceptions of patients. Soc Sci Med. 2000;50:813–28.. 10.1016/S0277-9536(99)00338-X. [DOI] [PubMed] [Google Scholar]
15. Job C, Adenipekun B, Cleves A, et al. Health professional's implicit bias of adult patients with low socioeconomic status (SES) and its effects on clinical decision-making: a scoping review protocol. BMJ Open. 2022;12:e059837. 10.1136/bmjopen-2021-059837. [DOI] [PMC free article] [PubMed] [Google Scholar]
16. Gara MA, Minsky S, Silverstein SM, et al. A naturalistic study of racial disparities in diagnoses at an outpatient Behavioral Health clinic. Psychiatr Serv. 2019;70:130–34.. 10.1176/appi.ps.201800223. [DOI] [PubMed] [Google Scholar]
17. Geruso M, Layton T. Upcoding: evidence from Medicare on squishy risk adjustment. J Polit Econ. 2020;128:984–1026.. 10.1086/704756. [DOI] [PMC free article] [PubMed] [Google Scholar]
18. Coustasse A, Layton W, Nelson L, et al. Upcoding Medicare: Is healthcare fraud and abuse increasing?. Perspect Health Inf Manag. 2021;18:1f. [PMC free article] [PubMed] [Google Scholar]
19. Luo W, Gallagher M. Unsupervised DRG upcoding detection in healthcare databases. In: 2010 IEEE International Conference on Data Mining Workshops, Sydney, NSW, Australia: IEEE; 2010:600–5.. 10.1109/ICDMW.2010.108. [DOI] [Google Scholar]
20. Moscelli G, Jacobs R, Gutacker N, et al. Prospective payment systems and discretionary coding-evidence from English mental health providers. Health Econ. 2019;28:387–402.. 10.1002/hec.3851. [DOI] [PMC free article] [PubMed] [Google Scholar]
21. Nikolaidis A, Chen AA, He X, et al. Suboptimal phenotypic reliability impedes reproducible human neuroscience. Biorxiv. 2022. 10.1101/2022.07.22.501193 Accessed 13 January 2025. [DOI] [Google Scholar]
22. Gell M, Eickhoff SB, Omidvarnia A, et al. How measurement noise limits the accuracy of brain-behaviour predictions. Nature Comms. 2024;15:10678. 10.1038/s41467-024-54022-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
23. Dadi K, Varoquaux G, Houenou J, et al. Population modeling with machine learning can enhance measures of mental health. Gigascience. 2021;10:giab071. 10.1093/gigascience/giab071. [DOI] [PMC free article] [PubMed] [Google Scholar]
24. Winter NR, Leenings R, Ernsting J, et al. Quantifying deviations of brain structure and function in major depressive disorder across neuroimaging modalities. JAMA Psychiatry. 2022;79:879–88.. 10.1001/jamapsychiatry.2022.1780. [DOI] [PMC free article] [PubMed] [Google Scholar]
25. Williams LM, Precision psychiatry: A neural circuit taxonomy for depression and anxiety. Lancet Psychiatry. 2016;3:472–80.. 10.1016/S2215-0366(15)00579-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
26. Berton O, Hahn C-G, Thase ME. Are we getting closer to valid translational models for major depression?. Science. 2012;338:75–79.. 10.1126/science.1222940. [DOI] [PubMed] [Google Scholar]
27. Lorenzetti V, Allen NB, Fornito A, et al. Structural brain abnormalities in major depressive disorder: A selective review of recent MRI studies. J Affect Disord. 2009;117:1–17.. 10.1016/j.jad.2008.11.021. [DOI] [PubMed] [Google Scholar]
28. Dai L, Zhou H, Xu X, et al. Brain structural and functional changes in patients with major depressive disorder: A literature review. PeerJ. 2019;7:e8170. 10.7717/peerj.8170. [DOI] [PMC free article] [PubMed] [Google Scholar]
29. Saberi A, Mohammadi E, Zarei M, et al. Structural and functional neuroimaging of late-life depression: A coordinate-based meta-analysis. Brain Imaging Behav. 2022;16:518–31. 10.1007/s11682-021-00494-9. [DOI] [PubMed] [Google Scholar]
30. Marek S, Tervo-Clemmens B, Calabro FJ, et al. Reproducible brain-wide association studies require thousands of individuals. Nature. 2022;603:654–60.. 10.1038/s41586-022-04492-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
31. Stroganov O, Fedarovich A, Wong E, et al. Mapping of UK Biobank clinical codes: Challenges and possible solutions. PLoS One. 2022;17:e0275816. 10.1371/journal.pone.0275816. [DOI] [PMC free article] [PubMed] [Google Scholar]
32. Alfaro-Almagro F, Jenkinson M, Bangerter NK, et al. Image processing and quality control for the first 10,000 brain imaging datasets from UK Biobank. Neuroimage. 2018;166:400–24.. 10.1016/j.neuroimage.2017.10.034. [DOI] [PMC free article] [PubMed] [Google Scholar]
33. Liu Y. Random forest algorithm in big data environment. Comput Model New Technol. 2014;18:147–151. [Google Scholar]
34. Easley T, Chen R, Hannon K, et al. Population modeling with machine learning can enhance measures of mental health—open-data replication. Neuroimage Rep. 2023;3:100163. 10.1016/j.ynirp.2023.100163. [DOI] [Google Scholar]
35. Schaefer A, Kong R, Gordon EM, et al. Local-global parcellation of the human cerebral cortex from intrinsic functional connectivity MRI. Cereb Cortex. 2018;28:3095–114.. 10.1093/cercor/bhx179. [DOI] [PMC free article] [PubMed] [Google Scholar]
36. Beckmann CF, Smith SM. Probabilistic independent component analysis for functional magnetic resonance imaging. IEEE Trans Med Imaging. 2004;23:137–52.. 10.1109/TMI.2003.822821. [DOI] [PubMed] [Google Scholar]
37. Nickerson LD, Smith SM, Öngür D, et al. Using dual regression to investigate network shape and amplitude in functional connectivity analyses. Front Neurosci. 2017;11:115. 10.3389/fnins.2017.00115. [DOI] [PMC free article] [PubMed] [Google Scholar]
38. Harrison SJ, Bijsterbosch JD, Segerdahl AR, et al. Modelling subject variability in the spatial and temporal characteristics of functional modes. Neuroimage. 2020;222:117226. 10.1016/j.neuroimage.2020.117226. [DOI] [PMC free article] [PubMed] [Google Scholar]
39. Harrison SJ, Woolrich MW, Robinson EC, et al. Large-scale probabilistic functional modes from resting state fMRI. Neuroimage. 2015;109:217–31.. 10.1016/j.neuroimage.2015.01.013. [DOI] [PMC free article] [PubMed] [Google Scholar]
40. Bijsterbosch J, Harrison S, Duff E, et al. Investigations into within- and between-subject resting-state amplitude variations. Neuroimage. 2017;159:57–69.. 10.1016/j.neuroimage.2017.07.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
41. Bijsterbosch JD, Harrison SJ, Jbabdi S, et al. Challenges and future directions for representations of functional brain organization. Nat Neurosci. 2020;23:1–12.. 10.1038/s41593-020-00726-z [DOI] [PubMed] [Google Scholar]
42. Bijsterbosch JD, Valk SL, Wang D, et al. Recent developments in representations of the connectome. Neuroimage. 2021;243:118533. 10.1016/j.neuroimage.2021.118533. [DOI] [PMC free article] [PubMed] [Google Scholar]
43. Easley T, Luo X, Hannon K, et al. Opaque ontology: Neuroimaging classification of ICD-10 diagnostic groups in the UK Biobank (version 1) [Computer software]. 2024. https://archive.softwareheritage.org/swh:1:snp:6b02c96d795ae80de099c23000ff61414b76ae9c;origin=https://github.com/tyo8/WAPIAW3 Accessed 13 January 2025. [DOI] [PMC free article] [PubMed] [Google Scholar]
44. Easley Ty et al. Opaque ontology: Neuroimaging classification of ICD-10 diagnostic groups in the UK Biobank. DOME Registry. 2024; https://registry.dome-ml.org/review/e0eaxjld0h. Accessed 9 December 2024. [DOI] [PMC free article] [PubMed]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

giae119_Supplemental_Files

giae119_supplemental_files.zip^{(2.6MB, zip)}

giae119_Response_to_Reviewer_Comments_Original_Submission

giae119_response_to_reviewer_comments_original_submission.pdf^{(70.1KB, pdf)}

giae119_Response_to_Reviewer_Comments_Revision_1

giae119_response_to_reviewer_comments_revision_1.pdf^{(36.9KB, pdf)}

giae119_Revision_1

giae119_revision_1.pdf^{(2.6MB, pdf)}

giae119_Revision_2

giae119_revision_2.pdf^{(1.8MB, pdf)}

giae119_Reviewer_1_Report_Original_Submission

Ashlea Segal -- 5/14/2024

giae119_reviewer_1_report_original_submission.pdf^{(129KB, pdf)}

giae119_Reviewer_2_Report_Original_Submission

Janna Hastings -- 6/10/2024

giae119_reviewer_2_report_original_submission.pdf^{(130KB, pdf)}

giae119_Reviewer_2_Report_Revision_1

Janna Hastings -- 11/6/2024

giae119_reviewer_2_report_revision_1.pdf^{(120.8KB, pdf)}

Data Availability Statement

[bib1] 1. Rashid B, Calhoun V. Towards a brain-based predictome of mental illness. Hum Brain Mapp. 2020;41:3468–535.. 10.1002/hbm.25013. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib2] 2. Sudlow C, Gallacher J, Allen N, et al. UK Biobank: An open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 2015;12:e1001779. 10.1371/journal.pmed.1001779. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib3] 3. Miller KL, Alfaro-Almagro F, Bangerter NK, et al. Multimodal population brain imaging in the UK Biobank prospective epidemiological study. Nat Neurosci. 2016;19:1523–36.. 10.1038/nn.4393. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib4] 4. Fry A, Littlejohns TJ, Sudlow C, et al. Comparison of sociodemographic and health-related characteristics of UK biobank participants with those of the general population. Am J Epidemiol. 2017;186:1026–34.. 10.1093/aje/kwx246. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib5] 5. Allen N, Sudlow C, Downey P, et al. UK Biobank: Current status and what it means for epidemiology. Health Policy Technol. 2012;1:123–26.. 10.1016/j.hlpt.2012.07.003. [DOI] [Google Scholar]

[bib6] 6. Smucny J, Barch DM, Gold JM, et al. Cross-diagnostic analysis of cognitive control in mental illness: Insights from the CNTRACS consortium. Schizophr Res. 2019;208:377–83.. 10.1016/j.schres.2019.01.018. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib7] 7. Lampe L, Niehaus S, Huppertz H-J, et al. Comparative analysis of machine learning algorithms for multi-syndrome classification of neurodegenerative syndromes. Alzheimers Res Ther. 2022;14:62. 10.1186/s13195-022-00983-z. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib8] 8. Stausberg J, Lehmann N, Kaczmarek D, et al. Reliability of diagnoses coding with ICD-10. Int J Med Informatics. 2008;77:50–57.. 10.1016/j.ijmedinf.2006.11.005. [DOI] [PubMed] [Google Scholar]

[bib9] 9. Shankman SA, Funkhouser CJ, Klein DN, et al. Reliability and validity of severity dimensions of psychopathology assessed using the Structured Clinical Interview for DSM-5 (SCID). Int J Methods Psychiatr Res. 2018;27:e1590. 10.1002/mpr.1590. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib10] 10. Regier DA, Narrow WE, Clarke DE, et al. DSM-5 field trials in the United States and Canada, part II: test-retest reliability of selected categorical diagnoses. Am J Psychiatry. 2013;170:59–70.. 10.1176/appi.ajp.2012.12070999. [DOI] [PubMed] [Google Scholar]

[bib11] 11. Tait RC, Chibnall JT, Kalauokalani D. Provider judgments of patients in pain: Seeking symptom certainty. Pain Med. 2009;10:11–34.. 10.1111/j.1526-4637.2008.00527.x. [DOI] [PubMed] [Google Scholar]

[bib12] 12. Tversky A, Kahneman D. Judgment under uncertainty: Heuristics and biases. Science. 1974;185:1124–31.. 10.1126/science.185.4157.1124. [DOI] [PubMed] [Google Scholar]

[bib13] 13. Bernheim SM, Ross JS, Krumholz HM, et al. Influence of patients’ socioeconomic status on clinical management decisions: A qualitative study. Ann Fam Med. 2008;6:53–59.. 10.1370/afm.749. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib14] 14. van Ryn M, Burke J. The effect of patient race and socio-economic status on physicians’ perceptions of patients. Soc Sci Med. 2000;50:813–28.. 10.1016/S0277-9536(99)00338-X. [DOI] [PubMed] [Google Scholar]

[bib15] 15. Job C, Adenipekun B, Cleves A, et al. Health professional's implicit bias of adult patients with low socioeconomic status (SES) and its effects on clinical decision-making: a scoping review protocol. BMJ Open. 2022;12:e059837. 10.1136/bmjopen-2021-059837. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib16] 16. Gara MA, Minsky S, Silverstein SM, et al. A naturalistic study of racial disparities in diagnoses at an outpatient Behavioral Health clinic. Psychiatr Serv. 2019;70:130–34.. 10.1176/appi.ps.201800223. [DOI] [PubMed] [Google Scholar]

[bib17] 17. Geruso M, Layton T. Upcoding: evidence from Medicare on squishy risk adjustment. J Polit Econ. 2020;128:984–1026.. 10.1086/704756. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib18] 18. Coustasse A, Layton W, Nelson L, et al. Upcoding Medicare: Is healthcare fraud and abuse increasing?. Perspect Health Inf Manag. 2021;18:1f. [PMC free article] [PubMed] [Google Scholar]

[bib19] 19. Luo W, Gallagher M. Unsupervised DRG upcoding detection in healthcare databases. In: 2010 IEEE International Conference on Data Mining Workshops, Sydney, NSW, Australia: IEEE; 2010:600–5.. 10.1109/ICDMW.2010.108. [DOI] [Google Scholar]

[bib20] 20. Moscelli G, Jacobs R, Gutacker N, et al. Prospective payment systems and discretionary coding-evidence from English mental health providers. Health Econ. 2019;28:387–402.. 10.1002/hec.3851. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib21] 21. Nikolaidis A, Chen AA, He X, et al. Suboptimal phenotypic reliability impedes reproducible human neuroscience. Biorxiv. 2022. 10.1101/2022.07.22.501193 Accessed 13 January 2025. [DOI] [Google Scholar]

[bib22] 22. Gell M, Eickhoff SB, Omidvarnia A, et al. How measurement noise limits the accuracy of brain-behaviour predictions. Nature Comms. 2024;15:10678. 10.1038/s41467-024-54022-6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib23] 23. Dadi K, Varoquaux G, Houenou J, et al. Population modeling with machine learning can enhance measures of mental health. Gigascience. 2021;10:giab071. 10.1093/gigascience/giab071. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib24] 24. Winter NR, Leenings R, Ernsting J, et al. Quantifying deviations of brain structure and function in major depressive disorder across neuroimaging modalities. JAMA Psychiatry. 2022;79:879–88.. 10.1001/jamapsychiatry.2022.1780. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib25] 25. Williams LM, Precision psychiatry: A neural circuit taxonomy for depression and anxiety. Lancet Psychiatry. 2016;3:472–80.. 10.1016/S2215-0366(15)00579-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib26] 26. Berton O, Hahn C-G, Thase ME. Are we getting closer to valid translational models for major depression?. Science. 2012;338:75–79.. 10.1126/science.1222940. [DOI] [PubMed] [Google Scholar]

[bib27] 27. Lorenzetti V, Allen NB, Fornito A, et al. Structural brain abnormalities in major depressive disorder: A selective review of recent MRI studies. J Affect Disord. 2009;117:1–17.. 10.1016/j.jad.2008.11.021. [DOI] [PubMed] [Google Scholar]

[bib28] 28. Dai L, Zhou H, Xu X, et al. Brain structural and functional changes in patients with major depressive disorder: A literature review. PeerJ. 2019;7:e8170. 10.7717/peerj.8170. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib29] 29. Saberi A, Mohammadi E, Zarei M, et al. Structural and functional neuroimaging of late-life depression: A coordinate-based meta-analysis. Brain Imaging Behav. 2022;16:518–31. 10.1007/s11682-021-00494-9. [DOI] [PubMed] [Google Scholar]

[bib30] 30. Marek S, Tervo-Clemmens B, Calabro FJ, et al. Reproducible brain-wide association studies require thousands of individuals. Nature. 2022;603:654–60.. 10.1038/s41586-022-04492-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib31] 31. Stroganov O, Fedarovich A, Wong E, et al. Mapping of UK Biobank clinical codes: Challenges and possible solutions. PLoS One. 2022;17:e0275816. 10.1371/journal.pone.0275816. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib32] 32. Alfaro-Almagro F, Jenkinson M, Bangerter NK, et al. Image processing and quality control for the first 10,000 brain imaging datasets from UK Biobank. Neuroimage. 2018;166:400–24.. 10.1016/j.neuroimage.2017.10.034. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib33] 33. Liu Y. Random forest algorithm in big data environment. Comput Model New Technol. 2014;18:147–151. [Google Scholar]

[bib34] 34. Easley T, Chen R, Hannon K, et al. Population modeling with machine learning can enhance measures of mental health—open-data replication. Neuroimage Rep. 2023;3:100163. 10.1016/j.ynirp.2023.100163. [DOI] [Google Scholar]

[bib35] 35. Schaefer A, Kong R, Gordon EM, et al. Local-global parcellation of the human cerebral cortex from intrinsic functional connectivity MRI. Cereb Cortex. 2018;28:3095–114.. 10.1093/cercor/bhx179. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib36] 36. Beckmann CF, Smith SM. Probabilistic independent component analysis for functional magnetic resonance imaging. IEEE Trans Med Imaging. 2004;23:137–52.. 10.1109/TMI.2003.822821. [DOI] [PubMed] [Google Scholar]

[bib37] 37. Nickerson LD, Smith SM, Öngür D, et al. Using dual regression to investigate network shape and amplitude in functional connectivity analyses. Front Neurosci. 2017;11:115. 10.3389/fnins.2017.00115. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib38] 38. Harrison SJ, Bijsterbosch JD, Segerdahl AR, et al. Modelling subject variability in the spatial and temporal characteristics of functional modes. Neuroimage. 2020;222:117226. 10.1016/j.neuroimage.2020.117226. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib39] 39. Harrison SJ, Woolrich MW, Robinson EC, et al. Large-scale probabilistic functional modes from resting state fMRI. Neuroimage. 2015;109:217–31.. 10.1016/j.neuroimage.2015.01.013. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib40] 40. Bijsterbosch J, Harrison S, Duff E, et al. Investigations into within- and between-subject resting-state amplitude variations. Neuroimage. 2017;159:57–69.. 10.1016/j.neuroimage.2017.07.014. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib41] 41. Bijsterbosch JD, Harrison SJ, Jbabdi S, et al. Challenges and future directions for representations of functional brain organization. Nat Neurosci. 2020;23:1–12.. 10.1038/s41593-020-00726-z [DOI] [PubMed] [Google Scholar]

[bib42] 42. Bijsterbosch JD, Valk SL, Wang D, et al. Recent developments in representations of the connectome. Neuroimage. 2021;243:118533. 10.1016/j.neuroimage.2021.118533. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib43] 43. Easley T, Luo X, Hannon K, et al. Opaque ontology: Neuroimaging classification of ICD-10 diagnostic groups in the UK Biobank (version 1) [Computer software]. 2024. https://archive.softwareheritage.org/swh:1:snp:6b02c96d795ae80de099c23000ff61414b76ae9c;origin=https://github.com/tyo8/WAPIAW3 Accessed 13 January 2025. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib44] 44. Easley Ty et al. Opaque ontology: Neuroimaging classification of ICD-10 diagnostic groups in the UK Biobank. DOME Registry. 2024; https://registry.dome-ml.org/review/e0eaxjld0h. Accessed 9 December 2024. [DOI] [PMC free article] [PubMed]

PERMALINK

Opaque ontology: neuroimaging classification of ICD-10 diagnostic groups in the UK Biobank

Ty Easley

Xiaoke Luo

Kayla Hannon

Petra Lenzini

Janine Bijsterbosch

Abstract

Background

Results

Conclusion

Background

Data Description

Case selection

Matched healthy controls

Matched sample size subgroups

Analyses

Diagnostic classification using structural neuroimaging features

Figure 1:

Diagnostic classification corrected for sample size

Figure 2:

Multiclass classification of ICD-10 diagnostic groups

Comparison against alternative classification models

Figure 3:

Comparison against alternative feature sets

Figure 4:

Comparison against age classification

Discussion

Potential Implications

Methods

Neuroimaging data and preprocessing

Classification model

Structural neuroimaging feature extraction

Alternative feature sets

Statistical analysis

Comparison against multiclass diagnostic classification

Comparison against alternative classification models

Comparison against alternative features sets

Comparison against age classification

Availability of Source Code

Additional Files

Abbreviations

Acknowledgments

Contributor Information

Author Contributions

Funding

Data Availability

Competing Interests

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases