Abstract
Mental health diagnostic approaches are seeking to identify biological markers to work alongside of advanced machine learning approaches. It is difficult to identify a biological marker of disease when the traditional diagnostic labels themselves are not necessarily valid. To begin to address this, we worked with brain imaging data collected from individuals with mood and psychosis disorders from over 1400 individuals comprising healthy controls, psychosis patients and their unaffected first-degree relatives and we assumed there may be noise in the diagnostic labelling process. We detected label noise by classifying the data multiple times using a support vector machine classifier and then we retained those individuals in which all classifiers unanimously mislabeled those subjects. Next we assigned a new diagnostic label to these individuals, based on the biological data, using an iterative data cleansing approach. Simulation results showed our method was highly accurate in identifying label noise. We evaluated our method via a deep learning model which shows performance improvement of model on the cleansed dataset. Both diagnostic and Biotype categories showed a large percentage of noisy labels with the largest amount of relabeling occurring between the healthy control and bipolar and schizophrenia disorder individuals as well as in the unaffected close relatives. Extraction of imaging features highlighted regional brain changes associated with each group. In sum, this approach represents an initial step towards developing approaches that need not assume existing mental health diagnostic categories are always valid, but rather allows us to leverage this information while also acknowledging that there are mis-assignments.
Keywords: Psychosis Disorders, Label Noise, Machine Learning, Data Cleansing, Deep Learning, Structural MRI
1. INTRODUCTION
Psychiatry has struggled with identifying a biological basis for mental illness. Current categorization approaches, including the APA Diagnostic and Statistical Manual (DSM) are not fully valid [1, 2, 3]. Diagnosis of mental illness, such as schizophrenia and bipolar disorder is typically based on unreliable symptom-based measures. It is also known that among different DSM diagnoses there is a large overlap not only in their clinical symptoms, but also in biological measures including disease risk genes, structural and functional brain measures, electrophysiology and cognitive functional deficits [4, 5, 6]. Additional challenges include the debate over the validity of additional “mixed” diagnostic categories such as schizoaffective disorder [4]. Unreliable self-report information regarding symptoms makes the process of diagnosis more complicated and leads to incorrect labeling. Understanding brain biomarkers in psychiatry may help diagnosing and treating mental disorders more effectively. Integrating clinical data with genomics and other patient information such as brain biomarkers help better define valid disease subtypes and/or categorize subjects more accurately and improve treatment outcomes [7]. The challenge is that applying biological classification while using traditional psychiatric diagnoses that lack inherent validity as a ground truth is unlikely to prove productive.
Identifying and prioritizing diagnostic categorization is a major challenge in psychiatry which if not carried out correctly translates into inaccuracies in diagnostic labeling of biological data, such as from medical imaging (e.g. magnetic resonance imaging or MRI). Addressing these inaccuracies (which we refer to as label noise in a diagnostic classification problem setup) is an important topic of great interest which serves the ultimate goal of helping patients [8]. The application of artificial intelligence (AI) can be leveraged to help with this task and to achieve better results. Among different forms of AI approaches, classification, the process of predicting the class of new samples has been broadly used in the realm of machine learning. To compute a prediction for an unseen sample, a supervised classifier algorithm first learns from data based on the provided labels. As such, the reliability of the dataset plays vital rule in the performance of the classification models. Therefore, if there is label noise in the data set, the prediction accuracy will decrease. However, having a completely pure and label noise free dataset is unlikely in mental health, given the fact that current diagnostic strategies are inaccurate and diagnoses of questionable validity [9, 1]. Noise in this context refers to anything that obscures the relationship between the features of an instance and its class [1, 10]. Previous work has shown how noise may affect classification accuracy [1, 11]. Studies show that neural networks are robust in handling label noise in data [12]; however, deep learning models require large datasets [13] and the minimum amount of required clean training data increases with an increase in the label noise level [12].
In order to improve categorization or nosology in psychiatry it is important to address the challenges in the existing categorization. One approach is to consider existing categories as ‘noisy’ and develop approaches to eliminate or at least reduce the consequences of this noise during a classification task. To do this, we identify cases where there is biological evidence which pushes against an existing categorization. Our proposed method included estimating individuals that are labeled (diagnosed) incorrectly, called class or label noise, and to remove them from the data set and retain the rest for further analysis. We first detected label noise by classifying the data multiple times using a support vector machine (SVM) which is supervised machine learning model for analyzing data. We then retained datasets in which the number of times a subject was classified correctly was above a specific criterion relative to the number of times they classified incorrectly. Using these votes and based on similarities between noisy labels and different diagnostic categories, we relabeled noisy subjects with new label such that subjects are more similar to each other in the new group. We then repeated these steps until we identified a specific acceptable amount of label noise in the dataset.
2. MATERIALS AND METHODS
1. B-SNIP-1 DATASET
In this study we analyzed the bipolar-schizophrenia network on intermediate phenotypes (B-SNIP-1) structural imaging (sMRI) dataset [14, 15]. B-SNIP is multi-site NIH-funded consortium of investigators which collected multiple brain imaging and assessment measures for stable patients within three psychotic disorders (schizophrenia, schizo-affective disorder or bipolar disorder with psychosis). The B-SNIP-1 dataset used in this study included 912 subjects after quality control assessment. Structural MRI three-dimensional acquisitions were carried out on 3T scanners. High resolution isotropic T1-weighted MP-RAGE sequences were acquired following the Alzheimer’s Disease Neuroimaging Initiative (ADNI) protocol [16, 17]. Subject demographics are reported in Table 1.
Table 1.
B-SNIP Dataset Demographic
| DSM-IV | Biotype | |||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| BPD (N=176) | SADP (N=134) | SZP (N=240) | HC (N=362) | B1 (N=147) | B2 (N=185) | B3 (N=218) | NC (N=362) | |||||||||
| N | % | N | % | N | % | N | % | N | % | N | % | N | % | N | % | |
| Size | 176/912 | 19.30% | 134/912 | 14.69% | 240/912 | 26.32% | 362/912 | 39.69% | 147/912 | 16.12% | 185/912 | 20.29% | 218/912 | 23.90% | 362/912 | 39.69% |
| Gender (Male) | 61/176 | 34.66% | 56/134 | 41.79% | 164/240 | 41.79% | 162/362 | 44.75% | 77/147 | 52.38% | 90/185 | 48.65% | 114/218 | 52.29% | 162/362 | 44.75% |
| Race | ||||||||||||||||
| African American | 34 | 52 | 107 | 99 | 79 | 57 | 57 | 99 | ||||||||
| American Indian | - | - | 1 | 3 | 1 | - | - | 3 | ||||||||
| Asian | 5 | 2 | 5 | 15 | - | 7 | 5 | 15 | ||||||||
| Caucasian | 131 | 73 | 112 | 231 | 57 | 114 | 145 | 231 | ||||||||
| Native Hawaiian | - | - | - | 1 | - | - | - | 1 | ||||||||
| Multiracial / Mixed Race | 2 | 6 | 8 | 6 | 6 | 3 | 7 | 6 | ||||||||
| Other Race | 4 | 1 | 7 | 6 | 4 | 4 | 4 | 6 | ||||||||
| Unknown / Missing | - | - | - | 1 | - | - | - | 1 | ||||||||
| Mean | Std | Mean | Std | Mean | Std | Mean | Std | Mean | Std | Mean | Std | Mean | Std | Mean | Std | |
| Age | 35.9 | 13.14 | 35.88 | 12.01 | 34.51 | 12.33 | 39.8 | 52.05 | 35.52 | 13.14 | 35.32 | 11.91 | 35.11 | 12.64 | 39.8 | 52.05 |
| PANSS | ||||||||||||||||
| Positive | 29.83 | 128.06 | 18.17 | 5.2 | 45.35 | 165.73 | - | - | 36.61 | 139.5 | 47.94 | 174.68 | 19.81 | 66.82 | - | - |
| Negative | 29.01 | 128.15 | 15.52 | 4.63 | 45.41 | 165.73 | - | - | 29.68 | 114.36 | 52.5 | 188.27 | 18.39 | 66.92 | - | - |
| General | 45.31 | 126.19 | 41.74 | 83.77 | 56.37 | 151.5 | - | - | 44.79 | 112.79 | 63.87 | 171.89 | 39.89 | 92.89 | - | - |
| Total | 70.11 | 123.41 | 75.14 | 81.99 | 96.73 | 168.74 | - | - | 83.55 | 133.7 | 99.21 | 179.69 | 68.75 | 91.15 | - | - |
Data were categorized and labeled using two different approaches. This included both standard clinical diagnoses based on the DSM-IV and also neurobiological heterogeneity defined groups, called Biotypes [18]. Each Biotype category contained individuals with all DSM psychosis categories and vice versa (see Table 2). According to [19], “Biotypes” (biologically distinctive phenotypes) refer to neurobiologically distinct subgroups of psychosis cases independent of clinical phenomenology that differentiated people with psychosis from healthy controls. Biotype group B1 manifested cases with impaired cognitive control and poor sensorimotor function, group B2 was characterized by impaired cognitive control but exaggerated sensorimotor response, and group B3 presented near normal cognitive and sensorimotor functions [20]. DSM categories included bipolar disorder (BPD) with psychosis, schizoaffective disorder (SADP), schizophrenia (SZP) and also healthy control (HC) subjects. In this study we separately evaluated the DSM and Biotype labelled data.
Table 2.
DSM-IV diagnostic criteria features all Biotype groups
| DSM-IV | Biotype | # of individuals |
|---|---|---|
| BPD | B1 | 25 |
| B2 | 57 | |
| B3 | 94 | |
| SADP | B1 | 33 |
| B2 | 48 | |
| B3 | 53 | |
| SZP | B1 | 89 |
| B2 | 80 | |
| B3 | 71 | |
| HC | NC | 362 |
The B-SNIP dataset also include 581 patients’ relatives (Biotype patient relatives) which consisted of 193 bipolar relatives (BPDR), 152 schizoaffective disorder relatives (SADPR) and 236 schizophrenia relatives (SZPR). Within the Biotype categorization, 147 of them are relatives of Biotype group 1 (BR1), 191 subjects were relatives of Biotype group 2 (BR2), 116 subjects are relatives of Biotype group 3 (BR2).
2. PREPROCESSING
Structural MRI data (1mm isotropic MPRAGE) were collected from all individuals [14, 15]. Images were preprocessed in SPM (https://www.fil.ion.ucl.ac.uk/spm/) via a unified approach [21] that included tissue classification, bias correction, image registration, spatial normalization, and resliced to 2×2×2mm. The unsmoothed gray matter density (GMD) images were then correlated to the gray matter template to access segmentation outliers and outliers were corrected if possible or removed otherwise.
3. METHOD DESCRIPTION
We used a novel classification-voting filtering method based on a data cleansing approach to eliminate label noise e.g. noisy, mislabeled subjects based on the sMRI image dataset. This also provided informative patterns for further investigation and for reassigning labels for identified mislabeled subjects based on the model. Our proposed model also provides suggested labels for noisy data sets. The method is based on computing m number of inner SVM classifiers which are trained and evaluated via cross-validation. These m SVM models are then used to identify mislabeled subjects for different runs of cross-validation sets. Thus, each individual is classified m-times by SVM for k cross-validation loops totaling m × k classification votes. Based on the m × k predicted labels we used consensus voting to determine if a given dataset should be considered as having a noisy label or not if all m × k labels are mislabeled. Fig 1 presents a summary of our method. In the following we briefly discuss more details of the proposed method.
Fig 1.

Visual summary and workflow of different aspects of the method. Preprocessing was done using SPM on structural MRI data collected at multiple sites. Then univariate feature selection using ANOVA were applied for dimension reduction on preprocessed data and subsampling was done on dataset to handle imbalanced classes. Cross validated classification/voting filtering was applied to resampled data. Then noisy labels were identified based on their votes, cleansed dataset (exclude noise subjects) and unlabeled dataset (including noisy subjects with dropped labels) feed to supervised and deep learning models to evaluate and obtain new suggested labels. For generalization, cross-validated classification voting filtering were performed iteratively on cleansed and new labeled data till we identified a specific acceptable amount of label noise in the dataset.
Simulation:
Because there is no ground truth in the neuroimaging data, we first evaluate our method on handwritten digit images dataset, which enables us to compare our method on data for which there is a well-defined ground truth. We introduce label noise to dataset by shuffling a proportion of labels of instances randomly. We added different amounts of label noise and evaluate the model.
Dimension Reduction and feature extraction:
The curse-of-dimensionality comes with high dimensional neuroimaging scan data where the number of features (voxels) is significantly larger than the number of samples (brain images) [22, 23]. This is problematic because it may lead overfitting in the predictive models and a lack of generalizability. We used univariate feature selection and computed a univariate ANOVA F-value between groups based on their assigned labels and used this to select the 100 best features among vectors of features (vector of voxels) among instances.
Imbalanced classes and subsampling:
A class imbalance problem occurs when distributions of different classes are imbalanced, and each class does not represent an equal portion of dataset. Imbalanced classes appear in most real-world datasets. Depending on the classification task, an imbalanced dataset can impact the accuracy of the classification [24]. This is because many classifiers assume the same distribution and proportion of samples across classes. More samples from a specific class will result in a biased result in favor of those classes [25]. We therefore used a random under-sampling approach for handling imbalanced classes by resampling the majority class randomly and uniformly. This was done repeatedly till all instances of the majority classes were visited at least 1 time.
Hyper-parameters optimization and parameter selection:
Hyper-parameters are parameters within a learning algorithm which should be defined priori and before model training and fitting. In contrast to ad hoc hyper-parameters setting, hyper-parameter tuning/optimization refers to identifying the optimal set of hyper-parameters which maximize performance of the algorithm among the other options. In this work, to find the best hyper-parameters for the training model, we used a grid search among different hyper-parameters space and settings for the SVM model. This approach searches the hyper-parameter space through cross validation and propose best candidate among possible hyper-parameters values.
Data cleansing classification/voting filtering:
We obtain votes using cross-validated classification. Using consensus voting approach, we considered individuals as noisy if all votes are inaccurate labels. Based on this, we generate two datasets for further analysis. In one experiment, we remove the noisy label data and only keep clean label data. We name this a cleansed dataset. In another experiment, we keep all the data but remove the labels for noisy subjects and consider them to be an unlabeled sample for cleansed relabeled dataset.
Supervised classification:
We next analyzed the cleansed dataset with a deep convolutional neural network (CNN) ResNet model. The architecture of our model is the modification of the open-source Pytorch implementation of the ResNet framework [26]. 3D structural MRI scans are seen by the convolution layer followed by batch normalization and a rectifier unit layer. The output of these layers go through max pooling layer and then go through Residual block of containing a series of 3D convolutional, batch normalization and non-linear rectifier (ReLU) layers [4]. The output of residual blocks feed to into the average pooling layer followed by fully connected layers. The output of this layer is the class probability score. See Fig 2.
Fig 2.

Deep model convolutional neural network ResNet architecture. 3D structural MRI scans are seen by the convolution layer followed by batch normalization and a rectifier unit layer. The output of these layers goes through max pooling layer and then go through Residual block of containing a series of 3D convolutional, batch normalization and non-linear rectifier (ReLU) layers
Relabeling using classification:
After removing labels from the noisy label data, we trained SVM model with cleansed data and predict new label for noisy subjects. Classification relabeling approach is based on the assumption that when two samples in a high-density region are close and similar then their output classification should be close too.
Convergence:
After relabeling step we repeated the process from subsampling step to relabeling step iteratively with new suggested labeled for label noise till number of label noise identified in filtering step reach to an ad-hoc threshold (e.g. 5%). By using this iterative step, label noises gradually be detected and the performance of the model increases. Combining this iteration and randomly chosen partition for training and validating increase generalization of our approach.
Visualization:
After our iterative classification filtering, we performed voxelwise t-tests on gray matter maps to highlight brain regions where different groups differed significantly. For each pair of groups, we ran statistical t-test on the original non-zero vectors of features (voxels) and then used the false discovery rate (FDR) correction for multiple comparisons. We identified those features that were significantly different between each pair of groups and create image only containing means of larger group of those features. We also tried to plot the dataset before and after cleansing on 2D plot using 2D-tSTE [27].
3. RESULTS
Simulation:
To validate our approach we conducted a simulation on handwritten digit images. We included images of digits 0, 3, 5, 8, because their shape similarity makes the prediction more challenging. The images have dimension 64 pixels (8×8) Fig 3. We selected and shuffled truth labels of randomly chosen digit images of different proportion in the dataset. The Fig 4 shows that by increasing the amount of label noise, the accuracy of the SVM classifier model decreased linearly on original dataset; however, using our method we obtain high accuracy on cleansed dataset by first identifying noise label instances and then filtering them from the dataset.
Figure 3.

a) Shows random samples from handwriting digits dataset. The numbers of 8,1,3,5 were selected because of their similarities and common features they have; and b) The mean over all samples for each digit.
Figure 4.

Shows effect of different proportion of label noise on accuracy of SVM classifier and improving accuracy using data cleansing approach. Corrupting data by shuffling labels of instances and adding synthetic label noises to ground truth decreases the performance of the predictive model. By applying data cleansing classification filtering method, we could identify noisy labels and improve the model performance. Even when 80% of labels in dataset are noisy, model accuracy was boosted from 0.38 in the noisy dataset to 0.8 in the cleansed dataset.
B-SNIP MRI Analysis:
For the B-SNIP data, we used a univariate ANOVA F-value to select the 100 best features among 2,122,945 voxels for 912 individuals. This was done once for each type of labelling approach (Biotype and DSM). To handle imbalanced classes, we used the subsampling approach described previously.
Using a grid search approach, the SVM classifier with RBF kernel and penalty parameter 10 and kernel coefficient 0.04 was selected as the best model with optimized hyper-parameters for the B-SNIP dataset. The average accuracy of classification filtering is shown in Fig 5. The average SVM classifier cross validated accuracy increased from 0.38 to 0.89 for both Biotype and DSM-IV categories using consensus voting.
Figure 5.

Average accuracy of classification filtering cross validation part after 5 iterations. In each iteration, label noise is identified, and new labels are assigned based on similarities between them and cleansed instances. The performance of the model increases after each iteration.
Among 912 subject, 573 subjects (63%) were identified as label noise in Biotype category and 601subjects (65%) were identified as label noise in DSM-IV category. 452 of 912 labels were identified as noisy for both Biotype and DSM-IV. Table 3 shows the different proportion and number of noisy labels found after 5 iterations of cross validation classification filtering and reveals convergence at approximately 90% cleansing. At each iteration we removed the labels for the noisy datasets (retaining the labels for the clean datasets) and fit an SVM model to predict new suggested labels. Fig 6 shows the heatmaps of those identified label noise in different type of labels. Also using the fit model, we predict the label for the relatives. 435 of 581 relatives are labeled as Biotype B3 and normal control in the Biotype category and all of 581 are labeled as healthy control in the DSM-IV category.
Table 3.
Shared label noise using consensus voting
| Noise: 452 | Clean: 460 | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| B2 | B3 | NC | BPD | SADP | SZP | ||||||||
| Noise | Clean | Noise | Clean | Noise | Clean | Noise | Clean | Noise | Clean | Noise | Clean | Noise | Clean |
| 84 | 78 | 107 | 101 | 117 | 218 | 144 | 89 | 87 | 69 | 65 | 84 | 156 | 218 |
Figure 6.

Top: Confusion matrices show label noise subjects related to Biotype or DSM-IV categories separately after convergence to 90% of cleansed data using consensus voting. Middle: Confusion matrices show shared label noise subjects among Biotype and DSM-IV categories after convergence. About 50% of Biotype proband 2 (B2) and 45% of Biotype proband 3 (B3) were relabeled as normal control. Among Biotype proband 1 (B1), 81% of instances identified as noisy label were relabeled as Biotype proband 2 (B2) and Biotype proband 3 (B3). 81% of normal control noisy subjects were also relabeled as Biotype proband 2 (B2) and Biotype proband 3 (B3). In DSM-IV category, 71% of bipolar proband noisy subjects were relabeled as healthy controls, whereas 64% and 32 % of healthy controls were relabeled as bipolar proband and schizoaffective proband respectively. Among schizophrenia label noise subjects, 38% were relabeled as healthy controls and 34% as schizoaffective proband. Bottom: Confusion matrices show the result of predicting labels of relatives using updated labels after the above analysis. Most of the relatives are labeled as Biotype3 (B3) and normal controls in the Biotype categories or as healthy controls in the DSM-IV category.
We analyzed the performance of our deep learning model on the first iteration of classification filtering. The accuracy of our deep supervised convolutional neural network (ResNet) model improved about 20% from 0.65 to 0.79 on an average of 6 binary classification task using DSM-IV labels and by 23% from 0.60 to 0.74 on 6 binary classification task using Biotype labels. Fig 7 shows receiver operating characteristic (ROC) plot of these 12 binary classification tasks on our deep model. The boxplot of accuracy of 12 classification tasks on cleansed and original dataset shows in Fig 8.
Figure 7.

ROC plot of 12 binary classification tasks using Biotype and DSM-IV labels on cleansed dataset after 1st iteration of our data cleansing approach before convergence. After one iteration of classification filtering on data and training deep model using cleansed dataset the accuracy improved about 20% from 0.65 to 0.79 on an average of 6 binary classification task using DSM-IV labels and by 23% from 0.60 to 0.74 on 6 binary classification task using Biotype labels.
Figure 8.

Accuracy boxplot of 12 binary classification tasks on deep model on cleansed dataset after 1st iteration. Red are boxplot accuracy of stratified cross validation using original label. Blue boxplot shows the accuracy of different stratified cross validation run on cleansed dataset after a single classification voting filtering. The figure shows accuracy improved on cleansed dataset after a single iteration of classification filtering.
We used t-SNE 2D-projection of dataset using the original and suggested new labels to visualize how well subjects will be categorized based on their labels. First, 100 features were extracted from nonzero voxels using univariate feature selection method. Next, these features were projected in 2D plot using tSNE. Fig 9 shows the t-SNE 2D-projection of the original dataset with label noise in the left panel, and projection using the new suggested labels in the right panel.
Figure 9.

Left panel shows t-SNE 2D-projection of original dataset with label noise and right panel shows 2D-projection using new suggested labels. Top: 2D-projection using Biotype labels. Bottom: 2D-projetion using DSM-IV labels. In both categories of Biotype and DSM-IV, affinities between data points shows subjects in different group of disease and healthy control overlap in noisy data. After identifying label noise and relabeling, the similarity between data points is more obvious and subjects with the same labels are close together. Embedded data into 2D space using original labels does not support the fact that subjects labeled based on their similarities. It is hard to interpret how groups differentiate from each other using original label, because there is considerable overlap between subjects. Cleansed dataset with new suggested label shows there is gradient pattern in both DSM-IV and Biotype categories from healthy/normal control to most severe cases of Biotype B1 or schizophrenia probands contain other mild cases in between.
Voxel-wise t-tests were run on non-zero voxels between each category. Then FDR correction was applied to correct for multiple comparisons. Voxels showing significant differences at a significance level α=0.05 were identified. Fig 10, 11, 12, 13 show results of statistical testing between each group. Fig 10, 11 show gray matter brain maps results using Biotype cleansed labels and Biotype original given label containing noise. Fig 12, 13 show gray matter brain maps results using DSM-IV cleansed labels and DSM-IV original diagnostic labels. Results who more voxels showing significant differences when using the cleansed data, and more voxels showing significant differencse in the DSM compared to the Biotype data. Importantly the regions that are shown are consistent with previous work (for example in schizophrenia we find reductions in bilateral temporal and insula regions as well as medial frontal.
Figure 10.

Gray matter map results of voxel wise t-tests between 4 Biotype groups after data cleansing using classification voting filtering. Top row) NC vs. B1, NC vs. B2, NC vs. B3. Bottom row) B1 vs. B2, B1 vs. B2, B2 vs. B3. Gray matter contrast between normal controls and Biotype probands shows they have differences with different levels in overlapped regions after cleaning and relabeling individuals. Gray matter differences between normal control and Biotype proband 1 (B1) have strongest separation among the other comparison tests. The gray matter contrast is lower between healthy controls and Biotype proband 3 (B3). Biotype proband 2 (B2) and Biotype proband 3 (B3) show the fewest differences.
Figure 11.

Gray matter map results of voxel wise t-tests between 4 Biotype groups on given labels. Top row: NC vs. B1, NC vs. B2, NC vs. B3; Bottom row: B1 vs. B2, B1 vs. B2, B2 vs. B3. Gray matter contrast between normal controls and Biotype probands only shows significant difference in some regions between normal controls and Biotype proband 1 (B1) using original labels. Gray matter density group differences between other groups do not show differences.
Figure 12.

Gray matter map results of voxel wise t-tests between 4 DSM-IV groups after data cleansing using classification voting filtering. Top row) BPD vs. HC, HC vs. SADP, HC vs. SZP. Bottom row) BPD vs. SADP, BPD vs. SZP, SADP vs. SZP. Gray matter contrast between healthy controls and DSM-IV probands shows they have differences with different levels in some overlapped regions after cleaning and relabeling individuals. Gray matter difference between healthy control and bipolar proband (BPD) has strongest separation among the other. The gray matter contrast shows differences between healthy control and schizophrenia and schizoaffective probands as well. Also, group differences could be observed between the bipolar proband group vs. schizophrenia and bipolar proband group vs. schizoaffective proband groups. However, schizophrenia and schizoaffective groups did not separate after data cleansing and relabeling.
Figure 13.

Gray matter map results of voxel wise t-tests between 4 DSM-IV groups on given labels. Top row: BPD vs. HC, HC vs. SADP, HC vs. SZP. Bottom row) BPD vs. SADP, BPD vs. SZP, SADP vs. SZP. Gray matter contrast was only significant in some areas between healthy controls and schizophrenia proband using the original labels. Group differences between other groups were not found.
Table 4 shows regions where the statistical tests are significantly different after FDR multiple comparisons correction on the p-values. When using the original dataset, we did not find differences in as many regions. There were significant differences in some regions between healthy controls and schizophrenia and bipolar probands and schizophrenia. Significantly different brain regions for the relabeled data are indicated with ● and by ◊ in the original dataset.
Table 4.
Area of containing voxels obtained by statistical significant test on the relabeled data ● and original data ◊
| AREA | B1 VS. B2 | B1 VS. B3 | B2 VS. B3 | NC VS. B1 | NC VS. B2 | NC VS. B3 | BDP VS. HC | BDP VS. SADP | BDP VS. SZP | HC VS. SADP | HC VS. SZP | SADP VS. SZP |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ANGULAR GYRUS | ● | ● | ● | ● | ||||||||
| ANTERIOR CINGULATE | ● | ● ◊ | ● | ● | ● | ● | ● | ● ◊ | ||||
| CAUDATE | ● | ● | ||||||||||
| CEREBELLAR LINGUAL | ● | |||||||||||
| CEREBELLAR TONSIL | ● | ● ◊ | ● | ● ◊ | ● | ● | ● | ● | ● | ● | ● ◊ | ● |
| CINGULATE GYRUS | ● ◊ | ● | ● | ● ◊ | ● ◊ | |||||||
| CLAUSTRUM | ● | |||||||||||
| CULMEN | ● | ● | ● ◊ | ● | ● | ● | ● | ● | ● ◊ | ● ◊ | ● | |
| CUNEUS | ● ◊ | ● | ● | ● | ● | ● | ● | ● ◊ | ||||
| DECLIVE | ● ◊ | ● | ● | ● | ● | ● | ● ◊ | |||||
| EXTRA-NUCLEAR | ● | ● | ● ◊ | ● | ● | ● | ● | ● | ● ◊ | ● ◊ | ● | |
| FUSIFORM GYRUS | ● | ● | ● | ● | ● | ● ◊ | ● | |||||
| INFERIOR FRONTAL GYRUS | ● | ● ◊ | ● ◊ | ● | ● | ● | ● | ● | ● | ● ◊ | ● | |
| INFERIOR OCCIPITAL GYRUS | ● | ● | ● | ● ◊ | ||||||||
| INFERIOR PARIETAL LOBULE | ● ◊ | ● | ● | ● | ● | |||||||
| INFERIOR TEMPORAL GYRUS | ● | ● | ● | ● | ● | ● | ● | ● | ||||
| INSULA | ● | ● | ● ◊ | ● | ● | ● | ● ◊ | ● ◊ | ● | |||
| LATERAL VENTRICLE | ◊ | ● | ● | ● | ● | |||||||
| LENTIFORM NUCLEUS | ◊ | ● | ● | ● | ● | ● | ● ◊ | ● ◊ | ||||
| LINGUAL GYRUS | ● ◊ | ● | ● | ● | ● | ● | ● | ● | ||||
| MEDIAL FRONTAL GYRUS | ● | ● | ● ◊ | ● | ● | ● | ● | ● | ● ◊ | |||
| MIDDLE FRONTAL GYRUS | ● ◊ | ● | ● | ● | ● | ● | ● | |||||
| MIDDLE OCCIPITAL GYRUS | ● ◊ | ● | ● | ● ◊ | ● ◊ | |||||||
| MIDDLE TEMPORAL GYRUS | ● ◊ | ● | ● | ● | ● | ● | ● ◊ | |||||
| NODULE | ◊ | ● ◊ | ● | ● ◊ | ● | ● ◊ | ||||||
| PARACENTRAL LOBULE | ● | ● ◊ | ● | ● | ● | ● | ● | ● | ● | |||
| PARAHIPPOCAMPAL GYRUS | ● | ● | ● | ● ◊ | ● | ● | ● | ● | ● | ● ◊ | ● ◊ | ● |
| POSTCENTRAL GYRUS | ● ◊ | ● | ● | ● | ● ◊ | |||||||
| POSTERIOR CINGULATE | ● | ● | ● | ● | ● | ● | ||||||
| PRECENTRAL GYRUS | ● | ● | ● ◊ | ● | ● | ● | ● | ● | ||||
| PRECUNEUS | ● ◊ | ● | ● | ● | ● | ● ◊ | ● ◊ | |||||
| PYRAMIS | ● | ● | ● ◊ | ● | ● | ● | ● | ● | ● | |||
| SUB-GYRAL | ● | ● | ● ◊ | ● | ● | ● | ● | ● | ● ◊ | ● ◊ | ||
| SUBCALLOSAL GYRUS | ● | ● | ● | ● | ● | ● | ● | ● | ● | ● | ||
| SUPERIOR FRONTAL GYRUS | ● ◊ | ● | ● | ● | ● | ● | ● | ● ◊ | ||||
| SUPERIOR PARIETAL LOBULE | ● | ● | ● | ● | ||||||||
| SUPERIOR TEMPORAL GYRUS | ● | ● ◊ | ● | ● | ● | ● | ● | ● | ● ◊ | |||
| SUPRAMARGINAL GYRUS | ● | ● | ● | ● | ||||||||
| THALAMUS | ● | ● ◊ | ● | ● ◊ | ● | ● ◊ | ● ◊ | ● | ||||
| TRANSVERSE TEMPORAL GYRUS | ● | ● | ● | ● | ● | |||||||
| TUBER | ● ◊ | ● | ● | ● | ● | ● | ● | |||||
| UNCUS | ● ◊ | ● | ● | ● | ● | ● | ● ◊ | ● ◊ | ● ◊ | |||
| UVULA | ● ◊ | ● | ● | ● | ● | ● | ◊ |
4. DISCUSSION
Current diagnostic categories for mental illness are not biologically based and exhibit considerable heterogeneity within diagnosis and overlap across diagnoses [28, 29]. Most imaging biomarker studies not surprisingly use the DSM categories as the ground truth, which, while informative, does not help us move beyond the known validity issues. In this work we present a first towards addressing the known problems with existing labelling schemes but providing a way for them to be updated by biological data (e.g. brain imaging). The approach we use uses an existing categorization approach as input but then updates is using additional biological data by assuming there is noise in the assignment process.
Many studies have been done to address label noise issue in different domains such as medical imaging [30, 31, 32, 33, 34, 35]. However, label noise is still a challenging task and an open question in computer-aided diagnosis systems. The problem is very difficult for mood and psychosis disorder, which share symptoms, and is arguably made worse by the use of advanced approaches like deep learning, which also typically require a ground truth [31]. It is often hard to acquire reliable labels for the classification task and in reality, because of many reasons such as insufficient information or even expert’s mistakes or poor data quality [1] we have to deal with labels that were polluted by label noise [2]. Label noise identification without knowing additional information about application domain is difficult task. Mislabeled instances may be included in any proportion of data i.e. training, validation and testing dataset which affect the performance of model evaluation and produce unreliable results. In addition, there is often considerable overlap among the subjects and no clear boundaries between the groups. For all these reasons and a lack of a ground truth in neuroimaging studies, identification of label noise becomes extremely difficult.
Voting filtering is one approach that has been proposed to try to identify label noise [3, 4, 5] and involves removing an instance when all learners are in agreement [2]. We approach this from a data cleansing perspective, that is, we first identify the noisy labels. One of the advantages of a data cleansing approach is it reduces the complexity of the model. It also increases the performance of the classifier; however, as we saw, it may remove a large number of cases. Indeed, if those cases are truly noisy (either because the label or the labelling system is inaccurate or incomplete), then they should be excluded.
One potential concern that may arise is that we do not have a ground truth to assess whether the model is biased by the approach. In the presence of label noise the validation and test accuracy of the model can be decreased by noisy subjects, and likewise it is not possible to estimate generalization metrics without a ground truth [1]. To validate our model, we analyzed our method on simulated handwriting digit dataset where identification of the label noise is possible since we have the ground truth. We artificially introduced label noise into the simulation dataset by shuffling their labels. Results showed our method successfully identified the correct labels and the performance of the model increased significantly on new cleansed data. This supports our view that one of the consequences of label noise is that it decreases the performance and accuracy of the model. The presence of label noise can disrupt the underlying patterns we are trying to discover. Our analysis of real data also proves this fact that label noise blurred the underlying patterns. The 2D projection using original labels shows lack of similarities between each subject in the 2D space within groups. This does not support the fact that there is relationship between probands belonging to same group. However, cleansed 2D projection shows there is gradient pattern between groups from healthy controls to the most severe Biotype case B1 or from healthy controls to schizophrenia for DSC.
In terms of voxel-wise results, there was a significant sMRI difference among different groups of patients and healthy controls, especially in the precuneus, entorhinal cortex and cerebellum after data cleansing, which were not identified using the original DSM-IV and Biotype labels. In general, cleansed data showed many more significant voxels, than did the original data. In addition, the DSM-IV cleansed data showed more significant voxels than the Biotype cleansed data. Interestingly, the cleansed dataset showed many more significant differences, providing possible support for our approach. Reclassified relatives had same distinct features in brain regions more similar to DSM-IV healthy control and Biotype B3 and Biotype normal control in cleansed dataset.
Among the evaluated criteria, the label noise found in DSM-IV showed more irregularity than Biotype criteria after iterative classification filtering. Among 912 subjects in different groups of DSM-IV, we found that the schizophrenia group showed the most differences. By comparing given labels and new assigned labels, the proportion of schizophrenia group reduced from 26% to 4% of total subjects after convergence and only 6% of schizophrenia subjects remained in same category after filtering and relabeling. About 62% of schizophrenia subjects were assigned new labels in schizoaffective and bipolar disorder categories and 32% of them were relabeled as healthy controls. Also, from 912 subjects, about 6% of bipolar group, 2% of healthy control and 6% of schizoaffective group, were relabeled as schizophrenia. Among different groups of patients in DSM-IV, 50% of bipolar, 20% of schizoaffective and 42% of schizophrenia control were relabeled as healthy control. 56% of heathy controls remained in their group and from remaining healthy control noisy subjects, 27% of them categorized as bipolar, 15% relabeled as schizoaffective and 2% were relabeled as schizophrenia.
On the other hand, for the Biotype groups, about 32% of subjects remained in their original Biotype patient groups and 45% of normal controls remained in their group. Normal controls and Biotype category 3 (B3) changed subjects between each other more than any other two groups in the Biotype criterion. This may because Biotype categories are more based on neuromarkers than clinical symptoms and Biotype category 3 is biologically most similar to normal controls [20]. For the Biotype cleansed relabeled dataset, 51% of relatives were categorized as Biotype B3, 24% of them were labeled as normal control, 17% of them were categorized as Biotype B2 and 8% were categorized as Biotype B1.
In this work, we only focused on biological features which extracted from sMRI and proposed an approach to ‘update’ an existing labelling approach with the biological data. This is just an initial step and does not represent a final solution. Reclassifying subjects does not suggest a patient is not actually sick or a healthy person had mental disorder. However, it shows very clearly that categories are not reflecting the underlying biology well. In addition, the fact that 1) the relabeled data showed a clear gradient from the most to least several categories, and 2) there were more voxelwise group differences, in regions that were consistent with what might be expected, in the cleansed data, provides intriguing evidence and supports continued work in this direction. Future work can including incorporating multiple types of data (e.g. EEG, sMRI and fMRI). In addition, allowing the approach to develop new categories (e.g. via splitting and merging) is another interesting topic for future work. Ultimately the results will of course need to be validated clinically.
5. CONCLUSION
In this paper we proposed a novel approach to estimate label noise prevalent due to incorrect diagnostic classification. We used iterative classification voting filtering using an SVM model. We applied our method to brain imaging in the context of a multi label imbalanced data from various psychosis disorders and healthy controls. The result of our method shows an improvement in the accuracy of deep learning model even for the first iteration. Overall accuracy increases and converges after iterating classification filtering and relabeling steps. Our method provides a promising approach for feature extraction of brain images even on noisy datasets by assigning new labels to those inconsistently diagnosed/labelled individuals over multiple iterations. Our method identified noisy label samples and suggest new labels for them with current noisy dataset without requiring extra clean samples in any classification domain by estimating appropriate models and finding optimal hyper-parameters for each classification task. Our method shows that although there were similar proportion of label noise in multiclass Biotype and DSM-IV categories, label noise distributions were more irregular in DSM-IV than in Biotype categories, however DSM-IV data showed more significant voxels when evaluating group differences on cleansed data. Both approaches showed a clear gradient from the most to least clinically severe groups. Our hope is that this represents an initial step towards a semi-blind categorization approach that is both informed by clinical as well as high-dimensional biological data.
6. ACKNOWLEDGMENTS
Research reported in this work was supported by the National Institute of Mental Health under award numbers R01EB005846 and 1R01MH104680.
REFERENCES
- [1].Frenay B and Verleysen M, “Classification in the presence of label noise: a survey,” IEEE transactions on neural networks and learning systems, vol. 25, no. 5, pp. 845–869, 2013. [DOI] [PubMed] [Google Scholar]
- [2].Bross I, “Misclassification in 2 × 2 tables,” Biometrics, vol. 10, no. 4, pp. 478–486, 1954. [Google Scholar]
- [3].Hadgu A, “The discrepancy in discrepant analysis,” The Lancet, vol. 348, no. 9027, pp. 592–593, 1996. [DOI] [PubMed] [Google Scholar]
- [4].Abrol A, Rokham H and Calhoun VD, “Diagnostic and Prognostic Classification of Brain Disorders Using Residual Learning on Structural MRI Data,” in 41st Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), Berlin, 2019. [DOI] [PubMed] [Google Scholar]
- [5].Wang Z, Meda SA, Keshavan MS, Tamminga CA, Sweeney JA, Clementz BA, Schretlen DJ, Calhoun VD, Lui S and Pearlson GD, “Large-scale fusion of gray matter and resting-state functional MRI reveals common and distinct biological markers across the psychosis spectrum in the B-SNIP cohort,” Frontiers in psychiatry, vol. 6, p. 174, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [6].Pearlson GD, Clementz BA, Sweeney JA, Keshavan MS and Tamminga CA, “Does biology transcend the symptom-based boundaries of psychosis?,” Psychiatric Clinics, vol. 39, no. 2, pp. 165–174, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [7].Insel TR and Cuthbert BN, “Brain disorders? Precisely,” Science, vol. 348, no. 6234, pp. 499–500, 2015. [DOI] [PubMed] [Google Scholar]
- [8].Langlotz CP, Allen B, Erickson BJ, Kalpathy-Cramer J, Bigelow K, Cook TS, Flanders AE, Lungren MP, Mendelson DS, Rudie JD, Wang G and Kandarpa K, “A roadmap for foundational research on artificial intelligence in medical imaging: From the 2018 NIH/RSNA/ACR/The Academy Workshop,” Radiology, vol. 291, no. 3, pp. 781–791, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [9].Hickey RJ, “Noise modelling and evaluating learning from examples,” Artificial Intelligence, vol. 82, no. 1–2, pp. 157–179, 1996. [Google Scholar]
- [10].Quinlan JR, “Induction of decision trees,” Machine learning, vol. 1, no. 1, pp. 81–106, 1986. [Google Scholar]
- [11].Zhu X and Wu X, “Class noise vs. attribute noise: A quantitative study,” {Artificial intelligence review, vol. 22, no. 3, pp. 177–210, 2004. [Google Scholar]
- [12].Rolnick D, Veit A, Belongie S and Shavit N, “Deep learning is robust to massive label noise,” arXiv preprint arXiv:1705.10694, 2017. [Google Scholar]
- [13].Sun C, Shrivastava A, Singh S and Gupta A, “Revisiting unreasonable effectiveness of data in deep learning era,” in Proceedings of the IEEE international conference on computer vision, 2017. [Google Scholar]
- [14].Tamminga CA, Pearlson G, Keshavan M, Sweeney J, Clementz B and Thaker G, “Bipolar and schizophrenia network for intermediate phenotypes: outcomes across the psychosis continuum,” Schizophrenia bulletin, vol. 40, no. Suppl_2, pp. S131--S137, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [15].Tamminga CA, Ivleva EI, Keshavan MS, Pearlson GD, Clementz BA, Witte B, Morris DW, Bishop J, Thaker GK and Sweeney JA, “Clinical phenotypes of psychosis in the Bipolar-Schizophrenia Network on Intermediate Phenotypes (B-SNIP),” American Journal of psychiatry, vol. 170, no. 11, pp. 1263–1274, 2013. [DOI] [PubMed] [Google Scholar]
- [16].Ivleva EI, Bidesi AS, Keshavan MS, Pearlson GD, Meda SA, Dodig D, Moates AF, Lu H, Francis AN, Tandon N and others, “Gray matter volume as an intermediate phenotype for psychosis: Bipolar-Schizophrenia Network on Intermediate Phenotypes (B-SNIP,” American Journal of Psychiatry, vol. 170, no. 11, pp. 1285–1296, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [17].Jack CR Jr, Bernstein MA, Fox NC, Thompson P, Alexander G, Harvey D, Borowski B, Britson PJ, Whitwell JL, Ward C and others, “The Alzheimer’s disease neuroimaging initiative (ADNI): MRI methods,” Journal of Magnetic Resonance Imaging: An Official Journal of the International Society for Magnetic Resonance in Medicine, vol. 27, no. 4, pp. 685–691, 2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [18].Clementz BA, Sweeney JA, Hamm JP, Ivleva EI, Ethridge LE, Pearlson GD, Keshavan MS and Tamminga CA, “Identification of distinct psychosis biotypes using brain-based biomarkers,” American Journal of Psychiatry, vol. 173, no. 4, pp. 373–384, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [19].Clementz BA, Sweeney JA, Hamm JP, Ivleva EI, Ethridge LE, Pearlson GD, Keshavan MS and Tamminga CA, “Identification of distinct psychosis biotypes using brain-based biomarkers,” American Journal of Psychiatry, vol. 173, no. 4, pp. 373–384, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [20].Ivleva EI, Clementz BA, Dutcher AM, Arnold SJ, Jeon-Slaughter H, Aslan S, Witte B, Poudyal G, Lu H, Meda SA and others, “Brain structure biomarkers in the psychosis biotypes: findings from the bipolar-schizophrenia network for intermediate phenotypes,” Biological psychiatry, vol. 82, no. 1, pp. 26–39, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [21].Ashburner J and Friston KJ, “Unified segmentation,” Neuroimage, vol. 26, no. 3, pp. 839–851, 2005. [DOI] [PubMed] [Google Scholar]
- [22].Mwangi B, Tian TS and Soares JC, “A review of feature reduction techniques in neuroimaging,” Neuroinformatics, vol. 12, no. 2, pp. 229–244, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [23].Bellman RE, Adaptive control processes: a guided tour, vol. 2045, Princeton university press, 2015. [Google Scholar]
- [24].Japkowicz N and Stephen S, “The class imbalance problem: A systematic study,” Intelligent data analysis, vol. 6, no. 5, pp. 429–449, 2002. [Google Scholar]
- [25].Provost F, “Machine learning from imbalanced data sets 101,” in Proceedings of the AAAI’2000 workshop on imbalanced data sets, 2000. [Google Scholar]
- [26].Paszke A, Gross S, Chintala S, Chanan G, Yang E, DeVito Z, Lin Z, Desmaison A, Antiga L and Lerer A, “Automatic differentiation in pytorch,” 2017. [Google Scholar]
- [27].Maaten L. v. d. and Hinton G, “Visualizing data using t-SNE,” Journal of machine learning research, vol. 9, pp. 2579–2605, 2008. [Google Scholar]
- [28].Allsopp K, Read J, Corcoran R and Kinderman P, “Heterogeneity in psychiatric diagnostic classification,” Psychiatry research, vol. 279, pp. 15–22, 2019. [DOI] [PubMed] [Google Scholar]
- [29].Olbert CM, Gala GJ and Tupler LA, “Quantifying heterogeneity attributable to polythetic diagnostic criteria: theoretical framework and empirical application.,” Journal of Abnormal Psychology, vol. 123, no. 2, p. 452, 2014. [DOI] [PubMed] [Google Scholar]
- [30].Calli E, Sogancioglu E, Scholten ET, Murphy K and van Ginneken B, “Handling label noise through model confidence and uncertainty: application to chest radiograph classification,” in Medical Imaging 2019: Computer-Aided Diagnosis, 2019. [Google Scholar]
- [31].Xue C, Dou Q, Shi X, Chen H and Heng PA, “Robust Learning at Noisy Labeled Medical Images: Applied to Skin Lesion Classification,” arXiv preprint arXiv:1901. 07759, 2019. [Google Scholar]
- [32].Pechenizkiy M, Tsymbal A, Puuronen S and Pechenizkiy O, “Class noise and supervised learning in medical domains: The effect of feature extraction,” in 19th IEEE Symposium on Computer-Based Medical Systems (CBMS’06), 2006. [Google Scholar]
- [33].Gamberger D, Lavrac N and Groselj C, “Experiments with noise filtering in a medical domain,” in ICML, 1999, pp. 143–151. [Google Scholar]
- [34].Ji S and Ye J, “Generalized linear discriminant analysis: a unified framework and efficient model selection,” in IEEE Transactions on Neural Networks, 2008. [DOI] [PubMed] [Google Scholar]
- [35].Robbins K, Joseph S, Zhang W, Rekaya R and Bertrand J, “Classification of incipient Alzheimer patients using gene expression data: Dealing with potential misdiagnosis,” Online Journal of Bioinformatics, vol. 7, no. 1, pp. 22–31, 2006. [Google Scholar]
- [36].Chapelle O, Scholkopf B and Zien A, “Semi-supervised learning (chapelle, o. et al., eds.; 2006)[book reviews,” IEEE Transactions on Neural Networks, vol. 20, no. 3, pp. 542–542, 2009. [Google Scholar]
- [37].Dawid AP and Skene AM, “Maximum likelihood estimation of observer error-rates using the EM algorithm,” {Journal of the Royal Statistical Society: Series C (Applied Statistics), vol. 28, no. 1, pp. 20–28, 1979. [Google Scholar]
- [38].Frenay B, Kaban A and others, A comprehensive introduction to label noise, 2014. [Google Scholar]
- [39].Brodley CE and Friedl MA, “Identifying mislabeled training data,” Journal of artificial intelligence research, vol. 11, pp. 131–167, 1999. [Google Scholar]
- [40].Khoshgoftaar TM and Rebours P, “Generating multiple noise elimination filters with the ensemble-partitioning filter,” in Proceedings of the 2004 IEEE International Conference on Information Reuse and Integration, 2004. IRI 2004., 2004, pp. 369–375. [Google Scholar]
