Abstract
Mental illness, a serious problem across the globe, requires multi-pronged solutions including effective computational models to predict illness. Mental illness diagnosis is complicated by the pronounced sharing of symptoms and mutual pre-dispositions. Set in this context we offer a systematic comparison of seven deep learning and two conventional machine learning models for predicting mental illness from the history of present illness free-text descriptions in patient records. The models tested include a new architecture CB-MH which ranks best for F1 (0.62) while another attention model is best for F2 (0.71). We also explore model decisions using Integrated Gradients interpretability method which we use to identify key influential features. Overall, the majority of true positives have key features appearing in meaningful contexts. False negatives are most challenging with most key features appearing in unclear contexts. False positives are mostly true positives in actuality as supported by a small-scale clinician-based user judgement study.
Introduction
Mental illness is the leading cause of disability globally - one in five adults in the U.S. lives with it19. Even prior to the COVID-19 pandemic mental illness and suicidal ideation were increasing among adults32. The same report shows that the number of people seeking help due to moderate or severe anxiety and depression has shown an alarming increase. These trends emphasize the importance of multi-pronged strategies to subdue the upsurge in mental health challenges: prediction models able to assist in effective diagnosis of mental illness is a key component.
The national drive towards Electronic Health Records (EHR) has resulted in vast collections of multi modal, structured and semi-structured patient records that compile information related to various encounters and activities with patients. These are naturally seen as resources that may be used to identify patients having specific diseases or disorders including those related to mental health. While EHR are generally encoded with disease codes from systems such as ICD-9/10/CM etc., there are well recognized limitations when these codes are used for secondary purposes such as phenotype or cohort identification. For example, these codes are known to be incomplete and error-prone16,31. This has generated active research in building models that rely more on the free-text portions of patient EHRs, as for instance to predict disease states and prognosis. Thus we see papers and reviews on machine learning methods, both conventional and more modern deep learning (DL) for building prediction models from free-text EHR data30,31.
Despite the excitement over the development of DL models in health care applications15,21 the emphasis on mental illness diagnosis from free-text portions of the clinical record has been minimal. In contrast, there has been active research with traditional machine learning models2. A 2020 review28 of DL in mental health identifies under five papers exploring the free-text portions of EHR. These were directed at different goals including symptom severity33,34 and cohort definition35. The one paper directly about mental illness prediction from free-text is by Tran et al.23. They explore CNN and RNNs using the history of present illness portions of clinical notes for predicting 11 mental health conditions. The limitation is that they use a small dataset of only 1000 notes. Thus, it is still unknown as to how well deep learning models will perform on the task of predicting mental health conditions using free-text portions of EHR and consequently it is also unknown as to which DL models are best suited for this problem.
The problem of predicting mental illness is particularly challenging as unlike typical chronic conditions laboratory tests to aid in diagnosis are not generally available28. Instead there is reliance on patient observation, self reports, responses to questionnaires etc. These impressions are generally summarized in the patient notes. Another challenge is that the shared underlying biology is pronounced with mental diseases5. For example, people with depression share symptoms with those having generalized anxiety disorder such as irritability and sleep problems17,18. Moreover, multiple problems may co-exist with one pre-disposing another13. Any predictive model will have to make fine distinctions between classes in this multi-label, multi-class setting.
Set in this context, our goal is to rigorously compare deep learning (DL) methods in their ability to predict mental illnesses from free-text clinical notes. We also include conventional machine learning algorithms in our comparison. We systematically compare nine models representing CNNs, LSTMS, different mechanisms to include attention components (exploring both self and soft attention mechanisms), transformer based models as well as conventional SVMs and Logistic Regression. We also propose a DL model Convolution BiLSTM Multi-Headed attention that we call CB-MH. This architecture includes a multi-head attention mechanism. We compare the models using a large collection of 150,085 psychiatry clinical notes spanning 10 years (history of present illness portions). We make comparisons using both F1 and F2 measures. We then go deeper into one best DL model to get a better understanding of its predictions. For this we utilize an interpretability approach using the Integrated Gradients22 methodology which allows us to identify key features, with weights designating their relative importance in model decisions. We use these key features to examine reasons for true positive decisions as well as reasons for errors (both false positives and false negatives). We close with a small scale judgement study involving two clinicians who provide further insights on the nature of a sample of erroneous model decisions.
Next under Materials and Methods we describe our dataset, the DL and traditional ML models we compare. Next is the Results section followed by a section analyzing decisions made by a select DL model for one disease. Discussion, Conclusions and future work sections follow.
Materials and Methods
Dataset
We used a 10-year span (2008-2017) of de-identified patient visit data from the psychiatry outpatient clinic of a leading midwestern hospital. There are 203,466 clinical visit records corresponding to 15,986 distinct patients. Each record has a de-identified patient id, visit date, notes on history of present illness (average length 219 words), and diagnoses in the form of ICD9 codes. After removing duplicates and records without a mental health diagnosis, there are 150,085 records for 14,916 patients. Domain experts grouped low frequency ICD labels resulting in a final set of 8 codes/diagnoses for our prediction goals. Table 1 describes our data.
Table 1:
Data distribution of Full set with 150,085 records (Since a record can have multiple class labels the columns do not add up to 100%.)
| Diagnosis | All | Train | Val | Test |
|---|---|---|---|---|
| Unipolar Depression | 51% | 51% | 49% | 51% |
| Other | 45% | 46% | 43% | 45% |
| Anxiety Disorders | 23% | 23% | 20% | 23% |
| Substance Use Disorders | 19% | 18% | 23% | 19% |
| PTSD_OCD_PanicDisorder | 19% | 19% | 16% | 21% |
| Cluster B Personality Traits | 16% | 16% | 16% | 17% |
| Psychotic Disorders | 14% | 14% | 16% | 12% |
| Bipolar Disorders | 12% | 11% | 13% | 14% |
Besides performing experiments on the Full set of 150,085 records we also experimented on a (Sample) of 15,009 records (10% of Full) to see if results scale well. Unlike most prior work, we explore this question while considering prediction confidence level. The smaller set is obtained using stratified sampling to maintain the same distribution of diagnosis as in the Full set (Table 1) within a 1% difference. The data were split into training (65%), validation (15%), and testing (20%). All records of a patient were either in training, validation, or testing.
Deep Learning Models Tested
BERT: This is a bidirectional language model based on the transformer24 model and is trained on two tasks masked language modeling (MLM) and next sentence prediction (NSP). MLM helps achieve bidirectionality and NSP learns sentence relationships during training. BERT is trained on a large corpus — the English Wikipedia and BooksCorpus and fine tuned for several NLP downstream tasks. We used pretrained uncased version of BERT base model4 modifying the last layer for our prediction task and fine tuning it.
LSTM: This7 is the standard sequence-to-sequence model (an extension of RNN) involving an input gate, forget gate, output gate, and a cell state to handle short term memory. The gates control the flow of input information (i.e., what information to remember or forget) and help to better handle context (in contrast with RNN)7. Equations representing these gates are as in Lipton et al.11. At an abstract level, we represent the final learned representation of the text as , where x is the input clinical note, where the hyperparameter hn represents the size of hidden layer.
BiLSTM: Here two LSTMs are combined, one that processes the text in the forward direction and the other in the backward direction. Both the LSTMs are of the same dimensions. Representing text x processed in the reverse direction as: , the final representation (H) is as in Eq.1 where .
| (1) |
Based on insights from He et al.6 and preliminary experimentation with the validation data, we found it effective to add a global max pooling layer followed by a dense layer with ReLU activation before the final output layer.
CNN: This is a standard CNN with convolutional filters and max pooling followed by a dense layer with a non-linear activation function ReLU9.
CNN-BiLSTM: The objective here is to take advantage of the CNN convolutional layer and BiLSTM hidden representations. Thus we built this model by adding a BiLSTM on top of the CNN.
CB-Atn: This is our first model exploring attention. Here we add to the top of the CNN-BiLSTM architecture a soft attention layer1. We use H from (Eq.1) to obtain word level attention weights (a) using Eq.2b. These activation weights indicate the relative importance of different portions of the input text. The final vector (f) representation weighted by attention uses Eq.2c where , and is a learned context vector.
| (2a) |
| (2b) |
| (2c) |
CB-MH: Inspired by work with the transformer model24, we propose a new architecture with multi-head self-attention mechanism instead of the single head soft attention. A single head will get different distributions each time it is run. Our insight is that a multi-head architecture will reduce this variation in distributions, thus giving attention to similar phrases. Thus, we capture attention multiple (r) times with different weight matrices for with a final concatenation of attention using Eq.3a. Like CB-Atn this model builds off CNN-BiLSTM.
| (3a) |
| (3b) |
| (3c) |
Here are learned parameters, and dk = dv , where dv represents the hidden size per head. The configuration for hidden size and the number of attention heads is the same as in BERTBASE4. In Eq.3c, the attention function maps a set of queries (Q), key-value pairs (K-V) to an output. In our case, we set the values of Q, K, and V as the output of BiLSTM layer (Eq.1) i.e. MultiHead (H, H, H). Our code for CB-MH is available at https://github.com/IngrojShrestha/CB-MH.
All models were configured with optimal hyperparameter settings using validation data. Table 2 lists the key parameters explored and the selected values.
Table 2:
Parameters for deep learning models (tuning range for hidden layer: [64, 128, 256], embedding dim.: [200, 400], dropout: [0.5, 0.6, 0.7, 0.8], global output threshold : [0.2, 0.4, 0.5, 0.6, 0.8], kernel size: [3,4,5,(3,4,5) (2,3,4,5)]). dropout is used in penultimate layer
| Model | Parameters (Optimal values selected) |
|---|---|
| General | embedding dim. (400), learning rate (0.001), batch size (100), IG (m = 10022), global output threshold (0.2), epoch (10) |
| BERTBASE | default from Devlin et al.4 |
| LSTM | hidden layer size (128) |
| BiLSTM | hidden layer size (128), patience (2), L2 regularization (λ = 0.0001), dense(128) |
| CNN | kernel size (2,3,4,5), stride (1), pool size (4), L2 (? = 0.0001), dropout(0.5), dense(128) |
| CNN-BiLSTM | CNN kernel size (3), dropout (0.7), dense (128) |
| CB-Atn | cv (128), dropout (0.7), dense (128) |
| CB-MH | attention head (12), hidden size (768), dropout (0.7) |
General settings for DL models: The input texts had stopwords removed and were stemmed. We use pre-padding and pre-truncation to handle variable sequence length. Input text tokens were represented by 400 dimension pre-trained embedding vectors using word2vec14 trained on PubMed abstracts from Pubmed Baseline 201829. The final output layer (of size L equal to the number of classes) uses a sigmoid function to output the probability of prediction for each label . All DL models learn the optimal parameters by minimizing the binary-cross entropy loss in Eq. 4. We use the Adam optimizer with a learning rate of 0.001 and Keras for implementation using one NVIDIA Tesla P100 PCIE (16GB) GPU.
| (4) |
Traditional Machine Learning Models Tested
We test (1) Logistic Regression and (2) SVM with L2 regularization. Each note is represented by a TF-IDF weighted word unigram vector after stopword removal. For LR no stemming is optimal while for the SVM stemming (Porter’s stemmer26) is optimal. We create one SVM classifier/LR regressor for each class using scikit-learn20.
Performance Measures
We report micro-averaged recall (R), precision (P), F1 and F2 scores. F2 (recall twice as important as precision) is important in medicine to minimize risk of missing a positive, i.e., of making Type II errors. Micro-averaging is appropriate as each visit is considered equally important.
Model Comparison Results
Table 3 presents the results. We use bootstrapping shift test25 to estimate 95% confidence intervals for comparing models. Specifically, we sample the test set with replacement 10K times; for each sample we calculate both F1 and F2 per model and we then generate 95% confidence intervals over these 10K bootstraps for each model and measure.
Table 3:
Results with 95% confidence intervals (p-value < 0.05)
| Sample set | Full set | |||
|---|---|---|---|---|
| Model | F1 | F2 | F1 | F2 |
| SVM | 0.50 [0.492, 0.516] | 0.47 [0.454, 0.478] | 0.54 [0.534, 0.541] | 0.49 [0.488, 0.496] |
| LR | 0.51 [0.494, 0.518] | 0.44 [0.436, 0.457] | 0.55 [0.547, 0.554] | 0.49 [0.491, 0.498] |
| BERT | 0.57 [0.559, 0.576] | 0.68 [0.670, 0.688] | 0.58 [0.576, 0.582] | 0.645 [0.642, 0.648] |
| LSTM | 0.53 [0.526, 0.543] | 0.63 [0.624, 0.644] | 0.54 [0.533, 0.539] | 0.55 [0.548, 0.555] |
| BiLSTM | 0.56 [0.553, 0.571] | 0.68 [0.668, 0.687] | 0.60 [0.597, 0.602] | 0.70 [0.694, 0.700] |
| CNN | 0.52 [0.513, 0.532] | 0.57 [0.561, 0.582] | 0.57 [0.570, 0.576] | 0.65 [0.645, 0.651] |
| CNN-BiLSTM | 0.54 [0.531, 0.548] | 0.68 [0.671, 0.688] | 0.60 [0.599, 0.604] | 0.70 [0.696, 0.701] |
| CB-MH | 0.56 [0.551, 0.567] | 0.68 [0.671, 0.688] | 0.62 [0.613, 0.618] | 0.68 [0.678, 0.685] |
| CB-Atn | 0.55 [0.543, 0.560] | 0.67 [0.658, 0.675] | 0.595 [0.593, 0.598] | 0.71 [0.705, 0.710] |
We see that F1 and F2 scores are significantly higher in the Full dataset compared to in Sample with only four exceptions of 18 comparisons (e.g., exception: F2 scores for our model CB-MH are statistically equivalent in both datasets). 18 comparisons refers to 9 models * 2 scores (F1 and F2). For some cases on moving from Sample to Full (a) F1 improved but F2 worsens (b) F1 improvements are significant but F2 improvements are marginal. When we move from Sample to Full, precision (P) improved but recall (R) decreased for some models. Moreover, the magnitude of change in precision was relatively large compared to that for recall in the 2 cases (out of 9) where F2 drops while F1 increases. Most importantly, confidence intervals for Full are tighter than for Sample and the DL models better distinguishable. For instance, in Sample, CB-MH and BiLSTM do not have significantly different scores, whereas in Full they do. This is consistent with expectations as DL models are more stable given more data. Thus, in our remaining analyses we focus on the Full dataset results.
With one exception (LSTM in F1), all DL models are significantly better than SVM and LR in both F1 and F2. The smallest statistically significant improvements are 3.6% in F1 (CNN) and 12.2% in F2 (LSTM), the highest 12.7% in F1 (CB-MH) and 44.9% in F2 (CB-Atn).
We find that F2 scores (minimizing Type II errors) are systematically larger than F1 scores for the DL models. But for SVM and LR, F1 scores are better than F2, these appear to favor precision over recall. But as noted earlier these models do not do as well as the DL models.
The attention models perform the best. CB-MH is the best for F1 (0.62 (95%CI : [0.613, 0.618])). As described earlier this involves a multi-head attention (12 heads) architecture. For F2 the best is CB-Atn of 0.71 (95%CI : [0.705, 0.710]). This uses a soft attention component. While our CB-MH performance is close (0.68 F2), there is a small difference of 0.02 in their confidence intervals. The use of multi-head attention seems to provide an advantage in making more precise predictions with a penalty in recall which brings down F2. BiL-STM and CNN-BiLSTM are second best on both F1 and F2 and LSTM is the weakest. The unidirectional sequence based model (LSTM) is not better than the non-sequential CNN. On the other hand, bidirectionality helps make BiLSTM better than CNN. Also, BERT trained on Wikipedia and Book-Corpus is likely limited for biomedicine.
Table 4 provides class level results (ignoring Other class) four CB-MH with rows ordered by decreasing data size. With one strong exception (Psychotic disorders), we see higher F1 with larger class size. The pattern is less consistent for F2 due to differences in the relative success with recall and precision. E.g., recall is favoured in Unipolar Depression and Cluster B personality and precision in Bipolar disorders. Exploring influential factors for these patterns in recall over precision (beyond dataset size variations) is left to future work. For completeness, we report macro-averages across classes and for the best traditional machine learning approach (LR). The percentage improvement of our proposed model over the LR approach is 18.8% in macro F1 and 47.5% in macro F2; these percentage improvements are in fact larger than micro average improvements reported earlier (12.7% F1, 38.8% F2, Table 3)
Table 4:
Classwise macro performance (CB-MH)
| Diagnosis | P | R | F1 | F2 |
|---|---|---|---|---|
| Unipolar Depression | 0.58 | 0.96 | 0.72 | 0.85 |
| Anxiety Disorders | 0.42 | 0.56 | 0.48 | 0.53 |
| PTSD_OCD_PanicDisorder | 0.57 | 0.51 | 0.54 | 0.52 |
| Substance Use Disorders | 0.56 | 0.47 | 0.51 | 0.49 |
| Cluster B Personality Traits | 0.42 | 0.63 | 0.51 | 0.57 |
| Psychotic Disorders | 0.73 | 0.74 | 0.73 | 0.74 |
| Bipolar Disorders | 0.60 | 0.42 | 0.49 | 0.45 |
| macro averages (CB-MH) | 0.55 | 0.61 | 0.57 | 0.59 |
| macro averages (LR) | 0.70 | 0.37 | 0.48 | 0.40 |
Analysis of Correct and Erroneous Model Predictions
We analyze model predictions using Integrated Gradients (IG)22 an interpretability or explainability technique. We analyze both correct decisions (focusing on true positives (TP)) as well as erroneous decisions, both false positive (FP) and false negative (FN) decisions (see Error Analysis section). We focus our analysis on decisions made by the CB-Atn model which had the highest F2 scores and the class unipolar depression. Our CB-MH model is an equally viable model for analysis since it leads in F1. However, we chose to analyze the model with lower Type II errors. First we present an overview of the IG based analysis approach and then present our observations.
Integrated Gradients (IG) estimates the extent to which each word feature contributes to model predictions. This is estimated based on the gradients of the model with respect to the input. At a high level the method operates as follows. We are given the embedding vector representation (e) for the input text, the baseline zero embedding vector , and F (.) the function in the model that gives the output vector. The interpolated method computes a series of m vectors (here m = 100) between e and . Then gradients for each interpolated vector are computed; the gradient for an input feature is the partial derivative of F(.) with respect to that feature (ei) (here features are words). The final integrated gradient score for a word feature (IGi) is then the average of its m gradients multiplied by (ei - ). This calculation is as in Sundararajan et al.22 and is shown in Eq. 5.
| (5) |
IG scores are normalized (-∞, +∞) to [0, 1] with negative values mapped to [0, 0.5) and positives to (0.5, 1]. A feature with a positive value indicates that its presence increases the likelihood of of the model predicting a label. Similarly, a negative value indicates that the feature’s presence in the note reduces the likelihood of the same event. Scores close to 0.5 signal neutral features.
Analysis strategy
Given a note of interest we first use IG to identify key features used by the CB-Atn model to make its predictions. Key feature(s) is/are the word(s) with the highest IG score. We limit features to relevant UMLS semantic types: Mental or Behavioral Dysfunction, Sign or Symptom, Disease or Syndrome, Mental Process, and Pharmacologic Substance. Our code for semantic filtering is available at https://github.com/IngrojShrestha/CB-MH. We analyze a note if it has a key feature with IG score > = 0.51, i.e., it has some positive influence on the prediction. The two authors then manually categorized a surrounding context of k (= 15) words into: (1) positive: context indicates depression present, (2) negative: context indicates depression absent, or (3) unclear: context insufficient to analyze. There was strong agreement between the two sets of judgements (kappa=0.81; 89.9% agreement).
Analysis of True Positives
There were 113 TP notes to analyse. The majority (82%) appear in positive contexts indicating sound reasons for these correct positive decisions. We see such examples in Table 5. Notably, in the last example the patient has depression that is improving making this is a correct prediction. While depression appears as a key feature in some notes it is crucial to note that a simple keyword search strategy using ‘depression’ on our Full dataset yields very poor results (F1: 0.47, F2: 0.41). The remaining 18% of TP key features appear in unclear/uninterpretable contexts. Thus the majority of the true positive decisions made by the model are sensible and justifiable.
Table 5:
True positive examples (key feature is in bold)
| Key Feature | Context |
|---|---|
| depressive | a 44 year old white male with a history of major depressive disorder extending back to an original episode in 2005 tried on several different medications including |
| exhausted | to take the next dose she has significant anxiety in addition to depression and feels exhausted she said that part of her sleep problems come from the fact that her chronic |
| disorder | mg by mouth daily for 2 weeks then 2 tablets daily thereafter indications major depressive disorder previous medications acetaminophen tylenol 500 mg tablet take 500 mg by mouth 4 times daily |
| depression | 45 y o y o caucasian woman presents for evaluation and treatment of depression pt reports she is doing well pt reports good mood sleep appetite interest and concentration |
Analysis of False Positive and False Negative Errors
False Positives: Key features occur in two types of contexts
The 34 FP notes analyzed were about evenly split between positive for depression contexts and unclear contexts. In the first example in Table 6, the patient is under antidepressants, which indicates that model prediction is correct. The next two examples indicate that depression continues from the past into the present. In the last example, while the patient is improving treatment for depression is clearly stated. The interpretations for these FP decisions (and we found several similar ones) point to likely errors of omission in ICD annotations.
Table 6:
False positive examples (key feature is in bold)
| Key | Feature Context |
|---|---|
| emotion | not want to continue antidepressants because she reported that she is less control over her emotions she recently was in the process of applying for a part time job for approximately |
| depression | name patient states that she was first diagnosed and put on medications for depression in 1991 she was hospitalized once in 1996 she states that typically depression manifests as |
| disorder | that encounter are available the patient reports that she has been diagnosed with major depressive disorder and has been treated for that for nearly 20 years she presents today requesting refills |
| insomnia | some overall improvement review of depressive symptoms shows that she has some initial and intermittent insomnia and fatigue particularly in the early morning she feels she should resume her previous morning |
False negatives: Key features are mostly in unclear contexts
The vast majority (84%) of the 74 FNs analyzed fall into unclear contexts making this the most challenging class of model errors for analysis. In example 1 in Table 7, the patient note has: "...Trazodone causes severe daytime sedation so he avoids it... .... He agrees to reduce Remeron to 7.5 ..." The model fails to connect Remeron and Trazadone with unipolar depression, possibly due to insufficient training data. Instead, it views thoughts as the key feature but its context is not specific enough to predict unipolar depression; instead Bipolar Disorders is predicted. In the (psychosis) example the model clearly made an error. In example three, the context for anxiety indicates the absence of depression (i.e., negative context). This could be why the model does not predict unipolar depression. Examining the full note we do not find any indication of the patient having depression. Similarly, in the last example (paranoia) there is no evidence in the rest of the note. Thus in a few cases the IG interpretation shows that evidence for a label is missing, pointing to possible errors of commission in ICD labeling.
Table 7:
False negative examples (key feature is in bold)
| Key | Feature Context |
|---|---|
| thoughts | having episdoes like this in the past which resolve on there own he denies racing thoughts or worries at night trazodone causes severe daytime sedation so he avoids it once asleep |
| psychosis | on her husband s part the differential diagnosis was considered major depression with psychosis versus psychosis nos and started on zoloft titrated rapidly to 100 mg and risperdal 1 mg qhs |
| anxiety | davies and has a key chain and jacket no evidence of depression psychosis mania or anxiety no alcohol or drugs past psychiatric history |
| paranoia | no current depressive sx does not believe he currently has an anxiety problem no hallucinations paranoia no history of mania past psychiatric history |
Error Analysis with Clinical Expertise
Next we present a small user study to determine if the model’s FNs and FPs decisions are due to annotation errors.
Sample selection: We first ranked the FP and FN notes by IG score given to a key feature. To explain, the top ranking note is one that had a key feature with the highest weight. Rankings for FN and FP notes were done independently. We then selected the top ranked 15 notes in each set. We then removed five notes that had more than 500 words (in order to make it a bit easier on the annotators). This gave us a set of 12 FN and 13 FP notes. These notes were randomly shuffled before showing to the clinicians.
Manual annotation: Two clinicians independently annotated each note. They were given the full history of present illness and asked to decide if an ICD code pertaining to depression should or should not be assigned. They were also allowed to use a ‘not sure’ category. They agreed in 18/25 cases (kappa = 0.59; 85.7% agreement). Four notes had at least one judge mark ‘not sure’ while they disagreed on three notes. Gauged against the 18 decisions where the judges agreed, our model’s FP and FN decisions were correct in 78% (14/18). To explain, we marked an FP (FN) decision as correctly declared positive (negative) if both judges indicated that the note should (should not) be annotated with a depression ICD code. Interestingly, all FP decisions were declared correct (12/12) while a third (2/6) FN decisions were correct. These results are encouraging especially for the FP decisions.
Discussion
Our proposed model, CB-MH, is top ranked in F1 (0.62) while CB-Atn is best in F2 (0.71). The differences between their confidence intervals is slight (0.02 at most). Both models use attention. In the case of our model the run time for both training and testing with the Full set is approximately 122 minutes. We find that while scores are generally higher in the Full set compared to the Sample, notably confidence intervals are tighter with the larger set. Traditional machine learning algorithms do not fair as well as DL models. They seem less well equipped to capitalize on the semantics underlying diagnoses as well as DL models. DL models are good at reducing Type II errors achieving higher F2 scores compared to F1 scores. As pointed in a recent review of DL models operating off clinical texts27, it is important to continue comparing DL with traditional methods. While our DL models fared well, in 11% of the studies they surveyed DL algorithms performed worse than traditional machine learning algorithms.
Setting aside differences across papers, our best scores are lower than recent DL results for heart failure (F1: 0.9), COPD (F1: 0.9), and kidney diseases (F1: 0.92)12. Our LSTM (F1:0.54) also fared lower than an LSTM for cardio-vascular disease (F1:0.61)8. These differences underline the greater challenge in predicting mental illness with its fine distinctions between disease classes. Our results also point to room for improvement and the need for research on stronger decision models for this challenging problem setting.
A high majority (82%) of true positive model decisions for depression analyzed involve key features that appear in meaningful contexts. This points to model success in this regard. Interestingly, about half of the FPs analyzed involve features in contexts indicating disease presence - suggesting possible errors of omission in ICD labels. Our follow up small scale study using clinician judges lend further support to this suggestion. All of the model’s FPs reviewed by the judges using the notes where declared to be true positives. The FN group was the most challenging; 84% of analyzed notes had key features that were in unclear contexts. For some notes, the model was clearly incorrect, possibly there were weaknesses in the training data. In still others, evidence supporting a positive annotation for depression was not in the note. In this context, the clinician judges found a third of the model’s FN decisions to be true negatives. Overall, agreement between judges was moderate (clinician judgements) to strong (author judgements). This suggests - in general - that the task of looking for evidence supporting depression diagnosis, either in a text window of +/-15 words or in the whole note is more straightforward than nuanced.
Our error analysis, especially with the false positives indicate a possibility for the future. Perhaps there is the potential for combining model prediction from free-text with interpretability based error analysis as a pathway towards strengthening, i.e., selectively correcting ICD annotations of clinical records. We will explore this in future research.
Conclusion
We compared seven deep learning and two traditional machine learning algorithms on the multi-class, multi-label problem of predicting mental illness from free-text notes. The tested models included one that we proposed CB-MH, a multi-head, soft-attention model. This model achieved the highest F1 score and a close to highest F2 score. Error analysis was conducted using Information Gradient methodology to identify features which contributed most to model decisions. We showed through clinician judgements that FPs with highest IG scoring features were actually TPs; similarly a third of the FNs were true negatives. We show a potential way forward for using DL models with error analysis through IG as a mechanism to correct ICD annotation errors in clinical datasets.
Limitations and Future Work
We have not explored resources such as BioBERT10. Though built for applications related to genes and diseases these may prove useful. Given the manual effort involved we focus on extracting explanations for a single disease. Our next goal is to scale this up by automating the process of identifying explanations from IG scores of features and their contexts. We analyzed micro averaging results alone (Table 3) primarily due to space restrictions. With micro averaging larger classes are likely to dominate. However, Table 4 does provide some results with macro averaging.
Figure 1:
Model interpretability pipeline
Acknowledgements
The authors thank the two clinicians Dr. L. Durairaj and Dr. A. Aggarwal for their participation as judges in the user study.
References
- 1.Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate. ICLR. 2015 [Google Scholar]
- 2.Cho G, Yim J, Choi Y, Ko J, Lee SH. Review of machine learning algorithms for diagnosing mental illness. Psychiatry investigation. 2019 Apr;16(4):262. doi: 10.30773/pi.2018.12.21.2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Dabek F, Caban JJ. International conference on brain informatics and health 2015 Aug 30 (pp. 252-261) Cham: Springer; A neural network based model for predicting psychological conditions. [Google Scholar]
- 4.Devlin J, Chang MW, Lee K, Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding; Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. [Google Scholar]
- 5.Doherty JL, Owen MJ. Genomic insights into the overlap between psychiatric disorders: implications for research and clinical practice. Genome medicine. 2014 Dec;6(4):1–3. doi: 10.1186/gm546. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. Proceedings of the IEEE conference on computer vision and pattern recognition. 2016:770–778. [Google Scholar]
- 7.Hochreiter S, Schmidhuber J. Long short-term memory. Neural computation. 1997 Nov 15;9(8):1735–80. doi: 10.1162/neco.1997.9.8.1735. [DOI] [PubMed] [Google Scholar]
- 8.Junwei K, Yang H, Junjiang L, Zhijun Y. Dynamic prediction of cardiovascular disease using improved LSTM. International Journal of Crowd Science. 2019 May 10 [Google Scholar]
- 9.Kim Y. Convolutional neural networks for sentence classification. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) 2014 doi: 10.18653/v1/d16-1076. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, Kang J. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics. 2020 Feb 15;36(4):1234–40. doi: 10.1093/bioinformatics/btz682. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Lipton ZC, Kale DC, Elkan C, Wetzel R. Learning to diagnose with LSTM recurrent neural networks. ICLR. 2016 [Google Scholar]
- 12.Ma F, Gao J, Suo Q, Quanzeng You, Jing Zhou, Aidong Zhang. Risk prediction on electronic health records with prior medical knowledge. Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery Data Mining. 2018:1910–1919. [Google Scholar]
- 13.Marshall M. The hidden links between mental disorders. Nature. 2020 May;581(7806):19–21. doi: 10.1038/d41586-020-00922-8. [DOI] [PubMed] [Google Scholar]
- 14.Mikolov T, Sutskever I, Chen K, Corrado G, Dean J. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems. 2013:3111–3119. [Google Scholar]
- 15.Miotto R, Wang F, Wang S, Jiang X, Dudley JT. Deep learning for healthcare: review, opportunities and challenges. Briefings in bioinformatics. 2018 Nov;19(6):1236–46. doi: 10.1093/bib/bbx044. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Mullenbach J, Wiegreffe S, Duke J, Sun J, Eisenstein J. Explainable prediction of medical codes from clinical text. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. :1101–1111. [Google Scholar]
- 17.NIMH. Anxiety Disorders. https://www.nimh.nih.gov/health/topics/anxiety-disorders/index.shtml .
- 18.NIMH. Depression. https://www.nimh.nih.gov/health/topics/depression/index.shtml .
- 19.NIMH. NIMH Mental Illness. https://www.nimh.nih.gov/health/statistics/mental-illness.shtml . [Google Scholar]
- 20.Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J. Scikit-learn: Machine learning in Python. Journal of machine Learning research. 2011 Nov 1;12:2825–30. [Google Scholar]
- 21.Shickel B, Tighe PJ, Bihorac A, Rashidi P. Deep EHR: a survey of recent advances in deep learning techniques for electronic health record (EHR) analysis. IEEE journal of biomedical and health informatics. 2017 Oct 27;22(5):1589–604. doi: 10.1109/JBHI.2017.2767063. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Sundararajan M, Taly A, Yan Q. Axiomatic attribution for deep networks. InInternational Conference on Machine Learning. 2017 Jul 17:3319–3328. PMLR. [Google Scholar]
- 23.Tran T, Kavuluru R. Predicting mental conditions based on history of present illness in psychiatric notes with deep neural networks. Journal of biomedical informatics. 2017 Nov 1;75:S138–48. doi: 10.1016/j.jbi.2017.06.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I. Attention is all you need. Advances in neural information processing systems. 2017:5998–6008. [Google Scholar]
- 25.Wilbur WJ. Non-parametric significance tests of retrieval performance comparisons. Journal of Information Science. 1994 Aug;20(4):270–84. [Google Scholar]
- 26.Willett P. The Porter stemming algorithm: then and now. Program. 2006 Jul 1 [Google Scholar]
- 27.Wu S, Roberts K, Datta S, Du J, Ji Z, Si Y, Soni S, Wang Q, Wei Q, Xiang Y, Zhao B. Deep learning in clinical natural language processing: a methodical review. Journal of the American Medical Informatics Association. 2020 Mar;27(3):457–70. doi: 10.1093/jamia/ocz200. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Su C, Xu Z, Pathak J, Wang F. Deep learning in mental health outcome research: a scoping review. Translational Psychiatry. 2020 Apr 22;10(1):1–26. doi: 10.1038/s41398-020-0780-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.NLM. The PubMed Baseline Repository. 2018 ftp://ftp.ncbi.nlm.nih.gov/pubmed/baseline . [Google Scholar]
- 30.Smoller JW. The use of electronic health records for psychiatric phenotyping and genomics. American Journal of Medical Genetics Part B: Neuropsychiatric Genetics. 2018 Oct;177(7):601–12. doi: 10.1002/ajmg.b.32548. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Perotte A, Pivovarov R, Natarajan K, Weiskopf N, Wood F, Elhadad N. Diagnosis code assignment: models and evaluation metrics. Journal of the American Medical Informatics Association. 2014 Mar 1;21(2):231–7. doi: 10.1136/amiajnl-2013-002159. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Mental Health America. The State of Mental Health in America. https://www.mhanational.org/ [Google Scholar]
- 33.Rios A, Kavuluru R. Ordinal convolutional neural networks for predicting RDoC positive valence psychiatric symptom severity scores. Journal of biomedical informatics. 2017 Nov 1;75:S85–93. doi: 10.1016/j.jbi.2017.05.008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Dai HJ, Jonnagaddala J. Assessing the severity of positive valence symptoms in initial psychiatric evaluation records: Should we use convolutional neural networks? PloS one. 2018 Oct 16;13(10):e0204493. doi: 10.1371/journal.pone.0204493. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Geraci J, Wilansky P, de Luca V, Roy A, Kennedy JL, Strauss J. Applying deep neural networks to unstructured text notes in electronic medical records for phenotyping youth depression. Evidence-based mental health. 2017 Aug 1;20(3):83–7. doi: 10.1136/eb-2017-102688. [DOI] [PMC free article] [PubMed] [Google Scholar]

