Abstract
Background
Relying on diagnostic categories of neuropsychiatric illness obscures the complexity of these disorders. Capturing multiple dimensional measures of neuropathology could facilitate clinical and neurobiological investigation of cognitive and behavioral phenotypes.
Methods
We developed a natural language processing (NLP)-based approach to extract five symptom dimensions, based on the NIMH Research Domain Criteria (RDoC) definitions, from narrative clinical notes. Estimates of RDoC loading (eRDoC) were derived from a cohort of 3,619 individuals with 4,623 hospital admissions. We applied this tool to a large corpus of psychiatric inpatient admission and discharge notes (2010–2015), and using the same cohort examined face validity, predictive validity, and convergent validity with gold standard annotations.
Results
In mixed effects models adjusted for sociodemographic and clinical features, greater negative and positive symptom domains were associated with shorter stay (β=−0.88; p=0.001 and β=−1.22; p<0.001, respectively) while greater social and arousal domain scores were associated with longer stay (β=0.93; p<0·001 and β=0.81; p=0.007, respectively). In fully-adjusted Cox regression models, greater positive domain score at discharge was also associated with significant increase in readmission risk (HR=1.22; p<0.001). Positive and negative valence domains were correlated with expert annotation (by ANOVA (3df), R2=0.13 and 0.19, respectively). Likewise, in a subset of patients, neurocognitive testing was correlated with cognitive performance scores (p<0.008 for 3/6 measures).
Discussion
This shows that NLP can be used to efficiently and transparently score clinical notes in terms of cognitive and psychopathologic domains.
Keywords: transdiagnostic, computed phenotype, natural language processing, electronic health record, research domain criteria, topic modeling
Introduction
The limitations of a categorical diagnostic system in neuropsychiatric illness have become increasingly apparent in an era of genomic study. A diagnostic category such as major depressive disorder (MDD) captures a large heterogeneous range of presentations.(1) Co-occurrence of psychiatric disorders is the norm, conflating true comorbidity with different manifestations of the same underlying pathology, such as in cases of bipolar disorder (BPD) and anxiety disorders.(2) The overlap in presentations and symptoms between disorders is not well-captured – for example, this limitation manifests in the complexity of the relationship between mood disorders and psychotic disorders.
The information loss from categorization has become even more striking with the emergence of alternate means of defining the relationship between disorders. Twin and family studies dating back decades illustrated that while individual disorders are familial and heritable, an abundance of data now demonstrates the continuity between psychiatric disorders in terms of genomic liability and environmental risk.(3–5)
As investigators frequently encounter the limitations of this system, increasing attention has turned to multidimensional alternatives.(6) The National Institute of Mental Health (NIMH) introduced the Research Domain Criteria (RDoC) as an alternate nosology focusing on linking clinical symptoms to relevant biology.(7) These five domains – negative and positive valence, social function, cognition, and arousal – are intended to capture the full range of brain-associated function.(8) Despite the appeal of RDoC as a means of facilitating translational studies, efficient assessment of these domains in clinical samples has yet to be established - it is intended as a research framework, not a clinical assessment per se. NIMH leadership has suggested that approaches incorporating 'big data', or large clinical data sets, will be necessary for continued progress in understanding dimensional psychopathology.(9–11) Still, the ability to estimate manifestation of these domains - even coarsely - in clinical data could greatly facilitate targeted investigations.
Natural language processing (NLP) refers to a broad set of methods extracting concepts or structured information from text, (e.g narrative clinical notes). These methods range from simple (e.g. matching particular strings in a block of text, or treating a document as a 'bag of words') to extremely complex, incorporating context and attempting to extract meaning.(12, 13) In a clinical context, NLP provides means of investigating phenotypic hypotheses not addressed by structured clinical data (e.g., health billing information or rating scales).(14) In psychiatry, diverse applications of NLP include identifying the presence or absence of depression in any given clinical visit and efforts to identify negative symptoms in psychosis, facilitating measures of quantity of symptoms present.(15–17) The utility of NLP has also been demonstrated outside of psychiatry, including the effective identification of the presence or absence of pulmonary embolism in radiology reports.(18) Importantly, these are examples of restructuring text; identifying an individual symptom or outcome that could conceptually have been collected as structured data during the initial encounter. These examples apply NLP as a 'force multiplier' by training models on expert annotations and then generalizing to a large number of new cases in a supervised learning paradigm. In both cases, re-structuring and supervised learning, a priori knowledge of a gold standard is assumed.
An alternate and complementary approach employs NLP to characterize notes without the assumption of known gold standard labels. Such methods assist in identifying unlabeled latent traits that are not yet well-studied. We previously demonstrated the feasibility of applying NLP to extract multiple continuous symptom domains from psychiatric notes, and found that the extracted dimensions improved prediction of hospital readmission.(19) However, this approach had two major limitations preventing broader application. First, it did not allow for inspection of the contributors to domain estimates and thus was not conducive to hypothesis generation, Second, it was computationally intensive and technically difficult to implement across health systems. Finally, the model employed cohort level score normalization which precluded on-line scoring. An ideal method would allow high throughput on-line estimates of existing clinical text, yield estimates with predictive and face validity, and allow the source of those estimates to be inspected. Here, we describe a novel method for identifying estimates of loading for each of the five RDoC domains, distinct from our prior work with improved inspectability, portability, and performance. We demonstrate that this method has strong face validity and interpretability, and that it improves prediction of clinical outcomes compared to structured data alone.
Methods and Materials
Overview and Data Set Generation
Sociodemographic and clinical data were extracted from the longitudinal electronic health record (EHR) of the Massachusetts General Hospital (MGH). Clinical data include billing (claims) codes as well as medication e-prescriptions and narrative clinical notes. We included any individuals age 18 or older with between 1 and 10 inpatient psychiatric hospitalizations from 2010–2015. We determined principal clinical diagnoses based on the ICD9 code at admission, incorporating any psychiatric diagnosis with at least 20 individuals represented in the cohort. These included schizophrenia (ICD9 295.x, except 295.7), schizoaffective disorder (295.7), post-traumatic stress disorder (309.8), anxiety disorders (300.0/1/2), substance use disorders (291 or 292), psychosis not otherwise specified (298.9), MDD (296.2 or 296.3), BPD - manic (296.0/1/4), other BPD (296.5/6/7/8), and suicidality without other primary diagnosis (V628).
A datamart containing all clinical data was generated with the i2b2 server software (i2b2 v1.6, Boston, MA, USA), a computational framework for managing human health data.(20–22) The Partners Institutional Review Board approved the study protocol, waiving the requirement for informed consent as detailed by 45 CFR 46.116.
Study Design and Analysis
Primary analyses utilized a cohort design with all patients admitted during the time period noted above. No individuals were missing. The admission and discharge documentation were used to estimate RDoC domain scores at both time points for all encounters. Additionally, clinical outcomes, including length of stay and psychiatric hospital re-admission, were used to validate the clinical utility of the scores. Length of stay was defined as discharge date minus admission date. Psychiatric hospital readmission was defined as a second psychiatric hospitalization at MGH within one year (a period during which individuals would be highly likely to be readmitted to the index hospital).
Derivation of estimated Research Domain Criteria (eRDoC) Token List
The goal of subsequent steps in phenotype derivation was to derive a set of tokens (i.e., single words or sets of two words (bigrams)) reflecting individual RDoC domains in narrative notes. We developed a multi-step process that used the text of DSM-IV-TR, a list of 10–50 seed unigrams or bigrams manually curated per domain based on expert consensus (THM, RHP) review of the RDoC workgroup statements, and psychiatric discharge summaries to identify terms that may be conceptually similar to those experts associate with each of the five RDoC domains.(23) For an overview of the entire process, please see Supplemental Figure S1. Both the DSM-IV and the corpus of narrative discharge notes were normalized using the UMLS Lexical Variant Generation (LVG) package.(24) The corpus of narrative discharge notes was tokenized to unigrams and bigrams, and stop words were eliminated.
For subsequent steps, thresholding choices were made by inspection of the individual distribution based on the authors' prior experience with health record NLP method development.(25) Choices to trim distributions were based on balancing the computational complexity of the task and breadth of symptoms captured, aiming to minimize overfitting risk to maximize portability. All thresholding choices were made prior to analysis of outcomes, and blind to token.
The DSM-IV-TR (provided by the APA) was then similarly preprocessed to generate unigram and bigram counts. DSM-IV-TR tokens were limited to those appearing in the narrative note corpus and further limited to unigrams occurring between 0.1–99% times and bigrams occurring four or more times. The retained DSM tokens were weighted by inverse document frequency, with paragraphs treated as “documents”. We then applied latent semantic analysis to the weighted paragraph-wise count data, using singular value decomposition to transform the token-paragraph association to token-topic association.(26) Based on inspection of distribution, the top 300 topics were retained for the subsequent similarity analysis between words. For each seed word, we identified 50 unigrams and bigrams with the greatest cosine similarity in the DSM as that seed’s candidate synonyms.
Next, the candidate synonyms for each RDoC domain were filtered to ensure only synonyms associated with a domain seed term could appear in the final model. For each domain, we jointly tested the association between each seed word token and each candidate synonym token. Significant associations were identified as those with p-values lower than a threshold chosen to control a false discovery rate of 10%.(27) Candidate synonyms identified in the DSM were dropped if they were not associated with any curated seed word in the clinical corpus.
To further filter these candidate terms, we required occurrences of candidate terms to be predictive of occurrences of seed words in the clinical notes. Only seed words appearing in 10–90% of notes were considered, and candidate terms were limited to those appearing in 5–90% of the notes. We first performed a univariate screening, and retained terms that were correlated with the token sum of the domain-specific seed terms with absolute rank correlation of at least 0.05. Subsequently, for each seed term, we sub-sampled 500 notes and fitted adaptive LASSO penalized logistic regression models to predict its occurrence in each note with features being tokens of the candidate terms that passed the univariate screening. Candidate words with zero coefficients are considered as non-informative. We repeated this process 20 times for each seed term of each domain by randomly sub-sampling the 500 notes. Then for each domain, we summarized the predictiveness of each candidate term based on the proportion of the times the term received a non-zero coefficient averaged over the fitting and the seed terms. The terms with non-zero frequency greater than 5% were considered as the final set of synonyms for each domain. This process generated a set of tokens believed, by virtue of their derivation, to be associated with domain symptomatology.
Alternative to LASSO for synonym identification
While LASSO remains a widely-used method for variable selection, it may fail to identify pairs of terms that overlap in their variance explained and thus be less sensitive than alternatives.(28, 29) Thus, as a sensitivity analysis of the primary LASSO selection algorithm we applied probabilistic topic modeling. Consistent with prior work, we applied latent Dirichlet allocation (LDA) to fit a 75 topic model to the full note corpus.(30) We then selected one topic per domain from this model by inspecting the posterior distribution to identify the topic under which the original seed words were most likely (e.g., the negative valence topic was selected as the topic under which the negative valence seed words were most probable). To build the final token list we cut the five selected topic distributions at 95% of the cumulative probability distribution. The resulting token lists were then used to generate domain scores as described for the primary analysis. In the interested of providing the most portable scoring system (developed below), both token identification systems employed the common “bag of words” model without employing extensions such as negation, temporality detection, or word sense disambiguation.
Implementation of Scoring System
Having derived token lists associated with each domain, we then implemented a greatly simplified scoring code to facilitate dissemination, reapplication, and replication. The simplification involved first inverting the lexical variant index such that all lexical variants were added to the token list to be identified, then implementing basic preprocessing and tokenization code depending only on the Python standard library.
To assign an estimated domain score to each note, we used the percent of domain-specific terms appearing in the note. So, for example, if a hypothetical domain contained 10 terms in its final term list, a note containing five of those terms would be assigned a domain score of 5/100. In presentation of results, we multiply these values by 10 to facilitate readability. All notes for the full cohort (all psychiatric discharges between 2010 and 2015) were scored.
In sum, this method should be viewed in two distinct parts: the derivation of the words and phrases of interest and then development of an easy-to-use scoring system based on the derived words and phrases. Token lists and the code used are available online.a
Validation Analysis
As the optimal clinical RDoC assessment is not yet available, it is not possible to compare the NLP based RDoC domain scores to ‘gold standard’ evaluations. Instead, to validate the clinical utility of the scores, we adopted multiple convergent strategies to estimate validity. We first examined the predictive validity of the individual psychopathology estimates by associating these scores with various clinical outcomes for all psychiatric discharges between 2010 and 2015, adjusting for age, sex, public versus private insurance, as well as self-report of race/ethnicity and baseline clinical features. Specifically, we assessed to what extent these domain estimates predict hospital length of stay (based on admission notes) and readmission (based on discharge notes) in regression models. We used as a metric of improvement the comparison to two nested models (likelihood-ratio test). For hospital length of stay, we made domain estimates from the admission note, and then for those individuals with multiple admissions, we accounted for multiple observations per subjects using mixed effects models as a means of maximizing efficiency of analysis. For time to hospital readmission, we used Cox clustered regression with results censored at death or at three years.
Second, we examined face validity by plotting selected DSM-IV diagnoses in terms of symptom domains. Qualitatively, we also generated word clouds to allow visualization of the terms loading onto each domain. To enhance the extent to which this word cloud reflects the contribution of individual words to predictions we weighted this word cloud using the Gini Index from a random forest trained on individual token counts at discharge against readmission.
Third, we examined convergent validity for two domains (negative and positive valence) by comparing scores to expert annotation of 200 randomly-selected individuals drawn from the full discharge cohort. We utilized a set of anchor points developed by the authors (RHP, THM, HEB, JNR) for an American Medical Informatics Association NLP challenge corresponding to clinically-meaningful anchor points, where 0=absence of symptoms, 1=subthreshold symptoms or symptoms of questionable importance, 2=threshold symptoms requiring outpatient treatment, and 3=threshold symptoms likely to require inpatient treatment or emergent intervention.(31) These anchor points were used by an expert clinician (RHP) familiar with the NIMH workgroup statements to score current symptoms reflecting negative and positive valence. The rater was blinded to computed scores and did not access token lists prior to scoring. Correlations were computed between estimated scores and expert rating.
Pilot Study of Neurocognitive Measures
As noted, the extent to which specific measures load onto specific domains has been asserted but not tested systematically. To demonstrate the potential application of automated scoring and future validation, we analyzed data from a battery of measures from the Cambridge Neuropsychological Test Automated Battery (CANTAB), collected within a random subset of individuals with psychiatric discharges between 2010–2015 during a systematic assessment for a cellular biobanking study.(32–34) Pearson product moment correlations were examined between estimated cognition score and six measures of cognitive domains, with p<0.05/6, or p<0.008, conservatively considered a corrected threshold for association. All analyses utilized R v3.3.(35)
In all cases, validation studies used either the full discharge cohort (predictive validity, face validity) or a subset (convergent validity). For an example of portability via application in a distinct cohort, see [cross-reference GWAS paper].
Results
We identified 3,619 individuals with 4,623 hospital discharges from 2010–2015; sociodemographic and clinical descriptors are available in Table 1. Figure 1 illustrates the distribution of cognition and negative valence eRDoC scores for individuals with MDD or BPD mania, at hospital admission (top) and discharge (bottom). While depressive symptoms are generally more severe among those admitted for MDD, the range of depressive symptoms among those with mania illustrates the spectrum of mixed features. At discharge, depressive symptoms diminished in both groups, but cognitive symptoms did not change appreciably. To illustrate face validity, Supplemental Figure S2 depicts individual terms contributing to the negative valence and cognitive RDoC domain estimates at discharge as word clouds.
Table 1.
Cohort Characteristics
| N | 3619 |
|---|---|
| Demographics | Mean (SD) |
|
| |
| Age (Years) | 43.89 (16.62) |
|
| |
| N (%) | |
|
| |
| Sex (Male) | 1840 (50.8) |
| Insurance (Public) | 2095 (57.9) |
| Admit via EDa | 2429 (67.1) |
|
| |
| Race/ethnicity | N (%) |
|
| |
| Asian | 145 (4.0) |
| Black | 343 (9.5) |
| Hispanic | 315 (8.7) |
| Other | 211 (5.8) |
| White | 2605 (72.0) |
|
| |
| Admit Diagnosis | N (%) |
|
| |
| Major depression | 774 (41.6) |
| Bipolar, depressed | 171 (9.2) |
| Bipolar, manic | 144 (7.7) |
| Psychosis NOS | 272 (14.6) |
| Schizoaffective disorder | 135 (7.3) |
| Schizophrenia | 160 (8.6) |
|
| |
| Comorbidity | Mean (SD) |
|
| |
| Log Charlson score | 0.61 (0.74) |
ED, emergency department
Figure 1.
Domain Comparison Contour Plots Showing Change between Admission (top) and Discharge (bottom)
Using the full cohort, we examined the association between individual domains extracted from hospital admission notes and length of hospital stay (Mean=9.67 (SD=8.92) days). Table 2 (right) reports mixed effects model including only sociodemographic and coded clinical features, including diagnosis; Table 2 (left) adds the five domain scores (sensitivity analysis in Supplemental Figure S3). Among the individual domains, greater negative and positive symptom domains were associated with shorter length of stay (β=−0.88; p=0.001 and β=−1.22; p<0.001, respectively) while greater social and arousal domain scores were associated with longer stay (β=0.93; p<0.001 and β=0.81; p<0.007, respectively). Adding domain scores improved model fit (likelihood ratio χ2(5)=107.12, p<2.2×10−16).
Table 2.
Length of Stay Regression Model with and without Domain Scores from Admission Documentation
| Model with RDoC domains | Model without RDoC domains | |||||
|---|---|---|---|---|---|---|
|
|
||||||
| B | CI | p | B | CI | p | |
| Negative | −0.88 | −1.41 – −0.36 | .001* | |||
| Positive | −1.22 | −1.61 – −0.84 | <.001* | |||
| Cognitive | 0.47 | −0.17 – 1.11 | .154 | |||
| Social | 0.93 | 0.40 – 1.46 | <.001* | |||
| Arousal & Regulatory | 0.81 | 0.22 – 1.40 | .007* | |||
|
|
||||||
| Age (years) | 0.09 | 0.07 – 0.11 | <.001* | 0.10 | 0.08 – 0.12 | <.001* |
| Sex (Male) | −0.49 | −1.03 – 0.04 | .071 | −0.57 | −1.11 – −0.03 | .037* |
| Race (White) | 0.20 | −0.41 – 0.82 | .520 | −0.39 | −1.00 – 0.23 | .218 |
| Insurance (Public) | 0.05 | −0.48 – 0.59 | .844 | 0.16 | −0.38 – 0.71 | .557 |
| Log Charlson score | −0.36 | −0.80 – 0.07 | .104 | −0.47 | −0.91 – −0.03 | .038* |
indicates p < .05
Next, we examined the association between domain scores in discharge notes and time to index hospital readmission, again using the full discharge cohort with 10,187 years of follow up (median follow up 951 days). Once again, we compared a Cox regression model incorporating only coded sociodemographic and clinical data (Table 3; sensitivity analysis in Supplemental Figure S4). Greater positive domain score at discharge was associated with significant increase in readmission risk (HR=1.22; p<0.001) - i.e., for every one standard deviation increase in positive valence score, readmission hazard increased by 22%. Figure 2 illustrates Kaplan-Meier survival curves for time to readmission split by the median discharge positive valence estimate (p<.001).
Table 3.
Cox Regression of Time to Readmission with and without Domain Scores from Discharge Documentation
| Model with RDoC domains | Model without RDoC domains | |||||||
|---|---|---|---|---|---|---|---|---|
|
|
||||||||
| HR | [95% CI] | p | HR | [95% CI] | p | |||
| Negative | 0.98 | 0.89 | 1.07 | 0.60 | ||||
| Positive | 1.22 | 1.14 | 1.30 | 3.34e-09* | ||||
| Cognitive | 0.96 | 0.88 | 1.04 | 0.31 | ||||
| Social | 1.02 | 0.95 | 1.11 | 0.59 | ||||
| Arousal & Regulatory | 0.90 | 0.82 | 0.99 | 0.03* | ||||
|
|
||||||||
| Age (years) | 0.99 | 0.99 | 0.99 | 7.83e-11* | 0.99 | 0.99 | 0.99 | 8.49e-13* |
| Sex (Male) | 0.97 | 0.89 | 1.05 | 0.42 | 0.99 | 0.92 | 1.08 | 0.90 |
| Race (White) | 1.20 | 1.09 | 1.33 | 0.0003* | 1.22 | 1.11 | 1.35 | 4.92e-05* |
| Insurance (Public) | 1.44 | 1.32 | 1.57 | 5.55e-16* | 1.44 | 1.32 | 1.57 | 3.33e-16* |
| Log Charlson score | 1.53 | 1.43 | 1.63 | <2.00e-16* | 1.53 | 1.43 | 1.63 | <2.00e-16* |
indicates p < .05
Figure 2.
Kaplan-Meier for Time to Readmission by Split by Median eRDoC Positive Valence
Validation against expert annotation
In a subset of 200 randomly-selected individuals from this cohort, we then examined the extent to which automated assignment of positive and negative valence severity correlated with a single expert annotator. Figure 3 depicts mean automated score, by expert annotation, for positive (top) and negative (bottom) valence for these 200 admissions. Positive and negative valence domains were correlated with expert annotation (by ANOVA R2=0.13 and 0.19, respectively).
Figure 3.
Comparison of Automated Score and Expert Annotation (Positive Top / Negative Bottom)
Lastly, as a pilot validation of the cognitive domain score, we examined association between estimated cognitive score and neurocognitive measures from the CANTAB in a convenience sample of 11 individuals from the original cohort undergoing assessment as outpatients. Among the six measures, adjusting a priori for age and sex, we observed three associated with cognition at hospital disposition with p<0.008 (Supplemental Table S1).
Sensitivity analysis
As a further examination of the robustness of these effects, we applied an alternate method to expand the token lists based on topic modeling that does not rely on LASSO. Results were markedly similar to those presented for primary analysis, despite the substantial methodologic difference. For example, in regression models for length of stay or readmission hazard, coefficients changed by less than 10% in all cases (See Supplemental Results).
Discussion
We characterized 3,619 individuals discharged from a psychiatric hospital between 2010–2015 in terms of symptom dimensions based upon the NIMH RDoC framework. We show that using NLP to calculate symptom dimension scores improves meaningfully on structured data alone, through inspectable collections of uni- and bigrams. This method analyzes the vast corpora of narrative clinical notes available in EHRs in order to study the dimensional nature and implications of brain disease. We further demonstrate this method’s predictive validity - explaining significant variance in length of hospital stay and readmission risk - as well as face and convergent validity.
Standard applications of EHR data sets have drawn on methodologies built for analyses of health claims data dating back decades. While well-established, these methods neglect a growing understanding of the complexity of psychopathology. They assume categorical diagnoses, even though such diagnoses are often unreliable and divorced from the underlying neurobiology, including burgeoning research showing overlap between major psychiatric diseases (and some neurologic diseases) in terms of common genetic risk.(3, 4)
As products of routine clinical care, detailed narrative notes provide an opportunity to recapture the complexity of psychopathology, but require the extraction of relevant symptoms from unstructured data.(14) Because structured elements are frequently incompletely coded, prior extraction work focused on validating classifiers for categorical phenotypes not coded at the time of diagnosis.(15, 16) However, an alternative approach, which may track more closely with the biologically valid phenotypes, is to extract multiple contiguous conceptual symptom domains rather than individual diagnostic labels.(36) This approach differs fundamentally in methods and goals by operating independent of gold standard labels and estimates, instead aiding in creating such labels and estimates.(19) In addition to offering new avenues for investigating causal biology, these multidimensional estimation based approaches may be more scalable at the systems level. Instead of building a classifier per disease, multidimensional estimators could be used to create high dimensionality concept spaces within which clinically interesting categorical diagnoses are positioned.
In a prior investigation, we had developed a dimensional phenotyping approach that demonstrated predictive validity in narrative clinical notes.(19) While valuable as proof of concept, a lack of transparency coupled with computational complexity limited its application. To overcome these limitations, we report here a phenotyping system based on multiple estimates instead of clinical categorization, and show the distribution of these estimates of dimensional pathology across an inpatient psychiatric population. We demonstrate that these dimensions differ in face validity between diagnostic categories, while also illustrating substantial overlap. Further, they help to explain length of hospital stay and readmission risk beyond the variance captured by available structured data.
In particular, the impact of positive valence score on readmission, which based upon inspection of token lists strongly reflects substance use disorders, appears face-valid; we note the important distinction that the positive affect domain of RDoC reflects disorders of positive affect (e.g., reward), not positive affect per se. Substance abuse admissions to the psychiatric unit at this hospital are typically brief. Similarly, the association between negative valence and shorter stay may reflect acute stabilization of suicidal patients (i.e., those with high negative valence) who are discharged once risk diminishes. The replication of these results when applying an entirely different term-amplification (synonym-generation) method suggests their robustness in this cohort.
We note several key limitations. First, this method will benefit from further validation against larger ‘gold standard’ measures drawn from individual clinical assessment batteries. Such efforts are ongoing, but await greater agreement on the necessary assessments and cohort generation. For two domains (positive and negative affect), we were able to draw on clinical expert annotations to examine convergent validity. The modest correlation here suggests opportunity to further refine concept detection as better gold standards are developed. For a third, cognition, we were able to draw on neuropsychological testing available in a small subset of the cohort to examine correlation between an estimate of cognitive symptomatology and objective cognitive measures. Validation of arousal and social domains may require additional data collection but is another important future direction.
Another limitation is that, while this method is readily portable to other health data, the portability and generalizability of our method must be demonstrated, as with any such tool. In particular, it will be useful to examine the extent to which incorporating other note types (e.g., outpatient visit notes, or inpatient progress notes), or use of structured data from rating scales, may improve precision of estimating symptom domains, or allow time series analysis. For an illustration of the application of this method to a biological question in a different patient population, see [cross-reference GWAS article]. We present this work together with dependency-free open-source software readily applicable to other narrative note sets to encourage replication efforts. Achieving this portability entailed constraints on toolchain and algorithmic complexity. For example, negative findings (e.g., "no evidence of psychosis") are common in medical documentation.(37, 38) Systems for automated detection of negation are an active area of research, and further work is needed to incorporate these findings in a fashion which preserves portability.(39–41) Finally, we would emphasize that the approach we employed constrains the symptom dimensions to correspond to those specified by NIMH workgroups; an important area of future work will be understanding the extent to which less constrained approaches to deriving symptom domains do or do not correspond to these hypothesized domains.
With these caveats in mind, the present report represents a key next step in developing and validating a simple, transparent and scalable NLP approach to extracting dimensional psychopathology. Such strategies may ensure that EHRs can be leveraged for modern data-driven analysis without abandoning the wealth of dimensional data they contain.
Supplementary Material
Supplemental Figure S1: Overview of term selection and scoring with examples for the Cognitive and Negative Domain.
Supplemental Figure S2: A) Word Cloud of Individual Terms Contributing to Negative Valence Domain in Discharge Documentation. B) Word Cloud of Individual Terms Contributing to Cognitive Domain in Discharge Documentation.
Supplemental Figure S3: Sensitivity analysis comparing length of stay regression coefficients (with confidence intervals) for baseline structured data only model and secondary LDA based term selection method.
Supplemental Figure S4: Sensitivity analysis comparing hazard, for readmission, ratio for baseline structured data only model and secondary LDA based term selection method.
Supplemental Table S1. Correlation between Automated Scores and Formal Testing.
Acknowledgments
This study was funded by the National Human Genome Research Institute, the National Institute of Mental Health, and the Stanley Center at the Broad Institute. The sponsor had no role in study design, writing of the report, or data collection, analysis, or interpretation. The corresponding and senior authors had full access to all data and made the decision to submit for publication. The authors wish to acknowledge the NIMH RDoC Unit members who contributed to manual review and curation of the RDoC term list.
RHP reports grants from National Human Genome Research Institute, grants from National Institute of Mental Health, during the conduct of the study; personal fees from Genomind, personal fees from Healthrageous, personal fees from Perfect Health, personal fees from Psy Therapeutics, and personal fees from RID Ventures. THM reports grants from the Broad Institute and Brain and Behavior Foundation.
Footnotes
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Financial Disclosures:
The other authors report no biomedical financial interests or potential conflicts of interest.
References and Notes
- 1.Zimmerman M, Ellison W, Young D, Chelminski I, Dalrymple K. How many different ways do patients meet the diagnostic criteria for major depressive disorder? Compr Psychiatry. 2015;56:29–34. doi: 10.1016/j.comppsych.2014.09.007. [DOI] [PubMed] [Google Scholar]
- 2.Pavlova B, Perlis RH, Alda M, Uher R. Lifetime prevalence of anxiety disorders in people with bipolar disorder: a systematic review and meta-analysis. Lancet Psychiatry. 2015;2:710–717. doi: 10.1016/S2215-0366(15)00112-1. [DOI] [PubMed] [Google Scholar]
- 3.Bulik-Sullivan B, Finucane HK, Anttila V, Gusev A, Day FR, Loh PR, et al. An atlas of genetic correlations across human diseases and traits. Nat Genet. 2015;47:1236–1241. doi: 10.1038/ng.3406. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Cross-Disorder Group of the Psychiatric Genomics C. Lee SH, Ripke S, Neale BM, Faraone SV, Purcell SM, et al. Genetic relationship between five psychiatric disorders estimated from genome-wide SNPs. Nat Genet. 2013;45:984–994. doi: 10.1038/ng.2711. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Gilman SE, Ni MY, Dunn EC, Breslau J, McLaughlin KA, Smoller JW, et al. Contributions of the social environment to first-onset and recurrent mania. Mol Psychiatry. 2015;20:329–336. doi: 10.1038/mp.2014.36. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Insel T, Cuthbert B, Garvey M, Heinssen R, Pine DS, Quinn K, et al. Research domain criteria (RDoC): toward a new classification framework for research on mental disorders. Am J Psychiatry. 2010;167:748–751. doi: 10.1176/appi.ajp.2010.09091379. [DOI] [PubMed] [Google Scholar]
- 7.Sanislow CA, Pine DS, Quinn KJ, Kozak MJ, Garvey MA, Heinssen RK, et al. Developing constructs for psychopathology research: research domain criteria. J Abnorm Psychol. 2010;119:631–639. doi: 10.1037/a0020909. [DOI] [PubMed] [Google Scholar]
- 8.Morris SE, Cuthbert BN. Research Domain Criteria: cognitive systems, neural circuits, and dimensions of behavior. Dialogues Clin Neurosci. 2012;14:29–37. doi: 10.31887/DCNS.2012.14.1/smorris. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Gordon J. The Future of RDoC. National Institute of Mental Health. 2017 https://www.nimh.nih.gov/about/director/messages/2017/the-future-of-rdoc.shtml.
- 10.Insel TR, Cuthbert BN. Brain disorders? Precisely. Science. 2015;348:499–500. doi: 10.1126/science.aab2358. [DOI] [PubMed] [Google Scholar]
- 11.Redish AD, Gordon JA. Computational psychiatry: New perspectives on mental illness. MIT Press; 2016. [Google Scholar]
- 12.Nadkarni PM, Ohno-Machado L, Chapman WW. Natural language processing: an introduction. JAMIA. 2011;18:544–551. doi: 10.1136/amiajnl-2011-000464. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Manning CD, Schütze H. Foundations of statistical natural language processing. MIT Press; 1999. [Google Scholar]
- 14.Forbush TB, Gundlapalli AV, Palmer MN, Shen S, South BR, Divita G, et al. "Sitting on Pins and Needles”: Characterization of Symptom Descriptions in Clinical Notes. AMIA summits on Translational Science Proceedings. 2013:67–71. [PMC free article] [PubMed] [Google Scholar]
- 15.Perlis RH, Iosifescu DV, Castro VM, Murphy SN, Gainer VS, Minnier J, et al. Using electronic medical records to enable large-scale studies in psychiatry: treatment resistant depression as a model. Psychol Med. 2012;42:41–50. doi: 10.1017/S0033291711000997. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Patel R, Jayatilleke N, Broadbent M, Chang CK, Foskett N, Gorrell G, et al. Negative symptoms in schizophrenia: a study in a large clinical sample of patients using a novel automated method. BMJ Open. 2015;5:e007619. doi: 10.1136/bmjopen-2015-007619. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Gorrell G, Jackson R, Roberts A, Stewart R. Finding negative symptoms of schizophrenia in patient records. Proc NLP Med Biol Work (NLPMedBio), Recent Adv Nat Lang Process (RANLP), Hissar, Bulg. 2013:9–17. [Google Scholar]
- 18.Yu S, Kumamaru KK, George E, Dunne RM, Bedayat A, Neykov M, et al. Classification of CT pulmonary angiography reports by presence, chronicity, and location of pulmonary embolism with natural language processing. J Biomed Inform. 2014;52:386–393. doi: 10.1016/j.jbi.2014.08.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.McCoy TH, Castro VM, Rosenfield HR, Cagan A, Kohane IS, Perlis RH. A clinical perspective on the relevance of research domain criteria in electronic health records. Am J Psychiatry. 2015;172:316–320. doi: 10.1176/appi.ajp.2014.14091177. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Murphy SN, Mendis M, Hackett K, Kuttan R, Pan W, Phillips LC, et al. Architecture of the open-source clinical research chart from Informatics for Integrating Biology and the Bedside; AMIA Annu Symp Proc; 2007. pp. 548–552. [PMC free article] [PubMed] [Google Scholar]
- 21.Murphy S, Churchill S, Bry L, Chueh H, Weiss S, Lazarus R, et al. Instrumenting the health care enterprise for discovery research in the genomic era. Genome Res. 2009;19:1675–1681. doi: 10.1101/gr.094615.109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Murphy SN, Weber G, Mendis M, Gainer V, Chueh HC, Churchill S, et al. Serving the enterprise and beyond with informatics for integrating biology and the bedside (i2b2) J Am Med Inform Assoc. 2010;17:124–130. doi: 10.1136/jamia.2009.000893. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Health NIoM Development of the RDoC Framework [Google Scholar]
- 24.Anonymous. UMLS Reference Manual [Internet] Bethesda (MD): National Library of Medicine; 2009. The SPECIALIST Lexicon and Lexical Tools. [Google Scholar]
- 25.Yu S, Liao KP, Shaw SY, Gainer VS, Churchill SE, Szolovits P, et al. Toward high-throughput phenotyping: unbiased automated feature extraction and selection from knowledge sources. J Am Med Inform Assoc. 2015;22:993–1000. doi: 10.1093/jamia/ocv034. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R. Indexing by latent semantic analysis. J Am Soc Inf Sci. 1990;41:391. [Google Scholar]
- 27.Cai TT, Liu W. Large-Scale Multiple Testing of Correlations. J Am Stat Assoc. 2016;111:229–240. doi: 10.1080/01621459.2014.999157. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Austin PC, Tu JV. Bootstrap Methods for Developing Predictive Models. The American Statistician. 2004;58:131–137. [Google Scholar]
- 29.Zou H. The Adaptive Lasso and Its Oracle Properties. Journal of the American Statistical Association. 2006;101:1418–1429. [Google Scholar]
- 30.McCoy TH, Castro VM, Snapper LA, Hart KL, Januzzi JL, Huffman JC, et al. Polygenic loading for major depression is associated with specific medical comorbidity. Translational Psychiatry. 2017;7(9):e1238. doi: 10.1038/tp.2017.201. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Anonymous. 2016 CEGS N-GRID Shared-Tasks and Workshop on Challenges in Natural Language Processing for Clinical Data. 2016 https://www.i2b2.org/NLP/RDoCforPsychiatry/
- 32.Falconer DW, Cleland J, Fielding S, Reid IC. Using the Cambridge Neuropsychological Test Automated Battery (CANTAB) to assess the cognitive impact of electroconvulsive therapy on visual and visuospatial memory. Psychol Med. 2010;40:1017–1025. doi: 10.1017/S0033291709991243. [DOI] [PubMed] [Google Scholar]
- 33.Egerhazi A, Berecz R, Bartok E, Degrell I. Automated Neuropsychological Test Battery (CANTAB) in mild cognitive impairment and in Alzheimer's disease. Prog Neuropsychopharmacol Biol Psychiatry. 2007;31:746–751. doi: 10.1016/j.pnpbp.2007.01.011. [DOI] [PubMed] [Google Scholar]
- 34.Levaux MN, Potvin S, Sepehry AA, Sablier J, Mendrek A, Stip E. Computerized assessment of cognition in schizophrenia: promises and pitfalls of CANTAB. Eur Psychiatry. 2007;22:104–115. doi: 10.1016/j.eurpsy.2006.11.004. [DOI] [PubMed] [Google Scholar]
- 35.R Development Core Team. R: A Language and Environment for Statistical Computing. 3.1.1. Vienna, Austria: R Foundation for Statistical Computing; 2016. [Google Scholar]
- 36.McCoy TH, Jr, Castro VM, Roberson AM, Snapper LA, Perlis RH. Improving Prediction of Suicide and Accidental Death After Discharge From General Hospitals With Natural Language Processing. JAMA Psychiatry. 2016;73:1064–1071. doi: 10.1001/jamapsychiatry.2016.2172. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Chapman WW, Bridewell W, Hanbury P, Cooper GF, Buchanan BG. Evaluation of negation phrases in narrative clinical reports; Proc AMIA Symp; 2001. pp. 105–109. [PMC free article] [PubMed] [Google Scholar]
- 38.Harkema H, Dowling JN, Thornblade T, Chapman WW. Context: An Algorithm for Determining Negation, Experiencer, and Temporal Status from Clinical Reports. Journal of biomedical informatics. 2009;42:839–851. doi: 10.1016/j.jbi.2009.05.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Wu S, Miller T, Masanz J, Coarr M, Halgrim S, Carrell D, et al. Negation's not solved: generalizability versus optimizability in clinical natural language processing. PLoS One. 2014;9:e112774. doi: 10.1371/journal.pone.0112774. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Mehrabi S, Krishnan A, Sohn S, Roch AM, Schmidt H, Kesterson J, et al. DEEPEN: A negation detection system for clinical text incorporating dependency relation into NegEx. J Biomed Inform. 2015;54:213–219. doi: 10.1016/j.jbi.2015.02.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Mukherjee P, Leroy G, Kauchak D, Rajanarayanan S, Romero Diaz DY, Yuan NP, et al. NegAIT: A new parser for medical text simplification using morphological, sentential and double negation. J Biomed Inform. 2017;69:55–62. doi: 10.1016/j.jbi.2017.03.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supplemental Figure S1: Overview of term selection and scoring with examples for the Cognitive and Negative Domain.
Supplemental Figure S2: A) Word Cloud of Individual Terms Contributing to Negative Valence Domain in Discharge Documentation. B) Word Cloud of Individual Terms Contributing to Cognitive Domain in Discharge Documentation.
Supplemental Figure S3: Sensitivity analysis comparing length of stay regression coefficients (with confidence intervals) for baseline structured data only model and secondary LDA based term selection method.
Supplemental Figure S4: Sensitivity analysis comparing hazard, for readmission, ratio for baseline structured data only model and secondary LDA based term selection method.
Supplemental Table S1. Correlation between Automated Scores and Formal Testing.



