Abstract
The second track of the CEGS N-GRID 2016 natural language processing shared tasks focused on predicting symptom severity from neuropsychiatric clinical records. For the first time, initial psychiatric evaluation records have been collected, de-identified, annotated and shared with the scientific community. One-hundred-ten researchers organized in twenty-four teams participated in this track and submitted sixty-five system runs for evaluation. The top ten teams each achieved an inverse normalized macro-averaged mean absolute error score over 0.80. The top performing system employed an ensemble of six different machine learning-based classifiers to achieve a score 0.86. The task resulted to be generally easy with the exception of two specific classes of records: records with very few but crucial positive valence signals, and records describing patients predominantly affected by negative rather than positive valence. Those cases proved to be very challenging for most of the systems. Further research is required to consider the task solved. Overall, the results of this track demonstrate the effectiveness of data-driven approaches to the task of symptom severity classification.
Graphical abstract
1. Introduction
The 2016 CEGS N-GRID shared tasks for clinical records had three different tracks focused on a new corpus of 1,000 initial psychiatric evaluation records, which makes it the first corpus of mental health records ever released to the scientific community. These records consist of initial psychiatric evaluations of individual patients produced by psychiatrists. They are provided to the community as-they-are, meaning that we did not attempt to split them in sentences or fix missing white space characters. An example of neuropsychiatric clinical record is reported in Figure 1. Accurate diagnoses of psychiatric disorders which are essential to proper treatment require expert analysis of relevant medical signs and symptoms. Initial psychiatric evaluation records are rich with these factors. Therefore, the CEGS N-GRID challenge dataset can support research on therapeutic alliances, safe and appropriate treatment planning, and treatment outcomes.
Figure 1.
Excerpts of an initial psychiatric evaluation record. Whitespace characters are highlighted.
In Track 2 of the 2016 CEGS N-GRID shared tasks, we used the Research Domain Criteria (RDoC) to define the prediction task. The aim of the RDoC framework is to support a better understanding of the basic dimensions of human behavior, from normal to abnormal, with the aim of studying mental disorders. In support of this goal, many sources of information are integrated into RDoC, including genetic, molecular, cellular, physiological, behavioral, self-reported and paradigmatic material. The RDoC framework is organized into five domains with the intention of capturing a different and cross-diagnostic perspective on psychiatric illness and health. While there has been some effort to identify elements of depressive symptomatology in narrative notes mostly to diagnose major depression other domains have largely been overlooked1. In the CEGS N-GRID challenge, we focused on just one domain: positive valence. The subject of positive valence domain are all the events, objects or situations that are harmful but attractive to the patients, to the point that they actively engage them in spite of the consequence. Abnormalities of positive valence may be observed in disorders as diverse as substance abuse and dependence, mania, gambling, obsessive-compulsive disorder, and depression.
The RDoC framework is graphically summarized by a matrix for the positive valence domain. Each row represents a specified functional dimension of behavior and each column (from genomic to paradigmatic) characterizes the constructs at different levels. The individual cells contain a list of tasks and measures (i.e., elements). Figure 2 shows some of the genetic, molecular and behavioral traits that characterize the positive valence domain. In this work, the RDoC framework helped us focusing on the positive valence aspect of mental disorders, rather than the entire psychiatric spectrum of possible disorders.
Figure 2.
The RDoC matrix for the positive valence domain with some of its genetic, molecular and behavioral traits.
Detection of symptoms and abnormalities are not sufficient for determining a course of treatment for patients. Severity of symptoms can affect the course and urgency of treatment. Information about severity is given, sometimes implicitly and sometimes explicitly, in the initial psychiatric interview records. Some symptoms can show up multiple times, at different levels of severity.
Our aim for Track 2 of CEGS N-GRID shared tasks is to determine the lifetime maximum symptom severity of patient’s mental disorders, based on the information reported in their initial psychiatric evaluation. Since we are interested in lifetime maximum severity, our task is not time dependent. We define severity on an ordinal scale from 0 to 3 as follows:
ABSENT: there is no evidence of relevant symptoms at any point in time.
MILD: there is some evidence of relevant symptoms, but it is never the focus of treatment.
MODERATE: there is evidence of relevant symptoms, sufficient to be the focus of outpatient treatment.
SEVERE: there is evidence of relevant symptoms, sufficient to warrant inpatient treatment or hospitalization.
We represent each patient with one severity score for positive valence. The goal of Track 2 of CEGS N-GRID challenge is to build automatic systems that predict the lifetime maximum positive valence severity score from each patient’s initial psychiatric evaluation record.
This paper provides an overview of Track 2 (the symptom severity prediction task) of the CEGS N-GRID 2016 shared task, including related works (Section 2), the data (Section 3), the annotation process (Section 4) and the evaluation measure designed to benchmark the submissions (Section 5). The submitted systems, along with their distinctive characteristics, are presented in Section 6, followed by their results (Section 7). We end the paper with a discussion (Section 8) and conclusion (Section 9).
2. Related works
Organizing clinical shared tasks is arduous due to the sensitive nature of the data involved: acquiring, de-identifying, annotating and sharing clinical records are all strictly regulated tasks [1]. This explains the relative scarcity of clinical shared tasks using medical narratives, even though they promote great opportunities for the advancement of medical research.
Some of the clinical shared tasks include the Conference and Labs of the Evaluation Forum (CLEF) Initiative [2, 3], the Semantic Evaluation (SemEval) series [4, 5, 6] and the Informatics for Integrating Biology and the Bedside (i2b2).
The shared task presented in this paper is important for Natural Language Processing (NLP), since it sets new directions of research. To begin with, this is the first task organized by employing neuropsychiatric records. Those records present structural and linguistic peculiarities not yet explored in the literature (see Figure 1). They are transcriptions of interviews occurred between a doctor and a patient, and organized as a sequence of question and answer, statements with their relative true/false state and narrative blocks, free-text sections corresponding to doctors notes about patients information and symptoms. Moreover, the linguistic content of those records is different from the pre-existing data in that they contain reported behavioral patterns related to mental health problems rather than, for example, the patients medical history or discharge notes.
Finally, this paper add a new task to those already tackled by the i2b2 series that include identifying patient smoking status [7], detecting obesity and its comorbidities [8], extracting medication and dosages [9], extracting medical concepts, assertions and relations [10], resolving medical co-reference [11], extracting temporal information [12], automatically de-identifying medical records [13, 14] and identifying Coronary Artery Disease (CAD) [15].
3. Data
The corpus consists of 816 de-identified neuropsychiatric clinical records provided by Partners Healthcare and the Neuropsychiatric Genome-Scale and RDoC Individualized Domains (N-GRID) project of Harvard Medical School. This corpus is a subset of the one used in the de-identification task [14].
We split this corpus into two: 600 records for training and 216 for testing. The training set consisted of (I) 325 records annotated by at least two annotators, (II) 108 records annotated by the most experienced annotator only, and (III) 167 unannotated records. The unannotated records were included in the training set in order to allow the participants to experiment with semi-supervised approaches. The test set consisted of 216 records, all of which annotated by at least two annotators. An extra 184 unannotated documents were not shared with the participants since they could not serve the benchmarking purpose2. The test set represents 40% of all the records annotated by at least two annotators (216 out of 541 records). Figure 3 shows the class distributions between the training and test sets are comparable.
Figure 3.
Distribution of classes between training and test set.
We shared the corpus with the research community under a Rules of Conduct and Data User Agreement.
4. Annotation and adjudication
The annotation task focused on symptom severity on the positive valence domain from the RDoC framework. Three expert psychiatrists from the Massachusetts General Hospital (MGH) and the Harvard Medical School, with several years of experience each, volunteered as annotators. The most experienced annotator trained the other annotators. Appendix A summarizes the conventions used by the annotators to carry out the task. Annotating a clinical record for this task consisted of assigning to it a label that specifies symptom severity according to an ordinal scale with the following possible values [16]: absent, mild, moderate, and severe (see Section 1).
We assigned each record to two annotators. Due to time constraints, only 541 records were multiply annotated. Of those records, 329 had agreeing severity labels, whereas the remaining 212 were tie-broken by the third annotator. After tie breaking, for 179 records out of the 212, the majority severity label was used as gold standard. The remaining 33 records did not have a majority severity label, in which case we considered the one provided by the most experienced annotator as gold standard. The same annotator also marked the 108 singly annotated records in the training set.
Figure 4 shows a comparison about the distribution of classes between annotators and gold standard. Annotator 2 had a slight tendency to overestimate the presence of positive valence symptoms. Annotator 3, the least experienced among the three, generally overestimated the gravity of the cases by consistently classifying mild cases as moderate or severe. The annotator 1, the most experienced one, presented the closest distribution to the gold standard although he slightly overestimated the mild cases when compared to the gold standard. The same analysis performed just on the records annotated by all the experts (179) produces the same distributions.
Figure 4.
Distribution of classes between annotators and gold standard.
We also measured the agreement of each annotators against the gold standard by using the official ranking score (see Section 5 for details). We found the scores to be 0.98, 0.96, and 0.90; and positively correlated with the annotators years of experience.
5. Evaluation
We selected the evaluation metric by experimenting with several performance measures borrowed from different scale types: nominal (precision, recall, F1-score, accuracy, Cohens Kappa and Scotts Pi coefficients), ordinal/interval (median absolute error, mean absolute error, mean squared error) [17] and continuous (R2 and Pearsons correlation coefficients). The aim was to score systems by taking into account the distance between gold and predicted labels. The result of our experiments lead us to choose the Mean Absolute Error (MAE), as in other related natural language processing tasks before [18, 19]. We customized MAE by (I) normalizing the deviations, and (II) macro-averaging the errors across the classes according to the following formula:
(1) |
where C is the set of classes, Tj is the set of test documents labeled with class j, xi and yi are the i-th prediction and gold label in the test set respectively. xi and yi have values in the interval [1, 4]. Since the classes are imbalanced (see Figure 3), we macro-averaged the formula to made each class count the same, rather than proportionally to their size [18]. Finally, by normalizing the deviations we make the errors independent by the gold label and comparable across the records. We ranked the systems using the inverse normalized macro-averaged MAE (INMAEM), 1 − NMAEM, which we refer to as the official ranking score. The script implementing the official ranking score is available online3.
6. Shared Task
We released the training data and their gold standard annotations to the research community on the 11th of June 2016. On the 10th of August 2016, we released the test records without gold standard annotations, and collected the system outputs on the test data during the next 2 days. On the 12th of August 2016, we stopped accepting submissions.
Each team could submit up to three system runs for evaluation. We received 65 submissions from 24 teams consisting of a total of 110 researchers from 42 institutions and 11 different countries. All the twenty-four teams tackled the task with machine learning supervised approaches. Four teams additionally experimented with semi-supervised approaches. We received one submission based on hand-written rules and two hybrid systems. Only three teams out of twenty-four used medical expertise to better tackle the task.
7. Results
We evaluated and ranked each team on their best performing submission. The average performance among all the submissions was 0.771 (median: 0.776) with a standard deviation of 0.056. The lowest performing run scored 0.525, whereas the best one scored 0.863. We did not receive any invalid submissions.
Table 7 shows the top 10 performing submissions and gives details of their methods and the INMAE of each class.
The top teams share the use of ensemble strategies. The run submitted by SentiMetrix Inc. proposed a complex ensemble of association rules and machine learning-based classifiers [20]. The system trained each classifier on a subset of the most discriminative features picked by using feature selection algorithms well-suited to the classifiers. Those algorithms were optimized with respect to the official ranking score. Similarly, the research group from the University of Kentucky experimented with embeddings extracted from the training data in an ensemble of several differently-trained Convolutional Neural Networks (CNN) [21]. In their experiments, they treated singly annotated records the same as double annotated ones.
Some teams put particular emphasis into segmenting the documents into different semantic blocks. The team from The University of Texas at Dallas extracted, among the other linguistic features, the structure of the paragraphs by assigning to them a specific pattern: Yes/No questions, questions with an affirmative answer, or non-negated narrative blocks [22]. Their best run was designed as linear combination of two approaches. In the first approach, per-class Support Vector Regressors (SVRs) with three separate Support Vector Machines (SVMs) predicted the class of each record. In the second approach, the records were instead classified with a learning-to-rank approach. The team from the University of Pittsburgh paid the same attention to the note structure. They proposed a pipeline that normalizes section titles and Yes/No questions into categorical features. They used the remaining narrative blocks to extract keywords and medical entities using MedLEE. A Decision Tree classifier predicted the labels. LIMSI-CNRS research group used a similar feature extraction process, but experimented in a semi-supervised setting. They used the unannotated data to explore the usage of self-training with SVM classifiers. Med Data Quest Inc. proposed a 3-tier pipeline that predicted the label of each record with a Gradient Tree Boosting (GTB) classifier based on binary, numerical and categorical features extracted by section names and questions [23]. They also used a battery of SVRs trained on each narrative block as additional numerical features. The systems from the Harbin Institute of Technology Shenzhen Graduate School divided each record into a sequence of question/answer pairs, symptom/medicine pairs and section names. They then used those information to train a neural network regressor composed of two tiers: a CNN and a Long-Short Term Memory (LSTM) classifier. This layout forced the regressor network to take into account the order of appearance of the different sections in each record.
The approach presented by the University of Minnesota extracted unigrams and bigrams and used Information Gain to reduce the space to just 1000 features. They predicted the labels by using a SVM classifier with re-weighted responses to reflect the unbalanced nature of the data. The Antwerp Universitys research team corrected misspellings and the erroneously concatenated words by using hand-written regular expressions and mapped the words to UMLS concepts by using fuzzy matching rules [24]. Instead of using the entire set of mapped concepts, they restricted it to the ones related to psychiatric diagnoses by means of the Diagnostic & Statistical Manual of Mental Disorders (DSM). The system predicted records labels by using Random Forests in combination with outlier detection and bootstrapping.
Finally, the research group from The University of Manchester et al. proposed a rule-based system. A domain expert analyzed 50 randomly chosen records in the training data and identified syntactic patterns suggesting the presence of positive valence. The system matched the patterns in the records by using dictionaries enriched with lexical variations and by also taking into account the order of importance of the activated rules: more effective rules superseded less effective ones.
8. Discussion
Challenge participants interpreted the symptom severity prediction in various ways and developed different solutions to the problem. Despite the differences in the interpretation and in the technical implementation, we found that almost every run in the top 10 used a form of record segmentation: splitting up the record in different blocks and applying different feature extraction and classification strategy to them. Among those, the systems using ensemble strategies scored higher than the rest.
Specifically, most participants found the record segmentation phase to be pivotal for the classification task. Several teams went as far as developing adhoc segmenters. Most systems included Yes/No questions in the analysis only when the answer resulted to be positive, and used narrative blocks (free-text portions outside of Yes/No questions) to extract keywords, medical entities, n-grams or to train regressors. Several teams treated differently, and in same cases even discarded, the narrative blocks when they mostly presented negated statements.
The choice of providing some unannotated records played out successfully. Four teams (9 runs) experimented with semi-supervised approaches. Somehow surprisingly, 11 teams opted for not using the records singly annotated, whereas the rest of the teams used them in different ways: from considering them equal to the fully annotated ones, to using them for bootstrap or hyperparameters tuning.
We also investigated the performance of the top 5 best runs by means of their confusion matrixes (see Figure 5). As expected, when a wrong prediction is made, the classes are mispredicted with the near adjacent neighbors (mild for absent and moderate for severe). Absent is never misclassified as severe, the same is true for the opposite. The top 5 submissions make those type of mistakes because they are selected according to the official ranking score, which favors small errors over large ones. More interestingly, for the internal classes (mild and moderate) theres a general tendency to prefer the internal opposite classes: moderate and mild respectively. In other words, moderate will be less likely misclassified with severe, and mild will be less likely misclassified with absent.
Figure 5.
Confusion matrixes for the top 5 best runs.
Finally, the overall level of performance reached by the top-ranked system (0.863019) qualifies the problem as effectively accomplishable within the campus of the data-driven approaches. With the purpose of better framing the systems results, we also measured the annotators performance with respect to the gold standard by using the official score (see Section 5 for details). We found them to be 0.98, 0.96 and 0.904; and positively correlated with the annotators years of experience. Although pretty close, the presented systems are not yet at the level of a young psychiatrist.
8.1. Error analysis
We analyzed the predictions from the ten best runs with the aim of studying which records were most difficult to classify and why. For each record in the test set, we averaged the absolute distance between gold label and the systems predictions (average absolute error). We then divided the distance scale into twenty-one bins and three groups, which we called ”easy”, ″medium” and ″difficult” records. Each part is composed of seven bins. Figure 6 shows the graph with the three different groups highlighted with different colors. For example, for 28 test records the recorded average absolute error was less than 0.1, in other words almost all the 10 systems correctly classified them.
Figure 6.
Number of test set records grouped by absolute distance. The distance is averaged across the top 10 best systems predictions.
Overall, 177 (82% of the test set) records fell into the easy group. On average, only 2.49 runs among the ten best runs misclassified these records. 28 out of 177 records were correctly classified by all of the runs.
Thirty-three (15% of the test set) records fell into the medium group. The systems that misclassified these records generally predicted a higher severity level than the correct one. When we studied them closely, we discovered that these records presented low severity for the positive valence domain, but expressed high severity for symptoms outside of the positive valence domain. Often, they contained high severity for the negative valence domain (depression, stress, anxiety, panic and similar signals). Although the two domains are to some extent correlated (see Appendix A), the systems misclassified those records since they presented an unusually high prominence of negative valence symptoms. On average, 7.7 runs out of ten misclassified these records.
The difficult group contained just 6 records (3% of the test set). These records are misclassified by almost every single system in the top 10. With the exception of several negations related to alcohol or drugs, these records contained almost no positive valence signals. The presence of very few signalling sentences hidden in a large number of negative valence signals is the most prominent characteristic that distinguishes these records from the easy and medium records.
System outputs for this track are available upon request from the authors of this paper.
9. Conclusion
This paper provides an overview of Track 2 of the 2016 CEGS N-GRID shared task in natural language processing for clinical data, which focused on predicting symptoms severity in the positive valence domain of the RDoC framework.
The data used for this task comes from a new corpus of 1,000 initial psychiatric evaluation records: the first corpus of mental health records ever released to the scientific community. Professional psychiatrists labeled the records with a specific level of symptom severity among absent, mild, moderate, and severe.
One-hundred-ten researchers, organized in twenty-four teams participated in the track, by submitting 65 system runs for evaluation. The highest performing system obtained an official ranking score of 0.863019, which is close to the level of performance registered for the least experienced psychiatrist among the annotators.
The results of this track frame the classification of symptoms severity task as effectively accomplishable within the campus of data-driven approaches, although a space for improvement is still left. The task resulted to be generally easy with the exception of two specific classes of records: records with very few but crucial positive valence signals, and records describing patients predominantly affected by negative rather than positive valence. Those cases proved to be very challenging for most of the systems and set the stage for further research before the task can be considered solved.
Table 1.
Top 10 performing runs, ranked according to the INMAEM (see Section 5). Bolded figures highlight the highest number in the column.
# | Affiliations | Methods | INMAE
|
Ranking score | |||
---|---|---|---|---|---|---|---|
absent | mild | moderate | severe | ||||
16 | SentiMetrix Inc. | Ensemble of Support Vector Machine with RBF kernel, Random Forest, Multinomial Naive Bayes, Adaboost, Deep Neural Network | .892473 | .860465 | .793478 | .905660 | .863019 |
| |||||||
17 | The University of Texas at Dallas | Support Vector Machine and Random Forest | .806452 | .831395 | .782609 | .943396 | .840963 |
| |||||||
18 | University of Kentucky | Convolutional Neural Network | .913978 | .918605 | .760870 | .761006 | .838615 |
| |||||||
21 | University of Pittsburgh | Decision Tree and Bayes Network | .881720 | .790698 | .793478 | .836478 | .825594 |
| |||||||
11 | Med Data Quest Inc. | Support Vector Regressor and Gradient Tree Boosting | .860215 | .854651 | .750000 | .805031 | .817474 |
| |||||||
7 | Harbin Institute of Technology Shenzhen Graduate School |
Convolutional Recurrent Neural Network | .827957 | .854651 | .760870 | .823899 | .816844 |
| |||||||
20 | University of Minnesota | Support Vector Machine and Naive Bayes | .849462 | .831395 | .717391 | .861635 | .814971 |
| |||||||
Antwerp University Hospital | |||||||
2 | Antwerp University Hospital University of Antwerp UZA |
Random Forest with Ontology-based rules | .870968 | .872093 | .652174 | .830189 | .806356 |
| |||||||
10 | LIMSI-CNRS | Support Vector Machine | .827957 | .866279 | .739130 | .773585 | .801738 |
| |||||||
Australian Institute of Health Innovation The University of Manchester |
|||||||
3 | University of Novi Sad University of Oxford |
Hand-written rules with Neural Network | .817204 | .924419 | .695652 | .767296 | .801143 |
Highlights.
Shared a clinical corpus of 1000 initial evaluation neuropsychiatrie records.
Scientific challenge involving 110 researchers, 24 teams, 65 submissions.
The best system achieved a level of performance comparable to the one recorded for the least experienced among the experts involved in the study
Classifying the positive domain symptom severity is accomplishable automatically.
The records describing patients affected by both negative and positive valence proved to be challenging for the automatic systems.
Acknowledgments
The authors would like to thank the 2016 CEGS NGRID program committee, along with all the researchers who committed to the shared tasks. The authors would also like to thank the three medical doctors that annotated the data: Roy Perlis, Hannah Brown, and James Rosenquist. We also thank the two anonymous reviewers for their insights. The first author would like to thank Halip Saifi for his technical support in managing the CEGS N-GRID 2016 official website and Christopher Kotfila for sharing the 2014 evaluation scripts on which we built on. The grants NIH P50 MH106933 (PI: Isaac Kohane) and NIH 4R13LM011411 (PI: Ozlem Uzuner) provided funding for this project.
Appendix A
We also studied the experts annotations and summarized their decision process, the cross-domain relations, and the relevant keywords. The following section is not meant to be seen as a set of strict rules, but more as conventions that guided psychiatrists during the annotation process, especially in the presence of borderline cases.
Depression
A depressed patient can score absent (0).
- A patient who is depressed and has decreased interest scores at least mild (1).
- – Examples: people with major depressive disorder
- A patient who is depressed and has decreased interest and that is the main focus of treatment scores at least moderate (2).
- – Examples: people who need an intervention to get out of bed
A patient who is depressed and needs hospitalization/electroconvulsive therapy scores severe (3)
Addiction
A patient who is violent and has a history of substances abuse scores at least moderate (2).
A patient who smokes cigarettes scores at least mild (1).
A patient who uses alcohol (EtOH) or street drugs scores at least moderate (2).
- A patient who uses marijuana (MJ):
- –occasional use not the focus of a treatment = mild (1)
- –specific focus of a treatment = moderate (2)
- –inpatient (or could have been) = severe (3)
A patient who has legal troubles (driving under the influence, arrest)/intensive outpatient program because of substance abuse scores at least moderate (2).
A patient who has blackouts/detoxification/withdrawal symptoms because of substance abuse scores severe (3).
A patient who participate to Alcoholics Anonymous scores at least moderate (2)
Motivation (decrease)
- A psychotic patient who is amotivated (anhedonia):
- – not focus of treatment = mild (1)
- – needs treatments = moderate (2)
- – requires ER or hospitalization = severe (3)
A patient who has little interest or pleasure scores AND this is not a focus of treatment, scores mild (1)
Motivation (increase)
- A patient who has mania:
- – few hypomanic symptoms scores mild (1)
- – elevation/abnormal drive scores moderate (2)
- – dangerous behavior that needs hospitalization scores severe (3)
- – is already hospitalized scores severe (3)
- A patient who has obsessive compulsive disorder:
- – present AND no under prescription scores mild (1)
- – present AND under prescription scores at least moderate (2)
- – present AND hospitalized scores severe (3)
- A patient who has bipolar disorder:
- – has symptoms scores mild (1)
- – present AND under prescription scores at least moderate (2)
- – present AND hospitalized scores severe (3)
Some general facts about psychiatry
a disturb is in the positive valence if there’s a rewarding component (overdose intentional vs. unintentional)
main sources of positive valence are abnormal changes in drive/approach/-motivation: increase (mania, substance) and decrease (symptoms of depression)
depression, anxiety, anhedonia, lack of motivation are correlated
violence not because of substances usage doesn’t score in positive valence
cutting is an abnormally rewarding behavior
PSTD has a negative valence component (arousal, social withdrawal, cognition, intrusive thoughts)
any mood/anxiety problem falls in the negative valence
Some general facts about the data
all the notes contain a review of systems which is standardized, although some categories come and go (PTSD)
most notes contain useful alcohol screen
some words like panic, abuse are hard because they appear negated and not
Important dimensions
the frequency in history
is or is not the focus of a treatment (not focus: mild, focus: moderate, inpatient: severe)
the gravity of the specific event
Challenges
patients describing someone else’s symptoms (spouse worried about her husband, parent worried about one child)
vague prior event reference
clinician diagnosis not consistent with symptoms (patient describes panic attacks, diagnosed with general anxiety)
extremely complex patients (PTSD + borderline + substance +/− psychosis)
extremely ill patients where history is not helpful (but mental status and impression may be)
many syndromes are reflected in multiple domains. (depressed mood (negative), loss of interest (positive), insomnia (arousal), OCD, PTSD, panic)
many domains reflect multiple syndromes (cognition = ADHD, depressive symptoms, psychosis, PTSD)
Trigger words
Depression, anxiety, anhedonia, bipolar disorder, smoking, smoking history, cigarette, EtHO (alcohol), MJ (marijuana), street drugs, illegality, manic, mood, decreased interest, decreased pleasure, violence, blackout, amotivation, cutting, bingering (addiction to crack/cocaine), purging (to cause intestinal evacuation), IOP (intensive outpatient program), PHP (partial hospitalization program), DUI (driving under the influence), OUI (operating under the influence), legal trouble, arrest, outpatient treatment, CBT (cognitive behavioral therapy), AA (Alcoholics Anonymous), hypomania, hypomanic, hyperactivity, increased libido, elevation, abnormal drive, ECT (Electroconvulsive therapy), MDD (Major depressive disorder), OCD (obsessive compulsive disorder), BPAD (bipolar disorder), ER (emergency room), ROS (review of systems), PTSD (post-traumatic stress disorder), ADHD (attention deficit hyperactivity disorder).
Footnotes
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Although differently annotated, Track 1 and Track 2 share the same clinical records. We kept the same training/test split for both tracks to avoid accidental disclosure of information to the participants committed to both tracks.
Kappa scores: 0.92692, 0.92523 and 0.90223 respectively.
Conflict of interests
None.
References
- 1.Jensen PB, Jensen LJ, Brunak S. Mining electronic health records: towards better research applications and clinical care. Nature Reviews Genetics. 2012;13(6):395–405. doi: 10.1038/nrg3208. [DOI] [PubMed] [Google Scholar]
- 2.Suominen H, Salanterä S, Velupillai S, Chapman WW, Savova G, Elhadad N, Pradhan S, South BR, Mowery DL, Jones GJ, et al. International Conference of the Cross-Language Evaluation Forum for European Languages. Springer; 2013. Overview of the share/clef ehealth evaluation lab 2013; pp. 212–231. [Google Scholar]
- 3.Kelly L, Goeuriot L, Suominen H, Schreck T, Leroy G, Mowery DL, Velupillai S, Chapman WW, Martinez D, Zuccon G, et al. International Conference of the Cross-Language Evaluation Forum for European Languages. Springer; 2014. Overview of the share/clef ehealth evaluation lab 2014; pp. 172–191. [Google Scholar]
- 4.Pradhan S, Elhadad N, Chapman W, Manandhar S, Savova G. Semeval-2014 task 7: Analysis of clinical text. Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014) 2014;199:54–62. [Google Scholar]
- 5.Bethard S, Derczynski L, Savova G, Pustejovsky J, Verhagen M. Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015) Association for Computational Linguistics Denver; Colorado: 2015. Semeval-2015 task 6: Clinical tempeval; pp. 806–814. [Google Scholar]
- 6.Bethard S, Savova G, Chen WT, Derczynski L, Pustejovsky J, Verhagen M. Semeval-2016 task 12. Clinical tempeval, Proceedings of SemEval ( 2016:1052–1062. [Google Scholar]
- 7.Uzuner Ö, Goldstein I, Luo Y, Kohane I. Identifying patient smoking status from medical discharge records. Journal of the American Medical Informatics Association. 2008;15(1):14–24. doi: 10.1197/jamia.M2408. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Uzuner O. Second i2b2 workshop on natural language processing challenges for clinical records. AMIA… Annual Symposium proceedings/AMIA Symposium. AMIA Symposium. 2007:1252–1253. [PubMed] [Google Scholar]
- 9.Uzuner Ö, Solti I, Cadag E. Extracting medication information from clinical text. Journal of the American Medical Informatics Association. 2010;17(5):514–518. doi: 10.1136/jamia.2010.003947. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Uzuner Ö, South BR, Shen S, DuVall SL. 2010 i2b2/va challenge on concepts, assertions, and relations in clinical text. Journal of the American Medical Informatics Association. 2011;18(5):552–556. doi: 10.1136/amiajnl-2011-000203. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Uzuner O, Bodnari A, Shen S, Forbush T, Pestian J, South BR. Evaluating the state of the art in coreference resolution for electronic medical records. Journal of the American Medical Informatics Association. 2012;19(5):786–791. doi: 10.1136/amiajnl-2011-000784. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Sun W, Rumshisky A, Uzuner O. Evaluating temporal relations in clinical text: 2012 i2b2 challenge. Journal of the American Medical Informatics Association. 2013;20(5):806–813. doi: 10.1136/amiajnl-2013-001628. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Stubbs A, Kotfila C, Uzuner Ö. Automated systems for the deidentification of longitudinal clinical narratives: Overview of 2014 i2b2/uthealth shared task track 1. Journal of biomedical informatics. 2015;58:S11–S19. doi: 10.1016/j.jbi.2015.06.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Stubbs A, Filannino M, Uzuner Ö. De-identification of psychiatric intake records: Overview of 2016 cegs n-grid shared tasks track 1. Journal of biomedical informatics. doi: 10.1016/j.jbi.2017.06.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Stubbs A, Kotfila C, Xu H, Uzuner Ö. Identifying risk factors for heart disease over time: Overview of 2014 i2b2/uthealth shared task track 2. Journal of biomedical informatics. 2015;58:S67–S77. doi: 10.1016/j.jbi.2015.07.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Spores JM. Clinician’s guide to psychological assessment and testing: With forms and templates for effective practice. Springer Publishing Company; 2012. [DOI] [PubMed] [Google Scholar]
- 17.Baccianella S, Esuli A, Sebastiani F. Proceedings of the 2009 Ninth International Conference on Intelligent Systems Design and Applications, ISDA ’09. IEEE Computer Society; Washington, DC, USA: 2009. Evaluation measures for ordinal regression; pp. 283–287. URL http://dx.doi.org/10.1109/ISDA.2009.230. [DOI] [Google Scholar]
- 18.Moreau E, Vogel C. Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics. Dublin City University and Association for Computational Linguistics; 2014. Limitations of mt quality estimation supervised systems: The tails prediction problem; pp. 2205–2216. [Google Scholar]
- 19.Graham Y, et al. Improving evaluation of machine translation quality estimation. ACL (1) 2015:1804–1813. [Google Scholar]
- 20.Kagan V, Subrahmanian V, Dekhtyar A, Eglowski S, Stevens A, Terrell J. Create: Clinical records analysis technologies ensemble. Journal of Biomedical Informatics [Google Scholar]
- 21.Kavuluru R, Rios A. Ordinal convolutional neural networks for predicting rdoc positive valence psychiatric symptom severity scores. Journal of Biomedical Informatics. doi: 10.1016/j.jbi.2017.05.008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Goodwin T, Maldonado R, Harabagiu SM. Automatic recognition of symptom severity from psychiatric evaluation records. Journal of Biomedical Informatics. doi: 10.1016/j.jbi.2017.05.020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Liu Y, Gu Y, Nguyen JC, Li H, Zhang J, Gao Y, Huang Y. Symptom severity classification with gradient tree boosting. Journal of Biomedical Informatics. doi: 10.1016/j.jbi.2017.05.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Scheurwegs E, Sushil M, Tulkens S, Daelemans W, Luyckx K. Counting trees in random forests: Predicting symptom severity in psychiatric intake reports. Journal of Biomedical Informatics. doi: 10.1016/j.jbi.2017.06.007. [DOI] [PMC free article] [PubMed] [Google Scholar]