Using Neural Multi-task Learning to Extract Substance Abuse Information from Clinical Notes

Kevin Lybarger; Meliha Yetisgen; Mari Ostendorf

. 2018 Dec 5;2018:1395–1404.

Using Neural Multi-task Learning to Extract Substance Abuse Information from Clinical Notes

Kevin Lybarger ¹, Meliha Yetisgen ¹, Mari Ostendorf ¹

PMCID: PMC6371261 PMID: 30815184

Abstract

Substance abuse carries many negative health consequences. Detailed information about patients’ substance abuse history is usually captured in free-text clinical notes. Automatic extraction of substance abuse information is vital to assess patients’ risk for developing certain diseases and adverse outcomes. We introduce a novel neural architecture to automatically extract substance abuse information. The model, which uses multi-task learning, outperformed previous work and several baselines created using discrete models. The classifier obtained 0.88-0.95 F1 for detecting substance abuse status (current, none, past, unknown) on a withheld test set. Other substance abuse entities (amount, frequency, exposure history, quit history, and type) were also extracted with high-performance. Our results demonstrate the feasibility of extracting substance abuse information with little annotated data. Additionally, we used the neural multi-task model to automatically annotate 59.7K notes from a different source. Manual review of a subset of these notes resulted 0.84-0.89 precision for substance abuse status.

Introduction

The negative impact of substance abuse (including alcohol, drug, and tobacco) on health is increasingly recognized as a key factor for morbidity and mortality.^1-3 In clinical research, it has been established that 5-10% of cancers can be attributed to hereditary factors, while 90-95% have been found correlated with lifestyle and environmental factors, such as smoking and alcohol consumption.⁴ The consequences of illicit drug and prescribed opioid abuse are also widespread, causing permanent physical and emotional damage to users. In many cases, users die prematurely from drug overdoses or other drug-associated illnesses. Characterization of a patient’s substance abuse history can be used to assess risk of future negative health outcomes related to substance abuse. Clinical notes contain rich information detailing the history of substance abuse from caregivers’ perspective, beyond what is available from structured EHR databases and which can be used to quantify severity of abuse. To utilize the reported substance abuse information for the purposes of secondary use, automatic extraction approaches are needed to process free-text clinical notes. In this work, we extend prior work using machine learning to automatically extract substance abuse information from clinical text. We introduce and evaluate a neural information extraction (IE) model using a corpus of clinical notes annotated by Yetisgen and Vanderwende.⁵ The corpus was created using a publicly available data source (MTSamples) and annotated for alcohol, drug, and tobacco abuse information documented in social history sections of history and physical notes.

We create a neural multi-task model that outperformed Yetisgen and Vanderwende’s previous work and the strong baselines we created using discrete models. To assess the generalizability of the extraction model, we used the multi-task model to annotate 59.7K discharge summaries from the MIMIC-III corpus,⁶ and hand-scored the substance status predictions of a randomly selected subset of notes. The performance results are encouraging and demonstrate the feasibility and generalizability of our extraction approach. We make the annotated corpus, source code, and annotations for the MIMIC-III notes available online.⁷

Task

Details of substance abuse are often described in free-text clinical notes. The 2008 i2b2 smoking challenge leveraged de-identified discharge records annotated as “past smoker,” “current smoker,” “smoker,” “non-smoker,” and “unknown.”⁸ Since its release, many IE systems have been developed to automatically annotate clinical text with these smoking annotations.^9-11 Presented methods included lexicon-based, rule-based, and supervised machine-learning models.

Melton et al. reviewed three widely used public health surveys, i.e. practitioners’ questions to measure behaviors that may be relevant to clinical care; they proposed an information model for survey items related to alcohol, drug, and tobacco use, as well as occupation.^12-14 Dimensions included temporality, degree of exposure, and frequency. They also investigated longitudinal tobacco use information from clinical notes.¹⁵

In the present study, we used Yetisgen and Vanderwende’s annotated corpus to explore IE algorithms on multiple types of substance abuse in clinical notes.⁵ They followed Melton’s model¹⁴ in annotating substance use, because it reflects the practioners’ needs and insights into how these substances impact patient health. Most of the dimensions discussed in the Carter et al. study of drug use were also annotated.¹⁴ Chen et al.¹⁶ surveyed the use of free-text describing alcohol use, coding for “type,” “status,” “temporal,” and “amount,” which was also captured in Yetisgen and Vanderwende’s⁵ annotation. Wang et al. explored extracting similar dimensions using rule-based methods.¹⁷

In the Yetisgen and Vanderwende data, each category of substance abuse (alcohol, drug, and tobacco) is represented with seven entities: status, amount, frequency, exposure history, quit history, type, and method.⁵ Status is a sentence-level label with four classes: unknown, none, current, and past. The other annotations correspond to multi-word spans, which are referred to as “entities” although they are not always noun phrases. In this study, we explore the use of neural networks, since recent work on IE in other domains has shown that they give state-of-the-art performance. We did not extract method, because the number of occurrences is insufficient to adequately assess performance. The substance associated with an entity is implicit in the label, but we also separately detect whether or not each sentence has information related to substance use, referred to here as an “event.” Table 1 provides status label examples and span label examples for the other entities. The amount spans often have a similar format, ([quantity] [units]); however, the units generally vary by substance (e.g. “three packs” vs. “one beer”). The phrases used to describe frequency, exposure history, and quit history are similar for all substances.

Table 1.

Annotation examples (bold font indicates labeled span)

Entity/label	Examples
status/tobacco-unknown	Currently employed.
status/tobacco-none	She does not smoke.
status/alcohol-current	He occasionally uses alcohol.
status/drug-past	He used cocaine in the past.
type	He occasionally has a beer.
amount	He smokes less than a pack of cigarettes daily.
frequency	She occasionally uses alcohol.
exposure history	...smokes half a pack per day for 30 years.
quit history	The patient quit smoking 17 years ago.

Open in a new tab

IE performance is evaluated as in prior studies.⁵ For each substance, status F1 scores are computed at the sentence level for each of the none, past, and current categories and the scores are combined with micro-averaging. The unknown category is our assumed default category for status, so we do not report its performance. Performance on labels involving multi-word spans is evaluated at the token-level, as in prior work on this corpus. Token-level (vs. span-level) evaluation is common in clinical IE tasks such as this, particularly when the dataset size is limited.

Related Work

Clinical IE tasks have been approached using rule-based and machine learning systems, and more recent work has utilized neural networks. Maldonado et al. automatically annotated electroencephalography (EEG) concepts and attributes using active learning in conjunction with a multi-task, deep neural network, including a long short-term memory (LSTM) network.¹⁸ Dernoncourt et al. detected protected health information (such as names, locations, dates, etc) in clinical notes using a bidirectional LSTM with character aware word embeddings.¹⁹ Other clinical natural language processing (NLP) tasks have also benefited from the use of neural approaches.^20,21

There is a significant body of IE work outside of the clinical domain, including IE subtasks like named entity recognition (NER), slot filling, event extraction, and text classification. Several works have achieved high performance in IE tasks using the Conditional Random Field (CRF) model.^22-24 More recent IE work utilized recurrent neural networks (RNN), including the LSTM network, which captures long-range word dependencies.^25-27 A popular LSTM-based approach to sequence tagging IE tasks incorporates a CRF layer at the output of the LSTM.^25,27,28 The inclusion of the CRF, allows the model to learn allowable transitions between labels and conditionally independent predictions. Convolutional neural networks (CNN) have also successfully been applied to IE tasks, including text classification, slot filling and event extraction.^29-31 These methods were incorporated in the neural model developed in this study.

In multi-task modeling, a single model generates multiple outputs and/or leverages shared parameters across multiple data sets or tasks. Neural multi-task models have achieved state-of-the-art performance in a variety of NLP tasks, including IE.^26,32-34 Neural multi-task modeling is also useful for applications in the medical domain outside of NLP. For example, Harutyunyan et al. used clinical time series data and neural multi-task modeling to predict in-hospital mortality, length of stay, phenotyping, and decompensation.³⁵ Jaques et al. predicted health, stress, and happiness using a neural multi-task model and data from wearable sensors and smartphone logs.³⁶

Attention mechanisms have been used in neural encoder-decoder models to identify and attend to the most relevant position within the input sequence.³⁷ Attention has been extended to representation learning through self-attention.³⁸ Self-attention learns a vector representation of target phenomena and calculates attention weights for an input sequence, to emphasize the elements within a sequence that are most salient to the task. The learned attention weights provide an interpretable result, which highlights the most predictive elements within the sequence.

Methods

This section describes the data that is the foundation for experimental work, the discrete models implemented as a baseline, and the new neural multi-task model. The training set and test set assignments from Yetisgen and Vanderwende’s previous work were unavailable. Consequently, we reimplemented discrete models, similar to Yetisgen and Vanderwende’s previous work, to facilitate direct comparison of the neural multi-task model with a discrete baseline. In the reimplementation of the discrete models, some additional features were explored.

In the discrete and neural multi-task modeling frameworks, a separate module was designed to determine whether or not a sentence is associated with any substance events (single binary detector), and the result was used to filter out sentences with no relevant information in both frameworks. Both frameworks also treat the span detection problems as sequence tagging, using a begin-inside-outside (BIO) approach to representing spans of different types.

Data

Yetisgen and Vanderwende annotated a corpus of 364 social history sections from 516 history and physical notes from MTSamples website (http://mtsamples.com).⁵ The annotated dataset is available for download at the UW-BioNLP website (http://depts.washington.edu/bionlp/index.html?corpora). Table 2 contains a summary of the entity frequencies. For multi-words span entities, these counts reflect the number of spans, not the number of associated tokens.

Table 2:

Annotation statistics

Entity	Alcohol	Drug	Tobacco
status	254	154	278
type	26	112	50
method	0	10	4
amount	69	25	78
frequency	65	6	59
exposure history	7	10	37
quit history	6	2	37

Open in a new tab

The MIMIC-III corpus⁶ discharge summaries (59.7K notes) and physician notes (142K notes) were used in unsupervised learning to pre-train word embeddings for the neural multi-task model. The discharge summaries were also used to evaluate the generalizability of the multi-task model and provide data with automatically detected labels. We experimented with using the entirety of these notes vs. only the “Social History” and “History of Present Illness” sections. Performance was similar, so the smaller subset was used. Note sections were identified using simple pattern matching. Some notes in the MIMIC-III corpus include extraneous line breaks within sentences. All of the lines within a given section were merged into a single line, and then a sentence boundary detector³⁹ was used to parse the section into sentences. The extracted sections resulted in a corpus of 19M tokens from 113K notes for word embedding training.

Discrete Models

Events. A substance detection model was trained to predict the presence of any substance event (alcohol, drug, or tobacco) within a sentence. The predictions from the substance detection model were used as a first-stage mask, such that subsequent modeling only used sentences that were predicted to contain a substance event. The substance detection model was created using logistic regression (LR). The same substance detection model was used as the first-stage mask for both discrete and neural modeling approaches. Substance indicator models were also trained to predict each individual substance using LR. LR models used word n-gram features (unigram-trigram) and gazetteer features. The gazetteer features consisted of three binary features, indicating alcohol, drug, or tobacco. The gazetteer word lists were generated by searching WordNet for hyponyms of each substance, resulting in 324 alcohol, 271 drug, and 46 tobacco words (search terms included “tobacco,” “alcoholic_drink,” “sedative,” “narcotic,” and “controlled_substance”).

Status. The status was classified using separate Maximum Entropy (MaxEnt) models for each substance. The MaxEnt models used word n-gram features (unigram-trigram) and same gazetteer features as the LR models.

Sequence Tags. The amount, frequency, exposure history, quit history, and type entities were extracted using linear-chain CRF models.²² Features included word n-grams (unigram-trigram), part-of-speech (POS) tags, capitalization indicators (lowercase, uppercase, title case, and other), and string type indicators (punctuation, number, alphabetic, alphanumeric, and other). Amount, frequency, exposure history, and quit history entities were extracted using two approaches. In the first approach, a separate CRF model was trained for each substance to extract substance-specific amount, frequency, exposure history, and quit history entities. In the second approach, these labels were merged across all substances, and a single CRF model was created to extract these substance-independent entities. Then, the substance indicator models were used to associate a substance with each extracted entity using the heuristic that any predicted entity in the sentence is assigned to all substances detected for the sentence. In the Results section, the first approach is referred to as “CRF,” and the second, two-stage, approach is referred to as “CRF+LR.”

Neural Multi-task Model

A single neural multi-task model was created to predict all entities for all substances, leveraging the shared information across entities and substances. Figure 1 is a diagram of the neural multi-task model. In the discrete modeling approaches, there is no shared learning between the different substances or entities. The multi-task approach is intended to take advantage of the similarities between substances and entities. The inputs to the model were pre-trained word embeddings, which fed into a bidirectional LSTM layer that used layer normalization.⁴⁰ Word embeddings were pretrained using word2vec⁴¹ on the MIMIC-III data and held constant during the training of the multi-task model, due to the limited size of the annotated data. A 38K token vocabulary was chosen using tokens that occurred at least five times in the MIMIC-III subset, corresponding to a relatively low out-of-vocabulary rate of 2.3%. The forward and backward outputs states of the LSTM were concatenated resulting in n-by-2u matrix H, where u is the LSTM hidden size and n is the sequence length. H was used as features in downstream entity and substance-specific classifiers.

Figure 1. — Neural multi-task extraction model

Events. The same LR model as in the discrete model was used to filter out sentences with no events. Then, self-attention was used to estimate the probability of each substance event, automatically identifying word positions that best predict a given substance. Attention weights, A_i, were calculated as

A_{i} = s o f t m a x (t a n h (W_{s, i} H^{T}))

(1)

where i denotes the substance, W_S,i is a 1-by-2u learned vector, and A_i is a 1-by-n vector. The probability of a substance event was calculated using the weighted average of the hidden states, A_iT, as

P_{e v e n t s, i} = s o f t m a x (W_{A, i} {(A_{i} H)}^{T} + b_{A, i})

(2)

where W_A,i is a 2-by-2u weight matrix and b_A,i is a 2-by-1 bias vector. The probability of the substance events was concatenated to form a 3-dimensional vector (# substances) for use in status classification. The ground truth for learning P_event was determined based on whether or not a given substance event occurred within the sentence. Error from the status classification was not back-propagated to P_event.

Status. Status was predicted using separate, substance-specific classifiers. A CNN with max-pooling was used to create a low dimensional sentence representation for the target substance, x_CNN,i, where i denotes the substance. The use of this CNN approach was motivated by the ability of CNNs to highlight salient information in the input sequence and preserve word order information. The CNNs used multiple filter widths, and multiple filters were created for each filter width. The substance probability vector, P_events, was concatenated with x_CNN,i to form a vector of size m-by-1 to predict status as

P_{s t a t u s, i} = s o f t m a x (W_{s t a t u s, i} [P_{e v e n t s}, x_{C N N, i}] + b_{s t a t u s, i})

(3)

where W_status,i is a 4-by-m weight matrix and b_status,i is a 4-by-1 bias vector(m = 3+#filters). Status probabilities, P_status,i, for each substance were concatenated to form 12-dimensional vector (4 labels × 3 substances), which was used as input features in the sequence tagging tasks. To prevent the sequence tagging tasks from negatively impacting the status classification, error from the sequence tagging tasks was not back-propagated to the status classifiers.

Sequence Tags. Linear-chain CRF models were used to extract sequence tags, because CRFs can learn and enforce allowable transitions between sequence labels (e.g. I-frequency label may follow a B-frequency label but not I-amount). The type entities for all substances were extracted using a single CRF, with input features H and P_status. The amount, frequency, exposure history, and quit history labels were merged across all three substances for training a substance-independent CRF that estimates probabilities, P_merged, for these entities, with input features H and P_status. Then, substance-specific amount, frequency, exposure history, and quit history entities are extracted using separate, substance-specific CRFs, with input features H, P_status, and P_merged.

Training

The annotated corpus was split into training and test sets using an 80%/20% split. All models were tuned using 5-fold cross validation (CV) on the training set. After determining the best configuration, models were retrained on the entirety of the training set. Only the highest performing discrete and multi-task models were applied to the withheld test set.

Discrete model tuning included the selection of feature types, including n-gram order, regularization type (L1 or L2), and regularization strength. Discrete models were created using Python scikit-learn.⁴² The best performing MaxEnt status models used the following features: unigrams for alcohol; unigrams and gazetteer for drug; and unigrams-bigrams and gazetteer for tobacco.

Neural multi-task model tuning included: selecting the word embedding training set, determining the connections within the multi-task graph, regularization strength, LSTM size, learning rate, number of epochs, CNN filter widths, number of CNN filters at each width, and layer normalization. The multi-task model was regularized using dropout in the LSTM and status CNN layers. The model was trained in multiple stages. In the first stage, the entire model was trained jointly to minimize average loss across all classifiers (tasks), updating all graph variables except for the pre-trained word embeddings. In the second stage, the learned parameters in the LSTM were frozen, and each set of entity-specific classifiers were trained jointly, with the other classifier variables fixed. The entity-specific classifiers were trained in the following order: 1) events, 2) status, 3) amount, frequency, exposure history, and quit history, and 4) type. The multi-task model configuration that achieved the best overall performance had the following configuration: H size = 200, H dropout = 20%, # joint epochs = 1000, # entity-specific epochs = 50, learning rate = 0.005, batch size = 20, CNN dropout = 40%, # CNN filters = 10, and CNN filter widths = [2, 3]. The neural multi-task model was created models using Google’s TensorFlow.⁴³

All of the CV and test results presented reflect the end-to-end performance, i.e., any event detection errors impact the final results. Performance is presented in terms of the true positive count (TP), false negative count (FN), false positive count (FP), precision (P), recall (R) and F₁ score (F1).

Results

The substance detector, which was used as a first-stage classifier, achieved a performance of F1=0.97 during CV and F1=0.98 on the test set. Tables 3 and 4 present the CV development and test set results, respectively, for the sentence-level status. Results are micro-averaged across labels.

Table 3:

Status CV development performance

Substance	Model	TP	FN	FP	P	R	F1
alcohol	MaxEnt	168	32	17	0.91	0.84	0.87
	Multi	178	22	23	0.89	0.89	0.89
drug	MaxEnt	94	21	5	0.95	0.82	0.88
	Multi	99	16	8	0.93	0.86	0.89
tobacco	MaxEnt	172	42	33	0.84	0.80	0.82
	Multi	185	29	33	0.85	0.86	0.86

Open in a new tab

Table 4:

Status test set performance

Substance	Model	TP	FN	FP	P	R	F1
alcohol	MaxEnt	49	5	5	0.91	0.91	0.91
	Multi	52	2	4	0.93	0.96	0.95
drug	MaxEnt	20	6	3	0.87	0.77	0.82
	Multi	24	2	1	0.96	0.92	0.94
tobacco	MaxEnt	49	11	10	0.83	0.82	0.82
	Multi	53	7	7	0.88	0.88	0.88

Open in a new tab

High performance was obtained for all substances in both data sets, with the neural model giving slightly better results in all cases. The multi-task model achieved higher performance on status extraction than Yetisgen and Vanderwende’s previous work previous work (F1 scores of 0.91 for alcohol, 0.86 for drug, and 0.80 for tobacco on the test set).⁵

Tables 5 and 6 present type, amount, frequency, exposure history, and quit history CV development and test set results, respectively. Results are presented for each entity, micro-averaged across substances. For amount, frequency, exposure history, and quit history extraction, the best performing CRF models used the following features: unigrams and gazetteer for alcohol; unigrams, POS, capitalization, string type, and gazetteer for drug; and unigrams and POS for tobacco. For the “CRF+LR” approach, the best performing CRF model used unigrams-trigrams, POS, capitalization, string type, and gazetteer features. The substance indicator models used in the “CRF+LR” approach achieved F1 performance of 0.95 for alcohol, 0.93 for drug, and 0.97 for tobacco during CV and F1 performance of 0.95 for alcohol, 0.88 for drug, and 0.95 for tobacco on the test set. The best-performing CRF model for type extraction used unigram and string type features. On the test set, the neural model produced similar or better results in all categories. The performance is robust in moving to the test set except for a small degradation for exposure history, which is from a drop in the drug subset where data is sparse. For amount, frequency, exposure history, and quit history, the multi-task model outperformed Yetisgen and Vanderwende’s previous work (F1 scores of 0.76 for amount, 0.75 for frequency, 0.63 for exposure history, and 0.75 for quit history)⁵ and both CRF baselines across each entity.

Table 5:

Entity CV development performance

Entity	Model	TP	FN	FP	P	R	F1
type	CRF	119	53	5	0.96	0.69	0.80
	Multi	123	49	8	0.94	0.72	0.81
	CRF	151	97	16	0.90	0.61	0.73
amount	CRF+LR	168	80	90	0.65	0.68	0.66
	Multi	180	68	39	0.82	0.73	0.77
	CRF	79	53	29	0.73	0.60	0.66
exposure hist.	CRF+LR	89	43	23	0.79	0.67	0.73
	Multi	85	47	17	0.83	0.64	0.73
	CRF	104	83	14	0.88	0.56	0.68
frequency	CRF+LR	120	67	70	0.63	0.64	0.64
	Multi	142	45	25	0.85	0.76	0.80
	CRF	59	55	6	0.91	0.52	0.66
quit history	CRF+LR	82	32	19	0.81	0.72	0.76
	Multi	65	49	23	0.74	0.57	0.64

Open in a new tab

Table 6:

Entity test set performance

Entity	Model	TP	FN	FP	P	R	F1
type	CRF	31	3	1	0.97	0.91	0.94
	Multi	31	3	2	0.94	0.91	0.93
	CRF	51	29	8	0.86	0.64	0.73
amount	CRF+LR	61	19	22	0.73	0.76	0.75
	Multi	65	15	12	0.84	0.81	0.83
	CRF	19	41	10	0.66	0.32	0.43
exposure hist.	CRF+LR	31	29	27	0.53	0.52	0.53
	Multi	37	23	12	0.76	0.62	0.68
	CRF	32	21	5	0.86	0.60	0.71
frequency	CRF+LR	35	18	9	0.80	0.66	0.72
	Multi	39	14	8	0.83	0.74	0.78
	CRF	18	9	12	0.60	0.67	0.63
quit history	CRF+LR	18	9	4	0.82	0.67	0.73
	Multi	21	6	4	0.84	0.78	0.81

Open in a new tab

Tables 7, 8, and 9 present the detailed status test set results for alcohol, drug, and tobacco, respectively. The none case is the dominant class for all substances. Again, the neural model gives the best result on the test set. For the cases where the difference between CV and test performance is greatest, the numbers of samples are small.

Table 7:

Alcohol status test set performance

Label	Model	TP	FN	FP	P	R	F1
current	MaxEnt	17	3	2	0.89	0.85	0.87
	Multi	20	0	2	0.91	1.00	0.95
none	MaxEnt	30	1	3	0.91	0.97	0.94
	Multi	30	1	1	0.97	0.97	0.97
past	MaxEnt	2	1	0	1.00	0.67	0.80
	Multi	2	1	1	0.67	0.67	0.67

Open in a new tab

Table 8:

Drug status test set performance

Label	Model	TP	FN	FP	P	R	F1
current	MaxEnt	1	4	0	1.00	0.20	0.33
	Multi	3	2	0	1.00	0.60	0.75
none	MaxEnt	19	1	2	0.90	0.95	0.93
	Multi	20	0	1	0.95	1.00	0.98
past	MaxEnt	0	1	1	0.00	0.00	0.00
	Multi	1	0	0	1.00	1.00	1.00

Open in a new tab

Table 9:

Tobacco status test set performance

Label	Model	TP	FN	FP	P	R	F1
current	MaxEnt	8	3	1	0.89	0.73	0.80
	Multi	8	3	1	0.89	0.73	0.80
none	MaxEnt	34	3	6	0.85	0.92	0.88
	Multi	35	2	2	0.95	0.95	0.95
past	MaxEnt	7	5	3	0.70	0.58	0.64
	Multi	10	2	4	0.71	0.83	0.77

Open in a new tab

MIMIC Annotation

The substance detection and multi-task models were used to predict substance events in the discharge summaries within the MIMIC-III corpus (59.7K notes). The resulting labeled texts are available online.⁷

The substance detection model predicted 40.3K MIMIC-III discharge summaries to have a substance event. Table 10 presents the entity occurrence counts from the unsupervised labeling of these notes.

Table 10:

MIMI-III annotation summary

Entity	Alcohol	Drug	Tobacco
status	44,536	20,725	45,244
type	4,756	14,509	11,443
amount	13,262	2,551	11,298
frequency	12,200	355	7,183
exposure history	770	396	4,639
quit history	176	72	10,241

Open in a new tab

Of the notes predicted to contain at least one substance event, 50 randomly sampled notes were hand-scored to evaluate the precision of status labels. Table 11 summarizes the manual review of 50 discharge summaries. The status classification precision was high for all substances (0.84-0.89).

Table 11.

MIMIC-III status prediction evaluation

Substance	TP	FP	P
alcohol	76	14	0.84
drug	80	10	0.89
tobacco	80	10	0.89

Open in a new tab

Conclusion

In summary, we implemented a state-of-the-art neural multi-task approach to the extraction of substance use information from clinical notes. The multi-task model outperformed Yetisgen and Vanderwende’s previous work and directly comparable discrete baselines on the test set for all entities, except type. For type, the multi-task model and the CRF baseline had similar performance. Excluding type, the performance gap between the multi-task model and the discrete baselines was larger on the test set than the training CV runs, suggesting the multi-task model generalized better to the test set. The improved extraction performance of the multi-task model will benefit downstream applications, including clinical decision support systems. Additionally, the multi-task modeling framework is well-suited to extending the current model to handle other types of information, such as socio-demographic, behavioral, and environmental exposure factors. A limited evaluation of the unsupervised labeling of the MIMIC-III discharge summaries indicated the multi-task model maintained high precision in the prediction of status.

This work is limited by the size and homogeneity of the annotated corpus, which may negatively impact generalizability of the extraction models to other clinical data sets. However, the initial results on the MIMIC-III data are encouraging. While semi-supervised learning techniques can be used to improve generalization,^27,44,45 a larger annotated corpus with notes from different disciplines and institutions is required for assessing the generalizability of the extraction models.

References

1.Centers for Disease Control and Prevention. Annual smoking-attributable mortality, years of potential life lost, and productivity losses–United States, 1997-2001. Morb Mortal Wkly Rep. 2005;54:625–628. [PubMed] [Google Scholar]
2.World Health Organization. Global status report on alcohol and health, 2014; 2014. Available from: http://apps.who.int/iris/bitstream/10665/112736/1/9789240692763_eng.pdf.
3.Degenhardt L, Hall W. Extent of illicit drug use and dependence, and their contribution to the global burden of disease. The Lancet. 2012;379:55–70. doi: 10.1016/S0140-6736(11)61138-0. [DOI] [PubMed] [Google Scholar]
4.Anand P, Kunnumakara AB, Sundaram C, Harikumar KB, Tharakan ST, Lai OS, et al. Cancer is a preventable disease that requires major lifestyle changes. Pharmaceutical Research. 2008;25(9):2097–2116. doi: 10.1007/s11095-008-9661-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Yetisgen M, Vanderwende L. 2017. Automatic identification of substance abuse from social history in clinical text. In: Artificial Intelligence in Medicine in Europe; pp. 171–181. [Google Scholar]
6.Johnson AE, Pollard TJ, Shen L, Lehman LwH, Feng M, Ghassemi M, et al. 2016. MIMIC-III, a freely accessible critical care database. Scientific Data; p. 3. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Lybarger K. Clinical extraction; 2018. Available from: https://github.com/Lybarger/clinical_extraction.
8.Uzuner Ö, Goldstein I, Luo Y, Kohane I. Identifying patient smoking status from medical discharge records. J Am Med Inform Assoc. 2008;15(1):14–24. doi: 10.1197/jamia.M2408. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.AM C. Five-way smoking status classification using text hot-spot identification and error-correcting output codes. J Am Med Inform Assoc. 2008;15(1):32–5. doi: 10.1197/jamia.M2434. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Clark C, Good K, Jezierny L, Macpherson M, Wilson B, Chajewska U. Identifying smokers with a medical extraction system. J Am Med Inform Assoc. 2008;15(1):36–39. doi: 10.1197/jamia.M2442. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Jonnagaddala J, Dai HJ, Ray P, Liaw ST. A preliminary study on automatic identification of patient smoking status in unstructured electronic health records. In: ACL-IJCNLP. 2015;vol. 2015:147–151. [Google Scholar]
12.Melton GB, Manaktala S, Sarkar IN, Chen ES. Social and behavioral history information in public health datasets. In: AMIA Annu Symp Proc. 2012;vol. 2012:625–34. [PMC free article] [PubMed] [Google Scholar]
13.Chen ES, Carter EW, Sarkar IN, Winden TJ, Melton GB. 2014. Examining the use, contents, and quality of free-text tobacco use documentation in the electronic health record. In: AMIA Annu Symp Proc; pp. 366–374. [PMC free article] [PubMed] [Google Scholar]
14.Carter EW, Sarkar IN, Melton GB, Chen ES. 2015. Representation of drug use in biomedical standards, clinical text, and research measures. In: AMIA Annu Symp Proc; pp. 376–385. [PMC free article] [PubMed] [Google Scholar]
15.Wang Y, Chen ES, Pakhomov S, Lindemann E, Melton GB. 2016. Investigating longitudinal tobacco use information from social history and clinical notes in the electronic health record. In: AMIA Annu Symp Proc; pp. 1209–1218. [PMC free article] [PubMed] [Google Scholar]
16.Chen E, Garcia-Webb M. An analysis of free-text alcohol use documentation in the electronic health record. Applied Clinical Informatics. 2014;5(2):402–415. doi: 10.4338/ACI-2013-12-RA-0101. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Wang Y, Chen ES, Pakhomov S, Arsoniadis E, Carter EW, Lindemann E. Automated extraction of substance use information from clinical texts. In: AMIA Annu Symp Proc. 2015;vol. 2015:2121–30. [PMC free article] [PubMed] [Google Scholar]
18.Maldonado R, Goodwin TR, Harabagiu SM. 2017. Active deep learning-based annotation of electroencephalography reports for cohort identification. In: AMIA Jt Summits Transl Sci Proc; p. 229. [PMC free article] [PubMed] [Google Scholar]
19.Dernoncourt F, Lee JY, Uzuner O, Szolovits P. De-identification of patient notes with recurrent neural networks. J Am Med Inform Assoc. 2017;24(3):596–606. doi: 10.1093/jamia/ocw156. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Xu J, Zhang Y, Xu H. 2015. Clinical abbreviation disambiguation using neural word embeddings. In: Proceedings of BioNLP; pp. 171–176. [Google Scholar]
21.Jagannatha AN, Yu H. Bidirectional RNN for medical event detection in electronic health records. In: NAACL. 2016;vol. 2016:473–82. doi: 10.18653/v1/n16-1056. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Lafferty J, McCallum A, Pereira FC. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proc Int Conf Mach Learn; pp. 282–289. [Google Scholar]
23.Finkel JR, Grenager T, Manning C. 2005. Incorporating non-local information into information extraction systems by gibbs sampling. In: Proc Conf Assoc Comput Linguist Meet; pp. 363–370. [Google Scholar]
24.Peng F, McCallum A. Information extraction from research papers using conditional random fields. Inf Process Manag. 2006;42(4):963–979. [Google Scholar]
25.Lample G, Ballesteros M, Subramanian S, Kawakami K, Dyer C. 2016. Neural architectures for named entity recognition. In: NAACL; pp. 260–270. [Google Scholar]
26.Augenstein I, Søgaard A. 2017. Multi-task learning of keyphrase boundary classification. In: Proc Conf Assoc Comput Linguist Meet; pp. 341–346. [Google Scholar]
27.Luan Y, Ostendorf M, Hajishirzi H. 2017. Scientific information extraction with semi-supervised neural tagging. In: Proc Conf Empir Methods Nat Lang Process; pp. 2641–2651. [Google Scholar]
28.Reimers N, Gurevych I. Optimal hyperparameters for deep LSTM-networks for sequence labeling tasks. arXiv preprint. 2017;1707:06799. [Google Scholar]
29.Johnson R, Zhang T. Effective use of word order for text categorization with convolutional neural networks. arXiv preprint. 2014;1412:1058. [Google Scholar]
30.Chen Y, Xu L, Liu K, Zeng D, Zhao J. 2015. Event extraction via dynamic multi-Pooling convolutional neural networks. In: Proc Conf Assoc Comput Linguist Meet; pp. 167–176. [Google Scholar]
31.Adel H, Roth B, Schutze H. 2016. Comparing convolutional neural networks to traditional models for slot filling. In: NAACL; pp. 828–838. [Google Scholar]
32.Collobert R, Weston J. 2008. A unified architecture for natural language processing: deep neural networks with multitask learning. In: Proc Int Conf Mach Learn; pp. 160–167. [Google Scholar]
33.Luan Y, Brockett C, Dolan B, Gao J, Galley M. Multi-task learning for speaker-role adaptation in neural conversation models. In: Int Joint Conf Natrual Lang Proc. 2017;vol. 1:605–614. [Google Scholar]
34.Jaech A, Heck LP, Ostendorf M. Domain adaptation of recurrent neural networks for natural language understanding. In: Interspeech. 2016;vol. 8:690–694. [Google Scholar]
35.Harutyunyan H, Khachatrian H, Kale DC, Galstyan A. Multitask learning and benchmarking with clinical time series data. arXiv preprint. 2017;1703:07771. doi: 10.1038/s41597-019-0103-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Jaques N, Taylor S, Nosakhare E, Sano A, Picard R. 2016. Multi-task learning for predicting health, stress, and happiness. In: NIPS Workshop on Machine Learning for Healthcare. [Google Scholar]
37.Liu B, Lane I. Attention-based recurrent neural network models for joint intent detection and slot filling. In: Interspeech. 2016;vol. 8:685–689. [Google Scholar]
38.Lin Z, Feng M, Santos CNd, Yu M, Xiang B, Zhou B. A structured self-attentive sentence embedding. arXiv preprint. 2017;1703:03130. [Google Scholar]
39.Bird S, Loper E, Klein E, Inc OM. 2009. Natural language processing with Python. [Google Scholar]
40.Ba JL, Kiros JR, Hinton GE. Layer normalization. arXiv preprint. 2016;1607(06450) [Google Scholar]
41.Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. arXiv preprint. 2013;1301:3781. [Google Scholar]
42.Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O. Scikit-learn: machine Learning in Python. Journal of Machine Learning Research. 2011;12:2825–2830. [Google Scholar]
43.Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C. TensorFlow: Large-scale machine learning on heterogeneous systems. 2015. Available from: https://www.tensorflow.org/.
44.Mohammad Aliannejadi SK, Masoud Kiaeeha, Ghidary SS. 2014. Graph-based semi-supervised conditional random fields for spoken language understanding using unaligned data. In: Australasian Lang Tech Assoc; pp. 98–103. [Google Scholar]
45.Yang Z, Cohen WW, Salakhutdinov R. Revisiting semi-supervised learning with graph embeddings. In: Proc Int Conf Mach Learn. 2016;vol. 48:40–48. [Google Scholar]

[r1-2974216] 1.Centers for Disease Control and Prevention. Annual smoking-attributable mortality, years of potential life lost, and productivity losses–United States, 1997-2001. Morb Mortal Wkly Rep. 2005;54:625–628. [PubMed] [Google Scholar]

[r2-2974216] 2.World Health Organization. Global status report on alcohol and health, 2014; 2014. Available from: http://apps.who.int/iris/bitstream/10665/112736/1/9789240692763_eng.pdf.

[r3-2974216] 3.Degenhardt L, Hall W. Extent of illicit drug use and dependence, and their contribution to the global burden of disease. The Lancet. 2012;379:55–70. doi: 10.1016/S0140-6736(11)61138-0. [DOI] [PubMed] [Google Scholar]

[r4-2974216] 4.Anand P, Kunnumakara AB, Sundaram C, Harikumar KB, Tharakan ST, Lai OS, et al. Cancer is a preventable disease that requires major lifestyle changes. Pharmaceutical Research. 2008;25(9):2097–2116. doi: 10.1007/s11095-008-9661-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r5-2974216] 5.Yetisgen M, Vanderwende L. 2017. Automatic identification of substance abuse from social history in clinical text. In: Artificial Intelligence in Medicine in Europe; pp. 171–181. [Google Scholar]

[r6-2974216] 6.Johnson AE, Pollard TJ, Shen L, Lehman LwH, Feng M, Ghassemi M, et al. 2016. MIMIC-III, a freely accessible critical care database. Scientific Data; p. 3. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r7-2974216] 7.Lybarger K. Clinical extraction; 2018. Available from: https://github.com/Lybarger/clinical_extraction.

[r8-2974216] 8.Uzuner Ö, Goldstein I, Luo Y, Kohane I. Identifying patient smoking status from medical discharge records. J Am Med Inform Assoc. 2008;15(1):14–24. doi: 10.1197/jamia.M2408. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r9-2974216] 9.AM C. Five-way smoking status classification using text hot-spot identification and error-correcting output codes. J Am Med Inform Assoc. 2008;15(1):32–5. doi: 10.1197/jamia.M2434. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r10-2974216] 10.Clark C, Good K, Jezierny L, Macpherson M, Wilson B, Chajewska U. Identifying smokers with a medical extraction system. J Am Med Inform Assoc. 2008;15(1):36–39. doi: 10.1197/jamia.M2442. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r11-2974216] 11.Jonnagaddala J, Dai HJ, Ray P, Liaw ST. A preliminary study on automatic identification of patient smoking status in unstructured electronic health records. In: ACL-IJCNLP. 2015;vol. 2015:147–151. [Google Scholar]

[r12-2974216] 12.Melton GB, Manaktala S, Sarkar IN, Chen ES. Social and behavioral history information in public health datasets. In: AMIA Annu Symp Proc. 2012;vol. 2012:625–34. [PMC free article] [PubMed] [Google Scholar]

[r13-2974216] 13.Chen ES, Carter EW, Sarkar IN, Winden TJ, Melton GB. 2014. Examining the use, contents, and quality of free-text tobacco use documentation in the electronic health record. In: AMIA Annu Symp Proc; pp. 366–374. [PMC free article] [PubMed] [Google Scholar]

[r14-2974216] 14.Carter EW, Sarkar IN, Melton GB, Chen ES. 2015. Representation of drug use in biomedical standards, clinical text, and research measures. In: AMIA Annu Symp Proc; pp. 376–385. [PMC free article] [PubMed] [Google Scholar]

[r15-2974216] 15.Wang Y, Chen ES, Pakhomov S, Lindemann E, Melton GB. 2016. Investigating longitudinal tobacco use information from social history and clinical notes in the electronic health record. In: AMIA Annu Symp Proc; pp. 1209–1218. [PMC free article] [PubMed] [Google Scholar]

[r16-2974216] 16.Chen E, Garcia-Webb M. An analysis of free-text alcohol use documentation in the electronic health record. Applied Clinical Informatics. 2014;5(2):402–415. doi: 10.4338/ACI-2013-12-RA-0101. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r17-2974216] 17.Wang Y, Chen ES, Pakhomov S, Arsoniadis E, Carter EW, Lindemann E. Automated extraction of substance use information from clinical texts. In: AMIA Annu Symp Proc. 2015;vol. 2015:2121–30. [PMC free article] [PubMed] [Google Scholar]

[r18-2974216] 18.Maldonado R, Goodwin TR, Harabagiu SM. 2017. Active deep learning-based annotation of electroencephalography reports for cohort identification. In: AMIA Jt Summits Transl Sci Proc; p. 229. [PMC free article] [PubMed] [Google Scholar]

[r19-2974216] 19.Dernoncourt F, Lee JY, Uzuner O, Szolovits P. De-identification of patient notes with recurrent neural networks. J Am Med Inform Assoc. 2017;24(3):596–606. doi: 10.1093/jamia/ocw156. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r20-2974216] 20.Xu J, Zhang Y, Xu H. 2015. Clinical abbreviation disambiguation using neural word embeddings. In: Proceedings of BioNLP; pp. 171–176. [Google Scholar]

[r21-2974216] 21.Jagannatha AN, Yu H. Bidirectional RNN for medical event detection in electronic health records. In: NAACL. 2016;vol. 2016:473–82. doi: 10.18653/v1/n16-1056. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r22-2974216] 22.Lafferty J, McCallum A, Pereira FC. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proc Int Conf Mach Learn; pp. 282–289. [Google Scholar]

[r23-2974216] 23.Finkel JR, Grenager T, Manning C. 2005. Incorporating non-local information into information extraction systems by gibbs sampling. In: Proc Conf Assoc Comput Linguist Meet; pp. 363–370. [Google Scholar]

[r24-2974216] 24.Peng F, McCallum A. Information extraction from research papers using conditional random fields. Inf Process Manag. 2006;42(4):963–979. [Google Scholar]

[r25-2974216] 25.Lample G, Ballesteros M, Subramanian S, Kawakami K, Dyer C. 2016. Neural architectures for named entity recognition. In: NAACL; pp. 260–270. [Google Scholar]

[r26-2974216] 26.Augenstein I, Søgaard A. 2017. Multi-task learning of keyphrase boundary classification. In: Proc Conf Assoc Comput Linguist Meet; pp. 341–346. [Google Scholar]

[r27-2974216] 27.Luan Y, Ostendorf M, Hajishirzi H. 2017. Scientific information extraction with semi-supervised neural tagging. In: Proc Conf Empir Methods Nat Lang Process; pp. 2641–2651. [Google Scholar]

[r28-2974216] 28.Reimers N, Gurevych I. Optimal hyperparameters for deep LSTM-networks for sequence labeling tasks. arXiv preprint. 2017;1707:06799. [Google Scholar]

[r29-2974216] 29.Johnson R, Zhang T. Effective use of word order for text categorization with convolutional neural networks. arXiv preprint. 2014;1412:1058. [Google Scholar]

[r30-2974216] 30.Chen Y, Xu L, Liu K, Zeng D, Zhao J. 2015. Event extraction via dynamic multi-Pooling convolutional neural networks. In: Proc Conf Assoc Comput Linguist Meet; pp. 167–176. [Google Scholar]

[r31-2974216] 31.Adel H, Roth B, Schutze H. 2016. Comparing convolutional neural networks to traditional models for slot filling. In: NAACL; pp. 828–838. [Google Scholar]

[r32-2974216] 32.Collobert R, Weston J. 2008. A unified architecture for natural language processing: deep neural networks with multitask learning. In: Proc Int Conf Mach Learn; pp. 160–167. [Google Scholar]

[r33-2974216] 33.Luan Y, Brockett C, Dolan B, Gao J, Galley M. Multi-task learning for speaker-role adaptation in neural conversation models. In: Int Joint Conf Natrual Lang Proc. 2017;vol. 1:605–614. [Google Scholar]

[r34-2974216] 34.Jaech A, Heck LP, Ostendorf M. Domain adaptation of recurrent neural networks for natural language understanding. In: Interspeech. 2016;vol. 8:690–694. [Google Scholar]

[r35-2974216] 35.Harutyunyan H, Khachatrian H, Kale DC, Galstyan A. Multitask learning and benchmarking with clinical time series data. arXiv preprint. 2017;1703:07771. doi: 10.1038/s41597-019-0103-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r36-2974216] 36.Jaques N, Taylor S, Nosakhare E, Sano A, Picard R. 2016. Multi-task learning for predicting health, stress, and happiness. In: NIPS Workshop on Machine Learning for Healthcare. [Google Scholar]

[r37-2974216] 37.Liu B, Lane I. Attention-based recurrent neural network models for joint intent detection and slot filling. In: Interspeech. 2016;vol. 8:685–689. [Google Scholar]

[r38-2974216] 38.Lin Z, Feng M, Santos CNd, Yu M, Xiang B, Zhou B. A structured self-attentive sentence embedding. arXiv preprint. 2017;1703:03130. [Google Scholar]

[r39-2974216] 39.Bird S, Loper E, Klein E, Inc OM. 2009. Natural language processing with Python. [Google Scholar]

[r40-2974216] 40.Ba JL, Kiros JR, Hinton GE. Layer normalization. arXiv preprint. 2016;1607(06450) [Google Scholar]

[r41-2974216] 41.Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. arXiv preprint. 2013;1301:3781. [Google Scholar]

[r42-2974216] 42.Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O. Scikit-learn: machine Learning in Python. Journal of Machine Learning Research. 2011;12:2825–2830. [Google Scholar]

[r43-2974216] 43.Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C. TensorFlow: Large-scale machine learning on heterogeneous systems. 2015. Available from: https://www.tensorflow.org/.

[r44-2974216] 44.Mohammad Aliannejadi SK, Masoud Kiaeeha, Ghidary SS. 2014. Graph-based semi-supervised conditional random fields for spoken language understanding using unaligned data. In: Australasian Lang Tech Assoc; pp. 98–103. [Google Scholar]

[r45-2974216] 45.Yang Z, Cohen WW, Salakhutdinov R. Revisiting semi-supervised learning with graph embeddings. In: Proc Int Conf Mach Learn. 2016;vol. 48:40–48. [Google Scholar]

PERMALINK

Using Neural Multi-task Learning to Extract Substance Abuse Information from Clinical Notes

Kevin Lybarger, MS

Meliha Yetisgen, PhD

Mari Ostendorf, PhD

Abstract

Introduction

Task

Table 1.

Related Work

Methods

Data

Table 2:

Discrete Models

Neural Multi-task Model

Figure 1.

Training

Results

Table 3:

Table 4:

Table 5:

Table 6:

Table 7:

Table 8:

Table 9:

MIMIC Annotation

Table 10:

Table 11.

Conclusion

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Using Neural Multi-task Learning to Extract Substance Abuse Information from Clinical Notes

Kevin Lybarger, MS

Meliha Yetisgen, PhD

Mari Ostendorf, PhD

Abstract

Introduction

Task

Table 1.

Related Work

Methods

Data

Table 2:

Discrete Models

Neural Multi-task Model

Figure 1.

Training

Results

Table 3:

Table 4:

Table 5:

Table 6:

Table 7:

Table 8:

Table 9:

MIMIC Annotation

Table 10:

Table 11.

Conclusion

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases