Abstract
Medication doses, one of the determining factors in medication safety and effectiveness, are present in the literature, but only in free-text form. We set out to determine if the systems developed for extracting drug prescription information from clinical text would yield comparable results on scientific literature and if sequence-to-sequence learning with neural networks could improve over the current state-of-the-art. We developed a collection of 694 PubMed Central documents annotated with drug dose information using the i2b2 schema. We found that less than half of the drug doses are present in the MEDLINE/PubMed abstracts, and full-text is needed to identify the other half. We identified the differences in the scope and formatting of drug dose information in the literature and clinical text, which require developing new dose extraction approaches. Finally, we achieved 83.9% recall, 87.2% precision and 85.5% F1 score in extracting complete drug prescription information from the literature.
Introduction
Medication doses are one of the determining factors in medication safety and effectiveness. Dose determination studies report on ineffective, effective and toxic dose ranges. A toxic dose or an adverse drug reaction reported in scientific literature but not reported by the pharmaceutical company to the regulatory authorities could result in a serious warning. The companies and regulators monitor PubMed®/MEDLINE® for adverse events, often using Medical Subject Heading® (MeSH®) indexing to capture reliable signals. MeSH-based filters successfully capture the adverse reactions1, and drug-drug interactions2, but information about dose safety and effectiveness can only be found in the text of the article.
The importance of extraction of medication doses from clinical text is well established, and the i2b2 challenges on medication extraction3 stimulated research and development of several publicly available tools and approaches, with the highest currently achieved 91.5% F1 score for dose extraction using Conditional Random Fields classifiers and word embeddings4. Surprisingly, extraction of doses from the scientific literature has not yet been actively pursued by researchers. To the best of our knowledge, we present the first publicly available collection of scientific abstracts and excerpts from the full text articles annotated using the i2b23 annotation schema that includes: 1) medication name; 2) dosage; 3) route of administration; 4) frequency of administration; 5) duration of administration; and 6) the reason for which the medication is given. We expanded the schema by splitting the 7) form from the medication name and attempting to distinguish between the dose and the 8) strength of the medication. Collections and tools are available for extraction of some of the important elements of the prescription information from the literature and other non-clinical text. Extraction of the drug and chemical names is still a hard problem regularly addressed in the community-wide challenges. Drug name extraction was recently extended to extraction of the events and relations that involve drugs. For example, the 2016 and 2017 BioCreative evaluations promoted the development and evaluation of systems able to detect relations between chemical compounds/drugs and diseases and genes/proteins, respectively5. Drug name identification was also an essential part of the Semeval-2013 drug-drug interaction extraction task6.
Expanding on the previous work on the full prescription information extraction from clinical text and partial drug information extraction from scientific literature, we set out to study if the systems developed for clinical text would yield comparable results on scientific literature and if sequence-to-sequence learning with neural networks could improve over the current state-of-the-art. Our initial assumption was that the specific doses are more likely to appear in the full text of the article, specifically in the methods section. To test this hypothesis, we compared dose information provided in the MEDLINE/PubMed abstracts alone with that found in different parts of the full text.
The contributions of this work are: 1) a collection of PubMed abstracts and sentences from the full-text articles fully annotated for drug doses using the i2b2 schema; 2) the baseline and deep learning approaches to extraction of dose information from the literature; and 3) the analysis that shows that less than half of drug dose information can be found in the titles and abstracts of MEDLINE citations.
Methods
We developed a collection of 694 documents fully annotated with drug doses/strengths, forms, routes of administration, frequencies and durations of administration and the reasons for administration. The collection includes 70 full MEDLINE/PubMed abstracts as test set and a training set of 624 documents containing sentences extracted from the abstracts and full-text articles using various baseline approaches described below. The 70 abstracts were retrieved using the same approach as the original 624 articles, but at a later date, to assure that our tests are aligned with the natural growth of the literature. We annotated only those drug mentions for which the doses/strengths were provided. Drug mentions without the doses were not annotated, even if some other prescription information was provided, e.g., form or duration. Table 1 shows the numbers of annotated entities and Table 2 shows the prescription annotations with different numbers of attributes. In all annotations, medication name and dose or strength are required. Other attributes are optional and any of the remaining six attributes (route, frequency, duration, reason, form, dose or strength) are annotated, if present. The most frequent other attributes were route and frequency of administration.
Table 1.
Prescription attribute | Instances |
Medication | 2663 |
Strength | 1486 |
Dose | 1467 |
Route | 518 |
Frequency | 438 |
Form | 363 |
Duration | 308 |
Reason | 170 |
Table 2.
Number of prescription attributes | Instances |
---|---|
1 (dose or strength only) | 1416 |
2 | 688 |
3 | 393 |
4 | 115 |
5 | 33 |
6 | 4 |
7 | 1 |
We then used the collection to test if the algorithms that performed well in the chemical extraction challenges and sequence tagging could reliably extract drug dose information.
Drug Dosage Data Set Creation
Our constraints for selecting the publications were as follows: 1) a full year of articles indexed for MEDLINE; 2) with full text available in PubMed Central; and 3) containing Drug and Dosage information in the text. We excluded the following PubMed subsets that were not likely to contain the original articles meeting all our inclusion criteria: OLDMEDLINE, PubMed-not-MEDLINE, completed but no indexing assigned, Comment On, Erratum For, Partial Retraction Of, Retraction Of, Republished From, Update Of, abstracts not owned by NLM, and Review, Letter, and Editorial Publication Types.
Our initial PubMed search (Figure 1) identified 686,290 MEDLINE citations. To further narrow down the search to those articles that may contain drug dose information, we selected citations with a Pharmacologic MeSH assigned (anything in the D27.505 MeSH Tree) using “Pharmacologic Actions [mh]”, which reduced the 686,290 citations to 108,865. We then limited the Pharmacologic MeSH assigned articles to ones that also had at least one of the Pharmacological MeSH headings with the administration & dosage MeSH qualifier: “Pharmacologic Actions/AD”, which further reduced the 108,865 articles to 22,082. To compare drug doses in the title and abstract with the doses in the full text, we needed articles that have abstracts, which reduced the 22,082 articles to 21,068. To get the articles for which the full text is freely available, we used the “free full text[sb]” filter, reducing the 21,068 articles to 8,150, with 5,958 of these actually having full text available in PubMed Central (PMC). Unfortunately, for 2,298 of these PMC full text articles, “The publisher of this article does not allow downloading of the full text in XML form,” leaving us with 3,660 articles that met all of our criteria and had full text available in PMC. We retrieved these articles on September 25, 2017.
Baselines Applied to 3,660 Articles
Our most basic approach (DoseRegEx) is recall oriented and extracts all sentences that contain numeric characters and units of measure. A slightly more advanced approach reduces this set of sentences by requiring each sentence also to contain a chemical extracted using Chemlistem7.
Our more sophisticated approach, which is also used to test how the tools developed for clinical text perform on the literature, is based on MedEx8. We downloaded and installed Vanderbilt’s MedEx UIMA 1.3.7 (https://sbmi.uth.edu/ccb/resources/medex.htm) and used it with default parameters. We processed the 3,660 articles using both MedEx and DoseRegEx on three different segments of the articles: full abstract, full text body without the title and abstract, and full text methods sections and image and table captions. We had to create two versions of the articles since MedEx did not seem to handle UTF8 characters, e.g., 250 μg/animal/day/i.p. For MedEx, we created a copy of the abstracts and the full text files and converted all of the text to ASCII. The abstracts and full text were both in XML format from PubMed and PMC, respectively; and no attempt was made to convert XML encodings (e.g., &) to their text equivalents. DoseRegEx used the UTF8 versions and MedEx processed the ASCII versions of the XML files.
Annotation of the training and test sets
From the documents in which at least one of the tools has identified candidate sentences potentially containing drug doses, we selected 624 for manual annotation. We stratified the documents so that each extraction approach was represented equally, pulling all distinct documents found by an approach, if needed, and randomly selecting documents from the sets retrieved by all baselines. The filenames were coded so that annotators did not know which method found the documents.
We used Brat annotation tool9 to annotate the drug dose information as shown in Figure 2. AA, SS, LR and DF annotated each document independently, with two annotators per each document. The differences were reconciled in meetings with all four annotators. In addition to annotating sentences extracted by the baseline approaches, we annotated 70 full abstracts to estimate sensitivity (recall) of the approaches. We added some more granularity to the dose annotation compared to the i2b2 guidelines and annotated the drug names and the numeric values of the doses and the units separately. We also attempted to distinguish the dose (the amount of drug given in a single administration) and the strength of the drug, i.e., the amount of drug in a unit of the dosage form.
Deep Learning approach
We used the annotated sentences from the 624 annotated documents to train a Long Short-Term Memory (LSTM) neural network with a Conditional Random Field (CRF) layer using character embeddings, as shown in Figure 3. (For more details, see https://guillaumegenthial.github.io/sequence-tagging-with-tensorflow.html). This approach had the best performance on part-of-speech tagging and CoNLL 2003 named entity recognition tasks10.
We converted the annotations to BIO-format (Beginning-Inside-Outside) required by the sequence tagger, initially, matching the annotations exactly, as shown in Figure 4.
In the second approach, we renamed the features labeled as Strength to Dose to create a larger and more uniform pool of examples. In the third approach, we masked the measurement units with ‘y’ characters, using a table of common measurements, as well as any measurements encountered during the annotation task. We used this table during both training and testing to mask both labeled and unlabeled (testing) features. We also tried converting medications into a mask of ‘X’ characters, which required a separate medication detection and lookup step for both feature generation for training and testing of unlabeled features. We used Chemlistem and an in-house chemical annotator to recognize chemicals to be masked. In the training set, only chemicals that were annotated as medications were masked, but, as there is no knowledge of true positives in the test set, all chemicals recognized by the tools in the test set had to be masked. Finally, we tested two combination approaches: one combining dose and strength labels and masking medications, and one combining dose and strength labels, medication masking, and measurements masking.
Evaluation
We evaluated dose extraction at both the document and the sentence levels on the 70 fully annotated abstracts. For the deep learning approaches, we report recall, precision and F1 score at the entity level, i.e., we evaluate the model predictions of the exact spans and types of the annotations. For the baseline approaches, we did not require exact matches of the spans annotated in the test set: for each drug in the test set, if the output of the tool for a given article contained the drug name and at least one other dosage attribute, it was considered true positive. The missed test set drugs were considered false negatives, and the drugs extracted by the tools, but not found in the test set were considered false positives. For the three baseline approaches we also report sentence-level precision on the gold-standard annotations of the 624 randomly selected articles.
Results
Table 3 shows the numbers of sentences extracted by the baseline tools from the 3,660 articles that met all of our criteria and had full text available in PMC.
Table 3:
Table 4 shows the counts of sentences and documents contributed to the manually annotated set of documents by the two baseline approaches and the results of discarding DoseRegEx sentences in which Chemlistem has not identified any chemicals. The results in Table 4 indicate that MedEx performed better on the abstracts it uniquely identified as containing dose information, but MedEx performed poorly on the full text when it was the only method to find the article. For abstracts and full text, the percentage of correctly extracted sentences was higher if both methods found the article, and even higher if the DoseRegEx sentences were filtered through Chemlistem. On the Full Text Body Only, i.e., sentences from the documents in which no doses were identified in the abstracts and in the methods sections, the extracted sentences were the best overall and did well even when the sentences were identified uniquely by each method. Extraction from the abstracts and full text seems to perform closely; however, Table 3 shows that for about 55% of the articles the dose information can be found only in the full text of the article, e.g., MedEx found only 1,070 abstracts, but 2,425 articles containing dose information. Given that the accuracy of extraction from these two document types is similar, we believe this demonstrates that full text is a richer source of drug dose information.
Table 4.
Table 5 presents the results of the evaluation of all approaches on the test set of 70 manually annotated abstracts. As expected, MedEx outperformed the baseline approaches. Merging the strength and the dose annotations improved the results for the neural network approach, as did masking of the units and medications. To our surprise, the type of the characters used as mask mattered significantly as illustrated by the differences in the results of medications masked with tildes and Xs. Finally, masking doses and medications, but not units resulted in the best performance with 83.9% recall, 87.2% precision and 85.5% F1 score.
Table 5.
LSTM neural network with a conditional random field (CRF) layer using character embeddings | |||||||||
---|---|---|---|---|---|---|---|---|---|
DoseRegEx & Chemicals | MedEx | Default | Dose & Strength conflated (D&S) | Units Mask ‘y’ | Meds Mask ‘~’ | Meds Mask ‘X’ | D&S, Meds & Units Mask | D&S & Meds Mask | |
Recall | 0.592 | 0.718 | 0.173 | 0.282 | 0.446 | 0.191 | 0.801 | 0.780 | 0.839 |
Precision | 0.356 | 0.689 | 0.453 | 0.560 | 0.545 | 0.367 | 0.867 | 0.861 | 0.872 |
F1 | 0.444 | 0.699 | 0.250 | 0.376 | 0.378 | 0.252 | 0.833 | 0.818 | 0.855 |
Discussion
Despite being extremely important to pharmacovigilance and systematic reviews, extraction of drug doses from the literature is under-researched. We developed the first publicly available collection of MEDLINE abstracts and excerpts from the full-text PubMed Central articles to facilitate this research. The collection includes 624 training documents that were generated through automated extraction of sentences potentially containing drug dose information and then manually annotated for drug doses/strengths, forms, routes of administration, frequencies and durations of administration and the reasons for administration. The test set consists of 70 manually annotated abstracts.
The baseline approaches identified thousands of sentences potentially containing drug dose information, and although the overlap between the approaches was significant, each method identified distinct documents missed by the others. Although neither MedEx nor DoseRegEx achieved perfect recall on the test abstracts, the methods retrieved enough examples for building a training set and training an LSTM model that achieved 83.9% recall, 87.2% precision and 85.5% F1 score.
Our question about the need for literature-specific approaches to dose extraction was answered positively. The drop in MedEx performance from over 90% F-score it achieves on clinical text to close to 70% is understandable: our manual annotation revealed the differences in the languages that are used to describe drug dose information in the literature and in the clinical text. Figure 5 shows one such example, in which the units precede the drug names and doses and reference all three drugs. Other obvious differences include the subjects of the study: in some cases, the doses were tested on animals. The authors also have very inventive ways of including study subjects in the dosage information, e.g., 16000 U/mouse/3 times or 250 mg/animal/day. As opposed to clinical text, in the literature, the medication doses and names often can be found only in different sentences and even paragraphs. During manual annotation, we used acronyms and abbreviations as shown in Figure 3, but sometimes the automatically extracted sentences contained no drug names. We could not annotate several automatically generated documents because they contained some prescription information but not the drug names. We decided to annotate only sentences that contained at least a drug name and a dose, either numeric or relative, e.g., low-dose aspirin. We therefore tested MedEx only on the sentences that contained numeric values or doses, but we tested the deep learning approach on all sentences in the 70 abstracts.
Another source of confusion are descriptions of the medium for cell lines and bacterial cultures, e.g.,
“Human ovarian cancer cells (SKOV3, OVCAR3) and HOSE were purchased from American Type Culture Collection (ATTC, USA) and were maintained at 37 ° C in 5% CO2 in complete media using RPMI 1640 containing 10% (v/v) fetal bovine serum, streptomycin (100 μ g/mL), and penicillin (100 units/mL).”
Although the patterns describing the media and the drug doses are practically indistinguishable, we decided against annotating the media. It appears that the LSTM approach had enough training data to learn how to exclude the medium descriptions.
The numeric patterns that do not represent drug doses are often found in reviews that report the percentages of the studies, e.g., “ciprofloxacin 82.6%, norfloxacin 12.1%, ofloxacin 3.2%, moxifloxacin 1.2%, and other fluoroquinolones 0.9%) and 909,656 courses of penicillin V use, matched 1:1 on propensity score, were included.”
Finally, we encountered several new drug forms that were hard to annotate, e.g., “The particle size, DL and encapsulation efficiency of the obtained insulin-loaded microspheres with 10 films were 5.25 +/- 0.15 microm, 111.33 +/- 1.15 mg/g and 33.7 +/- 0.19%, respectively.”
Our study clearly demonstrates the benefits of the full text: more than half of the dose information is available only in the body of the paper. We did not expect, though, that about 45% of the abstracts and titles would contain full prescription information. It appears that identifying the methods section is not necessary for dose extraction, probably because the patterns are so specific that not many false negatives are found in the full body of the paper.
Our collection, annotations and our best dose extraction approach have some limitations. We decided to annotate few sentences from as many papers as possible, as opposed to such in-depth efforts as the CRAFT collection11 that contains 97 fully annotated articles. Annotating excerpts proved to be more difficult than annotating full abstracts: partially because the automated segmenters sometimes break up sentences, and partially because some sentences are hard to understand out of context. In several cases, the annotators had to consult the full text to distinguish between the abbreviations of drug names and trial study arms. Our conversion of the text to ASCII was not always smooth and resulted in some extraneous characters that had to be included in the annotations. Our attempt to distinguish between the doses and the strengths of the drugs was unsuccessful, as most of the time in the reconciliation discussions was spent on these fine differences, and unneeded: merging these two annotation types produced better results. Our best approach to dose extraction requires masking medication names and units, which we do using Chemlistem for the names and lookup tables for the units. When we applied the same masking approach to the training and test sets, i.e., masked all chemicals found by the tools in the training set, the results were much lower than the best result reported above. This essentially makes our approach supervised and reflects the differences between our supervised and weakly-supervised approaches. We hope this pre-processing step will not be necessary if a larger annotated collection becomes available.
Conclusion
We present the first publicly available collection of scientific articles annotated with drug doses/strengths, forms, routes of administration, frequencies and durations of administration and the reasons for administration. Using this collection, we established that state-of-the-art dose extraction tools developed to process clinical text experience a drop in performance characteristic of switching to a different domain. We found that dose information is predominantly reported in the full text of the articles; but unexpectedly, about 45% of the articles provide dose information in the titles and abstracts as well. Finally, we present a deep learning approach that achieves 83.9% recall, 87.2% precision and 85.5% F1 score in extracting complete drug prescription information from the literature.
The annotated collection and the code are available at https://ii.nlm.nih.gov/DataSets/index.shtml
Acknowledgements
This work was supported by the intramural research program at the U.S. National Library of Medicine, National Institutes of Health. We thank Guillaume Genthial for providing his TensorFlow-based code and for an extremely helpful tutorial that accompanies the code. We thank Peter Corbett for making Chemlistem publicly available, and the BioCreative and i2b2 organizers for developing the collections annotated with drug and chemical names.
References
- 1.Winnenburg R, Sorbello A, Ripple A, Harpaz R, Tonning J, Szarfman A, Francis H, Bodenreider O. Leveraging MEDLINE indexing for pharmacovigilance - Inherent limitations and mitigation strategies. J Biomed Inform. 2015 Oct;57:425–35. doi: 10.1016/j.jbi.2015.08.022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Lu Y, Figler B, Huang H, Tu YC, Wang J, Cheng F. Characterization of the mechanism of drug-drug interactions from PubMed using MeSH terms. PLoS One. 2017 Apr 19;12(4):e0173548. doi: 10.1371/journal.pone.0173548. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Uzuner O, Solti I, Cadag E. Extracting medication information from clinical text. J Am Med Inform Assoc. 2010 Sep-Oct;17(5):514–8. doi: 10.1136/jamia.2010.003947. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Tao C, Filannino M, Uzuner Ö. Prescription extraction using CRFs and word embeddings. J Biomed Inform. 2017 Aug;72:60–66. doi: 10.1016/j.jbi.2017.07.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Wei CH, Peng Y, Leaman R, Davis AP, Mattingly CJ, Li J, Wiegers TC, Lu Z. 2016. Mar 19, Assessing the state of the art in biomedical relation extraction: overview of the BioCreative V chemical-disease relation (CDR) task. Database (Oxford) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Segura-Bedmar I, Martínez P, Zazo MH. Semeval-2013 task 9: Extraction of drug-drug interactions from biomedical texts (ddiextraction 2013). In Second Joint Conference on Lexical and Computational Semantics (* SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013) 2013;Vol. 2:341–350. [Google Scholar]
- 7.Corbett P, Boyle J. 2017. Chemlistem-chemical named entity recognition using recurrent neural networks. In Proceedings of the BioCreative V.5 Challenge Evaluation Workshop. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Xu H, Stenner SP, Doan S, Johnson KB, Waitman LR, Denny JC. MedEx: a medication information extraction system for clinical narratives. J Am Med Inform Assoc. 2010 Jan-Feb;17(1):19–24. doi: 10.1197/jamia.M3378. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Stenetorp P, Pyysalo S, Topic G, Ohta T, Ananiadou S, Tsujii J. 2012. Brat: a Web-based Tool for NLP-Assisted Text Annotation. In Proceedings of the Demonstrations Session at EACL. [Google Scholar]
- 10.Ma X, Hovy E. End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. 2016;1:1064–1074. [Google Scholar]
- 11.Bada M, Eckert M, Evans D, Garcia K, Shipley K, Sitnikov D, Baumgartner WA, Jr, Cohen KB, Verspoor K, Blake JA, Hunter LE. Concept annotation in the CRAFT corpus. BMC Bioinformatics. 2012 Jul 9;13:161. doi: 10.1186/1471-2105-13-161. [DOI] [PMC free article] [PubMed] [Google Scholar]