Automatic Extraction of Drug Indications from FDA Drug Labels

Ritu Khare; Chih-Hsuan Wei; Zhiyong Lu

. 2014 Nov 14;2014:787–794.

Automatic Extraction of Drug Indications from FDA Drug Labels

Ritu Khare ¹, Chih-Hsuan Wei ¹, Zhiyong Lu ¹

PMCID: PMC4419914 PMID: 25954385

Abstract

Extracting computable indications, i.e. drug-disease treatment relationships, from narrative drug resources is the key for building a gold standard drug indication repository. The two steps to the extraction problem are disease named-entity recognition (NER) to identify disease mentions from a free-text description and disease classification to distinguish indications from other disease mentions in the description. While there exist many tools for disease NER, disease classification is mostly achieved through human annotations. For example, we recently resorted to human annotations to prepare a corpus, LabeledIn, capturing structured indications from the drug labels submitted to FDA by pharmaceutical companies. In this study, we present an automatic end-to-end framework to extract structured and normalized indications from FDA drug labels. In addition to automatic disease NER, a key component of our framework is a machine learning method that is trained on the LabeledIn corpus to classify the NER-computed disease mentions as “indication vs. non-indication.” Through experiments with 500 drug labels, our end-to-end system delivered 86.3% F1-measure in drug indication extraction, with 17% improvement over baseline. Further analysis shows that the indication classifier delivers a performance comparable to human experts and that the remaining errors are mostly due to disease NER (more than 50%). Given its performance, we conclude that our end-to-end approach has the potential to significantly reduce human annotation costs.

Introduction

Drug-disease treatment relationships, i.e. drugs and their indications, are among the top searched topics in PubMed¹^,². The primary application of this information is to inform healthcare professionals and patients for questions³ like “what are the indications of fluoxetine.” Such information is also valuable for developing computational methods for predicting and validating results of novel drug indications⁴^–⁶ and drug side effects⁷, controlling data-entry and medication errors in the electronic medical records⁸^–¹⁰, feeding the Google Knowledge Graph, and assisting PubMed Health^® (http://www.ncbi.nlm.nih.gov/pubmedhealth/) editors to cross-link drug and disease monographs¹¹. Given the variety of applications, it is important to build a structured and normalized drug indication repository, or a computable “gold standard” of drug-disease relationships, from credible drug resources. Most existing high-quality drug resources, such as DailyMed¹², DrugBank¹³, MedlinePlus¹⁴, and MedicineNet¹⁵ are described in free text. Figure 1 presents excerpts from DailyMed descriptions, aka FDA drug labels, for two drugs.

Figure 1 also illustrates the two key steps in creating a drug indication repository from existing free-text resources: (i) disease named entity recognition (NER) to identify all the disease mentions (underlined) from drug narratives, and (ii) disease classification to distinguish indications from other disease mentions, demonstrated using checks for yes and ‘X’ for no. Disease NER, in general, is not a new problem; there exist high-performing and state-of-the-art tools such as MetaMap¹⁶, DNorm¹⁷^,¹⁸, and KMCI¹⁹ that automatically identify UMLS disease concepts from biomedical or clinical narratives. Also, many text-mining methods²⁰^–²⁵ have been proposed to extract disease mentions from drug resources such as DailyMed, MedlinePlus, and even Wikipedia.

The disease classification problem has not received much attention, however. Through detailed analysis of 100 FDA drug labels, we learned that even in the designated “INDICATIONS AND USAGE” section, about half of the disease mentions are not indications and include false positives such as characteristics of indications, contraindications, side effects, usages of another drug, unrelated diseases, etc²⁶. There have been a few earlier attempts to address the disease classification problem in context of building a drug indication repository, e.g. SIDER-2 contains structured indications from FDA drug labels by filtering the disease mentions that overlap with its side effects repository²³, and Wei et al.²¹ classify a given disease mention as an indication based on its frequency of occurrence across multiple structured and unstructured resources. However, the performances of these hand-crafted rule-based methods are limited either in terms of precision²³ or recall²¹. More recently, we resorted to manual expert annotations to identify indications from pre-recognized diseases in FDA drug labels, and curated a source-linked resource called LabeledIn²⁴^,²⁶. Upon systematic comparison with an existing automatically-curated resource²⁷, we concluded that human judgment is a reliable solution to the classification problem. However, it is time-consuming, and hence, expensive to be scalable with respect to building a repository with wide coverage of drugs.

In this study, we present an automatic end-to-end indication extraction framework that given an FDA drug label, (1) extracts the relevant section, (2) recognizes and normalizes all the disease mentions, (3) more importantly, classifies the disease mentions as “indications” vs. “non-indications,” and (4) outputs the indication mentions and corresponding UMLS disease concepts that could be directly used to populate the target drug indication repository. The main contribution of this work is the use of supervised machine learning to address the under-explored disease classification problem in (3) and achieve a performance that is comparable to human experts.

Methods

The overall framework to extract drug indications from FDA drug labels is shown in Figure 2. As our data source, we use the FDA drug labels from the DailyMed website¹², which contains the most up-to-date drug labels submitted to FDA by drug manufacturers. The output of the framework is a set of structured and precisely normalized drug-disease relationships that can be used to automatically populate a computable drug indication repository. In the following subsections, we describe our disease NER and classification methods in further detail.

Disease Named Entity Recognition

The goal of the method is to identify all the disease mentions, or indication candidates, from the textual descriptions of a given drug label. For this purpose, we prepared a disease lexicon using two seed ontologies, MeSH and SNOMED-CT, respectively useful for annotating scientific articles¹⁷^,²⁸^,²⁹ and clinical documents³⁰^–³². The lexicon consists of 77,464 concepts taken from: (i) the disease branch in MeSH, and (ii) the 11 disorder semantic types (UMLS disorder semantic types excluding ‘Finding’) in the SNOMED-CT as recommended in a recent shared task ³⁰. As for the automatic tool, we applied MetaMap¹⁶, a highly configurable program used for mapping biomedical texts to the UMLS identifying the mentions, offsets, and associated CUIs. We experimented with multiple settings of MetaMap, and the optimal setting for this study is illustrated in Figure 3.

Figure 3. — Disease Named Entity Recognition (NER) Method

The drug labels may contain overlapping disease mentions, e.g. the phrase “skin and soft tissue infections” denotes two specific diseases, “skin infections” and “soft tissue infections.” While the final results by MetaMap do not return such overlapping mentions, these are captured in the intermediate results of MetaMap, known as the Metathesaurus candidates. Hence, we utilized these candidate concepts in our method. MetaMap provides two types of candidate concepts: contiguous and dis-contiguous, e.g. in the phrase “skin and soft tissue infections”, “soft tissue infections” is a contiguous candidate, and “skin+infections” is a dis-contiguous candidate. We found that MetaMap returns different sets of dis-contiguous candidates with and without the term processing feature. Hence, we conducted two runs of MetaMap for comprehensive results. Also, the word sense disambiguation feature was turned on to disambiguate mentions that may map to multiple CUIs, e.g. “depression.” In order to restrict the returned candidates to specific semantic types from two vocabularies as mentioned above, we used a lookup against our custom disease lexicon as opposed to running multiple rounds of MetaMap for the two vocabularies. Finally, candidates with overlapping spans (e.g. “moderate to severe pain”) were resolved in the following manner: (i) when both candidates were contiguous, the longer candidate was selected, (ii) when one candidate was dis-contiguous - (a) if the merged span contained conjunctions (e.g. “or,” “and”) or prepositions (e.g. “to”), then the merged span was pre-annotated and both CUIs were retained, e.g. the elliptical coordination in “skin and soft tissue infections,” (b) if the two mentions were related by a parent-child UMLS relationship (e.g., the phrase “acute bacterial otitis media” maps to hierarchically related concepts “acute + otitis media” and “otitis”), then the longer mention was retained, else, the shorter mention was retained (e.g. the phrase “drug hypersensitivity reactions” maps to non-hierarchically related concepts “drug + reactions” and “hypersensitivity reactions”).

Disease Classification

Next, we model indication extraction as a binary classification problem that judges whether a disease mention identified through NER method is an indicated usage of the drug. To build a high-quality gold standard, it is very important to judge and remove irrelevant disease mentions, e.g. “Difficulty falling asleep” from drug label A, and “cancer” from drug label B in Figure 1. This problem could also be perceived as another flavor of classifying diseases in clinical narratives into linguistic or temporal bins³³^,³⁴.

To address the classification problem, we used the support vector machines (SVM), a discriminative binary classifier known to be highly accurate and widely used in text classification. Given a set of data points (or examples) with known labels, the SVM assigns scores to each data point and finds a hyperplane that partitions the examples into two distinct bins, positive and negative based on the assigned score. For this problem, a data point is a disease mention at a given location, e.g. in Figure 1, the drug label A has five data points, and drug label B has three data points. We hand-picked the classification features based on our previous study with human experts, experimented with several features, and selected a set of linguistic, contextual, semantic, and dictionary-based features for this problem as described and exemplified in Table 1. Other features considered but not selected for the final classifier include the relative location of sentence in the drug label and the presence of drug name in the sentence containing the mention.

Table 1.

SVM Features for Disease Classification Problem

Feature (Description)	Example1		Example2
Feature (Description)	Data Point	Feature Value(s)	Data Point	Feature Value
Mention (Represented Surface-value)	Difficulty falling asleep (Drug Label A in Fig 1)	Difficulty falling asleep	Vomiting (Drug Label B in Fig 1)	Vomiting
Neighboring Tokens (left/right 5 tokens within sentence)		may, frequently, be, associated, with		the, prevention, of, -DISEASE-, and, associated, with, initial, and, repeat
Location (of sentence)		2		1
NDF-RT Match (the relationship of this disease concept with a concept catalogued in NDFRT³⁵ for this drug)		No Match		Exact Match
Semantic Category (of the UMLS concept corres.to mention)		Mental or Behavioral Dysfunction		Sign or Symptom

Open in a new tab

To train the SVM, we relied on a recently curated high-quality annotated text corpus, LabeledIn²⁴, that contains labeled or marketed indications for 250 frequently searched drug ingredients on PubMed Health. LabeledIn is unique among existing structured drug indication resources²¹^,²²^,²⁷^,³⁵^,³⁶ in that it is human-validated, source-linked, and normalized to the most precise concepts in widely used UMLS vocabularies (SNOMEDCT, MeSH, and RxNorm). LabeledIn was created through double-annotation (88% kappa agreement) of 500 drug labels with assistance from a high-recall disease NER tool. In particular, for each drug label, the expert was presented with all disease mentions identified using an NER tool and was asked to assign a yes/no judgment. At the backend, the offsets of the disease mentions and the associated expert judgments were recorded. This helped in creating a fine-grained annotated text corpus that contained all disease mentions along with the associated human judgments. This becomes the training dataset for this problem as described in first row of Table 2. Since our goal is to automatically populate a computable gold standard of drug indications, we evaluate the results of the classifier at concept-level. For this purpose, we created an evaluation dataset that contains all the disease concepts (CUIs) present in a given drug label and the respective expert-determined yes/no assignments; this dataset is described in the second row of Table 2. There is a clear distinction between distribution of examples in both datasets because a mention identified as positive is likely to be repeated several times in a given drug label, e.g. the CUI represented by “RLS” has four occurrences in drug label A in Figure 1.

Table 2.

Classification Data Points for 500 FDA Drug Labels

Data Point Definition	#Data Points (+,−)
Disease Mentions with offset	5,336 (70%, 30%)
Disease Concepts (UMLS CUIs)	3,013 (55%, 45%)

Open in a new tab

Results

We first measured the baseline performance on 500 drug labels from LabeledIn by assuming all diseases to be indications. The result is shown in the first row of Table 3. The result is a high recall and low precision. The recall is less than perfect because of the cases where the NER method identified a less-specific concept than annotated by the human experts, e.g., from the phrase “Acute Bacterial Otitis Media,” the NER identified “Otitis Media” whereas annotators expanded the indication to the entire phrase.

Table 3.

Indication Extraction Performance at Concept-level (micro-averaged)

Indication Extraction Method	Precision(%)	Recall(%)	F1-measure(%)
Disease NER (baseline)	54.99	93.93	69.37
Disease NER + NB	80.16	79.65	79.91
Disease NER + ME	85.02	86.16	85.59
Disease NER + SVM	86.99	85.58	86.28

Open in a new tab

We then conducted the experiment including a classification method in the framework. We experimented with three different classifiers, SVM, Maximum Entropy (ME), and Naïve Bayes (NB). We used an internal C++ implementation of the classifiers to conduct the experiments with 5,336 training examples and used 10-fold cross validation for evaluation against the evaluation dataset. To align the classifier results with our concept-level evaluation benchmark, we post-processed the results where multiple occurrences of the same concept were classified differently, e.g. SVM may classify the two occurrences of “RLS” in drug label A (Figure 1) as positive and negative, respectively. We resolved this conflict by preferring the positive over negative classification. Finally, we selected SVM as it delivered the highest overall classification accuracy (optimized at score threshold 0.4) and hence leads to the highest indication extraction F1-measure on our data as shown in Table 3. The importance of each feature used for SVM is shown through the feature ablation experiment in Table 4 listing the features in decreasing order of their importance.

Table 4.

Results of Feature Ablation Experiment (SVM)

Ablated (Removed) Feature	F1-measure(%)
Neighbors	79.75
Mention	83.09
Location	84.97
NDF-RT Match	85.11
Semantic Category	86.01

Open in a new tab

Using the results from the SVM-based extraction method, we randomly selected the 100 erroneous cases (50/220 false positives and 50/143 false negatives) for further analysis. Figure 4 shows the different categories of false positive errors that we observed. The two largest categories (shown separately) correspond to errors due to the NER component of the framework: either only partly identifies the correct concept (Boundary Error) or confuses with some other entities (Not a Disease Mention). About 36% of false positives represent boundary errors, e.g. from the phrase “carcinoma of the prostate,” the disease tagger identified “carcinoma” and the classifier identified this mention as positive. This example is considered as a false positive as it does not match with our gold standard that contains the complete mention “carcinomas of the prostate.” About 20% of false positives include non-disease mentions, such as part of an organization name (e.g. the “cancer” society), disease homonym (“seizure” used in a non-biological context), etc.

Figure 4. — Analysis of 50 Randomly Selected False Positives

The remaining categories of false positive errors are due to indication misclassification as described next. About 18% include the diseases mentioned in some other context and unrelated to actual indications. Another 18% category represents the advanced cases, such as pre-existing conditions and conditions that cause indications. In our earlier study to create LabeledIn using expert annotations, we found some situations where experts needed to refer to additional references and drug properties to make their judgment. Consider the two drug labels in Figure 5: the “HIV infection” and “epilepsy” have similar feature values, e.g. neighbor tokens and semantic categories; however, “HIV infection” should be identified as a negative example and “epilepsy” as positive. Such cases are difficult to learn by a domain-independent classifier. Finally, the remaining few include characteristics of indications, contraindications, or side effects. In terms of false negatives, less than 5% were because of NER tool limitations, and the remaining represent misclassification errors because of poor structure of sentences (such as lists) and noisy neighbors (such as parenthesis), advanced cases, etc.

Figure 5. — Example of Training Dataset showing an advanced case of classification (“HIV infection” and “epilepsy”)

Since our final goal is to minimize human effort, we also studied how many concepts can be correctly classified by the SVM, and thereby eliminated from human judgment in a traditional machine-assisted annotation pipeline²⁶, without incurring any errors. We investigated whether the scores (or ranks) assigned by SVM could be used to identify some perfectly classified examples. Figure 6 shows the variation of error rate with respect to the proportion of highest/lowest ranked concepts bypassed from further human validation; we found that we could safely eliminate 2.2% top-scored disease concepts and 5.1% bottom-scored disease concepts from further human validation.

Figure 6. — Effect of Eliminating Highest/Lowest Ranked Concepts on Error Rate

Conclusion

In this study, we have addressed the disease recognition and classification problems to prepare a comprehensive drug indication repository serving many critical applications. In the context of curation of a drug indication repository, there have been many important efforts primarily based on automatic recognition of disease mentions from drug narratives²⁰^–²⁴. However, an important next step of classifying the automatically identified diseases as “indication”/“not indication” is largely missing from existing work, and hence the results from existing methods suffer from lower precision requiring further validation and cleaning by human experts. Taking a step forward, this study presents a framework that not only recognizes disease mentions from drug narratives, but further distinguishes indications from other disease mentions using an SVM-based classification method, thus minimizing the need to have humans in the pipeline. We experimented with 500 FDA drug labels corresponding to frequently-sought human drugs, and configured a disease NER tool to deliver an optimal recall on drug labels. We find that the combination (recognition + classification) framework achieves 32% improvement in precision and 17% improvement in F1-measure over a baseline method that is based only on recognition.

The proposed framework delivered an 86% F1-measure on 500 drug labels whereas two human experts in a previous study using similar NER mechanism²⁶ delivered 88% joint F1-measure on 100 drug labels after their first round of annotation. We consider these performances to be equivalent to each other, given the upper limit (94% recall) on classification performance due to natural language limitations of the NER tool. We also observed that over half of the errors are due to the NER limitations. This clearly indicates that the proposed classifier can act as an independent or complementary annotator, saving annotation costs and time²⁶. Furthermore, we find that about 7% of the training examples fed to the classifier could be altogether bypassed from further human judgments without incurring any errors. In terms of improvement, we found that several erroneous cases required advanced domain knowledge and are difficult to learn. Also, many false negative errors could be resolved by refining the neighbor calculation algorithm and adding formatting features to the classifier. Our future work also includes improving the technical aspects of the classifier such as automatic feature generation and handling imbalanced data. Lastly, we observed that certain annotated drug indications are specific to certain procedures/conditions (e.g. “nausea” and “vomiting” are associated with “cancer chemotherapy” in Figure 1 drug label B). Such information is not yet captured through NER or classification modules of our current pipeline, and requires further processing of drug labels.

Acknowledgments

This research was supported by the Intramural Research Program of the NIH - National Library of Medicine. The authors would like to thank Robert Leaman for proofreading the manuscript.

References

1.Islamaj Dogan R, Murray GC, Neveol A, Lu Z. Understanding PubMed user search behavior through log analysis. Database : the journal of biological databases and curation. 2009;2009:bap018. doi: 10.1093/database/bap018. Epub 2010/02/17. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Neveol A, Islamaj Dogan R, Lu Z. Semi-automatic semantic annotation of PubMed queries: a study on quality, efficiency, satisfaction. Journal of biomedical informatics. 2011;44(2):310–8. doi: 10.1016/j.jbi.2010.11.001. Epub 2010/11/26. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Ely JW, Osheroff JA, Gorman PN, Ebell MH, Chambliss ML, Pifer EA, et al. A taxonomy of generic clinical questions: classification study. BMJ. 2000;321(7258):429–32. doi: 10.1136/bmj.321.7258.429. Epub 2000/08/11. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Lu Z, Agarwal P, Butte AJ. Computational Drug Repositioning - Session Introduction. Pacific Symposium on Biocomputing. 2013:1–4. [Google Scholar]
5.Li J, Lu Z. A New Method for Computational Drug Repositioning Using Drug Pairwise Similarity; IEEE International Conference on Bioinformatics and Biomedicine; 2012. pp. 1–4. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Li J, Lu Z. Pathway-based drug repositioning using causal inference. BMC bioinformatics. 2013;14(Suppl 16):S3. doi: 10.1186/1471-2105-14-S16-S3. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Chang RL, Xie L, Bourne PE, Palsson BO. Drug off-target effects predicted using structural analysis in the context of a metabolic network model. PLoS computational biology. 2010;6(9):e1000938. doi: 10.1371/journal.pcbi.1000938. Epub 2010/10/20. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Khare R, An Y, Wolf S, Nyirjesy P, Liu L, Chou E. Understanding the EMR error control practices among gynecologic physicians; iConference 2013; Fort Worth, TX. 2013. pp. 289–301. [Google Scholar]
9.Lesar TS. Prescribing errors involving medication dosage forms. Journal of general internal medicine. 2002;17(8):579–87. doi: 10.1046/j.1525-1497.2002.11056.x. Epub 2002/09/06. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Ling Y, An Y, Liu M, Hu X. An error detecting and tagging framework for reducing data entry errors in electronic medical records (EMR) system; IEEE International Conference on Bioinformatics and Biomedicine (BIBM); Shanghai, China. 2013. pp. 249–54. [Google Scholar]
11.Li J, Khare R, Lu Z. Improving Online Access to Drug-Related Information; IEEE Second International Conference on Healthcare Informatics, Imaging and Systems Biology; La Jolla, CA. 2012. [Google Scholar]
12.DailyMed: Current Medication Information. Available from: http://dailymed.nlm.nih.gov.
13.DrugBank: Open Data Drug and Drug Target Database. Available from: http://www.drugbank.ca/
14.MedlinePlus: Trusted Health Information for You. Available from: http://www.nlm.nih.gov/medlineplus/
15.MedicineNet: We Bring Doctor’s Knowledge to You. Available from: http://www.medicinenet.com/script/main/hp.asp.
16.Aronson AR. Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. Proceedings / AMIA Annual Symposium AMIA Symposium. 2001:17–21. Epub 2002/02/05. [PMC free article] [PubMed] [Google Scholar]
17.Leaman R, Islamaj Dogan R, Lu Z. DNorm: disease name normalization with pairwise learning to rank. Bioinformatics. 2013 doi: 10.1093/bioinformatics/btt474. Epub 2013/08/24. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Leaman R, Khare R, Lu Z. Automatic Disease Normalization in Clinical Notes with DNorm. Journal of the American Medical Informatics Association (JAMIA) Under Review. [Google Scholar]
19.Lab VBLP. KMCI - KnowledgeMap Concept Indexer. Available from: http://knowledgemap.mc.vanderbilt.edu/research/content/kmci-knowledgemap-concept-indexer.
20.Neveol A, Lu Z. Automatic Integration of Drug Indications from Multiple Health Resources; ACM International Health Informatics Symposium; Arlington, VA. 2010. pp. 666–73. [Google Scholar]
21.Wei WQ, Cronin RM, Xu H, Lasko TA, Bastarache L, Denny JC. Development and evaluation of an ensemble resource linking medications to their indications. Journal of the American Medical Informatics Association : JAMIA. 2013;20(5):954–61. doi: 10.1136/amiajnl-2012-001431. Epub 2013/04/12. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Fung KW, Jao CS, Demner-Fushman D. Extracting drug indication information from structured product labels using natural language processing. Journal of the American Medical Informatics Association : JAMIA. 2013;20(3):482–8. doi: 10.1136/amiajnl-2012-001291. Epub 2013/03/12. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.A side effect resource to capture phenotypic effects of drugs [database on the Internet] 2010. [DOI] [PMC free article] [PubMed]
24.Khare R, Li J, Lu Z. LabeledIn: Cataloging Labeled Indications for Human Drugs. Journal of biomedical informatics. 2014 doi: 10.1016/j.jbi.2014.08.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Li Q, Deleger L, Lingren T, Zhai H, Kaiser M, Stoutenborough L, et al. Mining FDA drug labels for medical conditions. BMC medical informatics and decision making. 2013;13:53. doi: 10.1186/1472-6947-13-53. Epub 2013/04/27. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Khare R, Li J, Lu Z. Toward Creating a Gold Standard of Drug Indications from FDA Drug Labels; IEEE International Conference on Health Informatics; September 09–11 2013; Philadelphia, PA. 2013. [Google Scholar]
27.SIDER 2 Side Effect Resource. Available from: http://sideeffects.embl.de/
28.Dogan RI, Lu Z. An improved corpus of disease mentions in PubMed citations; Workshop on Biomedical Natural Language Processing; Bethesda, MD. 2012. pp. 91–9. [Google Scholar]
29.Huang M, Neveol A, Lu Z. Recommending MeSH terms for annotating biomedical articles. Journal of the American Medical Informatics Association : JAMIA. 2011;18(5):660–7. doi: 10.1136/amiajnl-2010-000055. Epub 2011/05/27. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Leaman R, Khare R, Lu Z. NCBI at 2013 ShARe/CLEF eHealth Shared Task: Disorder Normalization in Clinical Notes with DNorm; Conference and Labs of the Evaluation Forum 2013 Working Notes; 2013. [Google Scholar]
31.Khare R, An Y, Li J, Song I-Y, Hu X. Exploiting semantic structure for mapping user-specified form terms to SNOMED CT concepts; ACM SIGHIT International Health Informatics Symposium; Miami, FL. 2012. pp. 285–94. [Google Scholar]
32.An Y, Khare R, Hu X, Song IY. Bridging encounter forms and electronic medical record databases: Annotation, mapping, and integration; IEEE International Conference on Bioinformatics and Biomedicine (BIBM 2012); October 04–07 2012; Philadelphia, PA. 2012. [Google Scholar]
33.Raghavan P, Fosler-Lussier E, Lai AM. Temporal Classification of Medical Events; Workshop on Biomedical Natural Language Processing (BioNLP); Montreal, Quebec, Canada. 2012. [Google Scholar]
34.Chapman WW, Bridewell W, Hanbury P, Cooper GF, Buchanan BG. A simple algorithm for identifying negated findings and diseases in discharge summaries. Journal of biomedical informatics. 2001;34(5):301–10. doi: 10.1006/jbin.2001.1029. Epub 2002/07/19. [DOI] [PubMed] [Google Scholar]
35.2012A A National Drug File - Reference Terminology Source Information. Available from: http://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/NDFRT/
36.Freebase: A Community-curated Database of well-known People, Places, and Things. Available from: http://www.freebase.com/

[b1-1983305] 1.Islamaj Dogan R, Murray GC, Neveol A, Lu Z. Understanding PubMed user search behavior through log analysis. Database : the journal of biological databases and curation. 2009;2009:bap018. doi: 10.1093/database/bap018. Epub 2010/02/17. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b2-1983305] 2.Neveol A, Islamaj Dogan R, Lu Z. Semi-automatic semantic annotation of PubMed queries: a study on quality, efficiency, satisfaction. Journal of biomedical informatics. 2011;44(2):310–8. doi: 10.1016/j.jbi.2010.11.001. Epub 2010/11/26. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b3-1983305] 3.Ely JW, Osheroff JA, Gorman PN, Ebell MH, Chambliss ML, Pifer EA, et al. A taxonomy of generic clinical questions: classification study. BMJ. 2000;321(7258):429–32. doi: 10.1136/bmj.321.7258.429. Epub 2000/08/11. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b4-1983305] 4.Lu Z, Agarwal P, Butte AJ. Computational Drug Repositioning - Session Introduction. Pacific Symposium on Biocomputing. 2013:1–4. [Google Scholar]

[b5-1983305] 5.Li J, Lu Z. A New Method for Computational Drug Repositioning Using Drug Pairwise Similarity; IEEE International Conference on Bioinformatics and Biomedicine; 2012. pp. 1–4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b6-1983305] 6.Li J, Lu Z. Pathway-based drug repositioning using causal inference. BMC bioinformatics. 2013;14(Suppl 16):S3. doi: 10.1186/1471-2105-14-S16-S3. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b7-1983305] 7.Chang RL, Xie L, Bourne PE, Palsson BO. Drug off-target effects predicted using structural analysis in the context of a metabolic network model. PLoS computational biology. 2010;6(9):e1000938. doi: 10.1371/journal.pcbi.1000938. Epub 2010/10/20. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b8-1983305] 8.Khare R, An Y, Wolf S, Nyirjesy P, Liu L, Chou E. Understanding the EMR error control practices among gynecologic physicians; iConference 2013; Fort Worth, TX. 2013. pp. 289–301. [Google Scholar]

[b9-1983305] 9.Lesar TS. Prescribing errors involving medication dosage forms. Journal of general internal medicine. 2002;17(8):579–87. doi: 10.1046/j.1525-1497.2002.11056.x. Epub 2002/09/06. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b10-1983305] 10.Ling Y, An Y, Liu M, Hu X. An error detecting and tagging framework for reducing data entry errors in electronic medical records (EMR) system; IEEE International Conference on Bioinformatics and Biomedicine (BIBM); Shanghai, China. 2013. pp. 249–54. [Google Scholar]

[b11-1983305] 11.Li J, Khare R, Lu Z. Improving Online Access to Drug-Related Information; IEEE Second International Conference on Healthcare Informatics, Imaging and Systems Biology; La Jolla, CA. 2012. [Google Scholar]

[b12-1983305] 12.DailyMed: Current Medication Information. Available from: http://dailymed.nlm.nih.gov.

[b13-1983305] 13.DrugBank: Open Data Drug and Drug Target Database. Available from: http://www.drugbank.ca/

[b14-1983305] 14.MedlinePlus: Trusted Health Information for You. Available from: http://www.nlm.nih.gov/medlineplus/

[b15-1983305] 15.MedicineNet: We Bring Doctor’s Knowledge to You. Available from: http://www.medicinenet.com/script/main/hp.asp.

[b16-1983305] 16.Aronson AR. Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. Proceedings / AMIA Annual Symposium AMIA Symposium. 2001:17–21. Epub 2002/02/05. [PMC free article] [PubMed] [Google Scholar]

[b17-1983305] 17.Leaman R, Islamaj Dogan R, Lu Z. DNorm: disease name normalization with pairwise learning to rank. Bioinformatics. 2013 doi: 10.1093/bioinformatics/btt474. Epub 2013/08/24. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b18-1983305] 18.Leaman R, Khare R, Lu Z. Automatic Disease Normalization in Clinical Notes with DNorm. Journal of the American Medical Informatics Association (JAMIA) Under Review. [Google Scholar]

[b19-1983305] 19.Lab VBLP. KMCI - KnowledgeMap Concept Indexer. Available from: http://knowledgemap.mc.vanderbilt.edu/research/content/kmci-knowledgemap-concept-indexer.

[b20-1983305] 20.Neveol A, Lu Z. Automatic Integration of Drug Indications from Multiple Health Resources; ACM International Health Informatics Symposium; Arlington, VA. 2010. pp. 666–73. [Google Scholar]

[b21-1983305] 21.Wei WQ, Cronin RM, Xu H, Lasko TA, Bastarache L, Denny JC. Development and evaluation of an ensemble resource linking medications to their indications. Journal of the American Medical Informatics Association : JAMIA. 2013;20(5):954–61. doi: 10.1136/amiajnl-2012-001431. Epub 2013/04/12. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b22-1983305] 22.Fung KW, Jao CS, Demner-Fushman D. Extracting drug indication information from structured product labels using natural language processing. Journal of the American Medical Informatics Association : JAMIA. 2013;20(3):482–8. doi: 10.1136/amiajnl-2012-001291. Epub 2013/03/12. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b23-1983305] 23.A side effect resource to capture phenotypic effects of drugs [database on the Internet] 2010. [DOI] [PMC free article] [PubMed]

[b24-1983305] 24.Khare R, Li J, Lu Z. LabeledIn: Cataloging Labeled Indications for Human Drugs. Journal of biomedical informatics. 2014 doi: 10.1016/j.jbi.2014.08.004. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b25-1983305] 25.Li Q, Deleger L, Lingren T, Zhai H, Kaiser M, Stoutenborough L, et al. Mining FDA drug labels for medical conditions. BMC medical informatics and decision making. 2013;13:53. doi: 10.1186/1472-6947-13-53. Epub 2013/04/27. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b26-1983305] 26.Khare R, Li J, Lu Z. Toward Creating a Gold Standard of Drug Indications from FDA Drug Labels; IEEE International Conference on Health Informatics; September 09–11 2013; Philadelphia, PA. 2013. [Google Scholar]

[b27-1983305] 27.SIDER 2 Side Effect Resource. Available from: http://sideeffects.embl.de/

[b28-1983305] 28.Dogan RI, Lu Z. An improved corpus of disease mentions in PubMed citations; Workshop on Biomedical Natural Language Processing; Bethesda, MD. 2012. pp. 91–9. [Google Scholar]

[b29-1983305] 29.Huang M, Neveol A, Lu Z. Recommending MeSH terms for annotating biomedical articles. Journal of the American Medical Informatics Association : JAMIA. 2011;18(5):660–7. doi: 10.1136/amiajnl-2010-000055. Epub 2011/05/27. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b30-1983305] 30.Leaman R, Khare R, Lu Z. NCBI at 2013 ShARe/CLEF eHealth Shared Task: Disorder Normalization in Clinical Notes with DNorm; Conference and Labs of the Evaluation Forum 2013 Working Notes; 2013. [Google Scholar]

[b31-1983305] 31.Khare R, An Y, Li J, Song I-Y, Hu X. Exploiting semantic structure for mapping user-specified form terms to SNOMED CT concepts; ACM SIGHIT International Health Informatics Symposium; Miami, FL. 2012. pp. 285–94. [Google Scholar]

[b32-1983305] 32.An Y, Khare R, Hu X, Song IY. Bridging encounter forms and electronic medical record databases: Annotation, mapping, and integration; IEEE International Conference on Bioinformatics and Biomedicine (BIBM 2012); October 04–07 2012; Philadelphia, PA. 2012. [Google Scholar]

[b33-1983305] 33.Raghavan P, Fosler-Lussier E, Lai AM. Temporal Classification of Medical Events; Workshop on Biomedical Natural Language Processing (BioNLP); Montreal, Quebec, Canada. 2012. [Google Scholar]

[b34-1983305] 34.Chapman WW, Bridewell W, Hanbury P, Cooper GF, Buchanan BG. A simple algorithm for identifying negated findings and diseases in discharge summaries. Journal of biomedical informatics. 2001;34(5):301–10. doi: 10.1006/jbin.2001.1029. Epub 2002/07/19. [DOI] [PubMed] [Google Scholar]

[b35-1983305] 35.2012A A National Drug File - Reference Terminology Source Information. Available from: http://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/NDFRT/

[b36-1983305] 36.Freebase: A Community-curated Database of well-known People, Places, and Things. Available from: http://www.freebase.com/

PERMALINK

Automatic Extraction of Drug Indications from FDA Drug Labels

Ritu Khare, PhD

Chih-Hsuan Wei, PhD

Zhiyong Lu, PhD

Abstract

Introduction

Figure 1.

Methods