Abstract
One particular challenge in biomedical named entity recognition (NER) and normalization is the identification and resolution of composite named entities, where a single span refers to more than one concept (e.g., BRCA1/2). Previous NER and normalization studies have either ignored composite mentions, used simple ad-hoc rules, or only handled coordination ellipsis, making a robust approach for handling multi-type composite mentions greatly needed. To this end, we propose a hybrid method integrating a machine-learning model with a pattern identification strategy to identify the individual components of each composite mention. Our method, which we have named SimConcept, is the first to systematically handle many types of composite mentions. The technique achieves high performance in identifying and resolving composite mentions for three key biological entities: genes (90.42% in F-measure), diseases (86.47% in F-measure) and chemicals (86.05% in F-measure). Furthermore, our results show that using our SimConcept method can subsequently improve the performance of gene and disease concept recognition and normalization.
I. Introduction
In biomedical text mining, many studies have focused on automatically extracting relevant information from published literature. The relevant information is commonly focused on a specific topic, such as protein-protein interactions [2, 3]; protein transport and localization [4–6]; drug-disease associations [7–9], or gene function extraction[10]. Most of the common retrieval methods apply natural language processing or machine learning to identify relations in text. One crucial step towards this goal is automatically recognizing bioconcept mentions (e.g., gene/protein) – the task of named entity recognition (NER) –and mapping the bioconcept to a specific database identifier (e.g., NCBI EntrezGene) – the task of normalization. Many international biomedical text mining competitions (e.g., BioCreative) have therefore focused on these tasks [11–13]. Genes, diseases and chemicals are particularly notable for not only being important concepts, but also being the most popular concepts in biomedical literature search [14, 15]. Most normalization studies face two challenges: term variation and ambiguity [16–22]. Many previous studies have defined individual strategies (e.g., machine learning, statistical inference and rule-based methods) to deal with these two issues. However, a particular type of error which has not been handled well is composite mentions, where a single span refers to more than one concept (e.g., “SMADs 1, 5, and 8”). Such mentions specifically refer to multiple concepts; they are thus distinct from phenomena such as protein complexes and chemical mixtures where multiple entities combine to form a single physical unit. We observe that in our datasets, approximately 10% of gene, disease and chemical mentions are composite mentions, hence it is important to handle them properly. This study presents a new method for bio-concept mention simplification in a systematic fashion.
Most related previous studies have focused on text simplification (including both document/paragraph [23–27] and sentence [28–31] levels). The few studies which have considered mention simplification have only addressed coordination ellipsis. Buyko et al., [32] developed a CRF-based method with three states: conjunction, conjuncts, and ellipsis antecedent. For example, in “human and mouse cells”, “human” and “mouse” are conjuncts, “and” is a conjunction, and “cells” is an ellipsis antecedent. They evaluated their method using the GENIA [33] corpus, obtaining 86% accuracy. Due to the lower performance of this method on complex ellipsis (e.g., “recombinant human nm23- H1, -H2, mouse nm23-M1, and -M2”), Chae et al., [34] developed a pattern based method, using lexicons to identify the region of each component (i.e., conjunction, conjuncts, and ellipsis antecedent) for each mention. However, these previous studies have focused on only one type of composite mention: mentions with coordination ellipsis.
In this study, a total of six types of composite mentions are considered including five distinct types (including abbreviation pair) and a mixed type of mentions.
Mention with coordination ellipsis: the concepts in this type of mentions share part of the mention region, such as the token “SMAD” in “SMADs 1, 5, and 8.”
Range mention: Like mentions with coordination ellipsis, these mentions share part of the mention region, however this type represents a range of entities rather than a discrete set (e.g. “SMAD 2 to 4”).
Individual mention: this is an independent composite mention. All concepts can be separated into non-overlapping spans (e.g., “BTK/ITK/TEC/TXK”).
Overlap abbreviation pair mention: The long form and short form share some tokens, like “COUP (chicken ovalbumin upstream promoter) transcription factor” where the phrase “transcription factor” is shared across “COUP transcription factor” and “chicken ovalbumin upstream promoter transcription factor”. But the two concepts indicate the same database identifier.
Individual abbreviation pair mention: this is an independent composite mention where the two individual concepts indicate the same database identifier (e.g. “ectodermal dysplasia (EDA)”).
Mixed mention: It is a mixed mention of any two above types, like “high mobility group (HMG) protein 1 and 2” - a mix of type 1 and 4.
The three main contributions of this work are: 1) a new tool called SimConcept was developed to handle six types of composite mentions, more than any other methods previously reported; 2) When applied to the three bio-concepts (i.e., gene, disease, and chemical), our method achieved state-of-the-art performance and 3) Based on our success on more than one entity type, our approach is shown to be robust and generalizable.
II. METHODS
Overall, our method consists of two modules as shown in Figure 1. The first module consists of a conditional random field model. In this module, the input mention is separated into tokens and each token assigned labels according to the most likely sequence of states through the model. The second module reassembles the tokens into individual mentions using a pattern identification method.
A. Conditional random fields model
As mentioned above, we regarded this mention simplification problem as a sequence-labeling task. To recognize the composite mentions, we observed the composition of those mentions and defined nine states for building a conditional random fields (CRF) model [35]: Antecedent (A); Strain/Suffix(S); Conjunction of mentions with coordination ellipsis (C); Conjunction of range mentions (CR); Left parentheses of abbreviation pair (L); Right parentheses of abbreviation (R); Right parentheses of abbreviation, but the abbreviation and long form cannot be separated (Ro); Conjunction of individual mentions (I); Redundant (O). The states “C”, “CR”, “L”, “R”, “Ro” (L and R/Ro occur in pairs), and “I” are conjunction states which can use to recognize the mention types. If one mention includes two or more conjunction states, this mention would be identified as a mixed mention.
Our implementation uses a linear chain Conditional Random Fields (CRF) [35] provided by CRF++ (http://crfpp.googlecode.com/svn/trunk/doc/index.html). CRF++ applies L-BFGS [36] which is a quasi-newton algorithm for large scale numeric optimization problems.
B. CRF Features
We used tmVar’s tokenization [37] and part of its features in SimConcept development. Like tmVar, our tokenization separates uppercase characters, lowercase characters and digits. For example, “SMADs 2 to 4” is separated to “SMAD”, “s”, “2”, “to” and “4”. We adapted tmVar’s features to reflect the difference in input between tmVar (i.e., documents) and SimConcept (i.e., individual mentions). After reviewing the evidence for different token types of a mention, we defined several suffixes, prefixes and some semantic types for identifying bioconcepts (i.e., gene, disease, and chemical) mention characteristics. In particular, most mention suffixes for disease and chemical mentions are not digits, for example “breast and ovarian cancer” (disease) and “b-sitosteryl and stigmasteryl linoleates” (chemical), which might be difficult to recognize without semantic evidence. Therefore, we collected the semantic features used in some previous studies [37–39] and grouped the suffixes/prefixes we defined into semantic feature types such as those shown below.
Chemical Suffix: yl, ylidyne, oyl, sulfony one, and etc.
Chemical Alkane Stem: meth, eth, prop, tetracos
Chemical Trivial Ring: benzene, pyridine, toluen
Chemical Simple Multiplier: di, tri, tetra and etc.
Chemical elements: hydrogen, helium, lithium, carbon and etc.
Disease Suffix: cancer, disease, symptom and etc.
Gene/Protein Suffix: gene, protein, receptor, unit and etc.
Family, Complex : family, subfamily, superfamily complex
We also continue to use three of tmVar’s features types (i.e., Character features, Case pattern features and Contextual features). Character features include number of digits, number of uppercase and lowercase letters, number of all characters, and specific characters (; , . − > + _ / ?). Case pattern features are created by replacing uppercase alphabetic character to “A” and any lower case to ‘a’. Likewise any number (0–9) is replaced by ‘0’. Moreover, we also merged consecutive letters and numbers to generate additional features, such as “AAA” to “A”. Next, we used evidence in full-text as a feature. That is, we would search candidate mentions in full text and look for their presence. For example, in determining how to decompose “W1 and W2 W3”, we would search the bigram “W1 W3” in full text. If found, it is reasonable to infer that “Wl W3” is a valid and meaningful mention (e.g. ovarian and breast cancer). Otherwise, it is more likely that W1 should be separated by itself (e.g. emphysema and liver disease). Lastly, in order to take advantage of contextual information, for a given token we included the token and semantic features of 3 neighboring tokens from each side.
C. Token reassembly through pattern identification
By observing the characteristics of composite mentions in our training data, we manually defined four patterns to Observing these two cases, it becomes clear that the difference between range mentions and mentions with coordination ellipsis is the conjunctions. In case the conjunction is recognized as a conjunction of range mentions (CR state), all values in the range of these two suffixes should be considered as conjunct candidates. model the six types of composite bioconcept mentions, as shown in Figure 2. To simplify mentions, we distinguish between the antecedent region (green), conjuncts region (frame), conjunct candidate (blue) and conjunctions (red). The tokens in antecedent region should be present in all possible mentions. The tokens in conjuncts region should be replaced by all possible conjunct candidates in this region. Every conjuncts region consists of at least one conjunction. Conjunctions are used to separate individual conjunct candidates. In our definition, every mention can map to one of the patterns. Range mentions and mentions with coordination ellipsis map to Pattern 1. As shown in Figure 3a, the “ORP-2 to -4” is a range mention which can be separated to “ORP-” (antecedent region), and “2 to -4” (conjuncts region). In conjuncts region, all possible candidates (i.e., 2, 3 and 4 in “2 to -4”) belong to one of the possible mentions. Therefore, “ORP-2 to -4” is reassembled to “ORP-2”, “ORP-3” and “ORP-4”. In another similar case, the “ORP-1 and -2” is similar to “ORP-2 to -4”. The major difference is the conjunction (i.e., “and”). In this case, “-1” and “-2” in conjuncts region are independent. Therefore “ORP-1 and -2” becomes “ORP-1” and “ORP-2”. Otherwise, once the conjunction is recognized as a conjunction of mentions with coordination ellipsis (C state), the candidates in the conjuncts region are independent to each other, and dependent to the antecedent region. Therefore, the reassembly mentions are the combinations of antecedent region and each conjunct candidate.
As shown in Figure 3b, individual and overlap abbreviation pair mentions belong to Pattern 2. The pair (long form “neurokinin-3” and abbreviation “NK-3”) of abbreviation mentions is in the conjuncts region. Therefore, the two candidates, long form and abbreviation, are reassembled with antecedent region individually. We detected the long form region by applying the Ab3P abbreviation identification tool [40]. Thus, we are able to identify the conjuncts region in these mentions. Patterns 3 and 4 in Figure 2 are relatively easier than Pattern 1 and 2. Since the patterns do not contain a conjuncts region, assembling the individual mentions only requires splitting conjunctions and parentheses. As shown in Figure 3c/d, the mentions can be separated individually.
In addition to the above types, mixed mentions are more complicated. We defined a two phase strategy to divide concepts in mixed mentions. In the 1st phase we split the mention using Patterns 3 or 4, which do not contain any conjuncts region. In this phase, all conjunctions of range (CR), mention with coordination ellipsis (C) and abbreviation parentheses (L/Ro) are considered as part of antecedent region. In the 2n phase, the mention is decomposed by Pattern 1 or 2. In this phase we start to face the conjuncts region. As shown in Figure 4, “interferon gamma (IFN-gamma)-inducible protein 10 (gamma IP-10)” is split to “interferon gamma (IFN-gamma)-inducible protein 10” and “gamma IP-10” by cutting in 1st phase. According the states L/Ro which have been identified in “interferon gamma (IFN-gamma)- inducible protein 10”, the 2th phase should choose Patten 2 for simplifying. We therefore obtained “interferon gamma-inducible protein 10” and “IFN-gamma-inducible protein 10” from the 2th phase simplification.
In other words, the main idea of this two phase strategy is to retain all sub-mentions with a conjuncts region in the second phase. Since the sub-mentions which map to Patterns 1 & 2 are more complicated and cannot be separated individually, those sub-mentions will be processed in the second step.
D. Post processing
To optimize the CRF result in SimConcept, we developed several heuristic rules in post processing. The first rule is concerned with some plural mentions, such as “SMADs 1 and 3”. Typically, we remove the letter “s” from individual mentions when extracting them from the composite mention. For example, the output of “SMADs 1 and 3” becomes “SMAD 1” and “SMAD 3”. However, in some cases, the letter “s” is actually part of an entity name in individual mentions (e.g. “Vps 35”). Therefore, we first extract individual mentions without the letter “s” and search for its occurrence in the full text. If we cannot locate it in full text, we will add the “s” suffix to the individual mention. The second post-processing rule handles some antecedent and prefix tokens which cannot be normally split by our tokenization module, such as “tri- and diorganton”. In such a case, we recognize the prefix (i.e., tri-, di-, mono-, hexachloro-, hexabromo-, and hexa-) and change the state of tokens accordingly (e.g. to ). As a result, “tri- and diorganton” is decomposed to “triorganton” and “diorganton”.
E. The SimConcept Corpus
The SimConcept corpus was compiled using 5 datasets: three for genes, one for diseases and one for chemicals. For genes, we integrated the BioCreative II gene normalization task training (281 abstracts) and test (262 abstracts) corpora and the 151 GIA test collection (http://ii.nlm.nih.gov/DataSets/index.shtml#GIA). In addition, we also collected disease mention corpus from NCBI Disease corpus [16, 41] with 793 abstracts, and sampled Chemical mention corpus from BioCreative IV CHEMDNER task [42] training dataset for 937 abstracts. As shown in Table 1, we collected 2,424 abstracts in total. For each article, in addition to the annotations of all described bioconcept mentions, we appended following annotations: 1) the decomposed mentions, such as “BRCA1” and “BRCA2” for “BRCA1/2”; 2) the five types of composite mentions (e.g., “mention with coordination ellipsis”); 3) the states of tokens ( ). We used PubTator [43, 44], a web-based annotation tool to annotate the corpus. The distributions of the five composite mention types (CR: Range mention, C: mention of coordination ellipsis, I: individual mentions, IA: individual abbreviation, and OA: overlap abbreviation.) in Table 1 are different between the three sets. Chemicals contain significantly more range mentions than either disease or genes, and diseases contain more individual abbreviations than chemicals or genes. The distribution for genes is more even across all types than either diseases or chemicals.
Table 1.
Concept | # of abstracts | Five types of composite mentions | |||||
---|---|---|---|---|---|---|---|
| |||||||
All | CR | C | I | IA | OA | ||
Gene | 694 | 810 (1895) | 14 (60) | 101 (246) | 442 (1089) | 253 (534) | 41 (107) |
Disease | 793 | 1012 (2293) | 2 (18) | 245 (583) | 303 (809) | 486 (1045) | 52 (123) |
Chemical | 937 | 1012 (2944) | 99 (505) | 201 (771) | 496 (1389) | 302 (716) | 0 (0) |
III. EXPERIMENTAL RESULTS AND DISCUSSION
To evaluate our method, we used leave-one-out cross validation on the three sets (i.e., gene, disease and chemical). Table 2 shows the results of our evaluation, where we see that the overall performance is high for all three entity types.
Table 2.
Precision | Recall | F-measure | |
---|---|---|---|
Gene | 89.51% | 91.35% | 90.42% |
Disease | 87.92% | 85.07% | 86.47% |
Chemical | 87.44% | 84.71% | 86.05% |
The chemical corpus includes a type of mentions which is not addressed by our patterns. It is a joint mention which the second mention uses coreference to indicate the previous mention, such as “3–0-propargylated betulinic acid and its 1,2,3-triazoles”. The pronoun “its” represents the previous mention “3–0-propargylated betulinic acid”. Therefore, this composite mention contains two individual mentions “3–0-propargylated betulinic acid” and “1,2,3-triazoles of 3–0-propargylated betulinic acid”. We have ignored this type of mentions.
To assess the performance on each composite mention type, we computed results shown in Table 3. There are only two range mentions in the disease set, and we therefore ignored these. There are also no overlap mentions in the chemical set. Since two exception mentions belong to continuous mention type in chemical corpus, the performance of continuous mention becomes lower.
Table 3.
Gene | Disease | Chemical | |
---|---|---|---|
Individual abbreviation | 92.05% | 84.21% | 86.69% |
Overlap abbreviation | 80.91% | 91.53% | N/A |
Mention with coordination ellipsis | 76.35% | 80.21% | 61.10% |
Range mention | 91.67% | N/A | 94.14% |
Individual mention | 91.11% | 87.13% | 87.34% |
Mixed mention | 81.75% | 81.45% | 83.84% |
All composite mentions | 90.42% | 86.47% | 86.05 % |
As mentioned in introduction, this study is aimed at helping bioconcept normalization. We therefore applied SimConcept in GenNorm [19] and DNorm [16], and evaluated on the test sets of BioCreative II gene normalization task [45] and NCBI disease corpus [46], respectively (no normalized chemical corpus is available). To avoid training on the test set, the training set for SimConcept excluded the test corpora for GenNorm and DNorm. As shown in Table 4 and Table 5, using SimConcept can further improve the state-of-the-art performance for 1.17% in F-measure (P-value=0.02) for gene normalization and 1.34% in F-measure (P-value=0.03) for disease normalization. We also applied the heuristic rules used in previous gene normalization studies [47, 48] and showed the result in second row of Table 4. Our set of heuristics includes 9 rules. Those rules are defined by regular expressions to recognize the conjuncts at the end of the mention (e.g., detecting “1” and “2” in “BRCA1/2”) and handle some mentions containing coordination ellipsis and ranges. However, the composite mentions which are not considered in the refinement of heuristic rules cannot be recognized. This comparison shows that using heuristic rules is not as robust as SimConcept. As also shown in Table 5, using heuristic rules raises performance about half as much as SimConcept.
Table 4.
Method | Precision | Recall | F-measure |
---|---|---|---|
GenNorm + SimConcept | 87.01% | 86.13% | 86.57% |
GenNorm + heuristic rules | 86.78% | 85.23% | 86.00% |
GenNorm | 86.72% | 84.09% | 85.38% |
Table 5.
Method | Precision | Recall | F-measure |
---|---|---|---|
DNorm + SimConcept | 80.91% | 79.23% | 80.06% |
DNorm | 80.69% | 76.85% | 78.72% |
In order to examine the contribution of individual feature types, we performed a feature ablation study where different feature types were removed from the entire set of features one at a time. As shown in Table 6, the largest drop in performance was due to the removal of token features, followed by semantic and character features. The removal of case pattern or contextual features had little effect on final performance. In addition to removing features, we also changed the order of CRF model from order 2 to order 1. The result shows order 2 performs better than order 1.
Table 6.
Precision | Recall | F-measure | |
---|---|---|---|
SimConcept | 89.51% | 91.35% | 90.42% |
- Token features | 87.41% | 89.27% | 88.33% (−2.09%) |
- Semantic features | 88.64% | 89.78% | 89.21% (−1.21%) |
- Full text features | 88.54% | 90.34% | 89.42% (−1.00%) |
- Character features | 89.03% | 90.42% | 89.72% (−0.70%) |
- Pattern features | 89.50% | 91.12% | 90.30% (−0.12%) |
Order 2 → Order 1 | 87.61% | 89.38% | 88.49% (−1.93%) |
Despite our best efforts, there are still errors in our decomposition results. We examined a sample set of errors and grouped them into two major categories. The first group is due to incorrect conjunction type detection. For example, “AKR1C1-AKR1C4” is a range mention including AKR1C1, AKR1C2, AKR1C3, and AKR1C4. However, SimConcept incorrectly recognizes this as a mention with coordination ellipsis, thus missing two individual mentions AKR1C2 and AKR1C3. This category accounts for 84% of all errors in SimConcept. The second category is because of the incorrect antecedent/conjuncts region detection. Approximately 12% of SimConcept errors belong to this second group. For example, “cytosolic and mitochondrial serine hydroxymethyltransferase” is a composite mention with coordination ellipsis, which should be decomposed into “cytosolic serine hydroxymethyltransferase” and “mitochondrial serine hydroxymethyltransferase”. However, SimConcept recognizes serine as part of the conjuncts region of “mitochondrial serine” by mistake. As a result, the output of first extracted mention becomes “cytosolic hydroxymethyltransferase”.
IV. CONCLUSION
In this study, we present SimConcept – a method to handle the task of composite named entity simplification. We integrated a CRF-based method with a pattern identification strategy to systematically decompose the six types of composite mentions. To handle the three most fundamental bioconcepts, we re-annotated the composite mentions in five existing corpora for gene (BioCreative 2 GN task train/test corpus and NLM GIA corpus), disease (NCBI disease corpus) and chemicals (BioCreative IV ChemDNER task corpus), and used these to evaluate SimConcept. The results show that SimConcept handles composite mention simplification effectively.
We further used SimConcept to assist the bioconcept normalization task. The results suggest that SimConcept is helpful for improving normalization performance. Our approach should generalize to other entity types in addition to the three concepts that were the focus of this study: genes, diseases and chemicals. However, the problems of token reassembly step of different concepts are highly diverse. Using a pattern based method, may not be able to address all issues. In our future work, we will apply statistical methods (e.g., an unsupervised statistical approach [49]) to handle the reassembly issue.
Acknowledgments
This research was supported by the NIH Intramural Research Program, National Library of Medicine.
Biographies
Dr. Chih-Hsuan Wei received his PhD degree in computer science and information engineering from the National Cheng-Kung University, Taiwan. He is currently a research fellow at the National Center for Biotechnology Information (NCBI) and dedicates on bioconcept normalization and biocuration assistance.
Dr. Robert Leaman received his PhD in computer science from Arizona State University. He is currently a research fellow at the National Center for Biotechnology Information (NCBI). Dr. Leaman’s research focuses on bioconcept entity recognition and normalization.
Dr. Zhiyong Lu received his PhD in biomedical informatics from the University of Colorado School of Medicine. Dr. Lu is Earl Stadtman investigator at the National Center for Biotechnology Information (NCBI), National Institutes of Health (NIH) where he leads the biomedical text mining research group. Dr. Lu’s research has been applied to NCBI’s PubMed system and he has authored over 100 publications.
References
- 1.Wei C-H, Leaman R, Lu Z. SimConcept: A Hybrid Approach for Simplifying Composite Named Entities in Biomedicine. ACM Conference on Bioinformatics Computational Biology and Health Informatics; Newport Beach, CA. 2014. pp. 138–146. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Krallinger M, Vazquez M, Leitner F, Salgado D, Chatr-aryamontri A, Winter A, et al. The Protein-Protein Interaction tasks of BioCreative III: classification/ranking of articles and linking bio-ontology concepts to full text. BMC bioinformatics. 2011;12:S3. doi: 10.1186/1471-2105-12-S8-S3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Baumgartner WA, Jr, Lu Z, Johnson HL, Caporaso JG, Paquette J, Lindemann A, et al. An integrated approach to concept recognition in biomedical text. Proceedings of the Second BioCreative Challenge Evaluation Workshop; 2007. pp. 257–271. [Google Scholar]
- 4.Poon H, Vanderwende L. Joint inference for knowledge extraction from biomedical literature. presented at the Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics; Los Angeles, CA, USA. 2010. [Google Scholar]
- 5.Hunter L, Lu Z, Firby J, Baumgartner WA, Johnson HL, Ogren PV, et al. OpenDMAP: an open source, ontology-driven concept analysis engine, with applications to capturing knowledge regarding protein transport, protein interactions and cell-type-specific gene expression. BMC bioinformatics. 2008;9:78. doi: 10.1186/1471-2105-9-78. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Bethard S, Lu Z, Martin JH, Hunter L. Semantic role labeling for protein transport predicates. BMC bioinformatics. 2008;9:277. doi: 10.1186/1471-2105-9-277. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Yang CC, Yang H, Jiang L. Postmarketing Drug Safety Surveillance Using Publicly Available Health-Consumer-Contributed Content in Social Media. ACM Transactions on Management Information Systems (TMIS) 2014 Apr;5:2. [Google Scholar]
- 8.Dogğan RI, Névéol A, Lu Z. A context-blocks model for identifying clinical relationships in patient records. BMC bioinformatics. 2011;12:S3. doi: 10.1186/1471-2105-12-S3-S3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Li J, Lu Z. Systematic identification of pharmacogenomics information from clinical trials. Journal of Biomedical Informatics. 2012;45:870–878. doi: 10.1016/j.jbi.2012.04.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Mao Y, Van Auken K, Li D, Arighi CN, McQuilton P, Hayman GT, et al. Overview of the gene ontology task at BioCreative IV. Database. 2014;2014:bau086. doi: 10.1093/database/bau086. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Arighi CN, Wu CH, Cohen KB, Hirschman L, Krallinger M, Valencia A, et al. BioCreative-IV virtual issue. Database. 2014;2014:bau039. doi: 10.1093/database/bau039. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Lu Z, Kao HY, Wei CH, Huang M, Liu J, Kuo CJ, et al. The gene normalization task in BioCreative III. BMC bioinformatics. 2011;12:S2. doi: 10.1186/1471-2105-12-S8-S2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Arighi CN, Lu Z, Krallinger M, Cohen KB, Wilbur WJ, Valencia A, et al. Overview of the BioCreative III workshop. BMC bioinformatics. 2011;12:S1. doi: 10.1186/1471-2105-12-S8-S1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Névéol A, Doğan RI, Lu Z. Semi-automatic semantic annotation of PubMed queries: A study on quality, efficiency, satisfaction. Journal of Biomedical Informatics. 2011;44:310–318. doi: 10.1016/j.jbi.2010.11.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Dogan RI, Murray GC, Névéol A, Lu Z. Understanding PubMed® user search behavior through log analysis. Database. 2009;2009:bap018. doi: 10.1093/database/bap018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Leaman R, Doğan RI, Lu Z. DNorm: disease name normalization with pairwise learning to rank. Bioinformatics. 2013;29:2909–2917. doi: 10.1093/bioinformatics/btt474. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Wei CH, Kao HY, Lu Z. SR4GN: a species recognition software tool for gene normalization. Plos one. 2012;7:e38460. doi: 10.1371/journal.pone.0038460. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Rocktäschel T, Weidlich M, Leser U. ChemSpot: a hybrid system for chemical named entity recognition. Bioinformatics. 2012;28:1633–1640. doi: 10.1093/bioinformatics/bts183. [DOI] [PubMed] [Google Scholar]
- 19.Wei CH, Kao HY. Cross-species gene normalization by species inference. BMC bioinformatics. 2011;12:S5. doi: 10.1186/1471-2105-12-S8-S5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Torii M, Wagholikar K, Liu H. Detecting concept mentions in biomedical text using hidden Markov model: multiple concept types at once or one at a time? Journal of biomedical semantics. 2014:3. doi: 10.1186/2041-1480-5-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Leaman R, Wei CH, Lu Z. tmChem: a high performance approach for chemical named entity recognition and normalization. J Cheminform. 2015;7:S3. doi: 10.1186/1758-2946-7-S1-S3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Van Landeghem S, Björne J, Wei CH, Hakala K, Pyysalo S, Ananiadou S, et al. Large-scale event extraction from literature with multi-level gene normalization. PloS one. 2013;8:e55814. doi: 10.1371/journal.pone.0055814. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Leroy G, Endicott JE, Mouradi O, Kauchak A, Melissa, Just L. Improving Perceived and Actual Text Difficulty for Health Information Consumers using Semi-Automated Methods. presented at the American Medical Infomatics Association (AMIA) Annual Symposium; Chicago, USA. 2012. [PMC free article] [PubMed] [Google Scholar]
- 24.Ong E, Damay J, Lojico G, Lu K, Tarantan D. Simplifying text in medical literature. Journal of Research in Science Computing and Engineering. 2007;4:37–47. [Google Scholar]
- 25.Siddharthan A. Syntactic simplification and text cohesion. Research on Language and Computation. 2006;4:77–109. [Google Scholar]
- 26.Chandrasekar R, Srinivas B. Automatic induction of rules for text simplification. Knowledge-Based Systems. 1997;10:183–190. [Google Scholar]
- 27.Kauchak D. Improving Text Simplification Language Modeling Using Unsimplified Text Data. presented at the Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics; Sofia, Bulgaria. 2013. [Google Scholar]
- 28.Silveira SB, Branco A. Enhancing multi-document summaries with sentence simplification. presented at the ICAI 2012: International Conference on Artificial Intelligence; Las Vegas, Nevada, USA. 2012. [Google Scholar]
- 29.Peng Y, Tudor CO, Torii M, Wu CH, Vijay-Shanker K. iSimp: A Sentence Simplification System for Biomedical Text. presented at the The 2012 IEEE International Conference on Bioinformatics and Biomedicine; Philadelphia, PA, USA. 2012. [Google Scholar]
- 30.Vickrey D, Koller D. Sentence Simplification for Semantic Role Labeling. presented at the 22nd International Conference on. Computational Linguistics; Stroudsburg, PA, USA. 2008. [Google Scholar]
- 31.Vanderwende L, Suzuki H, Brockett C, Nenkova A. Beyond SumBasic: Task-focused summarization with sentence simplification and lexical expansion. Information Processing & Management. 2007;43:1606–1618. doi: 10.1016/j.ipm.2007.01.023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Buyko E, Tomanek K, Hahn U. Resolution of coordination ellipses in biological named entities using conditional random fields. presented at the 10th Conference of the Pacific Association for Computational Linguistics; Melbourne, Australia. 2007. [Google Scholar]
- 33.Kim J-D, Ohta T, Tateisi Y, Tsujii Ji. GENIA corpus—a semantically annotated corpus for bio-textmining. Bioinformatics. 2003;19:i180–i182. doi: 10.1093/bioinformatics/btg1023. [DOI] [PubMed] [Google Scholar]
- 34.Chae J, Jung Y, Lee T, Jung S, Huh C, Kim G, et al. Identifying non-elliptical entity mentions in a coordinated NP with ellipses. Journal of Biomedical Informatics. 2013:139–152. doi: 10.1016/j.jbi.2013.10.002. [DOI] [PubMed] [Google Scholar]
- 35.Lafferty J, McCallum A, Pereira F. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. presented at the Proceedings of the International Conference on Machine Learning (ICML 01); Williamstown, MA, USA. 2001. [Google Scholar]
- 36.Liu DC, Nocedal J. On the limited memory BFGS method for large scale optimization. Mathematical Programming B. 1989;45:503–528. [Google Scholar]
- 37.Wei CH, Harris BR, Kao HY, Lu Z. tmVar: A text mining approach for extracting sequence variants in biomedical literature. Bioinformatics. 2013;29:1433–1439. doi: 10.1093/bioinformatics/btt156. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Lowe DM, Corbett PT, Murray-Rust P, Glen RC. Chemical name to structure: OPSIN, an open source solution. Journal of chemical information and modeling. 2011;51:739–753. doi: 10.1021/ci100384d. [DOI] [PubMed] [Google Scholar]
- 39.Tanabe L, Wilbur WJ. Tagging gene and protein names in biomedical text. Bioinformatics. 2002;18:1124–1132. doi: 10.1093/bioinformatics/18.8.1124. [DOI] [PubMed] [Google Scholar]
- 40.Sohn S, Comeau DC, Kim W, Wilbur WJ. Abbreviation definition identification based on automatic precision estimates. BMC bioinformatics. 2008;9:402. doi: 10.1186/1471-2105-9-402. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Doğan RI, Lu Z. An improved corpus of disease mentions in PubMed citations. presented at the Proceedings of the 2012 Workshop on Biomedical Natural Language Processing; Montreal, Canada. 2012. [Google Scholar]
- 42.Krallinger M, Rabal O, Leitner F, Vazquez M, Salgado D, Lu Z, et al. The CHEMDNER corpus of chemicals and drugs and its annotation principles. Journal of Cheminformatics. 2015;(Suppl 1):S1. doi: 10.1186/1758-2946-7-S1-S2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Wei CH, Kao HY, Lu Z. PubTator: a Web-based text mining tool for assisting Biocuration. Nucleic acids research. 2013;41:W518–W522. doi: 10.1093/nar/gkt441. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Wei C-H, Harris BR, Li D, Berardini TZ, Huala E, Kao H-Y, et al. Accelerating literature curation with text-mining tools: a case study of using PubTator to curate genes in PubMed abstracts. Database: The Journal of Biological Databases & Curation. 2012:bas041. doi: 10.1093/database/bas041. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Morgan AA, Lu Z, Wang X, Cohen AM, Fluck J, Ruch P, et al. Overview of BioCreative II gene normalization. Genome biology. 2008;9:S3. doi: 10.1186/gb-2008-9-s2-s3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Doğan RI, Leaman R, Lu Z. NCBI disease corpus: A resource for disease name recognition and concept normalization. Journal of Biomedical Informatics. 2014 Jan;:1–10. doi: 10.1016/j.jbi.2013.12.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Wermter J, Tomanek K, Hahn U. High-performance gene name normalization with GeNo. Bioinformatics. 2009;25:815–821. doi: 10.1093/bioinformatics/btp071. [DOI] [PubMed] [Google Scholar]
- 48.Hakenberg J, Plake C, Leaman R, Schroeder M, Gonzalez G. Inter-species normalization of gene mentions with GNAT. Bioinformatics. 2008;24:i126–i132. doi: 10.1093/bioinformatics/btn299. [DOI] [PubMed] [Google Scholar]
- 49.Zhai C. Fast statistical parsing of noun phrases for document indexing. Proceedings of the fifth conference on Applied natural language processing; 1997. pp. 312–319. [Google Scholar]