Ambiguity and variability of database and software names in bioinformatics

Geraint Duck; Aleksandar Kovacevic; David L Robertson; Robert Stevens; Goran Nenadic

doi:10.1186/s13326-015-0026-0

. 2015 Jun 29;6:29. doi: 10.1186/s13326-015-0026-0

Ambiguity and variability of database and software names in bioinformatics

Geraint Duck ¹, Aleksandar Kovacevic ², David L Robertson ³, Robert Stevens ¹, Goran Nenadic ^1,^4,^✉

PMCID: PMC4485340 PMID: 26131352

Abstract

Background

There are numerous options available to achieve various tasks in bioinformatics, but until recently, there were no tools that could systematically identify mentions of databases and tools within the literature. In this paper we explore the variability and ambiguity of database and software name mentions and compare dictionary and machine learning approaches to their identification.

Results

Through the development and analysis of a corpus of 60 full-text documents manually annotated at the mention level, we report high variability and ambiguity in database and software mentions. On a test set of 25 full-text documents, a baseline dictionary look-up achieved an F-score of 46 %, highlighting not only variability and ambiguity but also the extensive number of new resources introduced. A machine learning approach achieved an F-score of 63 % (with precision of 74 %) and 70 % (with precision of 83 %) for strict and lenient matching respectively. We characterise the issues with various mention types and propose potential ways of capturing additional database and software mentions in the literature.

Conclusions

Our analyses show that identification of mentions of databases and tools is a challenging task that cannot be achieved by relying on current manually-curated resource repositories. Although machine learning shows improvement and promise (primarily in precision), more contextual information needs to be taken into account to achieve a good degree of accuracy.

Keywords: Bioinformatics, Computational biology, CRF, Dictionary, Resource extraction, Text-mining

Background

Bioinformatics and computational biology rely on domain databases and software to support data collection, aggregation and analysis and, as such, have been reported in research papers, typically as part of the methods section. However, limited progress has been made to systematically capture mentions of databases and tools in order to explore the bioinformatics practice of computational method on a large-scale. An evaluation of the resources available could help bioinformaticians to identify common usage patterns [1] and potentially infer scientific “best practice” [2] based on a measure of how often or where a particular resource is currently being used within an in silico workflow [3]. Although there are several inventories that list available database and software resources (e.g., the NAR databases and web-services special issues [4, 5], ExPASy [6], the Online Bioinformatics Resources Collection [7], etc.), until recently, to the best of our knowledge, there were no attempts to systematically identify resource mentions in the literature [8].

Biomedical text mining has seen wide usage in identifying mentions of entities of different types in the literature in recent years. Named entity recognition (NER) enables automated literature insights [9] and provides input to other text-mining applications. For example, within the fields of biology and bioinformatics, NER systems have been developed to capture species [10], proteins/genes [11–13], chemicals [14], etc. Issues of naming inconsistencies, numerous synonyms and acronyms, and an inability to distinguish entity names from common words in a natural language combined with ambiguous definitions of concepts, make NER a difficult task [15, 16]. Still, for some applications, NER tools achieve relatively high precision and recall scores. For example, LINNAEUS achieved F-scores around the 95 % mark for species name recognition and disambiguation on the mention and document levels [10]. On the other hand, gene names are known for their ambiguity and variability, resulting in lower reported F-scores. For example, ABNER [12] recorded an F-score of just under 73 % for strict-match gene name recognition (85 % with some boundary error toleration), and GNAT [13] reported an F-score of 81 % for the same task (up to a maximum of 90 % for single species gene name recognition, e.g., for yeast).

Some previous work exists on automated identification and harvesting of bioinformatics database and software names from the literature. For example, OReFiL [17] utilises the mentions of Unified Resource Locators (URLs) in text to recognise new resources to update its own internal index. Similarly, BIRI (BioInformatics Resource Inventory) uses a series of hand crafted regular expressions to automatically capture resource names, their functionality and classification from paper titles and abstracts [18]. The reported quality of the identification process was in line with other NER tasks. For example, BIRI successfully extracted resource names in 94 % of cases in a test corpus, which consisted of 392 abstracts that matched a search for “bioinformatics resource” and eight documents that were manually included to test domain robustness. However, both of these tools focused on updates and have biased their evaluation to resource rich text, which prevents full understanding of false negative errors in the general bioinformatics literature.

This paper aims to analyse database and software name mentions in the bioinformatics/computational biology literature to assess challenges for automated extraction. We analyse database and software names in the computational biology literature using a set of 60 full-text documents manually annotated at the mention level, building on our previous work [19]. We analyse the variability and ambiguity of bioinformatics resource names and compare dictionary and machine learning approaches for their identification based on the results on an additional dataset of 25 full-text documents. Although we focus here on bioinformatics resources, the challenges and solutions encountered in database and software recognition are generic, and thus not unique to this domain [20].

Methods

Corpus annotation and analysis

For the purpose of this study, we define databases as any electronic resource that stores records in a structured form, and provides unique identifiers to each record. These include any database, ontology, repository or classification resource, etc. Examples include SCOP (a database of protein structural classification) [21], UniProt (a database of protein sequences and functional information) [22], Gene Ontology (ontology that describes gene product attributes) [23], PubMed (a repository of abstracts) [24], etc. We adopt Wikipedia’s definition of software [25]: “a collection of computer programs … that provides the instructions for telling a computer what to do and how to do it”. We use program and tool as synonyms for software. Examples include BLAST (automated sequence comparison) [26], eUtils (access to literature data) [27], etc. We also include mentions of web-services as well as package names (e.g., R packages from Bioconductor [28, 29]). We explicitly exclude database record numbers/identifiers (e.g., GO:0002474, Q8HWB0), file formats (e.g., PDF), programming languages and their libraries (e.g., Python, BioPython), operating systems (e.g., Linux), algorithms (e.g., Merge-Sort), methods (e.g., ANOVA, Random Forests) and approaches (e.g., Machine Learning, Dynamic Programming).

To explore the use of database and tool names, we have developed an annotated set of 60 full-text articles from the PubMed Central [30] open-access subset. The articles were randomly selected from Genome Biology (5 articles), BMC Bioinformatics (36) and PLoS Computational Biology (19). These journals were selected as they could provide a broad overview of the bioinformatics and computational biology domain(s).

The articles were primarily annotated by a bioinformatician (GD) with experience in text mining. The annotation process included marking each database/software name mention. We note that associated designators of resources (e.g., words such as database, software) were included only if part of the official name (e.g., Gene Ontology). The inter-annotator agreement (IAA) [31] for the annotation of database and software names was calculated from five full-text articles randomly selected from the annotated corpus, which were annotated by a PhD student with bioinformatics and a text-mining background.

To assess the complexity, composition, variability and ambiguity of resource names, we performed an analysis of the annotated mentions. The corpus was pre-processed using a typical text-mining pipeline consisting of a tokeniser, sentence splitter and part-of-speech (POS) tagger from GATE’s ANNIE [32]. We analysed the length of names, their lexical (stemmed token-level) and structural composition (using POS tag patterns) and the level of variability and ambiguity as compared to common English words, acronyms and abbreviations.

In addition to the dataset of 60 articles that was used for analysis and development of NER tools, an additional dataset of 25 full-text annotated papers was created to assess the quality of the proposed NER approaches (see below).

Dictionary-based approach (baseline)

We compiled an extensive dictionary of database and software names from several existing sources (see Table 1). Some well-known acronyms and spelling/orthographic variants have also been added, resulting in 7322 entries with 8169 variants (6929 after removing repeats) for 6126 resources. The names collected in the dictionary were also analysed using a similar approach as used for the names appearing in the corpus (see above). We then used LINNAEUS [10] to match these names in text.

Table 1.

Sources from which the database and software name dictionary is comprised

Type	Entries	Variants	Source
DB	195	298	databases.biomedcentral.com
SW	263	278	www.bioinformatik.de
PK	799	799	www.bioconductor.org
SW	2033	2087	bioinformatics.ca/links_directory/
SW	389	391	evolution.genetics.washington.edu/phylip/software.html
DB	379	379	www.ebi.ac.uk/miriam/main/
DB	1452	1670	www.oxfordjournals.org/nar/database/a/
SW	135	135	www.netsci.org/Resources/Software/Bioinform/index.html
SW	36	41	www.bioinf.manchester.ac.uk/recombination/programs.shtml
SW	1149	1183	en.wikipedia.org/wiki/Wiki/<various>
SW, DB	171	231	Manually added entries
Our dictionary (DB, SW, PK)	7322	6929	http://sourceforge.net/projects/bionerds/

Name	Description
isAcronym	token is an acronym
containsAllCaps	all the letters in the token are capitalised
isCapitalised	token is capitalised
containsCapLetter	token contains at least one capital letter
containsDigits	token contains at least one digit
isAllDigits	token is made up of digits only

	Development	Test
Total number of documents	60	25
Total database and software mentions	2416	1479
Total unique resource mentions	401	301
Percentage of database mentions	36 %	28 %
Percentage of unique database mentions	27 %	30 %
Average mentions per document	40.3	70.0
Average unique mentions per document	8.1	13.4
Maximum mentions in a single document	227	217
Maximum unique mentions in a single document	57	55
Resources with only a single lexicographic mention	201	147

Pattern	Count	Frequency
NNP	258	63.7 %
NNP NNP	34	8.4 %
NNP NNP NNP	26	6.4 %
NN	20	4.9 %
NNP CD	16	4.0 %
NNP NNP NNP NNP	8	2.0 %
Other Patterns	43	10.6 %

Development corpus	Recall (%)	Precision (%)	F-score (%)
Dictionary	49 (47)	38 (37)	43 (41)
CRF with post-processing	58 (52)	76 (67)	65 (58)
CRF without post-processing	54 (49)	78 (70)	62 (57)
Test Corpus
Dictionary	46 (44)	46 (44)	46 (44)
CRF with post-processing	60 (54)	83 (74)	70 (63)
CRF without post-processing	53 (45)	71 (65)	62 (53)

Fold	Recall (%)	Precision (%)	F-score (%)
1	46 (43)	41 (39)	43 (41)
2	34 (31)	37 (34)	36 (32)
3	36 (34)	24 (23)	29 (27)
4	55 (53)	46 (45)	50 (49)
5	76 (75)	44 (43)	56 (55)
Min	34 (31)	24 (23)	29 (27)
Max	76 (75)	46 (45)	56 (55)
Mean	49 (47)	38 (37)	43 (41)

Fold	Recall (%)	Precision (%)	F-score (%)
1	51 (44)	71 (60)	59 (51)
2	44 (35)	88 (71)	59 (47)
3	51 (44)	76 (66)	61 (53)
4	65 (60)	73 (67)	69 (63)
5	80 (76)	74 (70)	77 (73)
Min	44 (35)	71 (60)	59 (47)
Max	80 (76)	88 (71)	77 (73)
Mean	58 (52)	76 (67)	65 (58)
Micro Avg	56 (50)	76 (67)	65 (57)

Fold	Recall (%)	Precision (%)	F-score (%)
1	46 (41)	78 (69)	58 (51)
2	42 (35)	89 (75)	57 (48)
3	45 (41)	75 (70)	56 (52)
4	60 (55)	71 (66)	65 (60)
5	76 (74)	74 (72)	75 (73)
Min	42 (35)	71 (66)	56 (52)
Max	76 (74)	89 (75)	75 (73)
Mean	54 (49)	78 (70)	62 (57)
Micro Avg	52 (47)	77 (70)	62 (56)

Fold	Recall (%)	Precision (%)	F-score (%)
1	56 (49)	43 (38)	49 (42)
2	50 (41)	45 (37)	48 (39)
3	57 (52)	32 (29)	41 (37)
4	68 (64)	45 (42)	54 (51)
5	87 (84)	45 (43)	59 (57)
Min	50 (41)	32 (29)	41 (37)
Max	87 (84)	45 (43)	59 (57)
Mean	64 (58)	42 (38)	50 (45)

Feature group	Recall (%)	Precision (%)	F-score (%)
All features	54 (49)	78 (70)	62 (57)
No lexical features	46 (43)	68 (62)	54 (50)
No syntactic features	53 (48)	77 (69)	61 (55)
No orthographic features	48 (43)	70 (62)	55 (50)
No dictionary features	49 (44)	70 (62)	57 (51)

Type	Contribution to total TPs
Machine learning matches	55.3 %
Heads and Hearst Patterns	9.8 %
Title appearances	0.5 %
References and URLs	1.8 %
Version information	0.9 %
Noun/verb associations	21.4 %
Comparisons	4.0 %
Remaining	6.3 %

PERMALINK

Ambiguity and variability of database and software names in bioinformatics

Geraint Duck

Aleksandar Kovacevic

David L Robertson

Robert Stevens

Goran Nenadic

Abstract

Background

Results

Conclusions

Background

Methods

Corpus annotation and analysis

Dictionary-based approach (baseline)

Table 1.

Machine learning approach

Table 2.

Machine learning – post-processing

Evaluation

Results and discussion

Corpus annotations

Table 3.

Database and software name composition

Fig. 1.

Table 4.

Variability of resource names

Ambiguity of resource names

Dictionary-based matching

Table 5.

Table 6.

Machine-learning approach

Table 7.

Table 8.

Table 9.

Feature impact analysis for the ML model

Table 10.

Missed database and software mentions

Table 11.

Table 12.

Table 13.

Table 14.

Table 15.

Table 16.

Table 17.

False positive filtering

Conclusions

Availability of supporting data

Acknowledgements

Abbreviations

Footnotes

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases