Optimising chemical named entity recognition with pre-processing analytics, knowledge-rich features and heuristics

Riza Batista-Navarro; Rafal Rak; Sophia Ananiadou

doi:10.1186/1758-2946-7-S1-S6

. 2015 Jan 19;7(Suppl 1):S6. doi: 10.1186/1758-2946-7-S1-S6

Optimising chemical named entity recognition with pre-processing analytics, knowledge-rich features and heuristics

Riza Batista-Navarro ^1,^2,^✉, Rafal Rak ¹, Sophia Ananiadou ¹

PMCID: PMC4331696 PMID: 25810777

Abstract

Background

The development of robust methods for chemical named entity recognition, a challenging natural language processing task, was previously hindered by the lack of publicly available, large-scale, gold standard corpora. The recent public release of a large chemical entity-annotated corpus as a resource for the CHEMDNER track of the Fourth BioCreative Challenge Evaluation (BioCreative IV) workshop greatly alleviated this problem and allowed us to develop a conditional random fields-based chemical entity recogniser. In order to optimise its performance, we introduced customisations in various aspects of our solution. These include the selection of specialised pre-processing analytics, the incorporation of chemistry knowledge-rich features in the training and application of the statistical model, and the addition of post-processing rules.

Results

Our evaluation shows that optimal performance is obtained when our customisations are integrated into the chemical entity recogniser. When its performance is compared with that of state-of-the-art methods, under comparable experimental settings, our solution achieves competitive advantage. We also show that our recogniser that uses a model trained on the CHEMDNER corpus is suitable for recognising names in a wide range of corpora, consistently outperforming two popular chemical NER tools.

Conclusion

The contributions resulting from this work are two-fold. Firstly, we present the details of a chemical entity recognition methodology that has demonstrated performance at a competitive, if not superior, level as that of state-of-the-art methods. Secondly, the developed suite of solutions has been made publicly available as a configurable workflow in the interoperable text mining workbench Argo. This allows interested users to conveniently apply and evaluate our solutions in the context of other chemical text mining tasks.

Keywords: Chemical named entity recognition, Text mining, Sequence labelling, Conditional random fields, Feature engineering, Configurable workflows, Workflow optimisation

Background

In carrying out scientific work, most researchers rely on published information in order to keep abreast of recent developments in the field, to avoid repetition of work and to guide the direction of current studies. This is especially true in the field of chemistry where endeavours such as drug discovery and development are largely driven by information screened from the copious amounts of data available. Whilst databases storing structured chemical information have proliferated in the last few years, published scientific articles, technical reports, patent documents and other forms of unstructured data remain to be the richest source of the most current information.

Text mining facilitates the efficient distillation of information from the plethora of scientific literature. Whilst most of the scientific text mining efforts in the last decade have focussed on the identification of biomedical entities such as genes, their products and the interactions between them, the community has recently begun to appreciate the need for automatically extracting chemical information from text. Applications in chemoinformatics, drug discovery and systems biology such as automatic database curation [1], compound screening [2], detection of adverse drug reactions [3], drug repurposing [4] and metabolic pathway curation [5] are facilitated and informed by the outcomes of chemical text mining, a fundamental task of which is the recognition of chemical named entities.

Chemical named entity recognition (NER), the automatic demarcation of expressions pertaining to chemical entities within text, is considered a challenging task for a number of reasons. First, chemical names may appear in various forms, ranging from the popular and human-readable trivial and brand names to the more obscure abbreviations, molecular formulas and database identifiers, to long nomenclature-conforming expressions, e.g., International Union of Pure and Applied Chemistry (IUPAC) names and Simplified Molecular-Input Line-Entry System (SMILES) strings [6-8]. Moreover, researchers working on lead compound identification and discovery sometimes tend to report their results using their own arbitrarily assigned abbreviations, further aggravating the proliferation of chemical names. Also considered a barrier to the development of chemical named entity recognisers is the relatively small number of available supporting corpora, compared to those developed for biological, such as gene and protein, name recognition [9]. Whilst a few notable data sets containing chemical named entity annotations have been developed, there was a lack of publicly available, wide-coverage, large-scale gold standard corpora of scientific publications. Although the SciBorg corpus [10,11] contains a substantial number of manually annotated chemical names in its 42 full-text articles, it had not been publicly available until very recently. In contrast, the large-scale CALBC corpus [12] is publicly available, but is considered "silver standard" as it contains annotations resulting from the harmonisation of the outputs of five different automatic tools, rather than manual annotations. The similarly publicly available SCAI pilot corpus [13,14] contains gold standard annotations for various types of chemical names but is relatively small with only 100 MEDLINE abstracts.

This limited number of resources has influenced the means by which the state-of-the-art chemical named entity recognisers have been developed and evaluated. Built as a pipeline of several Markov model-based classifiers, the publicly available OSCAR tool [15] was tuned to recognise the annotation types defined in the SciBorg corpus. The system was evaluated by means of three-fold cross validation on this corpus as well as on a bespoke data set of 500 annotated MEDLINE abstracts. ChemSpot [16], another publicly available chemical named entity recogniser, is a hybrid between methods for dictionary matching and machine learning. For capturing brand names, this tool uses a lexicon-based approach for matching expressions against the Joint Chemical Dictionary [17]. For recognising nomenclature-based expressions, however, it employs a conditional random fields (CRF) [18] model trained on the SCAI corpus subset that contains annotations for only IUPAC names. The developers carried out a comparative evaluation of ChemSpot and OSCAR on the SCAI pilot corpus, in which the former was reported to have outperformed the latter by a margin of 10.8 percentage points. It is worth noting, however, that both of these tools have not been comparatively evaluated nor benchmarked against any large-scale, gold standard corpora.

In aiming to alleviate these issues, the Critical Assessment of Information Extraction in Biology (BioCreative) initiative organised a track in the Fourth BioCreative Challenge Evaluation workshop to encourage the text mining community to develop methods for chemical named entity recognition, and enable the benchmarking of these methods against substantial gold standard data [19]. Known as CHEMDNER, this track publicly released a large corpus of documents containing manually annotated chemical named entities. The 10,000 MEDLINE abstracts in the CHEMDNER corpus [20], which were grouped into disparate sets for training (3,500), development (3,500) and testing (3,000), came from various chemical subdomains including pharmacology, medicinal chemistry, pharmacy, toxicology and organic chemistry. Each annotated chemical name was labelled with one of the following mention types: systematic, trivial, family, abbreviation, formula, identifier, coordination and a catch-all category. The corpus served as the primary resource for the two CHEMDNER subtasks, namely, chemical entity mention recogniton (CEM) and chemical document indexing (CDI). Whilst the former required participating systems to return the locations of all chemical mention instances found within a given document, the latter expects a ranked listing of unique mentions without any location information.

Having participated in the CHEMDNER challenge, we have developed our own chemical named entity recogniser that obtained top-ranking performance in both the CDI (1st) and CEM (3rd) tasks. Extending that work, we describe in this paper the details of our proposed methods for optimising chemical NER performance. In the next section, we compare the performance of our methods with the state of the art and present results of our evaluation on several corpora. Furthermore, we share details on how our contributions, publicly available as a service, can be accessed and utilised by the community. The Experiments section contains a detailed discussion of our proposed methods and the experiments we have performed in order to identify the optimal solution on each of the data sets considered. We summarise the results of our work in the Conclusions section. Lastly, we provide some technical background on the techniques and evaluation metrics we have used in this study in the Methods section.

Results and discussion

We developed a conditional random fields (CRF)-based method for chemical named entity recognition whose performance was optimised by (a) the selection of best-suited pre-processing components, (b) the incorporation of CRF features capturing chemistry-specific information, and (c) the application of post-processing heuristics.

We begin with describing the results from the evaluation of our method under the settings of the CHEMDNER challenge. Next, we demonstrate that our method obtains competitive performance compared to the state of the art. We then show that a statistical model trained on a large-scale, gold standard corpus such as CHEMDNER is suitable for recognising chemical names in a wider range of corpora, on which it consistently outperforms two known chemical NER tools. Finally, we describe the availability of our approach as a configurable workflow in the interoperable text mining platform Argo [21]. Hereafter, we refer to our suite of solutions collectively as Chemical Entity Recogniser, or ChER.

Performance evaluation under the CHEMDNER challenge settings

The first set of experiments was performed based on the specifications of the BioCreative IV CHEMDNER track [19], which our research team participated in. The micro-averaged results on the CHEMDNER test set obtained by our solutions using specialised pre-processing analytics (i.e., Cafetiere Sentence Splitter and OSCAR4 Tokeniser) are presented in Table 1. These closely approximate the results which were reported for our submissions during the official BioCreative challenge evaluation [22], in which the variant employing knowledge-rich features and abbreviation recognition achieved the best performance in both the CEM and CDI subtasks.

Table 1.

Performance of ChER under the BioCreative IV CHEMDNER track setting.

Custom Features	Post-processing		CEM			CDI

	Abbr.	Comp.	P	R	F₁	P	R	F₁
✓	✗	✗	92.76	81.02	86.49	91.39	85.29	88.23
✓	✓	✗	92.76	81.30	86.65	91.37	85.45	88.31
✓	✗	✓	92.14	81.41	86.44	90.55	85.72	88.07
✓	✓	✓	92.14	81.69	86.60	90.53	85.88	88.14

SciBorg (chemical molecules)				SCAI-100 (systematic names)
	P	R	F₁		P	R	F₁

ChER	85.96	74.22	79.66	ChER	86.70	67.50	75.90
OSCAR	-	-	81.20	ChemSpot	57.47	67.70	62.17

	NaCTeM Metabolites
	P	R	F₁

ChER	65.08	83.29	73.07
ChemSpot	58.02	73.99	65.04
OSCAR4	35.37	84.18	49.81

Feature	Brief description	Sample features (bigrams)
Character n-grams	the set of all possible combinations of a token's consecutive characters, taken n at a time (n = 2, 3, 4)	{GS}, {SK}, {K2}, {21}, {14}, {4a}

Token n-grams	unigrams and bigrams of surface forms; unigrams and bigrams of normalised surface forms where numbers numbers are replaced with '0's, the consecutive instances of which are compressed	{It, attenuated}, {attenuated, GSK214a}; {Aa, aaaaaaaaaa}, {aaaaaaaaaa, AAA000a}

Lemma n-grams	unigrams and bigrams of lemmatised surface forms	{It, attenuate}, {attenuate, GSK214a}

POS tag n-grams	unigrams and bigrams of part-of-speech (POS) tags	{PRP, VBD}, {VBD, NN},

Lemma & POS tag n-grams	unigrams and bigrams of lemmatised forms combined with POS tags	{It:PRP, attenuate:VBD}, {attenuate:VBD, GSK214a:NN}

Chunk information	chunk tag of current token; surface form of the enclosing chunk's	{B-NP}; {gestation}

Surface form	Lemma	Part-of-speech tag	Chunk tag
It	It	PRP	B-NP
attenuated	attenuate	VBD	B-VP
GSK214a	GSK214a	NN	B-NP
-induced	-induced	JJ	I-NP
gestation	gestation	NN	I-NP
in	in	IN	B-PP
rats	rat	NN	B-NP
.	.	.	O

Feature	Example
Initial letter is in uppercase	Boc-L-leucine
Contains only digits	206553
Contains digits	5-HTP
Contains only alphanumeric characters	HClO4
Contains only uppercase letters and digits	AFB1
Contains only uppercase letters	NO
Does not contain any lowercase letters	SKF81297
Contains non-initial uppercase letters	PbS
Contains two consecutive uppercase letters	PAHs
Has a Greek letter name as a substring	alpha-ketoacid
Contains a comma	3,14-dibromo
Contains a full stop	In(0.2)Ga(0.8)As
Contains a hyphen	HP-β-CD
Contains a forward slash	(E/Z)-Goniothalamin
Contains an opening square bracket	[(14)C]pazopanib
Contains a closing square bracket	pyrido[3,2-d]pyrimidines
Contains an opening parenthesis	I3 (-)
Contains a closing parenthesis	Fe(C10 H15)2
Contains a semi-colon	R = Me, Et; X = O, S;
Contains a percentage symbol	85%
Contains an apostrophe	5-methyl-2'-deoxycytidine

Token	Normal form	ChEBI	DrugBank	CTD	PubChem	Jochem
For	for	O	O	O	O	O
the	the	O	O	O	O	O
preparation	preparation	O	O	O	O	O
of	of	O	O	O	O	O
hydrogel	hydrogel	O	O	B	O	B
microspheres	microsphere	O	O	O	O	O
based	base	O	O	O	O	O
on	on	O	O	O	O	O
hydroxyethyl	hydroxyethyl	O	O	B	O	B
starch	starch	B	O	I	O	I
-	_	B	O	O	O	O
hydroxyethyl	hydroxyethyl	I	O	B	O	B
methacrylate	methacrylate	I	O	I	B	I
(	_	O	O	O	O	O
HES-HEMA	hes_hema	O	O	O	O	O
)	_	O	O	O	O	O

	Prefixes			Suffixes
Token	size 2	size 3	size 4	size 2	size 3	size 4

Incubation	O	O	O	O	O	O
with	O	O	O	O	O	O
diisopropyl	di	O	O	yl	O	O
fluorophosphate	O	O	fluo	O	ate	O
and	O	O	O	O	O	O
bis-(4-nitrophenyl)	O	O	O	O	O	O
phosphate	O	O	O	O	ate	O

Token	Basic segments	No. of basic segments
10-acetoxyactinidine	10, acet, oxy, actin, idine	5

methylergonovine	methyl, ergo, novi, ne	4

interleukin-2	interleukin, 2	2

	Macro			Micro
	P	R	F₁	P	R	F₁

Default features	86.66	79.01	80.89	88.55	76.82	82.27
Enriched features	88.26	81.11	82.86	89.87	78.99	84.07

Margin	+1.6	+2.1	+1.97	+1.32	+2.17	+1.8

Subtype	Frequency	Percentage
Abbreviation	1,882	30.32%
Formula	1,291	20.80%
Family	979	15.77%
Trivial	926	14.92%
Systematic	693	11.16%
Identifier	293	4.72%
Multiple	118	1.90%
No class	25	0.40%

NaCTeM Metabolites
	P	R	F₁

ChER	81.42	79.66	80.53
MetaboliNER	83.02	74.42	78.49

	SCAI-100 (all names)			Patents
	P	R	F₁	P	R	F₁

ChER	77.85	78.69	78.27	73.43	57.91	64.75
ChemSpot	76.35	72.55	74.41	67.79	41.97	51.84
OSCAR4	50.88	81.34	62.60	49.90	60.73	54.79

	DDI test			PK
	P	R	F₁	P	R	F₁

ChER	75.88	92.05	83.18	79.83	88.34	83.87
ChemSpot	73.09	89.49	80.46	65.29	86.07	74.25
OSCAR4	60.20	85.51	70.66	42.65	81.71	56.04

Token initially recognised as non-chemical	Chemical basic segments	Ratio
polycalcium	poly, calcium	1.0

2-methoxyestradiol	meth, oxy, estra, di, ol	0.89

palytoxin	toxin	0.56

	Data		Pre-processing		Cust.	Post-processing		Micro-averages
	Training	Test	Splitter	Tokeniser	Feats.	Abbr.	Comp.	P	R	F₁
1	CHEMDNER	CHEMDNER	LingPipe	GENIA	✗	✗	✗	88.87	70.95	78.91
	training & dev.	test	Cafetiere	OSCAR4	✓	✓	✓	92.76	81.30	86.65

2	SciBorg (CM):3-fold CV		LingPipe	GENIA	✗	✗	✗	80.44	55.16	65.45
			Cafetiere	OSCAR4	✓	✓	✓	85.96	74.22	79.66

3	SCAI-IUPAC	SCAI-100	LingPipe	GENIA	✗	✗	✗	84.78	66.87	74.77
	training	(IUPAC)	Cafetiere	GENIA	✓	✓	✓	86.70	67.50	75.90

4	NaCTeM Metabolites:10-fold CV		LingPipe	GENIA	✗	✗	✗	81.72	64.49	72.09
			Cafetiere	OSCAR4	✓	✓	✓	81.42	79.66	80.53

5	CHEMDNER	SCAI-100	LingPipe	GENIA	✗	✗	✗	72.56	66.00	69.13
	training & dev.	(All)	Cafetiere	OSCAR4	✓	✓	✓	77.85	78.69	78.27

6	CHEMDNER	Patents	LingPipe	GENIA	✗	✗	✗	72.66	52.97	61.27
	training & dev.		Cafetiere	OSCAR4	✓	✓	✓	73.43	57.91	64.75

7	CHEMDNER	DDI	LingPipe	GENIA	✗	✗	✗	76.52	75.00	75.75
	training & dev.	test	Cafetiere	OSCAR4	✓	•	✓	75.88	92.05	83.18

8	CHEMDNER	PK	LingPipe	GENIA	✗	✗	✗	79.29	84.66	81.89
	training & dev.		Cafetiere	GENIA	✓	✓	✓	79.83	88.34	83.87

9	CHEMDNER	NaCTeM	LingPipe	GENIA	✗	✗	✗	63.57	71.63	67.36
	training & dev.	Metabolites	Cafetiere	OSCAR4	✓	✓	✓	65.08	83.29	73.07

PERMALINK

Optimising chemical named entity recognition with pre-processing analytics, knowledge-rich features and heuristics

Riza Batista-Navarro

Rafal Rak

Sophia Ananiadou

Supplement

Abstract

Background

Results

Conclusion

Background

Results and discussion

Performance evaluation under the CHEMDNER challenge settings

Table 1.

Performance comparison against state-of-the-art methods

Table 2.

Table 3.

Performance evaluation on a variety of chemical corpora

Table 4.

Table 5.

Table 6.

Configurable chemical entity recognition workflows in Argo

Figure 1.

Experiments

Selection of pre-processing analytics

Sentence splitters

Tokenisers

Model training using a chemical knowledge-rich feature set

Weakly chemical-indicative features

Table 7.

Table 8.

Table 9.

Chemical dictionary matches

Table 10.

Chemical affix matches

Table 11.

Number of chemical basic segments

Table 12.

Chemical symbol matches

Heuristics-based post-processing

Table 13.

Table 14.

Abbreviation recognition

Chemical composition-based token relabelling

Table 15.

Evaluation

Table 16.

Conclusions

Methods

Sequence labelling

Sentence splitting

Tokenisation

Evaluation metrics

Competing interests

Authors' contributions

Supplementary Material

Acknowledgements

Declaration

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases