Enriching a biomedical event corpus with meta-knowledge annotation

Paul Thompson; Raheel Nawaz; John McNaught; Sophia Ananiadou

doi:10.1186/1471-2105-12-393

. 2011 Oct 10;12:393. doi: 10.1186/1471-2105-12-393

Enriching a biomedical event corpus with meta-knowledge annotation

Paul Thompson ^1,^✉,^#, Raheel Nawaz ^1,^#, John McNaught ¹, Sophia Ananiadou ¹

PMCID: PMC3222636 PMID: 21985429

Abstract

Background

Biomedical papers contain rich information about entities, facts and events of biological relevance. To discover these automatically, we use text mining techniques, which rely on annotated corpora for training. In order to extract protein-protein interactions, genotype-phenotype/gene-disease associations, etc., we rely on event corpora that are annotated with classified, structured representations of important facts and findings contained within text. These provide an important resource for the training of domain-specific information extraction (IE) systems, to facilitate semantic-based searching of documents. Correct interpretation of these events is not possible without additional information, e.g., does an event describe a fact, a hypothesis, an experimental result or an analysis of results? How confident is the author about the validity of her analyses? These and other types of information, which we collectively term meta-knowledge, can be derived from the context of the event.

Results

We have designed an annotation scheme for meta-knowledge enrichment of biomedical event corpora. The scheme is multi-dimensional, in that each event is annotated for 5 different aspects of meta-knowledge that can be derived from the textual context of the event. Textual clues used to determine the values are also annotated. The scheme is intended to be general enough to allow integration with different types of bio-event annotation, whilst being detailed enough to capture important subtleties in the nature of the meta-knowledge expressed in the text. We report here on both the main features of the annotation scheme, as well as its application to the GENIA event corpus (1000 abstracts with 36,858 events). High levels of inter-annotator agreement have been achieved, falling in the range of 0.84-0.93 Kappa.

Conclusion

By augmenting event annotations with meta-knowledge, more sophisticated IE systems can be trained, which allow interpretative information to be specified as part of the search criteria. This can assist in a number of important tasks, e.g., finding new experimental knowledge to facilitate database curation, enabling textual inference to detect entailments and contradictions, etc. To our knowledge, our scheme is unique within the field with regards to the diversity of meta-knowledge aspects annotated for each event.

Background

Due to the rapid advances in biomedical research, scientific literature is being published at an ever-increasing rate. This makes it highly important to provide researchers with automated, efficient and accurate means of locating the information they require, allowing them to keep abreast of developments within biomedicine [1-5]. Such automated means can be facilitated through text mining, which is receiving increasing interest within the biomedical field [6,7]. Text mining enriches text via the addition of semantic metadata, and thus permits tasks such as analysing molecular pathways [8] and semantic searching [9].

Event-based text mining

Information extraction (IE) systems facilitate semantic searching by producing classified, structured, template-like representations of important facts and findings contained within documents, called events. As the features of texts and the types of events to be recognised vary between different subject domains, IE systems must be adapted to deal with specific domains. A well-established method of carrying out this adaptation is through training using annotated corpora (e.g., [10-12]). Accordingly, a number of corpora of biomedical texts annotated for events have been produced (e.g., [13-15]), on which IE systems in the biomedical domain can be trained.

Event annotation in these corpora typically includes the identification of the trigger, type and participants of the event. The event trigger is a word or phrase in the sentence which indicates the occurrence of the event, and around which the other parts of the event are organised. The event type (generally assigned from an ontology) categorises the type of information expressed by the event. The event participants, i.e., entities or other events that contribute towards the description of the event, are also part of the event representation, and are often categorised using semantic role labels such as CAUSE (i.e., the entity or other event that is responsible for the event occurring) and THEME (i.e., the entity or other event that is affected by the event) to indicate their contribution towards the event description. Events that contain further events as participants are often referred to as complex events, while simple events only contain entities as participants. Usually, semantic types (e.g. gene, protein, etc.) are also assigned to the named entities (NEs) participating in the event. Other types of participants are also possible, corresponding, for example, to the location or environmental conditions under which the event took place.

In order to illustrate this typical event representation, consider sentence (1).

(1) The results suggest that the narL gene product activates the nitrate reductase operon.

The typical structured representation of the biomedical event described in this sentence, is as follows:

EVENT-TRIGGER: activates

EVENT-TYPE: positive_regulation

THEME: nitrate reductase operon:operon

CAUSE: narL gene product: protein

The automatic recognition of such structured events facilitates sophisticated semantic querying of documents, which provides much greater power than conventional search techniques. Rather than simply searching for keywords in documents, users can search for specific types of events in documents, through (partial) completion of a template. This template allows different types of restrictions to be placed on the events that are required to be found [9], e.g.,:

• The type of event to be retrieved.

• The types of participants that should be present in the event.

• The values of these participants, which could be specified in terms of either specific values or NE types.

The fact that event and NE types are often hierarchically structured can provide the user with a large amount of flexibility regarding the generality or specificity of their query.

Event interpretation and the role of meta-knowledge

Despite the increased power and more focussed searching that event-based searching can provide over traditional keyword-based searches, typical event annotations do not capture contextual information from the sentence, which can be vital for the correct interpretation of the event [16]. Let us consider again sentence (1), and in particular the phrase at the beginning of the sentence, i.e., The results suggest that ... This phrase allows us to determine the following about the event that follows:

• It is based on an analysis of experimental results.

• It is stated with a certain amount of speculation (according to the use of the verb suggest, rather than a more definite verb, such as demonstrate).

Altering the words in the context of the event can affect its interpretation in both subtle and significant ways. Consider the examples below:

(2a) It is known that the narL gene product activates the nitrate reductase operon.

(2b) We examined whether the narL gene product activates the nitrate reductase operon.

(2c) The narL gene product did not activate the nitrate reductase operon.

(2d) These results suggest that the narL gene product might be activated by the nitrate reductase operon.

(2e) The narL gene product partially activated the nitrate reductase operon

(2f) Previous studies have shown that the narL gene product activates the nitrate reductase operon.

If only the event type and participants are considered, then the events in sentences (2a-f) are identical to the event in sentence (1). However, the examples clearly illustrate that it is important to consider the context in which the event occurs, since a wide range of different types of information may be expressed that relate directly to the interpretation of the event.

In sentence (2a), the word known tells us that the event is a generally accepted fact, while in (2b), the interpretation is completely different. The word examined shows that the event is under investigation, and hence the truth value of the event is unknown. The presence of the word not in sentence (2c) shows that the event is negated, i.e. it did not happen. In sentence (2d), the presence of the word might (in addition to suggest) adds further speculation regarding the truth of the event. The word partially in (2e) does not challenge the truth of the event, but rather conveys the information that the strength or intensity of the event is less than may be expected by default. Finally, the phrase previous studies in (2f) shows that the event is based on information available in previously published papers, rather than relating to new information from the current study.

We use the term meta-knowledge to collectively refer to the different types of interpretative information available in the above sentences. There are several tasks in which biologists have to search and review the literature that could benefit from the automatic recognition of meta-knowledge about events. These tasks include building and updating models of biological processes, such as pathways [17] and curation of biological databases [18,19]. Central to both of these tasks is the identification of new knowledge that can enhance these resources, e.g., to build upon an existing, but incomplete model of a biological process [20], or to ensure that a database is kept up to date. New knowledge should correspond to experimental findings or conclusions that relate to the current study, which are stated with a high degree of confidence, rather than, e.g., more tentative hypotheses. In the case of an analytical conclusion, it may be important to find appropriate evidence that supports this claim [16] before allowing it to be added to the database.

Other users may be interested in checking for inconsistencies or contradictions in the literature. The identification of meta-knowledge could also help to flag such information. Consider, for example, the case where an event with the same ontological type and identical participants is stated as being true in one article and false in another. If the textual context of both events shows them to have been stated as facts, then this could constitute a serious contradiction. If, however, one of the events is marked as being a hypothesis, then the consequences are not so serious, since the hypothesis may have been later disproved. The automatic identification of meta-knowledge about events can clearly be an asset in such scenarios, and can prevent users from spending time manually examining the textual context of each and every event that has been extracted from a large document collection in order to determine the intended interpretation.

In response to the issues outlined above, we have developed a new annotation scheme that is specifically tailored to enriching biomedical event corpora with meta-knowledge, in order to facilitate the training of more useful systems in the context of various IE tasks performed on biomedical literature. As illustrated by the example sentences above, a number of different types of meta-knowledge may be encoded in the context of an event, e.g., general information type (fact, experimental result, analysis of results), level of confidence/certainty towards the event, polarity of the event (positive or negative), etc. In order to account for this, our annotation scheme is multi-dimensional, with each dimension encoding a different type of information. Each of the 5 dimensions has a fixed set of possible values. For each event, the annotation task consists of determining the most appropriate value for each dimension. Textual clue expressions that are used to determine the values are also annotated, when they are present.

Following an initial annotation experiment by two of the authors to evaluate the feasibility of the scheme [21], we applied our scheme to the complete GENIA event corpus [14]. This consists of 1000 MEDLINE abstracts, containing a total of 36,858 events. The annotation was carried out by two annotators, who were trained in the application of the scheme, and provided with a comprehensive set of annotation guidelines. The consistency and quality of the annotations produced were ensured though double annotation of a portion of the corpus.

To our knowledge, the enriched corpus represents a unique effort within the domain, in terms of the amount of meta-knowledge information annotated at such a fine-grained level of granularity (i.e., events). As the GENIA event corpus is currently the largest biomedical corpus annotated with events, the enrichment of this entire corpus with meta-knowledge annotation constitutes a valuable resource for training IE systems to recognise not only the core information about events and their participants, but also additional information to aid in their correct interpretation and provide enhanced search facilities. The corpus and annotation guidelines may be downloaded for academic purposes from http://www.nactem.ac.uk/meta-knowledge/.

Related work

Although our approach to annotating multi-dimensional meta-knowledge information at the level of events is novel, the more general study of how knowledge in biomedical texts can be classified to aid in its interpretation is a well-established research topic. Two main threads of research can be identified, i.e.:

1) Construction of classified inventories of lexical markers (i.e., words or phrases) which can accompany statements to indicate their intended interpretation.

2) Production of corpora annotated with various different types of meta-knowledge at differing levels of granularity.

Lexical markers of meta-knowledge

The presence of specific cue words and phrases has been shown to be an important factor in classifying biomedical sentences automatically according to whether or not they express speculation [22,23]. Corpus-based studies of hedging (i.e., speculative statements) in biological texts [24,25] reinforce the above experimental findings, in that 85% of hedges were found to be conveyed lexically, i.e., through the use of particular words and phrases, rather than through more complex means, e.g., by using conditional clauses. The lexical means of hedging in biological texts have also been found to be quite different to academic writing in general, with modal auxiliaries (e.g., may, could, would, etc.) playing a more minor role, and other verbs, adjectives and adverbs playing a more significant role [24]. It has additionally been shown that, in addition to speculation, specific lexical markers can denote other types of information pertinent to meta-knowledge identification, e.g., markers of certainty [26], as well as deductions or sensory (i.e. visual) evidence [24].

Based on the above, we can determine that lexical markers play an important role in distinguishing several different types of meta-knowledge, and also that there is a potentially wide range of different markers that can be used. For example, [27] identified 190 hedging cues that are used in biomedical research articles. Our own previous work [28] on identifying and categorising lexical markers of meta-knowledge demonstrated that such markers are to some extent domain-dependent. In contrast to other studies, we took a multi-dimensional approach to the categorisation, acknowledging that different types of meta-knowledge may be expressed through different words in the same sentence. As an example, consider sentence (3).

(3) The DNA-binding properties of mutations at positions 849 and 668 may indicate that the catalytic role of these side chains is associated with their interaction with the DNA substrate.

Firstly, the word indicate denotes that the statement following that is to be interpreted as an analysis based on the evidence given at the beginning of the sentence (rather than, e.g., a well-known fact or a direct experimental observation). Secondly, the word may conveys the fact that the author only has a medium level of confidence regarding this analysis.

Although such examples serve to demonstrate that a multi-dimensional approach recognising meta-knowledge information is necessary to correctly capture potential nuances of interpretation, it is important to note that taking a purely lexical approach to recognising meta-knowledge is not sufficient (i.e., simply looking for words from lists of cues that co-occur in the same sentences as events of interest). The reasons for this include:

a) The presence of a particular marker does not guarantee that the "expected" interpretation can be assumed [29]. Some markers may have senses that vary according to their context. As noted in [30], "Every instance should ... be studied in its sentential co-text" (p.125).

b) Although lexical markers are an important part of meta-knowledge recognition, there are other ways in which meta-knowledge can be expressed. This has been demonstrated in a study involving the annotation of rhetorical zones in biology papers (e.g., background, method, result, implication, etc.) [31], based on a scheme originally proposed in [32]. An analysis of features used to determine different types of zone in the annotated papers revealed that, in addition to explicit lexical markers, features such as the main verb in the clause, tense, section, position in the sentence within the paragraph and presence of citations in the sentence can also be important.

Thus, rather than assigning meta-knowledge based only on categorised lists of clue words and expressions, there is a need to produce corpora annotated with meta-knowledge, on which enhanced IE systems can be trained. By annotating meta-knowledge information for each relevant instance (e.g., an event), regardless of the presence of particular lexical markers, systems can be trained to recognise other types of features that can help to assign meta-knowledge values. However, given that the importance of lexical markers in the recognition of meta-knowledge has been clearly illustrated, explicit annotation of such markers should be carried out as part of the annotation process, whenever they are present.

Existing corpora with meta-knowledge annotations

There are several existing corpora with some degree of meta-knowledge annotation. These corpora vary in both the richness of the annotation added, and the type/size of the units at which the meta-knowledge annotation has been performed. Taking the unit of annotation into account, we can distinguish between annotations that apply to continuous text spans, and annotations that have been performed at the event level.

Annotations applied to continuous text spans most often cover only a single aspect of meta-knowledge, and are most often carried out at the level of the sentence. The most common types of meta-knowledge annotated correspond to either speculation/certainty level, e.g., [22,23], or general information content/rhetorical intent, e.g., background, methods, results, insights, etc. This latter type of annotation has been attempted both on abstracts [33,34] and full papers [31,32,35], using schemes of varying complexity, ranging from 4 categories for abstracts, up to 14 categories for one of the full paper schemes. Accurate automatic categorisation of sentences in abstracts has been shown to be highly feasible [36], and this functionality has been integrated into the MEDIE intelligent search system [37].

A few annotation schemes consider more than one aspect of meta-knowledge. For example, the ART corpus and its CoreSC annotation scheme [38,39] augment general information content categories with additional attributes, such as New and Old, to denote current or previous work. The corpus described in [40] annotates both speculation and negation, together with their scopes. Uniquely amongst the corpora mentioned above, [40] also annotates the clue expressions (i.e. the negative and speculative keywords) on which the annotations are based.

Although sentences or larger zones of text [32] constitute straightforward and easily identifiable units of text on which to perform annotation, a problem is that a single sentence may express several different pieces of information, as illustrated by sentence (4).

(4) Inhibition of the MAP kinase cascade with PD98059, a specific inhibitor of MAPK kinase 1, may prevent the rapid expression of the alpha2 integrin subunit.

This sentence contains at least 3 distinct pieces of information:

• Description of an experimental method: Inhibition of the MAP kinase cascade with PD98059.

• A general fact: PD98059 is a specific inhibitor of MAPK kinase 1.

• A speculative analysis: Inhibition of the MAP kinase may prevent the expression of the alpha2 integrin subunit

The main verb in the sentence (i.e., prevent) describes the speculative analysis. In a sentence-based annotation scheme, this is likely to be the only information that is encoded. However, this means that other potentially important information in the sentence is disregarded. Some annotation schemes have attempted to overcome such problems by annotating meta-knowledge below the sentence level, i.e., clauses [41,42] or segments [43]. In the case of the latter scheme, a new segment is created whenever there is a change in the meta-knowledge being expressed. The scheme proposed for segments is more complex than the sentence-based schemes, in that it covers multiple types of meta-knowledge, i.e., focus (content type), polarity, certainty, type of evidence and direction/trend (either increase or decrease in quantity/quality). It has, however, been shown that training a system to automatically annotate along these different dimensions is highly feasible [44].

At the level of biomedical events, annotation of meta-knowledge is generally very basic, and is normally limited to negation, e.g., [15]. Negation is also the only attribute annotated in the corpus described in [45], even though a more complex scheme involving certainty, manner and direction was also initially proposed. To our knowledge, only the GENIA event corpus [14] goes beyond negation annotation, in that different levels of certainty (i.e. probable and doubtful) are also annotated.

Despite this current paucity of meta-knowledge annotation for events, our earlier examples have demonstrated that further information can usefully be specified at this level, including at least the general information content of the event, e.g. fact, experimental observation, analysis, etc. A possibility would be to "inherit" this information from a system trained to assign such information at the text span level (e.g. sentences or fragments), although this would not provide an optimal solution. The problem lies in the fact that text spans constitute continuous stretches of text, but events do not. The different constituents of an event annotation (i.e., trigger and participants) can be drawn from multiple, discontinuous parts of a sentence. There are almost always multiple events within a sentence, and the different participants of a particular event may be drawn from multiple sentence fragments. This means that mapping between text span meta-knowledge and event-level meta-knowledge cannot be carried out in a straightforward manner. Thus, for the purposes of training more sophisticated event-based information search systems, annotation of meta-knowledge directly at the event level can provide more precise and accurate information that relates directly to the event.

Based on the above findings, we embarked upon the design of an event-based meta-knowledge annotation scheme specifically tailored for biomedical events. In the remainder of this paper, we firstly cover the key aspects of this annotation scheme, followed by a description of the recruitment and training of annotators. We follow this by providing detailed statistics, results and evaluation of the application of the scheme to the GENIA event corpus. Finally, we present some conclusions and directions for further research.

Methods

In this section, we begin by providing a general overview of our annotation scheme, followed by a more detailed description of each annotation dimension. Following a brief overview of the software used to perform the annotation, we describe how we conducted an annotation experiment to test the feasibility and soundness of our scheme, prior to beginning full-scale annotation. The section concludes with a brief explanation of the recruitment and training of our annotators.

Meta-knowledge annotation scheme for events

The aim of our meta-knowledge scheme is to capture as much useful information as possible that is specified about individual events in their textual context, in order to support users of event-based search systems in a number of tasks, including the discovery of new knowledge and the detection of contradictions. In order to achieve this aim, our annotation scheme identifies 5 different dimensions of information for each event, taking inspiration from previous multi-dimensional schemes (e.g. [39,43,45]). In addition to allowing several distinct types of information to be encoded about events, a multi-dimensional scheme is advantageous, in that the interplay between the different dimension values can be used to derive further useful information (hyper-dimensions) regarding the interpretation of the event.

Each dimension of the meta-knowledge scheme consists of a set of complete and mutually-exclusive categories, i.e., any given bio-event belongs to exactly one category in each dimension. The set of possible values for each dimension was determined through a detailed study of over 100 event-annotated biomedical abstracts. In order to minimise the annotation burden, the number of possible categories within each dimension has been kept as small as possible, whilst still respecting important distinctions in meta-knowledge that have been observed during our corpus study. Due to the demonstrated importance of lexical clues in the identification of certain meta-knowledge categories, the annotation task involves identifying such clues, when they are present.

Figure 1 provides an overview of the annotation scheme. Below, we provide a brief description of each annotation dimension. Further details and examples are provided in the comprehensive (66-page) annotation guidelines, which are available at: http://www.nactem.ac.uk/meta-knowledge/Annotation_Guidelines.pdf

**Meta-knowledge annotation scheme**. The boxes with the grey background correspond to information that is common to most bio-event annotation schemes, i.e., the participants in the event, together with an indication of the class or type of the event. The boxes with the dark green backgrounds correspond to our proposed meta-knowledge annotation dimensions and their possible values, whilst the light green box shows the hyper-dimensions that can be derived by considering a combination of the annotated dimensions.

Knowledge Type (KT)

This dimension is responsible for capturing the general information content of the event. The type of information encoded is at a slightly different level to some of the comparable sentence-based schemes, which have categories relating to structure or "zones" within a document, e.g. background or conclusion. Rather, our KT dimension attempts to identify a small number of more general information types that can be used to characterise events, regardless of the zone in which they occur. As such, our scheme can be seen as complementary to structure or zone-based schemes, providing a finer-grained analysis of the different types of information that can occur within a particular zone. The KT features we have defined are as follows:

• Investigation: Enquiries or investigations, which have either already been conducted or are planned for the future, typically accompanied by lexical clues like examined, investigated and studied, etc.

• Observation: Direct observations, sometimes represented by lexical clues like found, observed and report, etc. Event triggers in the past tense typically also describe observations.

• Analysis: Inferences, interpretations, speculations or other types of cognitive analysis, always accompanied by lexical clues, typical examples of which include suggest, indicate, therefore and conclude, etc.

• Method: Events that describe experimental methods. Denoted by trigger words that describe experimental methods, e.g., stimulate, addition.

• Fact: Events that describe general facts and well-established knowledge, typically denoted by present tense event triggers that describe biological processes, and sometimes accompanied by the lexical clue known.

• Other: The default category, assigned to events that either do not fit into one of the above categories, do not express complete information, or whose KT is unclear or is assignable from the context. These are mostly non-propositional events, i.e., events which cannot be ascribed a truth value due to lack of available (contextual) information.

Certainty Level (CL)

This dimension aims to identify the level of certainty associated with occurrence of the event, as ascribed by the authors. It comes into play whenever there is explicit indication that there is less than complete confidence that the specified event will occur. This could be because:

• There is uncertainty regarding the general truth value ascribed to the event.

• It is perceived that the event may not take place all of the time.

Different degrees of uncertainty and frequency can be considered as points on a continuous scale, and there is an ongoing discussion regarding whether it is possible to partition the epistemic scale into discrete categories [42]. However, the use of a number of distinct categories is undoubtedly easier for annotation purposes and has been proposed in a number of previous schemes. Although recent work has suggested the use of four or more categories [28,42,44], our initial analysis of bio-event corpora showed that only three levels of certainty seem readily distinguishable for bio-events. This is in line with [46], whose analysis of general English showed that there are at least three articulated points on the epistemic scale.

Like the scheme described in [43], we have chosen to use numerical values for the CL dimension, in order to reduce potential annotator confusions or biases that may be introduced through the use of labels corresponding to particular lexical markers of each category, such as probable or possible. Such labels could in any case be misleading, given that frequency can also come into play in assigning the correct category. Our chosen values of the CL dimension are defined as follows:

• L3: The default category. No explicit expression that either:

(a) There is uncertainty or speculation towards the event.

(b) The event does not occur all of the time.

• L2: Explicit indication of either:

(a) High (but not complete) confidence or slight speculation towards the event. Typical lexical clues include likely, probably, suggest and indicate.

(b) The event occurs frequently, but not all of the time. Typical lexical clues include normally, often, frequently.

• L1: Explicit indication of either:

(a) Low confidence or considerable speculation towards the event. Typical lexical clues include may, might and perhaps.

(b) The event occurs infrequently or only some of the time. Typical lexical markers may include sometimes, rarely, scarcely, etc.

Polarity

This dimension has been designed to capture the truth value of the assertion encapsulated by the event. We define a negated event as one that describes the absence or non-existence of an entity or a process. That is to say, the event may describe that a process does not or did not happen, or that an entity is absent or does not exist. The recognition of such information is vital, as the interpretation of a negated event instance is completely opposite to the interpretation of a non-negated (positive) instance of the same event. Our scheme permits the following two values for this dimension:

• Positive: No explicit negation of the event (default)

• Negative: The event has been negated according to the description above. The negation may be indicated through lexical clues such as no, not, fail, lack, etc.

Manner

This dimension identifies the rate, level, strength or intensity of the event (in biological terms). Such information has previously been shown to be relevant for biologists. The event annotation scheme for the GREC corpus [13], which was designed in consultation with biologists, identified expressions of manner as one of the semantic roles associated with event. The proposal for the annotation of protein-protein interactions suggested in [45] also lists manner as a potentially useful attribute to annotate. Inspired by these works, we build upon the types of manner annotation available in the GREC corpus by adopting a three-way categorisation of manner, as shown below:

• High: Explicit indication that the event occurs at a high rate, level, strength or intensity. Clue expressions are typically adjectives or adverbs such as high, strongly, rapidly, potent, etc.

• Low: Explicit indication that the event occurs at a low rate, level, strength or intensity. Clue expressions are typically adjectives and adverbs such as slightly, partially, small, etc.

• Neutral: The default category. Assigned when there is no explicit indication of either high or low manner, but also in the rare cases when neutral manner is explicitly indicated, using clue words such as normal or medium, etc.

Source

This dimension denotes the source or origin of the knowledge being expressed by the event. Specifically, we distinguish between events that can be attributed to the current study, and those that are attributed to other studies. Information about knowledge source has been demonstrated to be important through its annotation in both the Gene Ontology [18] and in the corpora presented in [38] and [43]. This dimension can help in distinguishing new experimental knowledge from previously reported knowledge. Two possible values are distinguished, as follows:

• Other: The event is attributed to a previous study. In this case, explicit clues are normally present, and can be indicated either by the use of clue words such as previously, recent studies, etc., or by the presence of citations.

• Current: The event makes an assertion that can be attributed to the current study. This is the default category, and is assigned in the absence of explicit lexical or contextual clues, although explicit clues such as the present study may be encountered.

Hyper-Dimensions

A defining feature of our annotation scheme is the fact that, in addition to the explicitly annotated dimensions, further information can be inferred by considering combinations of some of these dimensions. We refer to these additional types of information as the hyper-dimensions of our scheme, of which we have identified two.

• New Knowledge - The isolation of events describing new knowledge is, as we have described earlier, important for certain tasks undertaken by biologists. However, it is not possible to determine whether an event represents new knowledge by considering only a single annotation dimension. For example, events that have been assigned KT = Observation could correspond to new knowledge, but only if they represent observations from the current study, rather than observations cited from elsewhere. In a similar way, a KT = Analysis event drawn from experimental results in the current study could be treated as new knowledge, but generally only if it represents a straightforward interpretation of results, rather than something more speculative. Thus, we consider New Knowledge to be a hyper-dimension, whose value (either Yes or No) can be inferred by considering a combination of the values assigned to the KT, Source and CL dimensions. Table 1 is an inference table that can be used to obtain the appropriate value for New Knowledge, based on the values assigned to the three dimensions mentioned above.

Table 1.

Inference table for New Knowledge hyper-dimension

Source (Annotated)	KT (Annotated)	CL (Annotated)	New Knowledge (Inferred)
Other	X	X	No
X	X	L2	No
X	X	L1	No
Current	Observation	L3	Yes
Current	Analysis	L3	Yes
X	Fact	X	No
X	Method	X	No
X	Other	X	No
X	Investigation	X	No

Category	Freq	% of total events
Observation	12821	34.7%
Other	11537	31.3%
Analysis	6578	17.8%
Fact	2998	8.1%
Investigation	1948	5.3%
Method	976	2.6%

Analysis Clue	Freq	Investigation Clue	Freq	Observation Clue	Freq

suggest	408	examined	207	found	361
show	353	investigated	205	observed	226
demonstrate	335	analyzed	119	detected	141
demonstrated	332	studied	94	detectable	48
showed	246	to determine	50	seen	32
shown	244	tested	39	noted	17
may	242	measured	25	find	11
can	232	monitored	25	detect	11
associated	215	to investigate	23	findings	11
indicate	211	to examine	21	observations	9
revealed	196	to study	21	finding	9
suggesting	140	analysis	20	show	6
report	114	studies	20	report	6
identified	112	to identify	16	exhibit	5
thus	108	investigate	15

L2 Clue	Freq	L3 Clue	Freq

can	407	may	516
suggest	285	might	75
indicate	150	could	55
suggesting	112	possible	32
ability	108	potential	23
indicated	99	possibility	10
appears	88	possibly	10
able	86	potentially	10
indicating	72	perhaps	5
likely	52	propose	4

KT Category	Negated events (% within category)
Observation	1364 (10.6%)
Analysis	577 (8.7%)
Fact	105 (3.5%)
Other	187 (1.6%)
Method	10 (1.0%)
Investigation	20 (1.0%)

Category	Freq
not	1141
no	199
independent	113
without	65
failed	47
nor	47
absence	42
neither	38
unaffected	28
lack	23
un	23
unable	19
independently	18
resistant	15
fails	13

KT Category	Events with High or Low Manner annotated (% within category)
Observation	1141 (8.9%)
Analysis	276 (4.2%)
Fact	120 (4.0%)
Other	171 (1.5%)
Investigation	5 (0.2%)
Method	2 (0.2%)

High Manner Clue	Freq	Low Manner Clue	Freq

significantly	140	little	22
potent	84	low	15
markedly	81	little or no	13
rapidly	73	low levels	11
strongly	72	weak	11
rapid	65	limited	10
significant	39	low level	9
completely	36	weakly	9
strong	30	minimal	8
high	28	only a partial	8
high levels	28	no significant	8
overexpression	26	partially	8
highly	23	barely	7
marked	23	to a lesser extent	6
dramatically	22	not significant	6

Clue	Freq
previously	118
has been	89
recently	67
have been	39
previous studies	24
recent studies	17
recent	15
previous	14
our previous studies	10
earlier	6

Hyper-dimension	Category	Freq	% of total events
New Knowledge	Yes	15985	43.4%
	No	20873	56.6%

Hypothesis	Yes	4924	13.4%
	No	31934	86.6%

Category	Freq	% of total events
L3 (default)	33876	91.9%
L2	2216	6.0%
L1	766	2.1%

Polarity	Freq	% of total events
Positive (default)	34595	93.9%
Negative	2263	6.1%

Manner	Freq	% of total events
Neutral (default)	35143	95.3%
High	1392	3.8%
Low	323	0.8%

Dimension	Kappa value
Polarity	0.929
Source	0.878
CL	0.864
Manner	0.864
KT	0.843

PERMALINK

Enriching a biomedical event corpus with meta-knowledge annotation

Paul Thompson

Raheel Nawaz

John McNaught

Sophia Ananiadou

Abstract

Background

Results

Conclusion

Background

Event-based text mining

Event interpretation and the role of meta-knowledge

Related work

Lexical markers of meta-knowledge

Existing corpora with meta-knowledge annotations

Methods

Meta-knowledge annotation scheme for events

Figure 1.

Knowledge Type (KT)

Certainty Level (CL)

Polarity

Manner

Source

Hyper-Dimensions

Table 1.

Table 2.

Annotation software

Testing of annotation scheme

Annotators and training

Results and Discussion

General corpus characteristics

Knowledge Type (KT)

Table 3.

Table 4.

Certainty Level (CL)

Table 5.

Table 6.

Polarity

Table 7.

Table 8.

Table 9.

Manner

Table 10.

Table 11.

Table 12.

Source

Table 13.

Table 14.

Hyper-dimensions

Table 15.

Inter-annotator agreement

Table 16.

Annotation discrepancies

Conclusion

Authors' contributions

Contributor Information

Acknowledgements

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases