Skip to main content
AMIA Annual Symposium Proceedings logoLink to AMIA Annual Symposium Proceedings
. 2011 Oct 22;2011:1471–1480.

It’s about This and That: A Description of Anaphoric Expressions in Clinical Text

Yan Wang 1, Genevieve B Melton 1,2, Serguei Pakhomov 1,3
PMCID: PMC3243152  PMID: 22195211

Abstract

Although anaphoric expressions are very common in biomedical and clinical documents, little work has been done to systematically characterize their use in clinical text. Samples of ‘it’, ‘this’, and ‘that’ expressions occurring in inpatient clinical notes from four metropolitan hospitals were analyzed using a combination of semi-automated and manual annotation techniques. We developed a rule-based approach to filter potential non-referential expressions. A physician then manually annotated 1000 potential referential instances to determine referent status and the antecedent of each referent expression. A distributional analysis of the three referring expressions in the entire corpus of notes demonstrates a high prevalence of anaphora and large variance in distributions of referential expressions with different notes. Our results confirm that anaphoric expressions are common in clinical texts. Effective co-reference resolution with anaphoric expressions remains an important challenge in medical natural language processing research.

1. Introduction

The proliferation of electronic clinical texts with widespread electronic health record (EHR) system implementation brings an unparalleled opportunity to use clinical documents for secondary functions such as healthcare quality improvement, decision support, and clinical research through information extraction (IE) and medical natural language processing (NLP) techniques. Clinical texts contain a wealth of information about patients in computer-readable format, and medical NLP research has become an active area of study. Developing effective NLP techniques for the clinical domain requires special considerations, since unlike the biomedical literature, these documents are created in the process of clinical care and are constructed using one or a combination of methods, including transcription of dictated narrative, semi-structured templated text entry, or manual typing. As a result, in addition to highly domain-specific information, these texts contain abbreviations and acronyms, as well as a number of other informal language use features (e.g. ellipses, incomplete or ungrammatical sentences, misspellings). These features are more typical of transcribed spontaneous speech than formal written discourse such as biomedical literature. One of the features adding to the informal nature of clinical discourse is the use of anaphora.

In linguistics, an anaphor is the phenomenon of one linguistic expression (typically a pronoun) referring to another linguistic expression in the same discourse to avoid repetition. For example, the operative note excerpt in (1) contains two instances of anaphora where both pronouns ‘it’ point to the same noun phrase ‘the ureter.’

  • (1) The peritoneum was not opened and the ureteri was identified as iti crossed the pelvic brim. Iti was markedly dilated and had a bluish appearance.

A referent is the object, idea, fact or event named by (referred to) by a referring expression (typically a noun phrase or a pronoun; however, other syntactic phrases and even grammatical functions such as verb tense can be referential too). In the example in (1), the referent is the actual body organ that is about to be ligated during the operative procedure. An antecedent is the linguistic expression to which the anaphor (the referring expression) points thus forming the anaphor. In the example (1), the noun phrase ‘the ureter’ is the antecedent. Finally, co-reference arises when two or more expressions refer to the same item (i.e., have the same referent), as is the case with the anaphor and its antecedent. In the example (1), all three linguistic expressions (‘the ureter’ and two instances of the personal pronoun ‘it’ are said to co-refer – a relationship indicated in the example by having the same index (i) on all three expressions (shown in subscript).

While co-reference resolution and anaphora resolution as techniques are often used interchangeably, anaphora resolution can be viewed as a special case of co-reference resolution and constitutes the process of determining the antecedent of anaphoric expressions. Automated anaphora resolution is an active area of NLP research and is an important task for effective NLP systems, particularly those used to support IE. [13]. Pronouns such as ‘it’, ‘this’ and ‘that’ are linguistic forms that are underspecified for meaning – their meaning can only be established via the link to their antecedent. Thus, whatever information is predicated on these pronouns is attributable to the entity referred to by the antecedent. For example, the second sentence in example (1) conveys important additional information (‘dilated and bluish in appearance’) attributable to the entity referred as ‘the ureter.’ Without correctly resolving the anaphora, this information may be mistakenly attributed to another entity (e.g., ‘the peritoneum’).

In computational linguistics, both heuristic rule-based [48] and machine-learning approaches [912] have been used to address automated anaphora resolution. Heuristic techniques for anaphora resolution suffer from being labor-intensive and requiring domain knowledge; in contrast, machine-learning techniques require large amounts of training data to be effective. Several theoretical frameworks have been used in the development of co-reference and anaphora resolution algorithms. The two more prominent theories are the Centering Theory and the Givenness Hierarchy. The Centering Theory attempts to capture how local discourse information is structured and encoded linguistically via referring expressions [13]. A basic principle of the Centering Theory is that there is a single entity that is in the focus of attention and that any subsequent utterances will preferentially keep the same most salient entity in the same discourse segment in the center of attention and rely on pronominalization as a way to encode this centering relationship. The Givenness Hierarchy is another theory that takes a cognitive view of the problem and postulates that the form choice of the referring expression (e.g., pronouns, definite or indefinite noun phrases) is driven by what the speaker perceives the memory and attention state associated with the referent of the referring expression to be in the mind of the reader/hearer [14]. According to this theory, entities in the current focus of attention are likely to be referred to with a personal pronoun such as ‘it’, while entities that have been recently activated by being mentioned in the immediately preceding discourse (previous sentence or utterance) are likely to be referred to with demonstrative pronouns such as ‘this’ or demonstrative noun phrase (i.e., ‘this + noun’). This theory also predicts that entities familiar to the hearer/reader by virtue of having been introduced recently but not immediately preceding in the discourse are likely to be referred to by a demonstrative noun phrase ‘that + noun’. These theories have inspired the development of a number of automated anaphora resolution systems [8, 15].

Applying any existing anaphora resolution system or developing new approaches to the clinical domain require better understanding of anaphoric expression usage in clinical texts. These insights will provide medical NLP researchers with a better ability to predict which approaches for co-reference resolution might be most useful for the clinical domain. The goal of our study was to systematically characterize anaphoric expression containing ‘it’, ‘this’, and ‘that’ in a large corpus of clinical text with a combination of automated and manual techniques.

2. Background

2.1. Classification of anaphoric expressions

A number of classifications of anaphora and referring expressions have been developed [1621]. Most classifications that attempt to account for pronouns and demonstratives separate these expressions into ‘referential’ and ‘non-referential’ categories. The referential uses of these expressions are further classified based on the type of referent into exophoric [16] (i.e., non-linguistically introduced referent [20]) or endophoric [16] (i.e., linguistically introduced referent[20]). Endophoric uses of referring expressions are those that have an antecedent in the preceding discourse and are thus of particular interest in the current study. These expressions are classified into those with an antecedent that refer to a concrete referent or abstract referent [20] introduced with a proposition, a sentence or several sentences (a.k.a., ‘discourse deictic’[16], ‘associative anaphoric’[17], ‘inferable anaphoric’ [18], ‘bridging anaphor’ [19, 21]), further classifying endophoric anaphora with a concrete referent into those where the anaphor has the same head as the antecedent versus those with different heads[21]. This also point out that the reliability of manual classifications of referring expressions are sensitive to the number and complexity of the categories.

Thus, for the purposes of this study, we conceptualized and classified these expressions according to a coarser and simplified scheme as follows: 1) referential vs. non-referential and 2) in case of referential expressions, by the form of the antecedent as shown in Table 1. A noun phrase antecedent is a linguistic expression consisting of a noun phrase explicitly mentioned in the discourse. A proposition antecedent is an idea, fact or event rather than a specific object and may be expressed as a verb phrase, prepositional phrase, a sentence or a set of sentences in the preceding discourse. An inferable is when the anaphoric expression refers to something that is not explicitly stated but instead can be inferred. The difference between an inferable and an exophoric is that an inferable has a linguistically introduced antecedent while an exophoric has not. Table 1 includes some examples in each category.

Table 1.

Anaphoric expression categories

Category Examples
Anaphoric expression
Referential Hydrodissection of the lens nucleus was performed and this was hacoemulsified using a modified stop-and-chop technique.
This bifascicular block was present on previous EKG from Apple Valley Medical Center. There is no date on that EKG, however.

Non-referential He has had many psychiatric admissions, most recently at XXX hospital earlier this month In fact, she had angioplasty of that artery so that cardiac catheterization could be performed.

Antecendent form (referential expressions only)
Noun phrase We then identified the anterior medial portion of the sternocleidomastoid muscle and followed this border
She does have cerebrovascular disease with a history of stroke and some memory loss from that Papaverine had to be used to identify the testicular artery. Doppler was also used to identify it.

Proposition Finally, the patient will likely eventually need her gallbladder removed. Conceivably this could be considered in 2–3 weeks.
She started using marijuana. She said by 16 it was a major problem.
The drain was placed across the pectoral muscles into the axilla. Following that, we closed the skin flaps over the drain

Inferable The implant began with anastomosis of donor and recipient left atria, beginning at the level of the atrial appendage using running 3-0 Prolene. This incision was run around to the interatrial septum where it was completed.
The venous cannula was removed, and that tie oversewed with 4-0 Prolene.

2.2. Anaphoric expression usage in biomedical and clinical texts

Because of the lack of adequate training corpora in clinical and biomedical texts, most early approaches to anaphora resolution in biomedical text were heuristic-based. As biomedical corpora have become increasingly available, more research has utilized machine-learning approaches. Previous studies on biomedical text have demonstrated that unlike common English, the most common anaphoric expression in the biomedical texts are sortal anaphora (most commonly ‘this’ or ‘these’), such as ‘this enzyme’ where expressions contains type or sortal information within the phrase itself [22, 23]. Moreover, the use of ‘it’ is uncommon in biomedical texts (Torii et al. found only 7 instances in 50 abstracts). In contrast to biomedicine, the pronoun ‘it’, is by far the most frequently used pronoun in some common English texts, such as the British National Corpus (BNC) [24].

While investigators have looked at consumer health disease-summary documents [25], drug interactions [26], pathology reports, and discharge summaries [27], anaphoric expression usage in clinical text remains largely uncharacterized. Divita et al. [25] manually examined a corpus of disease summary documents (National Library of Medicine consumer health site) processed with MetaMap Transfer (MMTx) and found that MMTx errors were largely due to missing inferential and domain knowledge and concluded that effective co-reference resolution would be a central item for improving performance. When looking at drug interaction text, Segura-Bedmar and colleagues[26] found that pronominal anaphora were common, particularly personal pronouns (it, they) and relative pronouns (which, that). Demonstrative pronouns (this, these, those) were also often seen in reference to the drug or drug property/effect.

Hahn and Romacker [15] examined German language pathology reports for use of pronominal anaphors (it/they) and nominal anaphors (noun or noun phrase referent), as well as textual ellipses (also called functional anaphora requiring implicit information to interpret the text). In addition to common use of both pronominal and nominal anaphors, the investigators found textual ellipses to be highly prevalent and important for maintaining text coherence. Finally, He [27] focused on co-reference resolution in discharge summaries for all types of noun phrases/markables in texts with potential co-reference. In this case, anaphoric expressions were not the main focus and the related work with co-reference resolution in these texts did not characterize discharge summary anaphoric expressions. In preliminary study of the distributions of different anaphoric expressions such as personal pronouns (it, they) and relative pronouns (which, that), and demonstrative pronouns (this, these, those), we noticed that ‘it’, ‘this’ and ‘that’ were the most frequently occurring anaphoric expressions except for personal pronouns like ‘he’, ‘she’, ‘his’, and ‘her’ where personal pronouns most often refer to the provider or the patient. For this reason, the focus of this manuscript was to analyze the use of ‘it’, ‘this’ and ‘that’ within clinical text.

3. Methods

3.1. Study setting

Our overall approach was to 1) first develop and apply a linguistic rule-based filter to identify potential non-referential expressions using previously described techniques and 2) perform expert annotation (reference standard) on 1000 instances each of ‘it’, ‘this’, and ‘that’, as well as annotate a random sample of 500 potential non-referential expressions (removed with the rule-based filter) to confirm the performance of our rule-based system. We then 3) summarize the use of anaphoric expressions on our manually annotated as well as a semi-automatically annotated corpus and 4) provide overall statistics of anaphoric usage from our automated system (Figure 1). We chose ‘this’, ‘that’, and ‘it’ as opposed to definite articles (‘the’), pronominal anaphora in the first, second, or third person with the exception of ‘it’ (i.e., ‘I’, ‘me’, ‘you’, ‘your’, ‘who’, ‘they’ ‘he’, or ‘she’) and possessive anaphora (such as ‘their’, ‘his’, ‘her’, ‘its’, ‘our’, ‘my’), which less commonly refer to a named entity of interest (such as a disease, body part, procedure, medication, etc.) and more commonly refer to the speaker or writer (clinician) or patient.

Figure 1.

Figure 1.

Experiment Schematic for Anaphora Analysis in Clinical Notes

3.2. Corpus, and document pre-processing

The main corpus used for this study consisted of data from approximately 875,000 inpatient notes from University of Minnesota-affiliated Fairview Health Services, with data from 4 metropolitan hospitals in the Twin Cities that include both community and tertiary-referral settings. The inpatient notes included admission notes, discharge notes, consultation notes and operation notes. Since there are no widely available annotated corpora dedicated to the anaphoric usage with these clinical notes, we first created annotated corpora for anaphoric expression usage.

When creating the annotated corpora, 3000 instances of ‘it’ and ‘this’ and 6000 instances of ‘that’ were randomly selected from the data repository. All notes with these samples were subsequently processed by a sentence boundary detector trained on the GENIA Corpus and a probabilistic natural language parser. Pre-processing also included tokenization, part-of-speech (POS) tagging,, and parsing with OpenNLP parser. The overall corpus of 875,000 inpatient notes were also parsed with OpenNLP parser, which is fast and fairly accurate to generate overall corpus statistics based on our experiments. At this preliminary stage in our investigation, we wanted to get a general understanding of the distribution of anaphora in clinical text and therefore did not perform any formal evaluation of the sentence boundary detector or the parsers used to identify syntactic structure. Noun phrases obtained from the parser were regarded as markables, from which we selected a set of 1000 potential referential instances for ‘it’, ‘this’ and ‘that’ respectively. After applying our rule-based filter, we compiled three enriched development corpora, each consisting of 1000 anaphoric expression instances. We analyzed the performance of the rule-based filter. As shown in section 4.1, the filter achieved very high accuracy for filtering out non-referential instances. The referential instances remained from the 3000 (6000 for ‘that’) randomly pre-selected instances after applying the filter are representative of the entire corpus. The development corpora were used to conduct this preliminary study examining anaphoric usage of ‘it’, ‘this’ and ‘that’. University of Minnesota institutional review board approval was obtained, and informed consent waived for this minimal risk study. The corpora with 1000 anaphoric expression instances for ‘it’, ‘this’ and ‘that’ were manually annotated by a physician with Gate – (general open source architecture for language engineering) [28] as described in section 3.3. In addition, we compared clinical text with the Brown Corpus [29].

3.3. Rule-based filter

3.3.1. Exclusion of possibly non-referential ‘this’

In examination of the development corpus, we observed a large number of frequently occurring noun phrases such as ‘this patient’, ‘this consultation’, ‘this diagnosis’, ‘this visit’, or ‘this morning’ within clinical notes, which do not have a specific linguistically introduced antecedent. Rather, the target of these exophoric references is extralinguistic but is part of the overall discourse. Since the purposes of our analysis focused on endophoric reference, we grouped these expressions together with non-referential uses of pronouns (e.g., ‘It’s raining today.’). From a separate set of randomly selected set of notes, we collected a list of frequently occurring noun phrases that start with the word ‘this’. We reviewed each of these noun phrases and generated a list of 44 headwords from these phrases (See Table 2). We then developed a rule-based filter and applied it to the noun phrases chunks that started with ‘this’ to eliminate possibly non-referential uses of ‘this’. We would like to emphasize, that while the head nouns in Table 2 are likely to signal exophoric reference, this is clearly not always the case. However, manual examination of a random sample of these expressions confirmed that the majority is indeed exophoric.

Table 2.

Head nouns for possibly non-referential uses of ‘this’

admission episode infection problem task
afternoon evaluation information procedure test
appointment evening interview reason time
assessment event moment regard visit
consultation examiner morning sort week
diagnosis fall murmur spring winter
dictation family patient stage writer
EKG gentleman point study year
encephalopathy hospitalization pregnancy summer

3.3.2. Exclusion of possibly non-referential ‘that’

In general English, ‘that’ can be used as a pronoun, a determiner in a demonstrative noun phrase, or a conjunction. From a set of 691,216 fully parsed sentences using OpenNLP parser[30] with the word ‘that’, we examined the syntactic tree of these sentences and analyzed the Penn TreeBank clause tags [31] and POS assigned to the word ‘that’. The most frequently used POS tags included preposition or subordinating conjunction (IN) 58.7%, determiner (DT) 16.8%, wh-determiner (WDT) 12.6% and other tags (MD, NN, etc) 9.4%. From these, we identified several patterns of potential non-referential expressions, the most common of which are exemplified in Table 3 with the distribution of their Penn Tree-bank clause tags patterns. In this study, we didn’t use the DT POS tags as the rule for filtering out non-referential ‘that’ because the OpenNLP parser labeled 83.2% of all referential instances of ‘that’ as DT.

Table 3.

Syntactic tree tags for non-referential

Tag Patterns Percent Examples
(IN that) (S 51.8%
  1. It should be noted that he didn’t have any short of breath.

  2. Records from XXX Hospital suggest that he was treated with nine bilateral ECT treatments.

(WDT that) 12.6%
  1. She was found then to have adhesion that was lysed.

  2. He had an ultrasound of the gallbladder that showed …

(VB[A-Z]* \w+) (PP (IN that) (NP 0.6%
  1. The patient stated that 2 days ago she was assaulted in the street.

  2. Given that the patient lives about 80 miles from …

Based on the patterns and statistics in Table 3, the fully parsed syntactic tree tag patterns were used to capture non-referential ‘that’. These patterns are easily implemented as regular expressions and capable of detecting a large portion of non-referential ‘that’. One potential limitation of this approach is that the performance of the filter is largely depended on the parsing accuracy.

3.3.3. Exclusion of pleonastic-‘it’

Pleonastic-‘it’ expressions (a.k.a. “dummy ‘it’” are idiomatic and non-referential and thus were excluded by using rules similar to previous work [32] (Table 4). We extended these rules to detect extrapositional ‘it’ expressions, a subset of pleonastic ‘it’ expressions not addressed with the rules of Lin et al., such as the sentence: “It is my pleasure to see her in my office” (see last rule in Table 4).

Table 4.

Example statements with pleonastic-‘it’

Rule Example

It be [Adj|Adv| verb]* that It has been shown that plavix after drug-elluding stents is preventative of acute coronary thrombosis.

It be Adj [for NP] to VP However, it is possible for acute renal failure to be followed by renal insufficiency.

It [seems|appears|means|follows] [that]* It seems that she is willing to forgo chemotherapy despite the lower survival associated with this choice.

NP [makes|finds|take] it [Adj]* [for NP]* [to VP|Ving] Furthermore, the patient’s angina makes it possible to perform a pre-operative stress test now.

It [MD*|RB*|VB*|VP*] be NP + Clause (NOTE: added rule) But it has certainly been a pleasure to be involved...

3.4. Reference standard

Each enriched set of 1000 anaphoric expressions for ‘it’, ‘this’ and ‘that’ was classified by an expert physician annotator in the manner described in section 2.1. First, each potential anaphoric word was labeled as referential or non-referential based on if the word or the noun phrase that containing the target word referred to an object or idea, which may or may not be explicitly evoked in the context around the anaphoric expression. At this stage, the entire anaphoric expression was also annotated (e.g., ‘this incision’ or ‘this retroperitoneal mass’). Each referential anaphoric expression was further classified into those with noun phrase antecedents, proposition antecedents, or inferables. The physician reviewed the context around the referring expression and if an antecedent noun phrase or noun phrase set was found associated with the referring expression, the annotator would label it as a noun phrase antecedent type. With proposition referents that refer to an idea of an event rather than a specific object, the entire sentence (or set of sentences), prepositional phrase, or verb phrase was labeled. When the anaphoric expression was inferable and referred to something inferentially related to an evoked entity, the annotator would enter the inferred concept with manual text annotation since it was not explicitly stated in the text.

In addition to the 1000 annotated samples, a random sample of 500 filtered potential non-referential expressions for ‘it’, ‘this’ and ‘that’ were annotated separately to determine the accuracy of our rule-based filtering techniques described in section 3.2.

3.4. Analysis

We verified the performance our non-referential filtering system on 500 samples as described in section 3.3. The proportion of referential, non-referential, and types of antecedents were then determined on the enriched annotated sets for ‘this’, ‘that’ and ‘it’. For each type of referent, the distance between the anaphoric expression and its antecedent was calculated. We also examined the entire clinical document repository and determined the distribution of potential referential and non-referent expressions using our filtering system for different note types. We also labeled each anaphor instance as a pronoun or a noun phrase. We normalized the distribution of each pronoun over 1,000 sentences.

4. Results

4.1. Rule-based system for filtering potential non-referential cases

The referential categories classified by our rule-based filter were compared with the manual annotations across the three corpora for ‘it’, ‘this’ and ‘that’. Three sets of 500 instances that were determined to be potentially non-referential by the rule-based filter were reviewed and used to evaluate the performance. We processed the instances with two well-known parsers, Stanford parser[33] and OpenNLP parser. Statistics were collected based on both parsers.

4.1.1. Results based on the Stanford parser

For ‘this’, nearly all of the 500 instances that classified as non-referential by the automated system were non-referential. The automated system also classified each instance to be either a noun phrase (e.g., ‘this area’, ‘this incision’) or a pronoun. Compared the classification result with the annotations, 98.3% of noun phrases instances of ‘this’ were correctly classified and 97.3% of pronoun instances of ‘this’ were correctly classified. The classification errors were mostly caused by the erroneous parsing results, which was a known limitation of the rule-based system. For ‘that’, 95.2% of the 500 instances classified as non-referential were correct. All noun phrases were correctly classified and 95.4% of pronoun instances were correctly classified. For ‘it’, 88% of the 500 instances classified as non-referential were correctly classified. The performance was not as good as the performance of ‘this’ and ‘that’. The reason for this lies in errors with complicated syntactic tree patterns for non-referential it.

4.1.2. Results based on the OpenNLP parser

Similar to results obtained with the Stanford parser, all of the 500 instances that classified as non-referential by the automated system were non-referential ‘this’. Also, 99% of noun phrases instances of ‘this’ were correctly classified and 92.9% of pronoun instances of ‘this’ were correctly classified. For ‘that’, 94.8% of the 500 instances that classified as non-referential were correct. 91.2% of noun phrases were correctly classified and 93.6% of pronoun instances were correctly classified. For ‘it’, 89.8% of the 500 instances classified as non-referential were correctly classified. The results based on two well-known parsers were very close. Overall, Stanford parser provides more accurate parsing results than OpenNLP parser. One limitation with Stanford parser, particularly for any larger corpus, is the processing time is about two times longer than OpenNLP parser.

4.2. Annotation statistics

Table 5 summarizes the annotation statistics for the each of the 1,000 annotated samples for ‘it’, ‘this’ and ‘that’. As expected, this was an enriched corpus of referential instances, although up to 20% were non-referential. There were several instances, particularly with ‘that’, where the referent was inferable (and not directly mentioned in the text). Distances between the antecedent and the anaphoric expression are summarized, demonstrating a shorter distance with ‘it’ compared to ‘this’ and ‘that’. This is consistent with predictions of Gundel’s Givenness Hierarchy theory, where personal pronouns such as ‘it’ usually refer to entities that are in focus of immediate attention while demonstrative pronouns such as ‘this’ and ‘that’ more often refer to ‘activated’ entities as shown in the example below. As such, entities in focus of attention or in short-term memory tend to be in the same sentence as the anaphoric expression.

Table 5.

Annotation statistics for annotated instances

This That It

Antecedent form, 1000 in total

Non-referential 140 202 123
Referential 860 798 877

  Noun Phrase 614 577 822
  Proposition 241 203 46
  Inferable 5 18 9

Antecedent-Anaphor distance by sentences

All 1.13 0.87 0.68

  Demonstrative NP 1.87 1.19 N/A
  Pronoun 1.05 0.61
  • (2) “Using three finger technique, I identified the left vas, brought it close to the midline, infiltrated the skin and the vas and cord with about 2 cc of 1% lidocaine.”

Compared with ‘it’, many times ‘this’ refers to an idea or an action (propositional referent) like the example below.

  • (3) “According to records, Ms. Morrissey was noted to be agitated and confused on 10/23/2006. This was felt possibly related to her pain medications, which were decreased.”

According to the Givenness hierarchy, “each of the cognitive statuses in the Givenness Hierarchy entails all lower statuses. A particular form can often be replaced by forms which require a lower status.” [14] As illustrated in following example, each sentence is complaint with the Givenness hierarchy.

  • (4) “These incredibly small magnetic bubbles are the vanguard of a new, generation of ultradense memory-storage systems. “

    These systems are extremely rugged: they are resistant to radiation and are nonvolatile.”

    Those systems are extremely rugged: they are resistant to radiation and are nonvolatile.”

    The systems are extremely rugged: they are resistant to radiation and are nonvolatile.”

    New generation ultradense memory-storage systems are extremely rugged: they are resistant to radiation and are nonvolatile.” [14]

In this study, we observed in a large number of demonstrative anaphoric expressions the pattern of ‘that + noun’ occurring where the Givenness Hierarchy would predict ’this + noun’. One possible reason for this unexpected finding is that majority of clinic notes in this corpus were usually dictated after clinical care was provided. Therefore, the care providers may tend to use a lower status form ‘that’ to refer something not present in the immediate environment and therefore probably no longer having the ’activated’ status despite its antecedent being found in the immediately preceding discourse.

4.3. Distribution of anaphors by note type

Table 6 summarizes the overall corpus statistics by clinical document type.

Table 6.

Statistics by clinical note type. NP=noun phrase.

Admission Discharge Operation Consultation Brown

Overall corpus

Number of notes 196,735 305,153 362,311 11,302 N/A

Total sentences 15,058,39 18,696,16 20,114,473 1,397,860 57339

Average sentences per note 76 (1–282) 61 (6–278) 55(1–252) 123(13–245) N/A

Anaphoric (numbers are per 1,000 sentences)

This Total 40 36 42 75 87

Non-Referential (%) 16 (40%) 12 9 (21.4%) 33 (44%) 9 (10.3%)
Referential (%) 24 24 33 (78.6%) 42 (56.0%) 78 (89.7%)
  Demonstrative NP 6 (25.0%) 6 (25.0%) 10 (30.3%) 15 (35.7%) 50 (64.1%)
  Pronoun (%) 18 (75%) 18 (75%) 23 (69.7%) 27 (64.3%) 28 (35.9%)

That Total 46 35 17 112 163

Non-Referential (%) 34 25 12 (70.6%) 90 (80.4%) 138 (84.7%)
Referential (%) 12 10 5 (29.4%) 22 (19.6%) 25 (15.3%)
  Demonstrative NP 6 (50.0%) 6 (60.0%) 3 (60.0%) 10 (45.5%) 17 (68.0%)
  Pronoun (%) 6 (50%) 4 (40%) 2 (40%) 12 (54.5%) 8 (32.0%)

It Total 16 14 19 34 140

Non-Referential (%) 5 (31.2%) 6 (42.9%) 5 (26.3%) 13 (38.2%) 27 (19.3%)
Referential (%) 11 8 (57.1%) 14 (73.7%) 21 (61.8%) 113 (80.7%)

The average number of sentences per note varied largely by the note type. We applied our rule-based filter with the fully parsed result of all the notes of each note type. For each of ‘it’, ‘this’ and ‘that’, we collected the total occurrences and the number of non-referential occurrences in all four note types. Moreover, each referential ‘this’ and ‘that’ instance were classified as a noun phrase or a pronoun by the parsed results. To compare the anaphoric usage of ‘it’, ‘this’ and ‘that’ in clinical text and general English text, we applied the rule-based filter to the fully parsing results of Brown Corpus (Table 6). The comparison showed that anaphoric expression ‘it’, ‘this’ and ‘that’ had a high overall prevalence in clinical text, even though the incidence was lower in comparison to common English text.

5. Discussion

This study represents a systematic attempt to characterize anaphoric expressions in clinical text from a large corpus of electronic clinical documents. These findings are helpful in understanding issues particular to automated co-reference resolution in the clinical domain. While previous studies have demonstrated the predominance of sortal anaphora in biomedical text, we found that other anaphoric expressions are common and will need to be addressed with clinical NLP systems. Moreover, we observed that while consultation notes had the highest density of ‘it’, ‘this’ and ‘that’, a larger proportion of instances of these pronouns were non-referential possibly due to the more informal nature of the discourse found in consultation notes.

We observed that the operative notes always demonstrated the highest proportion of potential referential expressions (the proportion of instances that refer to an object, idea, etc) among four types of notes (78.6% for ‘this’, 29.4% for ‘that’, 73.7% for ‘it’). In contrast, consultation notes had lower rates of referential expressions overall. One possible reason for this is that compared with consultation notes, operative notes include much more “bottom-level” semantic concepts or entities in the UMLS semantic network. Most operative notes contain a specific description of procedures, which includes anatomy, procedures, and actions to anatomy, as illustrated in (5).

  • (5) “We carefully removed these lymph nodes, one at a time until I could not feel any additional nodes. Included in these nodes were probably 5 or 6 good sized lymph nodes. One of them had a distinctly different look to it and I suspect is a lymph node with Hodgkin’s disease in it.”

This may lead to frequent usage of anaphoric expression and co-reference within operative notes to track discourse participants and to establish links between chunks of ongoing discourse.

Additionally, we observed that compared with “it’ and ‘this’, variance between referential rates of ‘that’ for different types of documents was small, with 70-80% of ‘it’ instances representing non-referential expressions As expected, a large portion of ‘that’ instances represented conjunction usage, which accounts for most of the non-referential ‘that’. This may also account for the flatness observed in the referential rate of ‘that’.

Another finding that is important to developing anaphora resolution systems for clinical discourse is that while the average distance between the anaphor and it’s antecedent was within 1–2 clauses for all three expressions that we studied, there were maximum distances far outside of the expected range. Both the Centering Theory and the Givenness Hierarchy theory would predict the antecedents for these anaphors to be located in the immediate discourse context; however, some of the antecedents for ‘that’ anaphor were found up to 43 sentences away. Long-range anaphora are more typical of formal written discourse than in more informal conversational discourse. In written discourse, the use of long-range anaphora is supported by lower short-term memory demands required to keep an entity ‘on-line’ in order to make a reference to it. This finding is particularly interesting in that it indicates that even though a large number of clinical notes in our study were dictated and mostly represented spoken discourse, some of the anaphora represent very long-range dependencies likely due to the professional and in many cases repetitive nature of clinical documentation.

While this study addressed common anaphoric expressions of ‘this’, ‘that’, and ‘it’, we did not characterize all anaphoric expressions. Instead, we concentrated on anaphoric expressions thought to have highest significance in clinical corpora. We recognize, however, that for certain specialized clinical texts, such as family history, there are a large number of pronominal anaphora in the third person, requiring specific resolution (i.e., ‘he’ or ‘she’ might refer to a family member and not the patient) [34] that remain uncharacterized with this study. We plan to continue this work by exploring feature selection using both heuristic and machine-learning approaches for anaphora resolution in clinical text, as well as expand upon this corpus to improve training data, with the ultimate goal of integrating these co-reference tools as components of medical NLP systems.

6. Acknowledgements

This research was supported by the National Library of Medicine (#R01 LM009623-01) (SP), American Surgical Association Foundation Fellowship (GM), and University of Minnesota Institute for Health Informatics Seed Grant (GM & SP). We would like to thank Fairview Health Services for ongoing support of this research and Andrew Riggle for assistance with annotation.

7. References

  • 1.Gaizauskas R, Humphreys K. Quantitative Evaluation of Coreference Algorithms in an Information Extraction System. Colloquium Lancaster University; 1996. [Google Scholar]
  • 2.Savova GK, Chapman WW, Zheng J. Annotation schema for anaphoric relations in the clinical domain. American Medical Informatics Association; San Francisco, CA: 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Savova GK, Chapman WW, Zheng J, Crowley RS. Anaphoric relations in the clinical narrative: corpus creation. Journal of the American Medical Informatics Association. 2011 Feb 20;18(4):7. doi: 10.1136/amiajnl-2011-000108. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Hobbs JR. Resolving Pronoun References. Lingua. 1978;44(4):311–38. [Google Scholar]
  • 5.Lappin S, Leass HJ. An algorithm for pronominal anaphora resolution. Comput Linguist. 1994;20(4):535–61. [Google Scholar]
  • 6.Stuckardt R. Design and enhanced evaluation of a robust anaphor resolution algorithm. Comput Linguist. 2001;27(4):479–506. [Google Scholar]
  • 7.Kennedy C, Boguraev B. Anaphora for everyone: pronominal anaphora resolution without a parser. Proceedings of the 16th conference on Computational linguistics - Volume 1; Copenhagen, Denmark. 1996. pp. 113–8. 992651: Association for Computational Linguistics. [Google Scholar]
  • 8.Tetreault JR. A corpus-based evaluation of centering and pronoun resolution. Comput Linguist. 2001;27(4):507–20. [Google Scholar]
  • 9.Ng V. Technical report. 2003. Dec, Machine learning for coreference resolution: Recent successes and future challenges. [Google Scholar]
  • 10.Ng V. Learning noun phrase anaphoricity to improve coreference resolution: issues in representation and optimization. Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics; Barcelona, Spain. 2004. p. 151. 1218975: Association for Computational Linguistics. [Google Scholar]
  • 11.Ng V, Cardie C. Improving machine learning approaches to coreference resolution. Proceedings of the 40th Annual Meeting on Association for Computational Linguistics; Philadelphia, Pennsylvania. 2002. pp. 104–11. 1073102: Association for Computational Linguistics. [Google Scholar]
  • 12.Soon WM, Ng HT, Lim DCY. A machine learning approach to coreference resolution of noun phrases. Comput Linguist. 2001;27(4):521–44. [Google Scholar]
  • 13.Grosz BJ, Weinstein S, Joshi AK. Centering: a framework for modeling the local coherence of discourse. Comput Linguist. 1995;21(2):203–25. [Google Scholar]
  • 14.Gundel JK. Cognitive status and the form of referring expressions in discourse. Language. 1993;69(2):274. [Google Scholar]
  • 15.Strube M, Hahn U. Functional centering: grounding referential coherence in information structure. Comput Linguist. 1999;25(3):309–44. [Google Scholar]
  • 16.Diessel H. Demonstratives: Form, function, and grammaticalization. Amsterdam: John Benjamins; 1999. [Google Scholar]
  • 17.Hawkins JA. Definiteness and indefiniteness: A study in reference and grammaticality prediction. London: Croom Helm; 1978. [Google Scholar]
  • 18.Prince EF. On the inferencing of indefinite-this NPs.”. In: Joshi AK, Webber BL, Sag IA, editors. Elements of Discourse Understanding,. Elements of Discourse Understanding. Cambridge: Cambridge University Press; 1981. [Google Scholar]
  • 19.Clark HH. Thinking: Readings in Cognitive Science. Cambridge University Press; London and New York: 1977. [Google Scholar]
  • 20.Gundel JK. Information structure and referential givenness/newness: How much belongs in the grammar. Journal of Cognitive Science. 2003;4:177–99. [Google Scholar]
  • 21.Poesio M, Vieira R. A corpus-based investigation of definite description use. Comput Linguist. 1998;24(2):183–216. [Google Scholar]
  • 22.Castano J, Zhang J, Pustejovsky J. Anaphora resolution in biomedical literature. International Symposium on Reference Resolution for Natural Language Processing; 3–4 June 2002; Alicante. 2002. [Google Scholar]
  • 23.Torii M, Vijay-Shanker K. Sortal anaphora resolution in medline abastract. Computational Intelligence. 2007;23(1):15–27. [Google Scholar]
  • 24.Li Y, Musilek P, Reformat M, Wyard-Scott L. Identification of pleonastic it using the web. J Artif Int Res. 2009;34(1):339–89. [Google Scholar]
  • 25.Divita GTT, Roth L. Failure analysis of MetaMap Transfer (MMTx) Stud Health Technol Inform. 2004;107:763–7. [PubMed] [Google Scholar]
  • 26.Segura-Bedmar I, Crespo M, de Pablo-Sanchez C, Martinez P. Resolving anaphoras for the extraction of drug-drug interactions in pharmacological documents. BMC Bioinformatics. 2010;11(Suppl 2):S1. doi: 10.1186/1471-2105-11-S2-S1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.He T. Coreference Resolution on Entities and Events for Hospital Discharge Summaries. Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science; 2007. Thesis. [Google Scholar]
  • 28.Gate. http://gateacuk/.
  • 29.Francis WN, Kucera H. A Standard Corpus of Edited Present-Day American English. College English. 1979 1965 Jan;26(4):7. [Google Scholar]
  • 30.OpenNLP. http://incubatorapacheorg/opennlp/.
  • 31.Marcus MP, Marcinkiewicz MA, Santorini B. Building a large annotated corpus of English: the penn treebank. Comput Linguist. 1993;19(2):313–30. [Google Scholar]
  • 32.Lin YLT, Hsinehu T. Pronominal and sortal anaphora resolution for biomedical literature. Proceedings of ROCLING XVI: Conference on Computational Linguistics and Speech Processing; 2–3 September 2004; Taiwan. 2004. [Google Scholar]
  • 33.Stanford Parser. http://nlpstanfordedu/software/lex-parsershtml.
  • 34.Melton GB, Raman N, Chen ES, Sarkar IN, Pakhomov S, Madoff RD. Evaluation of family history information within clinical documents and adequacy of HL7 clinical statement and clinical genomics family history models for its representation: a case report 2010. [DOI] [PMC free article] [PubMed]

Articles from AMIA Annual Symposium Proceedings are provided here courtesy of American Medical Informatics Association

RESOURCES