Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2017 Jan 31.
Published in final edited form as: J Biomed Inform. 2016 Feb 26;60:334–341. doi: 10.1016/j.jbi.2016.02.011

Speculation Detection for Chinese Clinical Notes: Impacts of Word Segmentation and Embedding Models

Shaodian Zhang 1, Tian Kang 1, Xingting Zhang 2, Dong Wen 2, Noémie Elhadad 1, Jianbo Lei 2
PMCID: PMC5282586  NIHMSID: NIHMS839659  PMID: 26923634

Abstract

Speculations represent uncertainty towards certain facts. In clinical texts, identifying speculations is a critical step of natural language processing (NLP). While it is a nontrivial task in many languages, detecting speculations in Chinese clinical notes can be particularly challenging because word segmentation may be necessary as an upstream operation. The objective of this paper is to construct a state-of-the-art speculation detection system for Chinese clinical notes and to investigate whether embedding features and word segmentations are worth exploiting towards this overall task. We propose a sequence labeling based system for speculation detection, which relies on features from bag of characters, bag of words, character embedding, and word embedding. We experiment on a novel dataset of 36,828 clinical notes with 5,103 gold-standard speculation annotations on 2,000 notes, and compare the systems in which word embeddings are calculated based on word segmentations given by general and by domain specific segmenters respectively. Our systems are able to reach performance as high as 92.2% measured by F score. We demonstrate that word segmentation is critical to produce high quality word embedding to facilitate downstream information extraction applications, and suggest that a domain dependent word segmenter can be vital to such a clinical NLP task in Chinese language.

Keywords: speculation detection, natural language processing, Chinese NLP, Clinical NLP, word embedding, word segmentation

Graphical abstract

graphic file with name nihms839659u1.jpg

1. Introduction

An increasing amount of computerized clinical data is becoming available with the adoption of electronic health records (EHRs). Natural language processing (NLP) and information extraction techniques have been critical parts of the pipeline to automate the EHR data structuring, data mining and knowledge discovery, and have become an active research field in biomedical informatics. For example, researchers have made significant progress on named entity recognition (NER) from clinical and biomedical texts, whose aim is to detect the boundaries, to identify the categories of clinical entities, and to map them to concepts in standardized terminologies [13].

One of the issues that need to be addressed along with NER in clinical information extraction is to detect cues of speculations (sometimes also referred to as hedges, or uncertainties), which has vital impacts on the credibility of statements or hypotheses in contents [4]. Linguistic speculations are used when uncertainty is expressed, such as in the sentence “the patient may have a UTI”, where “may” is the cue responsible for the uncertainty. Identification of such cues is critical to downstream applications like knowledge discovery, question answering, and predictive modeling of diseases.

Most existing methods for speculation detection are developed for English text. In recent years, hospitals in China have been rapidly deploying EHR systems, which generate a great amount of clinical data. Efforts have been made in the research community to construct NLP components for Chinese clinical notes [58], but to our best knowledge there is no established system identifying clinical speculations.

Secondary use of EHR data in China requires NLP pipelines tailored to clinical language in Chinese, which is dramatically different from clinical notes in English and makes migration of existing NLP systems impractical. One of the primary difference between Chinese and English NLP is that most Chinese NLP systems need to begin with word segmentation, an unnecessary step in its English counterpart [9]. The need for word segmentation, which also exists in other languages like Arabic [10], Hebrew [11], and Japanese [12], creates additional challenges for automated NLP pipelines including named-entity recognition and other components (e.g., parsing, information extraction, etc.). Linguistic resources and tools have been created for Chinese word segmentation [9,13,14], but they were built for general purposes and were not tailored to handling clinical texts. As such, one question worthy of investigation is the impact of word segmentation on our task at hand, speculation detection in Chinese clinical texts.

In recent years, a novel type of feature has been proposed for NLP tasks such as text classification, parsing, and sentiment, namely, word embeddings [15,16]. Word embeddings are usually learned via neural networks. Instead of using each word in an NLP task as a feature, words are represented as vectors which could encode rich contextual information. In a broader scope, word representation based on distributional semantics has been exploited in a wide range of clinical NLP tasks such as named entity recognition [1719] and lexicon expansion [20]. An important motivation of using word embedding in many information extraction tasks, including speculation detection, is that it can learn semantic representations of words from a large unannotated corpus, while the training corpus for the task itself is usually much smaller and more sparse. This can be particularly helpful in clinical NLP because large amount of clinical notes can be available but manual annotations by medical experts are usually too costly. As such, the use of word embeddings as a feature in a speculation detection algorithm is another important question to investigate in our study.

In Chinese texts, word segmentation must be carried out before training a word embedding model. Unfortunately, high-quality segmentation is not always available when the text is highly domain specific, such as for clinical notes. As a substitute for word embeddings, character embeddings have been exploited in applications [21,22], which build feature vectors for each character instead of word. However, such substitution may not be ideal, since words, instead of characters, are units of language that carry meanings upon which semantic representations should be built. For Chinese clinical NLP, the two questions of interest, impact of word segmentation algorithms and impact of embedding features, are inter-related.

In this paper, we propose a sequence labeling based system identifying speculations from Chinese clinical text. To investigate our two questions, we experiment with four types of features: bag of characters, character embedding, bag of words, and word embedding. For the latter two that rely on word segmentation, we carry out experiments with a general-purpose Chinese word segmenter, and a segmenter specially trained on clinical notes, respectively. Our system, to our best knowledge, is the first speculation detection system for Chinese clinical notes. We compare the effectiveness of these four groups of features, and demonstrate that a domain-dependent word segmenter and embedding features can be helpful in such a clinical information extraction task.

1.1. Related work

In the biomedical domain, speculation detection has been an organic component in clinical NLP as early as first wave of automated text processing systems such as in [4]. These early systems usually rely on hand-crafted rules or grammatical patterns to identify statements with uncertainty in clinical notes. Since 2000, speculation detection from biomedical texts, especially scientific literature, has been flourishing thanks to the emergence of shared linguistic corpora and pioneer works [23,24]. The BioScope corpus, in particular, provides a benchmark for speculation detection as well as negation detection in several systems [2527]. The corpus comprises both biomedical literature texts and clinical notes. Part of the corpus was also adopted for shared tasks on hedge detection, such as CoNLL-2010 which also prompted an enthusiastic response from the international research community [27]. Utilizing manually labeled corpora, machine learning-based speculation cue detectors have been developed, which leverage SVM [25,28,29], CRF sequence labeling [3032], or decision trees [25]. Semi-supervised learning [33] or hybrid methods [34,35] were also developed to identify the hedges. It is noteworthy that many of the works detecting speculative cues also identify linguistic scopes of these cues, given that BioScope and CoNLL-2010 both include scope annotation in addition to cues. Scope finding, compared to cue detection, is a more challenging task which usually requires deeper syntactic analysis over the text [27]. While most of the previous works focus on biomedical speculations, other research has focused on identifying hedges from general scientific literature [36] or Wikipedia text [27,37].

In Chinese NLP, several negation and speculation detection systems have recently been developed, but primarily on scientific literature [3840] and general news texts [41]. In the clinical domain, various NLP systems for Chinese clinical text have been created, such as those for word segmentation [8], named entity recognition [6], information extraction [7], and term alignment [42]. No speculation detection system is developed specifically for Chinese clinical notes so far, to our best knowledge.

2. Material and methods

2.1. Dataset and annotations

One month of admission notes and discharge summaries were collected from the EHR database of Peking Union Medical College Hospital, resulting in 36,828 notes in total from 17 departments and clinics. After excluding incomplete notes, 2,000 clinical notes (1,000 admission notes and 1,000 discharge summaries) were randomly sampled for this study. The 2,000 notes were de-identified manually. Two medical doctors (DW and XZ) and one clinical informaticist (SZ) drafted the first version of the annotation guideline, which was strongly inspired by the “minimal unit” principle in the guideline of the BioScope corpus [24]. Then a pilot annotation on 10 discharge summaries, corresponding to approximately 30 speculation cues, was carried out by the two medical doctors to refine the annotation guideline. It is noteworthy that only speculative cues (keywords) were annotated, while linguistic scopes of these cues were not taken into consideration for this study. Our definition of speculation cues also includes keywords and phrases expressing vagueness, such as the “more than” in “more than 10 year history of diabetes”, since they occur very frequently and brings uncertainties to the facts as well. Furthermore, we asked the annotators to pay special attention to hedge cues for four categories of information: disease and syndrome, symptom and sign, treatment and drug, and laboratory test, based on the named entity recognition results which was described in our previous work on the same data set [6]. Following are several example sentences with speculations annotated (bold and italic):

  • Inline graphic Most likely lung infection

  • Inline graphic suspicious of ulcerative colitis

  • Inline graphic Renal function roughly in the normal range

The two medical doctors (DW and XZ) then double annotated 80 notes to calculate the inter-rater agreement. An agreement of 0.76 measured by Kappa [43] was reached. The two annotators then resolve all disagreements and refine the guideline. Our final annotation guideline is given in appendix A. The remaining 1,920 notes were evenly split and coded by single annotator only. In total, 5,103 speculation cues were identified, in which 2,558 modifies diseases and diagnosis, 1134 modifies symptoms, 960 modifies laboratory tests, and 451 modifies treatment and drug.

2.2. Baseline

The annotators were asked to keep track of and generalize annotation rules heuristically during the annotation process. The common part of the rule sets given by the two annotators, which consists of a list of 31 speculation cues with 16 rules of constraints (Table 1), was implemented with regular expressions and used as the baseline system in our experiments. Such rules include ones like “when “ Inline graphic (more than)” appears right after a number, like it does in “10 Inline graphic(more than 10)”, it should be annotated as a speculation cue”.

Table 1.

Cues and constraints selected by the two annotators. Cues and constraints listed in this table are used as the baseline system to match speculation cues from the original text.

Cues without constraints Cues with constraints

Inline graphic(cannot exclude), Inline graphic (cannot exclude), Inline graphic(to be excluded), Inline graphic(to be excluded), Inline graphic(to be excluded), Inline graphic(very likely), Inline graphic(very likely), Inline graphic (speculate), Inline graphic(probably), Inline graphic (suspicious), Inline graphic (cannot exclude), Inline graphic (suspicious), Inline graphic(a little), Inline graphic(a little), Inline graphic(occasionally) 1.? after disease names; 2. Inline graphic(probably) before Inline graphic(diagnose); 3. Inline graphic(around) before numbers; 4. Inline graphic(around) after numbers; 5. Inline graphic(more than) after numbers; 6. Inline graphic(more than) after numbers; 7. Inline graphic(consider) when appearing with Inline graphic(possibility); 8. Inline graphic(consider) when appearing with Inline graphic(possibility); 9. Inline graphic(possible) after disease names; 10. Inline graphic(around) before numbers except time; 11. Inline graphic(around) before numbers except time; 12. Inline graphic(overall) before Inline graphic(normal); 13. Inline graphic(roughly) before Inline graphic(normal); 14. Inline graphic(occasionally happen) before or after symptoms; 15. Inline graphic(occasionally have) before or after symptoms; 16. Inline graphic(occasionally see) before or after symptoms

2.3. CRF-based sequence labeling

The task of identifying speculative cues can be cast as a sequence labeling problem. Conditional Random Fields (CRF), an established method for numerous information extraction tasks including in clinical domain, was adopted [1,4446]. In our task, we used the classical ‘BIO’ notations to represent the boundaries of cues. The following example shows how a sentence with speculation is labeled, in which B-spec represents the beginning character of a speculation cue, I-spec represents being inside a cue, and O represents being outside of speculation cues:

  • Inline graphic/O Inline graphic/O Inline graphic/O Inline graphic/O Inline graphic/B-spec Inline graphic/I-spec Inline graphic/I-spec Inline graphic/I-spec Most likely lung infection

  • Inline graphic/O Inline graphic/O Inline graphic/O Inline graphic/O10/O Inline graphic/B-spec Inline graphic/O More than 10 year history of diabetes

We only used “B-spec”, “I-spec”, and “O” as labels and did not distinguish cues by entity types they modify. We rely on the open source CRF++ tool [47] for the training and prediction with default parameters for our experiments.

2.4. Features

To investigate our research questions (impact of word segmentation quality and use of embeddings as features for speculation detection), we fed different types of features into a series of CRF labelers. For each labeler, the input is a sentence and the output is a BIO-type sequence.

The first system in the study relies on the original input of individual characters as features. No word segmentation were carried out for this system. Thus, suppose that the vocabulary in the training set has Vc characters, and each unique character vi has a unique index i. For each character to be labeled, a one-hot feature vector fi of Vc dimensions will be created: fi = <0,…,1,…,0>, where value of the ith element is 1 and all other elements are zero.

The second type of feature we use is bag of words. State-of-the-art Chinese word segmenters can achieve around 95% accuracy on general text, which is sufficient for most applications [9]. However, in the clinical domain, word segmenters for general purposes usually fail as other NLP systems do, since content in clinical notes is highly domain specific and dependent, no matter what the language is [48]. As such, a segmentation tool needs to be re-trained on the clinical notes in order to be effective in providing accurate results for downstream word embedding model. In this study, we try separately two segmenters to generate bag of words representations, one taken directly from a state-of-the-art general purpose Chinese NLP pipeline, and the other specifically trained on clinical notes. Section 2.6 will describe how we trained the domain-specific word segmenter.

In order to enrich the feature set, we expand the space by adding distributional representations of characters and/or words, leading to the third and fourth types of features we use: character embedding and word embedding. More precisely, we use vectorized embeddings to get richer representations of linguistic units (characters or words), which is basically a parameterized function mapping units to high-dimensional vectors.[15] Intuitively, embedding models encode hidden linguistic information that a word/character can convey in different context into a vector of certain dimensions. Previous research shows that the embedding space is much more powerful than the one-hot representation (e.g., bag-of-words), and can make breakthroughs in many NLP tasks as it conveys more semantic meanings and is particularly useful in overcoming sparsity [15,16]. In our task, for each word or character t, depending on whether the embedding is word level or character level, a vector with fixed number of dimensions (N) will be calculated like the following: w(t) = < 0.2, -0.4, 0.7, …>, based on unsupervised parameter estimation on the unlabeled corpus. The vector can then be used directly as an N-dimensional feature vector for the CRF labeler. Similarly with bag of words, word embedding can be calculated based on different segmentation results, given by a general-purpose segmenter or by a domain specific one.

Figure 1 illustrate how the four types of feature vectors described above --bag of character, bag of words, character embedding, and word embedding-- are calculated for a snippet of text. For bag of words and bag of characters, one-hot representations are calculated for units of interest, i.e. characters or words. Character embedding and word embedding are obtained by training neural network models on the corpus, relying on the original sequence of characters and the segmented sequence of words, respectively. In the case of bag of characters and bag of words, the individual dimensions correspond to the characters and words in the corpus, and thus are quite large. In the embedding representations, the number of dimensions is much lower, but as a trade-off, each individual dimension is harder to interpret for humans. We also note that the weights under the embedding representations are learned unlike in the bag of word and character representations, where each weight simply indicates presence or absence in the input.

Figure 1.

Figure 1

Different representations for the feature vectors for the four types (bag of character, bag of words, character embedding, and word embedding) for a given text snippet. Upper-left are one-hot vectors in bag-of-character representations, and |Vc| is the total number of possible characters. Lower-left are one-hot vectors in bag-of-words representation, and |Vw| is the total number of words after word segmentation. Upper-right and lower-right are character embedding and word embedding representations, after trained by two independent neural networks. N is a hyper-parameter and is set as 100 in this study.

2.5. The workflow

For each type of feature, two systems have been implemented: one only using the vector of current unit as features (unigram), the other including vectors from both current unit and previous unit (bigram). The entire unlabeled dataset extracted (36,828 notes) were used to learn the word/character embeddings. We used the word2vec tool’s Continuous Bag of Words (CBOW) model to train all embeddings [49], and set vector size N = 100, iteration number 20 and all other parameters default.

An overall workflow for all systems is given in Figure 2. Twelve systems based on four types of features are implemented, in addition to the rule-based baseline. The BOC systems use one hot encodings of characters as features. C2V-unigram and C2V-bigram system use character embedding. BOW-G and W2V-G systems are bag of words and word embedding systems based on a word segmentation given by the Stanford word segmenter [50], which is trained on Chinese news text, and BOW-Ds and W2V-Ds are the systems using word embedding based on word segmentation given by a CRF segmenter trained on the same type of clinical notes.

Figure 2.

Figure 2

workflow to obtain the twelve systems. BOCs are the systems using bag of characters as features. C2V systems use character embedding. BOW-Gs and W2V-Gs use bag of words and word embedding based on a word segmentation given by the Stanford word segmenter. BOW-Ds and W2V-Ds are the systems using bag of words and word embedding based on word segmentation given by a CRF segmenter trained on clinical notes.

2.6. Supervised word segmentation for Chinese clinical notes

A domain specific word segmenter is needed to generate features for BOW-Ds and W2V-Ds systems described above. To train such a segmenter, 100 notes were randomly sampled and annotated manually by the same annotators as for the speculation detection task. An agreement of 0.74 has been achieved before the annotators resolve disagreements. A CRF segmenter is then trained and evaluated on the annotated data. Finally, the segmenter is applied to the entire 2,000 notes with speculation annotation and the results are taken by to BOW-Ds and W2V-Ds systems for feature extraction.

3. Results

For each system, we carry out a 5-fold cross validation and report average performance in precision, recall, and F-score. Detailed performance for all systems is given in Table 2. For each system, we re-sampled the 5 folds of data for 5 times, resulting in 25 folds in total with 25 scores. We calculated the 95% confidence intervals by using these 25 performance scores, assuming they are normally distributed. The confidence intervals can be used to measure whether differences amongst systems are indeed significant.

Table 2.

Performance of all systems measured by precision, recall, and F score. 95% confidence intervals are included in brackets. Descriptions of systems refer to Figure 1. Best precision, recall and F are in bold.

Precision (95% CI) Recall (95% CI) F (95% CI)
Baseline (rule based) 59.1 (±0.7) 84.5 (±0.6) 69.5 (±0.6)
BOC-unigram 85.1 (±0.5) 85.6 (±0.4) 85.4 (±0.5)
BOC-bigram 91.5 (±0.6) 88.1 (±0.6) 89.8 (±0.6)
C2V-unigram 86.7 (±0.5) 85.5 (±0.5) 86.1 (±0.6)
C2V-bigram 92.4 (±0.5) 90.2 (±0.4) 91.3 (±0.4)
BOW-G-unigram 83.7 (±0.6) 82.3 (±0.5) 82.9 (±0.6)
BOW-G-bigram 87.4 (±0.5) 86.9 (±0.4) 87.1 (±0.4)
BOW-D-unigram 84.9 (±0.4) 86.0 (±0.5) 85.4(±0.5)
BOW-D-bigram 91.2 (±0.3) 89.5 (±0.3) 90.3(±0.3)
W2V-G-unigram 85.6 (±0.8) 85.1 (±0.6) 85.4 (±0.7)
W2V-G-bigram 91.8 (±0.5) 89.9 (±0.4) 90.9 (±0.5)
W2V-D-unigram 89.9 (±0.7) 84.5 (±0.6) 87.1 (±0.6)
W2V-D-bigram 94.5 (±0.5) 90.1 (±0.3) 92.2 (±0.4)

Overall, all the CRF-based systems outperform the rule based baseline significantly, and all bigram systems reach F scores of over 90, except BOC-bigram and BOW-G-bigram. Several observations can be made. First, for all types of feature, adding bigram helps to increase the performance significantly. Second, most word-based systems outperform their character-based counterparts, only if domain-specific word segmentation is used. The general-purpose segmenter may undermine the effectiveness of using words, making BOW-Gs and W2V-Gs perform poorer than BOCs and and C2Vs. Third, embedding representation can enhance the system performance, compared with one-hot bag of characters or bag of words. The differences are particularly significant with word-based systems (i.e. W2Vs vs. BOWs). Finally, the best system W2V-D-bigram, which takes all advantages of bigram, embedding, and a domain specific word segmentation, outperform all other ones with significant differences.

The type of segmenter used made a difference in the overall task of speculation detection. Before discussing their impact, we report the evaluation of the segmenters themselves on our dataset. The Stanford word segmenter was used in W2V-Gs and BOW-Gs and the CRF word segmenter was used in W2V-Ds and BOW-Ds. We cross-validated the CRF word segmenter using the 100 annotated admission notes with 5 folds, and for each fold we also evaluated the performance of Stanford segmenter. For both tools, we use Contemporary Chinese Dictionary (CCD)[51] as the dictionary to evaluate IV (in vocabulary) and OOV (out of vocabulary) performance. The rationale behind is that CCD lists common Chinese words and excludes domain specific clinical terms. As such, IV could be regarded as how good the word segmenters identify common word, and OOV approximates how good the segmenters handle clinical terms. Table 3 gives overall, IV, and OOV F scores for the two systems.

Table 3.

Overall, In-Vocabulary, and Out-of-Vocabulary performance of Stanford segmenter and our CRF segmenter trained on admission notes, measured by F score. Vocabulary used is the Contemporary Chinese Dictionary (CCD). IV can be seen as how good the word segmenters identify common word and OOV approximates how good the segmenters handle clinical terms.

Overall IV OOV
Stanford 69.0 84.9 43.8
Ours 83.1 87.6 74.1

The results show that Stanford word segmenter trained on general Chinese news text cannot handle clinical notes well, especially on clinical terms, while training our CRF segmenter on the annotated admission notes of the same genre can boost the performance of word segmentation. The results are not surprising, but help confirm that the performance increase from BOW-Gs and W2V-Gs to BOW-Ds and W2V-Ds we showed previously is indeed the effect of a better word segmenter, especially a better segmentation of clinical terms.

4. Discussions

4.1. Findings and error analysis

Some of the disagreements between the two annotators in the double-annotation phase are caused by ambiguities of clinical statements, sometimes erroneous narratives in notes. For instance, our annotators made different decisions on following snippets of texts: Inline graphic(cannot exclude inequality), Inline graphic(cannot exclude no tenderness). Such meaningless or misleading statements are usually caused by typos entered by physicians. Some other disagreements were caused by the presence of signals of both certainty and uncertainty, such as in following piece of text found in the discharge summary when describing process of treatment: “ Inline graphic? Inline graphic(See clear evidence of infection? Continue anti-infection therapies). The question mark was believed to be a typo given the context. Many of disagreements, though, are simply caused by missing annotation by one of the annotators. Appendix B lists all disagreements between the two annotators which are not such mismatches of common cues due to missing annotation by one coder. The annotators then agreed to abandon such cases, making no annotations when a statement is meaningless. In our study, all disagreements between the annotators were guaranteed to be resolved before moving forward to the single annotation stage.

Our annotators originally believed that their rule set can solve most of the cases, and therefore there is little need of machine learning for the task. However, the experimental results indicate that the rules produce too many false positives, although they can indeed cover most of the cues and have roughly the same recall value with machine learning based systems. Around half of the false positives were brought by the following keywords: Inline graphic(a little), Inline graphic(a little), Inline graphic (occasionally), since they are common Chinese characters/words with multiple meanings. Complex constraints must be added to correct such false positives, which may harm the sensitivity of the system. Inline graphic(occasionally), specifically, yields interesting errors regarding word segmentation, such as in ‘9 Inline graphic(the nurse mad her usual rounds at 9am)’. Some other false positives indeed express uncertainties, but do not modify the clinical terms of interest according to our guideline. It is noteworthy that although the four types of clinical terms were identified before speculation annotation and were presented to our annotators, it is difficult to make regular expression rules to strictly define when a speculation cue is actually modifying an entity of interest. For instance, in ‘ Inline graphic(suspect) Inline graphic(finding) Inline graphic(perforation)’, the speculation cue ‘ Inline graphic (suspect)’ and the clinical entity ‘ Inline graphic(perforation)’ is separated by another word ‘ Inline graphic(finding)’. This type of exceptions makes it impossible to accurately determine whether a cue is a modifier of a clinical term without syntactic analysis. As such, no constraints were added to the rule set to get rid of false positives when speculation cues are describing other events.

Our experimental results indicate that CRF-based sequence labeling is effective in identifying speculative cues in Chinese clinical notes. Our best system achieves performance of 92.2 measured by F score, which is roughly on par with their state-of-the-art English counterparts [27]. The Kappa agreement between our best system, W2V-D-bigram, and the two annotators, respectively, are 0.71 and 0.74, which are almost as high as the inter-rater agreement between the two annotators. This indicates that the system may replace human coders in identifying speculation cues in most cases, although machine learning may still not be able to deal with unseen scenarios as well as humans. Most of the hedge cues in the clinical notes are certain keywords, such as “ Inline graphic(about)”, “ Inline graphic(very likely)”, “ Inline graphic(suspicious)”. However, in rare cases with little evidence in the training data, the cues may not be detected properly. (e.g., “ Inline graphic (lack of evidence)” in our corpus).

Many of the errors made by our system are caused by incorrect boundaries. For example, in “ Inline graphic(likely) Inline graphic (to have) Inline graphic(ascites)”, our gold standard annotated “ Inline graphic(likely) Inline graphic(to have)” as a speculative cue while the system labels only “ Inline graphic(likely)”. In contrast to English, both “ Inline graphic(likely)” and “ Inline graphic(likely to have)” are legal words in Chinese, which means either of the annotations is acceptable in practice. One reason of this issue is that in our manual annotation, the “minimal units principle” we follow is directly taken from the English guideline, which is essentially “minimal words principle”. Cases like “ Inline graphic(likely to have)” can be avoided if “minimal character principle” is followed, which reminds us to pay more attention to language-specific adjustments in future studies when annotation guidelines for Chinese NLP tasks are inherited from English studies. Boundary detection errors can hopefully be reduced if part-of-speech and syntax are added as features, since some boundary errors like “ Inline graphic (is suspected to be)” (gold standard: Inline graphic is suspected) are caused by including unnecessary syntactic constituents of sentences. In order to get a sense of how many boundary errors occur, we also evaluated our system using a slightly different evaluation metric, which allows for partial match of the speculative cues. There, if a gold-standard cue is a one-character cue, the predicted cue must be exactly the same one; if the gold-standard cue is multi-character, then at least 50% of the characters in the gold standard cue must also be identified to be counted as a true positive prediction. Under this partial-match evaluation, our best system, W2V-D-bigram, yields results with a 95.4 F score.

We also decompose the results by investigating cues modifying different types of entities. Our best system, W2V-D-bigram, yields F scores of 93.6, 91.5, 90.9, and 89.3, for diagnosis, symptoms, laboratory test, and treatment, respectively. The differences across different types may be a result of unbalanced sizes of samples. Also, one major difference between cues modifying diseases and cues modifying other types of entities is that almost all disease related cues indicate true speculations, such as “ Inline graphic(cannot exclude cor triatriatum)”, while many of other ones represent vagueness, such as in “ Inline graphic3cm Inline graphic (about 3cm cyst)”. The results suggest that it may be necessary to consider speculations and other uncertainties separately in future work.

Our experimental results also show how word segmentation affects downstream NLP tasks like speculation detection. Characters in Chinese, which are roughly equivalent to syllables, or morphemes like affixes or stems in English, can indeed express meanings by themselves, but are usually more ambiguous or vague than words in meaning. Our results demonstrate that word segmentation in clinical notes can be difficult if no annotated data provided, since existing tools trained on data of a different genre perform poorly in segmenting clinical terms. The most critical message our results conveys regarding word segmentation is that although it may be good practice to use character-level representations directly in Chinese clinical NLP, a good domain specific word segmenter can significantly strengthen bag of words and word embedding and makes it a superior choice over character-based ones.

Our experimental results also demonstrate the effectiveness of embedding models in speculation detection from Chinese clinical notes. Although the technique has been applied successfully to many NLP applications, it is the first time that both character embedding and word embedding are systematically carried out and compared in a Chinese NLP task. From a clinical NLP standpoint, our study also creates opportunities for future research in optimizing representations in clinical language processing tasks.

4.2. Limitations

There are several limitations of this study. First, our CRF word segmenter, although retrained on admission notes, is learning from an overly small training set, which included only 100 samples. A larger set of training data for word segmentation may help boost the system performance even more. Second, we did not push the system to the limit by tuning parameters or carrying out sophisticated feature engineering. In fact, we only focused on basic, standard features and embedding models, since our goal was to investigate the added value of embeddings and word segmentation in the task of speculation. It might be helpful to add common features like POS tags, to combine different features, and to cascade systems with different foci. Third, most of the data used in this study is coded by single annotator, although quality control of annotation has been carried out by doing inter-rater agreement tracking on a small portion of clinical notes. Without full double annotation on the entire data set, the gold-standard may be vulnerable to random annotation errors. The fourth major limitation of this work lies in the fact that only speculative cues are identified, while scopes of the speculations are ignored. Scope finding can be equally important because it represents which facts in context are affected by the uncertainty cues, and hence facilitate further semantic analyses over the texts. Scope finding can also be approached by sequence labeling [27] and will be part of our future work. Finally, all data in this study is from a single hospital, which is one of the most prestigious hospitals in China and which also makes the most efforts in EHR data standardizing. However, since no state-level or region-level standards for EHR data has been established and implemented in China, it is likely that there could be huge differences between terminologies used in EHR narratives by different hospitals, clinics, or even physicians. Therefore, the generalizability of our system still needs to be evaluated in order to support cross-institution applications.

5. Conclusions

Our study proposes a state-of-the-art speculation detection approach for Chinese clinical text. We experimented with bag of characters, bag of words, word embedding, and character embedding as features, and demonstrate the effectiveness of embedding models, and that word segmentation is a critical step to generate high-quality word representations for downstream information extraction applications. In particular, we suggest that a domain-dependent word segmenter can be vital to clinical NLP tasks in Chinese language.

Highlights.

  1. Chinese speculation detection is approached by a supervised sequence-labeling.

  2. Embedding features can enhance system accuracy, especially word embedding.

  3. Domain specific word segmentation is critical to Chinese speculation detection.

Acknowledgments

The authors would like to thank Dr. Yonghui Wu and Dr. Hua Xu, who helped with the data preparation. This study was supported by the National Natural Science Foundation of China (NSFC) Grant # 81171426 and #81471756, and National Institute of General Medical Sciences Grant R01GM114355.

Appendix A. Annotation Guideline for Speculation Detection

1. Task and Examples

Speculation cue in this task is defined as any words, phrases, or punctuations that bring or increase the uncertainty of the statements. Two types of uncertainties were considered in this annotation:

A. True speculations, which express that a fact, such as diagnosis or history of treatment, is not known with certainty. Particularly, question marks were sometimes used in Chinese clinical notes by doctors to express uncertain diagnosis, which should be annotated as speculation cues. Example cues include:

Inline graphic (Hypertensive heart disease, very likely)

Inline graphic (Cannot exclude diabetic retinopathy)

Inline graphic? (Intestinal obstruction?)

Inline graphic (suggestive of brain tumor)

B. Vagueness in describing facts, especially numeric information.

Inline graphic5 Inline graphic (More than 5 years history of rapid atrial fibrillation)

Inline graphic38.5 Inline graphic (Temperature around 38.5 degree)

2. Detailed guideline

A. Minimal unit principle

The minimal unit that expresses uncertainty is marked. For example, in Inline graphic (renal function roughly in the normal range), Inline graphic (roughly) instead of Inline graphic (roughly in the normal range) should be annotated

B. A cue can be split because of the syntactic structure of Chinese. For example, in:

Inline graphic (Cannot exclude the possibility of hepatic encephalopathy)

In this case, the minimal unit principle should be followed as well: only the first part, Inline graphic (cannot exclude), which alone can express the uncertainty, should be marked.

C. The annotators should pay particular attention to distinguish linguistic uncertainty and clinical uncertainty.

Only the linguistic uncertainty should be annotated. An example of clinical uncertainty is as follows:

Inline graphic (right lung pneumonia yet to be examined).

In this statement, pneumonia is an uncertain diagnosis which needs to be confirmed by further examinations. However, linguistically there is no uncertainty towards the undiagnostic fact. This can be particularly confusing in Chinese language.

D. A cue should be annotated only when it’s modifying following types of information: diagnosis and diseases, signs and symptoms, laboratory tests, and treatment procedures and methods. Other cues are not considered, even if it’s describing clinical events. For example:

Inline graphic (enter the ED at about 5pm)

E. Do not annotate if a statement is meaningless to human, usually due to typos in the clinical notes.

Appendix B

Disagreements between users which are not caused by missing annotation by one of the coders are listed below. Bold parts of texts in the three columns are speculation cues annotated by the two annotators and in the resolved annotation, respectively.

Annotator A Annotator B Resolved
Inline graphic (multiple sites of scars) Inline graphic (multiple sites of scars) Inline graphic (multiple sites of scars)
Inline graphic(cannot exclude inequality) Inline graphic(cannot exclude inequality) Inline graphic(cannot exclude inequality)
Inline graphic(cannot exclude no tenderness) Inline graphic(cannot exclude no tenderness) Inline graphic(cannot exclude no tenderness)
Inline graphic (possible risks during transfer) Inline graphic (possible risks during transfer) Inline graphic (possible risks during transfer)
Inline graphic? Inline graphic (See clear evidence of infection? Continue anti-infection therapies) Inline graphic? Inline graphic (See clear evidence of infection? Continue anti-infection therapies) Inline graphic? Inline graphic (See clear evidence of infection? Continue anti-infection therapies)
Inline graphic(not see clear abnormity) Inline graphic(not see clear abnormity) Inline graphic(not see clear abnormity)
Inline graphic(appear multiple times) Inline graphic(appear multiple times) Inline graphic(appear multiple times)

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

  • 1.Li D, Kipper-Schuler K, Savova G. Conditional Random Fields and Support Vector Machines for Disorder Named Entity Recognition in Clinical Texts. Proc Curr Trends Biomed Nat Lang Process (BioNLP 2008); 2008. [Google Scholar]
  • 2.Demner-Fushman D, Chapman WW, McDonald CJ. What can Natural Language Processing do for Clinical Decision Support? J Biomed Inform. 2009:760–772. doi: 10.1016/j.jbi.2009.08.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Jiang M, Chen Y, Liu M, Rosenbloom ST, Mani S, Denny JC, et al. A study of machine-learning-based approaches to extract clinical entities and their assertions from discharge summaries. J Am Med Inform Assoc. 2011;18(5):601–606. doi: 10.1136/amiajnl-2011-000163. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Friedman C, Alderson PO, Austin JH, Cimino JJ, Johnson SB. A general natural-language text processor for clinical radiology. J Am Med Informatics Assoc. 1994;1(2):161–174. doi: 10.1136/jamia.1994.95236146. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Wu Y, Leia J, Wei WQ, Tang B, Denny JC, Rosenbloom ST, et al. Analyzing differences between Chinese and English clinical text: A cross-institution comparison of discharge summaries in two languages. Stud Health Technol Inform. 2013;192:662–666. [PMC free article] [PubMed] [Google Scholar]
  • 6.Lei J, Tang B, Lu X, Gao K, Jiang M, Xu H. A comprehensive study of named entity recognition in Chinese clinical text. J Am Med Informatics Assoc. 2014;21(Ml):808–814. doi: 10.1136/amiajnl-2013-002381. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Wang H, Zhang W, Zeng Q, Li Z, Feng K, Liu L. Extracting important information from Chinese Operation Notes with natural language processing methods. J Biomed Inform Elsevier Inc. 2014 Apr;48:130–6. doi: 10.1016/j.jbi.2013.12.017. [DOI] [PubMed] [Google Scholar]
  • 8.Xu Y, Wang Y, Liu T, Liu J, Fan Y, Qian Y, et al. Joint segmentation and named entity recognition using dual decomposition in Chinese discharge summaries. J Am Med Inform Assoc. 2014;21:e84–92. doi: 10.1136/amiajnl-2013-001806. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Chang HG, Hai Z. Chinese Word Segmentation: A Decade Review. J Chinese Inf Process. 2007;21(3):8–20. [Google Scholar]
  • 10.Lee Y, Papineni K, Roukos S. Language model based Arabic word segmentation. ACL. 2003:399–406. [Google Scholar]
  • 11.Bar Haim R, Sima’an K, Winter Y. Choosing an optimal architecture for segmentation and POStagging of modern Hebrew. ACL Work Comput Approaches to Semit Lang. 2005 Jun;:39–46. [Google Scholar]
  • 12.Sassano M. An Empirical Study of Active Learning with Support Vector Machines for Japanese Word Segmentation. ACL. 2002;(July):505–512. [Google Scholar]
  • 13.Xue N, Xia F, Chiou F, Palmer M. The Penn Chinese TreeBank: Phrase Structure Annotation of a Large Corpus. Nat Lang Eng [Internet] 2005;11(2):207–238. [Google Scholar]
  • 14.Tseng H, Tseng H, Chang P, Chang P, Andrew G, Andrew G, et al. A Conditional Random Field Word Segmenter. Proc 4th SIGHAN Work Chinese Lang Process; 2005. [Google Scholar]
  • 15.Collbert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P. Natural Language Processing (Almost) from Scratch. J ofMachine Learn Res. 2011;12:2493–2537. [Google Scholar]
  • 16.Turian J, Ratinov L, Bengio Y. Word Representations: A Simple and General Method for Semi-supervised Learning. Proc 48th Annu Meet Assoc Comput Linguist; 2010; pp. 384–394. [Google Scholar]
  • 17.Pivovarov R, Elhadad N. A hybrid knowledge-based and data-driven approach to identifying semantically similar concepts. J Biomed Inform [Internet] Elsevier Inc. 2012 Jun;45(3):471–81. doi: 10.1016/j.jbi.2012.01.002. [cited 2014 Mar 5] [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Henriksson A, Dalianis H, Kowalski S. Generating Features for Named Entity Recognition by Learning Prototypes in Semantic Space: The Case of De-Identifying Health Records. Bioinforma Biomed (BIBM), 2014 IEEE Int Conf; 2014; pp. 450–457. [Google Scholar]
  • 19.Jonnalagadda S, Cohen T, Wu S, Gonzalez G. Enhancing clinical concept extraction with distributional semantics. J Biomed Inform Elsevier Inc. 2012;45(1):129–140. doi: 10.1016/j.jbi.2011.10.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Elhadad N, Zhang S, Driscoll P, Brody S. Characterizing the Sublanguage of Online Breast Cancer Forums for Medications, Symptoms, and Emotions. Proc AMIA Annu Fall Symp; 2014; [PMC free article] [PubMed] [Google Scholar]
  • 21.Wu Y, Jiang M, Lei J, Xu H. Named Entity Recognition in Chinese Clinical Text Using Deep Neural Network. Stud Health Technol Inform. 2015;(216):624–628. [PMC free article] [PubMed] [Google Scholar]
  • 22.Chen H. Deep Learning for Chinese Word Segmentation and POS Tagging. EMNLP. 2013 Oct;:647–657. [Google Scholar]
  • 23.Light M, Qiu XY, Srinivasan P. The Language of Bioscience: Facts, Speculations, and Statements in Between. BioLink 2004 -- Proc Work Link Biol Lit Ontol Databases; 2004; pp. 17–24. [Google Scholar]
  • 24.Vincze V, Szarvas G, Farkas R, Mora G, Csirik J. The BioScope corpus: biomedical texts annotated for uncertainty, negation and their scopes. 2008;9:1–9. doi: 10.1186/1471-2105-9-S11-S9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Cruz Díaz NP, Maña López MJ, Vázquez JM, Álvarez VP. A machine-learning approach to negation and speculation detection in clinical texts. J Am Soc Inf Sci Technol. 2012;63(7):1398–1410. [Google Scholar]
  • 26.Díaz NPC. Detecting Negated and Uncertain Information in Biomedical and Review Texts. RANLP. 2013:45–50. [Google Scholar]
  • 27.Farkas R, Vincze V, Móra G, Csirik J, Szarvas G. The CoNLL-2010 shared task: learning to detect hedges and their scope in natural language text. Proc Fourteenth Conf Comput Nat Lang Learn; 2010; pp. 1–12. [Google Scholar]
  • 28.Velldal E. Predicting speculation: a simple disambiguation approach to hedge detection in biomedical literature. J Biomed Semantics. 2013;2(Suppl 5):S7. doi: 10.1186/2041-1480-2-S5-S7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Roberts K, Harabagiu SM. A flexible framework for deriving assertions from electronic medical records. J Am Med Inform Assoc. 2011;18(5):568–573. doi: 10.1136/amiajnl-2011-000152. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Agarwal S, Yu H. Detecting hedge cues and their scope in biomedical text with conditional random fields. J Biomed Inform Elsevier Inc. 2010;43(6):953–961. doi: 10.1016/j.jbi.2010.08.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Zhang S, Zhao H, Zhou G, Lu B. Hedge Detection and Scope Finding by Sequence Labeling with Procedural Feature Selection. Comput Linguist. 2010 Jul;:92–99. [Google Scholar]
  • 32.Tang B, Wang X, Wang X, Yuan B, Fan S. A Cascade Method for Detecting Hedges and their Scope in Natural Language Text. Proc Fourteenth Conf Comput Nat Lang Learn; 2010; pp. 13–17. [Google Scholar]
  • 33.Medlock B, Briscoe T. Weakly Supervised Learning for Hedge Classification in Scientific Literature. Proc 45th Meet Assoc Comput Linguist; 2007; pp. 992–999. [Google Scholar]
  • 34.Hanauer DA, Liu Y, Mei Q, Manion FJ, Balis UJ, Zheng K. Hedging their mets: the use of uncertainty terms in clinical documents and its potential implications when sharing the documents with patients. AMIA Annu Symp Proc. 2012;2012(September 2015):321–30. [PMC free article] [PubMed] [Google Scholar]
  • 35.Fujikawa K, Seki K, Uehara K. A hybrid approach to finding negated and uncertain expressions in biomedical documents. Proc 2nd Int Work Manag interoperability Complex Heal Syst - Mix ‘12; 2012; p. 67. [Google Scholar]
  • 36.Özgür A, Radev DR. Detecting speculations and their scopes in scientific text. Proc 2009 Conf Empir Methods Nat Lang Process; 2009 August; pp. 1398–1407. [Google Scholar]
  • 37.Ganter V, Ganter V, Strube M, Strube M. Finding Hedges by Chasing Weasels: Hedge Detection Using Wikipedia Tags and Shallow Linguistic Features. Proc ACL-IJCNLP; 2009 August; pp. 173–176. [Google Scholar]
  • 38.Chen Z, Zou B, Zhu Q, Li P. Chinese Lex Semant. Springer; Berlin Heidelberg: 2013. The Scientific Literature Corpus for Chinese Negation and uncertainty identification; pp. 657–667. [Google Scholar]
  • 39.Chen Z, Zou B, Zhu Q, Li P. Nat Lang Process Chinese Comput. Springer; Berlin Heidelberg: 2013. Chinese Negation and Speculation Detection with Conditional Random Fields; pp. 30–40. [Google Scholar]
  • 40.Zou B, Zhu Q, Zhou G. Negation and Speculation Identification in Chinese Language. Proc 53rd Annu Meet Assoc Comput Linguist 7th Int Jt Conf Nat Lang Process; 2015; pp. 656–665. [Google Scholar]
  • 41.Ji F, Qiu X, Huang X. Exploring uncertainty sentences in Chinese. Proc 16th China Conf Inf Retreval; 2010; pp. 594–601. [Google Scholar]
  • 42.Xu Y, Chen L, Wei J, Ananiadou S, Fan Y, Qian Y, et al. Bilingual term alignment from comparable corpora in English discharge summary and Chinese discharge summary. BMC Bioinformatics. 2015;16(1):1–10. doi: 10.1186/s12859-015-0606-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Cohen J, et al. A coefficient of agreement for nominal scales. Educ Psychol Meas Durham. 1960;20(1):37–46. [Google Scholar]
  • 44.Xu Y, Tsujii J, Chang EI-C. Named entity recognition of follow-up and time information in 20 000 radiology reports. J Am Med Informatics Assoc. 2012;19(5):792–799. doi: 10.1136/amiajnl-2012-000812. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.He L, Yang Z, Lin H, Li Y. Drug name recognition in biomedical texts: A machine-learning-based method. Drug Discov Today Elsevier Ltd. 2014;19(5):610–617. doi: 10.1016/j.drudis.2013.10.006. [DOI] [PubMed] [Google Scholar]
  • 46.Lamurias A, Grego T, Couto FM. Chemical compound and drug name recognition using CRFs and semantic similarity based on ChEBI. BioCreative Chall Eval Work. 2013;(Cdi):75. [Google Scholar]
  • 47.Kudo T. CRF++: Yet another CRF Toolkit. 2005 Softw available http//crfpp.sourceforge.net.
  • 48.Nadkarni PM, Ohno-Machado L, Chapman WW. Natural language processing: an introduction. J Am Med Inform Assoc. 2011;18(5):544–551. doi: 10.1136/amiajnl-2011-000464. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. 2013 Jan 16;:1–12. arXiv Prepr arXiv13013781. [Google Scholar]
  • 50.Chang PC, Galley M, Manning CD. Optimizing Chinese word segmentation for machine translation performance. Proc Third Work Stat Mach Transl. 2008 Jun;:224–232. [Google Scholar]
  • 51.Contemporary Chinese Dictionary [Internet] Available from: https://en.wikipedia.org/wiki/Xiandai_Hanyu_Cidian.

RESOURCES