Skip to main content
AMIA Annual Symposium Proceedings logoLink to AMIA Annual Symposium Proceedings
. 2021 Jan 25;2020:658–667.

Facilitating information extraction without annotated data using unsupervised and positive-unlabeled learning

Zfania Tom Korach 1,2, Sharmitha Yerneni 1, Jonathan Einbinder 2,3, Carl Kallenberg 3, Li Zhou 1,2
PMCID: PMC8075513  PMID: 33936440

Abstract

Information extraction (IE), the distillation of specific information from unstructured data, is a core task in natural language processing. For rare entities (<1% prevalence), collection of positive examples required to train a model may require an infeasibly large sample of mostly negative ones. We combined unsupervised- with biased positive-unlabeled (PU) learning methods to: 1) facilitate positive example collection while maintaining the assumptions needed to 2) learn a binary classifier from the biased positive-unlabeled data alone. We tested the methods on a real-life use case of rare (<0.42%) entity extraction from medical malpractice documents. When tested on a manually reviewed random sample of documents, the PU model achieved an area under the precision-recall curve of0.283 and Fj of 0.410, outperforming fully supervised learning (0.022 and 0.096, respectively). The results demonstrate our method's potential to reduce the manual effort required for extracting rare entities from narrative texts.

Introduction

Information extraction (IE), defined as scanning unstructured data to elucidate structured, typically pre-defined, information is a core task of natural language processing (NLP). In the medical domain, it typically involves detecting clinical entities (e.g. signs, symptoms, treatments etc.) in clinical notes. It is typically formulated as a binary classification task: "given a text (e.g. a clinician's note), does the entity appear in it?", or more formally, given a document represented by a vector of features x, estimate the probability that it contains (y=1, "positive") or not (y=0, "negative") the entity, i.e. p(y=1 | x). While being extensively investigated, the variety of entities and variability of clinical documentation hinder the development of a generic "silver bullet" solution for IE and frequently require the development of custom extraction logic (what and how to extract). Machine-learning (ML) can help overcome the linguistic variability of clinical documentation to detect non-canonical manifestations of the desired entity. However, ML algorithms, particularly supervised ML, require examples, both positive (ones manifesting the entity) and negative (ones from which the entity is absent), and in large numbers. While the specific needs differ from case to case, in practice many ML-based IE studies involved hundreds to several thousands of examples for training the model (apart from the examples needed to test the model).i If the examples are collected by reviewing a random sample of notes, then following the binomial distribution the expected number of notes needed to review to accumulate m positive examples of an entity with a prevalence of p is n=m/p. Many clinical entities, despite their importance, are rare. For an entity with a prevalence of 1%, the expected (average) number of documents to review to collect 500 positive examples is 500/0.01=50,000. Such sample size may be infeasible due to the scarcity and high cost of clinician time for annotation. To overcome this challenge, we looked for methods to reduce the number of examples needed to be reviewed. Our goal was to 1) automatically find a focused subset of the data with a higher prevalence of the entity, and 2) use the collected examples to train a classifier that accurately extracts the entity from unlabeled text.

Background and Related Work

Focused subset finding

Unsupervised learning may be used to uncover latent relationships among the data and organize the items in a vector space where the distance between vectors correlates with the distance in meaning between their respective items. Such vector space can be searched to find items (e.g. words) related to a probe one. We previously used distributional semantics, a form of unsupervised learning that automatically learns the meaning of words from their co-occurrence patterns, to facilitate rule-based IE.1. Briefly, the words and phrases in a corpus were encoded using word embeddings (e.g. word2vec). The user entered the common, intuitive terms for the entity and the system returned other words and phrases whose embeddings were close to that of the query terms.

Recent years have seen the advent of algorithms capable of representing whole sentences. The transformer neural-network architecture, the basis for BERT and its derivative language models, provides an efficient and effective way to automatically learn how to represent whole sentences from unlabeled corpora.2 These representations can be used as input for classification models and outperformed the existing state-of-the-art on many NLP tasks. For IE tasks, the entity's description or canonical terms from a standard terminology can be used to probe the sentences from the task's population's documents to find similarly meaning ones. The retrieved sentences are adjudicated by a human annotator and the positive ones are used as queries to retrieve additional similar sentences, in a snow-balling fashion. Eventually the positive examples can serve as training data for a ML classifier. This way, instead of wading through numerous negative examples to get to the few rare positive ones, the manual effort can focus on a subset with a higher prevalence of positives, requiring the review of fewer examples to yield the same number of positive ones.

Classifier training

Supervised ML algorithms require both positive and negative examples to learn to distinguish between the two classes. Typically, independent (i.e. not correlated) examples are selected randomly, to yield an "independent and identically distributed" (IID) sample that is representative of the task's population (e.g. the documents of the study population). This set is then reviewed in full to provide the true label (y = {1,0}) for each and every example. In this scenario, called "traditional learning", the classifier is exposed to the distribution of the features across the classes p(x | y) and can therefore learn the posterior probability p(y = 1 | x).

In some use-cases, a fully labeled dataset is not available. Instead, a positive label is available for some cases, and the rest have no information about their true label (i.e. unlabeled). For example, when using the problem list from the electronic health record (EHR), only cases where a diagnosis has been made have a (positive) label. The rest of the patients are unlabeled, and could either have or not have the disease, since the absence of a diagnosis does not necessitate the absence of the disease. Formally, instead of a tuple (x,y), each example is a tuple (x,y,s) composed of the features (x), the (unobserved) true class label (y) and a binary indicator (s) of whether the example was labeled as positive or not. The labeling mechanism s is crucial to PU learning. It is assumed to have a perfect precision (i.e. p(y =1 | s = 1) = 1) but imperfect recall (i.e. p(y = 1 | s = 0) > 0), so an unlabeled example (i.e. having s = 0) might be either a negative or a missed positive (in the above example, some disease cases go undiagnosed). The labeling mechanism could be a heuristic or another phenomenon that given a truly positive example will label it (as positive) with a probability e(x) = p(s = 1 | y = 1, x) called the "propensity score". In this setting, called "positive-unlabeled (PU) learning", the set of examples can no longer be considered an IID sample, and therefore does not represent the task's population. Various methods have been developed to handle PU data and to enable learning a traditional classifier (i.e. one that can distinguish between positive and negative examples) from PU data (i.e. data that contains only positive and unlabeled examples).3 Broadly, these methods attempt to tackle the mixture of positive and negative among the unlabeled examples by either 1) looking for unlabeled examples that are very likely ("reliable") negative or 2) considering all unlabeled examples as negative but with a noisy (inaccurate) label.

Regardless of the method, to learn the distribution of the true label p(y = 1 | x), some assumption must be made about the labeling mechanism, i.e. the distribution of p(s = 1 | x, y=1).4 Usually the "selected completely at random" (SCAR) assumption is made: the labeling mechanism finds all positive examples at the same probability, independent of their features, i.e. p(s = 1| y = 1, x) = p(s = 1|y = 1). However, this assumption is frequently violated. If the positive cases are selected for inspection by the mechanism according to their characteristics, the probability of being labeled becomes dependent on the example's features. In our case, since the sentences are selected for human review based on their semantic similarity to the seeds, the labeling mechanism depends on the sentences' features and the SCAR assumption is violated. For the problem list, the decision to work-up a disease might depend on the characteristics of the patient or the clinical presentation, and therefore it also violates this assumption. In such scenarios, the positive examples suffer from a selection bias that renders them non-representative of the full positive population.

To relax the SCAR assumption and allow learning from the biased examples, Kato et al suggested pursuing a different goal: partially identify the traditional classifier.5 Instead of trying to learn a classifier that predicts the probability of a class p(y = 1|x), this method learns a scoring function that preserves the ordering of the examples that the traditional classifier would yield. Therefore, instead of requiring the SCAR assumption, this method requires that the labeling mechanism preserves the ordering of the examples' probability to be positive. Formally, this method assumes the invariance of order (IVO): for two examples i,j with features xi and xj respectively, p(y = 1|xi) ≤ p(y = 1|xj) ⇔ p(s = 1|xi) ≤ p(s = 1|xj). An example of this assumption is fraud detection: positive examples of fraudulent transactions are collected when transactions are investigated manually. However, transactions with aberrant characteristics are more likely to reach the attention of the human inspector and to have the chance to be flagged as fraudulent. Therefore, the SCAR assumption is invalidated (since not all positive cases have the same probability to be labeled). The IVO assumption, on the other hand, might be satisfied: as long as the heuristic used to collect cases for inspection follows the same trend as the true distribution of fraudulent transactions, the IVO assumption holds. With a suitable labeling mechanism (sentence selection mechanism in our case), the goal is to learn a scoring function r(x): X → R such that if p(y = 1|xi) ≤ p(y = 1|xj) (according to a traditional classifier), then r(xi) ≤ r(xj) and vice versa. Such scoring function is learned by training a model that tries to minimize a loss function that assigns different loss functions for positive and negative examples.6 While the original formula requires the SCAR assumption, when learning the scoring function instead of a classifier the estimates from the biased positive-unlabeled data can be used:

f^1=argminfH[πE^pbias[log(f(X))]+πE^pbias[log(1f(X))]E^u[log(1f(X))]+R(f)]

Where f is the ranking function from the candidate set H, Xis the data set, E^pbias is the estimated expectation (sample mean) over the positive examples and E^u is the corresponding quantity over the unlabeled examples, π is the class prevalence and R(f) is a regularization term. While the scoring function r(x) does not provide an actual probability, for IE tasks such score may be sufficient since typically the desired outcome is a binary indicator. The score can be converted to a binary indicator, i.e. yield a classifier, by picking a cutoff value e.g. using the precision-recall tradeoff on a test set. Therefore, training the IE classifier involves two sub-tasks: 1) finding a labeling mechanism that respects the IVO assumption, and 2) constructing and training a model with the special loss function.

We applied this method to a particular use-case, IE from medical malpractice documents of diagnostic error (DE) cases. DEs are a major malpractice concern, responsible for 45.9% and 21.1% of the paid claims in the outpatient and the inpatient setting, respectively.3 Such paid claims may represent only the tip of the iceberg, as in clinical practice diagnostic errors were found in 23.1% of autopsies, and errors that are likely to have affected patient outcomes were found in 9% of autopsies.4 Issues and patterns identified from overt DE cases may guide the search for similar "near -miss" cases for mitigation and prevention. These facts, called "defensibility issues" (DIs) include statements such as "the clinician did not follow clinical instructions" or "the clinician made an error but it did not cause the damage" and are manifested as short sentences in the text. This work focused on "lack of medical record documentation": cases where the absence of documentation of the provided care or patient condition contributes to the error or the liability. Automatic identification of such cases may facilitate active surveillance and intervention for such errors. Since this issue is rare (<1%), complex and no annotated corpus is available, no ready-to-use ML or rules (e.g. keywords) are available. We therefore leveraged our data augmentation and knowledge discovery platform, called "Deep Snow", to facilitate the curation of training data and classifier development for this task.

Methods

An overview of the proposed method.is outlined in Figure 1. We first describe the data preparation, then the creation of an unbiased (randomly selected) gold-standard dataset, followed by the process to collect the positive examples for both ML and rule-based classifiers and then the training procedure of the different models (biased-PU, rule-based and traditional ML). We finish with a comparison of the models' accuracy on the gold-standard dataset.

Figure 1. Overview of the proposed method. The study included 4 main stages: 1) Data preparation, to convert the scanned images to text, filter out uninformative document sections and produce individual and tokenized sentences. The sentences are encoded using unsupervised neural language models into a vector space where cosine similarity correlates with semantic similarity. 2) Positive example collection, where Deep Snow uses the sentence representations to focus the manual labeling effort on high-yield sentences, yielding biased positive examples. 3) Positive-unlabeled learning architecture is used to train a classifier by substituting the traditional binary cross-entropy with a loss function that corrects for the bias in the positive set. 4) Validation: the system is evaluated by the sentence-level accuracy of the classifier on a fully annotated random sample of notes held-out throughout the other stages.

Figure 1

1. Data collection

This work was part of the "Similar cAse Finder for Risk Reduction" (SAFRR) study which aims to identify similar cases and their clusters among diagnostic error malpractice claims for analysis and intervention. The study population includes all closed medical malpractice cases where the major allegation was a diagnostic error from a predefined time period. Each case information included demographics, structured information about the legal process (e.g. outcome) as well as narrative documents from pre-defined categories describing the investigation into the clinical occurrences and legal process until the resolution of the case. DIs are currently not coded and need to be extracted from a variety of legal documents. For the extraction of the DIs, the population was limited to the cases that ended either with a settlement or a verdict with monetary compensation.

Document preparation: The documents were provided as scanned images. Therefore, a preliminary optical character recognition (OCR) step was required to extract machine-processable text from them. We used Tesseract, a standard open-source OCR suite from Google, to recognize each of the documents.7,8 Tesseract organizes the output in distinct paragraphs that correspond to the geometric layout of the document image. However, the recognized text lacks the original document segmentation: paragraphs that span multiple lines might be split and irrelevant text such as headers and footers is injected to the text sequence. Therefore, we performed an additional step to filter out noisy and non-informative paragraphs. Two annotators reviewed Tesseract's output for each document sequentially and labeled each paragraph as informative or not. All of Tesseract's paragraphs were fed into a BERT language model, and the [CLS] embedding from the last 4 layers was used as the paragraph features. A binary classifier was trained using the H2O suite to perform architecture and hyper-parameter search. Following that, the paragraphs whose predicted probability exceeded the threshold corresponding to the maximal F2 value (favoring recall over precision) were concatenated. The cleaned document was tokenized and sentence-segmented using Stanford CoreNLP suite, filtering out sentences that contained no alphanumeric characters or titles only.9 Due to the temporally sequential nature of the documents, many of them repeated content from previous ones. Therefore, only the chronologically first one of all lexically identical sentences was used for data labeling in Deep Snow, The unit of classification for DI extraction was sentences.

2. Unbiased ground-truth data sets creation

Test set: A random sample of 312 documents was selected and reviewed by a medical malpractice expert for mentions of the DI. To reduce correlation, only a single document of the same type was picked from each case. A random subset of 58 documents was annotated by another annotator to estimate inter-rater agreement. The documents were reviewed in their original scanned image format (with a text layer underneath) to minimize deviation from their true content and the annotator highlighted the text snippets that manifested the DI. The highlighted snippets were then matched to the processed sentences and each sentence was indicated as whether it contains the DI or not. Text snippets that were highlighted but excluded during the document cleaning process were kept in this set (under a "missing sentence #" identifier) and during prediction were imputed by averaging the representations all other sentence in the test set.

Development set: For the development of a baseline traditional ML classifier, an additional set of randomly selected 1,508 sentences from documents outside of the unbiased test-set were reviewed by a clinician (which also performed the positive example collection and word expansion processes, to maintain uniformity of the labeled training data between the methods) and adjudicated for mentions of the DI. Sentences from the documents selected for the test and development sets were not used in other steps.

3. Positive example collection

Sentence representation for semantic similarity search: We used the "robustly optimized BERT pretraining approach" (RoBERTa) language model, an enhancement of the BERT language model focusing on masked words and with optimized training process, to learn a semantic representation of each sentence.2 Since raw BERT embeddings are typically not suitable for cosine similarity calculations, we used Sentence-BERT: an enhancement of BERT (and similar language models) that uses Siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity.10 Specifically, we used a model trained on similarity tasks the output of a RoBERTa model fine-tuned on tasks that reflect semantic similarity, namely the Stanford natural language inference (SNLI), Multi-Genre Natural Language Inference (MultiNLI) and Semantic Textual Similarity (STS) benchmark, achieving a Spearman correlation of 88.33 on the STS benchmark.11,12,13,14 The mean of the tokens' 1024-dimensional embeddings from the last layer was used as the sentence representation. Representation for words were learned from the processed corpus using the skip-gram with negative sampling algorithm variant of the FastText algorithm, to allow handling arbitrary query keywords.15

Searching for biased positive examples: As explained above, to enable learning from the collected positive examples, their probability of being labeled has to follow the order of their probability to be truly positive. To support this assumption, we use the heuristic that sentences mentioning the keywords specified by the DI's definition (in an affirmative, certain and currently present context) are positive. Since cosine similarity between the sentence representations correlates well with semantic similarity, it is assumed that other sentences similar to positive ones are also likely to be positive. Since the results are presented to the user in descending order of similarity to likely-positive sentence, sentences with higher probability to be positive are expected to be presented higher among the results and therefore more likely to be reviewed by the user. We enhanced Deep Snow to search for positive sentences as follows:

  1. Verbatim mention search: To yield strongly positive examples, the whole corpus is searched for verbatim mentions of the phrases from the DI definition: "lack of documentation", "failure to document", and "no documentation":
    1. Limiting to the top-10 results ranked by Okapi's BM25 relevancy score with parameters k1=1.2, b=0.75.16
    2. Filtering-out negated, historical, hypothetical and non-patient experiencer mentions using the FastContext implementation of the ConText algorithm.17
  2. Adjudication: A clinician adjudicates the retrieved sentences as either manifesting the desired entity ("positive") or not ("negative"). The negative class included both negative assertions and no mention of the entity.

  3. Nearest-neighbors search: Using all positive sentences found so far as probes:

    1. A random sample of sentences is selected uniformly without replacement from the whole corpus and the probes, duplicates and previously reviewed sentences are filtered-out. The sample size is set to 2x[desired number of results]/[class prior] to achieve the desired number of positive results (pre-defined as 20) after filtering.

    2. The cosine similarity between the probes and the sample was calculated. Since all of the retrieved sentences are ranked by their similarity to the same set of probes, over time the positive examples (i.e. probes) might become correlated. Therefore, they were de-correlated by searching for their subset which minimizes the sum of absolute pairwise Pearson correlation coefficients (r) of the similarity scores. The search started from k=number of probes with decrements of 1 and was stopped when either
      1. the adjusted mean absolute pairwise correlation (i=1,j=1,ijk|ri,j|)k2 stopped decreasing, or
      2. the mean absolute pairwise correlation (i=1,j=1,ijk|ri,j|)k<0.3
    3. The similarity values for the selected probes was preserved, and those for the non-selected probes was discarded.

  4. The top 20 results are presented to the user by descending similarity order.

  5. Go to step 2. The loop continues until either no new examples are found, all of the returned examples are negative or the pre-defined arbitrary cutoff of 300 examples were reviewed.

4. Classifier training

The biased positive examples from Deep Snow were used to train a sentence-level binary classifier of the DI. We compared our system to two alternatives: 1) rule-based tagger with word-level augmentation, and 2) a traditional classifier trained using the development set.

Biased-PU classifier (model 1): To handle the severe class imbalance, we used proportional batching:18when one class is very rare, mini-batches of the typical sizes (16-128) have a high probability of including no example of the minority class, leading to overfitting to the majority class. Ensuring the presence of the positive class would require a technically infeasible batch size (e.g. batch size of 1,000 for a minority class with a prevalence of 0.1%). To use a smaller batch-size while avoiding negative-only batches, the minimal number of positives per mini-batch positivesmin was set to the ceiling of [batch size] × [class prior]. For each batch, positivesmin positive examples were randomly selected, and the rest of the batch was filled with randomly selected negative examples. An epoch was completed when all the positive examples were used. Therefore, the collected positive examples were supplemented with enough randomly selected unique sentences (unlabeled examples) to allow the maximal desired batch size of 64, selected based on hardware limitations, for the number of batches allowed by the positive examples i.e.

#negatives=[desiredbatchsize]×numberofcollectedpositiveexamplespositivesmin.

The positive examples and supplemental negative ones formed the PU data set which was randomly split to train/development/test sets, respectively, in a 60%/20%/20% ratio for training (PU-train), hyperparameter search (PU-dev) and model selection (PU-test). A feed-forward neural binary classifier was designed with 1 or more tandem dense-dropout layer pairs before a final sigmoid dense output layer with the bias initialized to the class prior. The model, including the custom loss function, was implemented in Keras, a high-level neural-network design toolkit, version 2.3.1 using a TensorFlow version 2.1 back-end.19,20 Hyper-parameter search was conducted using the NNI toolkit.21 The Tree-structured Parzen Estimator (TPE) tuner was used to guide the hyper-parameter selection from the search space (Table 1).22 The training was performed on a single Nvidia Quadro P6000 graphical-processing unit (GPU). Early stopping of each trial was set to an absence of >0.01 improvement in the PR-AUC for 15 epochs. Each model's training was limited to 500 epochs and the hyper-parameter search itself was limited to 8 hours. The model with the highest PR-AUC on the PU-test set was selected.

Table 1. Biased positive-unlabeled model hyper-parameters search space.

Hyper-parameter Distribution Possible values
Number of dense-dropout pairs Choice 1, 2, 3
Apply uniformly to all layer pairs. Hidden size Choice 128, 256, 512, 1024, 2048
Activation Choice SELU, RELU, TANH
L2 regularization Choice 0.0001, 0.001, 0.01, 0.1
Dropout rate Uniform 0.1- 0.9
Batch size Choice 16, 32, 64
Initial learning rate Choice 1e-4, 1e-3, 1e-2, 1e-1
Optimizer Choice SGD, RMSprop, Adam
Standardize to mean 0 variance 1 Choice yes, no
L2-normalize Choice yes, no

Rule-based tagger baseline (model 2): As a baseline for comparison, a rule-based tagger used. The lexicon was curated by expanding the same seed queries using FastText word embeddings. To capture more complex entities, we used our adaptation of the automatic term recognition C-Value/NC-Value to induce multi-word entities from the documents' text.23 Since the study corpus is relatively small, the phrase mining hyper-parameters were modified to: maximal phrase length order of 6 (vs. 4), minimal phrase count 2 (vs. 10) , and kept term proportion of 0.1 (vs. 0.01). FastText embeddings were learned, treating each occurrence of the top 20,000 mined phrases in the cleaned documents as a single token, using the skip-gram with negative sampling algorithm and the hyper-parameters minimal count of 2, window size 10, 15 epochs, 15 negative samples and 600 dimensions. The word expansion process followed a similar approach to that for sentences:

  1. The seed queries for the search were "lack of documentation", "failure to document", and "no documentation".

  2. Query representation: using the word embeddings model, for each item in the query:
    1. Single-word item: retrieve\generate its representation, for in- and out-of-vocabulary words, respectively.
    2. Multi-word item: If this phrase appears as a fused sequence in the corpus, retrieve the representation of its fused form. Otherwise, average the representations of its tokens.
  3. Search the word embeddings for the top 20 nearest neighbors (individual tokens and mined phrases) of the queries.

  4. Average the similarity scores across the queries and return the top-20 most similar results.

  5. A clinician adjudicates each result as positive/negative and re-submits the positive examples are used as queries.

  6. Go to step 2. The loop continues until no more new results are retrieved or all of the results are deemed negative. The selected words and phrases were used as the lexicon for MTERMS, our rule-based tagger.24 All cleaned and tokenized sentences of the test set documents were processed. The tagged mentions' context was analyzed using FastContext and those with context that was not certain and currently present (i.e. non-standard context) were discarded. Since the rule-based method provides only a binary indicator, its accuracy was measured using the F1 score.

Fully supervised (traditional) learning (model 3): To compare our method to regular supervised learning in terms of the yield (i.e. classification performance of the final model for the same manual effort), we randomly selected from the development set the same number of sentences that were manually reviewed during the "Searching for biased positive examples" step. These labeled sentences were used to train a ML classifier using the same hyper-parameters and protocol used for the biased PU classifier.

5. Evaluation

The models were compared by their accuracy on the held-out unbiased test set. All metrics were calculated at the sentence level. Area under the precision-recall curve (PR-AUC) and F1 metrics were used instead of the commonly used area under the receiver-operating characteristic curve (ROC-AUC) since the latter may over-estimate the performance when the positive class is very rare.25 Confidence intervals (CI) were calculated using 1,000 bootstrap samples. The study was approved by the Partners HealthCare Institutional Review Board.

Results

1. Data collection

Overall, 574 cases (47% males, median age [inter-quartile range, IQR] of 49.7 [37.7 - 61.0] years) with 10,150 documents were included in the study.

Document preparation: Two annotators annotated 3,167 paragraphs. Inter-rater agreement (Cohen's kappa) on a subset of 160 paragraphs annotated by both was 0.81, signaling good agreement. The class labels were balanced (47.7% informative). Features were extracted from a BERT model trained on biomedical documents.26 The optimal model from hyperparameter search, a feed-forward neural-network with two 500-units hidden layers rectified linear unit activation function that achieved a maximal F2 of 0.955. Out of the 641,335 paragraphs yielded by Tesseract 46.3% were classified as informative yielding 844,421 sentences of which 496,161 (58.7%) were unique.

2. Ground-truth held-out test set creation

Overall, 223 documents were annotated. The Cohen's kappa on the 58 documents annotated by both annotators was 0.622, signaling a substantial agreement. The cleaned documents of the test set included 20,911 sentences, and only 3 annotated snippets were filtered out during cleaning. The DI's prevalence was 0.340% (95% CI: 0.260% - 0.418%).

3. Positive example collection

Fourteen out of the 20 top results of the seed queries ("lack of documentation", "failure to document", and "no documentation") were deemed positive. The 16 subsequent cycles yielded 322 examples of which 113 (31.3%) were deemed positive. The distribution of the selected positive sentences vs. the unbiased test set is depicted in Figure 2.

Figure 2. A 2-D tSNE transformation of the representations of the biased positive examples collected in Deep Snow (red) and positive (yellow) and negative (green) ones from the unbiased test-set. Unbiased positives without adjacent biased ones are highlighted with arrows.

Figure 2

4. Classifier training and evaluation

Table 2 lists the models' performance results. Table 3 lists the data sets used in this study and their composition.

Table 2. Accuracy metrics (95% confidence interval) of the final classifiers on the held-out test set.

Model PR-AUC ROC-AUC F1max Precision Recall
1 Biased-PU 0.283 (0.190 - 0.389) 0.973 (0.959 - 0.984) 0.410 (0.298 - 0.508) 0.467 (0.338 - 0.611) 0.370 (0.253 - 0.484)
2 Rule-based N/A N/A 0.224a (0.112 - 0.337) 0.420 (0.231 - 0.625) 0.154 (0.074 - 0.232)
3 Traditional 0.022 (0.009 - 0.044) 0.686 (0.619 - 0.750) 0.096 (0.033 - 0.165) 0.107 (0.021 - 0.229) 0.100 (0.042 - 0.184)

PR-AUC: area under the precision-recall curve, ROC-AUC: area under the receiver-operating characteristic curve. Precision and recall denote their value at the threshold yielding F1max.

a

Since the rule-based classifier yields a binary indicator, PR-AUC and ROC-AUC cannot be calculated for it.

Table 3. Number of sentences in the training and test data sets.

Data set Positive Negative(N)/ unlabeled(U)
Unbiased test set 71 N: 20,840
Unbiased development set 5 N: 1,501
Positive-unlabeled learning (model 1) 113 U: 7,119
Traditional machine learning (model 2) 1 N: 331

Biased PU classifier (model 1): The 322 positive sentences were combined with 7,119 unlabeled examples to form the PU dataset. The highest-ranking model , achieved a PR-AUC of 0.935 on the PU-test set. Inspection of erroneous predictions revealed sentences discussing similar topics but not specifically the lack of documentation. The Pearson correlation coefficient between PR-AUC on the PU- and unbiased test sets was 0.966 (p-value <0.001).

Rule-based tagger (model 2): Term expansion yielded 16 patterns. Considering the goal to extract cases where documentation was absent, the terms that describe documentation itself were configured to be extracted when they are negated while the terms that already describe the absence of a documentation were configured to be extracted when they are affirmed. Inspection of type-I (false-positive) errors revealed that many cases did discuss the absence of clinical documentation but not as a failure of the physician but rather as a neutral description of the facts. Inspection of type-II (false negative) errors revealed two culprits: The first is unusual surface forms (phrases) that differ substantially from the phrases described in the DI definition. The second is implicit statement of the lack of documentation, e.g. describing what could be done and implying that the inaction constitutes a failure to document, instead of explicitly stating the documentation is lacking. Fully supervised (traditional) classifier (model 3): The 332 selected sentences from the development set included a single positive example, precluding cross-validation for hyperparameter search. To increase the number of positives, we attempted to select the same number only from sentences containing the phrases from the DI definition, but only 3 sentences contained them. Therefore, we used a standard architecture, fine-tuning a transformer language model. We used a 768-dimensional BERT model pre-trained on clinical documents with the hyperparameters recommended in the original paper for 3 epochs.27

IE is a core task in NLP and forms the basis for many downstream analyses. In this work we attempted to tackle the challenge of training data collection by combining the recent advancement of automatic (unsupervised) sentence representation with the theoretical foundations of PU learning in the setting of a very rare entity. Both PU methods (models 1 and 2) outperformed a fully-supervised learning approach (model 3). Specifically, at the manual effort level (332 examples) that was sufficient for the biased-PU method to yield a skillful (better than random) model, the fully-supervised method was practically infeasible with only a single positive example collected. These findings support the use of data augmentation and PU methods like Deep Snow for situations where the amount of required labeled data is infeasible high. Beyond training data collection, severe class imbalance as in our case (0.34% positive) poses a challenge to the learning process itself, since the positive patterns may seem like noise in the huge number of negative examples. A possible solution could be training a preliminary classifier to discard topically unrelated sentences (e.g. sentences describing patient's background). However, early attempts at this direction showed only a modest increase in the prevalence of positives. This could stem from the fact that the task's documents are heterogenous with no clear internal structure and no workflow limitation on the sections where the DI could be documented. Since the manual annotations are binary (rather than a graded score), the IVO cannot be tested.

The strong correlation (Pearson r: 0.966, p-value = <0.001) between the performance on the PU test set and the eventual performance on the task itself (unbiased test set) signals that the biased (i.e. PU) sample may be used as a surrogate for the unbiased (randomly selected) test set, allowing us to complete the training with no need for resource-intensive review of a mostly-negative randomly-selected sample. Figure 2 shows how Deep Snow identified many of the clusters of positive sentences, including remote ones far from the main bulk of unbiased positives. Enriching the search algorithm with signals beyond textual similarity may help finding such remote clusters.

PU-learning is sometimes motivated by the difficulty of obtaining negative examples. For example, ruling out the presence of a disease may require medical work-up that cannot be conducted or justified. In clinical IE, however, the challenge is different: negative examples can be obtained by a manual review, while positive examples are harder to obtain due to their rarity. Focusing the review process on a subset with a higher a-priori prevalence of the class is an intuitive approach to address that challenge, but it would violate the SCAR assumption. Biased PU learning may enable learning from such biased sample requiring only a heuristic to rank examples by their probability of being positive. However, such knowledge (the heuristic) is essentially the objective of the IE task and it is likely not available. Therefore, Deep Snow and PU-learning are not silver bullets and may be more suitable for cases where some knowledge of the entity's manifestation is already available, either from rules, unsupervised- or transfer learning.

Additional data augmentation methods were attempted. Snorkel, a weak-supervision method was investigated initially.28 It relies on aggregation of partially accurate heuristics (called "labeling functions") to create a generative model of the classes to generate more training data to train a traditional discriminative classifier. However, the collected positive examples failed to form labeling functions that reliably distinguish between positive and negatives. With the abundance of negative examples, novelty detection was attempted. However, it assumes that the positive (anomalous) examples reside in low-density areas of the feature space separate from the dense areas hosting the negatives (normal) ones. The positives in our case did not behave so (Figure 2) and accordingly, the novelty detection model failed to identify any positives from the unbiased test set, achieving a precision and recall of 0.

Several limitations affect our study. First, it focuses only on a single entity from a single institute using a custom data set, hindering generalizing the findings to other settings. Second, annotation of randomly-selected data sufficient for training could not be performed due to resource limitations. Therefore, the exact number of training examples required to achieve the same performance as the PU method is not known and the saving (or increase) in training data requirements cannot be estimated. Third, the choice of the sentence representation model was arbitrary. Forth, the paucity of positive examples prevented optimization of the fully-supervised learning training process.

Future directions for our work include testing on standard datasets (e.g. MIMIC) and additional settings, implementing a more elaborate positive example search (e.g. ad-hoc model training) to capture atypical positive cases, improving sentence representation (e.g. transfer learning from other IE tasks) and investigating Deep Snow's effect on downstream tasks.

Conclusion

The combination of unsupervised- and positive-unlabeled learning methods may reduce the manual effort required to train an information-extraction classifier for a rare entity by focusing the review process on a higher-yield subset. The methods' suitability depends on the task and the validity of the heuristic used to select the subset of the data to review.

Acknowledgements

The work in this manuscript has been supported by a grant from Controlled Risk Insurance Company titles "Similar-Cases Finder for Risk Reduction". The Quadro GPU used for this research was donated by the NVIDIA Corporation.

Figures & Table

i

See See http://nlpprogress.com/english/named_entity_recognition.html for examples of standard IE datasets.

References

  • 1.Zfania Tom Korach, Collins Sarah A., Cato K, Kang Min J., Schnock Kumiko O., Brittany Couture, et al. Active Learning for the Identification of Nurses’ Concerns from Nursing Notes. San Francisco, CA, USA: 2019. [Google Scholar]
  • 2.Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv:190711692 [cs] [Internet] 2019. Jul 26,
  • 3.Bekker J, Davis J. Learning from positive and unlabeled data: a survey. Mach Learn [Internet] 2020 Apr 1;109(4):719–60. [Google Scholar]
  • 4.Elkan C, Noto K. Learning classifiers from only positive and unlabeled data. In: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining [Internet]; Las Vegas, Nevada, USA. Association for Computing Machinery; 2008. pp. 213–220. (KDD ’08) [Google Scholar]
  • 5.Kato M, Teshima T, Honda J. Learning from Positive and Unlabeled Data with a Selection Bias. 2019 International Conference on Learning Representations [Internet] 2018.
  • 6.Plessis MD, Niu G, Sugiyama M. Convex Formulation for Learning from Positive and Unlabeled Data. In: International Conference on Machine Learning [Internet] 2015. pp. 1386–94.
  • 7.Smith R. An Overview of the Tesseract OCR Engine. In: Proc Ninth Int Conference on Document Analysis and Recognition (ICDAR) 2007. 629 pp.633 pp.
  • 8.Tafti AP, Baghaie A, Assefi M, Arabnia HR, Yu Z, Peissig P. OCR as a Service: An Experimental Evaluation of Google Docs OCR, Tesseract, ABBYY FineReader, and Transym. In: Bebis G, Boyle R, Parvin B, Koracin D, Porikli F, Skaff S, et al., editors. Advances in Visual Computing. Springer International Publishing; 2016. pp. 735–46. (Lecture Notes in Computer Science) [Google Scholar]
  • 9.Manning CD, Surdeanu M, Bauer J, Finkel J, Bethard SJ, McClosky D. The Stanford CoreNLP Natural Language Processing Toolkit. In: Association for Computational Linguistics (ACL) System Demonstrations [Internet] 2014. pp. 55–60.
  • 10.Reimers N, Gurevych I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) [Internet]; Hong Kong, China. Association for Computational Linguistics; 2019. pp. 3982–3992. [Google Scholar]
  • 11.Bowman SR, Angeli G, Potts Christopher, Manning CD. A large annotated corpus for learning natural language inference. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP); Association for Computational Linguistics; 2015. [Google Scholar]
  • 12.Cer D, Diab M, Agirre E, Lopez-Gazpio I, Specia L. In: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017) [Internet] Vancouver, Canada: Association for Computational Linguistics; 2017. SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation; pp. 1–14. [Google Scholar]
  • 13.Williams A, Nangia N, Bowman S. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) [Internet]; New Orleans, Louisiana. Association for Computational Linguistics; 2018. pp. 1112–1122. [Google Scholar]
  • 14. UKPLab/sentence-transformers [Internet]. Ubiquitous Knowledge Processing Lab; 2020.
  • 15.Bojanowski P, Grave E, Joulin A, Mikolov T. Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics [Internet] 2017;5:135–146. [Google Scholar]
  • 16.Robertson SE, Walker S, Jones S, Hancock-Beaulieu MM, Gatford M. Okapi at TREC-3. Nist Special Publication Sp [Internet] 1995;109:109. others. [Google Scholar]
  • 17.Shi J, Hurdle JF. Trie-Based rule processing for clinical NLP: A use-case study of n-trie, making the ConText algorithm more efficient and scalable. Journal of Biomedical Informatics [Internet] 2018 Sep 1;85:106–13. doi: 10.1016/j.jbi.2018.08.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Jacovi A, Niu G, Goldberg Y, Sugiyama M. Scalable Evaluation and Improvement of Document Set Expansion via Neural Positive-Unlabeled Learning. arXiv:191013339 [cs] [Internet] 2019. Oct 29,
  • 19.Chollet F. 2015. , others. Keras [Internet].
  • 20.Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, et al. TensorFlow: Large- Scale Machine Learning on Heterogeneous Systems [Internet] 2015.
  • 21.microsoft/nni [Internet]. Microsoft. 2020.
  • 22.Bergstra J, Bardenet R, Bengio Y, Kégl B. Algorithms for hyper-parameter optimization. Proceedings of the 24th International Conference on Neural Information Processing Systems; Granada, Spain. Curran Associates Inc.; 2011. pp. 2546–2554. (NIPS’11) [Google Scholar]
  • 23.Korach ZT, Yang J, Rossetti SC, Cato KD, Kang M-J, Knaplund C, et al. Mining clinical phrases from nursing notes to discover risk factors of patient deterioration. Int J Med Inform. 2019 Dec 14;135:104053. doi: 10.1016/j.ijmedinf.2019.104053. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Zhou L, Plasek JM, Mahoney LM, Karipineni N, Chang F, Yan X, et al. Using Medical Text Extraction, Reasoning and Mapping System (MTERMS) to process medication information in outpatient clinical notes. AMIA Annu Symp Proc. 2011;2011:1639–48. [PMC free article] [PubMed] [Google Scholar]
  • 25.Davis J, Goadrich M. The Relationship between Precision-Recall and ROC Curves. In: Proceedings of the 23rd International Conference on Machine Learning [Internet]; New York, NY, USA. Association for Computing Machinery; 2006. pp. 233–240. (ICML ’06) [Google Scholar]
  • 26. BioBERT: a pre-trained biomedical language representation model for biomedical text mining | Semantic Scholar [Internet] [DOI] [PMC free article] [PubMed]
  • 27.Alsentzer E, Murphy J, Boag W, Weng W-H, Jin D, Naumann T, et al. Publicly Available Clinical BERT Embeddings. Proceedings of the 2nd Clinical Natural Language Processing Workshop [Internet]; Minneapolis, Minnesota, USA. Association for Computational Linguistics; 2019. pp. 72–78. [Google Scholar]
  • 28.Ratner AJ, Bach SH, Ehrenberg HR, Ré C. Snorkel: Fast Training Set Generation for Information Extraction. Proceedings of the 2017 ACM International Conference on Management of Data [Internet]; New York, NY, USA. ACM; 2017. pp. 1683–1686. (SIGMOD ’17) [Google Scholar]

Articles from AMIA Annual Symposium Proceedings are provided here courtesy of American Medical Informatics Association

RESOURCES